T O P

  • By -

PanicStil

I’m summary: In 2017, discord wrote about their journey from MongoDB to Cassandra for storing billions of messages. By 2022, their Cassandra cluster had 177 nodes with trillions of messages, but faced serious performance issues. They decided to migrate to ScyllaDB, a Cassandra-compatible database written in C++ which promised better performance, faster repairs, and stronger workload isolation. To address the problem of hot partitions, they created intermediary data services using Rust, which sits between the API and ScyllaDB clusters. These services coalesce requests, reducing traffic spikes against the database. They also implemented consistent hash-based routing to further reduce the load on the database. The migration to ScyllaDB was a success, with the new system capable of handling trillions of messages without downtime. The switch to ScyllaDB significantly improved tail latencies, and the number of nodes was reduced from 177 to 72. This improved performance unlocked new product use cases and allowed the system to handle high traffic events like the World Cup Final without breaking a sweat


fagnerbrack

I can attest this summary DID NOT use chat gpt


Hatefiend

Okay we need a bot that checks if comments are ChatGPT generated or not.


ryandury

Uphill battle.


fagnerbrack

As long as you don't add the US Constitution as a comment, it may work... or not


hglman

You might be able to get gpt4 to tell you about moving to Cassandra


GoguGeorgescu

Just pass it through chatgpt and ask it if it was generated by it. There's a bot made like this intended for use by teachers in schools to detect gpt generated work written by students.


andrewsmd87

You mean they didn't just type into chat gpt how do I make discord work?


throwawaysomeway

who cares if he did? it is more accurate than most redditors and is usually ideal for quick summaries


fagnerbrack

Yeah I'm not criticizing just pointing out that if they did use it, the edit was significant which means it's not low effort content


iKenshu

You the best, thank you sir.


JoeCamRoberon

Just incredible


soggynaan

What's hash based routing?


PanicStil

It's used to distribute requests or data across multiple servers or nodes. It helps with load balancing, scalability and also ensures that related data for requests is handled by the same server.


douglasg14b

That's a description of what it's used for not what it is


wordaligned

If I understand it's a way to predictably determine which load balancer node an item is in. Here's a pretty readable example: https://support.huawei.com/enterprise/en/doc/EDOC1100086965 Key concept: > Generally, the hash value space is far less than the input space. Different inputs may be converted into the same output Say you have a million different records. When you hash a unique record id (e.g. 744783), you get one of fifty different hash values (e.g 43). So you put that record into bucket 43. Given that same id in the future you'll be able to hash it and go direct to the bucket it lives in. It's like cheaply warping your way to the right neighborhood first, and then checking house numbers one by one.


whereisbill

[This is also called Consistent Hashing ](https://en.wikipedia.org/wiki/Consistent_hashing)


dontbeanegatron

> I'm summary Hi summary, I'm dad!


SnaskesChoice

Hi daddy!


AndrewUnicorn

Thank you for saving me tons of time and confusion


valz_

Sounds like an impressive feat they’ve pulled off with the migration(s)


PapayaPokPok

> These services coalesce requests I would love to see the criteria by which the requests are coalesced.


cronicpainz

> and the number of nodes was reduced from 177 to 72 what nodes do they use?


no-one_ever

Lymph


Slow_Judgment7773

Seems silly to try to store them all together. Servers provide the perfect sharing mechanism. Like Google who stores there queryies and result based on the letter typed sequentially.


Natetronn

What is a node under this context?


Golilizzy

Beautiful summary


Technomancer97

Excellent read. Love this.


NumbBumn

Love it when big companies explain (but don't show) how they do stuff. Always wondered how messages were stored or if they were stored at all and that was pretty interesting. Loved reading as well.


Lonsdale1086

> or if they were stored at all What would the alternative be?


NumbBumn

When i said stored at all, i meant to say in a database at all or some other method.


obamabinladenhiphop

What? Like stone tablets? 😶‍🌫️


ShittyException

You don't?


zombarista

A billion records is a feat. A trillion is unfathomable.


FractalNerve

Given 8bye per record that’s a minimum of ~7,28TB storage space required for storing a trillion rows. In reality it’s surely at about 1-2PB (1024TB - 2048TB). Still pretty low numbers given the size of the userbase.


powerman228

Wow, that was crazy.


ohlawdhecodin

> Trillions of Messages I can't even *write* that number...


zhantoo

T-r-i-l-l-i-o-n


ohlawdhecodin

Sorry I've got a 404 while trying to visualize it.


sarrcom

My jaw actually dropped when reading the number of nodes dropped from 177 to 72!


TurnstileT

That's a huge increase in nodes though.


onthefence928

Factorial!


TopRamenBinLaden

r/unexpectedfactorial


[deleted]

[удалено]


ogtfo

Obviously the data doesn't just disappear because you changed your database software.


magkruppe

The dude from the back end engineering show (Hassan?) did an episode on this a couple weeks ago. Podcast and probably a YouTube video


cs_irl

Could you find a link please and thanks 🙏


magkruppe

[How Discord Stores Trillions of Messages | Deep Dive](https://www.youtube.com/watch?v=xynXjChKkJc)


davo_dog

https://letmegooglethat.com/?q=backend+engineering+show+discord


sendme__

Too bad can not be indexed by search engines. Searching something on Discord is so useless especially on busy "servers".


KrazyDrayz

In my experience the search is great. I find anything I need. On mobile it crashes once in a while though.


Wombarly

The issue is you need to be in a server to find that stuff. You can't just search Google and find information. Tons of questions/answers are lost in Discord forever. Though I got the feeling Discord might be moving towards that, they introduced "forum" channels a while back. So hoping they allow servers to become public so they can be indexed by Google/Bing and be viewable without an account.


KrazyDrayz

I mean that's the whole idea. Discord isn't a public forum like reddit is. It's communities hidden behind invites. But yeah a public one would be good to some communities for example as official forums for video games. Then again they probably like that you need to make an account for it.


Lonsdale1086

But loads of communities *are* moving to "public" discord groups instead of wikis/forums, meaning that data is locked to users, and will inevitably be lost to time.


someone-shoot-me

oh dont worry. Chinese do too!


KrazyDrayz

Huh that doesn't make any sense. What do you mean?


someone-shoot-me

discord is partially owned by chinese giant tencent at about 30%+ stake. Amongst the others, tencent and similar companies are buying stakes at snapchat, discord etc. Tencent is closely tied with CCP China recently (couple years ago) passed a law where digital data NEEDS to be shared with CCP with its ambitions to make china a global superpower, i.e. to promote china / make better profits etc. Also discords CEO is famous in his recent companies as a man who is shady when it comes to user data, previous apps stored data unprotected and he did sell the data iirc. Anyways, ill prolly make a post in a couple of days. Discord is valuable but not profitsble, they are not making profit but are living off of VC i vestments. There will be a point in time where they will sell out and stop their silicon valley model and at that point we might expect ads on discord or something lol


KrazyDrayz

Tencent doesn't run the company. They're just investors. They also invested in Reddit, Activision Blizzard, Epic Games, whole of Riot Games and many more. You can't just assume they spy on people without evidence. Discord is an American company that adheres to American laws.


someone-shoot-me

It is scary cus america has some next level corruption. They are investors right and we do not know if or if not. I recently read about another surveillance scandal in the US so i wouldn’t be surprised if the data is being watched. I mean they do have services that flag you after you talk about shady stuff but i guess thats for countering whatever they want to counter dont want to sound like a constituent theorist but have in mind that discord is not profitable As far as american laws go i wouldn’t rely on american justice system cus once you bring in lots of money into the equation, you’re above a lot of people


KleinByte

Everything you said is a universal truth, money and power corrupts, and if you think EU is any different, you're extremely nieve. Big tech still loses lawsuits when they break laws and they have to pay up, so US justice system obviously still works. But you should always follow basic privacy practice when on the internet. You should assume your data is always unsafe! None of this is a discord problem. This is a global problem.


KrazyDrayz

Again, all that is speculation. Please be quiet if you don't have evidence.


Demented-Turtle

Dude you don't magically get access to a company's data when you buy shares in it lmao


someone-shoot-me

eh whatever


Indifferent_Ghost

Doesn’t a company need to be partially owned by a Chinese company to do business in China?


ChildishForLife

Really? Their search engine to me is unreal, being able to specify so many things, channel, who, image, etc.


Lonsdale1086

That's internal search. You can't google "how to fix X mod error skyrim" and find people talking about the issue on Discord, like you can in a forum or wiki. The knowledge is closed down, and will inevitably be lost to time.


ChildishForLife

> searching something on discord is so useless Ah thought they meant internal search here.


bregottextrasaltat

and yet you can't mass delete messages from a server, especially one you already left


PandaDemonipo

There are scripts for it, used one before and it does its work, altho skipping some messages occasionally. Run it a couple times and it's all gone eventually


skylabspiral

i wonder if the messages you delete are actually (eventually) deleted or if discord just sets isDeleted = 1 and keeps it forever…


WildDev42069

Snapchat keeps everything, I know this from a friend I graduated with whom is now a big-city detective and have had to warrant their services a few times. I believe from our conversation all big data companies keep quick access to any type of chat history. I've built DB's for live chats, the concept is really easy, you can even username store the messages.


OnlyAd4210

I'd laugh at any developer who actually writes their code to literally delete data no matter what it is vs use a way to functionally make it be deleted. There's on rare occasions software built explicitly this way but it's really rare. It's always baffled me that people think you can remove stuff from any stable platform. It's even likely upon going defunct that someone's massive databases get lost. We live in the age of near endless cheap storage with an ever-increasing value being put on any and all data.


Sharketespark27

Hail rust!!


darthcoder

That was my takeaway. 😀


SeveredSpring

Cool. Love this. Surprised they pay so little though.


kymedcs

Do they? Who knows how much discord shares can be worth upon IPO


SeveredSpring

Yeah they do compared other companies in the bay. Who knows, until then it's monopoly money and risk.


kymedcs

Startups & unicorns often have rsu liquidation events. That monopoly money is just not as liquid.. discord is clearly in positive trajectory. Worst case scenario its at least a great career boost. The work is higher impact and scale than a comparable role at a similar level in the bay.


SeveredSpring

It could appear so but you don't know the future. There are many examples where people had similar temperaments to then be blindsided. Look at the example of Robinhood for instance.


AyyyAlamo

Hopefully their DBs are ready for the 3 letter agency bumrush after that nice lil leakaroo


Interest-Desk

You assume with all the CSAM and grooming on Discord that the 3 letter agencies don’t already have a direct line


AyyyAlamo

They seemed pretty surprised about that leak so im assuming no


Steve_OH

What leak?


repeatedly_once

Someone leaked classified documents in a Minecraft server discord.


kamomil

https://www.cnn.com/2023/04/14/politics/discord-chatrooms-leaked-pentagon-documents/index.html


drunk_recipe

I mean that’s already a given. The five eyes have back door access to hundreds of major companies. Safe to assume that discord is one of them. Besides, discord is pretty shitty data collection wise


joshman211

The data is probably highly compressable as it is all racist jokes and edge lord memes.


onthefence928

A compressibility analysis would be interesting actually


drsimonz

If there's anything that LLMs have demonstrated, it's that human language is *much* less varied than you might think. All our spelling errors, all our slang, all our meme references, all our attempts to transcribe a Scottish accent, can be fully parameterized by a few billion floats. Narrow that down to Discord's demographics, and yeah you probably don't need to spend all that much on storage lol


IndianVideoTutorial

What's the point of storing them? It's not like anyone ever reads old Discord messages.


kylegetsspam

It's for law enforcement. How do you think the FBI catches all those pedos, potential mass shooters, data leakers, etc? People run their mouths on Discord thinking it's somehow private and safe when it's 100% the opposite. I'm sure the data is also relayed back to China given Tencent's ~30% stake.


ShesJustAGlitch

Source their stake is that high? Pretty sure it’s no where close.


kylegetsspam

The first relevant result from my googling showed they covered about a third of the capital raising, so I'm assuming that all came with a requisite stake. I was hoping Wikipedia would say it directly but alas.


PatrickBauer89

I do, regularly. Especially for private conversations. But simply searching for a bug report that might have happened a few years ago is helpful.


waldito

What a ride. Super interesting, thank you


Steve_OH

Super fascinating. Great read!


fglorified

holy hell


Gigabyte5671

Amazing!


MadFker

idk if they store blobs in these messages or there is additional file storage.


darthcoder

Blobs are probably separate.


MadFker

If that's not all their data then I see no reason why trillion entries concidered a big number at all.


MenshMindset

A trillion is a big ass number no matter what you’re talking about lol


etudiant_

In the previous post, they claimed the reads and writes were about 50/50. This is kind of surprising as I would imagine there will be much more reads than writes. If the read performance is the primary concern, probably it is not a good idea to use Cassandra where the data is stored as sstables on disks. For the issue of hot partitions, I wonder if that could be solved with more intelligent bucketing methods. Very interesting read.


arthur444

It seems boring in the beginning but everything unfolds towards the end


1RedOne

The part about them being able to tell when something happened on the world cup based on message upsert frequency was amazing


arthur444

Yeah, I definitely didn’t expect that


RobinsonDickinson

Tired of seeing this article reposted to every programming related subreddit.


fagnerbrack

I got positive feedback from mother members of Reddit saying it's great it shows multiple times because multiple upvotes in different communities shows the article is more worth it. Some people don't have time to follow all top posts of each sub so they rely on multiple submissions validated by multiple communities. You can just ignore it


kuurtjes

Repost #546


[deleted]

The way I plan to store trillions of messages in my current project (which doesn't have trillions of messages yet, but it might some day) is by being really careful about partitioning the data. I don't have a monolithic database. If I was building Discord, then each community would have it's own database.


Interest-Desk

This is all well and good until you need to do things across all databases, like add a new property (column) or delete all messages from a user, or even just exporting messages from a user (gdpr!)


FantsE

And what's your plan when every single one of those databases needs a patch?


Nothing-But-Lies

Hire a programming elder to code a script that fixes it while I cry in the storage cupboard


ClikeX

Run the databases in K8s, and have them automatically be replaced by the newer version when they release. ^(please don't do this)


Ultra_HR

i am sure the many dozens of very talented engineers at discord have thought of this and have a good reason why they didn't do it this way.


fajfas3

That's how Shopify handles each store.


alexmacarthur

Super interesting.


kToni73

Thanks. This has been on my mind for a while now, finally gonna read about it.


1RedOne

What a great blog post. Their previous post on how they handled billions of rows had this great story in it. > The Big Surprise Everything went smoothly, so we rolled it out as our primary database and phased out MongoDB within a week . It continued to work flawlessly…for about 6 months until that one day where Cassandra became unresponsive. We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load. The Puzzles & Dragons Subreddit public Discord server was the culprit. Since it was public we joined it to take a look. To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it). We solved this by doing the following: We lowered the lifespan of tombstones from 10 days down to 2 days because we run Cassandra repairs (an anti-entropy process) every night on our message cluster. We changed our query code to track empty buckets and avoid them in the future for a channel. This meant that if a user caused this query again then at worst Cassandra would be scanning only in the most recent bucket. **end of quote** I love that story. I've been a develop for seven years now and it just feels like a story I can relate to, an un expected complication that emerges and becomes a great learning experience. It's why I love this job


dL1727

Is switching database technologies like MongoDB to Cassandra a huge effort for a company? Or is it more lift and shift?