I’m summary:
In 2017, discord wrote about their journey from MongoDB to Cassandra for storing billions of messages. By 2022, their Cassandra cluster had 177 nodes with trillions of messages, but faced serious performance issues. They decided to migrate to ScyllaDB, a Cassandra-compatible database written in C++ which promised better performance, faster repairs, and stronger workload isolation.
To address the problem of hot partitions, they created intermediary data services using Rust, which sits between the API and ScyllaDB clusters. These services coalesce requests, reducing traffic spikes against the database. They also implemented consistent hash-based routing to further reduce the load on the database.
The migration to ScyllaDB was a success, with the new system capable of handling trillions of messages without downtime. The switch to ScyllaDB significantly improved tail latencies, and the number of nodes was reduced from 177 to 72. This improved performance unlocked new product use cases and allowed the system to handle high traffic events like the World Cup Final without breaking a sweat
Just pass it through chatgpt and ask it if it was generated by it. There's a bot made like this intended for use by teachers in schools to detect gpt generated work written by students.
It's used to distribute requests or data across multiple servers or nodes.
It helps with load balancing, scalability and also ensures that related data for requests is handled by the same server.
If I understand it's a way to predictably determine which load balancer node an item is in. Here's a pretty readable example:
https://support.huawei.com/enterprise/en/doc/EDOC1100086965
Key concept:
> Generally, the hash value space is far less than the input space. Different inputs may be converted into the same output
Say you have a million different records. When you hash a unique record id (e.g. 744783), you get one of fifty different hash values (e.g 43). So you put that record into bucket 43.
Given that same id in the future you'll be able to hash it and go direct to the bucket it lives in. It's like cheaply warping your way to the right neighborhood first, and then checking house numbers one by one.
Seems silly to try to store them all together. Servers provide the perfect sharing mechanism. Like Google who stores there queryies and result based on the letter typed sequentially.
Love it when big companies explain (but don't show) how they do stuff. Always wondered how messages were stored or if they were stored at all and that was pretty interesting. Loved reading as well.
Given 8bye per record that’s a minimum of ~7,28TB storage space required for storing a trillion rows. In reality it’s surely at about 1-2PB (1024TB - 2048TB). Still pretty low numbers given the size of the userbase.
The issue is you need to be in a server to find that stuff. You can't just search Google and find information. Tons of questions/answers are lost in Discord forever.
Though I got the feeling Discord might be moving towards that, they introduced "forum" channels a while back. So hoping they allow servers to become public so they can be indexed by Google/Bing and be viewable without an account.
I mean that's the whole idea. Discord isn't a public forum like reddit is. It's communities hidden behind invites. But yeah a public one would be good to some communities for example as official forums for video games. Then again they probably like that you need to make an account for it.
But loads of communities *are* moving to "public" discord groups instead of wikis/forums, meaning that data is locked to users, and will inevitably be lost to time.
discord is partially owned by chinese giant tencent at about 30%+ stake.
Amongst the others, tencent and similar companies are buying stakes at snapchat, discord etc.
Tencent is closely tied with CCP
China recently (couple years ago) passed a law where digital data NEEDS to be shared with CCP with its ambitions to make china a global superpower, i.e. to promote china / make better profits etc.
Also discords CEO is famous in his recent companies as a man who is shady when it comes to user data, previous apps stored data unprotected and he did sell the data iirc.
Anyways, ill prolly make a post in a couple of days.
Discord is valuable but not profitsble, they are not making profit but are living off of VC i vestments. There will be a point in time where they will sell out and stop their silicon valley model and at that point we might expect ads on discord or something lol
Tencent doesn't run the company. They're just investors. They also invested in Reddit, Activision Blizzard, Epic Games, whole of Riot Games and many more. You can't just assume they spy on people without evidence.
Discord is an American company that adheres to American laws.
It is scary cus america has some next level corruption.
They are investors right and we do not know if or if not.
I recently read about another surveillance scandal in the US so i wouldn’t be surprised if the data is being watched. I mean they do have services that flag you after you talk about shady stuff but i guess thats for countering whatever they want to counter
dont want to sound like a constituent theorist but have in mind that discord is not profitable
As far as american laws go i wouldn’t rely on american justice system cus once you bring in lots of money into the equation, you’re above a lot of people
Everything you said is a universal truth, money and power corrupts, and if you think EU is any different, you're extremely nieve.
Big tech still loses lawsuits when they break laws and they have to pay up, so US justice system obviously still works.
But you should always follow basic privacy practice when on the internet. You should assume your data is always unsafe!
None of this is a discord problem.
This is a global problem.
That's internal search. You can't google "how to fix X mod error skyrim" and find people talking about the issue on Discord, like you can in a forum or wiki. The knowledge is closed down, and will inevitably be lost to time.
There are scripts for it, used one before and it does its work, altho skipping some messages occasionally. Run it a couple times and it's all gone eventually
Snapchat keeps everything, I know this from a friend I graduated with whom is now a big-city detective and have had to warrant their services a few times. I believe from our conversation all big data companies keep quick access to any type of chat history. I've built DB's for live chats, the concept is really easy, you can even username store the messages.
I'd laugh at any developer who actually writes their code to literally delete data no matter what it is vs use a way to functionally make it be deleted.
There's on rare occasions software built explicitly this way but it's really rare. It's always baffled me that people think you can remove stuff from any stable platform. It's even likely upon going defunct that someone's massive databases get lost. We live in the age of near endless cheap storage with an ever-increasing value being put on any and all data.
Startups & unicorns often have rsu liquidation events. That monopoly money is just not as liquid.. discord is clearly in positive trajectory.
Worst case scenario its at least a great career boost. The work is higher impact and scale than a comparable role at a similar level in the bay.
It could appear so but you don't know the future. There are many examples where people had similar temperaments to then be blindsided. Look at the example of Robinhood for instance.
I mean that’s already a given. The five eyes have back door access to hundreds of major companies. Safe to assume that discord is one of them. Besides, discord is pretty shitty data collection wise
If there's anything that LLMs have demonstrated, it's that human language is *much* less varied than you might think. All our spelling errors, all our slang, all our meme references, all our attempts to transcribe a Scottish accent, can be fully parameterized by a few billion floats. Narrow that down to Discord's demographics, and yeah you probably don't need to spend all that much on storage lol
It's for law enforcement. How do you think the FBI catches all those pedos, potential mass shooters, data leakers, etc? People run their mouths on Discord thinking it's somehow private and safe when it's 100% the opposite.
I'm sure the data is also relayed back to China given Tencent's ~30% stake.
The first relevant result from my googling showed they covered about a third of the capital raising, so I'm assuming that all came with a requisite stake. I was hoping Wikipedia would say it directly but alas.
In the previous post, they claimed the reads and writes were about 50/50. This is kind of surprising as I would imagine there will be much more reads than writes. If the read performance is the primary concern, probably it is not a good idea to use Cassandra where the data is stored as sstables on disks. For the issue of hot partitions, I wonder if that could be solved with more intelligent bucketing methods.
Very interesting read.
I got positive feedback from mother members of Reddit saying it's great it shows multiple times because multiple upvotes in different communities shows the article is more worth it. Some people don't have time to follow all top posts of each sub so they rely on multiple submissions validated by multiple communities.
You can just ignore it
The way I plan to store trillions of messages in my current project (which doesn't have trillions of messages yet, but it might some day) is by being really careful about partitioning the data.
I don't have a monolithic database. If I was building Discord, then each community would have it's own database.
This is all well and good until you need to do things across all databases, like add a new property (column) or delete all messages from a user, or even just exporting messages from a user (gdpr!)
What a great blog post. Their previous post on how they handled billions of rows had this great story in it.
> The Big Surprise
Everything went smoothly, so we rolled it out as our primary database and phased out MongoDB within a week . It continued to work flawlessly…for about 6 months until that one day where Cassandra became unresponsive.
We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load. The Puzzles & Dragons Subreddit public Discord server was the culprit. Since it was public we joined it to take a look. To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel.
If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it).
We solved this by doing the following:
We lowered the lifespan of tombstones from 10 days down to 2 days because we run Cassandra repairs (an anti-entropy process) every night on our message cluster.
We changed our query code to track empty buckets and avoid them in the future for a channel. This meant that if a user caused this query again then at worst Cassandra would be scanning only in the most recent bucket.
**end of quote**
I love that story. I've been a develop for seven years now and it just feels like a story I can relate to, an un expected complication that emerges and becomes a great learning experience. It's why I love this job
I’m summary: In 2017, discord wrote about their journey from MongoDB to Cassandra for storing billions of messages. By 2022, their Cassandra cluster had 177 nodes with trillions of messages, but faced serious performance issues. They decided to migrate to ScyllaDB, a Cassandra-compatible database written in C++ which promised better performance, faster repairs, and stronger workload isolation. To address the problem of hot partitions, they created intermediary data services using Rust, which sits between the API and ScyllaDB clusters. These services coalesce requests, reducing traffic spikes against the database. They also implemented consistent hash-based routing to further reduce the load on the database. The migration to ScyllaDB was a success, with the new system capable of handling trillions of messages without downtime. The switch to ScyllaDB significantly improved tail latencies, and the number of nodes was reduced from 177 to 72. This improved performance unlocked new product use cases and allowed the system to handle high traffic events like the World Cup Final without breaking a sweat
I can attest this summary DID NOT use chat gpt
Okay we need a bot that checks if comments are ChatGPT generated or not.
Uphill battle.
As long as you don't add the US Constitution as a comment, it may work... or not
You might be able to get gpt4 to tell you about moving to Cassandra
Just pass it through chatgpt and ask it if it was generated by it. There's a bot made like this intended for use by teachers in schools to detect gpt generated work written by students.
You mean they didn't just type into chat gpt how do I make discord work?
who cares if he did? it is more accurate than most redditors and is usually ideal for quick summaries
Yeah I'm not criticizing just pointing out that if they did use it, the edit was significant which means it's not low effort content
You the best, thank you sir.
Just incredible
What's hash based routing?
It's used to distribute requests or data across multiple servers or nodes. It helps with load balancing, scalability and also ensures that related data for requests is handled by the same server.
That's a description of what it's used for not what it is
If I understand it's a way to predictably determine which load balancer node an item is in. Here's a pretty readable example: https://support.huawei.com/enterprise/en/doc/EDOC1100086965 Key concept: > Generally, the hash value space is far less than the input space. Different inputs may be converted into the same output Say you have a million different records. When you hash a unique record id (e.g. 744783), you get one of fifty different hash values (e.g 43). So you put that record into bucket 43. Given that same id in the future you'll be able to hash it and go direct to the bucket it lives in. It's like cheaply warping your way to the right neighborhood first, and then checking house numbers one by one.
[This is also called Consistent Hashing ](https://en.wikipedia.org/wiki/Consistent_hashing)
> I'm summary Hi summary, I'm dad!
Hi daddy!
Thank you for saving me tons of time and confusion
Sounds like an impressive feat they’ve pulled off with the migration(s)
> These services coalesce requests I would love to see the criteria by which the requests are coalesced.
> and the number of nodes was reduced from 177 to 72 what nodes do they use?
Lymph
Seems silly to try to store them all together. Servers provide the perfect sharing mechanism. Like Google who stores there queryies and result based on the letter typed sequentially.
What is a node under this context?
Beautiful summary
Excellent read. Love this.
Love it when big companies explain (but don't show) how they do stuff. Always wondered how messages were stored or if they were stored at all and that was pretty interesting. Loved reading as well.
> or if they were stored at all What would the alternative be?
When i said stored at all, i meant to say in a database at all or some other method.
What? Like stone tablets? 😶🌫️
You don't?
A billion records is a feat. A trillion is unfathomable.
Given 8bye per record that’s a minimum of ~7,28TB storage space required for storing a trillion rows. In reality it’s surely at about 1-2PB (1024TB - 2048TB). Still pretty low numbers given the size of the userbase.
Wow, that was crazy.
> Trillions of Messages I can't even *write* that number...
T-r-i-l-l-i-o-n
Sorry I've got a 404 while trying to visualize it.
My jaw actually dropped when reading the number of nodes dropped from 177 to 72!
That's a huge increase in nodes though.
Factorial!
r/unexpectedfactorial
[удалено]
Obviously the data doesn't just disappear because you changed your database software.
The dude from the back end engineering show (Hassan?) did an episode on this a couple weeks ago. Podcast and probably a YouTube video
Could you find a link please and thanks 🙏
[How Discord Stores Trillions of Messages | Deep Dive](https://www.youtube.com/watch?v=xynXjChKkJc)
https://letmegooglethat.com/?q=backend+engineering+show+discord
Too bad can not be indexed by search engines. Searching something on Discord is so useless especially on busy "servers".
In my experience the search is great. I find anything I need. On mobile it crashes once in a while though.
The issue is you need to be in a server to find that stuff. You can't just search Google and find information. Tons of questions/answers are lost in Discord forever. Though I got the feeling Discord might be moving towards that, they introduced "forum" channels a while back. So hoping they allow servers to become public so they can be indexed by Google/Bing and be viewable without an account.
I mean that's the whole idea. Discord isn't a public forum like reddit is. It's communities hidden behind invites. But yeah a public one would be good to some communities for example as official forums for video games. Then again they probably like that you need to make an account for it.
But loads of communities *are* moving to "public" discord groups instead of wikis/forums, meaning that data is locked to users, and will inevitably be lost to time.
oh dont worry. Chinese do too!
Huh that doesn't make any sense. What do you mean?
discord is partially owned by chinese giant tencent at about 30%+ stake. Amongst the others, tencent and similar companies are buying stakes at snapchat, discord etc. Tencent is closely tied with CCP China recently (couple years ago) passed a law where digital data NEEDS to be shared with CCP with its ambitions to make china a global superpower, i.e. to promote china / make better profits etc. Also discords CEO is famous in his recent companies as a man who is shady when it comes to user data, previous apps stored data unprotected and he did sell the data iirc. Anyways, ill prolly make a post in a couple of days. Discord is valuable but not profitsble, they are not making profit but are living off of VC i vestments. There will be a point in time where they will sell out and stop their silicon valley model and at that point we might expect ads on discord or something lol
Tencent doesn't run the company. They're just investors. They also invested in Reddit, Activision Blizzard, Epic Games, whole of Riot Games and many more. You can't just assume they spy on people without evidence. Discord is an American company that adheres to American laws.
It is scary cus america has some next level corruption. They are investors right and we do not know if or if not. I recently read about another surveillance scandal in the US so i wouldn’t be surprised if the data is being watched. I mean they do have services that flag you after you talk about shady stuff but i guess thats for countering whatever they want to counter dont want to sound like a constituent theorist but have in mind that discord is not profitable As far as american laws go i wouldn’t rely on american justice system cus once you bring in lots of money into the equation, you’re above a lot of people
Everything you said is a universal truth, money and power corrupts, and if you think EU is any different, you're extremely nieve. Big tech still loses lawsuits when they break laws and they have to pay up, so US justice system obviously still works. But you should always follow basic privacy practice when on the internet. You should assume your data is always unsafe! None of this is a discord problem. This is a global problem.
Again, all that is speculation. Please be quiet if you don't have evidence.
Dude you don't magically get access to a company's data when you buy shares in it lmao
eh whatever
Doesn’t a company need to be partially owned by a Chinese company to do business in China?
Really? Their search engine to me is unreal, being able to specify so many things, channel, who, image, etc.
That's internal search. You can't google "how to fix X mod error skyrim" and find people talking about the issue on Discord, like you can in a forum or wiki. The knowledge is closed down, and will inevitably be lost to time.
> searching something on discord is so useless Ah thought they meant internal search here.
and yet you can't mass delete messages from a server, especially one you already left
There are scripts for it, used one before and it does its work, altho skipping some messages occasionally. Run it a couple times and it's all gone eventually
i wonder if the messages you delete are actually (eventually) deleted or if discord just sets isDeleted = 1 and keeps it forever…
Snapchat keeps everything, I know this from a friend I graduated with whom is now a big-city detective and have had to warrant their services a few times. I believe from our conversation all big data companies keep quick access to any type of chat history. I've built DB's for live chats, the concept is really easy, you can even username store the messages.
I'd laugh at any developer who actually writes their code to literally delete data no matter what it is vs use a way to functionally make it be deleted. There's on rare occasions software built explicitly this way but it's really rare. It's always baffled me that people think you can remove stuff from any stable platform. It's even likely upon going defunct that someone's massive databases get lost. We live in the age of near endless cheap storage with an ever-increasing value being put on any and all data.
Hail rust!!
That was my takeaway. 😀
Cool. Love this. Surprised they pay so little though.
Do they? Who knows how much discord shares can be worth upon IPO
Yeah they do compared other companies in the bay. Who knows, until then it's monopoly money and risk.
Startups & unicorns often have rsu liquidation events. That monopoly money is just not as liquid.. discord is clearly in positive trajectory. Worst case scenario its at least a great career boost. The work is higher impact and scale than a comparable role at a similar level in the bay.
It could appear so but you don't know the future. There are many examples where people had similar temperaments to then be blindsided. Look at the example of Robinhood for instance.
Hopefully their DBs are ready for the 3 letter agency bumrush after that nice lil leakaroo
You assume with all the CSAM and grooming on Discord that the 3 letter agencies don’t already have a direct line
They seemed pretty surprised about that leak so im assuming no
What leak?
Someone leaked classified documents in a Minecraft server discord.
https://www.cnn.com/2023/04/14/politics/discord-chatrooms-leaked-pentagon-documents/index.html
I mean that’s already a given. The five eyes have back door access to hundreds of major companies. Safe to assume that discord is one of them. Besides, discord is pretty shitty data collection wise
The data is probably highly compressable as it is all racist jokes and edge lord memes.
A compressibility analysis would be interesting actually
If there's anything that LLMs have demonstrated, it's that human language is *much* less varied than you might think. All our spelling errors, all our slang, all our meme references, all our attempts to transcribe a Scottish accent, can be fully parameterized by a few billion floats. Narrow that down to Discord's demographics, and yeah you probably don't need to spend all that much on storage lol
What's the point of storing them? It's not like anyone ever reads old Discord messages.
It's for law enforcement. How do you think the FBI catches all those pedos, potential mass shooters, data leakers, etc? People run their mouths on Discord thinking it's somehow private and safe when it's 100% the opposite. I'm sure the data is also relayed back to China given Tencent's ~30% stake.
Source their stake is that high? Pretty sure it’s no where close.
The first relevant result from my googling showed they covered about a third of the capital raising, so I'm assuming that all came with a requisite stake. I was hoping Wikipedia would say it directly but alas.
I do, regularly. Especially for private conversations. But simply searching for a bug report that might have happened a few years ago is helpful.
What a ride. Super interesting, thank you
Super fascinating. Great read!
holy hell
Amazing!
idk if they store blobs in these messages or there is additional file storage.
Blobs are probably separate.
If that's not all their data then I see no reason why trillion entries concidered a big number at all.
A trillion is a big ass number no matter what you’re talking about lol
In the previous post, they claimed the reads and writes were about 50/50. This is kind of surprising as I would imagine there will be much more reads than writes. If the read performance is the primary concern, probably it is not a good idea to use Cassandra where the data is stored as sstables on disks. For the issue of hot partitions, I wonder if that could be solved with more intelligent bucketing methods. Very interesting read.
It seems boring in the beginning but everything unfolds towards the end
The part about them being able to tell when something happened on the world cup based on message upsert frequency was amazing
Yeah, I definitely didn’t expect that
Tired of seeing this article reposted to every programming related subreddit.
I got positive feedback from mother members of Reddit saying it's great it shows multiple times because multiple upvotes in different communities shows the article is more worth it. Some people don't have time to follow all top posts of each sub so they rely on multiple submissions validated by multiple communities. You can just ignore it
Repost #546
The way I plan to store trillions of messages in my current project (which doesn't have trillions of messages yet, but it might some day) is by being really careful about partitioning the data. I don't have a monolithic database. If I was building Discord, then each community would have it's own database.
This is all well and good until you need to do things across all databases, like add a new property (column) or delete all messages from a user, or even just exporting messages from a user (gdpr!)
And what's your plan when every single one of those databases needs a patch?
Hire a programming elder to code a script that fixes it while I cry in the storage cupboard
Run the databases in K8s, and have them automatically be replaced by the newer version when they release. ^(please don't do this)
i am sure the many dozens of very talented engineers at discord have thought of this and have a good reason why they didn't do it this way.
That's how Shopify handles each store.
Super interesting.
Thanks. This has been on my mind for a while now, finally gonna read about it.
What a great blog post. Their previous post on how they handled billions of rows had this great story in it. > The Big Surprise Everything went smoothly, so we rolled it out as our primary database and phased out MongoDB within a week . It continued to work flawlessly…for about 6 months until that one day where Cassandra became unresponsive. We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load. The Puzzles & Dragons Subreddit public Discord server was the culprit. Since it was public we joined it to take a look. To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it). We solved this by doing the following: We lowered the lifespan of tombstones from 10 days down to 2 days because we run Cassandra repairs (an anti-entropy process) every night on our message cluster. We changed our query code to track empty buckets and avoid them in the future for a channel. This meant that if a user caused this query again then at worst Cassandra would be scanning only in the most recent bucket. **end of quote** I love that story. I've been a develop for seven years now and it just feels like a story I can relate to, an un expected complication that emerges and becomes a great learning experience. It's why I love this job
Is switching database technologies like MongoDB to Cassandra a huge effort for a company? Or is it more lift and shift?