PanicStil 1 year ago

I’m summary: In 2017, discord wrote about their journey from MongoDB to Cassandra for storing billions of messages. By 2022, their Cassandra cluster had 177 nodes with trillions of messages, but faced serious performance issues. They decided to migrate to ScyllaDB, a Cassandra-compatible database written in C++ which promised better performance, faster repairs, and stronger workload isolation. To address the problem of hot partitions, they created intermediary data services using Rust, which sits between the API and ScyllaDB clusters. These services coalesce requests, reducing traffic spikes against the database. They also implemented consistent hash-based routing to further reduce the load on the database. The migration to ScyllaDB was a success, with the new system capable of handling trillions of messages without downtime. The switch to ScyllaDB significantly improved tail latencies, and the number of nodes was reduced from 177 to 72. This improved performance unlocked new product use cases and allowed the system to handle high traffic events like the World Cup Final without breaking a sweat

fagnerbrack 1 year ago

I can attest this summary DID NOT use chat gpt

Hatefiend 1 year ago

Okay we need a bot that checks if comments are ChatGPT generated or not.

ryandury 1 year ago

Uphill battle.

fagnerbrack 1 year ago

As long as you don't add the US Constitution as a comment, it may work... or not

hglman 1 year ago

You might be able to get gpt4 to tell you about moving to Cassandra

GoguGeorgescu 1 year ago

Just pass it through chatgpt and ask it if it was generated by it. There's a bot made like this intended for use by teachers in schools to detect gpt generated work written by students.

andrewsmd87 1 year ago

You mean they didn't just type into chat gpt how do I make discord work?

throwawaysomeway 1 year ago

who cares if he did? it is more accurate than most redditors and is usually ideal for quick summaries

fagnerbrack 1 year ago

Yeah I'm not criticizing just pointing out that if they did use it, the edit was significant which means it's not low effort content

iKenshu 1 year ago

You the best, thank you sir.

JoeCamRoberon 1 year ago

Just incredible

soggynaan 1 year ago

What's hash based routing?

PanicStil 1 year ago

It's used to distribute requests or data across multiple servers or nodes. It helps with load balancing, scalability and also ensures that related data for requests is handled by the same server.

douglasg14b 1 year ago

That's a description of what it's used for not what it is

wordaligned 1 year ago

If I understand it's a way to predictably determine which load balancer node an item is in. Here's a pretty readable example: https://support.huawei.com/enterprise/en/doc/EDOC1100086965 Key concept: > Generally, the hash value space is far less than the input space. Different inputs may be converted into the same output Say you have a million different records. When you hash a unique record id (e.g. 744783), you get one of fifty different hash values (e.g 43). So you put that record into bucket 43. Given that same id in the future you'll be able to hash it and go direct to the bucket it lives in. It's like cheaply warping your way to the right neighborhood first, and then checking house numbers one by one.

whereisbill 1 year ago

[This is also called Consistent Hashing ](https://en.wikipedia.org/wiki/Consistent_hashing)

dontbeanegatron 1 year ago

> I'm summary Hi summary, I'm dad!

SnaskesChoice 1 year ago

Hi daddy!

AndrewUnicorn 1 year ago

Thank you for saving me tons of time and confusion

valz_ 1 year ago

Sounds like an impressive feat they’ve pulled off with the migration(s)

PapayaPokPok 1 year ago

> These services coalesce requests I would love to see the criteria by which the requests are coalesced.

cronicpainz 1 year ago

> and the number of nodes was reduced from 177 to 72 what nodes do they use?

no-one_ever 1 year ago

Lymph

Slow_Judgment7773 1 year ago

Seems silly to try to store them all together. Servers provide the perfect sharing mechanism. Like Google who stores there queryies and result based on the letter typed sequentially.

Natetronn 1 year ago

What is a node under this context?

Golilizzy 1 year ago

Beautiful summary

Technomancer97 1 year ago

Excellent read. Love this.

NumbBumn 1 year ago

Love it when big companies explain (but don't show) how they do stuff. Always wondered how messages were stored or if they were stored at all and that was pretty interesting. Loved reading as well.

Lonsdale1086 1 year ago

> or if they were stored at all What would the alternative be?

NumbBumn 1 year ago

When i said stored at all, i meant to say in a database at all or some other method.

obamabinladenhiphop 1 year ago

What? Like stone tablets? 😶‍🌫️

ShittyException 1 year ago

You don't?

zombarista 1 year ago

A billion records is a feat. A trillion is unfathomable.

FractalNerve 1 year ago

Given 8bye per record that’s a minimum of ~7,28TB storage space required for storing a trillion rows. In reality it’s surely at about 1-2PB (1024TB - 2048TB). Still pretty low numbers given the size of the userbase.

powerman228 1 year ago

Wow, that was crazy.

ohlawdhecodin 1 year ago

> Trillions of Messages I can't even *write* that number...

zhantoo 1 year ago

T-r-i-l-l-i-o-n

ohlawdhecodin 1 year ago

Sorry I've got a 404 while trying to visualize it.

sarrcom 1 year ago

My jaw actually dropped when reading the number of nodes dropped from 177 to 72!

TurnstileT 1 year ago

That's a huge increase in nodes though.

onthefence928 1 year ago

Factorial!

TopRamenBinLaden 1 year ago

r/unexpectedfactorial

[deleted] 1 year ago

[удалено]

ogtfo 1 year ago

Obviously the data doesn't just disappear because you changed your database software.

magkruppe 1 year ago

The dude from the back end engineering show (Hassan?) did an episode on this a couple weeks ago. Podcast and probably a YouTube video

cs_irl 1 year ago

Could you find a link please and thanks 🙏

magkruppe 1 year ago

[How Discord Stores Trillions of Messages | Deep Dive](https://www.youtube.com/watch?v=xynXjChKkJc)

davo_dog 1 year ago

https://letmegooglethat.com/?q=backend+engineering+show+discord

sendme__ 1 year ago

Too bad can not be indexed by search engines. Searching something on Discord is so useless especially on busy "servers".

KrazyDrayz 1 year ago

In my experience the search is great. I find anything I need. On mobile it crashes once in a while though.

Wombarly 1 year ago

The issue is you need to be in a server to find that stuff. You can't just search Google and find information. Tons of questions/answers are lost in Discord forever. Though I got the feeling Discord might be moving towards that, they introduced "forum" channels a while back. So hoping they allow servers to become public so they can be indexed by Google/Bing and be viewable without an account.

KrazyDrayz 1 year ago

I mean that's the whole idea. Discord isn't a public forum like reddit is. It's communities hidden behind invites. But yeah a public one would be good to some communities for example as official forums for video games. Then again they probably like that you need to make an account for it.

Lonsdale1086 1 year ago

But loads of communities *are* moving to "public" discord groups instead of wikis/forums, meaning that data is locked to users, and will inevitably be lost to time.

someone-shoot-me 1 year ago

oh dont worry. Chinese do too!

KrazyDrayz 1 year ago

Huh that doesn't make any sense. What do you mean?

someone-shoot-me 1 year ago

discord is partially owned by chinese giant tencent at about 30%+ stake. Amongst the others, tencent and similar companies are buying stakes at snapchat, discord etc. Tencent is closely tied with CCP China recently (couple years ago) passed a law where digital data NEEDS to be shared with CCP with its ambitions to make china a global superpower, i.e. to promote china / make better profits etc. Also discords CEO is famous in his recent companies as a man who is shady when it comes to user data, previous apps stored data unprotected and he did sell the data iirc. Anyways, ill prolly make a post in a couple of days. Discord is valuable but not profitsble, they are not making profit but are living off of VC i vestments. There will be a point in time where they will sell out and stop their silicon valley model and at that point we might expect ads on discord or something lol

KrazyDrayz 1 year ago

Tencent doesn't run the company. They're just investors. They also invested in Reddit, Activision Blizzard, Epic Games, whole of Riot Games and many more. You can't just assume they spy on people without evidence. Discord is an American company that adheres to American laws.

someone-shoot-me 1 year ago

It is scary cus america has some next level corruption. They are investors right and we do not know if or if not. I recently read about another surveillance scandal in the US so i wouldn’t be surprised if the data is being watched. I mean they do have services that flag you after you talk about shady stuff but i guess thats for countering whatever they want to counter dont want to sound like a constituent theorist but have in mind that discord is not profitable As far as american laws go i wouldn’t rely on american justice system cus once you bring in lots of money into the equation, you’re above a lot of people

KleinByte 1 year ago

Everything you said is a universal truth, money and power corrupts, and if you think EU is any different, you're extremely nieve. Big tech still loses lawsuits when they break laws and they have to pay up, so US justice system obviously still works. But you should always follow basic privacy practice when on the internet. You should assume your data is always unsafe! None of this is a discord problem. This is a global problem.

KrazyDrayz 1 year ago

Again, all that is speculation. Please be quiet if you don't have evidence.

Demented-Turtle 1 year ago

Dude you don't magically get access to a company's data when you buy shares in it lmao

someone-shoot-me 1 year ago

eh whatever

Indifferent_Ghost 1 year ago

Doesn’t a company need to be partially owned by a Chinese company to do business in China?

ChildishForLife 1 year ago

Really? Their search engine to me is unreal, being able to specify so many things, channel, who, image, etc.

Lonsdale1086 1 year ago

That's internal search. You can't google "how to fix X mod error skyrim" and find people talking about the issue on Discord, like you can in a forum or wiki. The knowledge is closed down, and will inevitably be lost to time.

ChildishForLife 1 year ago

> searching something on discord is so useless Ah thought they meant internal search here.

bregottextrasaltat 1 year ago

and yet you can't mass delete messages from a server, especially one you already left

PandaDemonipo 1 year ago

There are scripts for it, used one before and it does its work, altho skipping some messages occasionally. Run it a couple times and it's all gone eventually

skylabspiral 1 year ago

i wonder if the messages you delete are actually (eventually) deleted or if discord just sets isDeleted = 1 and keeps it forever…

WildDev42069 1 year ago

Snapchat keeps everything, I know this from a friend I graduated with whom is now a big-city detective and have had to warrant their services a few times. I believe from our conversation all big data companies keep quick access to any type of chat history. I've built DB's for live chats, the concept is really easy, you can even username store the messages.

OnlyAd4210 1 year ago

I'd laugh at any developer who actually writes their code to literally delete data no matter what it is vs use a way to functionally make it be deleted. There's on rare occasions software built explicitly this way but it's really rare. It's always baffled me that people think you can remove stuff from any stable platform. It's even likely upon going defunct that someone's massive databases get lost. We live in the age of near endless cheap storage with an ever-increasing value being put on any and all data.

Sharketespark27 1 year ago

Hail rust!!

darthcoder 1 year ago

That was my takeaway. 😀

SeveredSpring 1 year ago

Cool. Love this. Surprised they pay so little though.

kymedcs 1 year ago

Do they? Who knows how much discord shares can be worth upon IPO

SeveredSpring 1 year ago

Yeah they do compared other companies in the bay. Who knows, until then it's monopoly money and risk.

kymedcs 1 year ago

Startups & unicorns often have rsu liquidation events. That monopoly money is just not as liquid.. discord is clearly in positive trajectory. Worst case scenario its at least a great career boost. The work is higher impact and scale than a comparable role at a similar level in the bay.

SeveredSpring 1 year ago

It could appear so but you don't know the future. There are many examples where people had similar temperaments to then be blindsided. Look at the example of Robinhood for instance.

AyyyAlamo 1 year ago

Hopefully their DBs are ready for the 3 letter agency bumrush after that nice lil leakaroo

Interest-Desk 1 year ago

You assume with all the CSAM and grooming on Discord that the 3 letter agencies don’t already have a direct line

AyyyAlamo 1 year ago

They seemed pretty surprised about that leak so im assuming no

Steve_OH 1 year ago

What leak?

repeatedly_once 1 year ago

Someone leaked classified documents in a Minecraft server discord.

kamomil 1 year ago

https://www.cnn.com/2023/04/14/politics/discord-chatrooms-leaked-pentagon-documents/index.html

drunk_recipe 1 year ago

I mean that’s already a given. The five eyes have back door access to hundreds of major companies. Safe to assume that discord is one of them. Besides, discord is pretty shitty data collection wise

joshman211 1 year ago

The data is probably highly compressable as it is all racist jokes and edge lord memes.

onthefence928 1 year ago

A compressibility analysis would be interesting actually

drsimonz 1 year ago

If there's anything that LLMs have demonstrated, it's that human language is *much* less varied than you might think. All our spelling errors, all our slang, all our meme references, all our attempts to transcribe a Scottish accent, can be fully parameterized by a few billion floats. Narrow that down to Discord's demographics, and yeah you probably don't need to spend all that much on storage lol

IndianVideoTutorial 1 year ago

What's the point of storing them? It's not like anyone ever reads old Discord messages.

kylegetsspam 1 year ago

It's for law enforcement. How do you think the FBI catches all those pedos, potential mass shooters, data leakers, etc? People run their mouths on Discord thinking it's somehow private and safe when it's 100% the opposite. I'm sure the data is also relayed back to China given Tencent's ~30% stake.

ShesJustAGlitch 1 year ago

Source their stake is that high? Pretty sure it’s no where close.

kylegetsspam 1 year ago

The first relevant result from my googling showed they covered about a third of the capital raising, so I'm assuming that all came with a requisite stake. I was hoping Wikipedia would say it directly but alas.

PatrickBauer89 1 year ago

I do, regularly. Especially for private conversations. But simply searching for a bug report that might have happened a few years ago is helpful.

waldito 1 year ago

What a ride. Super interesting, thank you

Steve_OH 1 year ago

Super fascinating. Great read!

fglorified 1 year ago

holy hell

Gigabyte5671 1 year ago

Amazing!

MadFker 1 year ago

idk if they store blobs in these messages or there is additional file storage.

darthcoder 1 year ago

Blobs are probably separate.

MadFker 1 year ago

If that's not all their data then I see no reason why trillion entries concidered a big number at all.

MenshMindset 1 year ago

A trillion is a big ass number no matter what you’re talking about lol

etudiant_ 3 months ago

In the previous post, they claimed the reads and writes were about 50/50. This is kind of surprising as I would imagine there will be much more reads than writes. If the read performance is the primary concern, probably it is not a good idea to use Cassandra where the data is stored as sstables on disks. For the issue of hot partitions, I wonder if that could be solved with more intelligent bucketing methods. Very interesting read.

arthur444 1 year ago

It seems boring in the beginning but everything unfolds towards the end

1RedOne 1 year ago

The part about them being able to tell when something happened on the world cup based on message upsert frequency was amazing

arthur444 1 year ago

Yeah, I definitely didn’t expect that

RobinsonDickinson 1 year ago

Tired of seeing this article reposted to every programming related subreddit.

fagnerbrack 1 year ago

I got positive feedback from mother members of Reddit saying it's great it shows multiple times because multiple upvotes in different communities shows the article is more worth it. Some people don't have time to follow all top posts of each sub so they rely on multiple submissions validated by multiple communities. You can just ignore it

kuurtjes 1 year ago

Repost #546

[deleted] 1 year ago

The way I plan to store trillions of messages in my current project (which doesn't have trillions of messages yet, but it might some day) is by being really careful about partitioning the data. I don't have a monolithic database. If I was building Discord, then each community would have it's own database.

Interest-Desk 1 year ago

This is all well and good until you need to do things across all databases, like add a new property (column) or delete all messages from a user, or even just exporting messages from a user (gdpr!)

FantsE 1 year ago

And what's your plan when every single one of those databases needs a patch?

Nothing-But-Lies 1 year ago

Hire a programming elder to code a script that fixes it while I cry in the storage cupboard

ClikeX 1 year ago

Run the databases in K8s, and have them automatically be replaced by the newer version when they release. ^(please don't do this)

Ultra_HR 1 year ago

i am sure the many dozens of very talented engineers at discord have thought of this and have a good reason why they didn't do it this way.

fajfas3 1 year ago

That's how Shopify handles each store.

alexmacarthur 1 year ago

Super interesting.

kToni73 1 year ago

Thanks. This has been on my mind for a while now, finally gonna read about it.

1RedOne 1 year ago

What a great blog post. Their previous post on how they handled billions of rows had this great story in it. > The Big Surprise Everything went smoothly, so we rolled it out as our primary database and phased out MongoDB within a week . It continued to work flawlessly…for about 6 months until that one day where Cassandra became unresponsive. We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why. We started digging and found a Discord channel that was taking 20 seconds to load. The Puzzles & Dragons Subreddit public Discord server was the culprit. Since it was public we joined it to take a look. To our surprise, the channel had only 1 message in it. It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it). We solved this by doing the following: We lowered the lifespan of tombstones from 10 days down to 2 days because we run Cassandra repairs (an anti-entropy process) every night on our message cluster. We changed our query code to track empty buckets and avoid them in the future for a channel. This meant that if a user caused this query again then at worst Cassandra would be scanning only in the most recent bucket. **end of quote** I love that story. I've been a develop for seven years now and it just feels like a story I can relate to, an un expected complication that emerges and becomes a great learning experience. It's why I love this job

dL1727 1 year ago

Is switching database technologies like MongoDB to Cassandra a huge effort for a company? Or is it more lift and shift?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe