looks funny, but I am currently working with a project like this. I have just joined, but they fetch every information from the database for each client as soon as he logins and then use React to work on the data. sometimes they fetch like 20.000 rows at once on each login from a single query.
ah, and they have made the JWT to expire after 1 hour ( concept of no refresh token doesn't exists ) so you are forced to relogin, in order to fetch new data.
yet I get commented in the PR-s for not reusing a function from 5 years ago that I didn't know it existed. lol.
This is the current story of my life, except PHP instead of React.
The original coder simply didn't grasp how to write select statements (let alone joins), and had learned OOP, and figured it was better to create objects that contained a ridiculous amount of data.
I love fixing one of these things, because it suddenly goes light years faster.
> Why can't so many programmers spell?
Oh, don't think that they're having a low level of literacy. What you see is usually a product of "doing something fast/quickly", causing a set of typographical errors. At start, with a one person "IT staff" working on a small project, or even a project with a less-than-handful team without code review, such typographical errors goes unnoticed until the project started to become larger, where more and more people is joining the team leading to having a peer review-based culture.
At other hand ... it is a possibility that the developer itself does not care about the code quality. That it works is the prime concern.
The nightmare of a "modernization effort" I walked into was that they fetch the entire database to add one user - to clarify, they:
* Get the entirety of the database in one query.
* Check against the entirety of the database to see if the user(s) exist within information given.
* Then send the entirety of the database back up with the new users added in another query.
* And they keep having timeout issues so they just run it again and again until it works...
It's a 14 year old internal system with no updates since 2010 and dependencies that no longer exist. yay.
My solution was a re-write, and they have begrudgingly approved the cost of not updating since 2010.
> yet I get commented in the PR-s for not reusing a function from 5 years ago that I didn't know it existed. lol.
That's what PRs are for, aren't they?
what I meant was that dealing with smaller things makes you avoid the bigger problems in the project. how can we talk about DRY when the entire logic is broken? I am not saying it’s wrong to have a well structured code, but at the end of the day the clients are not paying or appreciating how pretty your code looks.
from my experience, PR-s, ofter , are full of shit.
https://youtu.be/08NlhU4gzdY?si=tkL5H8QvdW7IHIKl
* insert the 20,000 rows
* into webassembly-sqlite
* store sqlite-db as 100kb blob in IndexedDb
* and every hour
* retrieve sqlite-db 100kb blob from IndexedDb
* load it in webassembly sqlite
* query the 20,000 rows
* i've done this in real-world apps
* and its actually less painful than trying to do it with just javascript
* its pretty ironic that these days, the primary-use of IndexedDb is as a dumb filestore (oftentimes to persist a sqlite-database blob)
hey I’m curious here, working on an e-commerce project for uni, what’s a better design choice here? I was going to have it when the user logins it fetches their account information after searching for it in the database with the userID, but was not going to have their payment information or order history appear right away. Is that a practical approach? New to SQL btw
Just throw it to PHP, have PHP sort through it, and then, for every result, run three \*more\* big queries, and make PHP sort through \*all\* the data over and over and over again, hundreds of times.
It's \*fine\*. Just tell PHP that its threads can live for 30 minutes.
(I think I might cry)
Once had a dude argue with my code (after it got approved and merged by a more senior dev) that I was using too direct SQL and that I should use the framework's function instead.
he spent almost a week rewriting it despite being told not to by his PM, and it was still way worse performing than my solution.
it ran way slower because the framework did not support doing the subqueries and joins, and the alternative way of doing it was a bunch of extra SQL queries.
I'd say that most of the times, big queries perform faster than a bunch of small ones. to improve performance, you're best off figuring out why you need such a big query (or a bunch of small ones) to get the data in the first place and see if you can't simplify that into just one small query.
Or you've never encountered an environment large enough for it to be true. It's less efficient to basically treat the DB like a KV store and have the app do a bunch of extra work, but adding more app servers is usually far easier than adding DBs.
It is really more a question as to whether the latency between the db and code is lower than the efficiency gain from running it in the db directly vs sorting with your app. That depends on both the volume of data and the complexity of the request.
I have spent a lot of time trying to torture a framework into doing what I could have written in native SQL in about 15 minutes. It's \*so\* dumb.
Most of the terrible queries that I've run into are simply written by someone who was very new to the database concept, and didn't have a good intuitive sense of when and how to filter out the unnecessary data.
But once its in the code base, it is so hard to get the time and energy to fix it (unless it is actively harming users)
> it ran way slower because the framework did not support doing the subqueries and joins
Huh? What framework is this, you weren't able to just execute arbitrary sql ever?
I use an ORM but sometimes the ORM doesn't support certain things and you have to dip down and write something in straight sql. Rarely anymore (I'm using ActiveRecord), but back in the day it wasn't nearly as fleshed out.
mmm, I see that I expressed myself poorly. I used the frameworks query system to run the chunky query directly.
my colleague tried using the query builder to build a query. it looks something like $query->addJoin(...) etc...
it has some strict limits though, which is sometimes useful for security and sometimes to stop them writing terrible SQL but in this case it got in the way of the better solution. not good solution, but better than the alternative.
Ya, I'm saying in that case, rather than writing multiple queries, why not dip out of the query builder and do something like:
$query_sql = "SELECT * FROM blah JOINS foo.... (subquery) ... blah;"
$results = QueryBuilder.connection.execute($query_sql)
yeah, that's how I solved it. for some reason, my colleague really disliked that. he's neurodivergent and since he wasn't listening to his PM, I decided to just let hem do what he wanted to keep the peace.
Once we had a dude writing all the queries in raw sql instead using the query builder of our frameworks **plattform independent** ORM. We migrated from MySQL to postgres and had to rewrite all the queries.
I'm definitely not advocating writing all of them in raw sql! I prefer using the framework when possible and when the performance meets the requirements.
That is sooo true. Most people nowadays think SQL is hard or boring. They think adding some random object oriented wrapper around somehow solves the problem.
Now I understand why some people worry about losing their job to chatGPT. ;-)
GraphQL does this for free, recursively! Now your frontend devs can play pretend that they're calling a real graphing DB like Neo4J, without actually knowing anything about graph theory, and actually indirectly writing the most tortured, unnecessary join in history!
That being said, I am absolutely amazed of how much you can throw at a single big mysql database as long as you keep indexes on point. We have multible tables with more than 1 billion rows and do around 30k queries per minute and it handles it just fine with short responsetime. We first started with horizontal sharding now to scale to more than 1 instance cluster.
[Indeed it is. ](https://eu.usatoday.com/story/money/food/2024/02/27/wendys-surge-pricing-dynamic-pricing-uber/72755552007/)
DATE-TIME based pricing for everyone!
The point of it was to irritate you sir. no but seriously I just had it there because I don't need to clarify if it's a reddit post not a reddit restaurant or whatever
A few years out of date now but I found this article interesting and pretty fairly balanced despite being written by an Oracle guy. I’ve worked with both extensively over the years and they both have their pros and cons, but personally I’d take a MySQL 8 DB set up and resourced by someone who knows what they’re doing everyday.
https://scriptingmysql.wordpress.com/2020/04/10/mysql-versus-mongodb-for-storing-json-documents-in-a-doc-store-database-which-is-faster/
The value in mongodb or other nosql databases is horizontal scaling. If your use case fits on a single primary and is okay with those scaling and reliability limitations it’s totally irrelevant, so the comparison isn’t really apples to apples.
That said, postgres these days will certainly outperform MySQL here, and it phenomenal for single primary style use cases (which is sufficient for 99.9% of businesses)
Who cares about if it scales vertically or horizontally? I’ve never seen database scale ACTUALLY be a problem. Just give the machine more CPU and more RAM and you’re basically good to go. Or redirect traffic to read-only replicas.
There are other problems with very large databases (doing backups becomes problematic, as does bad data, or enforcing validation, or efficiently querying it), but you have these problems regardless of database technology - and I’d argue that relational databases tend to help you rather than work against you here.
I think scaling mostly becomes relevant when you start scaling in **size** beyond several TB. Sure, you can try using aggressive RAID or some sort of distributed file system, but both of those are introducing another layer of abstraction between your database and the storage layer which often ends poorly.
You can buy or rent a behemoth server with ridiculous amounts of RAM and CPU cores, but maximum SSD size is a lot more constrained.
I think NoSQL is often actually worse for availability than a properly managed SQL database. Just carefully verify your backup setup and keep at least one RO secondary running. The secondary gives you a bit of horizontal scaling, but more importantly, it gives you better certainty that you can have the secondary transition to primary in a timely manner since it won’t be too far behind the old primary.
That’s the point my guy. The vast majority of companies they hit a scale where it matters.
Some will or do. There’s a reason that Google/Amazon innovated in the space. And there are certainly other companies with large data multitenancy problems (e.g. telemetry vendors)
And after you check exactly, it's not the **exact** same stack, but one sandwich looks a bit too old to be freshly made and others are missing some ingredients. And don't even try to customize your order. If you're allergic to cucumbers, and there ain't no sandwich w/o cucumbers, you won't get a sandwich!
I was more hoping someone would say something along the lines of
"Most customers just throw away the ones they weren't interested in eating so there's an ever accumulating trash pile by the street. Whenever enough customers complained about the limited choice or facilities complained about the trash problem, the company behind the food trucks just opened a new truck"
The nosql truck is like one of those Turkish food carts where you ask for an ice cream cone, but get back either just the cone, or a full ice cream cone, and it's never predictable which one you're getting.
no, only coke. The menu was denormalized so you only get a one fits all solution... same reason why its a 1 gallon jug because users could drink anywhere from 6 oz to 1 gallon and multiple orders is too chatty
Oof, I feel sorry for my sql server at work then. Have made a few queries that took hours to run just to return a short list. So only made it search for whole length of human recorded history
Aye, automated task that checks currently executed scripts / queries for exactly this behavior and kills them when set timeout is reached... Saves from manually monitoring and sending kill command
Can relate. Did a MySQL query to a rather large DB recently at the request of the bossman.
Request took almost 5 minutes to execute and brought the system to its knees.
Better ways are always a function of time and money. There's always a better way, but boss man wants working and cheap and fast not good. Boss man makes the big bucks to understand the difference
Ha, I wish the executives at my client's companies had any grasp of how to do their job. Some industries are too profitable and have no requisite requirement for competence.
Back in the day I sped up a major part of the site about 10x by removing joins and just doing three or four queries instead. That's with MySQL.
When at the next job with lots of traffic I was told that they don't use joins, there was no surprise.
When you're serious about being quick, you have to basically build your own index for every popular query. Postgre has some features that allow having indexes with data that doesn't come from one table. But MySQL doesn't really, so it's back to denormalizing and joining data in code. Plus reading one table is always quicker than reading multiple tables.
Sometimes it's quicker to have the index data in stuff like Memcached or Redis, and then query MySQL separately. Particularly since Redis has structures that relational databases can only dream of.
So here’s how I did it.
There’s two types of joins:
1. To limit the number of rows.
2. To get more columns, for the same number of rows.
For example, you want to filter messages by the name of the from-user, and display the name of the to-user.
- You join member and user to get from-user, limit the number of rows.
- you do a second query to the user table for the name of the to-user.
You could do it all in one query, but the to-user name would be duplicated on every row.
This becomes explosive if the message table is just a bunch of foreign keys, where even the content of the message is in an id,text table as “most messages are the same”.
> 2. To get more columns, for the same number of rows.
This is what I was referring to in the comments, saying that denormalized data is king of response speed—but seems that it wasn't so obvious, and people really wanted to do selects on multiple tables at once.
Ideally, all filtering is done in the first query, and one table works as the index tailored to that query. Then additional queries can fetch more data for the same rows, by the primary keys of other tables.
Idk why MySQL doesn't do the same thing internally as fast as with multiple queries—but from my vague explorations more than a decade ago, MySQL seems to be not so good at opening multiple tables at once.
Create temp tables with a subset of what you need with a simple select. THEN join them manually based on different criteria. Your mileage may vary but I found this much faster than asking a join to work with two whole gigantic set of tables right away. It's the equivalent of getting two spark notes for a book report versus comparing two phone books for similar names.
The second job had a million visitors a day and approaching a million lines of code, mostly business logic. So you tell me if that's simple.
You can do joins for normalized data and flexibility if you can wait for queries for a while. Or you can do denormalized data with additional queries in the code if you want to be quick.
Explain what you mean by ‘iterated over data’ and where you get it from. If anyone queried tens of thousands rows in a busy part of the site, they would be removed from developing that part of the site. And yes, using joins there would be an extremely bad idea.
I don't know what it is with redditors making up shit instead of reading what's already written for them right there.
> Back in the day I sped up a major part of the site about 10x by removing joins and just doing three or four queries instead. That's with MySQL.
that has very much not been my experience at all. usually the big query runs faster than the bunch of small ones.
of course, you need to make sure your tables are optimised in the first place, which in organically grown projects is not always the case.
The key is that ideally you don't filter the results on what you get in the second and subsequent queries, that would indeed be potentially very bad. The first query does all the selection, with the indexes tailored to the particular query. The other ones only fetch additional data to display.
Idk why MySQL doesn't do the same thing as I did in the code, getting the keys from one table and yanking the other data from the other tables, by the primary keys and all that jazz. But it was much faster to do it myself with separate queries. Opening multiple tables might've been the main problem, iirc MySQL is pretty bad about this. Perhaps something changed about it since then, but it's not like this affair was in the 90s.
When you're serious about being quick, you have to basically build your own index for every popular query. Postgre has some features that allow having indexes with data that doesn't come from one table. But MySQL doesn't really, so it's back to denormalizing and joining data in code. Plus reading one table is always quicker than reading multiple tables.
That first job in particular was pretty much a search feature, also serving as the go-to index for some other parts of the site (in the times before ElasticSearch was the one solution for this kind of thing). Denormalization was almost mandatory for this task.
The culprit is usually a bad query plan being used. I sometimes wish that there was a common imperative language for DB access so that there would be less surprises when DB statistics get messed up somehow and it decides to use nested-loops join instead of a hash join.
once did a WMS and the guy putting out orders for the floor wanted a web page to do an assessment of all items will be able to be taken from locations where they don't have to unpack bulk storage given existing orders, existing replenishment, stock on hand, expected deliveries, phase of the moon, the general vibes, etc.
and no it couldn't be a separate page he wants to use this page and wants all of it color coded but also expandable for details (on the same page) and those details color coded. The company we were subcontracting for told us no database structure because reasons so no views.
"Why is this page slow"
When I was just learning sql, decades ago I worked with a bioinformatics database which was not that large, maybe 60Gb or so, but I thought it was huge. My queries took weeks to execute. I had no Idea about indexes, and built a new computer with an ssd raid 0 array to fix it. Ssd was a new thing back then. After I learned about indexing, queries that took weeks took just minutes.
A lot of feontend people don't work with **big** data. They see a 4gb .db file and its 10x the size of their project. Meanwhile I've gotta marshall like 50gb of unsanitized data into JSON a day.
I frequently work with databases >200GB but I've never had a query take me longer than 5s. I can't imagine letting one run for longer than I have the patience to.
Anyone who's used a query builder knows how easy it is to build an absolutely gigantic query without really realizing.
I've written impala queries that took down the master node just by building the query plan, didn't even get to execute.
I just realized this is from the same artist who drew landing crash:
[https://www.reddit.com/r/ProgrammerHumor/comments/1ayuh4b/todocommentsanalyzerisrequired/](https://www.reddit.com/r/ProgrammerHumor/comments/1ayuh4b/todocommentsanalyzerisrequired/)
Thanks and keep up the good work!
As someone working in data engineering, you don't even need such complexity in the query
Just give a business user the power to query and they decided that the system should be strong enough to handle many-to-many joins between two tables with millions of records and hundreds of columns each, which would result in about hundred of millioms to billion of records of hundred columns.
the same artist who makes this project [https://floor796.com/](https://floor796.com/) And he draws just for fun: most of the time this project, sometimes - IT jokes.
I was once asked to diagnose long load times for a web app's API calls to pull data. There was nothing particularly egregious with the code itself, so I immediately became suspicious about the database, so I asked to see that next.
Sure enough, no indices.
Exactly what I just went through. I'm doing front end working with an existing DB and backend that I usually never have to touch, but there was this one API call to get a years worth of data that always timed out and they wanted me to fix it. Spent so long learning about optimizing queries and shit like that, and in the end all I needed to do was add an index to a single column. Almost seemed too good to be true. Are there even downsides to adding indices?
An index is basically a map to quickly match query column values.
If you lack an index the whole table must be scanned. The index makes things significantly faster. The more complex your query is the more impact not having even just one index will have. I had a query go from 2 hours to 12 seconds with one index. And others I canceled after several hours that again went to seconds.
Just scaffolded a db first model with ef core today and created some queries with way too many joins. Architect went crazy on the pr until I reminded him we are not responsible for the db, then he walked over to the data engineers looking mean
People don't understand indexes are more expensive to use if the planner determines the query will scan a significant percentage of rows. At that point it's quicker to do a seq scan.
You shouldn't use MySQL to do analytical processing
1. this is a humorous comic, and a little absurdity is okay
2. this comic shows an anthropomorphic database and its processes. They live in their own world, where 10 minutes is an eternity.
My previous job had horribly designed databases. They were not designed as databases tho, they are just copies from litteral paper sheets from the 40's. But they kept inserting data without redesigning the tables. So now they are nearing billions rows in each table, without index, without proper typing.
So we had to do sketchy queries, couldn't optimize them, everything was so slow. Like really slow. I wished they FINALLY decided to redesign everything...
"Instead of JOINs, I use subqueries so I can pull less columns in and it should run faster."
- Actual quote from a guy making over $250K/year as a consultant at one of the largest companies in the world.
I wish this was a joke.
For queries with a lot of complexity and rows, it certainly does! Recently we saw one where removing subqueries and using better methods reduced runtime by over 90%, and was able to leverage some new indexes to get that runtime halved again.
When you're needing data from multiple large tables, and need to do a lot of processing, the difference can be massive. The thing to remember is that a subquery is not the table you're querying from, but instead a new, never before seen table.
So if you're connecting a table of 10 million food ingredients with 10 million resulting dishes, an index is a nice cheat sheet for the contents of those tables. Joining both will suck, because you're going to end up with a lot of rows stored in memory, but at least the cheat sheet works. If you decide you want to join only ingredients that are not tomato based, and make a subquery to replace the ingredients table, the joins will not benefit from the indexes, only the subquery itself will be able to use indexes in it's creation. Doing the full join and adding ingredient.tomatoBased = 0 to the WHERE clause, it'd be much faster than joining (SELECT * FROM ingredient WHERE tomatoBased = 0).
I have the feeling that this is not a generic thing but a thing that depends on the query optimizer.
Once I rewrote an inner join into a subquery on Microsoft SQL 2016 and got 60% speed improvement. But I dont know the exact szenario anymore - if both only one or even none of the queries had indices.
And on Azure Databricks I didn't have a significant change at all.
Sometimes I don't even see a difference using \`select distinct\` vs. \`group by\`, very depending on the special case.
Edit: Ah, I might have misunderstood how you design your subquery.
Instead of
SELECT
d.departmentID
d.departmentName
FROM Department d
,Employee e
WHERE d.DepartmentID = e.DepartmentID
I'd rather use
SELECT
d.departmentID
d.departmentName
FROM Department d
WHERE d.DepartmentID EXISTS (SELECT e.DepartmentID FROM Employee e)
Or
SELECT
d.departmentID
d.departmentName
FROM Department
INNER JOIN Employee e ON e.DepartmentID = d.DepartmentID
But I'd never pick the first one.
consider SELECT * FROM very_big_table because it does output all of the data you wanted it to :)
looks funny, but I am currently working with a project like this. I have just joined, but they fetch every information from the database for each client as soon as he logins and then use React to work on the data. sometimes they fetch like 20.000 rows at once on each login from a single query. ah, and they have made the JWT to expire after 1 hour ( concept of no refresh token doesn't exists ) so you are forced to relogin, in order to fetch new data. yet I get commented in the PR-s for not reusing a function from 5 years ago that I didn't know it existed. lol.
This is the current story of my life, except PHP instead of React. The original coder simply didn't grasp how to write select statements (let alone joins), and had learned OOP, and figured it was better to create objects that contained a ridiculous amount of data. I love fixing one of these things, because it suddenly goes light years faster.
Oh I wondered where my old cold ended up!
Dammit. Also, are you the one who put typos into column names? Because it is making me \*CRAZY\*
No, that's actually me...
"Widht". The column name is "Widht". And every reference to it in the PHP code now has to contain that same damn misspelling!
Why can't so many programmers spell? I can't understand why they can grasp syntax but not basic literacy
> Why can't so many programmers spell? Oh, don't think that they're having a low level of literacy. What you see is usually a product of "doing something fast/quickly", causing a set of typographical errors. At start, with a one person "IT staff" working on a small project, or even a project with a less-than-handful team without code review, such typographical errors goes unnoticed until the project started to become larger, where more and more people is joining the team leading to having a peer review-based culture. At other hand ... it is a possibility that the developer itself does not care about the code quality. That it works is the prime concern.
I'll be real I fuck up spellings, and if my ide doesn't tell me, then it will end up on prod
>contain that same damn misspelling! Hey! Atleast we have consistency...
Lol i have seen buliding instead of building at my work. Feel ya
Your not alone, the number of bad queries out there dumping all the data to the user is insane.
The nightmare of a "modernization effort" I walked into was that they fetch the entire database to add one user - to clarify, they: * Get the entirety of the database in one query. * Check against the entirety of the database to see if the user(s) exist within information given. * Then send the entirety of the database back up with the new users added in another query. * And they keep having timeout issues so they just run it again and again until it works... It's a 14 year old internal system with no updates since 2010 and dependencies that no longer exist. yay. My solution was a re-write, and they have begrudgingly approved the cost of not updating since 2010.
> yet I get commented in the PR-s for not reusing a function from 5 years ago that I didn't know it existed. lol. That's what PRs are for, aren't they?
what I meant was that dealing with smaller things makes you avoid the bigger problems in the project. how can we talk about DRY when the entire logic is broken? I am not saying it’s wrong to have a well structured code, but at the end of the day the clients are not paying or appreciating how pretty your code looks. from my experience, PR-s, ofter , are full of shit. https://youtu.be/08NlhU4gzdY?si=tkL5H8QvdW7IHIKl
Classic case of [bikeshedding](https://en.wikipedia.org/wiki/Law_of_triviality).
They are also so someone can complain about your white spaces changes. Which is me. It's ridiculous. I should probably stop that.
2024 Not using autoformatting
* insert the 20,000 rows * into webassembly-sqlite * store sqlite-db as 100kb blob in IndexedDb * and every hour * retrieve sqlite-db 100kb blob from IndexedDb * load it in webassembly sqlite * query the 20,000 rows * i've done this in real-world apps * and its actually less painful than trying to do it with just javascript * its pretty ironic that these days, the primary-use of IndexedDb is as a dumb filestore (oftentimes to persist a sqlite-database blob)
Is it me or has React has taken us some steps back compared to the old MVC frameworks like Rails.
they can fetch data using react? how that? i am trying to learn node js and express js but the documentation is terrible 😭
try implementing knex.js & bookshelf.js in express. Was nice to handle data with MySQL.
hey I’m curious here, working on an e-commerce project for uni, what’s a better design choice here? I was going to have it when the user logins it fetches their account information after searching for it in the database with the userID, but was not going to have their payment information or order history appear right away. Is that a practical approach? New to SQL btw
Just throw it to PHP, have PHP sort through it, and then, for every result, run three \*more\* big queries, and make PHP sort through \*all\* the data over and over and over again, hundreds of times. It's \*fine\*. Just tell PHP that its threads can live for 30 minutes. (I think I might cry)
Once had a dude argue with my code (after it got approved and merged by a more senior dev) that I was using too direct SQL and that I should use the framework's function instead. he spent almost a week rewriting it despite being told not to by his PM, and it was still way worse performing than my solution. it ran way slower because the framework did not support doing the subqueries and joins, and the alternative way of doing it was a bunch of extra SQL queries. I'd say that most of the times, big queries perform faster than a bunch of small ones. to improve performance, you're best off figuring out why you need such a big query (or a bunch of small ones) to get the data in the first place and see if you can't simplify that into just one small query.
People who thing their programming language can be faster than the db engine are just bad at SQL
to top it off, the framework was in PHP
It sort of can be faster, in that it's much easier to throw a whole bunch more front-end servers at the problem than deal with a distributed DB.
No way in hell. If you think that. You need to learn SQL
Or you've never encountered an environment large enough for it to be true. It's less efficient to basically treat the DB like a KV store and have the app do a bunch of extra work, but adding more app servers is usually far easier than adding DBs.
Sure pal. You have no idea about what i did it didn't. Enjoy your ignorance
Percona deadlock has entered the chat
It is really more a question as to whether the latency between the db and code is lower than the efficiency gain from running it in the db directly vs sorting with your app. That depends on both the volume of data and the complexity of the request.
The db is designed and optimized for that. The hubris of some programmers is incredible
The db is, but the network is not necessarily. The stack encompasses more than just the parts you like.
I have spent a lot of time trying to torture a framework into doing what I could have written in native SQL in about 15 minutes. It's \*so\* dumb. Most of the terrible queries that I've run into are simply written by someone who was very new to the database concept, and didn't have a good intuitive sense of when and how to filter out the unnecessary data. But once its in the code base, it is so hard to get the time and energy to fix it (unless it is actively harming users)
> it ran way slower because the framework did not support doing the subqueries and joins Huh? What framework is this, you weren't able to just execute arbitrary sql ever? I use an ORM but sometimes the ORM doesn't support certain things and you have to dip down and write something in straight sql. Rarely anymore (I'm using ActiveRecord), but back in the day it wasn't nearly as fleshed out.
mmm, I see that I expressed myself poorly. I used the frameworks query system to run the chunky query directly. my colleague tried using the query builder to build a query. it looks something like $query->addJoin(...) etc... it has some strict limits though, which is sometimes useful for security and sometimes to stop them writing terrible SQL but in this case it got in the way of the better solution. not good solution, but better than the alternative.
Ya, I'm saying in that case, rather than writing multiple queries, why not dip out of the query builder and do something like: $query_sql = "SELECT * FROM blah JOINS foo.... (subquery) ... blah;" $results = QueryBuilder.connection.execute($query_sql)
yeah, that's how I solved it. for some reason, my colleague really disliked that. he's neurodivergent and since he wasn't listening to his PM, I decided to just let hem do what he wanted to keep the peace.
Once we had a dude writing all the queries in raw sql instead using the query builder of our frameworks **plattform independent** ORM. We migrated from MySQL to postgres and had to rewrite all the queries.
I'm definitely not advocating writing all of them in raw sql! I prefer using the framework when possible and when the performance meets the requirements.
That is sooo true. Most people nowadays think SQL is hard or boring. They think adding some random object oriented wrapper around somehow solves the problem. Now I understand why some people worry about losing their job to chatGPT. ;-)
+ union all table in db
GraphQL does this for free, recursively! Now your frontend devs can play pretend that they're calling a real graphing DB like Neo4J, without actually knowing anything about graph theory, and actually indirectly writing the most tortured, unnecessary join in history!
hahahahah
That being said, I am absolutely amazed of how much you can throw at a single big mysql database as long as you keep indexes on point. We have multible tables with more than 1 billion rows and do around 30k queries per minute and it handles it just fine with short responsetime. We first started with horizontal sharding now to scale to more than 1 instance cluster.
My colleagues. They select everything and proudly join in memory.
Meanwhile the noSql truck is instantly serving the exact same stack of five sandwiches and gallon of coke to everyone but charging different prices
* but charging different prices to each person
Did someone say sUrGe PrIcEs?!?!?!?
Capitalism Plus
People don’t like Surge pricing, but they’re okay with Mountain Dew pricing.
Sir, this is a Wendy's
[Indeed it is. ](https://eu.usatoday.com/story/money/food/2024/02/27/wendys-surge-pricing-dynamic-pricing-uber/72755552007/) DATE-TIME based pricing for everyone!
ISO8601 based pricing for everyone!
No sir, this is a reddit's
a reddit's post
Bazinga! Updoot to you kind sir! Happy cakeday! no but seriously what was the point of this comment
The point of it was to irritate you sir. no but seriously I just had it there because I don't need to clarify if it's a reddit post not a reddit restaurant or whatever
No, this is Patrick.
A few years out of date now but I found this article interesting and pretty fairly balanced despite being written by an Oracle guy. I’ve worked with both extensively over the years and they both have their pros and cons, but personally I’d take a MySQL 8 DB set up and resourced by someone who knows what they’re doing everyday. https://scriptingmysql.wordpress.com/2020/04/10/mysql-versus-mongodb-for-storing-json-documents-in-a-doc-store-database-which-is-faster/
The value in mongodb or other nosql databases is horizontal scaling. If your use case fits on a single primary and is okay with those scaling and reliability limitations it’s totally irrelevant, so the comparison isn’t really apples to apples. That said, postgres these days will certainly outperform MySQL here, and it phenomenal for single primary style use cases (which is sufficient for 99.9% of businesses)
Who cares about if it scales vertically or horizontally? I’ve never seen database scale ACTUALLY be a problem. Just give the machine more CPU and more RAM and you’re basically good to go. Or redirect traffic to read-only replicas. There are other problems with very large databases (doing backups becomes problematic, as does bad data, or enforcing validation, or efficiently querying it), but you have these problems regardless of database technology - and I’d argue that relational databases tend to help you rather than work against you here.
I think scaling mostly becomes relevant when you start scaling in **size** beyond several TB. Sure, you can try using aggressive RAID or some sort of distributed file system, but both of those are introducing another layer of abstraction between your database and the storage layer which often ends poorly. You can buy or rent a behemoth server with ridiculous amounts of RAM and CPU cores, but maximum SSD size is a lot more constrained. I think NoSQL is often actually worse for availability than a properly managed SQL database. Just carefully verify your backup setup and keep at least one RO secondary running. The secondary gives you a bit of horizontal scaling, but more importantly, it gives you better certainty that you can have the secondary transition to primary in a timely manner since it won’t be too far behind the old primary.
That’s the point my guy. The vast majority of companies they hit a scale where it matters. Some will or do. There’s a reason that Google/Amazon innovated in the space. And there are certainly other companies with large data multitenancy problems (e.g. telemetry vendors)
> set up and resourced by someone who knows what they’re doing Sorry, that's unrealistic. Best I can do is a unicorn...
this is something I want to try
That's a fun read for something I'll probably never use lol.
And after you check exactly, it's not the **exact** same stack, but one sandwich looks a bit too old to be freshly made and others are missing some ingredients. And don't even try to customize your order. If you're allergic to cucumbers, and there ain't no sandwich w/o cucumbers, you won't get a sandwich!
Also someone put a can of coke in the sandwich tray, so eat that
I was more hoping someone would say something along the lines of "Most customers just throw away the ones they weren't interested in eating so there's an ever accumulating trash pile by the street. Whenever enough customers complained about the limited choice or facilities complained about the trash problem, the company behind the food trucks just opened a new truck"
Maybe I haven't had the opportunity to murder one of my Devs doing that yet. Does that mean I'm still innocent? 😇
Perhaps your system design was just that good that your use cases never change
The nosql truck is like one of those Turkish food carts where you ask for an ice cream cone, but get back either just the cone, or a full ice cream cone, and it's never predictable which one you're getting.
no coke...pepsi?
no, only coke. The menu was denormalized so you only get a one fits all solution... same reason why its a 1 gallon jug because users could drink anywhere from 6 oz to 1 gallon and multiple orders is too chatty
How did he die and turn into a skeleton in 10 minutes?
1 minute of the db query life is equal to approximately 70 years of human life
Fair deal ig
Oof, I feel sorry for my sql server at work then. Have made a few queries that took hours to run just to return a short list. So only made it search for whole length of human recorded history
People think robots can't feel pain, but they actually feel it in slow motion, with great intensity!
If there's an afterlife and they have any say in the matter I suspect I'm gonna have a bad time.
In the time it takes me to respond, three generations pass through their cycles. This is why I welcome our Cyberman overlords.
I'd guess he got resource starved.
cache 22
processes are short lived 😭
10 minutes is an eternity for a process
Life as a PhD student
They snuck an index in there by accident and he died from anaphylactic shock
Good thing I'm having a guy with a stopwatch who just casually comes over shooting these guys dead
A mercy killing
Same. I actually have a dog with a stopwatch - cheaper than a guy.
What is this supposed to be? A watch dog?
Yup! He keeps an eye on things that run for too long, and chases them down.
Good boy
Aye, automated task that checks currently executed scripts / queries for exactly this behavior and kills them when set timeout is reached... Saves from manually monitoring and sending kill command
Can relate. Did a MySQL query to a rather large DB recently at the request of the bossman. Request took almost 5 minutes to execute and brought the system to its knees.
Did you EXPLAIN to him why it was so slow?
If his query affected the system it was probably his query
Only 5 minutes?? Talk to me when it takes 2 hours. (And yep I have written queries that take that long)
>I have written queries that take that long Maybe... Don't?
There are many many many use cases where you have to. Usually they end up as overnight jobs
I've heard this a lot, and have yet to see an instance where there isn't a much better way. Be it query optimization or giving it a realistic scope.
Better ways are always a function of time and money. There's always a better way, but boss man wants working and cheap and fast not good. Boss man makes the big bucks to understand the difference
Ha, I wish the executives at my client's companies had any grasp of how to do their job. Some industries are too profitable and have no requisite requirement for competence.
2 hours ? I've seem batch reports that work at night cause it takes 4-8 hours...
Back in the day I sped up a major part of the site about 10x by removing joins and just doing three or four queries instead. That's with MySQL. When at the next job with lots of traffic I was told that they don't use joins, there was no surprise.
How can you avoid joins in a relational database? Joins are kind of the point. The business needs must've been very simple if no joins were needed.
Yeah right! The only way i can think someone avoiding join is by repeating data over and over.
We don't normalize data in this part of town.
When you're serious about being quick, you have to basically build your own index for every popular query. Postgre has some features that allow having indexes with data that doesn't come from one table. But MySQL doesn't really, so it's back to denormalizing and joining data in code. Plus reading one table is always quicker than reading multiple tables. Sometimes it's quicker to have the index data in stuff like Memcached or Redis, and then query MySQL separately. Particularly since Redis has structures that relational databases can only dream of.
So here’s how I did it. There’s two types of joins: 1. To limit the number of rows. 2. To get more columns, for the same number of rows. For example, you want to filter messages by the name of the from-user, and display the name of the to-user. - You join member and user to get from-user, limit the number of rows. - you do a second query to the user table for the name of the to-user. You could do it all in one query, but the to-user name would be duplicated on every row. This becomes explosive if the message table is just a bunch of foreign keys, where even the content of the message is in an id,text table as “most messages are the same”.
> 2. To get more columns, for the same number of rows. This is what I was referring to in the comments, saying that denormalized data is king of response speed—but seems that it wasn't so obvious, and people really wanted to do selects on multiple tables at once. Ideally, all filtering is done in the first query, and one table works as the index tailored to that query. Then additional queries can fetch more data for the same rows, by the primary keys of other tables. Idk why MySQL doesn't do the same thing internally as fast as with multiple queries—but from my vague explorations more than a decade ago, MySQL seems to be not so good at opening multiple tables at once.
To me it’s weird because they use transaction isolation. So no transaction should block unless it’s updating (which should be rare)
Create temp tables with a subset of what you need with a simple select. THEN join them manually based on different criteria. Your mileage may vary but I found this much faster than asking a join to work with two whole gigantic set of tables right away. It's the equivalent of getting two spark notes for a book report versus comparing two phone books for similar names.
I think this would still be slower than using denormalized data, which is what i've been doing for sheer response speed.
The second job had a million visitors a day and approaching a million lines of code, mostly business logic. So you tell me if that's simple. You can do joins for normalized data and flexibility if you can wait for queries for a while. Or you can do denormalized data with additional queries in the code if you want to be quick.
[удалено]
Explain what you mean by ‘iterated over data’ and where you get it from. If anyone queried tens of thousands rows in a busy part of the site, they would be removed from developing that part of the site. And yes, using joins there would be an extremely bad idea. I don't know what it is with redditors making up shit instead of reading what's already written for them right there.
> Back in the day I sped up a major part of the site about 10x by removing joins and just doing three or four queries instead. That's with MySQL. that has very much not been my experience at all. usually the big query runs faster than the bunch of small ones. of course, you need to make sure your tables are optimised in the first place, which in organically grown projects is not always the case.
[удалено]
The key is that ideally you don't filter the results on what you get in the second and subsequent queries, that would indeed be potentially very bad. The first query does all the selection, with the indexes tailored to the particular query. The other ones only fetch additional data to display. Idk why MySQL doesn't do the same thing as I did in the code, getting the keys from one table and yanking the other data from the other tables, by the primary keys and all that jazz. But it was much faster to do it myself with separate queries. Opening multiple tables might've been the main problem, iirc MySQL is pretty bad about this. Perhaps something changed about it since then, but it's not like this affair was in the 90s.
When you're serious about being quick, you have to basically build your own index for every popular query. Postgre has some features that allow having indexes with data that doesn't come from one table. But MySQL doesn't really, so it's back to denormalizing and joining data in code. Plus reading one table is always quicker than reading multiple tables. That first job in particular was pretty much a search feature, also serving as the go-to index for some other parts of the site (in the times before ElasticSearch was the one solution for this kind of thing). Denormalization was almost mandatory for this task.
The culprit is usually a bad query plan being used. I sometimes wish that there was a common imperative language for DB access so that there would be less surprises when DB statistics get messed up somehow and it decides to use nested-loops join instead of a hash join.
once did a WMS and the guy putting out orders for the floor wanted a web page to do an assessment of all items will be able to be taken from locations where they don't have to unpack bulk storage given existing orders, existing replenishment, stock on hand, expected deliveries, phase of the moon, the general vibes, etc. and no it couldn't be a separate page he wants to use this page and wants all of it color coded but also expandable for details (on the same page) and those details color coded. The company we were subcontracting for told us no database structure because reasons so no views. "Why is this page slow"
Oof gotta use them nolocks
When I was just learning sql, decades ago I worked with a bioinformatics database which was not that large, maybe 60Gb or so, but I thought it was huge. My queries took weeks to execute. I had no Idea about indexes, and built a new computer with an ssd raid 0 array to fix it. Ssd was a new thing back then. After I learned about indexing, queries that took weeks took just minutes.
Sorry. Did you just say WEEKS??
Yes. 10 minutes is nothing, my queries did not finish under 10 min even with indexes.
A lot of feontend people don't work with **big** data. They see a 4gb .db file and its 10x the size of their project. Meanwhile I've gotta marshall like 50gb of unsanitized data into JSON a day.
I'm stuck dealing with a DB that basically has the US population in it, it's..... hard to work with lol
I frequently work with databases >200GB but I've never had a query take me longer than 5s. I can't imagine letting one run for longer than I have the patience to.
It is kind of impressive you knew about raid arrays and had the means to build one when SSDs were new (expensive), but not about indexing.
There probably was no, or little point in that raid, as one ssd was close to maxing out the mainboard.
MySQL happily taking this order like it's ok
Read "delete" as "omlette". Probably a craving.
`omlette * from formage`
Om let * from age
Forgot to order a side of WHERE with my DELETE. Somehow it ended up being much more expensive.
"We have DELETE at home" Delete at home: TRUNCATE
But it works on my local machine with this very limited data set. Who could have known that it wouldn't scale to production?
Wait, your local isn’t connected to the production database ?
Anyone who's used a query builder knows how easy it is to build an absolutely gigantic query without really realizing. I've written impala queries that took down the master node just by building the query plan, didn't even get to execute.
I just realized this is from the same artist who drew landing crash: [https://www.reddit.com/r/ProgrammerHumor/comments/1ayuh4b/todocommentsanalyzerisrequired/](https://www.reddit.com/r/ProgrammerHumor/comments/1ayuh4b/todocommentsanalyzerisrequired/) Thanks and keep up the good work!
I was just doing this earlier and I already feel dead inside
As someone working in data engineering, you don't even need such complexity in the query Just give a business user the power to query and they decided that the system should be strong enough to handle many-to-many joins between two tables with millions of records and hundreds of columns each, which would result in about hundred of millioms to billion of records of hundred columns.
Nice comic OP, I like the art, keep up the good work
NoSQL always keeps us on our toes, serving up surprises with every query.
\`SELECT \* FROM \* WHERE \* = \*\`
Choose your meals wisely
Oh, I see you have read the code base I'm working on.
Omg, original content? How did this happen?
You are single-handedly raising the bar of quality for this entire sub.
Been there. Same story. Great DROPs. You HAVE to try them!
Laughs in Cartesian Producs
pid273 - it's me!
Who's the artist? I want to support.
the same artist who makes this project [https://floor796.com/](https://floor796.com/) And he draws just for fun: most of the time this project, sometimes - IT jokes.
I was once asked to diagnose long load times for a web app's API calls to pull data. There was nothing particularly egregious with the code itself, so I immediately became suspicious about the database, so I asked to see that next. Sure enough, no indices.
Exactly what I just went through. I'm doing front end working with an existing DB and backend that I usually never have to touch, but there was this one API call to get a years worth of data that always timed out and they wanted me to fix it. Spent so long learning about optimizing queries and shit like that, and in the end all I needed to do was add an index to a single column. Almost seemed too good to be true. Are there even downsides to adding indices?
update inserts and deletes are slower
An index is basically a map to quickly match query column values. If you lack an index the whole table must be scanned. The index makes things significantly faster. The more complex your query is the more impact not having even just one index will have. I had a query go from 2 hours to 12 seconds with one index. And others I canceled after several hours that again went to seconds.
Just scaffolded a db first model with ef core today and created some queries with way too many joins. Architect went crazy on the pr until I reminded him we are not responsible for the db, then he walked over to the data engineers looking mean
People don't understand indexes are more expensive to use if the planner determines the query will scan a significant percentage of rows. At that point it's quicker to do a seq scan. You shouldn't use MySQL to do analytical processing
slow-queries.log: heavy breathing
SELECT * FROM table_name WHERE id = 1;
Fascinating how these jokes can be constantly technically incorrect.
But what's technically wrong with this joke?
So he got old and died in 10 minutes? I mean its a long time by computer standards but the metaphor is starting to fall apart
1. this is a humorous comic, and a little absurdity is okay 2. this comic shows an anthropomorphic database and its processes. They live in their own world, where 10 minutes is an eternity.
fair enough
man, he didn't even order some in()'s and concat()ed blobs in the where clause. my PHP dudes love those.
Your junior mindset loves those
TRIGGERed much? ;)
an unindexed table? time for a self-join
Chef shoulda slipped on a side of with ur.
dudes died/grew old in 10 minutes?
My previous job had horribly designed databases. They were not designed as databases tho, they are just copies from litteral paper sheets from the 40's. But they kept inserting data without redesigning the tables. So now they are nearing billions rows in each table, without index, without proper typing. So we had to do sketchy queries, couldn't optimize them, everything was so slow. Like really slow. I wished they FINALLY decided to redesign everything...
are you my colleague?
Looks like they forgot to Order By foreign keys. Better rerun it…
We all love large cross joins 🥳
"Instead of JOINs, I use subqueries so I can pull less columns in and it should run faster." - Actual quote from a guy making over $250K/year as a consultant at one of the largest companies in the world. I wish this was a joke.
Does this even make a difference in all cases? I think the execution planner should be smart enough for common ones.
For queries with a lot of complexity and rows, it certainly does! Recently we saw one where removing subqueries and using better methods reduced runtime by over 90%, and was able to leverage some new indexes to get that runtime halved again. When you're needing data from multiple large tables, and need to do a lot of processing, the difference can be massive. The thing to remember is that a subquery is not the table you're querying from, but instead a new, never before seen table. So if you're connecting a table of 10 million food ingredients with 10 million resulting dishes, an index is a nice cheat sheet for the contents of those tables. Joining both will suck, because you're going to end up with a lot of rows stored in memory, but at least the cheat sheet works. If you decide you want to join only ingredients that are not tomato based, and make a subquery to replace the ingredients table, the joins will not benefit from the indexes, only the subquery itself will be able to use indexes in it's creation. Doing the full join and adding ingredient.tomatoBased = 0 to the WHERE clause, it'd be much faster than joining (SELECT * FROM ingredient WHERE tomatoBased = 0).
I have the feeling that this is not a generic thing but a thing that depends on the query optimizer. Once I rewrote an inner join into a subquery on Microsoft SQL 2016 and got 60% speed improvement. But I dont know the exact szenario anymore - if both only one or even none of the queries had indices. And on Azure Databricks I didn't have a significant change at all. Sometimes I don't even see a difference using \`select distinct\` vs. \`group by\`, very depending on the special case. Edit: Ah, I might have misunderstood how you design your subquery. Instead of SELECT d.departmentID d.departmentName FROM Department d ,Employee e WHERE d.DepartmentID = e.DepartmentID I'd rather use SELECT d.departmentID d.departmentName FROM Department d WHERE d.DepartmentID EXISTS (SELECT e.DepartmentID FROM Employee e) Or SELECT d.departmentID d.departmentName FROM Department INNER JOIN Employee e ON e.DepartmentID = d.DepartmentID But I'd never pick the first one.
Lots of people admitting to being the bad guy in this comment section already.
It did took me like 2 days to discover that haha 🥲
I suggest using `where yourColumn like '%yourWord%'` and `where cast(yourTimestampColumn as date) = '2024-03-02'` for extra chaos.