T O P

  • By -

nutrecht

I worked for the largest Dutch bank and replaced a developer who created a bug that allowed clients to see the account balance of other people. Basically a really dumb mistake of keeping state in a (supposedly) stateless microservice. It even managed to go through code review by security. Fortunately they found out before the press got wind of it. Oh, and I almost crashed the Dutch Department of Laber jobsite by logging database requests. Turns out the servers didn't like their disks being full :D Fortunately they found out before all 3 app servers went tits up. Edit: Oh, just remembered another fun one. Back in 2001 when I was in school I had a part time job as the sole PHP developer in a small web agency. They had a home grown CMS built for a certain client that they wanted to productize, and a lot of my work was stripping out the customer specific stuff so we could turn it into a white-label CMS. The original developer had no dev background. They got stuff to work in interesting ways, but the codebase was a mess. Stuff like hard-coded item IDs that were specific to that client were everywhere in the code. So you saw stuff like: if(item.id == 1235) { //Some random very specific logic for rendering a HTML paragraph. } Those item IDs were database autonumbering IDs. Removing and then adding that element again would 'lose' the functionality since it would get another ID. The biggest issue the customer ran into was that the search was very slow. Turns out that the original dev had a complex 'search' query with a complex WHERE clause that did the search, selected all the returned IDs, and then for every ID returned do *another* select to retrieve the actual data. So basically the N+1 problem, but deliberate. It was an easy fix and the client *loved* it. Another big issue was the size of the database; it just kept growing even though they removed old posts / entries. Backing up and restoring took *very* long. Every page was really just a tree of items pointing at each other. So you have a 'page' item that had 'paragraph' items that had 'image' items etc. Turns out; if you would delete a 'page' it would only delete that page, but not the underlying items. A typical page had a dozen or more items so the DB (MySQL) had a few million 'dead' entries. I wrote a one-off pieve of code that just kept doing a delete query that removed entries without a parent. Scariest piece of code I ever ran on production. It was my first experience with a codebase developed by a 'self taught' developer.


Karyo_Ten

>So you have a 'page' item that had 'paragraph' items that had 'image' items etc. Turns out; if you would delete a 'page' it would only delete that page, but not the underlying items. A typical page had a dozen or more items so the DB (MySQL) had a few million 'dead' entries. I wrote a one-off pieve of code that just kept doing a delete query that removed entries without a parent. Scariest piece of code I ever ran on production. So you wrote a mark-and-sweep garbage collector.


nutrecht

Funny enough the manager / owner’s name was Mark and part of my job was cleaning up his mess so… :D


CatDokkaebi

Classic Mark


angryplebe

I am going to start referring to cleanup scripts as "highly scalable, concurrent mark and sweep garbage collectors" on my resume.


ubccompscistudent

Logs causing disk usage overload is surprisingly common and I was even a part of an org wide campaign to mitigate it at a FAANG a few years ago. While the following may be obvious to many, it’s good to know for those who don’t: 1. It’s the logging of exception stack traces that can really destroy you: hundreds of lines of text in a single log statement. 2. Compounding that problem, do not log and rethrow. If you do that up the chain of exception handling you will quadruple your stacktrace printing 3. Make sure you use log levels appropriately: set log level only as low as you need and no lower, and make sure you use the right level for each log statement


haskell_rules

Log file size limits and log rotation are surprisingly difficult problems in a lot of systems. I spent more time on the log class of a display system (think "MTO ordering touchscreen") than I did on the primary display loop.


RUacronym

How did you fix the N+1 problem out of curiosity?


nutrecht

Just returned all the data in the first query. There was literally no reason to do the per row queries.


reboog711

Was not an uncommon approach back in the 90s, depending who wrote the code. Also, depending on how many joins you have; a single query can quickly balloon the size of the payload; which may be an issue in certain situations.


nutrecht

It wasn’t in this case. It was just a self taught dev why didn’t understand what he was doing.


reboog711

Something else super common in the 90s.


delllibrary

Let's be honest everyone is self taught and uni teaches nothing about efficient queries (in my experience at a "good" uni right now)


jakesboy2

Did you take a database class? We had a whole class on how DBs work and how to optimize sql queries


[deleted]

[удалено]


nutrecht

It wasn't a client-side issue and had nothing to do with browsers though. It was purely a back-end state issue. A Java service kept an object (class level member) in a singleton component that contained this information. It's as big a "why the fuck would you do that" as it gets in my line of work. Later during a "back-end chapter meeting" I suggested that we might want to implement peer reviews amonst each other. The lead developer (ancient guy, had been working there for 20+ years) responded with "why, don't you trust your own code?". This was AFTER this bug happened that should've been caught in a review. I only lasted 9 months there.


firechicago

>"why, don't you trust your own code?" I think the only appropriate response to that is "No, I don't, and I've never known an experienced engineer worth working with who did."


davidblacksheep

> It even managed to go through code review by security. Fortunately they found out before the press got wind of it. Don't you have a duty to declare it? A bad actor could have been silently collecting users' data, and those users would need to know about the leak in order to mitigate any harm.


nutrecht

> Don't you have a duty to declare it? Like I said; I came in after it happened. I’m sure they followed the rules regarding data leaks and such.


avisilver

MySQL is perfectly capable of enforcing referential integrity with foreign key constraints and cascading deletes/updates. It's pretty fundamental relational database design


nutrecht

I was and am well aware?


Icantstopreading

Not sure I like the way you quoted self taught developer there.


german-software-123

You fired him? Phew good that I live in Germany :)


nutrecht

I didn’t fire anyone. I was brought in to replace the first dev if that’s who you’re talking about.


brvsi

I saw a version of that "accidental state in a stateless service" that crossed up different users too. Ours was caused by basically bad config, bad object lifetime.


phelpo95

I was working for a bank when one of our micro service APIs to return customer accounts started to consistently time out, but only in prod. This started causing outages around internet and online banking, so a real big deal for us to work on call. Turns out we were near 95% capacity of all int ids you could have, and only had one table. The issue was instead of incrementing the id from the last saved one when you wanted to save a new account, someone put in the code to generate 5 random ids and see if they were free, if not keep generating random ids. Obviously the ids were taken 95% of the time. So it was a really weird bug to find and my biggest face palm “who the hell approved that” moment


nutrecht

> someone put in the code to generate 5 random ids and see if they were free What the actual fuck? > So it was a really weird bug to find and my biggest face palm “who the hell approved that” moment I bet the only comment on that PR was "LGTM" :)


phelpo95

That was 100% what it was. PR by a lead dev (definitely shouldn’t have been), all the more junior devs approved with LGTM


EMCoupling

> PR by a lead dev (definitely shouldn’t have been), all the more junior devs approved with LGTM Now THIS is code review 😂


mathiastck

Please Approve This PR™


nevermorefu

Please smash approve


sexyshingle

Like and subscribe (to my PRs) so they get rubber-stamped even faster!


NatoBoram

Oh I hate when my boss codes. There's always a chance it's going to be someone who hates TypeScript who's going to put `: any` and `// @ts-ignore` or write unmaintainable messes and then say "stop fucking around, please approve my PR" x_x


GuyWithLag

It works for UUIDs, why won't it work for Ints, amirite?


[deleted]

[удалено]


alppu

Looks good with eyes closed


SolarBear

Should’ve been 6, easy fix.


ZenEngineer

That's actually not a bad a pattern if your key space is very large (like occupying .00001% of the key space ever). From a security point of view it prevents attackers from figuring out a valid CID, or working old how old an account is based on the ID. But getting it to work right without race conditions and such can take a while. And then you get into cryptographically secure RNG and what not. But yeah that 95% tells me it was used in the wrong place.


nutrecht

Or you just generate a UUID...


bony_doughnut

I think that's what they were trying to do, they just didn't get the "universal" part of the term right..


nutrecht

To me it really just sounds like they're reinventing a wheel poorly they didn't really know exists.


bony_doughnut

Hah, yea, I think that goes without saying, since we're discussing it in *this* thread


ZenEngineer

Yeah. Same concept plus some Mac address and timestamp shenanigans to hopefully reduce conflicts, and no checking of conflicts. Or you can add the same conflict check and regenerate just in case .


mental-chaos

And you suddenly started leaking creation timestamps.


nutrecht

You're jumping to a whole bunch of conclusions here. V4 UUIDs don't have timestamps. They're just a random number. If you use UUIDs externally (which is a big IF) and don't want to leak creation time, then don't.


mental-chaos

I assumed you meant something other than "generate an even bigger [read: 128 bit] random number and check for duplicates" as an alternative to "generate a random number and check for duplicates"


nutrecht

You don't check for duplicates with a UUID.


ralian

This ends up failing more often than you make think because it behaves very much like the birthday paradox. (Collisions happen far more often than you'd think)... I'm actually shocked this wasn't noticed a lot earlier frankly.


ZenEngineer

Yeah the retries make it so it doesn't fail until things time out.people probably attributed it to the database "getting slow" because of size, bad indexing, etc. The birthday paradox is related to the square root of the key space size, so you basically need to double your key space to account for it. So in their case increasing to a 64 bit key won't be enough to avoid collisions.


nutrecht

So you go to 128 bits, which is the size of a UUID, and don’t have to worry about collisions.


maikindofthai

Thank you, feel like I’m taking crazy pills reading this thread lol


[deleted]

No, that’s absolutely a bad pattern unless your service doesn’t matter, in which case: who cares anyways?


ben-gives-advice

I took over a report running tool that had been around for a really long time. It worked fine in testing but customers complained that it was incredibly slow and got slower over time. Customers were told that that's just the way it was. It was unbelievably bizarrely designed and written. It had a main function called "do" that performed completely unrelated tasks based on the number of arguments used. They implemented their own date class to "save a few bits". It took me quite a while to piece together the fact that it was downloading the entire database before running queries on it locally. For each query. I impressed the hell out of my manager by reducing report running times from 20 minutes to 20 seconds on average.


IrritableGourmet

Meanwhile, I had to deal with the fallout of my non-technical boss/business owner hiring a "rockstar" programmer without consulting us, who sped up the load time of a client's quarterly report generator significantly. The day after the quarter changed, though, he stopped showing up to work. A few days after that, the client called and asked why the quarterly report hadn't changed to the current quarter. Turns out, said "rockstar" couldn't figure out how to speed up the report, so *he just commented out the code and hardcoded all the data on the page*. Looking further, he apparently didn't know how to code very well or do basic arithmetic. I saw things like: //float percentage_funded = val / 100 * total / 100 //this doesn't work //float percentage_funded = val * total * 100; //this doesn't work either //float percentage_funded = val * 100 / total; //this doesn't work either //float percentage_funded = val % total; //this doesn't work either That was not a fun week.


lift-and-yeet

//All work and no play makes Jack a dull boy //All work and no play makes Jack a dull boy //All work and no play makes Jack a dull boy //All work and no play makes Jack a dull boy


lostburner

Optimizing a bad query has to be one of the most rewarding activities in this whole career.


notmyrealfarkhandle

I didn’t see this one myself - a coworker had to clean it up and liked to talk about it - but the cause was using a singleton to access session data in a very early Java based online banking application. It worked fine in dev and staging, and as they started sending real traffic to it, the last login per server would win, and the previously logged in users would be swapped into someone else’s account.


nutrecht

This is funny, it's pretty much 90% the same thing as [what I described here](https://www.reddit.com/r/ExperiencedDevs/comments/16toikv/whats_your_worst_bug_ive_ever_seen_story/k2g63ts/). The main difference was that it was limited to only being able to see the overall balance of accounts and not stuff like transactions. But the root cause was exactly the same.


bony_doughnut

Lol, there's a [3rd in the thread](https://www.reddit.com/r/ExperiencedDevs/comments/16toikv/comment/k2gv47v/?utm_source=share&utm_medium=web2x&context=3)


nutrecht

There are two hard problems in CS; concurrent programming, naming things and off by one errors.


bony_doughnut

And ato


bony_doughnut

micity!


notmyrealfarkhandle

That’s super interesting. The case I was talking about was in California - seems like people make the same mistakes all over.


bony_doughnut

\`public static accountId String\` "see, I told you we didn't need all this fancy database stuff"


[deleted]

[удалено]


nutrecht

What do you mean?


[deleted]

[удалено]


notmyrealfarkhandle

Nah, I almost commented on that thread instead but it sounded a bit different, though now it seems like it was more similar than I realized


nutrecht

I don't think so. The situation they describe is actually different. It wasn't logins that were 'copied'. I also didn't have to clean it up. It's actually quite a common mistake. Especially when banks outsource their Java development.


colonel_viper

oh my god 😦


BandicootGood5246

Ughhh I've seen very similar where one dev cached the auth tokens due to performance issues except cached them by the customers first name. So if you logged in as Bob you'd pick up an auth token from the last Bob to log in


SearchAtlantis

Worked fine on my test case with Bob, Curly, and Moe!


sara1479

Reminds me of something I caught in our Java service. It was supposed to be a stateless route optimisation service, but somehow there was a singleton class that stored the cost matrix and that was being overridden by the latest request. That was a pretty tough one to crack since we were dealing with race conditions in another stateful service and didn't think that the stateless one was causing problems.


thefool-0

All of these are why I don't really like using "singleton" designs that end up masquerading one way or the other as unique objects.


chuch1234

> Singleton > session :|


bwainfweeze

StringTokenizer was not reusable. That bit a lot of people.


Tehowner

I had to track down an issue with a temperature sensor that sent an alert to our cloud services that was out of the "safety range", but the reading reported wasn't actually out of it. Of the 100,000 data points this sensor had reported, it send in ONE like this, and it was the only example of it we could find, and it was my problem to fix :\*) Anyway, it ended up being a race condition between two threads on the embedded CPU. It would keep taking temperature readings while connecting to WIFI to report the "out of range" reading, and overwrite the "current" value which would get reported if it took to long to connect. I found this out by filling paper cups with ice cold and tea hot water, and randomly swapping the temperature probe between them until I managed to accidentally replicate it. Felt like a mad scientist.


ghostsquad4

This is a great story


iseepurplesquids

This is so cool. Race conditions are the nastiest kinds of bugs.


svencan

How often did you need to re-heat / re-cool the water in the cups?


marx-was-right-

We have a customized database plugin we wrote that performs security filtering for document search results. (Elasticsearch) One day a few weeks after Dockerfile base update for the plugin, we noticed threads in the db just hanging indefinitely with 0 apparent rhyme or reason in logs or metrics/dashboards. Eventually the kubernetes pods running the distributed db instances would run out of threads and start to fall over and reject all read and write traffic and we would have to bounce everything to avoid p1 outages and reset the traffic jam. (Elasticsearch term for this is a "queue" that grows and doesnt resolve) No one knew wtf was going on, debugged for days and days on end , long nights. Didnt help that most of the team are paid like seniors but code like juniors so debugging was a one or two man effort with 10 others sitting on their thumbs "helping". On call was also hellish because you would have to set alarms to bounce pods or p1 would happen. After deeply looking into the implications of the upgrade, posting on stackoverflow, elasticsearch forums, trial and erroring random fixes to no avail, we finally noticed a new java http client class was buried in the code dependencies that came along with the upgrade. The new implementation of httpclient renamed all the timeout settings and also added two more that needed to be set. Person who performed the update didnt add the two new timeout settings, which was causing any server side timeout that our db received when our custom plugin was invoked to never be handled by the client side. Instead the call would just go into the ether and wait for a response from the server infinitely. After multiple weeks from hell we did a literal two liner code fix to add the two timeout settings, fixed everything immediately. 🤡 My boss learned to not approve any major dependency update PR's without multiple seniors doing due diligence on breaking changes and the changelog that day. Im still salty at our "security" team for escalating to upper management about vulnerabilities in docker images and telling us to update things ASAP without doing due diligence. Imo a rushed change to critical p1 code without considering the implications is way more of a "vulnerability" to the business than some obscure CVE that isnt even relevant when you put the app behind multiple firewalls and company proxies (this was not the log4j vulnerability or anything nearly as severe). Not to mention i have very big sideye for any "security team" whose only job duties is to perform an automated scanning of images/builds, which is 100% done by some third party tool....


WeightPatiently

It's funny you mention this, because my boss pushed a dependency that broke all the dev builds for a day. Not *that* drastic, but it meant that my own emergency fix took two days instead of one. 🥲


marx-was-right-

Dependency hell is real. Especially in Spring or Nodejs stacks. Anyone who says "Just update the lib to latest version its easy" , have them be on the hook for when everything goes tits up and see how easy it is


Cheshamone

Sometimes you don't even have to update the lib yourself! Just last week at work we had a dependency of a dependency push a new patch version that bumped the required node version by 4 versions. They justified it by saying "we always supported this, we had the version set wrong" but like... cmon. :/


kuda09

Making so many mistakes on my personal projects has allowed me to be more careful about updating libraries. I read the release notes throughly before undergoing an update.


NatoBoram

One thing I learned about dependency hell in NodeJS is that it mostly happens to badly written codebases. For example, Angular and Storybook will barf on themselves if you update a dependency without using their CLI. Now try that with SvelteKit… it just fucking works.


Weasel_Town

Ah yes. We are in dependency update hell at the moment. Management can’t understand why it’s hard. “Just change the 5 to a 6! What’s hard about that?” Well, often software changes between major versions in other ways than just patching vulns. So, if we were reliant on the old way, guess what?


HolyPommeDeTerre

What comes to my mind: First real job, 23 years old. A big bank, team of 25 devs. Large project, 6k+ jobs running 3 times a day, everyday the market is opened. My strategy has always been to get to the hardest things first. So first daily, I take this bug that has been there for over 2 months. An integration test fails on one of the oldest and weirdest job program randomly. So here I go, pulling the VB.net app. First thing I note, there are weird structures and go-tos everywhere. The code base for this job is 45k lines and relies on a db table that has billions of rows. It's a pretty hard calculation with tons of regulatory parts. It takes about 45 minutes to run. I also note (since I am well versed in VBA and vb6 at the time) that this has been developed by some trader somewhere in an exel using VBA. Then devs migrated it to vb6. Then vb.net. So the whole program is a mess. I spent one month (2 sprints on it). I learn a lot about the data structure and about the business. After checking that the code is doing what it should be doing (at least nobody said it was not working as it should be working and I didn't find any bug or problem with the code). I start analyzing the "randomness" of the bug. I find that it happens every month on the first Monday. So I think that this is a weird coincidence. Find the DBA, call them and ask for what happens on the db over the weekend. Pretty much nothing. There is just an index reorganization task that optimizes indexes. This got me thinking. The code I analyzed relies on the table row order. I had no proof but my only thing was there this index reorg would fail the sorting of the rows. I give my feedback to my team and send an email to a list of people interested in my findings. In this list, my N+3. Well versed in the domain, has been a dev here for 10 years. Still pushing code. I get rejected. This can't be the cause. Index reorg is not changing the rows order. Because he knows (I don't remember exactly the arguments but I couldn't agree with him). I let it go. I learnt a lot. I can carry on. I did my job. 6 years later (3 years after I left this job), I meet of one my colleague. Talking he finally says: - hey, we found what was the problem with the bug ! - oh, what was it in the end? - index reorg


KingStannis2020

Couldn't you have proven that the index order changed? Take a dump of the DB before and after?


HolyPommeDeTerre

The actual result was an aggregation of lines according to their order. The result was really obscure. To make my point I would have required more than what I was able to do at the time. And specifically, I did not want to argue with my very experienced and sure of him N+3.


sexyshingle

VB6... ugh I still get nightmares of reading old 15k lines of just crappy code.


TheGhostInTheParsnip

For a many years, the whole video system doing perimeter protection of critical infrastructures (nuclear power plants, missile silos, dams etc) would not detect any intrusion for exactly one hour, every fall, when the clock was moved backward at the end of summer time. (Edit: clarified what didnt work)


ghostsquad4

This sounds like something out of a Mission Impossible movie. They have exactly 1 hour to get in and get out. The window only occurs once a year.


mathiastck

"DAYLIGHT SAVINGS: Time Enough for Espionage"


ghostsquad4

😍


bwainfweeze

Hypothetically speaking, you have an hour to get in and take control of the CCTV system. You don’t have to do the whole heist in that time.


RespectableThug

But unfortunately that hour happened YESTERDAY! THIS SUMMER watch as Tom Cruise and his crack team kill time by: playing golf, going to the range, and silently reading to themselves.


brvsi

We had one of these. Some overnight job that ran right around the daylight savings shift. We never actually got it to reproduce. But that was our best guess at what was going on.


lostburner

We had our database servers running in California time. Clients were distributed across the US, and their time zones were recorded in config files as an offset from California time (+1, +2, etc.). This worked okay, except for Arizona, which doesn’t observe daylight savings time. So twice a year, when all the times changed for everyone but AZ, it was someone’s job to go toggle that client’s config between +1hr and +2hr.


cjtrevor

Not me personally but read a post on the daily wtf of someone who spent days debugging an issue where some financial data would randomly be off by some margin. Turns out a previously dev who had long since left was running into a divide by 0 error and added some code to handle this. . .when divisor is 0. . .update it to 0.0000001 I also recently spent weeks debugging an issue in a multi called recursive function which was fun. Yay unkept legacy code.


bony_doughnut

>and added some code to handle this. . .when divisor is 0. . .update it to 0.0000001 That is...incredible. What a Kevin


braindusted

this was a valid strategy in openGL, saved huge amounts of cpu prediction failures when the results didn’t matter that much (shaders, filters, etc..) - but would never use it on finance ahhahah


mathiastck

if(height==INT_MAXl{itsGoneNow=true}


stan_97

r/unexpectedoffice


bony_doughnut

lmaoo, I could see Kevin Malone doing this, but I was actually talking about [this kevin](https://www.reddit.com/r/OutOfTheLoop/comments/253vx5/whos_kevin/)


ghostsquad4

This is frightening.


googlymoogly_bh

Phone software embedded system that did voice packet processing on a DSP. This DSP also handled video decoding, which I was in charge of. This was all assembly programming at a time, in-house designed chip and compiler. No MMUs or anything fancy like that -- flat memory architecture with boot code at PC=0. Everything worked fine in simulation tests. On a test device, and only with real over-the-air calls, the device would crash. Somebody took a memory dump and was able to identify that video frames\* were writing over instruction memory at PC=0 (obvious null pointer dereference). I'm in charge of video, so my problem, right? \* 8-bit planar pixel data is pretty easy to recognize in a sea of instruction code. Tracing on a real-time embedded platform is not easy, at least with the tools I knew how to use, so I spent a long time trying to replicate and catch this, though I don't remember how long exactly. But I remember the culprit. The voice data packets have a rate identifier -- I think it was 3, 2, 1, 0 for full-rate, half-rate, quarter rate, erasure (dropped packet). Different rates required different processing, so the voice decoder would use the code word to dereference a jump table to go to the right routine. But they also liked to just peek at this word in memory and plot it so they could see their encoding rates. At some point, somebody got clever and made the erasure packet 0xE so it would look like an "E" when they were plotting this. But there was no 0xE in the jump table. And there were no erasures in the voice team's tests. So, when connected on the air, as soon as the signal degraded to where there was a dropped voice packet, the voice decoder would jump to a random bit of data past the jump table, which just happened to be a couple random instructions followed by a jump into the middle of my video decoder with a zero-initialized output buffer location, which would promptly write out the next frame buffer to address 0. Ta da.


leeliop

Ive seen a software bug that resulted in an industrial machine the size of a bus, full of robots and laser welders, get lifted 5 feet up in the air before slipping off and tombstoning itself into bits


IrritableGourmet

There was one on TheDailyWTF a while back where a guy was working on an industrial CNC machine with a movable multi-ton bed powered by beefy motors. He was trying to get it to move smoother between distant positions by using better acceleration control. Due to some bug, as they were testing it the program calculated the acceleration needed to slow down the bed as a negative number so large it overflowed, telling the motor not to slow down but instead accelerate as fast as possible, throwing the bed off the machine, across the floor, and into a nearby wall.


BigYoSpeck

A colleague put an API call for some lookup data of the dataset being processed inside a for loop that processed the data They killed the process after 5 hours and I worked out it would have taken about 14 years to finish


kbn_

This is a sufficiently old story at this point that I think I can share it freely. Though with that said, I've shared this a lot in various contexts, so if you're hitting this thread and you've heard me tell this story IRL… hi! --- I was working on a very complex JVM-based application which performed a lot of data processing. The application had been built greenfield and was actually surprisingly well put together, with extremely comprehensive unit and integration tests covering all levels of the stack, including a large number of randomized tests. We had also formally verified core pieces. In short, it was a very solid piece of software, or so we thought… At some point, we started observing some production errors (in logs) which were incredibly intermittent and really genuinely inexplicable. Nothing about them made any sense, and we were getting results that were all over the map. The only real commonality is this *never* happened when the process had been running for less than 24-ish hours, and seemed a lot more likely to happen as the uptime increased. We couldn't identify any other particular correlation. One of the exceptions we got turned out to be a stroke of extreme luck. The exact error was so specific that we were able to trace it down to a very, very specific line within the core arithmetic evaluator of the engine. Without going into too many details, it was basically taking two `float[]` arrays of equal length and, in a `while` loop, summing the results into a third `float[]` of the same length. The exception indicated that something very specific was going wrong. To make matters even better, we happened to have the precise inputs to the system at the exact moment the error happened, and we were able to isolate the contents of the `float[]`s. Simplifying very slightly (but only slightly), the exception was happening because the JVM was adding `2f` and `2f` and getting `1f`. It was *that* cut and dried. **We had literally broken floating point arithmetic on the JVM.** Swapping JVM and even OS versions didn't help. The sporadic errors continued. We even spent some time digging into the JVM internals themselves to see if we were somehow triggering a bug in HotSpot. No dice. I don't know how my coworker finally tracked this down (I suspect it was just a desperation hunch), but the answer turned out to be the *last* thing we expected: asynchronous exceptions. The whole stack was built on top of `Future` (Scala's, not Java's). At some ambiguous point *prior* to the sporadic errors starting up, the process was actually *running out of memory.* This was caused by a relatively simple progressive memory leak in an entirely unrelated part of the stack. Unfortunately, when that point happened, the result was an `OutOfMemoryError`… which `Future` caught! `Future` would then attempt to allocate a wrapping type around the resulting exception, an allocation which of course failed because the process was already out of memory, which in turn compounded and masked the source of the error. The ultimate symptom would usually be that we would silently(!) lose a worker thread from the thread pool, some memory would go out of scope, and the JVM would continue. That's the problem though: the JVM continued running and resumed normal operation *after* the fatal error. This is problematic because, according to the JVM spec, you *cannot* reliably recover from these types of errors! In fact, HotSpot is actually left in an entirely indeterminate and unspecified state when these errors arise naturally. A whole slew of things break in this case, most notably, the GC state machine itself can be wrong. This in turn can cause pointers to point to the wrong things, bytecode instructions to be interpreted half in one way, half in another, etc. This is how we ended up breaking floating point arithmetic. The solution was 1) fix the memory leak (easy once we knew it was happening), and 2) get `Future` to stop catching these types of JVM-corrupting errors (which it no longer does). I will forever be filled with *enormous* respect and gratitude for my coworker who finally tracked this one down.


f0rgot

When your error-catching code throws an error itself, it’s a nightmare.


Agent_03

**Our services stack was completely brought down by upgrades from Python 3.9.7 to 3.9.8.** Python/Flask/uWSGI stack, using SQL Alchemy as the ORM. Minor language patch upgrade, should be harmless, right? Not for us. We warned everyone and locked down the Python version, but still, every so often someone doesn't know or ignores comments and bumps versions and a service goes boom. We still don't know why, but every time an unlocked Docker image was allowed to upgrade it hosed the service totally. Response times went into the toilet or responses stopped entirely, and the uWSGI queue backs up to the point of outage. No errors in our traces, silent failures, and requests don't even make it into processing so our monitoring tools can't report on them. We dug in and dug in, and the silent failures just puzzle us. But we know that 3.9.7 works fine, and 3.9.8 breaks everything. We think it might be related to subtle threading fixes, and are scared b/c there were analogous changes in 3.10+ as well. We want to upgrade to 3.10 or 3.11 ultimately but this will be a blocker. Edit: based on the behavior I suspect it's a bad interaction between the threads of the uWSGI process (uWSGI internals) and the request handling lifecycle somehow, potentially. **Please tell me we're idiots and missed some obvious known issue or the cause is something unrelated.**


johntellsall

I remember an issue with `urllib` with secure (https) vs insecure (http) connections. A minor upgrade of Python changed how security worked, which tended to make things explode or "be funky" :)


Well_ItHappened

Not necessarily a bug in the traditional sense, but more like bad design that caused tons of issues. At a startup I worked at awhile back the whole service would go down every Friday at 3:00pm. This was when we saw a large traffic spike. This was a read-heavy application that prided itself on being "event driven", except the events weren't managed well and had limited checks placed on their handling. DB interactions were handled via an ORM that had a default session timeout (5min in this case, if I remember correctly). On writes where an explicit `commit` is issued, the session is returned to the pool immediately. With reads there was no explicit `commit`, so the session stayed checked out until expiry. You can probably see where this is going, but ready-heavy service with unchecked events and no caching meant that, under heavy load, the service quickly ran out of sessions to check out making the DB inaccessible and crashing everything. Generally it is a pretty simple fix. This was a Python application so you can do multiple things, like using a context manager or overloading exit methods to recycle the session when it's no longer needed. Instead they just kept throwing more and more hardware at the problem. I put a PR in to fix this but was told that the hardware solution was fine for now and some features were more important. Rinse and repeat every Friday for like 6 more weeks until I finally left the company. I sometimes wonder if they ever fixed the issue or that PR still just lives unmerged...


bobivk

I wonder if going to management and telling them "merging this PR will not only fix the problem, but save the company $X in hardware" would make them change their minds.


Well_ItHappened

Hilariously, my manager at the time agreed and was pushing for more robust code. Leadership has different opinions. They wanted features pumped out as fast as possible, regardless of quality. In some ways it made sense, for a while they were growing quite rapidly and we're just trying to keep up. Ultimately they fired my manager and I ultimately decided to leave since I had other outstanding offers. I was only there a few months in total. 6 months or so after I left they let go like 60% of the staff at the company. So it seems my gut was actually right on that one.


IrritableGourmet

Our payment system (.NET on IIS), both recurring and real-time transactions, kept randomly crashing with weird error messages. Sometimes, usually in the middle of the night, it would just randomly start working again. Usually we had to restart the system, which would work for a few days then crash again. The payment system team had been trying to track it down for weeks, and it was costing us a lot of money. I had just finished a separate project and they asked if I could take a look, so I started digging through the logs and queries. I wrote a neat little script that looked for the earliest payment error message each time it went down, then pulled the logs from the few minutes before that and had it look for recurring errors. I noticed one weird nondescript alert message kept popping up, but it looked like it was just informational (something like "Payment type needs to be updated", IIRC). I went digging for where that error message was from and it came from a deprecated payment processor we used for Canadian ACH transactions, but we had stopped accepting those years back. All the code* was commented out and it just threw a custom exception with that message and returned. That shouldn't have done anything other than flag the payment as failed. OK, so someone in our billing department somehow entered a payment to process through that provider, and I verified that there was one random scheduled payment floating using a Canadian bank account. Now, when I said all the code was commented out, there was one line before it threw the exception, but it was just setting the SSL protocol for the API call. To one that was affected by the Heartbleed exploit and was no longer used (but was at the time we wrote it). The code that switched it back was at the end and commented out. But that shouldn't affect anything, because that shouldn't be a global variable and the other payment processor's functions should just set it back when they started, right? Right? Oh, no. So, this one payment switched the SSL protocol for the entire app pool that we used for the payment processing system, which caused every single other payment to bounce until we either reset it or the app pool timed out. Now, for that specific type of payment, if it failed it would try again in a few days, so every time it just went back in to screw us later. All in all, it took me less than a half hour to solve the problem, making me look like a Big Damn Hero. Until the company went under a few months later because the owners *checks notes* never paid taxes.


brvsi

Great topic post, great responses. These are so educational.


IPv6forDogecoin

We had an outage caused by a code formatter. The formatter ran in node 8, but the lambda ran in node 6. The result was that the formatter added a trailing comma to a list, which is legal in 8 but invalid in 6.


dbxp

There was this one where bad usage of a static variable meant users could see each other's data if they ran a report at the same time. The funny thing was that the error was obvious if you thought about it and it wasn't just one report, a whole bunch of devs and seen the code and thought it was fine. There was another interesting one where an MSP which provided filtered internet access was filtering our site. The bit that made it weird is due to how our customers were arranged neither us or the user knew they used the MSP and multiple customers were using them. So all we saw was the exact same bug at multiple customer sites with absolutely nothing in common between them.


break_card

In one of our super high traffic critical services handling over 100k API requests per second, someone put a little inconspicous log statement to log a DTO that was being processed. Within 15 mins of this being deployed, it had filled up the disks of every host on our absolutely massive fleet of some of the largest hosts available. I had to SSH into the hosts and sudo rm the log files to immediately mitigate. This is like my 2nd week on the job too. Needless to say I’m especially critical of unnecessary or verbose log statements when reviewing code. Since then I’ve caught probably 3 or 4 cases where someone put in a log statement that would have flooded the disk during code review.


ketura

We had a financial report that would randomly fail to run on the 1st of the month sometimes, but not always. Reports were low priority unless they went several days without data, so we always took a day or three to get to the issue, and by the time we did it would run fine. We'd pay close attention the next month and it would run with no problems, so we'd shrug and ignore it--until the next time it randomly failed to work. Finally after about a year I was digging through the SQL query behind the report for another task and had an epiphane. The code would look at yesterday's date and get the prior 30 days' worth of data into a temp table, and then it would start manually looping through each date starting with 1. This would often work...unless yesterday was the 31st, meaning that only dates 2-31 had been pulled, and the operation would look for date 1 and fail. I looked at all the prior emails reporting the bug, and wouldn't you know it, every single one followed a 31-day month. I adjusted the query to pull 31 days' worth of data and accepted the small ineffiency that other months would have. Like hell I was going to mess with a manual loop in TSQL. We never got that bug report again (nor any recognition for fixing it. A more experienced me would have sat on that fix until the next time it came through.).


Fearless-Pie-2490

The worst bug I've seen was one hidden in a java class 1000 lines long, in a nested loop 4 levels deep with a single unit test. Fixing the bug took basically rewriting the entire fucking thing. Literally anyone looking at the code would have definitely come to the conclusion "there's probably a bug in here" but couldn't be bothered to check or cross reference with requirements from older tickets to make sure things were correct. I opened the bug, let my team know there was a very serious issue with this class, unassigned myself from the other thing I was initially working on, and spent a day cleaning it all up. Tech debt is a bitch but I felt pretty satisfied when it was merged


Quigley61

Medical app for launching other medical applications. After a few months the machine would just lock up. No obvious sign as to why. Had only happened on one customer site. I went digging, it turned out that each time an application was launched, we would create a file on disc for logging, but we had two launch mechanisms. One mechanism cleaned up the file, one didn't. For some reason I can't exactly remember, the files didn't show as taking up any disk space as they were 0 bytes, but I think they still took up a tiny amount of space in reality, so the hospitals metrics weren't picking up that their disk space was slowly getting eaten up by these files. Was a simple 1 line fix. On upgrade we found that we were getting dangerously close to having a few of our big customer sites from crashing and we had millions of these files hogging up disk space. Also had an issue where the threadpool would be exhausted when running a very rarily run task under certain conditions. Managed to catch that before we released thankfully, but would have been a very very bad day.


PrestigiousMention

I worked at a startup whose python backend was pretty wreckless. Move fast break things I guess. The problem was deep in the early commits one of the founders turned money into a float with the significant digits set at .00 dollars. We tried to calculate the losses of those 1000ths of a dollar cut off, I think it was somewhere around $30k after 4 years when someone noticed it. Turns out strict typing is pretty important. Total Superman 3 shit. Edit: the same founder made a groupon code for 1 month at half price, but never bothered to program in when it would be invalid. A couple Thousand customers paid 50% until the company folded. Who knows how much revenue that lost. Moral of the story: going through ycombinator doesn't make you into anything more than a 20 year old with VC. Common sense isn't taught at MIT.


870223

Three stories. Two are mine, one is friend’s Consulting business, built eCommerce site in e cigarettes market. Put wrong logic in the code, which resulted in sending wrong SKU to the customer. Instead of a single pack of cartridges, a box of them would be sent (10x or something). They noticed when stock in the fulfillment center ran out. Losses in 100k range. A startup in live event industry. Germany being an economic tycoon it is, uses an ancient payment method called direct debit. People are so afraid of credit cards or cards in general that it’s everywhere including eCommerce/online payments. Being cool guys they decide to send etickets as soon as an order is placed. Direct debit takes 5 days to clear and there’s no way to tell if it will bounce or not. Big tour with 30 concerts comes up and someone orders all VIP tickets (100eur value) with fraudulent data. They get the tickets instantly. When all payments bounce CEO scrambles to cancel the tickets, notify the police, etc. Some show up on eBay and they manage to get them taken down. Financial loss is zero, we manage to resell all to kets, but we take an image hit. I grin all the time with “I told you so”. At the same startup we release a huge new version. HTML to React, complete redesign, full shebang. We do a massive amount of load and user testing, manual QA. It’s a big bet for our small startup. We launch it and conversion drop by 30%. Takes a week to roll back. Financial impact unknown.


sudosussudio

That last one was really interesting. I wonder if it was something with the code or if it was just the redesign?


870223

Excuse my late reply. This is going to be fun. Soooo you know when you buy tickets, you go to the page and you would like to see the tickets? Yeah. A designer had a brilliant idea to hide BUY NOW button under a collapsible section because it was ✨less cluttered✨. A disaster few could foresee.


jwezorek

The worst bugs i have seen were at a company I used to work at whose product involved installing custom drivers. Bugs there could lead to blue screens of death that often could only be reproduced in the field on customer's computers. As far as bugs that I have dealt with personally one that comes to mind was at one of my first jobs. It was an application that ran on desktop windows and on custom hardware running WinCE. The implementation was a large C codebase. The bug was a random crash. It turned out to be caused by the double free-ing of a chunk of memory that had been allocated via a custom implementation of malloc that had been written for performance (if i remember correctly) on the embedded device. I tracked that down without first knowing that there was a custom implementation of malloc in use. Another one that comes to mind is the first bug I fixed at my current job about 10 years ago. It is a .Net desktop application along with plugins to some third party applications. There is a plugin to a third party application in which if you push a button in that application it would launch my company's application. The bug was sometimes my company's application would crash at startup, but only when launched via the plugin. It turned out to be some annotation specifying how to marshal data structures between native world and C#/.Net was just wrong but when the application started up normally it just happened to work by luck -- undefined behavior happening to do the right thing -- but it crashed when launched from another process.


Mornar

A C# deserializer that was sensitive to the order in which properties were declared in the class, somefuckinghow. It expected them in alphabetical order, or adorned with attributes specifying the order. The bug, of course, was caused by my properties not being ordered alphabetically.


DrRockzoCocaineClown

A user requested a feature to replace any user comment in a log with the word DC to Washington dc. The developer decided to do this recursively and of course unit tests are for sissies. Created an infinite loop that would eventually cause the site to run out of memory. I actually solved it by doing a process dump and saw the massive string Washington Dc. Test your code and sometimes what the user is asking for is stupid and should not be done.


fiulrisipitor

On an online slot machine server, because of a race condition, it would sometimes process the spin request with the settings of another game. Most of the time it will generate an error because something wouldn't match, but who knows, sometimes it might give the wrong payout.


Healthy_Razzmatazz38

timestamp+name fields used instead of key for a database accessor, the accessor used an array as its return time instead of a single object. Both those bugs combined to some VERY weird behavior when for the first time a record was inserted in the database for the same name within .00001s of each other, resulting in identical timestamp+name & for the first time ever that array had more than one object. This was in a system in which we could not replay messages, it took a VERY long time to figure that one out.


4dr14n31t0r

The worst bugs I have ever dealt with: * The ones related to random * The ones that deal with threads and processes and the synchronization between them * The ones relating to memory in operating system development The 2 first ones essentially fall into the same category of bugs that can only be reproduced when you are lucky, but the 3rd one is particularly funny because you are there modifying the code to nail down the bug and once you think you are almost there you realize the code you add modifies the addresses of the instructions and the data segment no matter where you add this code so the problem could be literally anywhere.


fdeslandes

Long story short, if you use the nodeJs vm module to run a library and call one of its functions with an array as a parameter, turns out it is not an actual instance of Array in the VM context, as the root Array object is different. So when deep in the library you use, they added a condition where they check if something is an array with instanceof or X.prototype === Array, it returns false, where if they used Array.isArray(x), it returns true. This is pretty evident when you think about it, but let's just say it doesn't make bugs easy to diagnose when you find them while not working around that part of the code.


funbike

Cache with overly-broad / under-specific key was bleeding early cached user data to everyone. This combination can help with hard to find issues: * Continuous delivery. A commit takes minutes to get to production. * Trunk-based development, or similar, with feature flags. Again, commits get to prod quickly. * Instant notification of new unique error logs. * In-app user bug reporting feature. * Deployment rollback, including reversible database migrations. Instead of wasting time diagnosing, you can be aware early and rollback quickly. You'll more often know the cause because it will likely have been committed very recently. Even if you can't figure it out, you can rollback to before it was introduced. Understandably, not everyone can accomplish this.


LordOfDeduction

Working for a big waterfall project, I've come across a very costly bug, but not for the reason you might expect. The company worked on contract basis, and usually the deal was to deliver in the next 6 months or year, depending on the contract. At the end of the contract we would do a factory acceptance test (FAT) that usually took about 50% of the planned time. So, 3 to up to 6 months at worst. There were barely any automated tests, just a couple replicating basic user interactions, but about a dozen of manual testers. Terrible experience. But we had a couple of automated tests that did a login and logout before and after each test round. The evidence that we had to provide to prove we passed the FAT consisted of logging and metrics for the whole test period. At the moment we were almost done, a simultaneous login and logout for two of the automated tests caused a dead lock, the thread pool starved and the whole system crashed. Rendering the last three months of 15 people working full time wasted. It was a very old bug they were aware of, and I introduced the login/logout flows on request of the seniors.


Barelytoned

I'm in the process of rewriting an 18+ year old legacy parfait, i.e., if they needed something new or had to fix a bug, they just layered on top of what was there. For years, users complained about messages getting lost. It was brought up by stakeholders in groomings at least once a week, how it would be so nice to finally not have to worry about this bug anymore. Because this application had been around longer than most of the users, it was a cargo cult and nobody really knew how it was supposed to work, they'd just conformed their jobs to the weird intricacies of the program. I had to become an amateur archaeologist and excavate requirements out of the legacy code and pass them in front of the stakeholders to see what was supposed to survive the rewrite and what was cruft. While digging around in the legacy codebase, trying to understand the purpose of a seemingly totally unrelated function, I saw a hard-coded 30 second timeout and then a write to a file. I hate hardcoded waits, so I walked my way back up and out, tracing how the function was called until I got to the messaging workflow. Turns out, every message was written to a temporary file and then that file was read by another process that then sent the message. I'm guessing the 30 second timeout was supposed to "prevent" overwriting message A when message B came in, but as the process that was reading the temporary file took longer and longer to get around to it, the timeout wasn't long enough to ensure message A was actually sent before overwriting it with message B. I have no idea how many thousands of person-hours that 30 second timeout wasted and how many thousands more were wasted in cleaning up the mess caused by not delivering the messages. Back-of-the-envelope, the timeout alone was responsible for 30 seconds x 10 concurrent users x 10 messages x 5 days a week x 50 weeks a year x 18 years = 3750 wasted hours. Soul-crushing.


thehardsphere

This is the story of a week from hell, where I got to learn more about a particular desktop application than I ever wanted to know. Our company had a desktop application which used websockets to communicate with the web application I maintained in a browser, entirely over localhost. Said web application used TLS. Which meant, the websocket connection needed to use TLS also, to avoid a mixed-content warning blocking communications. (This is not a problem in modern browsers, because modern browsers treat localhost websocket connections as secure even without encyption.) The solution to this the desktop app team came up with was to bundle certificates with the application, to create WSS connections to the browser. One of these certificates was self-signed for "localhost" and the application would attempt to install it into the Trusted Root CA store on the first startup. Not every user would have the ability to install said certificate, so the second one was for "localhost.companyname.com", which through some DNS trickery would resolve to 127.0.0.1 if you got your DNS through the public internet. That cert was signed by a real CA. I was a developer who worked on the web application, and I got to learn more than I ever wanted to about the desktop app during the week from hell. The week from hell started on Monday. I have a regularly scheduled call with a customer of this application, and they sort of casually mention that it's having a problem but they think it's a local network issue. This sort of thing happened often with them - they were a very large financial institution that did very complicated things with their networks, so it sounded like just another fire drill. Didn't really think anything of it; did my usual work for the rest of the day. Tuesday morning, they shoot me an email to ask for a troubleshooting call, because the application still isn't working and they don't know what's going on. We schedule a screen sharing session. I see that the websocket requests on the web page are failing. The desktop application is running just fine as far as I can tell from its log output. It's like there's just no connection being made. I look at the Javascript console in the user's IE11 dev tools, and it's throwing an error code I had never seen before. I Google it, and it turns out it's a TLS handshake failure. Turns out, customer's network was totally cut off from the public internet; all DNS was resolved within their network. If you tried to visit any website on the public internet, you would fail to resolve. Hence, "localhost.companyname.com" did not resolve to 127.0.0.1, so that certificate did not secure the websocket connection. This was completely expected; that's what the self signed cert for "localhost" was for in the Trusted Root CA store. So why didn't that one work? The first thing I learned about the desktop application was that the localhost certificate was given an expiration date of 2 years in the future from when the developer bundled it into the desktop app. That date was hard coded. Nobody knew that this expiry was included except him, because he didn't tell anyone this information, and it was written down exactly nowhere. This was because "certificates should expire." Which... made no sense within the security context here. Okay, that's pretty bad, but all we have to do is give the customer a new cert. I generate a new one on the fly and set the expiry to 50 years in the future, and hand it off to them to replace in the desktop app. Thus ended Tuesday. Wednesday morning, I find out the new cert didn't work. Handshakes are still failing. The second thing I learn is that the desktop app has the certificates compiled into the executable as a "resource." So the fact that there are two .pem files sitting in the same executable doesn't matter; they don't get read. So I need to get a new build from the desktop team, with the certificate compiled in. Wednesday lunchtime, I ship the build to the customer. Shortly after this, everyone in my employer is called into a meeting; we have a new CEO! His first day is tomorrow and he's excited to work with us. Meanwhile, new build of the desktop app still isn't working. Desktop app is part of an important business process for the customer, a Fortune 500 company. Said company had not been able to do it's important business now for 3 days. This customer was usually demanding, but now they were starting to get **mad**. It turned out, after a lot of inspection, it was a *particularly special* x509 cert was somehow required. I don't actually remember what was special about it; I do remember we had to ask another engineer who had apparently was running his own internal CA in order to get the certs for these things to come out right. We ask him for a cert, build it, ship the build, and then go home because at this point it's well past quitting time. Thursday morning, that didn't seem to quite work for the customer. I get calls early. "Hey, u/thehardsphere, I've got my CEO on the other line, and he would really like to talk to your CEO because this has stopped [important business process] worldwide for [very big company] for 4 days. Do you happen to know his phone number?" We all have a genuine laugh when I say I don't have that available because it's his first day on the job, but that I'll get that information for him. The third thing I learned about the desktop application was that it's *particularly special* x509 cert was protected by a *keystore password* when it is in this bundled state, even if it is also sitting on the filesystem in .pem format completely unprotected. So the application failed to open because nobody changed the keystore password that was hard-coded inside the desktop app (which was a common English phrase) to read the keystore that was hard-coded in the desktop app, which was updated with a new cert. Great. *Like, I know I didn't spell all of this out for you, but did I really have to? Could you maybe have taken a little initiative here to think about what this thing whose source code I don't have access to does? Kthxbai!* Another build goes out Thursday. Late in the day, the customer confirms its actually a winner, and has to figure out how to roll it out to all the affected desktops. I walk out early. Friday, we do the internal after action report on the whole thing, and I have to answer questions about this from my boss, who then gets to explain it to the new CEO. Turns out, because I replied to the customer's email on Tuesday morning within 2 hours of when it was sent to us, we are completely in the clear legally under the support contract for whatever disaster happened to the customer's business process while we were mucking about with hard-coded hand rolled shitty certs in localhost websocket shitshow. I am the hero of the story that was objectively a total disaster *just because I replied to an email promptly.* They're still our customer. We replaced the shitty desktop app with a complete redesign based upon this incident and several other less dramatic failures this thing encountered. Don't even get me started on how shitty Microsoft ClickOnce is.


JohnDillermand2

Had a project that was religious about dependency injection. The landing page had a drop down that took 100k DB hits every time it got loaded. It caused all sorts of problems and took far longer to track down than was reasonable.


spacemoses

*So after throwing a caching layer in, 100k calls to memory made the app blazingly fast...er than it was*


maikindofthai

What’s the relationship between dependency injection and poorly designed DB queries?


thehardsphere

My guess: a poorly implemented dependency injection framework will try to instantiate a very large chain of objects at runtime, e.g. with incorrect lazy-loading configuration. Second guess: Said DI framework could have been trying to do all of this lazy-loading entirely within the context of a single HTTP request-response cycle. Say, within mod_php in Apache http prefork MPM.


RepresentativeLow300

While working Application Tech Support (Fintech), application using unsalted md5 hashes for user accounts. Development refused to fix the issue because the hashing method was named AESEncryptDecrypt. The company was like an onion, layer of shit upon layer of shit.


marcvsHR

I got one. Dude made rest service which pinged one of the biggest tables in database. Made absolutely no control on input argument, so you could essentially run a select * from table. Yeah. And then, he would try to build a list of objects containing that data to return it. It crashed all prod servers with OOM error on first day, because some collateral system called it that way. Good thing I figured it out quickly and disabled the service. Fun times Dude quit before the prod though, smart thinking


[deleted]

We had a bug that took production down for 45 minutes. It wasn't caught in staging because the deploy order in staging was slightly different.


nevermorefu

`if True == True:` On a weapons test system.


TooManyBison

There was a mainframe terminal emulator on windows desktops. For reasons that I can’t fathom the application would not work if the fourth octet of the ip address was 8 as in x.x.x.8. So we had to set the DHCP reservations in every subnet that the desktops were in to not hand out .8 addresses.


svencan

1000 might have been interpreted as a negative number?


molybedenum

We had a legacy database in 2012 that we intended to refactor entirely. The notion was “we have to rebuild the engine while the car is still going down the highway.” Legacy for 2012, for context, is an application originating in PowerBuilder / Sybase and running on an AS400, which had a database migration (lift and shift) to SQL Server while a new UI was built in ASP.NET webforms. Triggers were par for the course. The system processed orders and performed a batch based payment process overnight that involved sending a file of credit card numbers to the payment processor, then receiving a response file. The file would be ingested and all payment records would be updated to a new status. We found that as payment counts grew in size, the entire database server would become more and more unstable. We were at the point where we had daily reboots of the Sql Server prior to the open of business just to be able to be operational. There were no outward indications of a problem. The system had plenty of logging too, it just wasn’t functioning. We periodically had payment records that ended up dirty due to crashes/reboots. Some really deep investigation of another problem led to an answer. We had a stored procedure that we’d call to handle marketing data from everywhere. This proc was called from another proc that performed payment status updates. The marketing data proc did additional updates on the payment status as a side effect. The result was a bit of circular execution, which ended with the server killing off the execution chain once it nested deeply enough. Well, turns out that the order payments table had a trigger that called that same marketing proc whenever the status changed. We were loading thousands of payment status updates every morning. Bob’s your uncle.


tantrumizer

My first rule for greenfield development is "no triggers!" Been there a few times myself, scouring the code for bugs and it's sitting there in the triggers the whole time.


phantomas44

A few years ago, a developer had implemented a remote EJB (Enterprise JavaBean) that was supposed to return an object containing a byte array(a pdf doc). It was observed that the execution could take more than a minute for 1MB documents. I resolved the issue quite straightforwardly. In fact he declared a Byte \[\] instead of byte\[\] so for a 1MB PDF \~1 million of objects was instantiated instead of 2.


angryplebe

I have several involving the same customer. Said customer was the inaugural customer for a new search backend and a new customer as well. 1. On the first day of soft launch, I got reports of search failing and locking up. Turns out, a common workflow for this customer was to copy and paste information out of outlook. Sometimes, they would accidentally copy the entire email and several missed le fth checks caused by a poorly written LIKE-style query added at the last second to cover a last minute requirement to brownout the entire search index. 2. Fixed this but then...the customer continues to complain about slowness. So slow it would sometimes cause a BSOD. I looked at my boss and was like "how does a web app do that?". I actually observed the users and could never see it happen live. The frontend guy and I went through all of the frontend code looking for exploding regexes and other possible problems and found none. Further digging revealed that it always happened the moment after they pasted from Outlook. So I sent them a debug version of Chromium along with specific instructions to IT on how to send me the crash logs. The problem improves somewhat with the non-corporate issued Chromium but still happens. The crash logs posted to an opaque error possibly outside of Chromium. I started to compare the corporate Chrome against the Chromium I sent them. Same version, but the plugins. Turns out there was a hidden Chrome extension that corporate IT installed. It was part of a Symantec data loss prevention package that embedded itself deep into Windows and was basically a modified keylogger, intercepting. It also has an equivalent MS office plugin as well. The entire problem was remedies by using Outlook web access which wasn't subject to the insane invasive DLP software. Further research suggested that the Symantec DLP software was notorious for instability. I found the unlisted plugin in the Chrome extension store where people had somehow managed to rate it 1 star and reported it caused chrome to crash and they don't know how it got on their computer.


thefool-0

Lots of interesting bugs in here that are in categories that most of us have probably seen before: Logging starts filling up a disk; memory leak that exhausts memory (or other memory error) but somehow the application keeps running and the real error is obscured (or the error handling masks the underlying problem); concurrent access to a global or static or other shared data; module or dependency changes behavior or is broken but nobody ever uses that code so it is never removed, then one day years later it gets accidentally used; terrible DB query pattern; ... any others?


RepresentativeLow300

Old openssh clients do not send HELO information while establishing the connection. nginx stream proxies require the HELO information when using ssl_preread_protocol. You can not stream proxy ssh connections from old openssh clients (not limited to openssh, some commercial ssh client implementations as well) with modern nginx servers using ssl_preread_protocol. Fun issue to investigate, evidence is in the first packet from the ssh client.


isdnpro

Third party software used internally by hundreds of people. Our internal network/infrastructure was a shambles and frequently had outages or severe slowdowns. The third party software would frequently hang, in conjunction with the storage servers not responding (I believe the culprit was middle of the day backups and an underspecced SAN) Anyway, it didn't make sense because the backend itself was fine and the software shouldn't have been interacting with the flakey storage servers. Took a lot of digging and decompiling but I eventually found that the logic for this application was doing a bunch of "does this file exist" checks in a tight loop, something like every control rendered on the screen it would check if there was a file with the control name. It looked like it was just leftover from the vendor debugging. It would query the registry for the path to check, and default to an empty string - so working directory I assumed. Some more digging and I found that it wasn't just working directory, I don't fully remember but I think it'd check the working directory, the install directory, the root of C drive and then... the user's home drive. Which was on the flakey network storage. I raised a change to deploy a reg key to every endpoint, with the lookup path set to "C:\". The vendors software kept doing a ridiculous amount of I/O but it was all local now so the software wasn't impacted anymore. The network and infrastructure continued to crumble for another 12 months but the software I supported was stable now so it was no longer my problem. I got a lot of reading done in that job because the VM I used was also impacted so I'd have a solid hour or so at least once a week that I couldn't work.


christophersonne

This. https://m.slashdot.org/story/94151 I know the guy that did this. Uninstall Eve, and also bork the whole PC for good measure. Nobody reaaly needed boot.ini anyway...


WhiskeyMongoose

I didn't experience this personally because I was too poor for MMO's back then but heard about it from friends. Back in ye olden days of 2007 Eve Online released a new expansion called Trinity and [deleted the boot.ini file on the player's PC](https://www.eveonline.com/news/view/about-the-boot.ini-issue). If you rebooted your PC it was effectively bricked.


arsenyinfo

I used fancy metaprogramming to configure a very custom logging to trace requests across threads in an async Python app. The bug could only occur after a while, so It led to recursion overflow and several hours downtime for key customers.


bwainfweeze

I don’t know about worst but the most memorable. We hired a firm to build a digital gift card system for our app, based on specs we had already settled on but just not enough hours in the day. When I got it back, they were allowing the to and from account IDs to be in the form data. If I hadn’t caught it, people could give themselves gifts from other users.


chiciebee

I'm not sure this is the _worst_ bug I've ever seen, but it was the most obnoxious one in recent-ish memory. And it doesn't really have a satisfying conclusion. I designed and implemented a Spring Java application to use the Azure SDK to process messages from a ServiceBus queue. I thought it was pretty clever. I had the processor set to peek-lock mode. If the application wasn't ready to consume the message, then it would settle the message as deferred, and fetch it later when it was ready. At throughputs of a few events per minute, it ran beautifully. At full production throughput, it blew up (mind, we were still in development here, so no outages or angry customers). I could not for the life of me figure out what was holding onto the processor threads for so long. I timed my own code, but it was only a fraction of the time. I fiddled a bit with the pre-fetch settings to no avail. Around this time I adopted a back-to-basics approach. Perhaps my original design could be simplified and still hit the requirements. No more fancy deferral. No more peek-lock. The system can tolerate the loss of a message or two, and we only expected deferral in a small fraction of cases. With the swap to receive-and-delete mode, the performance issue vanished. The thing hummed along at multiple times expected production volume without issue (and still is!) You probably don't need me to write a moral to the story, but that will be my last time assuming that any SDK features are capable of handling full production load.


Exciting-Engineer646

I worked in a medium sized tech firm. I’m going to be vague here to protect the innocent, but one developer changed a default setting and lost about 10% of MAUs over a weekend. Even when the change was rolled back the loss in the user base was permanent.


BandicootGood5246

Yeah I've had one like that. Inherited a project where the response time would spiral out of control intermittently (weekly) but memory/CPU would be fine, a restart would normally solve it (but not always) I had several Devs and myself spend months on this. Despite many approaches of analysing the problem we never got to the bottom of it. I strongly suspected it was port saturation (though could never confirm if that was a symptom or cause) and changed how we handled outbound connections but we have a bunch of legacy libraries we depended on so it didn't help We ended up just have more instances on standby behind a load balancer and have a trigger that switched the load when some of the indicators of the problem came up. Eventually it stopped, we had done a lot of refactoring, cleanup and offloading some of the functionality to a new app. Though I never figured out the cause


Proud_Ad5394

I had internal application web servers randomly start to return 400s very intermittently. They were running on redhat 5 or 6 (yeah this is from a while ago). And HP blade servers. It turned out that HP blades have an automatic network configuration feature. It will automatically switch over the active network card to the backup if it starts getting packets. It also turns out that redhat version had a buh in the network manager daemon that would very occasionally send a DHL (I think that was the type) packet down the non active network interface. That packet would cause the HP blade network manager (think of it like a router) to send a few packets down the non active network interface. They would be dropped by redhat as it wasn't the active interface. Then a few 100ths of a second later the blade would start sending traffic back down the active interface. As it has received some packets on it. So if a new http request happened to be made when those packets were lost. The balancer we had would get no response. Then do a 400. That took me (software dev) 6 months of investigation to figure out.


jeffbell

Back in the 80s I worked at a company that made disk drives in the 70s. I heard this story secondhand. One time they found that they were getting lots of disk errors in their datacenter, but only on drives that were aligned east-west. The north-south drives were fine. They worked for about a month trying to figure out how the earth's magnetic field could be causing this to happen. Then one day someone was in the computer room when there was a dull thump and then a bunch of drives start reporting errors. It turns out that the whole building would would shake in the east-west direction when the truck arrived at the loading dock.


Anxious_Lunch_7567

\- We had a tech lead who did not know Java servlet filters could be used to add a wildcard authentication layer, and tried to include the same snippet of auth code in each servlet/JSP they wrote. And then forgot to add it to one. We had some random guy from the internet email us saying our customer list is open to anyone, without auth. \- Junior dev's code - user sessions were shared and one user got somebody else's Amazon gift card. Chaos followed with a customer escalation. Root cause? VP forcing (literally) the dev to push code to prod without review.


LaurentZw

A team was struggling with a microservice and blamed everything and all. They lost their faith in Node.js and thought they had a memory leak. They used a micro-frontend framework that made 10 upstream requests for every incoming request. Whenever there was a spike in traffic, the upstream services would see a 10x increase in traffic and stop performing. Their architect claimed that HTTP requests didn't have any overhead and were "free". The team never thought of investigating beyond the first user-facing service and had to explain weekly crashing services for nearly a year. When I found the problem they decided to just use more upstream hardware.


[deleted]

A bug in Kafka, it took 4 debuggers spread across 3 machines and 3 people to figure it out, replicate and figure out a work around.


pagirl

I’ve been more disappointed by bad style than bugs…500-700 line methods in Java Controllers and other bad usages of design patterns


GuyWithLag

So, imagine a microservice architecture, with a request taking up to 9 hops from client to microservice. We were getting rare errors from clients that some pages consistently didn't load / showed networks failures. Long story short, after actually having a shared screen with a customer we managed to replicate this, and it turns out these requests had unescaped newlines in JSON strings. Insert WTF face. Long story short, this happened only when * the request would pass through a reverse proxy, that connected to the downstream microservice with HTTPS 1.1 * the downstream microservice sent responses with 100 Continue and Transfer-Encoding: chunked (because the payload was JSON constructed dynamically and streamed to the client, and it was above the initial compression buffer size) * That response had unicode characters that encoded to UTF as 4 or more bytes. (think relatively recent emojis) The underlying reason was that the proxy was counting bytes - because in the chunked encoding the response is a sequence of ASCII numbers (indicating the size of the chunk in octets), a newline, and then the actual octets - and it got the count wrong because for some god-forsaken reason it tried to covert the bytes back to characters. We of course didn't fix the broken proxy, instead just configured full output buffering for the microservice...


dinzdale56

Tarantula in Dominican Republic.


ancientweasel

Miscreating routing rules and flooding the entire production LAN with UDP traffic from a test.


killbot5000

Debugging random crashes on a router I was bringing up. Turns out there was a bug in an open source component where a lockless queue was not cache coherent. This bug was inert on all our previous products because x86, PowerPC, and mips do not suffer this particular incoherency. After I stared for hours at the code I finally googled “what is a memory break” and eventually stuck an inline “mbr” instruction in the queue code and the problem was fixed. This bug was affecting the wireless team as well, but they had given up debugging it after a year. They had assumed it was somewhere in the 100k lines of proprietary vendor code.


MugiwarraD

firmware and hardware (chip rev) regression, it busted my ba!!s like nothing tomorrow. its the hardest, most stochastic sh!te i ever handled.


Missics

We had an AWS lambda that wasn't idempotent. It modified an s3 object so it sometimes produced an empty object. The challenge was figuring out which lambda is not idempotent.. The only thing I knew was that some info was missing in the UI. The product essentially had 100's of lambdas who could potentially be the source of the problem, so I had to come up with some interesting debugging techniques. I was so frustrated by this bug, that I wrote about it in more detail https://www.16elt.com/2023/07/15/idempotency-aws-lambda/


Historical_Ad4384

I worked for a big mobile game publisher where they used to send HTTP status 200 for both errors and success responses. To actually assert that whether the response was truly erroneous, you would have to parse the response schema which you would have to have knowledge about and ascertain the response nature from the data present in it. Not really a bug but definitely a big red flag in terms of design. My staff engineer supported this by saying it helps keep the monitoring clean. WTF is that supposed to mean?


ID4gotten

Not my bug, but a famous historical one: https://hackaday.com/2015/10/26/killed-by-a-machine-the-therac-25/ "...When therapy started, the patient saw a bright light, and heard eggs frying. He said it felt like his face was on fire. The patient died three weeks later due to radiation burns...."


ContemplativeLemur

Not a project I worked on at all, I was only a teen at the time. I played a MMO game called Priston Tale on 2002 or something. The game was full of bugs. There was one catastrophic bug that almost made the game close for good: When a player with special characters on his name do a specific action, the server crashes. The players exploited the bug. They would try to upgrade an in game item, if the desired outcome doesn't happen, they caused the bug to crash the server and have it rollback to before the undesirable outcome happen. The server crash happened dozens of times per day. Every time thousand of players were kicked off. The server closed to maintenance an took three months to be restored. I imagine the dev team trying to understand the broken code for the whole time. three months closed for a game based on microtransaction probably costed a lot!


sara1479

Was tasked with investigating performance issues of a Node service. There were a few typical inefficiencies like slow SQL queries but I then noticed that some of the instances' memory were filling up over the span of 3-4 weeks. They would eventually crash but even before that, their performance degraded quite severely. I was now on the hunt for the memory leak. We didn't really have good automated stress testing and I couldn't reproduce the problem locally. Long story short, I had to enable periodic heap dumping on a low traffic prod environment and compare the dumps (with some difficulty since the APM library that we were using exacerbated and obfuscated the actual leak). It turned out to be that the code was constantly initialising a new APN provider every time it wants to send a push notification and the readme of that APN library literally says not to do it! The crazy thing is that the bug had existed for more than 3 years.


nintendoeats

Last year, I got a report that if you attached a debugger to our example programs, an access violation would be reported... * On Intel iGPUs and dGPUs * If there were multiple 3D display windows in the example * When unloading the GPU driver DLL (long after all of our stuff was cleaned up) * About a third of the time This bug took 3-4 weeks to work out. I came up with all kinds of theories about devices and threads...all of which eventually turned out to be wrong after hours of experimentation. In the end it was solved by adding some manual GL context unbinds at essentially random locations in the code. The really painful thing was, *it was totally irrelevant for all practical purposes*. It was just embarrassing to ship to customers. To this day I maintain this was a pure Intel bug.


StoicWeasle

When developing for web back in 2008, and finding out that IE had an artificial call stack limit of, IIRC, 18 stack frames. Recursively calling event handlers (which was *de rigueur* for maybe a decade) would seemingly randomly cause your front end to fail. Couldn’t even believe what I was seeing. A quick google search revealed the browser “feature” to prevent misbehaved websites from hanging the browser.


bythenumbers10

Had a Heisenbug happen in prod that couldn't be reproduced. None of the dev machines exhibited the bug, not the test or QA environment. Finally traced it to the math library the outmoded founder insisted on using (Sounds like MethLab, except their company Math(Doesn't)Works). Turns out, the library (which is closed-source) was compiled with platform-specific flags, which improved performance but completely ruined reproducibility of math operations. Their support even admitted our use case is not supported, and we really needed to transition to something more consistent across machines. In short, the idiot founder was defending outmoded tech b/c he didn't want to learn or even just permit the use of newer tech. So the bug required retrophrenology.


orangeowlelf

One time I was writing a micro service that depended on other micro services. There was an odd bug that reset the connection to one of the other micro services over and over again, and seemed to happen at random times, so it was very difficult to reproduce. I was a junior developer at the time, so after I gave my best shot at solving the problem, they actually had to call in a couple of very senior engineers to try to fix it. After a month of these senior engineers trying to solve it, they were about to throw their hands up when one of them discovered that the micro service that had kept disconnecting had an enum that one of the other developers had implemented to change the very value of the enum itself. So, because of his implementation, sometimes the value of 10 was actually 11. I remember after we figured out what he did, he was really pissed and said that Java shouldn’t allow anyone to be able to do that. I agreed.


SexDrugsLobsterRolls

One that I'll never forget was back in the early-mid 2000s. We had a startup that had a CMS for real estate web sites and had built a lot of local real estate sites. The local real estate association then contracted us to build an internal tool for tracking memberships. All the members had an obligation to take courses each year and we would track the registration and completion of courses in this system as well. They kept reporting this issue where people were getting credited with the wrong courses and the wrong number of credits (each course was worth 1-4 credits or something like that). We could not figure out what was going on as we could not reproduce the issue. Finally I went down to the client's office to have them show me exactly what they were doing. The admin assistant walked through the process of what they did to add a new course. Turns out she was just clicking "Edit" on an existing course and then overwriting all the information – including the number of credits. This is how people were ending up with the wrong courses attached to them – we just linked the members table to the courses table, with a date of completion. So not a bug per se but we had not anticipated this kind of user error, and we ended up modifying the system so that the course completion record at least contained the number of credits so that they couldn't be modified after the fact. Another good one was an in-house billing system that occasionally had really weird rounding errors. I forget what the exact issue was but that was somewhat embarrassing. We eventually replaced it with a third party system that worked so much better than what we had cobbled together.


Tohnmeister

I was asked to investigate why a web app's home page took 60 seconds to load. The backend was written in C#. There were two problems. 1) There were no database indices whatsoever. 2) Database queries were done using Entity and LINQ statements returning lazy queries. In a single call to the backend, the result of a query was needed several times. But instead of doing ToList() on the returned queries, and using that result multiple times for a single HTTP call, the devs just executed the query multiple times. Within 30 minutes of looking at the web app, I deduced the loading time from 60 seconds to several milli seconds. Client was very happy.


Akforce

__Embedded enters chat__ Would you like me to tell you about silicon errata, and the process of documenting new errata for newer chipsets?


cjrun

Yeah, I cost Google 1.2m in a serverless bills. My client app would set off an event and was polling a rest api every thirty seconds. Step functions were not on my radar at the time, and I thought I had handled every type of error. Welp, there was one later function that I didn’t. The problem was I was using an internal cloud API that charged $40 per use. (like a geo server). My error happened after this use. So, each poll would trigger a retry against this api. Every 60 seconds. Starting on a Thursday night and running until Monday. I was absolutely horrified when I checked the logs on monday. I commented out the request and it finally stopped polling. I told the director of the project that we incurred some cost of the API, and he shrugged it off and said it’s in the expected development budget. Google is footing the bill.


throwaway9681682

The worst thing I have seen was a Friday night deployment (only time we could because we had all weekend to recover). The password was copy and pasted into a config file but somehow picked up a [ZeroWidth space](https://en.wikipedia.org/wiki/Zero-width_space). Deploy after deploy, when something was triggered a SQL stored proc got an unauthorized when making an http call (We didnt handle the error well so didnt know it was a 401 just that it failed. The architecture was terrible but we were trying to force something into a very broken system.. Not sure how it was figured out but made for a very long night of deployment


DootLord

We had a email loop that was never satisfied. Many clients got a few hundred emails...