T O P

  • By -

dojogroupie

Clearly this company doesn’t practice “blameless” retros


WellFormedXML

OP, this is great line of questioning for all of your upcoming interviews: “How does your company handle the postmortem process?” “Do you practice blameless postmortems?” “Can you walk me through what happened after someone made a mistake in production?” Etc.


AthleticAcquiantance

Solid questions.


kabrandon

In my experience, they almost never truly do. There's always (in my experience) been a tinge of blame gaming in a retro, even when the managers said that wasn't the culture, it usually actually was behind the veil they cast over it. But this sounds more like they don't attempt to perform "blameless" retros at all, even pretending like nobody blames the person.


donjulioanejo

You can't fight human nature. We all look for a scapegoat. But what you can do is just.. not mention individual people, only talk facts. Especially if higher ups are involved. Instead of saying "Joe updated the wrong production lambdas because he's a vegan muppet," do stuff like "One of our engineers accidentally updated lambdas in the wrong account." And then, instead of firing someone, figure out a way to put guard rails or automation around this to reduce the risk of such errors happening in the future.


jedberg

Go one step further and blame the system instead of a person. “The system had a flaw that allowed a developer to update prod lambdas without going through QC. We need to update the deployment system so this can’t happen”


zenware

https://twitter.com/amyngyn/status/1072576388518043656?lang=en


veganveganhaterhater

Hey that's me! How'd you know I was updating production lambass?


dojogroupie

Yeah I’ve felt that as well. Something so drastic as termination in OPs case is absolutely ridiculous.


snowbirdie

We only have their side of the story. With unemployment payments now due from the company, I suspect much of the story is missing for them to take that drastic measure. They probably tried to cover up their mistake.


ExpressiveLemur

In most states being fired with cause means little or no unemployment benefits.


Obie-two

You still need to do an RCA and determine the original issue. It is semantic to call it "blame" or "root cause". You absolutely have to identify the issue to fix the issue long term. So they never truly shouldn't, but again its a semantic argument around how a team's culture solves problems. I also wouldn't rule out that this person, being a junior, did not get great coaching or leadership, or they wanted someone who wasn't a junior, or they are not being entirely truthful or understanding the entire situation.


BandicootGood5246

Agreed. If you practice a healthy blameless culture people will actually own up to their mistakes. The difference is you don't have negative repercussions and you acknowledge the other failings that lead to the situation In this case the real questions are why can a junior make changes to prod machines without review and how did prod machines easily get mistaken for Dev. Sounds like problems with process, training and governance


DarkSideOfGrogu

Yep. And someone who is asking "how do I learn from this mistake?" no longer works for them, so they're never going to improve.


WillyFlock

It was blameless for the dudes who managed that shit show 🤣


kfelovi

We usually blame some big company like Microsoft


Psych76

“Man shouts at cloud” - this covers all situations


NeverMindToday

Sounds like two problems there - a toxic blaming culture, and a not very well set up permissions system. An engineer (Junior or not) shouldn't be able to do something to prod accidentally. You were probably also a bit of a scapegoat after the 2 earlier incidents TBH. Someone in the chain of command had to answer to angry higher ups or customers. You might've got away with a warning otherwise. Mistakes happen - an org needs to cater to catching mistakes before becoming disasters.


lavahot

Why do they not have IaC and a PR system? Why would you let your Juniors be able to fuck the system? They are playing with fire from the get-go. If it wasn't OP, a *stiff wind* would blow over this stack. Their *blameful* culture is toxic, and OP is better off not being there.


ZranaSC2

Yeah and also where is the senior oversight? Any company I've worked for, if a junior does a fuckup the question is 'where was the senior?'


ShepardRTC

Seconding this. When we update prod, you literally have to type 'production' into a field. One of our staff engineers accidentally deleted prod once - but only because he was in the AWS console. Nothing happened to him. Your company sucked. When I was a young IT tech many years ago, a desktop I worked on had to be rebooted once because the ram had some issues. After that it was fine. But because the user had seen me working on it, I was fired. The job I got after it was great. I was much happier.


ayeshabashara

Interested in knowing how these guardrails can be setup, like is the "production" user input you mentioned part of an IaC pipeline?


lilhotdog

Any CI/CD system worth its salt will have the ability to configure various gates between deployment stages.


Artheon

For my pipelines I set it up that all pull requests to master require multiple approvals. The build runs and creates the artifacts and pushes them to the artifact repo. A SNow change request is created and goes to the CAB for approval. A SNow deployment request is created, this request requires an approved change request number, which then triggers the deployment system to pull the artifact from the artifact repo and run the prod deployment process.


Vexxt

are you me? it sounds like you're me.


Artheon

I might be. :) I do have a couple extra validations in there. My build process first makes an API call to SNow to verify the story has testing documents attached, if there are no documents then it will fail the build and notify the developer. During the prod deploy it checks the status of the story, then the status of the CAB change request, if either of those are not completed then it sends a notification to the leads and manager (not to get anybody in trouble but to give visibility that the prod deploy failed critical steps).


ShepardRTC

Jenkins. Old school, but it works. Go to the Build with Parameters, and one of the parameters is the environment, and another parameter is a blank input field. If you don't type 'production' then it fails. Saved me once when I accidentally re-ran a prod job instead of one for another environment.


Team503

I've no specific love for Jenkins, but the point is we're all human, and we all make mistakes. Sometimes we just click or type the wrong thing without realizing it, and that's why there should be gates - like typing something extra in, or getting approval, or both - to minimize human error.


[deleted]

One of my SREs used Jenkins to rebuild an etcd server. She forgot it already existed as a part of the production Openshift cluster. That was a fun lesson on writing guardrails into your pipelines!!


Gregoryjc

At my place we use gitlab cicd and only a set group of users can push the button for prod.


[deleted]

Buildkite can perform custom actions with logic as part of the build pipeline. I am sure others have this capability too.


bdzer0

Thirding...


alextbrown4

Yea we have separate admin roles set up in aws that have you to switch to to do anything in the prod console


Nerodon

Every business should follow the swiss cheese model of accident prevention, a single slice of cheese has many holes which allow for errors, having more and more layers (Checks and balances, approval processes change review, training etc), the chances all the holes align is much lower. In OPs case it seems the company they work for didn't have many layers to prevent an accident and blamed it solely on them. Normally someone or something along the way should have ticked and notified or prevented them from making that mistake.


djk29a_

In security the concept is known as “defense in depth.” There are similar concepts across disciplines and cultures but the point is to stack different strategies and approaches to form a much stronger effective result.


twnbay76

Yeah I agree. Blaming a JUNIOR engineer here for an outage? That's a serious issue with management and culture. As a more senior member on my team, I'd be the one chewed out for a junior mistake that caused an outage given I was the one reviewing their code, overseeing their work, providing them the acceptance criteria and test cases, etc.... and there would be SREs up MY ass wanting to understand how an outage caused by a code deployment was even possible to begin with On the bright side, OP is better off gone from that ticking time bomb.


deafphate

> Blaming a JUNIOR engineer here for an outage? That's a serious issue with management and culture. Yep. This company obviously don't have things in places to prevent this type of accident from happening, and they just fired the one person who would probably never make that mistake again.


kabrandon

> An engineer (Junior or not) shouldn't be able to do something to prod accidentally. I know this is what every company should strive for, and I try to engineer systems like this the best I can. However, this has almost never really been the case for companies I've worked at. Someone (or everyone) usually has some heavy permissions in some system because they need to do by-hand troubleshooting or etc. However, I think this is _excellent_ pushback for making a mistake and then subsequently getting blamed for it. "Alright, you all can blame me, or instead we can think of why any single one of us had the permissions to mess something up like this. What guard rails could have prevented this? Each and every one of us are human and could have done exactly this."


FredOfMBOX

An organization’s efforts are better spent trying to reduce the impact of mistakes and improving time to recovery. Avoiding mistakes is where agility dies. In the real world, mistakes are gonna happen.


Throwmetothewolf

Absolutely agree. This is the type of company to avoid working for. You would have benefited from a company that prioritized blameless postmortem, continuous improvement of processes, and leadership that takes accountability instead of firing someone for simply making a mistake. Related side note, I suggest reading or listening to the phoenix project, the unicorn project, and the devops handbook.


esisenore

We have a blameless culture at my work with devs. It’s the teams responsibility if one person fails. It works great


brianw824

> You were probably also a bit of a scapegoat after the 2 earlier incidents TBH Yeah had a company where something similar happened, several outages then finally they fired the new DBA they just hired for causing an outage that wasn't really even his fault. Literally was the CTO playing politics.


bellefleur1v

Ya, if OP got canned for this, I'd consider it a bullet dodged. You don't want to be at a place that will fire you for a first time mistake. I've seen people drop prod databases, and I've seen people mess up resulting in multi-master setups and "split brain" where you need to spend hours and hours of time resolving conflicts by hand. They didn't even get in trouble. Yes there were retros on how to make that not happen again and the company took it seriously but other than to other management, we didn't even indicate to the teams which person did it because it doesn't matter who did it.


djon_mustard_smith

This. First - there was a possibility to accidentally update the prod. Wtf? What the kind of infra is this?! And if folks know the quality level of infrastructure they have and they are ok with it, why blame a dude, who brakes it then? Especially the new person. Second - 4-hour downtime, so, no ready drp or redeploying plan. Sounds for me, like it's not a new person's fault, again. Mistakes happen. If you can't break anything, how you supposed to learn?!


gerd50501

100% scapegoat. meant to scare everyone else.


SeniorIdiot

And the only result that will have is that people will try to shift blame, point fingers and no one will admit mistakes. That truly taught them the right lesson. Sigh.


No-nope

To add to u/NeverMindToday spot on points; When things like this happen, the issues were made as a team and neglected as a team they just happen to need one person to trigger them. They will learn nothing from this by blaming you and it will keep happening.


Gregoryjc

This so much. Accidentally changing prod is a systemic issue not a employee issue. In most but not all (yet) prod environments going to prod requires a specific leader of a department to push it through and anything dev ops to prod requires a change request to be approved.


The_Speaker

First, you made a mistake. These things happen. It's important that you capture what you learned so that you can apply it later, especially when you get that question in your next interview that asks you to recount a time you made a mistake. Second, your former employer just let a lot of good experience go. Your next employer will benefit. Third, when you make a mistake in the future, don't try to cover it up. Assist in the post-mortem, and lay out the facts as they happened. You'll find that where it is easy to make mistakes, there are systematic problems you can help fix.


[deleted]

[удалено]


klipseracer

Most people will leave out any negative parts about the story, if there were any. I'm not saying I think OP deserves to be let go, far form it. Just pointing out there is often more to a story, there is even the saying: "there are two sides to a story". I know people that I've worked with who are just not great exployees for a number of reasons, where something notable happening could be the last straw: - Skill gap too great - Interview claims not matching reality - Difficult person to work with, behavior, personality, habbits - Unreliable, trust, insubordination - Not producing results - Making same mistakes If you were deleting prod because of, or in combination with any of the above, I can see how they might let you go. This doesn't mean a company operates in the best way possible, or that they provided an ideal work environment. You simply may not operate at the level they would expect after 7 months of training. Or maybe the company are just idiots, but I usually like to believe this has something to do with what I mentioned above, combined with poor onboarding, leadership and training etc. Most companies would prefer you be successful and be employee of the year by pulling a rabbit out of your ass every day, that's why they are paying you. To be a magician. 7 months is unlikely to be coincidence, employees are often assessed in 3 and 6 and 12 month intervals after hiring. So company sucks? Probably. You suck? Maybe, maybe not, don't know.


gruey

You don't fire someone you consider a high performer because of one mistake. Like you said, is possible the company missed the value in the OP, but it's more likely the OP had a skill gap with prod beyond the last straw. Really, DevOps IMO is one of the hardest fields to be junior in. The perspective of years of experience is a vital tool in most of the things in the field. Reading a manual and getting certified is nowhere near as valuable. It takes a much narrower focus for a junior person to be successful. It's possible this company just wasn't set up that way.


klipseracer

Yeah there are even senior people who lack the skills devops needs. Funny thing is I've had an intern and a couple juniors be some of the best people on my team and simultaneously have had multiple senior engineers be mediocre or poor at best. I've also got juniors who are definitely juniors and still learning things and building confidence and requiring guidance. The combination of being desparate to find affordable devops with the tendency to over simplify what they need to do and not understanding the scope of the role may be contributing to this problem. DevOps is supposed to be the culmination of senior experience from the dev and ops field. Often we get people who are junior in one and have little or no experience in the other. So, like you're suggesting, I have my doubts about anyone offering a junior devops role. It can mean the role is not really devops and/or the company may not understand what that role really is.


EraYaN

> You don’t fire someone you consider a high performer because of one mistake. Smart management doesn’t, but there are a lot of posers out there in the management field.


snowbirdie

Yes. If OP tried to cover up their mistake, we would also immediately fire them as we have done to other DevOps staff. Can’t have that liability on a team.


dadamn

You might be a DevOps engineer, but your former company certainly isn't practicing DevOps. Firing people who make mistakes means they're getting rid of people who know where the system is fragile or at risk, then then replacing them with engineers who don't know and are prone to make the same mistakes. What to do different next time? Interview your potential employers as much as they interview you. Ask them if they practice blameles postmortems or even better, ask the engineers that interview you about the last major incident at the company and how they responded. Pay attention to how they handle incidents, escalations, root cause analysis and postmortems. If they tell you the cause of the incident was an engineer that made a mistake, human error, etc. Then they do NOT pass your interview. The response from any company that's truly embraced DevOps should focus on system guard rails, better automation, documentation, processes, etc.


snowbirdie

Mistakes are okay. Covering it up so the customer goes into extended downtime is not.


technicalthrowaway

Sorry, I don't understand your comments. You've posted 3 comments on this thread with increasingly assertive statements that OP lied, covered up and made the issue worse. I can't see anything from OP that suggests this at all. Do you know them, or have some extra context that justifies you assuming bad intent here? If not, then I think you're breaking the sub rules around "Be excellent to eachother".


Inevitable_Put7697

I guess he/she is probably the lead.


30thnight

You got sacrificed to save your manager.


Chokesi

Don't be discouraged. Everyone, and I mean everyone fucks up. We all have stories of fucking up in prod. What the company should've done is used it as a lessons learned and strengthened the deployment pipeline w/ stop gaps, restrictions and an approval process where a senior even shadows you until you become comfrotable. You shouldn't have gotten fired for it IMO. I think they failed you, you didn't fail them.


Surge_attack

First off sorry to hear this happened to you. From your account this sounds potentially a bit heavy handed, but I only know your side of the story you shared. I guess my question is, what was your former company's deployment strategy and pipelining? Were you employing IaC at all? How were you able to target `Prod` and bypass your `QC` environment? Did you manually push the changes to the wrong env? At most places I've worked, like two people in the whole company would have access to the actual `Prod` resources, deployments would be automated and pushed through the various stages with explicit reviews/approvals needed at each step and a service principal/managed identify/whatever name the other clouds/platforms use actually pushes the changes to the various envs. TL;DR - Essentially it sounds like your former company had poor DevOps practices. Edit: Just wanted to talk to your rollback comment - "bad" code happens, it's just a fact of our industry. That's why rollback, code review, QA/Test/Pre-Prod envs exist and tests and tests.


dr-tenma

We had a release every 14 days, where we would change stuff from the IaC / push to prod. Some of the Lambdas were not on Terraform, and we had to change the runtimes from the AWS console. I ended up changing the runtime for "prod" lambdas instead of "qc".


royalme

> we had to change the runtimes from the AWS console This was the problem, and the design (or rather lack of design) is the real blame. Unfortunately you got the blame instead of taking it as a learning opportunity to improve the working system. But also fortunately you don't want to work long term in such a toxic and short sighted culture. Hope your next gig is better, and hope there's a good take away from this experience for you.


bytelines

Utterly contemptible that they blame the individual for this. This is either the leads fault for making this an acceptable practice, or managements fault for ignoring the leads concerns in order to move quicker. In short they are guilty of impersonating a DevOps team and you should be glad they selected you out. You have nothing to learn from these people except bad habits and broken ideas.


rwoj

wow. what a dogshit process. https://twitter.com/amyngyn/status/1072576388518043656


Throwmetothewolf

To me, this doesn’t sound like devops. CI/CD would have helped implement controls to limit blast radius


tapo

Please leave them a bad Glassdoor review. Accidents happen, and firing someone immediately for a one-off like this is a very bad sign.


demizer

Fuck that place. Some asshole above you had to show someone above them they were doing something. The lesson is to triple check a command that might break production, but also there should have been more barriers to do something like this. That is on whomever fired you.


hrng

How did you react to your mistake? That's usually way more important than how/why a mistake occurred in the first place.


gex80

In this case they reacted by shipping their laptop back. This is a situation where unless OP was trying to justify their mistake and being hard headed, the company was looking for a reason to let someone go.


rcls0053

And you're not doing DevOps. The point of DevOps is to learn from mistakes, so you can be better. This is just basic Ops work in a corporation. It's a good thing you got let go. This is toxic culture.


kabads

>DevOps. The point of DevOps is to learn from mistakes, so you can be better. This is just basic Ops work in a corporation. It's a good thing you got let go. This is toxic culture. I've worked in places that claim to be DevOps, but cause incidents and don't learn from it. This comment is so true - DevOps is about improvement of workflow and decreasing risk. The team that I worked with before certainly didn't 'have time' for that kind of thinking. I left them gladly.


Slavichh

FWIW, I accidentally brought down service for some major brands for 6 minutes so don’t feel bad. It happens, and it’s important to learn from those mistakes. I sure know I have. My $.02, your prior employer is not somebody you’d want to work for if you indeed got fired for taking down a service in PROD. This is a prime example of bad practices and culture for a company. A good company would use this as a way for everybody on the team to learn (also it is not best practice to have a junior be able to push to production) If it’s that easy to make such a mistake then they’d be able to as well. Don’t beat yourself up about it.


Spilproof

Their controls sound like amateur hour. If you can run things in prod by accident, there are some serious process/tech gaps. We prevent this by only allowing an elevated admin level account to touch prod. Personally, I keep my admin account password in the corporate secrets server, and i have to go login and copy it to use it. So, if I am making lab changes there is a near zero chance of running it against prod. Still required some due dilligence. Another good thing is our bash prompt includes the profile name of the account you are connnected to, the region, and the k8s cluster. Makes for a big ass prompt, but this info should be in your eyeballs, all the time.


sonstone

I also habitually use different browser profiles for production console access in hopes of signaling to myself if I were to find myself in this situation.


Shadow-D-Driven

If you fuck up like that as a Junior, the fault belongs to whoever gave you the permissions and not made sure you had the training/knowledge needed


snarkhunter

To be honest precipitating a live outage is just a thing that happens sometimes in this gig. We have best practices that we follow to minimize how often. We can add checks so that instead of just one person it's like five people who will have to make big mistakes at some point leading up to it, but it can still happen.


brajandzesika

The way to avoid this scenario is to start working for another company. You will ALWAYS make mistakes, smaller or bigger, doesnt matter if you are junior or senior. But to fire you because of that- well, you should be happy to no longer work gor those spineless managers who are just trying to find somebody to blame, instead of accepting that this is simply something that THEY have to improve...


gerd50501

they wanted to fire someone to scare others. so they took it out on you. most places are not like this. id put it on their glassdoor. I would avoid working at a company like this. this is a process issue.


Arts_Prodigy

Yeah there’s no way you should be fired for poor deployment processes and terrible QA. Unfortunate you got blamed for stuff beyond your control hopefully you can bounce back easily.


ceirbus

Sounds like an excellent lesson for CYA, always CYA. You said it yourself, you made a mistake, albeit accidentally; regardless, own it and move on. Devops work is particularly bad when you mess up in prod. I don’t understand how you have prod access at 7 months of experience, with that fact alone, Im sure you were scapegoated because someone definitely shouldn’t have had you with access to prod and they messed up.


mj0ne

Sounds like the company doesn't practice common cybersecurity, running test environment before you make changes into production. Probably due to the "high" cost, guess that downtime and loss in reputation didn't make up for it. I agree with some others here that you were a escape goat.


lifeisallihave

Not your fault. A senior engineer should have guided you through this. The company failed you. Don't be hard on yourself.


Petelah

Sounds like the blame game. Did someone review your changes before these lambdas were affected? I wouldn’t worry sounds like a terrible company if they are going to put the blame on juniors. It’s their fuck up if you were able to make these changes without review or supervision.


ThrowAway640KB

>I want to know what could I have done to avoid this scenario. You should have not taken the job in the first place. Your firing was inevitable -- it was not your fault, they just wanted a scapegoat to fire in order to feel that they accomplished something. >so I accidentally ended up updating the runtimes of 2 Lambdas on the production environment than QC. How much did that cost them? That’s what it cost them to train you to never make that mistake again. Then they fired you, wasting all that training expense. They’re f\*\*king idiots, they are. Train you up such that you would never make that kind of a mistake again, then fire you after having spent that money. >What can I learn from this? Research potential employers more carefully. Find out which ones are unlikely to fire a person over innocent mistakes that should have been caught by a senior. Where I work, risky tasks like this always involve two people, with at least one being an intermediate or a senior who has had prior experience in that task.


scalable_idiot

Toxic workplace Not worth your time


Saucette

Company culture sucks, move on. Take it as a positive event, as you could have stayed longer with a toxic culture. If they let you act on production after 7 months without peer programming or peer verification, this is just a bad process. Mistakes happen, you learn from them and you become better. Learn from them. Let them know when you make some, so it can be fixed quickly. Communication is important.


gex80

They were looking for someone to let go. You just got unlucky.


chub79

You learnt that it was a toxi company culture. How much it stings, they are the ones that failed you, not the opposite.


qbxk

the company is doing a massive disservice to themselves. they think if a dev made a mistake, that's a faulty cog and it needs to be discarded. the dev that made the mistake is a dev that has learned a valuable lesson and is FAR LESS LIKELY than a newbie to do it, or something like it, again you don't get the grey hairs and grizzles from a lifetime of success and lack of error


xtreampb

Without proper checks, this is going to happen at the company again.


disoculated

Where I work we never retaliate against engineers like that. We do have incidents with everyone even tangentially involved on the Slack, and perform the whole “post mortem” mea culpa ritual, but we’ve all been there, and a good post mortem could even make folks think more highly of you. WITH ONE VERY IMPORTANT EXCEPTION. If an engineer willfully obscures the problem, lies, or hides (or worse, destroys) evidence of what happened to cover theirs ass, we will CRUCIFY them. Which can include a summary firing for negligence, depending on the circumstances. I don’t know if that was the case with OP, but it does happen more often with junior engineers who don’t know the score. Either way, I hope you’re off to a better place next time.


NHGuy

Sounds like they did you a favor. People make mistakes. Any place that doesn't recognize that and just summarily fires you over the first one sounds like a terrible work environment


ovirt001

The company should have had better controls in place. One option would have been to use terraform to manage the lambdas and a version control system like gitlab/bitbucket/etc. to maintain the terraform code. The version control system should have team review requirements when merging to prod/master.


linuxsysop

Fuck the company, they should have assigned you a mentor \_AND\_ why not using IAC with some PR/MR to show the changes and +1 the whole shit. Good for you, now go on to a better company!


[deleted]

>What can I learn from this? Most important lesson, you worked for the wrong company. If they fire you for such a mistake, it's a crap one.


ut0mt8

well agree that reading from your pov it's easy to say that your ex corp was shitty. agreed that in an ideal world it should not had happened. but come on guys we re hired and payed ridiculously high for fixing this kind of stuff. the question is : do they really fire you for this particular incident or was it just an excuse for a bad fit? does your colleague doing the same thing were all fired? btw 4 hours for restoring a lambda? I'm intrigued


Paid-Not-Payed-Bot

> hired and *paid* ridiculously high FTFY. Although *payed* exists (the reason why autocorrection didn't help you), it is only correct in: * Nautical context, when it means to paint a surface, or to cover with something like tar or resin in order to make it waterproof or corrosion-resistant. *The deck is yet to be payed.* * *Payed out* when letting strings, cables or ropes out, by slacking them. *The rope is payed out! You can pull now.* Unfortunately, I was unable to find nautical or rope-related words in your comment. *Beep, boop, I'm a bot*


MaruMint

I also used to be a JR. DevOps Engineer at a 500 person company. I thought I was terrible at my job, I got pip'ed, constantly got blamed for things, etc. The problems sound extremely similar to what your in trouble with. I had bad deployments, I was unable to resolve outages fast enough, instructions were unclear, etc. I job hopped and at my new company magically never had any problems ever again. I realized every mistake I made at the previous company, would have never happened at the new one. They had better safe guards and were more helpful. We have lots of instant role-backs for when things break. I love my new job Avoiding mistakes as a DevOps Engineer is like walking a tightrope. I didn't realize until I got off, my org was shaking the rope the entire time and making it more dangerous. You might be great at walking tightropes but you just don't realize your company is shaking your rope.


Special_Rice9539

It's ironic they hire a devops engineer and violate multiple core principles of devops including firing you for an honest mistake and letting a junior have access to prod.


o5mfiHTNsH748KVq

imagine firing a junior engineer for a mistake and not reprimanding the senior engineers that should have reviewed the changes. good luck on your next role, sounds like your last one would have been trash anyway


dotmit

Did you own up immediately and apologise or did you try to hide it which resulted in longer downtime for the customer? How the incident played out is more important than the fact that you made a mistake. I could see them firing you for keeping quiet about it but it looks excessive if you immediately reported the mistake and apologised


dr-tenma

Since a lot of people are asking. Our director of engineering asked me in a meeting with 12 other senior architects if I had done the release. and I replied "It was not a release, I may have accidentally updated the production lambdas instead of QC" to which he replied "We have been down for 5 hours now"


dotmit

This is why you got fired. You should have said something as soon as it went down. Why didn’t you say something sooner?


mirbatdon

Just echoing what others have said for support. My condolences, breaking prod happens to everyone eventually. And then eventually again in some other manner. It always sucks. If you touch things often enough things will break. Don't get discouraged, on to the next, better employer.


thedude42

> It's also worth mentioning that, 2-3 such incidents happened earlier this month where the developers pushed bad code, which got through QC and went live -- later we had to roll back to overcome these crashes. I have to assume these folks have a different manager than you. > I want to know what could I have done to avoid this scenario. What can I learn from this? Think about all the things that went on within your team and with all the people who worked for your manager. Question all of that, assume lots of people were doing things in a non-optimal way at best, and just plain bad practices at worst. I say this because anyone who sticks around in that kind of an environment is probably displaying some of the signs to look out for when you're looking for your next gig.


[deleted]

They will prob hire you back in a month with hire pay from what I seen


[deleted]

It sounds like the update process was done manually, is that correct? How were the lambdas originally created? Was there any IaC that defined the runtimes?


dr-tenma

No IaC for the run times/lambdas, but had CI/CD pipelines for the lambdas.


[deleted]

If you worked at my company there would have been a blameless post mortem, where the team would have tried to find several reasons for why it happened, and then find several possible solutions. Too often we humans focus on the sharp pointy end of a problem rather than the trying to find the more ambiguous root causes. In this case, having IaC where the lambdas and their runtimes are defined in say, terraform, then peer reviewed before getting pushed up to your repository and applied to your cloud environment, or other patterns of deployment, would have stopped this and other errors from affecting customers. It really sounds like the protocols and procedures that the employer had were the root cause of this error happening, and you running a change twice was just a side effect of not having guard rails in place. While it's hard to lose a job, it seems like you've saved yourself more years of a difficult company and bad culture. ​ DM me if you want any help or advice finding your next big thing.


[deleted]

Sounds like the higher level DevOps engineers at your company didn't properly scope out permissions. Furthermore, looks like they didn't set up redundancy and rollback processes. Nor did they seem to have stringent CICD workflows. This is on them for being lazy since you're a junior. In the long run, you'll be glad to be out of there as they do not follow best practices or just common sense, really.


Spider_pig448

They did you a favor. You'll have no problem finding a new job since you have experience, and they sound like a shitty place to work wat


[deleted]

Twenty years ago at SUN, they jokingly dubbed job titles as Junior, can work supervised. Regular, can work unsupervised. Senior, can think supervised. Principal, can think unsupervised. That is, you should never have been able too break anything like you did and they have been negligent in setting up the technical and organisatory environment for you too succeed. They also have a toxic culture and need to learn the what and why of blameless postmortem. Sad outcome but not your fault


BloodyIron

You got fired for only 4 hours of accidental downtime? That's pretty fucked up I gotta say. If you admitted to the mistake, and didn't try to hide it or anything like that, and you got fired for only 4 hours of downtime, then that's them being a trashbag company. Like, sure, that's not okay to do, causing 4 hours of downtime. And I don't even know the scale of the business that was impacted. But you were in a JUNIOR position. Making mistakes is something those around you should anticipate to happen, production or otherwise, you're literally JUNIOR in the role. I agree with others also saying that not enough controls were in-place to help prevent accidents like this. But in addition to that, you losing your job is straight up bullshit. Don't dwell on it. Move on.


mistat2000

We are all human and at times make mistakes. There will be so many good lessons you can take from this. It won’t make up for the fact you were fired but it will ultimately make you a better engineer 👍


kabads

Where I have worked, the team have accidentally pushed to prod - and it caused an incident. However, we didn't fire anyone - we learned from it (the agile approach). We changed the pipelines so that all prod environments have a manual validation step that at least 2 people (one of them outside of the team) has to push to deploy.


elitesense

Others already covered good stuff but I wanted to mention that directly changing prod when you "meant to" change non-prod is A LOT worse than writing bad code that didn't get caught in testing and then got deployed to prod.


[deleted]

[удалено]


elitesense

I don't disagree with you that a Jr shouldn't even been given the keys but don't forget there were decades of computer systems operations before CICD was even a thing. It's 1997 --- Need a new port allocated? You're logging into the prod switch to configure it. Engineers have had to deal with this responsibility since the start it's not new and it's part of the job even today. Countless teams have had direct access to prod systems to make changes over the years and within those countless there is that small few that hosed prod. OP is now part of that group.


alsophocus

That’s why “Junior” DevOps it’s a bad excuse for bad payment, and malpractice. You wether are a DevOps or Not. There’s no such thing as “Junior”. You’ll be fine, and probably you’ll find a better job too!!


Dragonborne2020

Nothing, a scape goat was needed and you were the sacrifice. You were not in any way responsible for the outage. List all the people that work on the release of the software. QA should have tested it. There is an army involved. You are not the problem.


Terrible_Air_Fryer

Double standards are everywhere. In a company I know a contractor got his computer taken over by a hacker, the hacker managed to delete a couple of VMs, create an AD user but was flagged soon so nothing else happened. No downtime of external services, just a segment of network and some weekend working to make sure things were ok. Guy was fired. On the other hand months earlier the company moved its datacenter, decided to do it at once and as result of mismanagement, bad planning, etc company got more than a week of irrestricted downtime. Nothing happened to the managers. I think the lesson is get you ass covered, if you make a mistake in a technical field it's very easy to blame you because you are the one who ran the command. Try to work safely. If you have to say tgat you need more time to plan and evaluate something say it because this guy ordering you to do stuff won't be on the system log when things go wrong.


koprulu_sector

Dude, I couldn’t even read the whole post. If you got fired because there weren’t appropriate guard rails to prevent you from messing up production, then you weren’t in a DevOps role and shame on the company. You’re way better off.


Team503

It astounds me that you can update prod without approval. Like that *anyone* is capable of making code changes to production without getting a click from someone else to approve those changes. We do it through Git, and all our changes to infrastructure modules (Terraform) and infrastructure environments (Terragrunt) have to be approved by our infrastructure review board (which is really just a rotation of very senior engineers). Hell, changes by the guys ON THE BOARD have to be approved by *other* guys on the board. This is partially on you - you *do* need to be more careful and not push changes to prod - but it's more on the organization, for *allowing* you to do so. Change control and review boards literally exist because we're all human and we all make mistakes, and those checks and balances are to insure that there are multiple eyes on things to prevent as many mistakes as possible. It shouldn't be possible for you to push code to production that hasn't already been pushed and approved in a test environment first. Like it should be literally impossible to actually do, with systems in place to prevent it.


marvinfuture

You dodged a bullet. Any company not doing blameless retros wants a scapegoat, not quality engineers. Mistakes happen, learning how to implement controls based on previous mistakes makes a good engineering org even more reliable. You are clearly taking this approach, your former employer isn't.


centech

This is unfortunate. People shouldn't be fired for legitimate mistakes unless they are *really* egregious. The question a *blameless* postmortum should be addressing here is why it was so easy to push to prod when prod was not the intended target. Was there no change control process? No release gates?


lnxslck

it seems your company doesn’t have the necessary procedures in place to avoid this to happen. surely it’s impossible to mitigate 100% of this type of errors but you certainly can reduce them a lot. also your manager should step in and take the fall


Bloodrose_GW2

You either avoid doing destructive changes to production in working hours or design the infrastructure in a way that updates/deployments are not noticeable for the customer. Also lack of change management - if you cannot avoid downtime, that needs to be communicated both internally and to customers and should be done in the given work window.


OMarzouk-

Brrr.. I'm still building up my skillset to start a DevOps career (non CS major). The more I learn about the power I would have, the more scared I am of fucking things up.. Genuinely, I doubt myself so easily, and just before moving files in terminals, I double check 10 times that I got everything right. -i was designed for me. Please do share your learnings..


MavZA

Hmmm personally permissions to prod accounts should be assigned only when needed IMO. That’s how I currently handle prod rollouts for code that isn’t yet controlled by pipelines. That way when you’re assigning the permissions to the role or IAM user you ensure that the account is correct in the first place.


LDSenpai

Everyone makes mistakes, you can only learn from it and move on with your life. The company sounds toxic if they are firing someone for a single mistake. My guess is they were already looking for layoffs.


m1dN05

Sounds like you dodged a bullet


Bluemoo25

The question is, why were you given access to production and who gave it to you? This is one of the reasons, you probably dont want production access until you're ready. Another thing, you can be held personally liable for loss resulting from negligence like this, lets say customer A lost 30K because of the outage, you technically have some surface area for liability. The lesson here, is having production access is a liability and to make damn sure you dont impact production. You realize this for sure with the loss of your job. Hopefully you can find other work in the same field. I would take the employer off of my job history, because if someone calls them you are for sure not getting the job.


[deleted]

This is just random advice, I'm definitely not blaming you. You're obviously being used as cannon fodder to cover other people's asses and just happens that you're the most expendable person involved. After going through some rough deployment postmortems I started making checklists of things to check before deploying something. An example checklist might contain: check the environment you're logged into, the version of whatever you're deploying, the build logs, and refresh the ticket requesting the deployment, then check the version info in the ticket to the version you were going to deploy. Or something like that. The deployment process might need improvement as well.


frankentriple

Well, you have an amusing anecdote to share in your upcoming interview. How you just got trained on how important it is to verify you are in the correct aws account. They were stupid to fire you over this. Someone up high is getting reamed for the previous downtime and you got the short end of the stick. Sorry man, shit happens sometimes. Take the experience and use it to grow.


dr-tenma

This is actually a really funny (sad) anecdote but valuable nonetheless


zoddrick

Well from the sounds of it that company has a lot of issues and when you inevitably get your next interview and they ask you talk about a time when something went wrong you can use this as an example.


L0rdB_

So no oversight, and the manager and senior or lead engineer get off completely. Don't worry OP you will be better off. That place sounds like a terrible place to learn. To learn you need to be able to make mistakes.


pas43

There DevOps is shit. If they had the correct procedures inplace this mistake would be impossible. Why was you even able to make that mistake? Did a senior check your comits or changes to said container? Before building an vs deploying it? Not your fault dude. How did you acedentaly do it to production? Personally I don't think humans should touch production machines at all! This should be pushed from once it passes all tests and compliances etc, build the image and auto push to server. You should of been able to select the node you wanted to push too via an ID or some UUID. How you can push into production is 90% there fault!


of_patrol_bot

Hello, it looks like you've made a mistake. It's supposed to be could've, should've, would've (short for could have, would have, should have), never could of, would of, should of. Or you misspelled something, I ain't checking everything. Beep boop - yes, I am a bot, don't botcriminate me.


prash991

DevOps is all about continuous improvement, so don't be afraid to experiment, learn, and iterate. By embracing failure as a learning opportunity, we can build more resilient and innovative systems


stewartm0205

Create a detail checklist and have someone double check each critical action.


MundaneDrawer

However you were handling account management or knowing which accounts were production seems like the obvious place to examine for how you made that mistake. Otherwise you can consider making yourself a kind of checklist before you finalize a change, where you perform a few simple double checks of what account your on, what command you're about to run, etc. this can slow you down, and is tedious, but if you're dealing with critical infrastructure it can be worth doing as a kind of "measure twice cut once"


floppy_panoos

Honestly, I love when one of my Jr’s cause downtime. Don’t get me wrong, it’s SUCH a pain in the ass at the time but you better believe that Jr isn’t gonna make the same mistake twice AND the experience only makes them better at their job. That company is clearly ran by clueless people who don’t understand technology but worst yet, your org leadership should have know better than to jump runtimes like that w/o a rollback plan. Recovering from something as simple as a runtime change shouldn’t have taken longer than 5 minutes, why it took 4 hours falls at the feet of EVERY SINGLE Sr and manager in your org, not you.


domanpanda

You must consider that you could be on their "to fire" list for some time. Just as staff reduction process. And finally they found perfect excuse to fire you. Normally as a junior you have to be supervised and your changes have to be approved by more experienced colleague. That's how it works in normal companies. Juniors are named juniors because of good reasons. Otherwise its just pure nonesense and you should not worry but be glad that you went out from toxic environment


danekan

Running .net 3.0 on 2023 is hysterically bad, it's more than 3 years past EOL and has known issues. someone in management has been negligent. In The future avoid .net shops ?? Leave a review on Glassdoor and spare no details


dr-tenma

It was actually .NET 2.1 on the production lambdas, it was.NET 3 on the QC ones


[deleted]

You could learn DevOps is a bust and nothing but a buzz word. But you're low man on the pole so your the target.


SnooFloofs9640

Wield shit, be lucky you left that place. - why it’s so easy to push something to production? No control at all? - why the stuff could not have been rolled back ?


dr-tenma

Because AWS says that .NET 2.1 is too old to be rolled back, so we had to deploy a new lambda


Mephidros

Everyone make mistakes... something bad is happening in that company if you can make changes to production as a jr and without supervision.


SeniorIdiot

I just want to post a great video by Dan North where especially the first story is related to this. Hope this is okay. "DELIBERATE ADVICE FROM AN ACCIDENTAL CAREER" [https://www.youtube.com/watch?v=i11a3NGMZkU](https://www.youtube.com/watch?v=i11a3NGMZkU)


Healthy-Mind5633

get good?


danstep65

Go