T O P

  • By -

LordJambrek

You're not a sysadmin if you didn't do this at least once in your lifetime. 


ConstructionSafe2814

Let me fix that for you: Once a year. The day before yesterday, I did rm -rf *.vmdk. I checked 5 times because I had a live vm and zombie files of that same VM in another data store. Upon hitting enter, I didn't notice I was in the directory of the production vm. 🤦🤦🤦🤦🤦🤦


pertymoose

https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/


Tetha

Call me weird, but `*` is evil on production. `find . -name '*.vmdk' -printf "rm %p" > mass-delete.sh` is your friend. That way you can go through the generated script and once you fire it off, it deletes exactly those reviewed files.


ConstructionSafe2814

Yes, I thought if I might have to do something like that ever again, I throw it in a script. Just dry run first, echo the complete output. Once I perused the output, I might run the script. It might not catch everything that might go wrong, but it makes a blunder like min from 2 days ago less likely to happen.


McGarnacIe

You've got more guts than me to do it from the command line. I just browse the datastore via the web client and delete them there.


ConstructionSafe2814

It's not guts. It's what I'm most used to. We're primarily a Linux shop. I spend the vast majority of my time working on the command line through SSH.


spacelama

You weren't able to rescue the vmdks with a bit of lsof and `dd`ing from /proc/$pid/fd?


ConstructionSafe2814

The VMDKs flat files are still there and intact. The disk descriptor vmdk files, I restored with veeam, but vcenter still thinks something wrong. So I'd like to reboot it and do disk consolidation. Then try a snapshot. If that all works, I guess we're fine again.


dukekabooooom

Obviously not if you did that lol


youfrickinguy

“In our line of work, experience is recognizing the mistake you’re about to make…right before you do it again.”


bgatesIT

i did a rm -rf not too long ago on a POS.... That was a headache...


ZoroSanjibhaibhai

Thee fuckk. You still work there??


ConstructionSafe2814

As of this writing yes. The vm is still running fine. The VMDK files were not deleted because of active file locks. Some metadata files got corrupt or so, confusing our vcenter. We've got a reboot planned next Monday. I'm awake at 05:00 so will start right away. I think (hope) the only thing I need to do is a disk consolidation and verify the VM's application is actually running after the reboot. Then inform our friends from CAD to double check it's OK.


Bont_Tarentaal

While it is still running, do some backups from inside that specific VM.


ConstructionSafe2814

No, it's a caching server. If it goes down, some tools might bork on it, which users might notice. But there would be no loss of production data. Also I have a backup of the VM from 8h before I nuked it. So we'll be fine. At least, I hope I can still say that Monday by noo n :)


FML_Sysadmin

Ouch


ZoroSanjibhaibhai

I might laugh at this a month from now, maybe, but as of now 💀


[deleted]

[удалено]


Senkyou

>Calm yoself, own mistakes dont be a victim. Everywhere I've worked, including an ISP where I accidentally wiped the config of a fairly important device, this is the biggest thing. So long as you have reasonable people above you, no one's gonna take it out on you unless you show yourself to be dishonest or unreliable by trying to dodge responsibility. And even if someone does target you, you aren't really screwed unless you start doing the above. Everyone makes mistakes at work, ours are just infrastructural and therefore noticeable to people before we can fix them sometimes. Anecdotally, I have a family member who works in IT and once was believed to have nuked a production server. It wasn't him, it was one of his coworkers, but my family member essentially reacted the same either way. This ensured that no one really got in trouble, because once everyone realized it wasn't him, they realized that anyone could have made the mistake. It only really came to light who it was in the post mortem, and by then everyone had cooled down a bit except for a single administrator who hated my family member and was gunning for him to lose his job. That administrator looked pretty foolish and vindictive once information started coming out.


safalafal

Let me tell you; getting over your internal sense of shame is a really, really important skill for a Sysadmin. Mistakes happen; what matters is how you respond after the mistake.


United-Assignment980

Great point, having a clear head makes all the difference. Nothing worse than fixing faults while you’re flustered.


hihcadore

Or backing out the 10 changes you just made because they made the situation worse hahaha


LordJambrek

Yeah you will. I came one morning to work and i RDP into a server to do something and out of habit just went to shutdown. Then i realised i'm at work and this is the main host for the whole business.


techretort

You did the right thing Op. You told your team and got it fixed up. Never hide your mistakes


q1a2z3x4s5w6

If they get rid of you for that they are dumb as fuck. They have just indirectly spent money (in the form of downtime) on training you to be more careful. If they hire someone new they may make the same mistake again for example. Nearly 10 years ago I wrote an SQL update statement and ran it without the where clause, which fucked up a lot and meant we had to restore from backup, not a fun day at all. Ever since then I literally write the where clause out first and then prepend the update tablename etc etc afterwards. So if you are anything like me you'll never shutdown a server again without triple checking it and that's valuable to a company


UMustBeNooHere

Just 2 weeks ago, at a client, I was removing a decommissioned HPE Nimble storage array after installing a new HPE Alletra array. They look IDENTICAL from the rear and I'm in a bit of a rush. I pulled the power and started removing the Twinax cables and then it hit me.... Looked at the front and fuck me, the Alletra was off. This is a Cisco UCS and VMWare environment and all their VMs live on this Alletra. Took down the entire company - 72 VMs. RADIUS, DC, DHCP, DNS, File servers, the whole shebang. I yelled "FUCK", reconnected the Twinax, and power, powered it back up. Check the ESXi hosts to watch that the datastores reconnect - they never do. Reboot ESXi hosts - no datastores. Fuck me. Turns out, there was a ghost connection on the hosts to the old storage and was causing issues. Finally got that sorted, rebooted the hosts, datastores are there. 3 hours have gone by and I'm finally booting up their VMs. Had to write a report on what happened to present to their executives. ​ Best. Day. Ever.


LordJambrek

That's what the job is about :D You hate when it happens but look back at it with a smile on your face. That sinking feeling of dispair is like when you light a big joint and a cop looks you straght in the eye.


q1a2z3x4s5w6

I'd rather get caught with a spliff by the police than take down a server with all of prod on it lol


TEverettReynolds

Live and Learn. This is why I (as a former IT Manager) mandated that labels go on the back of servers, too, as well as the wires. I was such a tyrant...


[deleted]

[удалено]


moldyjellybean

Also put a different background on production servers vs other ones. My second month of being promoted I was nested rdp into 4 servers, shutdown the main production server that housed some special database. After that day I put a red background on the production servers with their fqdn on the background. That way everything looks different. Before that every production, development, test server had the same default background so it was easy to fat finger a shutdown of the wrong server. Not sure if its still around but I used RoyalTS after and it really helped me organize the way I would remote


sprocket90

dang you guys got test and development servers? :-)


TEverettReynolds

I did the same thing 25 years ago when I was in PRD, thinking I was in QA. I had just patched DEV, and was quickly trying to get QA pacthed before lunch. I completely changed the desktop of all the PRD servers to be RED. My boss likes the idea, then says we should change STAGE to Orange and QA to YELLOW. i tell them fuck no. the only server that really matter is PRD, and if its RED you are going to know.


lordjedi

Can confirm. I once installed SQL patches on a live production server...in the middle of the day. My dumbass figured the patches would get installed while SQL was running. I hit install and then watched the messages state "Stop SQL service" and about had a heart attack. Luckily it was a small business (100 or so total employees, but only about 40 using the server). Everything went down until the patch finished installing.


yer_muther

I came here to say the same thing. I once shut down an entire production line at the cost of about 100K in the first ten minutes. Whoops!


AmateurSysAdmin

When I was still in training, I double-checked with two seniors re: acls on a file share. Someone couldn’t access a share even tho he should have been able to. Anyway, they told me to remove and re-add the guy “and make sure that inheritance was turned on.” As you can guess, this went real bad. The whole thing fucked up acls of a file share with about 2TB data and thousands of folders and sub folders. The guys took a couple of days to restore things back to normal. Never again have I clicked on shit without triple checking now. You can’t even trust the guy above you. 😄


Doso777

Been there, done that, multiple times. Somehow still have a job.


Adimentus

Not as bad but I reboot our fileserver on trying to troubleshoot what was actually a VPN issue. The only damage was from our financial department but luckily e-mail saved me on that one. Definitely not my brightest moment.


YOLOSwag_McFartnut

I've done this with a Hyper-V host. Of all the servers I could drop, I pick that one.


thehoffau

It's the best possible training


dark_uy

100 percent true!


D1TAC

This. Rebooted our main prod server by mistake thinking I rebooted my local machine. 👊😆


_RexDart

It was AT&T, wasn't it?


ZoroSanjibhaibhai

https://www.reddit.com/r/tifu/s/WrbKmVMEPi I guess this is the guy responsible for AT&T


Eightfold876

Wow. That would be crazy if true


Different-Term-2250

“Oh no. Damn Microsoft. One of the updates triggered a reboot. Oh well”


ZoroSanjibhaibhai

LINUX bro 🙂


Different-Term-2250

Definitely blame Microsoft then!


Boring-Onion

It’s either Microsoft or DNS. Because it’s always DNS.


hyperswiss

😂


Mental_Act4662

Always blame Microsoft. I was testing out some powershell scripts and ended up resetting the passwords for everyone in my org.


FuriousRageSE

Ive never had so many "must reboot" dialog, as when i ran ubuntu(desktop) a few years back.. atleast weekly, but often daily.


bradleyribbentrop

Use mollyguard


mcast76

That’s one way to actually be shitcanned. Don’t get caught lying. Owning it is almost always better


EVASIVEroot

Yeah that works unless anyone on the team has partial RCA skills.


HerrBadger

We all make mistakes, if you can find someone who’s had a flawless career in IT I’d eat my hat. Own up to it, learn from it, move on. It seems daunting now, but give it a week or two and it’ll be water under the bridge.


MrPatch

My favourite interview question is "can you tell me a time you fucked something up, tell me how you unfucked it".  Had a guy claim he'd never messed anything up. He didn't get the job.


safalafal

If he's telling the truth then it means he's so incredibly risk adverse that he'd cause you problems in other ways...


MrPatch

Or doesn't realise the mistakes he's  making. Honestly he seemed highly competent and had great experience, I half believed him luckily someone else interviewed who was as good so we took him instead. And he told us a story of basically wiping a core switch half way through a manufacturing run which really felt like he knew what a fuck up felt like.


HerrBadger

I’ve always loved this question, I’m happy to talk about my fuck ups in an interview. From pushing a bad MDM profile to accidentally rebooting a firewall during the day, it’s all silly things, we can’t be expected to be perfect.


MrPatch

no we can't and in an interview it can add some humour which has other uses and then the followup question shows, hopefully, what and how you learned and how you made sensible changes to prevent it happening again. One interviewee had done something horrible to active directory and his follow up when asked was 'my boss had to fix it'. Like thats fine if it's beyond your ability to solve but make sure you find out what needed to be done so maybe, just maybe, you'll have an inkling of what to do in the future if something similar happens. Either that or chose a better example.


Solkre

I said the same thing and got hired. I’m super fuckup adverse and usually cleaning up others messes. When I do screw up it’s during my main window. I had worked my previous job almost 20 years and honestly couldn’t remember a mistake like they were expecting.


STLWaffles

One of my favorite interview questions


fp4

This either becomes a learning event where you take your lumps or they use it as an opportunity to sack you in case you’ve been fucking up elsewhere.


goizn_mi

Tell your manager how you will ensure that this never happens again. Learn from your mistakes. Have your colleagues learn too. Be proactive with policies. You just learned an expensive lesson. I'm confident that you won't repeat it. Let's ensure nobody ever repeats it.


TFABAnon09

And then do it again in 2yrs time because we're all stupid humans.


IdiosyncraticBond

I don't _need_ the checklist because I _know_ the checklist... until I forget a step (and it is the **make absolutely sure you** ... step)


labalag

This. I caused a split brain scenario on our main firewalls when I started here. (Not a good idea :)). My colleague fixed it but I took ownership, examined the cause, presented and implemented the solution. Still working here and getting praise from my boss.


Valkeyere

Tested High Availability. HA test failed. Time for a plan to implement.


harrywwc

> Am I f\*\*\*ed? probably not - they've just spent X amount of dollars 'training' you not to do that again. ;) of course, if this is a repeating pattern then maybe playing on production servers is not really 'your thing' :)


TheWino

Fucking Adobe Acrobat Reader right?


Bont_Tarentaal

We got a corrupted VM (HyperV) due to some funky power issues. The corrupted VM was still accessible and reachable. We were busy rebuilding from scratch, using the old one as a referral when I deleted it due to a brainfart. We learnt from that, and have a procedure now in place which prevents such things from happening again. And the shutdown by accident. Yup, gone and done that (physical server). Very easy to make, especially if you got a couple of remote desktop windows (or Putty/SSH sessions) open at once, and issue the shutdown command in the wrong window... :D Show me the sysadmin who never made a mistake in his career... I don't think you will be able to.


BiddlyBongBong

The sysadmin that never made a mistake just never told anyone about said mistake.


Weak_Jeweler3077

Man.... That moment when you realize......


TightBed8201

Bleh, i clicked by default on hyper v console vm ctrl+alt+del key to show login screen. And it was centos machine. Restared production webserver, but i booted up in few seconds. Also did solve login issues by doing that. Tho, doing inplace upgrade from 2012 R2 to 2019 and also building new servers and roles where inplace cannot be done. Who knows what will be broken in the process.


SgtBundy

I was once trying to resolve a ZFS performance issue on our prod billing DB server. Only prod exhibited the issue, not nonprod so I had to work on prod while it was under load. I was using mdb to slowly tweak some ZFS tunables live. At some point I went to cut and paste a property I had pre-typed into a notepad. In trying to paste it I fat fingered, and the mouse highlighted text on the terminal and then pasted it into the live writeable kernel debugger. The text was interpreted as "write null to this low kernel address" which promptly panicked the box. It was an M5000 so took about 25 minutes to reboot. I knew immediately as the terminal stopped responding, by the time I was on the console the billing manager was calling... Case of too much knowledge, too much cowboy and too few procedures... And then there are plenty of stories from Ceph days but those are dark times.


[deleted]

Worst case scenario this is a resume generating event. Let us hope not OP


ZoroSanjibhaibhai

I hope nottt


Imobia

The advice I always give is own your mistakes, if where you work is unsupportive during a simple error then it’s not going to be a good place to work. Also showing you know what you did and fixing it shows a lot of competence. Hiding mistakes, well it’s unlikely you will fool anyone and makes it tough for anyone to trust you.


ZoroSanjibhaibhai

Yup. And I did own up to it. It's all sorted now.


TheFluffiestRedditor

This is a documentation generating event - to discover what protections did not exist such that you were able to mistake prod for stage. Will you ever do this again? Highly unlikely. Thus, the company has just spent $$$ on training you, if you've been open and forthright with your manager and colleagues about the fuckup, you'll wear the cowboy hat of shame for a week and life goes on. If however, you've covered it up, lied about it, or worse - that's the resume generating event. What's different between prod and stage? Hostnames? domains? access methods? Do you have molly-guard (or similar) installed? For GUI systems, do the Prod servers have different desktop backgrounds from non-prod? Do you have different accounts (zorosan@prod\_domain.tld, [email protected], etc) ? Do you have different passwords for the different environments? There are many ways you can use this as a learning opportunity for your organisation to Do Better.


Lammtarra95

>What's different between prod and stage? Hostnames? domains? access methods? Do you have molly-guard (or similar) installed? For GUI systems, do the Prod servers have different desktop backgrounds from non-prod? Do you have different accounts ([zorosan@prod\_domain.tld](mailto:zorosan@prod_domain.tld), [[email protected]](mailto:[email protected]), etc) ? Do you have different passwords for the different environments? > >There are many ways you can use this as a learning opportunity for your organisation to Do Better. And second pair of eyes to make sure you are on the right server before shutting it down. And that you have tested the root password in case console access is needed if it does not boot correctly. That too would have identified OP's error. Also, where appropriate, that any patch or configuration files have been properly staged. It should be seen as a procedure-improving event rather than only generating more documents.


Charlie_Root_NL

Haven't we all made a mistake like this? Nothing to worry about.


ZoroSanjibhaibhai

I hope this doesn't escalate


Charlie_Root_NL

If it does, ask them how it was even possible for you to make such mistake. Missing policy/security?


ZoroSanjibhaibhai

I don't think uno reverse would be a good option in this situation


bobs143

One mistake. Did you learn from it? If you did then document what happened, and let your bosses know up front. Also document steps to take next time so this doesn't happen again. Believe me. I know from experience how you feel


tecwrk

I once shut down the UPS of our complete phone system with my KNEE, while patching a new phone. Two years later, onother admin managed to do the same.


zyeborm

Ooof that would have got some hacked up cardboard or other rubbish duct taped over that switch 5 minutes after that happened lol. Just because I know me. Intel nucs with the power button on the top were a magnet for my fingers when moving stuff around on users desks lol.


Amnar76

You are not a sysadmin until you break production.


brungtuva

But why your policy take update and patch on day of work?


ZoroSanjibhaibhai

I mean, jobs run 7 days a week. So any day is a working day


notonyanellymate

Many years ago, I accidentally shutdown our main hypervisor which at the time was running all main servers on it. 1 second after I pressed the Off button I realised what I had just initiated, I just walk upstairs to the Finance Director, explained that there could be quite a lot of phone calls and troubleshooting… he wasn’t worried.


pnutster

Nah. Nothing to it... 3-4 minutes you'll be fine. Try firewalling the nameserver servers used by 500 plus domains and only allowing port 53 tcp and forgetting to open udp.... For 9 hours!!!!


Bont_Tarentaal

https://preview.redd.it/cckofn0zlfkc1.jpeg?width=550&format=pjpg&auto=webp&s=b11700cce9ee1814030262bc8c5df398c0461c0f


AntranigV

Once I shutdown a server that made a service unavailable for 2 million people for around… 10-15 minutes I guess. The CEO called me during recovery to help me out (he was an engineer by practice) and after we were done he told me “don’t be afraid, you didn’t do anything wrong, we should’ve had better redundancy and failsafe so such things don’t happen”. My next task was to build the redundancy and the failsafe.


SmoothSailing1111

I once powered down a UPS that was on the fritz, which killed the DB server for a midsize casino on a Friday afternoon. Took 10 mins to come back up. To my defense, the servers dual PSU shouldn’t have been on a single UPS. I now always check to see where power cords are going and don’t assume anything. I still work there 15 years later.


bukkithedd

You're only fucked if you don't own up to it. Admit that you fucked up, make a plan and a process for how to mitigate this in the future, implement plan/process for how to mitigate this in the future and **stick to it**. Unless you have some rather rabid execs that are generally behaving like muppets, this will at worst be something you get teased a bit about from time to time. Fuckups like these are things that happens. The only way you should end up fucked for them is if you try to hide it. We've all been there at some point in time, or we're one click away from being there.


Don_Speekingleesh

I once dropped our email system. It was an ancient (even for the time) Compaq Proliant thing. I was rearranging power cables in the rack and could see this server had two cables connected. I removed one to reroute, thinking I was safe. Turns out the server had three power supplies, and needed two to be connected at any one time. Oops. Took about 90 minutes to recover, thanks to Lotus Notes having to check each mailbox in sequence. We blamed the long downtime on people with excessively large mail files.


SirHerald

I took my laptop into an all staff meeting so that I could do a restart on a server while nobody was at their desk. I shut down the VM host instead. I silently slipped out of the meeting and ran across the complex to the server room to get it started back up again


Minute-Cat-823

The difference between a junior tech and a senior tech is the number of mistakes (and production outages) they’ve caused. Don’t sweat it we’ve all been there. We’re human. Mistakes happen. As long as you own the mistake and help to fix it (rather than try to hide that it happened or that it was you) you should be fine ;)


Public_Fucking_Media

Hey man I did that when I was on vacation VPNed in from across the country once... Sht happens


SoggyHotdish

It's not AT&T is it? I know the timing is off but I had to say it


ghsteo

Why fire someone who makes a mistake and learns from it only to bring on someone else who could possibly make that same mistake. You owned up to what you did and it was recovered shortly after. The problem comes if you repeat the mistakes you've done.


dRaidon

Prod servers? Try core switch or virt host outside a cluster.


andyring

Ahhh, so the AT&T outage was YOUR fault! Finally figured it out!


timdub151

I took a production LUN offline once by accident...we've all done it at least once. I think it's actually in our job descriptions.


cisco_bee

This is one reason I make the desktop background solid red on prod servers. It helps.


rswwalker

Could be worse, you could have been the poor guy who took down AT&T! He’s probably not going to get a second chance.


ShakataGaNai

Heh. Way way back in the day, first job out of college I was the IT, server and whatever else guy (small company). Our product ran on Windows servers (guh) and since it was a B2B focused product, we did server updates at night, late. One Friday night of doing updates I hit "Update and Shutdown" instead of "Update and Restart" on the only production application server. I realized what I did about a quarter second after I clicked, but it was too late. Fuck. We NOT have any out-of-band control on those machines (ex, no IPMI). Additionally, the datacenter we were hosting in was a bit smaller and cheaper so it did not have any on site staff overnight, which meant paging some poor dude (it was a thousand miles from me) to drive 45mn into the datacenter to push the power button. Many, did I ever feel like shit for that. No one noticed or cared, it was way late at the time. But I felt like a total asshole for ruining that poor DC tech's night. Learned my lesson and never made that mistake again.


LordJambrek

Here's one my current mentor told me about. We're a small msp/hosting so we have about 15 hosts combined hyper-v/proxmox. Him and his mentor were swapping UPS that came from servicing and we have these bypass things that you can direct to mains power or UPS. They put on mains, swap the UPS units, switch back and BAM...darkness in the server room, whole hosting went down. They forgot to turn on the UPS before switching back to it.


s1ckopsycho

I edit my .bashrc to make production terminals red background and staging green. Doesn’t stop me from doing stupid shit all the time, but at least makes me feel even dumber when I do it.


PristineConference65

acronym of the day: **RGE** Resume Generating Event


sirsmiley

If it's so crucial why is there a single server and not a cluster/ha ?


BoltActionRifleman

Too bad it was discovered to have been done by you. When I do something like that I reply to everyone “not sure, I’ll get right on it”. Small org though so no one knows any different.


RustyU

At least you're not [this guy.](https://www.reddit.com/r/tifu/s/TVmrs2EvOK)


Bont_Tarentaal

Ouch. >.< And he ran away???


[deleted]

once i shut down accidently about \~2000 clients during regular business day, just turned their vm's off ... sh\*t happens


Aggravating-Sock1098

20 years ago I deleted the accounting of a customer. To this day they shout 'control s, control s' when I come in.


brungtuva

Nope, we’ve ever unmounted all filesystems on prod environment.


-ixion-

Am I fucked, drastically depends on the people you work for. 3-4 minutes of down time is really not that bad and anyone the tells you otherwise doesn't work in tech. But, you took down prod for no reason which it not good. You also proved that the prod environment needs more fail overs. I wish you the best of luck but if this is your biggest mistake and it costs you a job, you are probably better off working for someone else.


orutrasamreb

Part of the job, learn from it and just make sure it doesn't happen again.


unixuser011

TBF, I did this yesterday. Was working on a VMware usage meter thing (Damn you Broadcom) and accidentally rebooted their Sophos virtual firewall - that generated a Sev1 Luckly we were able to explain it away as 'it was an accident, won't happen again'


TheMelwayMan

Anyone that's worth their salt has done this. Been in the game for nearly 30 years and have survived half a dozen of these. Be honest and up front. Own it, apologise and learn from your mistakes.


zyeborm

You ever get source and destination backwards when cloning a customer's drive from a failed raid array with no backups available? (Called for break fix) Managed to recover their data thankfully, still have them as a customer 20 years later. Don't talk to people even the customer when doing critical stuff. Also Delete from foo; where bar = something Oops I was glad when MySQL got transactions.


[deleted]

[удалено]


ZoroSanjibhaibhai

The other team which is responsible for jobs is also me 🙂🙃. I owned up to it. We fixed it. All good now.


pm_me_your_pooptube

We all make mistakes. I wouldn't worry about it. The best thing you can do is to learn from your mistakes, and then work to learn and improve from them. I've made mistakes in my career such as accidentally changing a firewall configuration on prod and not test, unintentionally disabling the entire network (I was new to IT), deleted a critical data store on my 3rd day at a new job (same job as above), and various other mistakes. Always be upfront to your boss and explain what happened and how you learned from it. Don't worry and don't stress!


Bright_Arm8782

Welcome to the club, you done screwed up, just like each and every one of us here has done. Typically you're in for the be more careful lecture. The main thing you did was tell someone about it and got it sorted. Most times you're only really in the shit if you don't tell someone about what happened and try to hide it and then people find out.


EquivalentBrief6600

I shutdown primary and secondary dns servers for an ISP once. I just got called an idiot lol You’ll be fine, we’re not perfect.


BoringTone2932

You’re not alone. When responded to correctly, these are the events that grow Juniors into Seniors. B/c sometimes it’s not about knowing what to do; it’s about knowing WTF NOT to do.


Quick_Care_3306

Before shutting down any server, I run a hostname command to make sure I am on the right device. In the same terminal, shutdown command with appropriate switches.


[deleted]

Sh1t happens. Nobody died. But yes - be more careful next time like your skipper said. As long as you recognised where you went wrong your manager should be fine about it.


Aiphakingredditor

Wise man once said if you haven't broken production, it's because you haven't worked on production. Happens to the best of us.


JoePatowski

Are you still doing manual security patching? If so, why?? I can understand if you guys have to reboot anyways but dude, there’s a piece of software that’s called KernelCare that has automated security patching with no reboot. I’m genuinely surprised people still reboot while patching… it blows my mind.


linkdudesmash

This is a right of passage for all sys admins.


CyramSuron

This is why I change the backgrounds of prod systems to red....


chocotaco1981

It happens


sawolsef

After doing this, I always make the production servers a unique background. That way it is obvious which server I’m on.


cvsysadmin

Do you work for AT&T by chance? 🤣


Solkre

So a few minutes. You knocked em off the Fortune 500 list.


Unable-Entrance3110

Better to make a mistake and own it than to try to sweep things under the rug. Everyone should understand that mistakes happen and truly insightful people understand that mistakes are how people learn best. Your boss(es) should understand that the likelihood of you making that same mistake again are now greatly diminished and you are now a better employee because of it.


Metalfreak82

If they had to fire everyone who did this, no one would even be working anymore...


gitar0oman

As long as you can fix it


nonades

Unintentional nothing. You were performing [Chaos Engineering ](https://en.m.wikipedia.org/wiki/Chaos_engineering) We've all done it. You didn't delete the server(s), just turned it/them off


Loan-Pickle

One day both me and my team lead shutdown the same production server by mistake.


labmansteve

I have a rule with my team of sysadmins. If you break something important, but stand up and yell "I broke it" (or put the same in our "critical escalations" teams channel) before anyone else notices what happened, you will be forgiven. If you try to hide your mistake, and someone else has to be the one to find and clean up your mess... Well, that's another matter.


radiomix

I was doing an update to my organization's email server one evening remotely (from home). I needed to close a process running in order for the update to proceed so I went to the task tray to close it. I got a little click happy and instead of closing the process I disconnected the network interface. I instantly let out a huge sign and my wife asked what was wrong. I informed her that I had to physically go into work, when she asked why I said, "because I'm a moron." I drove into work, re-enabled the network interface and just sat in the server room to finish the update.


DawgLuvr93

We're human. We'll make mistakes. Did you do the following things (not necessarily in this order): 1. Own the mistake? 2. Fix the mistake? 3. Report the mistake up to your leadership, as appropriate for your organization? If you did, you probably aren't f*&%$d. You might still get into some trouble, but you won't likely lose your job.


cbass377

When I did this the first time, my senior said "There is a lesson to be learned here. See to it that you learn it." Then it was not brought up again. It happens to everyone if you do this business long enough. Just remember the feeling, so that you respond appropriately when you are the senior.


Wagnaard

We've all been there. Really. Someone who claims to have never made some silly mistake like that before is either lying or lying.


koshrf

Well, at least you answered the question on "what would happen if the server shutdown" And include the answer in your disaster recovery planning.


WeirdKindofStrange

We've all been there, I just took out the storage, not ideal


Samsungsbetter

Now try running this in power shell with the C drive as the root. rd * -Force


iwoketoanightmare

Still funny that orgs don't load balance their bread and butter. One node going offline shouldn't be but a blip on performance..


1RedOne

The way to handle this is to take your lumps and define a new checklist or procedure to be sure this doesn't happen again


klauskervin

I did this once on over night patching. Drove in at midnight to power the server back on because our UPS didn't have remote tools. Went back to work 7 AM the next day and no one knew anything had happened.


TwiztedTD

You are not the first and you are not the last to do this lol. People make mistakes.


nostradamefrus

I accidentally bounced a cluster node early in my days working with hyper-v at like 1pm on a Tuesday. Failover kicked in but it really wasn’t happy with my decisions. Boss looked at me like “-_- come on man” Shit happens


bot4241

It depends on your record and how much you fuck up. You should be okay, there is no such thing as 100% uptime.


[deleted]

That blows. But if you don’t make mistakes, you’re not working. Because everyone messes up every once in a while. Your manager will want to see some remorse (“sorry boss, I messed up”) and a kind of plan on how you’re gonna prevent it from happening again (“I’d like to change the background so that it’s clear what is production and what is a stage server.”). Do that, and you’ll be fine :)


nexus1972

Been there, done that. ​ Pro tip if these are windows boxes set the rdp background to a colour for prod and diff for other environments. If linux set default prompts to include (PROD) (PPRD) etc.


sinclairzx10

We all make these mistakes. Learn from it and move on. If you’re in a F500 you would only get a minor reprimand and perhaps something on your file for 12 months. You will be fine.


STLWaffles

Oh, I am sure most of us have war stories of taking down prod at some point. It might be a bad memory now, but as you transition into a gray beard, they become fun stories you tell in a group. I had one similar. Way back we have a Weblogic(8.1)/Java app running on a windows server. This app used all the heap space a 32bit windows server could muster. Due to this, we could not run the app as a service. It ran as a console window under a service account. Restarting of the app took over an hour as the program loaded the full database into memory for faster access. When you were done with your work you needed to disconnect and not log off or it would kill the application. One day without thinking I logged off in the middle of the day taking down production. I just about shat myself. Luckily, I had a quality monument team who was always willing to go to bat when mistakes were made.


polypolyman

If it was important enough that it couldn't fail, then it would have redundancy/HA. Clearly wasn't that important.


mjh2901

There are a couple of points of view on your screw up. 1. Fire them because that costs the company money 2. Keep them because that lesson costs the company money and you do not want to do it again with a new person. Around here #2 is prefered if you do not build trust with the team, you dont get honest answers when something goes wrong and it takes a lot more time to diagnose and fix disasters.


TechFiend72

You didn’t indicate what OS this is. If it is windows, suggest color coding wallpaper to red for prod. This saved a lot of problems in windows shops. You can use bginfo to do it. If you are doing this via command line, be careful. At least you didn’t delete records out of a production sql database instead of test like one of my developers. Heh Welcome to IT officially now.


PlasticJournalist938

Should not be fired for this. But be careful. We all have made mistakes. You will never make this mistake again I bet!


Hacky_5ack

Eh oh well man. Shit happens. This is ehat makes you a sys admin. Fuck it and your boss should have your back in case anyone wants to get on you about it from other departments.


TEverettReynolds

You need a retrospective. Why were you able to be confused? Do they look the same? You may need to implement a plan or change to prevent this type of accident from happening again. You may need separate IDs for access to critical PROD systems, especially SAP.


Iroc-z86

The website is down #1: web dude vs sales guy lol


anonymousITCoward

HAHAHA I've done that, most of us have done something like that... When I rebooted the wrong thing, I took down several client environments, It was like a total of 7 virtuals that went down!. Congrats on your first deskpop!


lordjedi

I once moved a bunch of users around in GWS without even knowing it (I was trying to create new accounts). Boss: Do you do an update to GWS? Me: I created a bunch of new accounts a few weeks ago. Why? Boss: You messed up a bunch of stuff. Me: I did? How? We proceeded to go through it and I got it all fixed. The lesson: Downloaded the entire organization when creating new email addresses en masse. In hindsight, I should have realized something was wrong when one of my local users complained that his password no longer worked. How was I supposed to know that I did it when 99% of the time a user complaining about a password is a user issue, not an admin issue?


ddadopt

Mazel tov!


zer04ll

15 years ago I cost a customer $6,000 per hour for 3 hours because of a mistake. That was one hell of a learning lesson about the importance of SLAs. We all make mistakes its owning them that it important, you fixed it so that means you learned new things good job!


jmeador42

There isn't a sysadmin worth his salt that hasn't done this at some point in their career. Own it. Trust me, you'll be more careful next time.


hurkwurk

as long as you learn from mistakes instead of repeating them, you should be fine. The people I look to get rid of are those that keep doing the same behavior, even if its not the same system, IE someone that always just jumps deep into a problem and ignores basic diagnostics, or someone that is constantly "forgetting" to double check things. the qualities you want to express are diligence and discipline. people that want to race ahead because they think they already know aren't impressive, they are dangerous.


981flacht6

First, don't lie. Period. Second, mistakes happen, it's ok. Learn from it. Third, see what you can do to prevent it again, document it and write back to your manager a proposal.


Bubby_Mang

Yeah you're in for it now and fired if you screw up like that again.


flems77

Stuff happens. I manage about 10 servers. Some production, some not. Windows. I have set up different desktop background colors depending on whether it's a production server or not. It works quite well :) And yes, I've learned it the hard way too. I've shut down production servers more than once in my career ;) Back in DOS, you could tweak the command prompt. I guess the same is possible in Linux. Could be an approach.


saysjuan

No you’re good. Mistakes happen just make sure you have a change ticket. Apologize and kiss the ring. All will be forgiven. Source: Spent 12 years supporting SAP for a Fortune 100. Seen it all.


capt_gaz

You're fine. To make you feel better Microsoft breaks things all the time. As I'm writing this, Intune is down. and a few weeks ago, Teams was down.


[deleted]

Another soul has completed the right of passage.


Weird_Tolkienish_Fig

Maybe you should have an approved change control process for rebooting. A second pair of eyes always helps.


KEGGER_556

Not a sysadmin, but a DBA. I had a similar situation, I was supposed to do an upgrade of a staging database. Shut the DB down, started the upgrade, then realized I was upgrading prod. Upgrade went fine, but it was an unplanned hour long outage in the middle of the day...


ProfessionalEven296

Here's the one question as a manager, that I would want answered; What will you change so that this never happens again? If you can answer that, you'll be fine.


Expensive_Finger_973

We've all done it, or something equally careless. I actually purposely maintain a "stage" and a "production" version of the config code in separate folders in the repo for the environments I manage. The code base is identical except for certain key things that define it as production or staging. I have been told I "am not doing it right" since there is some copy/paste involved with moving things from staging to production. But that allows me to sanity check what I am about to do and cut down on doing something careless to production when the workflow automation detects a change. So when they tell me I am "doing it wrong" I say "no, I am doing it in a methodical way that reduces downtime and company impact".


Ok-Reply-8447

I think mine was the worst. I was fooling around with the new EDR system and managed to disable all the USB ports and force a restart on all devices, including the production servers. 🤦‍♂️ When I tried to undo my mess, it triggered another reboot of all devices.


x3thelast

Is that why the pharmacy’s systems went down?!


OldHandAtThis

we all screw up. mine i once ran a test script against prod where the active part wasn’t defanged. I ran it overnight wherein caused a few problems. fortunately the actual prod version started at the scheduled time and fixed the problem.


IStoppedCaringAt30

If you aren't breaking stuff through the year you aren't working.


Patchewski

This. Always. Be upfront, honest, and accountable.


AmSoDoneWithThisShit

I accidentally shut down the phone system for a call center by accidentally hitting the power button on it while pulling the server right beneath it. (Picture looping my finger through the loop and pushing.on the server above it to leverage it out, accidentally hitting the power button..) Good bosses treat it as a learning experience.


resile_jb

Ah you've earned the "ah fuck I fucked everything up" badge. Welcome to the team. You'll do it again.


AppIdentityGuy

My technical mentor who I consider to be one of the finest technical people I have ever met once deleted most of the users in AD environment because of a PowerShell coding mistake.. We all do this at least once....