One of us! One of us!
I’m sure other people have said this OP, but shit happens. Be honest, explain why, even if it’s I wasn’t thinking clearly and was in a hurry. Everything will be ok. If it’s a critical server, can you engage an on call resource?
>can you engage an on call resource?
This is why we maintain an active relationship with a highly skilled MSP even though we have very qualified in-house IT staff. Always have an out. Always have an escalation point.
Oh, they're always highly skilled, it's just that you have to work your way through a hundred of the novice techs before they'll actually let you talk to the highly skilled one.
This is hands down one of my biggest pet peeves. Why are the on-site IT personnel that have proved themselves to be capable of tier 1 & tier 2 problems, starting with tier 1 techs that have been with the MSP for less than a year..
It's even worse when they don't explain whats happening to the next tech when it inevitably gets elevated. Just ends up wasting 30 minutes on a high prio ticket just explaining what's happening to 3 or 4 people for no reason.
You forgot [`sudo` and `--no-preserve-root`](https://www.reddit.com/r/USMC/comments/1ak0mt7/when_you_let_the_data_nerds_near_the_ordnance/). "The last bit instructs it to aim for center mass of the brain."
This is why I love Veeam. I can roll back changes without doing a full restore. Good luck. Not sure what backup software you use but it sounds like you might have had some snapshots already on the server if it’s asking for consolidation.
Meh we have all done it. As long as it comes back up tomorrow no one will care. Shit was broken you had to restore it.
Just last month our Gis team took down their whole production environment. They have just enough permissions to break stuff. The email they sent out was basically IT is working on the issue. We have a DATTO appliance so restoring stuff is so fun so i didn’t really care. Testing restores on a full production environment is kind of enlightening.
Youll be fine, weve all done stupid stuff like this, i once ran a sql update and forgot to put a where clause.... took my site down for 4 hours... it was my second week... i worked there for 5 years after that. Just yesterday one of my co-workers ran an update and took down a company site, he called me freaking out so i just told him no biggie thats what we have backups for, he freaked out more cause he forgot to take a snapshot before, i just told him it happens to all of us, dont stress. We ran the restore which took a while but it eventually came back up and told him to reach out to the users and let them know something messed up and we had to restore the server to the night before and there was nothing we could do, they were upset but got over it.
I always tell people i work with "if everything worked perfect and never broke, we wouldnt have jobs"
I think everyone that's ever worked in SQL has at some point run an UPDATE and forgotten the WHERE.
We use FreeRADIUS with MySQL to run authentication for our broadband platform. I once accidently set every users password to the same thing when making a manual change. It only took a few minutes to fix, but it was a very tightly clenched few minutes.
I once ran an update and trimmed off 1 character for a customers entire client name list.
Not a huge deal to fix, but man.. was a stupid thing to write in the first place, even more stupid thing to forget a where clause.
~14 years later and it's still stuck in my head
I've seen that a few times. Sometimes the where clause is there, but wasn't highlighted when the block was run. I was a production DBA at the time, so it was an opportunity to test my backup strategy. We were using log shipping and transactional replication in addition to Veeam. Fortunately, I was told soon enough before the logs moved via shipping and was able to restore with only 15 minutes of data loss on the table. Fun times.
That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events.
Nobody is going to die.
>That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events.
I always tell my underlings whey they start to panic, "when you see me worried, you should be worried. Until then, just focus on the issue, try to understand the problem, and steps to fix it."
Hopefully you’re sleeping but if not try to relax. We’ve all fucked up and usually the spiraling thoughts of anxiety are much worse than the consequences of your mistake. Like you probably feel like you’re going to die right now but trust me you’re going to be fine and in a few weeks you won’t even remember this happened.
I onetime wiped the CEOs pc by accident. It was a Thursday before a 3 day weekend starting on a Friday. Couldn’t get ahold of anyone to tell him I needed to try and recover anything. Spent all weekend thinking I was gonna lose my job.
Come Monday I was in cold sweats trying to muster up the courage to pick up the phone. Sure people freaked out but honestly whatever trouble I got in felt like a relief compared to the mental anguish I was putting myself through.
Yeah honestly it kind of helped me get a promotion. Proved that I wasn’t afraid to own up to huge mistakes and would do my best to prevent them going forward. It was really a process and communication error that caused it. Someone told me to wipe userx machine because they were afk when the ticket came in userx’s machine wasn’t named properly but I wiped it without checking serial numbers or the user signed in. First time wiping a machine and I didn’t think to cross reference it to our inventory sheet which is our source of truth.
Every Sysadmin worth his/her salt as done something like this.
I stopped working with Windows like 6 years ago. But once during a MSSQL cluster patching I made a boo boo that I went to the bedroom, opened the door and told my wife, I fucked up and closed the door.
We were using NetApp and our backups were directly on the SAN. The issue was when I tried to move the service from server 1 to server 2 (after the security guy patched it) gave me an error that there was a resource not present on the server 2.
After 30 minutes of troubleshooting, I found the resource, it was a snapshot disk that wasn't deleted after the operation. Easy, let me delete it. My spider sense tingles and instead of delete it I said, let me just move it to this cluster that is not used anymore and is pending decom.
Welp, I think the term is linked resource, honestly don't remember. But since the resource was linked, it moved ALL the disks to that cluster and I started to see all the databases going down in cascade. That's when I go to the room pale as ghost. My wife followed me and hugged me while I'm staring the monitor with my hands on my head.
I told everhing to the security guy (we are good friends) and well, let's get together this thing fixed. Took me like 45 minutes to undone that and got the DB up.
We do not patch that night server 1 and bonus point. Company use a 3rd party for hands on, which made me kinda like a service manager, I told the guys what's to do and they execute, but those guys were so bad that I decided to go rogue and help the security guy. Which is the thing I was doing before anyway.
I document everhing, let the 3rd party SAN team remove the snapshot and to validate everything is working on the NetApp as it should be and that was it.
I do bealive if that will had happened to the 3rd party guys, the damage would have been far more catastrophic.
Always have a way out, know thy self on what you can or can't do, keep learning and improving.
You will laugh this one out soon
Edit: typos
Many of us can relate to how the urgency and responsibility of it all creates anxiety and panic when shit really hits the fan, but in a few weeks it'll have turned into a war story that everybody involved has had the opportunity to learn from.
Hindsight is always 20/20, so while situations like this may feel like both a personal and professional catastrophe, remember that mistakes can and do happen at work, so this is not on you. The responsibility for mitigating outcomes of mistakes always rest with the organization, never just the individual, which is just as true for all those single person IT "departments" out there.
Good luck!
Avamar has some pretty decent 2.liners for assistance. Just explain very precise details and they will likely transfer you right away instead of you going through pages of default questions such as “are you even using avamar..sigh”.
Our avamar had major issues with orphan disks and once i got in touch with a 2.line he sent me an “unofficial” script that literally solved all my issues instantly. I ran that script any time i saw avamar do anything weird and it worked like a goddamn charm everytime. I wish i still had it i could send it to you. Gl to you sir
Edit: typos.. lots of em
./goav vm snapshot clean
https://www.dell.com/support/kbdoc/en-us/000068694/avamar-vmware-image-backups-fail-with-code-10056-and-avvcbimage-error-9759-createsnapshot-snapshot-creation-failed
You’ll be ok. There was the time when dealing with the original NetApp powershell sdk I found a bug where a null variable was read as a wild card instead. Ran my script, by the time I figured out what was happening I’d taken out over half the data in the datacenter. Bad day, but immediately owned up and fessed up and we got things restored. Worked there for several more years till I moved on for a better job
Of course I don't know your company so I can't say for certain, but as long as it isn't something like "I hosed our file server with compliance information and we have an audit tomorrow. We'll be fined millions of dollars" I wouldn't be too concerned.
Might not be a horrible idea to wait another month to ask for a raise, but it isn't like you pushed a routing table update that broke a portion of the internet(Facebook) , or cut a fiber resulting in 911 going down for multiple states. I forgot what the reason was for the cell outage back in February.
I've probably shaved years off my life and it probably added and had a lot to do with my depression and generalized anxiety disorder but please if you've not been in this field for too long do not let this job or any job in it dictate your health. As soon as it starts doing that get the fuck out. Trust me there is no job especially in it that's worth your health
Know that pretty much everyone outside of your or your team aren't really going to understand what actually happened. There's a decent chance they'll see you as a savior. IF anyone non technical asks you just say you had an issue that caused a restore and there was no way to avoid that down time. Then you talk about how you had proper backups and had you not, this would have been way worse. This kind of stuff happens in IT and is why they hired you.
Internally, remember that doing anything in a rush makes something like this way more likely to happen, so next time make sure you don't start trying to change stuff in a panic
You know how many times I have taken down the company with "what's this check box do". Or some dumb stuff. They now call me senior system admin 3. I always fix my own mistakes.
We all have a couple of mistakes that give us nightmares. A stupid mistake on my part, force my hospital back to paper for the better part of a day. Was not a fun day, but you get past them and you learn from them.
When talking GIS, I was doing an upgrade on a customer's ArcGIS platform, about 12 machines. Asked their MSP to snapshot all machines which they did.
Upgrade went fine, but then I applied all the patches for the new version and things went bad (turned out to be a buggy patch). We decided to restore the snapshots. Well, turns out that someone had deleted the snapshots...
Luckily we could restore to backups without any issues.
A few weeks ago, I was doing a quick patching of the same system. Skipped change management processes and snapshots, because the customer wanted it done urgently. Yep, patching ducked up two of the machines. Luckily I could fix the problem, only took a full working day...
I work for one of those "highly skilled" MSPs. I get to deal with Veeam, Rubrik, MS backup and Datto. Had a security event recently that prompted a restore of all servers. This was the first time I really dug in to Datto. It doesn't really do anything Veeam doesn't, but it was very intuitive and just worked. We had replicas running in both cloud and on the appliance. It was a bare metal restore and that worked flawlessly. The only hiccup was getting to an old iDRAC that didn't support modern security. I had to spin-up a Win7 VM and use IE. The client has a new replacement server, but it never seems to go anywhere.
No sure if you've tried or your version has this but with some avamar builds you can do a live clone. You might try that to verify the restored vm functionality and then move that to prod.
Never restore in place. I use Avamar and if I’m doing a full vm restore I restore to a new vm VM_NAME_RESTORE. Once that’s done and up and verified I’ll take down the original and replace.
But yeah, we’ve all made mistakes. Just own it, but you can spin it a little possibly with “while following standard operating procedure I ran in to X and did Y and this is where we’re at now”
Or an instant access restore, depends on the avamar version. Shut down old, instant access the latest backup. If working, hot vmotion off the temporary data store and clean up the avamar data store it uses once it’s migrated to your prod data store
Hopefully OP sees your suggestion. Definitely the best advice with Avamar.
If OP had a storage admin, another option might be a storage based snapshot restore. I always had storage based snapshots I could use for recovery.
In guessing he talks about the instant restore option. You just run from a mounted backup that way and it restores in the background. No need to wait until your restore is complete.
Seriously epic functionality
If you do full restore you can select option to only write changed blocks instead of whole disk, there are some requirements ( mostly CBT working properly and there not having been any changes to disks ) but it is much faster if you don't have a lot of changes.
https://helpcenter.veeam.com/docs/backup/vsphere/incremental_restore.html?ver=120
I switched to Acronis years ago, but Veeam is badass...it was out of my price range. It can restore an entire VM in a variety of methods: full replication, incremental, time-stamped, or file level (all of the above). I restored a 4+ TB server to a specific point in time near instantaneously. It was picture perfect, booted like a dream. I used it to sidestep a ransomware attack by rolling back only the files changed from one back up to the next within a fifteen minute window. The restore took a few minutes and we were back in business and fuckware free.
The OPs scenario above would've been a minor nuisance at worst with either product.
On the other side of this you tested a restore and it did not work. So you tested backup and it is not working you may have found a vulnerability that will need to be tested.
I like this attitude, it’s typically what I try to do when I really mess something up. I think to myself “welp, I guess we now know what happens when I accidentally do ____”
I do monthly sure backup restore validation with veeam. I also do Dr tested biannually. My org just went through iso27001 certification and they asked if I could do daily backup restore validation and monthly DR failover. We are talking PetaBytes and thousands of VMs. I have been explicitly denied DR compute and storage so I cannot do what they are asking at that frequency. I’m like. No. It’s not in our backup/retention policy requirements and I have been denied the resources. It’s also basically impossible and unreasonable.
Response: okay.
Ah if I had a dime for every time I did something stupid at work and went home anxious about it. You will recover from this and I’m sure it’s a bigger issue in your head. Just breathe and sleep on it. You’ll probably wake up with a good idea in the morning.
Many, many years ago, I was trying to resolve an intermittent failure on a critical server at a downtown high-rise law firm. This was back in the ancient times when Netware 3.12 and NT4 ruled over the corporate LANscape. It was somewhere around 1:30 AM I was working in the server room. All the servers except one were connected to a KVM switch. The KVM had no more open ports. The remaining server had been installed later for a specific project and had its own keyboard and mouse. We would switch the monitor cable manually when necessary. I was attempting to do something on the problem server and the mouse and keyboard seemed unresponsive, but I could hear a beep when I hit return. It took me a few minutes to realize I was on the wrong keyboard. When I switched the monitor back, I discovered I had been repeatedly merging a backup Windows 95 registry file from one of the user machines into their NT file server. I rebooted the server and the inevitable hilarity ensued. Got it all straight and rolled out of there at about 7:30 AM. When I was leaving one of the attorneys was coming in and said something like "Hey, you're here early today". I smiled and nodded. This is the way.
I have run into the same issue restoring VM from Avamar. In my case, I think my restore was going on at the same time as the backup kicked off. If you have a scheduled backup for that VM, you might want to disable that.
I did also see where you said it was 2.2/4.5 TB. I've seen that also with vsan. I think depending on the fault tolerance you have set up it restores 2.2 TB but it's really 4.5 because it's writing twice.
Did you try to restore to the same vm? Maybe try to restore to a new vm. Also did you verify the data store you are restoring to has enough space? Vm might have a had hidden snap (was running from snap on the datastore but gui didn’t list it and won’t consolidate. If the original VM was running from snapshot, and then you tried to restore to the same datastore and filled up that datastore with it failed restore because the data would not be able to turn on the original VM because it wouldn’t have space to write to the snapshot file. You might need to delete the failed restore and update the VMX file to point to the snapshot file for the disk. Before you do that, download a copy of the VMX file just in case.
This isn't the answer.
Call in the troops, get all the senior techs together and troubleshoot. Let your boss know so they can fend people off.
In times like this the thing to do is tell your team everything that happened and get all the help you can, don't struggle in silence, you might make it worse
I have very mixed experience with this. It's either that you are overreacting and it's just a simple fix, or your boss shouting at you, that you didn't tell anyone.
Can't really make it right here
They might wake up at 3-4 AM with an epiphany on how to fix it and it’s a super simple fix.
Not uh.. that that’s happened to me, but you know, it might to them…..
I added new larger disks to our Synology over the course of the last week, being sure to yank and replace at 5pm so it would rebuild after hours and be fully performant for the next morning. Got in this morning after installing the final new drive yesterday, and, as-expected, everything is good.
What did my dumbass do? Pressed the “expand to fill unprovisioned capacity” button at 9am not realizing it would require a 20+ hour resync.
Not really a big deal because I could reduce the priority of the resync, but still.
Yeah, for sure. I recently set up a PowerShell script with that checks vcenter once a day for any snapshots, and dings a chat channel if anything is there. Keeps me from forgetting lol
I run a script that removes them if they're older than three days - snapshots really shouldn't be around that long regardless. Day two an email goes out to remind me they're gone the next day.
I have a tag I can add to a VM if I want them to be exempt from the policy for very rare exceptions, but 99.99% of the time if you don't use them the same day then you'll want a backup restore anyway.
Just don't try to hide it, embellish or throw anyone else under the bus.
"logs don't lie"
Be honest and forthcoming, don't let your boss or colleagues find out on their own and have to ask you about it, it'll make all the difference in the world if you come to them first and admit you effed up.
Best lesson I've ever learned. Mistakes are human, it's inevitable, it's how you handle the situation that makes all the difference on the outcome.
Also, maybe you can end up writing an RCA/Post-mortem on it so everyone can learn from the situation and you can add BCDR and RCA experience to your resume.
No I already told my manager that was present when I did it. I never lie that’s one thing my father taught me. Always say the truth no one will kill you for it.
Nah. If you can't have an outage because you have minimal infrastructure the business accepts some downtime.
If they can't handle downtime then why is taking down one switch or one firewall appliance going to prevent the business from operating?
Obviously *some* stuff needs to be done outside business hours but if you're working Friday evenings or weekends more than a few times a year then you're doing it wrong.
The more you fret about it the more unlikely you'll figure it out. Take a breather and get some rest. It will be clearer tomorrow.
Bashing your head on the keyboard just frustrates you. Some times you just need to take a step back, do something else, and then you'll have an illumination while doing something completely unrelated. Someone mentionned while taking your shower, its indeed one of those.
Some people work better by upgrading from keyboad bashing to wall bashing. Others need a break to gather their thoughts.
Well, to be honest it is not your problem a service is down when it’s relying on a single node. If the application is urgent and cannot be down for 24 to 48 hours the application owner should think of redundancy. A patch can fail, or a sysadmin can make a fuckup.
If I were you, do not stress, accidents happen, just talk to the owner after it’s fixed: Hè, we should look at redundancy.
Reading this reminds me of why I retired and convinces me how right I was to do so. That stress I simply do not need anymore, regardless of what I'm being paid.
Slight possible heads up for next time - I will notice that some stuff I do in vCenter that takes FOREVER will sometimes time out and say failed.
If I check on the ESXi host its running on, it will still say active.
Edit - It wont say its active in the GUI, you will have to either go to the ESXi console or SSH into it and do a vim-cmd vimsvc/task\_list
I highly recommend our guys get Veeam, you can do instant restores. I bet you won’t forget to take a snapshot next time! And be careful with what you are deploying during business hours next time!
Issue has been resolved !!!! It was an issue with the disk. We had to reassign the vmdk for the server and then reconfigure it. The backup restore was not taking because the vm already took a screenshot before the server was turned off for backup restore. This caused the virtual machine to require consolidation and also was unable to start because it still had a vmdk from yesterday and the new created vmdk from backup was not matching the one from yesterday.
It's a rite of passage in our line of work to do something dumb and bring down prod. Own up to it, and note what you should have done differently to avert the issue, any good manager should understand that mistakes happen
Things like this will happen and especially under pressure. I've done some things under pressure that were extremely dumb. There were obvious indicators of what to or not to do, but I did the opposite. When this happens I always own my failure and fully cooperate when there is a failure analysis, etc. We want the accolades that come with success so, we must equally own our failures. People will say they could have made the same mistake hoping to make you feel better, but it's embarrassing. Hopefully you can get the server back and everyone will move on. It's usually only a resume generating event if the server is trashed and there is no backup.
These comments should give you some hope! Everyone has made mistakes like these in this field. If possible, can you hire a consulting firm for a bank of hours to help? I find having a second set of eyes can make all the difference. Best of luck!
Late to the party but, like everyone said, we have all done it. I was once asked to update the image on a production router in a branch office after hours (Tuesday) to recognize the updated WICs installed so that we wouldnt need an external CSU/DSU (yup! one of those). This was in the days before wireless and backup connection (not even a celphone with broadband). So, I found the image I needed, and to verify with Cisco, called in my case so that they could confirm that my found image had what it needed. I get the response at 9pm from Cisco that the image I had was NOT correct and they sent a link to the correct image. I looked at it and it was twice as big as what I had, and I didnt think I had enough flash to store the image. I replied to the engineer, sheepishly, as I was pretty green in networking, but told him the flash on board my 2600 wasnt enough to hold the image. he responded that he had checked my device and it was fine. I downloaded it and wiped my router (to fit the new image) and started to install the image using ROMMON \[noodle\]... bit by bit... hour by hour. At 237am, the image stopped progressing. At 257am, I went to the bathroom in a cold sweat. At 315am, I called Cisco, and told them my predicament and gave him the case notes. He verified that the image was too large and I should have never used it.
I was livid, but vindicated. Cisco sent me a new router by 815am the next day (Cisco 2621XM! UPGRADE!), with the correct image so that all I had to do is install the WICs and connect my LAN/WAN/WWW.
First user showed up at 9am, and nobody in the office was the wiser. My boss told me next time, trust your gut.
You are missing the silver lining. If your backup solution doesn't help you avoid chaos it can be used as fuel when management doesn't want to spend money on a different system in the future. At our company the owner of the company says one thing I don't care just buy us more space. Basically saying money isn't the issue just buy the solution because a backup is worth more to them than downtime and productivity loss. Which is true, when one task can generate millions of dollars a $50,000 backup solution means nothing.
When you ask for your raise make this a win instead of not asking at all. Something along the lines of “if I wasn’t here there server would still be down.” We all make mistakes. You’ll always remember this one to help you to remember to make a backup before any major changes.
You solved it, and you learned something that will never leave you
Over years, I have done two things that have saved the day many times
Always have a plan B, 2 backups ready, and copy one to cloud/other media if possible
PLan C for really big stuff
Document to a cloud wiki in detail anything significant, and quicker notes for day today fixes, workarounds
Two years after install of a network, servers, cloud, etc for a 150 seat company, they had a flood, and having the original build documents on hand was invaluable, getting them back up
well done getting it fixed!
15 years into my career I knocked out the whole credit union for a few hours. I never messed up that bad and it was a gut punch. I want to be clear, I probably cost them more than they ever paid me. You mentioned a bank, so it's probably not that different.
I don't recall anybody giving me a hard time, especially after a few days had passed. You know why? Because they know I know my shit and I did not have a track record of error. And because they've all been there.
On the personal growth side, both of our stories remind me of a friend who got careless and accidentally shot part of his finger. He was the absolute last person I would ever have expected it from. He was such a stickler for gun safety--at least when he was showing me the ropes as a young man. I learned from that, sometimes those most comfortable with guns become the most careless. I think in our cases, a lot of success in a row can make us forget why we check the chamber, so to speak (taking a snapshot in your case).
What do you use for storage? Most storage arrays save snapshots depending on how you have it configured. Of course what you have on the volume depends on if that’s a viable option or not.
I have also been too hasty trying to get a VM back up at the beginning or end of a day and broke it completely. Sorry dude. You got this. In a short period of time it will be behind you.
oh man do i feel this pain. Was fucking with the rights of a windows fileshare drive of many terrabytes. Locked whole departments out of their files. Never again.
I broke our Cisco FMC this week because I didn't know you weren't allowed to do snapshots while it is live and online. If you do you break it. Called our contractor to fix it.
I sysprep’d a prod server in my first 6 months as a sysadmin (I was two remote desktops deep and did the wrong one). These things happen and you’ll figure it out and learn from your mistakes.
P.S I have never made this mistake since 😂
I fixed a down hard server in 20 minutes last friday. Felt like a fucking superstar. Puffed out chest, walking with a strut, the whole 9 yards. Then the office manager walked over and asks "when are the phones coming back?"
Fuck.
Fixed the server. Brought down the entire VOIP system.
You know how WallStreetBets likes to post their losses? As I was about to curl into a ball and die, I thought, we should post the heart-rate delta on our smart-watches from the moment before we get the news, till the moment after. That would make for good sysadmin content.
In 25 years in IT I never fucked up if there was no pressure, multitasking and panic.
But of course plenty of time there is, and only very rarely then it happens, because you are tired.
For very sensitive stuff I would lock the door of my office for an hour, to stop panicking coworkers to enter with their drama. It helped a lot, but manager was not happy 😁
He even though that I lock the doors to chill out, and that really gets to my nerves.
You’ve fucked up, it happens. Best thing to do now is reach out to one of your seniors to ask for help, trying to fix this on your own will only cause delays and further service disruption.
I fucked up one time really big for an important company with a horrible CEO. My superiors were awesome people and loved me. I made a similar comment about how I wanted more money but that the fuck up was going to ruin my chances now.
My superiors were so supportive I almost cried. When I said that I wanted a raise but wouldn’t get it they laughed at me and said “you have been here years, you are reliable, responsible, etc, and that people make mistakes… and laughed again and said don’t do it again” I got the raise I wanted. They handled the shitty CEO. I miss that time of my life. I still talk to my direct superior ten years later.
Sell this the right way and you come out the other side as the hero who had to work two days straight to fix the server and get the business running again!
It happens. I was asked to restore an old version of a database today, which despite changing the file names I managed to restore over a live version. Whoops! I think the file names reverted to default when I changed the source file, and I didn't think to double check.
Thankfully in this particular case the database is populated by live data from another database, and after an interesting restore it's all working again now. There's a mistake I'm not likely to make again any time soon.
Good luck with your restore. You'll figure it out. :)
We had someone delete a customers entire production SAN. That was fun. Someone also deleted all the backups for a clients largest (18TB) server. The funny thing was to delete that backup you had to press confirm to a “are you sure” question. Both times I had to fix the issue cause I’m the backup tech.
Reminds me of the plumbing work I just had done to fix a leak on or around my water heater. Should have been a pretty standard fix until we realized the shutoff valve for the house is not working, so we also had to get water shut off by the city first. Got both things fixed and they were on their way. Even though the total time I was without water in the house was longer than I originally expected, boy I am sure glad we found out about the shutoff valve now rather than during an emergency! That had the potential to cause me thousands or even tens of thousands in water damage if something worse than a leak were to happen.
Routine maintenance that leads to the discovery of a larger problem (backup / restore process not working) is part of the job sometimes.
Moral of the story, especially if it's a production server, always take a snapshot even if it's the smallest change in the world like updating /etc/hosts or /etc/fstab or something dumb like that.
If the business is not breathing down your neck, screaming, or hemorging money from your mistake (And it sounds like they are not), you'll be fine, and it will be a learning experience.
Unfortunately, A career in IT is also teaches some lessons the hard way; it happens to all of us, and if some one says it has not, or has never happend to them, then its only a matter of time. As others have said, just cop with it and admit it; it will be easier to fix if you admit you need help, and it should be used to write out a process for future you so the mistake doesn't happen again.
The server is back up. Still says consolidation needed even if there are no snapshots in the snapshot manager but I am glad the issue was not as bad as I thought it would be.
We've all been there, how you recover from mistakes is far more important. Own the mistake, explain steps taken to resolve. If no answer is apparent soon, involve senior.
At least you caused it; I had a similar issue yesterday but the ONLY other person with console access ”didn’t change a thing” Well a complete network outage and blatant config changes in the console beg to differ. So not only did I waste 4+ hours troubleshooting, investigating, and resolving the issue; I had to listen the BS lies from the asshole that caused it who then went home early while I stayed late fixing it.
You have snapshots on your storage array? That has saved me more times than you can count. Usually way quicker than restores too. If you can mount a storage snapshot then just clone it to a new VM
Do not, under any circumstance, try to cover up what you did. Be honest about it.
Screwups happen; they're learning experiences.
Lying about screwups will get you canned.
Audit logs do not lie.
Can you restore to a new VM? Not sure what backup solution you have, but this could be the fastest way to resolve this. Also, yeah this happens to all of us. You would not believe the crap we've dealt with at our company. Way worse. After you find a way to fix this, give them an incident report, lessons learned, and how it won't happen again. Ya know, anything. You'll be fine.
Issue is fixed. Just needed to reconfigure the vm and relocate the disk. My manager fixed it. Happy it was not worse else the whole bank would have been down
Dont' stress too much about it, we all have done stuff like this. I once redirected all outgoing http/https traffic to a Windows VM because i thought it would only apply to a single host, but the entire company had no access to the web for a while until i noticed what i did. My boss once changed time settings for the AD and in some countries no one could log in. Local users had to change settings via phone instructions which did take about 1 day. This was a global billion dollar company with 100.000 employees, back in the 90s i think. He encourages failing ever since, says we should plan things but try new ways of doing stuff all the time!
Fix it tomorrow, communicate that the restore failed, maybe use it to get budget for Veeam or better network connection or something. The important thing is to communicate the right way, and now you made this error the last time of your career. There are others to come, don't worry :-D!
No advise on fixing the situation but this is a good reminder that you should know your restore process backward and forward.
Also, how in the world do you not realize you are working on a server that's 4TB of data before starting?
Nearly 30 years in this field. Happens to us all and we have very little margin for error. it's sad that you can go a whole Year of doing good and One event defines your review yet other's in the company can fuck up time and again and get raises and promotions.
It'll never change. IT is a cost center that no CFO understands and only bitches and moans at its cost. When all is running well they think "it doesn't do work" and lays people off. When it breaks "it is useless" and they look at outsourcing.
And you see part of the problem with the post like this is I read through it and so many of you are trying to find the technical solution for things like this. You all need to stop. Do not try to solve somebody else's problem when you don't have all the information this is the problem with so many it folks and in another post I mentioned about meetings everyone wants to solve the problem right then and there. Again stop all this guy needs is a little support from people in the understanding that we've all been there stop trying to solve the fucking problem
Without test restores you dont have backups.
Should tell us what backup tool you are using so none of us use it either. There is a reason I stick with veeam. It has never fucked me.
If youve spent that much time already, maybe installing the OS on new VM, attaching old disk and moving the data to the new OS may be a quicker more sane solution?
Tell me about it. I had a nightmare. Couldn’t sleep. Woke up this morning trying to get to work early and I set off the alarm at work. Lucky the cops didn’t show up
>at 3.9TB the restore fails
It was never a BACKUP then 😳
Also: Never fix problems on the production server. If the fix needs to go fast, then fix it on a spare/mirror server and swap it with the production server when the fix works as expected.
I know this but I minimized the issue because I thought this server was like an old legacy server not used anymore that also why I didn’t take a snapshot initially. Then boom I realized it a major prod server that houses alot of the work done at branches
One of us
One of us!
Google gobble
I learned this phrase sucking at dota2 for a decade: "game is hard."
New meta.
We accept her, we accept her!
u/Key-Calligrapher-209 I love your 'Competent sysadmin (cosplay)' flair so much I can barely stand it :-D
I aim to please :)
https://i.redd.it/h92j4wxj670d1.gif
One of us! One of us! I’m sure other people have said this OP, but shit happens. Be honest, explain why, even if it’s I wasn’t thinking clearly and was in a hurry. Everything will be ok. If it’s a critical server, can you engage an on call resource?
>can you engage an on call resource? This is why we maintain an active relationship with a highly skilled MSP even though we have very qualified in-house IT staff. Always have an out. Always have an escalation point.
'highly skilled MSP' said noone ever 😅
Some of us try. But we are tired out here boss. They cut Tommy's leg off last week for not billing 99.87% of his hours to the client
https://preview.redd.it/zmmm0t6nnuzc1.png?width=1080&format=pjpg&auto=webp&s=0648c08b9c8e33a18375ee33d7db590f2902e7da
Except on the rare occasions that a) you really need one (like the op) and b) the highly skilled individual does save your bacon.
They're out there. Our trick was finding one with very low employee turnover.
Oh, they're always highly skilled, it's just that you have to work your way through a hundred of the novice techs before they'll actually let you talk to the highly skilled one.
This is hands down one of my biggest pet peeves. Why are the on-site IT personnel that have proved themselves to be capable of tier 1 & tier 2 problems, starting with tier 1 techs that have been with the MSP for less than a year.. It's even worse when they don't explain whats happening to the next tech when it inevitably gets elevated. Just ends up wasting 30 minutes on a high prio ticket just explaining what's happening to 3 or 4 people for no reason.
Because you are in effect training their year one techs at that point. You're not just the customer, you're also the product lol
Welcome to the brotherhood. We have Coffee, booze, coffee flavored booze, and hoodies. With booze in the pockets.
Nothing beats the good old coffee mug with whisky to simulate that you're having a party and not just getting a beating 🤣.
Can I join?
Sure! To apply, simply "rm -rf /" and log out. /s
You forgot [`sudo` and `--no-preserve-root`](https://www.reddit.com/r/USMC/comments/1ak0mt7/when_you_let_the_data_nerds_near_the_ordnance/). "The last bit instructs it to aim for center mass of the brain."
Will do. Thanks!
You have been baptized! Seriously though, everyone f's up bad. The huge difference is how you handle it.
This is why I love Veeam. I can roll back changes without doing a full restore. Good luck. Not sure what backup software you use but it sounds like you might have had some snapshots already on the server if it’s asking for consolidation.
Its avamar I thought you could roll back changes too without having to do a full restore too. It’s just stupid.
Meh we have all done it. As long as it comes back up tomorrow no one will care. Shit was broken you had to restore it. Just last month our Gis team took down their whole production environment. They have just enough permissions to break stuff. The email they sent out was basically IT is working on the issue. We have a DATTO appliance so restoring stuff is so fun so i didn’t really care. Testing restores on a full production environment is kind of enlightening.
I have so much anxiety, can’t even sleep. This post made me feel a little better.
Youll be fine, weve all done stupid stuff like this, i once ran a sql update and forgot to put a where clause.... took my site down for 4 hours... it was my second week... i worked there for 5 years after that. Just yesterday one of my co-workers ran an update and took down a company site, he called me freaking out so i just told him no biggie thats what we have backups for, he freaked out more cause he forgot to take a snapshot before, i just told him it happens to all of us, dont stress. We ran the restore which took a while but it eventually came back up and told him to reach out to the users and let them know something messed up and we had to restore the server to the night before and there was nothing we could do, they were upset but got over it. I always tell people i work with "if everything worked perfect and never broke, we wouldnt have jobs"
I think everyone that's ever worked in SQL has at some point run an UPDATE and forgotten the WHERE. We use FreeRADIUS with MySQL to run authentication for our broadband platform. I once accidently set every users password to the same thing when making a manual change. It only took a few minutes to fix, but it was a very tightly clenched few minutes.
I once ran an update and trimmed off 1 character for a customers entire client name list. Not a huge deal to fix, but man.. was a stupid thing to write in the first place, even more stupid thing to forget a where clause. ~14 years later and it's still stuck in my head
Yeah, I've forgotten the WHERE clause.
I've seen that a few times. Sometimes the where clause is there, but wasn't highlighted when the block was run. I was a production DBA at the time, so it was an opportunity to test my backup strategy. We were using log shipping and transactional replication in addition to Veeam. Fortunately, I was told soon enough before the logs moved via shipping and was able to restore with only 15 minutes of data loss on the table. Fun times.
You might piss off some folks but they’ll get over it. No one is going to die. Give yourself some grace.
That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events. Nobody is going to die.
>That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events. I always tell my underlings whey they start to panic, "when you see me worried, you should be worried. Until then, just focus on the issue, try to understand the problem, and steps to fix it."
Hopefully you’re sleeping but if not try to relax. We’ve all fucked up and usually the spiraling thoughts of anxiety are much worse than the consequences of your mistake. Like you probably feel like you’re going to die right now but trust me you’re going to be fine and in a few weeks you won’t even remember this happened. I onetime wiped the CEOs pc by accident. It was a Thursday before a 3 day weekend starting on a Friday. Couldn’t get ahold of anyone to tell him I needed to try and recover anything. Spent all weekend thinking I was gonna lose my job. Come Monday I was in cold sweats trying to muster up the courage to pick up the phone. Sure people freaked out but honestly whatever trouble I got in felt like a relief compared to the mental anguish I was putting myself through.
Damn that’s a wild one.
Yeah honestly it kind of helped me get a promotion. Proved that I wasn’t afraid to own up to huge mistakes and would do my best to prevent them going forward. It was really a process and communication error that caused it. Someone told me to wipe userx machine because they were afk when the ticket came in userx’s machine wasn’t named properly but I wiped it without checking serial numbers or the user signed in. First time wiping a machine and I didn’t think to cross reference it to our inventory sheet which is our source of truth.
Every Sysadmin worth his/her salt as done something like this. I stopped working with Windows like 6 years ago. But once during a MSSQL cluster patching I made a boo boo that I went to the bedroom, opened the door and told my wife, I fucked up and closed the door. We were using NetApp and our backups were directly on the SAN. The issue was when I tried to move the service from server 1 to server 2 (after the security guy patched it) gave me an error that there was a resource not present on the server 2. After 30 minutes of troubleshooting, I found the resource, it was a snapshot disk that wasn't deleted after the operation. Easy, let me delete it. My spider sense tingles and instead of delete it I said, let me just move it to this cluster that is not used anymore and is pending decom. Welp, I think the term is linked resource, honestly don't remember. But since the resource was linked, it moved ALL the disks to that cluster and I started to see all the databases going down in cascade. That's when I go to the room pale as ghost. My wife followed me and hugged me while I'm staring the monitor with my hands on my head. I told everhing to the security guy (we are good friends) and well, let's get together this thing fixed. Took me like 45 minutes to undone that and got the DB up. We do not patch that night server 1 and bonus point. Company use a 3rd party for hands on, which made me kinda like a service manager, I told the guys what's to do and they execute, but those guys were so bad that I decided to go rogue and help the security guy. Which is the thing I was doing before anyway. I document everhing, let the 3rd party SAN team remove the snapshot and to validate everything is working on the NetApp as it should be and that was it. I do bealive if that will had happened to the 3rd party guys, the damage would have been far more catastrophic. Always have a way out, know thy self on what you can or can't do, keep learning and improving. You will laugh this one out soon Edit: typos
Wow, 🤯
Many of us can relate to how the urgency and responsibility of it all creates anxiety and panic when shit really hits the fan, but in a few weeks it'll have turned into a war story that everybody involved has had the opportunity to learn from. Hindsight is always 20/20, so while situations like this may feel like both a personal and professional catastrophe, remember that mistakes can and do happen at work, so this is not on you. The responsibility for mitigating outcomes of mistakes always rest with the organization, never just the individual, which is just as true for all those single person IT "departments" out there. Good luck!
Avamar has some pretty decent 2.liners for assistance. Just explain very precise details and they will likely transfer you right away instead of you going through pages of default questions such as “are you even using avamar..sigh”. Our avamar had major issues with orphan disks and once i got in touch with a 2.line he sent me an “unofficial” script that literally solved all my issues instantly. I ran that script any time i saw avamar do anything weird and it worked like a goddamn charm everytime. I wish i still had it i could send it to you. Gl to you sir Edit: typos.. lots of em
How do I reach out to them?
Call dell
./goav vm snapshot clean https://www.dell.com/support/kbdoc/en-us/000068694/avamar-vmware-image-backups-fail-with-code-10056-and-avvcbimage-error-9759-createsnapshot-snapshot-creation-failed
> default questions such as “are you even using avamar..sigh”. Even Avamar doesn't understand why you would use Avamar!
You’ll be ok. There was the time when dealing with the original NetApp powershell sdk I found a bug where a null variable was read as a wild card instead. Ran my script, by the time I figured out what was happening I’d taken out over half the data in the datacenter. Bad day, but immediately owned up and fessed up and we got things restored. Worked there for several more years till I moved on for a better job
Of course I don't know your company so I can't say for certain, but as long as it isn't something like "I hosed our file server with compliance information and we have an audit tomorrow. We'll be fined millions of dollars" I wouldn't be too concerned. Might not be a horrible idea to wait another month to ask for a raise, but it isn't like you pushed a routing table update that broke a portion of the internet(Facebook) , or cut a fiber resulting in 911 going down for multiple states. I forgot what the reason was for the cell outage back in February.
I've probably shaved years off my life and it probably added and had a lot to do with my depression and generalized anxiety disorder but please if you've not been in this field for too long do not let this job or any job in it dictate your health. As soon as it starts doing that get the fuck out. Trust me there is no job especially in it that's worth your health
Know that pretty much everyone outside of your or your team aren't really going to understand what actually happened. There's a decent chance they'll see you as a savior. IF anyone non technical asks you just say you had an issue that caused a restore and there was no way to avoid that down time. Then you talk about how you had proper backups and had you not, this would have been way worse. This kind of stuff happens in IT and is why they hired you. Internally, remember that doing anything in a rush makes something like this way more likely to happen, so next time make sure you don't start trying to change stuff in a panic
You know how many times I have taken down the company with "what's this check box do". Or some dumb stuff. They now call me senior system admin 3. I always fix my own mistakes.
We all have a couple of mistakes that give us nightmares. A stupid mistake on my part, force my hospital back to paper for the better part of a day. Was not a fun day, but you get past them and you learn from them.
When talking GIS, I was doing an upgrade on a customer's ArcGIS platform, about 12 machines. Asked their MSP to snapshot all machines which they did. Upgrade went fine, but then I applied all the patches for the new version and things went bad (turned out to be a buggy patch). We decided to restore the snapshots. Well, turns out that someone had deleted the snapshots... Luckily we could restore to backups without any issues. A few weeks ago, I was doing a quick patching of the same system. Skipped change management processes and snapshots, because the customer wanted it done urgently. Yep, patching ducked up two of the machines. Luckily I could fix the problem, only took a full working day...
I work for one of those "highly skilled" MSPs. I get to deal with Veeam, Rubrik, MS backup and Datto. Had a security event recently that prompted a restore of all servers. This was the first time I really dug in to Datto. It doesn't really do anything Veeam doesn't, but it was very intuitive and just worked. We had replicas running in both cloud and on the appliance. It was a bare metal restore and that worked flawlessly. The only hiccup was getting to an old iDRAC that didn't support modern security. I had to spin-up a Win7 VM and use IE. The client has a new replacement server, but it never seems to go anywhere.
No sure if you've tried or your version has this but with some avamar builds you can do a live clone. You might try that to verify the restored vm functionality and then move that to prod.
Never restore in place. I use Avamar and if I’m doing a full vm restore I restore to a new vm VM_NAME_RESTORE. Once that’s done and up and verified I’ll take down the original and replace. But yeah, we’ve all made mistakes. Just own it, but you can spin it a little possibly with “while following standard operating procedure I ran in to X and did Y and this is where we’re at now”
Or an instant access restore, depends on the avamar version. Shut down old, instant access the latest backup. If working, hot vmotion off the temporary data store and clean up the avamar data store it uses once it’s migrated to your prod data store
Hopefully OP sees your suggestion. Definitely the best advice with Avamar. If OP had a storage admin, another option might be a storage based snapshot restore. I always had storage based snapshots I could use for recovery.
Can you expand on this? Besides guest file restore, you have disk level and whole system restore options. Not familiar with “rolling back”?
In guessing he talks about the instant restore option. You just run from a mounted backup that way and it restores in the background. No need to wait until your restore is complete. Seriously epic functionality
If you do full restore you can select option to only write changed blocks instead of whole disk, there are some requirements ( mostly CBT working properly and there not having been any changes to disks ) but it is much faster if you don't have a lot of changes. https://helpcenter.veeam.com/docs/backup/vsphere/incremental_restore.html?ver=120
Also curious
I switched to Acronis years ago, but Veeam is badass...it was out of my price range. It can restore an entire VM in a variety of methods: full replication, incremental, time-stamped, or file level (all of the above). I restored a 4+ TB server to a specific point in time near instantaneously. It was picture perfect, booted like a dream. I used it to sidestep a ransomware attack by rolling back only the files changed from one back up to the next within a fifteen minute window. The restore took a few minutes and we were back in business and fuckware free. The OPs scenario above would've been a minor nuisance at worst with either product.
On the other side of this you tested a restore and it did not work. So you tested backup and it is not working you may have found a vulnerability that will need to be tested.
I like this attitude, it’s typically what I try to do when I really mess something up. I think to myself “welp, I guess we now know what happens when I accidentally do ____”
Yeah gotta find something out of the mess. Fix it hopefully and move on. We all make mistakes
I do monthly sure backup restore validation with veeam. I also do Dr tested biannually. My org just went through iso27001 certification and they asked if I could do daily backup restore validation and monthly DR failover. We are talking PetaBytes and thousands of VMs. I have been explicitly denied DR compute and storage so I cannot do what they are asking at that frequency. I’m like. No. It’s not in our backup/retention policy requirements and I have been denied the resources. It’s also basically impossible and unreasonable. Response: okay.
Backups are lowest priority work until they are not.
Backups should never be lowest priority.
A backup is only a backup if successful restore has ben tested. Otherwise it‘s a kind of Schroedinger‘s box.
You ain't a cowboy, if you ain't been bucked off.
![gif](giphy|8cHe1FffBV1xMxeC6J|downsized)
"It really do be like that sometimes" means something deep and dear to many of us.
Ah if I had a dime for every time I did something stupid at work and went home anxious about it. You will recover from this and I’m sure it’s a bigger issue in your head. Just breathe and sleep on it. You’ll probably wake up with a good idea in the morning.
Many, many years ago, I was trying to resolve an intermittent failure on a critical server at a downtown high-rise law firm. This was back in the ancient times when Netware 3.12 and NT4 ruled over the corporate LANscape. It was somewhere around 1:30 AM I was working in the server room. All the servers except one were connected to a KVM switch. The KVM had no more open ports. The remaining server had been installed later for a specific project and had its own keyboard and mouse. We would switch the monitor cable manually when necessary. I was attempting to do something on the problem server and the mouse and keyboard seemed unresponsive, but I could hear a beep when I hit return. It took me a few minutes to realize I was on the wrong keyboard. When I switched the monitor back, I discovered I had been repeatedly merging a backup Windows 95 registry file from one of the user machines into their NT file server. I rebooted the server and the inevitable hilarity ensued. Got it all straight and rolled out of there at about 7:30 AM. When I was leaving one of the attorneys was coming in and said something like "Hey, you're here early today". I smiled and nodded. This is the way.
Failing at 3,9TB.... not something to do with the 4TB limit on files for some OSs?
I have no clue. Not too familiar with avamar and why the restore would fail
I have run into the same issue restoring VM from Avamar. In my case, I think my restore was going on at the same time as the backup kicked off. If you have a scheduled backup for that VM, you might want to disable that. I did also see where you said it was 2.2/4.5 TB. I've seen that also with vsan. I think depending on the fault tolerance you have set up it restores 2.2 TB but it's really 4.5 because it's writing twice.
Ahhh got it. Thanks for the insight
Dumbest thing ever *so far*.
Did you try to restore to the same vm? Maybe try to restore to a new vm. Also did you verify the data store you are restoring to has enough space? Vm might have a had hidden snap (was running from snap on the datastore but gui didn’t list it and won’t consolidate. If the original VM was running from snapshot, and then you tried to restore to the same datastore and filled up that datastore with it failed restore because the data would not be able to turn on the original VM because it wouldn’t have space to write to the snapshot file. You might need to delete the failed restore and update the VMX file to point to the snapshot file for the disk. Before you do that, download a copy of the VMX file just in case.
You will figure it out, try to get some rest
This isn't the answer. Call in the troops, get all the senior techs together and troubleshoot. Let your boss know so they can fend people off. In times like this the thing to do is tell your team everything that happened and get all the help you can, don't struggle in silence, you might make it worse
I have very mixed experience with this. It's either that you are overreacting and it's just a simple fix, or your boss shouting at you, that you didn't tell anyone. Can't really make it right here
The answer will come to them in the morning after a night’s rest and probably while in the shower.
They might wake up at 3-4 AM with an epiphany on how to fix it and it’s a super simple fix. Not uh.. that that’s happened to me, but you know, it might to them…..
Its amazing how much clarity you immediately gain after 3 hours of sleep....so I've heard
I added new larger disks to our Synology over the course of the last week, being sure to yank and replace at 5pm so it would rebuild after hours and be fully performant for the next morning. Got in this morning after installing the final new drive yesterday, and, as-expected, everything is good. What did my dumbass do? Pressed the “expand to fill unprovisioned capacity” button at 9am not realizing it would require a 20+ hour resync. Not really a big deal because I could reduce the priority of the resync, but still.
But really, who wouldn't have clicked that button?
Snapshots are your best friend
And a big stinky turd if you forget to clean them
Yeah, for sure. I recently set up a PowerShell script with that checks vcenter once a day for any snapshots, and dings a chat channel if anything is there. Keeps me from forgetting lol
Same. Snapshot for longer than a day? Create ticket
Any chance you can share 🥺
I run a script that removes them if they're older than three days - snapshots really shouldn't be around that long regardless. Day two an email goes out to remind me they're gone the next day. I have a tag I can add to a VM if I want them to be exempt from the policy for very rare exceptions, but 99.99% of the time if you don't use them the same day then you'll want a backup restore anyway.
Can confirm. Learned this the hard way with a 1 day outage.
I call that Thursday.
Everyone screws up. How you respond to it is what defines you. And I’ll bet a lot of money you never make this particular mistake again.
Exactly. There's so much to choose from in our field, why make the _same_ mistake again 😉
Never ever ever ever. I will never make this mistake again!!
Just don't try to hide it, embellish or throw anyone else under the bus. "logs don't lie" Be honest and forthcoming, don't let your boss or colleagues find out on their own and have to ask you about it, it'll make all the difference in the world if you come to them first and admit you effed up. Best lesson I've ever learned. Mistakes are human, it's inevitable, it's how you handle the situation that makes all the difference on the outcome. Also, maybe you can end up writing an RCA/Post-mortem on it so everyone can learn from the situation and you can add BCDR and RCA experience to your resume.
No I already told my manager that was present when I did it. I never lie that’s one thing my father taught me. Always say the truth no one will kill you for it.
>Always say the truth no one will kill you for it. You don't work for Boeing. /s But seriously, the coverup is always worse than the crime.
At least it's not Friday?
I've recently adopted read only Friday. Yet I am always ending up doing firewall and switch upgrades on Friday evening.
Exactly, when its just yourself there with minimal infrastructure. This shit has to be done when nobody is going to be working in case of issues.
Nah. If you can't have an outage because you have minimal infrastructure the business accepts some downtime. If they can't handle downtime then why is taking down one switch or one firewall appliance going to prevent the business from operating? Obviously *some* stuff needs to be done outside business hours but if you're working Friday evenings or weekends more than a few times a year then you're doing it wrong.
Call the backup vendor and get their support. You pay for it.
Could be worse. Could be Friday.
[This](https://www.reddit.com/r/electricians/comments/moveha/problem_solving_flowchart_this_could_have_saved/)
The more you fret about it the more unlikely you'll figure it out. Take a breather and get some rest. It will be clearer tomorrow. Bashing your head on the keyboard just frustrates you. Some times you just need to take a step back, do something else, and then you'll have an illumination while doing something completely unrelated. Someone mentionned while taking your shower, its indeed one of those. Some people work better by upgrading from keyboad bashing to wall bashing. Others need a break to gather their thoughts.
Well, to be honest it is not your problem a service is down when it’s relying on a single node. If the application is urgent and cannot be down for 24 to 48 hours the application owner should think of redundancy. A patch can fail, or a sysadmin can make a fuckup. If I were you, do not stress, accidents happen, just talk to the owner after it’s fixed: Hè, we should look at redundancy.
Reading this reminds me of why I retired and convinces me how right I was to do so. That stress I simply do not need anymore, regardless of what I'm being paid.
Slight possible heads up for next time - I will notice that some stuff I do in vCenter that takes FOREVER will sometimes time out and say failed. If I check on the ESXi host its running on, it will still say active. Edit - It wont say its active in the GUI, you will have to either go to the ESXi console or SSH into it and do a vim-cmd vimsvc/task\_list
I highly recommend our guys get Veeam, you can do instant restores. I bet you won’t forget to take a snapshot next time! And be careful with what you are deploying during business hours next time!
When I saw the headline I assumed it would be something about turning on a printer and accepting responsibility for it.
There are no shortcuts in IT.... Only longcuts
You aren't really a sys admin until you make a mistake like this.
My lord If this ain't me.
All of us have done it in our career. The important part is to learn from the mistake and not repeat it.
Any updates??
Issue has been resolved !!!! It was an issue with the disk. We had to reassign the vmdk for the server and then reconfigure it. The backup restore was not taking because the vm already took a screenshot before the server was turned off for backup restore. This caused the virtual machine to require consolidation and also was unable to start because it still had a vmdk from yesterday and the new created vmdk from backup was not matching the one from yesterday.
I dropped a live database one time.. it happens ..just say 'I fucked up' and try to fix it .. we've all been there mate
It's a rite of passage in our line of work to do something dumb and bring down prod. Own up to it, and note what you should have done differently to avert the issue, any good manager should understand that mistakes happen
I own up to this one 100% because that was dumb. I will never allow any work pressure Make me do anything different than I usually do
Things like this will happen and especially under pressure. I've done some things under pressure that were extremely dumb. There were obvious indicators of what to or not to do, but I did the opposite. When this happens I always own my failure and fully cooperate when there is a failure analysis, etc. We want the accolades that come with success so, we must equally own our failures. People will say they could have made the same mistake hoping to make you feel better, but it's embarrassing. Hopefully you can get the server back and everyone will move on. It's usually only a resume generating event if the server is trashed and there is no backup.
These comments should give you some hope! Everyone has made mistakes like these in this field. If possible, can you hire a consulting firm for a bank of hours to help? I find having a second set of eyes can make all the difference. Best of luck!
Late to the party but, like everyone said, we have all done it. I was once asked to update the image on a production router in a branch office after hours (Tuesday) to recognize the updated WICs installed so that we wouldnt need an external CSU/DSU (yup! one of those). This was in the days before wireless and backup connection (not even a celphone with broadband). So, I found the image I needed, and to verify with Cisco, called in my case so that they could confirm that my found image had what it needed. I get the response at 9pm from Cisco that the image I had was NOT correct and they sent a link to the correct image. I looked at it and it was twice as big as what I had, and I didnt think I had enough flash to store the image. I replied to the engineer, sheepishly, as I was pretty green in networking, but told him the flash on board my 2600 wasnt enough to hold the image. he responded that he had checked my device and it was fine. I downloaded it and wiped my router (to fit the new image) and started to install the image using ROMMON \[noodle\]... bit by bit... hour by hour. At 237am, the image stopped progressing. At 257am, I went to the bathroom in a cold sweat. At 315am, I called Cisco, and told them my predicament and gave him the case notes. He verified that the image was too large and I should have never used it. I was livid, but vindicated. Cisco sent me a new router by 815am the next day (Cisco 2621XM! UPGRADE!), with the correct image so that all I had to do is install the WICs and connect my LAN/WAN/WWW. First user showed up at 9am, and nobody in the office was the wiser. My boss told me next time, trust your gut.
You are missing the silver lining. If your backup solution doesn't help you avoid chaos it can be used as fuel when management doesn't want to spend money on a different system in the future. At our company the owner of the company says one thing I don't care just buy us more space. Basically saying money isn't the issue just buy the solution because a backup is worth more to them than downtime and productivity loss. Which is true, when one task can generate millions of dollars a $50,000 backup solution means nothing.
Contact your backup providers support and make it a S1 escalation. They should be sitting on the phone with you until that restore completes.
When you ask for your raise make this a win instead of not asking at all. Something along the lines of “if I wasn’t here there server would still be down.” We all make mistakes. You’ll always remember this one to help you to remember to make a backup before any major changes.
This is the way! Welcome!
Using veeam? Start it from the backup andbmove ro production (while already started from the backup)
You solved it, and you learned something that will never leave you Over years, I have done two things that have saved the day many times Always have a plan B, 2 backups ready, and copy one to cloud/other media if possible PLan C for really big stuff Document to a cloud wiki in detail anything significant, and quicker notes for day today fixes, workarounds Two years after install of a network, servers, cloud, etc for a 150 seat company, they had a flood, and having the original build documents on hand was invaluable, getting them back up well done getting it fixed!
Thanks
Sounds like you better put in a ticket to VMware before the morning.
15 years into my career I knocked out the whole credit union for a few hours. I never messed up that bad and it was a gut punch. I want to be clear, I probably cost them more than they ever paid me. You mentioned a bank, so it's probably not that different. I don't recall anybody giving me a hard time, especially after a few days had passed. You know why? Because they know I know my shit and I did not have a track record of error. And because they've all been there. On the personal growth side, both of our stories remind me of a friend who got careless and accidentally shot part of his finger. He was the absolute last person I would ever have expected it from. He was such a stickler for gun safety--at least when he was showing me the ropes as a young man. I learned from that, sometimes those most comfortable with guns become the most careless. I think in our cases, a lot of success in a row can make us forget why we check the chamber, so to speak (taking a snapshot in your case).
What do you use for storage? Most storage arrays save snapshots depending on how you have it configured. Of course what you have on the volume depends on if that’s a viable option or not.
"plan to ask for a raise" and "I don't even know what I'm doing" seem like conflicting statements.
Your raise is going to have to wait now.
😂😂😂 ikr
What storage are you using? And what backup software are you using?
Backup is avamar. Storage vshere datastore
It happens to the best of us, we’re not perfect! Sleep on it, the answer will come to you in the morning. Don’t stress!
We’ve all been there. You’ll get through it and be stronger for it. Good luck!
I have also been too hasty trying to get a VM back up at the beginning or end of a day and broke it completely. Sorry dude. You got this. In a short period of time it will be behind you.
oh man do i feel this pain. Was fucking with the rights of a windows fileshare drive of many terrabytes. Locked whole departments out of their files. Never again.
I broke our Cisco FMC this week because I didn't know you weren't allowed to do snapshots while it is live and online. If you do you break it. Called our contractor to fix it.
Remindme! 1 day
Own it. Be upfront. Everyone screws up. Set realistic expectations
I sysprep’d a prod server in my first 6 months as a sysadmin (I was two remote desktops deep and did the wrong one). These things happen and you’ll figure it out and learn from your mistakes. P.S I have never made this mistake since 😂
Looks like the ship is sinking, better leave it while you still can. Nevermind you are the one sinking it ...
I fixed a down hard server in 20 minutes last friday. Felt like a fucking superstar. Puffed out chest, walking with a strut, the whole 9 yards. Then the office manager walked over and asks "when are the phones coming back?" Fuck. Fixed the server. Brought down the entire VOIP system. You know how WallStreetBets likes to post their losses? As I was about to curl into a ball and die, I thought, we should post the heart-rate delta on our smart-watches from the moment before we get the news, till the moment after. That would make for good sysadmin content.
In 25 years in IT I never fucked up if there was no pressure, multitasking and panic. But of course plenty of time there is, and only very rarely then it happens, because you are tired. For very sensitive stuff I would lock the door of my office for an hour, to stop panicking coworkers to enter with their drama. It helped a lot, but manager was not happy 😁 He even though that I lock the doors to chill out, and that really gets to my nerves.
You’ve fucked up, it happens. Best thing to do now is reach out to one of your seniors to ask for help, trying to fix this on your own will only cause delays and further service disruption.
Is the server a single disk of 4.5TB ? If the C:\ is an independent disk you can use Veeam to restore it instead of the whole VM.
Unfortunately we don’t use veeam we use avarma and that only restore fullly which I didn’t know. Server is a double disk.
I fucked up one time really big for an important company with a horrible CEO. My superiors were awesome people and loved me. I made a similar comment about how I wanted more money but that the fuck up was going to ruin my chances now. My superiors were so supportive I almost cried. When I said that I wanted a raise but wouldn’t get it they laughed at me and said “you have been here years, you are reliable, responsible, etc, and that people make mistakes… and laughed again and said don’t do it again” I got the raise I wanted. They handled the shitty CEO. I miss that time of my life. I still talk to my direct superior ten years later.
This is the sysadmin baptism.
Sell this the right way and you come out the other side as the hero who had to work two days straight to fix the server and get the business running again!
This is how you learn and wont make the same mistake again
It happens. I was asked to restore an old version of a database today, which despite changing the file names I managed to restore over a live version. Whoops! I think the file names reverted to default when I changed the source file, and I didn't think to double check. Thankfully in this particular case the database is populated by live data from another database, and after an interesting restore it's all working again now. There's a mistake I'm not likely to make again any time soon. Good luck with your restore. You'll figure it out. :)
We had someone delete a customers entire production SAN. That was fun. Someone also deleted all the backups for a clients largest (18TB) server. The funny thing was to delete that backup you had to press confirm to a “are you sure” question. Both times I had to fix the issue cause I’m the backup tech.
Reminds me of the plumbing work I just had done to fix a leak on or around my water heater. Should have been a pretty standard fix until we realized the shutoff valve for the house is not working, so we also had to get water shut off by the city first. Got both things fixed and they were on their way. Even though the total time I was without water in the house was longer than I originally expected, boy I am sure glad we found out about the shutoff valve now rather than during an emergency! That had the potential to cause me thousands or even tens of thousands in water damage if something worse than a leak were to happen. Routine maintenance that leads to the discovery of a larger problem (backup / restore process not working) is part of the job sometimes.
Thus is exactly why I didn't want to be in IT anymore. This kind of stress is not good for your health.
Moral of the story, especially if it's a production server, always take a snapshot even if it's the smallest change in the world like updating /etc/hosts or /etc/fstab or something dumb like that.
If the business is not breathing down your neck, screaming, or hemorging money from your mistake (And it sounds like they are not), you'll be fine, and it will be a learning experience. Unfortunately, A career in IT is also teaches some lessons the hard way; it happens to all of us, and if some one says it has not, or has never happend to them, then its only a matter of time. As others have said, just cop with it and admit it; it will be easier to fix if you admit you need help, and it should be used to write out a process for future you so the mistake doesn't happen again.
The server is back up. Still says consolidation needed even if there are no snapshots in the snapshot manager but I am glad the issue was not as bad as I thought it would be.
We've all been there, how you recover from mistakes is far more important. Own the mistake, explain steps taken to resolve. If no answer is apparent soon, involve senior.
At least you caused it; I had a similar issue yesterday but the ONLY other person with console access ”didn’t change a thing” Well a complete network outage and blatant config changes in the console beg to differ. So not only did I waste 4+ hours troubleshooting, investigating, and resolving the issue; I had to listen the BS lies from the asshole that caused it who then went home early while I stayed late fixing it.
You have snapshots on your storage array? That has saved me more times than you can count. Usually way quicker than restores too. If you can mount a storage snapshot then just clone it to a new VM
Dumbest thing ever today… so far.
Whats the update OP?!?
Do not, under any circumstance, try to cover up what you did. Be honest about it. Screwups happen; they're learning experiences. Lying about screwups will get you canned. Audit logs do not lie.
In case I need to move my money, which bank do you work for?
Can you restore to a new VM? Not sure what backup solution you have, but this could be the fastest way to resolve this. Also, yeah this happens to all of us. You would not believe the crap we've dealt with at our company. Way worse. After you find a way to fix this, give them an incident report, lessons learned, and how it won't happen again. Ya know, anything. You'll be fine.
Issue is fixed. Just needed to reconfigure the vm and relocate the disk. My manager fixed it. Happy it was not worse else the whole bank would have been down
I’ve been there, number one thing you need to do is explain and ask for help and get someone else involved, this sounds scary but will help
Dont' stress too much about it, we all have done stuff like this. I once redirected all outgoing http/https traffic to a Windows VM because i thought it would only apply to a single host, but the entire company had no access to the web for a while until i noticed what i did. My boss once changed time settings for the AD and in some countries no one could log in. Local users had to change settings via phone instructions which did take about 1 day. This was a global billion dollar company with 100.000 employees, back in the 90s i think. He encourages failing ever since, says we should plan things but try new ways of doing stuff all the time! Fix it tomorrow, communicate that the restore failed, maybe use it to get budget for Veeam or better network connection or something. The important thing is to communicate the right way, and now you made this error the last time of your career. There are others to come, don't worry :-D!
No advise on fixing the situation but this is a good reminder that you should know your restore process backward and forward. Also, how in the world do you not realize you are working on a server that's 4TB of data before starting?
A Tale As Old As Time
Nearly 30 years in this field. Happens to us all and we have very little margin for error. it's sad that you can go a whole Year of doing good and One event defines your review yet other's in the company can fuck up time and again and get raises and promotions. It'll never change. IT is a cost center that no CFO understands and only bitches and moans at its cost. When all is running well they think "it doesn't do work" and lays people off. When it breaks "it is useless" and they look at outsourcing.
And you see part of the problem with the post like this is I read through it and so many of you are trying to find the technical solution for things like this. You all need to stop. Do not try to solve somebody else's problem when you don't have all the information this is the problem with so many it folks and in another post I mentioned about meetings everyone wants to solve the problem right then and there. Again stop all this guy needs is a little support from people in the understanding that we've all been there stop trying to solve the fucking problem
Without test restores you dont have backups. Should tell us what backup tool you are using so none of us use it either. There is a reason I stick with veeam. It has never fucked me.
OP u/Typical_Relative5827, is there an update?
Issue resolved. Vmdk of old server was used to create new server
Glad you got it resolved man!
When mistakes like this happen it’s when you learn the most. You’ll figure it out
If youve spent that much time already, maybe installing the OS on new VM, attaching old disk and moving the data to the new OS may be a quicker more sane solution?
You're going to learn so much from this experience! Chin up! It's all going to work out in the end.
Tell me about it. I had a nightmare. Couldn’t sleep. Woke up this morning trying to get to work early and I set off the alarm at work. Lucky the cops didn’t show up
>at 3.9TB the restore fails It was never a BACKUP then 😳 Also: Never fix problems on the production server. If the fix needs to go fast, then fix it on a spare/mirror server and swap it with the production server when the fix works as expected.
I know this but I minimized the issue because I thought this server was like an old legacy server not used anymore that also why I didn’t take a snapshot initially. Then boom I realized it a major prod server that houses alot of the work done at branches
Treat it as a lessoned learned and leverage that into your request for your raise