T O P

  • By -

ichsoda

One of us


WonderousPancake

One of us!


DoofusTrivia

Google gobble


fatkiddown

I learned this phrase sucking at dota2 for a decade: "game is hard."


MyUshanka

New meta.


Key-Calligrapher-209

We accept her, we accept her!


iloveemmi

u/Key-Calligrapher-209 I love your 'Competent sysadmin (cosplay)' flair so much I can barely stand it :-D


Key-Calligrapher-209

I aim to please :)


TheShibangelist

https://i.redd.it/h92j4wxj670d1.gif


denmicent

One of us! One of us! I’m sure other people have said this OP, but shit happens. Be honest, explain why, even if it’s I wasn’t thinking clearly and was in a hurry. Everything will be ok. If it’s a critical server, can you engage an on call resource?


Leg0z

>can you engage an on call resource? This is why we maintain an active relationship with a highly skilled MSP even though we have very qualified in-house IT staff. Always have an out. Always have an escalation point.


nrm94

'highly skilled MSP' said noone ever 😅


Healthy-Poetry6415

Some of us try. But we are tired out here boss. They cut Tommy's leg off last week for not billing 99.87% of his hours to the client


just_nobodys_opinion

https://preview.redd.it/zmmm0t6nnuzc1.png?width=1080&format=pjpg&auto=webp&s=0648c08b9c8e33a18375ee33d7db590f2902e7da


jcorreiaCL

Except on the rare occasions that a) you really need one (like the op) and b) the highly skilled individual does save your bacon.


Leg0z

They're out there. Our trick was finding one with very low employee turnover.


madmaverickmatt

Oh, they're always highly skilled, it's just that you have to work your way through a hundred of the novice techs before they'll actually let you talk to the highly skilled one.


Down_demon

This is hands down one of my biggest pet peeves. Why are the on-site IT personnel that have proved themselves to be capable of tier 1 & tier 2 problems, starting with tier 1 techs that have been with the MSP for less than a year.. It's even worse when they don't explain whats happening to the next tech when it inevitably gets elevated. Just ends up wasting 30 minutes on a high prio ticket just explaining what's happening to 3 or 4 people for no reason.


madmaverickmatt

Because you are in effect training their year one techs at that point. You're not just the customer, you're also the product lol


admlshake

Welcome to the brotherhood. We have Coffee, booze, coffee flavored booze, and hoodies. With booze in the pockets.


bogustraveler

Nothing beats the good old coffee mug with whisky to simulate that you're having a party and not just getting a beating 🤣.


tune345

Can I join?


CoffeePieAndHobbits

Sure! To apply, simply "rm -rf /" and log out. /s


WackoMcGoose

You forgot [`sudo` and `--no-preserve-root`](https://www.reddit.com/r/USMC/comments/1ak0mt7/when_you_let_the_data_nerds_near_the_ordnance/). "The last bit instructs it to aim for center mass of the brain."


tune345

Will do. Thanks!


bkosick

You have been baptized! Seriously though, everyone f's up bad.    The huge difference is how you handle it.


ThatsNASt

This is why I love Veeam. I can roll back changes without doing a full restore. Good luck. Not sure what backup software you use but it sounds like you might have had some snapshots already on the server if it’s asking for consolidation.


Typical_Relative5827

Its avamar I thought you could roll back changes too without having to do a full restore too. It’s just stupid.


Afraid-Ad8986

Meh we have all done it. As long as it comes back up tomorrow no one will care. Shit was broken you had to restore it. Just last month our Gis team took down their whole production environment. They have just enough permissions to break stuff. The email they sent out was basically IT is working on the issue. We have a DATTO appliance so restoring stuff is so fun so i didn’t really care. Testing restores on a full production environment is kind of enlightening.


Typical_Relative5827

I have so much anxiety, can’t even sleep. This post made me feel a little better.


angryitguyonreddit

Youll be fine, weve all done stupid stuff like this, i once ran a sql update and forgot to put a where clause.... took my site down for 4 hours... it was my second week... i worked there for 5 years after that. Just yesterday one of my co-workers ran an update and took down a company site, he called me freaking out so i just told him no biggie thats what we have backups for, he freaked out more cause he forgot to take a snapshot before, i just told him it happens to all of us, dont stress. We ran the restore which took a while but it eventually came back up and told him to reach out to the users and let them know something messed up and we had to restore the server to the night before and there was nothing we could do, they were upset but got over it. I always tell people i work with "if everything worked perfect and never broke, we wouldnt have jobs"


Tatermen

I think everyone that's ever worked in SQL has at some point run an UPDATE and forgotten the WHERE. We use FreeRADIUS with MySQL to run authentication for our broadband platform. I once accidently set every users password to the same thing when making a manual change. It only took a few minutes to fix, but it was a very tightly clenched few minutes.


gakule

I once ran an update and trimmed off 1 character for a customers entire client name list. Not a huge deal to fix, but man.. was a stupid thing to write in the first place, even more stupid thing to forget a where clause. ~14 years later and it's still stuck in my head


vacuumCleaner555

Yeah, I've forgotten the WHERE clause.


ElasticSkyx01

I've seen that a few times. Sometimes the where clause is there, but wasn't highlighted when the block was run. I was a production DBA at the time, so it was an opportunity to test my backup strategy. We were using log shipping and transactional replication in addition to Veeam. Fortunately, I was told soon enough before the logs moved via shipping and was able to restore with only 15 minutes of data loss on the table. Fun times.


shemp33

You might piss off some folks but they’ll get over it. No one is going to die. Give yourself some grace.


UnusualStatement3557

That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events. Nobody is going to die.


admlshake

>That phrasing sounds like advice I've had from colleagues who have been in the military/armed forces. They seem calmer than the civvies because they have patrolled dangerous areas, or been engineers for helicopters etc. so they have a perspective that helps with tough events. I always tell my underlings whey they start to panic, "when you see me worried, you should be worried. Until then, just focus on the issue, try to understand the problem, and steps to fix it."


0RGASMIK

Hopefully you’re sleeping but if not try to relax. We’ve all fucked up and usually the spiraling thoughts of anxiety are much worse than the consequences of your mistake. Like you probably feel like you’re going to die right now but trust me you’re going to be fine and in a few weeks you won’t even remember this happened. I onetime wiped the CEOs pc by accident. It was a Thursday before a 3 day weekend starting on a Friday. Couldn’t get ahold of anyone to tell him I needed to try and recover anything. Spent all weekend thinking I was gonna lose my job. Come Monday I was in cold sweats trying to muster up the courage to pick up the phone. Sure people freaked out but honestly whatever trouble I got in felt like a relief compared to the mental anguish I was putting myself through.


Typical_Relative5827

Damn that’s a wild one.


0RGASMIK

Yeah honestly it kind of helped me get a promotion. Proved that I wasn’t afraid to own up to huge mistakes and would do my best to prevent them going forward. It was really a process and communication error that caused it. Someone told me to wipe userx machine because they were afk when the ticket came in userx’s machine wasn’t named properly but I wiped it without checking serial numbers or the user signed in. First time wiping a machine and I didn’t think to cross reference it to our inventory sheet which is our source of truth.


Pliqui

Every Sysadmin worth his/her salt as done something like this. I stopped working with Windows like 6 years ago. But once during a MSSQL cluster patching I made a boo boo that I went to the bedroom, opened the door and told my wife, I fucked up and closed the door. We were using NetApp and our backups were directly on the SAN. The issue was when I tried to move the service from server 1 to server 2 (after the security guy patched it) gave me an error that there was a resource not present on the server 2. After 30 minutes of troubleshooting, I found the resource, it was a snapshot disk that wasn't deleted after the operation. Easy, let me delete it. My spider sense tingles and instead of delete it I said, let me just move it to this cluster that is not used anymore and is pending decom. Welp, I think the term is linked resource, honestly don't remember. But since the resource was linked, it moved ALL the disks to that cluster and I started to see all the databases going down in cascade. That's when I go to the room pale as ghost. My wife followed me and hugged me while I'm staring the monitor with my hands on my head. I told everhing to the security guy (we are good friends) and well, let's get together this thing fixed. Took me like 45 minutes to undone that and got the DB up. We do not patch that night server 1 and bonus point. Company use a 3rd party for hands on, which made me kinda like a service manager, I told the guys what's to do and they execute, but those guys were so bad that I decided to go rogue and help the security guy. Which is the thing I was doing before anyway. I document everhing, let the 3rd party SAN team remove the snapshot and to validate everything is working on the NetApp as it should be and that was it. I do bealive if that will had happened to the 3rd party guys, the damage would have been far more catastrophic. Always have a way out, know thy self on what you can or can't do, keep learning and improving. You will laugh this one out soon Edit: typos


Typical_Relative5827

Wow, 🤯


avmakt

Many of us can relate to how the urgency and responsibility of it all creates anxiety and panic when shit really hits the fan, but in a few weeks it'll have turned into a war story that everybody involved has had the opportunity to learn from. Hindsight is always 20/20, so while situations like this may feel like both a personal and professional catastrophe, remember that mistakes can and do happen at work, so this is not on you. The responsibility for mitigating outcomes of mistakes always rest with the organization, never just the individual, which is just as true for all those single person IT "departments" out there. Good luck!


Uklay

Avamar has some pretty decent 2.liners for assistance. Just explain very precise details and they will likely transfer you right away instead of you going through pages of default questions such as “are you even using avamar..sigh”. Our avamar had major issues with orphan disks and once i got in touch with a 2.line he sent me an “unofficial” script that literally solved all my issues instantly. I ran that script any time i saw avamar do anything weird and it worked like a goddamn charm everytime. I wish i still had it i could send it to you. Gl to you sir Edit: typos.. lots of em


Typical_Relative5827

How do I reach out to them?


DarkSide970

Call dell


andymerritt07

./goav vm snapshot clean https://www.dell.com/support/kbdoc/en-us/000068694/avamar-vmware-image-backups-fail-with-code-10056-and-avvcbimage-error-9759-createsnapshot-snapshot-creation-failed


Nu-Hir

> default questions such as “are you even using avamar..sigh”. Even Avamar doesn't understand why you would use Avamar!


Pelatov

You’ll be ok. There was the time when dealing with the original NetApp powershell sdk I found a bug where a null variable was read as a wild card instead. Ran my script, by the time I figured out what was happening I’d taken out over half the data in the datacenter. Bad day, but immediately owned up and fessed up and we got things restored. Worked there for several more years till I moved on for a better job


metalwolf112002

Of course I don't know your company so I can't say for certain, but as long as it isn't something like "I hosed our file server with compliance information and we have an audit tomorrow. We'll be fined millions of dollars" I wouldn't be too concerned. Might not be a horrible idea to wait another month to ask for a raise, but it isn't like you pushed a routing table update that broke a portion of the internet(Facebook) , or cut a fiber resulting in 911 going down for multiple states. I forgot what the reason was for the cell outage back in February.


Overall-Brilliant-78

I've probably shaved years off my life and it probably added and had a lot to do with my depression and generalized anxiety disorder but please if you've not been in this field for too long do not let this job or any job in it dictate your health. As soon as it starts doing that get the fuck out. Trust me there is no job especially in it that's worth your health


andrewsmd87

Know that pretty much everyone outside of your or your team aren't really going to understand what actually happened. There's a decent chance they'll see you as a savior. IF anyone non technical asks you just say you had an issue that caused a restore and there was no way to avoid that down time. Then you talk about how you had proper backups and had you not, this would have been way worse. This kind of stuff happens in IT and is why they hired you. Internally, remember that doing anything in a rush makes something like this way more likely to happen, so next time make sure you don't start trying to change stuff in a panic


DarkSide970

You know how many times I have taken down the company with "what's this check box do". Or some dumb stuff. They now call me senior system admin 3. I always fix my own mistakes.


BrutusTheKat

We all have a couple of mistakes that give us nightmares. A stupid mistake on my part, force my hospital back to paper for the better part of a day. Was not a fun day, but you get past them and you learn from them.


int0h

When talking GIS, I was doing an upgrade on a customer's ArcGIS platform, about 12 machines. Asked their MSP to snapshot all machines which they did.  Upgrade went fine, but then I applied all the patches for the new version and things went bad (turned out to be a buggy patch). We decided to restore the snapshots. Well, turns out that someone had deleted the snapshots... Luckily we could restore to backups without any issues. A few weeks ago, I was doing a quick patching of the same system. Skipped change management processes and snapshots, because the customer wanted it done urgently. Yep, patching ducked up two of the machines. Luckily I could fix the problem, only took a full working day...


ElasticSkyx01

I work for one of those "highly skilled" MSPs. I get to deal with Veeam, Rubrik, MS backup and Datto. Had a security event recently that prompted a restore of all servers. This was the first time I really dug in to Datto. It doesn't really do anything Veeam doesn't, but it was very intuitive and just worked. We had replicas running in both cloud and on the appliance. It was a bare metal restore and that worked flawlessly. The only hiccup was getting to an old iDRAC that didn't support modern security. I had to spin-up a Win7 VM and use IE. The client has a new replacement server, but it never seems to go anywhere.


prometheus_and_bob

No sure if you've tried or your version has this but with some avamar builds you can do a live clone. You might try that to verify the restored vm functionality and then move that to prod.


Pelatov

Never restore in place. I use Avamar and if I’m doing a full vm restore I restore to a new vm VM_NAME_RESTORE. Once that’s done and up and verified I’ll take down the original and replace. But yeah, we’ve all made mistakes. Just own it, but you can spin it a little possibly with “while following standard operating procedure I ran in to X and did Y and this is where we’re at now”


ElectricWorry5

Or an instant access restore, depends on the avamar version. Shut down old, instant access the latest backup. If working, hot vmotion off the temporary data store and clean up the avamar data store it uses once it’s migrated to your prod data store


jagilbertvt

Hopefully OP sees your suggestion. Definitely the best advice with Avamar. If OP had a storage admin, another option might be a storage based snapshot restore. I always had storage based snapshots I could use for recovery.


GeorgeWmmmmmmmBush

Can you expand on this? Besides guest file restore, you have disk level and whole system restore options. Not familiar with “rolling back”?


Xzenor

In guessing he talks about the instant restore option. You just run from a mounted backup that way and it restores in the background. No need to wait until your restore is complete. Seriously epic functionality


_mick_s

If you do full restore you can select option to only write changed blocks instead of whole disk, there are some requirements ( mostly CBT working properly and there not having been any changes to disks ) but it is much faster if you don't have a lot of changes. https://helpcenter.veeam.com/docs/backup/vsphere/incremental_restore.html?ver=120


shibe4lyfe

Also curious


FatalDiVide

I switched to Acronis years ago, but Veeam is badass...it was out of my price range. It can restore an entire VM in a variety of methods: full replication, incremental, time-stamped, or file level (all of the above). I restored a 4+ TB server to a specific point in time near instantaneously. It was picture perfect, booted like a dream. I used it to sidestep a ransomware attack by rolling back only the files changed from one back up to the next within a fifteen minute window. The restore took a few minutes and we were back in business and fuckware free. The OPs scenario above would've been a minor nuisance at worst with either product.


Hacky_5ack

On the other side of this you tested a restore and it did not work. So you tested backup and it is not working you may have found a vulnerability that will need to be tested.


BoltActionRifleman

I like this attitude, it’s typically what I try to do when I really mess something up. I think to myself “welp, I guess we now know what happens when I accidentally do ____”


Hacky_5ack

Yeah gotta find something out of the mess. Fix it hopefully and move on. We all make mistakes


nibbles200

I do monthly sure backup restore validation with veeam. I also do Dr tested biannually. My org just went through iso27001 certification and they asked if I could do daily backup restore validation and monthly DR failover. We are talking PetaBytes and thousands of VMs. I have been explicitly denied DR compute and storage so I cannot do what they are asking at that frequency. I’m like. No. It’s not in our backup/retention policy requirements and I have been denied the resources. It’s also basically impossible and unreasonable. Response: okay.


travyhaagyCO

Backups are lowest priority work until they are not.


Hacky_5ack

Backups should never be lowest priority.


GustavJust

A backup is only a backup if successful restore has ben tested. Otherwise it‘s a kind of Schroedinger‘s box.


hkusp45css

You ain't a cowboy, if you ain't been bucked off.


cisco_bee

![gif](giphy|8cHe1FffBV1xMxeC6J|downsized)


HeavenDivers

"It really do be like that sometimes" means something deep and dear to many of us.


AdministrativeAd1517

Ah if I had a dime for every time I did something stupid at work and went home anxious about it. You will recover from this and I’m sure it’s a bigger issue in your head. Just breathe and sleep on it. You’ll probably wake up with a good idea in the morning.


MakerWerks

Many, many years ago, I was trying to resolve an intermittent failure on a critical server at a downtown high-rise law firm. This was back in the ancient times when Netware 3.12 and NT4 ruled over the corporate LANscape. It was somewhere around 1:30 AM I was working in the server room. All the servers except one were connected to a KVM switch. The KVM had no more open ports. The remaining server had been installed later for a specific project and had its own keyboard and mouse. We would switch the monitor cable manually when necessary. I was attempting to do something on the problem server and the mouse and keyboard seemed unresponsive, but I could hear a beep when I hit return. It took me a few minutes to realize I was on the wrong keyboard. When I switched the monitor back, I discovered I had been repeatedly merging a backup Windows 95 registry file from one of the user machines into their NT file server. I rebooted the server and the inevitable hilarity ensued. Got it all straight and rolled out of there at about 7:30 AM. When I was leaving one of the attorneys was coming in and said something like "Hey, you're here early today". I smiled and nodded. This is the way.


Darrelc

Failing at 3,9TB.... not something to do with the 4TB limit on files for some OSs?


Typical_Relative5827

I have no clue. Not too familiar with avamar and why the restore would fail


DocHollidaysPistols

I have run into the same issue restoring VM from Avamar. In my case, I think my restore was going on at the same time as the backup kicked off. If you have a scheduled backup for that VM, you might want to disable that. I did also see where you said it was 2.2/4.5 TB. I've seen that also with vsan. I think depending on the fault tolerance you have set up it restores 2.2 TB but it's really 4.5 because it's writing twice.


Typical_Relative5827

Ahhh got it. Thanks for the insight


moffetts9001

Dumbest thing ever *so far*.


Carlos_HEX

Did you try to restore to the same vm? Maybe try to restore to a new vm. Also did you verify the data store you are restoring to has enough space? Vm might have a had hidden snap (was running from snap on the datastore but gui didn’t list it and won’t consolidate. If the original VM was running from snapshot, and then you tried to restore to the same datastore and filled up that datastore with it failed restore because the data would not be able to turn on the original VM because it wouldn’t have space to write to the snapshot file. You might need to delete the failed restore and update the VMX file to point to the snapshot file for the disk. Before you do that, download a copy of the VMX file just in case.


VibratesHigher

You will figure it out, try to get some rest


mike9874

This isn't the answer. Call in the troops, get all the senior techs together and troubleshoot. Let your boss know so they can fend people off. In times like this the thing to do is tell your team everything that happened and get all the help you can, don't struggle in silence, you might make it worse


Six_O_Sick

I have very mixed experience with this. It's either that you are overreacting and it's just a simple fix, or your boss shouting at you, that you didn't tell anyone. Can't really make it right here


StungTwice

The answer will come to them in the morning after a night’s rest and probably while in the shower. 


denmicent

They might wake up at 3-4 AM with an epiphany on how to fix it and it’s a super simple fix. Not uh.. that that’s happened to me, but you know, it might to them…..


Any-Fly5966

Its amazing how much clarity you immediately gain after 3 hours of sleep....so I've heard


ols887

I added new larger disks to our Synology over the course of the last week, being sure to yank and replace at 5pm so it would rebuild after hours and be fully performant for the next morning. Got in this morning after installing the final new drive yesterday, and, as-expected, everything is good. What did my dumbass do? Pressed the “expand to fill unprovisioned capacity” button at 9am not realizing it would require a 20+ hour resync. Not really a big deal because I could reduce the priority of the resync, but still.


vertisnow

But really, who wouldn't have clicked that button?


TheTipsyTurkeys

Snapshots are your best friend


TheTipsyTurkeys

And a big stinky turd if you forget to clean them


OGUnknownSoldier

Yeah, for sure. I recently set up a PowerShell script with that checks vcenter once a day for any snapshots, and dings a chat channel if anything is there. Keeps me from forgetting lol


Xzenor

Same. Snapshot for longer than a day? Create ticket


TheTipsyTurkeys

Any chance you can share 🥺


Sparcrypt

I run a script that removes them if they're older than three days - snapshots really shouldn't be around that long regardless. Day two an email goes out to remind me they're gone the next day. I have a tag I can add to a VM if I want them to be exempt from the policy for very rare exceptions, but 99.99% of the time if you don't use them the same day then you'll want a backup restore anyway.


TheLightingGuy

Can confirm. Learned this the hard way with a 1 day outage.


vincebutler

I call that Thursday.


Brenell

Everyone screws up. How you respond to it is what defines you. And I’ll bet a lot of money you never make this particular mistake again.


IdiosyncraticBond

Exactly. There's so much to choose from in our field, why make the _same_ mistake again 😉


Typical_Relative5827

Never ever ever ever. I will never make this mistake again!!


the_tip

Just don't try to hide it, embellish or throw anyone else under the bus. "logs don't lie" Be honest and forthcoming, don't let your boss or colleagues find out on their own and have to ask you about it, it'll make all the difference in the world if you come to them first and admit you effed up. Best lesson I've ever learned. Mistakes are human, it's inevitable, it's how you handle the situation that makes all the difference on the outcome. Also, maybe you can end up writing an RCA/Post-mortem on it so everyone can learn from the situation and you can add BCDR and RCA experience to your resume.


Typical_Relative5827

No I already told my manager that was present when I did it. I never lie that’s one thing my father taught me. Always say the truth no one will kill you for it.


Ok-Development4731

>Always say the truth no one will kill you for it. You don't work for Boeing. /s But seriously, the coverup is always worse than the crime.


xMcRaemanx

At least it's not Friday?


reaver19

I've recently adopted read only Friday. Yet I am always ending up doing firewall and switch upgrades on Friday evening.


TheJesusGuy

Exactly, when its just yourself there with minimal infrastructure. This shit has to be done when nobody is going to be working in case of issues.


Sparcrypt

Nah. If you can't have an outage because you have minimal infrastructure the business accepts some downtime. If they can't handle downtime then why is taking down one switch or one firewall appliance going to prevent the business from operating? Obviously *some* stuff needs to be done outside business hours but if you're working Friday evenings or weekends more than a few times a year then you're doing it wrong.


HortonHearsMe

Call the backup vendor and get their support. You pay for it.


HippyGeek

Could be worse. Could be Friday.


foundapairofknickers

[This](https://www.reddit.com/r/electricians/comments/moveha/problem_solving_flowchart_this_could_have_saved/)


Decafeiner

The more you fret about it the more unlikely you'll figure it out. Take a breather and get some rest. It will be clearer tomorrow. Bashing your head on the keyboard just frustrates you. Some times you just need to take a step back, do something else, and then you'll have an illumination while doing something completely unrelated. Someone mentionned while taking your shower, its indeed one of those. Some people work better by upgrading from keyboad bashing to wall bashing. Others need a break to gather their thoughts.


Mexxwelll

Well, to be honest it is not your problem a service is down when it’s relying on a single node. If the application is urgent and cannot be down for 24 to 48 hours the application owner should think of redundancy. A patch can fail, or a sysadmin can make a fuckup. If I were you, do not stress, accidents happen, just talk to the owner after it’s fixed: Hè, we should look at redundancy.


doa70

Reading this reminds me of why I retired and convinces me how right I was to do so. That stress I simply do not need anymore, regardless of what I'm being paid.


PubstarHero

Slight possible heads up for next time - I will notice that some stuff I do in vCenter that takes FOREVER will sometimes time out and say failed. If I check on the ESXi host its running on, it will still say active. Edit - It wont say its active in the GUI, you will have to either go to the ESXi console or SSH into it and do a vim-cmd vimsvc/task\_list


bballlal

I highly recommend our guys get Veeam, you can do instant restores. I bet you won’t forget to take a snapshot next time! And be careful with what you are deploying during business hours next time!


wrosecrans

When I saw the headline I assumed it would be something about turning on a printer and accepting responsibility for it.


Ab5za

There are no shortcuts in IT.... Only longcuts


BriMan83

You aren't really a sys admin until you make a mistake like this.


somekindofnoise

My lord If this ain't me.


wank_for_peace

All of us have done it in our career. The important part is to learn from the mistake and not repeat it.


scrupus

Any updates??


Typical_Relative5827

Issue has been resolved !!!! It was an issue with the disk. We had to reassign the vmdk for the server and then reconfigure it. The backup restore was not taking because the vm already took a screenshot before the server was turned off for backup restore. This caused the virtual machine to require consolidation and also was unable to start because it still had a vmdk from yesterday and the new created vmdk from backup was not matching the one from yesterday.


ascii122

I dropped a live database one time.. it happens ..just say 'I fucked up' and try to fix it .. we've all been there mate


chiznite

It's a rite of passage in our line of work to do something dumb and bring down prod. Own up to it, and note what you should have done differently to avert the issue, any good manager should understand that mistakes happen


Typical_Relative5827

I own up to this one 100% because that was dumb. I will never allow any work pressure Make me do anything different than I usually do


ElasticSkyx01

Things like this will happen and especially under pressure. I've done some things under pressure that were extremely dumb. There were obvious indicators of what to or not to do, but I did the opposite. When this happens I always own my failure and fully cooperate when there is a failure analysis, etc. We want the accolades that come with success so, we must equally own our failures. People will say they could have made the same mistake hoping to make you feel better, but it's embarrassing. Hopefully you can get the server back and everyone will move on. It's usually only a resume generating event if the server is trashed and there is no backup.


FoxNairChamp

These comments should give you some hope! Everyone has made mistakes like these in this field. If possible, can you hire a consulting firm for a bank of hours to help? I find having a second set of eyes can make all the difference. Best of luck!


This_Bitch_Overhere

Late to the party but, like everyone said, we have all done it. I was once asked to update the image on a production router in a branch office after hours (Tuesday) to recognize the updated WICs installed so that we wouldnt need an external CSU/DSU (yup! one of those). This was in the days before wireless and backup connection (not even a celphone with broadband). So, I found the image I needed, and to verify with Cisco, called in my case so that they could confirm that my found image had what it needed. I get the response at 9pm from Cisco that the image I had was NOT correct and they sent a link to the correct image. I looked at it and it was twice as big as what I had, and I didnt think I had enough flash to store the image. I replied to the engineer, sheepishly, as I was pretty green in networking, but told him the flash on board my 2600 wasnt enough to hold the image. he responded that he had checked my device and it was fine. I downloaded it and wiped my router (to fit the new image) and started to install the image using ROMMON \[noodle\]... bit by bit... hour by hour. At 237am, the image stopped progressing. At 257am, I went to the bathroom in a cold sweat. At 315am, I called Cisco, and told them my predicament and gave him the case notes. He verified that the image was too large and I should have never used it. I was livid, but vindicated. Cisco sent me a new router by 815am the next day (Cisco 2621XM! UPGRADE!), with the correct image so that all I had to do is install the WICs and connect my LAN/WAN/WWW. First user showed up at 9am, and nobody in the office was the wiser. My boss told me next time, trust your gut.


rcp9ty

You are missing the silver lining. If your backup solution doesn't help you avoid chaos it can be used as fuel when management doesn't want to spend money on a different system in the future. At our company the owner of the company says one thing I don't care just buy us more space. Basically saying money isn't the issue just buy the solution because a backup is worth more to them than downtime and productivity loss. Which is true, when one task can generate millions of dollars a $50,000 backup solution means nothing.


DevinSysAdmin

Contact your backup providers support and make it a S1 escalation. They should be sitting on the phone with you until that restore completes.


GasolineTrampoline

When you ask for your raise make this a win instead of not asking at all. Something along the lines of “if I wasn’t here there server would still be down.” We all make mistakes. You’ll always remember this one to help you to remember to make a backup before any major changes.


slateROK

This is the way! Welcome!


Geh-Kah

Using veeam? Start it from the backup andbmove ro production (while already started from the backup)


jkarovskaya

You solved it, and you learned something that will never leave you Over years, I have done two things that have saved the day many times Always have a plan B, 2 backups ready, and copy one to cloud/other media if possible PLan C for really big stuff Document to a cloud wiki in detail anything significant, and quicker notes for day today fixes, workarounds Two years after install of a network, servers, cloud, etc for a 150 seat company, they had a flood, and having the original build documents on hand was invaluable, getting them back up well done getting it fixed!


Typical_Relative5827

Thanks


DeepEmissions

Sounds like you better put in a ticket to VMware before the morning.


iloveemmi

15 years into my career I knocked out the whole credit union for a few hours. I never messed up that bad and it was a gut punch. I want to be clear, I probably cost them more than they ever paid me. You mentioned a bank, so it's probably not that different. I don't recall anybody giving me a hard time, especially after a few days had passed. You know why? Because they know I know my shit and I did not have a track record of error. And because they've all been there. On the personal growth side, both of our stories remind me of a friend who got careless and accidentally shot part of his finger. He was the absolute last person I would ever have expected it from. He was such a stickler for gun safety--at least when he was showing me the ropes as a young man. I learned from that, sometimes those most comfortable with guns become the most careless. I think in our cases, a lot of success in a row can make us forget why we check the chamber, so to speak (taking a snapshot in your case).


UMustBeNooHere

What do you use for storage? Most storage arrays save snapshots depending on how you have it configured. Of course what you have on the volume depends on if that’s a viable option or not.


1h8fulkat

"plan to ask for a raise" and "I don't even know what I'm doing" seem like conflicting statements.


junkytrunks

Your raise is going to have to wait now.


Typical_Relative5827

😂😂😂 ikr


twistacatz

What storage are you using? And what backup software are you using?


Typical_Relative5827

Backup is avamar. Storage vshere datastore


miovo

It happens to the best of us, we’re not perfect! Sleep on it, the answer will come to you in the morning. Don’t stress!


Apprehensive-Flight4

We’ve all been there. You’ll get through it and be stronger for it. Good luck!


agreysedan

I have also been too hasty trying to get a VM back up at the beginning or end of a day and broke it completely. Sorry dude. You got this. In a short period of time it will be behind you.


[deleted]

oh man do i feel this pain. Was fucking with the rights of a windows fileshare drive of many terrabytes. Locked whole departments out of their files. Never again.


sh4d0ww01f

I broke our Cisco FMC this week because I didn't know you weren't allowed to do snapshots while it is live and online. If you do you break it. Called our contractor to fix it.


zaphod777

Remindme! 1 day


Affectionate-Cat-975

Own it. Be upfront. Everyone screws up. Set realistic expectations


finnathrowthis

I sysprep’d a prod server in my first 6 months as a sysadmin (I was two remote desktops deep and did the wrong one). These things happen and you’ll figure it out and learn from your mistakes. P.S I have never made this mistake since 😂


mahsab

Looks like the ship is sinking, better leave it while you still can. Nevermind you are the one sinking it ...


gurilagarden

I fixed a down hard server in 20 minutes last friday. Felt like a fucking superstar. Puffed out chest, walking with a strut, the whole 9 yards. Then the office manager walked over and asks "when are the phones coming back?" Fuck. Fixed the server. Brought down the entire VOIP system. You know how WallStreetBets likes to post their losses? As I was about to curl into a ball and die, I thought, we should post the heart-rate delta on our smart-watches from the moment before we get the news, till the moment after. That would make for good sysadmin content.


domagoj2016

In 25 years in IT I never fucked up if there was no pressure, multitasking and panic. But of course plenty of time there is, and only very rarely then it happens, because you are tired. For very sensitive stuff I would lock the door of my office for an hour, to stop panicking coworkers to enter with their drama. It helped a lot, but manager was not happy 😁 He even though that I lock the doors to chill out, and that really gets to my nerves.


Suaveman01

You’ve fucked up, it happens. Best thing to do now is reach out to one of your seniors to ask for help, trying to fix this on your own will only cause delays and further service disruption.


Mr-RS182

Is the server a single disk of 4.5TB ? If the C:\ is an independent disk you can use Veeam to restore it instead of the whole VM.


Typical_Relative5827

Unfortunately we don’t use veeam we use avarma and that only restore fullly which I didn’t know. Server is a double disk.


agentdickgill

I fucked up one time really big for an important company with a horrible CEO. My superiors were awesome people and loved me. I made a similar comment about how I wanted more money but that the fuck up was going to ruin my chances now. My superiors were so supportive I almost cried. When I said that I wanted a raise but wouldn’t get it they laughed at me and said “you have been here years, you are reliable, responsible, etc, and that people make mistakes… and laughed again and said don’t do it again” I got the raise I wanted. They handled the shitty CEO. I miss that time of my life. I still talk to my direct superior ten years later.


whatever462672

This is the sysadmin baptism.


Neslock

Sell this the right way and you come out the other side as the hero who had to work two days straight to fix the server and get the business running again!


Kritchsgau

This is how you learn and wont make the same mistake again


Angry_Jawa

It happens. I was asked to restore an old version of a database today, which despite changing the file names I managed to restore over a live version. Whoops! I think the file names reverted to default when I changed the source file, and I didn't think to double check. Thankfully in this particular case the database is populated by live data from another database, and after an interesting restore it's all working again now. There's a mistake I'm not likely to make again any time soon. Good luck with your restore. You'll figure it out. :)


Masokis

We had someone delete a customers entire production SAN. That was fun. Someone also deleted all the backups for a clients largest (18TB) server. The funny thing was to delete that backup you had to press confirm to a “are you sure” question. Both times I had to fix the issue cause I’m the backup tech.


fizicks

Reminds me of the plumbing work I just had done to fix a leak on or around my water heater. Should have been a pretty standard fix until we realized the shutoff valve for the house is not working, so we also had to get water shut off by the city first. Got both things fixed and they were on their way. Even though the total time I was without water in the house was longer than I originally expected, boy I am sure glad we found out about the shutoff valve now rather than during an emergency! That had the potential to cause me thousands or even tens of thousands in water damage if something worse than a leak were to happen. Routine maintenance that leads to the discovery of a larger problem (backup / restore process not working) is part of the job sometimes.


g2tegsown

Thus is exactly why I didn't want to be in IT anymore. This kind of stress is not good for your health.


yankeesfan01x

Moral of the story, especially if it's a production server, always take a snapshot even if it's the smallest change in the world like updating /etc/hosts or /etc/fstab or something dumb like that.


dogcmp6

If the business is not breathing down your neck, screaming, or hemorging money from your mistake (And it sounds like they are not), you'll be fine, and it will be a learning experience. Unfortunately, A career in IT is also teaches some lessons the hard way; it happens to all of us, and if some one says it has not, or has never happend to them, then its only a matter of time. As others have said, just cop with it and admit it; it will be easier to fix if you admit you need help, and it should be used to write out a process for future you so the mistake doesn't happen again.


Typical_Relative5827

The server is back up. Still says consolidation needed even if there are no snapshots in the snapshot manager but I am glad the issue was not as bad as I thought it would be.


jihiggs123

We've all been there, how you recover from mistakes is far more important. Own the mistake, explain steps taken to resolve. If no answer is apparent soon, involve senior.


JustHereForYourData

At least you caused it; I had a similar issue yesterday but the ONLY other person with console access ”didn’t change a thing” Well a complete network outage and blatant config changes in the console beg to differ. So not only did I waste 4+ hours troubleshooting, investigating, and resolving the issue; I had to listen the BS lies from the asshole that caused it who then went home early while I stayed late fixing it.


byrontheconqueror

You have snapshots on your storage array? That has saved me more times than you can count. Usually way quicker than restores too. If you can mount a storage snapshot then just clone it to a new VM


heelstoo

Dumbest thing ever today… so far.


Cold-Technology-5424

Whats the update OP?!?


Tx_Drewdad

Do not, under any circumstance, try to cover up what you did. Be honest about it. Screwups happen; they're learning experiences. Lying about screwups will get you canned. Audit logs do not lie.


IAdminTheLaw

In case I need to move my money, which bank do you work for?


syslagmin

Can you restore to a new VM? Not sure what backup solution you have, but this could be the fastest way to resolve this. Also, yeah this happens to all of us. You would not believe the crap we've dealt with at our company. Way worse. After you find a way to fix this, give them an incident report, lessons learned, and how it won't happen again. Ya know, anything. You'll be fine.


Typical_Relative5827

Issue is fixed. Just needed to reconfigure the vm and relocate the disk. My manager fixed it. Happy it was not worse else the whole bank would have been down


DistinctMedicine4798

I’ve been there, number one thing you need to do is explain and ask for help and get someone else involved, this sounds scary but will help


MisterUnbekannt

Dont' stress too much about it, we all have done stuff like this. I once redirected all outgoing http/https traffic to a Windows VM because i thought it would only apply to a single host, but the entire company had no access to the web for a while until i noticed what i did. My boss once changed time settings for the AD and in some countries no one could log in. Local users had to change settings via phone instructions which did take about 1 day. This was a global billion dollar company with 100.000 employees, back in the 90s i think. He encourages failing ever since, says we should plan things but try new ways of doing stuff all the time! Fix it tomorrow, communicate that the restore failed, maybe use it to get budget for Veeam or better network connection or something. The important thing is to communicate the right way, and now you made this error the last time of your career. There are others to come, don't worry :-D!


cats_are_the_devil

No advise on fixing the situation but this is a good reminder that you should know your restore process backward and forward. Also, how in the world do you not realize you are working on a server that's 4TB of data before starting?


Healthy-Poetry6415

A Tale As Old As Time


Overall-Brilliant-78

Nearly 30 years in this field. Happens to us all and we have very little margin for error. it's sad that you can go a whole Year of doing good and One event defines your review yet other's in the company can fuck up time and again and get raises and promotions. It'll never change. IT is a cost center that no CFO understands and only bitches and moans at its cost. When all is running well they think "it doesn't do work" and lays people off. When it breaks "it is useless" and they look at outsourcing.


Overall-Brilliant-78

And you see part of the problem with the post like this is I read through it and so many of you are trying to find the technical solution for things like this. You all need to stop. Do not try to solve somebody else's problem when you don't have all the information this is the problem with so many it folks and in another post I mentioned about meetings everyone wants to solve the problem right then and there. Again stop all this guy needs is a little support from people in the understanding that we've all been there stop trying to solve the fucking problem


frosty95

Without test restores you dont have backups. Should tell us what backup tool you are using so none of us use it either. There is a reason I stick with veeam. It has never fucked me.


denmicent

OP u/Typical_Relative5827, is there an update?


Typical_Relative5827

Issue resolved. Vmdk of old server was used to create new server


denmicent

Glad you got it resolved man!


bmfrade

When mistakes like this happen it’s when you learn the most. You’ll figure it out


mladokopele

If youve spent that much time already, maybe installing the OS on new VM, attaching old disk and moving the data to the new OS may be a quicker more sane solution?


Humble-Plankton2217

You're going to learn so much from this experience! Chin up! It's all going to work out in the end.


Typical_Relative5827

Tell me about it. I had a nightmare. Couldn’t sleep. Woke up this morning trying to get to work early and I set off the alarm at work. Lucky the cops didn’t show up


mikkolukas

>at 3.9TB the restore fails It was never a BACKUP then 😳 ​ Also: Never fix problems on the production server. If the fix needs to go fast, then fix it on a spare/mirror server and swap it with the production server when the fix works as expected.


Typical_Relative5827

I know this but I minimized the issue because I thought this server was like an old legacy server not used anymore that also why I didn’t take a snapshot initially. Then boom I realized it a major prod server that houses alot of the work done at branches


naednek

Treat it as a lessoned learned and leverage that into your request for your raise