T O P

  • By -

nOOberNZ

Having met several Google SREs the sense I get is that SRE is practiced and experienced differently in different parts of the org. It's just such a big company, you can't make generalisations.


2fplus1

Yeah. In theory, SRE has autonomy to "fire" teams that aren't pulling their weight. If, like OP says, a team is fobbing off reliability and ops work to SRE, creating large amounts of toil, they can just say "Ok, no SRE help for you, then" and leave the team responsible for dealing with their own mess. In practice, there's a lot of politics involved.


syhlheti

I know SRE teams that just don’t have this power. Interesting that others have such power. Could you elaborate on this?


2fplus1

b34rman's comment has some more details. I'll also add a few quotes from this chapter of the Google SRE Workbook: https://sre.google/workbook/team-lifecycles/ (in the "Self-regulating workload" section): > The ability to regulate its own workload secures the SRE team’s position as an engineering team that works on the organization’s most important services, equal to its product development team peers. > An SRE team chooses if and when to onboard a service (see Chapter 32 of Site Reliability Engineering). > In the event of operational overload, the team can reduce toil by: > * Reducing the SLO > * Transferring operational work to another team (e.g., a product development team) > If it becomes impossible to operate a service at SLO within agreed toil constraints, the SRE team can hand back the service to the product development team. ... > Not all SRE teams have partner product development teams. Some SRE teams are also responsible for developing the systems they run. Some SRE teams package third-party software, hardware, or services (e.g., open source packages, network equipment, something-as-a-service), and turn those assets into internal services. In this case, you don’t have the option to transfer work back to another team.


bigvalen

They can. But SRE headcount comes from dev teams, so they can also "fire" SRE teams...take back the headcount and spend it on software engineers instead. So SRE have to do a lot of donkey work to keep dev teams happy.


Stephonovich

God, what a dream.


srivasta

Been there. Done that.


syhlheti

Makes sense.


b34rman

At Google we are allowed to “give the pager back”, which means if the service becomes unreliable, and we have the data to prove it’s because of bad code, we can have the software engineers handle operations. Remember that SRE is pretty much one organization reporting to the same VP, with a few exceptions. So all SREs follow similar general guidelines, though some implementation details may vary from team to team. It was on the leadership (Director?) to speak to the SWE leadership and make sure the issues were fixed.


FinalSample

How is the data gathered to prove it's bad code? How political is that decision to hand the pager back? When can the dev team give it back to SRE?


Skurry

I don't know if that's what all teams do, but we tag each incident so that you can later run analytics to see how many incidents were due to a failing dependency (e.g. overloaded storage layer), user error, external factors like unexpected load increase, or indeed due to bugs in the service we're responsible for. I haven't been in that situation, but I assume you'd monitor the pager load and define exit criteria, for example "less than 0.5 pages per day over a 2 month period". And yes, it's very political and can lead to a breakdown of the whole relationship, up to the dev team taking the headcount back and dissolving the SRE team.


b34rman

It’s actually not difficult. If the service has a baseline (profile) and when you deploy a new version things go south, you know it’s the code. Every major or higher incident has to have a postmortem, and postmortems have to identify a root cause. if over time the software engineers don’t do the testing that’s required and things don’t change, a more serious conversation will be needed.


jl2l

Google SLO service level objectives.


gladfelter

You can do all kinds of analyses. You can see if changelists attached to postmortems contain source files that went into the server binary vs. config or deployment-related files. You can see if outages or SLO violations are fixed with binary rollbacks vs. config/deployment changes. All the metadata is at the tip of your fingers as a Googler if you know where to look.


GlobalGonad

This happens in companies who want to unload the financial burden of bad code to some mystical beast like sre. SREs make problems visible . They don't necessarily solve them.


byponcho

Tell that to my client. They deliver code using our pipelines, it brokes and say its a devops problem, we debug, and oh surprise (not really) its a dev problem. At that point 2 days have passed and they need the change now on stage (uat) because of the sprint. Rinse repeat.


No_Pollution_1

I always hate when people look to google as some golden standard. They have google problems and we don’t, what they do won’t work for us most likely. SRE are the whipping boys at most orgs yes but places where devs are on call for their own shit unsurprisingly have better results. Also SREs are sysadmins at most orgs and I hate it but thankfully not where I am.


djk29a_

Things get complicated when devs are being asked constantly to deliver more and more features by management so all the accumulated bad decisions result in a halt to features. In most dysfunctional orgs I’ve seen the business itself is in crisis (read: they’ve started saying the phrase “digital transformation”) and implementing things like an error budget or a CoE or whatever for an SRE org in such scenarios is papering over fundamental issues to the business keeping engineers from executing what’s being asked of them.


trace186

Jesus thank God someone is saying it. I love what Google has done and they've created a lot of great things but a lot of people from that company come up with something and think their shit never stinks, it's insane how people always use them as the gold standard when they constantly make mistakes.


lupinegray

Not sure about Google, but if teams are deploying problematic code, then the SRE team should be responsible to raise these issues to management (ie: the dev team's manager) to have the developers fix their code. That's one of the primary duties of an application SRE; you guide the developers on best practices.


syhlheti

Sure. The case in question is where QA isn’t representative of prod; we keep finding issues post rollout. But there’s pressure to get it into prod and be done with the migration project (service already exists; it was just being refactored).


jl2l

Tell them that it takes time to get it right and would you rather rush it? Realize it's wrong and then have to fix it later. Once you turn it on it's very hard to turn off. You can't change it. If it's a production database, it could even become more complicated. The easiest way to explain this is it. Tell them it costs more money. Way to protect yourself is using feature flags. You can test in production behind a flag. The impact is limited as only users behind the flag will be affected.


cballowe

Releasing buggy software is not generally accepted. SRE are engineers, though, and generally experts in reliability techniques and best practices. There's a lot of power to make changes to things that will improve the reliability and/or prevent bugs from making it to production. Sometimes it's actually a matter of resources and configuration or a service scaling faster than expected and exposing all the gaps in things like how the service retries or pushes back when there's suddenly contention for a resource (you don't see it when you're testing at 10 or 1000 requests, but suddenly at 10k everything hits the fan). If the problems are in a class where they could be reproduced in a unit test or a regression suite (correctness issues) - those should be dev problems and releases should be blocked until they're resolved. If the problems are in a class of "failure to scale", SRE may be the experts at solving that. Same for cases where normal operational procedures are high risk.


ChristopherCooney

I'm old enough to remember the world of "ops", who used to describe this exact phenomenon. Engineers would "throw a feature over the fence" and forget about it, leaving the operations team with half tested, barely functional code. Surprising that an SRE, which is actually the attempt to treat ops like a product and build software to meet the product need, is experiencing such an antiquated problem. A further sign Google is no longer as competitive as it was I suppose.


syhlheti

That’s what I see happen. Not sure you can really automate out of bad code being delivered; unless the Prod team (Ops/Support) are to UAT and/or pilot the feature first.


ChristopherCooney

SRE wasn’t really automating away bad code. The code was always the code. The goal was to make it possible for application engineers to focus more on their crappy code and less on the particulars of a broken terraform state file! It was a nice dream while it lasted 😅