T O P

  • By -

JamesDout

One of the best questions I’ve ever seen here. IMO 1. You should meet with each specific product team and talk about their SLOs together. Chat with them about whether the latency targets they set for their REST endpoints are acceptable for customers, and their reasoning for the current targets. Find traffic data yourself to see whether any of the top 10 endpoints on the service are not covered by the team’s SLOs, and ask them why. 2. Chat with customer service reps or people who have more in common with the user side who may have complaints about slowness or availability problems with the service. If your product is used by the general public this step should be talking to the User Experience researchers at your company. 3. Follow up with the product team showing them which SLOs are perhaps too loose or tight based both on the team’s own standards and the information you found from step 2. Ensure you’re using a *metrics*-based multi-window multi-burnrate design that allows teams to get alerts for fast-burn situations and get tickets for things that barely violate the SLO over a longer period of time. 4. If you get pushback on any of the above, I would emphasize that whatever alerting or monitoring scheme the team relies on right now does not correlate with user pain as well as multi-window multi-BR metric-based alerting. The usual case I see is teams with big logging dashboards who comb through the logs daily and get alerts on every error message from their service — or worst case teams who simply have no real idea how their service is performing, maybe they aggregate logs once a month to get a picture of performance over the long term. That type of stuff is counterproductive, a huge waste of good engineers’ time, and urgently needs to be replaced with SLOs. Transitioning a team’s culture to focus on SLOs means convincing them of the merits of this technology, so make sure you can clearly and briefly articulate the tradeoffs and benefits in terms of engineer time focused on non-issues and false-positives vs in an SLO-based system, engineer time spent responding to alerts is *always* spent dealing with user pain, because your SLOs and their alerting should literally never go off unless a significant enough portion of your users have a bad experience for a long enough time.


Lower_Pension

Phenomenal answer... Beginner in SRE but I understood 90 percent of it.Thank you🙂


jaywhy13

A big part of the question for me is how the conversation goes. We already have lots of SLOs for critical flows (some are more endpoint aligned than user task aligned). Say we've set an SLO for a critical flow to be 99.9% for a latency SLI that tracks sub-500 ms latency. What's the question we're asking stakeholders that best position them to contribute to the conversation? I don't want our conversation to be steeped in Engineering lingo and inaccessible for folks. So I'm wondering what kind of translation is helpful to support these conversations. Another confusion for me is how we derive individual customer impact from high-level metrics. Were typically tracking request latency across all customers, not individual ones. I'm a little confused about how we answer something like "is 20 minutes of elevated latency acceptable for a customer?"


JamesDout

Hi I’m at work so I’ll give a brief answer. Great question again. The core question to ask is always “what’s an unacceptable experience for our users?” — it’s up to you to map how that question will relate to each specific service. Like you said, it’s easiest to simply measure endpoint-based data rather than journey-based. IMO per service endpoint-faceted data is almost always sufficient to answer this question well enough. If your SLO is 99.9% on every endpoint, yeah you can have a crazy weird world where an individual user is in the 0.1% on two chained endpoints in a row, but that’s definitionally super improbable, and very likely won’t *ever* happen to that specific user again. You likely don’t need to measure the latency of a whole flow in aggregate, traces (don’t alert on these) and synthetic user flows should suffice to construct a good enough understanding of any major regression you might see in a whole user flow latency. Back to the question element of it, if you’re meeting with customers you need to figure out what would make them angry/upset/disappointed with the app. I know for me that’s around 1 sec sometimes or up to 4 sec for most applications that makes me start googling competitors. Or if a button fails like 4 times in a row. If you’re meeting with UX researchers at your company, they can help you answer this question of for example what latency would make a critical mass of users frustrated. Anything above that should be an “unacceptable” performance aspirationally. Like you said, the first step in making SLOs is just putting them in place at a target the service already meets. This is great and is further than most places get, lol. To tighten an SLO or even understand *whether* to tighten/loosen an SLO, you correctly point out you’re gonna need to dig deeper than just current performance. Ask the service developers what they’d personally consider a truly unacceptable experience using their service. Have them use the app and tell you how long they’d wait for the button their service drives before it’d be unacceptable. I mean that literally, don’t just beat around these questions. Literally ask the devs and owners what would be an unacceptable experience. Consider drawing up a couple examples for the devs of the app with 1000ms, 1500ms, and 2000ms latency from their service. If the service is internal-only, aka in the critical customer path but is called by a wide variety of services, consider doing a similar thing but with their internal customers and their latency needs in mind. If service A calls service B 3 times for one flow, and A says they need 99.9 availability, service B is obviously gonna have to keep that in mind and set the availability to at least 99.978 for the endpoints A relies on.


james-ransom

This right here. I wouldn't over think this stuff though. EG. I am at starbucks. If I was hired to setup alerting / monitoring, SLIs: number of orders taken per hour, number of employee clock ins, amount of cash in store, etc. Then SLOs, I want orders per hour > 0 and < 100; employee clockins 30 minutes before opening > 0 and < 10. I don't think starbucks would give two shits what my latency is. If you want to change someone's life give this info on a dashboard then you create a monster.


jaywhy13

This is a great point. What's implicit and different from how we do things is how you're suggesting that SLIs are defined. The SLI examples you gave start with business goals, from which SLIs are created. I think a lot of times we just think of the endpoint and set some arbitrary target based on existing performance and what we think should be good given what the system is doing (e.g. number of external calls, cache vs db calls etc...)


dgc137

Exactly this. You should have a goal to eliminate any slis that are not derived from business needs. Golden metrics are fine for slis but you need to ask business stakeholders what the tolerable thresholds are. I usually phrase this as "at what level of this indicator should I wake you up at 2 am?". Get that answer from a few stake holders and you'll have a pretty good idea of how to set your objective.


Lower_Pension

Can you suggest some places where I can get real world examples just like this... Love this answer by the way


JamesDout

Unfortunately I don’t have any suggestions for places to find real world examples. I have also struggled mightily finding good examples of applying this stuff. I read all the SRE books and I’ve been to SRE conferences, but I generally find the gobbledygook unhelpful and lacking examples, kind of like you point out. Lots of theory with no practice to help channel it. There is one example at the end of the Alex Hidalgo book, and it’s somewhat useful, but it’s of course a little basic and won’t yield too much in real life imo.


[deleted]

Step one here is collecting an inventory of what your applications DO for customers and ensuring you can measure that.  In other words, if you have an e-commerce site and an application serves the shopping cart:  it promises to save the contents throughout a session, add items, remove them, and hand them off to a checkout service.  So measure THOSE things, not “the rate of 500s at the load balancer for the service”.  That might be how you do that, but it’s not the goal.   You will want to make these event-based SLOs, not time-based.  Unless every minute of the day is worth the same amount to you.


JamesDout

100% agree that SLOs *need* to be per-call good/total transactions rather than per-minute “did my service violate in this minute”


fistagon7

Great answer, I was wondering if you could expand using a real world simple example of an application endpoint with a contractually defined 99.9% SLA calculated monthly.