Wednesday, November 20 • 3:20pm - 3:55pm
How to Include Latency in SLO-based Alerting - Björn Rabenstein, Grafana Labs

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
Chapter 5 of “The Site Reliability Workbook” is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time windows. This rather complex approach is blissfully straight-forward to implement in Prometheus, as demonstrated in the chapter itself. However, all of it is based on error rates, leaving latency concerns out of scope. Björn “Beorn” Rabenstein will explore various options of applying the same ideas to latency-based SLOs. The foundation is a precise and meaningful definition of the SLO. From there, Beorn will explore various techniques to translate the SLO into an error budget and how to measure its burn rate with Prometheus. Once that is done, creating error-budget-based alerts is relatively simple. There are, however, pitfalls and trade-offs along the way, which Beorn will help cope with.

avatar for Björn Rabenstein

Björn Rabenstein

Engineer, Grafana Labs
Björn “Beorn” Rabenstein is an engineer at Grafana and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.

slides pdf

Wednesday November 20, 2019 3:20pm - 3:55pm
Room 16AB - San Diego Convention Center Mezzanine Level