The Reliability team at Spotify took over the monitoring stack and decreased incident pages by 42% within 6 months. At first, they were devoting all their time to managing on-call alerts and tech debt. Now, on-call alerts are manageable and infrequent, and the team is on a path to using entirely open sourced products.
This stack was developed years prior, when there were few well-developed open source solutions available. Lauren describes how migrations to new tools (Grafana and Prometheus) decreased their backlog and on-call pages. She will also cover the improvements the team made to their own open source products (Heroic and FFWD) and why they chose to continue using and maintaining them. Lastly, she will discuss a new tool that the team will be repurposing and open sourcing in the near future.
Lauren is a Site Reliability Engineer at Spotify on the Observability team. She is currently working on maintaining the monitoring and alerting stack, as well as implementing tracing.