Observability [clear filter]
Tuesday, November 19

10:55am PST

Blazin’ Fast PromQL - Tom Wilkie, Grafana Labs
PromQL, the Prometheus Query Language, is a concise, powerful and increasingly popular language for querying time series data. But PromQL queries can take a long time when they have to consider >100k series and months of data. Even with Prometheus’ compression, a 90 day query over 200k series can touch ~100GB of data.

In this talk we will present a series of techniques employed by Cortex (a CNCF project for clustered Prometheus) for accelerating PromQL queries -- namely query results caching, time slice parallelisation, aggregation sharding and automatic recoding rule substitutions.

But there’s more: we will show how you can use this technology to get these improvements with Thanos and Prometheus. Finally, we will cover optimisations to the PromQL engine by the Cortex team, and how these have already been merged upstream to benefit the whole community.


Tom Wilkie

VP Product, Grafana Labs
Tom is VP Product at Grafana Labs, but really he is a software engineer. Tom is a maintainer on the Prometheus project and a maintainer and the original author of Cortex, both CNCF projects. Previously Tom founded Kausal, a company working on Prometheus, and worked at companies such... Read More →

Tuesday November 19, 2019 10:55am - 11:30am PST
Room 11AB - San Diego Convention Center Upper Level

11:50am PST

No-Nonsense Observability Improvement - Cory Watson, SignalFx
Observability has gone from a thing you read about on Twitter or Medium thinkpieces to something your organization “has”. Maybe you’ve got a few new observability tools deployed. How is that working out for you?

Regardless of your adoption level – from logs on local boxes up to the highest cardinality traces and feature analysis – at the end of the day these are tools, not magic spells. How do you teach, train, use, evangelize, and measure the impact they have in your organization?

Cory has been a part of solo and large observability teams, in-house and vendor, and worked with dozens of companies. In this session he’ll explain some no-nonsense, tool agnostic methods for wringing more value out of what you have, identifying when to use new tools, how to handle migrations, how to measure value, and how to deal with “why does this cost so much?”

avatar for Cory Watson

Cory Watson

Technical Director, SignalFx
Cory Watson is Director of Technology at SignalFx, leading high impact, customer-focused projects around observability and monitoring. Cory started his journey as an SRE at Twitter, and continued on to found the observability team at Stripe. He is a strong voice in the observability... Read More →

Tuesday November 19, 2019 11:50am - 12:25pm PST
Room 17AB - San Diego Convention Center Mezzanine Level

2:25pm PST

From Issue to PR Merged: A Fluentd “Tail” - Jordan Hamel, Amgen
Do you often find yourself opening an issue or looking for an alternative open-source project with support for your use case? Not sure where to start in contributing a fix for an issue?
Getting involved in the Fluentd ecosystem and submitting a PR helped make it possible for Amgen to effortlessly collect CloudTrail logs from hundreds of AWS accounts owned by separate teams.
We'll take a look at the details of how to collect and annotate logs stored in any format or account in AWS with Fluentd where hundreds or any number of accounts are in use. We'll also follow the details of contributing this now merged PR to the Fluentd S3 plugin that made it possible.
Whether you're a new or long-time user of Fluentd, come and be inspired to consider contributing back to observability related open-source projects like Fluentd and the benefits it can bring to your organization and the community.

avatar for Jordan Hamel

Jordan Hamel

Sr Mgr Software Development Engineering, Amgen
Jordan Hamel is a software engineer currently at Amgen who cares about making sure software and the users like each other. Having previously led E-commerce operations for years at Newegg.com, he is a huge fan and supporter of making the user experience as observable as possible and... Read More →

Tuesday November 19, 2019 2:25pm - 3:00pm PST
Room 7AB - San Diego Convention Center Upper Level

3:20pm PST

Weighing a Cloud: Measuring Your Kubernetes Clusters - Han Kang, Google & Elana Hashman, Red Hat
Kubernetes is complicated. Instrumenting it can be worse. Measuring the components of a distributed system shouldn't be as daunting as being asked to weigh a literal cloud.

In this talk, we'll go over the components of a Kubernetes control-plane and show you where to look to figure out what is actually happening. We will show you common cluster issues and how they would look in your instrumentation, so that you can more effectively diagnose clusters.

Starting in version 1.14, Kubernetes metrics were overhauled to provide consistent, high quality metrics. Han Kang and Elana Hashman will go over the changes and the potential ingestion implications of this overhaul and how it may affect you.

avatar for Han Kang

Han Kang

Senior Staff Software Engineer, Google
Han Kang is a Senior Staff Software Engineer at Google. Han co-chairs SIG instrumentation while also participating in SIG API Machinery, focusing on operational aspects of managing Kubernetes clusters.

Elana Hashman

Principal Software Engineer, Red Hat
Elana Hashman currently works for Red Hat as a Principal Software Engineer on the OpenShift Container Platform Node Team, working upstream in Kubernetes SIG Node. Previously, she served as an SRE and technical lead on Azure Red Hat OpenShift. She is a subproject lead for the SIG Node... Read More →

Tuesday November 19, 2019 3:20pm - 3:55pm PST
Room 30ABCDE - San Diego Convention Center Upper Level
Wednesday, November 20

11:50am PST

Doing Things Prometheus Can’t Do with Prometheus - Tim Simmons, DigitalOcean
The current Cloud Native Observability dogma is that metrics (and logs and traces) are “not good enough” and that this brave new world needs brave new Observability tools. This is false.

This session will focus on how to utilize Prometheus and friends to solve problems that are typically cited as limitations. This talk is for anyone interested in learning how Prometheus can solve the majority of your Observability problems, no vendor required.

An outline of this talk is:
- How to thoughtfully utilize existing Observability tools
- Deploying High Availability Prometheus
- Effectively interacting with high-cardinality data
- Long-term metrics storage
- Doing “machine learning” on metrics
- Handling thousands of alerts in a sane way (https://twitter.com/timsimlol/status/1145790451129167872)
- How to measure *everything* with Prometheus
- Fostering a healthy Observability culture with SLOs

avatar for Tim Simmons

Tim Simmons

Senior Engineer, DigitalOcean
Tim Simmons is a Senior Engineer on the Observability Platforms team at DigitalOcean. He primarily cares for DigitalOcean's internal Prometheus infrastructure. On a normal day, he helps his colleagues with PromQL queries, writes custom Prometheus exporters, and builds tools around... Read More →

Wednesday November 20, 2019 11:50am - 12:25pm PST
Ballroom Sec 20AB - San Diego Convention Center Upper Level

11:50am PST

Shipping Metrics From the Edge - Matthias Loibl, Red Hat
Computing is getting pushed to the edge, it may be your car, TV, washing machine, or your toaster. All these devices have a lot of computing power these days. While extending the cloud to the edge is getting solved with projects like KubeEdge or k3s, in this talk we want to take a closer look at how to run Prometheus on them. We want to configure Prometheus in a way that we can replicate its data to a central collecting point, that is running Thanos on Kubernetes in a replicated setup, and then make use of all the shipped metrics to efficiently query across the entire fleet.

avatar for Matthias Loibl

Matthias Loibl

Senior Software Engineer, Polar Signals

Wednesday November 20, 2019 11:50am - 12:25pm PST
Room 11AB - San Diego Convention Center Upper Level

2:25pm PST

Beyond Getting Started: Using OpenTelemetry to Its Full Potential - Sergey Kanzhelev, Microsoft & Morgan McLean, Google
OpenTelemetry is a cloud-native set of APIs and libraries used to generate, collect, and export telemetry from distributed systems. This session goes beyond a basic introduction, and demonstrates how you can customize OpenTelemetry’s components and architecture for the unique needs of your app. Attendees will learn how to set up and configure built-in data collectors, how to write their own instrumentation, how to extend and enrich automatically collected telemetry with app-specific information, and how to send this data to Prometheus and Jaeger for analysis.

avatar for Morgan McLean

Morgan McLean

Product Manager, Google
Morgan is a co-founder of OpenCensus and OpenTelemetry, and has spent much of his career as an engineer and product manager working on distributed systems and developer tools. Morgan is responsible for Google's distributed tracing, profiling, and debugging tools, including Stackdriver... Read More →
avatar for Sergey Kanzhelev

Sergey Kanzhelev

Staff Software Engineer, Google
Sergey Kanzhelev is a seasoned open source and cloud native maintainer working actively on Kubernetes. Sergey is serving as co-chair of SIG node. He is also one of OpenTelemetry founders. He is working on engineering aspect of software and its practical application. With the Kubernetes... Read More →

Wednesday November 20, 2019 2:25pm - 3:00pm PST
Room 6C - San Diego Convention Center Upper Level

3:20pm PST

How to Include Latency in SLO-based Alerting - Björn Rabenstein, Grafana Labs
Chapter 5 of “The Site Reliability Workbook” is an excellent study of how to create meaningful alerts based on SLOs by measuring the rate at which the error budget is burned over different time windows. This rather complex approach is blissfully straight-forward to implement in Prometheus, as demonstrated in the chapter itself. However, all of it is based on error rates, leaving latency concerns out of scope. Björn “Beorn” Rabenstein will explore various options of applying the same ideas to latency-based SLOs. The foundation is a precise and meaningful definition of the SLO. From there, Beorn will explore various techniques to translate the SLO into an error budget and how to measure its burn rate with Prometheus. Once that is done, creating error-budget-based alerts is relatively simple. There are, however, pitfalls and trade-offs along the way, which Beorn will help cope with.

avatar for Björn Rabenstein

Björn Rabenstein

Engineer, Grafana Labs
Björn “Beorn” Rabenstein is an engineer at Grafana Labs and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.

slides pdf

Wednesday November 20, 2019 3:20pm - 3:55pm PST
Room 16AB - San Diego Convention Center Mezzanine Level

4:25pm PST

Debugging Live Applications the Kubernetes Way: From a Sidecar - Joe Elliott, Grafana Labs
Linux features a number of powerful debugging tools that give us insight into how our applications run in a real environment. Through live demonstration this session will present a straightforward way to begin debugging applications in a Kubernetes native way: from a sidecar. Sidecars offer a low impact way of profiling applications without installing packages or making messy changes to your nodes.

The techniques demonstrated will include recording LTTng events, cpu profiling, generating Flame Graphs and dynamic tracing with BCC. These techniques will be performed against a .NET Core sample application, but that will not be the focus of the session.

avatar for Joe Elliott

Joe Elliott

Backend Engineer, Grafana Labs
Joe Elliott is a Backend Engineer at Grafana Labs. Since Kubernetes 1.5 he has been building and maintaining microservice platforms on AWS for development teams to deploy their applications. Joe maintains several open source applications in github that publish metrics, manage Grafana... Read More →

Wednesday November 20, 2019 4:25pm - 5:00pm PST
Ballroom Sec 20AB - San Diego Convention Center Upper Level

5:20pm PST

The Great Cardinality Disasters of Our Time - Bryan Boreham, Weaveworks & Chris Marchbanks, Splunk
Many Cloud Native tools generate Prometheus metrics; together they form a great combination to operate and monitor your infrastructure. But sometimes things go wrong: a quirk in the metric labels can make the volume of data explode, and, soon after, your Prometheus will explode too.

Chris and Bryan will share their war-stories such as receiving 46,000 simultaneous alerts or squashing the source of 100kB label values. Then, they will provide top tips to avoid this happening to your tools in the future.

avatar for Bryan Boreham

Bryan Boreham

Distinguished Engineer, Grafana Labs
Bryan is a Distinguished Engineer at Grafana Labs, the observability company.After first getting into programming as a kid, creating a video game called "Splat", Bryan's career has ranged from charting pie sales at a bakery to real-time pricing of billion-dollar bond trades.At Grafana... Read More →
avatar for Chris Marchbanks

Chris Marchbanks

Senior Software Engineer, Splunk
Chris is a Software Engineer at Splunk where he delivers observability for teams working on multiple internal Kubernetes clusters. He is a team member for two CNCF projects, Prometheus and Cortex. Outside of work, Chris enjoys skiing uphill in the mountains of Colorado.

Wednesday November 20, 2019 5:20pm - 5:55pm PST
Room 17AB - San Diego Convention Center Mezzanine Level
Thursday, November 21

10:55am PST

Deep Linking Metrics and Traces with OpenTelemetry, OpenMetrics and M3 - Rob Skillington, Chronosphere
Metrics and traces are two pillars of Observability and are often used in a complementary fashion. Metrics can give you a high level view of application’s responses and performance and tracing can give you a detailed view of requests through applications. Often when using metrics in graphs or alerts you want be able to jump to an example of a request represented by a given metric datapoint which is difficult to do today. In this talk we show an example of this using an OpenTelemetry exporter to publish trace IDs as exemplars using the OpenMetrics exposition format.

We then walk through configuring Jaeger as a tracing backend and M3 as a metrics backend to store the trace ID alongside a datapoint. We show how it is then possible to go from a metrics graph that visualizes the latency of your application to a trace that fell into a latency bucket using the deep link of the trace ID.

avatar for Rob Skillington

Rob Skillington

CTO, Chronosphere
Rob Skillington is the CTO at Chronosphere and creator of open source M3 which is a Prometheus long term storage metrics platform. He is also a member of OpenMetrics, an open standard for transmitting metrics at scale.

Thursday November 21, 2019 10:55am - 11:30am PST
Room 11AB - San Diego Convention Center Upper Level

11:50am PST

Exporting Kubernetes Event Objects for Better Observability - Mustafa Akın & Ahmet Şeker, Atlassian
Objects in Kubernetes, such as Pod, Deployment, Ingress, Service publish events to indicate status updates or problems. Most of the time, these events are overlooked and their 1 hour lifespan might cause missing important updates. They are also not searchable and cannot be aggregated.

We are open-sourcing our internal tool for publishing the events in Kubernetes to Opsgenie, Slack, Elasticsearch, Webhooks, Kinesis, Pub/Sub. It has a configuration language for matching events based on various criteria, such as the content and the related object’s labels. It also has the capability to route the events intelligently, inspired by Prometheus Alertmanager.

For instance, you can notify an owner of Pod for runtime OCI failures, you can aggregate how many times the images are pulled, how many times container sandbox changes for various resource labels.

avatar for Mustafa Akın

Mustafa Akın

SRE, Atlassian
Mustafa works at Atlassian Opsgenie as a Senior Site Reliability Engineer. He works on Kubernetes and Golang to keep Opsgenie up all the times and works on observability and tracing. In his free time, he works on scheduling algorithms for Kubernetes for his PhD studies.
avatar for Ahmet Şeker

Ahmet Şeker

SRE, Atlassian
Ahmet is Engineering Manager at Atlassian Opsgenie SRE Team. Besides his management and SRE tasks, he tries to construct unified build system in Opsgenie. He and his team is the main driver for Opsgenie's K8s journey

Thursday November 21, 2019 11:50am - 12:25pm PST
Room 16AB - San Diego Convention Center Mezzanine Level

Filter sessions
Apply filters to sessions.