Operations [clear filter]
Tuesday, November 19

10:55am PST

Making the Most Out of Kubernetes Audit Logs - Laurent Bernaille & Robert Boll, Datadog
The Kubernetes audit logs are a rich source of information: all of the calls made to the API server are stored, along with additional metadata such as usernames, timings, and source IPs. They help to answer questions such as “What is overloading my control plane?” or “Which sequence of events led to this problematic situation?”. These questions are hard to answer otherwise—especially in large clusters. At Datadog, we have been running clusters with 1000+ nodes for more than a year and during that time, the audit logs have proved invaluable.

In this talk, we will first introduce the audit logs, explain how they are configured, and review the type of data they store. We will then demo a functioning setup and show a few different types of analysis techniques. Finally, we will describe in detail several scenarios where they have helped us to diagnose complex problems.


Laurent Bernaille

Staff Engineer, Datadog
Laurent Bernaille worked several years as a consultant specializing in cloud, containers, and automation and helped organizations migrate to the public cloud and adopt containers. He is now Principal Engineer at Datadog and works closely with infrastructure teams, which are responsible... Read More →

Robert Boll

Senior Director of Engineering, Datadog

Tuesday November 19, 2019 10:55am - 11:30am PST
Room 6C - San Diego Convention Center Upper Level

11:50am PST

Take Envoy Beyond a K8s Service Mesh - to Legacy Bare Metal and VMs + More - Steve Sloka & Steven Wong, VMware
Envoy’s mission is to extract network and communication security code from applications in a way that developers and users can deploy components that just work no matter where they run or what hosts them.

This session will show how to leverage Envoy to achieve interoperation of applications and services, split across Kubernetes and traditional VM or bare metal hosts. We’ll look at how to incrementally bring Kubernetes into an existing application architecture based on existing VM or bare metal applications and services.

Specific examples will demonstrate:
- Using Contour with Envoy as an Ingress and load balancer solution with a richer feature set than some common alternatives
- Sending requests from VM workloads to Kubernetes services
- Direct requests to services running on a VM from Kubernetes
- Dynamical traffic steering - K8s and VM workloads at the same time

avatar for Steven Wong

Steven Wong

Staff Engineer, VMware
Steve Wong has been active in the Kubernetes community since 2015. He is a co chair of the CNCF Working Group. Steve is co-chair of the VMware User Group on the Kubernetes project. He has implemented industrial control systems for many factories, pipelines, and process control systems... Read More →
avatar for Steve Sloka

Steve Sloka

Sr. Member of Technical Staff, VMware
Steve Sloka is a Sr. Member of Technical Staff at VMware based in Pittsburgh, PA dealing with all things Cloud, Containers, and Kubernetes. Steve is a maintainer of Contour & Gimbal and is a contributor to many other open source projects. Steve is also a Kubernetes contributor and... Read More →

Tuesday November 19, 2019 11:50am - 12:25pm PST
Room 28ABCDE - San Diego Convention Center Upper Level

2:25pm PST

Living with the Pathology of the Cloud: How AWS Runs Lots of Clusters - Micah Hausler, Amazon
Disk speed screeches to a crawl, packets get dropped, connections time out: welcome to the cloud! Most of the time the cloud "just works", but when it doesn’t, how does Kubernetes and etcd handle failure? In this talk Micah will discuss considerations for building and configuring cloud native systems for failure including how Amazon EKS’s architecture and design accounts for outages and dependency failures. Micah will also cover and lessons learned from managing lots and lots of Kubernetes and etcd for customers around the world.

avatar for Micah Hausler

Micah Hausler

Principal Engineer, AWS
Micah is a Kubernetes contributor, a member of the Kubernetes Security Response Committee, and a Principal Engineer working on EKS at Amazon Web Services.

Tuesday November 19, 2019 2:25pm - 3:00pm PST
Room 29ABCD - San Diego Convention Center Upper Level

3:20pm PST

Building the Cloud Native Kernel: Kubernetes Release Engineering - Tim Pepper & Stephen Augustus, VMware
Is Kubernetes a kernel or distribution? Yes! It is necessarily both!

CRD’s, out-of-tree cloud providers, and CNI/CSI/CRI abstractions evolve Kubernetes’ core toward an extensible kernel.

At 2017, KubeCon NA Tim Hockin and Michael Rubin started a conversation on formalizing “Kubernetes upstream as a distro”, proposing we clean up thinking/processes, define tools/standards, incentivize distros to stay close. They argued for a Kubernetes reference distribution focused on correctness and stability.

So where is it?

After a slow start, we have momentum in 2019 to improve conformance, API stability, and better documented support stances. However to understand why we don’t (yet) have an upstream reference distro, we need to dive deep on build/release/test tooling.

This talk will summarize Kubernetes distro issues/advances and potential contribution areas for individuals and companies.

avatar for Stephen Augustus

Stephen Augustus

Lead, Cloud Native Tools & Advocacy, VMware
Stephen Augustus is an active leader in the Kubernetes community. He currently serves as a Special Interest Group Chair (Release, PM), a Release Manager, and a subproject owner for Azure.Stephen leads the Cloud Native Developer Strategy team at VMware, driving meaningful interactions... Read More →
avatar for Tim Pepper

Tim Pepper

Principal Engineer, VMware
Tim Pepper is a Principal Engineer in VMware's Open Source Technology Center with over 25 years in open source, working as an open source developer advocate and contributor to Kubernetes (emeritus Steering Committee elected member, emeritus Code of Conduct Committee elected member... Read More →

Tuesday November 19, 2019 3:20pm - 3:55pm PST
Room 1AB - San Diego Convention Center Upper Level

4:25pm PST

Scaling Kubernetes to Thousands of Nodes Across Multiple Clusters, Calmly - Ben Hughes, Airbnb
In under a year, Airbnb went from 600 Kubernetes nodes across a couple handcrafted clusters to over 5000 nodes on tens of clusters. Successful adoption of Kubernetes by services led to more and faster adoption leading to challenges of scale. Facing this, Airbnb switched to a multiple production cluster architecture to get around single cluster scalability limits and ensure ample capacity for services.

This process increased the consistency of the cluster configurations while reducing manual operations. This talk will discuss the problems that were faced during scaling, the shape of the solutions, specific approaches that worked well (and didn’t), and how this was accomplished without a drastic shift away from existing pre-Kubernetes infrastructure tooling. A key result was reducing the time to create a new, production-ready cluster from over a week to under an hour.


Ben Hughes

Software Engineer, Airbnb
Ben Hughes has worked on database scaling, Ruby and Node.js performance, incident response, and Kubernetes at Airbnb. He has previously spoken about [Scaling Airbnb](https://www.oreilly.com/library/view/velocity-conference-new/9781491900406/video191370.html) at VelocityConf NY, [Alerting](https://www.youtube.com/watch?v=MYmVu_IMC20... Read More →

Tuesday November 19, 2019 4:25pm - 5:00pm PST
Ballroom Sec 20CD - San Diego Convention Center Upper Level
Wednesday, November 20

10:55am PST

Running Large-Scale Stateful Workloads On Kubernetes at Lyft - Surinder Singh & Anmol Khurana, Lyft
Along with core services, K8s at Lyft also forms the base to run a large variety of data processing stateful data processing jobs which includes Spark, Flink and other jobs via various ML and Data processing pipelines.

At Lyft, K8s has become the driver for the majority of our data processing needs running 10s of thousands of concurrent jobs. Operating the platform at this scale presents an unique set of challenges which get more complex with highly variable load pattern.

In this talk, the speakers will share their journey through some of these challenges and learnings.
- Potential pitfalls of running stateful jobs on K8s.
- Knobs/tweaks to optimize K8s for stateful jobs.
- Running k8s in a cloud environment.
- Building a fault-tolerant self-healing system with multiple K8s clusters underneath.

Talk will also focus on optimizations done to support the widely used workloads at Lyft.

avatar for Surinder Singh

Surinder Singh

Software Engineer, Lyft
Surinder Singh is a software engineer at Lyft in Seattle. He led execution plane for Flyte, Lyft’s open-source Machine learning and Data processing pipelines platform. Before Lyft, Surinder was at Microsoft where he worked on Azure Storage and SQL Server Query Optimizer.

Anmol Khurana

Software Engineer, Lyft
Anmol Khurana is a software engineer at Lyft. He is part of Data Platform team responsible for leading effort on Containerized Spark on K8s. Before Lyft, Anmol was at Amazon for 5+ years mostly with AWS Elastic Block Store team.

Wednesday November 20, 2019 10:55am - 11:30am PST
Ballroom Sec 20AB - San Diego Convention Center Upper Level

11:50am PST

Don’t Catch Feelings, Catch Issues With Kuberhealthy - Joshulyne Park & Shilla Saebi, Comcast
Kuberhealthy is a synthetic monitoring operator for both apps and Kubernetes clusters. Learn how to increase application and cluster observability by replicating real workflow and carefully checking for the expected behavior to occur. With Kuberhealthy, our team has been able to reliably monitor all critical Kubernetes cluster functionality in order to catch issues before our developers do. With Kuberhealthy, you can write your own tests of any kind in your own container and Kuberhealthy will manage everything else, including the creation of Prometheus metrics.

As we’ve transitioned more and more cloud workloads to elastic, self-healing Kubernetes clusters, the job of keeping the clusters running smoothly has become more challenging and important. That’s why we’re so excited to share Kuberhealthy, a new open-source tool we built at Comcast to keep our Kubernetes clusters running at their best.

avatar for Joshulyne Park

Joshulyne Park

Cloud Engineer, Comcast Technology Solutions
Joshulyne Park is a Cloud Engineer working on building a highly scalable and reliable Kubernetes platform to support all of Comcast Technology Solutions products and services. She is a graduate of Comcast's Career Opportunities and Rotational Experiences (CORE) technology program... Read More →
avatar for Shilla Saebi

Shilla Saebi

Program Manager, Open Source, Comcast
Shilla Saebi is an Open Source Program Manager who focuses on community and has been with Comcast for almost a decade. She has worked in many diverse roles within the tech industry in positions ranging from operations engineering, system administration, customer service, and network... Read More →

Wednesday November 20, 2019 11:50am - 12:25pm PST
Room 6F - San Diego Convention Center Upper Level

2:25pm PST

Fidelity’s Move to “Finance Grade” Kubernetes with GitOps - Alexis Richardson, Weaveworks & Rajarajan Pudupatti SJ, Fidelity Investments
At Fidelity Investments, every application must meet a unique mix of regulatory, security and governance requirements to protect millions of customers.

When Fidelity adopted Kubernetes for cloud application delivery, they teamed up with AWS and Weaveworks to use GitOps as a tool to analyze and implement a compliant platform. In this session, Rajan Pudupatti, Cloud Platforms Architect at Fidelity Investments and Alexis Richardson, CEO of Weaveworks will present the story. They’ll share when to automate, how to secure your CD pipeline, the process for adding deployment policy for clusters and applications, and connecting enterprise development tools to cloud automation services.

The session covers challenges and lessons learned implementing the Kubernetes platform with GitOps best practices, to operate efficiently and securely at scale.

avatar for Alexis Richardson

Alexis Richardson

CEO & Founder, Weaveworks
Alexis is CEO and co-founder of Weaveworks, and was the first chair of the CNCF TOC.  He is also known for popularising the terms and practices of GitOps.Previously, at Pivotal, as head of products for Spring, RabbitMQ, Redis and vFabric, he "rebooted" Spring and transitioned the... Read More →
avatar for Rajarajan Pudupatti SJ

Rajarajan Pudupatti SJ

Director, Cloud Platform Architecture, Fidelity Investments
Rajarajan is a Cloud Platform Architect at Fidelity Investments. At Fidelity, he drives the engineering behind implementing various container & serverless platforms at enterprise scale. His current focus is on building an ecosystem of frameworks, tools and design patterns that will... Read More →

Wednesday November 20, 2019 2:25pm - 3:00pm PST
Exhibit Hall AB - San Diego Convention Center Ground Level

3:20pm PST

To Infinite Scale and Beyond: Operating Kubernetes Past the Steady State - Austin Lamon, Spotify & Jago Macleod, Google
Operating large distributed systems at significant scale is challenging. Most discussions focus on scalability either at a single point in time under sustained load, or explore challenges related to changes in incoming traffic.

But running distributed systems at scale is about more than steady states and transitions between them. What is equally challenging and tends to get overlooked are the operational challenges of running at scale: upgrading many and/or large clusters; deploying applications to and across multiple clusters in a reasonable way; balancing freedom and consistency across multiple teams. In this case study, Google and Spotify share some of the challenges of running Kubernetes at Scale, together with concrete solutions, patterns, and common pitfalls we have found together. Intended for cluster operators and developers from organizations of any size and on any provider.


Jago Macleod

Engineering Director, Google
Jago Macleod is an Engineering Director at Google, where he is responsible for much of Google’s investment in Kubernetes, including productization through GKE, GDC, and Anthos. In this role since 2017, Jago has had the privilege of leading the ‘Kubernetes Kernel’ team, including... Read More →
avatar for Austin Lamon

Austin Lamon

Group Product Manager, Spotify
Austin Lamon is a software engineer turned product manager who is passionate about building scalable & resilient products that delight developers & customers. He currently leads product for Spotify's Core Infrastructure team in Stockholm and New York building the service platform... Read More →

Wednesday November 20, 2019 3:20pm - 3:55pm PST
Pacific Ballroom, Salon 20-22 - Marriott Marquis San Diego Marina Hotel

4:25pm PST

Wait, People Run Kubernetes on Mainframes? - Elizabeth K. Joseph, IBM
When you think of container orchestration mainframes probably aren't the first thing that come to mind.

But modern mainframes run Linux as a first class citizen and KVM can be used for virtualization, opening a whole world of open source tooling integration via libvirt and related virtualization tooling. The careful observer may have already discovered that the mainframe architecture (s390x) is one of the architectures that's built for every Kubernetes release.

How did this come to be? Who uses these mainframe builds of Kubernetes? Why would you run a distributed container orchestration service on a platform that's a symbol of the monolith we're looking to leave? Drawing upon my work with distributed systems and containers, including time spent on OpenStack, Apache Mesos and Kubernetes, and my new experience with mainframes, this talk answers all of those questions and more.

avatar for Elizabeth Joseph

Elizabeth Joseph

Developer Advocate, IBM
Elizabeth K. Joseph is a Linux systems administrator turned developer advocate for IBM Z where she works with the community to explore Linux workloads on mainframes. She has previously worked on distributed systems, including OpenStack and Apache Mesos, and has written books on Ubuntu... Read More →

Wednesday November 20, 2019 4:25pm - 5:00pm PST
Pacific Ballroom, Salon 14-15 - Marriott Marquis San Diego Marina Hotel

5:20pm PST

The Myth of the Monocluster - Matt Silverlock, Google
Building out a single monolithic Kubernetes cluster and trying to migrate all the things rarely, if ever, works out, and Kubernetes doesn't change that. It becomes harder to gather non-conflicting requirements, or avoid scope creep as new teams have what seem like reasonable asks (to them). Not to mention the technical challenges & increased blast radius of a big cluster.

How can we start with smaller teams, help them migrate and operationalize their clusters, learn from the inevitable mistakes, document the shortcuts, and use that as the framework for future teams?

Let's talk through what we need to ask ourselves in order to migrate to Kubernetes, how to divide & conquer (our clusters), and some lessons learnt from working with large organizations.

avatar for Matt Silverlock

Matt Silverlock

Customer Engineer, Google
Matt is a customer-facing engineer at Google, and regularly works with organizations actively moving to Kubernetes, from DIY on-prem, unmanaged on VMs, or managed platforms like GKE (and sometimes, a mix of all three). This gives him first-hand insight into how organizations are building... Read More →

Wednesday November 20, 2019 5:20pm - 5:55pm PST
Ballroom Sec 20CD - San Diego Convention Center Upper Level
Thursday, November 21

10:55am PST

Handling Risky Business: Cluster Upgrades - Puneet Pruthi, Lyft
Have you ever had to upgrade your Kubernetes clusters to update to a new release version, push new features or patch critical security vulnerabilities? Did it ever feel daunting to live update API masters or etcds? Can you automate such an operation?

We hope to share our musings at Lyft in solving the complexity of automating cluster upgrades and how that is incorporated into the design for - k8srotator - a Kubernetes custom controller.

As multiple components operating in cohesion make a cluster healthy, there are numerous points of failure that can occur during an upgrade cycle. Although there are varied ways of operating a Kubernetes cluster, the issues encountered during the process are common.

Attendees will walk away with knowledge about different cluster upgrade failures scenarios and ways to automate such operations without being in constant fear of losing the cluster state.

avatar for Puneet Pruthi

Puneet Pruthi

Engineering Manager, Lyft
Puneet is the Engineering Manager for Cloud Orchestration Team at Lyft which maintains the platform for microservices to interact with cloud providers. Previously he was a Senior Software Engineer on the Compute Team where he worked on supporting the Kubernetes Infrastructure and... Read More →

Thursday November 21, 2019 10:55am - 11:30am PST
Ballroom Sec 20CD - San Diego Convention Center Upper Level

11:50am PST

Am I Using It Right? Checking Best Practices on Live Kubernetes Clusters - Varsha Varadarajan & Adam Wolfe Gordon, DigitalOcean
While Kubernetes is stable, best practices for using it are a moving target. Some are generally applicable, others unique to a particular configuration or platform. Following best practices helps ensure workloads stay running as expected through cluster maintenance and upgrades, but checking them can feel like playing whack-a-mole in the dark.

This talk introduces a new open source tool, clusterlint, that checks compliance with best practices. Unlike other linters that work on deployment manifests, clusterlint identifies risks and problems in running Kubernetes clusters, making it useful for finding potential problems before performing cluster maintenance.

We'll discuss what clusterlint checks, why, how it works, how we use it in DigitalOcean's managed Kubernetes product to warn users of danger, and future plans for the tool.

avatar for Adam Wolfe Gordon

Adam Wolfe Gordon

Senior Engineer II, DigitalOcean
Adam Wolfe Gordon is a senior engineer focused on product strategy at DigitalOcean. Among other things, he previously worked as the tech lead for DigitalOcean's Kubernetes and container registry products. Adam is interested in infrastructure products, and likes to spend as much time... Read More →

Varsha Varadarajan

Engineering Intern, DigitalOcean
Varsha is a software engineer currently pursuing a Master's degree in Computer Science. She previously worked at ThoughtWorks in the continuous delivery domain; and as an intern at DigitalOcean on managed Kubernetes, where clusterlint was created. She likes working on Kubernetes related... Read More →

Thursday November 21, 2019 11:50am - 12:25pm PST
Ballroom Sec 20AB - San Diego Convention Center Upper Level

2:25pm PST

The Gotchas of Zero-Downtime Traffic /w Kubernetes - Leigh Capili, Weaveworks
Noticing your customers receive 503's every now-and-then?
Do they spike when you're updating your app or rotating your k8s cluster nodes?
Maybe you used to have this problem -- then you added some strange settings and it's mostly working now…

What most people need from Kubernetes regarding web-traffic is a repeatable but under-documented combo of esoteric, non-default options.

We'll walk through the basic needs of shaping traffic and apply that knowledge to the states of compute, rollout, and canonical networking we see with k8s.
Expect tidbits about CRI, CNI, Ingress, and the design trade-offs present in Kubernetes and its API's.

You’ll leave this session knowing how to keep your apps serving successful requests for a myriad of edge-cases.

avatar for Leigh Capili

Leigh Capili

Developer Experience Engineer, Weaveworks
Leigh is a Kubernetes Contributor and works in Developer Experience with Weaveworks. :wheel_of_dharma: He authored kubeadm's etcd mTLS implementation and is currently working toward k8s component-standards and cluster-addons. Previously, he helped design a functional state-store for... Read More →

Thursday November 21, 2019 2:25pm - 3:00pm PST
Ballroom Sec 20AB - San Diego Convention Center Upper Level

3:20pm PST

The Elephant in the Kubernetes Room: Team Interactions at Scale - Manuel Pais, Independent
Kubernetes helps us tame sprawling microservices architectures and address increased operational complexity. Kubernetes gives developers abstractions and APIs to deploy and run their services.

Yet, the elephant in the room is that to run, maintain and evolve Kubernetes clusters, we need more ops expertise and most likely a dedicated team to do so.

The question that begs to be asked is if we risk going back to pre-DevOps isolation between Dev and Ops teams? Is the tradeoff between better operational tools and introducing a new dependency layer on the path to production for application teams worthwhile? Are we making life easier for application teams or instead reducing their end-to-end ownership?

Manuel will then introduce Team Topologies, a balanced approach for thinking about teams responsibilities and interactions which can help get the most value out of your Kubernetes adoption.

avatar for Manuel Pais

Manuel Pais

Co-Author, "Team Topologies"
Manuel Pais is co-author of Team Topologies: Organizing Business and Technology Teams for Fast Flow. Recognized by TechBeacon as a DevOps thought leader, Manuel is an independent IT organizational consultant and trainer, focused on team interactions, delivery practices, and accel... Read More →

Thursday November 21, 2019 3:20pm - 3:55pm PST
Room 6F - San Diego Convention Center Upper Level

4:25pm PST

Enforcing Service Mesh Structure using OPA Gatekeeper - Sandeep Parikh, Google
Organizations need the ability to apply rules to their workloads and services, at scale and distinct from the development of those services. Policies and policy enablement provide those governance capabilities with declarative approaches. OPA Gatekeeper integrates with Kubernetes and is able to provide the right guardrails to enforce structure and keep your deployments running smoothly. In this session we'll talk about policy management and how OPA Gatekeeper can help manage policies at scale. We'll walkthrough the high-level architecture of Gatekeeper along with applied examples and demonstrate how it can be used to manage security and traffic management mechanisms found in service mesh deployments.

avatar for Sandeep Parikh

Sandeep Parikh

DevRel Engineer, Google Cloud
Sandeep is a DevRel Engineer for Google Cloud, where he focuses on making it easier for developers & operators to adopt DevOps and cloud native tools and processes. Sandeep’s background is in software engineering and he's worked for Google, VMware, Apple, MongoDB, and many others... Read More →

Thursday November 21, 2019 4:25pm - 5:00pm PST
Room 6F - San Diego Convention Center Upper Level

5:20pm PST

Governance on K8s: How to Solve Ownership, Metering & Capacity Planning - Micheal Benedict & Yongwen Xu, Pinterest
Pinterest is a cloud first visual discovery engine that serves over 250MM users. To support this scale, there are thousands of services running on tens of thousands of hosts, processing 300+PB of data. We operate large kubernetes clusters across several availability zones, across regions. The cluster is auto scaled with support for pod level auto-scaling. Finally,to effectively utilize resources within the clusters, we operate heterogeneous workloads on a kitchen sink of instance types. Given this,
1.Who owns what?
2.What is driving utilization?
3.How do we plan capacity effectively with minimal overhead?

In this talk, we will share how we built a governance platform to address the above through defining canonical ownership, metering resource utilization (at various granularities) + reporting and finally a policy enforcement mechanism (ex, pre-emption, placement, etc).

avatar for Micheal Benedict

Micheal Benedict

Head of Engineering Productivity, Pinterest
Micheal Benedict heads the Engineering Productivity organization at Pinterest that is responsible for languages strategy, source code management, build systems & CI/CD platform. Previously, Micheal led products for the Compute Platform at Twitter. Micheal holds a master's degree in... Read More →
avatar for Yongwen Xu

Yongwen Xu

Technical Lead - Engineering Productivity, Pinterest
Yongwen Xu is the Tech Lead at Engineering Productivity Team at Pinterest. Previously, Yongwen worked as a staff engineer at Sun and Oracle developing large scale distributed system. He holds a PhD degree in computer science from the University of Hawaii at Manoa.

Thursday November 21, 2019 5:20pm - 5:55pm PST
Ballroom Sec 20CD - San Diego Convention Center Upper Level

Filter sessions
Apply filters to sessions.