Machine Learning + Data [clear filter]
Tuesday, November 19

10:55am PST

Running Apache Samza on Kubernetes - Weiqing Yang, LinkedIn Corporation
Apache Samza is a distributed stream processing framework that allows you to process and analyze your data in real-time. It has been widely used at Linkedin and other companies on a large scale. Recently, we added Kubernetes as the new scheduler backend for Samza to run in distributed mode. In this talk, we will deep dive into the technical details about how Samza runs natively on Kubernetes by leveraging the primitives provided by Kubernetes for scheduling, storages, etc. We will also compare running Samza on Kubernetes with other existing solutions such as YARN and standalone mode. Finally, we will share some practices about running Kubernetes as a container orchestration framework for other big data processing engines.

avatar for Weiqing Yang

Weiqing Yang

Software Engineer, LinkedIn
Weiqing has been working in big data computation frameworks since 2015 and is an Apache Spark/HBase/Hadoop/Samza contributor. She is currently a software engineer in streaming infrastructure team at LinkedIn, working on Samza, Brooklin, etc. Before that, she worked in Spark team at... Read More →

Tuesday November 19, 2019 10:55am - 11:30am PST
Room 1AB - San Diego Convention Center Upper Level
  Machine Learning + Data

11:50am PST

Enabling Kubeflow with Enterprise-Grade Auth for On-Prem Deployments - Yannis Zarkadas, Arrikto & Krishna Durai, Cisco
Kubeflow is an open source machine learning platform built on Kubernetes. Every service in Kubeflow is implemented either as a Custom Resource Definition (CRD) (e.g., TensorFlow Job) or as a standalone service (e.g., Kubeflow Pipelines).

As enterprises start to adopt Kubeflow, the need for access control, authentication, and authorization is emerging. An enterprise-grade solution to authenticate and authorize on two API layers: Kubernetes APIs and Kubeflow’s stand-alone services APIs. For better adoption, the solution should also integrate seamlessly with existing user management solutions in enterprises, such as LDAP or Active Directory (AD).

We present how we combined open-source, cloud-native technologies to design and implement a flexible, modular solution for enterprise authentication and authorization in Kubeflow. The talk will include a live demo.

avatar for Yannis Zarkadas

Yannis Zarkadas

Software Engineer, Arrikto
Yannis is a software engineer at Arrikto, working with Kubeflow and the Kubernetes sig-storage group. He loves contributing to open source projects and has authored the Cassandra Operator in Rook and the official Scylla Operator, which he is currently maintaining.
avatar for Krishna Durai

Krishna Durai

Software Engineer, Cisco
Krishna is a software engineer at Cisco, Bangalore and is a contributor to the Kubeflow open-source project. He has been designing and engineering AI platforms in enterprise domains like healthcare.

Tuesday November 19, 2019 11:50am - 12:25pm PST
Room 16AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

2:25pm PST

Introducing KFServing: Serverless Model Serving on Kubernetes - Ellis Bigelow, Google & Dan Sun, Bloomberg
Production-grade serving of ML models is a challenging task for data scientists. In this talk, we'll discuss how KFServing powers some real-world examples of inference in production at Bloomberg, which supports the business domains of NLP, computer vision, and time-series analysis. KFServing (https://github.com/kubeflow/kfserving) provides a Kubernetes CRD for serving ML models on arbitrary frameworks. It aims to solve 80% of model serving use cases by providing performant, high abstraction interfaces for common ML frameworks. It provides a consistent and richly featured abstraction that supports bleeding-edge serving features like CPU/GPU auto-scaling, scale to and from 0, and canary rollouts. KFServing's charter includes a rich roadmap to fulfill a complete story for mission critical ML, including inference graphs, model explainability, outlier detection, and payload logging.

avatar for Dan Sun

Dan Sun

Software Engineer Team Lead, Bloomberg
Dan Sun is a team lead of the Data Science Serverless Runtime team at Bloomberg. Focused on building mission-critical production ML inference managed solutions, he strives to understand and tackle data scientists' complex problems. He also has many years of experience at Bloomberg... Read More →

Ellis Bigelow

Software Engineer, Google
Ellis Bigelow is a software engineer at Google Cloud developing next generation systems for the AI Platform Prediction Service. In addition to his efforts on Google's managed product, he leads the open source project, Kubeflow/KFServing, a kubernetes-based serverless inferencing platform... Read More →

Tuesday November 19, 2019 2:25pm - 3:00pm PST
Room 15AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

3:20pm PST

Towards Continuous Computer Vision Model Improvement with Kubeflow - Derek Hao Hu & Yanjia Li, Snap Inc.
With deep learning gaining popularity in industry, there is a lot of material focusing on model training and serving. However, in production machine learning typically isn't complete after a single round of training. Model owners need to find ways to improve trained models regularly, and good machine learning pipelines achieve this by leveraging continuous feedback.

In this talk, we will demonstrate how Kubeflow and Kubeflow Pipelines are being used to continuously improve computer vision models at Snapchat. We will walkthrough how we orchestrate multiple components with Kubeflow Pipelines to extract data, label images, and (re)train machine learning models. We will also discuss best practices for authoring Kubeflow Pipeline components based on our experiences from developing and deploying these components for production use.


Derek Hao Hu

Software Engineer, Snap Inc.
Derek Hao Hu is a software engineer at Snap on the Perception team. He's been working on building machine learning infrastructure, components, pipelines and tools that power different types of computer vision experiences inside Snapchat.

Yanjia Li

Software Engineer, Snap Inc.
Yanjia Li is a Software Engineer on the Perception team of Snap. He has been working on the algorithms and systems behind various computer vision products in Snapchat. One of his focus areas is building the software to handle large-scale deep learning model training and inference... Read More →

Tuesday November 19, 2019 3:20pm - 3:55pm PST
Room 31ABC - San Diego Convention Center Upper Level

4:25pm PST

Measuring and Optimizing Kubeflow Clusters at Lyft - Konstantin Gizdarski, Lyft & Richard Liu, Google
Machine learning workloads are often resource-intensive operations. As companies adopt more of these workloads, tracking resource consumption and optimizing spending becomes more challenging.

At Lyft, we developed a system which scrapes metrics from Kubernetes clusters and persists them in data warehouses. We then built a pipeline that transforms snapshots into cluster utilization metrics along the dimensions of CPU, memory, and GPU. Finally we join these metrics into our cost and usage dataset, so teams can budget resources accordingly and reduce spending.

In this talk, we will give an overview of Infraspend - our infrastructure for tracking Kubernetes usage. Attendees will learn how the data we collected helped Lyft reduce spending for Kubeflow clusters. The audience will also gain insights into how Kubernetes clusters can be optimized without performance or stability compromises.

avatar for Richard Liu

Richard Liu

Senior Software Engineer, Google
Richard Liu is a Senior Software Engineer at Google Cloud. He is currently an owner and maintainer of the TensorFlow operator and Katib projects in Kubeflow. Previously he had worked as a software developer at Microsoft Azure.
avatar for Konstantin Gizdarski

Konstantin Gizdarski

Software Engineer, Lyft
Konstantin Gizdarski is a Software Engineer at Lyft, where he has been working on — among other things — surfacing the utilization and efficiency of Kubernetes infrastructure. Previously, he has worked on machine learning and product at both Facebook and Stripe.

Tuesday November 19, 2019 4:25pm - 5:00pm PST
Room 6C - San Diego Convention Center Upper Level
  Machine Learning + Data
Wednesday, November 20

10:55am PST

Advanced Model Inferencing Leveraging KNative, Istio and Kubeflow Serving - Animesh Singh, IBM & Clive Cox, Seldon
Model Inferencing use cases are becoming a requirement for models moving into the next phase of production deployments. More and more users are now encountering use cases around canary deployments, scale-to-zero or serverless characteristics. And then there are also advanced use cases coming around model explainability, including A/B tests, ensemble models, multi-armed bandits, etc.

In this talk, the speakers are going to detail how to handle these use cases using Kubeflow Serving and the native Kubernetes stack which is Istio and Knative. Knative and Istio help with autoscaling, scale-to-zero, canary deployments to be implemented, and scenarios where traffic is optimized to the best performing models. This can be combined with KNative eventing, Istio observability stack, KFServing Transformer to handle pre/post-processing and payload logging which consequentially can enable drift and outlier detection to be deployed. We will demonstrate where currently KFServing is, and where it's heading towards.

avatar for Animesh Singh

Animesh Singh

Distinguished Engineer and CTO - Watson Data and AI OSS Platform, IBM
Animesh Singh is CTO and Director for IBM Watson Data and AI Open Technology, responsible for Data and AI Open Technology strategy. Creating, designing and implementing IBM’s Data and AI engine for AI and ML platform, leading IBM`s Trusted AI efforts, driving the strategy and execution... Read More →
avatar for Clive Cox

Clive Cox

CTO, Seldon
Clive is CTO of Seldon. Seldon helps enterprises put machine learning into production. Clive developed Seldon's open source Kubernetes based machine learning deployment platform Seldon Core. He is also a core contributor to the Kubeflow and KFServing projects.

Wednesday November 20, 2019 10:55am - 11:30am PST
Room 17AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

11:50am PST

Building and Managing a Centralized Kubeflow Platform at Spotify - Keshi Dai & Ryan Clough, Spotify
Machine learning workflows within Spotify have been migrated to Kubernetes by adopting Kubeflow and Kubeflow Pipelines. It helps teams increase model development speed and reduce the time to productionize a machine learning model.

In this talk, we will demonstrate some best practices Spotify has learned from managing Kubernetes for backend services and apply them to building a centralized Kubeflow platform. We treat infrastructure as code. We establish customizable and repeatable deployment process. Even with a handful of machine learning/data engineers, we are successfully able to manage multiple Kubernetes clusters and machine learning workloads at scale.

We will also show how teams at Spotify use Kubeflow platform as a one-stop shop for their machine learning development, which helps them build better products to improve user listening experience.

avatar for Keshi Dai

Keshi Dai

ML Infra Engineer, Spotify
Keshi Dai is a Senior ML Engineer on the Spotify Machine Learning platform team. He has been working on building and managing a centralized Kubeflow platform to help Machine Learning engineers at Spotify to adopt Kubernetes. Recently, he is also leading the effort to evaluate managed... Read More →
avatar for Ryan Clough

Ryan Clough

Senior ML Engineer, Spotify
Ryan Clough is a Senior Engineer on Spotify's Machine Learning Infrastructure team. Alongside his colleagues, he is responsible for designing and building the platform and tools that ML practitioners all across Spotify use to bring ML solutions from an idea all the way to production... Read More →

Wednesday November 20, 2019 11:50am - 12:25pm PST
Pacific Ballroom, Salon 20-22 - Marriott Marquis San Diego Marina Hotel
  Machine Learning + Data

2:25pm PST

Panel: Enterprise-grade, On-prem Kubeflow in the Financial Sector - Laura Schornack, JPMorgan Chase; Jeff Fogarty, US Bank; Josh Bottum, Arrikto; & Thea Lamkin, Google
This presentation will explore the journeys of two ML architects from JPMorgan Chase and US Bank, who have deployed Kubeflow into their on-premise environments. These subject matter experts will review their pre-installation checklists, their software architectures, and their operating expectations. They will pinpoint the critical features for an enterprise-grade deployment like authentication and authorization, data management, credentials management, and support for air gapped environments. They will also discuss their collaboration with the Kubeflow code contributors to define requirements and develop new functionality. The talk will include a review of planned Kubeflow enhancements, and a roadmap for those deliveries by code contributors to the Kubeflow On-Prem Special Interest Group (SIG).

avatar for Josh Bottum

Josh Bottum

Vice President, Arrikto
I am a Kubeflow Community Product Manager and VP at Arrikto. We simplify storage architectures and operations for K8s platforms.

Jeff Fogarty

Innovation Engineer,, US Bank
Jeff Fogarty is an Innovation Engineer at US Bank Supporting a team of Data Scientists. He participates with the Kubeflow open source community focusing on On-Prem functionality. Jeff speaks at technical events and conferences including the Kubeflow Contributors Summit and Cloud Native... Read More →
avatar for Thea Lamkin

Thea Lamkin

Open Source Developer Relations Program Manager, Google
Thea Lamkin leads Google's Open Source Developer Relations Program for Kubeflow. Thea sets the developer program strategy for Kubeflow and executes on the tactical work items and events necessary to make Kubeflow a success. Thea specializes in Open Source Community Architecture, Developer... Read More →

Laura Schornack

Sr. Architect, JPMorgan Chase
Laura Schornack is a JPMorgan Chase lead design architect and expert engineer for shared services. Previously, she worked for other leading tech organizations such as IBM and Nokia. She holds a degree in computer science from the University of Illinois at Urbana-Champaign. Laura presents... Read More →

Wednesday November 20, 2019 2:25pm - 3:00pm PST
Room 14AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

3:20pm PST

Kubeflow: Multi-Tenant, Self-Serve, Accelerated Platform for Practitioners - Kam Kasravi, Intel & Kunming Qu, Google
The kubeflow platform provides a self-serve multi-tenant platform on k8s for ML developers. Users can train their models using accelerated hardware in an isolated environment. Jobs can be configured and triggered from a notebook with no devops involvement. We leverage optimized libraries such as Intel® DAAL, Intel® MKL-DNN now included in tensorflow 1.14.+. Models can be monitored using Application CR deployed with kubeflow. All attendees can join the demo, create their own workspace and try out features. Attendees will walk away understanding how to run multi-tenancy on Kubernetes with kubeflow.

Self-serve multi-tenant workplace
Workspace owners can share / revoke access
System admin can reset access policy & resource quota per workspace
Multi-tenancy service is transparent to other apps.
A UI is available to simplify user experience.

avatar for Kunming Qu

Kunming Qu

Software Engineer, Google
Kunming Qu is a software engineer at Google working on Kubeflow project, a k8s based platform to help developers and enterprises deploy and use ML cloud-natively everywhere. He's been focusing on Kubeflow deployment experience; Identity-Aware integration; multi-tenancy cluster; enabling... Read More →
avatar for Kam Kasravi

Kam Kasravi

Senior Software Engineer, Intel
Kam works at Intel and is an early contributor to kubeflow. His focus has been on multi-tenancy, the kfctl/kustomize cli, device/hardware integration and application CR composition. Kam speaking history includes Scala conferences and a number of Kubernetes/Kubeflow related user meetings... Read More →

Wednesday November 20, 2019 3:20pm - 3:55pm PST
Room 17AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

4:25pm PST

Realizing End to End Reproducible Machine Learning on Kubernetes - Suneeta Mall, Nearmap
Industry adaptation of data-science has grown rapidly in the last few years. The probabilistic nature of this space requires the right tools and techniques to ensure that the answers produced are reliable. Models are derived from data, which is almost always evolving, massive (as in deep-learning), and requiring clean-up and pre-processing before use. Reproducibility, reporting, tracking and management around the tasks of 1) data - collection, pre-processing, often feature engineering and 2) model – training, tuning, evaluation and serving are essential.

With tools such as Pachyderm, Kubeflow, Katib, ModelDB, Seldon and Argo, an automated end-to-end reproducible machine learning framework can be built on Kubernetes. This talk will detail how the aforementioned tools can be used to build an automated, reproducible machine learning framework.

avatar for Suneeta Mall

Suneeta Mall

Head of AI Engineering, Harrison.ai

Wednesday November 20, 2019 4:25pm - 5:00pm PST
Room 16AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

5:20pm PST

Flyte: Cloud Native Machine Learning & Data Processing Platform - Ketan Umare & Haytham AbuelFutuh, Lyft
Flyte is the backbone for large-scale Machine Learning and Data Processing (ETL) pipelines at Lyft. It is used across business critical applications ranging from ETA, Pricing, Mapping, Autonomous, etc. At its core is a Kubernetes native workflow engine that executes 10M+ containers per month as part of thousands of workflows. The talk will focus on,
- Architecture of Flyte and its specification language to orchestrate compute and manage data flow across disparate systems like Spark, Flink, Tensorflow, Hive, etc.
- Deploying highly scalable and fault tolerant Kubernetes Operators
- Learnings from operating Flyte across multiple Kubernetes clusters and using other CNCF technologies like gRPC, Envoy, FluentD, Kustomize and Prometheus.
- Use-cases where Flyte can be leveraged
The talk will conclude with a demo of a machine learning pipeline built using the open source version of Flyte.

avatar for Haytham AbuelFutuh

Haytham AbuelFutuh

Software Engineer, Lyft
Haytham Abuelfutuh is a software engineer at Lyft and leads the Flyte backend team. During his tenure at Lyft, Haytham has helped build Flyte from the ground up, built and shipped Kubernetes operators and investigated and optimized Flyte system performance on k8s. Before Lyft, Haytham... Read More →
avatar for Ketan Umare

Ketan Umare

Chief Software Architect, Union.ai
Ketan Umare is the TSC Chair for Flyte (incubating under LF AI & Data). He is also currently the Chief Software Architect at Union.ai. Previously he had multiple Senior Lead roles at Lyft, Oracle and Amazon ranging from Cloud, Distributed storage, Mapping (map making) and machine... Read More →

Wednesday November 20, 2019 5:20pm - 5:55pm PST
Room 14AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data
Thursday, November 21

10:55am PST

Improving Performance of Deep Learning Workloads With Volcano - Ti Zhou, Baidu Inc
Baidu internally has improved the performance of large-scale deep learning workloads by using the Volcano project. The CRD-based computing resource model makes it possible to use resources more efficiently and configure computing models more flexibly. The Volcano project has unified abstraction of the underlying capabilities of group scheduling, fair share, priority queue, job suspend/resume, etc., which makes up for the lack of functionality of the native job based training operator.

After using Volcano, Baidu's internal resource utilization increased by 15%, and the training task completion speed increased by 10%. This talk will introduce the overall function of Volcano, transformation of the old operator to support Volcano, and the comparison of the performance of deep learning training tasks before and after using Volcano.

avatar for Ti Zhou

Ti Zhou

Senior Architect, Baidu
Ti Zhou, Kubernetes member, LF AI & Data TAC member, currently serves as senior architect in Baidu Inc, focusing on PaddlePaddle Deep Learning Framework and Baidu Cloud Container Engine, helps developers to deploy cloud-native machine learning on private and public cloud.

Thursday November 21, 2019 10:55am - 11:30am PST
Room 1AB - San Diego Convention Center Upper Level
  Machine Learning + Data

11:50am PST

Kubernetizing Big Data and ML Workloads at Uber - Mayank Bansal & Min Cai, Uber
Uber relies on Big Data and ML to make business critical decisions such as pricing, trip ETA, etc. Today, those workloads such as Hive and Spark are running on YARN. To save millions of dollars by efficient use of cluster resources, Uber is planning to use Kubernetes to co-locate BigData/ML and micro-service workloads.

Kubernetes is the de-facto standard for running micro-services. However, in comparison to YARN, it still lacks many features like hierarchical resource pools, elastic resource sharing, gang scheduling etc. To bridge this gap, we have re-architected Peloton to be a set of Kubernetes scheduler and controller plugins so that we can provide feature parity with YARN.

This talk will cover:
- Learnings of running large-scale BigData/ML on Kubernetes with Peloton
- Colocation of mixed workloads
- Federation across zones
- Feature and API parity with YARN

avatar for Min Cai

Min Cai

Sr. Staff Engineer, Uber
Min Cai is a Sr. Staff Engineer in Compute Platform team at Uber working on all-active datacenters, cluster management and micro-service deployment systems. He received his Ph.D. degree in Computer Science from Univ. of Southern California. Before joining Uber, he was a Sr. Staff... Read More →
avatar for Mayank Bansal

Mayank Bansal

Staff Engineer, Uber
Mayank Bansal is currently working as a Staff engineer at Uber in data infrastructure team. He is co-author of Peloton. He is Apache Hadoop Committer and Oozie PMC and Committer. Previously he was working at ebay in hadoop platform team leading YARN and MapReduce effort. Prior to... Read More →

Thursday November 21, 2019 11:50am - 12:25pm PST
Room 15AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

2:25pm PST

Networking Optimizations for Multi-Node Deep Learning on Kubernetes - Rajat Chopra, NVIDIA & Erez Cohen, Mellanox
Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple nodes. Horovod is the best example of such a scalable architecture. At NVIDIA, in collaboration with the community, we have configured Kubernetes and multi-node infrastructure to deliver performance that scales as we add more GPUs and nodes. This talk presents the problems and solutions related to networking discovered during this journey.

The inexhaustive list includes solutions like CNI for multiple networks using SRIOV, enabling RDMA over IB and Ethernet (RoCE) to provide low latency, high throughput and direct GPU to NIC connectivity (GPUDirect), enforcing PCI affinity of GPUs with respect to Network Interfaces, using Source-Based routing within pods for L3 networks and much more.

avatar for Erez Cohen

Erez Cohen

Vice President for CloudX & AI Program, Mellanox
Erez Cohen acts as Mellanox Vice President for CloudX & AI Programs, responsible for strategy, architecture and implementation. The CloudX program span across multiple cloud solutions including OpenStack, Kubernetes, Microsoft and VMware and incorporate Mellanox state of the art network... Read More →
avatar for Rajat Chopra

Rajat Chopra

Principal Engineer, Nvidia
Rajat Chopra is currently working at NVIDIA on AI/Deep-Learning infrastructure projects, which include kubernetes on edge-devices, multi-node multi-rail RDMA for deep learning jobs, layer 4 packet handling for a GPU cloud etc. He is also an expert in container networking with founding... Read More →

Thursday November 21, 2019 2:25pm - 3:00pm PST
Room 5AB - San Diego Convention Center Upper Level
  Machine Learning + Data

3:20pm PST

Building a Medical AI with Kubernetes and Kubeflow - Jeremie Vallee, Babylon Health
Engineering AI systems at scale can be difficult, especially in highly regulated environments like healthcare. Many challenges arise, such as ensuring reproducibility, controlling data access policies, and running highly secure infrastructure. But with some planning and meticulous engineering, this can be achieved.

At Babylon Health, we've leveraged Kubernetes, Kubeflow, Argo, Istio, OPA, and many other Cloud Native technologies to provide a secure research platform for building and scaling medical AI models across the world.

In this talk, we will share our experience so far, give an overview of how these components fit together, and explain our vision for the future of our platform. We will demonstrate how using open-source CNCF technologies can help you achieve your goal of experimenting, training and serving your AI models at scale, while operating in a regulated environment.

avatar for Jeremie Vallee

Jeremie Vallee

AI Infrastructure Lead, Babylon Health
Jeremie is a Cloud Infrastructure Engineer working at Babylon Health, using Cloud Native technologies to scale AI model training. When he's not writing YAML, you can find him running in one of London's many parks, or being lost in a music festival somewhere in France. But mostly... Read More →

Thursday November 21, 2019 3:20pm - 3:55pm PST
Room 11AB - San Diego Convention Center Upper Level
  Machine Learning + Data

4:25pm PST

GPU as a Service Over K8s: Drive Productivity and Increase Utilization - Yaron Haviv, Iguazio
Building machine learning applications is hard. Surprisingly enough, it’s not the data science that’s hard, but all the operations around it. GPUs accelerate performance, but pose challenges such as GPU resource sharing, software dependencies and data bottlenecks. In a cloud-native era, data scientists are looking for a GPU-powered machine learning PaaS like AWS Sagemaker or Google AI, only based on open source technologies, without vendor lock-ins and/or on-premises. Yaron Haviv will demonstrate how to integrate Kubernetes, KubeFlow, high-speed data layers and GPU-powered servers to build self-service machine learning platforms. He will show how GPU resources can be pooled to maximize utilization and increase scalability, how to use RAPIDS for 10x faster data processing and how to integrate GPUs with the rest of the machine learning stack.

avatar for Yaron Haviv

Yaron Haviv

CTO, Iguazio
Yaron Haviv is a serial entrepreneur who has deep technological experience in the fields of ML, big data, cloud, storage and networking. Prior to Iguazio, Yaron was the Vice President of Datacenter Solutions at Mellanox, where he led technology innovation, software development and... Read More →

Thursday November 21, 2019 4:25pm - 5:00pm PST
Room 17AB - San Diego Convention Center Mezzanine Level
  Machine Learning + Data

5:20pm PST

Supercharge Kubeflow Performance on GPU Clusters - Meenakshi Kaushik & Neelima Mukiri, Cisco
AI/ML applications on Kubernetes can be optimized for performance at many levels.

This presentation provides an overview of the optimizations such as:
- Distributed training on multiple GPUs with optimal selection of interconnects between the GPUs and CPUs.
- Utilizing different types of GPUs/Servers for different workloads like training and inference.
- OS level optimizations to get optimal performance on the hardware.
- Usage of GPU Passthrough for optimal utilization and performance.

This presentation will also cover how the selection of machine learning framework, like Kubeflow, can impact performance and hardware utilization.

avatar for Meenakshi Kaushik

Meenakshi Kaushik

Leader, Product Manager, Cisco
Meenakshi Kaushik leads product management for Cisco Panoptica Security platform. Meenakshi is interested in the AI and ML space and is excited to see how the technology can enhance human well-being and productivity.

Neelima Mukiri

Principal Engineer, Cisco
Neelima Mukiri is a Principal Engineer in Cisco's Cloud Platform Solutions group working on the architecture and development of Cisco's Container Platform. Prior to this she worked on core virtualization layer at VMware and systems software in Samsung Electronics.

Thursday November 21, 2019 5:20pm - 5:55pm PST
Room 11AB - San Diego Convention Center Upper Level
  Machine Learning + Data

Filter sessions
Apply filters to sessions.