Thursday, November 21 • 2:25pm - 3:00pm
Networking Optimizations for Multi-Node Deep Learning on Kubernetes - Rajat Chopra, NVIDIA & Erez Cohen, Mellanox

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
Training a Neural Network may take days or weeks, even on a top of the line GPU. To reduce training time, distributed computation is often employed to spread the work across multiple GPUs and multiple nodes. Horovod is the best example of such a scalable architecture. At NVIDIA, in collaboration with the community, we have configured Kubernetes and multi-node infrastructure to deliver performance that scales as we add more GPUs and nodes. This talk presents the problems and solutions related to networking discovered during this journey.

The inexhaustive list includes solutions like CNI for multiple networks using SRIOV, enabling RDMA over IB and Ethernet (RoCE) to provide low latency, high throughput and direct GPU to NIC connectivity (GPUDirect), enforcing PCI affinity of GPUs with respect to Network Interfaces, using Source-Based routing within pods for L3 networks and much more.

avatar for Erez Cohen

Erez Cohen

Vice President for CloudX & AI Program, Mellanox
Erez Cohen acts as Mellanox Vice President for CloudX & AI Programs, responsible for strategy, architecture and implementation. The CloudX program span across multiple cloud solutions including OpenStack, Kubernetes, Microsoft and VMware and incorporate Mellanox state of the art network... Read More →
avatar for Rajat Chopra

Rajat Chopra

Principal Engineer, Nvidia
Rajat Chopra is currently working at NVIDIA on AI/Deep-Learning infrastructure projects, which include kubernetes on edge-devices, multi-node multi-rail RDMA for deep learning jobs, layer 4 packet handling for a GPU cloud etc. He is also an expert in container networking with founding... Read More →

Thursday November 21, 2019 2:25pm - 3:00pm
Room 5AB - San Diego Convention Center Upper Level