Accelerating Machine Learning with GPUs in Kubernetes using the NVIDIA Device Plugin

NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads.

Keegan McCallum

Published on March 12, 2024

Table of Contents


Generative AI is having a moment right now, in no small part due to the immense scale of computing resources being leveraged to train and serve these models. Kubernetes has revolutionized the way we deploy and manage applications at scale, making it a natural choice for building large-scale computing platforms.

GPUs, with their parallel processing capabilities and high memory bandwidth, have become the go-to hardware for accelerating machine learning tasks. NVIDIA’s CUDA platform has emerged as the dominant framework for GPU computing, enabling developers to harness the power of GPUs for a wide range of applications. By combining the capabilities of Kubernetes with the extreme parallel computing power of modern GPUs like the NVIDIA H100, organizations are pushing the boundaries of what is possible with computers, from realistic video generation to analyzing entire novels worth of text and accurately answering questions about the contents.

However, orchestrating GPU-accelerated workloads in Kubernetes environments presents its own set of challenges. This is where the NVIDIA Device Plugin comes into play. It seamlessly integrates with Kubernetes, allowing you to expose GPUs on each node, monitor their health, and enable containers to leverage these powerful accelerators. By combining these two best of breed solutions, organizations are building robust, performant computing platforms to power the next generation of intelligent software.

Understanding the Nvidia Device Plugin for Kubernetes

The NVIDIA Device Plugin is a Kubernetes Daemonset that simplifies the management of GPU resources across a cluster. Its primary function is to automatically expose the number of GPUs on each node, making them discoverable and allocatable by the Kubernetes scheduler. This allows pods to request and consume GPU resources in a similar way to cpu and memory. Under the hood, the device plugin communicates with the kubelet on each node, providing information about the available GPUs and their capacities. It also monitors the health of the GPUs, ensuring they are functioning optimally and reporting any issues to Kubernetes.

Some of the benefits of the NVIDIA Device Plugin include:

  1. Automatic GPU discovery and allocation, eliminating the need to manually configure GPUs resources on each node.
  2. Seamless integration with Kubernetes, allowing you to manage GPUs with familiar tools and workflows
  3. GPU health monitoring, allowing Kubernetes to maintain stability and reliability for GPU-accelerated workloads.
  4. Resource sharing, which allows multiple pods to utilize the same GPU, which is crucial in an environment like today where GPUs are scarce and expensive.

Installing and Configuring the Nvidia Device Plugin


  • Ensure that your GPU nodes have the necessary NVIDIA drivers (version ~= 384.81) installed.
  • Install the nvidia-container-toolkit (version >= 1.7.0) on each GPU node.
  • Configure the nvidia-container-runtime as the default runtime for Docker or containerd.
  • Kubernetes version >= 1.10
  • If using AWS EKS for example, these will be handled for you by default when using GPU nodes

Deploying the Device Plugin

First, we’ll install the daemonset using helm. To install the latest version (v0.14.5 at the time of writing) into a cluster with default settings, the most basic command is:

helm upgrade -i nvdp nvidia-device-plugin \
  --repo \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version v0.14.5

This will install OR upgrade a helm release named nvdp into the nvidia-device-plugin namespace, with default settings.

This will give you a basic setup, but there are many reasons you may want to customize the chart via values.yaml. We’ll dive into some of the most useful options as well as some best practices, but you can see the full set of values here. You’ll likely want to add taints to your GPU nodes (the method used will depend on your kubernetes setup and how you are provisioning node) and then configure tolerations to ensure that the device plugin is only scheduled on GPU-enabled devices. We’ll dive deeper into these types of configurations in part 2 of this series.

Configuring GPU Sharing and Oversubscription

The nvidia-device-plugin supports 3 strategies for GPU sharing and oversubscription, allowing you to optimize GPU utilization based on your specific workload’s requirements. A quick overview of each, with examples of how to configure via values.yaml:

  • Time-slicing: This strategy allows multiple workloads to share a GPU by interleaving their execution. Each workload is allocated a specific time slice during which it has exclusive access to the GPU. Time-slicing is useful when you have many small workloads that don’t require the full power of a GPU simultaneously. One important point to note is that nothing special is done to isolate workloads that are granted replicas from the same underlying GPU, and each workload has access to the full GPU memory and runs in the same fault-domain as all of the others (meaning that if one pod’s workload crashes, they all do). In my experience, time-slicing usually isn’t what you’re looking for when it comes to GPU resource sharing, it’s basically just letting all the pods access the single GPU in a free-for-all manner and executing things concurrently without any regard for each other. If you have workloads that don’t mind this, such as Jupyter notebooks for research that aren’t utilizing the GPU at the same time, this setting COULD be useful, but I’d recommend looking at the other options first unless you know what you’re doing.

    Example values.yaml for time-slicing:

    default: |-
      version: v1
          - name:
            replicas: 10
  • Multi-Instance GPUs (MIG): To mitigate the potential downsides of time-slicing, NVIDIA supports MIG. MIG is a feature supported on certain NVIDIA GPUs (e.g., A100) that enables partitioning a single GPU into multiple smaller, isolated instances. Each instance behaves like a separate GPU with its own memory and compute resources. MIG is beneficial when you have workloads with varying resource requirements and want to ensure strict isolation between them. This is in contrast to MPS which gives you more fine-grained control over memory and compute resource allocation, but doesn’t provide full memory protection and error isolation between them. MIG supports both mixed and single strategies for exposing GPUs to kubernetes, if interested you can read more about how they work here. Mixed is more flexible and I’d recommend using mixed unless you have a cluster large enough that exposing only a single MIG type per node is feasible. MIG is only supported on NVIDIA Ampere GPUs and while less flexible than MPS, MIG is the most complete solution for workload isolation if your workloads require that.

    Example values.yaml for MIG:

    default: |
      version: v1
        migStrategy: "mixed"
  • CUDA Multi-Process Service (MPS): MPS is a runtime service that enables multiple CUDA processes to share a single GPU context. It allows fine-grained sharing of GPU resources among multiple pods by running CUDA kernels concurrently. This mode feels the most similar to the way kubernetes can allocate cpu and memory resources in a fine-grained way, and is supported on almost every CUDA-compatible GPU. MPS will split up a GPU into equal slices of compute and memory, and the MPS control daemon will enforce these limits. Sharing with MPS is currently not supported on devices with MIG enabled. Sharing with MPS is currently not supported on devices with MIG enabled. MPS is suitable when you have workloads that can efficiently share GPU resources without strict isolation requirements. If you don’t have strict isolation requirements, MPS is probably the right choice for you.

    Example values.yaml for MPS:

    default: |-
      version: v1
          - name:
            replicas: 10

This should be a good introduction to GPU sharing to get you started. We will go into more detail about advanced configuration and best practices in part 2 of this series.

Allocating GPUs to Pods Using the Nvidia Device Plugin

Allocating GPUs to pods when using the nvidia-device-plugin is straightforward and should feel familiar to anyone comfortable with kubernetes. It is highly recommended to use NVIDIA base images for your containers in order to have all the necessary dependencies installed and configured properly for your underlying workload. Setting a limit for is crucial, otherwise all GPUs will be exposed inside the container. Finally, make sure to include tolerations for any taints set on your nodes so that the pod can be scheduled appropriately. Here’s a barebones example of a GPU-enabled pod:

apiVersion: v1
kind: Pod
  name: gpu-pod
    - name: cuda-container
 1 # requesting 1 GPU
  - key:
    operator: Exists
    effect: NoSchedule


The NVIDIA Device Plugin for Kubernetes plays a crucial role in enabling organizations to harness the power of GPUs for accelerating machine learning workloads. By abstracting the complexities of GPU management and providing seamless integration with Kubernetes, it empowers developers and data scientists to focus on building and deploying their models without worrying about the underlying infrastructure.

We’re just scratching the surface here, so if you’re interested to learn more please check out part 2 of this series where we’ll go into detail on advanced configuration, troubleshooting common issues, and some of the limitations of using the nvidia-device-plugin alone to manage GPUs. Also, check out the additional resources at the end of this article!

Further Reading and Resources