Dynamic Resource Allocation (DRA) for GPUs in Kubernetes

Original: https://github.com/NVIDIA/k8s-dra-driver?tab=readme-ov-file

Introduction

Dynamic Resource Allocation (DRA) is an upcoming Kubernetes feature that puts resource scheduling in the hands of 3rd-party developers. From an end-user’s perspective, It moves away from the limited “countable” interface for requesting access to resources (e.g. “nvidia.com/gpu: 2“), providing an API more akin to that of persistent volumes. Under the hood it uses CDI to do its device injection.

NVIDIA has been working with Intel for the past 2 years on this feature and we are excited to see it finally gain traction within the community. DRA has been merged as an alpha feature for Kubernetes 1.26 (released in December 2022). Its graduation to beta and GA will follow soon after.

In the context of GPUs, this unlocks a host of new features without the need for awkward solutions shoehorned on top of the existing device plugin API.

These features include:

  • Controlled GPU Sharing (both within a pod and across pods)
  • Multiple GPU models per node (e.g. T4 and A100)
  • Specifying arbitrary constraints for a GPU (min/max memory, device model, etc.)
  • Natural support for MPS
  • Dynamic allocation of MIG devices
  • Dynamic repurposing of a GPU from full to MIG mode
  • Dynamic repurposing of a GPU for use as Passthrough vs. vGPU
  • … the list goes on …

A reference implementation of our DRA resource driver for GPUs is already available and a demo showcasing a subset of the features listed above can be found here.

User-Facing API

Dynamic Resource Allocation (DRA) is a generalization of the Persistent Volumes API for generic resources. As such, it allows one to separate the declaration of a resource to be consumed, from its actual consumption. This allows one to move away from the limited “countable” API provided by device-plugins today, to something much more flexible in terms of controlling which resources are consumed (and where).

Using our reference DRA resource driver for GPUs as an example, the table below shows the difference between how one requests access to two gpus under the existing device plugin model vs. DRA.

Existing device plugin DRA resource driver for GPUs
apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  containers:
  - name: ctr
    image: nvidia/cuda
    command: ["nvidia-smi", "-L"]
    resources:
      limits:
        nvidia.com/gpu: 2
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    resourceClassName: gpu.nvidia.com

––
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:

  • name: ctr
    image: nvidia/cuda
    command: [“nvidia-smi” “-L”]
    resources:
    claims:
    • name: gpu0
    • name: gpu1
      resourceClaims:
  • name: gpu0
    source:
    resourceClaimTemplate: gpu-template
  • name: gpu1
    source:
    resourceClaimTemplate: gpu-template

Moreover, ResourceClaims can be annotated with a set of parameters defined by the developer of the DRA resource driver for a given ResourceClass. These parameters allow users to attach additional constraints to their resource requests.

For example, our reference DRA resource driver for GPUs defines the following two claim parameter objects for use with claims against the gpu.nvidia.com resource class:

GPU Sharing within a Pod GPU Sharing across Pods
---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
  name: gpu-template
spec:
  spec:
    resourceClassName: gpu.nvidia.com

––
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:

  • name: ctr0
    image: nvidia/cuda
    command: [“nvidia-smi” “-L”]
    resources:
    claims:
    • name: gpu
  • name: ctr1
    image: nvidia/cuda
    command: [“nvidia-smi” “-L”]
    resources:
    claims:
    • name: gpu
      resourceClaims:
  • name: gpu
    source:
    resourceClaimTemplate: gpu-template
–--
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaim
metadata:
  name: shared-gpu
spec:
  resourceClassName: gpu.nvidia.com


apiVersion: v1
kind: Pod
metadata:
name: pod0
spec:
containers:

  • name: ctr
    image: nvidia/cuda
    command: [“nvidia-smi” “-L”]
    resources:
    claims:
    • name: gpu
      resourceClaims:
  • name: gpu
    source:
    resourceClaimName: shared-gpu

apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
containers:

  • name: ctr
    image: nvidia/cuda
    command: [“nvidia-smi” “-L”]
    resources:
    claims:
    • name: gpu
      resourceClaims:
  • name: gpu
    source:
    resourceClaimName: shared-gpu

Moreover, ResourceClaims can be annotated with a set of parameters defined by the developer of the DRA resource driver for a given ResourceClass. These parameters allow users to attach additional constraints to their resource requests.

For example, our reference DRA resource driver for GPUs defines the following two claim parameter objects for use with claims against the gpu.nvidia.com resource class:

apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  name: ...
spec:
  count: ...
  migEnabled: ...
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: MigDeviceClaimParameters
metadata:
  name: ...
spec:
  profile: ...
  gpuClaimName: ...

The GpuClaimParameters object gives users the ability to request more than one GPU from a single resource claim (via the count parameter) as well as specify whether they want to receive GPUs that have MIG mode enabled on them or not (via the migEnabled parameter) .

The MigDeviceClaimParameters object gives users the ability to specify the profile of a MIG device they would like access to (via the profile parameter) as well as an optional reference to the specific GPU they would like their MIG device to be allocated on (via the gpuClaimName parameter).

Note: both of these claim parameter objects are reference implementations, and we plan to extend / replace them before they are released. Any feedback on what you would like to see here is greatly appreciated.

An example of using GpuClaimParameters to request eight GPUs from a single resource claim can be seen below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
name: eight-gpus
spec:
count: 8

---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
name: eight-gpus-template
spec:
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.resource.nvidia.com
kind: GpuClaimParameters
name: eight-gpus

---
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:
- name: ctr
image: nvidia/cuda
command: ["nvidia-smi" "-L"]
resources:
claims:
- eight-gpus
resourceClaims:
- name: eight-gpus
source:
resourceClaimTemplate: eight-gpus-template

A more complex example involving both a GpuClaimParameters object and a MigDeviceClaimParameters object can be seen here:

In this example, we create a single pod with two containers, each of which wants access to its own 3g.40gb MIG device.

To ensure that the two MIG devices ultimately come from the same underlying GPU, we first create a GpuClaimParameters object requesting access to a MIG enabled GPU. We call this claim parameters object mig-enabled-gpu.
We then create a ResourceClaimTemplate also called mig-enabled-gpu, which binds the gpu.nvidia.com resource class to the mig-enabled-gpu claim parameters object.
Next we create a MigDeviceClaimParameters object specifying the 3g.40gb profile. This object also includes a forward reference to the (yet-to-be-created) resource claim of the MIG enabled GPU on which this MIG device should be created (shared-gpu). Note this is the name of the resource claim itself, not the claim parameters object we called mig-enabled-gpu. We call this new claim parameters object mig-3g.40gb.
We then create a ResourceClaimTemplate also called mig-3g.40gb, which binds the gpu.nvidia.com resource class to the mig-3g.40gb claim parameters object.
Next we create the actual resource claims themselves inside the pod spec. One resource claim called shared-gpu, which references the mig-enabled-gpu resource claim template, as well as two other resource claims, each referencing the mig-3g.40gb resource claim template. These two resource claims are called mig-3g-0 and mig-3g-1, respectively.
Finally, we reference each of these resource claims in the resources.claims sections of our two containers. Both containers refer to the same underlying shared-gpu claim, with each container pointing to one of mig-3g-0 or mig-3g-1, respectively.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
name: mig-enabled-gpu
spec:
migEnabled: true

---
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
name: mig-enabled-gpu
spec:
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.resource.nvidia.com
kind: GpuClaimParameters
name: mig-enabled-gpu

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: MigDeviceClaimParameters
metadata:
name: mig-3g.40gb
spec:
profile: 3g.40gb
gpuClaimName: shared-gpu

–--
apiVersion: resource.k8s.io/v1alpha1
kind: ResourceClaimTemplate
metadata:
name: mig-3g.40gb
spec:
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.resource.nvidia.com
kind: MigDeviceClaimParameters
name: mig-3g.40gb

---
apiVersion: v1
kind: Pod
metadata:
name: pod
spec:
containers:
- name: ctr0
image: nvidia/cuda
command: ["nvidia-smi" "-L"]
resources:
claims:
- name: shared-gpu
- name: mig-3g-0
- name: ctr1
image: nvidia/cuda
command: ["nvidia-smi" "-L"]
resources:
claims:
- name: shared-gpu
- name: mig-3g-1
resourceClaims:
- name: shared-gpu
source:
resourceClaimTemplate: mig-enabled-gpu
- name: mig-3g-0
source:
resourceClaimTemplate: mig-3g.40gb
- name: mig-3g-1
source:
resourceClaimTemplate: mig-3g.40gb

As mentioned previously, GpuClaimParameters and MigDeviceClaimParameters are just reference specifications, and we plan to iterate on them further before they get released. Any feedback on how you would like to see these evolve would be greatly appreciated.
In the following section, we discuss the details of the DRA resource driver architecture and how it interacts with Kubernetes to make the user-facing API described above possible.

DRA Resource Driver Architecture

At a high-level, a DRA resource driver is responsible for:

  • Defining a ResourceClass associated with a specific type of resource (e.g. gpu.nvidia.com)
  • Processing any class parameter objects associated with this ResourceClass
  • Watching for incoming ResourceClaims that reference this ResourceClass
  • Processing any claim parameter objects associated with this ResourceClaim
  • Coordinating with the Kubernetes scheduler to find a node where a given ResourceClaim should be allocated
  • Allocating the ResourceClaim on that node
  • Cleaning up any allocated ResourceClaims once they get deleted

To accomplish this, DRA resource drivers consists of two separate-but-coordinating components:

  • A centralized controller running with high-availability
  • A node-local kubelet plugin running as a daemonset

The centralized controller serves to:

  1. Coordinate with the K8s scheduler to decide which nodes an incoming ResourceClaim can be serviced on
  2. Perform the actual ResourceClaim allocation once the scheduler picks a node to allocate it on
  3. Perform the deallocation of a ResourceClaim once it has been deleted

The node-local kubelet plugin serves to:

  1. Advertise any node-local state required by the centralized controller to make its allocation decisions
  2. Perform any node-local operations required as part of allocating a ResourceClaim on a node
  3. Pass the set of devices associated with an allocated ResourceClaim to the kubelet so it can ultimately be forwarded to the underlying container runtime
  4. Perform any node-local operations required as part of freeing a ResourceClaim on a node

To help illustrate how these responsibilities are carried out by each component, the following section walks through the process of deploying a DRA resource driver and then allocating a ResourceClaim associated with a newly created pod.

Allocating a ResourceClaim

The Kubernetes Enhancement Proposal for DRA is the definitive source for all of the details about how the internals of DRA work. It defines a number of modes of operation, including delayed vs. immediate allocation, shared vs. non-shared resource claims, etc. As the names suggest, immediate allocation occurs as soon as a resource claim is created (i.e. there is no need to wait for a consumer to begin the allocation process), whereas delayed allocation does not occur until a pod that references the resource claim is first created. Likewise, shared resource claims can have multiple pods consuming them, whereas non-shared resource claims can only have one.

In this section we walk through the process of allocating a shared resource claim with delayed allocation (which is the default). The steps are broken down into phases, with a diagram showing the steps from each phase in action. The steps themselves are annotated as “1.x” from Phase 1, “2.x” from Phase 2, etc.

Phase 1 - Setup:

  1. Admin registers a ResourceClass pointing to a specific DRA resource driver as its owner
  2. Admin deploys DRA resource driver in the cluster
  3. Each DRA kubelet plugin begins advertising its node-local state for the centralized controller to pick up on

Phase 2 - Pod Creation:

  1. User creates a ResourceClaim referencing a registered ResourceClass
  2. User submits a pod to the API server referencing the ResourceClaim in one of its containers

Phase 3 - Node Selection:

  1. Scheduler picks up a pod from the API server and begins to schedule it on a node
  2. Scheduler sees the pod’s ResourceClaim and its ResourceClass pointing to a specific DRA resource driver
  3. Scheduler provides a list of potential nodes where it is considering scheduling the pod that has a reference to the ResourceClaim. It does this through a special PodScheduling object in the API server
  4. DRA resource driver picks up the PodScheduling object
  5. DRA resource driver narrows down the list of potential nodes to just those where the ResourceClaim could possibly be allocated. It writes this back to the PodScheduling object

Phase 4 - Claim Allocation

  1. Scheduler considers other scheduling constraints in relation to each of the nodes in the narrowed down list
  2. Scheduler picks a node and sets a field in the ResourceClaim with a reference to it
  3. DRA resource driver picks up the selected node from the ResourceClaim
  4. DRA resource driver allocates the ResourceClaim for use on the node
  5. DRA resource driver marks the allocation as complete in the ResourceClaim object
  6. Scheduler picks up the allocation completion from the ResourceClaim object
  7. Scheduler schedules the pod on the node
  8. Scheduler writes the scheduled node back to the Pod object

Phase 5 - Container Start:

  1. Kubelet picks up the pod from the API server and begins creating its containers
  2. Kubelet calls out to DRA’s kubelet plugin to get the list of CDI devices associated with the ResourceClaim
  3. Kubelet passes the CDI devices to the container-runtime via CRI
  4. Container runtime starts the container with access to the devices associated with the ResourceClaim

Writing your own DRA resource driver

As mentioned in the previous section, DRA resource drivers consist of two separate-but-coordinating components: a centralized controller and a daemonset of node-local kubelet plugins. Most of the work required by the centralized controller to coordinate with the scheduler can be handled by boilerplate code. Only the business logic required to actually allocate ResourceClaims against the ResourceClasses owned by the resource driver needs to be customized. As such, the following package is provide by Kubernetes to include APIs for invoking this boilerplate code as well as a Driver interface that one can implement to provide their custom business logic:

k8s.io/dynamic-resource-allocation/controller

Likewise, boilerplate code can be used to register the kubelet plugin with the kubelet, as well as start a gRPC server to implement DRA’s kubelet plugin API. The following package is provided for this purpose:

k8s.io/dynamic-resource-allocation/kubeletplugin

The set of functions defined by the controller’s Driver interface can be seen below:

1
2
3
4
5
6
7
type Driver interface {
GetClassParameters()
GetClaimParameters()
Allocate()
Deallocate()
UnsuitableNodes()
}

Likewise, the set of functions defined by the gRPC API for the node-local kubelet plugin are:

1
2
3
4
5
6
7
service Node {
rpc NodePrepareResource (NodePrepareResourceRequest)
returns (NodePrepareResourceResponse) {}

rpc NodeUnprepareResource (NodeUnprepareResourceRequest)
returns (NodeUnprepareResourceResponse) {}
}

Examples of implementing these can be found in our reference DRA resource driver for GPUs.

Status & Future Work

mentioned previously, the alpha release of DRA has been merged as part of Kubernetes 1.26 in December 2022. Its graduation to beta and GA will follow soon after. Our reference DRA resource driver for GPUs is feature-complete in terms of supporting all of the APIs required by DRA, but there is still quite a bit of room for improvement. We plan to continue iterating on this resource driver with an official release to coincide with the beta release of DRA itself. These are still the early days of DRA, and we are excited to see what other technologies this new feature helps unlock.

Reference