Upgrading the NVIDIA GPU Operator for Kubernetes to Patch Vulnerability
NVIDIA GPUs with NVlink.

Upgrading the NVIDIA GPU Operator for Kubernetes to Patch Vulnerability

The NVIDIA GPU Operator for Kubernetes is an essential tool in running production workloads on our Sorites HPC cluster.

On 25th September 2024, NVIDIA released a Security Bulletin that a bug had been identified affecting the NVIDIA Container Toolkit which is used by the GPU Operator, as well as for running standalone GPU workloads on Docker and Containerd.

Security Bulletin: NVIDIA Container Toolkit - September 2024

A patch been applied to the toolkit and an upgrade is required to clusters running affected versions of the Toolkit and GPU Operator.

The vulnerability has also featured in several news articles:

Patch now: Critical Nvidia bug allows container escape, complete host takeover - The Register

NVIDIA Container Toolkit Vulnerability Exposes Systems to Remote Code Execution - Cyber Security News

Critical Nvidia Container Flaw Exposes Cloud AI Systems to Host Takeover - Security Week

I successfully patched our cluster this morning, which only took a few minutes, so I thought I'd share the steps I took.

The upgrade was initially tested on a standalone single node cluster that we use for development and testing, to ensure it worked before deploying to the 24 node production cluster.

The instructions given below relate to Ubuntu version 24.04, but should work on other Linux flavours. On non-Debian based systems, please refer to your own package manager instructions for upgrading the standalone Container Toolkit.

Check your GPU Operator Version

Identify a pod running the GPU operator. The pod name till be of the format 'gpu-operator-xxxxxxxxxx-xxxxx'.

john@nearchus:~$ kubectl get pods -n gpu-operator
NAME                                                         READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-7584s                                  1/1     Running     0          6m14s
gpu-operator-587db77bb5-ks59m                                1/1     Running     0          6m32s
gpu-operator-node-feature-discovery-gc-fc549bd94-t5w69       1/1     Running     0          17m
gpu-operator-node-feature-discovery-master-b4bb855b7-tvm9r   1/1     Running     0          17m
gpu-operator-node-feature-discovery-worker-jbn4t             1/1     Running     0          10d
nvidia-container-toolkit-daemonset-qxxhf                     1/1     Running     0          4m46s
nvidia-cuda-validator-x4kb8                                  0/1     Completed   0          4m31s
nvidia-dcgm-exporter-dw74z                                   1/1     Running     0          6m14s
nvidia-device-plugin-daemonset-8k588                         1/1     Running     0          6m17s
nvidia-operator-validator-smxpt                              1/1     Running     0          5m43s
        

Find out which version of the GPU Operator Container image your pod is running from the image tag e.g.

john@nearchus:~$ kubectl describe pod gpu-operator-759dd5976b-vhqdn -n gpu-operator
Name:                 gpu-operator-759dd5976b-vhqdn
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      gpu-operator
Node:                 nearchus/192.168.0.183
Start Time:           Tue, 17 Sep 2024 09:11:21 +0000
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      app.kubernetes.io/instance=gpu-operator
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=gpu-operator
                      app.kubernetes.io/version=v24.6.1
                      helm.sh/chart=gpu-operator-v24.6.1
                      nvidia.com/gpu-driver-upgrade-drain.skip=true
                      pod-template-hash=759dd5976b
Annotations:          cni.projectcalico.org/containerID: e3d2833ac8f08890655f5fb004b9eefe1989e52ee67374dacacd51a8a293fa4e
                      cni.projectcalico.org/podIP: 10.1.148.235/32
                      cni.projectcalico.org/podIPs: 10.1.148.235/32
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   10.1.148.235
IPs:
  IP:           10.1.148.235
Controlled By:  ReplicaSet/gpu-operator-759dd5976b
Containers:
  gpu-operator:
    Container ID:  containerd://5a9dcf28aa62bd36bdd957580a4b303c35907041d5bb79c63fff422083838bfa
    Image:         nvcr.io/nvidia/gpu-operator:v24.6.1
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:d51c3a34aaa9a5dfbdd3b710ee18d9eaa50aa0fb3518bacd541053d77c5c1098
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
      --zap-time-encoding=epoch
      --zap-log-level=info
    State:          Running
      Started:      Tue, 17 Sep 2024 09:11:32 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      OPERATOR_NAMESPACE:    gpu-operator (v1:metadata.namespace)
      DRIVER_MANAGER_IMAGE:  nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r58mm (ro)        

In this example we are using container image version v24.6.1, which is vulnerable so needs to be upgraded.

Upgrade the GPU Operator Helm Chart

Update the NVIDIA repository:

helm repo update nvidia        

Specify the Operator release tag in an environment variable:

export RELEASE_TAG=v24.6.2        

Fetch the values from the chart and save them in a file to perform the upgrade:

helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml

Execute the following command:

helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator  --set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml *options        

It is absolutely critical to replace *options with the options used in the original Helm chart installation.

NVIDIA Documentation - Installing the GPU Operator

In our cluster, running Microk8s, we used the following command options as detailed in the documentation:

john@nearchus:~$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \
    --set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml $HELM_OPTIONS \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
    --set driver.enabled=false
Release "gpu-operator" has been upgraded. Happy Helming!
NAME: gpu-operator
LAST DEPLOYED: Fri Sep 27 09:31:32 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 3
TEST SUITE: None
        

NVIDIA Documentation - Upgrading the GPU Operator

In the documentation, I followed 'Option 2: Automatically Upgrading CRDs Using a Helm Hook' to upgrade the GPU Operator.

Wait 2 minutes, then check that the pods are running normally and images have been upgraded:

kubectl get pods -n gpu-operator
kubectl describe pod gpu-operator-xxxxxxxxxx-xxxxx -n gpu-operator        

in our case:


john@nearchus:~$ kubectl describe pod gpu-operator-587db77bb5-ks59m -n gpu-operator
Name:                 gpu-operator-587db77bb5-ks59m
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      gpu-operator
Node:                 nearchus/192.168.0.183
Start Time:           Fri, 27 Sep 2024 09:29:50 +0000
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      app.kubernetes.io/instance=gpu-operator
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=gpu-operator
                      app.kubernetes.io/version=v24.6.2
                      helm.sh/chart=gpu-operator-v24.6.2
                      nvidia.com/gpu-driver-upgrade-drain.skip=true
                      pod-template-hash=587db77bb5
Annotations:          cni.projectcalico.org/containerID: c3eb1a96d94b0a0892fae9ea669bb7e9db5c711cca5c0a4e1a5af664ec51f0a0
                      cni.projectcalico.org/podIP: 10.1.148.217/32
                      cni.projectcalico.org/podIPs: 10.1.148.217/32
                      openshift.io/scc: restricted-readonly
Status:               Running
IP:                   10.1.148.217
IPs:
  IP:           10.1.148.217
Controlled By:  ReplicaSet/gpu-operator-587db77bb5
Containers:
  gpu-operator:
    Container ID:  containerd://7aaba5f3266437d47a78ea60fb83c07f96ba932334344e4ef15e7e01e96dc0d4
    Image:         nvcr.io/nvidia/gpu-operator:v24.6.2
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:8e0969cffc030a89c4acd68e64d41dd54e3bce8a794106b178d4dbd636a07f1c
    Port:          8080/TCP
    Host Port:     0/TCP        

which shows the cluster is now running the patched image image tag v24.6.2.

NVIDIA Container Toolkit on standalone Docker and Containerd

This is more straightforward. If you followed the NVIDIA documentation, set up the repositories and installed the Container Toolkit, then updating your apt cache and upgrading the toolkit will automatically pull the latest version.

sudo apt-get update
sudo apt-get upgrade nvidia-container-toolkit -y        

I verified that this works on another computer.

NVIDIA Documentation - Installing the Container Toolkit

I hope this guide is helpful and please comment if you have any questions.


To view or add a comment, sign in

More articles by John Murray

Insights from the community

Others also viewed

Explore topics