Upgrading the NVIDIA GPU Operator for Kubernetes to Patch Vulnerability
The NVIDIA GPU Operator for Kubernetes is an essential tool in running production workloads on our Sorites HPC cluster.
On 25th September 2024, NVIDIA released a Security Bulletin that a bug had been identified affecting the NVIDIA Container Toolkit which is used by the GPU Operator, as well as for running standalone GPU workloads on Docker and Containerd.
A patch been applied to the toolkit and an upgrade is required to clusters running affected versions of the Toolkit and GPU Operator.
The vulnerability has also featured in several news articles:
I successfully patched our cluster this morning, which only took a few minutes, so I thought I'd share the steps I took.
The upgrade was initially tested on a standalone single node cluster that we use for development and testing, to ensure it worked before deploying to the 24 node production cluster.
The instructions given below relate to Ubuntu version 24.04, but should work on other Linux flavours. On non-Debian based systems, please refer to your own package manager instructions for upgrading the standalone Container Toolkit.
Check your GPU Operator Version
Identify a pod running the GPU operator. The pod name till be of the format 'gpu-operator-xxxxxxxxxx-xxxxx'.
john@nearchus:~$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-7584s 1/1 Running 0 6m14s
gpu-operator-587db77bb5-ks59m 1/1 Running 0 6m32s
gpu-operator-node-feature-discovery-gc-fc549bd94-t5w69 1/1 Running 0 17m
gpu-operator-node-feature-discovery-master-b4bb855b7-tvm9r 1/1 Running 0 17m
gpu-operator-node-feature-discovery-worker-jbn4t 1/1 Running 0 10d
nvidia-container-toolkit-daemonset-qxxhf 1/1 Running 0 4m46s
nvidia-cuda-validator-x4kb8 0/1 Completed 0 4m31s
nvidia-dcgm-exporter-dw74z 1/1 Running 0 6m14s
nvidia-device-plugin-daemonset-8k588 1/1 Running 0 6m17s
nvidia-operator-validator-smxpt 1/1 Running 0 5m43s
Find out which version of the GPU Operator Container image your pod is running from the image tag e.g.
john@nearchus:~$ kubectl describe pod gpu-operator-759dd5976b-vhqdn -n gpu-operator
Name: gpu-operator-759dd5976b-vhqdn
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: gpu-operator
Node: nearchus/192.168.0.183
Start Time: Tue, 17 Sep 2024 09:11:21 +0000
Labels: app=gpu-operator
app.kubernetes.io/component=gpu-operator
app.kubernetes.io/instance=gpu-operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gpu-operator
app.kubernetes.io/version=v24.6.1
helm.sh/chart=gpu-operator-v24.6.1
nvidia.com/gpu-driver-upgrade-drain.skip=true
pod-template-hash=759dd5976b
Annotations: cni.projectcalico.org/containerID: e3d2833ac8f08890655f5fb004b9eefe1989e52ee67374dacacd51a8a293fa4e
cni.projectcalico.org/podIP: 10.1.148.235/32
cni.projectcalico.org/podIPs: 10.1.148.235/32
openshift.io/scc: restricted-readonly
Status: Running
IP: 10.1.148.235
IPs:
IP: 10.1.148.235
Controlled By: ReplicaSet/gpu-operator-759dd5976b
Containers:
gpu-operator:
Container ID: containerd://5a9dcf28aa62bd36bdd957580a4b303c35907041d5bb79c63fff422083838bfa
Image: nvcr.io/nvidia/gpu-operator:v24.6.1
Image ID: nvcr.io/nvidia/gpu-operator@sha256:d51c3a34aaa9a5dfbdd3b710ee18d9eaa50aa0fb3518bacd541053d77c5c1098
Port: 8080/TCP
Host Port: 0/TCP
Command:
gpu-operator
Args:
--leader-elect
--zap-time-encoding=epoch
--zap-log-level=info
State: Running
Started: Tue, 17 Sep 2024 09:11:32 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 350Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
DRIVER_MANAGER_IMAGE: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10
Mounts:
/host-etc/os-release from host-os-release (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r58mm (ro)
In this example we are using container image version v24.6.1, which is vulnerable so needs to be upgraded.
Upgrade the GPU Operator Helm Chart
Update the NVIDIA repository:
helm repo update nvidia
Specify the Operator release tag in an environment variable:
Recommended by LinkedIn
export RELEASE_TAG=v24.6.2
Fetch the values from the chart and save them in a file to perform the upgrade:
helm show values nvidia/gpu-operator --version=$RELEASE_TAG > values-$RELEASE_TAG.yaml
Execute the following command:
helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator --set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml *options
It is absolutely critical to replace *options with the options used in the original Helm chart installation.
In our cluster, running Microk8s, we used the following command options as detailed in the documentation:
john@nearchus:~$ helm upgrade gpu-operator nvidia/gpu-operator -n gpu-operator \
--set operator.upgradeCRD=true --disable-openapi-validation -f values-$RELEASE_TAG.yaml $HELM_OPTIONS \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true \
--set driver.enabled=false
Release "gpu-operator" has been upgraded. Happy Helming!
NAME: gpu-operator
LAST DEPLOYED: Fri Sep 27 09:31:32 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 3
TEST SUITE: None
In the documentation, I followed 'Option 2: Automatically Upgrading CRDs Using a Helm Hook' to upgrade the GPU Operator.
Wait 2 minutes, then check that the pods are running normally and images have been upgraded:
kubectl get pods -n gpu-operator
kubectl describe pod gpu-operator-xxxxxxxxxx-xxxxx -n gpu-operator
in our case:
john@nearchus:~$ kubectl describe pod gpu-operator-587db77bb5-ks59m -n gpu-operator
Name: gpu-operator-587db77bb5-ks59m
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: gpu-operator
Node: nearchus/192.168.0.183
Start Time: Fri, 27 Sep 2024 09:29:50 +0000
Labels: app=gpu-operator
app.kubernetes.io/component=gpu-operator
app.kubernetes.io/instance=gpu-operator
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gpu-operator
app.kubernetes.io/version=v24.6.2
helm.sh/chart=gpu-operator-v24.6.2
nvidia.com/gpu-driver-upgrade-drain.skip=true
pod-template-hash=587db77bb5
Annotations: cni.projectcalico.org/containerID: c3eb1a96d94b0a0892fae9ea669bb7e9db5c711cca5c0a4e1a5af664ec51f0a0
cni.projectcalico.org/podIP: 10.1.148.217/32
cni.projectcalico.org/podIPs: 10.1.148.217/32
openshift.io/scc: restricted-readonly
Status: Running
IP: 10.1.148.217
IPs:
IP: 10.1.148.217
Controlled By: ReplicaSet/gpu-operator-587db77bb5
Containers:
gpu-operator:
Container ID: containerd://7aaba5f3266437d47a78ea60fb83c07f96ba932334344e4ef15e7e01e96dc0d4
Image: nvcr.io/nvidia/gpu-operator:v24.6.2
Image ID: nvcr.io/nvidia/gpu-operator@sha256:8e0969cffc030a89c4acd68e64d41dd54e3bce8a794106b178d4dbd636a07f1c
Port: 8080/TCP
Host Port: 0/TCP
which shows the cluster is now running the patched image image tag v24.6.2.
NVIDIA Container Toolkit on standalone Docker and Containerd
This is more straightforward. If you followed the NVIDIA documentation, set up the repositories and installed the Container Toolkit, then updating your apt cache and upgrading the toolkit will automatically pull the latest version.
sudo apt-get update
sudo apt-get upgrade nvidia-container-toolkit -y
I verified that this works on another computer.
I hope this guide is helpful and please comment if you have any questions.