Archy's Blog

OpenShift - Disable NVIDIA GPU Operator on a Node

I've recently run into a issue where the GPU Operator prevented the Machine Config Operator to apply Cluster Updates because of not being able to unload the Driver. In my case, the nodename was 'cl1gpu08.cluster.example.com' since it's going to be referenced in some commands. The fix was actually simple. First, disable the GPU Operator on the node: $ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands=false Next, make sure there are no NVIDIA GPU Operator Workloads running on that gpu: $ oc -n nvidia-gpu-operator get pods -o wide --field-selector spec.nodeName=cl1gpu08.cluster.prod.example.com If you're impatient, you can go ahead and remove the remaining pods as well as restart the machine-config-daemon. Once the node is back, set the label to 'true' so that the GPU Operator can be scheduled again on that node: $ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands- --overwrite Sources used: - ...

Archy's Blog

Search This Blog

Posts

OpenShift - Disable NVIDIA GPU Operator on a Node