I've recently run into a issue where the GPU Operator prevented the Machine Config Operator to apply Cluster Updates because of not being able to unload the Driver.
In my case, the nodename was 'cl1gpu08.cluster.example.com' since it's going to be referenced in some commands.
The fix was actually simple. First, disable the GPU Operator on the node:
$ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands=false
Next, make sure there are no NVIDIA GPU Operator Workloads running on that gpu:
$ oc -n nvidia-gpu-operator get pods -o wide --field-selector spec.nodeName=cl1gpu08.cluster.prod.example.com
If you're impatient, you can go ahead and remove the remaining pods as well as restart the machine-config-daemon. Once the node is back, set the label to 'true' so that the GPU Operator can be scheduled again on that node:
$ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands- --overwrite
Sources used:
- NVIDIA GPU Operator Common Deployment Scenarios
Feel free to comment and / or suggest a topic.
Comments
Post a Comment