If a system does not boot anymore, it's usually easiest to boot from a live-media and chroot into the installation to troubleshoot the issue at hand. I'll be using the archlinux installation iso to chroot into a debian install to fix a kernel update that's gone sideways. Once in the arch installation, make sure that the disks are all detected: root@archiso ~ # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS loop0 7:0 0 853.9M 1 loop /run/archiso/airootfs sr0 11:0 1 1.2G 0 rom /run/archiso/bootmnt vda 254:0 0 32G 0 disk ├─vda1 254:1 0 976M 0 part └─vda2 254:2 0 31G 0 part ├─vg_base-lv_root 253:0 0 3.8G 0 lvm ├─vg_base-lv_usr 253:1 0 5.7G 0 lvm ├─vg_base-lv_var 253:2 0 3.8G 0 lvm ├─vg_base-lv_var_log 253:3 0 3.8G 0 lvm ├─vg_base-lv_var_tmp 253:4 0 1.9G 0 lvm ├─vg_base-lv_tmp 253:5 0 488M 0 lvm ├─vg_base-lv_home 253:6 0 976M 0 lvm └─vg_base-lv_...
I've recently run into a issue where the GPU Operator prevented the Machine Config Operator to apply Cluster Updates because of not being able to unload the Driver. In my case, the nodename was 'cl1gpu08.cluster.example.com' since it's going to be referenced in some commands. The fix was actually simple. First, disable the GPU Operator on the node: $ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands=false Next, make sure there are no NVIDIA GPU Operator Workloads running on that gpu: $ oc -n nvidia-gpu-operator get pods -o wide --field-selector spec.nodeName=cl1gpu08.cluster.prod.example.com If you're impatient, you can go ahead and remove the remaining pods as well as restart the machine-config-daemon. Once the node is back, set the label to 'true' so that the GPU Operator can be scheduled again on that node: $ oc label node/cl1gpu08.cluster.prod.example.com nvidia.com/gpu.deploy.operands- --overwrite Sources used: - ...