You may get the following error when trying to detect GPUs in a DGX node:
[bright82->device[node001]]% sysinfo | grep -i gpu
Number of GPUs 1
GPU Driver Version Unsupported GPU or cannot connect to DCGM, please check the output of 'service cuda-dcgm status'
GPU0 Name
You cannot start or install cuda-dcgm
because it will conflict with an NVIDIA package:
file /usr/share/nvidia-validation-suite from install of cuda-dcgm-1:1.4.6.1-59_cm8.2.x86_64 conflicts with file from package datacenter-gpu-manager-1.5.9-1.x86_64
Bright's GPU detection uses nv-hostengine. This is available in NVIDIA's dcgm.service
. After making sure that this service is enabled in Bright, the GPUs can be detected:
[root@node001 ~]# systemctl status dcgm.service
● dcgm.service - DCGM service
Loaded: loaded (/usr/lib/systemd/system/dcgm.service; enabled; vendor preset: disabled)
Active: active (running) since Tue 2019-11-05 13:27:25 UTC; 3s ago
Main PID: 77500 (nv-hostengine)
Tasks: 6
Memory: 9.2M
CGroup: /system.slice/dcgm.service
└─77500 /usr/bin/nv-hostengine -n
Nov 05 13:27:25 node001 systemd[1]: Started DCGM service.
Nov 05 13:27:25 node001 nv-hostengine[77500]: DCGM initialized
[root@bright82 ~]# cmsh -c 'device sysinfo node001 | grep -i gpu'
Number of GPUs 8
GPU Driver Version
GPU0 Name NVIDIA
GPU0 Power Limit 0 W
GPU1 Name NVIDIA
GPU1 Power Limit 0 W
GPU2 Name NVIDIA
GPU2 Power Limit 0 W
GPU3 Name NVIDIA
GPU3 Power Limit 0 W
GPU4 Name NVIDIA
GPU4 Power Limit 0 W
GPU5 Name NVIDIA
GPU5 Power Limit 0 W
GPU6 Name NVIDIA
GPU6 Power Limit 0 W
GPU7 Name NVIDIA
GPU7 Power Limit 0 W