From the 8.2 admin manual:
3.13.2 GPU Unit Configuration Example: The Dell PowerEdge C410x
An example of a GPU unit is the Dell PowerEdge C410x, which comes in a 3U chassis size, has up to 16 Tesla M-series GPUs (in the form of cards in slots) inside it, and can allocate up to 4 GPUs per node.
It can be configured to work with Bright Cluster Manager as follows:
1. The GPUs in the unit are assigned to nodes using the direct web interface provided by the Baseboard Management Controller (BMC) of the C410x. This configuration is done outside of Bright Cluster Manager. The assignment of the GPUs can be done only according to the fixed groupings permitted by the web interface.
For example, if the unit has 16 GPUs in total (1, 2,..., 16), and there are 4 nodes in the cluster (node001, node002,..., node004), then an appropriate GPU indices mapping to the node may be:
- node001 is assigned 1, 2, 15, 16
- node002 is assigned 3, 4, 13, 14
- node003 is assigned 5, 6, 11, 12
- node004 is assigned 7, 8, 9, 10
This mapping is decided by accessing the C410x BMC with a browser, and making choices within the "Port Map" resource (figure 3.23). 4 mappings are displayed (Mapping 1, 2, 3, 4), with columns displaying the choices possible for each mapping. In the available mapping choices, the iPass value indicates a port, which corresponds to a node. Thus iPass 1 is linked to node001, and its corresponding PCIE values (1, 2, 15, 16) are the GPU indices here. Each iPass port can have 0 (N/A), 2, or 4 PCIE values associated with it, so that the GPU unit can supply up to 4 GPUs to each node.
The GPU indices (PCIE values) (that is, the numbers 1, 2,..., 16) are used to identify the card in Bright Cluster Manager. The lowest GPU index associated with a particular node---for example, 5 (rather than 6) for node03---is used to make Bright Cluster Manager aware in the next step that an association exists for that node, as well as aware which the first GPU card associated with it is.
2. The GPUs can be assigned to the nodes in Bright Cluster Manager as follows:
First, the GPU unit holding the GPUs can be assigned to the GPU unit resource. This can be carried out by adding the hostname of the GPU unit, for example in cmsh with:
[root@bright82 ~]# cmsh
[bright82->device]% add gpuunit schwartz
This drops the administrator into the gpuunit object, which has the hostname schwartz here. The equivalent addition in Bright View can be done via Devices-->GPU Units-->Add button.
Next, the IP address, network, and BMC user name and password can be set to appropriate values:
[bright82->device*[schwartz*]]% set ip 10.148.0.10
[bright82->device*[schwartz*]]% set network bmcnet
[bright82->device*[schwartz*]->bmcsettings]% set username darkhelmet
[bright82->device*[schwartz*]->bmcsettings*]% set userid 1002
[bright82->device*[schwartz*]->bmcsettings*]% set password 12345
Here, it is assumed that a BMC network that the BMC network interface is connected to has already been created appropriately. If it is not already set up, then the network can be configured as described in section 3.7.
The value of owners must be set to the list of nodes that have access to GPUs.
Also, the reserved variable, userdefined1 must be set to a key=value pair, so that each key (the node in owner) has a value (the PCIE values), assigned to it:
[bright82->device[schwartz]]% set owners node001 node002 node003 node004
[bright82->device[schwartz*]]]% set userdefined1 node001=1,2,15,16 node002=3,4,13,\
14 node003=5,6,11,12 node004=7,8,9,10
As a convenience, if the indices are all separated by 1, as in the case of node004 here, only the lowest index need be set for that node. The line setting the userdefined1 value can thus equivalently be carried out with:
[bright82->device[schwartz*]]% set userdefined1 node001=1,2,15,16 node002=3,4,13,\
14 node003=5,6,11,12 node004=7
Once these values are committed, Bright Cluster Manager is able to track GPU assignment consistently and automatically.
3. Finally, CUDA drivers should be installed on the relevant nodes so that they can make use of the GPUs. The details on how to do this are given in section 7.4 of the Installation Manual.