Skip to main content
Ask Question
agi
System Administrator
Asked a question 8 months ago

I am seeing failing health checks for our "schedulers" on 3 of 4 compute nodes and only 1 job out of over 60 running on the other node (Slurm). Can help to figure out what the problem is and get our jobs running on all compute nodes at full capacity? Thanks

Where am I?

In Bright Computing, Inc. you can ask and answer questions and share your experience with others!

We would recommend checking the /var/log/slurmd log file on the compute nodes, as it may contain useful information. The /var/log/slurmctld log file on the headnode may also assist.

Also, check the drainstatus of the nodes.
# cmsh
% device
% drainstatus

Slurm by default allocates one job per node.
Please see the following Knowledge Base article for details on how to configure Slurm to share resources: 
https://kb.brightcomputing.com/knowledge-base/how-do-i-share-resources-in-slurm/17

Related Questions

Question Stats

36 view
1 follower
Asked a question 8 months ago
Views this month