Skip to main content
Asked a question 7 years ago

Stale files from MPI jobs filled /dev/shm, what now?

Where am I?

In Bright Computing, Inc. you can ask and answer questions and share your experience with others!

Stale files from MPI jobs filled /dev/shm, what now?

Sometimes when compute nodes stay up for long periods of time, /dev/shm gets filled with stale files. This can happen if MPI jobs abort in an unexpected way.

The stale files get cleaned up if the node is rebooted. A cleanup that avoids a reboot is also possible, simply by remounting /dev/shm, but this may affect MPI jobs using /dev/shm at that time.

A gentler way to deal with this is to have a script clean /dev/shm if needed. It can be run each time a job attempts to start by adding it as a custom prejob health check.

The following script deletes files under /dev/shm that donโ€™t belong users that are running jobs on the node:



# do not remove stale root files
ignoretoken="-not -user root"

# get the users in the node via ps, as w/who don't work without login
for user in $(ps -eo euid,euser --sort +euid --no-headers | awk '{if($1 > 1000) print $2;}' | uniq)

ย ย ย  ignoretoken="${ignoretoken} -not -user $user"

# clean up
find $SHMDIR -mindepth 1 $ignoretoken -delete

The following steps add a prejob healthcheck via cmsh:

# cmsh
% monitoring healthchecks
% add clear_shm
% set command /path/to/custom/script
% commit
% setup healthconf <category name>
% add clear_shm
% set checkinterval prejob
% commit