Skip to main content
Ask Question
General
Asked a question 2 months ago

How do I upgrade from 6.0 to 6.1 for SLES?

Where am I?

In Bright Computing, Inc. you can ask and answer questions and share your experience with others!

How do I upgrade from 6.0 to 6.1 for SLES?

Upgrading from Bright 6.0 to 6.1 for SLES11

The procedure below can be used to upgrade a Bright 6.0 installation with failover to Bright 6.1 on the following supported SLES11 versions:
  • SLES11 SP2

Contents

Prerequisites

  • IMPORTANT: Support for SLES11sp1 had been dropped in Bright 6.1, so it is recommended that you upgrade to SLES11sp2 before upgrading to Bright 6.1
  • Make sure a full backup of both head nodes is available and working.
  • Turn off all nodes.
  • If there is a cloud setup configured:
    • Cluster Extension Scenario:
      • Terminate cloud nodes
      • Terminate cloud director(s)
  • Extra distribution packages will be installed. For enterprise Linux distributions (SuSE enterprise Linux), a valid RPM repository needs to be configured and accessible.

On the primary (active) and secondary (passive) head nodes

Apply existing updates to Bright 6.0

zypper clean
zypper refresh
zypper update 

Update Bright zypper repo configuration to 6.1

If the head node has access to the internet:

In /etc/zypp/repos.d/Cluster_Manager_Base.repo and /etc/zypp/repos.d/Cluster_Manager_Updates.repo

a. Change all occurrences of 6.0 to 6.1

OR

Use a Bright 6.1 DVD/ISO as a repo:

# Mount DVD
mount -o loop /path/to/bright/dvd /mnt1

# Create repo file
cat <<EOF >/etc/zypp/repos.d/cm6.1-dvd.repo
[cm-dvdrepo-6.1]
name=Bright Cluster Manager 6.1 DVD Repo
baseurl=file:///mnt1/data/cm-rpms/6.1
enabled=1
gpgcheck=1
EOF

# Disable the Bright 6.0 repo:
perl -pi -e 's/enabled=1/enabled=0/g' /etc/zypp/repos.d/Cluster_Manager_Base.repo
perl -pi -e 's/enabled=1/enabled=0/g' /etc/zypp/repos.d/Cluster_Manager_Updates.repo 

Lock update of workload manager packages

zypper addlock slurm* pbspro* sge* torque* cm-hwloc 

Create FRESH file, clear zypper cache, refresh repositories

touch /cm/FRESH
zypper clean
zypper refresh 

Create back up of current cluster configuration

 service cmd stop
 cmd -x /root/cm/cmd-backup-6.0.xml

 # Stop all workload managers
 /etc/init.d/slurm stop
 /etc/init.d/sgeexecd stop
 /etc/init.d/sgemaster.sge1 stop
 /etc/init.d/pbs stop
 /etc/init.d/torque_mom stop
 /etc/init.d/torque_server stop
 /etc/init.d/maui stop
 /etc/init.d/moab stop

 # Unmount shared storage (only in failover setup)
 umount /cm/shared 
 umount /home
 

Create a backup of the MySQL configuration file and .bashrc

cp /etc/my.cnf{,.ok}
cp /root/.bashrc{,.ok} 

Remove cm-config-cm from Bright 6.0 and install the one for Bright 6.1

 rpm -e --nodeps cm-config-cm
 zypper install cm-config-cm

 # Remove globalarrays rpms of 6.0, this causes conflicts during dep solving
 zypper remove globalarrays-{gcc,open64}-openmpi-64
 

Upgrade CM packages to Bright 6.1

 zypper update 

Remove old/obsolete packages

 # Remove packages that have been obseleted in Bright 6.1
 rpm -e $(rpm -qa | grep -E "^gotoblas|lm_sensors|rsync-cm|cmgui-json-dist" | grep _cm6.0) --nodeps

 # Remove packages for which names have changed in Bright 6.1
 rpm -e $(rpm -qa | grep -E "^fftw|freeipmi|ipmitool|globalarrays|conman|iozone|stresscpu" | grep _cm6.0)

 # Bright 6.1 will no longer provide mpich3 RPMs. The mpich RPMs will be upgraded to version 3.x
 # If the mpich3 provided by Bright is not being used, then it can be removed.

 rpm -e $(rpm -qa | grep -E "^mpich3-" | grep _cm6.0)
 

Install new packages introduced by Bright 6.1

zypper install openblas cm-conman cm-freeipmi cm-iozone cm-ipmitool stresscpu fftw{2,3}-openmpi-gcc-64 globalarrays-openmpi-gcc-64 

Install new distribution packages

# Bright 6.1 uses syslog-ng
zypper install syslog-ng 
service syslog restart 

Get list of all remaining Bright 6.0 packages

rpm -qa | grep _cm6.0 

Upgrade cuda RPMS (optional)

 # If cuda RPMS were installed on the Bright 6.0 cluster, the following RPMS will remain:
 rpm -qa | grep -E "^cuda" | grep _cm6.0

 # To remove all Bright 6.0 cuda RPMS (If multiple cuda versions were installed)
 zypper remove $(rpm -qa | grep -E "^cuda" | grep _cm6.0)

 # To remove only specific versions of cuda RPMS (for example cuda42):
 zypper remove $(rpm -qa | grep -E "^cuda42" | grep _cm6.0)

 # Install latest cuda RPMS from Bright 6.1
 zypper install cuda52*
 

Remove /cm/FRESH file

 # This is important, because future updates to any Bright config RPMs will overwrite 
 # config files in default locations, treating this as a FRESH install.

 rm /cm/FRESH
 

Check for rpmsave or rpmnew files, and fix/update them

 #Some updates may have resulted in rpmsave and/or rpmnew files being created. All of these need to be processed. 

 #Use the following command to find the rpmsave and/or rpmnew files:
 
 find / -name "*.rpmnew" -o -name "*.rpmsave"

*** VERY IMPORTANT ***
# It is important to use the new cmd.conf

# Create backup of existing configuration cp /cm/local/apps/cmd/etc/cmd.conf{,.old}

# Update required information.

1.Copy all username and password information from /cm/local/apps/cmd/etc/cmd.conf to /cm/local/apps/cmd/etc/cmd.conf.rpmnew .
 The most important ones are:
 DBPass
 LDAPPass
 LDAPReadOnlyOnlyPass

2.AdvancedConfig Directives:
The ProvisioningNodeAutoUpdate and ProvisioningNodeAutoUpdateTimer directives have become obsolete.
If you were using these, please set the provisioningnodeautoupdatetimeout property in the base partition after the upgrade completed.

IMPORTANT: Please copy all other AdvancedConfig directives that are being used, to /cm/local/apps/cmd/etc/cmd.conf.rpmnew

# Use new file cp /cm/local/apps/cmd/etc/cmd.conf{.rpmnew,} # Use new cmd init script and then restart cmdaemon cp /etc/init.d/cmd{,.old} cp /etc/init.d/cmd{.rpmnew,} service cmd restart

Restore the backup MySQL configuration and .bashrc

mv /etc/my.cnf{.ok,}
cp /root/.bashrc{.ok,} 

Fix /etc/motd

perl -pi -e "s/Cluster Manager ID: #00000/Cluster Manager ID: #$(cat /cm/CLUSTERMANAGERID)/g" /etc/motd 

On the primary (active) head node update all software images

For each software image (e.g. /cm/images/default-image), perform the following:

export IMAGE=/cm/images/default-image 

Apply existing updates to software image

zypper --root $IMAGE clean
zypper --root $IMAGE update 

Update repo configuration files

If the head node has access to the internet:
In $IMAGE/etc/zypp/repos.d/Cluster_Manager_Base.repo  and $IMAGE/etc/zypp/repos.d/Cluster_Manager_Updates.repo

a. Change all occurrences of 6.0 to 6.1

OR

# Update using a Bright 6.1 DVD/ISO:
mount -o loop /path/to/bright/dvd /mnt1 (if not already mounted)
mkdir $IMAGE/mnt1; mount -o bind /mnt1 $IMAGE/mnt1

# Create zypper repo configuration file for Bright 6.1

cat <<EOF >$IMAGE/etc/zypp/repos.d/cm6.1-dvd.repo
[cm-dvdrepo-6.1]
name=Bright Cluster Manager 6.1 DVD Repo
baseurl=file:///mnt1/data/cm-rpms/6.1
enabled=1
gpgcheck=1

# Disable the Bright 6.0 repo:
perl -pi -e 's/enabled=1/enabled=0/g' $IMAGE/etc/zypp/repos.d/Cluster_Manager_Base.repo
perl -pi -e 's/enabled=1/enabled=0/g' $IMAGE/etc/zypp/repos.d/Cluster_Manager_Updates.repo 

Lock update of workload manager packages

zypper --root $IMAGE addlock slurm* pbspro* sge* torque* cm-hwloc 

Create FRESH file, clear zypper cache, refresh repositories

touch $IMAGE/cm/FRESH
zypper --root $IMAGE clean
zypper --root $IMAGE refresh 

Remove cm-config-cm from Bright 6.0 and install the one for Bright 6.1

rpm --root $IMAGE -e --nodeps cm-config-cm
zypper --root $IMAGE install cm-config-cm 

Upgrade CM packages to Bright 6.1

zypper --root $IMAGE update 

Remove old/obsolete packages

rpm --root $IMAGE -e qx(rpm --root $IMAGE -qa | grep -E "^freeipmi|ipmitool|lm_sensors|rsync-cm" | grep _cm6.0) 

Install new packages introduced by Bright 6.1

zypper --root $IMAGE install cm-freeipmi cm-ipmitool 

Get list of all remaining Bright 6.0 packages

rpm --root $IMAGE -qa | grep _cm6.0 

Remove /cm/FRESH file

 # This is important, because future updates to any Bright config RPMs will overwrite 
 # config files in default locations, treating this as a FRESH install.

 rm $IMAGE/cm/FRESH
 

Check for rpmsave or rpmnew files, and fix/update them

 # Some updates may have resulted in rpmsave and/or rpmnew files being created. All of these need to be processed. 
 
 # Use the following command to find the rpmsave and/or rpmnew files:
 
 find $IMAGE/ -name "*.rpmnew" -o -name "*.rpmsave"

*** VERY IMPORTANT ***
# It is important to use the new cmd.conf

# Create backup of existing configuration cp $IMAGE/cm/local/apps/cmd/etc/cmd.conf{,.old}

# Update required information (*** VERY IMPORTANT ***).

1.Copy all username and password information from $IMAGE/cm/local/apps/cmd/etc/cmd.conf to $IMAGE/cm/local/apps/cmd/etc/cmd.conf.rpmnew .
 The most important one is:
 LDAPReadOnlyOnlyPass

2.AdvancedConfig Directives:
The ProvisioningNodeAutoUpdate and ProvisioningNodeAutoUpdateTimer directives have become obsolete.
If you were using these, please set the provisioningnodeautoupdatetimeout property in the base partition after the upgrade completed.

IMPORTANT: Please copy all other AdvancedConfig directives that are being used, to $IMAGE/cm/local/apps/cmd/etc/cmd.conf.rpmnew

#Use new file cp $IMAGE/cm/local/apps/cmd/etc/cmd.conf{.rpmnew,} # Use new init cmd init script cp $IMAGE/etc/init.d/cmd{,.old} cp $IMAGE/etc/init.d/cmd{.rpmnew,}

Propagate changes in the software image(s) to the secondary (passive) head node

cmsh -> softwareimage -> updateprovisioners

 # Wait for 'Provisioning completed' event

Add Xeon Phi settings (optional)

# Add mic metric:
cmsh
$ monitoring metrics
$ add mic
$ set command /cm/local/apps/cmd/scripts/metrics/sample_mic
$ set classofmetric prototype
$ set timeout 30
$ commit

# Add mic gres type into slurmserver role:
cmsh
$ device roles master
$ use slurmserver
$ append grestypes mic
$ commit

Upgrading Workload Managers (optional)

Workload manager RPMS from Bright 6.0 will remain:

rpm -qa | grep -E "slurm|pbspro|torque|cm-hwloc|sge.*_cm6.0"

# Back up all workload configuration files
cp /cm/shared/apps/slurm/var/etc/slurm.conf{,.bak}
cp /etc/pbs.conf{,.bak}
cp $IMAGE/etc/pbs.conf{,.bak}
 

Upgrade packages

Slurm

Remove zypper lock for slurm, cm-hwloc updates and update:

 zypper remove lock slurm* cm-hwloc
 zypper update
 zypper --root $IMAGE removelock slurm* hwloc
 zypper --root $IMAGE update 

PBS Pro

Remove zypper lock for pbspro updates and update:

 # On the active head node
 zypper removelock pbspro*
 zypper remove pbspro-slave
 zypper install pbspro-client
 zypper update

 zypper --root $IMAGE removelock pbspro*
 zypper --root $IMAGE remove pbspro-slave
 zypper --root $IMAGE install pbspro-client 

Torque

Remove zypper lock for torque updates and update:

 # On the active head node
 zypper removelock torque*
 zypper update
 zypper --root $IMAGE removelock torque*
 zypper --root $IMAGE update 

Open Grid Scheduler/SGE

Remove zypper lock for sge updates and update:

 # On the active head node
 zypper removelock sge*
 zypper update
 zypper --root $IMAGE removelock sge*
 zypper --root $IMAGE update 

Update provisioners

Propagate changes in the software image(s) to the secondary (passive) head node

 # On the active head node
 cmsh -> softwareimage -> updateprovisioners

 # Wait for 'Provisioning completed' event 

Clean up and reboot head nodes

Re-do shared storage setup (if failover setup) from active head node

cmha-setup -> Shared Storage 

Repair slurm config on active head node(if slurm power save was enabled)

 # Create back up of existing slurm.conf
 cp /etc/slurm/slurm.conf{,.bak}
 
 # Remove power save definitions between old markers (including markers).
 sed -i '/# ##### CM-POWER-SAVE-ENABLE #####/,/# ##### CM-POWER-SAVE-ENABLE #####/d' /etc/slurm/slurm.conf
 
 # Check diff with backup file, to make sure only the duplicate power save defs were removed
 diff /etc/slurm/slurm.conf /etc/slurm/slurm.conf.bak

 # Re-read slurm config
 scontrol reconfigure
 

 Update Slurm prologs

 # In slurm.conf replace 
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog
to
PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-healthchecker
Prolog=/cm/local/apps/cmd/scripts/prolog

Unmount ISO (if it was used as the repo) and reboot

 # On the active head node:
 umount /mnt1
 umount $IMAGE/mnt1
 reboot

 # On the passive head node:
 umount /mnt1
 reboot 

Boot cloud director, cloud nodes and regular nodes

 # From active head node

 # Boot cloud director(s)
 cmsh -> device power on -n <cloud-director-hostname>

 # Boot cloud nodes
 cmsh -> device power on -n cnode001..cnode1000

 # Boot regular nodes
 cmsh -> device power on -n node001..node1000