Skip to main content
Asked a question 3 years ago

How do I Install HTCondor from sources on top of a Bright Cluster

Where am I?

In Bright Computing, Inc. you can ask and answer questions and share your experience with others!

How to Install HTCondor from sources on top of a Bright Cluster

HTCondor can be installed on top of a Bright Cluster as follows:

Note: The following instructions have been tested on Bright 7.3 with CentOS 7 as the base OS.

On the head node

1. Install dependencies:

# yum install setools-console policycoreutils-python perl-Date-Manip.noarch

2. untar the sources:

# tar -xzvf condor-8.4.9-x86_64_RedHat7-stripped.tar.gz

3. Add condor user:

# cmsh
% user add condor
% comit

4. Install Condor using condor_install script (please note that installation should be done using any other user than root using the --owner option):

# cd condor-8.4.9-x86_64_RedHat7-stripped/

# ./condor_install --prefix=/cm/shared/apps/condor/8.4.9 --owner=condor --install-dir=/cm/shared/apps/condor/8.4.9

Installing Condor from /root/condor-8.4.9-x86_64_RedHat7-stripped to /cm/shared/apps/condor/8.4.9

Condor has been installed into:

   /cm/shared/apps/condor/8.4.9

Configured condor using these configuration files:

 global: /cm/shared/apps/condor/8.4.9/etc/condor_config

 local:  /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local

In order for Condor to work properly you must set your CONDOR_CONFIG

environment variable to point to your Condor configuration file:

/cm/shared/apps/condor/8.4.9/etc/condor_config before running Condor

commands/daemons.

Created scripts which can be sourced by users to setup their

Condor environment variables.  These are:

  sh: /cm/shared/apps/condor/8.4.9/condor.sh30

 csh: /cm/shared/apps/condor/8.4.9/condor.csh

5. Copy the condor environment variables setup scritps under /etc/profile.d :

# cp --preserve /cm/shared/apps/condor/8.4.9/condor.{sh,csh} /etc/profile.d/

6. Configure Condor with condor_configure script:

# ./condor_configure --type=manager,submit --verbose --install-dir=/cm/shared/apps/condor/8.4.9

Condor will be run as user: condor

Install directory: /cm/shared/apps/condor/8.4.9

Main config file: /cm/shared/apps/condor/8.4.9/etc/condor_config

Local directory: /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2

Local config file: /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local

Writing settings to file: /cm/shared/apps/condor/8.4.9/etc/condor_config

CONDOR_HOST=ma-c-12-30-b73-c7u2.cm.cluster
COLLECTOR_NAME=
DAEMON_LIST=COLLECTOR MASTER NEGOTIATOR SCHEDD

Configured condor using these configuration files:

 global: /cm/shared/apps/condor/8.4.9/etc/condor_config

 local:  /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/condor_config.local

In order for Condor to work properly you must set your CONDOR_CONFIG

environment variable to point to your Condor configuration file:

/cm/shared/apps/condor/8.4.9/etc/condor_config before running Condor

commands/daemons.

Created scripts which can be sourced by users to setup their

Condor environment variables.  These are:

  sh: /cm/shared/apps/condor/8.4.9/condor.sh30

 csh: /cm/shared/apps/condor/8.4.9/condor.csh

7. modify the condor_config file to point to the correct paths for different configuration parameters and expand the Condor pool beyond a single host (set ALLOW_WRITE to match all of the hosts):

# cat /cm/shared/apps/condor/8.4.9/etc/condor_config | grep -vE "^#|^$" 
RELEASE_DIR = /cm/shared/apps/condor/8.4.9
LOCAL_DIR = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)
LOCAL_CONFIG_FILE = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config
use SECURITY : HOST_BASED
ALLOW_WRITE = *.cm.cluster
use ROLE : Personal
CONDOR_HOST = master.cm26.cluster
UID_DOMAIN = cm.cluster
FILESYSTEM_DOMAIN = cm.cluster
LOCK = /tmp/condor-lock.0.0129490057743205
CONDOR_IDS = 1001.1001
CONDOR_ADMIN = root@master.cm18.cluster
MAIL = /usr/bin/mail
JAVA = /usr/bin/java
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
DAEMON_LIST = MASTER COLLECTOR SCHEDD NEGOTIATOR
STARTD_DEBUG = D_FULLDEBUG
COLLECTOR_DEBUG = D_FULLDEBUG
COLLECTOR_HOST = $(CONDOR_HOST):9618

8. Create startup/boot script for starting Condor services

# cp --preserve /cm/shared/apps/condor/8.4.9/etc/examples/condor.service /lib/systemd/system/
# cat /lib/systemd/system/condor.service

[Unit]

Description=Condor Distributed High-Throughput-Computing

After=syslog.target network-online.target29 nslcd.service ypbind.service

Wants=network-online.target

[Service]
Environment=CONDOR_CONFIG=/cm/shared/apps/condor/8.4.9/etc/condor_config
ExecStart=/cm/shared/apps/condor/8.4.9/sbin/condor_master -f
ExecStop=/cm/shared/apps/condor/8.4.9/sbin/condor_off -master
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=1minute
StandardOutput=syslog
LimitNOFILE=16384

[Install]
WantedBy=multi-user.target

# systemctl enable condor.service
Created symlink from /etc/systemd/system/multi-user.target21.wants/condor.service to /usr/lib/systemd/system/condor.service.

9. Start the Condor service:

# systemctl restart condor.service
# # systemctl status condor.service

● condor.service - Condor Distributed High-Throughput-Computing

  Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled; vendor preset: disabled)

  Active: active (running) since Fri 2016-12-30 11:24:23 CET; 4s ago

Main PID: 15093 (condor_master)

  CGroup: /system.slice/condor.service

          ├─15093 /cm/shared/apps/condor/8.4.9/sbin/condor_master -f

          ├─15118 condor_procd -A /tmp/condor-lock.0.0129490057743205/procd_pipe -L /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/log/ProcLog -R 1000000 -S 60 -C 1001

          ├─15119 condor_collector -f

          ├─15132 condor_negotiator -f

          └─15133 condor_schedd -f

Dec 30 11:24:23 ma-c-12-30-b73-c7u2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

Dec 30 11:24:23 ma-c-12-30-b73-c7u2 systemd[1]: Starting Condor Distributed High-Throughput-Computing...

# ps aux | grep condor

condor     15093  0.0  0.1  42884  5516 ?        Ss   11:24   0:00 /cm/shared/apps/condor/8.4.9/sbin/condor_master -f

root       15118  0.0  0.1  23004  4580 ?        S    11:24   0:00 condor_procd -A /tmp/condor-lock.0.0129490057743205/procd_pipe -L /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/log/ProcLog -R 1000000 -S 60 -C 1001

condor     15119  0.0  0.1  64064  6352 ?        Ss   11:24   0:00 condor_collector -f

condor     15132  0.0  0.1  42884  5480 ?        Ss   11:24   0:00 condor_negotiator -f

condor     15133  0.0  0.1  63268  7144 ?        Ss   11:24   0:00 condor_schedd -f

root       15211  0.0  0.0 112648   956 pts/0    S+   11:25   0:00 grep --color=auto condor

In the software image -- assuming default-image is the image currently used by the compute nodes

1. Install dependencies:

# yum install setools-console policycoreutils-python perl-Date-Manip.noarch --installroot=/cm/images/default-image

2. Install Condor inside the software image

Create a local configuration directory for each compute node (substitute node001/node002 with the correct node name and repeat/loop for the required number of nodes):

# cp -r --preserve /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/ /cm/shared/apps/condor/8.4.9/local.node001/

# cp -r --preserve /cm/shared/apps/condor/8.4.9/local.ma-c-12-30-b73-c7u2/ /cm/shared/apps/condor/8.4.9/local.node002/

Create a startup/boot script for starting Condor services in the software image:

# cat /cm/images/default-image/lib/systemd/system/condor.service

[Unit]
Description=Condor Distributed High-Throughput-Computing
After=syslog.target network-online.target29 nslcd.service ypbind.service network.target30
Wants=network-online.target network.target30

[Service]
Environment=CONDOR_CONFIG=/cm/shared/apps/condor/8.4.9/local.%H/condor_config.local
ExecStart=/cm/shared/apps/condor/8.4.9/sbin/condor_master -f
ExecStop=/cm/shared/apps/condor/8.4.9/sbin/condor_off -master
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=1minute
StandardOutput=syslog
LimitNOFILE=16384

[Install]
WantedBy=multi-user.target

Copy the condor_config file to condor_config.local under each local.<node> directory after making the necessary changes for the DAEMON_LIST

# cat /cm/shared/apps/condor/8.4.9/local.node001/condor_config.local | grep -vE "^#|^$"
RELEASE_DIR = /cm/shared/apps/condor/8.4.9
LOCAL_DIR = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)
LOCAL_CONFIG_FILE = /cm/shared/apps/condor/8.4.9/local.$(HOSTNAME)/condor_config.local
LOCAL_CONFIG_DIR = $(LOCAL_DIR)/config
use SECURITY : HOST_BASED
use ROLE : Personal
CONDOR_HOST = master
ALLOW_WRITE = *
UID_DOMAIN = cm.cluster
FILESYSTEM_DOMAIN = cm.cluster
LOCK = /tmp/condor-lock.0.0129490057743205
CONDOR_IDS = 1001.1001
CONDOR_ADMIN = root@master.cm18.cluster
MAIL = /usr/bin/mail
JAVA = /usr/bin/java
JAVA_MAXHEAP_ARGUMENT = -Xmx1024m
COLLECTOR_HOST = $(CONDOR_HOST):9618
DAEMON_LIST = MASTER STARTD

3. reboot the compute nodes to be provisioned using the modified software image

(check status from the head node after the nodes are up)

# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

node001.cm27.cluster LINUX      X86_64 Unclaimed Idle      0.000  993  0+00:30:04

node002.cm28.cluster LINUX      X86_64 Unclaimed Idle      0.150  993  0+00:00:04

                    Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX     2     0       0         2       0          0        0

              Total     2     0       0         2       0          0        0

[root@ma-c-12-30-b73-c7u2 ~]#

Submitting a job

Submitting jobs as root is not allowed so you have to switch to any other user to be able to submit jobs.

# su - cmsupport
$ cat hostname.sh25
#!/bin/bash
hostname -f
date


sleep 20
date
echo "exit"

$ cat hostname.condor
############
# Example job file
############
Universe=vanilla
Executable=/home/cmsupport/hostname.sh25
input=/dev/null
output=hostname.out

error=hostname.error
Queue

$ condor_submit hostname.condor
$ condor_q

-- Schedd: ma-c-12-30-b73-c7u2.cm28.cluster : <10.141.255.254:50275?...
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  9.0   cmsupport      12/30 17:33   0+00:00:06 R  0   0.0  hostname.sh25

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

$ cat hostname.out

node002.cm28.cluster

Fri Dec 30 17:33:58 CET 2016

Fri Dec 30 17:34:18 CET 2016

exit

$ condor_history 
ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD            
  9.0   cmsupport      12/30 17:33   0+00:00:20 C  12/30 17:34 /home/cmsupport/hostname.sh25 
  6.0   cmsupport      12/30 17:22   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  5.0   cmsupport      12/30 16:58   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  8.0   cmsupport      12/30 17:33   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  7.0   cmsupport      12/30 17:32   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  1.0   cmsupport      12/30 16:38   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  2.0   cmsupport      12/30 16:40   0+00:00:00 X         ???  /home/cmsupport/hostname.sh25 
  4.0   condor         12/30 16:43   0+00:00:00 X         ???  /home/condor/hostname.sh25