Condor installation on online Linux pool
Preparation questions for Condor installation:
1. What machine will be the central manager?
onlam2. At some point, perhaps I will set up condor_had (High Availability Daemon) to allow failover to onlldap.
2. Which machines should be allowed to submit jobs?
onl01-onl14. Since onl01-onl04 are still old nodes waiting for replacement, I'll start from onl14 and work down. Eventually, perhaps other online nodes can be added to the pool.
3. Will Condor run as root or not?
Condor daemons will start up under the root account.
4. Who will administer Condor?
Initially, Wayne Betts. (Technically, anyone with root access could do it.)
5. Will you have a Unix user named condor, and will its home directory be shared?
Yes, there will be a condor user in the NIS with a shared home directory, just as any of the usual human accounts, at least for the onlNN hosts. If additional submit hosts are included, they will have to be considered on a case-by-case basis.
6. Where should the machine-specific directories for Condor go?
The machine-specific directories are spool, log, execute and lock. One option is to store these in condor's shared home directory. In this case, the configuration file for the onlNN nodes should have "LOCAL_DIR=$(TILDE)/hosts/$(HOSTNAME)". Note that it is recommended that Condor's lock directory not be on NFS, but so far, I have not found how to specify the lock directory location separate from the other three (spool, log and execute).
Alternatively, a local condor directory (eg. /var/condor) could be created on each node in the pool, in which case the configuration file should have "LOCAL_DIR=/var/condor". I'm leaning towards this second option, to reduce network activity (and presumably increase robustness) and distribute the disk usage.
7. Where should parts of the Condor system be installed?
Configuration files: The global configuration file will be in Condor's shared home directory (~condor/etc/condor_config). There will also be a link from ~condor/condor_config to ~condor/etc/condor_config. There may be additional local configuration files for each machine in /etc/condor, which will have to be specified in the global configuration file using LOCAL_CONFIG_FILE.
Release directory: RELEASE _DIR will be specified as "$(TILDE)" in condor_config. Links from /usr/local/[s]bin will be created on each node to point to the binaries in ~condor/[s]bin.
8. Am I using AFS?
Yes and no. None of the Condor installation will use AFS, though individual users may use files from AFS. Since Condor jobs have no internal AFS authentication mechanism, users will have to ensure they access world-readable files in AFS and typically avoid writing anything to AFS, unless they generate credentials manually. (I'll be curious to see if this works.)
9. Do I have enough disk space for Condor?
I'm assuming there will be no check-pointing, and that the usage will be relatively small, so the local Condor directories will not be terribly large, but will have to see.
Installation:
I downloaded "condor-7.4.1-linux-x86-rhel5-dynamic-1.i386.rpm" to get started on onlam2. It installs into /opt/condor-7.4.1. The rpm installation does not put any services into the startup configuration, but there is a sample init script included (/opt/condor-7.4.1/etc/examples/condor.boot ). Once edited to point to the actual condor_master location this works for starting and stopping condor with the usual "service condor stop/start".
(Oops, I just noticed there are at least TWO sample init scripts, one named condor.boot and one named condor.init. I'm trying to figure out which is a better starting point... The README doesn't mention condor.init. condor.init appears to be much better developed than condor.boot. condor.boot only has start and stop functions, while condor.init has many more {start|stop|restart|try-restart|reload|force-reload|status}.
for the condor.init script to work as is, I create a link to the condor_master executable:
ln -s /opt/condor-7.4.1/sbin/condor_master /usr/sbin/condor_master
Edited /opt/condor-7.4.1/etc/condor_config as appropriate (saving a copy of the original in /opt/condor-7.4.1/etc/condor_config.original.
[root@onlam2 condor-7.4.1]# mkdir /etc/condor
[root@onlam2 condor-7.4.1]# touch /etc/condor/condor_config.local
/opt/condor-7.4.1/condor_configure --type manager,submit,execute
[root@onlam2 condor-7.4.1]# ln -s /opt/condor-7.4.1/sbin/condor_master /usr/sbin/condor_master
(This is necessary for the init script to find the condor_master executable)
[root@onlam2 condor-7.4.1]# mkdir /var/condor
[root@onlam2 condor-7.4.1]# mkdir /var/condor/log
[root@onlam2 condor-7.4.1]# mkdir /var/condor/spool
[root@onlam2 condor-7.4.1]# mkdir /var/condor/lock
[root@onlam2 condor-7.4.1]# mkdir /var/condor/execute
[root@onlam2 condor-7.4.1]# chown condor:condor /var/condor/*
mkdir /var/run/condor
chown condor:condor /var/run/condor
[root@onlam2 condor-7.4.1]# iptables -I INPUT -p tcp -s 130.199.60.0/23 --dport 9600:9700 -j ACCEPT
[root@onlam2 condor-7.4.1]# iptables -I INPUT -p udp -s 130.199.60.0/23 --dport 9600:9700 -j ACCEPT
With this installation, Condor environment variables need to be set by sourcing the appropriate file:
sh: /opt/condor-7.4.1/condor.sh
csh: /opt/condor-7.4.1/condor.csh
I put copies of these in /etc/profile.d so it is automatically loaded in users' environments at login.
I briefly tried using condor_master to start and stop things, but it so far baffles me how to use it and the related executables -- in the same sbin directory, there are 11 executable files that are identical except for their names: condor, condor_checkpoint, condor_master_off, condor_off, condor_on, condor_reconfig, condor_reconfig_schedd, condor_reschedule, condor_restart, condor_set_shutdown and condor_vacate. None of condor_off, condor_master_off and condor_off -master seem to do anything... No condor processes are stopped as a result of any of these commands. Something to come back to later...
I also need to better understand what "condor_configure" does beyond setting the demons to start up. I have no idea if condor_configure changes the condor_config file, reads from it, or completely ignores it.
The configuration file is quite large, confusing, and in some cases misleading.
Figured out one gotcha - there is a default keyboard idle time of 15 minutes, and typing in an ssh session counts as keyboard use, so if the nodes have interactive logins, then it is likely jobs will never run. I'm trying to figure out how to remove that ClassAd.
It figured out somehow that there are 6 CPUs in a box that only has two CPUs (plus hyperthreading could conceivably make it look like 4) - how the heck did it come up with 6?
Random tidbits of useful stuff:
condor_q -pool onlam2.starp.bnl.gov -global (This is the same as "condor_q -global", if CONDOR_HOST (the central manager) is set properly.)