HLT to OLP notes

This is a description of turning l402 and l403 into members of the OLP HTCondor pool for use with "grid" job submission of offline production and simulation jobs from Leve and Lidia.
As the work is ongoing, details will be added - two inter-related goals of these notes are:

  1. a guide to converting additional HLT machines into "OLP"-like worker nodes
  2. a guide to undo the changes piecemeal or in bulk relatively quickly should it be necessary for the primary mission of the HLT machines

Networking

First and foremost is the need to get network connectivity for access to AFS (for STAR software), database servers, and the online CephFS (for the OSG WNC and possibly future STAR software deployments).  Adding an interface to the starp network is the natural choice for this.  l402 and l403 have three network interfaces each.  It appears that historically two of them were bonded together (eth0 and eth1 @ 1Gb/s each) but that bond is no longer in use.  The third NIC is a 10Gb/s NIC, connected to the local l4 network.  eth0 is being reserved for use with IPMI, so eth1 will be used for the connection to starp.   Starp addresses are reserved (l402-onl and l043.onl) and can be configured with eth1 with no changes to the existing eth2 interface on the L4 network. 

DONE: configurd eth1 interfaces with reserved starp addresses and plugged in network cables.  (For some unknown reason, getting the eth1 interface to work properly required a reboot of the machine after the configuration.)

Reversal:  one could simply unplug the starp cable from eth1, and if desired, disable the eth1 interface

"Grid" User setup

Users running grid jobs will need to be configured.  "lbhajdu" will be the primary user, but others could also submit jobs.  Rather than set up the users individually, configuring the HLT machines as NIS clients to onlldap and onlam3 is preferable (a bit of overkill since it includes on the order of a hundred users, very few of which will ever use the HLT system as "grid" users, but easy to do).  All of the existing HLT locally-configured (in /etc/passwd) "flesh and bones" users have UIDs matching the NIS user list, so there should be no conflicts there.

Jobs in the general online Condor queue related to online activity very likely expect to find the user's shared home directory and various NFS mounts (eg, the event pool, trigger and scaler files).  Grid jobs (at least as Leve is submitting them for offline production/simulation) have no dependence on the user's home directory or other shared filesystems (except AFS and the OSG WNC in Ceph, described below).   Furthermore, rpm package sets are not planned (at least at this point) to be matched between the OLP and the HLT machines.  These distinctions in job requirements between the typical grid job and local online jobs indicate we should probably have a mechanism in Condor to prohibit local "online" jobs from landing on the HLT machines.  This sounds like it should be a simple matter within HTCondor, but I need to look into it.

Locally configured users use /star/u/$USER for home, while the NIS users are a mix of /ldaphome/$USER and /star/u/$USER.  This is problematic, because /star/u is already a symlink to /net/l409/home/, so we can't simply link /star/u to /ldaphome.  We can NFS mount the OLP home directories under /ldaphome to satisfy some cases, but for every NIS user with ~ = /star/u/$USER, individual directories will need to be added to l409's /home directory.  This is tedious to do initially and maintain, but it can be done at least initially with only four users practically expected to use this  - lbhajdu, didenko, wbetts and maybe starlib.  This is assuming that any jobs coming through condor do not care about a shared home directory. 

DONE:  added NIS client configuration for OLP (/etc/yp.conf and /etc/nsswitch.conf), started ypbind, set automounter to mount onlldap.starp.bnl.gov:/ldaphome/* on /ldaphome/* (one line in /etc/auto.master and /etc/auto.ldaphome).  lbhajdu and wbetts have home directories on l409, while didenko and starlib use /ldaphome

TODO (?):  Create home directories on l409 as needed for any other users expected to submit grid jobs with a home directory of /star/u/$USER.  There might not be any other users, and even if there are, they may not need existing home directories if their jobs are set to use $OSG_WN_TMP for instance.

Reversal:  The NIS client is easily stopped and disabled ("service ypbind stop; chkconfig ypbind off") and NFS mount for /ldaphome is easily commented out in /etc/auto.master (plus "service autofs reload") and dismounted by hand if necessary.  /ldaphome can be removed.  References to "nis" can be removed from /etc/nsswitch.conf and "domain" lines can be removed from /etc/yp.conf

Kernel change

In order to use the CephFS filesystem (see below), a non-stock kernel will be installed from the elrepo-kernel repository.  Either 4.8.6-1 (as almost all of the OLP uses) or the latest 4.12.x.  The stock kernel series will still be installed (or can be re-installed from sl repos if lost for some reason), so it can be booted at any time.  (Note, this may be trickier for the machines with the Xeon-Phi cards, but l402 and l403 do not have them.)

Unfortunately, the mpss-modules package (which includes mic.ko, the kernel module needed to use the XeonPhi cards) does not build cleanly in the 4.12 series kernel.  l402 and l403 do not have Xeon Phi cards, but most of the l4 machines do.  The first Xeon Phi-equipped test machine is l426.  So instead of the kernel-ml branch, on l426, the kernel-lt and kernel-lt-devel packages (3.10.107-1) have been installed (also from the elrepo-kernel repo), and the mpss-modules package builds cleanly from the SRPM.

DONE: 
      rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
      installed elrepo, and installed kernel-ml and kernel-ml-devel (or kernel-lt).  Modified /etc/grub.conf to boot into the 4.12 (or 3.10) kernel.  Rebooted into the new kernel

Reversal:  change the default kernel in /etc/grub.conf to the desired (presumably the latest stock 2.6.x kernel) and reboot.

CephFS access

The OSG worker node client software is installed in the OLP CephFS as a shared file system, so access to CephFS will be configured:

DONE:
        created /etc/auto.ceph
        add ceph line to /etc/auto.master
        copy in /etc/ceph/client.cephfs2
        rpm --import https://download.ceph.com/keys/release.asc
        various ceph RPMs and dependencies installed (see list of added RPMs below)
        mkdir /etc/ceph; mkdir /mnt/auto; ln -s /mnt/auto/ceph /ceph; service autofs reload

Reversal: the simplest reversal will be to umount the CephFS, comment it in /etc/auto.master and "service autofs reload".  Leaving the configuration intact is probably desirable for future use.

AFS

The most comprehensive access to STAR software is generally through AFS, so AFS client software will be installed and configured.  [Note that there has long been an interest in STAR software stack availability local to the online realm instead of AFS and some work has been done toward installations within the online CephFS, but it is not yet production ready in any capacity.]

DONE: installed openafs RPMs (listed below), /etc/krb5.conf, /usr/vice/etc/ThisCell, /usr/vice/etc/cacheinfo (adjust from default 100000 to 2000000), enabled and started the openafs-client service.

ln -s /afs/rhic.bnl.gov/x8664_sl6/opt/star /opt/star
ln -s /afs/rhic.bnl.gov/rcassoft/x8664_sl5/cernlib/i686-slc5-gcc43-opt /cern
ln -s /afs/rhic.bnl.gov/rcassoft/x8664_sl5/cernlib/x86_64-slc5-gcc43-desy /cern64
 
Reversal: stop and disable the openafs-client service.  rm /cern; rm /cern64; rm /opt/star
Leaving the afs and kerberos configurations intact is likely desirable for future use.

scratch space

Local scratch space is desirable on each host running grid jobs for storing incoming DAQ files and possibly job temporary and output files.  It is good to shoot for ~10GB per job (TBC), so with 15-16 jobs (TBD) running at a time, 160GB should be available at least.   The OLP hosts have 1.8TB filesystems for scratch space mounted on /scratch - current production jobs on the OLP are using /scratch/lbhajdu for instance.  Currently l402 and l403 have /scratch -> /tmp, but /tmp is in the root filesystem which has less than 40GB free.  /scratch could instead be symlinked to /b/scratch or /data/scratch for instance, where there are larger filesystems, but those filesystems under /b and /data are not all free either, so there could be space issues.   Hongwei's initial suggestion is to use /b.  Looking ahead a bit to l40[5678] and maybe l4099, there is insufficient local storage on these systems, but there are empty 2.5" drive bays.  S&C can likely provide drives for these machines to serve as scratch space.

DONE: mkdir /b/scratch; chmod 777 /b/scratch; rm /scratch ; ln -s /b/scratch /scratch

Reversal: rm /scratch; ln -s /tmp /scratch; rmdir /b/scratch #assuming /b/scratch is empty

HTCondor configuration

HTCondor on l402 and l403 is 8.4.9-1, which I think is compatible with the HTCondor installations currently in use with the OLP.  So a couple of configuration file swaps in /etc/condor (saving the existing ones for reversal) should be all that is needed to add them to the OLP condor pool.

DONE: (on l402 only) cp -a /etc/condor /etc/condor.HLT.July_28_2017

copied /etc/condor/condor_config and /etc/condor/condor_config.local from onl30.  Removed schedd from daemon list and set NUM_SLOTS to 1 for initial testing

 

TODO: test; when satisfied, increase NUM_SLOTS to 15

Reversal:  restore the original HLT HTCondor config files, restart the condor service

There is perhaps a better approach here.  The above assumes the HLT node becomes essentially dedicated to the OLP grid jobs.  It would be preferable from the HLT/tracking side to continue to be able to run their non-grid jobs (and even have priority over the grid jobs?) in the manner they have become accustomed to (after all, that's what they will continue to use when the HLT cluster returns to HLT/online tasks during the run).  It is possible that the existing HLT condor pool can be used by the grid jobs through Condor's flocking mechanism.  I need to look into this.

TODO:  investigate HTCondor configuration to allow flocking from stargrid03 to the HLT condor negotiator while limiting the HLT machines which will accept grid jobs.

GCC change

Several RPMs for gcc-4.8.2 will be installed, using /opt/gcc/4.8.2/.  The "stock" Scientific Linux 4.4.7 gcc version will still be present but the executable will be "gcc44" and similar for additional gcc components.  A script will be provided that renames the original gcc components and replaces them with symlinks to version 4.8.2 components (eg. /usr/bin/gcc will be a symlink to /opt/gcc/4.8.2/bin/gcc).  A reversal script will also be provided to remove the symlinks and rename the gcc44 components to their original names.  A note about this - kernel modules generally must be built with a gcc version similar to what the kernel was built with.  If no STAR jobs are running, this can be accomplished by simply wrapping the build process with the two scripts - restore gcc 4.4 first, build the module, and then re-run the original conversion script to restore gcc-4.8.2 as the default.  I do not know if this will interfere with HLT and tracking uses relying on STAR software installed within the HLT realm (ie, non-AFS) if the local software stack does not include branch(es) for use with gcc-4.8.2.  This seems safe enough for l402 and l403 initially, but could be a wrinkle to work out especially on other machines used for multiple purposes. 

There is a complication here.  /opt/gcc already exists as a link to /net/l409/software/gcc.  This prevents the rcas gcc packages from installing.  So I have installed the rcas gcc 4.8.2 packages on l409 (where /opt/gcc -> /software/gcc).

DONE: copied in /root/bin/comp_switch.sh and /root/bin/gcc_switch_back.sh

Installed the rcas gcc 4.8.2 packages on l409, executed /root/bin/comp_switch.sh
edited /etc/init.d/dkms_autoinstaller to include gcc_switchback.sh and comp_switch.sh wrapped around the dkms action

Reversal: execute /root/bin/gcc_switch_back.sh, remove the two switching calls in /etc/init.d/dkms_autoinstaller, remove the rcas packages on xeon-phi-dev/l409 if desired.

Iptables rules added:

It seems this isn't necessary, as iptables is wide open on the HLT machines.

Miscellaneous Changes:

DONE: SSH keys for mpoat and wbetts added to root
DONE: eth1's MAC address is registered with ITD
DONE: mkdir /root/bin

Miscellaneous Notes:

The initial configuration of eth1 on l402 and l403 had a bizarre result in both cases.  The interface was configured properly and packets were going both in and out as expected (seen in tcpdump watching only the eth1 interface), but testing with ping and ssh in and out within starp failed.  It was as if the communication between the transport layer and the application layer within the kernel was broken (though networking was fine with the local HLT network).  This seems VERY strange to me.  ifdown/ifup eth1 and even restarting the network service did not fix it.  In desperation, I eventually tried rebooting, and that fixed it, though I have no idea why.

I changed /ceph/osg/osg-wn-client/setup.csh to use OSG_LOCATION=/ceph/osg/osg-wn-client (instead of the previous /mnt/ceph/osg/osg-wn-client).  That was not a change on the HLT nodes, but was necessary because of the difference in mount points for ceph between the HLT nodes (using /mnt/auto/ceph) and the OLP nodes (using /mnt/ceph).  Since /ceph is a symlink to the actual mountpoint in both cases, using /ceph as the base path should be fine.

TODO?: Is Ordo necessary?  sending logs to ITD?

RPM changes

added (all x86_64 unless otherwise noted):

elrepo-release [/etc/yum.repos.d/elrepo.repo - modified to disable the main elrepo, but enable the elrepo-kernel]
kernel-ml
kernel-ml-devel
ceph-release
ceph

libcephfs1
ceph-common
boost-sytem
boost-thread
boost-program-options
gdisk
gperftools-libs
leveldb
libbabeltrace
librados2
librbd1
libunwind
lttng-ust
python-babel.noarch
python-backports
python-backports-ssl_match_hostname.noarch
python-cephfs
python-chardet.noarch
python-flask.noarch
python-jinja2-26.noarch
python-rados
python-rbd
python-requests.noarch
python-urllib3.noarch
python-werkzeug.noarch
userspace-rcu

dkms-openafs

dkms
kmod-openafs
openafs
opernafs-client
openafs-kernel-source
openafs-krb5

Installed on xeon-phi-dev/l409:

rcassoft-gcc
rcassoft-gcc-c++
rcassoft-gcc-gfortran
rcassoft-libquadmath-devel
rcassoft-libquadmath-devel.i686
rcassoft-libstdc++-devel
rcassoft-libstdc++-devel.i686
rcassoft-libstdc++-docs

removed:

none