Batch system, resource management system

Under:

SUMS (aka, STAR Scheduler)


SUMS, the product of the STAR Scheduler project, stands for Star Unified Meta-Scheduler. This tool is currently documented on its own pages. SUMS provides a uniform user interface to submitting jobs on "a" farm that is, regardless of the batch system used, the language it provides (in XML) is identical. The scheduling is controlled by policies handling all the details on fitting your jobs in the proper queue, requesting proper resource allocation and so on. In other words, it isolates users from the infrastructure details.

You would benefit from starting with the following documents:

LSF

LSF was dropped from BNL facility support in July 2008 due to licensing cost. Please, refer to the historical revision for information about it. If a link brought you here, please update or send a note to the page owner.

Condor

Quick start ...

Condor Pools at BNL

The condor pools are segmented into four pools:

  • production
  • users
  • OpenScienceGrid (OSG)
  • and general.

The four pools (like queues) span all STAR machines (CAS and CRS). On all STAR machines priority goes as follows

  • production = 2
  • users = 1
  • OSG = 1
  • general = 0

A few rules apply

  • Production jobs cannot be evicted on CRS machines ...
  • Users and OSG jobs can be evicted
    • Eviction happens after those have 3 hours of runtime from the time they start.
    • For example, if a production job wants a node being used by a user job that has been running for two hours then that user job has one hour left before it gets kicked out ...
  • This time limit comes into effect when a higher priority job wants the slot (i.e. production vs. user or production versus OSG)
  • general queue jobs are evicted IMMEDIATELY when the slot is wanted by ANY STAR job (production, user, or OSG).

This provides the general structure of the Condor policy in place for STAR. The other policy options in place goes as follows:

  1. The following options apply to all machines but only come into play on machines that also execute LSF jobs or have interactive jobs allowing a rise of the load
    1. production jobs cannot start unless the non-Condor 1min load <= 1.4.
    1. Non-production jobs cannot start unless the non-Condor 5min load <= 1.4
  2. On the CAS machines, general jobs are evicted when the non-Condor 1min load <= 1.4. (this is so that LSF jobs can evict general queue jobs).
  3. General queue jobs will not start on any node unless 1min < 1.4, swap > 200M, memory > 100M.
  4. CAS nodes only run one Condor job at a time (for LSF reasons)
  5. CRS nodes can run three production jobs at a time, two jobs otherwise.
  6. User fairshare is in place. The time limit is the exact same as described above (3 hours from t0)
  7. Not really something I control but I should mention that production normally does not run on the rcas machines (it keeps them blocked)

What you need to know about Condor

First of all, we recommend you use SUMS to submit to Condor as we would take care of adding codes, tricks, tweaks to make sure your jobs run smoothly. But if you really don't want to, here are a few issues you may encounter:

  • Unless you use the GetEnv=true  datacard directive in your condor job description, Condor jobs will start with a blank set of environment variables unlike a shell startup. Especially, none of
    SHELL, HOME, LOGNAME, PATH, TERM and MAIL
    will be define. The absence of $HOME will have for side effect that, whenever a job starts, your .cshrc and .login will not be seen hence, your STAR environment will not be loaded. You must take this into account and execute the STAR login by hand (within your job file).
    Note that using GetEnv=true has its own sde effects which includes a full copy of the environment variables as defined from the submitter node. This will NOT be suitable for distributed computing jobs. The use of getenv() C primitive in your code is especially questionable (it will unlikely return a valid value) and
    • STAR user may look at this post for more information on how to use ROOT function calls for defining some of the above.
    • You may also use getent shell command (if exists) to get the value of your home directory
    • A combinations of getpwuid(), getpwnam() would allow to define $USER and $HOME
       
  • Condor follows a multi-submitter node model with no centralized repository for all jobs. As a consequence, whenever you use a command such as condor_rm, you would kill the jobs you have submitted from that node only. To kill jobs submitted from other submitter nodes (any interactive node at BNL is a potential submitter node), you need to loop over the possibilities and use the -name command line option.
     
  • Condor will keep your jobs indefinitely in the Pool unless you either remove the jobs or specify a condition allowing for jobs to be automatically removed upon status and expiration time. A few examples below could be used for the PeriodicRemove Condor datacard
    • To automatically remove jobs which have been in the queue for more than 2 days but marked as status 5 (held for a reason or another and not moving) use
      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)
    • To automatically remove jobsruning the the queue for more than 2 days but using less than 10% of the CPU (probably looping or inefficient jobs blocking a job slot), use
      (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                          ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10))
    The full current condition SUMS add to each job is
    PeriodicRemove  = (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                       ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10)) || 
                      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)

Some condor commands

This is not meant to be an exhaustive set of commands nor a tutorial. You are invited to read to the manpages for condor_submit, condor_rm, condor_q, condor_status. Those will be most of what you will need to use on a daily basis. Help for version 6.9 is available online.

  • Query and information
    • condor_q -submitter $USER
      List jobs of specific submitter $USER from all the queues in the pool
    • condor_q -analyze $JOBID
      Perform an approximate analysis to determine how many resources are available to run the requested jobs.
    • condor_status -submitters
      shows the numbers of running/idle/held jobs for each user on all machines
    • condor_status -claimed
      Summarize jobs by servers as claimed
    • condor_status -avail
      Summarize resources which are available
       
  • Removing jobs, controlling them
    • condor_rm $USER
      removes all of your jobs submitted from this machine
    • condor_rm -name $node $USER
      removed all jobs for $USER submitted from machine $node
    • condor_rm -forcex $JOBID
      Forces the immediate local removal of jobs in undefined state (only affects jobs already being removed). This is needed if condor_q -submitter shows your job but condor_q -analyze $JOBID does not (indicating an out of sync information at Condor level).
    • condor_release $USER
      releases all of your held jobs back into the pending pool for $USER
    • condor_vacate -fast
      may be used to remove all obs from the submitter node job queue. This is a fast mode command (no checks).
       
  • More advanced
    • condor_status -constraint 'RemoteUser == "$USER@bnl.gov"'
      lists the machines on which your jobs are currently running
    • condor_q -submitter username -format "%d" ClusterId -format "  %d" JobStatus -format "   %s\n" Cmd
      shows the job id,  status, and command for all of your jobs.  1==Idle, 2==Running for Status.  I use something like this because the default output of condor_q truncates the command at 80 characters and prevents you from seeing the actually scheduler job ID associated with the Condor job.  I'll work on improving this command, but this is what I've got for now.
    • To access the reason for job 26875.0 to be held from a submitter node advertized to be rcas6007, use the following command to have a human readable format
      condor_q -pool condor02.rcf.bnl.gov:9664 -name rcas6007 -format "%s\n" HoldReason 26875.0