Batch system, resource management system

SUMS (aka, STAR Scheduler)


SUMS, the product of the STAR Scheduler project, stands for Star Unified Meta-Scheduler. This tool is currently documented on its own pages. SUMS provides a uniform user interface to submitting jobs on "a" farm that is, regardless of the batch system used, the language it provides (in XML) is identical. The scheduling is controlled by policies handling all the details on fitting your jobs in the proper queue, requesting proper resource allocation and so on. In other words, it isolates users from the infrastructure details.

You would benefit from starting with the following documents:

LSF

LSF was dropped from BNL facility support in July 2008 due to licensing cost. Please, refer to the historical revision for information about it. If a link brought you here, please update or send a note to the page owner. Information have been kept un-published You do not have access to view this node.

Condor

Quick start ...

Condor Pools at BNL

The condor pools are segmented into four pools extracted from this RACF page:

production jobs +Experiment = "star" +Job_Type = "crs" high priority CRS jobs, no time limit, may use all the slots on CRS nodes and up to 1/2 available job slots per system on CAS ; the CRS portion is not available to normal users and using this Job_Type for user will fail
users normal jobs +Experiment = "star" +Job_Type = "cas" short jobs, 3 to 5 hour soft limit (when resources are requested by others), 40 hour hard limit - this has higher priority than the "long" Job_Type.
user long jobs +Experiment = "star" +Job_Type = "long" long running jobs, 5 day soft limit (when resources are requested by others), 10 day hard limit, may use 1 job slot per system on a subset of machines
general queue +Experiment = "general"
+Experiment = "star"
  General queue shared by multiple experiments, 2 hours guaranteed time minimum (can be evicted afterward by any experiment's specific jobs claiming the slot)

 

The Condor configurations do not have create a simple notion of queues but generates a notion of pools. Pools are group of resources spanning all STAR machines (RCAS and RCRS nodes) and even other experiment's nodes. The first column tend to suggest four of such pools although we will see below that life is more complicated than that.

First, it is important to understand that the +Experiment attribute is only used for accounting purposes and what makes the difference between a user job or a production job or a general job is really the other attributes. 

Selection of how your jobs will run is the role of +Job_Type attribute. When it is unspecified, the general queue (spanning all RHIC machines at the facility) is assumed but your job may not have the same time limit. We will discuss the restriction later. The 4th column of the table above shows the CPU time limits and additional constraints such as the number of slots within a given category one may claim. Note that the +Job_type="crs" is reserved and its access will be enforced by Condor (only starreco may access this type).

In addition of using +Job_type which as we have seen controls what comes as close as possible to a queue in Condor, one may need to restrict its jobs to run on a subset of machines by using the CPU_Type attribute in the Requirements tag (if you are not completely lost by now, you are good ;-0 ).  An example to illustrate this:

+Experiment = "star"
+Job_type = "cas"
Requirements = (CPU_type != "crs") && (CPU_Experiment == "star")

In this example, a cas job (interpret this as "a normal user analysis job") is being run on behalf of the experiment star. The CPU / nodes requested are the CPU belonging to the star experiment and the nodes are not RCRS nodes. By specifying those two requirements, the user is trying to make sure that jobs will be running on RCAS nodes only (or != "crs") AND, regardless of a possible switch to +Experiment="general", the jobs will still be running on the nodes belonging to STAR only.

In this second example

+Experiment = "star"
+Job_type = "cas"
Requirements = (CPU_Experiment == "star")

we have pretty much the same request as before but the jobs may also run on RCRS nodes. However, if data production runs (+Job_type="crs" only starreco may start), the user's job will likely be evicted (as production jobs will have higher priorities) and the user may not want to risk that hence specifying the first Requirements tag.

Pool rules

A few rules apply (or summarized) below:

  • Production jobs cannot be evicted on their claimed slots ... since they have higher priority than user jobs even on CAS nodes, this means that as soon as production jobs starts, its pool of slots will slowly but surely be taken - user's jobs may use those slots at low-downs of utlization.
  • Users jobs can be evicte. Eviction happens after 3 hours of runtime from the time they start but only if the slot they are running in is claimed by other jobs. For example, if a production job wants a node being used by a user job that has been running for two hours then that user job has one hour left before it gets kicked out ...
  • This time limit comes into effect when a higher priority job wants the slot (i.e. production vs. user or production)
  • general queue jobs are evicted after two hours of guaranteed time when the slot is wanted by ANY STAR job (production, user)
  • general queue jobs will be evicted however if they consume more than 750 MB of memory

This provides the general structure of the Condor policy in place for STAR. The other policy options in place goes as follows:

  1. The following options apply to all machines: the 1 mn load has to be less than 1.4 on a two CPU node for a job to start
  2. General queue jobs will not start on any node unless 1 min < 1.4, swap > 200M, memory > 100M.
  3. User fairshare is in place.

In the land of confusion ...

Also, users are often confused of the meaning of their job priority. Condor will consider a user's job priority and submit jobs in priority order (where the larger the number, more likely the job willl start) but those priorities have NO meaning across two distinct users. In other words, it is not because user A sets job priorities larger by an order of magitude comparing to user B that his job will start first. Job priority only providesa mechanism for a user to specify which of their idle jobs in the queue are most important. Jobs with higher numerical priority should run before those with lower priority, although because jobs can be submitted from multiple machines, this is not always the case. Job prioritties are listed by the condor_q command in the PRIO column.

The effective user priority is dynamic, on the other hand, and changes as a user has been given access to resources over a period of time. A lower numerical effective user priority (EUP) indicates a higher priority. Condor's fairshare mechanism is implemented via EUP. The condor_userprio command hence provides an indication of your faireshareness.

You should be able to use condor_qedit to manually modify the "Priority" parameter, if desired. If a job does not run for weeks, there is likely a problem with its submitfile or one of its input, and in particular its Requirements line. You can use condor_q -analyze JOBID, or condor_q -better-analyze JOBID to determine why it cannot be scheduled.

 

What you need to know about Condor

First of all, we recommend you use SUMS to submit to Condor as we would take care of adding codes, tricks, tweaks to make sure your jobs run smoothly. But if you really don't want to, here are a few issues you may encounter:

  • Unless you use the GetEnv=true  datacard directive in your condor job description, Condor jobs will start with a blank set of environment variables unlike a shell startup. Especially, none of
    SHELL, HOME, LOGNAME, PATH, TERM and MAIL
    will be define. The absence of $HOME will have for side effect that, whenever a job starts, your .cshrc and .login will not be seen hence, your STAR environment will not be loaded. You must take this into account and execute the STAR login by hand (within your job file).
    Note that using GetEnv=true has its own sde effects which includes a full copy of the environment variables as defined from the submitter node. This will NOT be suitable for distributed computing jobs. The use of getenv() C primitive in your code is especially questionable (it will unlikely return a valid value) and
    • STAR user may look at this post for more information on how to use ROOT function calls for defining some of the above.
    • You may also use getent shell command (if exists) to get the value of your home directory
    • A combinations of getpwuid(), getpwnam() would allow to define $USER and $HOME
       
  • Condor follows a multi-submitter node model with no centralized repository for all jobs. As a consequence, whenever you use a command such as condor_rm, you would kill the jobs you have submitted from that node only. To kill jobs submitted from other submitter nodes (any interactive node at BNL is a potential submitter node), you need to loop over the possibilities and use the -name command line option.
     
  • Condor will keep your jobs indefinitely in the Pool unless you either remove the jobs or specify a condition allowing for jobs to be automatically removed upon status and expiration time. A few examples below could be used for the PeriodicRemove Condor datacard
    • To automatically remove jobs which have been in the queue for more than 2 days but marked as status 5 (held for a reason or another and not moving) use
      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)
    • To automatically remove jobsruning the the queue for more than 2 days but using less than 10% of the CPU (probably looping or inefficient jobs blocking a job slot), use
      (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                          ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10))
    The full current condition SUMS add to each job is
    PeriodicRemove  = (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && 
                       ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10)) || 
                      (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)

Some condor commands

This is not meant to be an exhaustive set of commands nor a tutorial. You are invited to read to the manpages for condor_submit, condor_rm, condor_q, condor_status. Those will be most of what you will need to use on a daily basis. Help for version 6.9 is available online.

  • Query and information
    • condor_q -submitter $USER
      List jobs of specific submitter $USER from all the queues in the pool
    • condor_q -submitter $USER -format "%s\n" ClusterID
      Shows the JobID for all jobs of $USER. This command may succeed although an unconstrained condor_q may fell if we had a large amount of jobs
    • condor_q -analyze $JOBID
      Perform an approximate analysis to determine how many resources are available to run the requested jobs.
    • condor_status -submitters
      shows the numbers of running/idle/held jobs for each user on all machines
    • condor_status -claimed
      Summarize jobs by servers as claimed
    • condor_status -avail
      Summarize resources which are available
       
  • Removing jobs, controlling them
    • condor_rm $USER
      removes all of your jobs submitted from this machine
    • condor_rm -name $node $USER
      removed all jobs for $USER submitted from machine $node
    • condor_rm -forcex $JOBID
      Forces the immediate local removal of jobs in undefined state (only affects jobs already being removed). This is needed if condor_q -submitter shows your job but condor_q -analyze $JOBID does not (indicating an out of sync information at Condor level).
    • condor_release $USER
      releases all of your held jobs back into the pending pool for $USER
    • condor_vacate -fast
      may be used to remove all jobs from the submitter node job queue. This is a fast mode command (no checks) and applies to running jobs (not pending ones)
       
  • More advanced
    • condor_status -constraint 'RemoteUser == "$USER@bnl.gov"'
      lists the machines on which your jobs are currently running
    • condor_q -submitter username -format "%d" ClusterId -format "  %d" JobStatus -format "   %s\n" Cmd
      shows the job id,  status, and command for all of your jobs.  1==Idle, 2==Running for Status.  I use something like this because the default output of condor_q truncates the command at 80 characters and prevents you from seeing the actually scheduler job ID associated with the Condor job.  I'll work on improving this command, but this is what I've got for now.
    • To access the reason for job 26875.0 to be held from a submitter node advertized to be rcas6007, use the following command to have a human readable format
      condor_q -pool condor02.rcf.bnl.gov:9664 -name rcas6007 -format "%s\n" HoldReason 26875.0