SUMS, the product of the STAR Scheduler project, stands for Star Unified Meta-Scheduler. This tool is currently documented on its own pages. SUMS provides a uniform user interface to submitting jobs on "a" farm that is, regardless of the batch system used, the language it provides (in XML) is identical. The scheduling is controlled by policies handling all the details on fitting your jobs in the proper queue, requesting proper resource allocation and so on. In other words, it isolates users from the infrastructure details.
You would benefit from starting with the following documents:
LSF was dropped from BNL facility support in July 2008 due to licensing cost. Please, refer to the historical revision for information about it. If a link brought you here, please update or send a note to the page owner. Information have been kept un-published You do not have access to view this node.
The condor pools are segmented into four pools extracted from this RACF page:
production jobs | +Experiment = "star" | +Job_Type = "crs" | high priority CRS jobs, no time limit, may use all the slots on CRS nodes and up to 1/2 available job slots per system on CAS ; the CRS portion is not available to normal users and using this Job_Type for user will fail |
users normal jobs | +Experiment = "star" | +Job_Type = "cas" | short jobs, 3 to 5 hour soft limit (when resources are requested by others), 40 hour hard limit - this has higher priority than the "long" Job_Type. |
user long jobs | +Experiment = "star" | +Job_Type = "long" | long running jobs, 5 day soft limit (when resources are requested by others), 10 day hard limit, may use 1 job slot per system on a subset of machines |
general queue | +Experiment = "general" +Experiment = "star" |
General queue shared by multiple experiments, 2 hours guaranteed time minimum (can be evicted afterward by any experiment's specific jobs claiming the slot) |
The Condor configurations do not have create a simple notion of queues but generates a notion of pools. Pools are group of resources spanning all STAR machines (RCAS and RCRS nodes) and even other experiment's nodes. The first column tend to suggest four of such pools although we will see below that life is more complicated than that.
First, it is important to understand that the +Experiment attribute is only used for accounting purposes and what makes the difference between a user job or a production job or a general job is really the other attributes.
Selection of how your jobs will run is the role of +Job_Type attribute. When it is unspecified, the general queue (spanning all RHIC machines at the facility) is assumed but your job may not have the same time limit. We will discuss the restriction later. The 4th column of the table above shows the CPU time limits and additional constraints such as the number of slots within a given category one may claim. Note that the +Job_type="crs" is reserved and its access will be enforced by Condor (only starreco may access this type).
In addition of using +Job_type which as we have seen controls what comes as close as possible to a queue in Condor, one may need to restrict its jobs to run on a subset of machines by using the CPU_Type attribute in the Requirements tag (if you are not completely lost by now, you are good ;-0 ). An example to illustrate this:
+Experiment = "star" +Job_type = "cas" Requirements = (CPU_type != "crs") && (CPU_Experiment == "star")
In this example, a cas job (interpret this as "a normal user analysis job") is being run on behalf of the experiment star. The CPU / nodes requested are the CPU belonging to the star experiment and the nodes are not RCRS nodes. By specifying those two requirements, the user is trying to make sure that jobs will be running on RCAS nodes only (or != "crs") AND, regardless of a possible switch to +Experiment="general", the jobs will still be running on the nodes belonging to STAR only.
In this second example
+Experiment = "star" +Job_type = "cas" Requirements = (CPU_Experiment == "star")
we have pretty much the same request as before but the jobs may also run on RCRS nodes. However, if data production runs (+Job_type="crs" only starreco may start), the user's job will likely be evicted (as production jobs will have higher priorities) and the user may not want to risk that hence specifying the first Requirements tag.
A few rules apply (or summarized) below:
This provides the general structure of the Condor policy in place for STAR. The other policy options in place goes as follows:
In the land of confusion ...
Also, users are often confused of the meaning of their job priority. Condor will consider a user's job priority and submit jobs in priority order (where the larger the number, more likely the job willl start) but those priorities have NO meaning across two distinct users. In other words, it is not because user A sets job priorities larger by an order of magitude comparing to user B that his job will start first. Job priority only providesa mechanism for a user to specify which of their idle jobs in the queue are most important. Jobs with higher numerical priority should run before those with lower priority, although because jobs can be submitted from multiple machines, this is not always the case. Job prioritties are listed by the condor_q command in the PRIO column.
The effective user priority is dynamic, on the other hand, and changes as a user has been given access to resources over a period of time. A lower numerical effective user priority (EUP) indicates a higher priority. Condor's fairshare mechanism is implemented via EUP. The condor_userprio command hence provides an indication of your faireshareness.
You should be able to use condor_qedit to manually modify the "Priority" parameter, if desired. If a job does not run for weeks, there is likely a problem with its submitfile or one of its input, and in particular its Requirements line. You can use condor_q -analyze JOBID, or condor_q -better-analyze JOBID to determine why it cannot be scheduled.
First of all, we recommend you use SUMS to submit to Condor as we would take care of adding codes, tricks, tweaks to make sure your jobs run smoothly. But if you really don't want to, here are a few issues you may encounter:
SHELL, HOME, LOGNAME, PATH, TERM and MAILwill be define. The absence of $HOME will have for side effect that, whenever a job starts, your .cshrc and .login will not be seen hence, your STAR environment will not be loaded. You must take this into account and execute the STAR login by hand (within your job file).
(((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)
(JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10))
PeriodicRemove = (JobStatus == 2 && (CurrentTime - JobCurrentStartDate > (54000)) && ((RemoteUserCpu+RemoteSysCpu)/(CurrentTime-JobCurrentStartDate) < 0.10)) || (((CurrentTime - EnteredCurrentStatus) > (2*24*3600)) && JobStatus == 5)
This is not meant to be an exhaustive set of commands nor a tutorial. You are invited to read to the manpages for condor_submit, condor_rm, condor_q, condor_status. Those will be most of what you will need to use on a daily basis. Help for version 6.9 is available online.