Year 10, W52 production status

 

Farm Load CRS

Avg Load (15, 5, 1m): 93%, 92%, 92% with the following distribution
- 94.91% of the cluster is well utilized consistent with expectations.

The load profile shows as follow
 

All is well load-wise for the production.

There has been report of slow IO on disk, analysis not going well (a 50% slow down is seen due to slow access to central storage). This is seen from the CAS load as well in this load profile and pattern where max load 2k versus 3k would be consistent with a 50% slow down.

 

Job efficiency

The first plot is the stream ratio for the past week, that is, the proportion of each stream in production.

Apart from the past few days, the mix was very eaven.

The second plot is the RealTime/CPUTime ratio - a value of 1 is best and anything diverging from it indicate inefficiencies.

Production status and problems

Production overview

Totals so far:

  • Number of events in MuDST: 183,372,603 (25% more events than two weeks ago)
  • Total size MuDST: 94.7 TB
  • Total number of MuDST:  107,223
  • Average file size: 0.90 GB

Breakdown of number of events by triggers

Cumulative progress

As seen above, a net change occurred two weeks ago, causing a camel shape in progress. We can understand the first slow down as a temporary loss of CPU due to the farm repair (see this announcement). But the slow down persisted requiring a different explanation.

Summary / incremental status

From the slopes, we can infer a slow down by a factor of at a factor of 2-3. The CPUtimes have also been reported as being twice slower. This can be seen looking at the average processing numbers below

          AuAu200_production (adc)  [ht] 2.45189 total MB/evt 0.66783 MB/evt Real=50.74 sec/evt CPU=42.65 sec/evt IO=0.05 (0.06) MB/sec
              AuAu200_production [gamma] 2.51921 total MB/evt 0.66450 MB/evt Real=53.64 sec/evt CPU=46.45 sec/evt IO=0.05 (0.05) MB/sec
                AuAu200_production [upc] 0.44832 total MB/evt 0.10611 MB/evt Real=15.64 sec/evt CPU=9.49 sec/evt IO=0.03 (0.05) MB/sec
       AuAu200_production (adc)  [gamma] 2.58274 total MB/evt 0.66855 MB/evt Real=65.64 sec/evt CPU=45.62 sec/evt IO=0.04 (0.06) MB/sec
         AuAu200_production (adc)  [upc] 0.48205 total MB/evt 0.11321 MB/evt Real=18.44 sec/evt CPU=8.33 sec/evt IO=0.03 (0.06) MB/sec
                 AuAu200_production [ht] 2.37594 total MB/evt 0.64161 MB/evt Real=53.38 sec/evt CPU=46.69 sec/evt IO=0.04 (0.05) MB/sec
               AuAu200_production (adc)  2.07553 total MB/evt 0.55890 MB/evt Real=44.09 sec/evt CPU=38.88 sec/evt IO=0.05 (0.05) MB/sec
                      AuAu200_production 2.01229 total MB/evt 0.53893 MB/evt Real=39.70 sec/evt CPU=38.46 sec/evt IO=0.05 (0.05) MB/sec

Comparing to the projections from two weeks ago, the numbers are much larger (39 sec versus 26.3 before for example for st_physics). Imagining we would restore the initial performance rate, the revised projections would be as follow:

  Goals Done Left %left Real (sec/evt) CPU (sec/evt) Real/CPU Time left (days)
st_gamma [+] 3,000,000 1,951,097 1,048,903 34.96% 38.80 29.79 1.3 0.19
st_ht (all) 58,752,491 21,588,450 37,164,041 63.26% 35.69 28.34 1.26 6.34
st_physics 207,000,000 135,762,154 71,237,846 34.41% 26.56 25.95 1.02 9.04
st_pmdftp [+] 5,000,000 5,307,771 -307,771 -6.16% 19.05 17.68 1.08 -0.03
st_upc 37,520,000 17,203,157 20,316,843 54.15% 11.57 6.00 1.93 1.12
Grand totals & avg 311,272,491 181,812,629 129,767,633 41.69% 26.33 21.55 1.22 16.69

... hence still 17 days to go: in two weeks, we only did 4 days worth of what we would do before. However, we had a 3 days farm downtime (hence 4+3=7 days while we projected Run 10, W50 production status 21 days left). Remains to explain the offset by 21-7*2=7 days worth of data.

This can be understood by looking at the problem initially reported as ticket RT # 2058. A code change is causing some job to get "stuck". We have 8k jobs automatically killed by the production system in this category (the system detects non active jobs and kick them out of the queue). It is believed to be due to a change in the maximum number of allowed vertices (from 20 to now 500). This is under testing and the result should be available by the 30th.

Since we projected earlier on an ETA of mid-January to beginning of February, we are still on target (those projections are snapshots allowing to recover fast - 17 days remaining is on the optimistic side, assuming not problem but duty factors).

 

We note that we will be done with the st_physics stream soon and this may create another efficiency condition whereas the "stream effect" will become a predominant factor to potential production slow down. We however have several options at this stage:

  • Our new database server are operational - database snapshots may be used if necessary
  • We may mix some of the st_physics from the rest of production for greater efficiency and farm use

We do not know of other problems at this stage.

Follow-up

While backing off the vertex change may have an effect on the CPU time, the number of jobs being "stuck" did not change. On 2010/12/30 I instructed Lidia to rebuild the library with specific codes as below:

	OK, leave the job on rcrs6101 running (don't kill it) and
rebuild the library with
* a base code from SL10j
* add the following
asps
Simulation/starsim/atmain/agdummy.age - subroutine gcalor micset added;
Simulation/starsim/atgeant/agrot.F - modified to switch off AGROTF_TEST;

StChain
StMaker.cxx - added y2008c, y2009b, y2010b geometry tags;

StEmcADCtoEMaker
StBemcData.cxx, StEmcADCtoEMaker.cxx - patched for BSMDe mapping problem
in P10ih and P10ij productions;
StEmcRawMaker
StBemcRaw.cxx, StBemcRaw.h, StEmcRawMaker.cxx - patched for BSMDe
mapping problem in P10ih and P10ij productions;
StEmcUtil
database/StEmcDecoder.cxx, StEmcDecoder.h - patched for BSMDE mapping
problem in P10ih and P10ij productions;
StEmcSimulatorMaker
StEmcSimulatorMaker.cxx - modified to make the energy assignments of the
newly created strips happen before they are added to the mEmcMcHits
collection; the cross talk leakage is now proportional to the difference
of energies of two neighbor strips;


StGammaMaker
StGammaPythiaEventMaker.cxx - modified to move the StMcEvent check to
the beginning of Make() to avoid crashing when the instance is not present ;
StGammaRawMaker.h - added mutator to exclude BEMC towers; set excluded
BEMC towers in StGammaRawMaker;
StGammaRawMaker.cxx - excluded desired BEMC towers; bug fixed in tower
exclusion code;


pams
/geometry/geometry/geometry.g - defined TPCE04r (reduced TPC envelope
radius) and BTOF67 (btof sensitive volume size fix) and incorporated
them into Y2011 tag; added tags:Y2008c, Y2009c, y2010b;
/geometry/geometry/btofgeo/gbtofgeo7.g - reverted to previous version of
btofgeo6 code; fixes in TOF geometry are applied now in btofgeo7 ;
/geometry/geometry/tpcegeo/tpcegeo3.g - reverted max radius to previous
value; reduced radius is set by TPCE04r flag in geometry.g;


StarDb
VmcGeometry/y2011.h - modified MUTD geometry: 27 backlegs changed with
28 backlegs ;
Geometry.y2008c.C, Geometry.y2009c.C, Geometry.y2010b.C,
Geometry.y2008b.C, Geometry.y2009b.C, Geometry.y2010a.C, y2008c.h,
y2009c.h, y2010b.h - added new files for tags y2008b, y2008c, y2009b,
y2009c, y2010a and y2010b;

	Let us see if this fixes the issue.
	Thank you,

This set of updates works as illustrated by the below graph. The rise in the speed progress is back to previous levels.

Remains to indentify the problem (this could be done in "dev").