Run 10 production times by dataset

 Following on from A look at production times for AuAu 7 and 39 GeV, which focused on inefficiency of production jobs due to time spent accessing the DB, I tried to assess the computing resources needed for the Run 10 datasets based solely on CPU time per event (i.e. ignoring time spend waiting for DB response). This I have done by looking at items reported in a random sampling of production log files over the course of the past few months.

In the first set of plots, I show the raw distributions of CPU times for AuAu7 st_physics jobs versus the number of events in the jobs in the top left panel. The two large groupings of entries in the plot are from adc (lower event multiplicity) and non-adc (higher event multiplicity) files. As previously known, the files with just a few events are dominated by initialization. We decided after this to only produce files with at least 20 events (see Follow-up #1 on Effect of stream data on database performance, a 2010 study), and all other plots on this page will feature two things: (1) a cut on a minimum of 20 events in the job, and (2) a weighting of of data points by sqrt(# events).

The lower left panel shows the mean CPU time per event versus the SI2k rating of the processors on which the jobs ran (obtained here) for st_physics jobs at each of the Run 10 collision energies (AuAu 200, 62, 39, 11, 7.7 GeV; lower energy events have lower multiplicity and so take less time to process). Statistical error bars are smaller than the markers. The plot demonstrates that CPU time per event can vary significantly (again, as expected) with the CPU spec rating.

The plot in the lower right panel tries to normalize by multiplying the CPU time by the spec rating (this plot has the same vertical range of 40:1 as bottom left plot). This helps flatten the distributions significantly, with the exception of the nodes with 1169 SI2k rating which generally seem to outperform their spec despite also having the least amount of memory per core (1 GB; other nodes have 1 GB [1290 SI2k], 2 GB [1996 & 2862 SI2k], or 3 GB [3578 SI2k]). The upper right plot shows how the distribution tightens up using this metric.

CPU time CPU speed * time

 

 

Now using the metric of spec * CPU time per event, it is important to recognize that there is still a notable spread in the performance metric for some of the data, as demonstrated by the following two plots. These are log10(spec * CPU time / event) [SI2k-sec] vs. stream ID (left and right axes respectively) for AuAu7 and AuAu39 (with the aforementioned cuts and weighting), where stream ID is in:

  1. st_physics
  2. st_ht
  3. st_gamma
  4. st_hlt
  5. st_mtd
  6. st_monitor
  7. st_pmdftp
  8. st_zerobias
  9. st_upc
  10. st_atomcules

Not all of these streams exist in all datasets (see tables below), and I have truncated these particular plots of stream ID 9 as I have no st_atomcules logs. Also, I have staggered adc files by offsetting them slightly (by 0.2) from these stream IDs.

AuAu7
AuAu39

 

In the studied data, some of the AuAu7 streams have a very broad distribution extending up to approximately 10M SI2k-sec per event. Such jobs took several days (0.5M seconds!) to run and are almost certainly due to the observed Limiting production of events with large hit counts which cause a huge amount of time to be spent in the track finder. These long-running production jobs are not seen in the AuAu39 data. Unfortunately, there was no simple cut in this analysis to exclude such high multiplicity events or the jobs which processed them (though jobs which took even longer were killed in the production), and this will bias my results for the 7.7 GeV data.

Nevertheless, even the AuAu39 data show that some spread in resource needs per event exists between different jobs. The RMS of the metric is at about 20% of the measured means for AuAu39.

Anyhow, taking the mean of these weighted distributions yields the following numbers per event per stream, shown as a plot (statistical errors on the means shown, colors as before: AuAu 200, 62, 39, 11, 7.7 GeV) and a table. I only obtained log files for some of the streams, but the st_ht, st_hlt, and st_mtd streams show a clear bias towards more time per event than st_physics. Generally, the adc files also take a little longer thant the non-adc files.

<spec * time / event> [SI2k-sec]:

ID stream 200 62 39 11 7.7
0 st_physics
_adc
34165 ± 226
30837 ± 178

15138 ± 131
12395 ± 6
12318 ± 22
9366 ± 2
9529 ± 8
6015 ± 14
9476 ± 270
1 st_ht
_adc
57835 ± 680 39010 ± 1 26379 ± 34
27486 ± 147
  45401 ± 2329
19784 ± 1834
2 st_gamma
_adc
         
3 st_hlt
_adc
    28368 ± 74
38470 ± 1
  10159 ± 13
12817 ± 20
4 st_mtd
_adc
    25549 ± 89
29032 ± 2017
  65656 ± 1453
8438 ± 623
5 st_monitor
_adc
    13547 ± 32
17244 ± 212
  4587 ± 57
5782 ± 140
6 st_pmdftp
_adc
68743 ± 448 16760 ± 86 13933 ± 11
14341 ± 34
  3298 ± 16
5513 ± 76
7 st_zerobias
_adc

34700 ± 967
 
2517 ± 5
 
2768 ± 2
8 st_upc
_adc
18365 ± 750   2149 ± 4
6087 ± 210
   
9 st_atomcules
_adc
         

 

Event counts [M] as reported by the file catalog for AuAuX_production, where X is the collision energy:

ID stream 200 62 39 11 7.7
0 st_physics
_adc
690
14
191
4
253
5
55
1
100
2
1 st_ht
_adc
59
1
34
0.7
14
0.3
  3
0.1
2 st_gamma
_adc
5
0.1
       
3 st_hlt
_adc
16
0.3
4
0.1
1
0.3
8
0.2
5
0.1
4 st_mtd
_adc
15
0.3
3
0.1
1
0.0
0.4
0.1
0.9
0.0
5 st_monitor
_adc
  25
0.2
9
0.2
3
0.1
15
0.3
6 st_pmdftp
_adc
5
0.1
4
0.1
23
0.5
  0.2
0.0
7 st_zerobias
_adc
0.0
2
0.0
0.6
0.0
0.5

0.4
0.0
1
8 st_upc
_adc
38
0.8
3
0.1
0.4
0.0
   
9 st_atomcules
_adc
17
0.3
1
0.0
     
Totals 865 272 309 68 128

 

Because I have many missing numbers, it's difficult to be accurate. But using some estimates from the above two tables, I arrive at the following approximate total CPU needs [SI2k-sec]:

200: 865M * 40k  = 34.6G SI2k-sec
62: 272M * 18k = 4.9G SI2k-sec
39: 309M * 14k = 4.3G SI2k-sec
11: 68M * 9.4k = 0.6G SI2k-sec
7.7: 128M * 7k = 0.9G SI2k-sec

On a farm with approximately 3M SI2k available (essentially the current state, with only half of the rcas6xxx nodes plus all rcrs6xxx nodes available to production), and perhaps an 80% efficiency of completing jobs (e.g. service outages, failed jobs), these translate into the following time estimates for production:

200: 167 days
62: 24 days
39: 21 days
11: 3 days
7.7: 4 days

The pending arrival of new nodes to the farm should essentially double the available SI2k for production. At 6M SI2k, the 200 GeV production should be completed within 3 months of its start.

On a related note, $100k can procure a little over 0.4M SI2k. This would represent a less than 7% additional capacity for production (and less than 5% of the full RCF linux farm capacity) beyond the pending node arrival. That capacity could reduce the overall production time of AuAu200+39 by about 1 week.

-Gene