Cloud computing - FastOffline run 11

 The summary of FastOffline real-time run 11 STAR data processing on the Cloud VMs.

Deployment of STAR software, initially on NERSC/Magellan:

  1.   STAR VM w/ SL10keuca-run-instances  -t c1.xlarge emi-48080D8D
  2.  STAR VM w/ SL11aeuca-run-instances  -t c1.xlarge emi-6F2A0E46
  3.  STAR VM w/ SL11beuca-run-instances - -t c1.xlarge emi-FA4D10D5
  4.  STAR VM w/ SL11ceuca-run-instances  -t c1.xlarge emi-6E5B0E5C

 Eventually we coherent cluster of over 100 VMs from three resource pools:

although never all 3 at once at full load.

Data processing scheme:

  1. daq files were dispatched by STAR FastOffline to 3TB /star/data13/ disk at RCF
  2. globus-online was used to serially transfer daq files to 20TB /scratch NERSC
  3. scp command did parallel transfer of daq files to VM and to transfer back the produced muDst+StEvent files to /scratch disk at NERSC
  4. globus-online pulled muDst+Stevent back to /star/data13
  5. FastOfline did cataloging of produced files form data13 and save them in HPSS @ RCF

STAR DB snapshot was generated every 2 hours and uploaded to VMs once per day at fixed random time. Every VM run 5 independent local DB, launched in consecutive days.

Single BFC job lasted up to 3 days.  

 

Summary of performance (estimate, ~10-20% accurate)

  • # of processed daq files : 18,000 
  • data volume pushed:  RCF-->NERSC ~70 TB,   send back ~40 TB
  • # of simultaneous jobs varied from 160 to 750 
  • continuous processing over 4 months
  • total harvest CPU of 25,000 CPU days or 70 CPU years

 


Fig 1. Example of peak load on VMs


Fig 2. Total load on VMs over last 3 months