Scalability Issue Troubleshooting at EC

Scalability Issue Troubleshooting at EC2

Running jobs at EC2 show some scalability issues with grater then 20-50 jobs submitted at once. The pathology can only be seen once the jobs have completed there run cycle, that is to say, after the jobs copy back the files they have produced and the local batch system reports the job as having finished. The symptoms are as follows:

No stdout from the job as defined in the .condorg file as “output=” comes back. No stderror from the job as defined in the .condorg file as “error=” comes back.

It should be noted that the std output/error can be recovered from the gate keeper at EC2 by scp'ing it back. The std output/error resides in:

/home/torqueuser/.globus/job/[gk name]/*/stdout

/home/torqueuser/.globus/job/[gk name]/*/stderr

The command would be:

scp -r root@[gk name]:/home/torqueuser/.globus/job /star/data08/users/lbhajdu/vmtest/io/

Jobs are still reported as running under condor_q on the submitting end long after they have finished, and the batch system on the other end reports them is finished.

Below is a standard sample condor_g file from a job:

[stargrid01] /<1>data08/users/lbhajdu/vmtest/> cat globusscheduler= ec2-75-101-199-159.compute-1.amazonaws.com/jobmanager-pbs
output =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.log
error =/star/data08/users/starreco/prodlog/P08ie/log/C3A7967022377B3E5F2DCCE2C60CB79D_998.err
log =schedC3A7967022377B3E5F2DCCE2C60CB79D_998.condorg.log
transfer_executable= true
notification =never
universe =globus
stream_output =false
stream_error =false
queue

The job parameters:

Work flow:

Copy in event generator configuration
Run raw event generator
Copy back raw event file (*.fzd)
Run reconstruction on raw events
Copy back reconstructed files(*.root)
Clean Up

Work flow processes : globus-url-copy -> pythia -> globus-url-copy -> root4star -> globus-url-copy

Note: Some low runtime processes not shown

Run time:

23 hours@1000 eventes

1 hour@10-100 events

Output:

15M rcf1504_*_1000evts.fzd

18M rcf1504_*_1000evts.geant.root

400K rcf1504_*_1000evts.hist.root

1.3M rcf1504_*_1000evts.minimc.root

3.7M rcf1504_*_1000evts.MuDst.root

60K rcf1504_*_1000evts.tags.root

14MB stdoutput log, later changed to 5KB by piping output to file and copying back via globus-url-copy.

Paths:

Jobs submitted form:

/star/data08/users/lbhajdu/vmtest/

Output copied back to:

/star/data08/users/lbhajdu/vmtest/data

STD redirect copied back to:

/star/data08/users/starreco/prodlog/P08ie/log

The tests:

We first tested 100nodes. Whit 14MB of text going to stdoutput. Failed with symptoms above.
Next test was with 10nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 20 nodes. With 14MB of text going to stdoutput. This worked without any problems.
Next test was 40 nodes. With 14MB of text going to stdoutput. Failed with symptoms above.
Next we redirected “>” the output of the event generator and the reconstruction to a file and copied this file back directly with globus-url-copy after the job was finished. We tested again with 40 nodes. The std out now is only 15K. This time it worked without any problems. (Was this just coincidence?)
Next we tried with 75 nodes and the redirected output trick. This failed with symptoms above.
Next we tried with 50 nodes. This failed with symptoms above.
We have consulted Alain Roy who has advised an upgrade of globus and condor-g. He says the upgrade of condor-g is most likely to help. Tim has upgraded the image with the latest version of globus and I will be submitting from stargrid05 which has a newer condor-g version. The software versions are listed here:

Stargrid01
- Condor/Condor-G 6.8.8
- Globus Toolkit, pre web-services, client 4.0.5
- Globus Toolkit, web-services, client 4.0.5
Stargrid05
- $CondorVersion: 7.0.5 Sep 20 2008 BuildID: 105846
- Globus Toolkit, pre web-services, client 4.0.7
- Globus Toolkit, pre web-services, server 4.0.7

We have tested on a five node cluster (1 head node, 4 works) and discovered a problem with stargrid05. Jobs do not get transfered over to the submitting side. The RCF has been contacted we know this is on our side. It was decided we should not submit until we can try from stargrid05.