EC2 and Nimbus testing

After the EC2 test snafu, we discuseed stepping away from EC2 and attempting to find the problem on the ANL (Nimbus) Cluster and see if the issue could be reproduced there.

As background information, my feeling is that we should at least attempt to start from a know-to-work state i.e. having software stack versions on both ends (submitter / receiver) similar to what we know to be working.

The base OSG 0.8.0 is known to work for STAR (BNL to PDSF or BNL to WSU works). OSG 0.8.0 is equivalent to VDT 1.8.1 which implies GT 4.0.5 . Release information was provided as below in an Email summary sent

From: Jerome LAURET <jlauret @ bnl.gov>

Date: Wed, 04 Feb 2009 19:00:10 -0500
We can at least start from the end backward (in fact, past VDT
there is no point as we are not using OSG specific services;
there may be a point in VDT though as some core components may
be the reason).
http://twiki.grid.iu.edu/bin/view/ArchivedDocumentation/OSG/OSG080/WebHome
http://twiki.grid.iu.edu/bin/view/ArchivedDocumentation/OSG/OSG080/VdtRelease
http://vdt.cs.wisc.edu/releases/1.8.1/contents.html

 

After meeting internally with Wayne Betts, Lidia Didenko, Levente Hajdhu, I came up with the below set of questions:

  • Are the VM images different depending on GK or WN?
    • Do we need to inflate two images and update two nodes?  
  • Batch system – is there anything else to do with jobmanager/pbs or is there something we should know before installing any VDT/OSG deployment?
  • Certificate revocation list – how to handle this? Would Tim  provide the receipe?

The general direction is that STAR would be willing to provide help to upgrade the VM image (or images depending on answers to the above) on Tuesday the 10th of February. Wayne Betts would assists and this would take at least 3 hours of his time.

In the interim, I suggest the following test suite (with indication of what we would learn and what not)

  • Start with no upgrade - use ANL cluster and the same images
    • test #1: short jobs, same dimension in terms of number of nodes / jobs (60 or whatever is the max at ANL) – very short jobs terminating at about the same time
      • success: we move to test #2
      • failure: we move to test #2 as well <=> no strong conclusion (but suspicion it could be related to jobs ending at the same time should be retained)
    • test #2: short jobs, same dimension – but short jobs not all terminating at the same time ; this can be accomplished by a sleep time following some distribution (TBD)
      • success: if test #1 failed, then we conclude a problem exists when jobs terminate at the same time. Redo test 1 & 2 to confirm
      • failure: we do not proceed with test #3 – need to understand why it fails at this point by tracing a failed job
         
    • test #3: also could use short test reproducing log file size and see if this the proble. THis can be accomplished technically using a simple "chatty"perl script redirected to a log. Conditions should be approximately what we had in real-life
      • success: no conclusions, move to test #4
      • failure: producing and transferring the logs may be the problem (focus on this)
         
    • test #4: tun the same jobs than what we had on EC2 (real simulation jobs, same conditions) 
      NB: worthwhile if #3 is a success
      • success: no need the next test / strong evidence of problem related to EC2 specifically, redo test #4 to confirm success
      • failure: same place than EC2 / further debugging possible ; possibly, try test #5 to understand
         
    • test #5: (only if previous fail): Leve suggests this additional test if possible to implement i.e. instead of trying to use 50 nodes, use one worker node and strat 50 processes 
      • pre-condition: at least up to test #2 was a success (no need to do this otherwise)
      • advantage: may be able to reproduce the same problem with minimal node investment.
         
  • Whenever the above are done (and hoping before Thursday's upgrade)
    • If failed on ANL as on EC2, upgrade
    • Redo the full test suite after the upgrade (with the same conditional stop wherever mentioned)