Summary of Apple's Xgrid cluster software and the steps we've taken to get it up and running at MIT.


Xgrid jobmanager status report

  • xgrid.pm can submit and cancel jobs successfully, haven't tested "poll" since the server is running WS-GRAM.
  • Xgrid SEG module monitors jobs successfully.  Current version of Xgrid logs directly to /var/log/system.log (only readable by admin group), so there's a permissions issue to resolve there.  My understanding is that the SEG module can run with elevated permissions if needed, but at the moment I'm using ACLs to explicitly allow user "globus" to read the system.log.  Unfortunately the ACLs get reset when the logs are rotated nightly.
  • CVS is up-to-date, but I can't promise that all of the Globus packaging stuff actually works.  I ended up installing both Perl module and the C library into my Globus installation by hand.
  • Current test environment uses SimpleCA, but I've applied for a server certificate at pki1.doegrids.org as part of the STAR VO.

Important Outstanding Issues

  • streaming stdout/stderr and stagingOut files is a little tricky.  Xgrid requires an explicit call to "xgrid -job results", otherwise it  just keeps all job info in the controller DB.  I haven't yet figured out where to inject this system call in the WS-GRAM job life cycle, so I'm asking for help on gram-dev@globus.org.
  • Need to decide how to do authentication.  Xgrid offers two options on the extreme ends of the spectrum.  On the one hand we can use a common password for all users, and on the other hand we can use K5 tickets.  Submitting a job using WS-GRAM involves a roundtrip user account -> container account -> user account via sudo, and I don't know how to forward a TGT for the user account through all of that.  I looked around and saw a "pkinit" effort that promised to do passwordless generation of TGTs from grid certs, but it doesn't seem like it's quite ready for primetime.