SGE Job Manager patch

Under:
We should come on this page with a draft that we want to send to the VDT guys about the SGE Job Manager.
  • Missing environment variables definition
    • In the BEGIN section check if $SGE_ROOT, $SGE_CELL and the commands ($qsub, $qstat, etc) are defined properly
    • in the SUBMIT, POOL and CLEAR sections, locate the line
      $ENV{"SGE_ROOT"} = $SGE_ROOT;
      
      and add the line
      $ENV{"SGE_CELL"} = $SGE_CELL;
      
  • Bug finding the correct job id when clearing jobs
    • in the CLEAR section, locate the line
      system("$qdel $job_id > /dev/null 2 > /dev/null");
      and replace for the following block
      $ENV{"SGE_ROOT"} = $SGE_ROOT;
      $ENV{"SGE_CELL"} = $SGE_CELL;
      $job_id =~ /(.*)\|(.*)\|(.*)/;
      $job_id = $1;
      system("$qdel $job_id > /dev/null 2 > /dev/null");
  • SGE Job Manager modifies definitions of both the standard output and standard error file names by appending .real. This procedure fails when a user specifies /dev/null for either of those files. The problem happens twice - once starting at line 318
  •     #####
        # Where to write output and error?
        #
        if(($description->jobtype() eq "single") && ($description->count() > 1))
        {
          #####
          # It's a single job and we use job arrays
          #
          $sge_job_script->print("#\$ -o "
                                 . $description->stdout() . ".\$TASK_ID\n");
          $sge_job_script->print("#\$ -e "
                                 . $description->stderr() . ".\$TASK_ID\n");
        }
        else
        {
            # [dwm] Don't use real output paths; copy the output there later.
            #       Globus doesn't seem to handle streaming of the output
            #       properly and can result in the output being lost.
            # FIXME: We would prefer continuous streaming.  Try to determine
            #       precisely what's failing so that we can fix the problem.
            #       See Globus bug #1288.
          $sge_job_script->print("#\$ -o " . $description->stdout() . ".real\n");
          $sge_job_script->print("#\$ -e " . $description->stderr() . ".real\n");
        }
     
    
    and then again at line 659:
          if(($description->jobtype() eq "single") && ($description->count() > 1))
          #####
          # Jobtype is single and count>1. Therefore, we used job arrays. We
          # need to merge individual output/error files into one.
          #
          {
            # [dwm] Use append, not overwrite to work around file streaming issues.
            system ("$cat $job_out.* >> $job_out");
            system ("$cat $job_err.* >> $job_err");
          }
          else
          {
            # [dwm] We still need to append the job output to the GASS cache file.
            #       We can't let SGE do this directly because it appears to
            #       *overwrite* the file, not append to it -- which the Globus
            #       file streaming components don't seem to handle properly.
            #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
          }
    
  • The snipped of code above is also missing a statement for the standard error. At the end instead of:
  •         #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
          }
    
    it should read:
            #       So append the output manually now.
            system("$cat $job_out.real >> $job_out");
            system("$cat $job_err.real >> $job_err");
          }
    
  • Additionally, if deployed in a CHOS environment, the job manager should be modified with the following additions at line 567:
  •     $ENV{"SGE_ROOT"} = $SGE_ROOT;
        if ( -r "$ENV{HOME}/.chos" ){
          $chos=`cat $ENV{HOME}/.chos`;
          $chos=~s/\n.*//;
          $ENV{CHOS}=$chos;
        }