SGE Job Manager patch
We should come on this page with a draft that we want to send
to the VDT guys about the SGE Job Manager.
- Missing environment variables definition
- Bug finding the correct job id when clearing jobs
- in the CLEAR section, locate the line
system("$qdel $job_id > /dev/null 2 > /dev/null");
and replace for the following block
$ENV{"SGE_ROOT"} = $SGE_ROOT;
$ENV{"SGE_CELL"} = $SGE_CELL;
$job_id =~ /(.*)\|(.*)\|(.*)/;
$job_id = $1;
system("$qdel $job_id > /dev/null 2 > /dev/null");
- SGE Job Manager modifies definitions of both the standard output
and standard error file names by appending .real. This procedure fails when a user specifies /dev/null for either of those files. The problem happens twice - once starting at line 318
#####
# Where to write output and error?
#
if(($description->jobtype() eq "single") && ($description->count() > 1))
{
#####
# It's a single job and we use job arrays
#
$sge_job_script->print("#\$ -o "
. $description->stdout() . ".\$TASK_ID\n");
$sge_job_script->print("#\$ -e "
. $description->stderr() . ".\$TASK_ID\n");
}
else
{
# [dwm] Don't use real output paths; copy the output there later.
# Globus doesn't seem to handle streaming of the output
# properly and can result in the output being lost.
# FIXME: We would prefer continuous streaming. Try to determine
# precisely what's failing so that we can fix the problem.
# See Globus bug #1288.
$sge_job_script->print("#\$ -o " . $description->stdout() . ".real\n");
$sge_job_script->print("#\$ -e " . $description->stderr() . ".real\n");
}
and then again at line 659:
if(($description->jobtype() eq "single") && ($description->count() > 1))
#####
# Jobtype is single and count>1. Therefore, we used job arrays. We
# need to merge individual output/error files into one.
#
{
# [dwm] Use append, not overwrite to work around file streaming issues.
system ("$cat $job_out.* >> $job_out");
system ("$cat $job_err.* >> $job_err");
}
else
{
# [dwm] We still need to append the job output to the GASS cache file.
# We can't let SGE do this directly because it appears to
# *overwrite* the file, not append to it -- which the Globus
# file streaming components don't seem to handle properly.
# So append the output manually now.
system("$cat $job_out.real >> $job_out");
}
The snipped of code above is also missing a statement for the standard error.
At the end instead of:
# So append the output manually now.
system("$cat $job_out.real >> $job_out");
}
it should read:
# So append the output manually now.
system("$cat $job_out.real >> $job_out");
system("$cat $job_err.real >> $job_err");
}
Additionally, if deployed in a CHOS environment, the job manager should be
modified with the following additions at line 567:
$ENV{"SGE_ROOT"} = $SGE_ROOT;
if ( -r "$ENV{HOME}/.chos" ){
$chos=`cat $ENV{HOME}/.chos`;
$chos=~s/\n.*//;
$ENV{CHOS}=$chos;
}