Skip to content

Commit dd53679

Browse files
authored
Update derecho shared batch job submission (CICE-Consortium#1091)
Derecho shared node jobs intermittently abort with error message "start failed on dec2436: No reply from shepherd after 108s" due to PBS/MPI launch conflicts. Derecho qstat output was also recently changed to return output for completed jobs which prevented the job checking scripts from identifying jobs that have completed. Update derecho shared batch job submission to both increase the number of shared node jobs and control the number of jobs per shared node by submitting the shared jobs on more cores than needed. In the end, an upgrade to PBS seemed to fix the shared node aborts, so this change was commented out in the PR. Derecho will continue to be closely watched. Fix potential bug in setting ICE_MACHINE_QSTAT if the string has spaces in it. Update job checking logic to avoid PBS output that shows completed jobs, added -v " historical ". This is far from ideal and not particularly future proof, but PBS qstat has become a mess. Update create fails to identify test suite jobs that failed to run then generate a script to resubmit them.
1 parent 1215d25 commit dd53679

File tree

6 files changed

+47
-11
lines changed

6 files changed

+47
-11
lines changed

cice.setup

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1244,7 +1244,7 @@ EOF0
12441244
12451245
if ($?ICE_MACHINE_QSTAT) then
12461246
cat >! ${tsdir}/poll_queue.env << EOF0
1247-
setenv ICE_MACHINE_QSTAT ${ICE_MACHINE_QSTAT}
1247+
setenv ICE_MACHINE_QSTAT "${ICE_MACHINE_QSTAT}"
12481248
EOF0
12491249
endif
12501250

configuration/scripts/cice.batch.csh

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,22 +35,31 @@ EOFB
3535

3636
else if (${ICE_MACHINE} =~ derecho*) then
3737
set memstr = ""
38-
if (${ncores} <= 8 && ${runlength} <= 1 && ${batchmem} <= 20) then
38+
set mycorespernode = ${corespernode}
39+
# trying to avoid shared node launch errors
40+
#if (${ncores} <= 24 && ${runlength} <= 1 && ${batchmem} <= 20) then
41+
if (${ncores} <= 16 && ${runlength} <= 1 && ${batchmem} <= 20) then
3942
set queue = "develop"
43+
# # set develop cores to 16 or 32 to limit the number of jobs per shared node
44+
# if (${mycorespernode} < 32) then
45+
# @ corenum = (${mycorespernode} / 16 + 1) * 16
46+
# set mycorespernode = ${corenum}
47+
# endif
4048
set memstr = ":mem=${batchmem}GB"
4149
endif
4250
cat >> ${jobfile} << EOFB
4351
#PBS -q ${queue}
4452
#PBS -l job_priority=regular
4553
#PBS -N ${ICE_CASENAME}
4654
#PBS -A ${acct}
47-
#PBS -l select=${nnodes}:ncpus=${corespernode}:mpiprocs=${taskpernodelimit}:ompthreads=${nthrds}${memstr}
55+
#PBS -l select=${nnodes}:ncpus=${mycorespernode}:mpiprocs=${taskpernodelimit}:ompthreads=${nthrds}${memstr}
4856
#PBS -l walltime=${batchtime}
4957
#PBS -j oe
5058
#PBS -W umask=022
5159
#PBS -o ${ICE_CASEDIR}
5260
5361
###PBS -m be
62+
5463
EOFB
5564

5665
else if (${ICE_MACHINE} =~ gadi*) then

configuration/scripts/tests/baseline.script

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,8 @@ if (${ICE_BFBCOMP} != ${ICE_SPVAL}) then
136136
set cnt = 0
137137
if (${job} =~ [0-9]*) then
138138
while ($qstatjob)
139-
set qstatus = `${ICE_MACHINE_QSTAT} $job | grep $job | wc -l`
139+
# historical avoids completed jobs on PBS (-x) and extra $job avoids superfluous header lines
140+
set qstatus = `${ICE_MACHINE_QSTAT} $job | grep -iv " historical " | grep $job | wc -l`
140141
# ${ICE_MACHINE_QSTAT} $job
141142
# echo $job $qstatus
142143
if ($qstatus == 0) then

configuration/scripts/tests/create_fails.csh

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,16 @@
33
echo " "
44
set tmpfile = create_fails.tmp
55
set outfile = fails.ts
6+
set runfile = rerun.csh
7+
8+
set delim = `pwd | rev | cut -d . -f 1 | rev`
69

710
./results.csh >& /dev/null
8-
cat results.log | grep ' run\| test' | grep -v "#" | grep -v PASS | cut -f 2 -d " " | sort -u >! $tmpfile
11+
12+
#fails, both "run" and "test" that failed
13+
# treat decomp special
14+
cat results.log | grep ' run\| test' | grep -v "#" | grep -v PASS | cut -f 2 -d " " | grep -v _decomp_ | sort -u >! $tmpfile
15+
cat results.log | grep ' run\| test' | grep -v "#" | grep -v PASS | cut -f 2 -d " " | grep _decomp_ | rev | cut -d _ -f 2- | rev | sort -u >> $tmpfile
916

1017
echo "# Test Grid PEs Sets" >! $outfile
1118
foreach line ( "`cat $tmpfile`" )
@@ -16,9 +23,25 @@ foreach line ( "`cat $tmpfile`" )
1623
set opts = `echo $line | cut -d "_" -f 6- | sed 's/_/,/g'`
1724
echo "$test $grid $pes $opts" >> $outfile
1825
end
26+
rm $tmpfile
1927

28+
#rerun, only "run" that failed
29+
# treat decomp special
30+
cat results.log | grep ' run' | grep -v "#" | grep -v PASS | cut -f 2 -d " " | grep -v _decomp_ | sort -u >! $tmpfile
31+
cat results.log | grep ' run' | grep -v "#" | grep -v PASS | cut -f 2 -d " " | grep _decomp_ | rev | cut -d _ -f 2- | rev | sort -u >> $tmpfile
32+
33+
echo "#/bin/csh" >! $runfile
34+
foreach line ( "`cat $tmpfile`" )
35+
#echo $line
36+
echo "cd ${line}.${delim}; ./*.submit; cd ../; sleep 5" >> $runfile
37+
end
38+
chmod +x $runfile
2039
rm $tmpfile
40+
2141
echo "$0 done"
42+
echo " "
43+
echo "Failed runs can be resubmitted by running $runfile"
44+
echo " "
2245
echo "Failed tests can be rerun with the test suite file...... $outfile"
2346
echo "To run a new test suite, copy $outfile to the top directory and do something like"
2447
echo " ./cice.setup --suite $outfile ..."

configuration/scripts/tests/poll_queue.csh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ foreach line ("`cat suite.jobs`")
1313
set qstatjob = 1
1414
if (${job} =~ [0-9]*) then
1515
while ($qstatjob)
16-
set qstatus = `${ICE_MACHINE_QSTAT} $job | grep $job | wc -l`
16+
# historical avoids completed jobs on PBS (-x) and extra $job avoids superfluous header lines
17+
set qstatus = `${ICE_MACHINE_QSTAT} $job | grep -iv " historical " | grep $job | wc -l`
1718
# echo $job $qstatus
1819
if ($qstatus == 0) then
1920
echo "Job $job completed"

doc/source/user_guide/ug_testing.rst

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -448,11 +448,13 @@ which means by default the test suite builds and submits the jobs. By defining
448448
By leveraging the **cice.setup** command line arguments ``--setup-only``, ``--setup-build``, and ``--setup-build-run`` as well as the environment variables SUITE_BUILD, SUITE_RUN, and SUITE_SUBMIT, users can run **cice.setup** and **suite.submit** in various combinations to quickly setup, setup and build, submit, resubmit, run interactively, or rebuild and resubmit full testsuites quickly and easily. See :ref:`examplesuites` for an example.
449449

450450
The script **create_fails.csh** will process the output from results.csh and generate a new
451-
test suite file, **fails.ts**, from the failed tests.
452-
**fails.ts** can then be edited and passed into ``cice.setup --suite fails.ts ...`` to rerun
453-
subsets of failed tests to more efficiently move thru the development, testing, and
454-
validation process. However, a full test suite should be run on the final development
455-
version of the code.
451+
test suite file, **fails.ts**, from the failed tests. It will also generate a script called
452+
**rerun.csh** for runs that failed to complete. **rerun.csh** can be executed from the testsuite directory and
453+
runs that failed to complete will be resubmitted.
454+
**fails.ts** can be passed into ``cice.setup --suite fails.ts ...`` to setup a new test
455+
suite based on the failed tests to more efficiently move thru the development, testing, and
456+
validation process. However, ultimately, once all code changes are complete, a full test suite
457+
should be run on the final development version of the code.
456458

457459
To report the test results, as is required for Pull Requests to be accepted into
458460
the main the CICE Consortium code see :ref:`testreporting`.

0 commit comments

Comments
 (0)