Commit dd53679
authored
Update derecho shared batch job submission (CICE-Consortium#1091)
Derecho shared node jobs intermittently abort with error message
"start failed on dec2436: No reply from shepherd after 108s"
due to PBS/MPI launch conflicts. Derecho qstat output was also recently changed to return output for completed jobs which prevented the job checking scripts from identifying jobs that have completed.
Update derecho shared batch job submission to both increase the number of shared node jobs and control the number of jobs per shared node by submitting the shared jobs on more cores than needed. In the end, an upgrade to PBS seemed to fix the shared node aborts, so this change was commented out in the PR. Derecho will continue to be closely watched.
Fix potential bug in setting ICE_MACHINE_QSTAT if the string has spaces in it.
Update job checking logic to avoid PBS output that shows completed jobs, added -v " historical ". This is far from ideal and not particularly future proof, but PBS qstat has become a mess.
Update create fails to identify test suite jobs that failed to run then generate a script to resubmit them.1 parent 1215d25 commit dd53679
File tree
6 files changed
+47
-11
lines changed- configuration/scripts
- tests
- doc/source/user_guide
6 files changed
+47
-11
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1244 | 1244 | | |
1245 | 1245 | | |
1246 | 1246 | | |
1247 | | - | |
| 1247 | + | |
1248 | 1248 | | |
1249 | 1249 | | |
1250 | 1250 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
39 | 42 | | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
40 | 48 | | |
41 | 49 | | |
42 | 50 | | |
43 | 51 | | |
44 | 52 | | |
45 | 53 | | |
46 | 54 | | |
47 | | - | |
| 55 | + | |
48 | 56 | | |
49 | 57 | | |
50 | 58 | | |
51 | 59 | | |
52 | 60 | | |
53 | 61 | | |
| 62 | + | |
54 | 63 | | |
55 | 64 | | |
56 | 65 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
139 | | - | |
| 139 | + | |
| 140 | + | |
140 | 141 | | |
141 | 142 | | |
142 | 143 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
6 | 9 | | |
7 | 10 | | |
8 | | - | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
9 | 16 | | |
10 | 17 | | |
11 | 18 | | |
| |||
16 | 23 | | |
17 | 24 | | |
18 | 25 | | |
| 26 | + | |
19 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
20 | 39 | | |
| 40 | + | |
21 | 41 | | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
22 | 45 | | |
23 | 46 | | |
24 | 47 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
| 17 | + | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
448 | 448 | | |
449 | 449 | | |
450 | 450 | | |
451 | | - | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
456 | 458 | | |
457 | 459 | | |
458 | 460 | | |
| |||
0 commit comments