-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathswarm.html
1185 lines (921 loc) · 62.8 KB
/
swarm.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<script type="text/javascript" language="JavaScript" src='/js/header.js'></script>
<!-- Start content - do not edit above this line -->
<script type='text/javascript' language='JavaScript'>document.querySelector('title').textContent = 'Swarm on Biowulf';</script>
<div class="title">Swarm on Biowulf</div>
<p>
<table width=100%><tr><td>
<table width=270px align=left style="margin-right:10px"><tr><td>
<div class="toc">
<div class="tocHeading" width=25%>Quick Links</div>
<div class="tocItem"><a href="#videos">Video Tutorials</a></div>
<div class="tocItem"><a href="#usage">Usage</a></div>
<div class="tocItem"><a href="#details">Details</a></div>
<div class="tocItem"><a href="#input">Input</a></div>
<div class="tocItem"><a href="#directives">File Directives</a></div>
<div class="tocItem"><a href="#output">Output</a></div>
<div class="tocItem"><a href="#examples">Examples</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#stdin">STDIN/STDOUT</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#fixed">Fixed output path</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#mixed">Mixed asynchronous and serial commands</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#environment">Setting environment variables</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#lscratch">Using local scratch</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#bundling">-b, --bundle</a></div>
<div class="tocItem" style="margin-left:20px"><A href="#gandt">-g and -t (memory and threads)</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#p">-p, --processes-per-subjob</a></div>
<div class="tocItem" style="margin-left:20px"><A href="#time">--time</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#dependency">--dependency</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#module">--module</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#sbatch">--sbatch</a></div>
<div class="tocItem" style="margin-left:20px"><a href="#devel">--devel, --verbose</a></div>
<div class="tocItem"><A href="#generate">Generating a swarm file</a></div>
<div class="tocItem"><A href="#monitor">Monitoring a swarm</a></div>
<div class="tocItem"><A href="#delete">Deleting/Canceling a swarm</a></div>
<div class="tocItem"><a href="#download">Download</a></div>
</div>
</table>
</td><td>
Swarm is a script designed to simplify submitting a group of commands to
the Biowulf cluster. Some programs do not scale well or can't use distributed memory.
Other programs may be 'embarrassingly parallel', in that many independent jobs need to be
run. These programs are well suited to running 'swarms of jobs'.
The swarm script simplifies these computational problems.<br /><br />
Note that swarm is <b><em>NOT</em></b> a workflow manager. It is merely a convenience
wrapper for the Slurm <b><code>sbatch --array</code></b> command.
</td></tr></table>
<p>Swarm reads a list of command lines (termed "commands" or "processes") from a swarm command file (termed the "swarmfile"), then automatically
submits those commands to the batch system to execute.
Command lines in the swarmfile should appear just as they would be entered on a Linux command line.
Swarm encapsulates each command line in a single temporary command script, then submits all command scripts to the Biowulf
cluster as a <a href="http://slurm.schedmd.com/job_array.html">Slurm job array</a>.
By default, swarm runs one command per core on a node, making optimum use of a node.
Thus, a node with 16 cores will run 16 commands <b>in parallel</b>.</p>
<p>For example, create a file that looks something like this (<b>NOTE:</b> lines that begin with a <b>#</b>
character are interpreted as comments and are not executed):</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
# My first swarmfile -- this file is file.swarm
uptime
uptime
uptime
uptime</pre>
<p>Then submit to the batch system:</p>
<pre class="term"><b>[biowulf]$</b> swarm --verbose 1 file.swarm
4 commands run in 4 subjobs, each command requiring 1.5 gb and 1 thread
12345</pre>
<p>This will result in a single <b>job</b> (jobid 12345) of four <b>subjobs</b> (subjobids 0, 1, 2, 3), with each swarmfile line being run independently as a single subjob.
By default, each subjob is allocated a 1.5 gb of memory and 1 core (consisting of 2 cpus).
The subjobs will be executed within the same directory from which the swarm was submitted.</p>
<p>The following diagram visualizes how the job array will look:</p>
<pre class="term">------------------------------------------------------------
SWARM
├── subjob 0: 1 command (1 cpu, 1.50 gb)
| ├── uptime
├── subjob 1: 1 command (1 cpu, 1.50 gb)
| ├── uptime
├── subjob 2: 1 command (1 cpu, 1.50 gb)
| ├── uptime
├── subjob 3: 1 command (1 cpu, 1.50 gb)
| ├── uptime
------------------------------------------------------------</pre>
<p>All output will be written to that same directory. By default, swarm will create two output files for each independent subjob, one for
STDOUT and one for STDERR. The format is <em>name</em>_<em>jobid</em>_<em>subjobid</em>.<em>{e,o}</em>:</p>
<pre class="term"><b>[biowulf]$</b> ls
file.swarm swarm_12345_0.o swarm_12345_1.o swarm_12345_2.o swarm_12345_3.o
swarm_12345_0.e swarm_12345_1.e swarm_12345_2.e swarm_12345_3.e</pre>
<!-- ======================================================================================================== -->
<!-- VIDEOS -->
<!-- ======================================================================================================== -->
<div class="heading"><a name="videos"></a>Video Tutorials</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<iframe width="560" height="315" src="https://www.youtube.com/embed/2skKVOlBXKk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
<!-- ======================================================================================================== -->
<!-- USAGE --
<!-- ======================================================================================================== -->
<div class="heading"><a name="usage"></a>Usage</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<pre class="term">
Usage: swarm [swarm options] [sbatch options] swarmfile
Basic options:
<a href="#gandt"><b>-g,--gb-per-process</b></a> [float]
gb per process (can be fractions of GB, e.g. 3.5)
<a href="#gandt"><b>-t,--threads-per-process</b></a> [int]/"auto"
threads per process (can be an integer or the word
auto). This option is only valid for
multi-threaded swarms (-p 1)
<a href="#p"><b>-p,--processes-per-subjob</b></a> [int]
processes per subjob (default = 1), this option is
only valid for single-threaded swarms (-t 1)
<a href="#bundling"><b>-b,--bundle</b></a> [int] bundle more than one command line per subjob and
run sequentially (this automatically multiplies the
time needed per subjob)
--noht don't use hyperthreading, equivalent to slurm
option --threads-per-core=1
--usecsh use tcsh as the shell instead of bash
--err-exit exit the subjob immediately on first non-zero exit
status
<a href="#module"><b>-m,--module</b></a> [str] provide a list of environment modules to load prior
to execution (comma delimited)
--no-comment don't ignore text following comment character #
--comment-char [char] use something other than # as the comment character
--maxrunning [int] limit the number of simultaenously running subjobs
--merge-output combine STDOUT and STDERR into a single file per
subjob (.o)
--logdir [dir] directory to which .o and .e files are to be
written (default is current working directory)
--noout completely throw away STDOUT
--noerr completely throw away STDERR
<a href="#time"><b>--time-per-command</b></a> [str] time per command (same as --time)
<a href="#time"><b>--time-per-subjob</b></a> [str] time per subjob, regardless of -b or -p
Development options:
--no-scripts don't create temporary swarm scripts (with --debug
or --devel)
--no-run don't actually run
--debug don't actually run
<a href="#devel"><b>--devel</b></a> combine --debug and --no-scripts, and be very
chatty
<a href="#devel"><b>-v,--verbose</b></a> [int] can range from 0 to 6, with 6 the most verbose
--silent don't give any feedback, just jobid
-h,--help print this help message
-V,--version print version and exit
sbatch options:
-J,--job-name [str] set the name of the job
<a href="#dependency"><b>--dependency</b></a> [str] set up dependency (i.e. run swarm before or after)
<a href="#time"><b>--time</b></a> [str] change the walltime for each subjob (default is
04:00:00, or 4 hours)
<a href="/docs/userguide.html#licenses"><b>-L,--licenses</b></a> [str] obtain software licenses (e.g. --licenses=matlab)
<a href="/docs/userguide.html#partitions"><b>--partition</b></a> [str] change the partition (default is norm)
<a href="#lscratch"><b>--gres</b></a> [str] set generic resources for swarm
--qos [str] set quality of service for swarm
--reservation [str] select a slurm reservation
--exclusive allocate a single node per subjob, same as -t auto
<a href="#sbatch"><b>--sbatch</b></a> [str] add sbatch-specific options to swarm; these options
will be added last, which means that swarm options
for allocation of cpus and memory take precedence
Environment variables:
The following environment variables will affect how sbatch allocates
resources:
SBATCH_JOB_NAME Same as --job-name
SBATCH_TIMELIMIT Same as --time
SBATCH_PARTITION Same as --partition
SBATCH_QOS Same as --qos
SBATCH_RESERVATION Same as --reservation
SBATCH_EXCLUSIVE Same as --exclusive
The following environment variables are set within a swarm:
SWARM_PROC_ID can be 0 or 1
For more information, type "man swarm".</pre>
<!-- ======================================================================================================== -->
<!-- Details -->
<!-- ======================================================================================================== -->
<div class="heading"><a name="details"></a>Details</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<table width=100%><tr><td>
<img src="/images/swarm_fig_1.png" alt="swarm_fig_1">
</td><td>
A <b>node</b> consists of a hierarchy of resources.
<ul>
<li>A <b>socket</b> is a receptacle on the motherboard for one physically packaged processor, each can contain one or more cores.</li>
<li>A <b>core</b> is a complete private set of registers, execution units, and retirement queues needed to execute programs.
Nodes on the biowulf cluster can have 8, 16, or 32 cores.</li>
<li>A <b>cpu</b> has the attributes of one core, but is managed and scheduled as a single logical processor by the operating system.
<b>Hyperthreading</b> is the implementation of multiple cpus on a single core.
All nodes on the biowulf cluster have hyperthreading enabled, with 2 cpus per core.</li>
</ul>
</td></tr></table>
<p>Slurm allocates on the basis of <b>cores</b>. The smallest subjob runs on a single core, meaning the <b>smallest number of cpus that swarm can allocate is 2</b>.</p>
<table width=100%><tr><td>
Swarm reads a swarmfile and creates a single <b>subjob</b> per line. By default a subjob is allocated to a single core.
Each line from a swarmfile has access to <b>2 cpus</b>.
Running swarm with the option <b>-t 2</b> is thus no different than running swarm without the -t option, as both cpus (hyperthreads)
are available to each subjob.
</td><td>
<img src="/images/swarm_fig_2.png" alt="swarm_fig_2">
</td></tr></table>
<table width=100%><tr><td>
<img src="/images/swarm_fig_3.png" alt="swarm_fig_3">
</td><td>
If commands in the swarmfile are multi-threaded, passing the -t option guarantees enough cpus will be available to the generated slurm subjobs.
For example, if the commands require either 3 or 4 threads, giving the <b>-t 3</b> or <b>-t 4</b> option allocates <b>2 cores per subjob</b>.
</td></tr></table>
<p>The nodes on the biowulf cluster are configured to constrain threads within the cores the subjob is allocated. Thus, if a multi-threaded
command exceeds the cpus available, <b>the command will run much slower than normal!</b>
This may not be reflected in the overall cpu load for the node.</p>
<p>
Memory is allocated <b>per subjob</b> by swarm, and is strictly enforced by slurm.
If a single subjob exceeds its memory allocation (by default 1.5 GB per swarmfile line), then
<b>the subjob will be killed by the batch system</b>.
See <a href="#gandt">below</a> for examples on how to allocate threads and memory.
</p>
<p>
More than one swarmfile line can be run per subjob using the <b>-p</b> option. This is only valid for single-threaded
swarms (i.e. <b>-t 1</b>). Under these circumstances, all cpus are used. See <a href="#p">below</a>
for more information on <b>-p</b>.
</p>
<!-- ======================================================================================================== -->
<!-- Input -->
<!-- ======================================================================================================== -->
<div class="heading"><a name="input"></a>Input</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<h3>The swarmfile</h3>
<p>The only required argument for swarm is a swarmfile. Each line in
the swarmfile is run as a single command. For example, the swarmfile <b><em>file.swarm</em></b></p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
uptime
uptime
uptime
uptime</pre>
<p>when submitted like this</p>
<pre class="term"><b>[biowulf]$</b> swarm file.swarm</pre>
<p>will create a swarm of 4 subjobs, with each subjob running the single command "uptime".</p>
<h3>Bundling</h3>
<p>There are occasions when running a single swarmfile line per subjob is inappropriate, such as when commands
are very short (e.g. a few seconds) or when there are many thousands or millions of commands in a swarmfile. In
these circumstances, it makes more sense to <em><b>bundle</b></em> the swarm. For example, a swarmfile of 10,000
commands when run with a bundle value of 40 will generate 250 subjobs (10000/40 = 250):</p>
<pre class="term"><b>[biowulf]$</b> swarm --devel -b 40 file.swarm
10000 commands run in 250 subjobs, each requiring 1 gb and 1 thread, running 40 commands serially per subjob</pre>
<p><b>NOTE</b>: If a swarmfile results in more than 1000 subjobs, swarm will <b>automatically autobundle the commands</b>.</p>
<p><b>ALSO</b>: The time needed per subjob will be automatically multiplied by the bundle factor. If the total time
per subjob exceeds the maximum walltime of the partition, an error will be given and the swarm will not be submitted.</p>
<h3>Comments</h3>
<p>By default, any text on a single line that follows a <b><em>#</em></b> character is assumed to be a comment,
and is ignored. For example,</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
# Here are my commands
uptime # this gives the current load status
pwd # this gives the current working directory
hostname # this gives the host name</pre>
<p>However, there are some applications that require a <b><em>#</em></b> character in the input:</p>
<pre class="term"><b>[biowulf]$</b> cat odd.file.swarm
bogus_app -n 365#AX -w -another-flag=nonsense > output</pre>
<p>The option <b>--no-comment</b> can be given to avoid removal of text following the <b><em>#</em></b> character.
Alternatively, another comment character can be designated using the <b>--comment-char</b> option.</p>
<h3>Command lists</h3>
<p>Multiple commands can be run serially (one after the other) when they are separated by a semi-colon (;). This
is also known as a command list. For example,</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
hostname ; date ; sleep 200 ; uptime
hostname ; date ; sleep 200 ; uptime
hostname ; date ; sleep 200 ; uptime
hostname ; date ; sleep 200 ; uptime
<b>[biowulf]$</b> swarm file.swarm</pre>
<p>will create 4 subjobs, each running independently on a single cpu. Each subjob will run "hostname", followed
by "date", then "sleep 200", then "uptime", all in order.</p>
<h3>Complex commands</h3>
<p>Environment variables can be set, directory locations can be changed, subshells can be spawned all within
a single command list, and conditional statements can be given. For example, if you wanted to run some
commands in a newly created random temporary directory, you could use this:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
export d=/data/user/${RANDOM} ; mkdir -p $d ; if [[ -d $d ]] ; then cd $d && pwd ; else echo "FAIL" >&2 ; fi
export d=/data/user/${RANDOM} ; mkdir -p $d ; if [[ -d $d ]] ; then cd $d && pwd ; else echo "FAIL" >&2 ; fi
export d=/data/user/${RANDOM} ; mkdir -p $d ; if [[ -d $d ]] ; then cd $d && pwd ; else echo "FAIL" >&2 ; fi
export d=/data/user/${RANDOM} ; mkdir -p $d ; if [[ -d $d ]] ; then cd $d && pwd ; else echo "FAIL" >&2 ; fi</pre>
<p><b>NOTE</b>: By default, command lists are interpreted as bash commands. If a swarmfile contains tcsh- or csh-specific
commands, swarm may fail unless <b>--usecsh</b> is included.</p>
<h3>Line continuation markers</h3>
<p>Application commands can be very long, with dozens of options and flags, and multiple commands separated by
semi-colons. To ease file editing, line continuation markers can be used to break up the single swarm commands
into multiple lines. For example, the swarmfile</p>
<pre class="term">cd /data/user/project; KMER="CCCTAACCCTAACCCTAA"; jellyfish count -C -m ${#KMER} -t 32 -c 7 -s 1000000000 -o /lscratch/$SLURM_JOB_ID/39sHMC_Tumor_genomic <(samtools bam2fq /data/user/bam/0A4HMC/DNA/genomic/39sHMC_genomic.md.bam ); echo ${KMER} | jellyfish query /lscratch/$SLURM_JOB_ID/39sHMC_Tumor_genomic_0 > 39sHMC_Tumor_genomic.telrpt.count</pre>
<p>can be written like this:</p>
<pre class="term">cd /data/user/project; KMER="CCCTAACCCTAACCCTAA"; \
jellyfish count -C
-m ${#KMER} \
-t 32 \
-c 7 \
-s 1000000000 \
-o /lscratch/$SLURM_JOB_ID/39sHMC_Tumor_genomic \
<(samtools bam2fq /data/user/bam/0A4HMC/DNA/genomic/39sHMC_genomic.md.bam ); \
echo ${KMER} | jellyfish query /lscratch/$SLURM_JOB_ID/39sHMC_Tumor_genomic_0 > 39sHMC_Tumor_genomic.telrpt.count</pre>
<h3>Modules</h3>
<p><a href="modules.html">Environment modules</a> can be loaded for an entire swarm using the <b>--module</b>
option. The </p>
<pre class="term">swarm --module python,tophat,ucsc,samtools,vcftools -g 4 -t 8 file.swarm</pre>
<h3><a name="directives"></a>Swarmfile Directives</h3>
<p>All swarm options can be incorporated into the swarmfile using swarmfile directives. Options preceded by <b><tt>#SWARM</tt></b> in the swarmfile (flush against the left side) will be evaluated the same as command line options.</p>
<p>For example, if the contents of swarmfile is as follows:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
#SWARM -t 4 -g 20 --time 40
command arg1
command arg2
command arg3
command arg4</pre>
<p>and is submitted like so:</p>
<pre class="term"><b>[biowulf]$</b> swarm file.swarm</pre>
<p>then each subjob will request 4 cpus, 20 GB of RAM and 40 minutes of walltime.</p>
<p>Multiple lines of swarmfile directives can be inserted, like so:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
#SWARM --threads-per-process 8
#SWARM --gb-per-process 8
#SWARM --sbatch '--mail-type=FAIL --export=var=100,nctype=12 --chdir=/data/user/test'
#SWARM --logdir /data/user/swarmlogs
command
command
command
command</pre>
<p>The precedence for options is handled in the same way as sbatch, but with options provided with the <b><tt>--sbatch</tt></b> option last:</p>
<pre> command line > environment variables > swarmfile directives > --sbatch options </pre>
<p>Thus, if the swarmfile has:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
#SWARM -t 4 -g 20 --time 40 --partition norm
command arg1
command arg2
command arg3
command arg4</pre>
<p>and is submitted like so:</p>
<pre class="term"><b>[biowulf]$</b> SBATCH_PARTITION=quick swarm -g 10 --time=10 file.swarm</pre>
<p>then each subjob will request 4 cpus, 10 GB of RAM and 10 minutes of walltime. The amount of memory and walltime requested with command line options and the partition chosen with the <b><tt>SBATCH_PARTITION</tt></b> environment variable supersedes the amount requested with swarmfile directives.</p>
<p><b>NOTE:</b> All lines with correctly formatted <b><tt>#SWARM</tt></b> directives will be removed even if <b>--no-comment</b> or a non-default <b>--comment-char</b> is given.</p>
<!-- ======================================================================================================== -->
<!-- Output -->
<!-- ======================================================================================================== -->
<div class="heading"><a name="output"></a>Output</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<h3>Default output files</h3>
<p>STDOUT and STDERR output from subjobs executed under swarm will be
directed to a file named <b>swarm_<em>jobid_subjobid</em>.o</b> and <b>swarm_<em>jobid_subjobid</em>.e</b>, respectively. </p>
<p class="alert">Please pay attention to the memory requirements of your swarm jobs!
When a swarm job runs out of memory, the node stalls and the job is eventually killed or
dies.
At the bottom of the .e file, you may see a warning like this:</p>
<pre class="term">slurmstepd: Exceeded job memory limit at some point. Job may have been partially swapped out to disk.</pre>
<p> If a job dies before it is finished, this output may not be available. Contact
<a href="mailto:[email protected]">[email protected]</a> when you have a question about why
a swarm stopped prematurely.</p>
<h3>Renaming output files</h3>
<p>The sbatch option <b>--job-name</b> can be used to rename the default output files.</p>
<pre class="term"><b>[biowulf]$</b> swarm -f file.swarm --job-name programAOK
...
<b>[biowulf]$</b> ls
programAOK_21381_0.e programAOK_21381_2.e programAOK_21381_4.e programAOK_21381_6.e
programAOK_21381_0.o programAOK_21381_2.o programAOK_21381_4.o programAOK_21381_6.o
programAOK_21381_1.e programAOK_21381_3.e programAOK_21381_5.e programAOK_21381_7.e
programAOK_21381_1.o programAOK_21381_3.o programAOK_21381_5.o programAOK_21381_7.o</pre>
<h3>Combining STDOUT and STDERR into a single file per subjob</h3>
<p>Including the <b>--merge-output</b> option will cause the STDERR output to be combined into the file used
for STDOUT. For swarm, that means the content of the .e files are written to the .o file. Keep in mind that
interweaving of content will occur.</p>
<pre class="term"><b>[biowulf]$</b> swarm --merge-output file.swarm
...
<b>[biowulf]$</b> ls
swarm_50158339_0.o swarm_50158339_1.o swarm_50158339_4.o swarm_50158339_7.o
swarm_50158339_10.o swarm_50158339_2.o swarm_50158339_5.o swarm_50158339_8.o
swarm_50158339_11.o swarm_50158339_3.o swarm_50158339_6.o swarm_50158339_9.o</pre>
<h3>Writing output files to a separate directory</h3>
<p>By default, the STDOUT and STDERR files are written to the same directory from which the swarm
was submitted. To redirect the files to a different directory, use <b>--logdir</b>:</p>
<pre class="term">swarm --logdir /path/to/another/directory file.swarm</pre>
<h3>Redirecting output</h3>
<P>Input/output redirects (and everything in the swarmfile) should be bash compatible. For example,</p>
<pre class="term"><b>[biowulf]$</b> cat bash_file.swarm
program1 -o -f -a -n 1 > output1.txt 2>&1
program1 -o -f -a -n 2 > output2.txt 2>&1
<b>[biowulf]$</b> swarm bash_file.swarm</pre>
<p>csh-style redirects like '<b>program >&; output</b>' will not work correctly unless
the <b>--usecsh</b> option is included. For example,</p>
<pre class="term"><b>[biowulf]$</b> cat csh_file.swarm
program1 -o -f -a -n 1 >& output1.txt
program1 -o -f -a -n 2 >& output2.txt
<b>[biowulf]$</b> swarm <b>--usecsh</b> csh_file.swarm</pre>
<p>Be aware of programs that write directly to a file using a fixed filename.
A file will be overwritten and garbled if multiple processes are writing to the same file.
If you run multiple instances of such programs then for each instance you will
need to a) change the name of the file in each command <b>or</b> b) alter the path to the file. See
the <b>EXAMPLES</b> section for some ideas.</p>
<!-- ======================================================================================================== -->
<!-- EXAMPLES -->
<!-- ======================================================================================================== -->
<div class="heading"><a name="examples"></a>Examples</div>
<a href="/apps/swarm.html" style="font-size:12px">back to top</a><br />
<table><tr><td>
<table width=270px align=left style="margin-right:10px;"><tr><td>
<div class="toc">
<div class="tocHeading">Quick Links</div>
<div class="tocItem"><a href="#stdin">STDIN/STDOUT</a></div>
<div class="tocItem"><a href="#bundling">-b, --bundle</a></div>
<div class="tocItem"><a href="#gandt">-g and -t</a></div>
<div class="tocItem"><a href="#p">-p, --processes-per-subjob</a></div>
<div class="tocItem"><a href="#time">--time</a></div>
<div class="tocItem"><a href="#dependency">--dependency</a></div>
<div class="tocItem"><a href="#fixed">Fixed output path</a></div>
<div class="tocItem"><a href="#mixed">Mixed asynchronous and serial commands</a></div>
<div class="tocItem"><a href="#module">--module</a></div>
<div class="tocItem"><a href="#environment">Setting environment variables</a></div>
<div class="tocItem"><a href="#sbatch">--sbatch</a></div>
<div class="tocItem"><a href="#devel">--devel, --verbose</a></div>
</div>
</table>
</td><td>
To see how swarm works, first create a file containing a few simple
commands, then use swarm to submit them to the batch queue:
</td></tr></table>
<pre class="term"><b>[biowulf]$</b> cat > file.swarm
date
hostname
ls -l
^D
<b>[biowulf]$</b> swarm file.swarm</pre>
<p>Use <b>sjobs</b> to monitor the status of your request; an
"R" in the "St"atus column indicates your job is running.
This particular example will probably run to completion
before you can give the qstat command. To see the output from the commands, see
the files named <b>swarm_<em>#_#</em>.o</b>.</p>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- STDIN/STOUT -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="stdin"></a><b>A program that reads to STDIN and writes to STDOUT</b></p>
<p>For each invocation of the program the names for the input and output files
vary:</p>
<pre class="term"><b>[biowulf]$</b> cat > runbix
./bix < testin1 > testout1
./bix < testin2 > testout2
./bix < testin3 > testout3
./bix < testin4 > testout4
^D</pre>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- bundling -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="bundling"></a><b>Bundling large numbers of commands</b></p>
<p class="alert">By default any swarmfile with > 1000 commands will be <b>autobundled</b>
unless it is deliberately bundled with the <b>-b</b> flag.</p>
<p>If you have over 1000 commands, especially if each one runs for a short
time, you should 'bundle' your jobs with the <b>-b</b> flag. For example, if the
swarmfile contains 2560 commands, the following swarm command will group them into
bundles of 40 commands each, producing 64 command bundles. Swarm will then submit the
64 command bundles, rather than the 2560 commands individually, as a single swarm job.
This would result in a swarm of 64 (2560/40) subjobs.</p>
<pre class="term"><b>[biowulf]$</b> swarm -b 40 file.swarm</pre>
<p>Note that commands in a bundle will run sequentially on the assigned node.
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- gandt -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="gandt"></a><b>Allocating memory and threads with -g and -t options</b></p>
<p>If the subjobs require significant amounts of memory (> 1.5 GB) or threads (> 1 per core), a swarm can
run fewer subjobs per node than the number of cores available
on a node. For example, if the commands in a swarmfile need up to 40 GB of
memory each using 8 threads, running swarm with --devel shows what might happen:</p>
<pre class="term"><b>[biowulf]$</b> swarm -g 40 -t 8 --devel file.swarm
14 commands run in 14 subjobs, each requiring 40 gb and 8 threads</pre>
<p>If a command requires to use as many cpus on a node as possible, then the option <b>-t auto</b>
should be added. This causes each subjob in the swarmfile to allocate an entire node exclusively to the subjob, allowing
the subjob to use all available cpus on the node.</p>
<p class="alert">The default partition <b>norm</b> has nodes with a maximum of 248GB memory. If <b>-g</b> exceeds 373GB, swarm will give a warning message:</p>
<pre class="term"><b>[biowulf]$</b> swarm -g 400 file.swarm
ERROR: -g 400 requires --partition largemem</pre>
<p>To allocate more than 373GB of memory per command, include <b>--partition largmem</b>:</p>
<pre class="term"><b>[biowulf]$</b> swarm -g 500 --partition largemem file.swarm</pre>
<p>For more information about partitions, please see <a href="/docs/userguide.html#partitions">https://hpc.nih.gov/docs/userguide.html#partitions</a></p>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- p -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="p"></a><b>Using -p option to "pack" commands</b></p>
<p>By default, swarm allocates a single command line per subjob. If the command is single-threaded, then swarm wastes half the
cpus allocated, because the slurm batch system allocates no less than a single core (or two cpus) per subjob. This effect
can be seen using the <b>jobload</b> command for a 4-command swarm:</p>
<pre class="term"><b>[biowulf]$</b> swarm file.swarm
219433
<b>[biowulf]$</b>$ jobload -u user
JOBID TIME NODES CPUS THREADS LOAD MEMORY
Alloc Running Used/Alloc
219433_3 0:37 cn0070 2 1 50% 1.0 GB/1.5 GB
219433_2 0:37 cn0070 2 1 50% 1.0 GB/1.5 GB
219433_1 0:37 cn0069 2 1 50% 1.0 GB/1.5 GB
219433_0 0:37 cn0069 2 1 50% 1.0 GB/1.5 GB
USER SUMMARY
Jobs: 2
Nodes: 2
CPUs: 4
Load Avg: 50%</pre>
<p>In order to use all the cpus allocated to a single-threaded swarm, the option <b>-p</b> will set the number
of commands run per subjob. Including <b>-p 2</b>, half as many subjobs are created, each using twice as many cpus and twice as much memory:</p>
<pre class="term"><b>[biowulf]$</b> swarm -p 2 file.swarm
219434
<b>[biowulf]$</b>$ jobload -u user
JOBID TIME NODES CPUS THREADS LOAD MEMORY
Alloc Running Used/Alloc
219434_1 0:24 cn0069 2 2 100% 2.0 GB/3.0 GB
219434_0 0:24 cn0069 2 2 100% 2.0 GB/3.0 GB
USER SUMMARY
Jobs: 2
Nodes: 2
CPUs: 4
Load Avg: 100%</pre>
<p>In this case, we are "packing" 2 commands per subjob.</p>
<p class="alert"><b>NOTE:</b> The cpus on the biowulf cluster are <i>hypercores</i>, and some programs run more inefficiently
when packed onto hypercores. Please test your application to see if it actually benefits from running two commands per core rather than one.</p>
<p>Keep in mind:</p>
<ul>
<li><b>-p</b> is only available to single-threaded swarms (i.e. <b>-t 1</b>).</li>
<li>The default file output format is different using <b>-p</b>. The file names end with an extra suffix indicating the cpu from the subjob:</li>
</ul>
<pre class="term">
<b>[biowulf]$</b>$ swarm -p 2 ../file.swarm
14 commands run in 7 subjobs, each command requiring 1.5 gb and 1 thread, packing 2 processes per subjob
221574
<b>[biowulf]$</b>$ ls
swarm_221574_0_0.e swarm_221574_1_1.e swarm_221574_3_0.e swarm_221574_4_1.e swarm_221574_6_0.e
swarm_221574_0_0.o swarm_221574_1_1.o swarm_221574_3_0.o swarm_221574_4_1.o swarm_221574_6_0.o
swarm_221574_0_1.e swarm_221574_2_0.e swarm_221574_3_1.e swarm_221574_5_0.e swarm_221574_6_1.e
swarm_221574_0_1.o swarm_221574_2_0.o swarm_221574_3_1.o swarm_221574_5_0.o swarm_221574_6_1.o
swarm_221574_1_0.e swarm_221574_2_1.e swarm_221574_4_0.e swarm_221574_5_1.e
swarm_221574_1_0.o swarm_221574_2_1.o swarm_221574_4_0.o swarm_221574_5_1.o</pre>
<p>In the case where each swarm subjob must create or use a unique directory or file, an environment variable <b><tt>SWARM_PROC_ID</tt></b> is
available to discriminate the 0 and 1 processes running with -p 2.</p>
<p>For example, in order to create a unique directory in allocated /lscratch for each subjob, this bash code example can be used:</p>
<pre class="term">
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
export TAG=${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID} && mkdir /lscratch/${SLURM_JOB_ID}/${TAG} && touch /lscratch/${SLURM_JOB_ID}/${TAG}/foo.{0..4} && tar czf /data/user/${TAG}.tgz /lscratch/${SLURM_JOB_ID}/${TAG}/foo.*
</pre>
<p>In this case, while the files created within each distinct <b><tt>/lscratch/${SLURM_JOB_ID}/${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}_${SWARM_PROC_ID}</tt></b> directory are identical to all
the other swarm subjobs, the final tarball is unique:</p>
<pre class="term">
<b>[biowulf]$</b>$ ls /data/user/output
221574_0_0.tgz 221574_0_1.tgz 221574_1_0.tgz 221574_1_1.tgz 221574_2_0.tgz 221574_2_1.tgz
221574_3_0.tgz 221574_3_1.tgz 221574_4_0.tgz 221574_4_1.tgz 221574_5_0.tgz 221574_5_1.tgz
221574_6_0.tgz 221574_6_1.tgz
</pre>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- time option -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="time"></a><b>Setting walltime with --time</b></p>
<P>By default all jobs and subjobs have a walltime of 2 hours. If a swarm subjob exceeds its walltime, <b>it will be killed!</b>.
On the other hand, if your swarm subjobs have a very short walltime, then their priority on the queue may be elevated. Therefore,
it is best practice to set a walltime using the <b>--time</b> option that reflects the estimated execution time of the subjobs.
For example, if the command lines in a swarm are expected to require no more than half an hour to complete, the swarm command should be:</p>
<pre class="term"><b>[biowulf]$</b> swarm --time 00:30:00 file.swarm</pre>
<p>Because a subjob is expected to be running a single command from the swarmfile, the value of <b>--time</b> can be considered
the amount of time to run a single command. When a swarm is bundled, the value for <b>--time</b> is then
multiplied by the bundle factor. For example, if
a swarm that normally creates 64 commands is bundled to run 4 commands serially, the value of <b>--time</b> is
multiplied by 4:</p>
<pre class="term"><b>[biowulf]$</b> swarm <b>--time 00:30:00 -b 4</b> --devel file.swarm
64 commands run in 16 subjobs, each command requiring 1.5 gb and 1 thread, running 4 processes serially per subjob
sbatch --array=0-15 --job-name=swarm <b>--time=2:00:00</b> --cpus-per-task=2 --partition=norm --mem=1536</pre>
<p>If a swarm has more than 1000 commands and is autobundled, there is a chance that the time requested will exceed
the maximum allowed. In that case, an error will be thrown:</p>
<pre class="term">
ERROR: Total time for bundled commands is greater than partition walltime limit.
Try lowering the time per command (--time=04:00:00), lowering the bundle factor
(if not autobundled), picking another partition, or splitting up the swarmfile.</pre>
<p>See the <a href="/docs/userguide.html#wall"> Biowulf User Guide for a discussion of walltime limits</a>.</p>
<p>There are two additional options for setting the time of a swarm. <b>--time-per-command</b> is identical to <b>--time</b>, and
merely serves as a more obvious explanation of time allocation.</p>
<p><b>--time-per-subjob</b> overrides the time adjustments applied when <a href="#bundling">bundling</a> or <a href="#p">packing</a> commands.
This option can be used when a single command takes less than 1 minute to complete and there are a high number of commands bundled per
subjob:</p>
<pre class="term"><b>[biowulf]$</b> swarm <b>--time-per-subjob 00:30:00 -b 4</b> --devel file.swarm
64 commands run in 16 subjobs, each command requiring 1.5 gb and 1 thread, running 4 processes serially per subjob
sbatch --array=0-15 --job-name=swarm <b>--time=30:00</b> --cpus-per-task=2 --partition=norm --mem=1536</pre>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- dependency -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="dependency"></a><b>Handling job dependencies</b></p>
<P>
If a swarm is run as a single step in a pipeline, job dependencies can be handled with the <b>--dependency</b> options.
For example, a first script (first.sh) is to be run to generate some initial data files. Once this job is finished, a swarm of
commands (swarmfile.txt) is run to take the output of the first script and process it. Then, a last script (last.sh) is run
to consolidate the output of the swarm and further process it into its final form.
</P>
<P>
Below, the swarm is run with a dependency on the first script. Then the last script is run with a dependency on the swarm.
The swarm will sit in a pending state until the first job (10001) is completed, and the last job will sit in a pending state until
the entire swarm (10002) is completed.
</P>
<pre class="term"><b>[biowulf]$</b> sbatch first.sh
10001
<b>[biowulf]$</b> swarm --dependency afterany:10001 file.swarm
10002
<b>[biowulf]$</b> sbatch --dependency=afterany:10002 last.sh
10003</pre>
<P>
The jobid of a job can be captured from the sbatch command and passed to subsequent submissions in a script (master.sh).
For example, here is a bash script which automates the above procedure, passing the variable $id to the first script. In this way,
the master script can be reused for different inputs:
<pre class="term"><b>[biowulf]$</b> cat master.sh
#!/bin/bash
jobid1=$(sbatch first.sh)
echo $jobid1
jobid2=$(swarm --dependency afterany:$jobid1 file.swarm)
echo $jobid2
jobid3=$(sbatch --dependency=afterany:$jobid2 last.sh)
echo $jobid3</pre>
<P>Now, master.sh can be submitted with a single argument</p>
<pre class="term"><b>[biowulf]$</b> bash master.sh mydata123
10001
10002
10003
<b>[biowulf]$</b></pre>
<p>You can check on the job status using squeue:</pre>
<pre class="term"><b>[biowulf]$</b> squeue -u user
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10002_[0-3] norm swarm user PD 0:00 1 (Dependency)
10003 norm last.sh user PD 0:00 1 (Dependency)
10001 norm first.sh uwer R 0:33 1 cn0121</pre>
<P>
The dependency key 'afterany' means run only after the entire job finishes, regardless of its exit status. Swarm passes the exit
status of the last command executed back to Slurm, and Slurm consolidates all the exit statuses of the subjobs in the job array into
a single exit status.
</p>
<P>The final statuses for the jobs can be seen with sacct. The individual subjobs from swarm are designated
by <b>jobid_subjobid</b>:</p>
<pre class="term"><b>[biowulf]$</b> sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
10001 first.sh norm user 2 COMPLETED 0:0
10001.batch batch user 1 COMPLETED 0:0
10002_3 swarm norm user 2 FAILED 2:0
10002_3.bat+ batch user 1 FAILED 2:0
10003 last.sh norm user 2 COMPLETED 0:0
10003.batch batch user 1 COMPLETED 0:0
10002_0 swarm norm user 2 COMPLETED 0:0
10002_0.bat+ batch user 1 COMPLETED 0:0
10002_1 swarm norm user 2 COMPLETED 0:0
10002_1.bat+ batch user 1 COMPLETED 0:0
10002_2 swarm norm user 2 COMPLETED 0:0
10002_2.bat+ batch user 1 COMPLETED 0:0</pre>
<p>If any of the subjobs in the swarm failed, the job is marked as <b>FAILED</b>. It almost all cases, it is better
to rely on <b>afterany</b> rather than <b>afterok</b>, since the latter may cause the dependent job to
remain queued forever:</P>
<pre class="term"><b>[biowulf]$</b> sjobs
................Requested............................
User JobId JobName Part St Runtime Nodes CPUs Mem Dependency Features Nodelist
user 10003 last.sh norm PD 0:00 1 1 2.0GB/cpu afterok:10002_* (null) (DependencyNeverSatisfied)</pre>
<p>See <a href="/docs/userguide.html#depend">the Biowulf User Guide</a>, or <a href="http://slurm.schedmd.com/job_exit_code.html">
SchedMD for a discussion on how Slurm handles exit codes</a>.</P>
<p class="alert">NOTE: Setting <b>-p</b> causes multiple commands to run per subjob. Because of this, the exit status of the
subjob can come from any of the multiple processes in the subjob. </p>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- Fixed filepath -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="fixed"></a><b>A program that writes to a fixed filepath</b></p>
<p>If a program writes to a fixed filename, then you may need to run the
program in different directories. First create the necessary directories (for
instance run1, run2), and in the swarmfile cd to the unique output
directory before running the program: (cd using either an absolute path
beginning with "/" or a relative path from your home directory). Lines with
leading "#" are considered comments and ignored.</p>
<pre class="term"><b>[biowulf]$</b> cat > file.swarm
# Run ped program using different directory
# for each run
cd pedsystem/run1; ../ped
cd pedsystem/run2; ../ped
cd pedsystem/run3; ../ped
cd pedsystem/run4; ../ped
...
<b>[biowulf]$</b> swarm file.swarm</pre>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- mixed -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="mixed"></a><b>Running mixed asynchronous and serial commands in a swarm</b></p>
<p>There are occasions when a single swarm command can contain a mixture of asynchronous and serial commands. For
example, collating the results of several commands into a single output and then running another command on the pooled
results. If run interactively, it would look like this:</p>
<pre class="term">
<b>[biowulf]$</b> cmdA < inp.1 > out.1
<b>[biowulf]$</b> cmdA < inp.2 > out.2
<b>[biowulf]$</b> cmdA < inp.3 > out.3
<b>[biowulf]$</b> cmdA < inp.4 > out.4
<b>[biowulf]$</b> cmdB -i out.1 -i out.2 -i out.3 -i out.4 > final_result
</pre>
<p>It would be more efficient if the four <b>cmdA</b> commands could run asynchronously (in parallel), and then
the last <b>cmdB</b> command would wait until they were all done and then run, all on the same node and in the same
swarm command. This can be achieved using process substitution with this one-liner in a swarmfile:</p>
<pre class="term">
( cmdA < inp.1 > out.1 & cmdA < inp.2 > out.2 & \
cmdA < inp.3 > out.3 & cmdA < inp.4 > out.4 & wait ) ; \
cmdB -i out.1 -i out.2 -i out.3 -i out.4 > final_result
</pre>
<p>Here, the <b>cmdA</b> commands are all run asynchronously in four background processes, and the <b>wait</b> command
is given to prevent <b>cmdB</b> from running until all the background processes are finished. Note that line
continuation markers were used for easier editing.</p>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- module -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="module"></a><b>Using --module option</b></p>
<p>It is sometimes difficult to set the environment properly before running commands. The
easiest way to do this on Biowulf is with <a href="modules.html">
environment modules</a>. Running commands via swarm complicates the issue, because the modules
must be loaded prior to every line in the swarmfile. Instead, you can use the <b>--module</b>
option to load a list of modules:</p>
<pre class="term"><b>[biowulf]$</b> swarm --module ucsc,matlab,python/2.7 file.swarm</pre>
<p>Here, the environment is set to use the UCSC executables, Matlab, and an older, non-default
version of Python.</p>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- lscratch -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="lscratch"></a><b>Using local scratch</b></p>
<p><a href="http://hpc.nih.gov/docs/userguide.html#local">Local scratch disk space is NOT automatically available under Slurm</a>. Instead, local scratch disk space
is allocated using <b>--gres</b>. Here is an example of how to allocate 200GB of local scratch disk space for <u>each swarm command</u>:</p>
<pre class="term"><b>[biowulf$</b> swarm --gres=lscratch:200 file.swarm</pre>
<p>Including <b>--gres=lscratch:<i>N</i></b>, where <b><i>N</i></b> is the number of GB required, will create a subdirectory on the node
corresponding to the jobid, e.g.:</p>
<pre class="term"><b>/lscratch/987654/</b></pre>
<p>This local scratch directory can be accessed dynamically using the <b>$SLURM_JOB_ID</b> environment variable:</p>
<pre class="term"><b>/lscratch/$SLURM_JOB_ID/</b></pre>
<p>/lscratch/$SLURM_JOB_ID is a <b>temporary work directory</b>. Each swarm subjob should do most if not all of its work in this temporary work directory. This means that any input data should be copied the /lscratch before running any commands, and the output should be copied back to the original location after completion.</p>
<p>Here is a generic example of how to use /lscratch in a swarm:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
TWD=/lscratch/$SLURM_JOB_ID; cp input1 $TWD; cmd -i $TWD/input1 -o $TWD/output1; cp $TWD/output1 .
TWD=/lscratch/$SLURM_JOB_ID; cp input2 $TWD; cmd -i $TWD/input2 -o $TWD/output2; cp $TWD/output2 .
TWD=/lscratch/$SLURM_JOB_ID; cp input3 $TWD; cmd -i $TWD/input3 -o $TWD/output3; cp $TWD/output3 .
TWD=/lscratch/$SLURM_JOB_ID; cp input4 $TWD; cmd -i $TWD/input4 -o $TWD/output4; cp $TWD/output4 .</pre>
<p>Local scratch space is allocated <b>per subjob</b>. By default, that means each command or command list (single line in
swarmfile) is allocated its own independent local scratch space. <b>HOWEVER</b>, there are two situations where some
thought must be given to local scratch space:</p>
<ul>
<li><b>bundled swarms</b> - <a href="#bundling">Bundled swarms</a> serialize multiple commands into a single subjob. Since local scratch space is
allocated per subjob, this means that each command in the job inherits the same local scratch space. This means that each
command should be written to deal with any "leftover" files from the previous commands. A simple solution might be to
clean out the local scratch space at the end of each command. For example:<br>
<pre class="term">cd /lscratch/$SLURM_JOB_ID ; command1 arg1 arg2 ; rm -rf /lscratch/$SLURM_JOB_ID/*</pre>
</li>
<li><b>-p 2</b> - If the <tt><b><a href="#p">-p 2</a></b></tt> option is given to swarm, then allocated local scratch space is shared
between 2 commands in a single job. In this case, make sure to allocate <b>twice</b> as much local scratch space as
normal.</li>
</ul>
<center><hr width="500" /></center>
<a href="/apps/swarm.html" style="font-size:12px; float: right;">back to top</a></p>
<!-- ------------------------------------------------------------------------------------- -->
<!-- environment variables -->
<!-- ------------------------------------------------------------------------------------- -->
<p><a name="environment"></a><b>Setting environment variables</b></p>
<P>If an entire swarm requires one or more environment variables to be set, the sbatch option <b>--export</b>
can be used to set the variables prior to running. In this example, we need to set the BOWTIE_INDEXES environment variable
to the correct path for all subjobs in the swarm:</P>
<pre class="term"><b>[biowulf]$</b> swarm --sbatch "--export=BOWTIE_INDEXES=/fdb/igenomes/Mus_musculus/UCSC/mm9/Sequence/BowtieIndex/" file.swarm</pre>
<p class="alert"><b>NOTE:</b> Environment variables set with the <b>--sbatch "--export="</b> option are defined
<b>PRIOR</b> to the job being submitted. This prevents setting environment variables using Slurm-generated environment
variables, such as $SLURM_JOB_ID or $SLURM_MEM_PER_NODE.</p>
<p>However, if each command line in the swarm requires a unique set of environment variables, this must be done in the swarmfile. For example, setting TMPDIR to a unique subdirectory of /lscratch/$SLURM_JOB_ID:</p>
<pre class="term"><b>[biowulf]$</b> cat file.swarm
export TMPDIR=/lscratch/$SLURM_JOB_ID/xyz1; mkdir $TMPDIR; cmdxyz -x 1 -y 1 -z 1
export TMPDIR=/lscratch/$SLURM_JOB_ID/xyz2; mkdir $TMPDIR; cmdxyz -x 2 -y 2 -z 2