Skip to content

Commit

Permalink
radically changed how swarms are cleaned up
Browse files Browse the repository at this point in the history
  • Loading branch information
davidhoover committed Oct 8, 2019
1 parent 5feaacb commit abea638
Show file tree
Hide file tree
Showing 4 changed files with 598 additions and 65 deletions.
12 changes: 6 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
help:
@echo targets: help swarm swarm_manager webpages manpages
@echo targets: help swarm swarm_cleanup webpages manpages

swarm:
@echo installing swarm
install -p -m 0755 -o root -g root swarm /usr/local/bin
ls -l /usr/local/bin/swarm

swarm_manager:
@echo installing swarm_manager
install -p -m 0740 -o helixmon -g staff swarm_manager /usr/local/sbin
ls -l /usr/local/sbin/swarm_manager
swarm_cleanup:
@echo installing swarm_cleanup.pl
install -p -m 0740 -o root -g staff swarm_cleanup.pl /usr/local/sbin
ls -l /usr/local/sbin/swarm_cleanup.pl

webpages:
@echo installing webpages
Expand All @@ -21,4 +21,4 @@ manpages:
install -p -m 0644 -o root -g root swarm.1 /usr/local/share/man/man1
ls -l /usr/local/share/man/man1/swarm.1

.PHONY: swarm swarm_manager webpages manpages
.PHONY: swarm swarm_cleanup webpages manpages
128 changes: 70 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,55 +45,33 @@ If a submission fails, then no symlink will be created.

When a user runs swarm in development mode (--devel), no temporary directory or files are created.

## Clean up
## Testing

Because the space in /spin1/swarm is limited, old directories need to be removed. We want to keep the directories and files around for a while to use in investigations, but not forever. The leftovers are cleaned up daily by /usr/local/sbin/swarm_manager in a root cron job on biowulf. At the moment, subdirectories and their accompanying symlinks are deleted when either the full swarm ended 5 days prior, or if not run the modification time exceeds 5 days.
Swarm has several options for testing things.

When run in --dry-run mode, swarm_manager prints out a summary of the known swarms, with a list of swarms and their corresponding status:
**--devel:** This option prevents swarm from creating command or batch scripts, prevents it from actually submitting to sbatch, and prevents it from logging to the standard logfile. It also increases the verbosity level of swarm.

```
$ swarm_manager --dry-run --human
Running in dry-run mode
R/P=3123,F=18136,U=5260
...
6WKbhKwgdE 53685609 F 2017-11-12T07:18:59 2017-11-12T09:28:59 2017-11-18T06:30:01 mmouse
MX6u9pc6qA 53719443 F 2017-11-12T17:45:46 2017-11-12T17:46:08 2017-11-18T06:30:01 mmouse
7Vs9wt_bj7 -1 U 2017-11-22T02:57:31 -1 -1 jeb
wM4wmAGBf2 53685210 F 2017-11-12T07:14:45 2017-11-12T08:30:37 2017-11-18T06:30:01 mmouse
sYczw00gOi 54462141 R/P 2017-11-22T00:22:02 -1 -1 dduck
yJdZ0Teclu -1 U 2017-11-11T15:44:40 -1 2017-11-17T06:30:02 mmouse
CkHn5oWBWZ 54462144 R/P 2017-11-22T00:22:09 -1 -1 dduck
4YWcIPcqp9 54461506 F 2017-11-21T23:56:10 2017-11-21T23:56:13 -1 dduck
NcvQIL0TNN 53764812 F 2017-11-13T08:33:59 2017-11-15T18:48:45 2017-11-21T06:30:01 ggoofy
z8OVfg_Az_ 54303744 F 2017-11-20T11:51:34 2017-11-20T11:51:36 -1 dduck2
...
```
**Columns**
* 1: unique tag that identifies the swarm
* 2: the slurm jobid for the jobarray -- unsubmitted swarms are set to -1
* 3: the metastate; R/P = running or pending, F = finished, U = unsubmitted
* 4: submit/create time -- this is always set
* 5: end time -- running/pending/unsubmitted swarms are set to -1
* 6: delete time -- undeleted swarms are set to -1
* 7: user
**--verbose:** This option makes swarm more chatty, and accepts an integer from between 0 (silent) and 4. Running a swarm with many commands at level 4 will give a lot of output, so beware.

Routine daily cleaning is done by including the --routine option. Running in --dry-run mode shows what directories will get removed:
**--debug:** This option is similar to --devel, except that the scripts are actually created. The temporary directory for the swarm.batch and command scripts begins with 'dev', rather than 'tmp' like normal.

**--no-run:** A hidden alacarte option, prevents swarm from actually submitting to sbatch.

**--no-log:** A hidden alacarte option, prevents swarm from logging.

**--logfile:** A hidden alacarte option, redirects the logfile from the standard logfile to one of your choice.

**--no-scripts:** Don't create command and batch scripts.

In the tests subdirectory, there are two scripts that can be run to test the current build of swarm. **test.sh** runs a series of swarm commands that are expected to succeed, and **fail.sh** runs a series of swarm commands that are expected to fail. They are run in **--devel** mode, so nothing is ever submitted to the cluster nor logged.

The script **sample.pl** extracts the last 100 or so lines from the swarm logfile and generates possible options for testing swarm. The **--sbatch** option is screwed up because it doesn't contain any quotes, so you will need to add those back in to construct proper swarm commands.

## Logging

* swarm logs to /usr/local/logs/swarm.log
* swarm_cleanup.pl logs to /usr/local/logs/swarm_cleanup.log

```
$ ./swarm_manager --dry-run --human --routine
Running in dry-run mode
...
2017-11-22T09:37:02 rm -rf /spin1/swarm/dduck/bKuRukIcN0 /spin1/swarm/dduck/54095527
2017-11-22T09:37:02 rm -rf /spin1/swarm/ggoofy/mgeLmPLATB /spin1/swarm/ggoofy/54095086
2017-11-22T09:37:02 rm -rf /spin1/swarm/mmouse/qdcR8dNqZZ /spin1/swarm/mmouse/54096680
...
/swarm usage: 2.03 GB ( 4.1%), 240274 files ( 6.9%)
======================================================================
Swarm directories scanned: 26524
Swarm directories deleted: 3
======================================================================
/swarm usage: 2.03 GB ( 4.1%), 240274 files ( 6.9%)
```
## Index File

An index file /usr/local/logs/swarm_tempdir.idx is updated when a swarm is created. This file contains the creation timestamp, user, unique tag, number of commands, and P value (either 1 or 2):
Expand All @@ -106,29 +84,63 @@ An index file /usr/local/logs/swarm_tempdir.idx is updated when a swarm is creat
1509020225,ggoofy,64PZ3h80nB,1000,1
```

## Logging
## Clean up

* swarm logs to /usr/local/logs/swarm.log
* swarm_manager logs to /usr/local/logs/swarm_cleanup.log
Because the space in /spin1/swarm is limited, old directories need to be removed. We want to keep the directories and files around for a while to use in investigations, but not forever. The leftovers are cleaned up daily by /usr/local/sbin/swarm_cleanup.pl in a root cron job on biowulf. At the moment, subdirectories and their accompanying symlinks are deleted when either the full swarm ended 5 days prior, or if not run the modification time exceeds 5 days.

## Testing
Under normal use, swarm_cleanup.pl first identifies all jobarrays from the biowulf_job_table of the slurmacct database (using the replicate slave). Then it parses the swarm_tempdir.idx, recent swarm.log and sbatch.log files and determines the status and age of all created swarms. Swarms that are either inactive (they finished in slurm) or are unknown (never submitted to slurm) and are 5 days old are deleted from /spin1/swarm.

Swarm has several options for testing things.
```
swarm_cleanup.pl --delete-age 5
```

**--devel:** This option prevents swarm from creating command or batch scripts, prevents it from actually submitting to sbatch, and prevents it from logging to the standard logfile. It also increases the verbosity level of swarm.
When a swarm tempdir directory is deleted, it is recorded in /usr/local/logs/swarm_cleanup.idx as a comma-delimited list:

**--verbose:** This option makes swarm more chatty, and accepts an integer from between 0 (silent) and 4. Running a swarm with many commands at level 4 will give a lot of output, so beware.
```
1569079783,0,1570533216,user1,ztTSHJ6gJ9,0
1570022156,1570022418,1570533216,user2,zuNcmPhc1r,37909839
```

**--debug:** This option is similar to --devel, except that the scripts are actually created. The temporary directory for the swarm.batch and command scripts begins with 'dev', rather than 'tmp' like normal.
The fields are:

**--no-run:** A hidden alacarte option, prevents swarm from actually submitting to sbatch.
* time created
* time finished (zero if not known)
* time deleted
* user
* tempdir
* jobid (zero if not known)

**--no-log:** A hidden alacarte option, prevents swarm from logging.
If --email is given, then an email is sent to the users listed in the option that looks like this:

**--logfile:** A hidden alacarte option, redirects the logfile from the standard logfile to one of your choice.
```
$ swarm_cleanup.pl --delete-age 5 --email userA,userB,userC
...
/swarm usage: 11.15 GB (22.3%), 2268917 files (32.4%)
======================================================================
Swarm directories scanned: 8222
Swarm directories deleted: 2407
======================================================================
/swarm usage: 7.56 GB (15.1%), 1373878 files (19.6%)
```

**--no-scripts:** Don't create command and batch scripts.
When run in --dry-run mode, swarm_cleanup.pl generates a unique swarm_cleanup.idx file in the current working directory.

In the tests subdirectory, there are two scripts that can be run to test the current build of swarm. **test.sh** runs a series of swarm commands that are expected to succeed, and **fail.sh** runs a series of swarm commands that are expected to fail. They are run in **--devel** mode, so nothing is ever submitted to the cluster nor logged.
A final tally is written to swarm_cleanup.log:

The script **sample.pl** extracts the last 100 or so lines from the swarm logfile and generates possible options for testing swarm. The **--sbatch** option is screwed up because it doesn't contain any quotes, so you will need to add those back in to construct proper swarm commands.
```
2019-10-08T07:13:36 f: a=85 i=5726 u=2411 d: i=796 u=1611 rss=195264 (813 seconds)
```

The format of the log is as follows

```
f -- found
d -- deleted
a: active (pending or running)
i: inactive (finished)
u: unknown (swarm not submitted to slurm)
rss: current memory usage
```

Thus, the above line from the logfile indicates that the script found 85 pending or running swarms, 5726 inactive swarms, and 2411 unknown swarms. Of the inactive swarms, 796 were deleted because they ended 5 or more days ago. Of the unknown swarms, 1611 were deleted because they were created 5 or more days ago. The entire cleanup process took about 15 minutes are required about 191 MB of RAM.
Loading

0 comments on commit abea638

Please sign in to comment.