This is a port of https://github.com/WeichenZhou/NanoPal-and-Cas9-targeted-enrichment-pipelines/ from multiple bash scripts into a single Snakemake-orchestrated pipeline.
To run, you'll need Snakemake and Singularity installed on the workstation or cluster you want to use.
For Great Lakes you should be able to do module load snakemake singularity and
be all set.
For workstations, you'll need one with Singularity installed (clover
definitely has it, not sure about the others). Then you'll need to install
Snakemake. You could do this through Conda, or with a boring old virtualenv if
you prefer.
There are three layers of configuration.
config/base.json contains basic configuration for the pipeline itself, e.g.
which versions of the various tools to use. There are also config/local.json and
config/glakes.json that let you change those values for locally running or
running on Great Lakes, but you likely won't need to do that (we should
probably just remove them entirely).
profiles/glakes/config.yaml contains the necessary configuration for using
Slurm on Great Lakes to parallelize the pipeline. You might need to edit this
if you want to e.g. change which Slurm account the pipeline runs its tasks
under.
runs/*.json is where you'll configure an individual run of a pipeline.
A single run config here looks something like this:
{
"batch_id": "hapmap-01",
"mobile_elements": ["LINE", "AluYa", "AluYb"],
"reference_version": "GRCh38",
"exclusions": "GM12878",
"reference": "/scratch/apboyle_root/apboyle0/slosh/nanopal-snakemake-test-runs/ref/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna",
"scratch_path": "/scratch/apboyle_root/apboyle0/slosh/nanopal-snakemake-test-runs/scratch",
"container_path": "/scratch/apboyle_root/apboyle0/slosh/nanopal-snakemake-test-runs/containers",
"datasets": {
"p2-acq1": ["/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240606_1417_P2S-01935-B_PAW86158_158598d3/57b768b9ec81290d2b070df0f53667e32169b91c.sorted.bam"],
"p2-acq2": ["/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240529_1010_P2S-01935-A_PAK58332_285f9616/c05f7e73c663028fe6c54432fd8ec97f675e48b2.sorted.bam",
"/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240529_1356_P2S-01935-A_PAK58332_17eb569e/aeaebaa1dadbf6ba45a132233e48005ad66ff1a5.sorted.bam"],
"minion": ["/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240523_1455_MN40776_FAY82215_1b3c041f/74afcd8e597181d7f8bb617588d99aa322fa4cf5.sorted.bam"],
"everything": ["/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240606_1417_P2S-01935-B_PAW86158_158598d3/57b768b9ec81290d2b070df0f53667e32169b91c.sorted.bam",
"/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240529_1010_P2S-01935-A_PAK58332_285f9616/c05f7e73c663028fe6c54432fd8ec97f675e48b2.sorted.bam",
"/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240529_1356_P2S-01935-A_PAK58332_17eb569e/aeaebaa1dadbf6ba45a132233e48005ad66ff1a5.sorted.bam",
"/nfs/turbo/boylelab/nanopore_data/MEI/LINE/20240523_1455_MN40776_FAY82215_1b3c041f/74afcd8e597181d7f8bb617588d99aa322fa4cf5.sorted.bam"]
},
"chromosomes": [
"chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
"chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17",
"chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"
]
}exclusions denotes which set of non-reference-but-previously-known MEIs should
be excluded from being called as novel MEIs. For example, "exclusions": "GM12878" will pull out previously-identified MEIs in the GM12878 cell line and
avoid reporting them as novel insertions. Use "exclusions": "none" to disable
this filtering and report all non-reference MEIs detected as novel MEIs.
batch_id is a name for the run (TODO we should probably rename this to
run_id). The pipeline will store all its output under a directory with this
name in the scratch_path directory.
container_path is the path to a directory to store the Singularity containers
(see below).
datasets describes the location of the dataset(s) you want to run the pipeline
on.
Each entry in the datasets dictionary should have a unique ID as the key and
a list of basecalled_output.tar or BAM files as the value. The data from all
the files in the list will be combined and used as input to the pipeline, which
can be handy if e.g.:
- You have multiple separate ONT data directories for the same sample because
you had to pause and restart the sequencer (the
p2-acq2dataset in the example). - You want to combine multiple sequencing runs into one big pool and run the
pipeline on all the data together (the
everythingdataset in the example).
If you only have one file for an entry you can optionally use a single string
for the value for convenience, instead of having to wrap it up as
a single-element list (the minion dataset in the example).
Note that if you're running on Great Lakes and point to files on scratch you
must use the /scratch/… form of the paths, and not /gpfs/accounts/….
/scratch on Great Lakes is just a symlink to /gpfs/accounts, so in theory it
shouldn't matter, but in the Singularity config on Great Lakes ARC have
explicitly configured Singularity to map /scratch/… into the containers but
have not mapped /gpfs/accounts/…, so you need to use the /scratch/… form
of the paths.
The pipeline will download its own containers to the directory specified in
container_path, except for the custom containers that we need to build
ourselves. For those you'll need to build them yourself (or find someone who's
got them built) and copy them over manually before you run for the first time.
Unfortunately you can't build Singularity containers on Great Lakes, so you'll
need to have Singularity installed on a machine you have root access to (or
that's been specially configured to allow non-root container building). Once
you've got that you should be able to build the *.sif files with:
singularity build --fakeroot nanopal-binaries.sif nanopal-binaries.def
singularity build --fakeroot palmer.sif palmer.def
Then copy those to the container_path wherever you're trying to run.
There may be some other snags I've forgotten to document here, ping me (slosh)
if you get an error here.
Once you've got your runs/whatever.json and the manually-built containers
ready, you should be all set to run. There are a couple of helper scripts that
will make it a little less tedious to invoke Snakemake:
run-local.shfor running directly on a workstation.run-glakes.shfor running on Great Lakes via Slurm.
Take a look at the individual scripts to see exactly what they're doing, but here are some examples to get you started. To do a dry run and just print what Snakemake thinks it needs to do:
script run config snakemake args
vvvvvv vvvvvvvvvv vvvvvvvvvvvvvv
./run-glakes.sh runs/real-data-test-gl.json _all --cores=36 --dry-run
./run-local.sh runs/real-data-test-local.json _all --cores=12 --dry-run
Unfortunately you do need to pass --cores even for a dry run. The reason is
that some of the commands that get invoked depend in a non-trivial way on the
number of cores used, which means Snakemake will consider them changed if the
number of cores changes (and the default --cores is 1), which will make the
dry run look like it'll have to do more work that the real run actually will.
To do a full run for real on a workstation (e.g. clover):
./run-local.sh runs/real-data-test-local.json _all --cores=12 > snakemake.log
Note that you'll probably want to do this in a screen or tmux session so it
doesn't get killed if your wifi dies.
When running on the cluster you're not supposed to do any real work on the login
nodes, so you should use Slurm to invoke the ./run-glakes.sh call on a Slurm
worker node. The run.sbat file does that, so you can use sbatch run.sbat runs/foo.json _all to kick off the Snakemake controller with the given config
and Snakemake target, which will then fan out the tasks to other nodes.
You can monitor the progress by watching the log file with:
tail -f snakemake-batch-run-<controller job id>.out
and/or by watching the Snakemake tasks in the queue with something like:
squeue --me --format="%.10i %60j %.6u %.12a %.10P %.2t %.10M %.10L %.6C %.10m %R"
The pipeline will put all of its output into subdirectories of <scratch_path>/<batch_id>:
/scratch/apboyle_root/apboyle0/slosh/nanopal-snakemake-test-runs/scratch/mei-ad
├── [ 4.0K] _benchmarks/
├── [ 4.0K] _logs/
├── [ 4.0K] mei_db/
├── [ 4.0K] reference/
├── [ 64K] 1096_Control_Frontal-Cortex/
├── [ 4.0K] 1096_Control_Occipital-Cortex/
├── [ 64K] 1096_Control_Parietal-Cortex/
└── [ 4.0K] 1096_Control_Cerebellum/
The _benchmarks directory stores the benchmark tsvs collected automatically
by Snakemake, which can be nice if you want to try to optimize the
runtime/memory requested for a particular task.
_logs stores the logs for all the various Snakemake tasks.
mei_db and reference are top-level Snakemake steps whose results are used
for all the datasets.
All the rest of the directories are the working directories for individual datasets. There will be one subdirectory per sample with the Snakemake steps for that sample inside. Steps that are parallelized across targets will have an extra per-target directory as well:
/scratch/apboyle_root/apboyle0/slosh/nanopal-snakemake-test-runs/scratch/mei-ad
└── [ 4.0K] 1096_Control_Cerebellum/
├── [ 4.0K] input/…
├── [ 4.0K] alignment/…
├── [ 4.0K] find_revcomp_reads/…
├── [ 4.0K] AluYa/
│ ├── [ 4.0K] find_on_target/…
│ ├── [ 4.0K] gather_matches/…
│ ├── [ 4.0K] intersect/…
│ ├── [ 4.0K] intersect_again/…
│ ├── [ 64K] palmer/…
│ ├── [ 4.0K] palmer_on_target/…
│ └── [ 4.0K] parse_cigar/…
├── [ 4.0K] AluYb/
│ ├── [ 4.0K] find_on_target/…
│ ├── [ 4.0K] gather_matches/…
│ ├── [ 4.0K] intersect/…
│ ├── [ 4.0K] intersect_again/…
│ ├── [ 64K] palmer/…
│ ├── [ 4.0K] palmer_on_target/…
│ └── [ 4.0K] parse_cigar/…
└── [ 4.0K] LINE/
├── [ 4.0K] find_on_target/…
├── [ 4.0K] gather_matches/…
├── [ 4.0K] intersect/…
├── [ 4.0K] intersect_again/…
├── [ 64K] palmer/…
├── [ 4.0K] palmer_on_target/…
└── [ 4.0K] parse_cigar/…
The final output numbers (which were just dumped to standard output in the
vanilla version of Nanopal) will be under
<target>/intersect_again/result-log.txt (TODO we should add another Snakemake
step to collect all the end results into a single place for easier
viewing/downloading).