diff --git a/docs/advanced/assembly.rst b/docs/advanced/assembly.md similarity index 76% rename from docs/advanced/assembly.rst rename to docs/advanced/assembly.md index 1e9dc8f2..203d63de 100644 --- a/docs/advanced/assembly.rst +++ b/docs/advanced/assembly.md @@ -1,11 +1,9 @@ -Pre-Assambly-processing ------------------------- +# Pre-Assambly-processing -Normalization Parameters -`````````````````````````` +## Normalization Parameters To improve assembly time and often assemblies themselves, coverage is -normalized across kmers to a target depth and can be set using:: +normalized across kmers to a target depth and can be set using: # kmer length over which we calculated coverage normalization_kmer_length: 21 @@ -14,30 +12,22 @@ normalized across kmers to a target depth and can be set using:: # reads must have at least this many kmers over min depth to be retained normalization_minimum_kmers: 8 +## Error Correction - -Error Correction -`````````````````````````` - -Optionally perform error correction using ``tadpole.sh`` from BBTools:: +Optionally perform error correction using `tadpole.sh` from BBTools: perform_error_correction: true +# Assembly Parameters +## Assembler -Assembly Parameters ------------------------- - - -Assembler -`````````````````````````` - -Currently, the supported assemblers are 'spades' and 'megahit' with the -default setting of:: +Currently, the supported assemblers are \'spades\' and \'megahit\' with +the default setting of: assembler: megahit -Both assemblers have settings that can be altered in the configuration:: +Both assemblers have settings that can be altered in the configuration: # minimum multiplicity for filtering (k_min+1)-mers megahit_min_count: 2 @@ -58,11 +48,9 @@ Both assemblers have settings that can be altered in the configuration:: # comma-separated list of k-mer sizes (must be odd and less than 128) spades_k: auto +## Contig Filtering -Contig Filtering -`````````````````````````` - -After assembly, contigs can be filtered based on several metrics:: +After assembly, contigs can be filtered based on several metrics: # Discard contigs with lower average coverage. minimum_average_coverage: 5 diff --git a/docs/advanced/index.md b/docs/advanced/index.md new file mode 100644 index 00000000..1f3894b2 --- /dev/null +++ b/docs/advanced/index.md @@ -0,0 +1,8 @@ +# Advanced Usage + +```{toctree} +:maxdepth: 2 + +assembly +qc +``` diff --git a/docs/advanced/qc.md b/docs/advanced/qc.md new file mode 100644 index 00000000..359043f6 --- /dev/null +++ b/docs/advanced/qc.md @@ -0,0 +1,104 @@ +# Quality control of reads + +## Adapter Trimming + +FASTA file paths for adapter sequences to be trimmed from the sequence +ends. + +We provide the adapter reference FASTA included in `bbmap` +for various + + preprocess_adapters: /database_dir/adapters.fa + +## Quality Trimming + +Trim regions with an average quality below this threshold. Higher is +more stringent. + + preprocess_minimum_base_quality: 10 + +## Adapter Trimming at Read Tips + +Allow shorter kmer matches down to `mink` at the read ends. +0 disables. + + preprocess_adapter_min_k: 8 + +## Allowable Mismatches in Adapter Hits + +Maximum number of substitutions between the target adapter kmer and the +query sequence kmer. Lower is more stringent. + + preprocess_allowable_kmer_mismatches: 1 + +## Contaminant Kmer Length + +Kmer length used for finding contaminants. Contaminant matches shorter +than this length will not be found. + + preprocess_reference_kmer_match_length: 27 + +## Read Length Threshold + +This is applied after quality and adapter trimming have been applied to +the sequence. + + preprocess_minimum_passing_read_length: 51 + +## Sequence Complexity Filter + +Require this fraction of each nucleotide per sequence to eliminate low +complexity reads. + + preprocess_minimum_base_frequency: 0.05 + +## Contamination Parameters + +Contamination reference sequences in the form of nucleotide FASTA files +can be provided and filtered from the reads using the following +parameters. + +If \'rRNA\' is defined, it will be added back to metagenomes but not to +metatranscriptomes. Additional references can be added arbitrarily, such +as:: : + + contaminant_references: + rRNA: /database_dir/silva_rfam_all_rRNAs.fa + phiX: /database_dir/phiX174_virus.fa + +Don\'t look for indels longer than this: + + contaminant_max_indel: 20 + +Fraction of max alignment score required to keep a site: + + contaminant_min_ratio: 0.65 + +mapping kmer length; range 8-15; longer is faster but uses more memory; +shorter is more sensitive: + + contaminant_kmer_length: 12 + +Minimum number of seed hits required for candidate sites: + + contaminant_minimum_hits: 1 + +Set behavior on ambiguously-mapped reads (with multiple top-scoring +mapping locations): + +- best (use the first best site) +- toss (consider unmapped, retain in reads for assembly) +- random (select one top-scoring site randomly) +- all (retain all top-scoring sites) + + contaminant_ambiguous: best + +For host decontamination we suggest the following genomes, where +contaminants and low complexity regions were masked. + +Many thanks to Brian Bushnell for providing the genomes of +\[human\](),\[mouse\](), +\[dog\](), +and +\[cat\](). +\[Source\]() diff --git a/docs/advanced/qc.rst b/docs/advanced/qc.rst deleted file mode 100644 index e2220b60..00000000 --- a/docs/advanced/qc.rst +++ /dev/null @@ -1,127 +0,0 @@ -Quality control of reads -------------------------- - - -Adapter Trimming -`````````````````````````` - -FASTA file paths for adapter sequences to be trimmed from the sequence ends. - -We provide the adapter reference FASTA included in `bbmap` for various - -:: - - preprocess_adapters: /database_dir/adapters.fa - - -Quality Trimming -`````````````````````````` - -Trim regions with an average quality below this threshold. Higher is more -stringent. - -:: - - preprocess_minimum_base_quality: 10 - - -Adapter Trimming at Read Tips -```````````````````````````````````````````````````` - -Allow shorter kmer matches down to `mink` at the read ends. 0 disables. - -:: - - preprocess_adapter_min_k: 8 - - -Allowable Mismatches in Adapter Hits -```````````````````````````````````````````````````` - -Maximum number of substitutions between the target adapter kmer and the query -sequence kmer. Lower is more stringent. - -:: - - preprocess_allowable_kmer_mismatches: 1 - - -Contaminant Kmer Length -`````````````````````````` - -Kmer length used for finding contaminants. Contaminant matches shorter than -this length will not be found. - -:: - - preprocess_reference_kmer_match_length: 27 - - -Read Length Threshold -`````````````````````````` - -This is applied after quality and adapter trimming have been applied to the -sequence. - -:: - - preprocess_minimum_passing_read_length: 51 - - -Sequence Complexity Filter -`````````````````````````` - -Require this fraction of each nucleotide per sequence to eliminate low -complexity reads. - -:: - - preprocess_minimum_base_frequency: 0.05 - - -Contamination Parameters -`````````````````````````` - -Contamination reference sequences in the form of nucleotide FASTA files can be -provided and filtered from the reads using the following parameters. - -If 'rRNA' is defined, it will be added back to metagenomes but not to metatranscriptomes. -Additional references can be added arbitrarily, such as:: -:: - - contaminant_references: - rRNA: /database_dir/silva_rfam_all_rRNAs.fa - phiX: /database_dir/phiX174_virus.fa - -Don't look for indels longer than this:: - - contaminant_max_indel: 20 - - -Fraction of max alignment score required to keep a site:: - - contaminant_min_ratio: 0.65 - -mapping kmer length; range 8-15; longer is faster but uses more memory; shorter is more sensitive:: - - contaminant_kmer_length: 12 - -Minimum number of seed hits required for candidate sites:: - - contaminant_minimum_hits: 1 - -Set behavior on ambiguously-mapped reads (with multiple top-scoring mapping locations): - -- best (use the first best site) -- toss (consider unmapped, retain in reads for assembly) -- random (select one top-scoring site randomly) -- all (retain all top-scoring sites) - -:: - - contaminant_ambiguous: best - -For host decontamination we suggest the following genomes, where contaminants and low complexity regions were masked. - -Many thanks to Brian Bushnell for providing the genomes of [human](https://drive.google.com/file/d/0B3llHR93L14wd0pSSnFULUlhcUk/edit?resourcekey=0-PsIKmg2q4EvTGWGOUjsKGQ),[mouse](https://drive.google.com/file/d/0B3llHR93L14wYmJYNm9EbkhMVHM/view?resourcekey=0-jSsdejBncqPu4eiFfJvf1w), -[dog](https://drive.google.com/file/d/0B3llHR93L14wTHdWRG55c2hPUXM/view?resourcekey=0-nJ2WQzTQYrTizK0pllVRZg), and [cat](https://drive.google.com/file/d/0B3llHR93L14wOXJhWXRlZjBpVUU/view?resourcekey=0-xxh33oYWp5FGBpRzobD_uw). [Source](https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/37175-introducing-removehuman-human-contaminant-removal?p=286481#post286481) diff --git a/docs/conf.py b/docs/conf.py index d736ef11..35d0a438 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -17,7 +17,7 @@ # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # -# import os +import os # import sys # sys.path.insert(0, os.path.abspath('.')) @@ -36,8 +36,12 @@ "sphinx.ext.todo", "sphinx.ext.viewcode", "sphinx.ext.napoleon", + "myst_parser", + "sphinx.ext.autosectionlabel", ] +autosectionlabel_prefix_document = True + # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] @@ -69,12 +73,12 @@ # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. -language = None +language = "en" # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This patterns also effect to html_static_path and html_extra_path -exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "old"] +exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "old", os.path.abspath("../CHANGELOG.md")] # The name of the Pygments (syntax highlighting) style to use. pygments_style = "sphinx" diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 00000000..3ec0ec69 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,42 @@ +[![image](https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg)](https://anaconda.org/bioconda/metagenome-atlas) + +[![image](https://img.shields.io/conda/dn/bioconda/metagenome-atlas.svg?label=Bioconda)](https://bioconda.github.io/recipes/metagenome-atlas/README.html) + +[![image](https://img.shields.io/twitter/follow/SilasKieser.svg?style=social&label=Follow)](https://twitter.com/search?f=tweets&q=%40SilasKieser%20%23metagenomeAtlas&src=typd) + +# Metagenome-Atlas + +![Metagenome-atlas logo](../resources/images/atlas_image.png) + +Metagenome-Atlas is a easy-to-use metagenomic pipeline based on +[snakemake](https://snakemake.github.io/). It handles all steps from QC, +Assembly, Binning, to Annotation. + +You can start using atlas with three commands: + + mamba install -c bioconda -c conda-forge metagenome-atlas={latest_version} + atlas init --db-dir databases path/to/fastq/files + atlas run + +where `{latest_version}` should be replaced by + +[![image](https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg)](https://anaconda.org/bioconda/metagenome-atlas) + +## Publication + +> ATLAS: a Snakemake workflow for assembly, annotation, and genomic +> binning of metagenome sequence data. Kieser, S., Brown, J., Zdobnov, +> E. M., Trajkovski, M. & McCue, L. A. BMC Bioinformatics 21, 257 +> (2020). doi: +> [10.1186/s12859-020-03585-4](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03585-4) + +```{toctree} +:maxdepth: 2 +:caption: Documentation + +usage/getting_started +usage/output +usage/configuration +advanced/index +usage/changelog +``` diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 971b50ca..00000000 --- a/docs/index.rst +++ /dev/null @@ -1,58 +0,0 @@ - -.. image:: https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg - :target: https://anaconda.org/bioconda/metagenome-atlas - -.. image:: https://img.shields.io/conda/dn/bioconda/metagenome-atlas.svg?label=Bioconda - :target: https://bioconda.github.io/recipes/metagenome-atlas/README.html - - -.. image:: https://img.shields.io/twitter/follow/SilasKieser.svg?style=social&label=Follow - :target: https://twitter.com/search?f=tweets&q=%40SilasKieser%20%23metagenomeAtlas&src=typd - - -.. |logo| image:: ../resources/images/atlas_image.png - :alt: Metagenome-atlas logo - - - - -Metagenome-Atlas -**************** - -|logo| - -Metagenome-Atlas is a easy-to-use metagenomic pipeline based on `snakemake `_. -It handles all steps from QC, Assembly, Binning, to Annotation. - -You can start using atlas with three commands:: - - mamba install -c bioconda -c conda-forge metagenome-atlas={latest_version} - atlas init --db-dir databases path/to/fastq/files - atlas run - -where `{latest_version}` should be replaced by - - -.. image:: https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg - :target: https://anaconda.org/bioconda/metagenome-atlas - - -.. _publication: - -Publication -=========== - - ATLAS: a Snakemake workflow for assembly, annotation, and genomic binning of metagenome sequence data. - Kieser, S., Brown, J., Zdobnov, E. M., Trajkovski, M. & McCue, L. A. - BMC Bioinformatics 21, 257 (2020). - doi: `10.1186/s12859-020-03585-4 `_ - - -.. toctree:: - :maxdepth: 2 - :caption: Documentation - - usage/getting_started - usage/output - usage/configuration - usage/changelog diff --git a/docs/usage/configuration.md b/docs/usage/configuration.md new file mode 100644 index 00000000..f3b85f46 --- /dev/null +++ b/docs/usage/configuration.md @@ -0,0 +1,335 @@ +\_configuration: + +# Configure Atlas + +\_contaminants: + +## Remove reads from Host + +One of the most important steps in the Quality control is to remove +reads from the host\'s genome. You can add any number of genomes to be +removed. + +We recommend using genomes where repetitive sequences are masked. See +here for more details [human +genome](http://seqanswers.com/forums/archive/index.php/t-42552.html). + +## Co-abundance Binning + +::: {#cobinning} +While binning each sample individually is faster, using co-abundance for +binning is recommended. Quantifying the coverage of contigs across +multiple samples provides valuable insights about contig co-variation. +::: + +There are two primary strategies for co-abundance binning: + +1. **Cross mapping:** Map the reads from multiple samples to each + sample\'s contigs. +2. **Co-binning:** Concatenate contigs from multiple samples and map + all the reads to these combined contigs. + +`final_binner: metabat2` is used for cross-mapping, while +`vamb` or `SemiBin` is used for co-binning. + +The samples to be binned together are specified using the +`BinGroup` in the `sample.tsv` file. The size of +the BinGroup should be selected based on the binner and the co-binning +strategy in use. + +Cross-mapping complexity scales quadratically with the size of the +BinGroup since each sample's reads are mapped to each other. This might +yield better results for complex metagenomes, although no definitive +benchmark is known. On the other hand, co-binning is more efficient, as +it maps a sample's reads only once to a potentially large assembly. + +### Default Behavior + +Starting with version 2.18, Atlas places every sample in a single +BinGroup and defaults to `vamb` as the binner unless there +are very few samples. For fewer than 8 samples, `metabat` is +the default binner. + +::: note +::: title +Note +::: + +This represents a departure from previous versions, where each sample +had its own BinGroup. Running `vamb` in those versions would +consider all samples, regardless of their BinGroup. This change might +cause errors if using a `sample.tsv` file from an older +Atlas version. Typically, you can resolve this by assigning a unique +BinGroup to each sample. +::: + +The mapping threshold has been adjusted to 95% identity (single sample +binning is 97%) to allow reads from different strains --- but not other +species --- to map to contigs from a different sample. + +If you're co-binning more than 150-200 samples or cross-mapping more +than 50 samples, Atlas will issue a warning regarding excessive samples +in a BinGroup. Although VAMB's official publication suggests it can +handle up to 1000 samples, this demands substantial resources. + +Therefore, splitting your samples into multiple BinGroups is +recommended. Ideally, related samples, or those where the same species +are anticipated, should belong to the same BinGroup. + +### Single-sample Binning + +To employ single-sample binning, simply assign each sample to its own +BinGroup and select `metabat` or `DASTool` as +the `final_binner`. + +Although it's not recommended, it's feasible to use +`DASTool` and feed it inputs from `metabat` and +other co-abundance-based binners. + +Add the following lines to your \`config.yaml\`: + +``` yaml +final_binner: DASTool + +binner: + - metabat + - maxbin + - vamb +``` + +## Long reads {#longreads} + +Limitation: Hybrid assembly of long and short reads is supported with +spades and metaSpades. However, metaSpades needs a paired-end short-read +library. + +The path of the (preprocessed) long reads should be added manually to +the sample table under a new column heading \'longreads\'. + +In addition, the type of the long reads should be defined in the config +file: `longread_type` one of \[\"pacbio\", \"nanopore\", \"sanger\", +\"trusted-contigs\", \"untrusted-contigs\"\] + +## Example config file + +```yaml +################################################################### +#### _______ _ _____ #### +#### /\ |__ __| | | /\ / ____| #### +#### / \ | | | | / \ | (___ #### +#### / /\ \ | | | | / /\ \ \___ \ #### +#### / ____ \ | | | |____ / ____ \ ____) | #### +#### /_/ \_\ |_| |______| /_/ \_\ |_____/ #### +#### #### +################################################################### + +# For more details about the config values see: +# https://metagenome-atlas.rtfd.io + +######################## +# Execution parameters +######################## +# threads and memory (GB) for most jobs especially from BBtools, which are memory demanding +threads: 8 +mem: 60 + +# threads and memory for jobs needing high amount of memory. e.g GTDB-tk,checkm or assembly +large_mem: 250 +large_threads: 16 +assembly_threads: 8 +assembly_memory: 250 +simplejob_mem: 10 +simplejob_threads: 4 + +#Runtime only for cluster execution +runtime: #in h + default: 5 + assembly: 48 + long: 24 + simplejob: 1 + +# directory where databases are downloaded with 'atlas download' +database_dir: databases + +######################## +# Quality control +######################## +data_type: metagenome # metagenome or metatranscriptome +interleaved_fastqs: false + +# remove (PCR)-duplicated reads using clumpify +deduplicate: true +duplicates_only_optical: false +duplicates_allow_substitutions: 2 + +# used to trim adapters from reads and read ends +preprocess_adapters: /path/to/databases/adapters.fa +preprocess_minimum_base_quality: 10 +preprocess_minimum_passing_read_length: 51 +# 0.05 requires at least 5 percent of each nucleotide per sequence +preprocess_minimum_base_frequency: 0.05 +preprocess_adapter_min_k: 8 +preprocess_allowable_kmer_mismatches: 1 +preprocess_reference_kmer_match_length: 27 +# error correction where PE reads overlap +error_correction_overlapping_pairs: true +#contamination references can be added such that -- key: /path/to/fasta +contaminant_references: + PhiX: /path/to/databases/phiX174_virus.fa +# host:/path/to/host_genome.fasta + +# We won't allow large indels +contaminant_max_indel: 20 +contaminant_min_ratio: 0.65 +contaminant_kmer_length: 13 +contaminant_minimum_hits: 1 +contaminant_ambiguous: best + +######################## +# Pre-assembly-processing +######################## + +# Advanced Error correction +error_correction_before_assembly: true +spades_skip_BayesHammer: true # Skip error correction in spades assembler +error_correction_kmer: 31 # can be longer e.g. 62 but takes more memory + +# remove reads with k-mers that cannot be used for assembly. +# Filter reads that have a 10% of k-mers below a minimum depth. +error_correction_remove_lowdepth: false +error_correction_minimum_kmer_depth: 1 # +error_correction_aggressive: false + +# Merging of pairs +# join R1 and R2 at overlap; unjoined reads are still utilized +merge_pairs_before_assembly: true +merging_k: 62 + +######################## +# Assembly +######################## +# megahit OR spades +assembler: spades + +minimum_contig_length: 1000 +# Megahit +#----------- +# 2 is for metagenomes, 3 for genomes with 30x coverage +megahit_min_count: 2 +megahit_k_min: 21 +megahit_k_max: 121 +megahit_k_step: 20 +megahit_merge_level: 20,0.98 +megahit_prune_level: 2 +megahit_low_local_ratio: 0.2 +# ['default','meta-large','meta-sensitive'] +megahit_preset: default + +# Spades +#------------ +spades_use_scaffolds: true # if false use contigs +#Comma-separated list of k-mer sizes to be used (all values must be odd, less than 128 and listed in ascending order). +spades_k: auto +spades_preset: meta # meta, ,normal, rna single end libraries doesn't work for metaspades +spades_extra: "" +longread_type: none # [none,"pacbio", "nanopore", "sanger", "trusted-contigs", "untrusted-contigs"] +# Preprocessed long reads can be defined in the sample table with 'longreads' , for more info see the spades manual + +# Filtering +#------------ +# filter out assembled noise +# this is more important for assembly from megahit +filter_contigs: false +# trim contig tips +contig_trim_bp: 0 +# require contigs to have read support +minimum_average_coverage: 1 +minimum_percent_covered_bases: 20 +minimum_mapped_reads: 0 + +######################## +# Quantification +######################## + +# Mapping reads to contigs +#-------------------------- +contig_min_id: 0.9 +contig_map_paired_only: true +contig_max_distance_between_pairs: 1000 +maximum_counted_map_sites: 10 +minimum_map_quality: 0 + +######################## +# Binning +######################## + +final_binner: vamb # [SemiBin, vamb, metabat, DASTool] + +semibin_options: "" + +metabat: + sensitivity: sensitive + min_contig_length: 1500 # metabat needs >1500 + +maxbin: + max_iteration: 50 + prob_threshold: 0.9 + min_contig_length: 1000 + +DASTool: + search_engine: "diamond" + score_threshold: 0.5 # Score threshold until selection algorithm will keep selecting bins [0..1]. + +genome_filter_criteria: "(Completeness-5*Contamination >50 ) & (Length_scaffolds >=50000) & (Ambigious_bases <1e6) & (N50 > 5*1e3) & (N_scaffolds < 1e3)" + +filter_chimieric_bins: true # filter chimeric bins using GUNC +gunc_database: "progenomes" # 'progenomes' or 'gtdb' + +genome_dereplication: + ANI: 0.95 ## Genome dreplication threshold 0.95 is more or less species + overlap: 0.2 + +rename_mags_contigs: true #Rename contigs of representative MAGs + +######################## +# Annotations +####################### + +annotations: + - gtdb_tree + - gtdb_taxonomy + - genes + - kegg_modules + - dram + +######################## +# Gene catalog +####################### +genecatalog: + source: contigs # [contigs, genomes] Predict genes from all contigs or only from the representative genomes + clustermethod: linclust # [mmseqs or linclust] see mmseqs for more details + minlength_nt: 270 # min length + minid: 0.90 # min id for gene clustering for the main gene catalog used for annotation + coverage: 0.9 + extra: " " + SubsetSize: 500000 + +gene_annotations: + - eggNOG + # - dram + +eggNOG_use_virtual_disk: false # coping the eggNOG DB to a virtual disk can speed up the annotation +virtual_disk: "/dev/shm" # But you need 37G extra ram +``` + +## Detailed configuration + +toctree:: + +: + + maxdepth + + : 1 + + ../advanced/qc ../advanced/assembly diff --git a/docs/usage/configuration.rst b/docs/usage/configuration.rst deleted file mode 100644 index 4864fc6a..00000000 --- a/docs/usage/configuration.rst +++ /dev/null @@ -1,118 +0,0 @@ - -.. -_configuration: - -Configure Atlas -*************** - -.. -_contaminants: - -Remove reads from Host -====================== - -One of the most important steps in the Quality control is to remove reads from the host's genome. -You can add any number of genomes to be removed. - -We recommend using genomes where repetitive sequences are masked. -See here for more details `human genome `_. - - -Co-abundance Binning -==================== - -.. _cobinning: - -While binning each sample individually is faster, using co-abundance for binning is recommended. -Quantifying the coverage of contigs across multiple samples provides valuable insights about contig co-variation. - -There are two primary strategies for co-abundance binning: - -1. **Cross mapping:** Map the reads from multiple samples to each sample's contigs. -2. **Co-binning:** Concatenate contigs from multiple samples and map all the reads to these combined contigs. - -`final_binner: metabat2` is used for cross-mapping, while `vamb` or `SemiBin` is used for co-binning. - -The samples to be binned together are specified using the `BinGroup` in the `sample.tsv` file. -The size of the BinGroup should be selected based on the binner and the co-binning strategy in use. - -Cross-mapping complexity scales quadratically with the size of the BinGroup since each sample's reads are mapped to each other. -This might yield better results for complex metagenomes, although no definitive benchmark is known. -On the other hand, co-binning is more efficient, as it maps a sample's reads only once to a potentially large assembly. - -Default Behavior ----------------- - -Starting with version 2.18, Atlas places every sample in a single BinGroup and defaults to `vamb` as the binner unless there are very few samples. -For fewer than 8 samples, `metabat` is the default binner. - -.. note:: - This represents a departure from previous versions, where each sample had its own BinGroup. - Running `vamb` in those versions would consider all samples, regardless of their BinGroup. - This change might cause errors if using a `sample.tsv` file from an older Atlas version. - Typically, you can resolve this by assigning a unique BinGroup to each sample. - -The mapping threshold has been adjusted to 95% identity (single sample binning is 97%) to allow reads from different strains — -but not other species — to map to contigs from a different sample. - -If you're co-binning more than 150-200 samples or cross-mapping more than 50 samples, Atlas will issue a warning regarding excessive samples in a BinGroup. -Although VAMB's official publication suggests it can handle up to 1000 samples, this demands substantial resources. - -Therefore, splitting your samples into multiple BinGroups is recommended. -Ideally, related samples, or those where the same species are anticipated, should belong to the same BinGroup. - -Single-sample Binning ---------------------- - -To employ single-sample binning, simply assign each sample to its own BinGroup and select `metabat` or `DASTool` as the `final_binner`. - -Although it's not recommended, it's feasible to use `DASTool` and feed it inputs from `metabat` and other co-abundance-based binners. - -Add the following lines to your `config.yaml`: - - -.. code-block:: yaml - - final_binner: DASTool - - binner: - - metabat - - maxbin - - vamb - - - -.. _longreads: - -Long reads -========== - -Limitation: Hybrid assembly of long and short reads is supported with spades and metaSpades. -However, metaSpades needs a paired-end short-read library. - -The path of the (preprocessed) long reads should be added manually to the -sample table under a new column heading 'longreads'. - -In addition, the type of the long reads should be defined in the config file: -``longread_type`` one of ["pacbio", "nanopore", "sanger", "trusted-contigs", "untrusted-contigs"] - - -Example config file -=================== - - -..include:: ../../config/template_config.yaml - :code: - - - - -Detailed configuration -====================== - -.. -toctree:: - :maxdepth: 1 - - ../advanced/qc - ../advanced/assembly diff --git a/docs/usage/getting_started.md b/docs/usage/getting_started.md new file mode 100644 index 00000000..0b2b3416 --- /dev/null +++ b/docs/usage/getting_started.md @@ -0,0 +1,416 @@ +# Getting Started + +## Setup + +### Conda package manager + +Atlas has **one dependency**: [conda](http://anaconda.org/). All +databases and other dependencies are installed **on the fly**. Atlas is +based on snakemake, which allows to run steps of the workflow in +parallel on a cluster. + +If you want to try atlas and have a linux computer (OSX may also work), +you can use our [example data](usage/getting_started:Example Data) for testing. + +For real metagenomic data atlas should be run on a *linux* sytem, +with enough memory (min \~50GB but assembly usually requires 250GB). + +You need to install [anaconda](http://anaconda.org/) or miniconda. If +you haven\'t done it already, you need to configure conda with the +bioconda-channel and the conda-forge channel. This are sources for +packages beyond the default one. Setting strict channel priority can +prevent quite some annoyances. + +``` bash +conda config --set channel_priority strict +conda config --add channels bioconda +conda config --add channels conda-forge +``` + +The order is important by the way. + +### Install mamba + +Conda can be a bit slow because there are so many packages. A good way +around this is to use [mamba](https://github.com/TheSnakePit/mamba) +(another snake).: + + conda install mamba + +From now on, you can replace `conda install` with `mamba install` and +see how much faster this snake is. + +### Install metagenome-atlas + +We recommend to install metagenome-atlas into a conda environment e.g. +named `atlasenv`. We also recommend to specify the latest version of +metagenome-atlas. + +``` bash +mamba create -y -n atlasenv metagenome-atlas={latest_version} +source activate atlasenv +``` + +where `{latest_version}` should be replaced by + +[![image](https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg)](https://anaconda.org/bioconda/metagenome-atlas) + +### Install metagenome-atlas from GitHub + +Alternatively, you can install metagenome Atlas directly from GitHub. +This allows you to access versions that are not yet in the conda +release, e.g. versions that are still in development. + +``` bash +git clone https://github.com/metagenome-atlas/atlas.git +cd atlas + +# optional change to different branch +# git checkout branchname + +# create dependencies for atlas +mamba env create -n atlas-dev --file atlasenv.yml +conda activate atlas-dev + +# install atlas version. Changes in the files are directly available in the atlas dev version +pip install --editable . +cd .. +``` + +## Example Data {#example-data} + +If you want to test atlas on a small example data, here is a two sample, +three genome minimal metagenome dataset, to test atlas. Even when atlas +will run faster on the test data, it will anyway download all the +databases and requirements, for a complete run, which can take a certain +amount of time and especially disk space (>100Gb). + +The database dir of the test run should be the same as for the later +atlas executions. + +The example data can be downloaded as following + +``` bash +wget https://zenodo.org/record/3992790/files/test_reads.tar.gz +tar -xzf test_reads.tar.gz +``` + +## Usage + +### Start a new project + +Let\'s apply atlas on your data or on our [example data](usage/getting_started:Example Data): + + atlas init --db-dir databases path/to/fastq_files + +This command parses the folder for fastq files (extension `.fastq(.gz)` +or `.fq(.gz)` , gzipped or not). fastq files can be arranged in +subfolders, in which case the subfolder name will be used as a sample +name. If you have paired-end reads the files are usually distinguishable +by `_R1/_R2` or simple `_1/_2` in the file names. Atlas searches for +these patterns and lists the paired-end files for each sample. + +The command creates a `samples.tsv` and a `config.yaml` in the working +directory. + +Have a look at them with a normal text editor and check if the sample +names are inferred correctly. The sample names are used for the naming +of contigs, genes, and genomes. Therefore, the sample names should +consist only of digits and letters and start with a letter (Even though +one `-` is allowed). Atlas tries to simplify the file name to obtain +unique sample names, if it doesn\'t succeed it simply puts S1, S2, \... +as sample names. + +See the +[example sample table](../reports/samples.tsv) + +The `BinGroup` parameter is used during the genomic binning. In short: +If you have between 5 and 150 samples the default (putting everything in +one group) is fine. If you have less than 5 samples, put every sample in +an individual BinGroup and use `metabat` as final binner. If +you have more samples see the **cobinning** +section for more details. + +::: note +::: title +Note +::: + +If you want to use **long reads** for a hybrid assembly, you can also specify them in the +sample table. +::: + +You should also check the `config.yaml` file, especially: + +- You may want to add ad + [host genomes](advanced/qc:Contamination Parameters) to be + removed. +- You may want to change the resources configuration, depending on the + system you run atlas on. + +Details about the parameters can be found in the section +[Configuration](configuration.md) + +Keep in mind that all databases are installed in the directory specified +with `--db-dir` so choose it wisely. + +``` text +Usage: atlas init [OPTIONS] PATH_TO_FASTQ + + Write the file CONFIG and complete the sample names and paths for all + FASTQ files in PATH. + + PATH is traversed recursively and adds any file with '.fastq' or '.fq' in + the file name with the file name minus extension as the sample ID. + +Options: + -d, --db-dir PATH location to store databases (need ~50GB) + [default: /Users/silas/Documents/GitHub/atla + s/databases] + + -w, --working-dir PATH location to run atlas + --assembler [megahit|spades] assembler [default: spades] + --data-type [metagenome|metatranscriptome] + sample data type [default: metagenome] + --interleaved-fastq fastq files are paired-end in one files + (interleaved) + + --threads INTEGER number of threads to use per multi-threaded + job + + --skip-qc Skip QC, if reads are already pre-processed + -h, --help Show this message and exit. +``` + +### Start a new project with public data + +Since v2.9 atlas has possibility to start a new project from public data +stored in the short read archive (SRA). + +You can run `atlas init-public ` and specify any ids, like +bioprojects, or other SRA ids. + +Atlas does the following steps: + +> 1. Search SRA for the corresponding sequences (Runs) and save them in +> the file `SRA/RunInfo_original.tsv`. For example, if you specify a +> Bioproject, it fetches the information for all runs of this +> project. +> 2. Atlas filters the runs to contain only valid metagenome sequences. +> E.g. exclude singleton reads, 16S. The output will be saved in +> `RunInfo.tsv` +> 3. Sometimes the same Sample is sequenced on different lanes, which +> will result into multiple runs from the same sample. Atlas will +> **merge** runs from the same biosample. +> 4. Prepare a sample table and a config.yaml similar to the +> `atlas init` command. + +If you are not happy with the filtering atlas performs, you can go back +to the `SRA/RunInfo_original.tsv` and create a new `RunInfo.tsv`. If you +then rerun `atlas init-public continue` it will continue from your +modified RunInfo and do step 3. & 4. above. + +Limitations: For now atlas, cannot handle a mixture of paired and single +end reads, so we focus primarily on the paired end. If you have +longreads for your project, you would need to specify them yourself in +the sample.tsv. + +During the run, the reads are downloaded from SRA in the likely most +efficient way using prefetch and parallel, fastq.gz generation. The +download step has checkpoints, so if the pipeline gets interrupted, you +can restart where you left off. Using the command line arguments +`--restart-times 3 and --keep-going` You can even ask atlas to do +multiple restarts before stopping. + +The downloaded reads are directly processed. However, if you only want +to download the reads you can use: + + atlas run None download_sra + +#### Example: Downloading reads from the human microbiome project2 + + atlas init-public --working-dir HMP2 PRJNA398089 + +Gives the output: + + [Atlas] INFO: Downloading runinfo from SRA + [Atlas] INFO: Start with 2979 runs from 2979 samples + [Atlas] INFO: Runs have the folowing values for LibrarySource: METAGENOMIC, METATRANSCRIPTOMIC + Select only runs LibrarySource == METAGENOMIC, Filtered out 762 runs + [Atlas] INFO: Runs have the folowing values for LibrarySelection: PCR, RT-PCR, RANDOM + Select only runs LibrarySelection == RANDOM, Filtered out 879 runs + [Atlas] INFO: Selected 1338 runs from 1338 samples + [Atlas] INFO: Write filtered runinfo to HMP2/RunInfo.tsv + [Atlas] INFO: Prepared sample table with 1338 samples + [Atlas] INFO: Configuration file written to HMP2/config.yaml + You may want to edit it using any text editor. + +### Run atlas + + atlas run genomes + +`atlas run` need to know the working directory with a `samples.tsv` +inside it. + +Take note of the `--dryrun` parameter, see the section +[snakemake](usage/getting_started:Useful command line options) for other handy snakemake +arguments. + +We recommend to use atlas on a [cluster](usage/getting_started:Cluster execution) +system, which can be set up in a view more commands. + +``` text +Usage: atlas run [OPTIONS] [qc|assembly|binning|genomes|genecatalog|None|all] + [SNAKEMAKE_ARGS]... + + Runs the ATLAS pipeline + + By default all steps are executed but a sub-workflow can be specified. + Needs a config-file and expects to find a sample table in the working- + directory. Both can be generated with 'atlas init' + + Most snakemake arguments can be appended to the command for more info see + 'snakemake --help' + + For more details, see: https://metagenome-atlas.readthedocs.io + +Options: + -w, --working-dir PATH location to run atlas. + -c, --config-file PATH config-file generated with 'atlas init' + -j, --jobs INTEGER use at most this many jobs in parallel (see cluster + submission for more details). + + --profile TEXT snakemake profile e.g. for cluster execution. + -n, --dryrun Test execution. [default: False] + -h, --help Show this message and exit. +``` + +# Execue Atlas + +## Cluster execution {#cluster-execution} + +### Automatic submitting to cluster systems + +Thanks to the underlying snakemake Atlas can submit parts of the +pipeline automatically to a cluster system and define the appropriate +resources. If one job has finished it launches the next one. This allows +to use the full capacity of your cluster system. You even need to pay +attention not to spam the other users of the cluster. + +Thanks to the underlying snakemake system, atlas can submit parts of the +pipeline to clusters and cloud systems. Instead of running all steps of +the pipeline in one cluster job, atlas can automatically submit each +step to your cluster system, specifying the necessary threads, memory, +and runtime, based on the values in the config file. Atlas periodically +checks the status of each cluster job and can re-run failed jobs or +continue with other jobs. + +See atlas scheduling jobs on a cluster in action +. + +If you have a common cluster system (Slurm, LSF, PBS \...) we have an +easy set up (see below). Otherwise, if you have a different cluster +system, file a GitHub issue (feature request) so we can help you bring +the magic of atlas to your cluster system. For more information about +cluster- and cloud submission, have a look at the [snakemake cluster +docs](https://snakemake.readthedocs.io/en/stable/executing/cluster-cloud.html). + +### Set up of cluster execution + +You need cookiecutter to be installed, which comes with atlas + +Then run: + + cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git + +This opens an interactive shell dialog and ask you for the name of the +profile and your cluster system. We recommend you keep the default name +`cluster`. The profile was tested on `slurm`, `lsf` and `pbs`. + +The resources (threads, memory and time) are defined in the atlas config +file (hours and GB). + +**Specify queues and accounts** + +If you have different **queues/partitions** on your cluster system you +should tell atlas about them so it can *automatically choose the best +queue*. Adapt the template for the queues.tsv: + + cp ~/.config/snakemake/cluster/queues.tsv.example ~/.config/snakemake/cluster/queues.tsv + +Now enter the information about the queues/partitions on your particular +system. + +If you need to specify **accounts** or other options for one or all +rules you can do this for all rules or for specific rules in the +`~/.config/snakemake/cluster/cluster_config.yaml`. In addition, using +this file you can overwrite the resources defined in the config file. + +Example for `cluster_config.yaml` with queues defined: + + __default__: + # default parameter for all rules + account: project_1345 + nodes: 1 + +Now, you can run atlas on a cluster with: + + atlas run --profile cluster + +As the whole pipeline can take several days, I usually run atlas itself +on a cluster in a long running queue. + +If a job fails, you will find the \"external jobid\" in the error +message. You can investigate the job via this ID. + +The atlas argument `--jobs` now becomes the number of jobs +simultaneously submitted to the cluster system. You can set this as high +as 99 if your colleagues don\'t mind you over-using the cluster system. + +## Single machine execution {#local} + +If you don\'t want to use the +[automatic scheduling](usage/getting_started:Cluster execution) you can +use atlas on a single machine (local execution) with a lot of memory and +threads ideally. In this case I recommend you the following options. The +same applies if you submit a single job to a cluster running atlas. + +Atlas detects how many CPUs and how much memory is available on your +system and it will schedule as many jobs in parallel as possible. If you +have less resources available than specified in the config file, the +jobs are downscaled. + +By default atlas will use all cpus and 95% of all the available memory. +If you are not happy with that, or you need to specify an exact amount +of memory/ cpus you can use the command line arguments `--jobs` and +`--max-mem` to do so. + +## Cloud execution + +Atlas, like any other snakemake pipeline can also easily be submitted to +cloud systems. I suggest looking at the [snakemake +doc](https://snakemake.readthedocs.io/en/stable/executing/cluster-cloud.html). +Keep in mind any snakemake command line argument can just be appended to +the atlas command. + +## Useful command line options {#useful-command-line-options} + +Atlas builds on snakemake. We designed the command line interface in a +way that additional snakemake arguments can be added to an atlas run +call. + +For instance the `--profile` used for cluster execution. Other handy +snakemake command line arguments include: + +> `--keep-going`, which allows atlas in the case of a failed job to +> continue with independent steps. +> +> `--report`, which allows atlas to generate a user-friendly run report +> (e.g., by specifying `--report report.html`). This report includes the +> steps used in the analysis workflow and the versions of software tools +> used at each step. See discussions +> [#523](https://github.com/metagenome-atlas/atlas/discussions/523) and +> [#514](https://github.com/metagenome-atlas/atlas/discussions/514))). + +For a full list of snakemake arguments see the [snakemake +doc](https://snakemake.readthedocs.io/en/stable/executing/cli.html#all-options). diff --git a/docs/usage/getting_started.rst b/docs/usage/getting_started.rst deleted file mode 100644 index 3c53f42b..00000000 --- a/docs/usage/getting_started.rst +++ /dev/null @@ -1,382 +0,0 @@ -.. _conda: http://anaconda.org/ -.. _mamba: https://github.com/TheSnakePit/mamba - -Getting Started -*************** - -Setup -===== - -Conda package manager ---------------------- - -Atlas has **one dependency**: conda_. All databases and other dependencies are installed **on the fly**. -Atlas is based on snakemake, which allows to run steps of the workflow in parallel on a cluster. - -If you want to try atlas and have a linux computer (OSX may also work), you can use our `example data`_ for testing. - -For real metagenomic data atlas should be run on a _linux_ sytem, with enough memory (min ~50GB but assembly usually requires 250GB). - - - -You need to install `anaconda `_ or miniconda. -If you haven't done it already, you need to configure conda with the bioconda-channel and the conda-forge channel. This are sources for packages beyond the default one. -Setting strict channel priority can prevent quite some annoyances. - -.. code-block:: bash - conda config --set channel_priority strict - conda config --add channels bioconda - conda config --add channels conda-forge - -The order is important by the way. - -Install mamba -------------- - -Conda can be a bit slow because there are so many packages. A good way around this is to use mamba_ (another snake).:: - - conda install mamba - - -From now on, you can replace ``conda install`` with ``mamba install`` and see how much faster this snake is. - -Install metagenome-atlas ------------------------- - -We recommend to install metagenome-atlas into a conda environment e.g. named ``atlasenv``. -We also recommend to specify the latest version of metagenome-atlas. - -.. code-block:: bash - - mamba create -y -n atlasenv metagenome-atlas={latest_version} - source activate atlasenv - -where `{latest_version}` should be replaced by - -.. image:: https://anaconda.org/bioconda/metagenome-atlas/badges/version.svg - :target: https://anaconda.org/bioconda/metagenome-atlas - - - - -Install metagenome-atlas from GitHub ------------------------------------- - -Alternatively, you can install metagenome Atlas directly from GitHub. This allows you to access versions that are not yet in the conda release, e.g. versions that are still in development. - -.. code-block:: bash - - git clone https://github.com/metagenome-atlas/atlas.git - cd atlas - - # optional change to different branch - # git checkout branchname - - # create dependencies for atlas - mamba env create -n atlas-dev --file atlasenv.yml - conda activate atlas-dev - - # install atlas version. Changes in the files are directly available in the atlas dev version - pip install --editable . - cd .. - - - - - -.. _`example data`: - -Example Data -============ - -If you want to test atlas on a small example data, here is a two sample, three genome minimal metagenome dataset, -to test atlas. Even when atlas will run faster on the test data, -it will anyway download all the databases and requirements, for a complete run, -which can take a certain amount of time and especially disk space (>100Gb). - -The database dir of the test run should be the same as for the later atlas executions. - -The example data can be downloaded as following - -.. code-block:: bash - - wget https://zenodo.org/record/3992790/files/test_reads.tar.gz - tar -xzf test_reads.tar.gz - - - -Usage -===== - -Start a new project -------------------- - -Let's apply atlas on your data or on our `example data`_:: - - atlas init --db-dir databases path/to/fastq_files - -This command parses the folder for fastq files (extension ``.fastq(.gz)`` or ``.fq(.gz)`` , gzipped or not). fastq files can be arranged in subfolders, in which case the subfolder name will be used as a sample name. If you have paired-end reads the files are usually distinguishable by ``_R1/_R2`` or simple ``_1/_2`` in the file names. Atlas searches for these patterns and lists the paired-end files for each sample. - -The command creates a ``samples.tsv`` and a ``config.yaml`` in the working directory. - -Have a look at them with a normal text editor and check if the sample names are inferred correctly. The sample names are used for the naming of contigs, genes, and genomes. Therefore, the sample names should consist only of digits and letters and start with a letter (Even though one ``-`` is allowed). Atlas tries to simplify the file name to obtain unique sample names, if it doesn't succeed it simply puts S1, S2, ... as sample names. - - -See the :download:`example sample table <../reports/samples.tsv>` - -The ``BinGroup`` parameter is used during the genomic binning. -In short: If you have between 5 and 150 samples the default (putting everything in one group) is fine. -If you have less than 5 samples, put every sample in an individual BinGroup and use `metabat` as final binner. -If you have more samples see the :ref:`cobinning` section for more details. - -.. note:: If you want to use :ref:`long reads ` for a hybrid assembly, you can also specify them in the sample table. - - -You should also check the ``config.yaml`` file, especially: - - -- You may want to add ad :ref:`host genomes ` to be removed. -- You may want to change the resources configuration, depending on the system you run atlas on. - -Details about the parameters can be found in the section :ref:`Configuration` - -Keep in mind that all databases are installed in the directory specified with ``--db-dir`` so choose it wisely. - - -.. code-block:: text - - Usage: atlas init [OPTIONS] PATH_TO_FASTQ - - Write the file CONFIG and complete the sample names and paths for all - FASTQ files in PATH. - - PATH is traversed recursively and adds any file with '.fastq' or '.fq' in - the file name with the file name minus extension as the sample ID. - - Options: - -d, --db-dir PATH location to store databases (need ~50GB) - [default: /Users/silas/Documents/GitHub/atla - s/databases] - - -w, --working-dir PATH location to run atlas - --assembler [megahit|spades] assembler [default: spades] - --data-type [metagenome|metatranscriptome] - sample data type [default: metagenome] - --interleaved-fastq fastq files are paired-end in one files - (interleaved) - - --threads INTEGER number of threads to use per multi-threaded - job - - --skip-qc Skip QC, if reads are already pre-processed - -h, --help Show this message and exit. - - - -Start a new project with public data ------------------------------------- - -Since v2.9 atlas has possibility to start a new project from public data stored in the short read archive (SRA). - -You can run ``atlas init-public `` and specify any ids, like bioprojects, or other SRA ids. - -Atlas does the following steps: - - 1. Search SRA for the corresponding sequences (Runs) and save them in the file ``SRA/RunInfo_original.tsv``. For example, if you specify a Bioproject, it fetches the information for all runs of this project. - 2. Atlas filters the runs to contain only valid metagenome sequences. E.g. exclude singleton reads, 16S. The output will be saved in ``RunInfo.tsv`` - 3. Sometimes the same Sample is sequenced on different lanes, which will result into multiple runs from the same sample. Atlas will **merge** runs from the same biosample. - 4. Prepare a sample table and a config.yaml similar to the ``atlas init`` command. - - -If you are not happy with the filtering atlas performs, you can go back to the ``SRA/RunInfo_original.tsv`` and create a new ``RunInfo.tsv``. -If you then rerun ``atlas init-public continue`` it will continue from your modified RunInfo and do step 3. & 4. above. - - -Limitations: For now atlas, cannot handle a mixture of paired and single end reads, so we focus primarily on the paired end. -If you have longreads for your project, you would need to specify them yourself in the sample.tsv. - -During the run, the reads are downloaded from SRA in the likely most efficient way using prefetch and parallel, fastq.gz generation. -The download step has checkpoints, so if the pipeline gets interrupted, you can restart where you left off. -Using the command line arguments ``--restart-times 3 and --keep-going`` You can even ask atlas to do multiple restarts before stopping. - -The downloaded reads are directly processed. However, if you only want to download the reads you can use:: - - atlas run None download_sra - -Example: Downloading reads from the human microbiome project2 -````````````````````````````````````````````````````````````` -:: - - atlas init-public --working-dir HMP2 PRJNA398089 - -Gives the output:: - - [Atlas] INFO: Downloading runinfo from SRA - [Atlas] INFO: Start with 2979 runs from 2979 samples - [Atlas] INFO: Runs have the folowing values for LibrarySource: METAGENOMIC, METATRANSCRIPTOMIC - Select only runs LibrarySource == METAGENOMIC, Filtered out 762 runs - [Atlas] INFO: Runs have the folowing values for LibrarySelection: PCR, RT-PCR, RANDOM - Select only runs LibrarySelection == RANDOM, Filtered out 879 runs - [Atlas] INFO: Selected 1338 runs from 1338 samples - [Atlas] INFO: Write filtered runinfo to HMP2/RunInfo.tsv - [Atlas] INFO: Prepared sample table with 1338 samples - [Atlas] INFO: Configuration file written to HMP2/config.yaml - You may want to edit it using any text editor. - - - - - -Run atlas ---------- - -:: - - atlas run genomes - - -``atlas run`` need to know the working directory with a ``samples.tsv`` inside it. - -Take note of the ``--dryrun`` parameter, see the section :ref:`snakemake` for other handy snakemake arguments. - -We recommend to use atlas on a :ref:`cluster` system, which can be set up in a view more commands. - - -.. code-block:: text - - Usage: atlas run [OPTIONS] [qc|assembly|binning|genomes|genecatalog|None|all] - [SNAKEMAKE_ARGS]... - - Runs the ATLAS pipeline - - By default all steps are executed but a sub-workflow can be specified. - Needs a config-file and expects to find a sample table in the working- - directory. Both can be generated with 'atlas init' - - Most snakemake arguments can be appended to the command for more info see - 'snakemake --help' - - For more details, see: https://metagenome-atlas.readthedocs.io - - Options: - -w, --working-dir PATH location to run atlas. - -c, --config-file PATH config-file generated with 'atlas init' - -j, --jobs INTEGER use at most this many jobs in parallel (see cluster - submission for more details). - - --profile TEXT snakemake profile e.g. for cluster execution. - -n, --dryrun Test execution. [default: False] - -h, --help Show this message and exit. - - -Execue Atlas -************ - - -.. _cluster: - -Cluster execution -================= - -Automatic submitting to cluster systems ---------------------------------------- - -Thanks to the underlying snakemake Atlas can submit parts of the pipeline automatically to a cluster system and define the appropriate resources. If one job has finished it launches the next one. -This allows to use the full capacity of your cluster system. You even need to pay attention not to spam the other users of the cluster. - - - - -Thanks to the underlying snakemake system, atlas can submit parts of the pipeline to clusters and cloud systems. Instead of running all steps of the pipeline in one cluster job, atlas can automatically submit each step to your cluster system, specifying the necessary threads, memory, and runtime, based on the values in the config file. Atlas periodically checks the status of each cluster job and can re-run failed jobs or continue with other jobs. - -See atlas scheduling jobs on a cluster in action ``_. - -If you have a common cluster system (Slurm, LSF, PBS ...) we have an easy set up (see below). Otherwise, if you have a different cluster system, file a GitHub issue (feature request) so we can help you bring the magic of atlas to your cluster system. -For more information about cluster- and cloud submission, have a look at the `snakemake cluster docs `_. - -Set up of cluster execution ---------------------------- - -You need cookiecutter to be installed, which comes with atlas - -Then run:: - - cookiecutter --output-dir ~/.config/snakemake https://github.com/metagenome-atlas/clusterprofile.git - -This opens an interactive shell dialog and ask you for the name of the profile and your cluster system. -We recommend you keep the default name ``cluster``. The profile was tested on ``slurm``, ``lsf`` and ``pbs``. - -The resources (threads, memory and time) are defined in the atlas config file (hours and GB). - -**Specify queues and accounts** - - -If you have different **queues/partitions** on your cluster system you should tell atlas about them so it can *automatically choose the best queue*. Adapt the template for the queues.tsv:: - - cp ~/.config/snakemake/cluster/queues.tsv.example ~/.config/snakemake/cluster/queues.tsv - -Now enter the information about the queues/partitions on your particular system. - - -If you need to specify **accounts** or other options for one or all rules you can do this for all rules or for specific rules in the ``~/.config/snakemake/cluster/cluster_config.yaml``. In addition, using this file you can overwrite the resources defined in the config file. - -Example for ``cluster_config.yaml`` with queues defined:: - - - __default__: - # default parameter for all rules - account: project_1345 - nodes: 1 - - - -Now, you can run atlas on a cluster with:: - - atlas run --profile cluster - - -As the whole pipeline can take several days, I usually run atlas itself on a cluster in a long running queue. - - .. The mapping between resources and cluster are defined in the ``~/.config/snakemake/cluster/key_mapping.yaml``. - - -If a job fails, you will find the "external jobid" in the error message. -You can investigate the job via this ID. - - -The atlas argument ``--jobs`` now becomes the number of jobs simultaneously submitted to the cluster system. You can set this as high as 99 if your colleagues don't mind you over-using the cluster system. - - -.. _local: - -Single machine execution -======================== - -If you don't want to use the :ref:`automatic scheduling ` you can use atlas on a single machine (local execution) with a lot of memory and threads ideally. In this case I recommend you the following options. The same applies if you submit a single job to a cluster running atlas. - -Atlas detects how many CPUs and how much memory is available on your system and it will schedule as many jobs in parallel as possible. If you have less resources available than specified in the config file, the jobs are downscaled. - -By default atlas will use all cpus and 95% of all the available memory. If you are not happy with that, or you need to specify an exact amount of memory/ cpus you can use the command line arguments ``--jobs`` and ``--max-mem`` to do so. - - -Cloud execution -=============== - -Atlas, like any other snakemake pipeline can also easily be submitted to cloud systems. I suggest looking at the `snakemake doc `_. Keep in mind any snakemake command line argument can just be appended to the atlas command. - - - -.. _snakemake: - -Useful command line options -=========================== - -Atlas builds on snakemake. We designed the command line interface in a way that additional snakemake arguments can be added to an atlas run call. - -For instance the ``--profile`` used for cluster execution. Other handy snakemake command line arguments include: - - ``--keep-going``, which allows atlas in the case of a failed job to continue with independent steps. - - ``--report``, which allows atlas to generate a user-friendly run report (e.g., by specifying ``--report report.html``). This report includes the steps used in the analysis workflow and the versions of software tools used at each step. See discussions `#523 `_ and `#514 `_. - -For a full list of snakemake arguments see the `snakemake doc `_. diff --git a/docs/usage/output.md b/docs/usage/output.md new file mode 100644 index 00000000..bc934028 --- /dev/null +++ b/docs/usage/output.md @@ -0,0 +1,321 @@ +# Expected output + +![Atlas is a workflow for assembly and binning of metagenomic reads](../../resources/images/atlas_list.png) + +There are two main workflows implemented in atlas. A. *Genomes* and B. +*Genecatalog*. The first aims in producing metagenome assembled genomes +(MAGs) where as the later produces a gene catalog. The steps of Quality +control and and + +::: note +::: title +Note +::: + +Have a look at the example output at + . +::: + +## Quality control + + atlas run qc + +Runs quality control of single or paired end reads and summarizes the +main QC stats in [reports/QC_report.html](<../reports/QC_report.html>). + +Per sample it generates: + +> - `QC/reads/{sample}_{fraction}.fastq.gz` + +### Fractions: + +When the input was paired end, we will put out three the reads in three +fractions R1,R2 and se The se are the paired end reads which lost their +mate during the filtering. + +The se reads are no longer used as they usually represent an +insignificant number of reads. + +## Assembly + + atlas run assembly + +Besides the +[reports/assembly_report.html](<../reports/assembly_report.html>) this +rule outputs the following files per sample: + +> - `Assembly/fasta/{sample}.fasta` +> - `{sample}/sequence_alignment/{sample}.bam` + +## Binning + + atlas run binning + +When you use different binners (e.g. vamb, metabat, DASTool), then Atlas +will produce for each binner and sample: + +> - `{sample}/binning/{binner}/cluster_attribution.tsv` + +which shows the attribution of contigs to bins. For the final_binner it +produces the + +> - `reports/bin_report_{binner}.html` + +See an [example](<../reports/bin_report_DASTool.html>) as a summary of the quality +of all bins. + +::: seealso +In version 2.8 the new binners *vamb* and *SemiBin* were added. First +experience show that they outperform the default binner (metabat, +maxbin + DASTool). They use a new approach of co-binning which uses the +co-abundance from different samples. For more information see the +[detailed explanation +here](https://silask.github.io/post/phd-thesis/Thesis\_Silas\_Kieser.pdf) +on page 14 +::: + +::: note +::: title +Note +::: + +Keep also in mind that maxbin, DASTool, and SemiBin are biased for +prokaryotes. If you want to try to bin (small) Eukaryotes use metabat or +vamb. More information about Eukaryotes see [the discussion +here](https://github.com/metagenome-atlas/atlas/discussions/427). +::: + +## Genomes + + atlas run genomes + +Binning can predict several times the same genome from different +samples. To remove this reduncancy we use DeRep to filter and +de-replicate the genomes. By default the threshold is set to **97.5%**, +which corresponds somewhat to the *sub-species level*. The best quality +genome for each cluster is choosen as the representative for each +cluster. The represenative MAG are then renamed and used for annotation +and quantification. + +The fasta sequence of the dereplicated and renamed genomes can be found +in `genomes/genomes` and their quality estimation are in +`genomes/checkm/completeness.tsv`. + +### Quantification + +The quantification of the genomes can be found in: + +> - `genomes/counts/median_coverage_genomes.tsv` +> - `genomes/counts/raw_counts_genomes.tsv` + +::: seealso +See in [Atlas example](https://github.com/metagenome-atlas/Tutorial) how +to analyze these abundances. +::: + +### Annotations + +The annotation can be turned of and on in the config file: + + annotations: + - genes + - gtdb_tree + - gtdb_taxonomy + - kegg_modules + - dram + +The `genes` option produces predicted genes and translated protein +sequences which are stored in `genomes/annotations/genes`. + +**Taxonomic adnnotation** + +A taxonomy for the genomes is proposed by the Genome [Taxonomy +database](https://gtdb.ecogenomic.org/) (GTDB). The results can be found +in `genomes/taxonomy`. The genomes are placed in a phylogenetic tree +separately for bacteria and archaea using the GTDB markers. + +In addition a tree for bacteria and archaea can be generated based on +the checkm markers. All trees are properly rooted using the midpoint. +The files can be found in `genomes/tree` + +**Functional annotation** + +Sicne version 2.8, We use [DRAM](https://github.com/shafferm/DRAM) to +annotate the genomes with Functional annotations, e.g. KEGG and CAZy as +well as to **infere pathways**, or more specifically Kegg modules. + +The Functional annotations for each genome can be found in +`genomes/annotations/dram/` + +and are contain the following files: + +> - `kegg_modules.tsv` Table of all Kegg modules +> - `annotations.tsv` Table of all annotations +> - `distil/metabolism_summary.xlsx` Excel of the summary of all +> annotations +> +> The tool alos produces a nice report in +> [distil/product.html](<../reports/dram_product.html>). + +## Gene Catalog + + atlas run all + # or + atlas run genecatalog + +The gene catalog takes all genes predicted from the contigs and clusters +them according to the configuration. It quantifies them by simply +mapping reads to the genes (cds sequences) and annotates them using +EggNOG mapper. + +This rule produces the following output file for the whole dataset. + +> - `Genecatalog/gene_catalog.fna` +> - `Genecatalog/gene_catalog.faa` +> - `Genecatalog/annotations/eggNog.tsv.gz` +> - `Genecatalog/counts/` + +Since version 2.15 the output of the quantification are stored in 2 hdf-files\`in the folder `Genecatalog/counts/`: + +: - `median_coverage.h5` + - `Nmapped_reads.h5.fna` + +Together with the statistics per gene and per sample. + +: - `gene_coverage_stats.parquet` + - `sample_coverage_stats.tsv` + +The hdf only contains a matrix of abundances or counts under the name +`data`. The sample names are stored as attributes. The gene names (e.g. +`Gene00001`) are simply the row number. + +You can open the hdf file in R or python as following: + +``` python +import h5py + +filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5" + +with h5py.File(filename, 'r') as hdf_file: + + data_matrix = hdf_file['data'][:] + sample_names = hdf_file['data'].attrs['sample_names'].astype(str) +``` + +``` R +library(rhdf5) + + +filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5" + +data <- h5read(filename, "data") + +attributes= h5readAttributes(filename, "data") + +colnames(data) <- attributes$sample_names +``` + +You don\'t need to load the full data. You could only select a subset of +genes, e.g. the genes with annotations, or genes that are not +singletons. To find out which gene is a singleton or not you can use the +file `gene_coverage_stats.parquet` + +``` R +library(rhdf5) +library(dplyr) +library(tibble) + +# read only subset of data +indexes_of_genes_to_load = c(2,5,100,150) # e.g. genes with annotations +abundance_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.h5") + + +# get dimension of data + +h5overview=h5ls(abundance_file) +dim= h5overview[1,"dim"] %>% stringr::str_split(" x ",simplify=T) %>% as.numeric +cat("Load ",length(indexes_of_genes_to_load), " out of ", dim[1] , " genes\n") + + +data <- h5read(file = abundance_file, name = "data", + index = list(indexes_of_genes_to_load, NULL)) + +# add sample names +attributes= h5readAttributes(abundance_file, "data") +colnames(data) <- attributes$sample_names + + +# add gene names (e.g. Gene00001) as rownames +gene_names = paste0("Gene", formatC(format="d",indexes_of_genes_to_load,flag="0",width=ceiling(log10(max(dim[1]))))) +rownames(data) <- gene_names + + +data[1:5,1:5] +``` + +If you do this you can use the information in the file +`Genecatalog/counts/sample_coverage_stats.tsv` to normalize the counts. + +Here is the R code to calculate the gene copies per million (analogous +to transcript per million) for the subset of genes. + +``` R +# Load gene stats per sample +gene_stats_file = file.path(atlas_dir,"Genecatalog/counts/sample_coverage_stats.tsv") + +gene_stats <- read.table(gene_stats_file,sep='\t',header=T,row.names=1) + +gene_stats <- t(gene_stats) # might be transposed, sample names should be index + +head(gene_stats) + +# calculate copies per million +total_covarage <- gene_stats[colnames(data) ,"Sum_coverage"] + +# gives wrong results +#gene_gcpm<- data / total_covarage *1e6 + +gene_gcpm<- data %*% diag(1/total_covarage) *1e6 +colnames(gene_gcpm) <- colnames(data) + +gene_gcpm[1:5,1:5] +``` + +::: seealso +See in Atlas Tutorial +::: + +Before version 2.15 the output of the counts were stored in a parquet +file. The parquet file can be opended easily with `pandas.read_parquet` +or `arrow::read_parquet`\`. However you need to load the full data into +memory. + +``` R +parquet_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.parquet") +gene_abundances<- arrow::read_parquet(parquet_file) + +# transform tibble to a matrix +gene_matrix= as.matrix(gene_abundances[,-1]) +rownames(gene_matrix) <- gene_abundances$GeneNr + + +#calculate copies per million +gene_gcpm= gene_matrix/ colSums(gene_matrix) *1e6 + + +gene_gcpm[1:5,1:5] +``` + +## All + +The option of `atlas run all` runs both Genecatalog and Genome workflows +and creates mapping tables between Genecatalog and Genomes. However, in +future the two workflows are expected to diverge more and more to +fulfill their aim better. + +If you want to run both workflows together you can do this by: + + atlas run genomes genecatalog + +If you are interested in mapping the genes to the genomes see the +discussion at diff --git a/docs/usage/output.rst b/docs/usage/output.rst deleted file mode 100644 index e0f2237a..00000000 --- a/docs/usage/output.rst +++ /dev/null @@ -1,332 +0,0 @@ -.. |scheme| image:: ../../resources/images/atlas_list.png - :alt: Atlas is a workflow for assembly and binning of metagenomic reads - -.. _thesis: https://github.com/TheSnakePit/mamba - -Expected output -*************** - -|scheme| - - -There are two main workflows implemented in atlas. A. *Genomes* and B. *Genecatalog*. The first aims in producing metagenome assembled genomes (MAGs) where as the later produces a gene catalog. The steps of Quality control and and - -.. note:: Have a look at the example output at ``_ . - -Quality control -=============== - -:: - - atlas run qc - -Runs quality control of single or paired end reads and summarizes the main QC stats in -`reports/QC_report.html`_. - -.. _reports/QC_report.html: ../_static/QC_report.html - -Per sample it generates: - - - ``QC/reads/{sample}_{fraction}.fastq.gz`` - - -.. _fractions: - -Fractions: ----------- -When the input was paired end, we will put out three the reads in three fractions R1,R2 and se -The se are the paired end reads which lost their mate during the filtering. - -The se reads are no longer used as they usually represent an insignificant number of reads. - - -Assembly -=============== - -:: - - atlas run assembly - - - -Besides the `reports/assembly_report.html`_ this rule outputs the following files per sample: - - - ``Assembly/fasta/{sample}.fasta`` - - ``{sample}/sequence_alignment/{sample}.bam`` - - - -.. _reports/assembly_report.html: ../_static/assembly_report.html - - - - - - -Binning -=============== -:: - - atlas run binning - - - -When you use different binners (e.g. vamb, metabat, DASTool), -then Atlas will produce for each binner and sample: - - - ``{sample}/binning/{binner}/cluster_attribution.tsv`` - -which shows the attribution of contigs to bins. For the final_binner it produces the - - - ``reports/bin_report_{binner}.html`` - -See an `example <../_static/bin_report.html>`_ as a summary of the quality of all bins. - -.. seealso:: In version 2.8 the new binners *vamb* and *SemiBin* were added. First experience show that they outperform the default binner (metabat, maxbin + DASTool). They use a new approach of co-binning which uses the co-abundance from different samples. For more information see the `detailed explanation here `_ on page 14 - -.. note:: Keep also in mind that maxbin, DASTool, and SemiBin are biased for prokaryotes. If you want to try to bin (small) Eukaryotes use metabat or vamb. More information about Eukaryotes see `the discussion here `_. - - -Genomes -=============== -:: - - atlas run genomes - - -Binning can predict several times the same genome from different samples. To remove this reduncancy we use DeRep to filter and de-replicate the genomes. By default the threshold is set to **97.5%**, which corresponds somewhat to the *sub-species level*. The best quality genome for each cluster is choosen as the representative for each cluster. The represenative MAG are then renamed and used for annotation and quantification. - -The fasta sequence of the dereplicated and renamed genomes can be found in ``genomes/genomes`` -and their quality estimation are in ``genomes/checkm/completeness.tsv``. - -Quantification --------------- - -The quantification of the genomes can be found in: - - - ``genomes/counts/median_coverage_genomes.tsv`` - - ``genomes/counts/raw_counts_genomes.tsv`` - -.. seealso:: See in `Atlas example `_ how to analyze these abundances. - -Annotations ------------ - -The annotation can be turned of and on in the config file:: - - annotations: - - genes - - gtdb_tree - - gtdb_taxonomy - - kegg_modules - - dram - - -The ``genes`` option produces predicted genes and translated protein sequences which are stored in ``genomes/annotations/genes``. - - - -**Taxonomic adnnotation** - - -A taxonomy for the genomes is proposed by the Genome `Taxonomy database `_ (GTDB). -The results can be found in ``genomes/taxonomy``. -The genomes are placed in a phylogenetic tree separately for bacteria and archaea using the GTDB markers. - -In addition a tree for bacteria and archaea can be generated based on the checkm markers. -All trees are properly rooted using the midpoint. The files can be found in ``genomes/tree`` - -**Functional annotation** - -Sicne version 2.8, We use `DRAM `_ to annotate the genomes with Functional annotations, e.g. KEGG and CAZy as well as to **infere pathways**, or more specifically Kegg modules. - -The Functional annotations for each genome can be found in ``genomes/annotations/dram/`` - -and are contain the following files: - - - ``kegg_modules.tsv`` Table of all Kegg modules - - ``annotations.tsv`` Table of all annotations - - ``distil/metabolism_summary.xlsx`` Excel of the summary of all annotations - - The tool alos produces a nice report in `distil/product.html`_. - -.. _distil/product.html: ../_static/dram_product.html - - - -Gene Catalog -=============== - -:: - - atlas run all - # or - atlas run genecatalog - -The gene catalog takes all genes predicted from the contigs and clusters them -according to the configuration. It quantifies them by simply mapping reads to the genes (cds sequences) and annotates them using EggNOG mapper. - -This rule produces the following output file for the whole dataset. - - - ``Genecatalog/gene_catalog.fna`` - - ``Genecatalog/gene_catalog.faa`` - - ``Genecatalog/annotations/eggNog.tsv.gz`` - - ``Genecatalog/counts/`` - - - -Since version 2.15 the output of the quantification are stored in 2 hdf-files`in the folder ``Genecatalog/counts/``: - - ``median_coverage.h5`` - - ``Nmapped_reads.h5.fna`` - -Together with the statistics per gene and per sample. - - ``gene_coverage_stats.parquet`` - - ``sample_coverage_stats.tsv`` - - - -The hdf only contains a matrix of abundances or counts under the name ``data``. The sample names are stored as attributes. -The gene names (e.g. ``Gene00001``) are simply the row number. - - - - - - -You can open the hdf file in R or python as following: - - -.. code-block:: python - - import h5py - - filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5" - - with h5py.File(filename, 'r') as hdf_file: - - data_matrix = hdf_file['data'][:] - sample_names = hdf_file['data'].attrs['sample_names'].astype(str) - - -.. code-block:: R - - library(rhdf5) - - - filename = "path/to/atlas_dir/Genecatalog/counts/median_coverage_genomes.h5" - - data <- h5read(filename, "data") - - attributes= h5readAttributes(filename, "data") - - colnames(data) <- attributes$sample_names - - -You don't need to load the full data. -You could only select a subset of genes, e.g. the genes with annotations, or genes that are not singletons. -To find out which gene is a singleton or not you can use the file ``gene_coverage_stats.parquet`` - - -.. code-block:: R - - library(rhdf5) - library(dplyr) - library(tibble) - - # read only subset of data - indexes_of_genes_to_load = c(2,5,100,150) # e.g. genes with annotations - abundance_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.h5") - - - # get dimension of data - - h5overview=h5ls(abundance_file) - dim= h5overview[1,"dim"] %>% stringr::str_split(" x ",simplify=T) %>% as.numeric - cat("Load ",length(indexes_of_genes_to_load), " out of ", dim[1] , " genes\n") - - - data <- h5read(file = abundance_file, name = "data", - index = list(indexes_of_genes_to_load, NULL)) - - # add sample names - attributes= h5readAttributes(abundance_file, "data") - colnames(data) <- attributes$sample_names - - - # add gene names (e.g. Gene00001) as rownames - gene_names = paste0("Gene", formatC(format="d",indexes_of_genes_to_load,flag="0",width=ceiling(log10(max(dim[1]))))) - rownames(data) <- gene_names - - - data[1:5,1:5] - -If you do this you can use the information in the file ``Genecatalog/counts/sample_coverage_stats.tsv`` to normalize the counts. - -Here is the R code to calculate the gene copies per million (analogous to transcript per million) for the subset of genes. - -.. code-block:: R - - # Load gene stats per sample - gene_stats_file = file.path(atlas_dir,"Genecatalog/counts/sample_coverage_stats.tsv") - - gene_stats <- read.table(gene_stats_file,sep='\t',header=T,row.names=1) - - gene_stats <- t(gene_stats) # might be transposed, sample names should be index - - head(gene_stats) - - # calculate copies per million - total_covarage <- gene_stats[colnames(data) ,"Sum_coverage"] - - # gives wrong results - #gene_gcpm<- data / total_covarage *1e6 - - gene_gcpm<- data %*% diag(1/total_covarage) *1e6 - colnames(gene_gcpm) <- colnames(data) - - gene_gcpm[1:5,1:5] - -.. seealso:: See in Atlas Tutorial - - -Before version 2.15 the output of the counts were stored in a parquet file. -The parquet file can be opended easily with ``pandas.read_parquet`` or ``arrow::read_parquet```. -However you need to load the full data into memory. - -.. code-block:: R - - parquet_file <- file.path(atlas_dir,"Genecatalog/counts/median_coverage.parquet") - gene_abundances<- arrow::read_parquet(parquet_file) - - # transform tibble to a matrix - gene_matrix= as.matrix(gene_abundances[,-1]) - rownames(gene_matrix) <- gene_abundances$GeneNr - - - #calculate copies per million - gene_gcpm= gene_matrix/ colSums(gene_matrix) *1e6 - - - gene_gcpm[1:5,1:5] - - - - - - - - - - - - -All -=== - -The option of ``atlas run all`` runs both Genecatalog and Genome workflows and creates mapping tables between Genecatalog and Genomes. However, in future the two workflows are expected to diverge more and more to fulfill their aim better. - -If you want to run both workflows together you can do this by:: - - atlas run genomes genecatalog - -If you are interested in mapping the genes to the genomes see the discussion at https://github.com/metagenome-atlas/atlas/issues/413