Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3055,6 +3055,14 @@ VerenaMoo:
name: Verena Moosmann
joined: 2024-12

vinisalazar:
name: Vini Salazar
joined: 2025-10
orcid: 0000-0002-8362-3195
affiliations:
- unimelb
- melbournebioinformatics

vivekbhr:
name: Vivek Bhardwaj
joined: 2017-09
Expand Down
11 changes: 11 additions & 0 deletions topics/microbiome/tutorials/metagenomics-binning/tutorial.bib
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
@article{nissen2021improved,
title={Improved metagenome binning and assembly using deep variational autoencoders},
author={Nissen, Jakob Nybo and Johansen, Joachim and Alles{\o}e, Rosa Lundbye and S{\o}nderby, Casper Kaae and Armenteros, Jose Juan Almagro and Gr{\o}nbech, Christopher Heje and Jensen, Lars Juhl and Nielsen, Henrik Bj{\o}rn and Petersen, Thomas Nordahl and Winther, Ole and others},
journal={Nature biotechnology},
volume={39},
number={5},
pages={555--560},
year={2021},
publisher={Nature Publishing Group US New York}
}

@article{maxbin2015,
author = {Wu, Yu-Wei and Simmons, Blake A. and Singer, Steven W.},
title = "{MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets}",
Expand Down
70 changes: 35 additions & 35 deletions topics/microbiome/tutorials/metagenomics-binning/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,16 @@ title: Binning of metagenomic sequencing data
zenodo_link: https://zenodo.org/record/7818827
extra:
zenodo_link_results: https://zenodo.org/record/7845138
level: Introductory
level: Intermediate
questions:
- What is metagenomic binning refers to?
- Which tools should be used for metagenomic binning?
- How to assess the quality of metagenomic data binning?
- Which tools may be used for metagenomic binning?
- How to assess the quality of metagenomic binning?
objectives:
- Describe what metagenomics binning is
- Describe common problems in metagenomics binning
- What software tools are available for metagenomics binning
- Binning of contigs into metagenome-assembled genomes (MAGs) using MetaBAT 2 software
- Evaluation of MAG quality and completeness using CheckM software
- Describe what is metagenomics binning.
- Describe common challenges in metagenomics binning.
- Perform metagenomic binning using MetaBAT 2 software.
- Evaluation of MAG quality and completeness using CheckM software.
time_estimation: 2H
key_points:
- Metagenomics binning is a computational approach to grouping together DNA sequences
Expand All @@ -32,6 +31,11 @@ contributions:
authorship:
- npechl
- fpsom
requirements:
- type: internal
topic: metagenomics
tutorials:
- metagenomics-assembly
subtopic: metagenomics
tags:
- binning
Expand All @@ -56,11 +60,14 @@ recordings:

---


Metagenomics is the study of genetic material recovered directly from environmental samples, such as soil, water, or gut contents, without the need for isolation or cultivation of individual organisms. Metagenomics binning is a process used to classify DNA sequences obtained from metagenomic sequencing into discrete groups, or bins, based on their similarity to each other.

The goal of metagenomics binning is to assign the DNA sequences to the organisms or taxonomic groups that they originate from, allowing for a better understanding of the diversity and functions of the microbial communities present in the sample. This is typically achieved through computational methods that include sequence similarity, composition, and other features to group the sequences into bins.

> <comment-title></comment-title>
> Before starting this tutorial, it is recommended to do the [**Metagenomics Assembly Tutorial**]({% link topics/microbiome/tutorials/metagenomics-assembly/tutorial.md %})
{: .comment}

There are several approaches to metagenomics binning, including:

- **Sequence composition-based binning**: This method is based on the observation that different genomes have distinct sequence composition patterns, such as GC content or codon usage bias. By analyzing these patterns in metagenomic data, sequence fragments can be assigned to individual genomes or groups of genomes.
Expand All @@ -76,7 +83,7 @@ There are several approaches to metagenomics binning, including:
Each of these methods has its strengths and limitations, and the choice of binning method depends on the specific characteristics of the metagenomic data set and the research question being addressed.


**Metagenomics binning is a complex process that involves many steps and can be challenging due to several problems that can occur during the process**. Some of the most common problems encountered in metagenomics binning include:
**Metagenomic binning is a complex process that involves many steps and can be challenging due to several problems that can occur during the process**. Some of the most common problems encountered in metagenomic binning include:

- **High complexity**: Metagenomic samples contain DNA from multiple organisms, which can lead to high complexity in the data.
- **Fragmented sequences**: Metagenomic sequencing often generates fragmented sequences, which can make it difficult to assign reads to the correct bin.
Expand All @@ -86,20 +93,23 @@ Each of these methods has its strengths and limitations, and the choice of binni
- **Chimeric sequences**: Sequences that are the result of sequencing errors or contamination can lead to chimeric sequences, which can make it difficult to accurately bin reads.
- **Strain variation**: Organisms within a species can exhibit significant genetic variation, which can make it difficult to distinguish between different strains in a metagenomic sample.

There are plenty of computational tools to perform metafenomics binning. Some of the most widely used include:
There are plenty of algorithms that perform metagenomic binning. Some of the most widely used include:

- **MaxBin** ({%cite maxbin2015%}): A popular de novo binning algorithm that uses a combination of sequence features and marker genes to cluster contigs into genome bins.
- **MetaBAT** ({%cite Kang2019%}): Another widely used de novo binning algorithm that employs a hierarchical clustering approach based on tetranucleotide frequency and coverage information.
- **CONCOCT** ({%cite Alneberg2014%}): A de novo binning tool that uses a clustering algorithm based on sequence composition and coverage information to group contigs into genome bins.
- **MyCC** ({%cite Lin2016%}): A reference-based binning tool that uses sequence alignment to identify contigs belonging to the same genome or taxonomic group.
- **GroopM** ({%cite Imelfort2014%}): A hybrid binning tool that combines reference-based and de novo approaches to achieve high binning accuracy.
- **MetaWRAP** ({%cite Uritskiy2018%}): A comprehensive metagenomic analysis pipeline that includes various modules for quality control, assembly, binning, and annotation.
- **Anvi'o** ({%cite Eren2015%}): A platform for visualizing and analyzing metagenomic data, including features for binning, annotation, and comparative genomics.
- **SemiBin** ({%cite Pan2022%}): A command tool for metagenomic binning with deep learning, handles both short and long reads.
- **Vamb** ({%cite nissen2021improved%}): An algorithm that uses variational autoencoders (VAEs) to encode sequence composition and coverage information.

Other tools also include:
- **MetaWRAP** ({%cite Uritskiy2018%}): A comprehensive metagenomic analysis pipeline that includes various modules for quality control, assembly, binning, and annotation.
- **Anvi'o** ({%cite Eren2015%}): A platform for visualizing and analyzing metagenomic data, including features for binning, annotation, and comparative genomics. Uses CONCOCT as the default binning backend.

A benchmark study of metagenomics software can be found at {%cite Sczyrba2017%}. MetaBAT 2 outperforms previous MetaBAT and other alternatives in both accuracy and computational efficiency . All are based on default parameters ({%cite Sczyrba2017%}).

**In this tutorial, we will learn how to run metagenomic binning tools and evaluate the quality of the results**. In order to do that, we will use data from the study: [Temporal shotgun metagenomic dissection of the coffee fermentation ecosystem](https://www.ebi.ac.uk/metagenomics/studies/MGYS00005630#overview) and MetaBAT 2 algorithm. MetaBAT is a popular software tool for metagenomics binning, and there are several reasons why it is often used:
**In this tutorial, we will learn how to run metagenomic binning tools and evaluate the quality of the results**. In order to do that, we will use data from the study: [Temporal shotgun metagenomic dissection of the coffee fermentation ecosystem](https://www.ebi.ac.uk/metagenomics/studies/MGYS00005630#overview) and the MetaBAT 2 algorithm. MetaBAT is a popular software tool for metagenomics binning, and there are several reasons why it is often used:
- *High accuracy*: MetaBAT uses a combination of tetranucleotide frequency, coverage depth, and read linkage information to bin contigs, which has been shown to be highly accurate and efficient.
- *Easy to use*: MetaBAT has a user-friendly interface and can be run on a standard desktop computer, making it accessible to a wide range of researchers with varying levels of computational expertise.
- *Flexibility*: MetaBAT can be used with a variety of sequencing technologies, including Illumina, PacBio, and Nanopore, and can be applied to both microbial and viral metagenomes.
Expand Down Expand Up @@ -186,7 +196,7 @@ As explained before, there are many challenges to metagenomics binning. The most
- Chimeric sequences.
- Strain variation.

![Image show the binning process where sequences are grouped together based on genome signatures like the kmer profiles of each contig, contig coverage, or GC content](./binning.png "Binning"){:width="60%"}
![Metagenomic binning involves grouping contigs into 'bins' based on sequence composition, coverage, or other properties.](./images/binning.png "Metagenomic binning involves grouping contigs into 'bins' based on sequence composition, coverage, or other properties."){:width="60%"}

In this tutorial we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool through Galaxy. **MetaBAT** stands for "Metagenome Binning based on Abundance and Tetranucleotide frequency". It is:

Expand All @@ -196,21 +206,11 @@ In this tutorial we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool t
We will use the uploaded assembled fasta files as input to the algorithm (For simplicity reasons all other parameters will be preserved with their default values).

> <hands-on-title>Individual binning of short-reads with MetaBAT 2</hands-on-title>
> 1. {% tool [MetaBAT 2](toolshed.g2.bx.psu.edu/repos/iuc/megahit/megahit/1.2.9+galaxy0) %} with parameters:
> 1. {% tool [MetaBAT 2](https://toolshed.g2.bx.psu.edu/view/iuc/metabat2/01f02c5ddff8) %} with parameters:
> - *"Fasta file containing contigs"*: `assembly fasta files`
> <!-- - In *Advanced options*
> - *"Percentage of good contigs considered for binning decided by connection among contigs"*: `default 95`
> - *"Minimum score of an edge for binning"*: `default 60`
> - *"Maximum number of edges per node"*: `default 200`
> - *"TNF probability cutoff for building TNF graph"*: `default 0`
> - *"Turn off additional binning for lost or small contigs?"*: `default "No"`
> - *"Minimum mean coverage of a contig in each library for binning"*: `default 1`
> - *"Minimum total effective mean coverage of a contig for binning "*: `default 1`
> - *"For exact reproducibility"*: `default 0`
> - In *Output options*
> - *"Minimum size of a bin as the output"*: `default 200000`
> - *"Output only sequence labels as a list in a column without sequences?"*: `default "No"`
> - *"Save cluster memberships as a matrix format?"*: `"Yes"` -->
> - In **Advanced options**, keep all as **default**.
> - In **Output options:**
> - *"Save cluster memberships as a matrix format?"*: `"Yes"`
>
{: .hands_on}

Expand Down Expand Up @@ -244,20 +244,20 @@ These output files can be further analyzed and used for downstream applications
> > ```
> >
> >
> > 2. Create a collection named `MEGAHIT Contig`, rename your pairs with the sample name
> > 2. Create a collection named `MetaBAT2 Bins` and add the zip files to it.
> >
> {: .hands_on}
{: .comment}

> <question-title></question-title>
> <question-title>Binning metrics</question-title>
>
> 1. How many bins has been for ERR2231567 sample?
> 2. How many sequences are contained in the second bin?
> 2. How many contigs are in the bin with most contigs? What about the one with the least?
>
> > <solution-title></solution-title>
> >
> > 1. There are 6 bins identified
> > 2. 167 sequences are classified into the second bin.
> > 1. There are 6 bins identified.
> > 2. 7170 in the one with the most contigs, and 140 in the one with the least (these numbers may differ slightly depending on the version of MetaBAT2).
> >
> {: .solution}
>
Expand All @@ -269,7 +269,7 @@ De-replication is the process of identifying sets of genomes that are the "same"

A common use for genome de-replication is the case of individual assembly of metagenomic data. If metagenomic samples are collected in a series, a common way to assemble the short reads is with a “co-assembly”. That is, combining the reads from all samples and assembling them together. The problem with this is assembling similar strains together can severely fragment assemblies, precluding recovery of a good genome bin. An alternative option is to assemble each sample separately, and then “de-replicate” the bins from each assembly to make a final genome set.

![Image shows the process of individual assembly on two strains and five samples, after individual assembly of samples two samples are chosen for de-replication process. In parallel, co-assembly on all five samples is performed](./individual-assembly.png "Individual assembly followed by de-replication vs co-assembly"){:width="80%"}
![Image shows the process of individual assembly on two strains and five samples, after individual assembly of samples two samples are chosen for de-replication process. In parallel, co-assembly on all five samples is performed](./individual-assembly.png "Individual assembly followed by de-replication vs co-assembly."){:width="80%"}

MetaBAT 2 does not explicitly perform dereplication in the sense of identifying groups of identical or highly similar genomes in a given dataset. Instead, MetaBAT 2 focuses on improving the accuracy of binning by leveraging various features such as read coverage, differential coverage across samples, and sequence composition. It aims to distinguish between different genomes present in the metagenomic dataset and assign contigs to the appropriate bins.

Expand Down