Annotate-Contigs:

This tool takes a k-mer/contig table as input and produces an annotation file with mapping information for each k-mer/contig including position, intron/exon/intergenic location, gene name, CIGAR etc. Should work with any reference genome and annoattion as input. In this pipeline, we use two alignment tools, STAR and Minimap2. STAR is used for aligning sequences that are 200 bases or shorter, while Minimap2 is used for aligning sequences greater than or equal 200 bases.

This tool was initially thought as a less restrictive alternative to DEkupl-annotation. So many of its aspects are similar.

Usage
Installation
- Option 1: With singularity
- Option 2: From source
Configuration
Output files
Ontology

Usage

In order to run the tool, you will need at least 4 specific files
- Input file : A tsv/csv file with at least 2 named columns. One contains the sequences you want to annotate, and each sequence must have a unique identifier. Typical input files are Dekupl-run/Kamrat outputs (example available in data/input_table.tsv).
- Config file : As the pipeline is designed with snakemake, any run requires a configuration file. See below for specifications of available parameters.
- Genome and annotation files : Associated fasta and gtf files of an organism (gz). Typically downloaded from Ensembl or Gencode websites.(examples available in data/)

Installation

Clone the Repository

To download this project, use the following command:

   git clone https://github.com/Transipedia/Annotate-contigs && cd Annotate-contigs

We recommand to use singularity to use the tool, or using the manual installation.

Option 1: With singularity

Step 1: Upload Singularity Image You can upload the singularity image directly from the link

wget https://zenodo.org/records/13789508/files/annotatecontig.sif?download=1 -O annotatecontig.sif

Step 2: Create your configuration file This tool is designed to work with Snakemake, which means that all user inputs must be defined in a configuration file (config.json). You can find an example of this configuration file in the repository. A comprehensive list of all parameters is provided in the following section.
Step 3: Run with mounted volumes It is advised to mount certain volumes (input/output directories). By default, a Singularity image cannot access external data. To fix this, you need to mount your directories as volumes. Using the parameter -B /store:/store tells Singularity to reference your store directory when mentioned (notably in your configuration file). It is recommended that all your input files be located in the /store directory.
```
singularity -v run -B /home:/home annotatecontig.sif -s ./Snakefile --configfile ./config.json --cores $nb_cores  
```

Option 2: From conda

Step 1: Install dependancies. Before using the tool, install the dependencies. You can install them manually using the conda environnement file annotatecontig.yml :
```
conda env create -f annotatecontig.yml
conda activate AnnotateContig
```

Step 2: Edit config file & run with Snakemake.

snakemake --configfile config.json --cores $nb_cores

Configuration :

inside the config.json you find all the parameters:

Mandatory parameters :

mode: can be either "index" or "table".

"index": Use this when running the pipeline for the first time to build the indexes. "table": Use this if the STAR and Minimap2 indexes already exist, and you only need to generate the table.
input_file: Path to the file containing sequences to annotate (supports tsv/csv, gzipped or uncompressed). Example in (data/input_table.tsv)
map_to: Name of the organism to be used in output files and the annotation table. (e.g: map_to : "human")
reference: Path to the fasta.gz file used to build the index for the specified organism. Only required in "index" mode, otherwise, you put "". Example for test reference
annotation: Path to the gtf.gz file used to build the index for the specified organism. Required in "index" and "table" mode. Example for test annotation
minimap2_index: Path to the pre-built Minimap2 genome index for the organism, if previously created. if "index" mode, add ""
star_index: Path to the pre-built STAR genome index for the organism, if previously created. if "index" mode, add ""
preset: (Default : "splice") Adjusts internal parameters of Minimap2 (e.g., k-mer size, scoring schemes, alignment heuristics) to optimize performance and accuracy for specific data types, you can>

The preset used for index building must be consistent. Some presets may not provide information about chimeric reads. In such cases, you may need to build the index again using a different preset.

*About the GTF

Only the "exon" features of the GTF file will be used. In order for the program to run properly, the mandatory attributes (column 9) are : "gene_id", "transcript_id", "gene_type".

Optionnal parameters (and default values) :

sequence_col: (Default :"contig"). Name of the column in input file containing the sequences to annotate.
id_col: (Default:"tag"). Name of the column in input file containing the unique identifier of the sequence.
output_dir: (Default:"./output"). Path to where the results will be generated.
keep_col: (Default:"all"). Either "all" or a list of column names you want to keep from the input file.
library_type: (Default:"rf"). Strandedness: "rf", "fr" or "unstranded".
supp_map_to: (Default:[""]). List of supplementary reference names you want to map your sequences to, with no further information (using blast).
supp_map_to_fasta:(Defaults:[""]). For each reference in supp_map_to, path to its fasta sequence. An exemple of a typical fasta file you could use (Human repeats from Dfam) is available in data/.
contamination : (Default: False). Toggle contaminant detection.
database : If contamination is set to True, specify the path to the BLAST-generated database here.
threads : Number of threads.

*About supplementary alignment

Any amount of supplementary alignment columns can be added to the output. For each supplementary reference provided, a single column will be added at the end of the output file specifying where the annotated sequence was aligned on this reference.

Example: with a reference of human repeats provided in this repository (data/human_repeat_ref.fasta):

"supp_map_to":["HumanRepeats"],
"supp_map_to_fasta" : ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta"],

Example with Multiple References: If you have two supplementary references, e.g., human repeats and viral elements, the configuration would look like this:

"supp_map_to": ["HumanRepeats", "ViralElements"],
"supp_map_to_fasta": ["/home/Documents/Annotate-contigs/data/human_repeat_ref.fasta","/home/Documents/Annotate-contigs/data/viral_elements_ref.fasta" ]

N.B : If "supp_map_to" and "supp_map_to_fasta" are set to [""], the pipeline will skip the supplementary alignment step, and the corresponding column will be missing from the final output table.

Output file

Table merged_annotation.tsv, summarizing for each contig, its location on the genome (if it's aligned), the sequence alignment informations, and other optionnal alignment informations. If you choose to detect contaminations and repeats, you will get two additional columns (Repeats and Contamination).

N.B : You will also find some intermediate files in the output folder, specifically query_lt_200.fa and query_gt_200.fa.

If query_lt_200.fa is empty, it means that all the sequences in your query have a length of less than 200 bases, so you will have empty output files from STAR.
If query_gt_200.fa is empty, it means that all the sequences in your query have a length greater than 200 bases, so you will have empty output files from Minimap2.

Annotated values

Term	Type	Description
mapped_to	Str	Reference to which the sequence was aligned
chromosome	Str	Chromosome
start	Int	Beginning of the alignment on the reference
end	Int	End of the alignment on the reference
strand	Char	Strand of the alignment (+/-). set to "." in unstranded data.
cigar	Str	CIGAR string from the SAM alignment.
nb_insertion	Int	Number of insertions in the alignment (infered from cigar)
nb_deletion	Int	Number of deletions in the alignment (infered from cigar)
nb_splice	Int	Number of splices in the alignment (infered from cigar)
nb_snv	Int	Number of SNV in the contigs (computed as the number of mismatches minus indels)
clipped_3p	Int	Number of clipped bases (soft/hard) from 3prim contig
clipped_5p	Int	Number of clipped bases (soft/hard) from 5prim contig
query_cover	Float	Fraction of the query that have been aligned to the reference
alignment_identity	Float	Fraction of exact match over the query alignment length (splices do not count)
nb_hit	Int	Number of alignment given for the contig (NH field)
nb_mismatches	Int	Number of mismatches in the alignment (NM field)
gene_id	Str	Overlapping gene ID (from GTF ID field)
gene_symbol	Str	Overlapping gene symbol (from GTF Name field)
gene_biotype	Str	Overlapping gene biotype (from GTF biotype field)
gene_strand	Char	Overlapping gene strand (+/-)
as_gene_id	Str	Overlapping antisense gene ID (from GFF ID field). Defined only when working with stranded datas.
as_gene_symbol	Str	Overlapping antisense gene symbol (from GFF Name field). Defined only when working with stranded datas.
as_gene_strand	Char	Overlapping antisense gene strand (+/-). Defined only when working with stranded datas.
as_gene_biotype	Str	Overlapping antisense gene biotype (from GFF biotype field). Defined only when working with stranded datas.
is_exonic	Bool	Overlap between an exon and the contig. Same strand if working with stranded datas, both strand otherwise.
is_intronic	Bool	Overlap between an intron and the contig. Same strand if working with stranded datas, both strand otherwise.
is_chimeric	Bool	The contig contains a chimeric junction
is_circ	Bool	The chimeric junction behaves like a circular RNA.
seg1_cj	Str	First segment of the chimeric junction
seg2_cj	Str	Second segment of the chimeric junction

Example

Details

Example usage

After installing the tool, in the data folder, we have the directory containing the files we'll need.
Assuming that you have already generated the minimap2 and STAR indexes,
retrieved the annotation.gtf.gz file,
and also downloaded and decompressed the BLAST database.

Config file should be like this:

{
  "mode" : "table",
  "input_file": "Annotate-contigs/data/input_file.tsv",
  "sequence_col": "contig",
  "id_col": "tag",
  "map_to" : "human",
  "supp_map_to" : ["repeats"],
  "preset" : "splice",
  "supp_map_to_fasta" : ["Annotate-contigs/data/human_repeat_ref.fasta"],
  "output_dir": "Example",
  "keep_col": ["contig","tag"],
  "threads": 5,
  "library_type": "unstranded",
  "reference": "",
  "annotation" : "annotation.gtf.gz",
  "star_index" : "human_index/star",
  "minimap2_index" : "human_index/minimap2/human.mmi",
  "contamination": True,
  "database": "Annotate-contigs/databases/db_bacteria_virus_fungi"
}

Then, execute this command line :

singularity -v run -B /home:/home annotatecontig.sif -s ./Snakefile --configfile ./config.json --cores 4

At the end of the execution, you wil have a folder named "Example" containing the following structure:

.
|-- LOGS  
|-- Minimap2_human  
|-- STAR_human  
|-- blast  
|-- contamination  
|-- human_tmp  
|-- merged_annotation.tsv  
|-- query.fa.gz  
|-- query_gt_200.fa  
`-- query_lt_200.fa

The final file, which contains the global annotation table, is merged_annotation.tsv. It is located in the output folder inside data/.

Note: The last two columns are optional. In this example, we chose to include repeat elements and detect contaminations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Annotate-Contigs:

Usage

Installation

Clone the Repository

Option 1: With singularity

Option 2: From conda

Configuration :

Mandatory parameters :

*About the GTF

Optionnal parameters (and default values) :

*About supplementary alignment

Output file

Annotated values

Example

Config file should be like this:

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
script		script
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
Snakefile		Snakefile
annotatecontig.yml		annotatecontig.yml
config.json		config.json

Transipedia/Annotate-contigs

Folders and files

Latest commit

History

Repository files navigation

Annotate-Contigs:

Usage

Installation

Clone the Repository

Option 1: With singularity

Option 2: From conda

Configuration :

Mandatory parameters :

*About the GTF

Optionnal parameters (and default values) :

*About supplementary alignment

Output file

Annotated values

Example

Config file should be like this:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages