Tutorial: annotation and interpretation of animal genomes

This is a tutorial on how to annotate a newly sequenced animal genome. This takes the user from their FASTA sequence to a high-quality GFF file annotated with gene symbols. We break the process into five main steps:

Repeat masking
Generating protein-coding RNA (mRNA) gene models
Combining and filtering gene models
Adding non-coding RNA (ncRNA) gene models
Assigning gene symbols

We recommend tools and best practices for each step, providing code to help the user execute each task. We provide guidance on the installation of the tools that we recommend, and provide additional tips on installation or running the tool based on common challenges that we experienced.

Generally, genome annotation does not have a comparable ground truth, so we use different sources of evidence to annotate the most likely gene models. These gene models are considered hypotheses for where the genes are located on the genome, but false positives and false negatives will always exist. This pipeline uses existing tools and quality-checking software to try to minimise both of these error rates to create a high-quality annotation.

Our main recommendations in brief

In this tutorial, we talk in depth on what process and tools we recommend depending on your data availability and quality. However, here is a brief overview of our recommendations:

Soft-mask your genome using Earl Grey; use the soft-masked genome for all of the following steps
Find two or three closely related genomes on RefSeq (or Ensembl) by searching through their genome database
Lift over the annotations using LiftOff (works best for more closely-related species)
Lift over the annotations using TOGA (works best for more distantly-related species)
If you have access to high-quality RNA-seq data (2x100bp or longer) from the species you wish to annotate, align RNA-seq with HISAT2 and perform annotation with StringTie2; do this for as many tissues as possible
Perform annotation with BRAKER3 using any or no RNA-seq data from your species
Use Mikado to combine and filter any annotations generated in steps 3-6
Use Infernal + RFam, MirMachine, and extract outputs from Earl Grey to annotate ncRNAs
Predict gene symbols using OrthoFinder script or by intersecting gene liftover results with BedTools

For each annotation generated in steps 3, 4, 6, and 7, we recommend converting the GTF/GFF file to a protein FASTA file using GFFRead and generating a BUSCO score to assess the quality of the annotation (not possible for StringTie-generated annotations as these lack CDS features which are required for the conversion to a protein FASTA file). The combined and filtered annotation generated by Mikado in step 7 should have a higher score than any input annotation. If not, rerun Mikado with modified settings.

How to use this repository

We provide three ways to use this tutorial to cater towards different preferences and computational set-ups.

Installation.

Our programmatic approach to end-to-end genome annotation requires that specific external programs and databases be added to GenomeAnnotationTutorial. This installation process is detailed in setup but summarized here.

# Clone the tutorial repo.

git clone https://github.com/BaderLab/GenomeAnnotationTutorial.git

cd GenomeAnnotationTutorial

# Create the conda environment from the repo

conda env create -f annotation_tutorial.yml

conda activate annotation_tutorial

module load singularity # replace with however you activate singulairty on your system (e.g., module load apptainer etc.)

# make sure that SINGULARITY_CACHEDIR and SINGULARITY_TMPDIR are not attached to root.
# export SINGULARITY_CACHEDIR=/path-to/singularitycache
# export SINGULARITY_TMPDIR=/path-to/singularitytemp

# run install and download script. This requires the internet.

cd setup

bash GAT-InstallAndDownload.sh .. # or /path-to-GenomeAnnotationTutorial 

# build blastDBs

bash GAT-MakeBlastDB.sh .. # or /path-to-GenomeAnnotationTutorial

You will also use the preprocess reference species script to build genome directories for species that you want to use to transfer gene models to the target species (e.g., mouse, human). This is fully documented in setup/Preprocess Reference Species.md. Briefly, using mouse as an example, this process is done with. the reference_directory_refseq.sh (or reference_directory_ensembl.sh) script in /setup. The following code is commented to help explain what each part of the code is doing, but the copied and pasted code may not work unless you remove the comment lines.

In /path-to-GenomeAnnotationTutorial/GenomeAnnotationTutorial/data/references:

mkdir -p mmus_GRC39 ; cd mmus_GRC39

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.gff.gz
for i in *.gz ; do gunzip $i ; echo $i ; done
  bash ../../../setup/reference_directory_refseq.sh \
  . \ # path to the reference genome directory
  ~/GenomeAnnotationTutorial \ #  path to GenomeAnnotationTutorial ( `GenomeAnnotationTutorial` included)
  GCF_000001635.27_GRCm39_genomic.fna \
  GCF_000001635.27_GRCm39_genomic.gff

1. A customizable step-by-step set of instructions

These instructions can be found in the folders labeled by step (e.g. "1-RepeatMasking") which contain README files of instructions on how to use the various tools. Explanations are accompanied by pseudocode and screenshots of what to expect (screenshots in progress), and R notebooks performing intermediate formatting steps are available when required. Installation instructions for all of the required tools can be found in setup/InstallAndDownload.md, which also contains instructions on database downloads. Once installations are complete, you can move onto PreprocessReferenceSpecies.md which downloads and processes the files from particular reference species that will be used in the pipeline. Finally, you can begin the step-by-step workflow by navigating to "1-RepeatMasking" and continue through all the steps. The step-by-step workflow is meant as an instructional tutorial, and uses general file names that you may wish to modify. It also assumes that you will be responsible for the organization of files and directories used in the workflow.

2. A series of scripts that run the recommended tools

These scripts are found in /scripts and runs each step of the tutorial.

The Execute_GAT_in_serial.sh contains the variables required to run the tutorial as well as flow control to run each step in serial.

bash GAT-InstallAndDownload.sh /path-to/GenomeAnnotationTutorial /path-to/miniconda3/ # The two "/path-to/"'s can be different
# e.g.. =/.mounts/labs/simpsonlab/users/dsokolowski/projects/GenomeAnnotationTutorial
# e.g. /.mounts/labs/simpsonlab/users/dsokolowski/miniconda3

The most time-consuming steps are EarlGrey and Braker (short read)/TOGA/Braker (long read), which are each in the 100-200h range. With a 3Gb mammalian genome, the remaining steps may take an additional ~50 hours combined. As such, we'd expect an end-to-end annotation with this tutorial to be performed in 2-3 weeks of runtime.

Assuming the tutorial and reference directories are properly put together, executing each script on your own should only take a few hours of hands-on human time (e.g., setup, submitting groups of scripts that should be run in parallel, making sure that each step has the expected output).

Execute_GAT_in_serial.sh also tells you which scripts can be run in parallel if you'd rather run these scripts manually with your own data. We do not recommend running each tool serially once moving to a full animal genome. For example, BRAKER (short read), TOGA, and BRAKER (long read) could each take ~150h for the 2.6Gb naked mole-rat genome with 5-10 tissues of RNA-seq or ISO-seq. These steps can be run in parallel, saving 300 hours (~12 days) of runtime.

3. A Snakemake pipeline

GAT_Snakemake allows for automated control for the genome annotation tutorial. We expect the same directory structure and setup when using the tutorial in any other manner (i.e., same /data and /external directories, the same annotation_tutorial conda environment, and access to singularity as a module or in your path).

By nature, when the Snakemake is ready, annotations should take no human time after setting up the config file for each assembly. It will be 5-10x faster than running everything in serial and will save a few days when doing flow control manually.

To run the pipeline, you need to change the config.yaml file to match your directory structure.

Assuming that you have singularity in your path, and annotation_tutorial activated:

   module load singularity # or module load apptainer or whatever you need to get singularity in your environment
   conda activate annotation_tutorial

Then, run Snakemake using the parameters of your hpc. This is what it looks like for ours:

snakemake --jobs 750 --latency-wait 60 --cluster "qsub -cwd -V -o snakemake.output.log -e snakemake.error.log -pe smp {threads} -l h_vmem={params.memory_per_thread} {params.extra_cluster_opt} -l h_stack=32M -l h_rt={params.walltime} -P simpsonlab -b y" "$@"

Example data

Assuming everything is properly installed and you have adjusted the two variables below. You can run the example data by submitting Execute_GAT_in_serial.sh as a job. We recommend ~72h runtime, 16 cores, and 64Gb of RAM for the tutorial data.

The example data is pulled from zendodo. We assume that this wget command will be performed in /path-to-GAT/GenomeAnnotationTutorial (see /setup/GAT-InstallAndDownload.md for details)

We include a chromosome from our naked mole-rat assembly (plus RNA-seq and ISO-seq data) to try this tutorial.

wget https://zenodo.org/records/14962941/files/example_data.tar.gz
tar -xvzf example_data.tar.gz

Assuming everything is properly set up, running the tutorial without flow control involves submitting the execute script with two positional arguments: bash Execute_GAT_in_serial.sh path-to-GAT path-to-Conda For us this looks like

bash Execute_GAT_in_serial.sh \
/.mounts/labs/simpsonlab/users/dsokolowski/projects/GenomeAnnotationTutorial \
/.mounts/labs/simpsonlab/users/dsokolowski/miniconda3

Even annotating one chromosome is relatively resource intensive, so we reccomend submitting this as a job (Recommended: 64G mem, 16 cores, 72h runtime)

Lastly, this script has a module load singularity. If you access singularity differently (e.g., apptainer, conda environment etc.) then replace that line with what you need to have singularity accessible.

Running the example data using the snakemake pipeline takes slightly more work:

For config.yaml To run the tutorial with the example data, you need to change: "/path-to-conda/miniconda3/" to the path to the miniconda directory where annotation_tutorial lives (e.g., /.mounts/labs/simpsonlab/users/dsokolowski/miniconda3/) and change "path-to-GAT/" to the path where you cloned "GenomeAnnotationTutorial" (e.g., /.mounts/labs/simpsonlab/users/dsokolowski/projects/GenomeAnnotationTutorial).

Otherwise, follow the steps in "3. A Snakemake pipeline".

List of tools

The tools used in this tutorial are listed below:

Installed with Conda: LiftOff v1.6.3, Infernal v1.1.2, BUSCO v5.7.1, TransDecoder v5.7.1, Portcullis v1.2.4, Minimap2 v2.28, Diamond v2.1.10, BedTools v2.31.1, MirMachine v0.2.13, NextFlow v24.10.4, OrthoFinder v3.0.1b1, HISAT2 v2.2.1, Augustus v3.5.0, GFFRead v0.12.7, RegTools v1.0.0
Singularity images: BRAKER3, CACTUS v2.9.3, Earl Grey v5.1.0, Mikado v2.3.2
Binaries: TOGA v1.1.8.dev, kent utilities v362, StringTie v2.2.3, InterProScan v1.8.0_412
Data packages: Uniprot Swiss-Prot database, RFam database, OrthoDB database

Instructions on how to install these tools can be found in the InstallAndDownload.md script in the set-up directory.

Notes on computational requirements

We expect the user to be familiar with installing and running command-line tools and using R. Many tools can be run on a desktop, but some are very computationally intensive and require the use of a high-performance compute cluster (e.g. Compute Canada). The speed of many tools will improve if they have access to multiple threads and can therefore run tasks in parallel. To check how many threads you can specify when running tools, check the documentation of your compute cluster or run nproc on your local desktop.

Virtual environments

Because genome annotation relies on a number of different command-line tools, we recommend using virtual environments whenever possible to manage dependency conflicts. Virtual environments can be used or created with tools like Docker, Conda, or Python.

Docker containers exist for certain tools, and typically mean that a tool is packaged with all of its requirements and is ready for you to use; you can check for them by running docker search name_of_tool. If a Docker container is listed, you can typically use it by running docker run -v "$(pwd)":/tmp name_of_container command. docker run means that you are running the container, -v indicates where you are mounting the volume, which essentially gives Docker a temporary space on your machine to store data (here just given the placeholder /tmp), name_of_container is replaced by the name of whatever container you want to try, and command indicates the command of the tool you want to use.

Conda environments can be created using the command conda create -n name_of_environment required_package_1 required_package_2 ... where additional packages separated by a space can replace the ellipses. Conda considers all the packages required by the user, and creates an isolated environment where all package versions should be compatable. If the user only needs one particular package with all of its dependencies, Conda will automatically find and install all dependencies into the environment when only the one needed is specified. Environments can be activated by running conda activate name_of_environment and deactivated with conda deactivate.

Python environments can be created using the command venv or virtualenv and work similarly to Conda environments. One can create a virtual Python environment, activate it, and then install all the Python packages they require for a particular tool that won't conflict with anything else on your machine once the virtual environment is deactivated. To create a virtual environment, you can run virtualenv /path/to/virtual/env where the path to the virtual environment is any path you'd like to store it. Then to activate the environment with source /path/to/virtual/env and deactivate with deactivate.

Running long jobs

Some of the tools we recommend will take a long time to run: sometimes hours, sometimes days. If this is the case, you probably want the job to run in the background so that you can do other things on your computer. You can do this with the nohup command, which can wrap around any command you want to execute and it allows the job to continue running if you exit the terminal - just don't turn your computer off. Nohup can be run using nohup command argument1 argument2 > output >& nohup.out. Just replace the command, arguments, and/or output with whatever you would run normally, and wrap it with nohup. The output that would normally be written to the screen will now be stored in nohup.out.

Name		Name	Last commit message	Last commit date
Latest commit History 331 Commits
1-RepeatMasking		1-RepeatMasking
2-GeneratingGeneModels		2-GeneratingGeneModels
3-CombineAndFilter		3-CombineAndFilter
4-AnnotateNonCodingRNA		4-AnnotateNonCodingRNA
5-AssignGeneSymbols		5-AssignGeneSymbols
Docker		Docker
GenAnT_Snakemake		GenAnT_Snakemake
scripts		scripts
setup		setup
GenAnT_in_serial.sh		GenAnT_in_serial.sh
LICENSE		LICENSE
README.md		README.md
annotation_tutorial.yml		annotation_tutorial.yml
evaluating_annotation_quality.md		evaluating_annotation_quality.md
ongoing_updates		ongoing_updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tutorial: annotation and interpretation of animal genomes

Our main recommendations in brief

How to use this repository

Installation.

1. A customizable step-by-step set of instructions

2. A series of scripts that run the recommended tools

3. A Snakemake pipeline

Example data

List of tools

Notes on computational requirements

Virtual environments

Running long jobs

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

License

BaderLab/GenAnT

Folders and files

Latest commit

History

Repository files navigation

Tutorial: annotation and interpretation of animal genomes

Our main recommendations in brief

How to use this repository

Installation.

1. A customizable step-by-step set of instructions

2. A series of scripts that run the recommended tools

3. A Snakemake pipeline

Example data

List of tools

Notes on computational requirements

Virtual environments

Running long jobs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages