Skip to content

A tutorial on how to annotate and interpret novel mammalian reference genomes

License

Notifications You must be signed in to change notification settings

BaderLab/GenomeAnnotationTutorial

Repository files navigation

Tutorial: annotation and interpretation of mammalian genomes

This is a tutorial on how to annotate a newly sequenced mammalian genome. This takes the user from their FASTA sequence to a high-quality GFF file annotated with gene symbols. We break the process into five main steps:

  1. Repeat masking
  2. Generating protein-coding RNA (mRNA) gene models
  3. Combining and filtering gene models
  4. Adding non-coding RNA (ncRNA) gene models
  5. Assigning gene symbols

We recommend tools and best practices for each step, providing code to help the user execute each task. We provide guidance on the installation of the tools that we recommend, and provide additional tips on installation or running the tool based on common challenges that we experienced.

Generally, genome annotation does not have a comparable ground truth, so we use different sources of evidence to annotate the most likely gene models. These gene models are considered hypotheses for where the genes are located on the genome, but false positives and false negatives will always exist. This pipeline uses existing tools and quality-checking software to try to minimise both of these error rates to create a high-quality annotation.

Our main recommendations in brief

In this tutorial, we talk in depth on what process and tools we recommend depending on your data availability and quality. However, here is a brief overview of our recommendations:

  1. Soft-mask your genome using Earl Grey; use the soft-masked genome for all of the following steps
  2. Find two or three closely related genomes on RefSeq (or Ensembl) by searching through their genome database
  3. Lift over the annotations using LiftOff (works best for more closely-related species)
  4. Lift over the annotations using TOGA (works best for more distantly-related species)
  5. If you have access to high-quality RNA-seq data (2x100bp or longer) from the species you wish to annotate, align RNA-seq with HISAT2 and perform annotation with StringTie2; do this for as many tissues as possible
  6. Perform annotation with BRAKER3 using any or no RNA-seq data from your species
  7. Use Mikado to combine and filter any annotations generated in steps 3-6
  8. Use Infernal + RFam, MirMachine, and extract outputs from Earl Grey to annotate ncRNAs
  9. Predict gene symbols using OrthoFinder script or by intersecting gene liftover results with BedTools

For each annotation generated in steps 3, 4, 6, and 7, we recommend converting the GTF/GFF file to a protein FASTA file using GFFRead and generating a BUSCO score to assess the quality of the annotation (not possible for StringTie-generated annotations as these lack CDS features which are required for the conversion to a protein FASTA file). The combined and filtered annotation generated by Mikado in step 7 should have a higher score than any input annotation. If not, rerun Mikado with modified settings.

How to use this repository

We provide three ways to use this tutorial to cater towards different preferences and computational set-ups.

1. A customizable step-by-step set of instructions

These instructions can be found in the folders labeled by step (e.g. "1-RepeatMasking") which contain README files of instructions on how to use the various tools. Explanations are accompanied by pseudocode and screenshots of what to expect (screenshots in progress), and R notebooks performing intermediate formatting steps are available when required. Installation instructions for all of the required tools can be found in setup/InstallAndDownload.md, which also contains instructions on database downloads. Once installations are complete, you can move onto PreprocessReferenceSpecies.md which downloads and processes the files from particular reference species that will be used in the pipeline. Finally, you can begin the step-by-step workflow by navigating to "1-RepeatMasking" and continue through all the steps. The step-by-step workflow is meant as an instructional tutorial, and uses general file names that you may wish to modify. It also assumes that you will be responsible for the organization of files and directories used in the workflow.

2. A series of scripts that run the recommended tools

These scripts are found in /scripts and runs each step of the tutorial.

The Execute_GAT_in_serial.sh contains the variables required to run the tutorial as well as flow control to run each step in serial.

Execute_GAT_in_serial starts by pointing you to where you downloaded GAT, so it should be executed with

bash GAT-InstallAndDownload.sh ~/GenomeAnnotationTutorial # where ~ is /path-to-GAT/

The Snakemake pipeline brings this down to ~20h as it controls which scripts can be run in parallel. Execute_GAT_in_serial.sh also tells you which scripts can be run in parallel if you'd rather run these scripts manually with your own data. We do not recommend running each tool serially once moving to a full mammalian genome. For example, Braker (short read), TOGA, and Braker (long read) could each take ~150h for the 2.6Gb naked mole-rat genome with 5-10 tissues of RNA-seq or ISO-seq. These steps can be run in parallel, saving 300 hours (~12 days) of runtime.

The most time-consuming steps are EarlGrey and Braker (short read)/TOGA/Braker (long read), which are each in the 100-200h range. With a 3Gb mammalian genome, the remaining steps may take an additional ~50 hours combined. As such, we'd expect an end-to-end annotation with this tutorial to be performed in 2-3 weeks of runtime.

Assuming the tutorial and reference directories are properly put together, executing each script on your own should only take a few hours of hands-on human time (e.g., setup, submitting groups of scripts that should be run in parallel, making sure that each step has the expected output). By nature, when the Snakemake is ready, annotations should take no human time after setting up the config file for each assembly.

sourceDir=/path-to/miniconda3/
tutorialDir=/path-to/GenomeAnnotationTutorial

3. A Snakemake pipeline

GAT_Snakemake allows for automated control for the genome annotation tutorial. We expect the same directory structure and setup when using the tutorial in any other manner (i.e., same /data and /external directories, the same annotation_tutorial conda environment, and access to singularity as a module or in your path).

To run the pipeline, you need to change the config.yaml file to match your directory structure.

Assuming that you have singularity in your path, and annotation_tutorial activated:

   module load singularity # or module load apptainer or whatever you need to get singularity in your environment
   conda activate annotation_tutorial

Then, run Snakemake using the parameters of your hpc. This is what it looks like for ours:

snakemake --jobs 750 --latency-wait 60 --cluster "qsub -cwd -V -o snakemake.output.log -e snakemake.error.log -pe smp {threads} -l h_vmem={params.memory_per_thread} {params.extra_cluster_opt} -l h_stack=32M -l h_rt={params.walltime} -P simpsonlab -b y" "$@"

Example data

Assuming everything is properly installed and you have adjusted the two variables below. You can run the example data by submitting Execute_GAT_in_serial.sh as a job. We recommend ~72h runtime, 16 cores, and 64Gb of RAM for the tutorial data.

The example data is pulled from zendodo. We assume that this wget command will be performed in /path-to-GAT/GenomeAnnotationTutorial (see /setup/GAT-InstallAndDownload.md for details)

We include a chromosome from our naked mole-rat assembly (plus RNA-seq and ISO-seq data) to try this tutorial.

wget https://zenodo.org/records/14962941/files/example_data.tar.gz
tar -xvzf example_data.tar.gz

Assuming everything is properly set up, running the tutorial without flow control involves submitting the execute script with two positional arguments: bash Execute_GAT_in_serial.sh path-to-GAT path-to-Conda For us this looks like

bash Execute_GAT_in_serial.sh \
/.mounts/labs/simpsonlab/users/dsokolowski/projects/GenomeAnnotationTutorial \
/.mounts/labs/simpsonlab/users/dsokolowski/miniconda3

Even annotating one chromosome is relatively resource intensive, so we reccomend submitting this as a job (Recommended: 64G mem, 16 cores, 72h runtime)

Lastly, this script has a module load singularity. If you access singularity differently (e.g., apptainer, conda environment etc.) then replace that line with what you need to have singularity accessible.

Running the example data using the snakemake pipeline takes slightly more work:

For config.yaml To run the tutorial with the example data, you need to change: "/path-to-conda/miniconda3/" to the path to the miniconda directory where annotation_tutorial lives (e.g., /.mounts/labs/simpsonlab/users/dsokolowski/miniconda3/) and change "path-to-GAT/" to the path where you cloned "GenomeAnnotationTutorial" (e.g., /.mounts/labs/simpsonlab/users/dsokolowski/projects/GenomeAnnotationTutorial).

Otherwise, follow the steps in "3. A Snakemake pipeline".

List of tools

The tools used in this tutorial are listed below:

Instructions on how to install these tools can be found in the InstallAndDownload.md script in the set-up directory.

Notes on computational requirements

We expect the user to be familiar with installing and running command-line tools and using R. Many tools can be run on a desktop, but some are very computationally intensive and require the use of a high-performance compute cluster (e.g. Compute Canada). The speed of many tools will improve if they have access to multiple threads and can therefore run tasks in parallel. To check how many threads you can specify when running tools, check the documentation of your compute cluster or run nproc on your local desktop.

Virtual environments

Because genome annotation relies on a number of different command-line tools, we recommend using virtual environments whenever possible to manage dependency conflicts. Virtual environments can be used or created with tools like Docker, Conda, or Python.

Docker containers exist for certain tools, and typically mean that a tool is packaged with all of its requirements and is ready for you to use; you can check for them by running docker search name_of_tool. If a Docker container is listed, you can typically use it by running docker run -v "$(pwd)":/tmp name_of_container command. docker run means that you are running the container, -v indicates where you are mounting the volume, which essentially gives Docker a temporary space on your machine to store data (here just given the placeholder /tmp), name_of_container is replaced by the name of whatever container you want to try, and command indicates the command of the tool you want to use.

Conda environments can be created using the command conda create -n name_of_environment required_package_1 required_package_2 ... where additional packages separated by a space can replace the ellipses. Conda considers all the packages required by the user, and creates an isolated environment where all package versions should be compatable. If the user only needs one particular package with all of its dependencies, Conda will automatically find and install all dependencies into the environment when only the one needed is specified. Environments can be activated by running conda activate name_of_environment and deactivated with conda deactivate.

Python environments can be created using the command venv or virtualenv and work similarly to Conda environments. One can create a virtual Python environment, activate it, and then install all the Python packages they require for a particular tool that won't conflict with anything else on your machine once the virtual environment is deactivated. To create a virtual environment, you can run virtualenv /path/to/virtual/env where the path to the virtual environment is any path you'd like to store it. Then to activate the environment with source /path/to/virtual/env and deactivate with deactivate.

Running long jobs

Some of the tools we recommend will take a long time to run: sometimes hours, sometimes days. If this is the case, you probably want the job to run in the background so that you can do other things on your computer. You can do this with the nohup command, which can wrap around any command you want to execute and it allows the job to continue running if you exit the terminal - just don't turn your computer off. Nohup can be run using nohup command argument1 argument2 > output >& nohup.out. Just replace the command, arguments, and/or output with whatever you would run normally, and wrap it with nohup. The output that would normally be written to the screen will now be stored in nohup.out.

About

A tutorial on how to annotate and interpret novel mammalian reference genomes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •