Convergent evolution in protein antigens is common across pathogens and has also been documented in SARS-CoV-2 (hCoV-19); the most likely reason is the need to evade the selective pressure exerted by previous infection- or vaccine-elicited immunity. There is a pressing need for tools that allow automated analysis of convergent mutations.
In response to this need, we developed ConvMut, a tool to analyze genetic sequence data to identify patterns of recurrent mutations in SARS-CoV-2 evolution. To this end, we exploited the granular phylogenetic tree representation developed by PANGO, allowing us to observe what we call deltas, i.e., groups of mutations that are acquired on top of the immediately upstream tree nodes. Deltas comprise amino acid substitutions, insertions, and deletions. ConvMut can perform individual protein analysis to identify the most common single mutations acquired independently in a given subtree (starting from a user-selected root). Such mutations are represented in a barplot that can be sorted by frequency or position, and filtered by region of interest. Lineages are then gathered into clusters according to their sets of shared mutations. Finally, an interactive graph orders the evolutionary steps of clusters, details the acquired amino acid changes for each sublineage, and allows us to trace the evolutionary path until a selected lineage.
Other unique tools are paired with the main functionality of ConvMut to support a complete analysis, such as a frequency analysis for a given nucleotide or amino acid changes at a given residue across a selected phylogenetic subtree.
ConvMut will facilitate the design of antiviral anti-Spike monoclonal antibodies and Spike-based vaccines with longer-lasting efficacy, minimizing development and marketing failures.
The ConvMut software is deployed and made usable on the GISAID platform https://gisaid.org/ to all users with a valid registration, leveraging the real-time updated EpiCoV database.
For use on local servers, we provide an open implementation that can be downloaded from this GitHub repository and run on any correctly-formatted input and deployed as described next.
This software requires a computer with:
- an installation of Docker CLI or Docker Desktop
- a terminal / command prompt
- Docker CLI v2 (or above) or Docker Desktop >=4.0.0;
- storage space: minimum 16GB
- memory: minimum 16GB
All the commands are intended to be used in a terminal window inside the directory where this file resides.
Software dependencies are installed through PIP within a Docker container (convmut-open-data-base-image) that prepares the virtual environment for running the software. To prepare the virtual environment, start Docker and run:
docker compose build base && docker compose buildNotes for software developers: whenever a change to the dependencies (i.e., the requirements.txt file) is made, rebuild the virtual environment with the option --no-cache to apply the change.
Open a new terminal on your system, navigate to the ConvMut directory and launch the Input Data Updater Service by executing the command
docker compose run --rm open-data-updaterWhenever you need to update an outdated version of the data, repeat the instructions in this section. If the Input Data Updater Service detects a newer version of the data, it will be downloaded. The data update will be automatically reflected in the Application Frontend with some delay, or one can force the update immediately by restarting the application as described in the next section.
Troubleshooting tips: if the files are not downloaded or updated, there might be an issue with your internet connection, or the source files might have changed location or access protocol. The Input Data Updater Service normally silences any failure, but you can make failures explicit by appending the options --no-silent-date-check --no-silent-download to the above command.
- Open a new terminal on your system, navigate to the ConvMut directory and launch both the Application Data Updater Service and the Streamlit Service by executing the command
docker compose up open-data-frontend-
Open a browser and navigate to the address https://localhost:65265.
-
Start exploring!
[!WARNING] Startup times On the very first run, the application takes usually ~10 minutes before getting ready. During this time, the necessary Application Data is generated. Subsequent runs won't introduce any delay even if you stop/restart the application until the Dynamic Input Data is updated.
Open Docker Desktop and delete the images and containers related to ConvMut. Then delete this repository.
The system can be viewed as a stack of layers, including data and services.
The Data provider layer includes the application's input data sources, specifically, we employ Pangolin designations and UCSC-built lineage constellations.
The download of the Dynamic Input Data into a Docker Data Volume is managed by the Input Data Updater Service of the Application Backend. When launched, the Input Data Updater Service compares the local files with those available from the corresponding sources and downloads the related files whenever an updated version is available. Within Dynamic Input Data, we expect some files containing information about the current clades (e.g., Pango-lineages), their mutations (along a chosen phylogeny), and hierarchical relationships.
The Application Backend (packaged as a Docker Application Container) integrates:
- the Input Data Updater Service providing and updating the Dynamic Input Data
- the Application Data Updater Service that transforms the dynamic input data (served through the Docker Data Volume) into a format that offers better performances for the computation of queries (see Application Data); the task is run at the application startup and after it, in the background to re-compute the application data whenever the input source is updated.
- the Streamlit Service is a service backed by the application framework Streamlit, enclosing API, UI modules and the main application logic.
By Application Data we refer to a collection of files encoded in a format that helps the application to efficiently answer the user queries. This is composed of:
- files representing the domain knowledge (e.g., protein annotations, the reference sequence) – as this information is not supposed to change in the future, we refer to this as "static data" and ship it together with the application;
- data files dynamically generated by the Application Data Updater Service.
Finally, we have the Application Frontend, a graphical interface allowing the user to explore the convergent mutations.
We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based.
Consider citing this work in your research as:
Tommaso Alfonsi, Anna Bernasconi, Emma Fanfoni, Cesare Ernesto Maria Gruber, Fabrizio Maggi, Daniele Focosi.
ConvMut: Exploration of viral convergent mutations along phylogenies. biorXiv
https://doi.org/10.1101/2024.12.16.628620
https://annabernasconi.faculty.polimi.it/
Phone: +39 02 2399 3494