Cryo-EM Heterogeneity Challenge

This repository contains the code used to analyse the submissions for the Inaugural Flatiron Cryo-EM Heterogeneity Challenge.

Scope

This repository explains how to preprocess a submission (80 maps and corresponding probability distribution), and analyze it. Challenge participants can benchmark their submissions locally against the ground truth and other submissions that are available on the cloud via the Open Science Foundation project The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge.

Warning

This is a work in progress, while the code will probably not change, we are still writting better tutorials, documentation, and other ideas for analyzing the data. We are also in the process of making it easier for other people to contribute with their own metrics and methods. We are also in the process of distributing the code to PyPi.

Accesing the data

The data is available via the Open Science Foundation project The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge. You can download via a web browser, or programatically with wget as per this script.

NOTE: We recommend downloadaing the data with the script and wget as the downloads from the web browser might be unstable.

System Requirements

Running the code in this repository requires a system with Python 3.10-3.13. The code is tested automatically using GitHub actions for the latest stable Ubuntu version for Python 3.10, 3.11, 3.12 and 3.13. No specialized hardware is required for running the pipelines. GPU acceleration is possible through Torch's functionalities. The execution of the Map To Map pipelines can be accelerated through Dask, although this is optional.

Installation

Note: the installation time may vary depending on the virtual environment framework used. Using python's default venv the installation should take between one and two minutes.

Clone the repository

Before installation, please clone this GitHub repository locally:

git clone [email protected]:flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1.git

Stable installation

The library in this repository may be installed with pip. We recommend creating a virtual environment (using conda or pyenv), since we have dependencies such as PyTorch or Aspire, which are better dealt with in an isolated environment. After creating your environment, make sure to activate it and run:

cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
pip install .

Alternatively, you can install without cloning the repository by running:

pip install git+https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1.git

although this will not give you access to the tests and tutorials.

Developer installation

If you are interested in developing, please, install the repository in editable mode and install the development dependencies with the following commands:

cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
pip install -e ".[dev]"

Testing

We recommend testing your installation by running

cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
pytest tests/

This will make sure all the pipelines run on a mock set of submissions.

Demo

We have prepared a demo dataset for running all our pipelines. The instructions for downloading the required data and running the pipelines can be found here

Running

If you want to run our code on the full challenge data, or you own local data, please complete the following steps

0. Preprocessing raw submissions

As the the submissions for the challenge are anonymous, we do not provide the raw submissions. For running our preprocessing pipeline for new submissions please see our preprocessing tutorial. We provide access to all preprocessed submissions below.

1. Download the full challenge data from The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge

The data required to reproduce the results from the challenge can be downloaded from OSF. Tje data consists of three folders: dataset_1_submissions, datset_2_submissions, and Ground_truth. The first two folders contain the preprocessed submissions for the experimental and simulated datasets, respectivelly. The third folder contains the GT-based mock submissions (Averaged GT and Sampled GT), as well as the volumes used for generating the data for the simulated datasets (maps_gt_flat.pt). These are provided as a .pt file to make easier the execution of our analysis pipelines.

2. Modify the config files and run the commands on the full challenge data

The tutorial notebooks explain how to setup the config files, and how to run each pipeline. We provide examples of config files for each pipeline. Given a config file, each pipeline can be executed from the command line as:

run_svd_pipeline                          --config path/to/config_files/config_svd.yaml
run_map_to_map_pipeline                   --config path/to/config_files/config_map_to_map.yaml
run_distribution_to_distribution_pipeline --config path/to/config_files/config_distribution_to_distribution.yaml

Example for config files can be found here

3. SVD Pipeline

Running this pipeline for the simulated dataset submissions takes around 45 minutes on a computer node with 64 CPU cores and 128 GB of RAM. The runtime is divided into 1) Preparing the submissions (30 min) and 2) Computing the metrics (15 min). Since the prepared submissions are saved on disk (optional, see Tutorial), subsequent executions of this metric will be faster.

See our SVD Tutorial for a step-by-step on how to reproduce the results presented in the paper.

4. Map to Map Pipeline

Running this pipeline takes several hours or more per submission, depending on what metrics are requested. On a computer node with dozens of CPU cores and several hundreds of GB of RAM, the metrics l2, bioem, corr take several minutes; fsc takes several hours, with res taking minutes but needing the fsc results. The other metrics take much longer and it's recommended to only use a few hundred ground truth volumes for reference for them.

5. Distribution to Distribution Pipeline

Running this pipeline takes several minutes (we used a computer node with dozens of CPU cores and several hundreds of GB of RAM, but much less would be needed), and scales linearly with the number of replicates requested. Depending on the CVXPY solver requested, and the tolerance values requested, it can take a lot longer. We recommend ECOS, SCS, or CLARABEL.

Contributing

If you find any bug or have a suggestion on the code feel free to open an issue here.

We also welcome any help with the development of this repository. If you want to contribute with your own suggestions, code, or fixes, we recommend creating a fork of this repository to avoid any incompatibilities with newer versions of the software. Once you are happy with your new code, please, make a PR from your fork to this repository.

We are also working on pipelines to simplify the exentension of the code with new metrics or functionalities, stay tuned!

Acknowledgements

Miro A. Astore, Geoffrey Woollard, David Silva-Sánchez, Wenda Zhao, Khanh Dao Duc, Nikolaus Grigorieff, Pilar Cossio, and Sonya M. Hanson. "The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge". 9 June 2023. DOI:10.17605/OSF.IO/8H6FZ
David Herreros for testing and CI and debugging in this repo

Name		Name	Last commit message	Last commit date
Latest commit History 656 Commits
.github/workflows		.github/workflows
config_files		config_files
demo		demo
docs		docs
scripts		scripts
src/cryo_challenge		src/cryo_challenge
tests		tests
tutorials		tutorials
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Cryo-EM Heterogeneity Challenge

Scope

Warning

Accesing the data

System Requirements

Installation

Clone the repository

Stable installation

Developer installation

Testing

Demo

Running

0. Preprocessing raw submissions

1. Download the full challenge data from The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge

2. Modify the config files and run the commands on the full challenge data

3. SVD Pipeline

4. Map to Map Pipeline

5. Distribution to Distribution Pipeline

Contributing

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1

Folders and files

Latest commit

History

Repository files navigation

Cryo-EM Heterogeneity Challenge

Scope

Warning

Accesing the data

System Requirements

Installation

Clone the repository

Stable installation

Developer installation

Testing

Demo

Running

0. Preprocessing raw submissions

1. Download the full challenge data from The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge

2. Modify the config files and run the commands on the full challenge data

3. SVD Pipeline

4. Map to Map Pipeline

5. Distribution to Distribution Pipeline

Contributing

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages