Skip to content

eporetsky/MutClust

Repository files navigation

MutClust: Mutual Rank-Based Clustering and GO Enrichment Analysis

MutClust is a Python package designed for RNA-seq gene coexpression analyses. It performs mutual rank (MR)-based clustering of coexpressed genes and identifies enriched Gene Ontology (GO) terms for the resulting clusters. The package is optimized for speed, able to run a whole-genome coexpression analysis in minutes.


Features

  • Mutual Rank Analysis: Calculates MR from Pearson correlation coefficients to identify coexpressed genes.
  • Leiden Clustering: Groups genes into clusters based on mutual rank and exponential decay weights.
  • Gene Annotations: Merge cluster members with gene annotations, if provided.
  • GO Enrichment Analysis: Identifies enriched GO terms for each cluster using GOATOOLS.
  • Highly Configurable: Supports adjustable thresholds, resolution parameters, and multi-threading for performance optimization.
  • Calculate correlation matrix and mutual rank from RNA-seq data
  • Filter and apply exponential decay to mutual rank values
  • Perform Leiden clustering to identify co-expressed gene clusters
  • Calculate eigen-genes for each cluster (first principal component)
  • Perform GO enrichment analysis on gene clusters
  • Annotate clusters with gene information

Installation

You can install MutClust directly from PyPI:

pip install mutclust

Note: Because of a known dependency issue with PyNetCor, MutClust is not currently available on MacOS through PyPI but installs properly on Linux.

Alternatively, you can clone the repository and install it locally:

git clone https://github.com/eporetsky/mutclust.git
cd mutclust
pip install .

Conda Environment (Recommended)

You can optionally use a conda environment for easier dependency management. This is especially useful for installing clusterone (required for some workflows) from bioconda:

conda env create -f environment.yml
conda activate mutclust

This will install all core dependencies, bioconda::clusterone, and set up MutClust in editable mode. You can still update your code and use the CLI immediately.

Docker Installation

For users who prefer containerized deployment, MutClust is available as a Docker container:

# Build the container
docker build -t mutclust .

# Run MutClust with your data
docker run -v /path/to/your/data:/data mutclust mutclust mr -i /data/your_expression.tsv -o /data/results

The container uses Ubuntu 20.04 and includes all necessary dependencies. Mount your data directory to /data inside the container to access your files.


Usage

MutClust now provides a Click-based command-line interface (CLI) with three main subcommands:

  • mutclust mr: Calculate mutual rank from an expression dataset
  • mutclust cls: Run clustering analysis on a given MR table
  • mutclust enr: Run GO enrichment analysis on clusters

Basic Usage

# Calculate mutual rank from expression data
mutclust mr -i input.tsv -o output_prefix

# Run clustering analysis on a mutual rank table
mutclust cls -i output_prefix.mrs.tsv -o output_prefix

# Run GO enrichment analysis on clusters
mutclust enr -c output_prefix.clusters.tsv -go go-basic.obo -gf tair.gaf -o output_prefix

Subcommand Arguments

mutclust mr

Argument Short Description Default
--input -i Path to the RNA-seq dataset (TSV format). Required
--output -o Output prefix for the results. Required
--mr-threshold -m Mutual rank threshold for filtering. 100
--e-value -e Exponential decay constant. 10
--threads -t Number of threads for correlation calculation. 4
--save-intermediate Save intermediate files (PCC, MR, filtered pairs). Optional

mutclust cls

Argument Short Description Default
--input -i Path to Mutual Rank (MR) table (TSV format). Required
--output -o Output prefix for the results. Required
--annotations -a Path to the gene annotation file. Optional
--resolution -r Resolution parameter for Leiden clustering. 0.1
--eigengene/--no-eigengene Calculate eigen-genes for clusters. True
--expression Path to RNA-seq dataset for eigen-gene calculation. Required if --eigengene

mutclust enr

Argument Short Description Default
--clusters -c Path to clusters file (TSV format). Required
--go-obo -go Path to the Gene Ontology (GO) OBO file. Required
--go-gaf -gf Path to the GO annotation file (GAF format). Required
--output -o Output prefix for the results. Required
--expression Path to RNA-seq dataset for background gene set. Optional

Example Workflow

# Step 1: Calculate mutual rank
tab="data/AtCol-0.cpm.tsv"
mutclust mr -i $tab -o results/atcol0

# Step 2: Cluster genes
mutclust cls -i results/atcol0.mrs.tsv -o results/atcol0 --annotations annotations/AtCol-0.annot.tsv --expression $tab

# Step 3: GO enrichment
mutclust enr -c results/atcol0.clusters.tsv -go go-basic.obo -gf tair.gaf -o results/atcol0 --expression $tab

Input File Formats

RNA-seq Dataset

  • Format: Tab-separated values (TSV).
  • Columns: Gene IDs as row indices and samples as columns.
  • Example:
geneID    Sample1    Sample2    Sample3
GeneA     1.23       2.34       3.45
GeneB     4.56       5.67       6.78

Gene Annotation File

  • Format: Tab-separated values (TSV).
  • Columns: geneID and additional annotation fields.
  • Example:
geneID    description
GeneA     Photosynthesis-related protein
GeneB     Transcription factor

GO OBO File

  • Description: The Gene Ontology (GO) OBO file contains the ontology structure.
  • Source: Download from Gene Ontology.

GO GAF File

  • Description: The Gene Annotation File (GAF) maps genes to GO terms.
  • Source: Download from Gene Ontology.

Output Files

  1. Filtered MR and e-values (<output_prefix>.mrs.tsv):

    • Lists of coexpressed genes with MR and e-values.
    • Columns: Gene1, Gene2, MR, ED.

    Example:

    Gene1    Gene2    MR    ED
    GeneA    GeneB    10.2  0.39
    GeneB    GeneC    6     0.6
  2. Clustered Genes (<output_prefix>.clusters.tsv):

    • Lists genes in each cluster.
    • Annotation columns if provided.
    • Columns: clusterID, geneID.

    Example:

    clusterID    geneID    Annotations
    c1           GeneA     ...
    c1           GeneB     ...
  3. GO Enrichment Results (<output_prefix>_go_enrichment_results.tsv):

    • Contains enriched GO terms for each cluster.
    • Columns: cluster, type, size, term, p-val, FC, desc.

    Example:

    cluster    type    size    term       p-val       FC    desc
    c1         BP      25      GO:0008150 0.00123     3.5   Biological Process
  4. Eigen-gene values (<output_prefix>.eigen.tsv):

    • Eigen-gene values for each cluster.
    • Columns: geneID and sample columns.

    Example:

    geneID    Sample1    Sample2    Sample3
    c1        0.707107   0.707107   0.707107
    c2        0.577350   0.577350   0.577350
    c3        0.500000   0.500000   0.500000

Dependencies

The following Python libraries are required and will be installed automatically:

  • numpy
  • pandas
  • pynetcor
  • python-igraph
  • goatools
  • scikit-learn
  • click

Other dependencies (such as clusterone) can be installed via conda/bioconda as needed.


License

This project is licensed under the MIT License. See the LICENSE file for details.


Contributing

Contributions, suggestions and issues are welcome!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published