Skip to content

ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU. https://doi.org/10.1101/2025.01.28.635256

License

Notifications You must be signed in to change notification settings

interactivereport/ScaleSC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleSC

A GPU-accelerated tool for large scale scRNA-seq pipeline.

Highlights • Why ScaleSC • Installation • Tutorial • API Reference

Highlights

  • Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as scanpy and rapids-singlecell.
  • Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
  • Chunk the data to avoid the int32 limitation in cupyx.scipy.sparse used by rapids-singlecell that disables the computing for moderate-size dataset (~1.3M) without Multi-GPU support.
  • Reconcile output at each step to scanpy to reproduce the same results as on CPU end.
  • Improvement on harmonypy which allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU.
  • Speedup and optimize NSForest algorithm using GPU for better maker gene identification.
  • Merge clusters according to the gene expression of detected markers by NSForest.

Why ScaleSC

What can ScaleSC do?

pipeline

ScaleSC Overview

overview



Overview of different packages*
scanpy scalesc rapids-singlecell
GPU Support ❌ ✅ ✅
int32 Issue in Sparse ✅ ✅ ❌
Upper Limit of #cell 5M ~20M ~1M
Upper Limit of #sample <100 >1000 <100



Time comparsion between scanpy(CPU) and scalesc(GPU) on A100(80G)

time-comp

How To Install

Note: ScaleSC requires a high-end GPU (> 24G VRAM) and a matching CUDA version to support GPU-accelerated computing.

Requirements:

  • RAPIDS from Nvidia
  • rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
  • Conda, version >=22.11 is strongly encoruaged, because conda-libmamba-solver is set as default, which significant speeds up solving dependencies.
  • pip, a python package installer.

Environment Setup:

  1. Install RAPIDS through Conda,
    conda create -n scalesc -c rapidsai -c conda-forge -c nvidia \ rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8 Users have flexibility to install it according to their systems by using this online selector. We highly recommand to install **RAPIDS**>=24.12, it solves a bug related to the leiden algorithm which results in too many clusters.

  2. Activate conda env,
    conda activate scalesc

  3. Install rapids-singlecell using pip,
    pip install rapids-singlecell

  4. Install scaleSC,

    • pull scaleSC from github
      git clone https://github.com/interactivereport/scaleSC.git
    • enter the folder and install scaleSC
      cd scaleSC
      pip install .
  5. check env:

    • python -c "import scalesc; print(scalesc.__version__)" == 0.1.0
    • python -c "import cupy; print(cupy.__version__)" >= 13.3.0
    • python -c "import cuml; print(cuml.__version__)" >= 24.10
    • python -c "import cupy; print(cupy.cuda.is_available())" = True
    • python -c "import xgboost; print(xgboost.__version__) >= 2.1.1, optionally for marker annotation

Tutorial:

Citation

Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.

Updates:

  • 2/26/2025:
    • adding a parameter threshold in function adata_cluster_merge to support cluster merging at various scales according to user's specification. threshold is between 0 and 1. set to 0 by default.
    • updating a few more examples of cluster merging in the tutorial.
    • future work: adding supports for loading from large .h5ad files.

Contact

API Reference

class ScaleSC

ScaleSC integrated pipeline in a scanpy-like style.

It will automatcially load dataset in chunks, see scalesc.util.AnnDataBatchReader for details, and all methods in this class manipulate this chunked data.

Args:

  • data_dir (str): Data folder of the dataset.
  • max_cell_batch (int): Maximum number of cells in a single batch.
  • Default: 100000.
  • preload_on_cpu (bool): If load the entire chunked data on CPU. Default: True
  • preload_on_gpu (bool): If load the entire chunked data on GPU, preload_on_cpu will be overwritten to True when this sets to True. Default is True.
  • save_raw_counts (bool): If save adata_X to disk after QC filtering.
  • Default: False.
  • save_norm_counts (bool): If save adata_X data to disk after normalization.
  • Default: False.
  • save_after_each_step (bool): If save adata (without .X) to disk after each step.
  • Default: False.
  • output_dir (str): Output folder. Default: './results'.
  • gpus (list): List of GPU ids, [0] is set if this is None. Default: None.

method __init__

__init__(
    data_dir,
    max_cell_batch=100000.0,
    preload_on_cpu=True,
    preload_on_gpu=True,
    save_raw_counts=False,
    save_norm_counts=False,
    save_after_each_step=False,
    output_dir='results',
    gpus=None
)

property adata

AnnData: An AnnData object that used to store all intermediate results without the count matrix.

Note: This is always on CPU.


property adata_X

AnnData: An AnnData object that used to store all intermediate results including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption, make sure to invoke to_CPU() before calling this object.


method calculate_qc_metrics

calculate_qc_metrics()

Calculate quality control metrics.


method clear

clear()

Clean the memory


method filter_cells

filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)

Filter genes based on number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a cell to pass filtering.
  • max_count (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var (str='n_genes_by_counts'): Feature in QC metrics that used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes

filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)

Filter genes based on number of a QC metric.

Args:

  • min_count (int): Minimum number of counts required for a gene to pass filtering.
  • max_count (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var (str='n_cells_by_counts'): Feature in QC metrics that used to filter genes.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method filter_genes_and_cells

filter_genes_and_cells(
    min_counts_per_gene=0,
    min_counts_per_cell=0,
    max_counts_per_gene=None,
    max_counts_per_cell=None,
    qc_var_gene='n_cells_by_counts',
    qc_var_cell='n_genes_by_counts',
    qc=False
)

Filter genes based on number of a QC metric.

Note:

This is an efficient way to perform a regular filtering on genes and cells without repeatedly iterating over chunks.

Args:

  • min_counts_per_gene (int): Minimum number of counts required for a gene to pass filtering.
  • max_counts_per_gene (int): Maximum number of counts required for a gene to pass filtering.
  • qc_var_gene (str='n_cells_by_counts'): Feature in QC metrics that used to filter genes.
  • min_counts_per_cell (int): Minimum number of counts required for a cell to pass filtering.
  • max_counts_per_cell (int): Maximum number of counts required for a cell to pass filtering.
  • qc_var_cell (str='n_genes_by_counts'): Feature in QC metrics that used to filter cells.
  • qc (bool=False): Call calculate_qc_metrics before filtering.

method harmony

harmony(sample_col_name, n_init=10, max_iter_harmony=20)

Use Harmony to integrate different experiments.

Note:

This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into adata.obsm['X_pca_harmony'].

Args:

  • sample_col_name (str): Column of sample ID.
  • n_init (int=10): Number of times the k-means algorithm is run with different centroid seeds.
  • max_iter_harmony (int=20): Maximum iteration number of harmony.

method highly_variable_genes

highly_variable_genes(n_top_genes=4000, method='seurat_v3')

Annotate highly variable genes.

Note:

Only seurat_v3 is implemented. Raw count matrix is expected as input for seurat_v3. HVGs are set to True in adata.var['highly_variable'].

Args:

  • n_top_genes (int=4000): Number of highly-variable genes to keep.
  • method (str='seurat_v3'): Choose the flavor for identifying highly variable genes.

method leiden

leiden(resolution=0.5, random_state=42)

Performs Leiden clustering using rapids-singlecell.

Args:

  • resolution (float=0.5): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.
  • random_state (int=42): Random seed.

method neighbors

neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')

Compute a neighborhood graph of observations using rapids-singlecell.

Args:

  • n_neighbors (int=20): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.
  • n_pcs (int=50): Use this many PCs.
  • use_rep (str='X_pca_harmony'): Use the indicated representation.
  • algorithm (str='cagra'): The query algorithm to use.

method normalize_log1p

normalize_log1p(target_sum=10000.0)

Normalize counts per cell then log1p.

Note:

If save_raw_counts or save_norm_counts is set, write adata_X to disk here automatically.

Args:

  • target_sum (int=1e4): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

method normalize_log1p_pca

normalize_log1p_pca(
    target_sum=10000.0,
    n_components=50,
    hvg_var='highly_variable'
)

An alternative for calling normalize_log1p and pca together.

Note:

Used when preload_on_cpu is False.


method pca

pca(n_components=50, hvg_var='highly_variable')

Principal component analysis.

Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn.

Note:

Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in adata.obsm['X_pca'].

Args:

  • n_components (int=50): Number of principal components to compute.
  • hvg_var (str='highly_variable'): Use highly variable genes only.

method save

save(data_name=None)

Save adata to disk.

Note:

Save to 'output_dir/data_name.h5ad'.

Args:

  • data_name (str): If None, set as data_dir.

method savex

savex(name, data_name=None)

Save adata to disk in chunks.

Note:

Each chunk will be saved individually in a subfolder under output_dir. Save to 'output_dir/name/data_name_i.h5ad'.

Args:

  • name (str): Subfolder name.
  • data_name (str): If None, set as data_dir.

method to_CPU

to_CPU()

Move all chunks to CPU.


method to_GPU

to_GPU()

Move all chunks to GPU.


method umap

umap(random_state=42)

Embed the neighborhood graph using rapids-singlecell.

Args:

  • random_state (int=42): Random seed.

class AnnDataBatchReader

Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.

method __init__

__init__(
    data_dir,
    preload_on_cpu=True,
    preload_on_gpu=False,
    gpus=None,
    max_cell_batch=100000,
    max_gpu_memory_usage=48.0,
    return_anndata=True
)

property shape


method batch_to_CPU

batch_to_CPU()

method batch_to_GPU

batch_to_GPU()

method batchify

batchify(axis='cell')

Return a data generator if preload_on_cpu is set as True.


method clear

clear()

method get_merged_adata_with_X

get_merged_adata_with_X()

method gpu_wrapper

gpu_wrapper(generator)

method read

read(fname)

method set_cells_filter

set_cells_filter(filter, update=True)

Update cells filter and applied on data chunks if update set to True, otherwise, update filter only.


method set_genes_filter

set_genes_filter(filter, update=True)

Update genes filter and applied on data chunks if update set to True, otherwise, update filter only.

Note:

Genes filter can be set sequentially, a new filter should be always compatible with the previous filtered data.


method update_by_cells_filter

update_by_cells_filter(filter)

method update_by_genes_filter

update_by_genes_filter(filter)

This file was automatically generated via lazydocs.

About

ScaleSC: A superfast and scalable single cell RNA-seq data analysis pipeline powered by GPU. https://doi.org/10.1101/2025.01.28.635256

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages