A GPU-accelerated tool for large scale scRNA-seq pipeline.
Highlights • Why ScaleSC • Installation • Tutorial • API Reference
- Fast scRNA-seq pipeline including QC, Normalization, Batch-effect Removal, Dimension Reduction in a similar syntax as
scanpy
andrapids-singlecell
. - Scale to dataset with more than 10M cells on a single GPU. (A100 80G)
- Chunk the data to avoid the
int32
limitation incupyx.scipy.sparse
used byrapids-singlecell
that disables the computing for moderate-size dataset (~1.3M) without Multi-GPU support. - Reconcile output at each step to
scanpy
to reproduce the same results as on CPU end. - Improvement on
harmonypy
which allows dataset with more than 10M cells and more than 1000 samples to be run on a single GPU. - Speedup and optimize
NSForest
algorithm using GPU for better maker gene identification. - Merge clusters according to the gene expression of detected markers by
NSForest
.
scanpy |
scalesc |
rapids-singlecell |
|
---|---|---|---|
GPU Support | ❌ | ✅ | ✅ |
int32 Issue in Sparse |
✅ | ✅ | ❌ |
Upper Limit of #cell | 5M | ~20M | ~1M |
Upper Limit of #sample | <100 | >1000 | <100 |
Requirements:
- RAPIDS from Nvidia
- rapids-singlecell, an alternative of scanpy that employs GPU for acceleration.
- Conda, version >=22.11 is strongly encoruaged, because conda-libmamba-solver is set as default, which significant speeds up solving dependencies.
- pip, a python package installer.
Environment Setup:
-
Install RAPIDS through Conda,
conda create -n scalesc -c rapidsai -c conda-forge -c nvidia \ rapids=25.02 python=3.12 'cuda-version>=12.0,<=12.8
Users have flexibility to install it according to their systems by using this online selector. We highly recommand to install**RAPIDS**>=24.12
, it solves a bug related to the leiden algorithm which results in too many clusters. -
Activate conda env,
conda activate scalesc
-
Install rapids-singlecell using pip,
pip install rapids-singlecell
-
Install scaleSC,
- pull scaleSC from github
git clone https://github.com/interactivereport/scaleSC.git
- enter the folder and install scaleSC
cd scaleSC
pip install .
- pull scaleSC from github
-
check env:
python -c "import scalesc; print(scalesc.__version__)"
== 0.1.0python -c "import cupy; print(cupy.__version__)"
>= 13.3.0python -c "import cuml; print(cuml.__version__)"
>= 24.10python -c "import cupy; print(cupy.cuda.is_available())"
= Truepython -c "import xgboost; print(xgboost.__version__)
>= 2.1.1, optionally for marker annotation
- See this tutorial for details.
Please cite ScaleSC, and Scanpy, Rapids-singlecell, NSForest, AnnData according to their instructions respectively.
- 2/26/2025:
- adding a parameter
threshold
in functionadata_cluster_merge
to support cluster merging at various scales according to user's specification.threshold
is between 0 and 1. set to 0 by default. - updating a few more examples of cluster merging in the tutorial.
- future work: adding supports for loading from large
.h5ad
files.
- adding a parameter
ScaleSC integrated pipeline in a scanpy-like style.
It will automatcially load dataset in chunks, see scalesc.util.AnnDataBatchReader
for details, and all methods in this class manipulate this chunked data.
Args:
data_dir
(str
): Data folder of the dataset.max_cell_batch
(int
): Maximum number of cells in a single batch.Default
: 100000.preload_on_cpu
(bool
): If load the entire chunked data on CPU. Default:True
preload_on_gpu
(bool
): If load the entire chunked data on GPU,preload_on_cpu
will be overwritten toTrue
when this sets toTrue
. Default isTrue
.save_raw_counts
(bool
): If saveadata_X
to disk after QC filtering.Default
: False.save_norm_counts
(bool
): If saveadata_X
data to disk after normalization.Default
: False.save_after_each_step
(bool
): If saveadata
(without .X) to disk after each step.Default
: False.output_dir
(str
): Output folder. Default: './results'.gpus
(list
): List of GPU ids,[0]
is set if this is None. Default: None.
__init__(
data_dir,
max_cell_batch=100000.0,
preload_on_cpu=True,
preload_on_gpu=True,
save_raw_counts=False,
save_norm_counts=False,
save_after_each_step=False,
output_dir='results',
gpus=None
)
AnnData
: An AnnData object that used to store all intermediate results without the count matrix.
Note: This is always on CPU.
AnnData
: An AnnData
object that used to store all intermediate results including the count matrix. Internally, all chunks should be merged on CPU to avoid high GPU consumption, make sure to invoke to_CPU()
before calling this object.
calculate_qc_metrics()
Calculate quality control metrics.
clear()
Clean the memory
filter_cells(min_count=0, max_count=None, qc_var='n_genes_by_counts', qc=False)
Filter genes based on number of a QC metric.
Args:
min_count
(int
): Minimum number of counts required for a cell to pass filtering.max_count
(int
): Maximum number of counts required for a cell to pass filtering.qc_var
(str
='n_genes_by_counts'): Feature in QC metrics that used to filter cells.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
filter_genes(min_count=0, max_count=None, qc_var='n_cells_by_counts', qc=False)
Filter genes based on number of a QC metric.
Args:
min_count
(int
): Minimum number of counts required for a gene to pass filtering.max_count
(int
): Maximum number of counts required for a gene to pass filtering.qc_var
(str
='n_cells_by_counts'): Feature in QC metrics that used to filter genes.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
filter_genes_and_cells(
min_counts_per_gene=0,
min_counts_per_cell=0,
max_counts_per_gene=None,
max_counts_per_cell=None,
qc_var_gene='n_cells_by_counts',
qc_var_cell='n_genes_by_counts',
qc=False
)
Filter genes based on number of a QC metric.
Note:
This is an efficient way to perform a regular filtering on genes and cells without repeatedly iterating over chunks.
Args:
min_counts_per_gene
(int
): Minimum number of counts required for a gene to pass filtering.max_counts_per_gene
(int
): Maximum number of counts required for a gene to pass filtering.qc_var_gene
(str
='n_cells_by_counts'): Feature in QC metrics that used to filter genes.min_counts_per_cell
(int
): Minimum number of counts required for a cell to pass filtering.max_counts_per_cell
(int
): Maximum number of counts required for a cell to pass filtering.qc_var_cell
(str
='n_genes_by_counts'): Feature in QC metrics that used to filter cells.qc
(bool
=False
): Callcalculate_qc_metrics
before filtering.
harmony(sample_col_name, n_init=10, max_iter_harmony=20)
Use Harmony to integrate different experiments.
Note:
This modified harmony function can easily scale up to 15M cells with 50 pcs on GPU (A100 80G). Result after harmony is stored into
adata.obsm['X_pca_harmony']
.
Args:
sample_col_name
(str
): Column of sample ID.n_init
(int
=10
): Number of times the k-means algorithm is run with different centroid seeds.max_iter_harmony
(int
=20
): Maximum iteration number of harmony.
highly_variable_genes(n_top_genes=4000, method='seurat_v3')
Annotate highly variable genes.
Note:
Only
seurat_v3
is implemented. Raw count matrix is expected as input forseurat_v3
. HVGs are set toTrue
inadata.var['highly_variable']
.
Args:
n_top_genes
(int
=4000
): Number of highly-variable genes to keep.method
(str
='seurat_v3'
): Choose the flavor for identifying highly variable genes.
leiden(resolution=0.5, random_state=42)
Performs Leiden clustering using rapids-singlecell
.
Args:
resolution
(float
=0.5
): A parameter value controlling the coarseness of the clustering. (called gamma in the modularity formula). Higher values lead to more clusters.random_state
(int
=42
): Random seed.
neighbors(n_neighbors=20, n_pcs=50, use_rep='X_pac_harmony', algorithm='cagra')
Compute a neighborhood graph of observations using rapids-singlecell
.
Args:
n_neighbors
(int
=20
): The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation.n_pcs
(int
=50
): Use this many PCs.use_rep
(str
='X_pca_harmony'
): Use the indicated representation.algorithm
(str
='cagra'
): The query algorithm to use.
normalize_log1p(target_sum=10000.0)
Normalize counts per cell then log1p.
Note:
If
save_raw_counts
orsave_norm_counts
is set, writeadata_X
to disk here automatically.
Args:
target_sum
(int
=1e4
): If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
normalize_log1p_pca(
target_sum=10000.0,
n_components=50,
hvg_var='highly_variable'
)
An alternative for calling normalize_log1p
and pca
together.
Note:
Used when
preload_on_cpu
isFalse
.
pca(n_components=50, hvg_var='highly_variable')
Principal component analysis.
Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn.
Note:
Flip the directions according to the largest values in loadings. Results will match up with scanpy perfectly. Calculated PCA matrix is stored in
adata.obsm['X_pca']
.
Args:
n_components
(int
=50
): Number of principal components to compute.hvg_var
(str
='highly_variable'
): Use highly variable genes only.
save(data_name=None)
Save adata
to disk.
Note:
Save to '
output_dir
/data_name
.h5ad'.
Args:
data_name
(str
): IfNone
, set asdata_dir
.
savex(name, data_name=None)
Save adata
to disk in chunks.
Note:
Each chunk will be saved individually in a subfolder under
output_dir
. Save to 'output_dir
/name
/data_name
_i
.h5ad'.
Args:
name
(str
): Subfolder name.data_name
(str
): IfNone
, set asdata_dir
.
to_CPU()
Move all chunks to CPU.
to_GPU()
Move all chunks to GPU.
umap(random_state=42)
Embed the neighborhood graph using rapids-singlecell
.
Args:
random_state
(int
=42
): Random seed.
Chunked dataloader for extremely large single-cell dataset. Return a data chunk each time for further processing.
__init__(
data_dir,
preload_on_cpu=True,
preload_on_gpu=False,
gpus=None,
max_cell_batch=100000,
max_gpu_memory_usage=48.0,
return_anndata=True
)
batch_to_CPU()
batch_to_GPU()
batchify(axis='cell')
Return a data generator if preload_on_cpu
is set as True
.
clear()
get_merged_adata_with_X()
gpu_wrapper(generator)
read(fname)
set_cells_filter(filter, update=True)
Update cells filter and applied on data chunks if update
set to True
, otherwise, update filter only.
set_genes_filter(filter, update=True)
Update genes filter and applied on data chunks if update
set to True, otherwise, update filter only.
Note:
Genes filter can be set sequentially, a new filter should be always compatible with the previous filtered data.
update_by_cells_filter(filter)
update_by_genes_filter(filter)
This file was automatically generated via lazydocs.