A tool for decomposing a doublet (mixed two single cells when sequencing) into two single cells based on their trascriptome.
The matrices in "umi" folder only included expression values of 8,676 selected genes, while the matrices in "FullGenes" folder reserved information of all available genes.
- Disk space and memory
To generate 400k artificial doublets wiht around 9k genes included, you will need around 30GB disk space and 60GB memory. - GPU
To train a model on 400k artificial doublets with information of 9k genes. At least a 10GB GTX 3080 is required.
In our application, we finished the training in half an hour with a 24GB Titan RTX.
Anaconda is needed to set a proper Python environment for running DeepDoublet:
conda env create --file environment.yml
source activate DeepDoublet
python GenAD_v1.py -h
- Example Generate 400,000 hepatocyte-LEC doublets
python GenAD_v1.py --s1 umi/hep_umi.csv --s2 umi/LEC_umi.csv --sampleSize 400000 --outdir ADoub_new
The artificial doublets are organized and stored as follows:
-
Folder:
ADoub_new- Files:
ADoub.npy
Contains all artificial doublets as a NumPy array.ADoub_meta.csv
Stores the composition metadata for each artificial doublet.
- Files:
-
Folder:
umi- Files:
LEC_umi.csv
Contains UMI (Unique Molecular Identifier) counts for LEC cells.hep_umi.csv
Contains UMI counts for HEP cells.
- Files:
This CSV file contains the composition information for each artificial doublet. Each row corresponds to an artificial doublet stored in ADoub.npy.
| Column | Description |
|---|---|
s1 |
Index of the first single cell |
MF_s1 |
Proportion of the first single cell (s1) |
s2 |
Index of the second single cell |
MF_s2 |
Proportion of the second single cell (s2) |
Single cell transcriptome information is stored in the umi folder. The folder contains two CSV files:
- Files:
-
LEC_umi.csv- Row Names: Gene names
- Column Names: Cell names
- Description: UMI counts for LEC cells. Only selected genes are included.
-
hep_umi.csv- Row Names: Gene names
- Column Names: Cell names
- Description: UMI counts for HEP cells. Only selected genes are included.
-
Notes:
- Ensure that the gene names across
LEC_umi.csvandhep_umi.csvare consistent if they are to be used together. - The selected genes in these CSV files should correspond to the genes used in downstream analyses.
The transcriptome of each artificial doublet is calculated using the following formula:
Transcriptome_ADoub = Transcriptome_s1 X MF_s1 + Transcriptome_s2 X MF_s2
Where:
- Transcriptome_s1: Transcriptome data of the first single cell (
s1) - MF_s1: Proportion of the first single cell in the doublet
- Transcriptome_s2: Transcriptome data of the second single cell (
s2) - MF_s2: Proportion of the second single cell in the doublet
For example, if s1, MF_s1, s2, MF_s2 are 596, 0.7, 250, 0.3 respectively. The corresponding artificial doublet is
Transcriptome_s1_596 X 0.7 + Transcriptome_s2_250 X 0.3
import numpy as np
import pandas as pd
# Paths to the metadata and doublets
meta_path = 'ADoub_new/ADoub_meta.csv'
doublets_path = 'ADoub_new/ADoub.npy'
lec_umi_path = 'umi/LEC_umi.csv'
hep_umi_path = 'umi/hep_umi.csv'
# Load metadata
metadata = pd.read_csv(meta_path)
# Load single cell transcriptomes
# Load LEC and HEP UMI counts
lec_umi = pd.read_csv(lec_umi_path, index_col=0)
hep_umi = pd.read_csv(hep_umi_path, index_col=0)
# Load artificial doublets
ADoub = np.load(doublets_path)The two deep learning models for decomposing hepatocyte-LEC doublets into hepatocytes and LECs, along with 400,000 artificial doublets for additional testing, have been uploaded here: https://figshare.com/articles/dataset/DeepDoublet_identifies_neighboring_cell_dependent_gene_expression/27217101
- Train Decomposition Model
Decomposition_Training.ipynb - Predict with Decomposition Model
Decomposition_Predict.ipynb - Differential Expression Analysis
DEA.ipynb
- Program
LR.py - LOG file
LR.log