Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add raw_location input for DE processing (SCP-5950) #387

Merged
merged 2 commits into from
Mar 20, 2025

Conversation

jlchang
Copy link
Collaborator

@jlchang jlchang commented Mar 19, 2025

This is a redo of PR#386 to fix a merge conflict.

Background

From AnnData DE backfill and conversation with study owners, we know that raw count data is not always stored in the adata.raw slot. (The scanpy tutorial has recommended using adata.layers['counts'] since 2022). To enable DE processing on raw count data that is not in the adata.raw location, additional information needs to be passed to ingest pipeline.

This PR represents a breaking change for DE because DE jobs will now expect a --raw-location parameter, where .raw is used to indicate the adata.raw slot and all other values are assumed to be an adata.layers key value.

Manual testing

Download the following files :
HTAPP_compliant_layers_counts.h5ad: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP_compliant_layers_counts.h5ad
(you may already have the following cluster and metadata files from PR#372 or PR#374)
HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed :(
HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed

  1. Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
  2. Run the following ingest pipeline job:
    python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location 'counts' --annotation-name leiden --de-type pairwise --group1 "1" --group2 "2" --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/HTAPP_compliant_layers_counts.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression

Compared to HTAPP-330-SMP-1082_compliant.h5ad (which had raw counts in .raw), HTAPP_compliant_layers_counts.h5ad had the top DE gene, FTH1, deleted from the dataset.

Confirm that the job generates "umap--leiden--1--2--study--wilcoxon.tsv" and the top DE gene is MT-CO3 and not FTH1:

names scores logfoldchanges pvals pvals_adj pct_nz_group feature_id
0 MT-CO3 7.004 0.7007 2.481e-12 1.651e-09 1 ENSG00000198938
1 RPS27 6.46 1.039 1.045e-10 5.08e-08 0.9778 ENSG00000177954
2 RPL12 6.257 0.9006 3.934e-10 1.657e-07 0.9111 ENSG00000197958
3 RPS3A 6.172 1.611 6.732e-10 2.745e-07 0.8556 ENSG00000145425
4 RPL28 6.121 1.408 9.279e-10 3.449e-07 0.9222 ENSG00000108107
5 RPL41 5.75 0.71 8.903e-09 3.126e-06 0.9889 ENSG00000229117

Copy link

codecov bot commented Mar 19, 2025

Codecov Report

Attention: Patch coverage is 52.94118% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.04%. Comparing base (0067ebd) to head (c48a38b).
Report is 3 commits behind head on development.

Files with missing lines Patch % Lines
ingest/de.py 50.00% 8 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #387      +/-   ##
===============================================
- Coverage        76.07%   76.04%   -0.04%     
===============================================
  Files               30       30              
  Lines             4493     4504      +11     
===============================================
+ Hits              3418     3425       +7     
- Misses            1075     1079       +4     
Files with missing lines Coverage Δ
ingest/cli_parser.py 82.83% <100.00%> (+0.12%) ⬆️
ingest/ingest_pipeline.py 58.16% <ø> (ø)
ingest/de.py 84.64% <50.00%> (-0.84%) ⬇️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jlchang jlchang requested review from eweitz and bistline and removed request for eweitz March 20, 2025 13:31
@jlchang jlchang merged commit 17f7483 into development Mar 20, 2025
5 checks passed
@jlchang jlchang deleted the jlc_layers_de branch March 20, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants