Skip to content

Enable raw_location input for DE processing (SCP-5950) #386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

jlchang
Copy link
Contributor

@jlchang jlchang commented Mar 14, 2025

Background

From AnnData DE backfill and conversation with study owners, we know that raw count data is not always stored in the adata.raw slot. (The scanpy tutorial has recommended using adata.layers['counts'] since 2022). To enable DE processing on raw count data that is not in the adata.raw location, additional information needs to be passed to ingest pipeline.

This PR represents a breaking change for DE because DE jobs will now expect a --raw-location parameter, where .raw is used to indicate the adata.raw slot and all other values are assumed to be an adata.layers key value.

Manual testing

Download the following files :
HTAPP_compliant_layers_counts.h5ad: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP_compliant_layers_counts.h5ad
(you may already have the following cluster and metadata files from PR#372 or PR#374)
HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed :(
HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed

  1. Set up for ingest testing as per usual (from ingest directory, run source ../scripts/setup-mongo-dev.sh)
  2. Run the following ingest pipeline job:
    python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location 'counts' --annotation-name leiden --de-type pairwise --group1 "1" --group2 "2" --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/HTAPP_compliant_layers_counts.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression

Compared to HTAPP-330-SMP-1082_compliant.h5ad (which had raw counts in .raw), HTAPP_compliant_layers_counts.h5ad had the top DE gene, FTH1, deleted from the dataset.

Confirm that the job generates "umap--leiden--1--2--study--wilcoxon.tsv" and the top DE gene is MT-CO3 and not FTH1:

names scores logfoldchanges pvals pvals_adj pct_nz_group feature_id
0 MT-CO3 7.004 0.7007 2.481e-12 1.651e-09 1 ENSG00000198938
1 RPS27 6.46 1.039 1.045e-10 5.08e-08 0.9778 ENSG00000177954
2 RPL12 6.257 0.9006 3.934e-10 1.657e-07 0.9111 ENSG00000197958
3 RPS3A 6.172 1.611 6.732e-10 2.745e-07 0.8556 ENSG00000145425
4 RPL28 6.121 1.408 9.279e-10 3.449e-07 0.9222 ENSG00000108107
5 RPL41 5.75 0.71 8.903e-09 3.126e-06 0.9889 ENSG00000229117

Copy link

codecov bot commented Mar 14, 2025

Codecov Report

Attention: Patch coverage is 52.94118% with 8 lines in your changes missing coverage. Please review.

Project coverage is 76.03%. Comparing base (3486c28) to head (fd73c11).
Report is 5 commits behind head on development.

Files with missing lines Patch % Lines
ingest/de.py 50.00% 8 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #386      +/-   ##
===============================================
- Coverage        76.06%   76.03%   -0.04%     
===============================================
  Files               30       30              
  Lines             4491     4502      +11     
===============================================
+ Hits              3416     3423       +7     
- Misses            1075     1079       +4     
Files with missing lines Coverage Δ
ingest/cli_parser.py 82.83% <100.00%> (+0.12%) ⬆️
ingest/ingest_pipeline.py 58.16% <ø> (ø)
ingest/de.py 84.64% <50.00%> (-0.84%) ⬇️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jlchang jlchang requested review from eweitz and bistline and removed request for eweitz March 14, 2025 20:36
@jlchang jlchang changed the title Enable raw_location input for DE processing Enable raw_location input for DE processing (SCP-5950) Mar 17, 2025
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good!

Copy link
Contributor

@bistline bistline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Question: do we want to specify the location with the period/dot (.) or just interpolate that in at runtime? In other words, have the user say raw instead of .raw? I'm fine either way, more thinking about collecting that information from the user and having it in a form/data model.

@jlchang
Copy link
Contributor Author

jlchang commented Mar 18, 2025

Question: do we want to specify the location with the period/dot (.) or just interpolate that in at runtime? In other words, have the user say raw instead of .raw? I'm fine either way, more thinking about collecting that information from the user and having it in a form/data model.

Because the keys for layers can be any string, it's possible that someone might decide to call their layer "raw" but they'd be unlikely to call it ".raw". I'm agnostic about how we present it in the upload wizard UI, I just wanted to be able to easily distinguish between the .raw slot and adata.layers['raw'] in the CLI for ingest.

@bistline
Copy link
Contributor

Question: do we want to specify the location with the period/dot (.) or just interpolate that in at runtime? In other words, have the user say raw instead of .raw? I'm fine either way, more thinking about collecting that information from the user and having it in a form/data model.

Because the keys for layers can be any string, it's possible that someone might decide to call their layer "raw" but they'd be unlikely to call it ".raw". I'm agnostic about how we present it in the upload wizard UI, I just wanted to be able to easily distinguish between the .raw slot and adata.layers['raw'] in the CLI for ingest.

Ah good call - I sorta glossed over the difference between adata.raw and adata.layers['raw']. Let's leave it as is and we can deal with that complexity in the upload wizard.

@jlchang jlchang closed this Mar 19, 2025
@bistline bistline deleted the jlc_allow_layers_de branch April 15, 2025 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants