-
Notifications
You must be signed in to change notification settings - Fork 0
Enable raw_location
input for DE processing (SCP-5950)
#386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #386 +/- ##
===============================================
- Coverage 76.06% 76.03% -0.04%
===============================================
Files 30 30
Lines 4491 4502 +11
===============================================
+ Hits 3416 3423 +7
- Misses 1075 1079 +4
🚀 New features to boost your workflow:
|
raw_location
input for DE processingraw_location
input for DE processing (SCP-5950)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Question: do we want to specify the location with the period/dot (.
) or just interpolate that in at runtime? In other words, have the user say raw
instead of .raw
? I'm fine either way, more thinking about collecting that information from the user and having it in a form/data model.
Because the keys for layers can be any string, it's possible that someone might decide to call their layer "raw" but they'd be unlikely to call it ".raw". I'm agnostic about how we present it in the upload wizard UI, I just wanted to be able to easily distinguish between the .raw slot and adata.layers['raw'] in the CLI for ingest. |
Ah good call - I sorta glossed over the difference between |
Background
From AnnData DE backfill and conversation with study owners, we know that raw count data is not always stored in the adata.raw slot. (The scanpy tutorial has recommended using adata.layers['counts'] since 2022). To enable DE processing on raw count data that is not in the adata.raw location, additional information needs to be passed to ingest pipeline.
This PR represents a breaking change for DE because DE jobs will now expect a
--raw-location
parameter, where.raw
is used to indicate the adata.raw slot and all other values are assumed to be an adata.layers key value.Manual testing
Download the following files :
HTAPP_compliant_layers_counts.h5ad: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP_compliant_layers_counts.h5ad
(you may already have the following cluster and metadata files from PR#372 or PR#374)
HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed :(
HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz: gs://fc-2f8ef4c0-b7eb-44b1-96fe-a07f0ea9a982/test_Data/differential_expression/HTAPP-330-SMP-1082/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz (Note: file will lose the .gz suffix BUT will still need to be decompressed
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location 'counts' --annotation-name leiden --de-type pairwise --group1 "1" --group2 "2" --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/HTAPP-330-SMP-1082_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/HTAPP_compliant_layers_counts.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression
Compared to HTAPP-330-SMP-1082_compliant.h5ad (which had raw counts in .raw), HTAPP_compliant_layers_counts.h5ad had the top DE gene,
FTH1
, deleted from the dataset.Confirm that the job generates "umap--leiden--1--2--study--wilcoxon.tsv" and the top DE gene is
MT-CO3
and notFTH1
: