Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

bistline · 2025-03-06T20:32:30Z

BACKGROUND & CHANGES

This update addresses two corner cases when processing differential expression jobs from AnnData files:

The query to find the associated matrix file (i.e. the AnnData file) can fail if the ExpressionFileInfo document is not properly initialized. There was only one recorded instance of this, so it is unclear if there is a related bug or this is simply user error, but it causes DifferentialExpressionResult objects to not properly save, even though the output files successfully generated. Jobs will continually retry on this file every time a qualifying ingest event checks if the study is eligible for DE. Now, instead of directly querying on that nested document, the convenience method study_file.is_raw_counts_file? is used as this is more robust across different file types. Also, the filename query was updated to only reference upload_file_name as this value is always set, regardless of how the file was uploaded.
If raw counts data is extracted after the initial file upload by the user or an admin specifying via the upload wizard, automated downstream DE jobs do not properly launch because the check for eligibility happens before the file is updated to indicate it had raw counts data extracted. Now, the order of these operations is switched and DE eligibility is checked last.

MANUAL TESTING

To test the fix for the file query issue:

Open a Rails console session and find an AnnData file that has raw counts data:

study_file = StudyFile.where(file_type: 'AnnData').detect(&:is_raw_counts_file?)

Make a copy of the ExpressionFileInfo document and then remove it from the source file

exp_info = study_file.expression_file_info
study_file.update(expression_file_info: nil)
=> true

Confirm the file still registers as a raw counts file even without the nested document:

study_file.reload
study_file.expression_file_info
=> nil
study_file.is_raw_counts_file?
=> true

Restore the file to its original state:

study_file.update(expression_file_info_attributes: exp_info.attributes)
=> true
study_file.expression_file_info
=> #<ExpressionFileInfo _id: 67bddff294ec8f57097e0cd5, ...

To test the DE job launch issue:

Find a DE result where the source file was AnnData:

result = DifferentialExpressionResult.where(is_author_de: false).detect { |r| r.matrix_file && r.matrix_file.is_viz_anndata? }

Remove the extract raw count cell names and delete the DE reults:

study = result.study
filename = 'h5ad_frag.matrix.raw.mtx.gz'
study_file = result.matrix_file
query = { name: "#{filename} Cells", cluster_name: filename, array_type: 'cells', linear_data_type: 'Study', linear_data_id: study.id, study_file_id: study_file.id, cluster_group_id: nil, subsample_annotation: nil, subsample_threshold: nil }
DataArray.where(query).delete_all
=> # will return info on documents processed, likely only 1
result.destroy
=> true

Load the same study in the upload wizard, and click Save on the expression tab form. This should launch a job to re-extract the raw counts info in development.log:

Request object sent to Google Batch API, excluding 'environment' parameters:
---
:task:
  :commands:
  - python
  - ingest_pipeline.py
  - "--study-id"
  - 67bddf8394ec8f57097e0cc5
  - "--study-file-id"
  - 67bddfd89b706436d767c635
  - "--user-metrics-uuid"
  - e20680f6-bcfb-4d37-b66a-be2c9c7c5ebd
  - ingest_anndata
  - "--ingest-anndata"
  - "--anndata-file"
  - gs://fc-cb8ee4bd-e427-4ef5-803d-b3a0e3a6fc8c/compliant_liver.h5ad
  - "--extract"
  - '["raw_counts"]'

Once this job completes, confirm that you see automated DE jobs launching in development.log (there should be at least one, but it will depend on eligible annotations/clusters in this study):

IngestJob poller: projects/broad-singlecellportal-bistlin/locations/us-central1/jobs/job-2502cbed-cbbb-4b3d-b2d2-47544d1e9fb2 is done!
IngestJob poller: projects/broad-singlecellportal-bistlin/locations/us-central1/jobs/job-2502cbed-cbbb-4b3d-b2d2-47544d1e9fb2 status: SUCCEEDED
SCP168 is eligible for differential expression, launching available jobs
SCP168 has annotations eligible for DE; validating inputs
Checking DE job for SCP168: umap (author_cell_type--group--study)
==> DE job for SCP168: umap (author_cell_type--group--study) successfully launched
Checking DE job for SCP168: umap (cell_type__ontology_label--group--study)
==> DE job for SCP168: umap (cell_type__ontology_label--group--study) successfully launched
SCP168 yielded 2 differential expression jobs; 0 skipped

codecov · 2025-03-06T21:07:28Z

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.30%. Comparing base (12f666c) to head (47c8816).
Report is 2 commits behind head on development.

Files with missing lines	Patch %	Lines
app/models/ingest_job.rb	66.66%	1 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff               @@
##           development    #2213      +/-   ##
===============================================
- Coverage        70.34%   70.30%   -0.05%     
===============================================
  Files              332      332              
  Lines            28492    28492              
  Branches          2518     2518              
===============================================
- Hits             20042    20030      -12     
- Misses            8303     8315      +12     
  Partials           147      147

Files with missing lines	Coverage Δ
app/models/ingest_job.rb	`55.74% <66.66%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jlchang

Functional review: manual tests worked as described

Fixing query logic & order of operations for AnnData DE jobs

47c8816

bistline requested review from eweitz and jlchang March 6, 2025 20:32

eweitz approved these changes Mar 6, 2025

View reviewed changes

jlchang approved these changes Mar 10, 2025

View reviewed changes

bistline added the build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem. label Mar 10, 2025

bistline merged commit b7948ce into development Mar 10, 2025
4 of 5 checks passed

github-actions bot deleted the jb-de-result-matrix-file-query branch March 10, 2025 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

bistline commented Mar 6, 2025 •

edited

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading

jlchang left a comment

Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

Conversation

bistline commented Mar 6, 2025 • edited Loading

BACKGROUND & CHANGES

MANUAL TESTING

codecov bot commented Mar 6, 2025 • edited Loading

Codecov Report

jlchang left a comment

Choose a reason for hiding this comment

bistline commented Mar 6, 2025 •

edited

Loading

codecov bot commented Mar 6, 2025 •

edited

Loading