Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing query logic & order of operations for automated AnnData DE jobs (SCP-5946) #2213

Merged
merged 1 commit into from
Mar 10, 2025

Conversation

bistline
Copy link
Contributor

@bistline bistline commented Mar 6, 2025

BACKGROUND & CHANGES

This update addresses two corner cases when processing differential expression jobs from AnnData files:

  1. The query to find the associated matrix file (i.e. the AnnData file) can fail if the ExpressionFileInfo document is not properly initialized. There was only one recorded instance of this, so it is unclear if there is a related bug or this is simply user error, but it causes DifferentialExpressionResult objects to not properly save, even though the output files successfully generated. Jobs will continually retry on this file every time a qualifying ingest event checks if the study is eligible for DE. Now, instead of directly querying on that nested document, the convenience method study_file.is_raw_counts_file? is used as this is more robust across different file types. Also, the filename query was updated to only reference upload_file_name as this value is always set, regardless of how the file was uploaded.

  2. If raw counts data is extracted after the initial file upload by the user or an admin specifying via the upload wizard, automated downstream DE jobs do not properly launch because the check for eligibility happens before the file is updated to indicate it had raw counts data extracted. Now, the order of these operations is switched and DE eligibility is checked last.

MANUAL TESTING

To test the fix for the file query issue:

  1. Open a Rails console session and find an AnnData file that has raw counts data:
study_file = StudyFile.where(file_type: 'AnnData').detect(&:is_raw_counts_file?)
  1. Make a copy of the ExpressionFileInfo document and then remove it from the source file
exp_info = study_file.expression_file_info
study_file.update(expression_file_info: nil)
=> true                                               
  1. Confirm the file still registers as a raw counts file even without the nested document:
study_file.reload
study_file.expression_file_info
=> nil
study_file.is_raw_counts_file?
=> true
  1. Restore the file to its original state:
study_file.update(expression_file_info_attributes: exp_info.attributes)
=> true
study_file.expression_file_info
=> #<ExpressionFileInfo _id: 67bddff294ec8f57097e0cd5, ...

To test the DE job launch issue:

  1. Find a DE result where the source file was AnnData:
result = DifferentialExpressionResult.where(is_author_de: false).detect { |r| r.matrix_file && r.matrix_file.is_viz_anndata? }
  1. Remove the extract raw count cell names and delete the DE reults:
study = result.study
filename = 'h5ad_frag.matrix.raw.mtx.gz'
study_file = result.matrix_file
query = { name: "#{filename} Cells", cluster_name: filename, array_type: 'cells', linear_data_type: 'Study', linear_data_id: study.id, study_file_id: study_file.id, cluster_group_id: nil, subsample_annotation: nil, subsample_threshold: nil }
DataArray.where(query).delete_all
=> # will return info on documents processed, likely only 1
result.destroy
=> true
  1. Load the same study in the upload wizard, and click Save on the expression tab form. This should launch a job to re-extract the raw counts info in development.log:
Request object sent to Google Batch API, excluding 'environment' parameters:
---
:task:
  :commands:
  - python
  - ingest_pipeline.py
  - "--study-id"
  - 67bddf8394ec8f57097e0cc5
  - "--study-file-id"
  - 67bddfd89b706436d767c635
  - "--user-metrics-uuid"
  - e20680f6-bcfb-4d37-b66a-be2c9c7c5ebd
  - ingest_anndata
  - "--ingest-anndata"
  - "--anndata-file"
  - gs://fc-cb8ee4bd-e427-4ef5-803d-b3a0e3a6fc8c/compliant_liver.h5ad
  - "--extract"
  - '["raw_counts"]'
  1. Once this job completes, confirm that you see automated DE jobs launching in development.log (there should be at least one, but it will depend on eligible annotations/clusters in this study):
IngestJob poller: projects/broad-singlecellportal-bistlin/locations/us-central1/jobs/job-2502cbed-cbbb-4b3d-b2d2-47544d1e9fb2 is done!
IngestJob poller: projects/broad-singlecellportal-bistlin/locations/us-central1/jobs/job-2502cbed-cbbb-4b3d-b2d2-47544d1e9fb2 status: SUCCEEDED
SCP168 is eligible for differential expression, launching available jobs
SCP168 has annotations eligible for DE; validating inputs
Checking DE job for SCP168: umap (author_cell_type--group--study)
==> DE job for SCP168: umap (author_cell_type--group--study) successfully launched
Checking DE job for SCP168: umap (cell_type__ontology_label--group--study)
==> DE job for SCP168: umap (cell_type__ontology_label--group--study) successfully launched
SCP168 yielded 2 differential expression jobs; 0 skipped

@bistline bistline requested review from eweitz and jlchang March 6, 2025 20:32
Copy link

codecov bot commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.30%. Comparing base (12f666c) to head (47c8816).
Report is 2 commits behind head on development.

Files with missing lines Patch % Lines
app/models/ingest_job.rb 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #2213      +/-   ##
===============================================
- Coverage        70.34%   70.30%   -0.05%     
===============================================
  Files              332      332              
  Lines            28492    28492              
  Branches          2518     2518              
===============================================
- Hits             20042    20030      -12     
- Misses            8303     8315      +12     
  Partials           147      147              
Files with missing lines Coverage Δ
app/models/ingest_job.rb 55.74% <66.66%> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional review: manual tests worked as described

@bistline bistline added the build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem. label Mar 10, 2025
@bistline bistline merged commit b7948ce into development Mar 10, 2025
4 of 5 checks passed
@github-actions github-actions bot deleted the jb-de-result-matrix-file-query branch March 10, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants