Skip to content

Enable raw_location input for DE processing (SCP-5950) #386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
160 changes: 80 additions & 80 deletions .github/workflows/minify_ontologies.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,93 +2,93 @@ name: Minify ontologies

on:
pull_request:
types: [opened] # Only trigger on PR "opened" event
# push: # Uncomment, update branches to develop / debug
# branches:
# jlc_show_de_pairwise
types: [opened] # Only trigger on PR "opened" event
push: # Uncomment, update branches to develop / debug
branches:
jlc_allow_layers_de

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}

- name: Copy and decompress ontologies in repo
run: cd ingest/validation/ontologies; mkdir tmp; cp -r *.min.tsv.gz tmp/; gzip -d tmp/*.min.tsv.gz

- name: Minify newest ontologies
run: cd ingest/validation; python3 minify_ontologies.py; gzip -dkf ontologies/*.min.tsv.gz

- name: Diff and commit changes
run: |
#!/bin/bash

# Revert the default `set -e` in GitHub Actions, to e.g. ensure
# "diff" doesn't throw an error when something is found
set +e
# set -x # Enable debugging

cd ingest/validation/ontologies

# Define directories
SOURCE_DIR="."
TMP_DIR="tmp"

# Ensure TMP_DIR exists
if [ ! -d "$TMP_DIR" ]; then
echo "Temporary directory $TMP_DIR does not exist."
exit 1
fi

# Flag to track if there are any changes
CHANGES_DETECTED=false

# Find and diff files
for FILE in $(find "$SOURCE_DIR" -type f -name "*.min.tsv"); do
# Get the base name of the file
BASENAME=$(basename "$FILE")
# Construct the path to the corresponding file in the TMP_DIR
TMP_FILE="$TMP_DIR/$BASENAME"

# Check if the corresponding file exists in TMP_DIR
if [ -f "$TMP_FILE" ]; then
# Run the diff command
echo "Diffing $FILE and $TMP_FILE"
diff "$FILE" "$TMP_FILE" > diff_output.txt
# Check if diff output is not empty
if [ -s diff_output.txt ]; then
echo "Differences found in $BASENAME"
cat diff_output.txt
# Copy the updated file to the source directory (if needed)
cp "$TMP_FILE" "$FILE"
# Mark that changes have been detected
CHANGES_DETECTED=true
# Stage the modified file
git add "$FILE".gz
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}

- name: Copy and decompress ontologies in repo
run: cd ingest/validation/ontologies; mkdir tmp; cp -r *.min.tsv.gz tmp/; gzip -d tmp/*.min.tsv.gz

- name: Minify newest ontologies
run: cd ingest/validation; python3 minify_ontologies.py; gzip -dkf ontologies/*.min.tsv.gz

- name: Diff and commit changes
run: |
#!/bin/bash

# Revert the default `set -e` in GitHub Actions, to e.g. ensure
# "diff" doesn't throw an error when something is found
set +e
# set -x # Enable debugging

cd ingest/validation/ontologies

# Define directories
SOURCE_DIR="."
TMP_DIR="tmp"

# Ensure TMP_DIR exists
if [ ! -d "$TMP_DIR" ]; then
echo "Temporary directory $TMP_DIR does not exist."
exit 1
fi

# Flag to track if there are any changes
CHANGES_DETECTED=false

# Find and diff files
for FILE in $(find "$SOURCE_DIR" -type f -name "*.min.tsv"); do
# Get the base name of the file
BASENAME=$(basename "$FILE")
# Construct the path to the corresponding file in the TMP_DIR
TMP_FILE="$TMP_DIR/$BASENAME"

# Check if the corresponding file exists in TMP_DIR
if [ -f "$TMP_FILE" ]; then
# Run the diff command
echo "Diffing $FILE and $TMP_FILE"
diff "$FILE" "$TMP_FILE" > diff_output.txt
# Check if diff output is not empty
if [ -s diff_output.txt ]; then
echo "Differences found in $BASENAME"
cat diff_output.txt
# Copy the updated file to the source directory (if needed)
cp "$TMP_FILE" "$FILE"
# Mark that changes have been detected
CHANGES_DETECTED=true
# Stage the modified file
git add "$FILE".gz
else
echo "No differences in $BASENAME"
fi
else
echo "No differences in $BASENAME"
echo "No corresponding file found in $TMP_DIR for $BASENAME"
fi
done

if [ "$CHANGES_DETECTED" = true ]; then
# Update version to signal downstream caches should update
echo "$(date +%s) # validation cache key" > version.txt
git add version.txt

# Configure Git
git config --global user.name "github-actions"
git config --global user.email "[email protected]"

# Commit changes
git commit -m "Update minified ontologies via GitHub Actions"
git push
else
echo "No corresponding file found in $TMP_DIR for $BASENAME"
echo "No changes to commit."
fi
done

if [ "$CHANGES_DETECTED" = true ]; then
# Update version to signal downstream caches should update
echo "$(date +%s) # validation cache key" > version.txt
git add version.txt

# Configure Git
git config --global user.name "github-actions"
git config --global user.email "[email protected]"

# Commit changes
git commit -m "Update minified ontologies via GitHub Actions"
git push
else
echo "No changes to commit."
fi
10 changes: 8 additions & 2 deletions ingest/cli_parser.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
"""Helper functions for ingest_pipeline.py
"""
"""Helper functions for ingest_pipeline.py"""

import argparse
import ast
Expand Down Expand Up @@ -281,6 +280,13 @@ def create_parser():
help="Accepted values: 'pairwise' or 'rest' (default)",
)

parser_differential_expression.add_argument(
"--raw-location",
required=True,
help="location of raw counts. '.raw' for raw slot, "
"else adata.layers key value",
)

parser_differential_expression.add_argument(
"--study-accession",
required=True,
Expand Down
61 changes: 44 additions & 17 deletions ingest/de.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,7 @@ def run_scanpy_de(
):
method = extra_params.get("method")
de_type = extra_params.get("de_type")
raw_location = extra_params.get("raw_location")

try:
DifferentialExpression.assess_annotation(annotation, metadata, extra_params)
Expand Down Expand Up @@ -432,24 +433,50 @@ def run_scanpy_de(
)

if matrix_file_type == "h5ad":
if orig_adata.raw is not None:
adata = AnnData(
# using .copy() for the AnnData components is good practice
# but we won't be using orig_adata for analyses
# choosing to avoid .copy() for memory efficiency
X=orig_adata.raw.X,
obs=orig_adata.obs,
var=orig_adata.var,
)
if raw_location == ".raw":
if orig_adata.raw is not None:
DifferentialExpression.de_logger.info(
f"Performing DE on {raw_location} data"
)
adata = AnnData(
# using .copy() for the AnnData components is good practice
# but we won't be using orig_adata for analyses
# choosing to avoid .copy() for memory efficiency
X=orig_adata.raw.X,
obs=orig_adata.obs,
var=orig_adata.var,
)
else:
msg = f'{matrix_file_path} does not have a .raw attribute'
print(msg)
log_exception(
DifferentialExpression.dev_logger,
DifferentialExpression.de_logger,
msg,
)
raise ValueError(msg)
else:
msg = f'{matrix_file_path} does not have a .raw attribute'
print(msg)
log_exception(
DifferentialExpression.dev_logger,
DifferentialExpression.de_logger,
msg,
)
raise ValueError(msg)
if raw_location in orig_adata.layers.keys():
DifferentialExpression.de_logger.info(
f"Performing DE on adata.layers['{raw_location}'] data"
)
adata = AnnData(
# using .copy() for the AnnData components is good practice
# but we won't be using orig_adata for analyses
# choosing to avoid .copy() for memory efficiency
X=orig_adata.layers[raw_location],
obs=orig_adata.obs,
var=orig_adata.var,
)
else:
msg = f'{matrix_file_path} does not have adata.layers["{raw_location}"]'
print(msg)
log_exception(
DifferentialExpression.dev_logger,
DifferentialExpression.de_logger,
msg,
)
raise ValueError(msg)
# AnnData expects gene x cell so dense and mtx matrices require transposition
else:
adata = adata.transpose()
Expand Down
12 changes: 7 additions & 5 deletions ingest/ingest_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,20 @@
# Differential expression analysis (sparse matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cell_type__ontology_label --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/differential_expression/sparse/sparsemini_matrix.mtx --gene-file ../tests/data/differential_expression/sparse/sparsemini_features.tsv --barcode-file ../tests/data/differential_expression/sparse/sparsemini_barcodes.tsv --matrix-file-type mtx --annotation-file ../tests/data/differential_expression/sparse/sparsemini_metadata.txt --cluster-file ../tests/data/differential_expression/sparse/sparsemini_cluster.txt --cluster-name de_sparse_integration --study-accession SCPsparsemini --differential-expression

# Differential expression analysis (h5ad matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name louvain --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad --matrix-file-type h5ad --annotation-file ../tests/data/anndata/h5ad_frag.metadata.tsv --cluster-file ../tests/data/anndata/h5ad_frag.cluster.X_umap.tsv --cluster-name umap --study-accession SCPdev --differential-expression
# Differential expression analysis (h5ad matrix, raw count in raw slot)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location '.raw' --annotation-name cell_type__ontology_label --de-type rest --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression

# Differential expression analysis (h5ad matrix, raw count in adata.layers['counts'])
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location 'counts' --annotation-name cell_type__ontology_label --de-type rest --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver_layers_counts.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression

# Pairwise differential expression analysis (dense matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cell_type__ontology_label --de-type pairwise --group1 "['cholinergic neuron']" --group2 "cranial somatomotor neuron" --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/differential_expression/de_dense_matrix.tsv --matrix-file-type dense --annotation-file ../tests/data/differential_expression/de_dense_metadata.tsv --cluster-file ../tests/data/differential_expression/de_dense_cluster.tsv --cluster-name de_integration --study-accession SCPdev --differential-expression

# Pairwise differential expression analysis (sparse matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cell_type__ontology_label --de-type pairwise --group1 "['endothelial cell']" --group2 "smooth muscle cell" --annotation-type group --annotation-scope study --matrix-file-path ../tests/data/differential_expression/sparse/sparsemini_matrix.mtx --gene-file ../tests/data/differential_expression/sparse/sparsemini_features.tsv --barcode-file ../tests/data/differential_expression/sparse/sparsemini_barcodes.tsv --matrix-file-type mtx --annotation-file ../tests/data/differential_expression/sparse/sparsemini_metadata.txt --cluster-file ../tests/data/differential_expression/sparse/sparsemini_cluster.txt --cluster-name de_sparse_integration --study-accession SCPsparsemini --differential-expression

# Pairwise differential expression analysis (h5ad matrix)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --annotation-name cell_type__ontology_label --de-type pairwise --group1 "['mature B cell']" --group2 "plasma cell" --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression

# Pairwise differential expression analysis (h5ad matrix, raw count in raw slot)
python ingest_pipeline.py --study-id addedfeed000000000000000 --study-file-id dec0dedfeed1111111111111 differential_expression --raw-location '.raw' --annotation-name cell_type__ontology_label --de-type pairwise --group1 "mature B cell" --group2 "plasma cell" --annotation-type group --annotation-scope study --annotation-file ../tests/data/anndata/compliant_liver_h5ad_frag.metadata.tsv.gz --cluster-file ../tests/data/anndata/compliant_liver_h5ad_frag.cluster.X_umap.tsv.gz --cluster-name umap --matrix-file-path ../tests/data/anndata/compliant_liver.h5ad --matrix-file-type h5ad --study-accession SCPdev --differential-expression
"""

import json
Expand Down
Binary file modified ingest/validation/ontologies/cl.min.tsv.gz
Binary file not shown.
Binary file modified ingest/validation/ontologies/efo.min.tsv.gz
Binary file not shown.
Binary file modified ingest/validation/ontologies/mondo.min.tsv.gz
Binary file not shown.
Binary file modified ingest/validation/ontologies/ncbitaxon.min.tsv.gz
Binary file not shown.
3 changes: 1 addition & 2 deletions ingest/validation/ontologies/version.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
1738072997 # validation cache key

1742329182 # validation cache key
Binary file not shown.
Loading