feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91

shntnu · 2025-07-09T06:27:05Z

Summary

Adds a YAML-driven runner for copairs morphological profiling analyses, enabling configuration-based workflows without writing Python code.

Changes

Added copairs_runner.py with modular preprocessing pipeline
Created YAML configuration system for analysis parameters
Included example configs for phenotypic activity and consistency analyses
Added documentation and example outputs

Usage

uv run copairs_runner.py configs/activity_analysis.yaml

Test plan

Run bash run_examples.sh to execute both example analyses with LINCS data.

🤖 Generated with Claude Code

… analyses This commit introduces a flexible, configuration-driven runner for copairs morphological profiling analyses. The runner provides a declarative way to specify data loading, preprocessing pipelines, and analysis parameters through YAML configuration files. Key features: - YAML-based configuration for all analysis parameters - Modular preprocessing pipeline with 9 built-in steps: - filter, dropna, remove_nan_features, split_multilabel, filter_active, aggregate_replicates, merge_metadata, filter_single_replicates, apply_assign_reference - Support for both average precision and mean average precision analyses - Automatic plotting of mAP vs -log10(p-value) scatter plots - Flexible output formats (CSV/Parquet) - Example configurations for phenotypic activity and consistency analyses - Example outputs included for documentation (PNG plots) The runner uses inline script dependencies (PEP 723) for easy execution with uv, and includes comprehensive documentation in CLAUDE.md for AI-assisted development. Example usage: uv run copairs_runner.py configs/activity_analysis.yaml Includes working examples using LINCS Cell Painting data that can be run with: bash run_examples.sh 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

jfredinh · 2025-07-09T15:22:09Z

Woops pressed the ready to review by mistake.

Note: The test case runs fine.

But when running on the JUMP set it fails.
Was the "add_column_from_query" function removed?

shntnu

Can you drop in the yaml here?

shntnu · 2025-07-09T16:39:53Z

ah you mean this one

  - type: add_column_from_query
    params:
      query: '(Metadata_moa == "EGFR inhibitor") & (Metadata_mmoles_per_liter > 1)'
      column_name: "Metadata_is_high_dose_EGFR_inhibitor"
      fill_value: False  # Optional: fill NaN values (e.g., when moa or concentration is missing)

… resolution - Add `add_column` preprocessing step to create boolean columns from query expressions - Implement config-relative path resolution for better portability - Update example configs to use relative paths from config directory - Write plot outputs to examples directory for documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add polars-based lazy filtering for large parquet files - Support HTTP/S3 URLs for data and metadata loading - Implement config-relative path resolution for URLs and local files - Add enhanced logging to show loaded metadata/feature columns - Standardize config naming: example_<analysis>_<dataset>.yaml - Add JUMP TARGET2 example using S3 data with lazy filtering - Fix parameter naming consistency (on_columns vs on) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-09T18:57:14Z

@jfredinh ready

see

monorepo/libs/copairs_runner/configs/example_activity_jump_target2.yaml

Lines 33 to 44 in e05de00

    
           # 4. Add negative control indicator column 
        
           - type: add_column 
        
             params: 
        
               query: "Metadata_JCP2022 == 'JCP2022_033924'" 
        
               column: "Metadata_negcon" 
        
           # 5. Force negcons not to have activity calculated 
        
           - type: apply_assign_reference 
        
             params: 
        
               condition: "Metadata_negcon" 
        
               reference_col: "Metadata_reference_index" 
        
               default_value: -1

…figuration format - Fix incorrect config examples in CLAUDE.md (use 'path' not 'profiles', remove non-existent params) - Update README.md with correct config structure (add 'steps:' under preprocessing) - Document lazy loading and URL/S3 support - Correct example config filenames to match actual files 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Reorder config sections for logical flow (preprocessing before analysis) - Update null_size to more typical value (10000) with guidance comment - Preprocessing now appears after data section where it belongs in the pipeline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add clear comments distinguishing lazy filtering (before load) vs preprocessing filter (after load) - Document that lazy filtering uses SQL syntax (polars) while preprocessing uses pandas query syntax - Provide examples of both filtering approaches in README and docstring - Update CLAUDE.md with detailed guidance on when to use each approach This addresses potential confusion about the two different filtering mechanisms. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Document when to add vs avoid new preprocessing steps - Emphasize alternatives (lazy filtering, external preprocessing) - Add minimal implementation guide for necessary steps - Update README to reference CONTRIBUTING.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-09T19:31:31Z

@jfredinh I added this https://github.com/broadinstitute/monorepo/blob/copairs-runner/libs/copairs_runner/CONTRIBUTING.md to clarify what goes into preprocessing

- Add context that copairs typically runs on already-processed profiles - Note that extensive preprocessing needs may indicate upstream issues - Reframe criteria to focus on last-mile transformations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

jfredinh

I tested out the latest version of the code.
Everything seems to run as expected, including when playing around with the arguments.

I reviewed all of the changed files except the copairs_runner.py file. Should I go through that one as well?

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

libs/copairs_runner/copairs_runner.py

Co-authored-by: Copilot <[email protected]>

- Remove unimplemented metadata_regex from docs and configs - Fix filter_active parameter naming (on_column -> on_columns) - Ensure consistent parameter naming across implementation and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-09T22:53:15Z

I tested out the latest version of the code. Everything seems to run as expected, including when playing around with the arguments.

Thanks for testing!

I reviewed all of the changed files except the copairs_runner.py file. Should I go through that one as well?

copilot did a pretty decent job -- and looks like you've read through it too so that's good enough.

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

I'll address

I've address all of copilots comments via 9e793ea

shntnu · 2025-07-09T23:04:12Z

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

I think the runner is the wrong place to address this -- that should happen in whatever system we use to create the config fiiles

- Change path resolution from config-relative to CWD-relative - Remove automatic defaults for environment variables (must be set explicitly) - Update all example configs to use ${COPAIRS_DATA} and ${COPAIRS_OUTPUT} - Update documentation to reflect new path behavior - Add environment variable setup to run_examples.sh BREAKING CHANGE: Paths in configs are now resolved relative to current working directory instead of config file location. Environment variables used in configs must be explicitly set. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Replace PyYAML with OmegaConf for config loading - Update environment variable syntax to ${oc.env:VAR} - Add OmegaConf.resolve() to handle interpolations - Convert OmegaConf containers to dicts where needed - Update documentation to reflect dependency change This migration provides better config interpolation support and built-in environment variable handling. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add hydra-core dependency for configuration management - Replace argparse with @hydra.main decorator - Update __init__ to accept DictConfig directly (no backward compatibility) - Use Hydra's built-in logging configuration - Keep OmegaConf.to_container() calls with resolve=True for ListConfig compatibility - Update all documentation and examples to use new Hydra CLI syntax - Use --config-name instead of positional config file argument - Replace --verbose with hydra.verbose=true - Document parameter override syntax 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add hydra.run.dir and hydra.job.chdir to use existing output/ directory - Update all file paths to be relative since Hydra changes working directory - Use ${hydra:runtime.cwd} for data paths to reference original directory - Preserve workflow where second analysis reads from first analysis output This allows copairs_runner to: 1. Keep all outputs in the same directory for inter-analysis dependencies 2. Store Hydra's .hydra/ config snapshots alongside results 3. Maintain backward compatibility with existing workflows 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…d 3-file pattern Major refactoring to simplify and standardize output handling: - Replaced separate save methods with unified dictionary-based save_results() - Fixed output pattern: always saves ap_scores, map_results, and map_plot - Simplified configuration: just output.directory and output.name (no more plotting section) - Extracted plot creation from saving for better separation of concerns - Updated all example configs to use new simplified structure - Improved dictionary access patterns for safer config handling - Updated README with clear output documentation - Refactored CLAUDE.md to focus on development context, referring to README for usage This makes the runner more predictable and easier to extend while maintaining backward compatibility for the analysis logic. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… consistency - Renamed 'data' to 'input' throughout codebase for better symmetry with 'output' - Updated all config files (YAML) to use 'input:' instead of 'data:' - Updated code to use self.config["input"] instead of self.config["data"] - Updated all documentation (README.md, CLAUDE.md) to reflect the change This creates cleaner configuration with input/output symmetry and better reflects that the section configures all input aspects (paths, filtering, columns), not just data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…tion Add explanation for why preprocessing uses list-based configuration instead of dict-based, documenting the trade-offs between explicit ordering and easier command-line overrides. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…workflows The LINCS workflow uses a shared directory where consistency analysis reads outputs from activity analysis. Without nested subdirectories, the second analysis would overwrite .hydra/ runtime files from the first, making it impossible to track configurations. Changes: - Use nested subdirectories (shared/activity/ and shared/consistency/) - Update relative path in consistency config to ../activity/ - Document this critical design pattern in README.md and CLAUDE.md - Add explanatory comments in YAML configs This ensures both analyses maintain separate Hydra environments while preserving the dependency relationship. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Fix aggregate_replicates to use observed=True in groupby operation This prevents pandas from creating combinations for unobserved categorical values, which could cause issues with categorical columns - Improve filter_active logging to show both row count and number of active perturbations for better debugging - Update terminology from "compounds" to "perturbations" for consistency 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…ared workflow Add example configurations for JUMP CPCNN data that demonstrate the shared folder workflow pattern. Note that the consistency example groups by plate rather than actual targets - these are primarily workflow demonstrations showing how dependent analyses can share outputs. Both configs: - Use the shared folder pattern (shared/activity/ and shared/consistency/) - Work with JUMP CPCNN embeddings (X_1, X_2, X_3 features) - Apply lazy filtering for efficient parquet handling - Filter to TARGET2 plates only 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…roposals Add comprehensive roadmap documenting potential Hydra enhancements including: - Dynamic directory naming with interpolation - Multirun parameter sweeps and override_dirname - Configuration inheritance and custom resolvers - Error handling and robustness improvements Includes caveat emptor note explaining AI-assisted creation process using Claude and Context7 for Hydra documentation verification. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

jfredinh · 2025-07-21T20:16:42Z

libs/copairs_runner/copairs_runner.py

+
+        # Add significance threshold line
+        ax.axhline(
+            -np.log10(0.05), color="#d6604d", linestyle="--", linewidth=1.5, alpha=0.8


At the moment this is hardcoded to a threshold of 0.05 while the threshold is a parameter in the config files. Do we want to skip this line, as the line is visualized by the color of each dot in the scatterplot? Or add the threshold value as an input to the create_map_plot function?

…andalone usage - Move script to src/copairs_runner/ for proper package structure - Add pyproject.toml with build config and dependencies - Support both `uv run copairs-runner` (installed) and direct script execution - Update README with installation instructions and GitHub raw URL usage - Add sync notes between PEP 723 inline deps and pyproject.toml

- Configure Hatchling to read version from src/copairs_runner/__init__.py - Follows Python packaging best practice for single source of truth - Prevents version mismatch between pyproject.toml and __init__.py

…bility - Set config_path=None to allow flexible config location - Update README to show --config-path usage for installed package - Fixes MissingConfigException when running as installed package

- Add docstring to main() with clear usage examples - Shows config-path and parameter override patterns - Improves user experience when running --help

…entation - Update all file paths in docs to reflect src/copairs_runner/ structure - Remove docstring from main() to avoid confusion with Hydra help - Update .gitignore with package-related patterns (dist/, *.egg-info/, etc.) - Fix run_examples.sh to use new script location - Update CLAUDE.md to describe package design instead of single-file - Clarify README.md about absolute paths for installed package usage

…ngWithCopyWarning Replace df[column] = False with df.loc[:, column] = False to avoid ambiguity when modifying DataFrames that might be views. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…en adding columns Create an explicit copy of the DataFrame when adding new columns to avoid pandas warning about setting values on a view. This is cleaner than suppressing the warning and makes the intent clear.

…ing step - Add duckdb dependency to PEP 723 header and imports - Extend _preprocess_merge_metadata to handle .duckdb files with table parameter - Maintain backward compatibility with existing CSV file usage - Support both tables and views in DuckDB files

…ctices - Replace custom resolve_path() with hydra.utils.to_absolute_path() - Simplify config interpolations from ${hydra:runtime.cwd}/${oc.env:VAR} to ${oc.env:VAR,.} - Add consistent fallback defaults (.} for both COPAIRS_DATA and COPAIRS_OUTPUT - Update load_data() to use resolve_path() consistently for input paths - Fix preprocessing config structure to require 'steps' key - Update README documentation to reflect simplified syntax - Update run_examples.sh to use --config-dir for consistency 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Add configurable output format option to save results as either CSV or Parquet files. Users can now specify 'format: parquet' in the output configuration section. Defaults to CSV for backwards compatibility. Also fixed unused params warning in _preprocess_remove_nan_features method.

… (vs creating one separately)

…ot stop if the checksum changes)

…rt to filter_active - Fixed hardcoded 0.05 threshold in create_map_plot visualization - Now uses threshold from mean_average_precision config params - Allows different analyses to use appropriate significance levels - Enhanced filter_active preprocessing to support Parquet files - Renamed parameter from activity_csv to activity_file for clarity - Auto-detects file format by extension (.parquet or .csv) - Maintains backward compatibility with CSV files - Updated example configs to use new activity_file parameter - Added documentation note about using uv run python 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Runner now discovers and processes all preprocessing_* sections - Sections are executed in alphabetical order for determinism - Example config demonstrates splitting into metadata, filters, and features - Maintains backward compatibility with single preprocessing section

Adapt to upstream copairs PR #98 which now returns both mean_average_precision and mean_normalized_average_precision columns. The visualization now shows both metrics side-by-side in separate subplots: - Left plot: traditional mAP (0 to 1 range) - Right plot: normalized mAP (-1 to 1 range, clipped to 0 with triangle markers) - Consistent x-axis ranges and annotations for clipped negative values - Updated test hashes to reflect new dual-metric output format 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu mentioned this pull request Jul 9, 2025

feat: add generic copairs runner for flexible analysis configurations cytomining/copairs#96

Closed

4 tasks

Update README.md

ec785de

jfredinh marked this pull request as ready for review July 9, 2025 14:57

shntnu commented Jul 9, 2025

View reviewed changes

shntnu and others added 3 commits July 9, 2025 13:40

fix: format

e05de00

shntnu requested a review from jfredinh July 9, 2025 19:12

shntnu and others added 3 commits July 9, 2025 15:15

jfredinh reviewed Jul 9, 2025

View reviewed changes

jfredinh requested a review from Copilot July 9, 2025 21:28

This comment was marked as outdated.

Sign in to view

jfredinh reviewed Jul 9, 2025

View reviewed changes

libs/copairs_runner/copairs_runner.py Outdated Show resolved Hide resolved

shntnu and others added 2 commits July 9, 2025 18:42

Update libs/copairs_runner/run_examples.sh

12e3b81

Co-authored-by: Copilot <[email protected]>

shntnu and others added 4 commits July 9, 2025 21:15

shntnu and others added 2 commits July 10, 2025 11:18

shntnu force-pushed the copairs-runner branch from 6f10160 to ead89f4 Compare July 10, 2025 15:30

shntnu and others added 5 commits July 10, 2025 11:38

jfredinh reviewed Jul 21, 2025

View reviewed changes

shntnu and others added 21 commits August 4, 2025 19:24

fix(copairs_runner): add missing __init__.py and fix wheel build config

e0c34c7

fix: add quotes

fd1071e

refactor(copairs_runner): use dynamic versioning from __init__.py

a8cf8e0

- Configure Hatchling to read version from src/copairs_runner/__init__.py - Follows Python packaging best practice for single source of truth - Prevents version mismatch between pyproject.toml and __init__.py

fix(copairs_runner): remove hardcoded config path for package compati…

7054308

…bility - Set config_path=None to allow flexible config location - Update README to show --config-path usage for installed package - Fixes MissingConfigException when running as installed package

docs(copairs_runner): add concise help message for CLI usage

ffe737c

- Add docstring to main() with clear usage examples - Shows config-path and parameter override patterns - Improves user experience when running --help

fix: empty string is None

7f5eb86

test: turn the run_examples.sh script into a minimal integration test…

b5ec577

… (vs creating one separately)

fix: drop claim about test passing because it is fairly loose (does n…

e98c3a8

…ot stop if the checksum changes)

Set read_only=true to allow concurrent reads

8523bf0

fix(copairs_runner): fix font

e322e23

fix(copairs_runner): clarify

48f744c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91

feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91

Uh oh!

shntnu commented Jul 9, 2025 •

edited

Loading

Uh oh!

jfredinh commented Jul 9, 2025

Uh oh!

shntnu left a comment

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

jfredinh left a comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

jfredinh Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91

Are you sure you want to change the base?

feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91

Uh oh!

Conversation

shntnu commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Usage

Test plan

Uh oh!

jfredinh commented Jul 9, 2025

Uh oh!

shntnu left a comment

Choose a reason for hiding this comment

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

jfredinh left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

jfredinh Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shntnu commented Jul 9, 2025 •

edited

Loading