Skip to content

Conversation

@shntnu
Copy link
Contributor

@shntnu shntnu commented Jul 9, 2025

Summary

Adds a YAML-driven runner for copairs morphological profiling analyses, enabling configuration-based workflows without writing Python code.

Changes

  • Added copairs_runner.py with modular preprocessing pipeline
  • Created YAML configuration system for analysis parameters
  • Included example configs for phenotypic activity and consistency analyses
  • Added documentation and example outputs

Usage

uv run copairs_runner.py configs/activity_analysis.yaml

Test plan

Run bash run_examples.sh to execute both example analyses with LINCS data.

🤖 Generated with Claude Code

… analyses

This commit introduces a flexible, configuration-driven runner for copairs morphological
profiling analyses. The runner provides a declarative way to specify data loading,
preprocessing pipelines, and analysis parameters through YAML configuration files.

Key features:
- YAML-based configuration for all analysis parameters
- Modular preprocessing pipeline with 9 built-in steps:
  - filter, dropna, remove_nan_features, split_multilabel, filter_active,
    aggregate_replicates, merge_metadata, filter_single_replicates,
    apply_assign_reference
- Support for both average precision and mean average precision analyses
- Automatic plotting of mAP vs -log10(p-value) scatter plots
- Flexible output formats (CSV/Parquet)
- Example configurations for phenotypic activity and consistency analyses
- Example outputs included for documentation (PNG plots)

The runner uses inline script dependencies (PEP 723) for easy execution with uv,
and includes comprehensive documentation in CLAUDE.md for AI-assisted development.

Example usage:
  uv run copairs_runner.py configs/activity_analysis.yaml

Includes working examples using LINCS Cell Painting data that can be run with:
  bash run_examples.sh

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@jfredinh jfredinh marked this pull request as ready for review July 9, 2025 14:57
@jfredinh
Copy link

jfredinh commented Jul 9, 2025

Woops pressed the ready to review by mistake.

Note: The test case runs fine.

But when running on the JUMP set it fails.
Was the "add_column_from_query" function removed?

Copy link
Contributor Author

@shntnu shntnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you drop in the yaml here?

@shntnu
Copy link
Contributor Author

shntnu commented Jul 9, 2025

ah you mean this one

  - type: add_column_from_query
    params:
      query: '(Metadata_moa == "EGFR inhibitor") & (Metadata_mmoles_per_liter > 1)'
      column_name: "Metadata_is_high_dose_EGFR_inhibitor"
      fill_value: False  # Optional: fill NaN values (e.g., when moa or concentration is missing)

shntnu and others added 3 commits July 9, 2025 13:40
… resolution

- Add `add_column` preprocessing step to create boolean columns from query expressions
- Implement config-relative path resolution for better portability
- Update example configs to use relative paths from config directory
- Write plot outputs to examples directory for documentation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add polars-based lazy filtering for large parquet files
- Support HTTP/S3 URLs for data and metadata loading
- Implement config-relative path resolution for URLs and local files
- Add enhanced logging to show loaded metadata/feature columns
- Standardize config naming: example_<analysis>_<dataset>.yaml
- Add JUMP TARGET2 example using S3 data with lazy filtering
- Fix parameter naming consistency (on_columns vs on)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Contributor Author

shntnu commented Jul 9, 2025

@jfredinh ready

see

# 4. Add negative control indicator column
- type: add_column
params:
query: "Metadata_JCP2022 == 'JCP2022_033924'"
column: "Metadata_negcon"
# 5. Force negcons not to have activity calculated
- type: apply_assign_reference
params:
condition: "Metadata_negcon"
reference_col: "Metadata_reference_index"
default_value: -1

…figuration format

- Fix incorrect config examples in CLAUDE.md (use 'path' not 'profiles', remove non-existent params)
- Update README.md with correct config structure (add 'steps:' under preprocessing)
- Document lazy loading and URL/S3 support
- Correct example config filenames to match actual files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu shntnu requested a review from jfredinh July 9, 2025 19:12
shntnu and others added 3 commits July 9, 2025 15:15
- Reorder config sections for logical flow (preprocessing before analysis)
- Update null_size to more typical value (10000) with guidance comment
- Preprocessing now appears after data section where it belongs in the pipeline

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add clear comments distinguishing lazy filtering (before load) vs preprocessing filter (after load)
- Document that lazy filtering uses SQL syntax (polars) while preprocessing uses pandas query syntax
- Provide examples of both filtering approaches in README and docstring
- Update CLAUDE.md with detailed guidance on when to use each approach

This addresses potential confusion about the two different filtering mechanisms.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Document when to add vs avoid new preprocessing steps
- Emphasize alternatives (lazy filtering, external preprocessing)
- Add minimal implementation guide for necessary steps
- Update README to reference CONTRIBUTING.md

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Contributor Author

shntnu commented Jul 9, 2025

@jfredinh I added this https://github.com/broadinstitute/monorepo/blob/copairs-runner/libs/copairs_runner/CONTRIBUTING.md to clarify what goes into preprocessing

- Add context that copairs typically runs on already-processed profiles
- Note that extensive preprocessing needs may indicate upstream issues
- Reframe criteria to focus on last-mile transformations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link

@jfredinh jfredinh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested out the latest version of the code.
Everything seems to run as expected, including when playing around with the arguments.

I reviewed all of the changed files except the copairs_runner.py file. Should I go through that one as well?

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

@jfredinh jfredinh requested a review from Copilot July 9, 2025 21:28
Copilot

This comment was marked as outdated.

shntnu and others added 2 commits July 9, 2025 18:42
- Remove unimplemented metadata_regex from docs and configs
- Fix filter_active parameter naming (on_column -> on_columns)
- Ensure consistent parameter naming across implementation and examples

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Contributor Author

shntnu commented Jul 9, 2025

I tested out the latest version of the code. Everything seems to run as expected, including when playing around with the arguments.

Thanks for testing!

I reviewed all of the changed files except the copairs_runner.py file. Should I go through that one as well?

copilot did a pretty decent job -- and looks like you've read through it too so that's good enough.

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

I'll address


I've address all of copilots comments via 9e793ea

@shntnu
Copy link
Contributor Author

shntnu commented Jul 9, 2025

My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.

I think the runner is the wrong place to address this -- that should happen in whatever system we use to create the config fiiles

shntnu and others added 4 commits July 9, 2025 21:15
- Change path resolution from config-relative to CWD-relative
- Remove automatic defaults for environment variables (must be set explicitly)
- Update all example configs to use ${COPAIRS_DATA} and ${COPAIRS_OUTPUT}
- Update documentation to reflect new path behavior
- Add environment variable setup to run_examples.sh

BREAKING CHANGE: Paths in configs are now resolved relative to current working directory instead of config file location. Environment variables used in configs must be explicitly set.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Replace PyYAML with OmegaConf for config loading
- Update environment variable syntax to ${oc.env:VAR}
- Add OmegaConf.resolve() to handle interpolations
- Convert OmegaConf containers to dicts where needed
- Update documentation to reflect dependency change

This migration provides better config interpolation support and
built-in environment variable handling.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add hydra-core dependency for configuration management
- Replace argparse with @hydra.main decorator
- Update __init__ to accept DictConfig directly (no backward compatibility)
- Use Hydra's built-in logging configuration
- Keep OmegaConf.to_container() calls with resolve=True for ListConfig compatibility
- Update all documentation and examples to use new Hydra CLI syntax
  - Use --config-name instead of positional config file argument
  - Replace --verbose with hydra.verbose=true
  - Document parameter override syntax

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add hydra.run.dir and hydra.job.chdir to use existing output/ directory
- Update all file paths to be relative since Hydra changes working directory
- Use ${hydra:runtime.cwd} for data paths to reference original directory
- Preserve workflow where second analysis reads from first analysis output

This allows copairs_runner to:
1. Keep all outputs in the same directory for inter-analysis dependencies
2. Store Hydra's .hydra/ config snapshots alongside results
3. Maintain backward compatibility with existing workflows

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
shntnu and others added 2 commits July 10, 2025 11:18
…d 3-file pattern

Major refactoring to simplify and standardize output handling:

- Replaced separate save methods with unified dictionary-based save_results()
- Fixed output pattern: always saves ap_scores, map_results, and map_plot
- Simplified configuration: just output.directory and output.name (no more plotting section)
- Extracted plot creation from saving for better separation of concerns
- Updated all example configs to use new simplified structure
- Improved dictionary access patterns for safer config handling
- Updated README with clear output documentation
- Refactored CLAUDE.md to focus on development context, referring to README for usage

This makes the runner more predictable and easier to extend while maintaining backward compatibility for the analysis logic.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… consistency

- Renamed 'data' to 'input' throughout codebase for better symmetry with 'output'
- Updated all config files (YAML) to use 'input:' instead of 'data:'
- Updated code to use self.config["input"] instead of self.config["data"]
- Updated all documentation (README.md, CLAUDE.md) to reflect the change

This creates cleaner configuration with input/output symmetry and better reflects
that the section configures all input aspects (paths, filtering, columns), not just data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
shntnu and others added 5 commits July 10, 2025 11:38
…tion

Add explanation for why preprocessing uses list-based configuration instead
of dict-based, documenting the trade-offs between explicit ordering and
easier command-line overrides.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…workflows

The LINCS workflow uses a shared directory where consistency analysis reads
outputs from activity analysis. Without nested subdirectories, the second
analysis would overwrite .hydra/ runtime files from the first, making it
impossible to track configurations.

Changes:
- Use nested subdirectories (shared/activity/ and shared/consistency/)
- Update relative path in consistency config to ../activity/
- Document this critical design pattern in README.md and CLAUDE.md
- Add explanatory comments in YAML configs

This ensures both analyses maintain separate Hydra environments while
preserving the dependency relationship.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Fix aggregate_replicates to use observed=True in groupby operation
  This prevents pandas from creating combinations for unobserved
  categorical values, which could cause issues with categorical columns
- Improve filter_active logging to show both row count and number of
  active perturbations for better debugging
- Update terminology from "compounds" to "perturbations" for consistency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…ared workflow

Add example configurations for JUMP CPCNN data that demonstrate the shared
folder workflow pattern. Note that the consistency example groups by plate
rather than actual targets - these are primarily workflow demonstrations
showing how dependent analyses can share outputs.

Both configs:
- Use the shared folder pattern (shared/activity/ and shared/consistency/)
- Work with JUMP CPCNN embeddings (X_1, X_2, X_3 features)
- Apply lazy filtering for efficient parquet handling
- Filter to TARGET2 plates only

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…roposals

Add comprehensive roadmap documenting potential Hydra enhancements including:
- Dynamic directory naming with interpolation
- Multirun parameter sweeps and override_dirname
- Configuration inheritance and custom resolvers
- Error handling and robustness improvements

Includes caveat emptor note explaining AI-assisted creation process using
Claude and Context7 for Hydra documentation verification.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

# Add significance threshold line
ax.axhline(
-np.log10(0.05), color="#d6604d", linestyle="--", linewidth=1.5, alpha=0.8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this is hardcoded to a threshold of 0.05 while the threshold is a parameter in the config files. Do we want to skip this line, as the line is visualized by the color of each dot in the scatterplot? Or add the threshold value as an input to the create_map_plot function?

shntnu and others added 21 commits August 4, 2025 19:24
…andalone usage

- Move script to src/copairs_runner/ for proper package structure
- Add pyproject.toml with build config and dependencies
- Support both `uv run copairs-runner` (installed) and direct script execution
- Update README with installation instructions and GitHub raw URL usage
- Add sync notes between PEP 723 inline deps and pyproject.toml
- Configure Hatchling to read version from src/copairs_runner/__init__.py
- Follows Python packaging best practice for single source of truth
- Prevents version mismatch between pyproject.toml and __init__.py
…bility

- Set config_path=None to allow flexible config location
- Update README to show --config-path usage for installed package
- Fixes MissingConfigException when running as installed package
- Add docstring to main() with clear usage examples
- Shows config-path and parameter override patterns
- Improves user experience when running --help
…entation

- Update all file paths in docs to reflect src/copairs_runner/ structure
- Remove docstring from main() to avoid confusion with Hydra help
- Update .gitignore with package-related patterns (dist/, *.egg-info/, etc.)
- Fix run_examples.sh to use new script location
- Update CLAUDE.md to describe package design instead of single-file
- Clarify README.md about absolute paths for installed package usage
…ngWithCopyWarning

Replace df[column] = False with df.loc[:, column] = False to avoid ambiguity
when modifying DataFrames that might be views.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…en adding columns

Create an explicit copy of the DataFrame when adding new columns to avoid
pandas warning about setting values on a view. This is cleaner than suppressing
the warning and makes the intent clear.
…ing step

- Add duckdb dependency to PEP 723 header and imports
- Extend _preprocess_merge_metadata to handle .duckdb files with table parameter
- Maintain backward compatibility with existing CSV file usage
- Support both tables and views in DuckDB files
…ctices

- Replace custom resolve_path() with hydra.utils.to_absolute_path()
- Simplify config interpolations from ${hydra:runtime.cwd}/${oc.env:VAR} to ${oc.env:VAR,.}
- Add consistent fallback defaults (.} for both COPAIRS_DATA and COPAIRS_OUTPUT
- Update load_data() to use resolve_path() consistently for input paths
- Fix preprocessing config structure to require 'steps' key
- Update README documentation to reflect simplified syntax
- Update run_examples.sh to use --config-dir for consistency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Add configurable output format option to save results as either CSV or Parquet files.
Users can now specify 'format: parquet' in the output configuration section.
Defaults to CSV for backwards compatibility.

Also fixed unused params warning in _preprocess_remove_nan_features method.
…rt to filter_active

- Fixed hardcoded 0.05 threshold in create_map_plot visualization
  - Now uses threshold from mean_average_precision config params
  - Allows different analyses to use appropriate significance levels

- Enhanced filter_active preprocessing to support Parquet files
  - Renamed parameter from activity_csv to activity_file for clarity
  - Auto-detects file format by extension (.parquet or .csv)
  - Maintains backward compatibility with CSV files

- Updated example configs to use new activity_file parameter
- Added documentation note about using uv run python

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Runner now discovers and processes all preprocessing_* sections
- Sections are executed in alphabetical order for determinism
- Example config demonstrates splitting into metadata, filters, and features
- Maintains backward compatibility with single preprocessing section
Adapt to upstream copairs PR #98 which now returns both mean_average_precision
and mean_normalized_average_precision columns. The visualization now shows both
metrics side-by-side in separate subplots:

- Left plot: traditional mAP (0 to 1 range)
- Right plot: normalized mAP (-1 to 1 range, clipped to 0 with triangle markers)
- Consistent x-axis ranges and annotations for clipped negative values
- Updated test hashes to reflect new dual-metric output format

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants