-
Notifications
You must be signed in to change notification settings - Fork 3
feat(copairs_runner): add configurable YAML-driven runner for copairs analyses #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… analyses
This commit introduces a flexible, configuration-driven runner for copairs morphological
profiling analyses. The runner provides a declarative way to specify data loading,
preprocessing pipelines, and analysis parameters through YAML configuration files.
Key features:
- YAML-based configuration for all analysis parameters
- Modular preprocessing pipeline with 9 built-in steps:
- filter, dropna, remove_nan_features, split_multilabel, filter_active,
aggregate_replicates, merge_metadata, filter_single_replicates,
apply_assign_reference
- Support for both average precision and mean average precision analyses
- Automatic plotting of mAP vs -log10(p-value) scatter plots
- Flexible output formats (CSV/Parquet)
- Example configurations for phenotypic activity and consistency analyses
- Example outputs included for documentation (PNG plots)
The runner uses inline script dependencies (PEP 723) for easy execution with uv,
and includes comprehensive documentation in CLAUDE.md for AI-assisted development.
Example usage:
uv run copairs_runner.py configs/activity_analysis.yaml
Includes working examples using LINCS Cell Painting data that can be run with:
bash run_examples.sh
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
|
Woops pressed the ready to review by mistake. Note: The test case runs fine. But when running on the JUMP set it fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you drop in the yaml here?
|
ah you mean this one - type: add_column_from_query
params:
query: '(Metadata_moa == "EGFR inhibitor") & (Metadata_mmoles_per_liter > 1)'
column_name: "Metadata_is_high_dose_EGFR_inhibitor"
fill_value: False # Optional: fill NaN values (e.g., when moa or concentration is missing) |
… resolution - Add `add_column` preprocessing step to create boolean columns from query expressions - Implement config-relative path resolution for better portability - Update example configs to use relative paths from config directory - Write plot outputs to examples directory for documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add polars-based lazy filtering for large parquet files - Support HTTP/S3 URLs for data and metadata loading - Implement config-relative path resolution for URLs and local files - Add enhanced logging to show loaded metadata/feature columns - Standardize config naming: example_<analysis>_<dataset>.yaml - Add JUMP TARGET2 example using S3 data with lazy filtering - Fix parameter naming consistency (on_columns vs on) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
|
@jfredinh ready see
|
…figuration format - Fix incorrect config examples in CLAUDE.md (use 'path' not 'profiles', remove non-existent params) - Update README.md with correct config structure (add 'steps:' under preprocessing) - Document lazy loading and URL/S3 support - Correct example config filenames to match actual files 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Reorder config sections for logical flow (preprocessing before analysis) - Update null_size to more typical value (10000) with guidance comment - Preprocessing now appears after data section where it belongs in the pipeline 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add clear comments distinguishing lazy filtering (before load) vs preprocessing filter (after load) - Document that lazy filtering uses SQL syntax (polars) while preprocessing uses pandas query syntax - Provide examples of both filtering approaches in README and docstring - Update CLAUDE.md with detailed guidance on when to use each approach This addresses potential confusion about the two different filtering mechanisms. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Document when to add vs avoid new preprocessing steps - Emphasize alternatives (lazy filtering, external preprocessing) - Add minimal implementation guide for necessary steps - Update README to reference CONTRIBUTING.md 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
|
@jfredinh I added this https://github.com/broadinstitute/monorepo/blob/copairs-runner/libs/copairs_runner/CONTRIBUTING.md to clarify what goes into preprocessing |
- Add context that copairs typically runs on already-processed profiles - Note that extensive preprocessing needs may indicate upstream issues - Reframe criteria to focus on last-mile transformations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested out the latest version of the code.
Everything seems to run as expected, including when playing around with the arguments.
I reviewed all of the changed files except the copairs_runner.py file. Should I go through that one as well?
My only suggestion after this is to change the output naming based on user specified name of the analysis. To either generate a separate results folder based on a user specified name, or simply including the name as a substring in the full path. With the intention of making it easier to correctly identify the results from several different yamls.
Co-authored-by: Copilot <[email protected]>
- Remove unimplemented metadata_regex from docs and configs - Fix filter_active parameter naming (on_column -> on_columns) - Ensure consistent parameter naming across implementation and examples 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Thanks for testing!
copilot did a pretty decent job -- and looks like you've read through it too so that's good enough.
I'll address I've address all of copilots comments via 9e793ea |
I think the runner is the wrong place to address this -- that should happen in whatever system we use to create the config fiiles |
- Change path resolution from config-relative to CWD-relative
- Remove automatic defaults for environment variables (must be set explicitly)
- Update all example configs to use ${COPAIRS_DATA} and ${COPAIRS_OUTPUT}
- Update documentation to reflect new path behavior
- Add environment variable setup to run_examples.sh
BREAKING CHANGE: Paths in configs are now resolved relative to current working directory instead of config file location. Environment variables used in configs must be explicitly set.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
- Replace PyYAML with OmegaConf for config loading
- Update environment variable syntax to ${oc.env:VAR}
- Add OmegaConf.resolve() to handle interpolations
- Convert OmegaConf containers to dicts where needed
- Update documentation to reflect dependency change
This migration provides better config interpolation support and
built-in environment variable handling.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
- Add hydra-core dependency for configuration management - Replace argparse with @hydra.main decorator - Update __init__ to accept DictConfig directly (no backward compatibility) - Use Hydra's built-in logging configuration - Keep OmegaConf.to_container() calls with resolve=True for ListConfig compatibility - Update all documentation and examples to use new Hydra CLI syntax - Use --config-name instead of positional config file argument - Replace --verbose with hydra.verbose=true - Document parameter override syntax 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add hydra.run.dir and hydra.job.chdir to use existing output/ directory
- Update all file paths to be relative since Hydra changes working directory
- Use ${hydra:runtime.cwd} for data paths to reference original directory
- Preserve workflow where second analysis reads from first analysis output
This allows copairs_runner to:
1. Keep all outputs in the same directory for inter-analysis dependencies
2. Store Hydra's .hydra/ config snapshots alongside results
3. Maintain backward compatibility with existing workflows
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
…d 3-file pattern Major refactoring to simplify and standardize output handling: - Replaced separate save methods with unified dictionary-based save_results() - Fixed output pattern: always saves ap_scores, map_results, and map_plot - Simplified configuration: just output.directory and output.name (no more plotting section) - Extracted plot creation from saving for better separation of concerns - Updated all example configs to use new simplified structure - Improved dictionary access patterns for safer config handling - Updated README with clear output documentation - Refactored CLAUDE.md to focus on development context, referring to README for usage This makes the runner more predictable and easier to extend while maintaining backward compatibility for the analysis logic. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… consistency - Renamed 'data' to 'input' throughout codebase for better symmetry with 'output' - Updated all config files (YAML) to use 'input:' instead of 'data:' - Updated code to use self.config["input"] instead of self.config["data"] - Updated all documentation (README.md, CLAUDE.md) to reflect the change This creates cleaner configuration with input/output symmetry and better reflects that the section configures all input aspects (paths, filtering, columns), not just data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…tion Add explanation for why preprocessing uses list-based configuration instead of dict-based, documenting the trade-offs between explicit ordering and easier command-line overrides. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…workflows The LINCS workflow uses a shared directory where consistency analysis reads outputs from activity analysis. Without nested subdirectories, the second analysis would overwrite .hydra/ runtime files from the first, making it impossible to track configurations. Changes: - Use nested subdirectories (shared/activity/ and shared/consistency/) - Update relative path in consistency config to ../activity/ - Document this critical design pattern in README.md and CLAUDE.md - Add explanatory comments in YAML configs This ensures both analyses maintain separate Hydra environments while preserving the dependency relationship. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Fix aggregate_replicates to use observed=True in groupby operation This prevents pandas from creating combinations for unobserved categorical values, which could cause issues with categorical columns - Improve filter_active logging to show both row count and number of active perturbations for better debugging - Update terminology from "compounds" to "perturbations" for consistency 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…ared workflow Add example configurations for JUMP CPCNN data that demonstrate the shared folder workflow pattern. Note that the consistency example groups by plate rather than actual targets - these are primarily workflow demonstrations showing how dependent analyses can share outputs. Both configs: - Use the shared folder pattern (shared/activity/ and shared/consistency/) - Work with JUMP CPCNN embeddings (X_1, X_2, X_3 features) - Apply lazy filtering for efficient parquet handling - Filter to TARGET2 plates only 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…roposals Add comprehensive roadmap documenting potential Hydra enhancements including: - Dynamic directory naming with interpolation - Multirun parameter sweeps and override_dirname - Configuration inheritance and custom resolvers - Error handling and robustness improvements Includes caveat emptor note explaining AI-assisted creation process using Claude and Context7 for Hydra documentation verification. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
|
|
||
| # Add significance threshold line | ||
| ax.axhline( | ||
| -np.log10(0.05), color="#d6604d", linestyle="--", linewidth=1.5, alpha=0.8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this is hardcoded to a threshold of 0.05 while the threshold is a parameter in the config files. Do we want to skip this line, as the line is visualized by the color of each dot in the scatterplot? Or add the threshold value as an input to the create_map_plot function?
…andalone usage - Move script to src/copairs_runner/ for proper package structure - Add pyproject.toml with build config and dependencies - Support both `uv run copairs-runner` (installed) and direct script execution - Update README with installation instructions and GitHub raw URL usage - Add sync notes between PEP 723 inline deps and pyproject.toml
- Configure Hatchling to read version from src/copairs_runner/__init__.py - Follows Python packaging best practice for single source of truth - Prevents version mismatch between pyproject.toml and __init__.py
…bility - Set config_path=None to allow flexible config location - Update README to show --config-path usage for installed package - Fixes MissingConfigException when running as installed package
- Add docstring to main() with clear usage examples - Shows config-path and parameter override patterns - Improves user experience when running --help
…entation - Update all file paths in docs to reflect src/copairs_runner/ structure - Remove docstring from main() to avoid confusion with Hydra help - Update .gitignore with package-related patterns (dist/, *.egg-info/, etc.) - Fix run_examples.sh to use new script location - Update CLAUDE.md to describe package design instead of single-file - Clarify README.md about absolute paths for installed package usage
…ngWithCopyWarning Replace df[column] = False with df.loc[:, column] = False to avoid ambiguity when modifying DataFrames that might be views. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…en adding columns Create an explicit copy of the DataFrame when adding new columns to avoid pandas warning about setting values on a view. This is cleaner than suppressing the warning and makes the intent clear.
…ing step - Add duckdb dependency to PEP 723 header and imports - Extend _preprocess_merge_metadata to handle .duckdb files with table parameter - Maintain backward compatibility with existing CSV file usage - Support both tables and views in DuckDB files
…ctices
- Replace custom resolve_path() with hydra.utils.to_absolute_path()
- Simplify config interpolations from ${hydra:runtime.cwd}/${oc.env:VAR} to ${oc.env:VAR,.}
- Add consistent fallback defaults (.} for both COPAIRS_DATA and COPAIRS_OUTPUT
- Update load_data() to use resolve_path() consistently for input paths
- Fix preprocessing config structure to require 'steps' key
- Update README documentation to reflect simplified syntax
- Update run_examples.sh to use --config-dir for consistency
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
Add configurable output format option to save results as either CSV or Parquet files. Users can now specify 'format: parquet' in the output configuration section. Defaults to CSV for backwards compatibility. Also fixed unused params warning in _preprocess_remove_nan_features method.
… (vs creating one separately)
…ot stop if the checksum changes)
…rt to filter_active - Fixed hardcoded 0.05 threshold in create_map_plot visualization - Now uses threshold from mean_average_precision config params - Allows different analyses to use appropriate significance levels - Enhanced filter_active preprocessing to support Parquet files - Renamed parameter from activity_csv to activity_file for clarity - Auto-detects file format by extension (.parquet or .csv) - Maintains backward compatibility with CSV files - Updated example configs to use new activity_file parameter - Added documentation note about using uv run python 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Runner now discovers and processes all preprocessing_* sections - Sections are executed in alphabetical order for determinism - Example config demonstrates splitting into metadata, filters, and features - Maintains backward compatibility with single preprocessing section
Adapt to upstream copairs PR #98 which now returns both mean_average_precision and mean_normalized_average_precision columns. The visualization now shows both metrics side-by-side in separate subplots: - Left plot: traditional mAP (0 to 1 range) - Right plot: normalized mAP (-1 to 1 range, clipped to 0 with triangle markers) - Consistent x-axis ranges and annotations for clipped negative values - Updated test hashes to reflect new dual-metric output format 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Summary
Adds a YAML-driven runner for copairs morphological profiling analyses, enabling configuration-based workflows without writing Python code.
Changes
copairs_runner.pywith modular preprocessing pipelineUsage
Test plan
Run
bash run_examples.shto execute both example analyses with LINCS data.🤖 Generated with Claude Code