broadinstitute · shntnu · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/libs/copairs_runner/.gitignore b/libs/copairs_runner/.gitignore
@@ -0,0 +1,3 @@
+input/
+output/
+.claude/settings.local.json
diff --git a/libs/copairs_runner/CLAUDE.md b/libs/copairs_runner/CLAUDE.md
@@ -0,0 +1,109 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.
+
+## Key Commands
+
+### Running Analyses
+```bash
+# Run analysis with a config file
+uv run copairs_runner.py <config_file.yaml>
+
+# Run with verbose logging
+uv run copairs_runner.py <config_file.yaml> --verbose
+
+# Run the example analyses (downloads data if needed)
+bash run_examples.sh
+```
+
+### Development Commands
+```bash
+# Lint code using uvx (following monorepo standards)
+uvx ruff check copairs_runner.py
+
+# Auto-fix linting issues
+uvx ruff check copairs_runner.py --fix
+
+# Format code
+uvx ruff format copairs_runner.py
+
+# Run tests (when implemented)
+pytest tests/
+```
+
+## Architecture
+
+### Core Components
+1. **copairs_runner.py**: Main script with inline dependencies (PEP 723)
+   - `CopairsRunner` class: Handles data loading, preprocessing, analysis, and visualization
+   - Key methods:
+     - `run()`: Main pipeline orchestrator
+     - `preprocess_data()`: Applies configurable preprocessing pipeline
+     - `run_average_precision()`: Calculates AP for compound activity
+     - `run_mean_average_precision()`: Calculates mAP with significance testing
+     - `plot_map_results()`: Creates scatter plots of mAP vs -log10(p-value)
+     - `save_results()`: Saves results to CSV/Parquet files
+
+2. **Configuration System**: YAML-based configuration with sections for:
+   - `data`: Input paths and metadata patterns
+   - `preprocessing`: Pipeline steps (filtering, aggregation, etc.)
+   - `average_precision`/`mean_average_precision`: Analysis parameters
+   - `output`: Result file paths
+   - `plotting`: Visualization settings
+
+### Preprocessing Pipeline
+The runner supports these preprocessing steps (order determined by config):
+1. `filter`: Apply pandas query expressions
+2. `dropna`: Remove rows with NaN values in specified columns
+3. `remove_nan_features`: Remove feature columns containing NaN
+4. `split_multilabel`: Split pipe-separated values into lists
+5. `filter_active`: Filter based on activity CSV with below_corrected_p column
+6. `aggregate_replicates`: Aggregate by taking median of features
+7. `merge_metadata`: Merge external CSV metadata
+8. `filter_single_replicates`: Remove groups with < min_replicates members
+9. `apply_assign_reference`: Apply copairs.matching.assign_reference_index
+
+## Important Context
+
+### Monorepo Standards
+This project is part of a monorepo that uses:
+- **uv** for package management (transitioning from Poetry)
+- **ruff** for formatting and linting
+- **pytest** for testing (>90% coverage target)
+- **numpy** documentation style
+- Conventional commits for commit messages
+
+### Current State
+- The script uses inline dependencies (PEP 723 format)
+- Has a minimal pyproject.toml for ruff configuration
+- No test suite exists
+- Examples use LINCS Cell Painting data from GitHub
+- Configuration files demonstrate typical usage patterns
+
+### Dependencies
+Required packages (from inline script metadata):
+- python >= 3.8
+- pandas, numpy, copairs, pyyaml, matplotlib, seaborn
+
+## Common Tasks
+
+### Adding New Preprocessing Steps
+1. Implement a new method `_preprocess_<step_name>` in `CopairsRunner` class
+2. The method should accept `df` and `params` arguments
+3. Add documentation for the new step in the `preprocess_data()` docstring
+4. Use the step in your YAML config with `type: <step_name>`
+
+### Creating New Analysis Configs
+1. Copy an existing config from `configs/`
+2. Modify data paths and preprocessing steps
+3. Adjust analysis parameters as needed
+4. Run with: `uv run copairs_runner.py your_config.yaml`
+
+### Debugging
+- Use `--verbose` flag for detailed logging
+- Check intermediate results with `save_intermediate: true` in preprocessing
+- Examine output CSV files for analysis results
diff --git a/libs/copairs_runner/README.md b/libs/copairs_runner/README.md
@@ -0,0 +1,70 @@
+# Copairs Runner
+
+YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).
+
+## Usage
+
+```bash
+uv run copairs_runner.py config.yaml
+```
+
+## Configuration
+
+```yaml
+# Required
+data:
+  path: "data.csv"  # or .parquet
+
+average_precision:
+  params:
+    pos_sameby: ["Metadata_compound"]
+    pos_diffby: []
+    neg_sameby: []
+    neg_diffby: ["Metadata_compound"]
+
+output:
+  path: "results.csv"
+
+# Optional
+preprocessing:
+  - type: filter
+    params:
+      query: "Metadata_dose > 0.1"
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_compound"]
+    null_size: 1000000
+    threshold: 0.05
+    seed: 0
+
+plotting:
+  enabled: true
+  path: "plot.png"
+```
+
+## Preprocessing Steps
+
+- `filter`: Filter rows with pandas query
+- `dropna`: Remove rows with NaN
+- `aggregate_replicates`: Median aggregation by group
+- `merge_metadata`: Join external CSV
+- `split_multilabel`: Split pipe-separated values
+- See `copairs_runner.py` docstring for complete list
+
+## Examples
+
+- `configs/activity_analysis.yaml`: Phenotypic activity
+- `configs/consistency_analysis.yaml`: Target consistency
+
+Run both: `./run_examples.sh`
+
+### Example Output
+
+The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:
+
+**Phenotypic Activity Assessment:**
+![Activity Plot](examples/example_activity_plot.png)
+
+**Phenotypic Consistency (Target-based):**
+![Consistency Plot](examples/example_consistency_plot.png)
diff --git a/libs/copairs_runner/configs/activity_analysis.yaml b/libs/copairs_runner/configs/activity_analysis.yaml
@@ -0,0 +1,51 @@
+# Configuration for phenotypic activity analysis
+# Matches the phenotypic_activity.ipynb example
+
+data:
+  path: "input/2016_04_01_a549_48hr_batch1_plateSQ00014812.csv"
+  metadata_regex: "^Metadata"
+
+preprocessing:
+  # Remove constant columns (as done in notebook)
+  # Note: This is handled differently in the runner, but we can achieve similar results
+
+  # Assign reference index for controls (DMSO)
+  - type: apply_assign_reference
+    params:
+      condition: "Metadata_broad_sample == 'DMSO'"
+      reference_col: "Metadata_reference_index"
+      default_value: -1
+
+average_precision:
+  params:
+    # Positive pairs: replicates of the same compound
+    pos_sameby: ["Metadata_broad_sample", "Metadata_reference_index"]
+    pos_diffby: []
+
+    # Negative pairs: compound vs control
+    neg_sameby: []
+    neg_diffby: ["Metadata_broad_sample", "Metadata_reference_index"]
+
+    # Using default distance (cosine) as in notebook
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_broad_sample"]  # Group by compound
+    null_size: 1000000                 # As used in notebook
+    threshold: 0.05
+    seed: 0                            # As used in notebook
+
+output:
+  path: "output/activity_map_runner.csv"
+  save_ap_scores: true  # Save AP scores to match notebook output
+
+plotting:
+  enabled: true
+  path: "output/map_activity_plot.png"
+  format: "png"  # or pdf, svg, etc.
+  title: "Phenotypic Activity Assessment"
+  xlabel: "mAP"
+  ylabel: "-log10(p-value)"
+  annotation_prefix: "Phenotypically active"
+  figsize: [8, 6]
+  dpi: 100
diff --git a/libs/copairs_runner/configs/consistency_analysis.yaml b/libs/copairs_runner/configs/consistency_analysis.yaml
@@ -0,0 +1,67 @@
+# Configuration for phenotypic consistency analysis
+# Matches the phenotypic_consistency.ipynb example
+
+data:
+  path: "input/2016_04_01_a549_48hr_batch1_plateSQ00014812.csv"
+  metadata_regex: "^Metadata"
+
+preprocessing:
+  # Filter to only active compounds based on activity analysis
+  - type: filter_active
+    params:
+      activity_csv: "output/activity_map_runner.csv"
+      on_column: "Metadata_broad_sample"
+
+  # Remove rows with missing targets (implicit in notebook via query)
+  - type: dropna
+    params:
+      columns: ["Metadata_target"]
+
+  # Aggregate replicates by taking median of features (as done in notebook)
+  - type: aggregate_replicates
+    params:
+      groupby: ["Metadata_broad_sample", "Metadata_target"]
+
+  # Split the pipe-separated target values into lists for multilabel analysis
+  - type: split_multilabel
+    params:
+      column: "Metadata_target"
+      separator: "|"
+
+average_precision:
+  # Use multilabel since compounds have multiple targets (separated by |)
+  multilabel: true
+
+  params:
+    # Positive pairs: compounds sharing the same target
+    pos_sameby: ["Metadata_target"]
+    pos_diffby: []
+
+    # Negative pairs: compounds with different targets
+    neg_sameby: []
+    neg_diffby: ["Metadata_target"]
+
+    # For multilabel analysis, specify the column
+    multilabel_col: "Metadata_target"
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_target"]  # Group by target
+    null_size: 1000000           # As used in notebook
+    threshold: 0.05
+    seed: 0                      # As used in notebook
+
+output:
+  path: "output/target_maps_runner.csv"
+  save_ap_scores: false
+
+plotting:
+  enabled: true
+  path: "output/map_consistency_plot.png"
+  format: "png"  # or pdf, svg, etc.
+  title: "Phenotypic Consistency Assessment"
+  xlabel: "mAP"
+  ylabel: "-log10(p-value)"
+  annotation_prefix: "Phenotypically consistent"
+  figsize: [8, 6]
+  dpi: 100