broadinstitute · shntnu · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/libs/copairs_runner/.gitignore b/libs/copairs_runner/.gitignore
@@ -0,0 +1,3 @@
+input/
+output/
+.claude/settings.local.json
diff --git a/libs/copairs_runner/CLAUDE.md b/libs/copairs_runner/CLAUDE.md
@@ -0,0 +1,185 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.
+
+## Key Commands
+
+### Running Analyses
+```bash
+# Run analysis with a config file
+uv run copairs_runner.py <config_file.yaml>
+
+# Run with verbose logging
+uv run copairs_runner.py <config_file.yaml> --verbose
+
+# Run the example analyses (downloads data if needed)
+bash run_examples.sh
+```
+
+### Development Commands
+```bash
+# Lint code using uvx (following monorepo standards)
+uvx ruff check copairs_runner.py
+
+# Auto-fix linting issues
+uvx ruff check copairs_runner.py --fix
+
+# Format code
+uvx ruff format copairs_runner.py
+
+# Run tests (when implemented)
+pytest tests/
+```
+
+## Architecture
+
+### Core Components
+1. **copairs_runner.py**: Main script with inline dependencies (PEP 723)
+   - `CopairsRunner` class: Handles data loading, preprocessing, analysis, and visualization
+   - Key methods:
+     - `run()`: Main pipeline orchestrator
+     - `load_data()`: Supports CSV/Parquet from local files, URLs, and S3
+     - `preprocess_data()`: Applies configurable preprocessing pipeline
+     - `run_average_precision()`: Calculates AP for compound activity
+     - `run_mean_average_precision()`: Calculates mAP with significance testing
+     - `plot_map_results()`: Creates scatter plots of mAP vs -log10(p-value)
+     - `save_results()`: Saves results to CSV/Parquet files
+
+2. **Configuration System**: YAML-based configuration with sections for:
+   - `data`: Input paths, metadata patterns, and lazy loading options
+   - `preprocessing`: Pipeline steps (filtering, aggregation, etc.)
+   - `average_precision`/`mean_average_precision`: Analysis parameters
+   - `output`: Result file paths
+   - `plotting`: Visualization settings
+
+### Preprocessing Pipeline
+The runner supports these preprocessing steps (order determined by config):
+1. `filter`: Apply pandas query expressions
+2. `dropna`: Remove rows with NaN values in specified columns
+3. `remove_nan_features`: Remove feature columns containing NaN
+4. `split_multilabel`: Split pipe-separated values into lists
+5. `filter_active`: Filter based on activity CSV with below_corrected_p column
+6. `aggregate_replicates`: Aggregate by taking median of features
+7. `merge_metadata`: Merge external CSV metadata
+8. `filter_single_replicates`: Remove groups with < min_replicates members
+9. `apply_assign_reference`: Apply copairs.matching.assign_reference_index
+
+## Important Context
+
+### Monorepo Standards
+This project is part of a monorepo that uses:
+- **uv** for package management (transitioning from Poetry)
+- **ruff** for formatting and linting
+- **pytest** for testing (>90% coverage target)
+- **numpy** documentation style
+- Conventional commits for commit messages
+
+### Current State
+- The script uses inline dependencies (PEP 723 format)
+- Has a minimal pyproject.toml for ruff configuration
+- No test suite exists yet
+- Examples use LINCS Cell Painting data from GitHub
+- Supports lazy loading for large parquet files using polars
+- Configuration files demonstrate typical usage patterns
+
+### Dependencies
+Required packages (from inline script metadata):
+- python >= 3.8
+- pandas, numpy, copairs, pyyaml, pyarrow, matplotlib, seaborn, polars
+
+### Data Loading Capabilities
+- Supports local files, HTTP URLs, and S3 paths
+- Automatic data download and caching for URLs
+- Lazy loading for large parquet files with polars
+- Path resolution relative to config file location
+
+## Common Tasks
+
+### Adding New Preprocessing Steps
+1. Implement a new method `_preprocess_<step_name>` in `CopairsRunner` class
+2. The method should accept `df` and `params` arguments
+3. Add documentation for the new step in the `preprocess_data()` docstring
+4. Use the step in your YAML config with `type: <step_name>`
+
+### Creating New Analysis Configs
+1. Copy an existing config from `configs/`
+2. Modify data paths and preprocessing steps
+3. Adjust analysis parameters as needed
+4. Run with: `uv run copairs_runner.py your_config.yaml`
+
+### Working with Large Datasets
+For memory-efficient processing:
+1. Use lazy filtering in the data config for parquet files:
+   ```yaml
+   data:
+     path: "huge_dataset.parquet"
+     use_lazy_filter: true
+     filter_query: "Metadata_PlateType == 'TARGET2'"  # SQL syntax
+     columns: ["Metadata_compound", "feature1", "feature2"]  # optional
+   ```
+   This filters BEFORE loading into memory using polars.
+
+2. For standard filtering after loading, use preprocessing:
+   ```yaml
+   preprocessing:
+     steps:
+       - type: filter
+         params:
+           query: "Metadata_dose > 0.1"  # pandas query syntax
+   ```
+
+3. Enable `save_intermediate: true` in preprocessing for debugging
+
+Note: Lazy filtering uses SQL syntax (polars), while preprocessing uses pandas query syntax
+
+### Debugging
+- Use `--verbose` flag for detailed logging
+- Check intermediate results with `save_intermediate: true` in preprocessing
+- Examine output CSV files for analysis results
+- Review preprocessing logs to understand data transformations
+
+## Configuration Examples
+
+### Minimal Activity Analysis
+```yaml
+data:
+  path: "path/to/profiles.parquet"
+  # metadata_regex: "^Metadata"  # optional, this is the default
+
+average_precision:
+  params:
+    pos_sameby: ["Metadata_broad_sample"]
+    pos_diffby: []
+    neg_sameby: []
+    neg_diffby: ["Metadata_broad_sample", "Metadata_Plate"]
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_broad_sample"]
+    null_size: 1000000
+    threshold: 0.05
+    seed: 0
+
+output:
+  path: "results/map_results.csv"
+```
+
+### Advanced Preprocessing Pipeline
+```yaml
+preprocessing:
+  steps:
+    - type: filter
+      params:
+        query: "Metadata_broad_sample != 'DMSO'"
+    - type: aggregate_replicates
+      params:
+        groupby: ["Metadata_broad_sample", "Metadata_Plate"]
+    - type: apply_assign_reference
+      params:
+        reference_query: "Metadata_broad_sample == 'DMSO'"
+        not_reference_query: "Metadata_broad_sample != 'DMSO'"
+```
diff --git a/libs/copairs_runner/CONTRIBUTING.md b/libs/copairs_runner/CONTRIBUTING.md
@@ -0,0 +1,57 @@
+# Contributing to copairs_runner
+
+## Preprocessing Steps
+
+The preprocessing pipeline intentionally provides a minimal DSL to avoid recreating pandas/SQL in YAML. Before adding new steps, consider whether users should handle the transformation externally.
+
+**Important context**: Copairs analysis typically happens at the end of a morphological profiling pipeline. By this stage, your data should already be:
+- Quality-controlled and normalized
+- Aggregated to appropriate levels
+- Filtered for relevant samples
+- Properly annotated with metadata
+
+If you find yourself needing extensive preprocessing here, it likely indicates issues with your upstream pipeline.
+
+### Alternatives to New Steps
+
+1. **Lazy filtering** - For large parquet files, use polars' SQL syntax before loading:
+   ```yaml
+   data:
+     use_lazy_filter: true
+     filter_query: "Metadata_PlateType == 'TARGET2'"
+   ```
+
+2. **External preprocessing** - Complex transformations belong in Python/SQL scripts, not YAML configs
+
+3. **Composition** - Combine existing steps rather than creating specialized ones
+
+### When to Add a Step
+
+Add a step only if it:
+- Integrates with copairs-specific functionality (e.g., `apply_assign_reference`)
+- Handles last-mile transformations specific to copairs analysis
+- Requires runner context (resolved paths, metadata patterns)
+- Has been requested by multiple users
+
+Remember: needing complex preprocessing at this stage often indicates upstream processing gaps.
+
+### Implementation
+
+```python
+def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
+    """One-line description."""
+    # Implementation
+    logger.info(f"Log what happened")
+    return df
+```
+
+Update the `preprocess_data()` docstring with parameters and add a usage example.
+
+### Design Constraints
+
+- Keep implementations under ~10 lines
+- Single responsibility per step
+- Clear parameter validation
+- Informative error messages
+
+The goal is providing just enough convenience without creating a parallel data manipulation framework. Most preprocessing should happen before data reaches this runner.
diff --git a/libs/copairs_runner/README.md b/libs/copairs_runner/README.md
@@ -0,0 +1,81 @@
+# Copairs Runner
+
+YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).
+
+## Usage
+
+```bash
+uv run copairs_runner.py config.yaml
+```
+
+## Configuration
+
+```yaml
+# Required sections
+data:
+  path: "data.csv"  # or .parquet, URLs, S3 paths
+
+  # For large parquet files - filter BEFORE loading into memory:
+  # use_lazy_filter: true
+  # filter_query: "Metadata_PlateType == 'TARGET2'"  # SQL syntax
+  # columns: ["Metadata_col1", "feature_1", "feature_2"]  # optional
+
+# Optional sections
+preprocessing:
+  steps:
+    # Standard filtering - happens AFTER data is loaded:
+    - type: filter
+      params:
+        query: "Metadata_dose > 0.1"  # pandas query syntax
+
+average_precision:
+  params:
+    pos_sameby: ["Metadata_compound"]
+    pos_diffby: []
+    neg_sameby: []
+    neg_diffby: ["Metadata_compound"]
+
+output:
+  path: "results.csv"
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_compound"]
+    null_size: 10000  # Typically 10000-100000
+    threshold: 0.05
+    seed: 0
+
+plotting:
+  enabled: true
+  path: "plot.png"
+```
+
+## Preprocessing Steps
+
+- `filter`: Filter rows with pandas query
+- `dropna`: Remove rows with NaN
+- `aggregate_replicates`: Median aggregation by group
+- `merge_metadata`: Join external CSV
+- `split_multilabel`: Split pipe-separated values
+- See `copairs_runner.py` docstring for complete list
+
+## Examples
+
+- `configs/example_activity_lincs.yaml`: Phenotypic activity
+- `configs/example_consistency_lincs.yaml`: Target consistency
+
+Run all examples: `./run_examples.sh`
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on adding preprocessing steps.
+
+### Example Output
+
+The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:
+
+**Phenotypic Activity Assessment:**
+![Activity Plot](examples/example_activity_plot.png)
+
+**Phenotypic Consistency (Target-based):**
+![Consistency Plot](examples/example_consistency_plot.png)