Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
328d232
feat(copairs_runner): add configurable YAML-driven runner for copairs…
shntnu Jul 9, 2025
ec785de
Update README.md
shntnu Jul 9, 2025
623b8a0
feat(copairs_runner): add preprocessing step and config-relative path…
shntnu Jul 9, 2025
67654dc
feat(copairs_runner): add lazy loading and standardize config naming
shntnu Jul 9, 2025
e05de00
fix: format
shntnu Jul 9, 2025
4f31ff1
docs(copairs_runner): update CLAUDE.md and README.md with current con…
shntnu Jul 9, 2025
8a22d4b
docs(copairs_runner): improve README configuration example
shntnu Jul 9, 2025
5d561f9
docs(copairs_runner): clarify lazy vs preprocessing filtering
shntnu Jul 9, 2025
8519b69
docs(copairs_runner): add CONTRIBUTING.md with preprocessing guidelines
shntnu Jul 9, 2025
1509b5e
docs(copairs_runner): emphasize copairs as end-stage analysis
shntnu Jul 9, 2025
12e3b81
Update libs/copairs_runner/run_examples.sh
shntnu Jul 9, 2025
9e793ea
fix(copairs_runner): address PR review comments
shntnu Jul 9, 2025
7537e9b
refactor(copairs_runner): switch to CWD-relative path resolution
shntnu Jul 10, 2025
050d3d2
refactor(copairs_runner): migrate from PyYAML to OmegaConf
shntnu Jul 10, 2025
0aeb68b
feat(copairs_runner): migrate from argparse to Hydra
shntnu Jul 10, 2025
e0960ab
feat(copairs_runner): configure Hydra to use existing output directory
shntnu Jul 10, 2025
6e10d9e
docs(copairs_runner): update documentation for Hydra migration
shntnu Jul 10, 2025
ae16e82
fix(copairs_runner): cleanup
shntnu Jul 10, 2025
3c8d354
fix(copairs_runner): typo
shntnu Jul 10, 2025
69caf31
refactor(copairs_runner): simplify code and improve documentation
shntnu Jul 10, 2025
d7ed157
feat(copairs_runner): implement Hydra best practices for path handling
shntnu Jul 10, 2025
832edaf
refactor(copairs_runner): implement unified output handling with fixe…
shntnu Jul 10, 2025
ead89f4
refactor(copairs_runner): rename 'data' config section to 'input' for…
shntnu Jul 10, 2025
16abb52
docs(copairs_runner): add design note on preprocessing list configura…
shntnu Jul 10, 2025
303a982
fix(copairs_runner): prevent Hydra runtime file overwrites in shared …
shntnu Jul 10, 2025
2d6d2b0
fix(copairs_runner): add observed=True to groupby and improve logging
shntnu Jul 10, 2025
992cdf7
feat(copairs_runner): add JUMP CPCNN example configs demonstrating sh…
shntnu Jul 10, 2025
61bd877
docs(copairs_runner): add ROADMAP.md with AI-assisted Hydra feature p…
shntnu Jul 10, 2025
e6e36a1
feat(copairs_runner): convert to installable package while keeping st…
shntnu Aug 4, 2025
e0c34c7
fix(copairs_runner): add missing __init__.py and fix wheel build config
shntnu Aug 4, 2025
fd1071e
fix: add quotes
shntnu Aug 4, 2025
a8cf8e0
refactor(copairs_runner): use dynamic versioning from __init__.py
shntnu Aug 4, 2025
7054308
fix(copairs_runner): remove hardcoded config path for package compati…
shntnu Aug 4, 2025
ffe737c
docs(copairs_runner): add concise help message for CLI usage
shntnu Aug 4, 2025
92f0b91
refactor(copairs_runner): finalize package structure and update docum…
shntnu Aug 5, 2025
b0193fb
fix(copairs_runner): use explicit .loc accessor to avoid pandas Setti…
shntnu Aug 5, 2025
80bf20d
fix(copairs_runner): add .copy() to prevent SettingWithCopyWarning wh…
shntnu Aug 5, 2025
b848ef1
feat(copairs_runner): add DuckDB support to merge_metadata preprocess…
shntnu Aug 13, 2025
7f5eb86
fix: empty string is None
shntnu Aug 14, 2025
f39f010
refactor(copairs_runner): simplify path handling using Hydra best pra…
shntnu Aug 15, 2025
70c7f29
feat(copairs_runner): add Parquet output format support
shntnu Aug 26, 2025
b5ec577
test: turn the run_examples.sh script into a minimal integration test…
shntnu Aug 26, 2025
e98c3a8
fix: drop claim about test passing because it is fairly loose (does n…
shntnu Aug 26, 2025
108863d
fix(copairs_runner): remove hardcoded threshold and add Parquet suppo…
shntnu Aug 27, 2025
6417bae
feat(copairs_runner): add support for modular preprocessing sections
shntnu Aug 27, 2025
8523bf0
Set read_only=true to allow concurrent reads
shntnu Aug 27, 2025
b93522b
feat(copairs_runner): add support for normalized mAP visualization
shntnu Sep 7, 2025
e322e23
fix(copairs_runner): fix font
shntnu Sep 8, 2025
48f744c
fix(copairs_runner): clarify
shntnu Sep 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions libs/copairs_runner/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
input/
output/
.claude/settings.local.json
109 changes: 109 additions & 0 deletions libs/copairs_runner/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.

## Key Commands

### Running Analyses
```bash
# Run analysis with a config file
uv run copairs_runner.py <config_file.yaml>

# Run with verbose logging
uv run copairs_runner.py <config_file.yaml> --verbose

# Run the example analyses (downloads data if needed)
bash run_examples.sh
```

### Development Commands
```bash
# Lint code using uvx (following monorepo standards)
uvx ruff check copairs_runner.py

# Auto-fix linting issues
uvx ruff check copairs_runner.py --fix

# Format code
uvx ruff format copairs_runner.py

# Run tests (when implemented)
pytest tests/
```

## Architecture

### Core Components
1. **copairs_runner.py**: Main script with inline dependencies (PEP 723)
- `CopairsRunner` class: Handles data loading, preprocessing, analysis, and visualization
- Key methods:
- `run()`: Main pipeline orchestrator
- `preprocess_data()`: Applies configurable preprocessing pipeline
- `run_average_precision()`: Calculates AP for compound activity
- `run_mean_average_precision()`: Calculates mAP with significance testing
- `plot_map_results()`: Creates scatter plots of mAP vs -log10(p-value)
- `save_results()`: Saves results to CSV/Parquet files

2. **Configuration System**: YAML-based configuration with sections for:
- `data`: Input paths and metadata patterns
- `preprocessing`: Pipeline steps (filtering, aggregation, etc.)
- `average_precision`/`mean_average_precision`: Analysis parameters
- `output`: Result file paths
- `plotting`: Visualization settings

### Preprocessing Pipeline
The runner supports these preprocessing steps (order determined by config):
1. `filter`: Apply pandas query expressions
2. `dropna`: Remove rows with NaN values in specified columns
3. `remove_nan_features`: Remove feature columns containing NaN
4. `split_multilabel`: Split pipe-separated values into lists
5. `filter_active`: Filter based on activity CSV with below_corrected_p column
6. `aggregate_replicates`: Aggregate by taking median of features
7. `merge_metadata`: Merge external CSV metadata
8. `filter_single_replicates`: Remove groups with < min_replicates members
9. `apply_assign_reference`: Apply copairs.matching.assign_reference_index

## Important Context

### Monorepo Standards
This project is part of a monorepo that uses:
- **uv** for package management (transitioning from Poetry)
- **ruff** for formatting and linting
- **pytest** for testing (>90% coverage target)
- **numpy** documentation style
- Conventional commits for commit messages

### Current State
- The script uses inline dependencies (PEP 723 format)
- Has a minimal pyproject.toml for ruff configuration
- No test suite exists
- Examples use LINCS Cell Painting data from GitHub
- Configuration files demonstrate typical usage patterns

### Dependencies
Required packages (from inline script metadata):
- python >= 3.8
- pandas, numpy, copairs, pyyaml, matplotlib, seaborn

## Common Tasks

### Adding New Preprocessing Steps
1. Implement a new method `_preprocess_<step_name>` in `CopairsRunner` class
2. The method should accept `df` and `params` arguments
3. Add documentation for the new step in the `preprocess_data()` docstring
4. Use the step in your YAML config with `type: <step_name>`

### Creating New Analysis Configs
1. Copy an existing config from `configs/`
2. Modify data paths and preprocessing steps
3. Adjust analysis parameters as needed
4. Run with: `uv run copairs_runner.py your_config.yaml`

### Debugging
- Use `--verbose` flag for detailed logging
- Check intermediate results with `save_intermediate: true` in preprocessing
- Examine output CSV files for analysis results
70 changes: 70 additions & 0 deletions libs/copairs_runner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Copairs Runner

YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).

## Usage

```bash
uv run copairs_runner.py config.yaml
```

## Configuration

```yaml
# Required
data:
path: "data.csv" # or .parquet

average_precision:
params:
pos_sameby: ["Metadata_compound"]
pos_diffby: []
neg_sameby: []
neg_diffby: ["Metadata_compound"]

output:
path: "results.csv"

# Optional
preprocessing:
- type: filter
params:
query: "Metadata_dose > 0.1"

mean_average_precision:
params:
sameby: ["Metadata_compound"]
null_size: 1000000
threshold: 0.05
seed: 0

plotting:
enabled: true
path: "plot.png"
```

## Preprocessing Steps

- `filter`: Filter rows with pandas query
- `dropna`: Remove rows with NaN
- `aggregate_replicates`: Median aggregation by group
- `merge_metadata`: Join external CSV
- `split_multilabel`: Split pipe-separated values
- See `copairs_runner.py` docstring for complete list

## Examples

- `configs/activity_analysis.yaml`: Phenotypic activity
- `configs/consistency_analysis.yaml`: Target consistency

Run both: `./run_examples.sh`

### Example Output

The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:

**Phenotypic Activity Assessment:**
![Activity Plot](examples/example_activity_plot.png)

**Phenotypic Consistency (Target-based):**
![Consistency Plot](examples/example_consistency_plot.png)
51 changes: 51 additions & 0 deletions libs/copairs_runner/configs/activity_analysis.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Configuration for phenotypic activity analysis
# Matches the phenotypic_activity.ipynb example

data:
path: "input/2016_04_01_a549_48hr_batch1_plateSQ00014812.csv"
metadata_regex: "^Metadata"

preprocessing:
# Remove constant columns (as done in notebook)
# Note: This is handled differently in the runner, but we can achieve similar results

# Assign reference index for controls (DMSO)
- type: apply_assign_reference
params:
condition: "Metadata_broad_sample == 'DMSO'"
reference_col: "Metadata_reference_index"
default_value: -1

average_precision:
params:
# Positive pairs: replicates of the same compound
pos_sameby: ["Metadata_broad_sample", "Metadata_reference_index"]
pos_diffby: []

# Negative pairs: compound vs control
neg_sameby: []
neg_diffby: ["Metadata_broad_sample", "Metadata_reference_index"]

# Using default distance (cosine) as in notebook

mean_average_precision:
params:
sameby: ["Metadata_broad_sample"] # Group by compound
null_size: 1000000 # As used in notebook
threshold: 0.05
seed: 0 # As used in notebook

output:
path: "output/activity_map_runner.csv"
save_ap_scores: true # Save AP scores to match notebook output

plotting:
enabled: true
path: "output/map_activity_plot.png"
format: "png" # or pdf, svg, etc.
title: "Phenotypic Activity Assessment"
xlabel: "mAP"
ylabel: "-log10(p-value)"
annotation_prefix: "Phenotypically active"
figsize: [8, 6]
dpi: 100
67 changes: 67 additions & 0 deletions libs/copairs_runner/configs/consistency_analysis.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Configuration for phenotypic consistency analysis
# Matches the phenotypic_consistency.ipynb example

data:
path: "input/2016_04_01_a549_48hr_batch1_plateSQ00014812.csv"
metadata_regex: "^Metadata"

preprocessing:
# Filter to only active compounds based on activity analysis
- type: filter_active
params:
activity_csv: "output/activity_map_runner.csv"
on_column: "Metadata_broad_sample"

# Remove rows with missing targets (implicit in notebook via query)
- type: dropna
params:
columns: ["Metadata_target"]

# Aggregate replicates by taking median of features (as done in notebook)
- type: aggregate_replicates
params:
groupby: ["Metadata_broad_sample", "Metadata_target"]

# Split the pipe-separated target values into lists for multilabel analysis
- type: split_multilabel
params:
column: "Metadata_target"
separator: "|"

average_precision:
# Use multilabel since compounds have multiple targets (separated by |)
multilabel: true

params:
# Positive pairs: compounds sharing the same target
pos_sameby: ["Metadata_target"]
pos_diffby: []

# Negative pairs: compounds with different targets
neg_sameby: []
neg_diffby: ["Metadata_target"]

# For multilabel analysis, specify the column
multilabel_col: "Metadata_target"

mean_average_precision:
params:
sameby: ["Metadata_target"] # Group by target
null_size: 1000000 # As used in notebook
threshold: 0.05
seed: 0 # As used in notebook

output:
path: "output/target_maps_runner.csv"
save_ap_scores: false

plotting:
enabled: true
path: "output/map_consistency_plot.png"
format: "png" # or pdf, svg, etc.
title: "Phenotypic Consistency Assessment"
xlabel: "mAP"
ylabel: "-log10(p-value)"
annotation_prefix: "Phenotypically consistent"
figsize: [8, 6]
dpi: 100
Loading