Skip to content
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
328d232
feat(copairs_runner): add configurable YAML-driven runner for copairs…
shntnu Jul 9, 2025
ec785de
Update README.md
shntnu Jul 9, 2025
623b8a0
feat(copairs_runner): add preprocessing step and config-relative path…
shntnu Jul 9, 2025
67654dc
feat(copairs_runner): add lazy loading and standardize config naming
shntnu Jul 9, 2025
e05de00
fix: format
shntnu Jul 9, 2025
4f31ff1
docs(copairs_runner): update CLAUDE.md and README.md with current con…
shntnu Jul 9, 2025
8a22d4b
docs(copairs_runner): improve README configuration example
shntnu Jul 9, 2025
5d561f9
docs(copairs_runner): clarify lazy vs preprocessing filtering
shntnu Jul 9, 2025
8519b69
docs(copairs_runner): add CONTRIBUTING.md with preprocessing guidelines
shntnu Jul 9, 2025
1509b5e
docs(copairs_runner): emphasize copairs as end-stage analysis
shntnu Jul 9, 2025
12e3b81
Update libs/copairs_runner/run_examples.sh
shntnu Jul 9, 2025
9e793ea
fix(copairs_runner): address PR review comments
shntnu Jul 9, 2025
7537e9b
refactor(copairs_runner): switch to CWD-relative path resolution
shntnu Jul 10, 2025
050d3d2
refactor(copairs_runner): migrate from PyYAML to OmegaConf
shntnu Jul 10, 2025
0aeb68b
feat(copairs_runner): migrate from argparse to Hydra
shntnu Jul 10, 2025
e0960ab
feat(copairs_runner): configure Hydra to use existing output directory
shntnu Jul 10, 2025
6e10d9e
docs(copairs_runner): update documentation for Hydra migration
shntnu Jul 10, 2025
ae16e82
fix(copairs_runner): cleanup
shntnu Jul 10, 2025
3c8d354
fix(copairs_runner): typo
shntnu Jul 10, 2025
69caf31
refactor(copairs_runner): simplify code and improve documentation
shntnu Jul 10, 2025
d7ed157
feat(copairs_runner): implement Hydra best practices for path handling
shntnu Jul 10, 2025
832edaf
refactor(copairs_runner): implement unified output handling with fixe…
shntnu Jul 10, 2025
ead89f4
refactor(copairs_runner): rename 'data' config section to 'input' for…
shntnu Jul 10, 2025
16abb52
docs(copairs_runner): add design note on preprocessing list configura…
shntnu Jul 10, 2025
303a982
fix(copairs_runner): prevent Hydra runtime file overwrites in shared …
shntnu Jul 10, 2025
2d6d2b0
fix(copairs_runner): add observed=True to groupby and improve logging
shntnu Jul 10, 2025
992cdf7
feat(copairs_runner): add JUMP CPCNN example configs demonstrating sh…
shntnu Jul 10, 2025
61bd877
docs(copairs_runner): add ROADMAP.md with AI-assisted Hydra feature p…
shntnu Jul 10, 2025
e6e36a1
feat(copairs_runner): convert to installable package while keeping st…
shntnu Aug 4, 2025
e0c34c7
fix(copairs_runner): add missing __init__.py and fix wheel build config
shntnu Aug 4, 2025
fd1071e
fix: add quotes
shntnu Aug 4, 2025
a8cf8e0
refactor(copairs_runner): use dynamic versioning from __init__.py
shntnu Aug 4, 2025
7054308
fix(copairs_runner): remove hardcoded config path for package compati…
shntnu Aug 4, 2025
ffe737c
docs(copairs_runner): add concise help message for CLI usage
shntnu Aug 4, 2025
92f0b91
refactor(copairs_runner): finalize package structure and update docum…
shntnu Aug 5, 2025
b0193fb
fix(copairs_runner): use explicit .loc accessor to avoid pandas Setti…
shntnu Aug 5, 2025
80bf20d
fix(copairs_runner): add .copy() to prevent SettingWithCopyWarning wh…
shntnu Aug 5, 2025
b848ef1
feat(copairs_runner): add DuckDB support to merge_metadata preprocess…
shntnu Aug 13, 2025
7f5eb86
fix: empty string is None
shntnu Aug 14, 2025
f39f010
refactor(copairs_runner): simplify path handling using Hydra best pra…
shntnu Aug 15, 2025
70c7f29
feat(copairs_runner): add Parquet output format support
shntnu Aug 26, 2025
b5ec577
test: turn the run_examples.sh script into a minimal integration test…
shntnu Aug 26, 2025
e98c3a8
fix: drop claim about test passing because it is fairly loose (does n…
shntnu Aug 26, 2025
108863d
fix(copairs_runner): remove hardcoded threshold and add Parquet suppo…
shntnu Aug 27, 2025
6417bae
feat(copairs_runner): add support for modular preprocessing sections
shntnu Aug 27, 2025
8523bf0
Set read_only=true to allow concurrent reads
shntnu Aug 27, 2025
b93522b
feat(copairs_runner): add support for normalized mAP visualization
shntnu Sep 7, 2025
e322e23
fix(copairs_runner): fix font
shntnu Sep 8, 2025
48f744c
fix(copairs_runner): clarify
shntnu Sep 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions libs/copairs_runner/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
input/
output/
.claude/settings.local.json
185 changes: 185 additions & 0 deletions libs/copairs_runner/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.

## Key Commands

### Running Analyses
```bash
# Run analysis with a config file
uv run copairs_runner.py <config_file.yaml>

# Run with verbose logging
uv run copairs_runner.py <config_file.yaml> --verbose

# Run the example analyses (downloads data if needed)
bash run_examples.sh
```

### Development Commands
```bash
# Lint code using uvx (following monorepo standards)
uvx ruff check copairs_runner.py

# Auto-fix linting issues
uvx ruff check copairs_runner.py --fix

# Format code
uvx ruff format copairs_runner.py

# Run tests (when implemented)
pytest tests/
```

## Architecture

### Core Components
1. **copairs_runner.py**: Main script with inline dependencies (PEP 723)
- `CopairsRunner` class: Handles data loading, preprocessing, analysis, and visualization
- Key methods:
- `run()`: Main pipeline orchestrator
- `load_data()`: Supports CSV/Parquet from local files, URLs, and S3
- `preprocess_data()`: Applies configurable preprocessing pipeline
- `run_average_precision()`: Calculates AP for compound activity
- `run_mean_average_precision()`: Calculates mAP with significance testing
- `plot_map_results()`: Creates scatter plots of mAP vs -log10(p-value)
- `save_results()`: Saves results to CSV/Parquet files

2. **Configuration System**: YAML-based configuration with sections for:
- `data`: Input paths, metadata patterns, and lazy loading options
- `preprocessing`: Pipeline steps (filtering, aggregation, etc.)
- `average_precision`/`mean_average_precision`: Analysis parameters
- `output`: Result file paths
- `plotting`: Visualization settings

### Preprocessing Pipeline
The runner supports these preprocessing steps (order determined by config):
1. `filter`: Apply pandas query expressions
2. `dropna`: Remove rows with NaN values in specified columns
3. `remove_nan_features`: Remove feature columns containing NaN
4. `split_multilabel`: Split pipe-separated values into lists
5. `filter_active`: Filter based on activity CSV with below_corrected_p column
6. `aggregate_replicates`: Aggregate by taking median of features
7. `merge_metadata`: Merge external CSV metadata
8. `filter_single_replicates`: Remove groups with < min_replicates members
9. `apply_assign_reference`: Apply copairs.matching.assign_reference_index

## Important Context

### Monorepo Standards
This project is part of a monorepo that uses:
- **uv** for package management (transitioning from Poetry)
- **ruff** for formatting and linting
- **pytest** for testing (>90% coverage target)
- **numpy** documentation style
- Conventional commits for commit messages

### Current State
- The script uses inline dependencies (PEP 723 format)
- Has a minimal pyproject.toml for ruff configuration
- No test suite exists yet
- Examples use LINCS Cell Painting data from GitHub
- Supports lazy loading for large parquet files using polars
- Configuration files demonstrate typical usage patterns

### Dependencies
Required packages (from inline script metadata):
- python >= 3.8
- pandas, numpy, copairs, pyyaml, pyarrow, matplotlib, seaborn, polars

### Data Loading Capabilities
- Supports local files, HTTP URLs, and S3 paths
- Automatic data download and caching for URLs
- Lazy loading for large parquet files with polars
- Path resolution relative to config file location

## Common Tasks

### Adding New Preprocessing Steps
1. Implement a new method `_preprocess_<step_name>` in `CopairsRunner` class
2. The method should accept `df` and `params` arguments
3. Add documentation for the new step in the `preprocess_data()` docstring
4. Use the step in your YAML config with `type: <step_name>`

### Creating New Analysis Configs
1. Copy an existing config from `configs/`
2. Modify data paths and preprocessing steps
3. Adjust analysis parameters as needed
4. Run with: `uv run copairs_runner.py your_config.yaml`

### Working with Large Datasets
For memory-efficient processing:
1. Use lazy filtering in the data config for parquet files:
```yaml
data:
path: "huge_dataset.parquet"
use_lazy_filter: true
filter_query: "Metadata_PlateType == 'TARGET2'" # SQL syntax
columns: ["Metadata_compound", "feature1", "feature2"] # optional
```
This filters BEFORE loading into memory using polars.

2. For standard filtering after loading, use preprocessing:
```yaml
preprocessing:
steps:
- type: filter
params:
query: "Metadata_dose > 0.1" # pandas query syntax
```

3. Enable `save_intermediate: true` in preprocessing for debugging

Note: Lazy filtering uses SQL syntax (polars), while preprocessing uses pandas query syntax

### Debugging
- Use `--verbose` flag for detailed logging
- Check intermediate results with `save_intermediate: true` in preprocessing
- Examine output CSV files for analysis results
- Review preprocessing logs to understand data transformations

## Configuration Examples

### Minimal Activity Analysis
```yaml
data:
path: "path/to/profiles.parquet"
# metadata_regex: "^Metadata" # optional, this is the default

average_precision:
params:
pos_sameby: ["Metadata_broad_sample"]
pos_diffby: []
neg_sameby: []
neg_diffby: ["Metadata_broad_sample", "Metadata_Plate"]

mean_average_precision:
params:
sameby: ["Metadata_broad_sample"]
null_size: 1000000
threshold: 0.05
seed: 0

output:
path: "results/map_results.csv"
```

### Advanced Preprocessing Pipeline
```yaml
preprocessing:
steps:
- type: filter
params:
query: "Metadata_broad_sample != 'DMSO'"
- type: aggregate_replicates
params:
groupby: ["Metadata_broad_sample", "Metadata_Plate"]
- type: apply_assign_reference
params:
reference_query: "Metadata_broad_sample == 'DMSO'"
not_reference_query: "Metadata_broad_sample != 'DMSO'"
```
57 changes: 57 additions & 0 deletions libs/copairs_runner/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Contributing to copairs_runner

## Preprocessing Steps

The preprocessing pipeline intentionally provides a minimal DSL to avoid recreating pandas/SQL in YAML. Before adding new steps, consider whether users should handle the transformation externally.

**Important context**: Copairs analysis typically happens at the end of a morphological profiling pipeline. By this stage, your data should already be:
- Quality-controlled and normalized
- Aggregated to appropriate levels
- Filtered for relevant samples
- Properly annotated with metadata

If you find yourself needing extensive preprocessing here, it likely indicates issues with your upstream pipeline.

### Alternatives to New Steps

1. **Lazy filtering** - For large parquet files, use polars' SQL syntax before loading:
```yaml
data:
use_lazy_filter: true
filter_query: "Metadata_PlateType == 'TARGET2'"
```

2. **External preprocessing** - Complex transformations belong in Python/SQL scripts, not YAML configs

3. **Composition** - Combine existing steps rather than creating specialized ones

### When to Add a Step

Add a step only if it:
- Integrates with copairs-specific functionality (e.g., `apply_assign_reference`)
- Handles last-mile transformations specific to copairs analysis
- Requires runner context (resolved paths, metadata patterns)
- Has been requested by multiple users

Remember: needing complex preprocessing at this stage often indicates upstream processing gaps.

### Implementation

```python
def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
"""One-line description."""
# Implementation
logger.info(f"Log what happened")
return df
```

Update the `preprocess_data()` docstring with parameters and add a usage example.

### Design Constraints

- Keep implementations under ~10 lines
- Single responsibility per step
- Clear parameter validation
- Informative error messages

The goal is providing just enough convenience without creating a parallel data manipulation framework. Most preprocessing should happen before data reaches this runner.
81 changes: 81 additions & 0 deletions libs/copairs_runner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copairs Runner

YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).

## Usage

```bash
uv run copairs_runner.py config.yaml
```

## Configuration

```yaml
# Required sections
data:
path: "data.csv" # or .parquet, URLs, S3 paths

# For large parquet files - filter BEFORE loading into memory:
# use_lazy_filter: true
# filter_query: "Metadata_PlateType == 'TARGET2'" # SQL syntax
# columns: ["Metadata_col1", "feature_1", "feature_2"] # optional

# Optional sections
preprocessing:
steps:
# Standard filtering - happens AFTER data is loaded:
- type: filter
params:
query: "Metadata_dose > 0.1" # pandas query syntax

average_precision:
params:
pos_sameby: ["Metadata_compound"]
pos_diffby: []
neg_sameby: []
neg_diffby: ["Metadata_compound"]

output:
path: "results.csv"

mean_average_precision:
params:
sameby: ["Metadata_compound"]
null_size: 10000 # Typically 10000-100000
threshold: 0.05
seed: 0

plotting:
enabled: true
path: "plot.png"
```

## Preprocessing Steps

- `filter`: Filter rows with pandas query
- `dropna`: Remove rows with NaN
- `aggregate_replicates`: Median aggregation by group
- `merge_metadata`: Join external CSV
- `split_multilabel`: Split pipe-separated values
- See `copairs_runner.py` docstring for complete list

## Examples

- `configs/example_activity_lincs.yaml`: Phenotypic activity
- `configs/example_consistency_lincs.yaml`: Target consistency

Run all examples: `./run_examples.sh`

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on adding preprocessing steps.

### Example Output

The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:

**Phenotypic Activity Assessment:**
![Activity Plot](examples/example_activity_plot.png)

**Phenotypic Consistency (Target-based):**
![Consistency Plot](examples/example_consistency_plot.png)
Loading