Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
## Lessons Learned

- Keep adapters focused on source parsing and structural graph emission; move cross-ontology ID normalization to resolvers.
- Raw source adapters should not populate `sources` or `provenance`; let the framework stamp canonical datasource/version metadata during ETL.
- For ontology xrefs, maintain an explicit allowlist and perform case-insensitive prefix checks.
- When adding new datasource version handling, use named parameters for `DatasourceVersionInfo` to avoid argument-order regressions.
- When an edge can be emitted by multiple sources and later merged, keep source-specific payload in a `details` list instead of top-level edge fields.
Expand Down
99 changes: 99 additions & 0 deletions designs/ppi/bioplex_ppi_ingest_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# BioPlex PPI Ingest Design

## Status

Implemented, validated in the working graph and working MySQL paths, and promoted to `pharos.yaml` / `target_graph.yaml`.

## Goal

Add a first-pass BioPlex protein-protein interaction ingest for Pharos.

## Source Choice

Use the official undirected BioPlex 3.0 interaction releases from the BioPlex download page:

- `BioPlex_293T_Network_10K_Dec_2019.tsv`
- `BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`

Rationale:

- These are the current official-release network files exposed on the BioPlex site.
- They match the current graph `PPIEdge` model better than the directed bait-prey files.
- They avoid the noisier unfiltered candidate-interaction lists.

## Source URLs

- Landing page: `https://bioplex.hms.harvard.edu/interactions.php`
- Data index: `https://bioplex.hms.harvard.edu/data/`
- 293T release: `https://bioplex.hms.harvard.edu/data/BioPlex_293T_Network_10K_Dec_2019.tsv`
- HCT116 release: `https://bioplex.hms.harvard.edu/data/BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`

## Version Strategy

- Use BioPlex release label `3.0` as the dataset version.
- Capture per-file `Last-Modified` dates into `input_files/auto/bioplex/bioplex_version.tsv`.
- Let adapter-side `download_date` come from file mtime unless we later decide to persist it explicitly.

Observed current `/data` filenames include a December 2019 stamp even though the site still presents them as the current BioPlex 3.0 official releases.

## Observed File Shape

Current BioPlex 3.0 files have the same shape as the legacy TCRD BioPlex loader expected:

- `GeneA`
- `GeneB`
- `UniprotA`
- `UniprotB`
- `SymbolA`
- `SymbolB`
- `pW`
- `pNI`
- `pInt`

Observed counts:

- `BioPlex_293T_Network_10K_Dec_2019.tsv`: `118,162` rows
- `BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`: `70,966` rows

Observed payload details:

- no self-pairs in either file
- gene IDs are numeric Entrez Gene identifiers
- isoform-suffixed UniProt accessions are common
- `UniprotA` can be the literal string `UNKNOWN`
- `293T`: `4,950` rows
- `HCT116`: `1,688` rows
- `UniprotB` did not contain `UNKNOWN` in the profiled files
- `pInt` values range from about `0.75` to `1.0`

## Implemented Mapping

Current graph mapping:

- emit `PPIEdge`
- use UniProt accessions as the primary emitted identifier family for endpoint proteins
- fall back to `NCBIGene` when BioPlex reports `UniprotA='UNKNOWN'`
- preserve BioPlex confidence-style fields into:
- `p_wrong`
- `p_ni`
- `p_int`

Implementation choice:

- configure one adapter instance per file so provenance distinguishes `293T` versus `HCT116` via the version string
- do not populate adapter-level `sources`; the ETL framework stamps canonical datasource/version metadata

## Validation Summary

Validated outcomes:

- merged graph edges can carry multiple `p_int` values when the same canonical pair is supported by both BioPlex cell lines
- `UNKNOWN` UniProt rows resolve through `NCBIGene:*` fallback when the reviewed target graph contains the mapped protein
- downstream `ncats_ppi` exports BioPlex rows with scalar `p_int`, `p_ni`, and `p_wrong` using `max(...)` collapse for merged graph lists
- promoted into:
- `src/use_cases/pharos/pharos.yaml`
- `src/use_cases/pharos/target_graph.yaml`

Open follow-up questions:

- whether cell-line provenance should eventually be carried in a dedicated edge field instead of only in provenance / sources
127 changes: 127 additions & 0 deletions designs/ppi/reactome_ppi_ingest_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Reactome PPI Ingest Design

## Status

Implemented, validated in the working graph and working MySQL paths, and promoted to `pharos.yaml` / `target_graph.yaml`.

## Goal

Add a first-pass Reactome-derived protein-protein interaction ingest for Pharos.

## Source Choice

Use the official human Reactome tab-delimited interaction file:

- `reactome.homo_sapiens.interactions.tab-delimited.txt`

Rationale:

- This is the current official human interaction export on the Reactome download site.
- It matches the old TCRD loader input format.
- It includes interaction type and context/PMID fields that can map naturally into the current PPI model and TCRD export.

## Source URLs

- Download docs: `https://reactome.org/download-data?id=62&ml=1`
- Directory index: `https://reactome.org/download/current/interactors/`
- Human tab-delimited file: `https://reactome.org/download/current/interactors/reactome.homo_sapiens.interactions.tab-delimited.txt`

## Version Strategy

- Use the Reactome database version recorded in `input_files/auto/reactome/reactome_version.tsv`
- Use the PPI file `Last-Modified` header as `version_date`
- Let adapter-side `download_date` come from file mtime unless we later decide to persist it explicitly

## Documented File Shape

Reactome documents the tab-delimited human interaction file as:

1. interactor 1 protein ID
2. interactor 1 Ensembl gene ID(s)
3. interactor 1 Entrez Gene ID(s)
4. interactor 2 protein ID
5. interactor 2 Ensembl gene ID(s)
6. interactor 2 Entrez Gene ID(s)
7. interaction type
8. interaction context
9. PubMed IDs

## Legacy Comparison

The old UNM TCRD loader used the same human tab-delimited Reactome interaction file and:

- required both interactors to have UniProt IDs
- populated `interaction_type`
- skipped duplicate interaction rows
- skipped self-pairs

This should be treated as a comparison point only; current behavior should still be validated against the real file after download.

## Observed File Profile

Observed counts from the downloaded file:

- total rows: `123,895`
- rows where both interactors are UniProt proteins: `83,545`
- filtered non-protein rows: `40,350`
- protein self-pairs: `7,677`
- duplicate unordered protein-pair-plus-type rows: `57,386`
- distinct unordered protein-pair-plus-type combinations: `26,159`

Observed payload behavior:

- non-protein rows include identifiers such as `ChEBI:*`
- almost every protein-protein row has PubMed references
- every row has a Reactome context string like `reactome:R-HSA-...`
- current interaction types include values such as:
- `physical association`
- `enzymatic reaction`
- `cleavage reaction`
- `dephosphorylation reaction`

## Implemented Mapping

Current first-pass graph mapping:

- emit `PPIEdge`
- keep only rows where both interactors are UniProt IDs
- skip self-pairs
- canonicalize unordered protein pairs
- dedupe repeated source rows by unordered pair plus interaction type
- preserve:
- `interaction_type` as a graph list field
- `contexts` as a graph list field
- `pmids` as a graph list field
- do not populate adapter-level `sources`; the ETL framework stamps canonical datasource/version metadata

## Legacy Downstream Comparison

Direct inspection of `pharos319.ncats_ppi` showed:

- `StringDB` populated only `score`
- `BioPlex` populated only `p_int`, `p_ni`, and `p_wrong`
- `Reactome` rows left `evidence`, `interaction_type`, `score`, `p_int`, `p_ni`, and `p_wrong` empty

## Current Downstream Mapping

Current IFX_ODIN downstream decision:

- keep Reactome `pmids`, `contexts`, and `interaction_type` in the graph
- map `pmids` to `ncats_ppi.evidence` as pipe-delimited PMIDs
- map `interaction_type` to `ncats_ppi.interaction_type`
- keep `contexts` graph-only for now

## Validation Summary

Validated outcomes:

- Reactome-backed graph edges landed with non-empty `pmids`, `contexts`, and `interaction_type`
- Reactome merged cleanly with both BioPlex and STRING on shared canonical pairs
- downstream `ncats_ppi` rows now carry Reactome PMIDs in `evidence` and Reactome interaction types in `interaction_type`
- promoted into:
- `src/use_cases/pharos/pharos.yaml`
- `src/use_cases/pharos/target_graph.yaml`

## Open Follow-Ups

- decide whether Reactome context should eventually have its own dedicated downstream column or lookup table
13 changes: 3 additions & 10 deletions designs/ppi/string_ppi_ingest_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@

Implemented and validated in the working graph and working MySQL paths.

This first pass covers **STRING human protein-protein interactions** only.
BioPlex and Reactome PPI remain follow-up sources.
STRING, BioPlex, and Reactome PPI are now all implemented for Pharos.

## Scope

Expand All @@ -15,9 +14,7 @@ Implemented source:

Explicitly deferred:

- BioPlex PPI
- Reactome PPI
- source-specific `interaction_type` / `evidence` population
- richer STRING channel-specific fields from `.protein.links.full...`

## Files Added / Changed

Expand Down Expand Up @@ -102,14 +99,14 @@ Confirmed old IFX_ODIN / Pharos readback behavior:
- `PPIEdge`
- `start_node`: `Protein(id="ENSEMBL:ENSP...")`
- `end_node`: `Protein(id="ENSEMBL:ENSP...")`
- `sources`: STRING provenance list
- `score`: list-valued, emitted as `[combined_score]`

Implementation choices:

- `score_cutoff` is an adapter parameter with default `400`
- rows below the cutoff are discarded before they enter the graph
- self-pairs are discarded before they enter the graph
- adapter does not populate `sources`; the ETL framework stamps canonical datasource/version metadata
- `max_rows` is supported for bounded validation runs and counts **kept emitted
edges**, not scanned raw lines

Expand Down Expand Up @@ -237,10 +234,6 @@ pairs before export.

- Profile whether STRING `.protein.links.full...` is worth revisiting for richer
channel-specific fields
- Add Reactome PPI ingest
- populate `interaction_type`
- decide whether the Reactome evidence/context column should map to `evidence`
- Add BioPlex PPI ingest
- Decide whether downstream `ncats_ppi` export should collapse duplicate canonical
pairs before reciprocal row generation, or continue to preserve one SQL row pair
per graph edge
1 change: 1 addition & 0 deletions src/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ class DataSourceName(SimpleEnum):
Cellosaurus = "Cellosaurus"
Reactome = "Reactome"
STRING = "STRING"
BioPlex = "BioPlex"
WikiPathways = "WikiPathways"
PathwayCommons = "PathwayCommons"
CLO = "Cell Line Ontology (CLO)"
Expand Down
Loading
Loading