ncats · KeithKelleher · Apr 24, 2026 · Apr 24, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -30,6 +30,7 @@
 ## Lessons Learned
 
 - Keep adapters focused on source parsing and structural graph emission; move cross-ontology ID normalization to resolvers.
+- Raw source adapters should not populate `sources` or `provenance`; let the framework stamp canonical datasource/version metadata during ETL.
 - For ontology xrefs, maintain an explicit allowlist and perform case-insensitive prefix checks.
 - When adding new datasource version handling, use named parameters for `DatasourceVersionInfo` to avoid argument-order regressions.
 - When an edge can be emitted by multiple sources and later merged, keep source-specific payload in a `details` list instead of top-level edge fields.

diff --git a/designs/ppi/bioplex_ppi_ingest_design.md b/designs/ppi/bioplex_ppi_ingest_design.md
@@ -0,0 +1,99 @@
+# BioPlex PPI Ingest Design
+
+## Status
+
+Implemented, validated in the working graph and working MySQL paths, and promoted to `pharos.yaml` / `target_graph.yaml`.
+
+## Goal
+
+Add a first-pass BioPlex protein-protein interaction ingest for Pharos.
+
+## Source Choice
+
+Use the official undirected BioPlex 3.0 interaction releases from the BioPlex download page:
+
+- `BioPlex_293T_Network_10K_Dec_2019.tsv`
+- `BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`
+
+Rationale:
+
+- These are the current official-release network files exposed on the BioPlex site.
+- They match the current graph `PPIEdge` model better than the directed bait-prey files.
+- They avoid the noisier unfiltered candidate-interaction lists.
+
+## Source URLs
+
+- Landing page: `https://bioplex.hms.harvard.edu/interactions.php`
+- Data index: `https://bioplex.hms.harvard.edu/data/`
+- 293T release: `https://bioplex.hms.harvard.edu/data/BioPlex_293T_Network_10K_Dec_2019.tsv`
+- HCT116 release: `https://bioplex.hms.harvard.edu/data/BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`
+
+## Version Strategy
+
+- Use BioPlex release label `3.0` as the dataset version.
+- Capture per-file `Last-Modified` dates into `input_files/auto/bioplex/bioplex_version.tsv`.
+- Let adapter-side `download_date` come from file mtime unless we later decide to persist it explicitly.
+
+Observed current `/data` filenames include a December 2019 stamp even though the site still presents them as the current BioPlex 3.0 official releases.
+
+## Observed File Shape
+
+Current BioPlex 3.0 files have the same shape as the legacy TCRD BioPlex loader expected:
+
+- `GeneA`
+- `GeneB`
+- `UniprotA`
+- `UniprotB`
+- `SymbolA`
+- `SymbolB`
+- `pW`
+- `pNI`
+- `pInt`
+
+Observed counts:
+
+- `BioPlex_293T_Network_10K_Dec_2019.tsv`: `118,162` rows
+- `BioPlex_HCT116_Network_5.5K_Dec_2019.tsv`: `70,966` rows
+
+Observed payload details:
+
+- no self-pairs in either file
+- gene IDs are numeric Entrez Gene identifiers
+- isoform-suffixed UniProt accessions are common
+- `UniprotA` can be the literal string `UNKNOWN`
+  - `293T`: `4,950` rows
+  - `HCT116`: `1,688` rows
+- `UniprotB` did not contain `UNKNOWN` in the profiled files
+- `pInt` values range from about `0.75` to `1.0`
+
+## Implemented Mapping
+
+Current graph mapping:
+
+- emit `PPIEdge`
+- use UniProt accessions as the primary emitted identifier family for endpoint proteins
+- fall back to `NCBIGene` when BioPlex reports `UniprotA='UNKNOWN'`
+- preserve BioPlex confidence-style fields into:
+  - `p_wrong`
+  - `p_ni`
+  - `p_int`
+
+Implementation choice:
+
+- configure one adapter instance per file so provenance distinguishes `293T` versus `HCT116` via the version string
+- do not populate adapter-level `sources`; the ETL framework stamps canonical datasource/version metadata
+
+## Validation Summary
+
+Validated outcomes:
+
+- merged graph edges can carry multiple `p_int` values when the same canonical pair is supported by both BioPlex cell lines
+- `UNKNOWN` UniProt rows resolve through `NCBIGene:*` fallback when the reviewed target graph contains the mapped protein
+- downstream `ncats_ppi` exports BioPlex rows with scalar `p_int`, `p_ni`, and `p_wrong` using `max(...)` collapse for merged graph lists
+- promoted into:
+  - `src/use_cases/pharos/pharos.yaml`
+  - `src/use_cases/pharos/target_graph.yaml`
+
+Open follow-up questions:
+
+- whether cell-line provenance should eventually be carried in a dedicated edge field instead of only in provenance / sources
diff --git a/designs/ppi/reactome_ppi_ingest_design.md b/designs/ppi/reactome_ppi_ingest_design.md
@@ -0,0 +1,127 @@
+# Reactome PPI Ingest Design
+
+## Status
+
+Implemented, validated in the working graph and working MySQL paths, and promoted to `pharos.yaml` / `target_graph.yaml`.
+
+## Goal
+
+Add a first-pass Reactome-derived protein-protein interaction ingest for Pharos.
+
+## Source Choice
+
+Use the official human Reactome tab-delimited interaction file:
+
+- `reactome.homo_sapiens.interactions.tab-delimited.txt`
+
+Rationale:
+
+- This is the current official human interaction export on the Reactome download site.
+- It matches the old TCRD loader input format.
+- It includes interaction type and context/PMID fields that can map naturally into the current PPI model and TCRD export.
+
+## Source URLs
+
+- Download docs: `https://reactome.org/download-data?id=62&ml=1`
+- Directory index: `https://reactome.org/download/current/interactors/`
+- Human tab-delimited file: `https://reactome.org/download/current/interactors/reactome.homo_sapiens.interactions.tab-delimited.txt`
+
+## Version Strategy
+
+- Use the Reactome database version recorded in `input_files/auto/reactome/reactome_version.tsv`
+- Use the PPI file `Last-Modified` header as `version_date`
+- Let adapter-side `download_date` come from file mtime unless we later decide to persist it explicitly
+
+## Documented File Shape
+
+Reactome documents the tab-delimited human interaction file as:
+
+1. interactor 1 protein ID
+2. interactor 1 Ensembl gene ID(s)
+3. interactor 1 Entrez Gene ID(s)
+4. interactor 2 protein ID
+5. interactor 2 Ensembl gene ID(s)
+6. interactor 2 Entrez Gene ID(s)
+7. interaction type
+8. interaction context
+9. PubMed IDs
+
+## Legacy Comparison
+
+The old UNM TCRD loader used the same human tab-delimited Reactome interaction file and:
+
+- required both interactors to have UniProt IDs
+- populated `interaction_type`
+- skipped duplicate interaction rows
+- skipped self-pairs
+
+This should be treated as a comparison point only; current behavior should still be validated against the real file after download.
+
+## Observed File Profile
+
+Observed counts from the downloaded file:
+
+- total rows: `123,895`
+- rows where both interactors are UniProt proteins: `83,545`
+- filtered non-protein rows: `40,350`
+- protein self-pairs: `7,677`
+- duplicate unordered protein-pair-plus-type rows: `57,386`
+- distinct unordered protein-pair-plus-type combinations: `26,159`
+
+Observed payload behavior:
+
+- non-protein rows include identifiers such as `ChEBI:*`
+- almost every protein-protein row has PubMed references
+- every row has a Reactome context string like `reactome:R-HSA-...`
+- current interaction types include values such as:
+  - `physical association`
+  - `enzymatic reaction`
+  - `cleavage reaction`
+  - `dephosphorylation reaction`
+
+## Implemented Mapping
+
+Current first-pass graph mapping:
+
+- emit `PPIEdge`
+- keep only rows where both interactors are UniProt IDs
+- skip self-pairs
+- canonicalize unordered protein pairs
+- dedupe repeated source rows by unordered pair plus interaction type
+- preserve:
+  - `interaction_type` as a graph list field
+  - `contexts` as a graph list field
+  - `pmids` as a graph list field
+- do not populate adapter-level `sources`; the ETL framework stamps canonical datasource/version metadata
+
+## Legacy Downstream Comparison
+
+Direct inspection of `pharos319.ncats_ppi` showed:
+
+- `StringDB` populated only `score`
+- `BioPlex` populated only `p_int`, `p_ni`, and `p_wrong`
+- `Reactome` rows left `evidence`, `interaction_type`, `score`, `p_int`, `p_ni`, and `p_wrong` empty
+
+## Current Downstream Mapping
+
+Current IFX_ODIN downstream decision:
+
+- keep Reactome `pmids`, `contexts`, and `interaction_type` in the graph
+- map `pmids` to `ncats_ppi.evidence` as pipe-delimited PMIDs
+- map `interaction_type` to `ncats_ppi.interaction_type`
+- keep `contexts` graph-only for now
+
+## Validation Summary
+
+Validated outcomes:
+
+- Reactome-backed graph edges landed with non-empty `pmids`, `contexts`, and `interaction_type`
+- Reactome merged cleanly with both BioPlex and STRING on shared canonical pairs
+- downstream `ncats_ppi` rows now carry Reactome PMIDs in `evidence` and Reactome interaction types in `interaction_type`
+- promoted into:
+  - `src/use_cases/pharos/pharos.yaml`
+  - `src/use_cases/pharos/target_graph.yaml`
+
+## Open Follow-Ups
+
+- decide whether Reactome context should eventually have its own dedicated downstream column or lookup table
diff --git a/designs/ppi/string_ppi_ingest_design.md b/designs/ppi/string_ppi_ingest_design.md
@@ -4,8 +4,7 @@
 
 Implemented and validated in the working graph and working MySQL paths.
 
-This first pass covers **STRING human protein-protein interactions** only.
-BioPlex and Reactome PPI remain follow-up sources.
+STRING, BioPlex, and Reactome PPI are now all implemented for Pharos.
 
 ## Scope
 
@@ -15,9 +14,7 @@ Implemented source:
 
 Explicitly deferred:
 
-- BioPlex PPI
-- Reactome PPI
-- source-specific `interaction_type` / `evidence` population
+- richer STRING channel-specific fields from `.protein.links.full...`
 
 ## Files Added / Changed
 
@@ -102,14 +99,14 @@ Confirmed old IFX_ODIN / Pharos readback behavior:
 - `PPIEdge`
   - `start_node`: `Protein(id="ENSEMBL:ENSP...")`
   - `end_node`: `Protein(id="ENSEMBL:ENSP...")`
-  - `sources`: STRING provenance list
   - `score`: list-valued, emitted as `[combined_score]`
 
 Implementation choices:
 
 - `score_cutoff` is an adapter parameter with default `400`
 - rows below the cutoff are discarded before they enter the graph
 - self-pairs are discarded before they enter the graph
+- adapter does not populate `sources`; the ETL framework stamps canonical datasource/version metadata
 - `max_rows` is supported for bounded validation runs and counts **kept emitted
   edges**, not scanned raw lines
 
@@ -237,10 +234,6 @@ pairs before export.
 
 - Profile whether STRING `.protein.links.full...` is worth revisiting for richer
   channel-specific fields
-- Add Reactome PPI ingest
-  - populate `interaction_type`
-  - decide whether the Reactome evidence/context column should map to `evidence`
-- Add BioPlex PPI ingest
 - Decide whether downstream `ncats_ppi` export should collapse duplicate canonical
   pairs before reciprocal row generation, or continue to preserve one SQL row pair
   per graph edge
diff --git a/src/constants.py b/src/constants.py
@@ -18,6 +18,7 @@ class DataSourceName(SimpleEnum):
     Cellosaurus = "Cellosaurus"
     Reactome = "Reactome"
     STRING = "STRING"
+    BioPlex = "BioPlex"
     WikiPathways = "WikiPathways"
     PathwayCommons = "PathwayCommons"
     CLO = "Cell Line Ontology (CLO)"