ncats · KeithKelleher · Apr 17, 2026 · Apr 16, 2026
diff --git a/designs/ppi/string_ppi_ingest_design.md b/designs/ppi/string_ppi_ingest_design.md
@@ -0,0 +1,246 @@
+# STRING PPI Ingest Design
+
+## Status
+
+Implemented and validated in the working graph and working MySQL paths.
+
+This first pass covers **STRING human protein-protein interactions** only.
+BioPlex and Reactome PPI remain follow-up sources.
+
+## Scope
+
+Implemented source:
+
+- STRING human PPI
+
+Explicitly deferred:
+
+- BioPlex PPI
+- Reactome PPI
+- source-specific `interaction_type` / `evidence` population
+
+## Files Added / Changed
+
+### New / updated ingest code
+
+- `src/input_adapters/string/string_ppi.py`
+- `src/models/ppi.py`
+- `workflows/pharos.Snakefile`
+- `src/use_cases/working.yaml`
+
+### New / updated downstream code
+
+- `src/input_adapters/pharos_arango/tcrd/ppi.py`
+- `src/output_adapters/sql_converters/tcrd.py`
+- `src/use_cases/working_mysql.yaml`
+
+### Tests
+
+- `tests/test_string_ppi.py`
+- `tests/test_ppi_record_merging.py`
+- `tests/test_tcrd_output_converter.py`
+
+## Source Inputs
+
+Implemented download target:
+
+- `https://stringdb-downloads.org/download/protein.links.v12.0/9606.protein.links.v12.0.txt.gz`
+
+Stored under:
+
+- `input_files/auto/string/9606.protein.links.v12.0.txt.gz`
+- `input_files/auto/string/string_version.tsv`
+
+Version strategy:
+
+- `version`: hardcoded as `12.0` in the Snakemake rule
+- `version_date`: derived from HTTP `Last-Modified`
+- `download_date`: adapter file mtime unless explicitly present in version TSV
+
+## Raw File Profiling
+
+Observed file shape:
+
+- header: `protein1 protein2 combined_score`
+- IDs are `9606.ENSP...`
+- no self-pairs were observed in the raw file
+
+Observed counts from the local v12.0 file:
+
+- total rows: `13,715,404`
+- rows kept at `score >= 400`: `1,858,944`
+- rows filtered out below `400`: `11,856,460`
+- percent filtered out by the cutoff: `86.45%`
+
+## Legacy Comparison
+
+Old TCRD loader reviewed:
+
+- `https://github.com/unmtransinfo/TCRD/blob/master/loaders/load-STRINGDB.py`
+
+Confirmed old behavior:
+
+- input file: `9606.protein.links.v11.0.txt`
+- strips `9606.` and works with bare `ENSP...`
+- inserts `ppitype='STRINGDB'`
+- inserts scalar `score=<combined_score>`
+- skips unmapped proteins
+- skips self-pairs
+- does **not** apply a score cutoff during load
+
+Confirmed old IFX_ODIN / Pharos readback behavior:
+
+- `src/input_adapters/pharos_mysql/ppi_adapter.py` keeps only `StringDB` rows with
+  `score >= 400`
+- it filters `protein_id < other_id`, which implies `ncats_ppi` stores reciprocal
+  rows and the graph readback collapses them to one undirected edge
+
+## Implemented Graph Mapping
+
+`StringPPIAdapter` emits:
+
+- `PPIEdge`
+  - `start_node`: `Protein(id="ENSEMBL:ENSP...")`
+  - `end_node`: `Protein(id="ENSEMBL:ENSP...")`
+  - `sources`: STRING provenance list
+  - `score`: list-valued, emitted as `[combined_score]`
+
+Implementation choices:
+
+- `score_cutoff` is an adapter parameter with default `400`
+- rows below the cutoff are discarded before they enter the graph
+- self-pairs are discarded before they enter the graph
+- `max_rows` is supported for bounded validation runs and counts **kept emitted
+  edges**, not scanned raw lines
+
+Intentionally not populated for STRING first pass:
+
+- `p_int`
+- `p_ni`
+- `p_wrong`
+- `interaction_type`
+- `evidence`
+
+Reason:
+
+- the selected STRING file only provides `combined_score`
+- preserving guessed mappings into legacy fields would be speculative
+
+## Resolver Path
+
+STRING emits `ENSEMBL:ENSP...` protein IDs and relies on `tcrd_targets` for
+canonicalization into reviewed target-graph proteins.
+
+Observed partial-run behavior with:
+
+- `score_cutoff: 400`
+- `max_rows: 10000`
+- `collapse_reviewed_targets: true`
+
+Results:
+
+- adapter emitted `10,000` edges
+- `test_pharos` stored `9,425` `PPIEdge` docs
+- the difference was mostly resolver drop-off from unresolved reviewed-target
+  coverage, not adapter parsing or cutoff logic
+
+Spot checks performed during validation:
+
+- no adapter-emitted self-loops
+- no graph self-loop `PPIEdge` docs
+- graph `PPIEdge` score range: `400..999`
+
+## Graph Merge Behavior
+
+`PPIEdge.score` is now list-valued.
+
+Reason:
+
+- some final graph edges aggregate multiple STRING rows after target resolution and
+  reviewed-target collapse
+- a scalar `score` was being overwritten under `KeepLast`
+- a list-valued `score` preserves all merged values in the graph
+
+This was validated with a dedicated record-merging test.
+
+## Downstream MySQL Mapping
+
+`TCRDOutputConverter` now exports `PPIEdge` to `ncats_ppi`.
+
+Implemented behavior:
+
+- export reciprocal rows for parity with `pharos319`
+- normalize graph source label `STRING` to legacy `StringDB` in `ppitypes`
+- write `score = max(score_list)` for the SQL row
+
+Why `max(score)`:
+
+- the graph preserves all merged STRING scores
+- the legacy SQL schema stores only one scalar `score`
+- `max(score)` is a conservative first-pass collapse rule
+
+## Validation Results
+
+### Local test coverage
+
+Executed:
+
+- `pytest tests/test_string_ppi.py`
+- `pytest tests/test_string_ppi.py tests/test_ppi_record_merging.py`
+- `pytest tests/test_tcrd_output_converter.py`
+
+Result:
+
+- all targeted tests passed
+
+### Partial graph validation
+
+Working graph setup used:
+
+- reviewed UniProt proteins
+- `StringPPIAdapter(score_cutoff=400, max_rows=10000)`
+- database: `test_pharos`
+
+Observed graph contents:
+
+- `Protein`: `20,332`
+- `PPIEdge`: `9,425`
+- all `PPIEdge` rows carried STRING provenance
+- score range in graph: `400..999`
+- no graph self-loops
+
+### Working MySQL validation
+
+`working_mysql.yaml` was reduced to:
+
+- `ProteinAdapter`
+- `ProteinPPIAdapter`
+
+Observed downstream contents in `pharos400_working`:
+
+- `ncats_ppi` rows: `18,850`
+- score range: `400..999`
+
+Important note:
+
+- `18,850` rows corresponds to reciprocal export of `9,425` graph edges
+- however, not every unordered protein pair appears exactly twice
+- some unordered pairs appear four times because multiple graph `PPIEdge` docs
+  collapse to the same canonical protein pair after reviewed-target resolution,
+  and each graph edge is then exported in both directions
+
+This is currently acceptable for parity-oriented validation, but it is a follow-up
+cleanup topic if we later want the downstream SQL path to deduplicate canonical
+pairs before export.
+
+## Open Follow-Ups
+
+- Profile whether STRING `.protein.links.full...` is worth revisiting for richer
+  channel-specific fields
+- Add Reactome PPI ingest
+  - populate `interaction_type`
+  - decide whether the Reactome evidence/context column should map to `evidence`
+- Add BioPlex PPI ingest
+- Decide whether downstream `ncats_ppi` export should collapse duplicate canonical
+  pairs before reciprocal row generation, or continue to preserve one SQL row pair
+  per graph edge
diff --git a/playbooks/ingest_playbook.md b/playbooks/ingest_playbook.md
@@ -184,3 +184,8 @@ Provide a repeatable workflow for adding a new data source to the target graph i
   regenerates files on a weekly Sunday schedule — the `Last-Modified` header captures that publish
   date reliably. Write it to a small TSV in Snakemake so the adapter can read it as `version_date`.
 - **Use named parameters for `DatasourceVersionInfo`.** This avoids argument-order regressions when version handling evolves.
+
+### Edge merging and downstream parity
+- **If duplicate source rows can collapse onto the same final graph edge, make merge-sensitive edge payload list-valued in the graph.** STRING PPIs needed `score: List[float]` instead of a scalar because reviewed-target collapse could merge multiple source rows onto one final `PPIEdge`; otherwise `KeepLast` silently overwrote earlier scores.
+- **A graph edge and a downstream SQL row do not need the same cardinality.** For PPIs, the graph should keep one conceptual undirected edge with merged payload, while the legacy `ncats_ppi` table expects reciprocal directed rows for query-order parity. Model the graph cleanly first, then adapt row cardinality explicitly in the SQL converter.
+- **When a legacy downstream table stores one scalar but the graph preserves many values, choose and document an explicit collapse rule.** For STRING PPI export, `max(score)` was used for `ncats_ppi.score`; do not let scalar selection happen implicitly through merge order.
diff --git a/src/constants.py b/src/constants.py
@@ -17,6 +17,7 @@ class DataSourceName(SimpleEnum):
     CCLE = "Cancer Cell Line Encyclopedia (CCLE)"
     Cellosaurus = "Cellosaurus"
     Reactome = "Reactome"
+    STRING = "STRING"
     WikiPathways = "WikiPathways"
     PathwayCommons = "PathwayCommons"
     CLO = "Cell Line Ontology (CLO)"

diff --git a/src/input_adapters/pharos_arango/tcrd/ppi.py b/src/input_adapters/pharos_arango/tcrd/ppi.py
@@ -0,0 +1,51 @@
+from typing import Generator, List
+
+from src.input_adapters.pharos_arango.tcrd.protein import PharosArangoAdapter
+from src.models.datasource_version_info import DataSourceDetails
+from src.models.ppi import PPIEdge
+from src.models.protein import Protein
+
+
+def protein_ppi_query(last_key: str = None, limit: int = 10000) -> str:
+    filter_clause = f'FILTER rel._key > "{last_key}"' if last_key else ""
+    return f"""
+    FOR rel IN `PPIEdge`
+        {filter_clause}
+        SORT rel._key
+        LIMIT {limit}
+        RETURN rel
+    """
+
+
+def ppi_version_query() -> str:
+    return """FOR rel IN `PPIEdge` LIMIT 1 RETURN rel.creation"""
+
+
+class ProteinPPIAdapter(PharosArangoAdapter):
+    batch_size = 10_000
+
+    def get_all(self) -> Generator[List[PPIEdge], None, None]:
+        last_key = None
+        while True:
+            rows = list(self.runQuery(protein_ppi_query(last_key=last_key, limit=self.batch_size)))
+            if not rows:
+                break
+
+            yield [
+                PPIEdge(
+                    start_node=Protein(id=row["start_id"]),
+                    end_node=Protein(id=row["end_id"]),
+                    provenance=row.get("provenance"),
+                    sources=row.get("sources") or [],
+                    p_int=row.get("p_int"),
+                    p_ni=row.get("p_ni"),
+                    p_wrong=row.get("p_wrong"),
+                    score=row.get("score") or [],
+                )
+                for row in rows
+            ]
+            last_key = rows[-1]["_key"]
+
+    def get_version_info_query(self) -> DataSourceDetails:
+        raw_version_info = self.runQuery(ppi_version_query())[0]
+        return DataSourceDetails.parse_tsv(raw_version_info)
diff --git a/src/input_adapters/pharos_mysql/ppi_adapter.py b/src/input_adapters/pharos_mysql/ppi_adapter.py
@@ -47,6 +47,6 @@ def get_all(self) -> List[Relationship]:
                 p_int=row[3],
                 p_ni=row[4],
                 p_wrong=row[5],
-                score=row[6]
+                score=[row[6]] if row[6] is not None else []
             )
             for row in results]