Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ coverage.xml
lib
.pybiomart.sqlite
proxy.log
pounce_audit
129 changes: 129 additions & 0 deletions designs/drugcentral_indications_discovery_2026_04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# DrugCentral Indications Summary

Date: 2026-04-14

## Goal

Add DrugCentral indications to the Pharos working graph and validate the TCRD MySQL conversion path.

## Scope Chosen

- Source: current DrugCentral PostgreSQL `omop_relationship_doid_view`
- Relationship scope: `indication` only
- Protein gating: only drugs with human target activity in `act_table_full`
- Graph outputs:
- `Disease`
- `ProteinDiseaseEdge`
- Working config only:
- graph in `src/use_cases/working.yaml`
- MySQL in `src/use_cases/working_mysql.yaml`

Reference counts from discovery:

- Raw source counts:
- current DrugCentral `indication` rows: `12047`
- current indication structures: `2723`
- current indication rows with `UMLS`: `9562`
- Legacy downstream count:
- `pharos319.disease` rows with `dtype='DrugCentral Indication'`: `13919`
- Working graph count:
- `test_pharos.ProteinDiseaseEdge` rows: `41663`
- Initial working MySQL count:
- `pharos400_working.disease` rows with `dtype='DrugCentral Indication'`: `62140`

Note:

- the raw DrugCentral source count is not directly comparable to the MySQL `disease` table count
- the source count is pre-expansion
- the MySQL `disease` count is post-expansion across protein targets

Deferred:

- contraindication / off-label use
- approval metadata
- full Pharos config promotion

## Key Design Decisions

- Use `UMLS` as the primary source disease ID.
- Node Normalizer coverage was stronger for `UMLS` than `SNOMEDCT`.
- `SNOMEDCT` rows were already paired with `UMLS`, so it was not needed as a fallback.
- Do not rely on DrugCentral `DOID` as a fallback.
- It did not add useful extra coverage once `UMLS` / `SNOMEDCT` were considered.
- Preserve text-only indication concepts.
- If a row has `UMLS`, emit `Disease.id = UMLS:<cui>`.
- If a row has no `UMLS`, emit a stable local ID `DrugCentral:INDICATION:<hash>`.
- Preserve source metadata in edge details:
- `drug_name`
- `snomed_id`
- `doid`

Resolver investigation summary:

- We checked the current Node Normalizer integration used by Pharos / target_graph.
- `UMLS` resolved at a higher rate than `SNOMEDCT` for DrugCentral indication IDs.
- Where both existed, `UMLS` and `SNOMEDCT` usually normalized to the same concept, but not always.
- That was enough to justify a deterministic rule:
- use `UMLS` as the emitted disease ID
- keep `SNOMEDCT` as metadata
- avoid mixing the two at adapter time

## Legacy Comparison

Legacy `pharos319` DrugCentral indications:

- lived in `disease` with `dtype = 'DrugCentral Indication'`
- populated `name` and `drug_name`
- sometimes populated `did`
- often had no discrete disease ID at all

This mattered for first-pass design because legacy Pharos did preserve text-only indication names downstream.

Additional legacy comparison:

- among the Pharos disease association sources reviewed in `pharos319`, DrugCentral was the only one with true name-only disease association rows that had neither `did` nor `mondoid`

## What Was Implemented

Code/config changes:

- new DrugCentral indication adapter
- `DiseaseAssociationDetail` extended for DrugCentral metadata
- `working.yaml` updated to ingest DrugCentral indications into `test_pharos`
- `working_mysql.yaml` updated to read disease data from `test_pharos`
- TCRD converter updated so:
- `disease.did = detail.source_id`
- `disease.drug_name = detail.drug_name`

## Graph Validation

Validation of `test_pharos` showed:

- DrugCentral indication diseases were loaded
- many `UMLS` diseases normalized to `MONDO`
- text-only indication concepts were preserved as local `DrugCentral:INDICATION:*` disease nodes

Representative outcomes:

- `Anesthesia for cesarean section` preserved
- `Local anesthesia` preserved
- `Alcoholism` normalized to `MONDO`
- `Metastatic Breast Carcinoma` preserved as a local DrugCentral indication concept

## MySQL Validation

Initial `pharos400_working` run showed the main converter issues:

- `disease.drug_name` was not being populated
- local graph IDs were leaking into `disease.did`

Those converter issues were patched.

## Remaining Follow-Up

- Recheck `pharos400_working` after the latest converter rerun:
- `disease.did`
- `disease.drug_name`
- `disease.mondoid`
- `ncats_disease` preservation of text-only names
- Compare this pattern with other disease association ingests such as CTD to decide whether the local-ID strategy should remain DrugCentral-specific or become a broader convention.
1 change: 1 addition & 0 deletions designs/graph_views_metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
Add graph-owned live views/exports that `qa_browser` can discover and execute
generically.


The first target is `current_tdls`. Because TDL fields are populated in
post-processing, that view should be declared in post-processing YAML such as
[`pharos_aql_post.yaml`](/Users/kelleherkj/IdeaProjects/IFX_ODIN/src/use_cases/pharos/pharos_aql_post.yaml).
Expand Down
25 changes: 18 additions & 7 deletions playbooks/ingest_playbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,29 @@ Provide a repeatable workflow for adding a new data source to the target graph i
- Note that the TCRD format is not always a natural fit, but often captures important historical scope.
- For Pharos-related sources, inspect the old loader implementation in the TCRD repository when it helps explain legacy field choices or filtering.

6) **Review data that makes it into TCRD**
6) **Check identifier normalization coverage**
- Before adapter implementation, inspect how the configured resolver path will normalize the source IDs.
- For Pharos / target_graph disease ingest, check the current Node Normalizer integration in `src/id_resolvers/node_normalizer.py`.
- When the source offers multiple disease identifier families, profile each candidate family separately (for example `UMLS`, `SNOMEDCT`, `DOID`) rather than assuming the most ontology-like one is best.
- Query the resolver service metadata when helpful, for example Node Normalizer `GET /get_curie_prefixes`, to confirm accepted prefixes.
- Measure real coverage on distinct source IDs, not just a few spot checks.
- Record both:
- percent of source IDs that resolve at all
- representative canonical prefixes returned by the resolver
- Use these findings to choose what raw source ID the adapter should emit and leave canonicalization to the resolver layer whenever possible.

7) **Review data that makes it into TCRD**
- Currently Pharos uses pharos319.
- Review the relevant tables and row counts to understand what was ingested previously.
- Compare previous ingest output against the current raw payload to separate legacy limitations from current source reality.
- When relevant, also inspect `pharos400` to understand what the newer MySQL path already captures or still misses.

7) **Pause and propose the implementation plan**
8) **Pause and propose the implementation plan**
- Summarize the intended adapter scope, node/edge model, resolver dependencies, and validation plan.
- Keep the first pass intentionally minimal.
- Get user confirmation before making code changes.

8) **Implement an InputAdapter**
9) **Implement an InputAdapter**
- Inherit from `src/interfaces/input_adapter.py` (or `FlatFileAdapter`).
- Implement:
- `get_all`
Expand All @@ -78,18 +89,18 @@ Provide a repeatable workflow for adding a new data source to the target graph i
- Emit `Node` / `Relationship` models that match the schema.
- Keep adapters focused on source parsing and structural graph emission.

9) **Map to the data model**
10) **Map to the data model**
- Confirm existing node/edge classes or add new ones in `src/models/`.
- Use stable IDs and consistent prefixes.
- Avoid speculative parsing when source text is ambiguous; preserve the source text when parsing would be lossy.
- Keep source-specific payload that may merge later inside `details` structures instead of flattening it into top-level edge fields.

10) **Wire configuration into YAML**
11) **Wire configuration into YAML**
- Add the adapter to `src/use_cases/working.yaml` first.
- Pass file paths and version metadata file paths via `kwargs`.
- Only after the working ingest is validated, promote the finalized configuration into `src/use_cases/pharos/target_graph.yaml`.

11) **Validate the working ingest**
12) **Validate the working ingest**
- Ask the user to run the working ETL path.
- Validate that counts, labels, IDs, provenance, and key edge endpoints look correct.
- Validate that representative input-file records land where expected in the working graph and, when available, in the working MySQL output.
Expand All @@ -100,7 +111,7 @@ Provide a repeatable workflow for adding a new data source to the target graph i
- which source-specific columns are populated in `pharos319` but still empty in the working MySQL output
- whether graph data is present in the working graph but not yet mapped into downstream tables

12) **Update the design document**
13) **Update the design document**
- Revise the design doc to reflect what actually ended up in the code:
- Final field mappings and any decisions that changed during implementation
- Actual node/edge counts produced
Expand Down
118 changes: 118 additions & 0 deletions src/input_adapters/drug_central/drug_indication.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
from collections import OrderedDict
import hashlib
from typing import Generator, List, Union

from sqlalchemy import text

from src.constants import DataSourceName, Prefix
from src.input_adapters.drug_central.drug_node import DrugCentralAdapter
from src.interfaces.input_adapter import InputAdapter
from src.models.datasource_version_info import DatasourceVersionInfo
from src.models.disease import Disease, DiseaseAssociationDetail, ProteinDiseaseEdge
from src.models.node import EquivalentId, Node, Relationship
from src.models.protein import Protein


class DrugCentralIndicationAdapter(InputAdapter, DrugCentralAdapter):
batch_size = 10000

def get_datasource_name(self) -> DataSourceName:
return DataSourceName.DrugCentral

def get_version(self) -> DatasourceVersionInfo:
return self.version_info

def get_all(self) -> Generator[List[Union[Node, Relationship]], None, None]:
diseases_by_id: OrderedDict[str, Disease] = OrderedDict()
edge_by_key: OrderedDict[tuple[str, str], ProteinDiseaseEdge] = OrderedDict()
seen_detail_keys: set[tuple[str, str, str, str | None, str | None]] = set()

with self.get_session() as session:
rows = session.execute(text("""
select distinct
o.struct_id,
s.name as drug_name,
o.concept_name,
o.umls_cui,
o.snomed_conceptid,
o.doid,
a.accession
from omop_relationship_doid_view o
join structures s
on s.id = o.struct_id
join act_table_full a
on a.struct_id = o.struct_id
where o.relationship_name = 'indication'
and a.organism = 'Homo sapiens'
and a.accession is not null
and trim(a.accession) <> ''
order by o.struct_id, o.concept_name, o.umls_cui, a.accession
""")).mappings()

for row in rows:
disease_name = (row["concept_name"] or "").strip()
if not disease_name:
continue
umls_cui = (row["umls_cui"] or "").strip() or None
disease_id = self._disease_id(disease_name, umls_cui)

diseases_by_id.setdefault(
disease_id,
Disease(
id=disease_id,
name=disease_name,
),
)

snomed_id = (
EquivalentId(id=str(row["snomed_conceptid"]).strip(), type=Prefix.SNOMEDCT).id_str()
if row["snomed_conceptid"] is not None and str(row["snomed_conceptid"]).strip()
else None
)
doid = (row["doid"] or "").strip() or None
drug_name = (row["drug_name"] or "").strip() or None

for accession in self._split_accessions(row["accession"]):
protein_id = EquivalentId(id=accession, type=Prefix.UniProtKB).id_str()
edge_key = (protein_id, disease_id)
detail_key = (protein_id, disease_id, drug_name or "", snomed_id, doid)
if detail_key in seen_detail_keys:
continue
seen_detail_keys.add(detail_key)

detail = DiseaseAssociationDetail(
source="DrugCentral Indication",
source_id=EquivalentId(id=umls_cui, type=Prefix.UMLS).id_str() if umls_cui else None,
drug_name=drug_name,
snomed_id=snomed_id,
doid=doid,
)

if edge_key not in edge_by_key:
edge_by_key[edge_key] = ProteinDiseaseEdge(
start_node=Protein(id=protein_id),
end_node=diseases_by_id[disease_id],
details=[detail],
)
else:
edge_by_key[edge_key].details.append(detail)

yield list(diseases_by_id.values())
edge_values = list(edge_by_key.values())
for i in range(0, len(edge_values), self.batch_size):
yield edge_values[i:i + self.batch_size]

@staticmethod
def _split_accessions(raw_accessions: str) -> List[str]:
return [
token.strip()
for token in raw_accessions.split("|")
if token and token.strip()
]

@staticmethod
def _disease_id(disease_name: str, umls_cui: str | None) -> str:
if umls_cui:
return EquivalentId(id=umls_cui, type=Prefix.UMLS).id_str()
digest = hashlib.sha1(disease_name.strip().lower().encode("utf-8")).hexdigest()[:16]
return f"DrugCentral:INDICATION:{digest}"
5 changes: 5 additions & 0 deletions src/input_adapters/pharos_arango/tcrd/disease.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,11 @@ def get_all(self) -> Generator[List[Union[Disease, DiseaseParentEdge]], None, No
rows.append(disease)
yield rows

db = self.get_db()
if not db.has_collection("DiseaseParentEdge"):
yield []
return

parents = self.runQuery(disease_parent_query())
yield [
DiseaseParentEdge(
Expand Down
2 changes: 1 addition & 1 deletion src/input_adapters/pounce_sheets/pounce_node_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ def _experiment_nodes(
meta_sheet=ExperimentWorkbook.ProteinDataMetaSheet.name,
data_sheet=ExperimentWorkbook.ProteinDataSheet.name,
analyte_id_col=analyte_id_col,
data_type="protein data",
data_type="raw data",
parser=exp_parser
)

Expand Down
3 changes: 3 additions & 0 deletions src/models/disease.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ class DiseaseAssociationDetail:
confidence: Optional[float] = None
zscore: Optional[float] = None
url: Optional[str] = None
drug_name: Optional[str] = None
snomed_id: Optional[str] = None
doid: Optional[str] = None

def to_dict(self):
return asdict(self)
Expand Down
4 changes: 2 additions & 2 deletions src/output_adapters/sql_converters/tcrd.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,19 +552,19 @@ def disease_converter(self, obj: dict) -> List[mysqlDisease]:
disease_name = self._disease_name(obj)
rows = []
for ordinal, detail in enumerate(self._iter_disease_details(obj)):
source_disease_id = detail.get('source_id') or resolved_disease_id
assoc_key = self._disease_assoc_key(obj['start_id'], resolved_disease_id, detail, ordinal)
rows.append(mysqlDisease(
id=self.resolve_id('disease_assoc', assoc_key),
dtype=detail.get('source') or '',
protein_id=self.resolve_id('protein', obj['start_id']),
name=disease_name,
ncats_name=disease_name,
did=source_disease_id,
did=detail.get('source_id'),
evidence="|".join(detail.get('evidence_terms') or detail.get('evidence_codes') or []) or None,
zscore=detail.get('zscore'),
conf=detail.get('confidence'),
reference=detail.get('url'),
drug_name=detail.get('drug_name'),
mondoid=mondoid,
provenance=obj['provenance'],
))
Expand Down
Loading
Loading