diff --git a/designs/diseases/monarch_diseases_ingest_design.md b/designs/diseases/monarch_diseases_ingest_design.md new file mode 100644 index 0000000..9c3398a --- /dev/null +++ b/designs/diseases/monarch_diseases_ingest_design.md @@ -0,0 +1,31 @@ +# Monarch Diseases Discovery Note + +## Status +Rejected as a new standalone ingest source. + +## Why +- The current public Monarch disease-association file is a Translator-style aggregated release, not a source-native Monarch assertion set. +- `infores:monarchinitiative` appears as the aggregator, not the primary source. +- The only `primary_knowledge_source` values in the current dump are: + - `infores:omim` (`6830` rows) + - `infores:clingen` (`218` rows) +- So treating the file as a single `Monarch` ingest would blur provenance and mostly duplicate planned OMIM work. + +## What We Found +- Current file: `https://data.monarchinitiative.org/monarch-kg-dev/latest/tsv/gene_associations/gene_disease.9606.tsv.gz` +- Current dump is human-only and already normalized to `HGNC -> MONDO`. +- Edge payload is slim: mostly predicate plus provenance. +- Old `pharos319` Monarch rows came from a much older Monarch/MySQL export and are not reproduced by the current public file. +- The ClinGen-backed slice adds some distinct pairs, but that is better thought of as ClinGen content than as an independent Monarch source. + +## Decision +- Do not ingest this file as `Monarch`. +- Move on to direct OMIM discovery. +- Optionally revisit ClinGen later using a direct ClinGen download surface rather than the aggregated Monarch release. + +## References +- Monarch KG downloads: `https://monarchinitiative.org/kg/downloads` +- Release artifact root: `https://data.monarchinitiative.org/monarch-kg-dev/latest/` +- Old loaders: + - `load-MonarchDiseases.py` + - `load-MonarchOrthologDiseases.py` diff --git a/designs/protein_classes/panther_classes_ingest_design.md b/designs/protein_classes/panther_classes_ingest_design.md new file mode 100644 index 0000000..c17b6b5 --- /dev/null +++ b/designs/protein_classes/panther_classes_ingest_design.md @@ -0,0 +1,295 @@ +# PANTHER Protein Classes Ingest Design + +## Goal + +Add PANTHER protein classes and evolutionary family/subfamily membership to the Pharos graph ingest, then bridge the protein-class portion into the Pharos MySQL path. + +## Discovery Date + +- 2026-04-14 + +## Source Inputs + +Current official files inspected during discovery: + +- `https://data.pantherdb.org/PANTHER19.0/ontology/Protein_Class_19.0` +- `https://data.pantherdb.org/PANTHER19.0/ontology/Protein_class_relationship` +- `https://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/PTHR19.0_human` + +Legacy TCRD comparison loader: + +- `https://github.com/unmtransinfo/TCRD/blob/master/loaders/load-PANTHERClasses.py` + +## Legacy TCRD Behavior + +The old loader populated: + +- `panther_class` +- `p2pc` + +Legacy ingest behavior observed from the loader: + +- Loaded class definitions from `Protein_Class_14.0` +- Loaded parent relationships from `Protein_class_relationship` +- Loaded protein-to-class assignments from the human sequence classification file +- Matched proteins by UniProt first, then HGNC fallback +- Stored parent IDs as pipe-delimited `parent_pcids` +- Skipped rows without class assignments + +Initial Pharos MySQL comparison during discovery: + +- `pharos319.panther_class`: `256` rows +- `pharos319.p2pc`: `22,520` rows +- `pharos400.panther_class`: `0` rows +- `pharos400.p2pc`: `0` rows + +This appears to be a real gap in the newer Pharos path rather than an already-migrated target-graph ingest. + +## Observed Payload Shape + +### `Protein_Class_19.0` + +Observed shape: + +- 3 metadata comment lines starting with `!` +- 239 non-comment class rows +- 4 tab-delimited columns per data row + - `pcid` + - hierarchical numeric code + - class name + - description + +Examples: + +- `PC00000` root class with blank description +- `PC00197` name `transmembrane signal receptor` + +Notes: + +- The file path is under `PANTHER19.0`, but the file header says `version: 17.0` and `date: 1/11/2022` +- No duplicate class IDs were observed in the current file + +### `Protein_class_relationship` + +Observed shape: + +- 2 metadata comment lines starting with `!` +- 214 non-comment rows +- 5 tab-delimited columns per data row + - child `pcid` + - child name + - parent `pcid` + - parent name + - level/order code + +Notes: + +- Each child class had one parent in the current file +- 214 distinct child IDs were observed + +### `PTHR19.0_human` + +Observed shape: + +- 19,450 rows +- 11 tab-delimited columns on every row +- No header row + +Observed columns by position: + +1. `species|HGNC|UniProtKB` compound field +2. UniProt accession +3. gene symbol +4. PANTHER family / subfamily ID +5. family / subfamily name +6. protein name +7. molecular function GO bundle +8. biological process GO bundle +9. cellular component GO bundle +10. protein class bundle +11. pathway bundle + +Protein class encoding in the current file: + +- Class assignments are in column `10` as `name#PCxxxxx` +- Multiple assignments are semicolon-delimited +- Example: `G-protein modulator#PC00022;protein-binding activity modulator#PC00095` + +Profile summary: + +- 13,987 rows with at least one protein class +- 5,463 rows without any protein class assignment +- 21,948 total protein-to-class links parsed +- 194 distinct class IDs observed in the sequence file +- 6,939 rows had multiple class assignments + +Important drift from the legacy loader: + +- The old loader parsed class tokens from `row[8]` +- In the current file, protein class IDs are in `row[9]` +- `row[10]` now contains pathway data + +## Identifier Findings + +Protein identifiers available per sequence row: + +- UniProt accession in column 2 +- HGNC ID embedded in column 1 +- gene symbol in column 3 + +Initial ingest choice: + +- Emit `Protein(id=)` and rely on the configured target graph resolver path +- Keep HGNC as fallback parsing material only if resolver coverage shows UniProt misses + +Class identifiers: + +- Stable source IDs are PANTHER class IDs such as `PC00197` +- IFX constant already exists for `PANTHER.FAMILY` in `src/constants.py` + +## Graph Mapping + +Implemented graph scope: + +- New `PantherFamily` node for evolutionary family / subfamily membership +- New `ProteinPantherFamilyEdge` +- New `PantherFamilyParentEdge` +- New `PantherClass` node +- New `ProteinPantherClassEdge` +- New `PantherClassParentEdge` + +Implemented `PantherClass` node fields: + +- `id`: `PCxxxxx` +- `name` +- `description` +- `hierarchy_code` + +Decision: + +- Do not put `parent_pcids` on graph nodes +- Keep parent-child structure represented only as edges in the graph +- If `pharos400` needs `parent_pcids`, derive it downstream during table materialization rather than denormalizing the source graph + +### Evolutionary family / subfamily graph + +Observed shape from `PTHR19.0_human`: + +- Column 4 carries IDs like `PTHR23158:SF54` +- All inspected rows used the `family:subfamily` form +- 7,526 distinct top-level `PTHR...` family IDs were observed +- The file groups proteins into evolutionary family / subfamily membership +- This is distinct from the `PCxxxxx` protein class hierarchy + +Implemented modeling: + +- Emit one `PantherFamily` node type using `PANTHER.FAMILY` +- Represent both family and subfamily with a `level` field +- Emit parent edges from subfamily to family +- Emit protein membership to the subfamily node only +- Do not infer extra deeper hierarchy beyond family -> subfamily + +Name handling decision: + +- The current human sequence file provides a stable text label per top-level family +- It does not clearly provide a separate trustworthy subfamily label +- Keep family `name` +- Leave subfamily `name` unset rather than speculating + +## Final Inclusion / Exclusion Decisions + +Included: + +- PANTHER protein class nodes from the ontology file +- parent-child class hierarchy +- human protein-to-class edges from the current human sequence classification file +- PANTHER family / subfamily nodes from the human sequence classification file +- subfamily -> family edges +- protein -> subfamily edges +- `panther_class` / `p2pc` materialization in `working_mysql.yaml` + +Excluded: + +- GO annotations embedded in the sequence file +- pathway assignments embedded in the sequence file +- hash-prefixed ontology rows such as `#PC...` +- inline ontology comment rows such as `#removed ...` + +## Open Questions / Risks + +- Versioning is not clean: the ontology URLs live under `PANTHER19.0`, while the ontology file header reports `17.0` +- Current cleaned class set is lower than legacy `pharos319.panther_class` +- Current class hierarchy differs from legacy `pharos319` for at least some nodes, for example `PC00233` + +## Resolver Coverage Audit + +Resolver coverage was checked against the current `PTHR19.0_human` UniProt IDs. + +Results: + +- `TargetGraphProteinResolver`: `19,433 / 19,450` matched (`99.91%`) +- `TCRDTargetResolver`: `19,368 / 19,450` matched (`99.58%`) + +Interpretation: + +- UniProt-only emission is the correct identifier strategy for this source +- The remaining misses are not an identifier-family problem +- The extra drop in `TCRDTargetResolver` appears to reflect Pharos target-universe coverage rather than parser failure +- Representative `TCRDTargetResolver` misses include some olfactory receptors, immunoglobulin variants, keratin-associated proteins, and unnamed / locus-style entries + +## Implementation Notes + +Implemented pieces: + +1. Added download rules and version metadata under `input_files/auto/panther/`. +2. Added `PantherFamily`, `PantherClass`, and related edge models. +3. Added the flat-file graph ingest adapter in `src/input_adapters/panther/panther_classes.py`. +4. Added the Arango -> MySQL bridge adapters in `src/input_adapters/pharos_arango/tcrd/panther.py`. +5. Added `TCRDOutputConverter` mappings for `panther_class` and `p2pc`. +6. Validated in `working.yaml` and `working_mysql.yaml`. +7. Promoted the graph adapter into `src/use_cases/pharos/target_graph.yaml` and `src/use_cases/pharos/pharos.yaml`. + +`parent_pcids` decision: + +- Graph nodes do not store `parent_pcids` +- `parent_pcids` is derived downstream for MySQL from direct `PantherClassParentEdge` links +- This matches legacy Pharos API expectations more closely than storing full ancestry +- Duplicate parent IDs are intentionally not preserved + +## Validation Targets + +Minimum validation after implementation: + +- node count for `PantherClass` +- edge count for protein-to-class links +- edge count for parent-child links +- spot-check several known proteins from `PTHR19.0_human` +- compare raw distinct class IDs and protein-class link counts to working graph output +- compare working MySQL path if applicable before promotion + +## Validation Outcome + +Validated graph outcome in rebuilt working graph: + +- `PantherClass` is searchable on `name` and `description` +- `PantherClass` has no `#PC...` artifact rows after cleanup +- `PantherFamily` / `PantherClass` collections and edges loaded successfully + +Validated MySQL outcome in `pharos400_working`: + +- `panther_class`: `210` rows +- `p2pc`: `21,848` rows +- `#PC...` rows in `panther_class`: `0` +- multi-parent `parent_pcids` rows: `0` + +Representative MySQL rows: + +- `PC00021 -> PC00197` +- `PC00197 -> PC00000` +- `PC00233 -> PC00197` + +Observed source cleanup impact: + +- The downloaded ontology file contains hash-prefixed rows and inline comment rows +- Those rows are not used in the human protein assignment file +- Excluding them produces a cleaner protein-facing class set for Pharos diff --git a/playbooks/ingest_playbook.md b/playbooks/ingest_playbook.md index d6de4d0..9747301 100644 --- a/playbooks/ingest_playbook.md +++ b/playbooks/ingest_playbook.md @@ -119,6 +119,12 @@ Provide a repeatable workflow for adding a new data source to the target graph i - Follow-up scope intentionally deferred from the first pass - Include the exact validation steps and comparison points used to accept the ingest. +14) **Update the playbook with reusable lessons learned** + - After the ingest is working, capture any lessons that would help future ingests in the `Lessons Learned` section of this playbook. + - Keep these lessons generally applicable across sources and workflows. + - Prefer patterns, failure modes, and workflow improvements over source-specific facts. + - Do not duplicate details that belong only in the source design doc. + --- ## Lessons Learned diff --git a/src/constants.py b/src/constants.py index f97ebb6..1df535c 100755 --- a/src/constants.py +++ b/src/constants.py @@ -46,6 +46,7 @@ class DataSourceName(SimpleEnum): DiseaseOntology = "Disease Ontology" CTD = "CTD" UBERON = 'UBERON' + PANTHERClasses = "PANTHER Protein Classes" class Prefix(SimpleEnum): CAS = 'CAS' # from TNN @@ -89,6 +90,7 @@ class Prefix(SimpleEnum): OMIM = 'OMIM' # from TNN OMIM_PS = 'OMIM.PS' # from TNN orphanet = 'orphanet' # from TNN + PANTHER_CLASS = 'PANTHER.CLASS' PANTHER_FAMILY = 'PANTHER.FAMILY' # from TNN PANTHER_PATHWAY = 'PANTHER.PATHWAY' # from TNN PMC = 'PMC' # from TNN diff --git a/src/input_adapters/panther/panther_classes.py b/src/input_adapters/panther/panther_classes.py new file mode 100644 index 0000000..e7a6e3f --- /dev/null +++ b/src/input_adapters/panther/panther_classes.py @@ -0,0 +1,297 @@ +import csv +import os +import re +from datetime import datetime +from typing import Generator, List, Optional, Union + +from src.constants import DataSourceName, Prefix +from src.interfaces.input_adapter import InputAdapter +from src.models.datasource_version_info import DatasourceVersionInfo, parse_to_date +from src.models.node import EquivalentId, Node, Relationship +from src.models.panther_class import ( + PantherClass, + PantherClassParentEdge, + PantherFamily, + PantherFamilyParentEdge, + ProteinPantherClassEdge, + ProteinPantherFamilyEdge, +) +from src.models.protein import Protein + +_PCID_RE = re.compile(r"#(PC\d{5})") + + +def _panther_class_node_id(pcid: str) -> str: + return EquivalentId(id=pcid, type=Prefix.PANTHER_CLASS).id_str() + + +def _panther_family_node_id(panther_family_id: str) -> str: + return EquivalentId(id=panther_family_id, type=Prefix.PANTHER_FAMILY).id_str() + + +class PantherClassesAdapter(InputAdapter): + def __init__( + self, + class_file_path: str, + relationship_file_path: str, + sequence_classification_file_path: str, + version_file_path: Optional[str] = None, + max_rows: Optional[int] = None, + ): + self.class_file_path = class_file_path + self.relationship_file_path = relationship_file_path + self.sequence_classification_file_path = sequence_classification_file_path + self.version_file_path = version_file_path + self.max_rows = max_rows + + def get_datasource_name(self) -> DataSourceName: + return DataSourceName.PANTHERClasses + + def get_version(self) -> DatasourceVersionInfo: + version = None + version_date = None + download_date = None + if self.version_file_path and os.path.exists(self.version_file_path): + with open(self.version_file_path, "r", encoding="utf-8") as handle: + reader = csv.DictReader(handle, delimiter="\t") + row = next(reader, None) + if row: + version = row.get("version") or None + version_date = parse_to_date(row.get("version_date")) + download_date = parse_to_date(row.get("download_date")) + + if download_date is None: + timestamps = [] + for path in ( + self.class_file_path, + self.relationship_file_path, + self.sequence_classification_file_path, + ): + if os.path.exists(path): + timestamps.append(os.path.getmtime(path)) + if timestamps: + download_date = datetime.fromtimestamp(max(timestamps)).date() + + return DatasourceVersionInfo( + version=version, + version_date=version_date, + download_date=download_date, + ) + + def get_all(self) -> Generator[List[Union[Node, Relationship]], None, None]: + family_nodes = self._load_family_nodes() + if family_nodes: + yield list(family_nodes.values()) + + class_nodes = self._load_class_nodes() + yield list(class_nodes.values()) + + family_parent_edges = self._load_family_parent_edges(family_nodes) + if family_parent_edges: + yield family_parent_edges + + parent_edges = self._load_parent_edges() + if parent_edges: + yield parent_edges + + family_edges = self._load_family_membership_edges(family_nodes) + for i in range(0, len(family_edges), self.batch_size): + yield family_edges[i:i + self.batch_size] + + protein_edges = self._load_protein_edges(class_nodes) + for i in range(0, len(protein_edges), self.batch_size): + yield protein_edges[i:i + self.batch_size] + + def _load_family_nodes(self) -> dict: + nodes = {} + with open(self.sequence_classification_file_path, "r", encoding="utf-8", errors="replace") as handle: + reader = csv.reader(handle, delimiter="\t") + kept_rows = 0 + for row in reader: + if len(row) < 5: + continue + family_id = (row[3] or "").strip() + family_name = (row[4] or "").strip() or None + if not family_id: + continue + + top_level_id = family_id.split(":", 1)[0] + if top_level_id not in nodes: + nodes[top_level_id] = PantherFamily( + id=_panther_family_node_id(top_level_id), + source_id=top_level_id, + level="family", + name=family_name, + source="PANTHER", + ) + + if family_id not in nodes: + nodes[family_id] = PantherFamily( + id=_panther_family_node_id(family_id), + source_id=family_id, + level="subfamily" if ":" in family_id else "family", + source="PANTHER", + ) + + if self.max_rows is not None: + kept_rows += 1 + if kept_rows >= self.max_rows: + break + return nodes + + def _load_family_parent_edges(self, family_nodes: dict) -> List[PantherFamilyParentEdge]: + edges: List[PantherFamilyParentEdge] = [] + seen = set() + for source_id in family_nodes.keys(): + if ":" not in source_id: + continue + parent_id = source_id.split(":", 1)[0] + key = (source_id, parent_id) + if key in seen: + continue + seen.add(key) + edges.append( + PantherFamilyParentEdge( + start_node=PantherFamily(id=_panther_family_node_id(source_id)), + end_node=PantherFamily(id=_panther_family_node_id(parent_id)), + ) + ) + return edges + + def _load_class_nodes(self) -> dict: + nodes = {} + for row in self._iter_tsv_rows(self.class_file_path): + if len(row) < 3: + continue + pcid = row[0].strip() + if pcid.startswith("#"): + continue + if not pcid or pcid in nodes: + continue + hierarchy_code = row[1].strip() if len(row) > 1 else None + name = row[2].strip() if len(row) > 2 else None + description = row[3].strip() if len(row) > 3 else None + nodes[pcid] = PantherClass( + id=_panther_class_node_id(pcid), + source_id=pcid, + name=name or None, + description=description or None, + hierarchy_code=hierarchy_code or None, + ) + return nodes + + def _load_parent_edges(self) -> List[PantherClassParentEdge]: + edges = [] + seen = set() + for row in self._iter_tsv_rows(self.relationship_file_path): + if len(row) < 3: + continue + child_pcid = row[0].strip() + parent_pcid = row[2].strip() + if not child_pcid or not parent_pcid: + continue + key = (child_pcid, parent_pcid) + if key in seen: + continue + seen.add(key) + edges.append( + PantherClassParentEdge( + start_node=PantherClass(id=_panther_class_node_id(child_pcid)), + end_node=PantherClass(id=_panther_class_node_id(parent_pcid)), + ) + ) + return edges + + def _load_protein_edges(self, class_nodes: dict) -> List[ProteinPantherClassEdge]: + edges: List[ProteinPantherClassEdge] = [] + seen = set() + kept_rows = 0 + with open(self.sequence_classification_file_path, "r", encoding="utf-8", errors="replace") as handle: + reader = csv.reader(handle, delimiter="\t") + for row in reader: + if len(row) < 10: + continue + protein_classes = row[9].strip() + if not protein_classes: + continue + uniprot_id = self._extract_uniprot_id(row) + if not uniprot_id: + continue + pcids = _PCID_RE.findall(protein_classes) + if not pcids: + continue + kept_rows += 1 + protein_node = Protein(id=EquivalentId(id=uniprot_id, type=Prefix.UniProtKB).id_str()) + for pcid in pcids: + if pcid not in class_nodes: + continue + key = (protein_node.id, pcid) + if key in seen: + continue + seen.add(key) + edges.append( + ProteinPantherClassEdge( + start_node=protein_node, + end_node=PantherClass(id=_panther_class_node_id(pcid)), + source="PANTHER", + ) + ) + if self.max_rows is not None and kept_rows >= self.max_rows: + break + return edges + + def _load_family_membership_edges(self, family_nodes: dict) -> List[ProteinPantherFamilyEdge]: + edges: List[ProteinPantherFamilyEdge] = [] + seen = set() + kept_rows = 0 + with open(self.sequence_classification_file_path, "r", encoding="utf-8", errors="replace") as handle: + reader = csv.reader(handle, delimiter="\t") + for row in reader: + if len(row) < 4: + continue + family_id = (row[3] or "").strip() + if not family_id or family_id not in family_nodes: + continue + uniprot_id = self._extract_uniprot_id(row) + if not uniprot_id: + continue + kept_rows += 1 + protein_node = Protein(id=EquivalentId(id=uniprot_id, type=Prefix.UniProtKB).id_str()) + key = (protein_node.id, family_id) + if key in seen: + continue + seen.add(key) + edges.append( + ProteinPantherFamilyEdge( + start_node=protein_node, + end_node=PantherFamily(id=_panther_family_node_id(family_id)), + source="PANTHER", + ) + ) + if self.max_rows is not None and kept_rows >= self.max_rows: + break + return edges + + @staticmethod + def _extract_uniprot_id(row: list) -> Optional[str]: + if len(row) > 1 and row[1].strip(): + return row[1].strip() + compound_field = row[0].strip() if row else "" + parts = compound_field.split("|") + for part in parts: + if part.startswith("UniProtKB="): + return part.split("=", 1)[1].strip() or None + return None + + def _iter_tsv_rows(self, file_path: str): + with open(file_path, "r", encoding="utf-8", errors="replace") as handle: + reader = csv.reader(handle, delimiter="\t") + for row in reader: + if not row: + continue + first = (row[0] or "").strip() + if not first or first.startswith("!"): + continue + if file_path == self.class_file_path and first.startswith("#") and not first.startswith("#PC"): + continue + yield row diff --git a/src/input_adapters/pharos_arango/tcrd/panther.py b/src/input_adapters/pharos_arango/tcrd/panther.py new file mode 100644 index 0000000..7d9f39b --- /dev/null +++ b/src/input_adapters/pharos_arango/tcrd/panther.py @@ -0,0 +1,100 @@ +from typing import Generator, List + +from src.input_adapters.pharos_arango.tcrd.protein import PharosArangoAdapter +from src.models.datasource_version_info import DataSourceDetails +from src.models.panther_class import PantherClass, ProteinPantherClassEdge +from src.models.protein import Protein + + +def panther_class_query() -> str: + return """FOR d IN `PantherClass` RETURN d""" + + +def panther_class_parent_query() -> str: + return """FOR rel IN `PantherClassParentEdge` RETURN { "child": rel.start_id, "parent": rel.end_id }""" + + +def protein_panther_class_query(last_key: str = None, limit: int = 10000) -> str: + filter_clause = f'FILTER rel._key > "{last_key}"' if last_key else "" + return f""" + FOR rel IN `ProteinPantherClassEdge` + {filter_clause} + SORT rel._key + LIMIT {limit} + RETURN rel + """ + + +def panther_version_query() -> str: + return """FOR d IN `PantherClass` LIMIT 1 RETURN d.creation""" + + +class PantherClassAdapter(PharosArangoAdapter): + def _load_parent_map(self) -> dict[str, list[str]]: + db = self.get_db() + if not db.has_collection("PantherClassParentEdge"): + return {} + parent_rows = self.runQuery(panther_class_parent_query()) + parent_map: dict[str, set[str]] = {} + for row in parent_rows: + child = row.get("child") + parent = row.get("parent") + if not child or not parent: + continue + parent_map.setdefault(child, set()).add(parent) + return { + child_id: sorted(parent_ids) + for child_id, parent_ids in parent_map.items() + } + + @staticmethod + def _source_id(node_id: str) -> str: + return node_id.split(":", 1)[1] if ":" in node_id else node_id + + def get_all(self) -> Generator[List[PantherClass], None, None]: + parent_map = self._load_parent_map() + rows = [] + for row in self.runQuery(panther_class_query()): + parent_ids = parent_map.get(row["id"], []) + rows.append( + PantherClass( + id=row["id"], + source_id=row.get("source_id"), + name=row.get("name"), + description=row.get("description"), + hierarchy_code=row.get("hierarchy_code"), + provenance=row.get("provenance"), + sources=row.get("sources") or [], + parent_pcids="|".join(self._source_id(parent_id) for parent_id in parent_ids) or None, + ) + ) + yield rows + + def get_version_info_query(self) -> DataSourceDetails: + raw_version_info = self.runQuery(panther_version_query())[0] + return DataSourceDetails.parse_tsv(raw_version_info) + + +class ProteinPantherClassAdapter(PharosArangoAdapter): + batch_size = 10_000 + + def get_all(self) -> Generator[List[ProteinPantherClassEdge], None, None]: + last_key = None + while True: + rows = list(self.runQuery(protein_panther_class_query(last_key=last_key, limit=self.batch_size))) + if not rows: + break + + yield [ + ProteinPantherClassEdge( + start_node=Protein(id=row["start_id"]), + end_node=PantherClass(id=row["end_id"]), + source=row.get("source"), + ) + for row in rows + ] + last_key = rows[-1]["_key"] + + def get_version_info_query(self) -> DataSourceDetails: + raw_version_info = self.runQuery(panther_version_query())[0] + return DataSourceDetails.parse_tsv(raw_version_info) diff --git a/src/models/panther_class.py b/src/models/panther_class.py new file mode 100644 index 0000000..7c60915 --- /dev/null +++ b/src/models/panther_class.py @@ -0,0 +1,50 @@ +from dataclasses import dataclass +from typing import Optional + +from src.core.decorators import search +from src.models.node import Node, Relationship +from src.models.protein import Protein + + +@dataclass +class PantherFamily(Node): + source_id: Optional[str] = None + level: Optional[str] = None + name: Optional[str] = None + source: Optional[str] = None + + +@dataclass +class PantherFamilyParentEdge(Relationship): + start_node: PantherFamily + end_node: PantherFamily + + +@dataclass +class ProteinPantherFamilyEdge(Relationship): + start_node: Protein + end_node: PantherFamily + source: Optional[str] = None + + +@dataclass +@search(text_fields=["name", "description"]) +class PantherClass(Node): + source_id: Optional[str] = None + parent_pcids: Optional[str] = None + name: Optional[str] = None + description: Optional[str] = None + hierarchy_code: Optional[str] = None + + +@dataclass +class PantherClassParentEdge(Relationship): + start_node: PantherClass + end_node: PantherClass + + +@dataclass +class ProteinPantherClassEdge(Relationship): + start_node: Protein + end_node: PantherClass + source: Optional[str] = None diff --git a/src/output_adapters/sql_converters/tcrd.py b/src/output_adapters/sql_converters/tcrd.py index e9e4e0a..d306750 100755 --- a/src/output_adapters/sql_converters/tcrd.py +++ b/src/output_adapters/sql_converters/tcrd.py @@ -8,6 +8,7 @@ from src.models.keyword import ProteinKeywordEdge from src.models.ligand import Ligand, ProteinLigandEdge from src.models.node import EquivalentId +from src.models.panther_class import PantherClass, ProteinPantherClassEdge from src.models.pathway import ProteinPathwayEdge from src.models.protein import Protein from src.models.tcrd_disease_ontology import MondoTerm, MondoTermParentEdge, DOTerm, DOTermParentEdge @@ -17,7 +18,7 @@ GeneRif, GeneRif2Pubmed, Protein2Pubmed, Ligand as mysqlLigand, LigandActivity, Uberon, UberonParent, Tissue as mysqlTissue, Expression, Gtex, Mondo, MondoParent, MondoXref, Disease as mysqlDisease, DiseaseType, DO, DOParent, - NcatsDisease, NcatsD2DA, Pathway as mysqlPathway, + NcatsDisease, NcatsD2DA, Pathway as mysqlPathway, PantherClass as mysqlPantherClass, P2PC, ) from src.output_adapters.sql_converters.output_converter_base import SQLOutputConverter from src.shared.sqlalchemy_tables.pharos_tables_new import Base as TCRDBase @@ -61,6 +62,9 @@ def __init__(self): ProteinDiseaseEdge: [self.disease_type_converter, self.disease_converter, self.ncats_d2da_converter], # Pathway ProteinPathwayEdge: [self.pathway_converter], + # Panther + PantherClass: [self.panther_class_converter], + ProteinPantherClassEdge: [self.p2pc_converter], # Keyword ProteinKeywordEdge: [self.keyword_xref_converter], } @@ -599,6 +603,23 @@ def pathway_converter(self, obj: dict) -> mysqlPathway: provenance=obj['provenance'], ) + # --- Panther --- + + def panther_class_converter(self, obj: dict) -> mysqlPantherClass: + return mysqlPantherClass( + id=self.resolve_id('panther_class', obj['id']), + pcid=obj.get('source_id') or obj['id'], + parent_pcids=obj.get('parent_pcids'), + name=obj.get('name') or '', + description=obj.get('description'), + ) + + def p2pc_converter(self, obj: dict) -> P2PC: + return P2PC( + panther_class_id=self.resolve_id('panther_class', obj['end_id']), + protein_id=self.resolve_id('protein', obj['start_id']), + ) + # --- Keyword --- def keyword_xref_converter(self, obj: dict) -> Xref: diff --git a/src/use_cases/pharos/TCRD_TODO.md b/src/use_cases/pharos/TCRD_TODO.md index 567c4a2..2eb609c 100644 --- a/src/use_cases/pharos/TCRD_TODO.md +++ b/src/use_cases/pharos/TCRD_TODO.md @@ -34,13 +34,15 @@ Each row is a protein-facing Pharos/TCRD concept. Data source checkboxes = inges | **ProteinGoTermEdge** | [x] UniProt GAF
[x] GO GAF | `ProteinGoTermEdge` | [x] `goa` | | **Ligand** | [x] IUPHAR
[x] ChEMBL
[x] DrugCentral | `Ligand` | [x] `ncats_ligands` | | **ProteinLigandEdge** | [x] IUPHAR
[x] ChEMBL
[x] DrugCentral | `ProteinLigandEdge` | [x] `ncats_ligand_activity` | -| **Disease** | [x] MONDO
[x] Disease Ontology
[x] UniProt curated
[x] CTD
[x] JensenLab DISEASES *(promoted in `pharos.yaml` / `target_graph.yaml`)* | `Disease` | [x] `ncats_disease` | +| **Disease** | [x] MONDO
[x] Disease Ontology
[x] UniProt curated
[x] CTD
[x] JensenLab DISEASES *(promoted in `pharos.yaml` / `target_graph.yaml`)*
[x] DrugCentral Indication | `Disease` | [x] `ncats_disease` | | **DiseaseParentEdge** | [x] MONDO | `DiseaseParentEdge` | [x] `mondo_parent`
[x] `ancestry_mondo` | | **DODiseaseParentEdge** | [x] Disease Ontology | `DODiseaseParentEdge` | [x] `do_parent`
[x] `ancestry_do` | -| **ProteinDiseaseEdge** | [x] UniProt curated
[x] CTD *(side-lifted from gene associations by the TCRD target resolver)*
[x] JensenLab DISEASES *(Knowledge, Experiment/TIGA, and Text Mining; promoted in `pharos.yaml` / `target_graph.yaml`; working/full configs apply `textmining_min_zscore: 6.0` to stay close to historical Pharos text-mining scope)* | `ProteinDiseaseEdge` | [x] `disease_type`
[x] `disease`
[x] `ncats_d2da` | +| **ProteinDiseaseEdge** | [x] UniProt curated
[x] CTD *(side-lifted from gene associations by the TCRD target resolver)*
[x] JensenLab DISEASES *(Knowledge, Experiment/TIGA, and Text Mining; promoted in `pharos.yaml` / `target_graph.yaml`; working/full configs apply `textmining_min_zscore: 6.0` to stay close to historical Pharos text-mining scope)*
[x] DrugCentral Indication | `ProteinDiseaseEdge` | [x] `disease_type`
[x] `disease`
[x] `ncats_d2da` | | **Pathway** | [x] UniProt
[x] Reactome
[x] WikiPathways
[x] PathwayCommons | `Pathway` | [x] `pathway` | | **PathwayParentEdge** | [x] Reactome | `PathwayParentEdge` | not exported to legacy TCRD MySQL | | **ProteinPathwayEdge** | [x] UniProt
[x] Reactome
[x] WikiPathways *(side-lifted from gene associations by the TCRD target resolver)*
[x] PathwayCommons *(side-lifted from gene associations by the TCRD target resolver)* | `ProteinPathwayEdge` | [x] `pathway` | +| **PantherClass** | [x] PANTHER Classes *(promoted in `pharos.yaml` / `target_graph.yaml`)* | `PantherClass` | [x] `panther_class` *(via `tcrd.yaml`; validated in `working_mysql.yaml` first)* | +| **ProteinPantherClassEdge** | [x] PANTHER Classes *(promoted in `pharos.yaml` / `target_graph.yaml`)* | `ProteinPantherClassEdge` | [x] `p2pc` *(via `tcrd.yaml`; validated in `working_mysql.yaml` first)* | | **Keyword** | [x] UniProt | `Keyword` | [x] `xref` *(UniProt Keyword xtype)* | | **ProteinKeywordEdge** | [x] UniProt | `ProteinKeywordEdge` | [x] `xref` *(UniProt Keyword xtype)* | | | *— post-processing (pharos_aql_post.yaml) —* | | | @@ -61,18 +63,13 @@ These tables are populated directly from ontology source files during the TCRD b --- ## Planned Data Sources - -### Additional Disease Associations (ProteinDiseaseEdge) -- DrugCentral Indication -- ERAM *(punt for now: public download appears stale/legacy; if we need ERAM coverage, prefer copying or migrating the legacy `eRAM` rows from `pharos319` rather than building a fresh ingest from the public files)* -- Expression Atlas *(punt for now: old TCRD used a bulk Atlas export plus custom preprocessing, but current Atlas appears to require per-experiment harvesting from FTP; revisit only as a larger dedicated project, not a quick ingest)* -- Monarch -- OMIM +### Target Disease Associations +- maybe ClinGen - old pharos didn't have it, but maybe it's useful ### New Concepts - Protein-Protein Interactions — STRING, BioPlex, Reactome PPI - Orthologs — OMA, EggNOG, Inparanoid -- Protein Classes - PANTHER, DTO +- Protein Classes - DTO - Phenotype — IMPC, JAX/MGI - GWAS - Protein & Disease Novelty (this might be TINx, I'm not sure) @@ -98,6 +95,22 @@ These tables are populated directly from ontology source files during the TCRD b ### Requires License - DisGeNET Disease Associations - KEGG Pathway +- OMIM investigation follow-up *(low priority: legacy TCRD/Pharos loaded licensed OMIM files into `omim`, `omim_ps`, and `phenotype` with `ptype='OMIM'`, and the frontend did not surface that content. Revisit only if we explicitly want a phenotype/trait ingest, not as a target-disease source.)* + +### Punted / Not Doing Right Now +- ERAM *(punt for now: public download appears stale/legacy; if we need ERAM coverage, prefer copying or migrating the legacy `eRAM` rows from `pharos319` rather than building a fresh ingest from the public files)* +- Expression Atlas *(punt for now: old TCRD used a bulk Atlas export plus custom preprocessing, but current Atlas appears to require per-experiment harvesting from FTP; revisit only as a larger dedicated project, not a quick ingest)* +- Monarch as a standalone disease-association source *(do not ingest the current dump as `Monarch`; the public file is a Translator-style aggregate whose primary sources are `infores:omim` and `infores:clingen`)* + +### Findings From Investigation +- OMIM is not a legacy Pharos target-disease association source + - old `load-OMIM.py` populated `omim`, `omim_ps`, and `phenotype`, not `disease` + - `pharos319` currently has `14147` `phenotype` rows with `ptype='OMIM'` + - the old frontend did not surface that OMIM phenotype content +- Current Monarch disease dump is not a clean standalone `Monarch` source + - the current public file is a Translator-style aggregate + - `infores:monarchinitiative` is the aggregator, while the primary sources are `infores:omim` and `infores:clingen` + - do not ingest it as `Monarch`; revisit direct OMIM or direct ClinGen instead --- @@ -117,8 +130,9 @@ Use `src/use_cases/working.yaml` and `src/use_cases/working_mysql.yaml` to valid - `MondoTableAdapter` maps `Disease.mondo_description` into `MondoTerm.mondo_description`, and `TCRDOutputConverter.mondo_table_converter()` writes that into `mondo.def`. - [x] Verify `do.def` is populated from source-file Disease Ontology data - `DOTableAdapter` maps `Disease.do_description` into `DOTerm.do_description`, and `TCRDOutputConverter.do_table_converter()` writes that into `do.def`. -- [ ] Compare `mondo.comment` population against `pharos319` - - Current source-file path preserves MONDO comments separately from the merged disease graph. +- [x] Compare `mondo.comment` population against `pharos319` + - Current source-file path already maps MONDO comments into `mondo.comment` through `MondoTableAdapter` and `TCRDOutputConverter.mondo_table_converter()`. + - Current source-file adapter output includes `1210` commented MONDO terms versus `773` in `pharos319`; the difference appears to reflect source-version drift, not a missing comment mapping. ### Disease Association Table @@ -127,12 +141,26 @@ Use `src/use_cases/working.yaml` and `src/use_cases/working_mysql.yaml` to valid - [x] Map `ProteinDiseaseEdge.details` into source-specific `disease` association rows - Current converter emits one `disease` row per edge detail rather than one row per merged graph edge. - [x] Populate `disease.evidence` from disease association details -- [ ] Decide whether any disease-detail text should populate `disease.description` - - `pharos319` has disease descriptions for some sources, but current working MySQL leaves `disease.description` empty. -- [ ] Decide whether disease association detail metadata should populate `disease.source` - - `pharos319` uses `disease.source` for some sources; current working MySQL leaves it empty. -- [ ] Document source-specific fields that remain intentionally unsupported in the working converter - - Examples from `pharos319`: `drug_name`, `log2foldchange`, `pvalue`, `score`, `S2O`, `updated`. +- [x] Decide whether any disease-detail text should populate `disease.description` + - Keep the generic converter behavior as-is: do not map edge-detail text into `disease.description`. + - Legacy `pharos319` only populated `disease.description` for `UniProt Disease`, and that text is disease-level descriptive content rather than generic edge-detail payload. + - In the current graph model, source-native disease descriptions belong on the `Disease` node (for example `Disease.uniprot_description`), not in per-source association detail rows. +- [x] Decide whether disease association detail metadata should populate `disease.source` + - Keep `disease.source` intentionally empty in the generic working converter for now. + - In legacy `pharos319`, `disease.source` was source-specific subsource metadata for only some ingests (for example `DisGeNET` values like `CTD_human` / `PSYGENET`, and `eRAM` source bundles like `CLINVAR|CTD_human|GHR|ORPHANET|UNIPROT`). + - That does not map cleanly to a generic cross-source field, so any future population should be source-specific and explicit rather than inferred from `details`. +- [x] Document source-specific fields that remain intentionally unsupported in the working converter + - Intentionally supported already: + - `drug_name` for `DrugCentral Indication` + - `zscore`, `conf`, and `reference` for Jensen text-mining / knowledge rows + - Intentionally unsupported in the generic converter: + - `description` as an edge-detail sink + - `source` as a generic subsource field + - `log2foldchange` and `pvalue` from `Expression Atlas` + - `score` and `source` from `DisGeNET` + - `source` from `eRAM` + - `S2O` / `O2S` from legacy `Monarch` + - `updated` flags from legacy source-specific rows - [x] Populate Jensen-compatible disease association fields in working/full MySQL - `disease.did` now preserves source disease IDs, while `mondoid` remains a best-effort FK-backed resolved MONDO mapping. - `disease.conf` and `disease.evidence` now populate for Jensen Knowledge / Experiment rows. @@ -145,7 +173,6 @@ Use `src/use_cases/working.yaml` and `src/use_cases/working_mysql.yaml` to valid - [x] Add `ncats_disease` output for canonical disease nodes from the graph - [x] Limit `ncats_disease` output to diseases that have target associations - `DiseaseAdapter` now supports `associated_only: true` for the TCRD build path. -- [ ] Decide how non-`MONDO:` / non-`DOID:` associated disease nodes should be represented downstream +- [x] Decide how non-`MONDO:` / non-`DOID:` associated disease nodes should be represented downstream - Current graph includes associated diseases with prefixes such as `UMLS`, `OMIM`, `HP`, `EFO`, `NCIT`, and `MESH`. -- [ ] Compare disease ID normalization expectations against `pharos319` - - Especially for legacy `MIM:` / `OMIM:` / `UMLS:` disease identifiers that appear in association rows. + - the IdResolver handles the mappings diff --git a/src/use_cases/pharos/pharos.yaml b/src/use_cases/pharos/pharos.yaml index b51bb0d..cf792d2 100644 --- a/src/use_cases/pharos/pharos.yaml +++ b/src/use_cases/pharos/pharos.yaml @@ -329,6 +329,14 @@ input_adapters: file_path: ./input_files/auto/pathwaycommons/pc-hgnc.gmt.gz version_file_path: ./input_files/auto/pathwaycommons/pathwaycommons_version.tsv + - import: ./src/input_adapters/panther/panther_classes.py + class: PantherClassesAdapter + kwargs: + class_file_path: ./input_files/auto/panther/Protein_Class_19.0 + relationship_file_path: ./input_files/auto/panther/Protein_class_relationship + sequence_classification_file_path: ./input_files/auto/panther/PTHR19.0_human + version_file_path: ./input_files/auto/panther/panther_classes_version.tsv + output_adapters: - import: ./src/output_adapters/arango_output_adapter.py class: ArangoOutputAdapter diff --git a/src/use_cases/pharos/target_graph.yaml b/src/use_cases/pharos/target_graph.yaml index 439bfea..f240502 100644 --- a/src/use_cases/pharos/target_graph.yaml +++ b/src/use_cases/pharos/target_graph.yaml @@ -371,6 +371,14 @@ input_adapters: file_path: ./input_files/auto/pathwaycommons/pc-hgnc.gmt.gz version_file_path: ./input_files/auto/pathwaycommons/pathwaycommons_version.tsv + - import: ./src/input_adapters/panther/panther_classes.py + class: PantherClassesAdapter + kwargs: + class_file_path: ./input_files/auto/panther/Protein_Class_19.0 + relationship_file_path: ./input_files/auto/panther/Protein_class_relationship + sequence_classification_file_path: ./input_files/auto/panther/PTHR19.0_human + version_file_path: ./input_files/auto/panther/panther_classes_version.tsv + output_adapters: - import: ./src/output_adapters/arango_output_adapter.py class: ArangoOutputAdapter diff --git a/src/use_cases/pharos/tcrd.yaml b/src/use_cases/pharos/tcrd.yaml index a0670f1..0686df9 100644 --- a/src/use_cases/pharos/tcrd.yaml +++ b/src/use_cases/pharos/tcrd.yaml @@ -105,6 +105,18 @@ input_adapters: kwargs: database_name: *source_database + - import: ./src/input_adapters/pharos_arango/tcrd/panther.py + class: PantherClassAdapter + credentials: *source_credentials + kwargs: + database_name: *source_database + + - import: ./src/input_adapters/pharos_arango/tcrd/panther.py + class: ProteinPantherClassAdapter + credentials: *source_credentials + kwargs: + database_name: *source_database + - import: ./src/input_adapters/pharos_arango/tcrd/keyword.py class: ProteinKeywordAdapter credentials: *source_credentials diff --git a/src/use_cases/working.yaml b/src/use_cases/working.yaml index d39f966..9809511 100644 --- a/src/use_cases/working.yaml +++ b/src/use_cases/working.yaml @@ -34,9 +34,13 @@ input_adapters: file_path: ./input_files/manual/target_graph/protein_ids.tsv collapse_reviewed_targets: true - - import: ./src/input_adapters/drug_central/drug_indication.py - class: DrugCentralIndicationAdapter - credentials: ./src/use_cases/secrets/drugcentral_credentials.yaml + - import: ./src/input_adapters/panther/panther_classes.py + class: PantherClassesAdapter + kwargs: + class_file_path: ./input_files/auto/panther/Protein_Class_19.0 + relationship_file_path: ./input_files/auto/panther/Protein_class_relationship + sequence_classification_file_path: ./input_files/auto/panther/PTHR19.0_human + version_file_path: ./input_files/auto/panther/panther_classes_version.tsv output_adapters: - import: ./src/output_adapters/arango_output_adapter.py diff --git a/src/use_cases/working_mysql.yaml b/src/use_cases/working_mysql.yaml index 2cafd79..c5dd6d8 100644 --- a/src/use_cases/working_mysql.yaml +++ b/src/use_cases/working_mysql.yaml @@ -20,25 +20,25 @@ resolvers: - Transcript input_adapters: - - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py - class: MondoTableAdapter - kwargs: - file_path: ./input_files/auto/mondo/mondo.json - - - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py - class: MondoTableParentEdgeAdapter - kwargs: - file_path: ./input_files/auto/mondo/mondo.json - - - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py - class: DOTableAdapter - kwargs: - file_path: ./input_files/auto/disease_ontology/doid.json - - - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py - class: DOTableParentEdgeAdapter - kwargs: - file_path: ./input_files/auto/disease_ontology/doid.json +# - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py +# class: MondoTableAdapter +# kwargs: +# file_path: ./input_files/auto/mondo/mondo.json +# +# - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py +# class: MondoTableParentEdgeAdapter +# kwargs: +# file_path: ./input_files/auto/mondo/mondo.json +# +# - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py +# class: DOTableAdapter +# kwargs: +# file_path: ./input_files/auto/disease_ontology/doid.json +# +# - import: ./src/input_adapters/pharos_source_tcrd/ontology_tables.py +# class: DOTableParentEdgeAdapter +# kwargs: +# file_path: ./input_files/auto/disease_ontology/doid.json - import: ./src/input_adapters/pharos_arango/tcrd/protein.py class: ProteinAdapter @@ -65,25 +65,37 @@ input_adapters: # kwargs: # database_name: *source_database # -# - import: ./src/input_adapters/pharos_arango/tcrd/keyword.py -# class: ProteinKeywordAdapter -# credentials: *source_credentials -# kwargs: -# database_name: *source_database -# - - import: ./src/input_adapters/pharos_arango/tcrd/disease.py - class: DiseaseAdapter + - import: ./src/input_adapters/pharos_arango/tcrd/panther.py + class: PantherClassAdapter credentials: *source_credentials kwargs: database_name: *source_database - associated_only: true - - import: ./src/input_adapters/pharos_arango/tcrd/disease.py - class: ProteinDiseaseAdapter + - import: ./src/input_adapters/pharos_arango/tcrd/panther.py + class: ProteinPantherClassAdapter credentials: *source_credentials kwargs: database_name: *source_database +# - import: ./src/input_adapters/pharos_arango/tcrd/keyword.py +# class: ProteinKeywordAdapter +# credentials: *source_credentials +# kwargs: +# database_name: *source_database +# +# - import: ./src/input_adapters/pharos_arango/tcrd/disease.py +# class: DiseaseAdapter +# credentials: *source_credentials +# kwargs: +# database_name: *source_database +# associated_only: true +# +# - import: ./src/input_adapters/pharos_arango/tcrd/disease.py +# class: ProteinDiseaseAdapter +# credentials: *source_credentials +# kwargs: +# database_name: *source_database + output_adapters: - import: ./src/output_adapters/mysql_output_adapter.py class: TCRDOutputAdapter diff --git a/tests/test_panther_classes.py b/tests/test_panther_classes.py new file mode 100644 index 0000000..c123964 --- /dev/null +++ b/tests/test_panther_classes.py @@ -0,0 +1,121 @@ +from pathlib import Path + +from src.input_adapters.panther.panther_classes import PantherClassesAdapter +from src.models.panther_class import ( + PantherClass, + PantherClassParentEdge, + PantherFamily, + PantherFamilyParentEdge, + ProteinPantherClassEdge, + ProteinPantherFamilyEdge, +) + + +def _write_fixture(path: Path, content: str) -> None: + path.write_text(content, encoding="utf-8") + + +def test_panther_adapter_emits_family_and_class_graph(tmp_path): + class_path = tmp_path / "Protein_Class_19.0" + relationship_path = tmp_path / "Protein_class_relationship" + sequence_path = tmp_path / "PTHR19.0_human" + version_path = tmp_path / "panther_classes_version.tsv" + + _write_fixture( + class_path, + ( + "! version: 17.0\n" + "! date: 1/11/2022\n" + "#removed PC00255 because only one family in class\n" + "#PC00255\t1.08.01.08.00\tTATA-binding transcription factor\tCommented out class\n" + "PC00000\t1.00.00.00.00\tprotein class\t\n" + "PC00197\t1.01.00.00.00\ttransmembrane signal receptor\tA receptor class\n" + "PC00021\t1.01.01.00.00\tG-protein coupled receptor\tGPCR class\n" + ), + ) + _write_fixture( + relationship_path, + ( + "! version: 17.0\n" + "PC00197\treceptor\tPC00000\tprotein class\t05\n" + "PC00021\tGPCR\tPC00197\treceptor\t01\n" + ), + ) + _write_fixture( + sequence_path, + ( + "HUMAN|HGNC=1|UniProtKB=P11111\tP11111\tGENE1\tPTHR10000:SF1\tFAMILY ONE\tProtein One\t\t\t\t" + "transmembrane signal receptor#PC00197;G-protein coupled receptor#PC00021\t\n" + "HUMAN|HGNC=2|UniProtKB=P22222\tP22222\tGENE2\tPTHR10000:SF2\tFAMILY ONE\tProtein Two\t\t\t\t" + "\t\n" + ), + ) + _write_fixture( + version_path, + "version\tversion_date\tdownload_date\n19.0\t2026-04-14\t2026-04-14\n", + ) + + adapter = PantherClassesAdapter( + class_file_path=str(class_path), + relationship_file_path=str(relationship_path), + sequence_classification_file_path=str(sequence_path), + version_file_path=str(version_path), + ) + + batches = list(adapter.get_all()) + + family_nodes = batches[0] + class_nodes = batches[1] + family_parent_edges = batches[2] + class_parent_edges = batches[3] + family_edges = batches[4] + class_edges = batches[5] + + assert {type(obj) for obj in family_nodes} == {PantherFamily} + assert {node.id for node in family_nodes} == { + "PANTHER.FAMILY:PTHR10000", + "PANTHER.FAMILY:PTHR10000:SF1", + "PANTHER.FAMILY:PTHR10000:SF2", + } + family_node_map = {node.id: node for node in family_nodes} + assert family_node_map["PANTHER.FAMILY:PTHR10000"].level == "family" + assert family_node_map["PANTHER.FAMILY:PTHR10000"].name == "FAMILY ONE" + assert family_node_map["PANTHER.FAMILY:PTHR10000:SF1"].level == "subfamily" + assert family_node_map["PANTHER.FAMILY:PTHR10000:SF1"].name is None + + assert {type(obj) for obj in class_nodes} == {PantherClass} + assert {node.id for node in class_nodes} == { + "PANTHER.CLASS:PC00000", + "PANTHER.CLASS:PC00197", + "PANTHER.CLASS:PC00021", + } + assert "PANTHER.CLASS:#PC00255" not in {node.id for node in class_nodes} + + assert all(isinstance(edge, PantherFamilyParentEdge) for edge in family_parent_edges) + assert {(edge.start_node.id, edge.end_node.id) for edge in family_parent_edges} == { + ("PANTHER.FAMILY:PTHR10000:SF1", "PANTHER.FAMILY:PTHR10000"), + ("PANTHER.FAMILY:PTHR10000:SF2", "PANTHER.FAMILY:PTHR10000"), + } + + assert all(isinstance(edge, PantherClassParentEdge) for edge in class_parent_edges) + assert {(edge.start_node.id, edge.end_node.id) for edge in class_parent_edges} == { + ("PANTHER.CLASS:PC00197", "PANTHER.CLASS:PC00000"), + ("PANTHER.CLASS:PC00021", "PANTHER.CLASS:PC00197"), + } + + assert all(isinstance(edge, ProteinPantherFamilyEdge) for edge in family_edges) + assert {(edge.start_node.id, edge.end_node.id) for edge in family_edges} == { + ("UniProtKB:P11111", "PANTHER.FAMILY:PTHR10000:SF1"), + ("UniProtKB:P22222", "PANTHER.FAMILY:PTHR10000:SF2"), + } + + assert all(isinstance(edge, ProteinPantherClassEdge) for edge in class_edges) + assert {(edge.start_node.id, edge.end_node.id) for edge in class_edges} == { + ("UniProtKB:P11111", "PANTHER.CLASS:PC00197"), + ("UniProtKB:P11111", "PANTHER.CLASS:PC00021"), + } + + version = adapter.get_version() + assert version.version == "19.0" + assert version.version_date.isoformat() == "2026-04-14" + assert version.download_date.isoformat() == "2026-04-14" diff --git a/tests/test_pharos_arango_panther_adapter.py b/tests/test_pharos_arango_panther_adapter.py new file mode 100644 index 0000000..efd916b --- /dev/null +++ b/tests/test_pharos_arango_panther_adapter.py @@ -0,0 +1,88 @@ +from src.input_adapters.pharos_arango.tcrd.panther import ( + PantherClassAdapter, + ProteinPantherClassAdapter, + panther_class_query, + panther_class_parent_query +) + + +def test_panther_class_adapter_hydrates_parent_pcids(): + adapter = PantherClassAdapter.__new__(PantherClassAdapter) + + class_rows = [ + { + "id": "PANTHER.CLASS:PC00000", + "source_id": "PC00000", + "name": "protein class", + "description": None, + "hierarchy_code": "1.00.00.00.00", + "source": "PANTHER", + "provenance": "PANTHER Protein Classes\t19.0\t2026-04-14\t2026-04-14", + "sources": ["PANTHER Protein Classes\t19.0\t2026-04-14\t2026-04-14"], + }, + { + "id": "PANTHER.CLASS:PC00021", + "source_id": "PC00021", + "name": "G-protein coupled receptor", + "description": "GPCR", + "hierarchy_code": "1.01.01.00.00", + "source": "PANTHER", + "provenance": "PANTHER Protein Classes\t19.0\t2026-04-14\t2026-04-14", + "sources": ["PANTHER Protein Classes\t19.0\t2026-04-14\t2026-04-14"], + }, + ] + parent_rows = [ + {"child": "PANTHER.CLASS:PC00021", "parent": "PANTHER.CLASS:PC00000"}, + ] + + class FakeDb: + @staticmethod + def has_collection(name): + return name == "PantherClassParentEdge" + + def fake_run_query(query): + if query == panther_class_query(): + return class_rows + if query == panther_class_parent_query(): + return parent_rows + return [] + + adapter.get_db = lambda: FakeDb() + adapter.runQuery = fake_run_query + + batches = list(adapter.get_all()) + assert len(batches) == 1 + nodes = {node.id: node for node in batches[0]} + + assert nodes["PANTHER.CLASS:PC00000"].parent_pcids is None + assert nodes["PANTHER.CLASS:PC00021"].parent_pcids == "PC00000" + + +def test_protein_panther_class_adapter_emits_edges(): + adapter = ProteinPantherClassAdapter.__new__(ProteinPantherClassAdapter) + adapter.batch_size = 10_000 + + edge_rows = [ + { + "_key": "1", + "start_id": "IFXProtein:ABC123", + "end_id": "PANTHER.CLASS:PC00021", + "source": "PANTHER", + } + ] + + def fake_run_query(query): + if "FOR rel IN `ProteinPantherClassEdge`" in query: + rows = edge_rows.copy() + edge_rows.clear() + return rows + return [] + + adapter.runQuery = fake_run_query + + batches = list(adapter.get_all()) + assert len(batches) == 1 + edge = batches[0][0] + assert edge.start_node.id == "IFXProtein:ABC123" + assert edge.end_node.id == "PANTHER.CLASS:PC00021" + assert edge.source == "PANTHER" diff --git a/tests/test_tcrd_panther_converter.py b/tests/test_tcrd_panther_converter.py new file mode 100644 index 0000000..365d7f5 --- /dev/null +++ b/tests/test_tcrd_panther_converter.py @@ -0,0 +1,39 @@ +from src.output_adapters.sql_converters.tcrd import TCRDOutputConverter + + +def test_panther_class_converter_maps_parent_pcids_and_source_id(): + converter = TCRDOutputConverter() + + row = converter.panther_class_converter( + { + "id": "PANTHER.CLASS:PC00021", + "source_id": "PC00021", + "parent_pcids": "PC00197", + "name": "G-protein coupled receptor", + "description": "GPCR class", + } + ) + + assert row.id == 1 + assert row.pcid == "PC00021" + assert row.parent_pcids == "PC00197" + assert row.name == "G-protein coupled receptor" + assert row.description == "GPCR class" + + +def test_p2pc_converter_uses_preloaded_protein_and_panther_ids(): + converter = TCRDOutputConverter() + converter.id_mapping = { + "protein": {"IFXProtein:ABC123": 7}, + "panther_class": {"PANTHER.CLASS:PC00021": 13}, + } + + row = converter.p2pc_converter( + { + "start_id": "IFXProtein:ABC123", + "end_id": "PANTHER.CLASS:PC00021", + } + ) + + assert row.protein_id == 7 + assert row.panther_class_id == 13 diff --git a/workflows/pharos.Snakefile b/workflows/pharos.Snakefile index 8970446..565df34 100644 --- a/workflows/pharos.Snakefile +++ b/workflows/pharos.Snakefile @@ -38,7 +38,11 @@ rule all: "../input_files/auto/jensenlab/tissues_version.tsv", "../input_files/auto/wikipathways/wikipathways_human.gmt", "../input_files/auto/wikipathways/wikipathways_version.tsv", - "../input_files/auto/disease_ontology/doid.json" + "../input_files/auto/disease_ontology/doid.json", + "../input_files/auto/panther/Protein_Class_19.0", + "../input_files/auto/panther/Protein_class_relationship", + "../input_files/auto/panther/PTHR19.0_human", + "../input_files/auto/panther/panther_classes_version.tsv" rule download_ctd: output: @@ -246,6 +250,30 @@ rule download_pathwaycommons: curl -fs https://download.baderlab.org/PathwayCommons/PC2/v14/datasources.txt | python3 -c "import sys,re; from datetime import datetime; data=sys.stdin.read(); m=re.search(r'PC version (\d+) (\d+ \w+ \d+)',data); v=m.group(1); dt=datetime.strptime(m.group(2),'%d %b %Y').date().isoformat(); open('{output[1]}','w').write('version\\tversion_date\\n'+v+'\\t'+dt+'\\n')" """ +rule download_panther_classes: + output: + "../input_files/auto/panther/Protein_Class_19.0", + "../input_files/auto/panther/Protein_class_relationship", + "../input_files/auto/panther/PTHR19.0_human", + "../input_files/auto/panther/panther_classes_version.tsv" + shell: + """ + mkdir -p ../input_files/auto/panther + class_url='https://data.pantherdb.org/PANTHER19.0/ontology/Protein_Class_19.0' + rel_url='https://data.pantherdb.org/PANTHER19.0/ontology/Protein_class_relationship' + seq_url='https://data.pantherdb.org/ftp/sequence_classifications/current_release/PANTHER_Sequence_Classification_files/PTHR19.0_human' + + curl -fL -o {output[0]} "$class_url" + curl -fL -o {output[1]} "$rel_url" + curl -fL -o {output[2]} "$seq_url" + + class_lm=$(curl -fsI "$class_url" | awk -F': ' 'tolower($1)=="last-modified"{{print $2}}') + rel_lm=$(curl -fsI "$rel_url" | awk -F': ' 'tolower($1)=="last-modified"{{print $2}}') + seq_lm=$(curl -fsI "$seq_url" | awk -F': ' 'tolower($1)=="last-modified"{{print $2}}') + download_date=$(date -u +%F) + python3 -c "import email.utils,sys; vals=[v for v in sys.argv[1:4] if v.strip()]; dates=[email.utils.parsedate_to_datetime(v).date().isoformat() for v in vals]; version_date=max(dates) if dates else ''; version='19.0'; open(sys.argv[4],'w').write('version\\tversion_date\\tdownload_date\\n'+version+'\\t'+version_date+'\\t'+sys.argv[5]+'\\n')" "$class_lm" "$rel_lm" "$seq_lm" {output[3]} "$download_date" + """ + rule download_wikipathways: output: "../input_files/auto/wikipathways/wikipathways_human.gmt",