diff --git a/CLAUDE.md b/CLAUDE.md index 813a9d8f..7456ed91 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -86,9 +86,9 @@ semantic type plus data collection, reports, exports, and DuckDB. - **`snakefiles/`** — Snakemake rule definitions wiring data handlers to compendium creators. - **`node.py`** — Core classes: `NodeFactory`, `SynonymFactory`, `DescriptionFactory`, `TaxonFactory`, `InformationContentFactory`, `TSVSQLiteLoader`. -- **`babel_utils.py`** — Download/FTP utilities, state management. +- **`babel_utils.py`** — Download/FTP utilities, `glom()` (clique merging), `write_compendium()` + (compendium builder), state management. - **`util.py`** — Logging, config loading, Biolink Model Toolkit (bmt) access. -- **`make_cliques.py`** — Union-find clique merging logic. - **`exporters/`** — Output format handlers (KGX, Parquet, JSONL). - **`reports/`**, **`synonyms/`**, **`metadata/`** — Report generation, synonym files, provenance. @@ -100,7 +100,8 @@ semantic type plus data collection, reports, exports, and DuckDB. - **Biolink Model** integration via `bmt` — types, valid prefixes, and naming conventions all follow the Biolink Model. - **Concord files** are the core data structure: tab-separated `CURIE1 \t Relation \t CURIE2` - triples expressing cross-references between vocabularies. + triples expressing cross-references between vocabularies. The `glom()` function in + `babel_utils.py` merges them into equivalence cliques. ### Biolink Model Usage @@ -128,7 +129,7 @@ identifier that owns them and are not promoted to the first entry. ### Conflation -Gene+Protein and Drug+Chemical each have dedicated conflation modules (`geneprotein.py`, +GeneProtein and DrugChemical conflation each have dedicated conflation modules (`geneprotein.py`, `drugchemical.py`) that merge their respective cliques. See `docs/Conflation.md`. ### Directories at Runtime @@ -161,3 +162,12 @@ don't miss out on any valid identifiers without very good reason. If you're chan identifiers are filtered in one compendium, think about whether that will affect which identifiers should be included in the other compendia to prevent any identifiers from being missed or being added twice. + +## Documentation + +When making a significant change, check if it affects any of the documentation +files (`docs/*.md`, `*.md`) and update them if necessary. Suggest adding +new documentation files if necessary. + +When writing documentation files, avoid using horizontal pipes unless necessary -- +section headings are sufficient for dividing up documentation. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f3637e6f..0fc3c069 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -28,56 +28,19 @@ us triage and prioritize them correctly. would also appreciate if you can include what you expect the tool to return. Any other details you can provide, especially anything that will be help us replicate the issue, will be very helpful. -1. After you have reported a bug, helping to triage, prioritize and group it - will be very helpful: - - We triage issues into one of the - [milestones](https://github.com/NCATSTranslator/Babel/milestones): - - [Needs investigation](https://github.com/NCATSTranslator/Babel/milestone/12) - refers to issues that need to be investigated further -- either to figure - out what is causing the issue or to communicate with the user community - to understand what should occur. - - [Immediate](https://github.com/NCATSTranslator/Babel/milestone/35) need - to be fixed immediately. Issues I'm currently working on will be placed - here. - - [Needed soon](https://github.com/NCATSTranslator/Babel/milestone/30) - refers to issues that should be fixed in the next few months: not - immediately, but sooner rather than later. - - [Needed later](https://github.com/NCATSTranslator/Babel/milestone/31) - refers to issues that should be fixed eventually, but are not needed - immediately. - - We prioritize issues with one of the three priority tags: - [Priority: Low](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Low%22), - [Priority: Medium](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Medium%22), - [Priority: High](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20High%22). - The idea is that issues with the highest priority will determine which will be - investigated/tested first, and which are most likely to move from Needed later/Needed soon into - Immediate for working on. - - We estimate effort on tasks using a series of - ["T-shirt sizes"](https://asana.com/resources/t-shirt-sizing): - [Size: XS](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XS%22), - [Size: S](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20S%22), - [Size: M](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20M%22), - [Size: L](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20L%22), - [Size: XL](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XL%22). - These are to help distinguish between tasks that are easy to complete (extra small) and those - that will require a lot of thinking, programming and testing (extra large). - - You can group issues in two ways: - - GitHub lets you chose a "parent" issue for each issue, which is useful for issues that are - related to each other. We try to build "issues of issues" that group together similar issues - that might require similar fixes (e.g. - [our issue tracking deprecated identifiers](https://github.com/NCATSTranslator/Babel/issues/93)). - If you find an issue related to yours, please feel free to add yours as a child of the - existing issue or vice versa. - - You can use labels to group similar issues. We don't have a lot of labels - for you to choose from, but feel free to add any that make sense! +1. For guidance on how to assign priority, impact and size fields, group related + issues, and track when your issue is likely to be addressed, see + [docs/Triage.md](./docs/Triage.md). ## Contributing source code -Babel is structured around its [Snakemake files](./src/snakefiles), which call -into its [data handlers](./src/datahandlers) and -[compendia creators](./src/createcompendia). The heart of its data are its -concord files, which contain cross-references between different databases. These -are combined into compendium files and synonyms. +For an overview of how Babel's source code is organized — including the two-phase pipeline, +the role of concord files, and the key patterns used throughout the codebase — see +[docs/Architecture.md](./docs/Architecture.md). + +For a detailed guide to the development workflow — including how to obtain prerequisites, build +individual compendia, and ideas for making the pipeline easier to work with — see +[docs/Development.md](./docs/Development.md). We use three linters to check the style of submitted code in GitHub pull requests -- don't worry if this is difficult to do at your end, as it is easy to @@ -96,8 +59,6 @@ fix in a pull request: ### Contributing tests -TODO - Tests are written using [pytest](https://pytest.org/) and are present in the `tests` directory. You can run these tests by running `PYTHONPATH=. uv run pytest`. @@ -106,22 +67,17 @@ Tests are written using [pytest](https://pytest.org/) and are present in the [working on that](https://github.com/NCATSTranslator/Babel/issues/602), and if you can help get them to pass, that would be great! -### Writing a new concord or compendium - -TODO +### Writing a new concord, compendium, or data source -### Adding a new source of identifiers, synonyms or descriptions - -TODO +See [docs/Architecture.md](./docs/Architecture.md) for an overview of where new code goes, +and [docs/Development.md](./docs/Development.md) for the development workflow. ## Want to work on the frontends instead? Babel has two frontends: the [Node Normalizer] for exposing information about cliques, and the [Name Resolver], which lets you search by synonyms or names. - -- -- -- +Both of these could use help with issues that are specific to them! Please check +their GitHub repositories to see what improvements they need. [babel issue tracker]: https://github.com/NCATSTranslator/Babel/issues/ [name resolver]: https://github.com/NCATSTranslator/NameResolution diff --git a/README.md b/README.md index 384c8e2e..8ebc1bff 100644 --- a/README.md +++ b/README.md @@ -105,7 +105,7 @@ identifier for any identifier provided. In addition to returning the preferred identifier and all the secondary identifiers for a clique, NodeNorm will also return its Biolink type and -["information content" score](#what-are-information-content-values), and +["information content" score](./docs/Understanding.md#what-are-information-content-values), and optionally any descriptions we have for these identifiers. It also includes some endpoints for normalizing an entire TRAPI message and @@ -129,160 +129,28 @@ You can find out more about NameRes at its ## Understanding Babel outputs -### How does Babel choose a preferred identifier for a clique? - -After determining the equivalent identifiers that belong in a single clique, -Babel sorts them in the order of CURIE prefixes for that Biolink type as -determined by the Biolink Model. For example, for a -[biolink:SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes), -any CHEBI identifiers will appear first, followed by any UNII identifiers, and -so on. The first identifier in this list is the preferred identifier for the -clique. - -[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in -that order when that conflation is applied. The preferred identifier for the -clique is therefore the preferred identifier of the first clique being -conflated. - -- For GeneProtein conflation, the preferred identifier is a gene. -- For DrugChemical conflation, Babel uses the - [following algorithm](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/src/createcompendia/drugchemical.py#L466-L538): - 1. We first choose an overall Biolink type for the conflated clique. To do this, we use a - ["preferred Biolink type" order](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L32-L50) - that can be configured in [config.yaml](./config.yaml) and choose the most preferred Biolink - type that is present in the conflated clique. - 1. We then group the cliques to be conflated by the prefix of their preferred - identifier, and sort them based on the preferred prefix order for the - chosen Biolink type. - 1. If there are multiple cliques with the same prefix in their preferred - identifier, we use the following criteria to sort them: - 1. A clique with a lower information content value will be sorted before - those with a higher information content or no information content at - all. - 1. A clique with more identifiers are sorted before those with fewer - identifiers. - 1. A clique whose preferred identifier has a smaller numerical suffix will - be sorted before those with a larger numerical suffix. - -### How does Babel choose a preferred label for a clique? - -For most Biolink types, the preferred label for a clique is the label of the -preferred identifier. There is a -[`demote_labels_longer_than`](https://github.com/NCATSTranslator/Babel/blob/master/config.yaml#L437) -configuration parameter that -- if set -- will cause labels that are longer than -the specified number of characters to be ignored unless no labels shorter than -that length are present. This is to avoid overly long labels when a more concise -label is available. - -Biolink types that are chemicals (i.e. -[biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/) and its -subclasses) have a special list of -[preferred name boost prefixes](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L416-L426) -that are used to prioritize labels. This list is currently: - -1. DRUGBANK -1. DrugCentral -1. CHEBI -1. MESH -1. CHEMBL.COMPOUND -1. GTOPDB -1. HMDB -1. RXCUI -1. PUBCHEM.COMPOUND - -[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in -that order when that conflation is applied. The preferred label for the -conflated clique is therefore the preferred label of the first clique being -conflated. - -## Where do the clique descriptions come from? - -Currently, all descriptions for NodeNorm concepts come from -[UberGraph](https://github.com/INCATools/ubergraph/). You will note that -descriptions are collected for every identifier within a clique, and then the -description associated with the most preferred identifier is provided for the -preferred identifier. Descriptions are not included in NameRes, but the -`description` flag can be used to include any descriptions when returning -cliques from NodeNorm. - -### What are "information content" values? - -Babel obtains information content values for over 3.8 million concepts from -[Ubergraph](https://github.com/INCATools/ubergraph?tab=readme-ov-file#graph-organization) -based on the number of terms related to the specified term as either a subclass -or any existential relation. They are decimal values that range from 0.0 -(high-level broad term with many subclasses) to 100.0 (very specific term with -no subclasses). - -## Reporting incorrect Babel cliques - -### I've found two or more identifiers in separate cliques that should be considered identical - -Please report this "split" clique as an issue to the -[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a -minimum, please include the identifiers (CURIEs) for the identifiers that should -be combined. Links to a NodeNorm instance showing the two cliques are very -helpful. Evidence supporting the lumping, such as a link to an external database -that makes it clear that these identifiers refer to the same concept, are also -very helpful: while we have some ability to combine cliques manually if needed -urgently for some application, we prefer to find a source of mappings that would -combine the two identifiers, allowing us to improve cliquing across Babel. - - -### I've found two or more identifiers combined in a single clique that actually identify different concepts - - -Please report this "lumped" clique as an issue to the -[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a -minimum, please include the identifiers (CURIEs) for the identifiers that should -be split. Links to a NodeNorm instance showing the lumped clique is very -helpful. Evidence, such as a link to an external database that makes it clear -that these identifiers refer to the same concept, are also very helpful: while -we have some ability to combine cliques manually if needed urgently for some -application, we prefer to find a source of mappings that would combine the two -identifiers, allowing us to improve cliquing across Babel. +For a detailed explanation of how Babel constructs cliques, chooses preferred identifiers and +labels, sources descriptions, and calculates information content values — as well as guidance on +reporting incorrect cliques — see [docs/Understanding.md](./docs/Understanding.md). ## Running Babel ### How can I run Babel? -Babel is difficult to run, primarily because of its inefficient memory handling --- we currently need around 500G of memory to build the largest compendia -(Protein and DrugChemical conflated information), although the smaller compendia -should be buildable with far less memory. We are working on reducing these -restrictions as far as possible. You can read more about -[Babel's build process](docs/RunningBabel.md), and please do contact us if you -run into any problems or would like some assistance. - -We have [detailed instructions for running Babel](docs/RunningBabel.md), but the -short version is: - -- We use [uv](https://docs.astral.sh/uv/) to manage Python dependencies. You can - use the - [Docker image](https://github.com/NCATSTranslator/Babel/pkgs/container/babel) - if you run into any difficulty setting up the prerequisites. -- We use [Snakemake](https://snakemake.github.io/) to handle the dependency - management. - -Therefore, you should be able to run Babel by cloning this repository and -running: - -```shell -$ uv run snakemake --cores [NUMBER OF CORES TO USE] -``` - -The number of cores can be specified as `all` in order to use all available cores on your machine. - -The [./slurm/run-babel-on-slurm.sh](./slurm/run-babel-on-slurm.sh) Bash script -can be used to start running Babel as a Slurm job. You can set the BABEL_VERSION -environment variable to document which version of Babel you are running. +Babel requires significant memory — around 500 GB to build the largest compendia (Protein and +DrugChemical conflated), though smaller compendia need far less. It uses +[uv](https://docs.astral.sh/uv/) for Python dependency management and +[Snakemake](https://snakemake.github.io/) for build orchestration. See +[docs/RunningBabel.md](docs/RunningBabel.md) for detailed instructions, configuration, and +Slurm job setup. ## Contributing to Babel If you want to contribute to Babel, start with the [Contributing to Babel](./CONTRIBUTING.md) documentation. This will provide guidance on how the source code is organized, what contributions are most -useful, and how to run the tests. +useful, and how to run the tests. For a deeper look at the development +workflow and ideas for improving it, see [Developing Babel](./docs/Development.md). ## Contact information diff --git a/docs/Architecture.md b/docs/Architecture.md new file mode 100644 index 00000000..6ec252d6 --- /dev/null +++ b/docs/Architecture.md @@ -0,0 +1,140 @@ +# Babel Architecture + +This document describes how Babel's source code is organized, how data flows through the pipeline, +and the key patterns and data structures that appear throughout the codebase. It is intended for +contributors who want to understand the system before making changes. + +For instructions on how to run and configure the pipeline, see +[RunningBabel.md](./RunningBabel.md). For the development workflow and known challenges, see +[Development.md](./Development.md). + +## Pipeline overview + +Babel's pipeline has two phases, orchestrated by [Snakemake](https://snakemake.github.io/): + +1. **Data collection** — individual data handlers download source data from FTP servers and the + web, then parse and normalize it into two kinds of files per source: + - `labels` files: CURIE → preferred name mappings + - `synonyms` files: CURIE → predicate → synonym mappings + + These files are written into `babel_downloads/[PREFIX]/`. + +2. **Compendium building** — for each semantic type (e.g. chemicals, genes, anatomy), a compendium + creator module reads the relevant label and synonym files, extracts the identifiers for that + type into `babel_outputs/intermediate/[PIPELINE]/ids/`, produces pairwise cross-reference files + called **concords**, merges the concords into equivalence cliques using a union-find algorithm, + and writes enriched JSONL compendia to `babel_outputs/compendia/[BIOLINK TYPE].txt`. + +The top-level `Snakefile` coordinates the whole pipeline by including ~20 specialized snakefiles +from `src/snakefiles/` — one per semantic type, plus files for data collection, reports, exports, +and DuckDB integration. + +## Configuration + +The main configuration file is [`config.yaml`](../config.yaml) at the repository root. It contains: + +- Directory paths for inputs and outputs +- Version strings for the current build +- Per-semantic-type lists of valid CURIE prefixes and their priority ordering +- Chemical-specific settings such as `preferred_name_boost_prefixes` and `demote_labels_longer_than` + +The `UMLS_API_KEY` environment variable is required for downloading UMLS and RxNorm data. You +can obtain a UMLS API key by setting up a [UMLS Terminology Services](https://uts.nlm.nih.gov/uts/) +account and looking up your API key in [your profile](https://uts.nlm.nih.gov/uts/profile). + +## Source code layout + +All Python and Snakemake source code lives under `src/`: + +| Directory / file | Purpose | +|------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `src/datahandlers/` | ~37 modules, one per external data source. Each module downloads, parses, and normalizes data from a specific source (ChEBI, UniProt, NCBI Gene, DrugBank, MeSH, etc.). | +| `src/createcompendia/` | ~15 modules, one per semantic type (chemicals, genes, proteins, anatomy, disease/phenotype, etc.). These consume data handler outputs, build concords, and write final compendia. | +| `src/snakefiles/` | Snakemake rule files that wire data handlers to compendium creators and define the full dependency graph. | +| `src/node.py` | Core factory classes: `NodeFactory`, `SynonymFactory`, `DescriptionFactory`, `TaxonFactory`, `InformationContentFactory`, `TSVSQLiteLoader`. | +| `src/babel_utils.py` | Core pipeline utilities: download/FTP helpers, `glom()` (clique merging), `write_compendium()` (compendium builder), and state management helpers. | +| `src/util.py` | Logging setup, config loading, [Biolink Model Toolkit](https://github.com/biolink/biolink-model-toolkit) access. | +| `src/exporters/` | Output format handlers for KGX, Apache Parquet, and JSONL. | +| `src/reports/` | Report generation code. | +| `src/synonyms/` | Synonym file generation. | +| `src/metadata/` | Provenance and metadata handling. | + +## Key data structures + +### Concord files + +Concord files are the central intermediate data structure in Babel. Each concord file is a +tab-separated file of triples: + +```text +CURIE1 Relation CURIE2 +``` + +Each triple expresses that `CURIE1` and `CURIE2` are related by `Relation` (typically +`skos:exactMatch` or an equivalent). The compendium building phase reads all concord files for a +semantic type and feeds them into the `glom()` function in `src/babel_utils.py` to merge them into +equivalence cliques. + +### Compendium JSONL + +Each line of a compendium file is a JSON object representing one clique. A clique includes: + +- `identifiers` — list of all equivalent CURIEs, in preferred-prefix order +- `ic` — information content score (from UberGraph) +- `taxa` — associated taxa (for genes, proteins, etc.) +- `preferred_name` — the preferred human-readable label for the clique +- `descriptions` — descriptions collected from UberGraph +- `type` — Biolink semantic type + +The first identifier in `identifiers` is the preferred identifier for the clique. See +[DataFormats.md](./DataFormats.md) for the full format specification. + +## Key patterns + +### Factory pattern for large datasets + +`NodeFactory`, `SynonymFactory`, `DescriptionFactory`, `TaxonFactory`, and +`InformationContentFactory` (all in `src/node.py`) use a factory pattern for lazy loading. +Rather than loading entire datasets into memory up front, they load data on demand and cache +results. This is important because many source files are gigabytes in size. + +### TSVSQLiteLoader + +`TSVSQLiteLoader` (in [`src/node.py`](../src/node.py)) loads tab-separated files into in-memory +SQLite databases that spill to disk when memory pressure is high. This avoids the need to hold +entire large TSV files in RAM, which would be infeasible given Babel's data volumes. + +### Clique merging via `glom()` + +`glom()` in [`src/babel_utils.py`](../src/babel_utils.py) merges concord triples into equivalence +cliques. It maintains a dictionary (`conc_set`) where every CURIE key points to its equivalence +set. For each new `(CURIE1, relation, CURIE2)` triple, it unions all existing sets that contain +either CURIE, then adds both CURIEs to the resulting set. At the end, each value in `conc_set` is +one clique. `write_compendium()` in the same file drives the overall compendium-building process, +calling `glom()` and then sorting, enriching, and writing the output. + +### Biolink Model integration + +All semantic types, valid CURIE prefixes, and naming conventions follow the +[Biolink Model](https://biolink.github.io/biolink-model/). The +[Biolink Model Toolkit](https://github.com/biolink/biolink-model-toolkit) is accessed via +[`src/util.py`](../src/util.py) and is used throughout the codebase to validate types, look up +preferred prefix orderings, and check whether a given prefix is valid for a type. + +### Conflation modules + +GeneProtein and DrugChemical conflations each have dedicated conflation modules +([`src/createcompendia/geneprotein.py`](../src/createcompendia/geneprotein.py) and +[`src/createcompendia/drugchemical.py`](../src/createcompendia/drugchemical.py)) that merge their +respective cliques after the initial compendium build. See [Conflation.md](./Conflation.md) for +details on what conflation means and how it works. + +## Output directories + +When the pipeline runs, it creates and populates these directories: + +| Directory | Contents | +|-------------------------------|-----------------------------------------------------------------------------------------------------| +| `babel_downloads/` | Cached source data, organized by prefix (e.g. `babel_downloads/CHEBI/`). Can be reused across runs. | +| `babel_outputs/intermediate/` | Intermediate build artifacts: ids files, concord files, per-type label and synonym aggregates. | +| `babel_outputs/` | Final outputs: compendia (JSONL), synonym files, reports, and exports (Parquet, KGX). | diff --git a/docs/DataFormats.md b/docs/DataFormats.md index 4b9dbe18..6230acdf 100644 --- a/docs/DataFormats.md +++ b/docs/DataFormats.md @@ -57,7 +57,7 @@ This entry consists of the following fields: | Field | Value | Meaning | |------------------|-------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| ic | 100 | Information content value (see [the main README](../README.md#what-are-information-content-values)). They are decimal values that range from 0.0 (high-level broad term with many subclasses) to 100.0 (very specific term with no subclasses). | +| ic | 100 | Information content value (see [Understanding.md](./Understanding.md#what-are-information-content-values)). They are decimal values that range from 0.0 (high-level broad term with many subclasses) to 100.0 (very specific term with no subclasses). | | identifiers | _See below_ | A list of identifiers for this clique. This is arranged in the same order as the valid ID prefixes for this type in the Biolink Model, e.g. [starting with NCBIGene and ENSEMBL for `biolink:Gene`](https://biolink.github.io/biolink-model/Gene/#valid-id-prefixes). | | identifiers[0].i | NCBIGene:2358 | A CURIE representing this identifier. You can use the [Biolink Model prefixmap](https://github.com/biolink/biolink-model/tree/master/project/prefixmap) to expand this into a full concept IRI. | | identifiers[0].l | G6PC1 | A label for this identifier. This will almost always be from the source of the CURIE (in this case, the label is from the NCBI Gene database). | diff --git a/docs/Deployment.md b/docs/Deployment.md index 6ab14e4a..f4591d68 100644 --- a/docs/Deployment.md +++ b/docs/Deployment.md @@ -1,4 +1,4 @@ -# Release information for Babel, NodeNorm and NameRes +# Deployment information for Babel, NodeNorm and NameRes There are two main installations of NodeNorm that would be of interest to users who aren't system administrators for these tools: @@ -19,16 +19,18 @@ to users who aren't system administrators for these tools: server. 2. Update the Translator-devops repo with the URL to these Babel output files. 3. Create a [Redis R3 External] instance to store identifiers. - 4. Run the [NodeNorm loader] to load the Babel outputs into the redis r3 instance. - 5. Create a [NodeNorm web server] to share the data in the redis r3 instance. -4. Deploy a new NameRes instance + 4. Run the [NodeNorm loader] to load the Babel outputs into a Redis instance. + 5. Create a [NodeNorm web server] to share the data in a Redis instance. +4. Deploy a new NameRes instance (either + [locally](https://github.com/NCATSTranslator/NameResolution/blob/master/documentation/Deployment.md) + or + [on Kubernetes](https://github.com/helxplatform/translator-devops/tree/ed25b5f5bfe2383ade8457da97341c90500f5291/helm/name-lookup)) 1. Create an empty Apache Solr instance. 2. Load it with synonym information from Babel outputs. 3. Write out a Solr backup and store it as a tarball. 4. Copy the Solr backup to a publicly accessible URL. 5. Update the Translator-devops repo with the new URL. - 6. Create a NameRes instance that will download the Solr backup and start the instance with it - (see [NameRes devops] for information). + 6. Create a NameRes instance that will download the Solr backup and start the instance with it. 5. Use the [Babel Validator] to test this release and check how it performs compared to the previous release. 6. Use the @@ -55,4 +57,3 @@ to users who aren't system administrators for these tools: [Redis R3 External]: https://github.com/helxplatform/translator-devops/tree/ed25b5f5bfe2383ade8457da97341c90500f5291/helm/redis-r3-external [NodeNorm loader]: https://github.com/helxplatform/translator-devops/tree/ed25b5f5bfe2383ade8457da97341c90500f5291/helm/node-normalization-loader [NodeNorm web server]: https://github.com/helxplatform/translator-devops/tree/ed25b5f5bfe2383ade8457da97341c90500f5291/helm/node-normalization-web-server -[NameRes devops]: https://github.com/helxplatform/translator-devops/tree/ed25b5f5bfe2383ade8457da97341c90500f5291/helm/name-lookup diff --git a/docs/Development.md b/docs/Development.md new file mode 100644 index 00000000..6c8beac5 --- /dev/null +++ b/docs/Development.md @@ -0,0 +1,241 @@ +# Developing Babel + +This document describes the current development workflow for Babel and ideas for improving it. + +## Current Development Process + +Developing a change to Babel is significantly more complicated than developing most software, +because the pipeline operates on very large data files that take hours to download, gigabytes of +disk space to store, and hundreds of gigabytes of RAM to process. This creates a feedback loop +that is slow by necessity. + +### Typical workflow + +1. **Build prerequisites.** Before writing any code, you need the input files for the Snakemake + step you plan to modify. There are two ways to get these: + - Run the upstream Snakemake rules yourself (which may themselves require large downloads). + - Copy intermediate files from the last successful full run, e.g. from the SLURM cluster. + +2. **Write code.** Implement the change locally, iterating against whatever prerequisite files are + available. + +3. **Run the relevant target.** For example, if you changed how anatomy compendia are built: + + ```bash + uv run snakemake --cores 1 anatomy + ``` + + This produces compendia and synonym files for the anatomy semantic type, but does _not_ trigger + the pipeline-wide reports, which require _all_ compendia, synonym, and conflation files. + +4. **Review your own output.** Because reports are not available, you must inspect the output files + manually — checking JSONL structure, spot-checking a few cliques, and reviewing completeness + reports. There is no automated feedback at this stage. + +5. **Merge.** The change goes in without confirmation that it behaves correctly in the context of + the full pipeline. + +6. **Wait for SLURM.** The next time someone runs the full pipeline on the SLURM cluster, you + find out whether your change worked. If it did not, the turnaround for a fix is another full + run. + +### Why this is hard + +- **Download cost.** Many data sources (UMLS, UniChem, PubChem, UniProt TrEMBL) are multi-gigabyte + downloads that take hours. You cannot easily re-download them per experiment. +- **Memory cost.** The chemical and protein compendia steps require 512 GB of RAM, which is not + available on a laptop or a typical workstation. +- **Cross-type report dependencies.** The `all_reports` target in `src/snakefiles/reports.snakefile` + requires every compendium, synonym, and conflation file to exist before it will run. Building one + semantic type's compendium in isolation does not satisfy these dependencies. +- **No unit-testable seams.** The core compendium-building logic (`createcompendia/`) reads and + writes large files. There is no easy way to run it against a small, synthetic dataset without + manually constructing the full file layout. +- **Opaque intermediate state.** Snakemake tracks what has been built via file existence and + timestamps. There is no summary of what prerequisites are present and what is missing. + +--- + +## Ideas for Improvement + +The suggestions below range from small scripts you could add this week to multi-month architectural +changes. All are worth considering. + +### Small, practical improvements + +#### 1. Proposed per-compendium assessment script (`src/scripts/assess_compendium.py`, not yet implemented) + +A standalone CLI script that would take a compendium JSONL file as input and print a human-readable +summary: number of cliques, clique size distribution, identifier prefix breakdown, large-clique +examples, and any structural validation errors. This would mirror what the pipeline's `assess` rules +do today, but could be run against _any_ compendium file, including one built from a partial +dataset, without needing the full pipeline to have run. This script and its CLI entrypoint are +**not** implemented yet. + +For example, if such a CLI were added, you might run: + +```bash +# Hypothetical example; `assess-compendium` does not exist in this repository today. +uv run assess-compendium babel_outputs/compendia/AnatomicalEntity.txt +``` + +#### 2. Proposed compendium diff script (`src/scripts/diff_compendia.py`, not yet implemented) + +A CLI script that compares two compendium files and reports: + +- Cliques that appear in one but not the other. +- Cliques whose membership changed (identifiers added or removed). +- Cliques whose preferred identifier (clique leader) changed. +- Summary statistics: total cliques gained/lost, total identifiers gained/lost. + +This would immediately tell you whether a code change had the intended effect when you copy a +before/after snapshot. + +```bash +uv run diff-compendia old/AnatomicalEntity.txt new/AnatomicalEntity.txt +``` + +#### 3. CURIE lookup script (`src/scripts/lookup_curie.py`) + +A CLI script that searches all compendium files in a directory for a given CURIE and prints the +full clique it belongs to. Useful for spot-checking whether a specific identifier was correctly +merged. + +```bash +uv run lookup-curie MESH:D014867 --compendia-dir babel_outputs/compendia/ +``` + +#### 4. Concord inspector script (`src/scripts/inspect_concord.py`) + +A CLI script that reads one or more concord files (the `CURIE1 \t relation \t CURIE2` files in +`intermediate/*/concords/`) and shows statistics: which prefixes appear, how many cross-references +exist per prefix pair, and examples of entries. This makes it easier to verify that a concord +generation step is working before building the full compendium. + +```bash +uv run inspect-concord babel_outputs/intermediate/chemicals/concords/CHEBI +``` + +#### 5. Snakemake dependency checker (`src/scripts/check_prerequisites.py`) + +A script that reads `config.yaml` and checks which intermediate and download files are present on +disk, printing a table of what is available versus missing. This would tell you immediately which +prerequisites you need to copy before you can run a particular target. + +```bash +uv run check-prerequisites --target anatomy +``` + +#### 6. Per-type report targets in `src/snakefiles/reports.snakefile` + +Currently `all_reports` requires all compendia. Adding per-type report targets (e.g., +`anatomy_report`, `chemical_report`) that only require the files for that semantic type would let +you run a meaningful report in isolation. The per-compendium content report rules already exist +(`generate_content_report_for_compendium_*`); they just need to be wired into per-type aggregate +rules. + +#### 7. Structured logging for compendium building + +The compendium-building Python code (`createcompendia/`) currently logs to a mix of `print` and +Python logging. Adding structured JSON log output (e.g., counts of identifiers processed, concords +merged, cliques formed at each stage) would make it possible to write a script that summarizes +pipeline behavior from logs alone, without inspecting output files. + +--- + +### Medium-effort improvements + +#### 8. Mini-dataset fixtures for each data source + +Each data handler in `src/datahandlers/` reads a specific file format. For each handler, create a +small, representative fixture file (a few hundred rows) checked into `tests/fixtures/`. Then write +a test that runs the handler against the fixture and checks the resulting `labels`, `synonyms`, and +`ids` files. This would let you run a fast integration test for a data handler without downloading +anything. + +This builds naturally on the existing test infrastructure in `tests/`. + +#### 9. Development config with smaller targets (`config.dev.yaml`) + +A second `config.yaml` variant that points to smaller fixture datasets and has reduced prefix +lists. Running `uv run snakemake --configfile config.dev.yaml --cores 4` would exercise the full +pipeline structure (all rules, all file handoffs) against toy data, completing in minutes rather +than hours. The output would be structurally valid but not biologically complete. + +#### 10. Standalone compendium builder script + +A script that accepts a list of concord files and id files as command-line arguments and runs +just the clique `glom()` method to produce a compendium file, without any Snakemake +involvement. This decouples the algorithmic core from the orchestration layer, making it easy to +experiment with clique-merging logic on captured intermediate files. + +```bash +uv run build-cliques \ + --ids intermediate/chemicals/ids/* \ + --concords intermediate/chemicals/concords/* \ + --output my_test_compendium.jsonl +``` + +#### 11. Remote intermediate file cache + +A script (or Snakemake rule) that syncs a canonical set of intermediate files from object storage +(S3, GCS, or a shared NFS path) to your local machine. This means developers don't have to run +data collection steps themselves — they pull the outputs of the last successful full run. Combined +with a clear versioning scheme (tied to the data source versions in `config.yaml`), this could +eliminate most of the prerequisite-gathering step. + +#### 12. Compendium regression test suite + +After each full pipeline run, serialize summary statistics for every compendium (total cliques, +cliques per prefix, median clique size, etc.) as a JSON file and commit it to the repository. +On subsequent runs, compare against this baseline and fail if any metric changes by more than a +configurable threshold. This would catch regressions automatically and provide the feedback loop +that is currently missing. + +--- + +### Large, sweeping changes + +#### 13. Isolate semantic types into independent sub-pipelines + +Currently the reports depend on all semantic types together, creating a hard global dependency. +If each semantic type were a self-contained sub-pipeline — with its own report, its own +completeness check, and its own done-marker — developers could run and validate a single type +end-to-end without touching any other type. This would require refactoring the report rules but +would not change the pipeline logic. + +#### 14. Unit-testable Python API for compendium building + +The compendium-building code in `createcompendia/` directly reads and writes files. Refactoring it +so that each function accepts Python data structures (lists of ID tuples, concord triples) and +returns clique structures — with file I/O as a separate layer — would make every step independently +unit-testable. Snakemake rules would remain as thin wrappers that read inputs from disk, call the +Python API, and write outputs to disk. + +This is the highest-value architectural change for long-term maintainability. + +#### 15. Streaming / chunked processing with DuckDB + +Several steps require hundreds of gigabytes of RAM because they load entire files into memory. +`TSVSQLiteLoader` already attempts to mitigate this with an in-memory SQLite database. A further +step would be to use DuckDB (already present in the pipeline for exports) as the primary +intermediate store throughout compendium building — storing ids, concords, and partial cliques in +DuckDB tables on disk, and performing joins and aggregations inside DuckDB rather than in Python +memory. This would reduce RAM requirements substantially and make more steps runnable on +development hardware. + +#### 16. Containerized development environment with prebuilt downloads + +A Docker image (separate from the production image) that includes a curated, compressed snapshot +of all download data — not full production datasets, but representative subsets sufficient to +exercise every code path. A developer could `docker pull` this image, mount their source code, and +run the full pipeline against it in a few hours on a workstation. This is the closest thing to a +reproducible, low-friction development environment for a pipeline of this scale. + +#### 17. Per-data-source version pinning and change detection + +Currently, when an upstream data source changes (e.g. a new UMLS release), it is not always clear +which parts of the pipeline are affected. Adding an explicit version manifest — a file that records +the version/checksum of each downloaded resource — would allow a script to compare against the +previous manifest and report exactly which downstream compendia need to be rebuilt. This would +make release preparation more predictable and reduce unnecessary re-runs. diff --git a/docs/README.md b/docs/README.md index 04d90a2a..90580852 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,16 +1,28 @@ -# Documentation +# Babel documentation -This folder collects a number of documentation files that provide information on different -aspects of Babel and its output. +This folder contains reference documentation for Babel, organized by audience. -## Babel model and formats +## For Babel users — understanding and using outputs -* [Conflation](./Conflation.md) describes the conflation options available in Babel. -* [Data Formats](./DataFormats.md) describes the output data formats produces by Babel. -* [Downloads](./Downloads.md) describes the Babel downloads we publish to the internet. +| Document | Description | +|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Understanding.md](./Understanding.md) | How Babel constructs cliques, chooses preferred identifiers and labels, sources descriptions and IC values, and what split/lumped cliques are. Also covers how to report incorrect cliques. | +| [DataFormats.md](./DataFormats.md) | Compendium, synonym, and conflation file format specification. | +| [Conflation.md](./Conflation.md) | What conflations are, how GeneProtein and DrugChemical conflation work, and when to use them. | +| [Downloads.md](./Downloads.md) | Where to download published Babel outputs and which formats are available. | -## Running and deploying Babel +## For pipeline operators — running and deploying -* [Babel Jupyter Notebook](./Babel.ipynb) shows you what running Babel looks like. -* [Running Babel](./RunningBabel.md) provides additional information on running Babel. -* [Deployment](./Deployment.md) provides information on deploying Babel's outputs. +| Document | Description | +|--------------------------------------|-----------------------------------------------------------------------------------------| +| [RunningBabel.md](./RunningBabel.md) | Build instructions, configuration, Snakemake targets, and system requirements. | +| [Deployment.md](./Deployment.md) | Release checklist and deployment instructions for Node Normalization and Name Resolver. | +| [Babel.ipynb](./Babel.ipynb) | Interactive Jupyter notebook demonstrating what running Babel looks like. | + +## For contributors and maintainers + +| Document | Description | +|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Architecture.md](./Architecture.md) | Source code layout, data-flow narrative, key data structures (concord files, compendium JSONL), and key patterns (factory pattern, TSVSQLiteLoader, union-find, Biolink Model integration). | +| [Development.md](./Development.md) | Development workflow, how to obtain prerequisites, how to build individual compendia, known challenges, and ideas for improving the pipeline. | +| [Triage.md](./Triage.md) | **Part 1 (for users):** how to file a useful bug report, assign priority/impact/size, and track when your issue will be addressed. **Part 2 (for developers):** triage checklist, automated test syntax, and sprint planning heuristics. | diff --git a/docs/Triage.md b/docs/Triage.md new file mode 100644 index 00000000..cf3a8617 --- /dev/null +++ b/docs/Triage.md @@ -0,0 +1,252 @@ +# Babel Issue Triage + +This document describes how issues in the [Babel issue tracker] are triaged and prioritized using +the [Babel sprints GitHub project]. It is written for two audiences: + +- **Part 1: For users of Babel outputs** — how to file a useful bug report, how to assign + priority, impact and size when you file an issue, and how to read the project board to estimate + when your issue is likely to be addressed. +- **Part 2: For Babel developers** — how to triage incoming issues, how to add automated tests, + and how to select issues for the next sprint. + +--- + +## Part 1: Reporting and tracking issues (for users) + +### Filing a bug report + +Before filing an issue, check whether a similar issue already exists in the +[Babel issue tracker]. If it does, you can add a comment with additional examples or "+1" the +issue to signal that it affects you too. You can also add your issue as a sub-issue of an existing +issue if the same underlying bug seems to be the cause. + +When filing a new issue, please include: + +- The identifiers or concept names that are behaving incorrectly (ideally as a table or TSV/CSV + attachment). +- What you expected Babel to return and what it actually returned. +- Which frontend you noticed the problem in: [Node Normalizer], [Name Resolver], or direct + inspection of Babel output files. +- Any additional context that will help replicate the problem. + +### Assigning priority, impact and size + +When you file an issue, please fill in three fields in the [Babel sprints GitHub project] to +help us understand how urgently it needs to be addressed. If you are unsure about any of these, +please leave them blank. Developers will fill them in during triage. + +#### Priority + +How urgent is this to fix? + +| Value | Meaning | +|--------------|-----------------------------------------------------------------------------------------------------------------------| +| **Critical** | Causes outright failures or produces seriously wrong results that are actively misleading downstream users right now. | +| **High** | Significantly degrades the quality or usability of Babel outputs, but a workaround exists. | +| **Medium** | A noticeable quality problem, but not one that breaks workflows. | +| **Low** | A minor issue or a nice-to-have improvement. | + +#### Impact + +How beneficial will fixing this issue be to Babel users? + +| Value | Meaning | +|--------------|------------------------------------------------------------------------------------------------------------| +| **Enormous** | Will significantly improve clique or output quality, or will make future development substantially easier. | +| **High** | Will provide a large benefit to users or developers. | +| **Medium** | Will provide a moderate benefit to users or developers. | +| **Low** | Will provide a small benefit to users or developers. | + +#### Size + +How much effort do you think this fix will require? (This is an estimate; developers may adjust it.) + +| Value | Approximate effort | +|--------|---------------------------------------------------------------------------| +| **XS** | Trivial change — a configuration tweak or a one-line fix. | +| **S** | Small — a few hours of focused work. | +| **M** | Medium — up to a day or two of work. | +| **L** | Large — requires investigation and several days of implementation. | +| **XL** | Extra large — a substantial piece of work that may span multiple sprints. | + +### Grouping related issues + +If your issue looks like it may be caused by the same underlying bug as an existing issue, you +can set the **Parent issue** field to that issue. This helps developers see patterns and fix +related issues together. + +You can also set the **Component** property to identify which part of Babel is affected: + +| Component | What it covers | +|-------------------------|---------------------------------------------------------------| +| Process | The overall pipeline for running Babel | +| Cliques and identifiers | What identifiers are or are not in a clique | +| Downloaders | Code that downloads source data | +| Metadata | Information content, taxon, or other metadata stored on nodes | +| Biolink types | How Biolink semantic types are assigned to cliques | +| Conflations | GeneProtein and DrugChemical conflation | +| Preferred labels | How preferred labels are chosen | +| Synonyms | Which synonyms are included | +| New data sources | Requests to add a new data source | +| Validation and reports | Validating Babel output or producing a report | +| Documentation | Improving or fixing Babel documentation | +| NodeNorm | [Node Normalizer] frontend | +| NameRes | [Name Resolver] frontend | + +### Tracking when your issue will be addressed + +Babel development is organized into two-week **sprints** using the +[Babel sprints GitHub project]. You can use the project board to see: + +- **Backlog** — issues that have been triaged and are waiting to be scheduled. +- **Ready** — issues that are queued for the current or next sprint. +- **In progress** — issues being actively worked on right now. +- **Needs review** — issues with a pull request awaiting review. +- **Done** — issues completed in recent sprints. + +At the start of each sprint, leftover items from the previous sprint are carried forward, and +then the highest-priority issues from the backlog are added. If an issue is unexpectedly large +or is displaced by a higher-priority item, it may be deferred to a later sprint. In general, a +**Critical + Enormous** issue will be scheduled very quickly, while a **Low + Low** issue may +sit in the backlog for a long time. + +To estimate when your issue is likely to be addressed, look at how many **Critical** and **High** +priority issues are currently in the backlog ahead of yours. Issues are typically ordered by +priority first and then impact. + +--- + +## Part 2: Triage guide (for developers) + +### Triage checklist + +When a new issue arrives, work through the following steps: + +1. **Reproduce or understand the report.** Read the issue carefully. Can the problem be confirmed + from the description? If not, ask the reporter for more information (e.g. which Babel build + they are using, example identifiers). + +2. **Check for duplicates.** Search for existing issues that describe the same problem. If a + duplicate exists, close the new issue with a reference to the original (or add the new issue + as a sub-issue of the original). + +3. **Set Priority, Impact and Size.** If the reporter has filled these in, review them and adjust + if necessary. If they are blank, set them now based on your assessment. Don't be shy about + changing them in the future if necessary. + +4. **Set the Component field.** Choose the appropriate **Component** value (see table above) + if that would be useful. This can help group together related issues and for filtering during + sprint planning. + +5. **Link to a parent issue.** If this issue is one instance of a broader known problem (e.g. a + deprecated identifier source, or a class of missing cliques), set the **Parent issue** field. + +6. **Set Status to Backlog.** Unless the issue is Critical and needs immediate scheduling, move + it to the **Backlog** column. + +7. **Add an automated test.** See the next section for how to embed tests directly in the issue + description. + +### Adding automated tests to issues + +The [babel-validation] project can run automated checks against live NodeNorm and NameRes +instances that are triggered by issues. You can embed tests directly in issue descriptions or +comments using two syntaxes. + +#### Wiki syntax (single assertion) + +```text +{{BabelTest|AssertionType|param1|param2|...}} +``` + +For example, to assert that two CURIEs resolve to the same clique: + +```text +{{BabelTest|ResolvesWith|MESH:D014867|DRUGBANK:DB09145}} +``` + +#### YAML syntax (multiple assertions) + +Use a fenced code block with the language tag `yaml` and a top-level property `babel_tests`: + +````text +This can be inserted anywhere in the issue. + +```yaml +babel_tests: + ResolvesWith: + - ['MESH:D014867', 'DRUGBANK:DB09145'] + HasLabel: + - ['MESH:D014867', 'Water'] +``` + +Having text after the fenced code block (or multiple code blocks) is fine too. +```` + +#### Available assertion types + +You can see an up-to-date list of supported assertions +[in the Babel Validation repository](https://github.com/TranslatorSRI/babel-validation/blob/3eeeccfb0d15451e45ecade7603404e096b30fb0/src/babel_validation/assertions/README.md). + + + +**NodeNorm assertions:** + +| Assertion | What it tests | +|----------------------|-------------------------------------------------------------------------------| +| `Resolves` | Each CURIE returns a non-null result from NodeNorm. | +| `DoesNotResolve` | Each CURIE intentionally fails to normalize. | +| `ResolvesWith` | Two or more CURIEs normalize to identical results. | +| `DoesNotResolveWith` | Two or more CURIEs do NOT resolve to the same entity. | +| `HasLabel` | A CURIE's primary label exactly matches the expected string (case-sensitive). | +| `ResolvesWithType` | CURIEs resolve with a specified Biolink semantic type. | + +**NameRes assertions:** + +| Assertion | What it tests | +|----------------|-------------------------------------------------------------------------------------| +| `SearchByName` | A CURIE appears in the top N NameRes results for a given text string (default N=5). | + +**Special:** + +| Assertion | Meaning | +|-----------|----------------------------------------------------------------------------------| +| `Needed` | Placeholder marking that a test needs to be written. Always fails as a reminder. | + +When adding tests to an issue, use `{{BabelTest|Needed}}` as a placeholder if you know a test +is needed but do not yet know the exact expected values. + +### Sprint planning + +Sprints are two weeks long. At the start of each sprint: + +1. **Carry over unfinished items.** Any issues still **In progress** or **Ready** that were not + completed automatically move to the next sprint. + +2. **Review the backlog.** Sort the backlog by Priority (descending) then Impact (descending). + Consider Size to avoid overloading a sprint — a sprint full of XL issues will not complete on + time. + +3. **Select issues for the sprint.** Choose the highest-priority issues that together represent a + realistic amount of work for two weeks. Move selected issues to **Ready**. + +4. **Adjust if needed.** An issue may be removed from the current sprint mid-sprint if it turns + out to be much larger than estimated, or if a Critical issue arrives that must take precedence. + In either case, the deferred issue should be the first candidate for the next sprint. + +#### Heuristics for issue selection + +- Prefer **Critical** issues regardless of impact. +- Among **High** and **Medium** priority issues, prefer those with **Enormous** or **High** impact. +- Prefer **XS** and **S** issues when a sprint already contains several large items — clearing + small issues reduces backlog pressure. +- Issues with automated tests (see above) are easier to verify once fixed; prefer these when all + else is equal. +- Group issues sharing the same **Parent issue** or **Component** — fixing a class of bugs together + is more efficient than fixing them one at a time across different sprints. + +[Babel issue tracker]: https://github.com/NCATSTranslator/Babel/issues/ +[Babel sprints GitHub project]: https://github.com/orgs/NCATSTranslator/projects/36 +[babel-validation]: https://github.com/TranslatorSRI/babel-validation +[Name Resolver]: https://github.com/NCATSTranslator/NameResolution +[Node Normalizer]: https://github.com/NCATSTranslator/NodeNormalization diff --git a/docs/Understanding.md b/docs/Understanding.md new file mode 100644 index 00000000..928e410f --- /dev/null +++ b/docs/Understanding.md @@ -0,0 +1,123 @@ +# Understanding Babel outputs + + + +This document explains *how and why* Babel constructs its outputs: how cliques are formed, +how preferred identifiers and labels are chosen, where descriptions and IC values come from, +and how to report errors. For a description of *what the output files look like*, see +[DataFormats.md](./DataFormats.md). + +## How does Babel choose a preferred identifier for a clique? + +After determining the equivalent identifiers that belong in a single clique, +Babel sorts them in the order of CURIE prefixes for that Biolink type as +determined by the Biolink Model. For example, for a +[biolink:SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes), +any CHEBI identifiers will appear first, followed by any UNII identifiers, and +so on. The first identifier in this list is the preferred identifier for the +clique. + +[Conflations](./Conflation.md) are lists of identifiers that are merged in +that order when that conflation is applied. The preferred identifier for the +clique is therefore the preferred identifier of the first clique being +conflated. + +- For GeneProtein conflation, the preferred identifier is a gene. +- For DrugChemical conflation, Babel uses the + [following algorithm](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/src/createcompendia/drugchemical.py#L466-L538): + 1. We first choose an overall Biolink type for the conflated clique. To do this, we use a + ["preferred Biolink type" order](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L32-L50) + that can be configured in [config.yaml](../config.yaml) and choose the most preferred Biolink + type that is present in the conflated clique. + 1. We then group the cliques to be conflated by the prefix of their preferred + identifier, and sort them based on the preferred prefix order for the + chosen Biolink type. + 1. If there are multiple cliques with the same prefix in their preferred + identifier, we use the following criteria to sort them: + 1. A clique with a lower information content value will be sorted before + those with a higher information content or no information content at + all. + 1. A clique with more identifiers are sorted before those with fewer + identifiers. + 1. A clique whose preferred identifier has a smaller numerical suffix will + be sorted before those with a larger numerical suffix. + +## How does Babel choose a preferred label for a clique? + +For most Biolink types, the preferred label for a clique is the label of the preferred identifier. +There is a +[`demote_labels_longer_than`](https://github.com/NCATSTranslator/Babel/blob/738f5e917e910847fac76ab13e847b15cf68b759/config.yaml#L420) +configuration parameter that -- if set -- will cause labels that are longer than the specified +number of characters to be ignored unless no labels shorter than that length are present. This is to +avoid overly long labels when a more concise label is available. + +Biolink types that are chemicals (i.e. +[biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/) and its +subclasses) have a special list of +[preferred name boost prefixes](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L416-L426) +that are used to prioritize labels. This list is currently: + +1. DRUGBANK +1. DrugCentral +1. CHEBI +1. MESH +1. CHEMBL.COMPOUND +1. GTOPDB +1. HMDB +1. RXCUI +1. PUBCHEM.COMPOUND + +[Conflations](./Conflation.md) are lists of identifiers that are merged in +that order when that conflation is applied. The preferred label for the +conflated clique is therefore the preferred label of the first clique being +conflated. + +## Where do the clique descriptions come from? + +Currently, all descriptions for NodeNorm concepts come from +[UberGraph](https://github.com/INCATools/ubergraph/). You will note that +descriptions are collected for every identifier within a clique, and then the +description associated with the most preferred identifier is provided for the +preferred identifier. Descriptions are not included in NameRes, but the +`description` flag can be used to include any descriptions when returning +cliques from NodeNorm. + +## What are "information content" values? + +Babel obtains information content values for over 3.8 million concepts from +[Ubergraph](https://github.com/INCATools/ubergraph?tab=readme-ov-file#graph-organization) +based on the number of terms related to the specified term as either a subclass +or any existential relation. They are decimal values that range from 0.0 +(high-level broad term with many subclasses) to 100.0 (very specific term with +no subclasses). + +## Reporting incorrect Babel cliques + +### I've found two or more identifiers in separate cliques that should be considered identical + +Please report this "split" clique as an issue to the +[Babel GitHub repository](https://github.com/NCATSTranslator/Babel/issues). At a +minimum, please include the identifiers (CURIEs) for the identifiers that should +be combined. Links to a NodeNorm instance showing the two cliques are very +helpful. Evidence supporting the lumping, such as a link to an external database +that makes it clear that these identifiers refer to the same concept, are also +very helpful: while we have some ability to combine cliques manually if needed +urgently for some application, we prefer to find a source of mappings that would +combine the two identifiers, allowing us to improve cliquing across Babel. + + +### I've found two or more identifiers combined in a single clique that actually identify different concepts + + +Please report this "lumped" clique as an issue to the +[Babel GitHub repository](https://github.com/NCATSTranslator/Babel/issues). At a +minimum, please include the identifiers (CURIEs) for the identifiers that should +be split. Links to a NodeNorm instance showing the lumped clique is very +helpful. Evidence, such as a link to an external database that makes it clear +that these identifiers refer to the same concept, are also very helpful: while +we have some ability to combine cliques manually if needed urgently for some +application, we prefer to find a source of mappings that would combine the two +identifiers, allowing us to improve cliquing across Babel. diff --git a/src/datahandlers/chembl.py b/src/datahandlers/chembl.py index d9ed663a..9bac1a11 100644 --- a/src/datahandlers/chembl.py +++ b/src/datahandlers/chembl.py @@ -83,7 +83,7 @@ def pull_labels(self, ofname): # Sometimes the CHEMBL label is identical to the chemblid. We don't want those (https://github.com/TranslatorSRI/Babel/issues/430). if label == chemblid: - continue + label = '' outf.write(f"{CHEMBLCOMPOUND}:{chemblid}\t{label}\n") diff --git a/src/node.py b/src/node.py index 63b80c54..5a5fafe7 100644 --- a/src/node.py +++ b/src/node.py @@ -73,8 +73,13 @@ def load_synonyms(self, prefix): with open(labelfname) as inf: for line in inf: x = line.strip().split("\t") - lbs[x[0]].add(("http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", x[1])) - count_labels += 1 + if len(x) == 1: + lbs[x[0]].add( ('http://www.geneontology.org/formats/oboInOwl#hasExactSynonym', '') ) + elif len(x) == 2: + lbs[x[0]].add(("http://www.geneontology.org/formats/oboInOwl#hasExactSynonym", x[1])) + count_labels += 1 + else: + logger.warning(f"Unexpected number of columns in {labelfname} ({len(x)}), skipping: {line.strip()}") synfname = os.path.join(self.synonym_dir, prefix, "synonyms") if os.path.exists(synfname): with open(synfname) as inf: @@ -504,8 +509,15 @@ def load_extra_labels(self, prefix): if os.path.exists(labelfname): with open(labelfname) as inf: for line in inf: - x = line.strip().split("\t") - lbs[x[0]] = x[1] + x = line.strip().split('\t', maxsplit=1) + if len(x) == 1: + # We have an identifier, but we explicitly don't have a label. + lbs[x[0]] = '' + elif len(x) == 2: + # We have an identifier and a label. + lbs[x[0]] = x[1] + else: + logger.warning(f"bad line in {labelfname}: {line.strip()}") self.extra_labels[prefix] = lbs def apply_labels(self, input_identifiers, labels): diff --git a/tests/conftest.py b/tests/conftest.py index 295a7f3d..36c8b8a6 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -3,9 +3,10 @@ import pytest from src.node import NodeFactory +from src.util import get_config -# Biolink Model version used throughout the test suite. Should match config.yaml. -BIOLINK_VERSION = "4.3.6" +# Biolink Model version derived from config.yaml — the single source of truth. +BIOLINK_VERSION = get_config()["biolink_version"] @pytest.fixture(scope="session") diff --git a/tests/test_node_factory.py b/tests/test_node_factory.py index bd6cd9b2..86437bf4 100644 --- a/tests/test_node_factory.py +++ b/tests/test_node_factory.py @@ -1,7 +1,13 @@ +import os + import pytest import src.prefixes as pref from src.LabeledID import LabeledID +from src.node import NodeFactory +from src.util import get_config + +BIOLINK_VERSION = get_config()["biolink_version"] # Node schema (as of the current codebase): # {"identifiers": [{"identifier": CURIE, "label": str}, ...], "type": str} @@ -191,3 +197,35 @@ def test_pubchem_ignore_CID(node_factory): # :999 has label "CID1" which is ignored; :111 ("longerlabel") is preferred instead assert node["identifiers"][1]["identifier"] == f"{pref.PUBCHEMCOMPOUND}:111" assert node["identifiers"][1]["label"] == "longerlabel" + + +@pytest.mark.unit +def test_load_extra_labels_single_column(tmp_path): + """load_extra_labels() must not raise on single-column lines (identifier with no label).""" + label_dir = tmp_path / "CHEMBL.COMPOUND" + label_dir.mkdir() + (label_dir / "labels").write_text( + "CHEMBL.COMPOUND:CHEMBL1\tWater\n" + "CHEMBL.COMPOUND:CHEMBL2\n" + ) + fac = NodeFactory(str(tmp_path), BIOLINK_VERSION) + fac.common_labels = {} + fac.load_extra_labels("CHEMBL.COMPOUND") + assert fac.extra_labels["CHEMBL.COMPOUND"]["CHEMBL.COMPOUND:CHEMBL1"] == "Water" + assert fac.extra_labels["CHEMBL.COMPOUND"]["CHEMBL.COMPOUND:CHEMBL2"] == "" + + +@pytest.mark.unit +def test_load_extra_labels_tab_in_label(tmp_path): + """load_extra_labels() must preserve labels that themselves contain a tab (maxsplit=1).""" + label_dir = tmp_path / "CHEMBL.COMPOUND" + label_dir.mkdir() + (label_dir / "labels").write_text( + "CHEMBL.COMPOUND:CHEMBL1\tWater\n" + "CHEMBL.COMPOUND:CHEMBL2\n" + "CHEMBL.COMPOUND:CHEMBL3\tWater\tbottle\n" + ) + fac = NodeFactory(str(tmp_path), BIOLINK_VERSION) + fac.common_labels = {} + fac.load_extra_labels("CHEMBL.COMPOUND") + assert fac.extra_labels["CHEMBL.COMPOUND"]["CHEMBL.COMPOUND:CHEMBL3"] == "Water\tbottle" diff --git a/tests/test_synonym_factory.py b/tests/test_synonym_factory.py new file mode 100644 index 00000000..c08e9ed6 --- /dev/null +++ b/tests/test_synonym_factory.py @@ -0,0 +1,30 @@ +from collections import defaultdict + +import pytest + +from src.node import SynonymFactory +from src.util import get_config + +BIOLINK_VERSION = get_config()["biolink_version"] + +HAS_EXACT_SYNONYM = "http://www.geneontology.org/formats/oboInOwl#hasExactSynonym" + + +@pytest.mark.unit +def test_load_synonyms_single_column(tmp_path): + """load_synonyms() must handle single-column lines (identifier with no label) without dropping them.""" + label_dir = tmp_path / "CHEMBL.COMPOUND" + label_dir.mkdir() + (label_dir / "labels").write_text( + "CHEMBL.COMPOUND:CHEMBL1\tWater\n" + "CHEMBL.COMPOUND:CHEMBL2\n" + ) + sf = object.__new__(SynonymFactory) + sf.synonym_dir = tmp_path + sf.synonyms = {} + sf.common_synonyms = defaultdict(set) + + sf.load_synonyms("CHEMBL.COMPOUND") + + assert (HAS_EXACT_SYNONYM, "Water") in sf.synonyms["CHEMBL.COMPOUND"]["CHEMBL.COMPOUND:CHEMBL1"] + assert (HAS_EXACT_SYNONYM, "") in sf.synonyms["CHEMBL.COMPOUND"]["CHEMBL.COMPOUND:CHEMBL2"]