Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
23821d4
We still add IDs to the labels file, but blank duplicate labels.
gaurav Sep 18, 2025
eed6d9e
Added check for malformed label file.
gaurav Sep 22, 2025
7e66c59
Added a maxsplit.
gaurav Sep 22, 2025
c158239
Support label files with unlabeled identifiers.
gaurav Sep 23, 2025
33a5913
Added support for blank labels and synonyms.
gaurav Sep 23, 2025
b072b3a
Add Triage.md documenting sprint-based issue triage process
gaurav Mar 6, 2026
bce6a40
Add docs/Development.md describing the development workflow and impro…
gaurav Mar 5, 2026
dd68022
Fix MD032: add blank line before list (rumdl fmt)
gaurav Mar 5, 2026
9aa2a53
Potential fix for pull request finding
gaurav Mar 15, 2026
3340074
Potential fix for pull request finding
gaurav Mar 15, 2026
a451710
Consolidate triage/dev docs and fix cross-references
gaurav Mar 15, 2026
dcfbe9c
Reorganize docs: create Understanding.md and Architecture.md, slim RE…
gaurav Mar 15, 2026
1443080
Potential fix for pull request finding
gaurav Mar 15, 2026
c1a15ba
Potential fix for pull request finding
gaurav Mar 15, 2026
277d33c
Potential fix for pull request finding
gaurav Mar 15, 2026
12c08e4
Replaced a branch-based path to a commit-based path.
gaurav Mar 15, 2026
1f7a5de
Fixed Markdown issues using rumdl.
gaurav Mar 15, 2026
dc47eb2
Improved docs/Architecture.md.
gaurav Mar 15, 2026
ca86ccd
Fix incorrect make_cliques.py reference in Architecture.md and CLAUDE.md
gaurav Mar 15, 2026
6ea01e4
Removed unnecessary horizontal rule.
gaurav Mar 15, 2026
ba1bef8
Apply suggestion from @gaurav
gaurav Mar 15, 2026
7a4cc78
Gave Claude some advice on documentation.
gaurav Mar 15, 2026
b7f861f
Some improvements.
gaurav Mar 15, 2026
ba077b7
Replaced Gene+Protein and Drug+Chemical conflations with proper names.
gaurav Mar 15, 2026
398b0a9
Improved some documentation.
gaurav Mar 15, 2026
6f1eec8
Fixed some Markdown issues.
gaurav Mar 15, 2026
05f2f8a
Improved triage documentation.
gaurav Mar 15, 2026
616264a
Fixed Markdown issues.
gaurav Mar 15, 2026
f6a54d9
Improved documentation a bit.
gaurav Mar 15, 2026
11e606f
Reorganize documentation, including updated issue triage information …
gaurav Mar 15, 2026
70dcc2a
Merge branch 'master' into get-test-suite-working-again
gaurav Mar 15, 2026
109059f
Merge branch 'get-test-suite-working-again' into fix-chembl-issue-584
gaurav Mar 15, 2026
d84c76b
Add unit tests for two of the NodeFactory/SynonymFactory label-loadin…
gaurav Mar 15, 2026
7acee1d
Merge branch 'add-pipeline-tests-for-shared-identifiers' into fix-che…
gaurav Mar 15, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,9 +86,9 @@ semantic type plus data collection, reports, exports, and DuckDB.
- **`snakefiles/`** — Snakemake rule definitions wiring data handlers to compendium creators.
- **`node.py`** — Core classes: `NodeFactory`, `SynonymFactory`, `DescriptionFactory`,
`TaxonFactory`, `InformationContentFactory`, `TSVSQLiteLoader`.
- **`babel_utils.py`** — Download/FTP utilities, state management.
- **`babel_utils.py`** — Download/FTP utilities, `glom()` (clique merging), `write_compendium()`
(compendium builder), state management.
- **`util.py`** — Logging, config loading, Biolink Model Toolkit (bmt) access.
- **`make_cliques.py`** — Union-find clique merging logic.
- **`exporters/`** — Output format handlers (KGX, Parquet, JSONL).
- **`reports/`**, **`synonyms/`**, **`metadata/`** — Report generation, synonym files, provenance.

Expand All @@ -100,7 +100,8 @@ semantic type plus data collection, reports, exports, and DuckDB.
- **Biolink Model** integration via `bmt` — types, valid prefixes, and naming conventions all follow
the Biolink Model.
- **Concord files** are the core data structure: tab-separated `CURIE1 \t Relation \t CURIE2`
triples expressing cross-references between vocabularies.
triples expressing cross-references between vocabularies. The `glom()` function in
`babel_utils.py` merges them into equivalence cliques.

### Biolink Model Usage

Expand Down Expand Up @@ -128,7 +129,7 @@ identifier that owns them and are not promoted to the first entry.

### Conflation

Gene+Protein and Drug+Chemical each have dedicated conflation modules (`geneprotein.py`,
GeneProtein and DrugChemical conflation each have dedicated conflation modules (`geneprotein.py`,
`drugchemical.py`) that merge their respective cliques. See `docs/Conflation.md`.

### Directories at Runtime
Expand Down Expand Up @@ -161,3 +162,12 @@ don't miss out on any valid identifiers without very good reason. If you're chan
identifiers are filtered in one compendium, think about whether that will affect which identifiers
should be included in the other compendia to prevent any identifiers from being missed or being
added twice.

## Documentation

When making a significant change, check if it affects any of the documentation
files (`docs/*.md`, `*.md`) and update them if necessary. Suggest adding
new documentation files if necessary.

When writing documentation files, avoid using horizontal pipes unless necessary --
section headings are sufficient for dividing up documentation.
74 changes: 15 additions & 59 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,56 +28,19 @@ us triage and prioritize them correctly.
would also appreciate if you can include what you expect the tool to return.
Any other details you can provide, especially anything that will be help us
replicate the issue, will be very helpful.
1. After you have reported a bug, helping to triage, prioritize and group it
will be very helpful:
- We triage issues into one of the
[milestones](https://github.com/NCATSTranslator/Babel/milestones):
- [Needs investigation](https://github.com/NCATSTranslator/Babel/milestone/12)
refers to issues that need to be investigated further -- either to figure
out what is causing the issue or to communicate with the user community
to understand what should occur.
- [Immediate](https://github.com/NCATSTranslator/Babel/milestone/35) need
to be fixed immediately. Issues I'm currently working on will be placed
here.
- [Needed soon](https://github.com/NCATSTranslator/Babel/milestone/30)
refers to issues that should be fixed in the next few months: not
immediately, but sooner rather than later.
- [Needed later](https://github.com/NCATSTranslator/Babel/milestone/31)
refers to issues that should be fixed eventually, but are not needed
immediately.
- We prioritize issues with one of the three priority tags:
[Priority: Low](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Low%22),
[Priority: Medium](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Medium%22),
[Priority: High](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20High%22).
The idea is that issues with the highest priority will determine which will be
investigated/tested first, and which are most likely to move from Needed later/Needed soon into
Immediate for working on.
- We estimate effort on tasks using a series of
["T-shirt sizes"](https://asana.com/resources/t-shirt-sizing):
[Size: XS](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XS%22),
[Size: S](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20S%22),
[Size: M](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20M%22),
[Size: L](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20L%22),
[Size: XL](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XL%22).
These are to help distinguish between tasks that are easy to complete (extra small) and those
that will require a lot of thinking, programming and testing (extra large).
- You can group issues in two ways:
- GitHub lets you chose a "parent" issue for each issue, which is useful for issues that are
related to each other. We try to build "issues of issues" that group together similar issues
that might require similar fixes (e.g.
[our issue tracking deprecated identifiers](https://github.com/NCATSTranslator/Babel/issues/93)).
If you find an issue related to yours, please feel free to add yours as a child of the
existing issue or vice versa.
- You can use labels to group similar issues. We don't have a lot of labels
for you to choose from, but feel free to add any that make sense!
1. For guidance on how to assign priority, impact and size fields, group related
issues, and track when your issue is likely to be addressed, see
[docs/Triage.md](./docs/Triage.md).

## Contributing source code

Babel is structured around its [Snakemake files](./src/snakefiles), which call
into its [data handlers](./src/datahandlers) and
[compendia creators](./src/createcompendia). The heart of its data are its
concord files, which contain cross-references between different databases. These
are combined into compendium files and synonyms.
For an overview of how Babel's source code is organized — including the two-phase pipeline,
the role of concord files, and the key patterns used throughout the codebase — see
[docs/Architecture.md](./docs/Architecture.md).

For a detailed guide to the development workflow — including how to obtain prerequisites, build
individual compendia, and ideas for making the pipeline easier to work with — see
[docs/Development.md](./docs/Development.md).

We use three linters to check the style of submitted code in GitHub pull
requests -- don't worry if this is difficult to do at your end, as it is easy to
Expand All @@ -96,8 +59,6 @@ fix in a pull request:

### Contributing tests

TODO

Tests are written using [pytest](https://pytest.org/) and are present in the
`tests` directory. You can run these tests by running
`PYTHONPATH=. uv run pytest`.
Expand All @@ -106,22 +67,17 @@ Tests are written using [pytest](https://pytest.org/) and are present in the
[working on that](https://github.com/NCATSTranslator/Babel/issues/602), and if
you can help get them to pass, that would be great!

### Writing a new concord or compendium

TODO
### Writing a new concord, compendium, or data source

### Adding a new source of identifiers, synonyms or descriptions

TODO
See [docs/Architecture.md](./docs/Architecture.md) for an overview of where new code goes,
and [docs/Development.md](./docs/Development.md) for the development workflow.

## Want to work on the frontends instead?

Babel has two frontends: the [Node Normalizer] for exposing information about
cliques, and the [Name Resolver], which lets you search by synonyms or names.

-
-
-
Both of these could use help with issues that are specific to them! Please check
their GitHub repositories to see what improvements they need.

[babel issue tracker]: https://github.com/NCATSTranslator/Babel/issues/
[name resolver]: https://github.com/NCATSTranslator/NameResolution
Expand Down
156 changes: 12 additions & 144 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ identifier for any identifier provided.

In addition to returning the preferred identifier and all the secondary
identifiers for a clique, NodeNorm will also return its Biolink type and
["information content" score](#what-are-information-content-values), and
["information content" score](./docs/Understanding.md#what-are-information-content-values), and
optionally any descriptions we have for these identifiers.

It also includes some endpoints for normalizing an entire TRAPI message and
Expand All @@ -129,160 +129,28 @@ You can find out more about NameRes at its

## Understanding Babel outputs

### How does Babel choose a preferred identifier for a clique?

After determining the equivalent identifiers that belong in a single clique,
Babel sorts them in the order of CURIE prefixes for that Biolink type as
determined by the Biolink Model. For example, for a
[biolink:SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes),
any CHEBI identifiers will appear first, followed by any UNII identifiers, and
so on. The first identifier in this list is the preferred identifier for the
clique.

[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in
that order when that conflation is applied. The preferred identifier for the
clique is therefore the preferred identifier of the first clique being
conflated.

- For GeneProtein conflation, the preferred identifier is a gene.
- For DrugChemical conflation, Babel uses the
[following algorithm](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/src/createcompendia/drugchemical.py#L466-L538):
1. We first choose an overall Biolink type for the conflated clique. To do this, we use a
["preferred Biolink type" order](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L32-L50)
that can be configured in [config.yaml](./config.yaml) and choose the most preferred Biolink
type that is present in the conflated clique.
1. We then group the cliques to be conflated by the prefix of their preferred
identifier, and sort them based on the preferred prefix order for the
chosen Biolink type.
1. If there are multiple cliques with the same prefix in their preferred
identifier, we use the following criteria to sort them:
1. A clique with a lower information content value will be sorted before
those with a higher information content or no information content at
all.
1. A clique with more identifiers are sorted before those with fewer
identifiers.
1. A clique whose preferred identifier has a smaller numerical suffix will
be sorted before those with a larger numerical suffix.

### How does Babel choose a preferred label for a clique?

For most Biolink types, the preferred label for a clique is the label of the
preferred identifier. There is a
[`demote_labels_longer_than`](https://github.com/NCATSTranslator/Babel/blob/master/config.yaml#L437)
configuration parameter that -- if set -- will cause labels that are longer than
the specified number of characters to be ignored unless no labels shorter than
that length are present. This is to avoid overly long labels when a more concise
label is available.

Biolink types that are chemicals (i.e.
[biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/) and its
subclasses) have a special list of
[preferred name boost prefixes](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L416-L426)
that are used to prioritize labels. This list is currently:

1. DRUGBANK
1. DrugCentral
1. CHEBI
1. MESH
1. CHEMBL.COMPOUND
1. GTOPDB
1. HMDB
1. RXCUI
1. PUBCHEM.COMPOUND

[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in
that order when that conflation is applied. The preferred label for the
conflated clique is therefore the preferred label of the first clique being
conflated.

## Where do the clique descriptions come from?

Currently, all descriptions for NodeNorm concepts come from
[UberGraph](https://github.com/INCATools/ubergraph/). You will note that
descriptions are collected for every identifier within a clique, and then the
description associated with the most preferred identifier is provided for the
preferred identifier. Descriptions are not included in NameRes, but the
`description` flag can be used to include any descriptions when returning
cliques from NodeNorm.

### What are "information content" values?

Babel obtains information content values for over 3.8 million concepts from
[Ubergraph](https://github.com/INCATools/ubergraph?tab=readme-ov-file#graph-organization)
based on the number of terms related to the specified term as either a subclass
or any existential relation. They are decimal values that range from 0.0
(high-level broad term with many subclasses) to 100.0 (very specific term with
no subclasses).

## Reporting incorrect Babel cliques

### I've found two or more identifiers in separate cliques that should be considered identical

Please report this "split" clique as an issue to the
[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a
minimum, please include the identifiers (CURIEs) for the identifiers that should
be combined. Links to a NodeNorm instance showing the two cliques are very
helpful. Evidence supporting the lumping, such as a link to an external database
that makes it clear that these identifiers refer to the same concept, are also
very helpful: while we have some ability to combine cliques manually if needed
urgently for some application, we prefer to find a source of mappings that would
combine the two identifiers, allowing us to improve cliquing across Babel.

<!-- rumdl-disable MD013 -->
### I've found two or more identifiers combined in a single clique that actually identify different concepts
<!-- rumdl-enable MD013 -->

Please report this "lumped" clique as an issue to the
[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a
minimum, please include the identifiers (CURIEs) for the identifiers that should
be split. Links to a NodeNorm instance showing the lumped clique is very
helpful. Evidence, such as a link to an external database that makes it clear
that these identifiers refer to the same concept, are also very helpful: while
we have some ability to combine cliques manually if needed urgently for some
application, we prefer to find a source of mappings that would combine the two
identifiers, allowing us to improve cliquing across Babel.
For a detailed explanation of how Babel constructs cliques, chooses preferred identifiers and
labels, sources descriptions, and calculates information content values — as well as guidance on
reporting incorrect cliques — see [docs/Understanding.md](./docs/Understanding.md).

## Running Babel

### How can I run Babel?

Babel is difficult to run, primarily because of its inefficient memory handling
-- we currently need around 500G of memory to build the largest compendia
(Protein and DrugChemical conflated information), although the smaller compendia
should be buildable with far less memory. We are working on reducing these
restrictions as far as possible. You can read more about
[Babel's build process](docs/RunningBabel.md), and please do contact us if you
run into any problems or would like some assistance.

We have [detailed instructions for running Babel](docs/RunningBabel.md), but the
short version is:

- We use [uv](https://docs.astral.sh/uv/) to manage Python dependencies. You can
use the
[Docker image](https://github.com/NCATSTranslator/Babel/pkgs/container/babel)
if you run into any difficulty setting up the prerequisites.
- We use [Snakemake](https://snakemake.github.io/) to handle the dependency
management.

Therefore, you should be able to run Babel by cloning this repository and
running:

```shell
$ uv run snakemake --cores [NUMBER OF CORES TO USE]
```

The number of cores can be specified as `all` in order to use all available cores on your machine.

The [./slurm/run-babel-on-slurm.sh](./slurm/run-babel-on-slurm.sh) Bash script
can be used to start running Babel as a Slurm job. You can set the BABEL_VERSION
environment variable to document which version of Babel you are running.
Babel requires significant memory — around 500 GB to build the largest compendia (Protein and
DrugChemical conflated), though smaller compendia need far less. It uses
[uv](https://docs.astral.sh/uv/) for Python dependency management and
[Snakemake](https://snakemake.github.io/) for build orchestration. See
[docs/RunningBabel.md](docs/RunningBabel.md) for detailed instructions, configuration, and
Slurm job setup.

## Contributing to Babel

If you want to contribute to Babel, start with the
[Contributing to Babel](./CONTRIBUTING.md) documentation. This will provide
guidance on how the source code is organized, what contributions are most
useful, and how to run the tests.
useful, and how to run the tests. For a deeper look at the development
workflow and ideas for improving it, see [Developing Babel](./docs/Development.md).

## Contact information

Expand Down
Loading
Loading