NCATSTranslator · gaurav · Sep 18, 2025 · Sep 22, 2025 · Sep 22, 2025 · Sep 23, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -86,9 +86,9 @@ semantic type plus data collection, reports, exports, and DuckDB.
 - **`snakefiles/`** — Snakemake rule definitions wiring data handlers to compendium creators.
 - **`node.py`** — Core classes: `NodeFactory`, `SynonymFactory`, `DescriptionFactory`,
   `TaxonFactory`, `InformationContentFactory`, `TSVSQLiteLoader`.
-- **`babel_utils.py`** — Download/FTP utilities, state management.
+- **`babel_utils.py`** — Download/FTP utilities, `glom()` (clique merging), `write_compendium()`
+  (compendium builder), state management.
 - **`util.py`** — Logging, config loading, Biolink Model Toolkit (bmt) access.
-- **`make_cliques.py`** — Union-find clique merging logic.
 - **`exporters/`** — Output format handlers (KGX, Parquet, JSONL).
 - **`reports/`**, **`synonyms/`**, **`metadata/`** — Report generation, synonym files, provenance.
 
@@ -100,7 +100,8 @@ semantic type plus data collection, reports, exports, and DuckDB.
 - **Biolink Model** integration via `bmt` — types, valid prefixes, and naming conventions all follow
   the Biolink Model.
 - **Concord files** are the core data structure: tab-separated `CURIE1 \t Relation \t CURIE2`
-  triples expressing cross-references between vocabularies.
+  triples expressing cross-references between vocabularies. The `glom()` function in
+  `babel_utils.py` merges them into equivalence cliques.
 
 ### Biolink Model Usage
 
@@ -128,7 +129,7 @@ identifier that owns them and are not promoted to the first entry.
 
 ### Conflation
 
-Gene+Protein and Drug+Chemical each have dedicated conflation modules (`geneprotein.py`,
+GeneProtein and DrugChemical conflation each have dedicated conflation modules (`geneprotein.py`,
 `drugchemical.py`) that merge their respective cliques. See `docs/Conflation.md`.
 
 ### Directories at Runtime
@@ -161,3 +162,12 @@ don't miss out on any valid identifiers without very good reason. If you're chan
 identifiers are filtered in one compendium, think about whether that will affect which identifiers
 should be included in the other compendia to prevent any identifiers from being missed or being
 added twice.
+
+## Documentation
+
+When making a significant change, check if it affects any of the documentation
+files (`docs/*.md`, `*.md`) and update them if necessary. Suggest adding
+new documentation files if necessary.
+
+When writing documentation files, avoid using horizontal pipes unless necessary --
+section headings are sufficient for dividing up documentation.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -28,56 +28,19 @@ us triage and prioritize them correctly.
    would also appreciate if you can include what you expect the tool to return.
    Any other details you can provide, especially anything that will be help us
    replicate the issue, will be very helpful.
-1. After you have reported a bug, helping to triage, prioritize and group it
-   will be very helpful:
-   - We triage issues into one of the
-     [milestones](https://github.com/NCATSTranslator/Babel/milestones):
-     - [Needs investigation](https://github.com/NCATSTranslator/Babel/milestone/12)
-       refers to issues that need to be investigated further -- either to figure
-       out what is causing the issue or to communicate with the user community
-       to understand what should occur.
-     - [Immediate](https://github.com/NCATSTranslator/Babel/milestone/35) need
-       to be fixed immediately. Issues I'm currently working on will be placed
-       here.
-     - [Needed soon](https://github.com/NCATSTranslator/Babel/milestone/30)
-       refers to issues that should be fixed in the next few months: not
-       immediately, but sooner rather than later.
-     - [Needed later](https://github.com/NCATSTranslator/Babel/milestone/31)
-       refers to issues that should be fixed eventually, but are not needed
-       immediately.
-   - We prioritize issues with one of the three priority tags:
-     [Priority: Low](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Low%22),
-     [Priority: Medium](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20Medium%22),
-     [Priority: High](https://github.com/NCATSTranslator/Babel/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22Priority%3A%20High%22).
-     The idea is that issues with the highest priority will determine which will be
-     investigated/tested first, and which are most likely to move from Needed later/Needed soon into
-     Immediate for working on.
-   - We estimate effort on tasks using a series of
-     ["T-shirt sizes"](https://asana.com/resources/t-shirt-sizing):
-     [Size: XS](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XS%22),
-     [Size: S](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20S%22),
-     [Size: M](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20M%22),
-     [Size: L](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20L%22),
-     [Size: XL](https://github.com/NCATSTranslator/Babel/issues?q=state%3Aopen%20label%3A%22Size%3A%20XL%22).
-     These are to help distinguish between tasks that are easy to complete (extra small) and those
-     that will require a lot of thinking, programming and testing (extra large).
-   - You can group issues in two ways:
-     - GitHub lets you chose a "parent" issue for each issue, which is useful for issues that are
-       related to each other. We try to build "issues of issues" that group together similar issues
-       that might require similar fixes (e.g.
-       [our issue tracking deprecated identifiers](https://github.com/NCATSTranslator/Babel/issues/93)).
-       If you find an issue related to yours, please feel free to add yours as a child of the
-       existing issue or vice versa.
-     - You can use labels to group similar issues. We don't have a lot of labels
-       for you to choose from, but feel free to add any that make sense!
+1. For guidance on how to assign priority, impact and size fields, group related
+   issues, and track when your issue is likely to be addressed, see
+   [docs/Triage.md](./docs/Triage.md).
 
 ## Contributing source code
 
-Babel is structured around its [Snakemake files](./src/snakefiles), which call
-into its [data handlers](./src/datahandlers) and
-[compendia creators](./src/createcompendia). The heart of its data are its
-concord files, which contain cross-references between different databases. These
-are combined into compendium files and synonyms.
+For an overview of how Babel's source code is organized — including the two-phase pipeline,
+the role of concord files, and the key patterns used throughout the codebase — see
+[docs/Architecture.md](./docs/Architecture.md).
+
+For a detailed guide to the development workflow — including how to obtain prerequisites, build
+individual compendia, and ideas for making the pipeline easier to work with — see
+[docs/Development.md](./docs/Development.md).
 
 We use three linters to check the style of submitted code in GitHub pull
 requests -- don't worry if this is difficult to do at your end, as it is easy to
@@ -96,8 +59,6 @@ fix in a pull request:
 
 ### Contributing tests
 
-TODO
-
 Tests are written using [pytest](https://pytest.org/) and are present in the
 `tests` directory. You can run these tests by running
 `PYTHONPATH=. uv run pytest`.
@@ -106,22 +67,17 @@ Tests are written using [pytest](https://pytest.org/) and are present in the
 [working on that](https://github.com/NCATSTranslator/Babel/issues/602), and if
 you can help get them to pass, that would be great!
 
-### Writing a new concord or compendium
-
-TODO
+### Writing a new concord, compendium, or data source
 
-### Adding a new source of identifiers, synonyms or descriptions
-
-TODO
+See [docs/Architecture.md](./docs/Architecture.md) for an overview of where new code goes,
+and [docs/Development.md](./docs/Development.md) for the development workflow.
 
 ## Want to work on the frontends instead?
 
 Babel has two frontends: the [Node Normalizer] for exposing information about
 cliques, and the [Name Resolver], which lets you search by synonyms or names.
-
--
--
--
+Both of these could use help with issues that are specific to them! Please check
+their GitHub repositories to see what improvements they need.
 
 [babel issue tracker]: https://github.com/NCATSTranslator/Babel/issues/
 [name resolver]: https://github.com/NCATSTranslator/NameResolution

diff --git a/README.md b/README.md
@@ -105,7 +105,7 @@ identifier for any identifier provided.
 
 In addition to returning the preferred identifier and all the secondary
 identifiers for a clique, NodeNorm will also return its Biolink type and
-["information content" score](#what-are-information-content-values), and
+["information content" score](./docs/Understanding.md#what-are-information-content-values), and
 optionally any descriptions we have for these identifiers.
 
 It also includes some endpoints for normalizing an entire TRAPI message and
@@ -129,160 +129,28 @@ You can find out more about NameRes at its
 
 ## Understanding Babel outputs
 
-### How does Babel choose a preferred identifier for a clique?
-
-After determining the equivalent identifiers that belong in a single clique,
-Babel sorts them in the order of CURIE prefixes for that Biolink type as
-determined by the Biolink Model. For example, for a
-[biolink:SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/#valid-id-prefixes),
-any CHEBI identifiers will appear first, followed by any UNII identifiers, and
-so on. The first identifier in this list is the preferred identifier for the
-clique.
-
-[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in
-that order when that conflation is applied. The preferred identifier for the
-clique is therefore the preferred identifier of the first clique being
-conflated.
-
-- For GeneProtein conflation, the preferred identifier is a gene.
-- For DrugChemical conflation, Babel uses the
-  [following algorithm](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/src/createcompendia/drugchemical.py#L466-L538):
-    1. We first choose an overall Biolink type for the conflated clique. To do this, we use a
-       ["preferred Biolink type" order](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L32-L50)
-       that can be configured in [config.yaml](./config.yaml) and choose the most preferred Biolink
-       type that is present in the conflated clique.
-    1. We then group the cliques to be conflated by the prefix of their preferred
-       identifier, and sort them based on the preferred prefix order for the
-       chosen Biolink type.
-    1. If there are multiple cliques with the same prefix in their preferred
-       identifier, we use the following criteria to sort them:
-        1. A clique with a lower information content value will be sorted before
-           those with a higher information content or no information content at
-           all.
-        1. A clique with more identifiers are sorted before those with fewer
-           identifiers.
-        1. A clique whose preferred identifier has a smaller numerical suffix will
-           be sorted before those with a larger numerical suffix.
-
-### How does Babel choose a preferred label for a clique?
-
-For most Biolink types, the preferred label for a clique is the label of the
-preferred identifier. There is a
-[`demote_labels_longer_than`](https://github.com/NCATSTranslator/Babel/blob/master/config.yaml#L437)
-configuration parameter that -- if set -- will cause labels that are longer than
-the specified number of characters to be ignored unless no labels shorter than
-that length are present. This is to avoid overly long labels when a more concise
-label is available.
-
-Biolink types that are chemicals (i.e.
-[biolink:ChemicalEntity](https://biolink.github.io/biolink-model/ChemicalEntity/) and its
-subclasses) have a special list of
-[preferred name boost prefixes](https://github.com/NCATSTranslator/Babel/blob/f3ff2103e74bc9b6bee9483355206b32e8f9ae9b/config.yaml#L416-L426)
-that are used to prioritize labels. This list is currently:
-
-1. DRUGBANK
-1. DrugCentral
-1. CHEBI
-1. MESH
-1. CHEMBL.COMPOUND
-1. GTOPDB
-1. HMDB
-1. RXCUI
-1. PUBCHEM.COMPOUND
-
-[Conflations](./docs/Conflation.md) are lists of identifiers that are merged in
-that order when that conflation is applied. The preferred label for the
-conflated clique is therefore the preferred label of the first clique being
-conflated.
-
-## Where do the clique descriptions come from?
-
-Currently, all descriptions for NodeNorm concepts come from
-[UberGraph](https://github.com/INCATools/ubergraph/). You will note that
-descriptions are collected for every identifier within a clique, and then the
-description associated with the most preferred identifier is provided for the
-preferred identifier. Descriptions are not included in NameRes, but the
-`description` flag can be used to include any descriptions when returning
-cliques from NodeNorm.
-
-### What are "information content" values?
-
-Babel obtains information content values for over 3.8 million concepts from
-[Ubergraph](https://github.com/INCATools/ubergraph?tab=readme-ov-file#graph-organization)
-based on the number of terms related to the specified term as either a subclass
-or any existential relation. They are decimal values that range from 0.0
-(high-level broad term with many subclasses) to 100.0 (very specific term with
-no subclasses).
-
-## Reporting incorrect Babel cliques
-
-### I've found two or more identifiers in separate cliques that should be considered identical
-
-Please report this "split" clique as an issue to the
-[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a
-minimum, please include the identifiers (CURIEs) for the identifiers that should
-be combined. Links to a NodeNorm instance showing the two cliques are very
-helpful. Evidence supporting the lumping, such as a link to an external database
-that makes it clear that these identifiers refer to the same concept, are also
-very helpful: while we have some ability to combine cliques manually if needed
-urgently for some application, we prefer to find a source of mappings that would
-combine the two identifiers, allowing us to improve cliquing across Babel.
-
-<!-- rumdl-disable MD013 -->
-### I've found two or more identifiers combined in a single clique that actually identify different concepts
-<!-- rumdl-enable MD013 -->
-
-Please report this "lumped" clique as an issue to the
-[Babel GitHub repository](https://github.com/TranslatorSRI/Babel/issues). At a
-minimum, please include the identifiers (CURIEs) for the identifiers that should
-be split. Links to a NodeNorm instance showing the lumped clique is very
-helpful. Evidence, such as a link to an external database that makes it clear
-that these identifiers refer to the same concept, are also very helpful: while
-we have some ability to combine cliques manually if needed urgently for some
-application, we prefer to find a source of mappings that would combine the two
-identifiers, allowing us to improve cliquing across Babel.
+For a detailed explanation of how Babel constructs cliques, chooses preferred identifiers and
+labels, sources descriptions, and calculates information content values — as well as guidance on
+reporting incorrect cliques — see [docs/Understanding.md](./docs/Understanding.md).
 
 ## Running Babel
 
 ### How can I run Babel?
 
-Babel is difficult to run, primarily because of its inefficient memory handling
--- we currently need around 500G of memory to build the largest compendia
-(Protein and DrugChemical conflated information), although the smaller compendia
-should be buildable with far less memory. We are working on reducing these
-restrictions as far as possible. You can read more about
-[Babel's build process](docs/RunningBabel.md), and please do contact us if you
-run into any problems or would like some assistance.
-
-We have [detailed instructions for running Babel](docs/RunningBabel.md), but the
-short version is:
-
-- We use [uv](https://docs.astral.sh/uv/) to manage Python dependencies. You can
-  use the
-  [Docker image](https://github.com/NCATSTranslator/Babel/pkgs/container/babel)
-  if you run into any difficulty setting up the prerequisites.
-- We use [Snakemake](https://snakemake.github.io/) to handle the dependency
-  management.
-
-Therefore, you should be able to run Babel by cloning this repository and
-running:
-
-```shell
-$ uv run snakemake --cores [NUMBER OF CORES TO USE]
-```
-
-The number of cores can be specified as `all` in order to use all available cores on your machine.
-
-The [./slurm/run-babel-on-slurm.sh](./slurm/run-babel-on-slurm.sh) Bash script
-can be used to start running Babel as a Slurm job. You can set the BABEL_VERSION
-environment variable to document which version of Babel you are running.
+Babel requires significant memory — around 500 GB to build the largest compendia (Protein and
+DrugChemical conflated), though smaller compendia need far less. It uses
+[uv](https://docs.astral.sh/uv/) for Python dependency management and
+[Snakemake](https://snakemake.github.io/) for build orchestration. See
+[docs/RunningBabel.md](docs/RunningBabel.md) for detailed instructions, configuration, and
+Slurm job setup.
 
 ## Contributing to Babel
 
 If you want to contribute to Babel, start with the
 [Contributing to Babel](./CONTRIBUTING.md) documentation. This will provide
 guidance on how the source code is organized, what contributions are most
-useful, and how to run the tests.
+useful, and how to run the tests. For a deeper look at the development
+workflow and ideas for improving it, see [Developing Babel](./docs/Development.md).
 
 ## Contact information