Skip to content

Latest commit

 

History

History
234 lines (215 loc) · 12.6 KB

File metadata and controls

234 lines (215 loc) · 12.6 KB

Data formats

There are three custom formats used within Babel outputs.

Compendia files

Compendia files are JSON Lines (JSONL) files in the compendia/ directory. Each line consists of a single "clique" -- a set of identifiers that Babel believes represents the same concept. Here is an example from compendia/Gene.txt for the glucose-6-phosphatase catalytic subunit 1 (G6PC1) gene.

{
  "ic": "100",
  "identifiers": [
    {
      "i": "NCBIGene:2538",
      "l": "G6PC1",
      "d": [],
      "t": [
        "NCBITaxon:9606"
      ]
    },
    {
      "i": "ENSEMBL:ENSG00000131482",
      "l": "G6PC1 (Hsap)",
      "d": [],
      "t": []
    },
    {
      "i": "HGNC:4056",
      "l": "G6PC1",
      "d": [],
      "t": []
    },
    {
      "i": "OMIM:613742",
      "d": [],
      "t": []
    },
    {
      "i": "UMLS:C1414892",
      "l": "G6PC1 gene",
      "d": [],
      "t": []
    }
  ],
  "preferred_name": "G6PC1",
  "taxa": [
    "NCBITaxon:9606"
  ],
  "type": "biolink:Gene"
}

This entry consists of the following fields:

Field Value Meaning
ic 100 Information content value (see Understanding.md). They are decimal values that range from 0.0 (high-level broad term with many subclasses) to 100.0 (very specific term with no subclasses).
identifiers See below A list of identifiers for this clique. This is arranged in the same order as the valid ID prefixes for this type in the Biolink Model, e.g. starting with NCBIGene and ENSEMBL for biolink:Gene.
identifiers[0].i NCBIGene:2358 A CURIE representing this identifier. You can use the Biolink Model prefixmap to expand this into a full concept IRI.
identifiers[0].l G6PC1 A label for this identifier. This will almost always be from the source of the CURIE (in this case, the label is from the NCBI Gene database).
identifiers[0].d (blank in this example, but usually 1-3 sentences) A description of this identifier or concept from this source.
identifiers[0].t ["NCBITaxon:9606"] A list of taxa that this concept is found in as NCBITaxon CURIEs. NCBITaxon:9606 refers to the species Homo sapiens.
preferred_name G6PC1 The preferred name for this clique. This is not currently used by NodeNorm, but will be in the future.
taxa ["NCBITaxon:9606"] A list of taxa that this concept is found in as NCBITaxon CURIEs. This is combined from all the individual taxa from each identifier.
descriptions (blank in this example, but usually a list of descriptions) A list of descriptions, created by combining descriptions from all the identifiers.
type biolink:Gene The Biolink type of this concept. Must be a class from the Biolink model with a biolink: prefix.

The first identifier in the identifiers list is considered the "clique leader" or "preferred ID" for the clique. When normalizing an identifier, that identifier is used to represent the entire clique. The preferred name is not necessarily the label for the clique leader -- another name may be chosen to clarify the meaning of the clique or to provide a better label for displaying in the Translator UI.

Synonym files

Synonym files are JSONL files, where each entry is a JSON document describing a concept and all its synonyms.

{
  "clique_identifier_count": 5,
  "curie": "NCBIGene:2538",
  "curie_suffix": 2538,
  "names": [
    "GSD1",
    "G6PC",
    "G6PT",
    "GSD1a",
    "G6PC1",
    "G6Pase",
    "G-6-Pase",
    "G6PC gene",
    "G6PC1 gene",
    "G6Pase-alpha",
    "G6PC1 (Hsap)",
    "G6PT, FORMERLY",
    "glucose-6-phosphatase alpha",
    "GLUCOSE-6-PHOSPHATASE, CATALYTIC",
    "GLUCOSE-6-PHOSPHATASE, CATALYTIC, 1",
    "glucose-6-phosphatase catalytic subunit",
    "glucose-6-phosphatase catalytic subunit 1",
    "glycogen storage disease type I, von Gierke disease",
    "glucose-6-phosphatase, catalytic (glycogen storage disease type I, von Gierke disease)"
  ],
  "preferred_name": "G6PC1",
  "shortest_name_length": 4,
  "taxa": [
    "NCBITaxon:9606"
  ],
  "types": [
    "Gene",
    "GeneOrGeneProduct",
    "GenomicEntity",
    "ChemicalEntityOrGeneOrGeneProduct",
    "PhysicalEssence",
    "OntologyClass",
    "BiologicalEntity",
    "ThingWithTaxon",
    "NamedThing",
    "Entity",
    "PhysicalEssenceOrOccurrent",
    "MacromolecularMachineMixin"
  ]
}

This entry consists of the following fields:

Field Value Meaning
clique_identifier_count 5 The number of identifiers in the corresponding clique (i.e. for NCBIGene:2358).
curie NCBIGene:2538 The CURIE for this entry. Note that the equivalent identifiers are not included.
curie_suffix 2538 If the CURIE suffix is completely numerical, it will be stored in this field as a number. This is used to sort search results, with lower CURIE suffixes appearing first.
names [ "GD1", "G6PC", ... ]" A list of synonyms for this concept. It is usually arranged from shortest to longest, except for conflated cliques, which has all the synonyms for the first identifier, followed by all the synonyms for the second identifier, and so on.
preferred_name G6PC1 The preferred name for this clique.
shortest_name_length 4 The length of the shortest synonym in the names list, in order to sort results for the shortest name.
taxa ["NCBITaxon:9606"] The list of taxa that this concept is found in. This should be identical to the entry in the corresponding Compendia file.
types ["Gene", "GeneOrGeneProduct", ...] A list of Biolink types (without the biolink: prefix) for this concept. This is arranged in the same order provided by the Biolink Model Toolkit, starting with the narrowest concept, expanding to the broadest, followed by mixins.

Note that the synonym files are generated with DrugChemical conflation turned on, but GeneProtein conflation turned off.

Conflation files

There are only two conflation files: GeneProtein.txt and DrugChemical.txt, corresponding to the two currently supported conflation methods. Both files have the same format: a JSONL file where each entry is a list of clique leaders (i.e. the first identifier for a clique in the Compendia files) that should be combined under that conflation. For example, the following entry indicates that if either NCBIGene:2538 or UniProtKB:P35575 is queried with DrugChemical conflation turned on, then a combined clique of both identifiers should be returned.

["NCBIGene:2538", "UniProtKB:P35575"]

Here is the response when normalizing UniProtKB:P35575 from NodeNorm when both DrugChemical conflation and individual types are turned on:

{
  "UniProtKB:P35575": {
    "id": {
      "identifier": "NCBIGene:2538",
      "label": "G6PC1"
    },
    "equivalent_identifiers": [
      {
        "identifier": "NCBIGene:2538",
        "label": "G6PC1",
        "type": "biolink:Gene"
      },
      {
        "identifier": "ENSEMBL:ENSG00000131482",
        "label": "G6PC1 (Hsap)",
        "type": "biolink:Gene"
      },
      {
        "identifier": "HGNC:4056",
        "label": "G6PC1",
        "type": "biolink:Gene"
      },
      {
        "identifier": "OMIM:613742",
        "type": "biolink:Gene"
      },
      {
        "identifier": "UMLS:C1414892",
        "label": "G6PC1 gene",
        "type": "biolink:Gene"
      },
      {
        "identifier": "UniProtKB:P35575",
        "label": "G6PC1_HUMAN Glucose-6-phosphatase catalytic subunit 1 (sprot)",
        "type": "biolink:Protein"
      },
      {
        "identifier": "PR:P35575",
        "label": "glucose-6-phosphatase catalytic subunit 1 (human)",
        "type": "biolink:Protein"
      },
      {
        "identifier": "UMLS:C4549614",
        "label": "G6PC1 protein, human",
        "type": "biolink:Protein"
      }
    ],
    "type": [
      "biolink:Gene",
      "biolink:GeneOrGeneProduct",
      "biolink:GenomicEntity",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:PhysicalEssence",
      "biolink:OntologyClass",
      "biolink:BiologicalEntity",
      "biolink:ThingWithTaxon",
      "biolink:NamedThing",
      "biolink:PhysicalEssenceOrOccurrent",
      "biolink:MacromolecularMachineMixin",
      "biolink:Protein",
      "biolink:GeneProductMixin",
      "biolink:Polypeptide",
      "biolink:ChemicalEntityOrProteinOrPolypeptide"
    ],
    "information_content": 88.2
  }
}

Note that this includes both biolink:Gene identifiers (such as HGNC:4056) and biolink:Protein identifiers (such as UniProtKB:P35575).