Know2BIO dataset is periodically updated and released. The latest release is from 2023-08-18 and consists of 3 versions:
- Full dataset: The full Know2BIO Knowledge Graph consisting of 219,169 nodes, 6,181,160 edges, and their node features. This dataset requires licensing restrictions to access. Instructions on obtaining these license will be released soon. This dataset can be obtained by obtaining the necessary licenses and filling out the below webform.
- Safe release: The Know2BIO Knowledge Graph without licensing-restricted information. The resulting KG is 152,845 nodes, 3,282,063 edges, and their node features. This dataset can be obtained by filling out the below webform.
- Sampled safe release: A 1% sample of the Know2BIO safe release, representing roughly 1% proportional to all edges. This dataset is accesible under the 'sampled_know2bio_safe_release' folder. An example of their node features is included in this directory under 'sampled_node_features.json'. The full set of node features for this sampled dataset is accesible at this link.
For accessing the full dataset of the safe release dataset, please fill out this webform: https://forms.gle/3HdKRtvW7ce9PKpw6
Summary:
Manually download this data into the dataset/create_edge_files_utils/input folder:
- UMLS thesaurus "MRCONSO.RRF" (make a UMLS account): https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html
- Drugbank full database (make a DrugBank account): https://go.drugbank.com/releases/5-1-9/downloads/all-full-database
- DrugBank compound structures (make a Drugbank account): https://go.drugbank.com/releases/5-1-9/downloads/all-structure-links
Execute these commands:
python create_edge_files.py
bash
python ./prepare_kgs/prepare_kgs.py ./output/edges_to_use/ ./know2bio ./input_lists/all_kg_edges.txt
python ./prepare_kgs/split_dataset.py ./know2bio 0.8
python ./prepare_kgs/prepare_benchmark.py ./know2bio ../../benchmark/data
cd ../../benchmark/data/K2BIO
python n-n.py
Execute python create_edge_files_utils.py
to create all edges.
Note: This executes scripts in a suitable order. Note that some scripts must be run in this order (e.g., compound_to_compound_alignment
and gene_to_protein
should be run first) but others can be run in different orders or can be not run if you don't want to create the edges for the script's respective edge types. Alternatively, you can still create the edges but just choose to not use the edge file produced by the script.
Following construction of the Know2BIO dataset detailed above, the edge files must be assembled into a knowledge graph and prepared following a specific data format for use with our benchmark knowledge graph representation learning models.
Scripts for constructing the knowledge graph for Know2BIO dataset are found in the prepare_kgs
folder.
Set up the environment to run the script, using pip install -r prepare_kgs_requirements.txt
The script takes as input the directory where Know2BIO is downloaded, the output directory, and a list of all input files separated by which kg it is a part of. The kgs are separated into test/train/validation set following 80/10/10 split and are processed in the correct format for benchmarking.
Run kg_sampler.py
to sample a subset of the knowledge graph for testing purposes.
- Adding or Removing Know2BIO Edges: Including or excluding certain data files from Know2BIO allows the construction of a use-case specific knowledge graph. This is achieved by providing a tailored file list for the knowledge graph construction. Included within this repository is the
input_list
folder, specifying specific edge lists to be used for knowledge graph construction. Theont_bridge_inst_list.txt
file specifies all edge_files generated from constructing Know2BIO. This text file specifies which files are considered as part of each view (i.e., instance view, ontology view, bridge view). To make a tailored file, we recommend copying this file and deleting the file names which should not be included.
An example of specific use cases is included within the input_lists
folder and detailed below:
- A protein-protein interaction knowledge graph: This protein-centric knowledge graph includes only known protein-protein interactions, their relation to genes, and their relations to biological pathways. No disease or drug information is included, as their inclusion could hinder prediction of protein-protein interactions. An example input list for this knowledge graph is
protein_protein_interaction_kg_list.txt
- A drug-target interaction knowledge graph: This knowledge graph focuses specifically on protein-drug interactions between DrugBank drugs, UniProt proteins, MeSH diseases, and Reactome biological pathways. This KG does not include data files from different knowledge bases which serve a similar purpose (e.g., using Reactome, instead of Reactome and KEGG; using DrugBank only, instead of DrugBank and MeSH compounds) to reduce complexity of the KG when integrating from multiple sources. An example input list for this knowledge graph is
drug_target_interaction_kg_list.txt
Furthermore, these KGs can be constructed for testing predictive power of biologically relevant triples (e.g., train a model using entire Know2BIO and predict only on protein-drug edges) will be detailed soon.
- Adding Your Own Data: To add your own edges, follow the format of other edge files (columns for head, relation, tail, and optionally for weight). Each node should have a prefix following the conventions we used: nodes should begin with a prefix for the node type followed by a colon and then the ID of the node corresponding to the node type. Node type prefixes we used are
ATC, DrugBank_Compound, Entrez, HMDB_Metabolite, KEGG_Pathway, MeSH_Anatomy, MeSH_Compound, MeSH_Disease, MeSH_Tree_Anatomy, MeSH_Tree_Disease, Reactome_Pathway, Reactome_Reaction, SMPDB_Pathway, UniProt, biological_process, cellular_component, molecular_function
. The latter 3 are for GO terms.
The prepare_kgs.py
script prepares the input data from output/edges_to_use
and splits the dataset into different views (i.e., instance view, ontology view, bridge view, and the aggregate view). These four combined knowledge graphs will be output to the know2bio
folder. Run the below command to execute.
python ./prepare_kgs/prepare_kgs.py ./output/edges_to_use/ ./know2bio ./input_lists/ont_bridge_inst_list.txt
The split_dataset.py
script splits the knowledge graphs constructed from the previous step, into separate knowledge graphs for training and evaluation. The input folder (with kg1f_instances.txt
,kg2f_ontologies.txt
, and alignf_bridges.txt
) is provided as the first argument, with the proportion of the KG for training as the second argument. The train graph is required to span all nodes in the knowledge graph, for every connected component with greater than 10 nodes.
python ./prepare_kgs/split_dataset.py ./know2bio 0.8
The prepare_benchmark.py
script automatically reformats the resulting aggregate, instance, and ontology view data into the data folder of the benchmark section of the repository. This generates the entity2id.txt
, relation2id.txt
, whole2id.txt
, train2id.txt
, test2id.txt
, and valid2id.txt
files in the corresponding data folder. The n-n.py
script must be executed to generate the remaining datafiles to sucessfully update the dataset for modeling.
python ./prepare_kgs/prepare_benchmark.py ./know2bio ../../benchmark/data
cd ../../benchmark/data/K2BIO
python n-n.py
The prepare_multimodal_data.py
script automatically combines the resulting node features in the multi-modal_data
folder as the node_features.json
file. This file summarizes the node features for each node in the knowledge graph. To generate this file, make sure all parsed files are within the multi-modal_data
folder and execute the below script.
python ./prepare_kgs/prepare_multimodal_data.py
- Motivation
- Composition
- Collection process
- Preprocessing/cleaning/labeling
- Uses
- Distribution
- Maintenance
Knowledge Graph Benchmark of Biomedical Instances and Ontologies (Know2Bio) is a comprehensive and evolving general-purpose heterogeneous knowledge graph encompassessing data from 29 diverse sources, representing 11 biomedical categories and capturing intricate relationships. The current version of Know2BIO consists of 216,000 nodes and 6,500,000 edges, and it can be automatically updated to ensure its relevance and currency.
Know2Bio was created as a general-purpose biomedical knowledge graph. It is intended to be used as a resource for biomedical discovery and a real-world benchmark dataset for knowledge graph representtion learning models.
Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
Due to review policy, author information will be released shortly.
Due to review policy, funding details will be disclosed shortly.
Know2Bio integrates data from 29 diverse sources, capturing intricate relationships across 11 biomedical categories. It currently consists of 216,000 nodes and 6,500,000 edges. Biomedical types are anatomy, biological process, cellular component, compounds/drugs, disease, drug class, genes, molecular function, pathways, proteins, and reactions.
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
The data set is available in three formats: 1) as raw input files (.csv) detailing individually extracted biomedical knowledge via API and downloads. These files also include intermediate files for mapping between ontologies as well as node features (e.g., text descriptions, sequence data, structure data), and edge weights which were not included in the combined dataset as they were not included in model evaluation. 2) a combined KG following the head-relation-tail (h,r,t) convention, as a comma-separated text file. These KG are released for the ontology view, instance view, and bridge view, as well as a combined whole KG. 3) To facilitate benchmark comparison between different KG embedding models, we also release the train, validation, and test split KGs.
The current version of Know2BIO consists of 216,000 nodes and 6,500,000 edges across 11 biomedical categories. Tables describing detailed description of node and edge counts will be added soon.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
The dataset contains all possible instances. Individual input files, test/train/validation splits, and a subset of sampled dataset are also available.
Data are released as processed triples (head, relation, tail) as well as raw text description and sequence data.
There is no label or target associated with each instance.
Triples involving DrugBank are removed in this release due to licensing release issues. Mechanism to access full dataset will be released soon.
Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?
Relationships between instances are explicit, as represented in the knowledge graph format.
To facilitate benchmark comparison between different KG embedding models, we also release the train, validation, and test split KGs. These KG are released for the ontology view, instance view, and bridge view, as well as a combined whole KG. The resulting KG was split into a train and test KG. The largest strongly connected component from the main KG was used to generate the train and test split, designating as close to 30% in the test set across all nodes as possible, for each view. The remaining portion is divided into train and validation sets with a ratio of 9:1. Table enumerating the dataset split to be added.
To construct our KG, we integrate data from 29 biomedical data sources spanning several biomedical disciplines. This data integration required careful selection of data sources from which we extracted data. It also required data identifiers (IDs) to be mapped to common IDs through various intermediary resources. This is needed to unify knowledge on each biomedical entity/concept because they (e.g., genes) are often represented by different IDs (e.g., gene name; IDs from Entrez, Ensembl, HGNC) in different data sources. However, this process can be circuitous. For example, to unify knowledge on compounds and the proteins they target (i.e., Compound (DrugBank ID) -targets- Protein (UniProt ID)) taken from the Therapeutic Target Database (TTD), the following relationships are aligned: Compound (TTD ID) -targets- Protein (TTD ID) from TTD, Protein (TTD ID) -is- Protein (UniProt name) from UniProt, and Protein (UniProt name) -is- Protein (UniProt ID) from UniProt. This creates Compound (TTD ID) -targets- Protein (UniProt ID) edges. But to unify this with the same compounds represented by DrugBank IDs elsewhere in the KG, the following relationships are aligned: Compound (DrugBank ID) -is- Compounds (old TTD, CAS, PubChem, and ChEBI IDs) from DrugBank (4 relationships), and Compounds (CAS, PubChem, and ChEBI) -is- Compound (new TTD) from TTD (3 relationships).
Relationships are also backed by varying levels of evidence (e.g., confidence scores for protein-protein interactions from STRING, gene-disease associations from DisGeNET). To select appropriate thresholds for inclusion in our KG, we investigate how confidence scores are calculated, what past researchers have selected, KB author recommendations, and resulting data availability.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
The dataset is self-contained. The source code used to generate the dataset link to external resources. Table describing websites and APIs used to be added.
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
No.
The dataset relates to biomedical research, human health, and disease.
No.
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?
No.
Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?
No.
To construct our KG, we integrate data from 29 biomedical data sources spanning several biomedical disciplines. This data integration required careful selection of data sources from which we extracted data. It also required data identifiers (IDs) to be mapped to common IDs through various intermediary resources. This is needed to unify knowledge on each biomedical entity/concept because they (e.g., genes) are often represented by different IDs (e.g., gene name; IDs from Entrez, Ensembl, HGNC) in different data sources. However, this process can be circuitous. For example, to unify knowledge on compounds and the proteins they target (i.e., Compound (DrugBank ID) -targets- Protein (UniProt ID)) taken from the Therapeutic Target Database (TTD), the following relationships are aligned: Compound (TTD ID) -targets- Protein (TTD ID) from TTD, Protein (TTD ID) -is- Protein (UniProt name) from UniProt, and Protein (UniProt name) -is- Protein (UniProt ID) from UniProt. This creates Compound (TTD ID) -targets- Protein (UniProt ID) edges. But to unify this with the same compounds represented by DrugBank IDs elsewhere in the KG, the following relationships are aligned: Compound (DrugBank ID) -is- Compounds (old TTD, CAS, PubChem, and ChEBI IDs) from DrugBank (4 relationships), and Compounds (CAS, PubChem, and ChEBI) -is- Compound (new TTD) from TTD (3 relationships).
Relationships are also backed by varying levels of evidence (e.g., confidence scores for protein-protein interactions from STRING, gene-disease associations from DisGeNET). To select appropriate thresholds for inclusion in our KG, we investigate how confidence scores are calculated, what past researchers have selected, KB author recommendations, and resulting data availability.
The data sources include various databases, knowledge bases, API services, and knowledge graphs: MyGene.info, MyChem.info, MyDisease.info, Bgee, KEGG, PubMed, MeSH, SIDER, UMLS, CTD, PathFX, DisGeNET, TTD, Hetionet, Uberon, Mondo, PharmGKB, DrugBank, Reactome, DO, ClinGen, ClinVar, UniProt, GO, STRING, InxightDrugs, SMPDB, HGNC, and GRNdb.
What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)?
See above. Manual human inspection of API result was performed to detect errors and successful download.
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
Due to review policy, author information will be released shortly.
Creation of the data collection pipeline took place over a year and a half from publication and release; new dataset was generated a few months before release and will be periodically updated.
No.
No.
Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?
N.A.
N.A.
N.A.
If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?
N.A.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted?
N.A.
Raw data and intermediate files downloaded from API and other sources are included with the submission. Resulting files are represented as knowledge graph triples (i.e., head, relation, tail).
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?
Namespace identifiers indicating which biomedical resource it originated from, were appended to node names (e.g., "Reactome:" was appended to all Reactome Pathways and Reactions). Relations with 0 weight were removed. Relation names were given to match knowledge graph triple convention.
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?
Yes. It is included.
Yes. It is included.
We have performed a benchmark for knowledge graph representation learning models.
If the accompanying paper is accepted, these can be found as papers which cite the publication.
This knowledge graph is a general-purpose knowledge graph for biomedical knowledge. It can be used for identifying drug targets and therapeutics, discovering biomarkers for disease, among other purposes.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This knowledge graph framework is easily extensible to other currently not included biomedical knowledge bases. In the preparation of this dataset, there is no unfair treatment of individuals or groups or other undesirable harms beyond what is already represented in the source biomedical knowledge bases. To our knowledge, we have not included knowledge bases of consern for these harms.
This knowledge graph is intended for research purposes and should not be used for medical advise and clinical decision making without consulting a medical professional.
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
This dataset is available for download and reuse under the MIT license.
If this paper is accepted, we will release the DOI to access the dataset.
This dataset will be distributed upon paper acceptabce.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
MIT license.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
A few data sources impose licensing restrictions and will not be included with our dataset release. Instructions on how to access license and retrieve the complete dataset will be released soon.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
No.
Due to review policy, institute information will be released shortly.
Due to review policy, author information will be released shortly.
Not currently.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
The dataset and accompanying code will be periodically updated and communicated through the project GitHub.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)?
No.
We will maintain versioning of new release of dataset.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
Users are welcome to extend, augment, build, and contribute to the dataset, per mechanisms on GitHub.