Which scientific concepts, that have never been investigated jointly, will lead to the most impactful research?
📖 Read our paper here:
Forecasting high-impact research topics via machine learning on evolving knowledge graphs
Xuemei Gu, Mario Krenn
Note
Full Dynamic Knowledge Graph and Datasets can be downloaded at 10.5281/zenodo.10692137
Dataset for Benchmark can be downloaded at 10.5281/zenodo.14527306
create_concept │ ├── Concept_Corpus │ ├── s0_get_preprint_metadata.ipynb: Get metadata from chemRxiv, medRxiv, bioRxiv (arXiv data from Kaggle) │ ├── s1_make_metadate_arxivstyle.ipynb: Preprocessing metadata from different sources │ ├── s2_combine_all_preprint_metadate.ipynb: Combining metadata │ ├── s3_get_concepts.ipynb: Use NLP techniques (for instance RAKE) to extract concepts │ └── s4_improve_concept.ipynb: Further improvements of full concept list │ └── Domain_Concept ├── s0_prepare_optics_quantum_data.ipynb: Get papers for specific domain (optics and quantum physics in our case). ├── s1_split_domain_papers.py: Prepare data for parallelization. ├── s2_get_domain_concepts.py: Get domain-specific vertices in full concept list. ├── s3_merge_concepts.py: Postprocessing domain-specific concepts ├── s4_improve_concepts.ipynb: Further improve concept lists ├── s5_improve_manually_concepts.py: Manually inspect the concepts in the very end for grammar, non-conceptual phrases, verbs, ordinal numbers, conjunctions, adverbials and so on, to improve quality └── full_domain_concepts.txt: Final list of 37,960 concepts (represent vertices of knowledge graph)
create_dynamic_edges ├── _get_openalex_workdata.py: Get metadata from OpenAlex) ├── _get_openalex_workdata_parallel_run1.py: Get parts of the metadata from OpenAlex (run in many parts) ├── get_concept_pairs.py: Create edges of the knowledge graph (edges carry the time and citation information). ├── merge_concept_pairs.py: Combining edges files └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic knowledge graph
. ├── prepare_unconnected_pair_solution.ipynb: Find unconnected concept pairs (for training, testing and evaluating) ├── prepare_adjacency_pagerank.py: Prepare dynamic knowledge graph and compute properties ├── prepare_node_pair_citation_data_years.ipynb: Prepare citation data for both individual concept nodes and concept pairs for specific years │ ├──create_dynamic_concepts │ ├── get_concept_citation.py: Create dynamic concepts from the knowledge graph (concepts carry the time and citation information). │ ├── merge_concept_citation.py: Combining dynamic concepts files │ └── process_concept_to_pandas_frame.py: Post-processing, store the full dynamic concepts │ ├── merge_concept_pairs.py: Combining dynamic concepts │ └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic concepts │ └──prepare_eval_data ├── prepare_eval_feature_data.py: Prepare features of knowledge graph (for evaluation dataset) └── prepare_eval_feature_data_condition.py: Prepare features of knowledge graph (for evaluation dataset, conditioned on existence in the future)
. ├── train_model_2019_run.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022). ├── train_model_2019_condition.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022, conditioned on existence in the future) ├── train_model_2019_individual_feature.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022) on individual features └── train_model_2022_run.py: Training 2019 -> 2022 (for real future predictions of 2025)
Feature descriptions for an unconnected pair of concepts (u, v)
Feature Type | Feature Index | Feature Description |
---|---|---|
node feature | 0-5 | Number of neighbors for each node ( denoted as |
6-7 | Number of new neighbors for each node ( i.e., |
|
8-9 | Number of new neighbors for each node ( i.e., |
|
10-11 | Rank of the number of new neighbors for each node ( i.e., rank( |
|
12-13 | Rank of the number of new neighbors for each node ( i.e., rank( |
|
14-19 | PageRank scores of each node ( denoted and ordered as |
|
node citation feature | 20-25 | Yearly citation for each node ( denoted and ordered as |
26-31 | Total citation for each node ( denoted and ordered as |
|
32-37 | Total citations for each node ( denoted and ordered as |
|
38-43 | Number of papers mentioning node denoted and ordered as |
|
44-49 | Average yearly citations for each node ( denoted and ordered as e.g., |
|
50-55 | Average total citations for each node ( denoted and ordered as |
|
56-61 | Average total citations for each node ( denoted and ordered as e.g., |
|
62-63 | New citations for each node ( i.e., |
|
64-65 | New citations for each node ( i.e., |
|
66-67 | Rank of the new citations for each node ( i.e., rank( |
|
68-69 | Rank of the new citations for each node ( i.e., rank( |
|
70-71 | Number of papers mentioning nodes i.e., |
|
72-73 | Number of papers mentioning nodes i.e., |
|
74-75 | Rank of the number of papers mentioning nodes i.e., rank( |
|
76-77 | Number of papers mentioning nodes i.e., rank( |
|
pair feature | 78-80 | Number of shared neighbors between nodes denoted and ordered as |
81-83 | Geometric similarity coefficient for the pair denoted and ordered as |
|
84-86 | Cosine similarity coefficient for the pair denoted and ordered as |
|
87-89 | Simpson coefficient for the pair denoted and ordered as |
|
90-92 | Preferential attachment coefficient for the pair denoted and ordered as |
|
93-95 | Sørensen–Dice coefficient for the pair denoted and ordered as |
|
96-98 | Jaccard coefficient for the pair denoted and ordered as |
|
pair citation feature | 99-101 | Ratio of the sum of citations received by nodes denoted and ordered as |
102-104 | Ratio of the product of citations received by nodes denoted and ordered as |
|
105-107 | Sum of average citations received by nodes denoted and ordered as |
|
108-110 | Sum of average total citations received by nodes denoted and ordered as |
|
111-113 | Sum of the total citations received by nodes denoted and ordered as |
|
114-116 | Sum of average total citations received by nodes denoted and ordered as |
|
117-119 | Minimum number of citations received by either node denoted and ordered as |
|
120-122 | Maximum number of citations received by either node denoted and ordered as |
|
123-125 | Minimum number of total citations received by nodes denoted and ordered as |
|
126-128 | Maximum number of total citations received by nodes denoted and ordered as |
|
129-131 | Minimum number of total citations received by nodes denoted and ordered as |
|
132-134 | Maximum number of total citations received by nodes denoted and ordered as |
|
135-137 | Minimum number of papers mentioning the node denoted and ordered as |
|
138-140 | Maximum number of papers mentioning the node denoted and ordered as |
One needs to download the data at 10.5281/zenodo.14527306 and unzip the file in the benchmark_code folder.
benchmark_code ├── loops_fcNN.py: fully connected neural network model ├── loops_transformer.py: transformer model ├── loops_tree.py: random forest model ├── loops_xgboost.py: XGBoost model └── other python files: Post-processing, make the Figure 6-8 from the evaluation on different models.
Three examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data (accessible at 10.5281/zenodo.14527306) are for producing Figure 11 in the fpr_example folder.