Skip to content

Forecasting high-impact research topics via machine learning on evolving knowledge graphs

License

Notifications You must be signed in to change notification settings

artificial-scientist-lab/Impact4Cast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Impact4Cast

License: MIT arXiv ICML AI4Science

Which scientific concepts, that have never been investigated jointly, will lead to the most impactful research?

📖 Read our paper here:
Forecasting high-impact research topics via machine learning on evolving knowledge graphs
Xuemei Gu, Mario Krenn

workflow

Note

Full Dynamic Knowledge Graph and Datasets can be downloaded at 10.5281/zenodo.10692137
Dataset for Benchmark can be downloaded at 10.5281/zenodo.14527306

create_concept
│ 
├── Concept_Corpus
│   ├── s0_get_preprint_metadata.ipynb: Get metadata from chemRxiv, medRxiv, bioRxiv (arXiv data from Kaggle)
│   ├── s1_make_metadate_arxivstyle.ipynb: Preprocessing metadata from different sources
│   ├── s2_combine_all_preprint_metadate.ipynb: Combining metadata
│   ├── s3_get_concepts.ipynb: Use NLP techniques (for instance RAKE) to extract concepts
│   └── s4_improve_concept.ipynb: Further improvements of full concept list
│   
└── Domain_Concept
    ├── s0_prepare_optics_quantum_data.ipynb: Get papers for specific domain (optics and quantum physics in our case).
    ├── s1_split_domain_papers.py: Prepare data for parallelization.
    ├── s2_get_domain_concepts.py: Get domain-specific vertices in full concept list.
    ├── s3_merge_concepts.py: Postprocessing domain-specific concepts
    ├── s4_improve_concepts.ipynb: Further improve concept lists
    ├── s5_improve_manually_concepts.py: Manually inspect the concepts in the very end for grammar, non-conceptual phrases, verbs, ordinal numbers, conjunctions, adverbials and so on, to improve quality
    └── full_domain_concepts.txt: Final list of 37,960 concepts (represent vertices of knowledge graph)
create_dynamic_edges
├── _get_openalex_workdata.py: Get metadata from OpenAlex)
├── _get_openalex_workdata_parallel_run1.py: Get parts of the metadata from OpenAlex (run in many parts)
├── get_concept_pairs.py: Create edges of the knowledge graph (edges carry the time and citation information).
├── merge_concept_pairs.py: Combining edges files
└── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic knowledge graph

workflow

.
├── prepare_unconnected_pair_solution.ipynb: Find unconnected concept pairs (for training, testing and evaluating)
├── prepare_adjacency_pagerank.py: Prepare dynamic knowledge graph and compute properties
├── prepare_node_pair_citation_data_years.ipynb: Prepare citation data for both individual concept nodes and concept pairs for specific years
│
├──create_dynamic_concepts
│  ├── get_concept_citation.py: Create dynamic concepts from the knowledge graph (concepts carry the time and citation information). 
│  ├── merge_concept_citation.py: Combining dynamic concepts files
│  └── process_concept_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│  ├── merge_concept_pairs.py: Combining dynamic concepts
│  └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│
└──prepare_eval_data
   ├── prepare_eval_feature_data.py: Prepare features of knowledge graph (for evaluation dataset)
   └── prepare_eval_feature_data_condition.py: Prepare features of knowledge graph (for evaluation dataset, conditioned on existence in the future)

workflow

.
├── train_model_2019_run.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022).
├── train_model_2019_condition.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022, conditioned on existence in the future)
├── train_model_2019_individual_feature.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022) on individual features
└── train_model_2022_run.py: Training 2019 -> 2022 (for real future predictions of 2025)
Feature descriptions for an unconnected pair of concepts (u, v)
Feature Type Feature Index Feature Description
node feature 0-5 Number of neighbors for each node ($u$ or $v$) until the year $y$, $y{-}1$, $y{-}2$
denoted as $N_{u,y}$, $N_{v,y}$, $N_{u,y-1}$, $N_{v,y-1}$, $N_{u,y-2}$, and $N_{v,y-2}$, ordered as indices 0–5
6-7 Number of new neighbors for each node ($u$ or $v$) between year $y{-}1$ and $y$
i.e., $N_{u,y}{-}N_{u,y-1}$ and $N_{v,y}{-}N_{v,y-1}$
8-9 Number of new neighbors for each node ($u$ or $v$) between year $y{-}2$ and $y$
i.e., $N_{u,y}{-}N_{u,y-2}$ and $N_{v,y}{-}N_{v,y-2}$
10-11 Rank of the number of new neighbors for each node ($u$ or $v$) between year $y{-}1$ and $y$
i.e., rank($N_{u,y}{-}N_{u,y-1}$) and rank($N_{v,y}{-}N_{v,y-1}$)
12-13 Rank of the number of new neighbors for each node ($u$ or $v$) between year $y{-}2$ and $y$
i.e., rank($N_{u,y}{-}N_{u,y-1}$) and rank($N_{v,y}{-}N_{v,y-2}$)
14-19 PageRank scores of each node ($u$ or $v$) until the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{PR}_{u,y}$, $\mathrm{PR}_{v,y}$, $\mathrm{PR}_{u,y-1}$, $\mathrm{PR}_{v,y-1}$, $\mathrm{PR}_{u,y-2}$ and $\mathrm{PR}_{v,y-2}$
node citation feature 20-25 Yearly citation for each node ($u$ or $v$) in year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{C}_{u,y}$, $\mathrm{C}_{v,y}$, $\mathrm{C}_{u,y-1}$, $\mathrm{C}_{v,y-1}$, $\mathrm{C}_{u,y-2}$ and $\mathrm{C}_{v,y-2}$
26-31 Total citation for each node ($u$ or $v$) since the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{Ct}_{u,y}$, $\mathrm{Ct}_{v,y}$, $\mathrm{Ct}_{u,y-1}$, $\mathrm{Ct}_{v,y-1}$, $\mathrm{Ct}_{u,y-2}$ and $\mathrm{Ct}_{v,y-2}$
32-37 Total citations for each node ($u$ or $v$) in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{Ct}^{\Delta 3}_{u,y}$, $\mathrm{Ct}^{\Delta 3}_{v,y}$, $\mathrm{Ct}^{\Delta 3}_{u,y{-}1}$, $\mathrm{Ct}^{\Delta 3}_{v,y{-}1}$, $\mathrm{Ct}^{\Delta 3}_{u,y{-}2}$, and $\mathrm{Ct}^{\Delta 3}_{v,y{-}2}$
38-43 Number of papers mentioning node $u$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$, similar for node $v$
denoted and ordered as $\mathrm{Pn}_{u,y}$, $\mathrm{Pn}_{v,y}$, $\mathrm{Pn}_{u,y-1}$, $\mathrm{Pn}_{v,y-1}$, $\mathrm{Pn}_{u,y-2}$, and $\mathrm{Pn}_{v,y-2}$
44-49 Average yearly citations for each node ($u$ or $v$) in the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Cm}_{u,y}$, $\mathrm{Cm}_{v,y}$, $\mathrm{Cm}_{u,y-1}$, $\mathrm{Cm}_{v,y-1}$, $\mathrm{Cm}_{u,y-2}$ and $\mathrm{Cm}_{v,y-2}$
e.g., $\mathrm{Cm}_{u,y}=\mathrm{C}_{u,y}/\mathrm{Pn}_{u,y}$
50-55 Average total citations for each node ($u$ or $v$) since the first publications to the years $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Ctm}_{u,y}$, $\mathrm{Ctm}_{v,y}$, $\mathrm{Ctm}_{u,y-1}$, $\mathrm{Ctm}_{v,y-1}$, $\mathrm{Ctm}_{u,y-2}$ and $\mathrm{Ctm}_{v,y-2}$; e.g., $\mathrm{Ctm}_{u,y}=\mathrm{Ct}_{u,y}/\mathrm{Pn}_{u,y}$
56-61 Average total citations for each node ($u$ or $v$) in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{Ctm}^{\Delta 3}_{u,y}$, $\mathrm{Ctm}^{\Delta 3}_{v,y}$, $\mathrm{Ctm}^{\Delta 3}_{u,y-1}$, $\mathrm{Ctm}^{\Delta 3}_{v,y-1}$, $\mathrm{Ctm}^{\Delta 3}_{u,y-2}$ and $\mathrm{Ctm}^{\Delta 3}_{v,y-2}$
e.g., $\mathrm{Ctm}^{\Delta 3}_{u,y}=\mathrm{Ct}^{\Delta 3}_{u,y}/\mathrm{Pn}_{u,y}$
62-63 New citations for each node ($u$ or $v$) between years $y{-}1$ and $y$
i.e., $\mathrm{Ct}_{u,y}{-}\mathrm{Ct}_{u,y-1}$ and $\mathrm{Ct}_{v,y}{-}\mathrm{Ct}_{v,y-1}$
64-65 New citations for each node ($u$ or $v$) between years $y{-}2$ and $y$
i.e., $\mathrm{Ct}_{u,y}{-}\mathrm{Ct}_{u,y-2}$ and $\mathrm{Ct}_{v,y}{-}\mathrm{Ct}_{v,y-2}$
66-67 Rank of the new citations for each node ($u$ or $v$) between years $y{-}1$ and $y$
i.e., rank($\mathrm{C}_{u,y}{-}\mathrm{C}_{u,y-1}$) and rank($\mathrm{C}_{v,y}{-}\mathrm{C}_{v,y-1}$)
68-69 Rank of the new citations for each node ($u$ or $v$) between years $y{-}2$ and $y$
i.e., rank($\mathrm{C}_{u,y}{-}\mathrm{C}_{u,y-2}$) and rank($\mathrm{C}_{v,y}{-}\mathrm{C}_{v,y-2}$)
70-71 Number of papers mentioning nodes $u$ between years $y{-}1$ and $y$, similar for node $v$
i.e., $\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-1}$ and $\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-1}$
72-73 Number of papers mentioning nodes $u$ between years $y{-}2$ and $y$, similar for node $v$
i.e., $\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-2}$ and $\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-2}$
74-75 Rank of the number of papers mentioning nodes $u$ between years $y{-}1$ and $y$, similar for node $v$
i.e., rank($\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-1}$) and rank($\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-1}$)
76-77 Number of papers mentioning nodes $u$ between years $y{-}2$ and $y$, similar for node $v$
i.e., rank($\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-2}$) and rank($\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-2}$)
pair feature 78-80 Number of shared neighbors between nodes $u$ and $v$ until the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Ns}_{y}$, $\mathrm{Ns}_{y-1}$ and $\mathrm{Ns}_{y-2}$; e.g., $\mathrm{Ns}_{y}=\mathrm{N}_{u,y} \cap \mathrm{N}_{v,y}$
81-83 Geometric similarity coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{Geo}_{y}$, $\mathrm{Geo}_{y-1}$, and $\mathrm{Geo}_{y-2}$; e.g., $\mathrm{Geo}_{y} = \mathrm{Ns}_{y}^{2}/(\mathrm{N}_{u,y}\times \mathrm{N}_{v,y})$
84-86 Cosine similarity coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Cos}_{y}$, $\mathrm{Cos}_{y-1}$, and $\mathrm{Cos}_{y-2}$; e.g., $\mathrm{Cos}_{y} = \sqrt{\mathrm{Geo}_{y}}$
87-89 Simpson coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Sim}_{y}$, $\mathrm{Sim}_{y-1}$, and $\mathrm{Sim}_{y-2}$; e.g., $\mathrm{Sim}_{y} = \mathrm{Ns}_{y}/\min(\mathrm{N}_{u,y}, \mathrm{N}_{v,y})$
90-92 Preferential attachment coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Pre}_{y}$, $\mathrm{Pre}_{y-1}$, and $\mathrm{Pre}_{y-2}$; e.g., $\mathrm{Pre}_{y} =\mathrm{N}_{u,y}\times \mathrm{N}_{v,y}$
93-95 Sørensen–Dice coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Sor}_{y}$, $\mathrm{Sor}_{y-1}$, and $\mathrm{Sor}_{y-2}$; e.g., $\mathrm{Sor}_{y} = 2\mathrm{Ns}_{y}/(\mathrm{N}_{u,y}+\mathrm{N}_{v,y}$)
96-98 Jaccard coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$
denoted and ordered as $\mathrm{Jac}_{y}$, $\mathrm{Jac}_{y-1}$, and $\mathrm{Jac}_{y-2}$; e.g., $\mathrm{Jac}_{y} = \mathrm{Ns}_{y}/(\mathrm{N}_{u,y}+\mathrm{N}_{v,y}-\mathrm{Ns}_{y})$
pair citation feature 99-101 Ratio of the sum of citations received by nodes $u$ and $v$ until the year $y$ to the total number of papers mentioning either concept, similar for years $y-1$, $y-2$
denoted and ordered as $\mathrm{r1}_{y}$, $\mathrm{r1}_{y-1}$, and $\mathrm{r1}_{y-2}$; e.g., $\mathrm{r1}_{y}=(\mathrm{Ct}_{u,y}$ + $\mathrm{Ct}_{v,y}) / (\mathrm{Pn}_{u,y}+\mathrm{Pn}_{v,y})$.
102-104 Ratio of the product of citations received by nodes $u$ and $v$ until the year $y$ to the total number of papers mentioning either concept, similar for years $y-1$, $y-2$
denoted and ordered as $\mathrm{r2}_{y}$, $\mathrm{r2}_{y-1}$, and $\mathrm{r2}_{y-2}$; e.g., $\mathrm{r2}_{y}=(\mathrm{Ct}_{u,y} \times \mathrm{Ct}_{v,y}) / (\mathrm{Pn}_{u,y}+\mathrm{Pn}_{v,y})$
105-107 Sum of average citations received by nodes $u$ and $v$ in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{s}_{y}$, $\mathrm{s}_{y-1}$, and $\mathrm{s}_{y-2}$; e.g., $\mathrm{s}_{y}=\mathrm{Cm}_{u,y}+\mathrm{Cm}_{v,y}$
108-110 Sum of average total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{st}_{y}$, $\mathrm{st}_{y-1}$, and $\mathrm{st}_{y-2}$; e.g., $\mathrm{st}_{y}=\mathrm{Ctm}_{u,y}+\mathrm{Ctm}_{v,y}$
111-113 Sum of the total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{st}^{\Delta 3}_{y}$, $\mathrm{st}^{\Delta 3}_{y-1}$, and $\mathrm{st}^{\Delta 3}_{y-2}$; e.g., $\mathrm{st}^{\Delta 3}_{y}=\mathrm{Ct}^{\Delta 3}_{u,y}+\mathrm{Ct}^{\Delta 3}_{v,y}$
114-116 Sum of average total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{stm}^{\Delta 3}_{y}$, $\mathrm{stm}^{\Delta 3}_{y-1}$, and $\mathrm{stm}^{\Delta 3}_{y-2}$; e.g., $\mathrm{stm}^{\Delta 3}_{y}=\mathrm{Ctm}^{\Delta 3}_{u,y}+\mathrm{Ctm}^{\Delta 3}_{v,y}$
117-119 Minimum number of citations received by either node $u$ or $v$ in years $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{minC}_{y}$, $\mathrm{minC}_{y-1}$, and $\mathrm{minC}_{y-2}$; e.g., $\mathrm{minC}_{y}=\min(\mathrm{C}_{u,y}, \mathrm{C}_{v,y})$
120-122 Maximum number of citations received by either node $u$ or $v$ in years $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{maxC}_{y}$, $\mathrm{maxC}_{y-1}$, and $\mathrm{maxC}_{y-2}$; e.g., $\mathrm{maxC}_{y}=\max(\mathrm{C}_{u,y}, \mathrm{C}_{v,y})$
123-125 Minimum number of total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{minCt}_{y}$, $\mathrm{minCt}_{y-1}$, and $\mathrm{minCt}_{y-2}$; e.g., $\mathrm{minCt}_{y}= \min(\mathrm{Ct}_{u,y}, \mathrm{Ct}_{v,y})$
126-128 Maximum number of total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{maxCt}_{y}$, $\mathrm{maxCt}_{y-1}$, and $\mathrm{maxCt}_{y-2}$; e.g., $\mathrm{maxCt}_{y}=\max(\mathrm{Ct}_{u,y}, \mathrm{Ct}_{v,y})$
129-131 Minimum number of total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{minCt}^{\Delta 3}_{y}$, $\mathrm{minCt}^{\Delta 3}_{y-1}$, and $\mathrm{minCt}^{\Delta 3}_{y-2}$; e.g., $\mathrm{minCt}^{\Delta 3}_{y}= \min(\mathrm{Ct}^{\Delta 3}_{u,y}, \mathrm{Ct}^{\Delta 3}_{v,y})$.
132-134 Maximum number of total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{maxCt}^{\Delta 3}_{y}$, $\mathrm{maxCt}^{\Delta 3}_{y-1}$, and $\mathrm{maxCt}^{\Delta 3}_{y-2}$; e.g., $\mathrm{maxCt}^{\Delta 3}_{y}= \max(\mathrm{Ct}^{\Delta 3}_{u,y}, \mathrm{Ct}^{\Delta 3}_{v,y})$.
135-137 Minimum number of papers mentioning the node $u$ or node $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{minPn}_{y}$, $\mathrm{minPn}_{y-1}$ and $\mathrm{minPn}_{y-2}$; e.g., $\mathrm{minPn}_{y}= \min(\mathrm{Pn}_{u,y}, \mathrm{Pn}_{v,y})$
138-140 Maximum number of papers mentioning the node $u$ or node $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$
denoted and ordered as $\mathrm{maxPn}_{y}$, $\mathrm{maxPn}_{y-1}$ and $\mathrm{maxPn}_{y-2}$; e.g., $\mathrm{maxPn}_{y}= \max(\mathrm{Pn}_{u,y}, \mathrm{Pn}_{v,y})$

One needs to download the data at 10.5281/zenodo.14527306 and unzip the file in the benchmark_code folder.

benchmark_code
├── loops_fcNN.py: fully connected neural network model
├── loops_transformer.py: transformer model
├── loops_tree.py: random forest model
├── loops_xgboost.py: XGBoost model
└── other python files: Post-processing, make the Figure 6-8 from the evaluation on different models.

Three examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data (accessible at 10.5281/zenodo.14527306) are for producing Figure 11 in the fpr_example folder.

About

Forecasting high-impact research topics via machine learning on evolving knowledge graphs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published