Impact4Cast

Which scientific concepts, that have never been investigated jointly, will lead to the most impactful research?

📖 Read our paper here:
Forecasting high-impact research topics via machine learning on evolving knowledge graphs
Xuemei Gu, Mario Krenn

Note

Full Dynamic Knowledge Graph and Datasets can be downloaded at 10.5281/zenodo.10692137
Dataset for Benchmark can be downloaded at 10.5281/zenodo.14527306

Prepare an evolving, citation-augmented knowledge graph

Creating a list of scientific concepts

create_concept
│ 
├── Concept_Corpus
│   ├── s0_get_preprint_metadata.ipynb: Get metadata from chemRxiv, medRxiv, bioRxiv (arXiv data from Kaggle)
│   ├── s1_make_metadate_arxivstyle.ipynb: Preprocessing metadata from different sources
│   ├── s2_combine_all_preprint_metadate.ipynb: Combining metadata
│   ├── s3_get_concepts.ipynb: Use NLP techniques (for instance RAKE) to extract concepts
│   └── s4_improve_concept.ipynb: Further improvements of full concept list
│   
└── Domain_Concept
    ├── s0_prepare_optics_quantum_data.ipynb: Get papers for specific domain (optics and quantum physics in our case).
    ├── s1_split_domain_papers.py: Prepare data for parallelization.
    ├── s2_get_domain_concepts.py: Get domain-specific vertices in full concept list.
    ├── s3_merge_concepts.py: Postprocessing domain-specific concepts
    ├── s4_improve_concepts.ipynb: Further improve concept lists
    ├── s5_improve_manually_concepts.py: Manually inspect the concepts in the very end for grammar, non-conceptual phrases, verbs, ordinal numbers, conjunctions, adverbials and so on, to improve quality
    └── full_domain_concepts.txt: Final list of 37,960 concepts (represent vertices of knowledge graph)

Creating dynamic knowlegde graph

create_dynamic_edges
├── _get_openalex_workdata.py: Get metadata from OpenAlex)
├── _get_openalex_workdata_parallel_run1.py: Get parts of the metadata from OpenAlex (run in many parts)
├── get_concept_pairs.py: Create edges of the knowledge graph (edges carry the time and citation information).
├── merge_concept_pairs.py: Combining edges files
└── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic knowledge graph

Prepare other data

.
├── prepare_unconnected_pair_solution.ipynb: Find unconnected concept pairs (for training, testing and evaluating)
├── prepare_adjacency_pagerank.py: Prepare dynamic knowledge graph and compute properties
├── prepare_node_pair_citation_data_years.ipynb: Prepare citation data for both individual concept nodes and concept pairs for specific years
│
├──create_dynamic_concepts
│  ├── get_concept_citation.py: Create dynamic concepts from the knowledge graph (concepts carry the time and citation information). 
│  ├── merge_concept_citation.py: Combining dynamic concepts files
│  └── process_concept_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│  ├── merge_concept_pairs.py: Combining dynamic concepts
│  └── process_edge_to_pandas_frame.py: Post-processing, store the full dynamic concepts
│
└──prepare_eval_data
   ├── prepare_eval_feature_data.py: Prepare features of knowledge graph (for evaluation dataset)
   └── prepare_eval_feature_data_condition.py: Prepare features of knowledge graph (for evaluation dataset, conditioned on existence in the future)

🤖Forecasting with Neural Network

.
├── train_model_2019_run.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022).
├── train_model_2019_condition.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022, conditioned on existence in the future)
├── train_model_2019_individual_feature.py: Training neural network from 2016 -> 2019 (evaluated form 2019 -> 2022) on individual features
└── train_model_2022_run.py: Training 2019 -> 2022 (for real future predictions of 2025)

Feature descriptions for an unconnected pair of concepts (u, v)

Feature Type	Feature Index	Feature Description
node feature	0-5	Number of neighbors for each node ($u$ or $v$) until the year $y$, $y{-}1$, $y{-}2$ denoted as $N_{u,y}$, $N_{v,y}$, $N_{u,y-1}$, $N_{v,y-1}$, $N_{u,y-2}$, and $N_{v,y-2}$, ordered as indices 0–5
	6-7	Number of new neighbors for each node ($u$ or $v$) between year $y{-}1$ and $y$ i.e., $N_{u,y}{-}N_{u,y-1}$ and $N_{v,y}{-}N_{v,y-1}$
	8-9	Number of new neighbors for each node ($u$ or $v$) between year $y{-}2$ and $y$ i.e., $N_{u,y}{-}N_{u,y-2}$ and $N_{v,y}{-}N_{v,y-2}$
	10-11	Rank of the number of new neighbors for each node ($u$ or $v$) between year $y{-}1$ and $y$ i.e., rank($N_{u,y}{-}N_{u,y-1}$) and rank($N_{v,y}{-}N_{v,y-1}$)
	12-13	Rank of the number of new neighbors for each node ($u$ or $v$) between year $y{-}2$ and $y$ i.e., rank($N_{u,y}{-}N_{u,y-1}$) and rank($N_{v,y}{-}N_{v,y-2}$)
	14-19	PageRank scores of each node ($u$ or $v$) until the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{PR}_{u,y}$, $\mathrm{PR}_{v,y}$, $\mathrm{PR}_{u,y-1}$, $\mathrm{PR}_{v,y-1}$, $\mathrm{PR}_{u,y-2}$ and $\mathrm{PR}_{v,y-2}$
node citation feature	20-25	Yearly citation for each node ($u$ or $v$) in year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{C}_{u,y}$, $\mathrm{C}_{v,y}$, $\mathrm{C}_{u,y-1}$, $\mathrm{C}_{v,y-1}$, $\mathrm{C}_{u,y-2}$ and $\mathrm{C}_{v,y-2}$
	26-31	Total citation for each node ($u$ or $v$) since the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{Ct}_{u,y}$, $\mathrm{Ct}_{v,y}$, $\mathrm{Ct}_{u,y-1}$, $\mathrm{Ct}_{v,y-1}$, $\mathrm{Ct}_{u,y-2}$ and $\mathrm{Ct}_{v,y-2}$
	32-37	Total citations for each node ($u$ or $v$) in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{Ct}^{\Delta 3}_{u,y}$, $\mathrm{Ct}^{\Delta 3}_{v,y}$, $\mathrm{Ct}^{\Delta 3}_{u,y{-}1}$, $\mathrm{Ct}^{\Delta 3}_{v,y{-}1}$, $\mathrm{Ct}^{\Delta 3}_{u,y{-}2}$, and $\mathrm{Ct}^{\Delta 3}_{v,y{-}2}$
	38-43	Number of papers mentioning node $u$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$, similar for node $v$ denoted and ordered as $\mathrm{Pn}_{u,y}$, $\mathrm{Pn}_{v,y}$, $\mathrm{Pn}_{u,y-1}$, $\mathrm{Pn}_{v,y-1}$, $\mathrm{Pn}_{u,y-2}$, and $\mathrm{Pn}_{v,y-2}$
	44-49	Average yearly citations for each node ($u$ or $v$) in the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Cm}_{u,y}$, $\mathrm{Cm}_{v,y}$, $\mathrm{Cm}_{u,y-1}$, $\mathrm{Cm}_{v,y-1}$, $\mathrm{Cm}_{u,y-2}$ and $\mathrm{Cm}_{v,y-2}$ e.g., $\mathrm{Cm}_{u,y}=\mathrm{C}_{u,y}/\mathrm{Pn}_{u,y}$
	50-55	Average total citations for each node ($u$ or $v$) since the first publications to the years $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Ctm}_{u,y}$, $\mathrm{Ctm}_{v,y}$, $\mathrm{Ctm}_{u,y-1}$, $\mathrm{Ctm}_{v,y-1}$, $\mathrm{Ctm}_{u,y-2}$ and $\mathrm{Ctm}_{v,y-2}$; e.g., $\mathrm{Ctm}_{u,y}=\mathrm{Ct}_{u,y}/\mathrm{Pn}_{u,y}$
	56-61	Average total citations for each node ($u$ or $v$) in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{Ctm}^{\Delta 3}_{u,y}$, $\mathrm{Ctm}^{\Delta 3}_{v,y}$, $\mathrm{Ctm}^{\Delta 3}_{u,y-1}$, $\mathrm{Ctm}^{\Delta 3}_{v,y-1}$, $\mathrm{Ctm}^{\Delta 3}_{u,y-2}$ and $\mathrm{Ctm}^{\Delta 3}_{v,y-2}$ e.g., $\mathrm{Ctm}^{\Delta 3}_{u,y}=\mathrm{Ct}^{\Delta 3}_{u,y}/\mathrm{Pn}_{u,y}$
	62-63	New citations for each node ($u$ or $v$) between years $y{-}1$ and $y$ i.e., $\mathrm{Ct}_{u,y}{-}\mathrm{Ct}_{u,y-1}$ and $\mathrm{Ct}_{v,y}{-}\mathrm{Ct}_{v,y-1}$
	64-65	New citations for each node ($u$ or $v$) between years $y{-}2$ and $y$ i.e., $\mathrm{Ct}_{u,y}{-}\mathrm{Ct}_{u,y-2}$ and $\mathrm{Ct}_{v,y}{-}\mathrm{Ct}_{v,y-2}$
	66-67	Rank of the new citations for each node ($u$ or $v$) between years $y{-}1$ and $y$ i.e., rank($\mathrm{C}_{u,y}{-}\mathrm{C}_{u,y-1}$) and rank($\mathrm{C}_{v,y}{-}\mathrm{C}_{v,y-1}$)
	68-69	Rank of the new citations for each node ($u$ or $v$) between years $y{-}2$ and $y$ i.e., rank($\mathrm{C}_{u,y}{-}\mathrm{C}_{u,y-2}$) and rank($\mathrm{C}_{v,y}{-}\mathrm{C}_{v,y-2}$)
	70-71	Number of papers mentioning nodes $u$ between years $y{-}1$ and $y$, similar for node $v$ i.e., $\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-1}$ and $\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-1}$
	72-73	Number of papers mentioning nodes $u$ between years $y{-}2$ and $y$, similar for node $v$ i.e., $\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-2}$ and $\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-2}$
	74-75	Rank of the number of papers mentioning nodes $u$ between years $y{-}1$ and $y$, similar for node $v$ i.e., rank($\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-1}$) and rank($\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-1}$)
	76-77	Number of papers mentioning nodes $u$ between years $y{-}2$ and $y$, similar for node $v$ i.e., rank($\mathrm{PR}_{u,y}-\mathrm{PR}_{u,y-2}$) and rank($\mathrm{PR}_{v,y}-\mathrm{PR}_{v,y-2}$)
pair feature	78-80	Number of shared neighbors between nodes $u$ and $v$ until the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Ns}_{y}$, $\mathrm{Ns}_{y-1}$ and $\mathrm{Ns}_{y-2}$; e.g., $\mathrm{Ns}_{y}=\mathrm{N}_{u,y} \cap \mathrm{N}_{v,y}$
	81-83	Geometric similarity coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{Geo}_{y}$, $\mathrm{Geo}_{y-1}$, and $\mathrm{Geo}_{y-2}$; e.g., $\mathrm{Geo}_{y} = \mathrm{Ns}_{y}^{2}/(\mathrm{N}_{u,y}\times \mathrm{N}_{v,y})$
	84-86	Cosine similarity coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Cos}_{y}$, $\mathrm{Cos}_{y-1}$, and $\mathrm{Cos}_{y-2}$; e.g., $\mathrm{Cos}_{y} = \sqrt{\mathrm{Geo}_{y}}$
	87-89	Simpson coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Sim}_{y}$, $\mathrm{Sim}_{y-1}$, and $\mathrm{Sim}_{y-2}$; e.g., $\mathrm{Sim}_{y} = \mathrm{Ns}_{y}/\min(\mathrm{N}_{u,y}, \mathrm{N}_{v,y})$
	90-92	Preferential attachment coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Pre}_{y}$, $\mathrm{Pre}_{y-1}$, and $\mathrm{Pre}_{y-2}$; e.g., $\mathrm{Pre}_{y} =\mathrm{N}_{u,y}\times \mathrm{N}_{v,y}$
	93-95	Sørensen–Dice coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Sor}_{y}$, $\mathrm{Sor}_{y-1}$, and $\mathrm{Sor}_{y-2}$; e.g., $\mathrm{Sor}_{y} = 2\mathrm{Ns}_{y}/(\mathrm{N}_{u,y}+\mathrm{N}_{v,y}$)
	96-98	Jaccard coefficient for the pair $(u, v)$ for the year $y$, $y{-}1$, $y{-}2$ denoted and ordered as $\mathrm{Jac}_{y}$, $\mathrm{Jac}_{y-1}$, and $\mathrm{Jac}_{y-2}$; e.g., $\mathrm{Jac}_{y} = \mathrm{Ns}_{y}/(\mathrm{N}_{u,y}+\mathrm{N}_{v,y}-\mathrm{Ns}_{y})$
pair citation feature	99-101	Ratio of the sum of citations received by nodes $u$ and $v$ until the year $y$ to the total number of papers mentioning either concept, similar for years $y-1$, $y-2$ denoted and ordered as $\mathrm{r1}_{y}$, $\mathrm{r1}_{y-1}$, and $\mathrm{r1}_{y-2}$; e.g., $\mathrm{r1}_{y}=(\mathrm{Ct}_{u,y}$ + $\mathrm{Ct}_{v,y}) / (\mathrm{Pn}_{u,y}+\mathrm{Pn}_{v,y})$.
	102-104	Ratio of the product of citations received by nodes $u$ and $v$ until the year $y$ to the total number of papers mentioning either concept, similar for years $y-1$, $y-2$ denoted and ordered as $\mathrm{r2}_{y}$, $\mathrm{r2}_{y-1}$, and $\mathrm{r2}_{y-2}$; e.g., $\mathrm{r2}_{y}=(\mathrm{Ct}_{u,y} \times \mathrm{Ct}_{v,y}) / (\mathrm{Pn}_{u,y}+\mathrm{Pn}_{v,y})$
	105-107	Sum of average citations received by nodes $u$ and $v$ in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{s}_{y}$, $\mathrm{s}_{y-1}$, and $\mathrm{s}_{y-2}$; e.g., $\mathrm{s}_{y}=\mathrm{Cm}_{u,y}+\mathrm{Cm}_{v,y}$
	108-110	Sum of average total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{st}_{y}$, $\mathrm{st}_{y-1}$, and $\mathrm{st}_{y-2}$; e.g., $\mathrm{st}_{y}=\mathrm{Ctm}_{u,y}+\mathrm{Ctm}_{v,y}$
	111-113	Sum of the total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{st}^{\Delta 3}_{y}$, $\mathrm{st}^{\Delta 3}_{y-1}$, and $\mathrm{st}^{\Delta 3}_{y-2}$; e.g., $\mathrm{st}^{\Delta 3}_{y}=\mathrm{Ct}^{\Delta 3}_{u,y}+\mathrm{Ct}^{\Delta 3}_{v,y}$
	114-116	Sum of average total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{stm}^{\Delta 3}_{y}$, $\mathrm{stm}^{\Delta 3}_{y-1}$, and $\mathrm{stm}^{\Delta 3}_{y-2}$; e.g., $\mathrm{stm}^{\Delta 3}_{y}=\mathrm{Ctm}^{\Delta 3}_{u,y}+\mathrm{Ctm}^{\Delta 3}_{v,y}$
	117-119	Minimum number of citations received by either node $u$ or $v$ in years $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{minC}_{y}$, $\mathrm{minC}_{y-1}$, and $\mathrm{minC}_{y-2}$; e.g., $\mathrm{minC}_{y}=\min(\mathrm{C}_{u,y}, \mathrm{C}_{v,y})$
	120-122	Maximum number of citations received by either node $u$ or $v$ in years $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{maxC}_{y}$, $\mathrm{maxC}_{y-1}$, and $\mathrm{maxC}_{y-2}$; e.g., $\mathrm{maxC}_{y}=\max(\mathrm{C}_{u,y}, \mathrm{C}_{v,y})$
	123-125	Minimum number of total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{minCt}_{y}$, $\mathrm{minCt}_{y-1}$, and $\mathrm{minCt}_{y-2}$; e.g., $\mathrm{minCt}_{y}= \min(\mathrm{Ct}_{u,y}, \mathrm{Ct}_{v,y})$
	126-128	Maximum number of total citations received by nodes $u$ and $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{maxCt}_{y}$, $\mathrm{maxCt}_{y-1}$, and $\mathrm{maxCt}_{y-2}$; e.g., $\mathrm{maxCt}_{y}=\max(\mathrm{Ct}_{u,y}, \mathrm{Ct}_{v,y})$
	129-131	Minimum number of total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{minCt}^{\Delta 3}_{y}$, $\mathrm{minCt}^{\Delta 3}_{y-1}$, and $\mathrm{minCt}^{\Delta 3}_{y-2}$; e.g., $\mathrm{minCt}^{\Delta 3}_{y}= \min(\mathrm{Ct}^{\Delta 3}_{u,y}, \mathrm{Ct}^{\Delta 3}_{v,y})$.
	132-134	Maximum number of total citations received by nodes $u$ and $v$ in the last 3 years ending in the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{maxCt}^{\Delta 3}_{y}$, $\mathrm{maxCt}^{\Delta 3}_{y-1}$, and $\mathrm{maxCt}^{\Delta 3}_{y-2}$; e.g., $\mathrm{maxCt}^{\Delta 3}_{y}= \max(\mathrm{Ct}^{\Delta 3}_{u,y}, \mathrm{Ct}^{\Delta 3}_{v,y})$.
	135-137	Minimum number of papers mentioning the node $u$ or node $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{minPn}_{y}$, $\mathrm{minPn}_{y-1}$ and $\mathrm{minPn}_{y-2}$; e.g., $\mathrm{minPn}_{y}= \min(\mathrm{Pn}_{u,y}, \mathrm{Pn}_{v,y})$
	138-140	Maximum number of papers mentioning the node $u$ or node $v$ from the first publication to the year $y$, $y{-}1$, and $y{-}2$ denoted and ordered as $\mathrm{maxPn}_{y}$, $\mathrm{maxPn}_{y-1}$ and $\mathrm{maxPn}_{y-2}$; e.g., $\mathrm{maxPn}_{y}= \max(\mathrm{Pn}_{u,y}, \mathrm{Pn}_{v,y})$

Perform benchmarking

One needs to download the data at 10.5281/zenodo.14527306 and unzip the file in the benchmark_code folder.

benchmark_code
├── loops_fcNN.py: fully connected neural network model
├── loops_transformer.py: transformer model
├── loops_tree.py: random forest model
├── loops_xgboost.py: XGBoost model
└── other python files: Post-processing, make the Figure 6-8 from the evaluation on different models.

Three examples about 10M evaluation samples (2019-2022) with raw outputs from a neural network trained on 2016-2019 data (accessible at 10.5281/zenodo.14527306) are for producing Figure 11 in the fpr_example folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impact4Cast

Prepare an evolving, citation-augmented knowledge graph

Creating a list of scientific concepts

Creating dynamic knowlegde graph

Prepare other data

🤖Forecasting with Neural Network

Perform benchmarking

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
benchmark_code		benchmark_code
create_concepts		create_concepts
create_dynamic_concepts		create_dynamic_concepts
create_dynamic_edges		create_dynamic_edges
fpr_example		fpr_example
miscellaneous		miscellaneous
prepare_eval_data		prepare_eval_data
LICENSE		LICENSE
README.md		README.md
features_utils.py		features_utils.py
general_utils.py		general_utils.py
prepare_adjacency_pagerank.py		prepare_adjacency_pagerank.py
prepare_node_pair_citation_data_years.ipynb		prepare_node_pair_citation_data_years.ipynb
prepare_unconnected_pair_solution.ipynb		prepare_unconnected_pair_solution.ipynb
preprocess_utils.py		preprocess_utils.py
train_model_2019_condition.py		train_model_2019_condition.py
train_model_2019_individual_feature.py		train_model_2019_individual_feature.py
train_model_2019_run.py		train_model_2019_run.py
train_model_2022_run.py		train_model_2022_run.py
train_model_utils.py		train_model_utils.py

License

artificial-scientist-lab/Impact4Cast

Folders and files

Latest commit

History

Repository files navigation

Impact4Cast

About

Topics

Resources

License

Stars

Watchers

Forks

Languages