Can LLMs Effectively Leverage Graph Structural Information: When and Why

We provide three main components:

A new dataset arxiv-2023, whose test nodes are chosen from arXiv Computer Science (CS) papers published in 2023.
A unified dataloader for cora, pubmed, ogbn-arxiv, arxiv-2023 and ogbn-product as well as their raw text.
A simple template for testing ChatGPT on these datasets. See template.ipynb.

1. New Dataset: `arxiv-2023`

arxiv-2023 is collected to be compared with ogbn-arxiv. Both datasets represent directed citation networks where each node corresponds to a paper published on arXiv and each edge indicates one paper citing another.

Statistics of `ogbn-arxiv` and `arxiv-2023` datasets

Dataset	#Nodes (Full Dataset)	#Edges (Full Dataset)	In-Degree/Out-Degree (Test Set)	Average Degree (Test Set)	Published Year (Test Set)
`ogbn-arxiv`	169343	1166243	1.33/11.1	12.43	2019
`arxiv-2023`	33868	305672	0.16/10.6	10.76	2023

Proportional distribution of labels in `ogbn-arxiv` and `arxiv-2023` datasets. Each label represents an arXiv Computer Science Category.

2. Unified Dataloader for Datasets and Raw Text

Download Datasets and Raw Text

We provide the dataset and raw text for arxiv-2023 in this repo. You may need to download the dataset and raw text for other datasets.

cora and pubmed: download here. and place the datasets at /dataset/cora/ and /dataset/pubmed/ respectively.
ogbn-arxiv and ogbn-product: as you run the dataloader, ogb will automatically download the dataset for you. But you need to download the raw text by yourself. For ogbn-arxiv, download here and place the file at /dataset/ogbn_arxiv/titleabs.tsv. For ogbn-product, download here and place the folder at /dataset/ogbn-products/Amazon-3M.raw

Set up environment and OpenAI API key

You need to set up your OpenAI API key as OPENAI_API_KEY environment variable. See here for details.

Required packages include openai, pytorch, PyG, ogb etc.

Data Loading API

>>> from utils.utils import load_data
>>> data, text = load_data("arxiv_2023", use_text=True)
>>> print(data)
Data(x=[33868, 128], edge_index=[2, 305672], y=[33868, 1], paper_id=[33868], train_mask=[33868], val_mask=[33868], test_mask=[33868], num_nodes=33868, train_id=[19461], val_id=[4682], test_id=[668])
>>> print(text.keys())
dict_keys(['title', 'abs', 'label', 'id'])

Citation

If you find this repo helpful for your research, please consider citing our paper below.

@misc{huang2023llms,
      title={Can LLMs Effectively Leverage Graph Structural Information: When and Why}, 
      author={Jin Huang and Xingjian Zhang and Qiaozhu Mei and Jiaqi Ma},
      year={2023},
      eprint={2309.16595},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name	Name	Last commit message	Last commit date
Latest commit Jn-Huang update citation Oct 3, 2023 6a25eca · Oct 3, 2023 History 5 Commits
attention	attention	init repo	Sep 29, 2023
dataset/arxiv_2023	dataset/arxiv_2023	init repo	Sep 29, 2023
few_shot_examples	few_shot_examples	init repo	Sep 29, 2023
figures	figures	update readme	Sep 29, 2023
utils	utils	update readme	Sep 29, 2023
.gitignore	.gitignore	init repo	Sep 29, 2023
README.md	README.md	update citation	Oct 3, 2023
template.ipynb	template.ipynb	update readme	Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Can LLMs Effectively Leverage Graph Structural Information: When and Why

1. New Dataset: `arxiv-2023`

Statistics of `ogbn-arxiv` and `arxiv-2023` datasets

Proportional distribution of labels in `ogbn-arxiv` and `arxiv-2023` datasets. Each label represents an arXiv Computer Science Category.

2. Unified Dataloader for Datasets and Raw Text

Download Datasets and Raw Text

Set up environment and OpenAI API key

Data Loading API

Citation

About

Releases

Packages

Languages

TRAIS-Lab/LLM-Structured-Data

Folders and files

Latest commit

History

Repository files navigation

Can LLMs Effectively Leverage Graph Structural Information: When and Why

1. New Dataset: arxiv-2023

Statistics of ogbn-arxiv and arxiv-2023 datasets

Proportional distribution of labels in ogbn-arxiv and arxiv-2023 datasets. Each label represents an arXiv Computer Science Category.

2. Unified Dataloader for Datasets and Raw Text

Download Datasets and Raw Text

Set up environment and OpenAI API key

Data Loading API

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. New Dataset: `arxiv-2023`

Statistics of `ogbn-arxiv` and `arxiv-2023` datasets

Proportional distribution of labels in `ogbn-arxiv` and `arxiv-2023` datasets. Each label represents an arXiv Computer Science Category.

Packages