GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

📚 Table of Contents

📝 What is GraphGen?
📌 Latest Updates
🚀 Quick Start
🏗️ System Architecture
🍀 Acknowledgements
📚 Citation
📜 License
📅 Star History

📝 What is GraphGen?

GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.

Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.

Domain	Dataset	Ours	Qwen2.5-7B-Instruct (baseline)
Plant	SeedBench	65.9	51.5
Common	CMMLU	73.6	75.8
Knowledge	GPQA-Diamond	40.0	33.3
Math	AIME24	20.6	16.7
	AIME25	22.7	7.2

It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

📌 Latest Updates

2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
2025.04.21: We have released the initial version of GraphGen.

🚀 Quick Start

Experience GraphGen through Web or Backup Web Entrance

For any questions, please check FAQ, open new issue or join our wechat group and ask.

Preparation

Install uv

# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details
curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository

git clone --depth=1 https://github.com/open-sciencelab/GraphGen
cd GraphGen

Create a new uv environment
```
 uv venv --python 3.10
```
Configure the dependencies
```
uv pip install -r requirements.txt
```

Run Gradio Demo

uv run webui/app.py

Run from PyPI

Install GraphGen
```
uv pip install graphg
```

Run in CLI

SYNTHESIZER_MODEL=your_synthesizer_model_name \
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
TRAINEE_MODEL=your_trainee_model_name \
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
TRAINEE_API_KEY=your_api_key_for_trainee_model \
graphg --output_dir cache

Run from Source

Configure the environment

Create an .env file in the root directory
```
cp .env.example .env
```

Set the following environment variables:

# Synthesizer is the model used to construct KG and generate data
SYNTHESIZER_MODEL=your_synthesizer_model_name
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
# Trainee is the model used to train with the generated data
TRAINEE_MODEL=your_trainee_model_name
TRAINEE_BASE_URL=your_base_url_for_trainee_model
TRAINEE_API_KEY=your_api_key_for_trainee_model

(Optional) Customize generation parameters in graphgen/configs/ folder.

Edit the corresponding YAML file, e.g.:

  # configs/cot_config.yaml
  input_data_type: raw
  input_file: resources/input_examples/raw_demo.jsonl
  output_data_type: cot
  tokenizer: cl100k_base
  # additional settings...

Generate data

Pick the desired format and run the matching script:

Format	Script to run	Notes
`cot`	`bash scripts/generate/generate_cot.sh`	Chain-of-Thought Q&A pairs
`atomic`	`bash scripts/generate/generate_atomic.sh`	Atomic Q&A pairs covering basic knowledge
`aggregated`	`bash scripts/generate/generate_aggregated.sh`	Aggregated Q&A pairs incorporating complex, integrated knowledge
`multi-hop`	`bash scripts/generate/generate_multihop.sh`	Multi-hop reasoning Q&A pairs

Get the generated data
```
ls cache/data/graphgen
```

Run with Docker

Build the Docker image
```
docker build -t graphgen .
```
Run the Docker container
```
 docker run -p 7860:7860 graphgen
```

🏗️ System Architecture

See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.

Workflow

🍀 Acknowledgements

SiliconFlow Abundant LLM API, some models are free
LightRAG Simple and efficient graph retrieval solution
ROGRAG A robustly optimized GraphRAG framework
DB-GPT An AI native data app development framework

📚 Citation

If you find this repository useful, please consider citing our work:

@misc{chen2025graphgenenhancingsupervisedfinetuning,
      title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation}, 
      author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
      year={2025},
      eprint={2505.20416},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20416}, 
}

📜 License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 496 Commits
.github/workflows		.github/workflows
baselines		baselines
graphgen		graphgen
resources		resources
scripts		scripts
webui		webui
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_ZH.md		README_ZH.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 What is GraphGen?

📌 Latest Updates

🚀 Quick Start

Preparation

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

📅 Star History

About

Uh oh!

Releases

Uh oh!

Contributors 7

Uh oh!

Languages

License

open-sciencelab/GraphGen

Folders and files

Latest commit

History

Repository files navigation

📝 What is GraphGen?

📌 Latest Updates

🚀 Quick Start

Preparation

Run Gradio Demo

Run from PyPI

Run from Source

Run with Docker

🏗️ System Architecture

Workflow

🍀 Acknowledgements

📚 Citation

📜 License

📅 Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 7

Uh oh!

Languages