GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
π Table of Contents
- π What is GraphGen?
- π Latest Updates
- π Quick Start
- ποΈ System Architecture
- π Acknowledgements
- π Citation
- π License
- π Star History
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the paper and best practice.
Here is post-training result which over 50% SFT data comes from GraphGen and our data clean pipeline.
Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
---|---|---|---|
Plant | SeedBench | 65.9 | 51.5 |
Common | CMMLU | 73.6 | 75.8 |
Knowledge | GPQA-Diamond | 40.0 | 33.3 |
Math | AIME24 | 20.6 | 16.7 |
AIME25 | 22.7 | 7.2 |
It begins by constructing a fine-grained knowledge graph from the source textοΌthen identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
- 2025.08.14: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
- 2025.07.31: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
- 2025.04.21: We have released the initial version of GraphGen.
Experience GraphGen through Web or Backup Web Entrance
For any questions, please check FAQ, open new issue or join our wechat group and ask.
-
Install uv
# You could try pipx or pip to install uv when meet network issues, refer the uv doc for more details curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone the repository
git clone --depth=1 https://github.com/open-sciencelab/GraphGen cd GraphGen
-
Create a new uv environment
uv venv --python 3.10
-
Configure the dependencies
uv pip install -r requirements.txt
uv run webui/app.py
-
Install GraphGen
uv pip install graphg
-
Run in CLI
SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache
-
Configure the environment
- Create an
.env
file in the root directorycp .env.example .env
- Set the following environment variables:
# Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
-
(Optional) Customize generation parameters in
graphgen/configs/
folder.Edit the corresponding YAML file, e.g.:
# configs/cot_config.yaml input_data_type: raw input_file: resources/input_examples/raw_demo.jsonl output_data_type: cot tokenizer: cl100k_base # additional settings...
-
Generate data
Pick the desired format and run the matching script:
Format Script to run Notes cot
bash scripts/generate/generate_cot.sh
Chain-of-Thought Q&A pairs atomic
bash scripts/generate/generate_atomic.sh
Atomic Q&A pairs covering basic knowledge aggregated
bash scripts/generate/generate_aggregated.sh
Aggregated Q&A pairs incorporating complex, integrated knowledge multi-hop
bash scripts/generate/generate_multihop.sh
Multi-hop reasoning Q&A pairs -
Get the generated data
ls cache/data/graphgen
- Build the Docker image
docker build -t graphgen .
- Run the Docker container
docker run -p 7860:7860 graphgen
See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
- SiliconFlow Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG A robustly optimized GraphRAG framework
- DB-GPT An AI native data app development framework
If you find this repository useful, please consider citing our work:
@misc{chen2025graphgenenhancingsupervisedfinetuning,
title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},
author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
year={2025},
eprint={2505.20416},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20416},
}
This project is licensed under the Apache License 2.0.