Skip to content

A MULTI-GENERATOR ENSEMBLE FRAMEWORK FOR NATURAL LANGUAGE TO SQL

License

Notifications You must be signed in to change notification settings

XGenerationLab/XiYan-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

image

中文版 | 英文版

XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL

Introduction in Short

XiYan-SQL is an innovative framework for LLM based Text-to-SQL.

It contains:

  1. M-schema a semi-structured schema representation method.

  2. XiYanSQL-QwenCoder-32B a LLM training strategy with tuned generation model for SQLite.

  3. Ensemble Strategy a multi-generator ensemble strategy with selection model (to release soon).

  4. DateResolver a date understanding and reasoning enhanced model, major for Chinese.

  5. MoMQ a multi-dialects Text-to-SQL MoE model based on Qwen (to release soon).

  6. Database Description Generation a method and corresponding code for automatic description generation for Text-to-SQL (to release soon).

News🔥

  • Jan. 22, 2025 🌟We release XiYanSQL-QwenCoder-32B and simultaneously open source the model weights.

  • Jan. 09, 2025 🌟XiYanSQL-QwenCoder-32B: XiYanSQL-QwenCoder-32B achieves an EX score of 69.03% on the BIRD test set, setting a new SOTA under only a single fine-tuned model.

  • Dec. 17, 2024 🌟New SOTA on Bird: XiYan-SQL reaches the top of Bird leaderboard with an EX score of 75.63%, outperforming the second place by 0.84 pt.

  • Dec. 13, 2024 We release the model and source code of DateResolver.

  • Dec. 12, 2024 Try our model: ModelScope

Full Intro.

To tackle the challenges of large language model performance in natural language to SQL tasks, we introduce XiYan-SQL, an innovative framework that employs a multi-generator ensemble strategy to improve candidate generation. We introduce M-Schema, a semi-structured schema representation method designed to enhance the understanding of database structures. To enhance the quality and diversity of generated candidate SQL queries, XiYan-SQL integrates the significant potential of in-context learning (ICL) with the precise control of supervised fine-tuning. On one hand, we propose a series of training strategies to fine-tune models to generate high-quality candidates with diverse preferences. On the other hand, we implement the ICL approach with an example selection method based on named entity recognition to prevent overemphasis on entities. The refiner optimizes each candidate by correcting logical or syntactical errors. To address the challenge of identifying the best candidate, we fine-tune a selection model to distinguish nuances of candidate SQL queries. The experimental results on multiple dialect datasets demonstrate the robustness of XiYan-SQL in addressing challenges across different scenarios. Overall, our proposed XiYan-SQL achieves the state-of-the-art execution accuracy of 75.63% on Bird test, 89.65% on the Spider test set, 69.86% on SQL-Eval, 41.20% on NL2GQL. The proposed framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods.

Coming Soon🕒

  1. The complete code for XiYan-SQL will be released.Feb. 2025

  2. The fine-tuned model for SQLite will be released.Jan. 2025 Done

  3. A method and corresponding code for automatic description generation for Text-to-SQL will be provided.Jan. 2025

  4. The code and model of DateResolver will be released. Dec. 2024 Done

  5. The MoMQ model and training code will be released.Jan. 2025

Timeline

The major events.

Date Event
2024-05 Proposing M-schema, involving ICL in SQL generation
Achieving 86.98% on Spider test set (SOTA 86.6%)
2024-09 Proposing DateSolver
2024-10 Proposing an MoE model MoMQ
2024-11 Proposing Training Strategy and Ensemble Strategy
Achieving 89.65% on Spider test set (new SOTA), 69.86% on SQL-Eval (new SOTA)
Achieving 41.20% on NL2GQL, and a competitive score of 72.23% on Bird dev (4-th)
2024-12 Reaching the top of Bird leaderboard with an EX score of 75.63% and R-VES of 71.41(new SOTA)
2025-01 XiYanSQL-QwenCoder-32B achieves an EX score of 69.03% on BIRD test, new SOTA using only single fine-tuned model
XiYanSQL-QwenCoder-32B has been released

Recruitment

We're looking for summer interns, research interns, new graduates, and internal transfers!

Please contact: Zhiling Luo, [email protected]

Application

Welcome everyone to try the intelligent data querying solution based on XiYan-SQL, which is called XiYan GBI. We welcome any product experiences and suggestions for optimization.

For product introduction, please visit: https://help.aliyun.com/zh/model-studio/user-guide/brief-introduction-of-gbi-products

To try the product, please visit: https://bailian.console.aliyun.com/xiyan

Product DingTalk Group: 94725009401

Star History

Star History Chart

Citation

If you find our work helpful, feel free to give us a cite.

@article{xiyansql,
      title={XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL}, 
      author={Yingqi Gao and Yifu Liu and Xiaoxia Li and Xiaorong Shi and Yin Zhu and Yiming Wang and Shiqi Li and Wei Li and Yuntao Hong and Zhiling Luo and Jinyang Gao and Liyu Mou and Yu Li},
      year={2024},
      journal={arXiv preprint arXiv:2411.08599},
      url={https://arxiv.org/abs/2411.08599},
      primaryClass={cs.AI}
}

About

A MULTI-GENERATOR ENSEMBLE FRAMEWORK FOR NATURAL LANGUAGE TO SQL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published