Skip to content

fix(knowledge): 修复 general parser 分块超限导致 LightRAG 索引失败#670

Merged
xerrors merged 1 commit intoxerrors:mainfrom
guoyi8:fix/chunking-token-limit
May 3, 2026
Merged

fix(knowledge): 修复 general parser 分块超限导致 LightRAG 索引失败#670
xerrors merged 1 commit intoxerrors:mainfrom
guoyi8:fix/chunking-token-limit

Conversation

@guoyi8
Copy link
Copy Markdown

@guoyi8 guoyi8 commented May 1, 2026

Summary

  • 修复 naive_merge 在单行内容超过 chunk_token_num 时产生超大 chunk,导致 LightRAG 报错 Chunk token length 3140 exceeds chunk_token_size 1200 的问题
  • 新增 nlp.hard_split_by_token_limit() 公共硬切分函数(从 laws.py 提升,DRY)
  • general.py 添加 _ensure_chunk_token_limit() 兜底保护
  • 包含 16 个单元测试覆盖超长中文/英文、空文本、零上限等边界场景

Background

LightRAG 上游有多个相关 issue(HKUDS/LightRAG#2387#102、#2126)均未合并修复,需在预分块阶段自行兜底。

Test plan

  • pytest test/unit/test_chunking_token_limit.py — 16/16 passed
  • Codex code review 通过(仅测试隔离问题,已修复)
  • Docker 环境端到端测试:上传包含超长单行的文档,确认索引成功

naive_merge 不保证输出 chunk 在 token 上限内,当单行内容超过
chunk_token_num 时会产生超大 chunk,导致 LightRAG 报错
"Chunk token length 3140 exceeds chunk_token_size 1200"。

- nlp.py: 新增 hard_split_by_token_limit 公共硬切分函数
- general.py: 新增 _ensure_chunk_token_limit 兜底保护
- laws.py: 删除本地重复函数,改用 nlp 版本(DRY)
@xerrors xerrors merged commit 887b939 into xerrors:main May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants