fix(knowledge): 修复 general parser 分块超限导致 LightRAG 索引失败 by guoyi8 · Pull Request #670 · xerrors/Yuxi

guoyi8 · 2026-05-01T14:40:13Z

Summary

修复 naive_merge 在单行内容超过 chunk_token_num 时产生超大 chunk，导致 LightRAG 报错 Chunk token length 3140 exceeds chunk_token_size 1200 的问题
新增 nlp.hard_split_by_token_limit() 公共硬切分函数（从 laws.py 提升，DRY）
为 general.py 添加 _ensure_chunk_token_limit() 兜底保护
包含 16 个单元测试覆盖超长中文/英文、空文本、零上限等边界场景

Background

LightRAG 上游有多个相关 issue（HKUDS/LightRAG#2387、#102、#2126）均未合并修复，需在预分块阶段自行兜底。

Test plan

pytest test/unit/test_chunking_token_limit.py — 16/16 passed
Codex code review 通过（仅测试隔离问题，已修复）
Docker 环境端到端测试：上传包含超长单行的文档，确认索引成功

naive_merge 不保证输出 chunk 在 token 上限内，当单行内容超过 chunk_token_num 时会产生超大 chunk，导致 LightRAG 报错 "Chunk token length 3140 exceeds chunk_token_size 1200"。 - nlp.py: 新增 hard_split_by_token_limit 公共硬切分函数 - general.py: 新增 _ensure_chunk_token_limit 兜底保护 - laws.py: 删除本地重复函数，改用 nlp 版本（DRY）

xerrors merged commit 887b939 into xerrors:main May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(knowledge): 修复 general parser 分块超限导致 LightRAG 索引失败#670

fix(knowledge): 修复 general parser 分块超限导致 LightRAG 索引失败#670
xerrors merged 1 commit intoxerrors:mainfrom
guoyi8:fix/chunking-token-limit

guoyi8 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guoyi8 commented May 1, 2026

Summary

Background

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants