Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New chunking classes #996

Merged
merged 2 commits into from
Feb 6, 2025
Merged

New chunking classes #996

merged 2 commits into from
Feb 6, 2025

Conversation

dluc
Copy link
Collaborator

@dluc dluc commented Feb 6, 2025

The original chunkers ported from SK had some bugs introduced while refactoring, leading to incorrect split. This is a full rewrite following the original logic, with some changes:

  • remove MaxTokensPerLine setting
  • overlap doesn't use sentences anymore, and copy raw tokens from the previous chunk instead
  • markdown chunker uses better splitting logic, although it should be rewritten to use a markdown parser
  • chunkers now work with a Chunk class which is used also by the file parsers. This will allow to port properties from files to chunks, such as page number and other metadata
  • chunkers now take a dependency on tokenizers directly, rather than just TokenCount
  • chunkers are now out of Core and into a dedicated nuget, for future reuse outside KM

@dluc dluc force-pushed the extendedchunks branch 4 times, most recently from 291fd6b to 87adf99 Compare February 6, 2025 22:36
@dluc dluc merged commit a490102 into microsoft:main Feb 6, 2025
6 checks passed
@dluc dluc deleted the extendedchunks branch February 6, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant