Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Haystack splitter using docling chunkers #4

Open
dmartinol opened this issue Jan 7, 2025 · 2 comments
Open

Haystack splitter using docling chunkers #4

dmartinol opened this issue Jan 7, 2025 · 2 comments
Assignees

Comments

@dmartinol
Copy link

Request to design an Haystack DocumentSplitter based on the Docling chunkers.

docling_splitter is a reference implementation using the HybridChunker developed by @ilya-kolchinsky

@vagenas
Copy link
Contributor

vagenas commented Jan 7, 2025

@dmartinol @ilya-kolchinsky docling-haystack provides the DoclingConverter which can produce the chunks/splits based on a provided Docling BaseChunker right away using the DOC_CHUNKS export type:

from docling_haystack.converter import DoclingConverter, ExportType
from docling.chunking import HybridChunker

# example setup; instantiate chunker as needed
chunker = HybridChunker(tokenizer="sentence-transformers/all-MiniLM-L6-v2")

# set up converter
converter = DoclingConverter(
  export_type=ExportType.DOC_CHUNKS,
  chunker=chunker,
)

This already provides the capability of getting "splits" created by native Docling chunkers — you may just have to skip the explicit "splitting" pipeline step in this branch of the logic.

Let me know if this answers the question, otherwise we can also have a quick chat.

@dmartinol
Copy link
Author

This seems to do what we need, thank you!
ATM we cannot test this package because we're stuck with legacy versions of docling (<=2.8.3), but I think @ilya-kolchinsky should have more freedom to check this out. I will raise an enhancement issue on their repo for now.

@vagenas vagenas self-assigned this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants