Haystack splitter using docling chunkers #4

dmartinol · 2025-01-07T10:31:44Z

Request to design an Haystack DocumentSplitter based on the Docling chunkers.

docling_splitter is a reference implementation using the HybridChunker developed by @ilya-kolchinsky

The text was updated successfully, but these errors were encountered:

vagenas · 2025-01-07T12:45:26Z

@dmartinol @ilya-kolchinsky docling-haystack provides the DoclingConverter which can produce the chunks/splits based on a provided Docling BaseChunker right away using the DOC_CHUNKS export type:

from docling_haystack.converter import DoclingConverter, ExportType
from docling.chunking import HybridChunker

# example setup; instantiate chunker as needed
chunker = HybridChunker(tokenizer="sentence-transformers/all-MiniLM-L6-v2")

# set up converter
converter = DoclingConverter(
  export_type=ExportType.DOC_CHUNKS,
  chunker=chunker,
)

Full example: https://ds4sd.github.io/docling/examples/rag_haystack/
More resources: https://ds4sd.github.io/docling/integrations/haystack/

This already provides the capability of getting "splits" created by native Docling chunkers — you may just have to skip the explicit "splitting" pipeline step in this branch of the logic.

Let me know if this answers the question, otherwise we can also have a quick chat.

dmartinol · 2025-01-09T09:48:40Z

This seems to do what we need, thank you!
ATM we cannot test this package because we're stuck with legacy versions of docling (<=2.8.3), but I think @ilya-kolchinsky should have more freedom to check this out. I will raise an enhancement issue on their repo for now.

dmartinol mentioned this issue Jan 7, 2025

feat: Expose document store Python API in instructlab/instructlab rag submodule instructlab/instructlab#2832

Merged

6 tasks

vagenas self-assigned this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Haystack splitter using docling chunkers #4

Haystack splitter using docling chunkers #4

dmartinol commented Jan 7, 2025

vagenas commented Jan 7, 2025

dmartinol commented Jan 9, 2025

Haystack splitter using docling chunkers #4

Haystack splitter using docling chunkers #4

Comments

dmartinol commented Jan 7, 2025

vagenas commented Jan 7, 2025

dmartinol commented Jan 9, 2025