integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

jac-cbi · 2024-10-03T20:59:37Z

I'm fired up about a rust implemented document parsing / embedding engine for my code and documents. Sadly, I don't see a good PDF ingestion in the code.

Ideally, I'd like to import PDFs from academic papers, webpage printouts, or even long screenshots right out of a headless chrome instance. To do this effectively, the image(s) need to be segmented, classified, and then each segment handed to the best LLM for OCR / interpretation.

I've looked at Sycamore, which has published their segmentation LLM on HF. Sadly, it's python all the way down. I'm looking for something more production-ready, and default-safe with untrusted data

jac-cbi · 2024-10-03T21:02:38Z

Oops, just found the original code: https://github.com/DS4SD/DocLayNet

timonv · 2024-10-03T22:10:54Z

Hey @jac-cbi, We haven't done much yet with non text data. If the PDFs are well formed and don't need ocr, a more simple reader could be added faster.

That said, processing non-text / binary data is super interesting. An llm would then generate context on the segment, which would then be stored in a database?

timonv · 2024-10-03T22:11:34Z

Potential reason to continue with #66

jac-cbi · 2024-10-04T12:41:55Z

Hey @jac-cbi, We haven't done much yet with non text data. If the PDFs are well formed and don't need ocr, a more simple reader could be added faster.

Sure, I looked into pdf-rs, but I have some security concerns. My end state is a daily driver: dumping all my rss feeds, blog posts I find, academic papers, exploit write-ups, code repositories, etc into a knowledge base I can query offline. I'm concerned about wayward javascript (pdf, svg, web), injected text that's not visible to human reader (knowledge base poisoning), just to name a few I can think of at the moment.

That said, processing non-text / binary data is super interesting. An llm would then generate context on the segment, which would then be stored in a database?

Yes, my idea is to use untrusted, isolated processes (eg Cloudflare's headless browser, or a headless chrome running as nobody) to render the source material into images as would be seen by a human, then process and ingest as a human would. The Aryn model is the most critical piece, as it segments structured webpages / pdf images into labeled regions, which can then be passed to different vision LLMs for processing. The final result is vectors / embeddings stored in a vector database, eg qdrant.

Basically, I'd like to treat the rendered images as a security boundary. Everything to be rendered is untrusted, the images are trusted to at least be what a human can review and confirm if doubt arises later.

timonv · 2024-10-04T15:04:49Z

Hah, quite a nerd snipe, sounds like a cool project. If I understand correctly, that means there's two parts to this. Multimodal support, i.e. via mistral rs, and supporting binary data in Swiftide?

Right now te latter doesn't have priority for me. That said, it sounds very cool. If you're interested in picking it up, I'd love to think along.

timonv · 2024-11-08T10:20:10Z

@jac-cbi I'm still interested in picking this up further and gave it a bit more thought. Previously, I got mistral-rs to work in Swiftide, but I can't release it in Swiftide itself because the library isn't released as a crate. Although I'd much prefer to do it that way, an alternative solution could be a separate repository with git dependencies.

As for the having non-textual data in the pipeline, I think this can be very valuable in general. The solution we came up with is to make a Node generic over its internal chunk value, which would have the added benefit of having transformers that only work with specific kinds of data. I'll create an issue for that in a bit, as it basically opens the door to multimodel processing, which I think is a must have.

Could you let me know if you're still interested in setting this up? I'd be happy to take a look at setting up mistral rs in a separate repository if that is the case.

jac-cbi mentioned this issue Oct 3, 2024

Model Wishlist EricLBuehler/mistral.rs#156

Open

11 tasks

timonv mentioned this issue Oct 11, 2024

feat(integrations): Support in process hugging face models via mistralrs #386

Merged

timonv self-assigned this Nov 8, 2024

timonv mentioned this issue Nov 8, 2024

Multimodal support by making nodes generic over their value #441

Open

timonv added the feature request label Dec 29, 2024

timonv removed their assignment Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

jac-cbi commented Oct 3, 2024

jac-cbi commented Oct 3, 2024

timonv commented Oct 3, 2024

timonv commented Oct 3, 2024

jac-cbi commented Oct 4, 2024

timonv commented Oct 4, 2024

timonv commented Nov 8, 2024

integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

Comments

jac-cbi commented Oct 3, 2024

jac-cbi commented Oct 3, 2024

timonv commented Oct 3, 2024

timonv commented Oct 3, 2024

jac-cbi commented Oct 4, 2024

timonv commented Oct 4, 2024

timonv commented Nov 8, 2024