Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrations: PDF segmentation and ingest via Aryn model / Sycamore #356

Open
jac-cbi opened this issue Oct 3, 2024 · 6 comments
Open

Comments

@jac-cbi
Copy link

jac-cbi commented Oct 3, 2024

I'm fired up about a rust implemented document parsing / embedding engine for my code and documents. Sadly, I don't see a good PDF ingestion in the code.

Ideally, I'd like to import PDFs from academic papers, webpage printouts, or even long screenshots right out of a headless chrome instance. To do this effectively, the image(s) need to be segmented, classified, and then each segment handed to the best LLM for OCR / interpretation.

I've looked at Sycamore, which has published their segmentation LLM on HF. Sadly, it's python all the way down. I'm looking for something more production-ready, and default-safe with untrusted data

@jac-cbi
Copy link
Author

jac-cbi commented Oct 3, 2024

Oops, just found the original code: https://github.com/DS4SD/DocLayNet

@timonv
Copy link
Member

timonv commented Oct 3, 2024

Hey @jac-cbi, We haven't done much yet with non text data. If the PDFs are well formed and don't need ocr, a more simple reader could be added faster.

That said, processing non-text / binary data is super interesting. An llm would then generate context on the segment, which would then be stored in a database?

@timonv
Copy link
Member

timonv commented Oct 3, 2024

Potential reason to continue with #66

@jac-cbi
Copy link
Author

jac-cbi commented Oct 4, 2024

Hey @jac-cbi, We haven't done much yet with non text data. If the PDFs are well formed and don't need ocr, a more simple reader could be added faster.

Sure, I looked into pdf-rs, but I have some security concerns. My end state is a daily driver: dumping all my rss feeds, blog posts I find, academic papers, exploit write-ups, code repositories, etc into a knowledge base I can query offline. I'm concerned about wayward javascript (pdf, svg, web), injected text that's not visible to human reader (knowledge base poisoning), just to name a few I can think of at the moment.

That said, processing non-text / binary data is super interesting. An llm would then generate context on the segment, which would then be stored in a database?

Yes, my idea is to use untrusted, isolated processes (eg Cloudflare's headless browser, or a headless chrome running as nobody) to render the source material into images as would be seen by a human, then process and ingest as a human would. The Aryn model is the most critical piece, as it segments structured webpages / pdf images into labeled regions, which can then be passed to different vision LLMs for processing. The final result is vectors / embeddings stored in a vector database, eg qdrant.

Basically, I'd like to treat the rendered images as a security boundary. Everything to be rendered is untrusted, the images are trusted to at least be what a human can review and confirm if doubt arises later.

@timonv
Copy link
Member

timonv commented Oct 4, 2024

Hah, quite a nerd snipe, sounds like a cool project. If I understand correctly, that means there's two parts to this. Multimodal support, i.e. via mistral rs, and supporting binary data in Swiftide?

Right now te latter doesn't have priority for me. That said, it sounds very cool. If you're interested in picking it up, I'd love to think along.

@timonv
Copy link
Member

timonv commented Nov 8, 2024

@jac-cbi I'm still interested in picking this up further and gave it a bit more thought. Previously, I got mistral-rs to work in Swiftide, but I can't release it in Swiftide itself because the library isn't released as a crate. Although I'd much prefer to do it that way, an alternative solution could be a separate repository with git dependencies.

As for the having non-textual data in the pipeline, I think this can be very valuable in general. The solution we came up with is to make a Node generic over its internal chunk value, which would have the added benefit of having transformers that only work with specific kinds of data. I'll create an issue for that in a bit, as it basically opens the door to multimodel processing, which I think is a must have.

Could you let me know if you're still interested in setting this up? I'd be happy to take a look at setting up mistral rs in a separate repository if that is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants