-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integrations: PDF segmentation and ingest via Aryn model / Sycamore #356
Comments
Oops, just found the original code: https://github.com/DS4SD/DocLayNet |
Hey @jac-cbi, We haven't done much yet with non text data. If the PDFs are well formed and don't need ocr, a more simple reader could be added faster. That said, processing non-text / binary data is super interesting. An llm would then generate context on the segment, which would then be stored in a database? |
Potential reason to continue with #66 |
Sure, I looked into
Yes, my idea is to use untrusted, isolated processes (eg Cloudflare's headless browser, or a headless chrome running as Basically, I'd like to treat the rendered images as a security boundary. Everything to be rendered is untrusted, the images are trusted to at least be what a human can review and confirm if doubt arises later. |
Hah, quite a nerd snipe, sounds like a cool project. If I understand correctly, that means there's two parts to this. Multimodal support, i.e. via mistral rs, and supporting binary data in Swiftide? Right now te latter doesn't have priority for me. That said, it sounds very cool. If you're interested in picking it up, I'd love to think along. |
@jac-cbi I'm still interested in picking this up further and gave it a bit more thought. Previously, I got mistral-rs to work in Swiftide, but I can't release it in Swiftide itself because the library isn't released as a crate. Although I'd much prefer to do it that way, an alternative solution could be a separate repository with git dependencies. As for the having non-textual data in the pipeline, I think this can be very valuable in general. The solution we came up with is to make a Could you let me know if you're still interested in setting this up? I'd be happy to take a look at setting up mistral rs in a separate repository if that is the case. |
I'm fired up about a rust implemented document parsing / embedding engine for my code and documents. Sadly, I don't see a good PDF ingestion in the code.
Ideally, I'd like to import PDFs from academic papers, webpage printouts, or even long screenshots right out of a headless chrome instance. To do this effectively, the image(s) need to be segmented, classified, and then each segment handed to the best LLM for OCR / interpretation.
I've looked at Sycamore, which has published their segmentation LLM on HF. Sadly, it's python all the way down. I'm looking for something more production-ready, and default-safe with untrusted data
The text was updated successfully, but these errors were encountered: