Allow enable OCR extraction from PDF #624

GryBsh · 2023-10-13T11:20:34Z

GryBsh
Oct 13, 2023

I'd like to be able to opt-in enable OCRing PDF documents.
I understand that tesseract doesn't support this, but Form Recognizer does.

dluc · 2023-12-29T04:22:40Z

dluc
Dec 29, 2023
Maintainer

hi @GryBsh the solution supports Azure Form Recognizer, now known as Azure AI Document Intelligence.

0 replies

GryBsh · 2023-12-30T11:25:22Z

GryBsh
Dec 30, 2023
Author

I also took this issue to our MS account team and I got the same answer.
Let me tell you what I told them: Read you're own code: https://github.com/microsoft/kernel-memory/blob/0b8e4cc5592000096f39d80fce1302d24e9e9b39/service/Core/DataFormats/Pdf/PdfDecoder.cs

No, the solution does NOT support OCR of any kind on PDFs. The assumption is made the PDFs have already been OCR'd well. So, I don't think that "completed" tag is very accurate .

0 replies

Matt-Scheetz · 2024-02-22T22:21:11Z

Matt-Scheetz
Feb 22, 2024

Bump

0 replies

crickman · 2024-02-26T16:42:54Z

crickman
Feb 26, 2024

@Matt-Scheetz - You can examine this project to see how to integrate tesseract into kernel-memory: https://github.com/microsoft/chat-copilot

Otherwise, Azure Forms Recognizer is supported if you add the configuration: https://github.com/microsoft/kernel-memory/blob/main/service/Service/appsettings.json#L338

0 replies

dluc · 2024-02-26T20:08:42Z

dluc
Feb 26, 2024
Maintainer

@GryBsh sorry about the misunderstanding. What I meant to say is that KM has integrated Azure Form Recognizer as an optional OCR solution, however, the integration is used only for images. For PDF KM always uses UglyToad.PdfPig, which is free and was added earlier if I remember correctly.

In order to use Azure Form Recognizer we'll need to make "PDF extraction" configurable, allowing to choose between Azure Doc Intelligence, UglyToad.PdfPig, or any other injectable class. It would be a nice feature to have, though currently we don't have a timeline for it. If someone is willing to work on it and send a PR it would definitely be welcome.

0 replies

GryBsh · 2024-06-05T09:24:06Z

GryBsh
Jun 5, 2024
Author

Sigh. Yes now that its been moved to a discussion, to keep in tasks to be groomed, I now have to submit ANOTHER feature request.
And this is why we don't mark things completed and move them to discussion before they've been legitimately triaged.
Expect a new feature request.

1 reply

dluc Jun 5, 2024
Maintainer

@GryBsh we collect all the feature requests in this section, where we can discuss the and where people can upvote features. We didn't mark this discussion as completed. GitHub marked the original issue as completed because of the migration from Issue to Discussion, unfortunately we don't control that.

If you open new feature requests, for other features, please use this section under the discussions section, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow enable OCR extraction from PDF #624

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Allow enable OCR extraction from PDF #624

GryBsh Oct 13, 2023

Replies: 6 comments · 1 reply

dluc Dec 29, 2023 Maintainer

GryBsh Dec 30, 2023 Author

Matt-Scheetz Feb 22, 2024

crickman Feb 26, 2024

dluc Feb 26, 2024 Maintainer

GryBsh Jun 5, 2024 Author

dluc Jun 5, 2024 Maintainer

GryBsh
Oct 13, 2023

Replies: 6 comments 1 reply

dluc
Dec 29, 2023
Maintainer

GryBsh
Dec 30, 2023
Author

Matt-Scheetz
Feb 22, 2024

crickman
Feb 26, 2024

dluc
Feb 26, 2024
Maintainer

GryBsh
Jun 5, 2024
Author

dluc Jun 5, 2024
Maintainer