Allow enable OCR extraction from PDF #624
Replies: 6 comments 1 reply
-
hi @GryBsh the solution supports Azure Form Recognizer, now known as Azure AI Document Intelligence. |
Beta Was this translation helpful? Give feedback.
-
I also took this issue to our MS account team and I got the same answer. No, the solution does NOT support OCR of any kind on PDFs. The assumption is made the PDFs have already been OCR'd well. So, I don't think that "completed" tag is very accurate . |
Beta Was this translation helpful? Give feedback.
-
@Matt-Scheetz - You can examine this project to see how to integrate tesseract into kernel-memory: https://github.com/microsoft/chat-copilot Otherwise, Azure Forms Recognizer is supported if you add the configuration: https://github.com/microsoft/kernel-memory/blob/main/service/Service/appsettings.json#L338 |
Beta Was this translation helpful? Give feedback.
-
@GryBsh sorry about the misunderstanding. What I meant to say is that KM has integrated Azure Form Recognizer as an optional OCR solution, however, the integration is used only for images. For PDF KM always uses In order to use Azure Form Recognizer we'll need to make "PDF extraction" configurable, allowing to choose between Azure Doc Intelligence, UglyToad.PdfPig, or any other injectable class. It would be a nice feature to have, though currently we don't have a timeline for it. If someone is willing to work on it and send a PR it would definitely be welcome. |
Beta Was this translation helpful? Give feedback.
-
Sigh. Yes now that its been moved to a discussion, to keep in tasks to be groomed, I now have to submit ANOTHER feature request. |
Beta Was this translation helpful? Give feedback.
-
I'd like to be able to opt-in enable OCRing PDF documents.
I understand that tesseract doesn't support this, but Form Recognizer does.
Beta Was this translation helpful? Give feedback.
All reactions