Extract OCR from a pdf/ppt has only images(No Text) #1112

yamuna83 · 2025-12-16T17:29:52Z

yamuna83
Dec 16, 2025

I am using kernel memory to process set of documents(pdf, docx, images, ppt, pptx). Works fine.
When I process a pdf file with only images then it is not extracting ocr.
Getting this error:
warn: Microsoft.KernelMemory.DocumentStorage.AzureBlobs.AzureBlobsStorage[0] The file user-documents/000d0000-ac13-0242-b807-08de3cc174d8/Digital_Strategy.pdf.extract.txt is empty warn: Microsoft.KernelMemory.Handlers.SaveRecordsHandler[0] Pipeline 'user-documents/000d0000-ac13-0242-b807-08de3cc174d8': step save_records: no records found, cannot save, moving to next pipeline step.

My current OCR settings:
"ImageOcrType": "AzureAIDocIntel",

Any help would be ppreciated.

dluc · 2026-01-07T15:27:30Z

dluc
Jan 7, 2026
Maintainer

hi @yamuna83, unfortunately that's a known limitation and the version of KM you are using has been archived. That said, the fix should be simple enough, patching this file

kernel-memory/archived/km-v1/service/Core/Handlers/TextExtractionHandler.cs

Lines 186 to 235 in 94b69d3

    
           private async Task<(string text, FileContent content, bool skipFile)> ExtractTextAsync( 
        
               DataPipeline.FileDetails uploadedFile, 
        
               BinaryData fileContent, 
        
               CancellationToken cancellationToken) 
        
           { 
        
               // Define default empty content 
        
               var content = new FileContent(MimeTypes.PlainText); 
        
               if (string.IsNullOrEmpty(uploadedFile.MimeType)) 
        
               { 
        
                   uploadedFile.Log(this, $"File MIME type is empty, ignoring the file {uploadedFile.Name}"); 
        
                   this._log.LogWarning("Empty MIME type, file '{0}' will be ignored", uploadedFile.Name); 
        
                   return (text: string.Empty, content, skipFile: true); 
        
               } 
        
               // Checks if there is a decoder that supports the file MIME type. If multiple decoders support this type, it means that 
        
               // the decoder has been redefined, so it takes the last one. 
        
               var decoder = this._decoders.LastOrDefault(d => d.SupportsMimeType(uploadedFile.MimeType)); 
        
               if (decoder is not null) 
        
               { 
        
                   this._log.LogDebug("Extracting text from file '{0}' mime type '{1}' using extractor '{2}'", 
        
                       uploadedFile.Name, uploadedFile.MimeType, decoder.GetType().FullName); 
        
                   content = await decoder.DecodeAsync(fileContent, cancellationToken).ConfigureAwait(false); 
        
               } 
        
               else 
        
               { 
        
                   uploadedFile.Log(this, $"File MIME type not supported: {uploadedFile.MimeType}. Ignoring the file {uploadedFile.Name}."); 
        
                   this._log.LogWarning("File MIME type not supported: {0} - ignoring the file {1}", uploadedFile.MimeType, uploadedFile.Name); 
        
                   return (text: string.Empty, content, skipFile: true); 
        
               } 
        
               var textBuilder = new StringBuilder(); 
        
               foreach (var section in content.Sections) 
        
               { 
        
                   var sectionContent = section.Content.Trim(); 
        
                   if (string.IsNullOrEmpty(sectionContent)) { continue; } 
        
                   textBuilder.Append(sectionContent); 
        
                   // Add a clean page separation 
        
                   if (section.SentencesAreComplete) 
        
                   { 
        
                       textBuilder.AppendLineNix(); 
        
                       textBuilder.AppendLineNix(); 
        
                   } 
        
               } 
        
               var text = textBuilder.ToString().Trim(); 
        
               return (text, content, skipFile: false);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract OCR from a pdf/ppt has only images(No Text) #1112

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extract OCR from a pdf/ppt has only images(No Text) #1112

Uh oh!

yamuna83 Dec 16, 2025

Replies: 1 comment

Uh oh!

dluc Jan 7, 2026 Maintainer

yamuna83
Dec 16, 2025

dluc
Jan 7, 2026
Maintainer