Document page numbers #669

JonathanVelkeneers · 2024-06-17T12:21:25Z

JonathanVelkeneers
Jun 17, 2024

Is there a way to get PDF page numbers from the Citations list?

The PdfDecoder seems to add them to the results, but as far as I can tell from looking through the code and the results (tags and payload) they go unused.

https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs
https://github.com/microsoft/kernel-memory/blob/main/service/Core/Handlers/TextExtractionHandler.cs

dluc · 2024-06-17T17:41:12Z

dluc
Jun 17, 2024
Maintainer

Currently not possible. We did some initial work to extract the information, but there's more work left to do. We need a new text partitioning class that can break text maintaining the text metadata (page number, titles, etc) so that the metadata can be stored in the memory DB.

1 reply

alkampfergit Jul 25, 2024

I think that this is an interesting thing to do. When indexing large documents, the user really wants to know where he/she can find the relevant text in the original document. Actually the problem, from what I saw in the code, originates from the fact that, after extraction, all text is collapsed into a unique text content that then gets splitted. Then the TextChunker chunks the data as whole.

From what I see probably the solution would be creating another chunker, am I wrong?

Started a possible PR here https://github.com/alkampfergit/kernel-memory/tree/feature/pages-in-memoryrecords @dluc when you want you can tell me if the direction is ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document page numbers #669

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Document page numbers #669

JonathanVelkeneers Jun 17, 2024

Replies: 1 comment · 1 reply

dluc Jun 17, 2024 Maintainer

alkampfergit Jul 25, 2024

JonathanVelkeneers
Jun 17, 2024

Replies: 1 comment 1 reply

dluc
Jun 17, 2024
Maintainer