Process OCR XML format for PMC content by haohangyan · Pull Request #1501 · gyorilab/indra

haohangyan · 2026-05-08T14:10:56Z

This PR adds functionality to extract_paragraphs() to enable it to process full text content from OCR XML. This is intended to extract text from some historical PDF papers.

haohangyan added 3 commits May 6, 2026 14:15

Add check for ocr xml file and then extract paragraphs from xml

d25693c

Add comment in _extract_from_pmc_ocr

d4dc945

Fix comment typos

3c75db2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process OCR XML format for PMC content#1501

Process OCR XML format for PMC content#1501
haohangyan wants to merge 3 commits into
gyorilab:masterfrom
haohangyan:pmc_xml

haohangyan commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haohangyan commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant