Skip to content

Process OCR XML format for PMC content#1501

Open
haohangyan wants to merge 3 commits into
gyorilab:masterfrom
haohangyan:pmc_xml
Open

Process OCR XML format for PMC content#1501
haohangyan wants to merge 3 commits into
gyorilab:masterfrom
haohangyan:pmc_xml

Conversation

@haohangyan
Copy link
Copy Markdown
Contributor

This PR adds functionality to extract_paragraphs() to enable it to process full text content from OCR XML. This is intended to extract text from some historical PDF papers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant