-
Notifications
You must be signed in to change notification settings - Fork 709
Fix pdf parsing bug #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix pdf parsing bug #938
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request fixes a PDF parsing bug by updating the text extraction logic to correctly handle multi-column layouts.
- Replaces the use of page.get_text("text", sort=True) with a blocks-based extraction
- Concatenates text blocks in the order provided by the PDF parser
Co-authored-by: Copilot <[email protected]>
paperqa/readers.py
Outdated
# Extract text blocks, which are already in the correct order, from the page | ||
blocks = page.get_text("blocks", sort=False) | ||
|
||
# Concatenate text blocks into a single string | ||
text = "\n".join(block[4] for block in blocks if len(block) > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we lose with sort=False
? I am wondering why we had sort=True
originally (it predates my time at FutureHouse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem wasn't sort=True. The problem was getting "text" rather than "blocks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem isn't sort=False
, can you revert to sort=True
there? Just to keep diff smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sort=False
retains the correct order of blocks in two-column pdfs (as well as one-column pdfs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need lint
workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.
That being said, looks like failing tests (locally for me) are:
test_get_directory_index
test_get_directory_index_w_manifest
Can you:
- Get these to pass (adjusting the assertions)
- Expand them to account for
pasa.pdf
paperqa/readers.py
Outdated
# Extract text blocks, which are already in the correct order, from the page | ||
blocks = page.get_text("blocks", sort=False) | ||
|
||
# Concatenate text blocks into a single string | ||
text = "\n".join(block[4] for block in blocks if len(block) > 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the problem isn't sort=False
, can you revert to sort=True
there? Just to keep diff smaller
Thanks, @jamesbraza. I fixed the failing unit tests. |
@jamesbraza could you kick off the workflow, please. Thanks! |
The line
text = page.get_text("text", sort=True)
in
readers.py
doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.Replacing that line of code with
extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.