Skip to content

Fix pdf parsing bug #938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Fix pdf parsing bug #938

wants to merge 7 commits into from

Conversation

markenki
Copy link

The line
text = page.get_text("text", sort=True)
in readers.py doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.

Replacing that line of code with

# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks)

extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.

@Copilot Copilot AI review requested due to automatic review settings April 16, 2025 23:37
@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. bug Something isn't working labels Apr 16, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request fixes a PDF parsing bug by updating the text extraction logic to correctly handle multi-column layouts.

  • Replaces the use of page.get_text("text", sort=True) with a blocks-based extraction
  • Concatenates text blocks in the order provided by the PDF parser

Comment on lines 43 to 47
# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks if len(block) > 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we lose with sort=False? I am wondering why we had sort=True originally (it predates my time at FutureHouse)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem wasn't sort=True. The problem was getting "text" rather than "blocks".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sort=False retains the correct order of blocks in two-column pdfs (as well as one-column pdfs).

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

  • test_get_directory_index
  • test_get_directory_index_w_manifest

Can you:

  1. Get these to pass (adjusting the assertions)
  2. Expand them to account for pasa.pdf

Comment on lines 43 to 47
# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks if len(block) > 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem isn't sort=False, can you revert to sort=True there? Just to keep diff smaller

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Apr 17, 2025
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 21, 2025
@markenki
Copy link
Author

We'll need lint workflow and tests to pass. Our CI doesn't work for contributors since the OpenAI API key secret doesn't propagate outside of the FutureHouse org.

That being said, looks like failing tests (locally for me) are:

  • test_get_directory_index
  • test_get_directory_index_w_manifest

Can you:

  1. Get these to pass (adjusting the assertions)
  2. Expand them to account for pasa.pdf

Thanks, @jamesbraza. I fixed the failing unit tests.

@markenki
Copy link
Author

@jamesbraza could you kick off the workflow, please. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants