pdf parsing doesn't handle multi-column papers correctly

In readers.py, the text extracted from multi-column pdf documents doesn't respect columns, i.e., the text continues across columns. To fix this, the following line:

```python
text = page.get_text("text", sort=True)
```

should be replaced by these lines:

```python
# Extract text blocks from the page
blocks = page.get_text("blocks")
# Concatenate text blocks, which are already in the correct order, into a single string
text = "\n".join(block[4] for block in blocks)
```

I'd submit a pull request, but it seems I don't have sufficient permissions to do so.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdf parsing doesn't handle multi-column papers correctly #937

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pdf parsing doesn't handle multi-column papers correctly #937

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions