errors in reading tabular data in PDF breaking qa

First, amazing effort. Really appreciate the hard work and making a greatly useful tool. 

I have a very large PDF (1600 pages) consisting mainly of a table with ~7000 rows containing various pieces of information that I would like to query against. Traditional SQL or pandas filtering doesn't work because certain rows relate to one another via the enclosed text descriptions in certain rows (i.e. Item 10 can't be combined with Item 20 or Item 15 can only be combined when 16, 20, 30 are satisfied etc.) hence the need for a QA tool (the descriptions are a bit more verbose and subtle than that but that is the simplified situation). 

The trouble is when I run the QA (via script), all of the item numbers (column 1) are off by one row to their description column (column 2). As in when it lists the Item number and it's description in the response, the item number actually refers to the NEXT row description, likely because of way the PDF table is formatted.

To make life easier, I actually have an XML file which contains all of this tabular data (and also a CSV etc.). 

With that context in mind, before I go off and try and fix this off by one issue in the PDF tables, is it possible to ingest e.g. an XML file or just the CSV tabular data and use the QA system to query relationships between various rows based on the description column. I could just export the CSV to a PDF in a better format than the one I'm using above, but wanted to check if there's a best way to handle this.

For a bonus, there is e.g. the first 50 pages of the PDF I'd like to keep (text information not in tables) then concatenate it the XML/CSV information. Is this possible? Might you have an example script to perform this QA? I'm just using the basic README query at the moment:

```python
answer_response = ask(
    "What are the conflicts between using Item 15 and Item 20 together?",
    settings=Settings(paper_directory="mypdfs",embedding="text-embedding-3-large"),
)
```

Thank you for any help in advance. As a bonus, is there a way to max out the accuracy at higher cost in the above as these items I describe are medical related? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

errors in reading tabular data in PDF breaking qa #845

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

errors in reading tabular data in PDF breaking qa #845

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions