Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors in reading tabular data in PDF breaking qa #845

Open
bgriffen opened this issue Feb 1, 2025 · 1 comment
Open

errors in reading tabular data in PDF breaking qa #845

bgriffen opened this issue Feb 1, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@bgriffen
Copy link

bgriffen commented Feb 1, 2025

First, amazing effort. Really appreciate the hard work and making a greatly useful tool.

I have a very large PDF (1600 pages) consisting mainly of a table with ~7000 rows containing various pieces of information that I would like to query against. Traditional SQL or pandas filtering doesn't work because certain rows relate to one another via the enclosed text descriptions in certain rows (i.e. Item 10 can't be combined with Item 20 or Item 15 can only be combined when 16, 20, 30 are satisfied etc.) hence the need for a QA tool (the descriptions are a bit more verbose and subtle than that but that is the simplified situation).

The trouble is when I run the QA (via script), all of the item numbers (column 1) are off by one row to their description column (column 2). As in when it lists the Item number and it's description in the response, the item number actually refers to the NEXT row description, likely because of way the PDF table is formatted.

To make life easier, I actually have an XML file which contains all of this tabular data (and also a CSV etc.).

With that context in mind, before I go off and try and fix this off by one issue in the PDF tables, is it possible to ingest e.g. an XML file or just the CSV tabular data and use the QA system to query relationships between various rows based on the description column. I could just export the CSV to a PDF in a better format than the one I'm using above, but wanted to check if there's a best way to handle this.

For a bonus, there is e.g. the first 50 pages of the PDF I'd like to keep (text information not in tables) then concatenate it the XML/CSV information. Is this possible? Might you have an example script to perform this QA? I'm just using the basic README query at the moment:

answer_response = ask(
    "What are the conflicts between using Item 15 and Item 20 together?",
    settings=Settings(paper_directory="mypdfs",embedding="text-embedding-3-large"),
)

Thank you for any help in advance. As a bonus, is there a way to max out the accuracy at higher cost in the above as these items I describe are medical related?

Copy link

dosubot bot commented Feb 1, 2025

Hey @bgriffen! I'm Dosu, a bot here to help you with bug fixes, answering questions, and becoming a contributor. How can I assist you today while waiting for a human maintainer?

The current information available does not specify whether the QA system directly supports ingesting XML or CSV files. However, you can consider a few approaches to handle your situation:

  1. Pre-process XML/CSV Data: Since you have the data in XML and CSV formats, you could pre-process these files to ensure correct alignment and then convert them into a format that the QA system can handle, such as a structured text or a well-formatted PDF. This might involve using a script to parse the XML/CSV and output a clean text file or PDF.

  2. Concatenating PDF and Other Formats: For combining the first 50 pages of your PDF with the XML/CSV data, you could extract the text from these pages and append the processed XML/CSV data. This combined document can then be used as input for the QA system.

  3. Custom Script for Querying: You might need to write a custom script to handle the ingestion and querying of your data. This script could parse the XML/CSV, align the data correctly, and then format it for the QA system. Unfortunately, there are no specific examples or scripts provided in the documentation for this exact use case.

If you decide to convert your XML/CSV data into a PDF, ensure that the formatting aligns correctly to avoid the row misalignment issues you encountered. This might involve using a tool or library that allows precise control over table formatting in PDFs.

While these suggestions require some manual setup, they could help you leverage the QA system more effectively with your data.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant