Research-oriented tooling for extracting, normalizing, heuristically interpreting, and optionally reviewing Table 1-style epidemiology tables from PDFs.
- The main user command is now
table1-parser parse, which runs the available pipeline stages once and writes all currently available paper outputs. - The
extractandnormalizecommands are still available for stage-specific inspection and debugging. TableDefinitionis now implemented as a deterministic, value-free semantic representation of the table structure.ParsedTableis now emitted as the final structured value layer, with conservative numeric parsing and soft Table 1n (%)heuristics.- The repository also contains heuristic interpretation, diagnostics, paper-context artifacts, R inspection helpers, and a standalone LLM variable-plausibility review command.
The goal of this project is to parse Table 1-style tables from epidemiology papers into structured representations that can be inspected, validated, and used by downstream tools.
The intended pipeline is:
PDF -> ExtractedTable -> NormalizedTable -> TableDefinition -> ParsedTable
Each stage has a different purpose:
-
ExtractedTableRaw table extraction from the PDF. This preserves the recovered grid, cell text, page information, and extraction metadata. -
NormalizedTableA cleaned and organized version of the extracted table. This separates header rows from body rows, preserves row structure, and computes row-level signals that help later interpretation. -
TableDefinitionThe value-free semantic stage. This captures which variables appear in the rows, which ones are categorical, what their levels are, and how the columns were constructed, without including the table's printed values. -
ParsedTableThe final semantic output. This combines variable definitions, column meanings, and parsed table values into a structured format.
This separation is intentional. It keeps extraction, normalization, semantic interpretation, and final value output distinct, which makes the system easier to debug and safer to extend.
At the moment, the repository can persist:
ExtractedTableNormalizedTableTableProfileTableDefinitionParsedTableTableProcessingStatus- paper-level markdown context
- paper-level variable inventory
- per-table paper-context bundles
- optional LLM variable-plausibility reviews
Today, a single call to table1-parser parse writes those artifacts from one extraction pass.
The parse command is deterministic and does not call an LLM. Optional LLM review is run separately with table1-parser review-variable-plausibility.
Clone the repository and enter the project directory:
git clone <repo-url>
cd parseTable1python3 -m pip install -e .Run the available parse pipeline once and write all current stage outputs:
table1-parser parse path/to/paper.pdfBy default this writes:
outputs/papers/<paper_stem>/extracted_tables.json
outputs/papers/<paper_stem>/normalized_tables.json
outputs/papers/<paper_stem>/table_profiles.json
outputs/papers/<paper_stem>/table_definitions.json
outputs/papers/<paper_stem>/parsed_tables.json
outputs/papers/<paper_stem>/table_processing_status.json
outputs/papers/<paper_stem>/paper_markdown.md
outputs/papers/<paper_stem>/paper_sections.json
outputs/papers/<paper_stem>/paper_visual_inventory.json
outputs/papers/<paper_stem>/paper_references.json
outputs/papers/<paper_stem>/paper_variable_inventory.json
outputs/papers/<paper_stem>/table_contexts/table_0_context.json
For example:
table1-parser parse testpapers/cobaltpaper.pdfThis writes the extraction, normalization, table-profile, table-definition, final parsed-table, processing-status, visual-reference, and paper-context outputs in one run.
To run the optional LLM variable-plausibility review:
table1-parser review-variable-plausibility path/to/paper.pdfWhen LLM provider configuration is available, this reruns the deterministic pipeline and writes:
outputs/papers/<paper_stem>/table_variable_plausibility_llm.json
The review is intentionally narrow. It scores whether each deterministic variable label, type, and categorical level structure looks plausible for an epidemiology Table 1. It does not rewrite table_definitions.json.
To write debug payloads and monitoring records for the review:
LLM_DEBUG=true table1-parser review-variable-plausibility path/to/paper.pdfIf you want just the raw extraction stage:
table1-parser extract path/to/paper.pdfBy default this writes JSON to:
outputs/papers/<paper_stem>/extracted_tables.jsonFor example:
table1-parser extract testpapers/cobaltpaper.pdfproduces:
outputs/papers/cobaltpaper/extracted_tables.json
If you want the JSON on stdout instead of a file:
table1-parser extract path/to/paper.pdf --stdoutIf you want a different root output directory:
table1-parser parse path/to/paper.pdf --outdir resultsThis writes to:
results/papers/<paper_stem>/
The extract, normalize, parse, and review-variable-plausibility commands all accept --outdir.
If you want just the normalization stage:
table1-parser normalize path/to/paper.pdfBy default this writes JSON to:
outputs/papers/<paper_stem>/normalized_tables.jsonThe normalization JSON can also be printed to stdout:
table1-parser normalize path/to/paper.pdf --stdoutThe repository includes small base-R helpers for visual inspection of normalized, parsed, and trace-oriented JSON files.
Rscript R/visualize_table_from_json.R outputs/papers/cobaltpaper/normalized_tables.jsonFrom an interactive R session:
source("R/visualize_table_from_json.R")
options(width = 200)
visualize_table_from_json("outputs/papers/cobaltpaper/normalized_tables.json")More detail:
For paper-level inspection there is also:
source("R/inspect_paper_outputs.R")
summarize_table_processing("outputs/papers/cobaltpaper")
show_paper_table_inventory("outputs/papers/cobaltpaper")
show_table_structure("outputs/papers/cobaltpaper", table_number = 1L)
show_table_processing("outputs/papers/cobaltpaper", table_number = 1L)
show_paper_variable_candidates("outputs/papers/cobaltpaper")
show_paper_variable_mentions("outputs/papers/cobaltpaper", source_type = "text_based", mention_role = "variable")
show_table_context("outputs/papers/cobaltpaper", table_number = 1L)Use table_number for public inspection. Extraction-order indices are retained only as low-level provenance/debug handles.
If review-variable-plausibility has been run:
source("R/inspect_paper_outputs.R")
outputs <- load_paper_outputs("outputs/papers/cobaltpaper")
llm_variable_plausibility_df(outputs, table_number = 1L)
show_llm_variable_plausibility("outputs/papers/cobaltpaper", table_number = 1L)
list_llm_variable_plausibility_debug_runs("outputs/papers/cobaltpaper")
summarize_llm_variable_plausibility_monitoring("outputs/papers/cobaltpaper")These helpers are meant to make it easier to inspect the paper-level candidate variable inventory, processing status, deterministic table structure, optional LLM variable-plausibility review, and table-context retrieval artifacts.
The default root output directory is:
outputs
Under that, outputs are organized by paper:
outputs/
papers/
cobaltpaper/
extracted_tables.json
normalized_tables.json
table_profiles.json
table_definitions.json
parsed_tables.json
table_processing_status.json
paper_markdown.md
paper_sections.json
paper_visual_inventory.json
paper_references.json
paper_variable_inventory.json
table_contexts/
table_0_context.json
table_variable_plausibility_llm.json # when review-variable-plausibility is run
llm_variable_plausibility_debug/ # when LLM_DEBUG=true
This keeps outputs for each paper in a separate directory and leaves room for trace and interpretation-stage outputs.
The parse command is intended to populate this directory with every deterministic stage output from a single pipeline run.
The review-variable-plausibility command adds the optional review artifact and any review debug files.
When LLM_DEBUG=true, variable-plausibility review debug artifacts are written under outputs/papers/<paper_stem>/llm_variable_plausibility_debug/<timestamp>/.
The easiest way to inspect one paper is:
- start with
extracted_tables.jsonif the recovered table grid looks wrong - read
normalized_tables.jsonto see the cleaned table structure - read
table_profiles.jsonto see how each table was routed - read
table_definitions.jsonto see the deterministic row and column interpretation - read
paper_visual_inventory.jsonandpaper_references.jsonto see actual in-paper tables/figures and anchored prose mentions - read
paper_variable_inventory.jsonto see the paper-level candidate variable reference list - read
parsed_tables.jsonto see the final structured values - read
table_processing_status.jsonto see whether each table parsed cleanly, was rescued, or failed - use
paper_sections.jsonandtable_contexts/*.jsonseparately when you want to inspect the paper-context artifacts that may support later grounding work - read
table_variable_plausibility_llm.jsonwhen present to see the optional LLM variable-plausibility review
In practice:
extracted_tables.jsonbest for raw PDF recovery, page numbers, and extracted captionsnormalized_tables.jsonbest for stable row and column indicestable_definitions.jsonbest for the syntax-first semantic baselinepaper_variable_inventory.jsonbest for the paper-level candidate variable list and text/table provenancetable_profiles.jsonbest for understanding whether a table was treated as descriptive, estimate-like, or unknownparsed_tables.jsonbest for the final structured row, column, and value outputtable_processing_status.jsonbest for understanding parse outcome, rescue attempts, and failure reasonstable_variable_plausibility_llm.jsonbest for reviewing LLM plausibility scores for deterministic variables, types, and levels
The repository keeps syntax and semantics separate.
- syntax
what rows and columns physically exist in the table
main files:
extracted_tables.json,normalized_tables.json - semantics
what the rows mean and which levels belong under them
main files:
table_definitions.json,parsed_tables.json
The deterministic semantic files refer back to the same table_id and row-index space. The optional LLM review also preserves those variable identities, but it is a QA artifact rather than a replacement semantic definition.
The paper-level variable inventory is complementary rather than competitive with those table-level artifacts. It is the paper-scoped candidate reference list that later semantic work can consult while still keeping one table at a time in the LLM prompt.
If you want to find the table caption or where the paper refers to a table:
extracted_tables.jsoncontains the extractedtitle,caption, and sourcepage_numtable_definitions.jsoncarriestitleandcaptionforward into the value-free semantic layertable_contexts/table_<n>_context.jsoncarries:table_labeltitlecaption- linked
reference_ids - linked
resolved_visual_ids - retrieved
passages
paper_visual_inventory.jsonlists actual in-paper table and figure objects, including table IDs, figure captions, and whether each visual has a non-self text referencepaper_references.jsonlists anchored table and figure mentions and marks each asresolved,unresolved,external_or_bibliographic, orambiguous
The most useful passages for table references are usually those with:
match_type = "table_reference"
Each retrieved passage includes:
passage_idsection_idheadingtext
You can use section_id to find the broader section in paper_sections.json.
Current limitation:
- the context files currently provide section IDs, headings, and passage text
- visual references provide section, paragraph, and character anchors, but not yet page anchors or line numbers inside the paper markdown
The LLM variable-plausibility review should be treated as an inspection artifact, not hidden ground truth.
To evaluate it:
- inspect
table_definitions.jsonfirst to understand the deterministic variables and levels - inspect
table_variable_plausibility_llm.jsonor useshow_llm_variable_plausibility(...)in R - review low-scoring variables and their notes
- check whether categorical variables have sensible child levels and whether binary variables look like one-row indicators
- inspect
normalized_tables.jsonif the LLM is reacting to a row that was misclassified before the review
Current limitation:
- the review does not change
table_definitions.json - for now, users inspect the deterministic parse and the plausibility review side by side
The deterministic parse command does not call an LLM.
The optional review-variable-plausibility command uses environment-variable-based configuration. If provider setup is missing, it skips provider calls with a setup warning and writes an empty review artifact.
Minimum OpenAI setup:
export LLM_PROVIDER=openai
export OPENAI_API_KEY=your_api_key_here
export OPENAI_MODEL=gpt-4.1-mini
export LLM_TEMPERATURE=0
export LLM_TIMEOUT_SECONDS=60
export LLM_MAX_RETRIES=2
export LLM_DEBUG=false
export LLM_SDK_DEBUG=falseMinimum Qwen setup:
export LLM_PROVIDER=qwen
export DASHSCOPE_API_KEY=your_api_key_here
export QWEN_MODEL=qwen-plus
export QWEN_BASE_URL=https://dashscope.aliyuncs.com/api/v1
export LLM_TEMPERATURE=0
export LLM_TIMEOUT_SECONDS=60
export LLM_MAX_RETRIES=2
export LLM_DEBUG=falseMore detail:
The repository also includes internal scripts for pipeline inspection, diagnostics, and synthetic data generation.
Pipeline summary:
python3 scripts/debug_pipeline.py testpapers/cobaltpaper.pdfDiagnostics:
python3 scripts/debug_quality_report.py testpapers/cobaltpaper.pdfVariable-plausibility LLM debug artifacts are written by table1-parser review-variable-plausibility when LLM_DEBUG=true. The per-run directory contains llm_variable_plausibility_monitoring.json plus per-table files such as variable_plausibility_llm_input.json, variable_plausibility_llm_metrics.json, variable_plausibility_llm_output.json, and variable_plausibility_llm_review.json.
The repository keeps the table pipeline JSON-first. The current output and intermediate JSON design is documented in:
For the current LLM review path, the main contract models are in table1_parser.llm.variable_plausibility_schemas.