Skip to content

Update test_network.py#619

Closed
bosd wants to merge 1 commit into
masterfrom
bosd-fix585-attempt
Closed

Update test_network.py#619
bosd wants to merge 1 commit into
masterfrom
bosd-fix585-attempt

Conversation

@bosd

@bosd bosd commented Aug 14, 2025

Copy link
Copy Markdown
Collaborator

No description provided.

@bosd bosd force-pushed the bosd-fix585-attempt branch 3 times, most recently from 47e78f4 to 56537cc Compare August 14, 2025 15:07
@bosd bosd force-pushed the bosd-fix585-attempt branch from ec30867 to b6bed22 Compare May 20, 2026 12:41
@bosd

bosd commented May 20, 2026

Copy link
Copy Markdown
Collaborator Author

Triage update

The substance here is not obsolete — the parser changes are legitimate:

  • compute_plausible_gaps: h_textlines[i].x0 - h_textlines[i-1].x1 (gap between text edges) replaces x0 - x0 (gap between text starts). The new formula is the right one for column-spacing detection — text-width-dependent biases disappear.
  • max(..., default=0) on the merged-zones loop in search_header_from_body_bbox is a real safety fix for the empty-iterator case.
  • New good_energy.pdf fixture is a genuine reproducer for Unable to specify column separators for only one table when there are multiple tables #585-style column-separator confusion.

Why CI is fully red:

  • pre-commit 3.10 fail = branch predates a couple of pre-commit-config.yaml / black / flake8 changes that landed since June.
  • Every test row failing in lockstep = the network gap-formula change shifts what counts as a column boundary on the existing test fixtures (foo.pdf, budget.pdf, …). Either those fixtures need their pinned column/row coords updated, or the formula change needs to be opt-in.

Action items if you want to resurrect this:

  1. Rebase onto current master (mechanical).
  2. Run tests/test_network.py locally and decide for each diff between old and new output: is the new output correct (then update the fixture)? Or did we lose information (then narrow the formula change)?
  3. Add a test specifically exercising good_energy.pdf (the regression that prompted the PR).

I won't auto-rebase or close — this is your work and the parser side has nuance. Happy to do the rebase + fixture update pass if you confirm the direction; just say so. Leaving open for now.

…eparator confusion

Rebase of the older bosd-fix585-attempt onto current master, keeping
only the network parser fixes + the new test fixture (the old branch
had also gone behind master enough that its 'parsing_report' delta
was a straight revert of #739's confidence metric — dropped).

Substance preserved from the original branch:

* compute_plausible_gaps: use the gap between text *edges*
  (h_textlines[i].x0 - h_textlines[i-1].x1) instead of between text
  *starts* (x0 - x0). The new formula is the right one for column
  spacing detection; text-width-dependent biases disappear.
* search_table_body: handle 'no aligned textline found' explicitly
  by returning None instead of dereferencing it downstream.
* search_header_from_body_bbox: add 'default=0' to the merged-zones
  max() so an empty iterator doesn't raise.
* _generate_table_bbox: refactor for two distinct cases — user-
  provided table_areas (run network detection on the area's textlines
  only) vs discovery mode (scan whole page).
* drop noqa: C901 on find_closest_tls — the refactor brought
  cyclomatic complexity back under the limit.

New fixture / tests:

* tests/files/good_energy.pdf: a reproducer for the column-separator
  confusion that motivated the original #585 issue.
* tests/test_network.py: test_issue_585 + test_issue_585_network_flavor_with_table_areas
  exercising the table_areas + columns path on multiple_tables.pdf and
  good_energy.pdf.

If existing test_network fixtures now produce slightly different
column boundaries (the gap-edge change can do that on edge cases),
update them per-fixture rather than narrowing the formula.
@bosd bosd force-pushed the bosd-fix585-attempt branch from b6bed22 to 9119be3 Compare May 20, 2026 13:14
bosd pushed a commit to bosd/camelot that referenced this pull request May 20, 2026
…mal)

Two real crash-path bugs from the year-old camelot-dev#619 (bosd-fix585-attempt):

* TextNetworks.search_table_body: return None early when
  most_connected_textline() has nothing to return. Previously
  dereferenced None on the next line.
* TextNetworks.search_header_from_body_bbox: max(..., default=0)
  on the merged-zones generator so an empty zones list no longer
  raises ValueError: max() arg is an empty sequence.

The _generate_table_bbox refactor was included in an earlier attempt
of this PR but turned out to depend on a _get_user_provided_bboxes
helper that didn't exist on master, breaking every test_network. It
also pushed cyclomatic complexity from 11 to 16 (past the flake8 C901
limit). Dropping it; the refactor's value can be revisited as its
own focused PR.

The gap-edge formula change (.x0 - .x1 instead of .x0 - .x0) stays
held back for the planned camelot-dev#619 split 3/3 with fixture recalibration.
bosd added a commit that referenced this pull request May 20, 2026
…1/3) (#744)

Co-authored-by: bosd <ebo@stefcy.com>
bosd added a commit that referenced this pull request May 20, 2026
…t 2/3) (#745)

Co-authored-by: bosd <ebo@stefcy.com>
@bosd

bosd commented May 20, 2026

Copy link
Copy Markdown
Collaborator Author

Closing as superseded by a clean 3-PR split:

The defensive bits no longer need this branch as their staging ground; closing for hygiene. Will reopen or supersede with a new PR for 3/3 once the local validation against the #585 reproducer is done.

@bosd bosd closed this May 20, 2026
bosd pushed a commit to bosd/camelot that referenced this pull request May 20, 2026
…ormula

A single-line marker at the deferred change site so the work doesn't
slip out of memory when the 2.0 release window closes. The patch
itself is one character per gap line ('.x0 - .x0' -> '.x0 - .x1')
but needs the two xfailed regression tests in tests/test_network.py
(merged in camelot-dev#745) to flip to xpass before it can ship — and that
requires local-run validation against the camelot-dev#585 reproducers
(multiple_tables.pdf, good_energy.pdf).

Pairs with a separate tracking issue opened alongside this commit.
bosd pushed a commit to bosd/camelot that referenced this pull request May 21, 2026
…ter (camelot-dev#770)

Local validation (debug/619 notebook) showed the naive .x0->.x1
gap-formula swap fixes camelot-dev#585 but regresses 12 network/hybrid fixtures
because the gap consumers are stride-calibrated. Replace the
'TODO: apply this change' comment with an 'NB: do NOT apply this
naively, here's why' pointer to camelot-dev#770's analysis, so the next person
doesn't re-attempt the same dead end. The two xfailed camelot-dev#585 tests
remain as the documented limitation.
bosd pushed a commit to bosd/camelot that referenced this pull request May 21, 2026
…ormula

A single-line marker at the deferred change site so the work doesn't
slip out of memory when the 2.0 release window closes. The patch
itself is one character per gap line ('.x0 - .x0' -> '.x0 - .x1')
but needs the two xfailed regression tests in tests/test_network.py
(merged in camelot-dev#745) to flip to xpass before it can ship — and that
requires local-run validation against the camelot-dev#585 reproducers
(multiple_tables.pdf, good_energy.pdf).

Pairs with a separate tracking issue opened alongside this commit.
bosd pushed a commit to bosd/camelot that referenced this pull request May 21, 2026
…ter (camelot-dev#770)

Local validation (debug/619 notebook) showed the naive .x0->.x1
gap-formula swap fixes camelot-dev#585 but regresses 12 network/hybrid fixtures
because the gap consumers are stride-calibrated. Replace the
'TODO: apply this change' comment with an 'NB: do NOT apply this
naively, here's why' pointer to camelot-dev#770's analysis, so the next person
doesn't re-attempt the same dead end. The two xfailed camelot-dev#585 tests
remain as the documented limitation.
bosd pushed a commit to bosd/camelot that referenced this pull request May 21, 2026
…ormula

A single-line marker at the deferred change site so the work doesn't
slip out of memory when the 2.0 release window closes. The patch
itself is one character per gap line ('.x0 - .x0' -> '.x0 - .x1')
but needs the two xfailed regression tests in tests/test_network.py
(merged in camelot-dev#745) to flip to xpass before it can ship — and that
requires local-run validation against the camelot-dev#585 reproducers
(multiple_tables.pdf, good_energy.pdf).

Pairs with a separate tracking issue opened alongside this commit.
bosd pushed a commit to bosd/camelot that referenced this pull request May 21, 2026
…ter (camelot-dev#770)

Local validation (debug/619 notebook) showed the naive .x0->.x1
gap-formula swap fixes camelot-dev#585 but regresses 12 network/hybrid fixtures
because the gap consumers are stride-calibrated. Replace the
'TODO: apply this change' comment with an 'NB: do NOT apply this
naively, here's why' pointer to camelot-dev#770's analysis, so the next person
doesn't re-attempt the same dead end. The two xfailed camelot-dev#585 tests
remain as the documented limitation.
bosd added a commit that referenced this pull request May 21, 2026
Co-authored-by: bosd <ebo@stefcy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant