Skip to content

refactor: derive inferredParagraphs from rendered body, not extraction#52

Open
ethanj wants to merge 2 commits intomainfrom
refactor/derive-inferred-paragraphs
Open

refactor: derive inferredParagraphs from rendered body, not extraction#52
ethanj wants to merge 2 commits intomainfrom
refactor/derive-inferred-paragraphs

Conversation

@ethanj
Copy link
Copy Markdown
Contributor

@ethanj ethanj commented May 1, 2026

Third of four pre-0.6.0 audit-fix PRs.

What

Drop the LLM-generated extraction-time guess for inferredParagraphs and have the lint rule unconditionally derive the count from the rendered body.

The field used to live in three places:

  1. CONCEPT_EXTRACTION_TOOL asked the model to estimate how many paragraphs in the future page would be inferred — a guess made before the page even existed.
  2. The compile path persisted that estimate as inferredParagraphs in the page frontmatter.
  3. checkInferredWithoutCitations trusted the metadata when present and only counted uncited body paragraphs as a fallback.

The two paths regularly disagreed — the model's estimate was unreliable, and once the page was hand-edited the cached value drifted further.

The fix

Body is now the single source of truth.

  • src/compiler/prompts.ts — drop inferred_paragraphs from the tool schema, the prompt's metadata bullet list, RawConcept, and mapRawConcept. The LLM no longer sees the field.
  • src/utils/types.ts — drop inferredParagraphs from ProvenanceMetadata. ExtractedConcept and WikiFrontmatter inherit the slimmer shape via extends ProvenanceMetadata.
  • src/utils/markdown.tsparseProvenanceMetadata no longer emits the field. Legacy on-disk pages with the field still parse — the loader ignores the unrecognised key.
  • src/compiler/provenance.tsaddProvenanceMeta no longer writes the field. New compiles produce frontmatter without it.
  • src/compiler/index.ts — drop the dead Math.max reconciliation branch in reconcileConceptMetadata.
  • src/linter/rules.tscheckInferredWithoutCitations always counts uncited prose paragraphs in the body. No metadata path. Catches hand-edits and stays accurate after any page revision.

Behaviour summary: a fully-cited page can no longer be falsely flagged because of a stale frontmatter value, and a body with too many uncited paragraphs always fires the warning regardless of what the (now-absent) metadata field says.

Test plan

  • npx tsc --noEmit clean
  • npm run build succeeds
  • npm test — 630 pass / 3 skipped (smoke), no regressions
  • npm run fallow:ci — 0 issues above threshold
  • Existing tests covering the metadata-trust path replaced with body-driven equivalents
  • New regression test pins the "legacy inferredParagraphs frontmatter is intentionally ignored" behaviour so a future re-introduction of the metadata path would break

Up next (last audit follow-up)

  • Lower-priority: dedupe checkSchemaCrossLinks / checkPageCrossLinks shared logic; surface seed pages in generation.pages

ethanj added 2 commits May 1, 2026 00:20
Codex's post-merge schema-overlap audit flagged that
inferredParagraphs was an unreliable signal: the LLM was asked at
extraction time to estimate how many paragraphs in the FUTURE page
would be inferred, and the lint rule then trusted that guess when
present, falling back to counting uncited body paragraphs only when
absent. The two paths regularly disagreed.

Drop the extraction-time guess entirely. The rendered body is now the
single source of truth.

  - src/compiler/prompts.ts: drop the `inferred_paragraphs` field from
    CONCEPT_EXTRACTION_TOOL, the prompt's metadata bullet list,
    RawConcept, and mapRawConcept. The LLM no longer produces or even
    sees this field.
  - src/utils/types.ts: drop `inferredParagraphs` from
    ProvenanceMetadata. ExtractedConcept and WikiFrontmatter inherit
    the slimmer shape via `extends ProvenanceMetadata`. Doc updated
    to explain that the field has moved to body-derived lint.
  - src/utils/markdown.ts: drop `parseInferredParagraphs` and stop
    emitting the field from `parseProvenanceMetadata`. Legacy on-disk
    pages with the field still parse fine — the loader just ignores
    the unrecognised key.
  - src/compiler/provenance.ts: drop the inferredParagraphs branch
    from `addProvenanceMeta` so new compiles never write the field.
  - src/compiler/index.ts: drop the now-dead `Math.max` reconciliation
    branch from `reconcileConceptMetadata`.
  - src/linter/rules.ts: `checkInferredWithoutCitations` now
    unconditionally counts uncited prose paragraphs in the body. No
    metadata path. Catches hand-edits and stays accurate after any
    page revision.

Tests: updated `confidence-metadata.test.ts` (parser, frontmatter
round-trip, parseConcepts), `compile-provenance.test.ts`,
`compile-claim-provenance.test.ts`, the just-added
`provenance-metadata-shape.test.ts`, and rewrote the two integration
tests in `confidence-metadata-integration.test.ts` to drive the
excess-inferred-paragraphs rule via body content rather than the
removed metadata field. Added a regression test pinning the new
"legacy frontmatter is intentionally ignored" behaviour so a future
re-introduction of the metadata path would break.
Three findings from codex review on PR #52:

1. ASCII-only prose detection. /^[A-Za-z]/ silently skipped CJK,
   Cyrillic, Greek, and Arabic paragraphs, so excess-inferred-paragraphs
   would stop firing on pages produced via `--lang Chinese`,
   `--lang Japanese`, etc. (#46). Switch to /^\p{L}/u and add a
   regression test that pins detection of CJK + Cyrillic + Japanese
   prose blocks.

2. README still documented inferredParagraphs as a frontmatter field
   and claimed merge reconciliation took the max — both contradicted
   the new behaviour. Drop the field from the example frontmatter,
   rewrite the reconciliation sentence, and update the lint-rule
   description to make clear the count comes from the body.

3. Stale JSDoc on reconcileConceptMetadata listed an
   `inferredParagraphs: max` rule that no longer exists. Replaced with
   a note explaining the field is body-derived now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant