refactor: derive inferredParagraphs from rendered body, not extraction#52
Open
refactor: derive inferredParagraphs from rendered body, not extraction#52
Conversation
Codex's post-merge schema-overlap audit flagged that
inferredParagraphs was an unreliable signal: the LLM was asked at
extraction time to estimate how many paragraphs in the FUTURE page
would be inferred, and the lint rule then trusted that guess when
present, falling back to counting uncited body paragraphs only when
absent. The two paths regularly disagreed.
Drop the extraction-time guess entirely. The rendered body is now the
single source of truth.
- src/compiler/prompts.ts: drop the `inferred_paragraphs` field from
CONCEPT_EXTRACTION_TOOL, the prompt's metadata bullet list,
RawConcept, and mapRawConcept. The LLM no longer produces or even
sees this field.
- src/utils/types.ts: drop `inferredParagraphs` from
ProvenanceMetadata. ExtractedConcept and WikiFrontmatter inherit
the slimmer shape via `extends ProvenanceMetadata`. Doc updated
to explain that the field has moved to body-derived lint.
- src/utils/markdown.ts: drop `parseInferredParagraphs` and stop
emitting the field from `parseProvenanceMetadata`. Legacy on-disk
pages with the field still parse fine — the loader just ignores
the unrecognised key.
- src/compiler/provenance.ts: drop the inferredParagraphs branch
from `addProvenanceMeta` so new compiles never write the field.
- src/compiler/index.ts: drop the now-dead `Math.max` reconciliation
branch from `reconcileConceptMetadata`.
- src/linter/rules.ts: `checkInferredWithoutCitations` now
unconditionally counts uncited prose paragraphs in the body. No
metadata path. Catches hand-edits and stays accurate after any
page revision.
Tests: updated `confidence-metadata.test.ts` (parser, frontmatter
round-trip, parseConcepts), `compile-provenance.test.ts`,
`compile-claim-provenance.test.ts`, the just-added
`provenance-metadata-shape.test.ts`, and rewrote the two integration
tests in `confidence-metadata-integration.test.ts` to drive the
excess-inferred-paragraphs rule via body content rather than the
removed metadata field. Added a regression test pinning the new
"legacy frontmatter is intentionally ignored" behaviour so a future
re-introduction of the metadata path would break.
Three findings from codex review on PR #52: 1. ASCII-only prose detection. /^[A-Za-z]/ silently skipped CJK, Cyrillic, Greek, and Arabic paragraphs, so excess-inferred-paragraphs would stop firing on pages produced via `--lang Chinese`, `--lang Japanese`, etc. (#46). Switch to /^\p{L}/u and add a regression test that pins detection of CJK + Cyrillic + Japanese prose blocks. 2. README still documented inferredParagraphs as a frontmatter field and claimed merge reconciliation took the max — both contradicted the new behaviour. Drop the field from the example frontmatter, rewrite the reconciliation sentence, and update the lint-rule description to make clear the count comes from the body. 3. Stale JSDoc on reconcileConceptMetadata listed an `inferredParagraphs: max` rule that no longer exists. Replaced with a note explaining the field is body-derived now.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Third of four pre-0.6.0 audit-fix PRs.
What
Drop the LLM-generated extraction-time guess for
inferredParagraphsand have the lint rule unconditionally derive the count from the rendered body.The field used to live in three places:
CONCEPT_EXTRACTION_TOOLasked the model to estimate how many paragraphs in the future page would be inferred — a guess made before the page even existed.inferredParagraphsin the page frontmatter.checkInferredWithoutCitationstrusted the metadata when present and only counted uncited body paragraphs as a fallback.The two paths regularly disagreed — the model's estimate was unreliable, and once the page was hand-edited the cached value drifted further.
The fix
Body is now the single source of truth.
src/compiler/prompts.ts— dropinferred_paragraphsfrom the tool schema, the prompt's metadata bullet list,RawConcept, andmapRawConcept. The LLM no longer sees the field.src/utils/types.ts— dropinferredParagraphsfromProvenanceMetadata.ExtractedConceptandWikiFrontmatterinherit the slimmer shape viaextends ProvenanceMetadata.src/utils/markdown.ts—parseProvenanceMetadatano longer emits the field. Legacy on-disk pages with the field still parse — the loader ignores the unrecognised key.src/compiler/provenance.ts—addProvenanceMetano longer writes the field. New compiles produce frontmatter without it.src/compiler/index.ts— drop the deadMath.maxreconciliation branch inreconcileConceptMetadata.src/linter/rules.ts—checkInferredWithoutCitationsalways counts uncited prose paragraphs in the body. No metadata path. Catches hand-edits and stays accurate after any page revision.Behaviour summary: a fully-cited page can no longer be falsely flagged because of a stale frontmatter value, and a body with too many uncited paragraphs always fires the warning regardless of what the (now-absent) metadata field says.
Test plan
npx tsc --noEmitcleannpm run buildsucceedsnpm test— 630 pass / 3 skipped (smoke), no regressionsnpm run fallow:ci— 0 issues above thresholdinferredParagraphsfrontmatter is intentionally ignored" behaviour so a future re-introduction of the metadata path would breakUp next (last audit follow-up)
checkSchemaCrossLinks/checkPageCrossLinksshared logic; surface seed pages ingeneration.pages