refactor: derive inferredParagraphs from rendered body, not extraction by ethanj · Pull Request #52 · atomicmemory/llm-wiki-compiler

ethanj · 2026-05-01T07:20:45Z

Third of four pre-0.6.0 audit-fix PRs.

What

Drop the LLM-generated extraction-time guess for inferredParagraphs and have the lint rule unconditionally derive the count from the rendered body.

The field used to live in three places:

CONCEPT_EXTRACTION_TOOL asked the model to estimate how many paragraphs in the future page would be inferred — a guess made before the page even existed.
The compile path persisted that estimate as inferredParagraphs in the page frontmatter.
checkInferredWithoutCitations trusted the metadata when present and only counted uncited body paragraphs as a fallback.

The two paths regularly disagreed — the model's estimate was unreliable, and once the page was hand-edited the cached value drifted further.

The fix

Body is now the single source of truth.

src/compiler/prompts.ts — drop inferred_paragraphs from the tool schema, the prompt's metadata bullet list, RawConcept, and mapRawConcept. The LLM no longer sees the field.
src/utils/types.ts — drop inferredParagraphs from ProvenanceMetadata. ExtractedConcept and WikiFrontmatter inherit the slimmer shape via extends ProvenanceMetadata.
src/utils/markdown.ts — parseProvenanceMetadata no longer emits the field. Legacy on-disk pages with the field still parse — the loader ignores the unrecognised key.
src/compiler/provenance.ts — addProvenanceMeta no longer writes the field. New compiles produce frontmatter without it.
src/compiler/index.ts — drop the dead Math.max reconciliation branch in reconcileConceptMetadata.
src/linter/rules.ts — checkInferredWithoutCitations always counts uncited prose paragraphs in the body. No metadata path. Catches hand-edits and stays accurate after any page revision.

Behaviour summary: a fully-cited page can no longer be falsely flagged because of a stale frontmatter value, and a body with too many uncited paragraphs always fires the warning regardless of what the (now-absent) metadata field says.

Test plan

npx tsc --noEmit clean
npm run build succeeds
npm test — 630 pass / 3 skipped (smoke), no regressions
npm run fallow:ci — 0 issues above threshold
Existing tests covering the metadata-trust path replaced with body-driven equivalents
New regression test pins the "legacy inferredParagraphs frontmatter is intentionally ignored" behaviour so a future re-introduction of the metadata path would break

Up next (last audit follow-up)

Lower-priority: dedupe checkSchemaCrossLinks / checkPageCrossLinks shared logic; surface seed pages in generation.pages

Codex's post-merge schema-overlap audit flagged that inferredParagraphs was an unreliable signal: the LLM was asked at extraction time to estimate how many paragraphs in the FUTURE page would be inferred, and the lint rule then trusted that guess when present, falling back to counting uncited body paragraphs only when absent. The two paths regularly disagreed. Drop the extraction-time guess entirely. The rendered body is now the single source of truth. - src/compiler/prompts.ts: drop the `inferred_paragraphs` field from CONCEPT_EXTRACTION_TOOL, the prompt's metadata bullet list, RawConcept, and mapRawConcept. The LLM no longer produces or even sees this field. - src/utils/types.ts: drop `inferredParagraphs` from ProvenanceMetadata. ExtractedConcept and WikiFrontmatter inherit the slimmer shape via `extends ProvenanceMetadata`. Doc updated to explain that the field has moved to body-derived lint. - src/utils/markdown.ts: drop `parseInferredParagraphs` and stop emitting the field from `parseProvenanceMetadata`. Legacy on-disk pages with the field still parse fine — the loader just ignores the unrecognised key. - src/compiler/provenance.ts: drop the inferredParagraphs branch from `addProvenanceMeta` so new compiles never write the field. - src/compiler/index.ts: drop the now-dead `Math.max` reconciliation branch from `reconcileConceptMetadata`. - src/linter/rules.ts: `checkInferredWithoutCitations` now unconditionally counts uncited prose paragraphs in the body. No metadata path. Catches hand-edits and stays accurate after any page revision. Tests: updated `confidence-metadata.test.ts` (parser, frontmatter round-trip, parseConcepts), `compile-provenance.test.ts`, `compile-claim-provenance.test.ts`, the just-added `provenance-metadata-shape.test.ts`, and rewrote the two integration tests in `confidence-metadata-integration.test.ts` to drive the excess-inferred-paragraphs rule via body content rather than the removed metadata field. Added a regression test pinning the new "legacy frontmatter is intentionally ignored" behaviour so a future re-introduction of the metadata path would break.

Three findings from codex review on PR #52: 1. ASCII-only prose detection. /^[A-Za-z]/ silently skipped CJK, Cyrillic, Greek, and Arabic paragraphs, so excess-inferred-paragraphs would stop firing on pages produced via `--lang Chinese`, `--lang Japanese`, etc. (#46). Switch to /^\p{L}/u and add a regression test that pins detection of CJK + Cyrillic + Japanese prose blocks. 2. README still documented inferredParagraphs as a frontmatter field and claimed merge reconciliation took the max — both contradicted the new behaviour. Drop the field from the example frontmatter, rewrite the reconciliation sentence, and update the lint-rule description to make clear the count comes from the body. 3. Stale JSDoc on reconcileConceptMetadata listed an `inferredParagraphs: max` rule that no longer exists. Replaced with a note explaining the field is body-derived now.

ethanj added 2 commits May 1, 2026 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: derive inferredParagraphs from rendered body, not extraction#52

refactor: derive inferredParagraphs from rendered body, not extraction#52
ethanj wants to merge 2 commits intomainfrom
refactor/derive-inferred-paragraphs

ethanj commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ethanj commented May 1, 2026

What

The fix

Test plan

Up next (last audit follow-up)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant