Skip to content

Write entity-extraction prompt + alias table for teachers/traditions #308

@meninoebom

Description

@meninoebom

What to build

Brandon's contribution. Write the LLM prompt that powers the extractor module (M4).

The prompt receives:

  • A transcript chunk (or full episode — your call on which is better)
  • The known vocabulary: list of teacher names + slugs (from data/teachers/*.json) and tradition names + slugs (from data/traditions/*.mdx)
  • An alias table you maintain (e.g., "Joko" → joko-beck, "HHDL" → dalai-lama-14, "Luang Por" → ajahn-chah, "Adya" → adyashanti)

It returns structured JSON:

type ExtractedMention = {
  entity_type: 'teacher' | 'tradition' | 'concept';
  entity_slug: string;       // canonical slug; concept mentions can use a free-form slug
  surface_form: string;       // the exact text matched
  confidence: number;         // 0..1
}

Deliverables:

  • src/lib/extract/prompt.ts exports buildExtractionPrompt(chunk, vocabulary, aliases).
  • data/podcasts/aliases.json — your curated alias table for things you know about.
  • A short doc in docs/podcast/extraction.md explaining the precision/recall posture (we want ≥90% precision, accept lower recall).
  • Hand-label 20 chunks (a tests/fixtures/labeled-chunks.json) so Implement entity-extraction pipeline (writes to mentions table) #309 can score the prompt against ground truth.

Why it matters

The vocabulary IS the moat. Your domain knowledge of how teachers are referenced in dharma talks (nicknames, honorifics, transliteration variants) is what separates this from a generic NER pass. No amount of model upgrades will fix a bad prompt or missing alias table.

Acceptance criteria

  • buildExtractionPrompt function exists and is unit-tested for shape
  • aliases.json has at least 20 entries
  • docs/podcast/extraction.md records the posture and known limitations
  • labeled-chunks.json has 20 hand-labeled chunks for evaluation
  • One end-to-end smoke test feeds a labeled chunk through Claude and prints the result for manual review

Dependencies

None — can be drafted in parallel; #309 needs it to ship

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew functionalityneeds-eyesRequires human review before/during build

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions