[Feature Request] Two-Pass NER Pipeline & Glossary Management for Translation Consistency

Hi @hydropix,

Thank you for this amazing project! I've been using it to translate long web novels, and while the current architecture is fantastic, I'd like to propose a feature that would significantly improve translation consistency across long documents: **A Two-Pass NER (Named Entity Recognition) Pipeline & Glossary Management System.**

### The Problem / Use Case
When translating long novels (e.g., 1800+ chapters), the AI often struggles with entity consistency. This is especially problematic when the source text is an existing machine-translated English version rather than the raw Chinese original. For instance, a protagonist named "Li Fan" might randomly get translated as "Li Fanking" or "Li Fangue" in later chapters. 

While the current Custom Instructions system helps, managing a manual glossary for hundreds of characters, sects, and locations is tedious.

### Proposed Solution: Two-Pass NER Pipeline
I propose adding an optional NER pipeline that extracts entities before the actual translation begins, allowing the user to verify them, and then injecting them into the translation prompt.

**Phase 1: NER Core Extraction**
- An option to pre-scan batches of chapters to extract entities using LLMs (e.g., Gemma 4, Mistral).
- Categorize extracted entities into: `Characters`, `Locations`, `Sects`, `Items`.
- Output this as a structured Glossary (JSON format).

**Phase 2: Glossary Management (Web UI & CLI)**
- Add a new "Glossary & NER" section in the Web UI.
- Allow users to auto-generate, upload (JSON/CSV), edit, and download glossaries.
- Users can review and fix the auto-generated translations (e.g., enforcing "Qingxuan Sect" across all chapters).

**Phase 3: Translation Pipeline Integration**
- Modify `generate_translation_prompt()` to accept the active glossary.
- Inject the glossary into the system prompt (e.g., *"Use these translations for consistency: 李凡 → Li Fan"*).
- Store the active `glossary_id` inside the Checkpoint data so that if the translation is paused/resumed, the glossary context is preserved.

### Why This Adds Value
1. **Quality:** Drastically reduces hallucinations in character/location names.
2. **Automation:** Saves users hours of manually reading and compiling glossary text files.
3. **Seamless Integration:** It perfectly complements your existing prompt injection and format adapter architecture.

### Open Questions for Implementation
If you are open to this idea, I'd love to discuss:
1. Would you prefer the glossary data to be stored as simple local JSON files (`/glossaries` folder) or integrated into `jobs.db`?
2. Are `characters`, `locations`, `sects`, and `items` sufficient default entity types for the prompt engineering?

I've already thought about the codebase architecture for this and would be more than happy to help submit PRs for the backend logic or UI components if you think this is a good direction for the project!

Looking forward to your thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Two-Pass NER Pipeline & Glossary Management for Translation Consistency #132

The Problem / Use Case

Proposed Solution: Two-Pass NER Pipeline

Why This Adds Value

Open Questions for Implementation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Two-Pass NER Pipeline & Glossary Management for Translation Consistency #132

Description

The Problem / Use Case

Proposed Solution: Two-Pass NER Pipeline

Why This Adds Value

Open Questions for Implementation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions