Hi @hydropix,
Thank you for this amazing project! I've been using it to translate long web novels, and while the current architecture is fantastic, I'd like to propose a feature that would significantly improve translation consistency across long documents: A Two-Pass NER (Named Entity Recognition) Pipeline & Glossary Management System.
The Problem / Use Case
When translating long novels (e.g., 1800+ chapters), the AI often struggles with entity consistency. This is especially problematic when the source text is an existing machine-translated English version rather than the raw Chinese original. For instance, a protagonist named "Li Fan" might randomly get translated as "Li Fanking" or "Li Fangue" in later chapters.
While the current Custom Instructions system helps, managing a manual glossary for hundreds of characters, sects, and locations is tedious.
Proposed Solution: Two-Pass NER Pipeline
I propose adding an optional NER pipeline that extracts entities before the actual translation begins, allowing the user to verify them, and then injecting them into the translation prompt.
Phase 1: NER Core Extraction
- An option to pre-scan batches of chapters to extract entities using LLMs (e.g., Gemma 4, Mistral).
- Categorize extracted entities into:
Characters, Locations, Sects, Items.
- Output this as a structured Glossary (JSON format).
Phase 2: Glossary Management (Web UI & CLI)
- Add a new "Glossary & NER" section in the Web UI.
- Allow users to auto-generate, upload (JSON/CSV), edit, and download glossaries.
- Users can review and fix the auto-generated translations (e.g., enforcing "Qingxuan Sect" across all chapters).
Phase 3: Translation Pipeline Integration
- Modify
generate_translation_prompt() to accept the active glossary.
- Inject the glossary into the system prompt (e.g., "Use these translations for consistency: 李凡 → Li Fan").
- Store the active
glossary_id inside the Checkpoint data so that if the translation is paused/resumed, the glossary context is preserved.
Why This Adds Value
- Quality: Drastically reduces hallucinations in character/location names.
- Automation: Saves users hours of manually reading and compiling glossary text files.
- Seamless Integration: It perfectly complements your existing prompt injection and format adapter architecture.
Open Questions for Implementation
If you are open to this idea, I'd love to discuss:
- Would you prefer the glossary data to be stored as simple local JSON files (
/glossaries folder) or integrated into jobs.db?
- Are
characters, locations, sects, and items sufficient default entity types for the prompt engineering?
I've already thought about the codebase architecture for this and would be more than happy to help submit PRs for the backend logic or UI components if you think this is a good direction for the project!
Looking forward to your thoughts.
Hi @hydropix,
Thank you for this amazing project! I've been using it to translate long web novels, and while the current architecture is fantastic, I'd like to propose a feature that would significantly improve translation consistency across long documents: A Two-Pass NER (Named Entity Recognition) Pipeline & Glossary Management System.
The Problem / Use Case
When translating long novels (e.g., 1800+ chapters), the AI often struggles with entity consistency. This is especially problematic when the source text is an existing machine-translated English version rather than the raw Chinese original. For instance, a protagonist named "Li Fan" might randomly get translated as "Li Fanking" or "Li Fangue" in later chapters.
While the current Custom Instructions system helps, managing a manual glossary for hundreds of characters, sects, and locations is tedious.
Proposed Solution: Two-Pass NER Pipeline
I propose adding an optional NER pipeline that extracts entities before the actual translation begins, allowing the user to verify them, and then injecting them into the translation prompt.
Phase 1: NER Core Extraction
Characters,Locations,Sects,Items.Phase 2: Glossary Management (Web UI & CLI)
Phase 3: Translation Pipeline Integration
generate_translation_prompt()to accept the active glossary.glossary_idinside the Checkpoint data so that if the translation is paused/resumed, the glossary context is preserved.Why This Adds Value
Open Questions for Implementation
If you are open to this idea, I'd love to discuss:
/glossariesfolder) or integrated intojobs.db?characters,locations,sects, anditemssufficient default entity types for the prompt engineering?I've already thought about the codebase architecture for this and would be more than happy to help submit PRs for the backend logic or UI components if you think this is a good direction for the project!
Looking forward to your thoughts.