Skip to content

Language/Region-specific character mapping tables for AI and LLM text normalization. Instantly convert curly quotes, fancy dashes, ß, spaces, and other problematic or regionally uncommon Unicode characters to plain, standard, or country-preferred forms. Perfect for AI text cleanup, pre/post-processing pipelines, or privacy-friendly browser tools.

License

Notifications You must be signed in to change notification settings

patrickdobler/llm-text-normalizer-mappings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Language Typography & Character Mapping for AI Text Cleanup

This project provides language-specific mapping tables to standardize and normalize typographic and special characters in AI-generated or processed texts.
Mappings cover characters that are regionally uncommon, discouraged, or problematic for further text processing.

Why?

  • Many LLMs and online AIs output Unicode characters that are not desirable or conventional for end users in specific countries (e.g. curly quotes, em dashes, narrow spaces, ß, etc).
  • This repo defines clear rules to convert these to standard, widely-supported, or country-preferred equivalents.
  • Main use case: LLM or AI text pre/post-processing in privacy-focused, in-browser, or server-side tools.

Usage

Import the JSON mapping for your target language/region and apply the character replacements to your text pipeline.

Usage Example (pseudo code):

for char, replacement in mapping.items():
    text = text.replace(char, replacement)

Supported Languages

  • Swiss German (swiss-german.json)
  • German (german.json)
  • French (french.json)
  • Italian (italian.json)
  • English (International) (english-international.json)
  • English (US) (english-us.json)

Contributing

Feel free to open issues or PRs for new mappings, edge cases, or country-specific improvements!

About

Language/Region-specific character mapping tables for AI and LLM text normalization. Instantly convert curly quotes, fancy dashes, ß, spaces, and other problematic or regionally uncommon Unicode characters to plain, standard, or country-preferred forms. Perfect for AI text cleanup, pre/post-processing pipelines, or privacy-friendly browser tools.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published