Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mongolian cyrillic work march 2025 #187

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
650 changes: 613 additions & 37 deletions scriptshifter/tables/data/_cyrillic_base.yml

Large diffs are not rendered by default.

12 changes: 8 additions & 4 deletions scriptshifter/tables/data/asian_cyrillic.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
general:
name: Asian (Cyrillic)
name: Cyrillic (Generic)
parents:
- _cyrillic_base

Expand Down Expand Up @@ -384,15 +384,19 @@ roman_to_script:
"(|)": "\u0482"
"(^)": "\u0488"
"(')": "\u0489"


# Two Less-than signs mapped to Left-pointing double angle quotation mark
"\u003C\u003C": "\u00AB"
# Two Greater-than signs mapped to Right-pointing double angle quotation mark
"\u003E\u003E": "\u00BB"
Comment on lines +388 to 391
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a transliteration task, specific to Cyrillic, or a normalization task that applies elsewhere?

We have now a normalize section available in the configuration. It is only available in S2R at the moment but I could implement it for the other direction too.


script_to_roman:
map:

"\u00AB": "\""
"\u00BB": "\""
# Left-pointing double angle quotation mark mapped to Two Less-than signs
"\u00AB": "\u003C\u003C"
# Right-pointing double angle quotation mark mapped to Two Greater-than signs
"\u00BB": "\u003E\u003E"
"\u2116": "No\u0332"
"\u0400": "E\u0300"
"\u0401": "E\u0308"
Expand Down
54 changes: 17 additions & 37 deletions scriptshifter/tables/data/bulgarian.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,54 +5,34 @@ general:

roman_to_script:
map:
"G": "\u0413"
"g": "\u0433"
# this conversion shouldn't be needed, but does no harm
"ZH": "\u0416"
"Zh": "\u0416"
"zh": "\u0436"
"I\uFE20E\uFE21": "\u0462"
# this conversion shouldn't be needed, but does no harm
"I\uFE20e\uFE21": "\u0462"
# this conversion shouldn't be needed, but does no harm
# this conversion shouldn't be needed, but does no harm
"I": "\u0418"
"i\uFE20e\uFE21": "\u0463"
"i": "\u0438"
# this conversion shouldn't be needed, but does no harm
"SHT": "\u0429"
"Sht": "\u0429"
"sht": "\u0449"
"T\uFE20S\uFE21": "\u0426"
# this conversion shouldn't be needed, but does no harm
"T\uFE20s\uFE21": "\u0426"
"t\uFE20s\uFE21": "\u0446"
"U\u0310": "\u046A"
"U\u0306": "\u042A"
# Mapping from precomposed non-MARC-8 Latin equivalent
"\u016C": "\u042A"
"u\u0306": "\u044A"
# Mapping from precomposed non-MARC-8 Latin equivalent
"\u016D": "\u044A"
"U\u0310": "\u046A"
"u\u0310": "\u046B"
# this conversion is ambiguous - \u042A is also theoretically possible
"\u02BA": "\u044A"
# upper case hard sign is unlikely to occur
"\u02BA\u0332": "\u042A"

script_to_roman:
map:
"\u044C": ""
"\u042C": ""
"\u044A": ""
"\u042A%": "u\u0306"
"\u042A": ""
"\u0413": "G"
"\u0433": "g"
"\u0416": "Zh"
"\u0436": "zh"
"\u0462": "I\uFE20E\uFE21"
"\u0418": "I"
"\u0463": "i\uFE20e\uFE21"
"\u0438": "i"
"\u0429": "Sht"
"\u042A": "U\u0306"
# Capital letter hard sign at the end of a word (rare)
"%\u042A": "\u02BA\u0332"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SS has an opposite behavior to Transliterator when it comes to the % sign, in that it denotes a word boundary. So, for matching a token at the end of a word, the % sign should be at the end of the token.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"%\u042A": "\u02BA\u0332"
"\u042A%": "\u02BA\u0332"

"\u042C": "\u02B9\u0332"
"\u0449": "sht"
"\u0426": "T\uFE20S\uFE21"
"\u0446": "t\uFE20s\uFE21"
"\u044A": "u\u0306"
# Small letter hard sign at the end of a word (rare)
"%\u044A": "\u02BA"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as above.

Suggested change
"%\u044A": "\u02BA"
"\u044A%": "\u02BA"

"\u044C": "\u02B9"
"\u046A": "U\u0310"
"\u046B": "u\u0310"
"\u042A": "u\u016C"
"\u044A": "u\u016D"

Loading