-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mongolian cyrillic work march 2025 #187
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested some changes.
Note that, as I mentioned elsewhere, Python has a very simple function to normalize pre-composed Unicode characters to their decomposed version, and vice versa. If the decomposed form is what we consider the canonical one, I can apply the decomposition step to all Roman text input before the transliteration, so that we only need to map decomposed sequences. Let me know if this is something you want to do.
# Two Less-than signs mapped to Left-pointing double angle quotation mark | ||
"\u003C\u003C": "\u00AB" | ||
# Two Greater-than signs mapped to Right-pointing double angle quotation mark | ||
"\u003E\u003E": "\u00BB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a transliteration task, specific to Cyrillic, or a normalization task that applies elsewhere?
We have now a normalize
section available in the configuration. It is only available in S2R at the moment but I could implement it for the other direction too.
"\u0429": "Sht" | ||
"\u042A": "U\u0306" | ||
# Capital letter hard sign at the end of a word (rare) | ||
"%\u042A": "\u02BA\u0332" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that SS has an opposite behavior to Transliterator when it comes to the %
sign, in that it denotes a word boundary. So, for matching a token at the end of a word, the %
sign should be at the end of the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"%\u042A": "\u02BA\u0332" | |
"\u042A%": "\u02BA\u0332" |
"\u0446": "t\uFE20s\uFE21" | ||
"\u044A": "u\u0306" | ||
# Small letter hard sign at the end of a word (rare) | ||
"%\u044A": "\u02BA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto as above.
"%\u044A": "\u02BA" | |
"\u044A%": "\u02BA" |
@@ -64,6 +53,17 @@ church_slavonic: | |||
chuvash_cyrillic: | |||
marc_code: chv | |||
name: Chuvash (Cyrillic) | |||
cyrillic_generic: | |||
description: 'Multi-purpose transliteration for most languages that use the Cyrillic script: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see languages that already have their own specific table in this list. Should they be removed? This message is only informational but it can be misleading.
"D\uFE20z\uFE21": "\u0405" | ||
"d\uFE20Z\uFE21": "\u0405" | ||
# Mapping from precomposed non-MARC-8 Latin equivalent | ||
"\u01F1": "\u405" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"\u01F1": "\u405" | |
"\u01F1": "\u0405" |
"\u01F2": "\u405" | ||
"d\uFE20z\uFE21": "\u0455" | ||
# Mapping from precomposed non-MARC-8 Latin equivalent | ||
"\u01F3": "\u455" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"\u01F3": "\u455" | |
"\u01F3": "\u0455" |
"\u0111": "\u0452" | ||
"dz\u030C": "\u045F" | ||
"dz": "\u0455" | ||
"Z\u030C": "\u0416" | ||
# Mapping from precomposed non-MARC-8 Latin equivalent | ||
"\u017D": "\u0416" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have found a very simple way to convert all precomposed Unicode characters into their decomposed form. I can add that as a default step to the R2S process. In that case, we don't need to worry about mapping precomposed characters if they have a decomposed version.
script_to_roman: | ||
|
||
map: | ||
# ga" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ga" | |
# ga" | |
Also, as a general rule, it is best to submit pull requests focused on one feature, so that the addition of Tod and Manchu could have been submitted separately. It's fine to keep these in now, but for next PRs will be easier to review and approve if they are separated by feature. |
These commits are from Randy and Rachel working in our development version of SS. They probably need a little finessing to merge.
---- Updates from Randy ----