Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mongolian cyrillic work march 2025 #187

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

thisismattmiller
Copy link
Member

@thisismattmiller thisismattmiller commented Mar 6, 2025

These commits are from Randy and Rachel working in our development version of SS. They probably need a little finessing to merge.

---- Updates from Randy ----

  1. I enhanced the _cyrillic_base.yml file to include default mappings for all Cyrillic script characters. Now the language-specific *.yml files will control those language-based exceptions to the Cyrillic base set. That is the way it should have been working all along but the Cyrillic base file was greatly deficient. I didn’t create it initially. I think Stefano did. He must not have known which characters to include. I’ve taken the approach that the huge Cyrillic set is shared by all languages with a few exceptional practices in some languages. A majority of the languages that use the Cyrillic script do not have any exceptional practices in transliteration from the Cyrillic base set.
  2. I created a new mapping file called “cyrillic_generic.yml” that basically duplicates the “asian_cyrillic.yml” mapping. I would like to remove the “asian_cyrillic.yml” mapping completely. It’s name seems to have confused people into thinking it gives accurate conversion for all Asian languages that use the Cyrillic script. It does NOT. The generic Cyrillic mapping doesn’t provide correct mappings for all Cyrillic based languages BUT it does allow you to switch between the Cyillic and Latin scripts with unique encodings on both sides. These conversions are correct for many languages. Those for which is not correct DO have their own language-specific mapping that is also available in ScriptShifter. The rationale behind having a generic Cyrillic is that, sometimes you have a book with Russian and Komi, or Chechen and Russian. The Russian table alone isn’t enough but the generic Cyrillic works for all three languages. I’d like to create a list of all the Cyrillic languages for which the Cyrillic Generic mapping works perfectly. It’s a lot of languages.
  3. I adjusted the Bulgarian_yml file to correct, or I should say “enhance” the exceptional mapping for the Cyrillic letter “Ъ/ъ”. For almost all Cyrillic based languages this letter pair map to the “hard sign” (prime) \u02BA but for Bulgarian, the letters are mapped to the Latin U/u with combining breve. The problem here, and elsewhere, was that the Latin side of the mapping was referencing Unicode precomposed Latin letters Ŭ/ŭ, that is which have completely different Unicode character values. It’s no wonder they weren’t converting properly. This problem is widespread in that, whenever someone has a Latin script base letter followed by a combining diacritical mark of some kind, it could also be represented by a single precomposed Latin script character. I discussed this problem with Sally. I think Stefano may be able to solve the problem more globally by adding a normalization process to the Latin side. The Voyager system automatically normalizes most precomposed Latin script letters to their base letter followed by a combining character. NOTE: Not all Latin letter with combining character(s) can be represented by a precomposed Unicode character. The MARC-8 and Unicode combining characters allow for what we’ve called an “open repertoire” of possible combinations. The Unicode Consortium added hundreds of precomposed letters with diacritics, especially within the Latin, Cyrillic, and Greek repertoires. Despite the richness of Unicode, some combinations are missing. For example, the “0̇/ȯ” and “U̇/u̇” (letters O/o and U/u with combining dot above) can only be encoded with the Latin base letters followed by the combining dot above. There are no precomposed Latin characters for these combinations. MANY of the Cyrillic-based non-Slavic languages have special O/o and U/u pairs that have been mapped to “0̇/ȯ” and “U̇/u̇”. There are other unusual combinations of Latin base letters with combining marks specified in the ALA-LC Romanization Tables that are NOT represented by precomposed Unicode characters. If we were developing the ALA-LC transliteration schemes today we might limit our choice of modified Latin letters to those in Unicode, but we can’t change our legacy data easily to that new approach. So, the problem will have to be dealt with someone, perhaps by partial fixes.
  4. I fixed the conversion mappings in the Macedonian *.yml file which did not match the most recent version of the ALA-LC documentation. Macedonian had shared the table with Serbian, with inherent problems. Both were changed in 2021.
  5. I fixed the conversion mappings in the Serbian *.yml file which, like Macedonian, shared one table until 2021, which resulted in problems. Both Macedonian and Serbian are working fine now in the Test App.
  6. The problem reported for Moldovan Cyrillic was related to the deficiencies in the _cyrillic_base.yml file. Once the enhance Cyrillic base file is in production, the problems will go away. There were probably problems for other languages as well, except that no one noticed them or reported them.
  7. I made some adjustments to the tod_mongolian.yml file to improve the mappings. The “Tod” (clear) version of the Mongolian script is used in northwest China by Uighurs and Oirats. Until recently I had tried to generate the original script for the dialects that use the special Mongolian letters that make up the Tod script. Tod uses a few letters from the basic Mongolian alphabet, but replaces many with letters that are easier to interpret. That’s why its call the “clear” (tod) script because it eliminates the ambiguities of the regular Mongolian traditional script. I won’t go into details. Creating a _mongolian_base.yml like we have for Cyrillic makes no sense. Tod Mongolian and Mongolian are almost completely mutually exclusive in terms of the characters they use. They share few. Unicode defines them all on the same code page because they are so closely related otherwise.
  8. I added a new conversion mapping file for the Manchu language which, like 6) above, defines additional Mongolian script characters. Manchu, however is a different language. One of its dialects—Sibo (Xibe)—uses the same letters. I almost named this new mapping “Manchu/Sibo” but there is very little published in Sibo. Sibo is, however still spoken in northern China. Manchu, the official language of the Qing empire (the last in China that ended in 1912), is virtually extinct. Chinese has completely replaced Manchu as a national language of China. There are, however, a lot of Chinese government documents in Manchu that seem to be the focus of a lot of research. LC has a contractor cataloging many of them. Having a way to generate the original Manchu script is important.
  9. I talked to Stefano about adding MARC subfield identifiers—which are alpha/numeric—to the list of strings in the “_ignore_base.yml”. MARC subfields are always introduced by the “‡” (\u2021) followed by a single lowercase Latin letter (a-z) or digit 0-9. Transliterator has always ignored subfield codes but ScriptShifter treats them as regular alpha/numeric data. Given the presence of the subfield delimiter, it should be easy to avoid converting these. I would have updated the ignore list myself but Stefano suggested a more efficient way to specify this array logically. I will leave that up to him to address a some point—sooner than later, I hope. In the interim, I could add the list of 36 strings to the “_ignore_base.yml” file. What do you think? I use ScriptShifter quite often in Voyager for scripts that Voyager will not allow me to enter directly. The subfield codes get converted in error and I have to fix them. It’s a pain.

@thisismattmiller thisismattmiller requested a review from scossu March 6, 2025 18:45
Copy link
Collaborator

@scossu scossu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some changes.

Note that, as I mentioned elsewhere, Python has a very simple function to normalize pre-composed Unicode characters to their decomposed version, and vice versa. If the decomposed form is what we consider the canonical one, I can apply the decomposition step to all Roman text input before the transliteration, so that we only need to map decomposed sequences. Let me know if this is something you want to do.

Comment on lines +388 to 391
# Two Less-than signs mapped to Left-pointing double angle quotation mark
"\u003C\u003C": "\u00AB"
# Two Greater-than signs mapped to Right-pointing double angle quotation mark
"\u003E\u003E": "\u00BB"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a transliteration task, specific to Cyrillic, or a normalization task that applies elsewhere?

We have now a normalize section available in the configuration. It is only available in S2R at the moment but I could implement it for the other direction too.

"\u0429": "Sht"
"\u042A": "U\u0306"
# Capital letter hard sign at the end of a word (rare)
"%\u042A": "\u02BA\u0332"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that SS has an opposite behavior to Transliterator when it comes to the % sign, in that it denotes a word boundary. So, for matching a token at the end of a word, the % sign should be at the end of the token.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"%\u042A": "\u02BA\u0332"
"\u042A%": "\u02BA\u0332"

"\u0446": "t\uFE20s\uFE21"
"\u044A": "u\u0306"
# Small letter hard sign at the end of a word (rare)
"%\u044A": "\u02BA"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as above.

Suggested change
"%\u044A": "\u02BA"
"\u044A%": "\u02BA"

@@ -64,6 +53,17 @@ church_slavonic:
chuvash_cyrillic:
marc_code: chv
name: Chuvash (Cyrillic)
cyrillic_generic:
description: 'Multi-purpose transliteration for most languages that use the Cyrillic script:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see languages that already have their own specific table in this list. Should they be removed? This message is only informational but it can be misleading.

"D\uFE20z\uFE21": "\u0405"
"d\uFE20Z\uFE21": "\u0405"
# Mapping from precomposed non-MARC-8 Latin equivalent
"\u01F1": "\u405"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\u01F1": "\u405"
"\u01F1": "\u0405"

"\u01F2": "\u405"
"d\uFE20z\uFE21": "\u0455"
# Mapping from precomposed non-MARC-8 Latin equivalent
"\u01F3": "\u455"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\u01F3": "\u455"
"\u01F3": "\u0455"

"\u0111": "\u0452"
"dz\u030C": "\u045F"
"dz": "\u0455"
"Z\u030C": "\u0416"
# Mapping from precomposed non-MARC-8 Latin equivalent
"\u017D": "\u0416"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found a very simple way to convert all precomposed Unicode characters into their decomposed form. I can add that as a default step to the R2S process. In that case, we don't need to worry about mapping precomposed characters if they have a decomposed version.

script_to_roman:

map:
# ga"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ga"
# ga"

@scossu
Copy link
Collaborator

scossu commented Mar 16, 2025

Also, as a general rule, it is best to submit pull requests focused on one feature, so that the addition of Tod and Manchu could have been submitted separately. It's fine to keep these in now, but for next PRs will be easier to review and approve if they are separated by feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants