Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some case-swapping edge cases are handled incorrectly #25

Open
jaynetics opened this issue Dec 22, 2024 · 1 comment
Open

Some case-swapping edge cases are handled incorrectly #25

jaynetics opened this issue Dec 22, 2024 · 1 comment

Comments

@jaynetics
Copy link
Owner

Issue extracted from #21.

Onigmo applies https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt both ways in case insensitive mode, but only for literals, ranges, positive unicode properties outside of charsets, positive or negated properties inside charsets, and maybe some other exotic cases.

This applies even if the mapping is between one and multiple chars, and even if the single char is part of a charset (e.g. /[ß]/i), allowing some unquantified charsets to match more than one char.

E.g. the literal ß maps to ["ss", "sS", "sſ", "Ss", "SS", "Sſ", "ſs", "ſS", "ſſ", "ẞ"], and vice versa, even if in a charset (/[ß]/), but only if present as literal, part of a positive range, part of a positive property, or part of a negated property in a charset (i.e. not when part of a char type, a positive posix class, a negated charset containing literals ranges or char types, a plain negated property, a negated property in a negated charset, or a negated posix class.).

Not yet tested: codepoint lists, meta & control escapes, absence groups, effects of backrefs or subexp calls, nested charsets, combining various expressions in charsets, free-spacing mode, ...

RUBY_VERSION # => "3.3.6"

# literal matching cases
'SS'[/ß/i] # => "SS"
'ASSE'[/ß/i] # => "SS"
'ASSE'[/AßE/i] # => "ASSE"
'ASSE'[/A[ß]E/i] # => "ASSE" # (!)

# non-literal matching cases
'SS'[/\u00DF/i] # => "SS"
'SS'[/[\u00DE-\u00E0]/i] # => "SS"
'SS'[/[Þ-à]/i] # => "SS"
'SS'[/[Þ-à&&Þ-á]/i] # => "SS"
'SS'[/[T-\u{10FFFF}]/i] # => "S" # because s -> S ?
'SS'[/[t-\u{10FFFF}]/i] # => "S" # because ſ -> S ?
'SS'[/\p{word}/i] # => 'S'
'ASSE'[/A\p{word}E/i] # => 'ASSE'
'ASSE'[/A[\p{word}]E/i] # => 'ASSE'
'ASSE'[/A[\p{^Mark}]E/i] # => 'ASSE'

# non-matching cases
'ASSE'[/A.E/i] # => nil
'ASSE'[/A[.]E/i] # => nil
'ASSE'[/A[^x]E/i] # => nil
'ASSE'[/A[^x-y]E/i] # => nil
'ASSE'[/A[^\d]E/i] # => nil
'ASSE'[/A(?u:\w)E/i] # => nil
'ASSE'[/A\p{^Mark}E/i] # => nil
'ASSE'[/A[[:word:]]E/i] # => nil
'ASSE'[/A[[:^digit:]]E/i] # => nil
'ASSE'[/A[\p{^Mark}]E/i] # => nil
'ASSE'[/A[^\p{^word}]E/i] # => nil
'ASSE'[/A[ß]{2}E/i] # => nil

# inverse direction
'ß'[/SS/i] # => 'ß'
'ß'[/ss/i] # => 'ß'
@slevithan
Copy link

slevithan commented Dec 22, 2024

Per your description, this case-insensitive mapping from one to multiple chars seems like a mess. I ran just your first test in Oniguruma, and it's doing the same thing as Onigmo.

Note that JS doesn't do this for Unicode case folding (which it applies when flag i is combined with u or v). So a good starting point might be to not yet worry about expansions to multiple chars, but still support other aspects of Unicode case folding (which js_regex doesn't yet do) like titlecase chars and Turkish İ ı.

Not sure what the recommendation is for case-insensitive length expansions in regular expressions per UTS 18 (i.e., not sure whether JS or Onigmo more closely follows the Unicode recommendations).

Edit: See also https://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants