Skip to content

Some case-swapping edge cases are handled incorrectly #25

Open
@jaynetics

Description

@jaynetics

Issue extracted from #21.

Onigmo applies https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt both ways in case insensitive mode, but only for literals, ranges, positive unicode properties outside of charsets, positive or negated properties inside charsets, and maybe some other exotic cases.

This applies even if the mapping is between one and multiple chars, and even if the single char is part of a charset (e.g. /[ß]/i), allowing some unquantified charsets to match more than one char.

E.g. the literal ß maps to ["ss", "sS", "sſ", "Ss", "SS", "Sſ", "ſs", "ſS", "ſſ", "ẞ"], and vice versa, even if in a charset (/[ß]/), but only if present as literal, part of a positive range, part of a positive property, or part of a negated property in a charset (i.e. not when part of a char type, a positive posix class, a negated charset containing literals ranges or char types, a plain negated property, a negated property in a negated charset, or a negated posix class.).

Not yet tested: codepoint lists, meta & control escapes, absence groups, effects of backrefs or subexp calls, nested charsets, combining various expressions in charsets, free-spacing mode, ...

RUBY_VERSION # => "3.3.6"

# literal matching cases
'SS'[/ß/i] # => "SS"
'ASSE'[/ß/i] # => "SS"
'ASSE'[/AßE/i] # => "ASSE"
'ASSE'[/A[ß]E/i] # => "ASSE" # (!)

# non-literal matching cases
'SS'[/\u00DF/i] # => "SS"
'SS'[/[\u00DE-\u00E0]/i] # => "SS"
'SS'[/[Þ-à]/i] # => "SS"
'SS'[/[Þ-à&&Þ-á]/i] # => "SS"
'SS'[/[T-\u{10FFFF}]/i] # => "S" # because s -> S ?
'SS'[/[t-\u{10FFFF}]/i] # => "S" # because ſ -> S ?
'SS'[/\p{word}/i] # => 'S'
'ASSE'[/A\p{word}E/i] # => 'ASSE'
'ASSE'[/A[\p{word}]E/i] # => 'ASSE'
'ASSE'[/A[\p{^Mark}]E/i] # => 'ASSE'

# non-matching cases
'ASSE'[/A.E/i] # => nil
'ASSE'[/A[.]E/i] # => nil
'ASSE'[/A[^x]E/i] # => nil
'ASSE'[/A[^x-y]E/i] # => nil
'ASSE'[/A[^\d]E/i] # => nil
'ASSE'[/A(?u:\w)E/i] # => nil
'ASSE'[/A\p{^Mark}E/i] # => nil
'ASSE'[/A[[:word:]]E/i] # => nil
'ASSE'[/A[[:^digit:]]E/i] # => nil
'ASSE'[/A[\p{^Mark}]E/i] # => nil
'ASSE'[/A[^\p{^word}]E/i] # => nil
'ASSE'[/A[ß]{2}E/i] # => nil

# inverse direction
'ß'[/SS/i] # => 'ß'
'ß'[/ss/i] # => 'ß'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions