Skip to content

Regexp: support for case-insensitive Unicode matching#2130

Open
balajirrao wants to merge 10 commits intomozilla:masterfrom
balajirrao:regexp-unicode-caseinsensitive
Open

Regexp: support for case-insensitive Unicode matching#2130
balajirrao wants to merge 10 commits intomozilla:masterfrom
balajirrao:regexp-unicode-caseinsensitive

Conversation

@balajirrao
Copy link
Contributor

@balajirrao balajirrao commented Oct 17, 2025

Enable Unicode case-insensitive regex matching (/iu flag combination) using approximate case folding.

@balajirrao balajirrao force-pushed the regexp-unicode-caseinsensitive branch from 6a24f28 to 647c882 Compare October 17, 2025 16:06
@rbri
Copy link
Collaborator

rbri commented Nov 21, 2025

@balajirrao any plans for finishing this? Waiting for that to makt the separate engine pr...

@balajirrao
Copy link
Contributor Author

@rbri I thought I was going to finish it and then I hit a wall. I believe that in order to do this in the general case, we'd need icu4j. I'm considering creating a module outside of rhino, say, rhino-icu4j that when included would offer complete Unicode support in regexps and possibly in other cases too. How does that sound ?

@andreabergia
Copy link
Contributor

andreabergia commented Nov 28, 2025

@rbri I thought I was going to finish it and then I hit a wall. I believe that in order to do this in the general case, we'd need icu4j. I'm considering creating a module outside of rhino, say, rhino-icu4j that when included would offer complete Unicode support in regexps and possibly in other cases too. How does that sound ?

IMHO that's the right approach. An opt-in module that, if present, adds the capability. If not, we can error out with "not supported". It would be a good improvement on what we do now.

@aardvark179
Copy link
Contributor

I'm not sure the complement classes present an insurmountable wall. icu4j would certainly offer a route to a complete implementation, but it would also be entirely reasonable to calculate classes, and their complements, when needed. Looping from 0 to MAX_CODE_POINT and building a range structure doesn't actually take much time, and most unicode classes have ranges that can be represented pretty compactly.

@balajirrao balajirrao force-pushed the regexp-unicode-caseinsensitive branch 3 times, most recently from fa34971 to e8f1bf2 Compare February 24, 2026 09:44
@balajirrao balajirrao force-pushed the regexp-unicode-caseinsensitive branch from e8f1bf2 to 462c8b0 Compare February 24, 2026 10:29
@balajirrao balajirrao marked this pull request as ready for review February 24, 2026 13:38
@balajirrao balajirrao force-pushed the regexp-unicode-caseinsensitive branch from 462c8b0 to db0763e Compare February 24, 2026 14:08
@balajirrao
Copy link
Contributor Author

balajirrao commented Feb 24, 2026

I've finally managed to finish it up.

@aardvark179 It turns out I didn't need to compute case fold of arbitrary Unicode regions in the u mode. It's needed only for the v mode - it was clear from the spec, it was MDN that I was confused by.

@rbri would appreciate you taking a look when you have a chance!

@balajirrao balajirrao changed the title Regexp: support for case-insensitive unicode matching Regexp: support for case-insensitive Unicode matching Feb 24, 2026
@rbri
Copy link
Collaborator

rbri commented Mar 1, 2026

@balajirrao did a smoke thest with this and also took the chance to ask some LLM's to create test cases for that. Looks all good - i think we can go with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants