Skip to content

<regex>: Add multiline option and make non-multiline mode the default #5535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented May 23, 2025

Resolves #73 (also tracked by DevCom-268592 / VSO-629739) and implements LWG-2503. Arguably also resolves DevCom-436138 to the degree that it is reasonable (namely, when the anchor appears in the regex before it starts branching); see the benchmark. Unblocks four libcxx tests.

This PR also aligns the multiline mode with ECMAScript's specification. The anchors now match at any of ECMAScript's line terminators: carriage returns, line feeds, line separators and paragraph separators. Before, the anchors only matched at line feeds.

The PR provides _REGEX_MAKE_MULTILINE_MODE_DEFAULT as an escape hatch to return to default multiline mode; if so, non-multiline mode is not available.

For POSIX grammars, the new multiline option has no effect. While I find this unfortunate, this behavior appears to have been specified in [re.synopt].

To simplify the logic in the matcher and avoid some preprocessor #ifdefs, the matcher's internal copy of the regex syntax flags _Sflags is mutated before matching starts:

  • The multiline flag is set for all grammars when the escape hatch is defined.
  • The multiline flag is cleared for POSIX grammars when the escape hatch is not defined.

These mutations ensure that multiline mode is enabled if and only if the multiline flag is set in _Sflags.

I see a potential concern with the implementation in this PR: Even if the escape hatch is set, the matcher still changes behavior and allows anchors to match not just line feeds but all ECMAScript line terminators. It can reasonably be argued that the behavior should be completely unchanged if the escape hatch is defined. Even so, I opted to submit the implementation with ECMAScript-conforming line terminators in this PR first because this simplifies the implementation a lot.

Benchmark

Only for pattern "^bibe" to show that this resolves DevCom-436138.

Benchmark Time CPU Iterations
bm_lorem_search/"^bibe"/2 56.4 ns 57.8 ns 10000000
bm_lorem_search/"^bibe"/3 55.9 ns 54.4 ns 11200000
bm_lorem_search/"^bibe"/4 55.5 ns 56.2 ns 10000000

@muellerj2 muellerj2 requested a review from a team as a code owner May 23, 2025 21:53
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 23, 2025
@muellerj2 muellerj2 force-pushed the regex-implement-multiline-option branch from c8e4499 to 38709a6 Compare May 23, 2025 21:55
@muellerj2 muellerj2 force-pushed the regex-implement-multiline-option branch from 38709a6 to 1ef8a05 Compare May 23, 2025 21:57
@StephanTLavavej StephanTLavavej self-assigned this May 27, 2025
@StephanTLavavej StephanTLavavej added LWG Library Working Group issue regex meow is a substring of homeowner labels May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LWG Library Working Group issue regex meow is a substring of homeowner
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

LWG-2503 multiline option should be added to syntax_option_type
2 participants