Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<regex>: Clean up parsing logic for quantifiers #5253

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

muellerj2
Copy link
Contributor

The parser uses some confusing control flow to generate the NFA for quantifiers. The control flow goes like this: The parser sets the _Fl_final flag on the last node generated before processing the quantifier (by calling _Builder::_Mark_final() in _Parser::_Quantifier()). Then it is checked in _Builder::_Add_rep() whether the last generated node is a node that matches a string with more than one character. If so, the function removes the last character from this node and then adds the character to the NFA again by calling _Builder::_Add_char(). Because the _Fl_final flag is set on the last string-matching node (that we just removed a character from), _Builder::_Add_char() doesn't just re-add the character to the string-matching node again, but instead creates a new string node that matches this one character. Finally, the repetition is added to the NFA.

There are no other usages of the _Fl_final in <regex>, so supporting this confusing parsing and NFA generation logic seems to be its only purpose.

This PR cleans up this logic and removes any usage of the _Fl_final flag: When the last generated node is a string-matching node with more than one character, _Builder::_Add_rep() now explicitly calls _Builder::_Add_str_node() to generate a new string-matching node (so no more confusing checking of the _Fl_final flag) and then moves the last character from the previous string-matching node to the newly created one (by calling _Buf::_Del() and _Builder::_Add_char2()).

_Builder::_Add_char() and _Builder::_Add_rep() are renamed, otherwise regular expressions could be miscompiled when old and new functions are mixed. The _Builder::_Mark_final() function is no longer used and thus removed. The new tests make sure we don't accidentally miscompile quantifiers (e.g., by not moving the last character from a preceding string-matching node to a new one, or by incorporating too few or too many nodes in the repetition).

The _Fl_final flag is no longer set on nodes, but I think this flag was never used by the matcher. I can't come up with any reasonable use for it when matching (the flag marks the last node in a repetition, except when it marks the last node before a repetition) and I also couldn't find any usage of the flag in the matcher in the oldest toolset 14.16.27023 (VS 2017) I still had on one of my machines. But it would prudent to check the very first VS 2015 toolset as well.

Alternatively, we can also avoid the source code archaeology and just insert the following line at the very beginning of _Builder::_Add_rep2():

_Current->_Flags |= _Fl_final; // TRANSITION, ABI

Then the flag gets set on the same nodes before and after this PR.

@muellerj2 muellerj2 requested a review from a team as a code owner January 26, 2025 13:53
@StephanTLavavej
Copy link
Member

I also couldn't find any usage of the flag in the matcher in the oldest toolset 14.16.27023 (VS 2017) I still had on one of my machines. But it would prudent to check the very first VS 2015 toolset as well.

We weren't in the habit of changing <regex> after VS 2015, so I would consider that sufficient proof.

@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex Everyone's favorite header labels Jan 27, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved regex Everyone's favorite header
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants