<regex>
: Clean up parsing logic for quantifiers
#5253
+33
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The parser uses some confusing control flow to generate the NFA for quantifiers. The control flow goes like this: The parser sets the
_Fl_final
flag on the last node generated before processing the quantifier (by calling_Builder::_Mark_final()
in_Parser::_Quantifier()
). Then it is checked in_Builder::_Add_rep()
whether the last generated node is a node that matches a string with more than one character. If so, the function removes the last character from this node and then adds the character to the NFA again by calling_Builder::_Add_char()
. Because the_Fl_final
flag is set on the last string-matching node (that we just removed a character from),_Builder::_Add_char()
doesn't just re-add the character to the string-matching node again, but instead creates a new string node that matches this one character. Finally, the repetition is added to the NFA.There are no other usages of the
_Fl_final
in<regex>
, so supporting this confusing parsing and NFA generation logic seems to be its only purpose.This PR cleans up this logic and removes any usage of the
_Fl_final
flag: When the last generated node is a string-matching node with more than one character,_Builder::_Add_rep()
now explicitly calls_Builder::_Add_str_node()
to generate a new string-matching node (so no more confusing checking of the_Fl_final
flag) and then moves the last character from the previous string-matching node to the newly created one (by calling_Buf::_Del()
and_Builder::_Add_char2()
)._Builder::_Add_char()
and_Builder::_Add_rep()
are renamed, otherwise regular expressions could be miscompiled when old and new functions are mixed. The_Builder::_Mark_final()
function is no longer used and thus removed. The new tests make sure we don't accidentally miscompile quantifiers (e.g., by not moving the last character from a preceding string-matching node to a new one, or by incorporating too few or too many nodes in the repetition).The
_Fl_final
flag is no longer set on nodes, but I think this flag was never used by the matcher. I can't come up with any reasonable use for it when matching (the flag marks the last node in a repetition, except when it marks the last node before a repetition) and I also couldn't find any usage of the flag in the matcher in the oldest toolset 14.16.27023 (VS 2017) I still had on one of my machines. But it would prudent to check the very first VS 2015 toolset as well.Alternatively, we can also avoid the source code archaeology and just insert the following line at the very beginning of
_Builder::_Add_rep2()
:_Current->_Flags |= _Fl_final; // TRANSITION, ABI
Then the flag gets set on the same nodes before and after this PR.