-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG:Some bug in Balancing Group of Regular Expressions #111161
Comments
Verified the repro, including verifying generated results via However, the repro should be simplified if possible so that it is easier to verify that the result is not correct and to help find the root cause. Is the discussion at #110976 for issues with FWIW: Generated C# results// <auto-generated/>
#nullable enable
#pragma warning disable CS0162 // Unreachable code
#pragma warning disable CS0164 // Unreferenced label
#pragma warning disable CS0219 // Variable assigned but never used
namespace ConsoleApp332
{
partial class Program
{
/// <remarks>
/// Pattern:<br/>
/// <code>\\d+((?'x'[a-z-[b]]+)).(?<=(?'2-1'(?'x1'..)).{6})b(?(2)(?'Group2Captured'.)|(?'Group2NotCaptured'.))</code><br/>
/// Options:<br/>
/// <code>RegexOptions.IgnoreCase</code><br/>
/// Explanation:<br/>
/// <code>
/// ○ Match a Unicode digit greedily at least once.<br/>
/// ○ 1st capture group.<br/>
/// ○ "x" capture group.<br/>
/// ○ Match a character in the set [A-Za-z\u0130\u212A-[Bb]] greedily at least once.<br/>
/// ○ Match any character other than '\n'.<br/>
/// ○ Zero-width positive lookbehind.<br/>
/// ○ Match a character other than '\n' exactly 6 times right-to-left.<br/>
/// ○ Balancing group. Captures the 2nd capture group and uncaptures the 1st capture group.<br/>
/// ○ "x1" capture group.<br/>
/// ○ Match a character other than '\n' exactly 2 times right-to-left.<br/>
/// ○ Match a character in the set [Bb].<br/>
/// ○ Atomic group.<br/>
/// ○ Conditionally match one of two expressions depending on whether the 2nd capture group matched.<br/>
/// ○ Matched: "Group2Captured" capture group.<br/>
/// ○ Match any character other than '\n'.<br/>
/// ○ Not Matched: "Group2NotCaptured" capture group.<br/>
/// ○ Match any character other than '\n'.<br/>
/// </code>
/// </remarks>
[global::System.CodeDom.Compiler.GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "10.0.11.11703")]
private static partial global::System.Text.RegularExpressions.Regex TestRegex() => global::System.Text.RegularExpressions.Generated.TestRegex_0.Instance;
}
}
namespace System.Text.RegularExpressions.Generated
{
using System;
using System.Buffers;
using System.CodeDom.Compiler;
using System.Collections;
using System.ComponentModel;
using System.Globalization;
using System.Runtime.CompilerServices;
using System.Text.RegularExpressions;
using System.Threading;
/// <summary>Custom <see cref="Regex"/>-derived type for the TestRegex method.</summary>
[GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "10.0.11.11703")]
file sealed class TestRegex_0 : Regex
{
/// <summary>Cached, thread-safe singleton instance.</summary>
internal static readonly TestRegex_0 Instance = new();
/// <summary>Initializes the instance.</summary>
private TestRegex_0()
{
base.pattern = "\\d+((?'x'[a-z-[b]]+)).(?<=(?'2-1'(?'x1'..)).{6})b(?(2)(?'Group2Captured'.)|(?'Group2NotCaptured'.))";
base.roptions = RegexOptions.IgnoreCase;
ValidateMatchTimeout(Utilities.s_defaultTimeout);
base.internalMatchTimeout = Utilities.s_defaultTimeout;
base.factory = new RunnerFactory();
base.CapNames = new Hashtable { { "0", 0 } , { "1", 1 } , { "2", 2 } , { "Group2Captured", 5 } , { "Group2NotCaptured", 6 } , { "x", 3 } , { "x1", 4 } };
base.capslist = new string[] {"0", "1", "2", "x", "x1", "Group2Captured", "Group2NotCaptured" };
base.capsize = 7;
}
/// <summary>Provides a factory for creating <see cref="RegexRunner"/> instances to be used by methods on <see cref="Regex"/>.</summary>
private sealed class RunnerFactory : RegexRunnerFactory
{
/// <summary>Creates an instance of a <see cref="RegexRunner"/> used by methods on <see cref="Regex"/>.</summary>
protected override RegexRunner CreateInstance() => new Runner();
/// <summary>Provides the runner that contains the custom logic implementing the specified regular expression.</summary>
private sealed class Runner : RegexRunner
{
/// <summary>Scan the <paramref name="inputSpan"/> starting from base.runtextstart for the next match.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
protected override void Scan(ReadOnlySpan<char> inputSpan)
{
// Search until we can't find a valid starting position, we find a match, or we reach the end of the input.
while (TryFindNextPossibleStartingPosition(inputSpan) &&
!TryMatchAtCurrentPosition(inputSpan) &&
base.runtextpos != inputSpan.Length)
{
base.runtextpos++;
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
}
}
/// <summary>Search <paramref name="inputSpan"/> starting from base.runtextpos for the next location a match could possibly start.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if a possible match was found; false if no more matches are possible.</returns>
private bool TryFindNextPossibleStartingPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
// Any possible match is at least 5 characters.
if (pos <= inputSpan.Length - 5)
{
// The pattern begins with a Unicode digit.
// Find the next occurrence. If it can't be found, there's no match.
int i = inputSpan.Slice(pos).IndexOfAnyDigit();
if (i >= 0)
{
base.runtextpos = pos + i;
return true;
}
}
// No match found.
base.runtextpos = inputSpan.Length;
return false;
}
/// <summary>Determine whether <paramref name="inputSpan"/> at base.runtextpos is a match for the regular expression.</summary>
/// <param name="inputSpan">The text being scanned by the regular expression.</param>
/// <returns>true if the regular expression matches at the current position; otherwise, false.</returns>
private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan)
{
int pos = base.runtextpos;
int matchStart = pos;
int capture_starting_pos = 0;
int capture_starting_pos1 = 0;
int capture_starting_pos2 = 0;
int capture_starting_pos3 = 0;
int capture_starting_pos4 = 0;
int capture_starting_pos5 = 0;
int charloop_capture_pos = 0;
int charloop_capture_pos1 = 0;
int charloop_starting_pos = 0, charloop_ending_pos = 0;
int charloop_starting_pos1 = 0, charloop_ending_pos1 = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match a Unicode digit greedily at least once.
//{
charloop_starting_pos = pos;
int iteration = 0;
while ((uint)iteration < (uint)slice.Length && char.IsDigit(slice[iteration]))
{
iteration++;
}
if (iteration == 0)
{
UncaptureUntil(0);
return false; // The input didn't match.
}
slice = slice.Slice(iteration);
pos += iteration;
charloop_ending_pos = pos;
charloop_starting_pos++;
goto CharLoopEnd;
CharLoopBacktrack:
UncaptureUntil(charloop_capture_pos);
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
if (charloop_starting_pos >= charloop_ending_pos ||
(charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOfAny(Utilities.s_nonAscii_53BE860A7BB3C901EBE8EECDBB69D761C6C74DF0564F8B7A7926DECC0EC263B1)) < 0)
{
UncaptureUntil(0);
return false; // The input didn't match.
}
charloop_ending_pos += charloop_starting_pos;
pos = charloop_ending_pos;
slice = inputSpan.Slice(pos);
CharLoopEnd:
charloop_capture_pos = base.Crawlpos();
//}
// Advance the next matching position.
if (base.runtextpos < pos)
{
base.runtextpos = pos;
}
// 1st capture group.
//{
capture_starting_pos = pos;
// "x" capture group.
//{
capture_starting_pos1 = pos;
// Match a character in the set [A-Za-z\u0130\u212A-[Bb]] greedily at least once.
//{
charloop_starting_pos1 = pos;
int iteration1 = slice.IndexOfAnyExcept(Utilities.s_nonAscii_53BE860A7BB3C901EBE8EECDBB69D761C6C74DF0564F8B7A7926DECC0EC263B1);
if (iteration1 < 0)
{
iteration1 = slice.Length;
}
if (iteration1 == 0)
{
goto CharLoopBacktrack;
}
slice = slice.Slice(iteration1);
pos += iteration1;
charloop_ending_pos1 = pos;
charloop_starting_pos1++;
goto CharLoopEnd1;
CharLoopBacktrack1:
UncaptureUntil(charloop_capture_pos1);
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
if (charloop_starting_pos1 >= charloop_ending_pos1 ||
(charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAnyExcept('\n')) < 0)
{
goto CharLoopBacktrack;
}
charloop_ending_pos1 += charloop_starting_pos1;
pos = charloop_ending_pos1;
slice = inputSpan.Slice(pos);
CharLoopEnd1:
charloop_capture_pos1 = base.Crawlpos();
//}
base.Capture(3, capture_starting_pos1, pos);
goto CaptureSkipBacktrack;
CaptureBacktrack:
goto CharLoopBacktrack1;
CaptureSkipBacktrack:;
//}
base.Capture(1, capture_starting_pos, pos);
goto CaptureSkipBacktrack1;
CaptureBacktrack1:
goto CaptureBacktrack;
CaptureSkipBacktrack1:;
//}
// Match any character other than '\n'.
if (slice.IsEmpty || slice[0] == '\n')
{
goto CaptureBacktrack1;
}
pos++;
slice = inputSpan.Slice(pos);
// Zero-width positive lookbehind.
{
slice = inputSpan.Slice(pos);
int positivelookbehind_starting_pos = pos;
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
// Match a character other than '\n' exactly 6 times right-to-left.
{
for (int i = 0; i < 6; i++)
{
if ((uint)(pos - 1) >= inputSpan.Length || inputSpan[pos - 1] == '\n')
{
goto CaptureBacktrack1;
}
pos--;
}
}
// Balancing group. Captures the 2nd capture group and uncaptures the 1st capture group.
{
capture_starting_pos2 = pos;
if (!base.IsMatched(1))
{
goto CaptureBacktrack1;
}
// "x1" capture group.
{
capture_starting_pos3 = pos;
// Match a character other than '\n' exactly 2 times right-to-left.
{
for (int i = 0; i < 2; i++)
{
if ((uint)(pos - 1) >= inputSpan.Length || inputSpan[pos - 1] == '\n')
{
goto CaptureBacktrack1;
}
pos--;
}
}
base.Capture(4, capture_starting_pos3, pos);
}
base.TransferCapture(2, 1, capture_starting_pos2, pos);
}
pos = positivelookbehind_starting_pos;
slice = inputSpan.Slice(pos);
}
// Match a character in the set [Bb].
if (slice.IsEmpty || ((slice[0] | 0x20) != 'b'))
{
goto CaptureBacktrack1;
}
// Conditionally match one of two expressions depending on whether the 2nd capture group matched.
{
pos++;
slice = inputSpan.Slice(pos);
if (base.IsMatched(2))
{
// The 2nd capture group captured a value. Match the first branch.
// "Group2Captured" capture group.
{
capture_starting_pos4 = pos;
// Match any character other than '\n'.
if (slice.IsEmpty || slice[0] == '\n')
{
goto CaptureBacktrack1;
}
pos++;
slice = inputSpan.Slice(pos);
base.Capture(5, capture_starting_pos4, pos);
}
}
else
{
// Otherwise, match the second branch.
// "Group2NotCaptured" capture group.
{
capture_starting_pos5 = pos;
// Match any character other than '\n'.
if (slice.IsEmpty || slice[0] == '\n')
{
goto CaptureBacktrack1;
}
pos++;
slice = inputSpan.Slice(pos);
base.Capture(6, capture_starting_pos5, pos);
}
}
}
// The input matched.
base.runtextpos = pos;
base.Capture(0, matchStart, pos);
return true;
// <summary>Undo captures until it reaches the specified capture position.</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
void UncaptureUntil(int capturePosition)
{
while (base.Crawlpos() > capturePosition)
{
base.Uncapture();
}
}
}
}
}
}
/// <summary>Helper methods used by generated <see cref="Regex"/>-derived implementations.</summary>
[GeneratedCodeAttribute("System.Text.RegularExpressions.Generator", "10.0.11.11703")]
file static class Utilities
{
/// <summary>Default timeout value set in <see cref="AppContext"/>, or <see cref="Regex.InfiniteMatchTimeout"/> if none was set.</summary>
internal static readonly TimeSpan s_defaultTimeout = AppContext.GetData("REGEX_DEFAULT_MATCH_TIMEOUT") is TimeSpan timeout ? timeout : Regex.InfiniteMatchTimeout;
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
/// <summary>Finds the next index of any character that matches a Unicode digit.</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static int IndexOfAnyDigit(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptDigits);
if ((uint)i < (uint)span.Length)
{
if (char.IsAscii(span[i]))
{
return i;
}
do
{
if (char.IsDigit(span[i]))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
internal static readonly SearchValues<char> s_asciiExceptDigits = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f");
/// <summary>Supports searching for characters in or not in "ACDEFGHIJKLMNOPQRSTUVWXYZacdefghijklmnopqrstuvwxyzİK".</summary>
internal static readonly SearchValues<char> s_nonAscii_53BE860A7BB3C901EBE8EECDBB69D761C6C74DF0564F8B7A7926DECC0EC263B1 = SearchValues.Create("ACDEFGHIJKLMNOPQRSTUVWXYZacdefghijklmnopqrstuvwxyzİK");
}
} |
Discussion at #110976 for issues with The main problem of the current issue is not the inconsistency between the interpreted mode and the compiled regular expression match results. In mircosoft's document Balancing group definitions, when this Balancing grouping construct has the following format:
But, what happens if match of
|
Description
In the balancing group
(?'g1-g2'exp)
, when the content matched byexp
precedes the latest capture ofg2
,g1.Captures.Count
and the actual behavior ofg1
are inconsistent.By checking the captures of the group using
Group.Captures
, you will find that the captures appear empty. However, when using(?(g1)yes|no)
for conditional evaluation, it will matchyes
, indicating that there actually is a capture.更多关于平衡组的bug,可以参考平衡组的bug·其二
For more information about this bug, please refer to Bug in Balancing Groups - Part 2
测试用例中,使用到了比较复杂的正则表达式。
In the test cases, more complex regular expressions are used.
Reproduction Steps
Output:
Expected behavior
Or
Actual behavior
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: