automata: fix subtle DFA performance bug

BurntSushi · BurntSushi · commit 356d3c950414 · 2023-10-09T16:51:00.000-04:00
This commit fixes a subtle *performance* bug in the start state
computation. The issue here is rather tricky, but it boils down to the
fact that the way the look-behind assertions are computed in the start
state is not quite precisely equivalent to how they're computed during
normal state generation. Namely, in normal state generation, we only
compute look-behind assertions if the NFA actually has one (or one
similar to it) in its graph somewhere. If it doesn't, then there's no
point in saving whether the assertion is satisfied or not.

Logically speaking, this doesn't matter too much, because if the
look-around assertions don't match up with how they're computed in the
start state, a new state will simply be created. Not a huge deal, but
wasteful. The real problem is that the new state will no longer be
considered a start state. It will just be like any other normal state.
We rely on being able to detect start states at search time to know when
to trigger the prefilter. So if we re-generate start states as non-start
states, then we may end up not triggering the prefilter. That's bad.

rebar actually caught this bug via the
`imported/sherlock/line-boundary-sherlock-holmes` benchmark, which
recorded a 20x slowdown due to the prefilter not running. Owch!

This specifically was caused by the start states unconditionally
attaching half-starting word boundary assertions whenever they were
satisfied, where as normal state generation only does this when there is
actually a half-starting word boundary assertion in the NFA. So this led
to re-generating start states needlessly.

Interestingly, the start state computation was unconditionally attaching
all different types of look-behind assertions, and thus in theory, this
problem already existed under different circumstances. My hypothesis is
that it wasn't "as bad" because it was mostly limited to line
terminators. But the half-starting word boundary assertion is much more
broadly applicable.

We remedy this not only for the half-starting word boundary assertion,
but for all others as well. I also did manual mutation testing in this
start state computation and found a few branches not covered by tests.
We add those tests here.

Thanks rebar!
diff --git a/regex-automata/src/util/determinize/mod.rs b/regex-automata/src/util/determinize/mod.rs
@@ -587,67 +587,95 @@ pub(crate) fn set_lookbehind_from_start(
 ) {
     let rev = nfa.is_reverse();
     let lineterm = nfa.look_matcher().get_line_terminator();
+    let lookset = nfa.look_set_any();
     match *start {
         Start::NonWordByte => {
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::WordByte => {
-            builder.set_is_from_word();
+            if lookset.contains_word() {
+                builder.set_is_from_word();
+            }
         }
         Start::Text => {
-            builder.set_look_have(|have| {
-                have.insert(Look::Start)
-                    .insert(Look::StartLF)
-                    .insert(Look::StartCRLF)
-                    .insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_anchor_haystack() {
+                builder.set_look_have(|have| have.insert(Look::Start));
+            }
+            if lookset.contains_anchor_line() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::StartLF).insert(Look::StartCRLF)
+                });
+            }
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::LineLF => {
             if rev {
-                builder.set_is_half_crlf();
-                builder.set_look_have(|have| have.insert(Look::StartLF));
+                if lookset.contains_anchor_crlf() {
+                    builder.set_is_half_crlf();
+                }
+                if lookset.contains_anchor_line() {
+                    builder.set_look_have(|have| have.insert(Look::StartLF));
+                }
             } else {
-                builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                if lookset.contains_anchor_line() {
+                    builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                }
             }
-            if lineterm == b'\n' {
+            if lookset.contains_anchor_line() && lineterm == b'\n' {
                 builder.set_look_have(|have| have.insert(Look::StartLF));
             }
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::LineCR => {
-            if rev {
-                builder.set_look_have(|have| have.insert(Look::StartCRLF));
-            } else {
-                builder.set_is_half_crlf();
+            if lookset.contains_anchor_crlf() {
+                if rev {
+                    builder.set_look_have(|have| have.insert(Look::StartCRLF));
+                } else {
+                    builder.set_is_half_crlf();
+                }
             }
-            if lineterm == b'\r' {
+            if lookset.contains_anchor_line() && lineterm == b'\r' {
                 builder.set_look_have(|have| have.insert(Look::StartLF));
             }
-            builder.set_look_have(|have| {
-                have.insert(Look::WordStartHalfAscii)
-                    .insert(Look::WordStartHalfUnicode)
-            });
+            if lookset.contains_word() {
+                builder.set_look_have(|have| {
+                    have.insert(Look::WordStartHalfAscii)
+                        .insert(Look::WordStartHalfUnicode)
+                });
+            }
         }
         Start::CustomLineTerminator => {
-            builder.set_look_have(|have| have.insert(Look::StartLF));
+            if lookset.contains_anchor_line() {
+                builder.set_look_have(|have| have.insert(Look::StartLF));
+            }
             // This is a bit of a tricky case, but if the line terminator was
             // set to a word byte, then we also need to behave as if the start
             // configuration is Start::WordByte. That is, we need to mark our
             // state as having come from a word byte.
-            if utf8::is_word_byte(lineterm) {
-                builder.set_is_from_word();
-            } else {
-                builder.set_look_have(|have| {
-                    have.insert(Look::WordStartHalfAscii)
-                        .insert(Look::WordStartHalfUnicode)
-                });
+            if lookset.contains_word() {
+                if utf8::is_word_byte(lineterm) {
+                    builder.set_is_from_word();
+                } else {
+                    builder.set_look_have(|have| {
+                        have.insert(Look::WordStartHalfAscii)
+                            .insert(Look::WordStartHalfUnicode)
+                    });
+                }
             }
         }
     }
diff --git a/testdata/line-terminator.toml b/testdata/line-terminator.toml
@@ -38,6 +38,18 @@ unescape = true
 line-terminator = '\xFF'
 utf8 = false
 
+# This tests a tricky case where the line terminator is set to \r. This ensures
+# that the StartLF look-behind assertion is tracked when computing the start
+# state.
+[[test]]
+name = "carriage"
+regex = '(?m)^[a-z]+'
+haystack = 'ABC\rabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+line-terminator = '\r'
+
 # This tests that we can set the line terminator to a byte corresponding to a
 # word character, and things work as expected.
 [[test]]
diff --git a/testdata/word-boundary-special.toml b/testdata/word-boundary-special.toml
@@ -651,3 +651,37 @@ regex = '\b{end-half}'
 haystack = "b𝛃"
 matches = [[5, 5]]
 unicode = true
+
+# Specialty tests.
+
+# Since \r is special cased in the start state computation (to deal with CRLF
+# mode), this test ensures that the correct start state is computed when the
+# pattern starts with a half word boundary assertion.
+[[test]]
+name = "word-start-half-ascii-carriage"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC\rabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+
+# Since \n is also special cased in the start state computation, this test
+# ensures that the correct start state is computed when the pattern starts with
+# a half word boundary assertion.
+[[test]]
+name = "word-start-half-ascii-linefeed"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC\nabc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+
+# Like the carriage return test above, but with a custom line terminator.
+[[test]]
+name = "word-start-half-ascii-customlineterm"
+regex = '\b{start-half}[a-z]+'
+haystack = 'ABC!abc'
+matches = [[4, 7]]
+bounds = [4, 7]
+unescape = true
+line-terminator = '!'