Skip to content

Commit 356d3c9

Browse files
committed
automata: fix subtle DFA performance bug
This commit fixes a subtle *performance* bug in the start state computation. The issue here is rather tricky, but it boils down to the fact that the way the look-behind assertions are computed in the start state is not quite precisely equivalent to how they're computed during normal state generation. Namely, in normal state generation, we only compute look-behind assertions if the NFA actually has one (or one similar to it) in its graph somewhere. If it doesn't, then there's no point in saving whether the assertion is satisfied or not. Logically speaking, this doesn't matter too much, because if the look-around assertions don't match up with how they're computed in the start state, a new state will simply be created. Not a huge deal, but wasteful. The real problem is that the new state will no longer be considered a start state. It will just be like any other normal state. We rely on being able to detect start states at search time to know when to trigger the prefilter. So if we re-generate start states as non-start states, then we may end up not triggering the prefilter. That's bad. rebar actually caught this bug via the `imported/sherlock/line-boundary-sherlock-holmes` benchmark, which recorded a 20x slowdown due to the prefilter not running. Owch! This specifically was caused by the start states unconditionally attaching half-starting word boundary assertions whenever they were satisfied, where as normal state generation only does this when there is actually a half-starting word boundary assertion in the NFA. So this led to re-generating start states needlessly. Interestingly, the start state computation was unconditionally attaching all different types of look-behind assertions, and thus in theory, this problem already existed under different circumstances. My hypothesis is that it wasn't "as bad" because it was mostly limited to line terminators. But the half-starting word boundary assertion is much more broadly applicable. We remedy this not only for the half-starting word boundary assertion, but for all others as well. I also did manual mutation testing in this start state computation and found a few branches not covered by tests. We add those tests here. Thanks rebar!
1 parent 674a952 commit 356d3c9

File tree

3 files changed

+111
-37
lines changed

3 files changed

+111
-37
lines changed

regex-automata/src/util/determinize/mod.rs

Lines changed: 65 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -587,67 +587,95 @@ pub(crate) fn set_lookbehind_from_start(
587587
) {
588588
let rev = nfa.is_reverse();
589589
let lineterm = nfa.look_matcher().get_line_terminator();
590+
let lookset = nfa.look_set_any();
590591
match *start {
591592
Start::NonWordByte => {
592-
builder.set_look_have(|have| {
593-
have.insert(Look::WordStartHalfAscii)
594-
.insert(Look::WordStartHalfUnicode)
595-
});
593+
if lookset.contains_word() {
594+
builder.set_look_have(|have| {
595+
have.insert(Look::WordStartHalfAscii)
596+
.insert(Look::WordStartHalfUnicode)
597+
});
598+
}
596599
}
597600
Start::WordByte => {
598-
builder.set_is_from_word();
601+
if lookset.contains_word() {
602+
builder.set_is_from_word();
603+
}
599604
}
600605
Start::Text => {
601-
builder.set_look_have(|have| {
602-
have.insert(Look::Start)
603-
.insert(Look::StartLF)
604-
.insert(Look::StartCRLF)
605-
.insert(Look::WordStartHalfAscii)
606-
.insert(Look::WordStartHalfUnicode)
607-
});
606+
if lookset.contains_anchor_haystack() {
607+
builder.set_look_have(|have| have.insert(Look::Start));
608+
}
609+
if lookset.contains_anchor_line() {
610+
builder.set_look_have(|have| {
611+
have.insert(Look::StartLF).insert(Look::StartCRLF)
612+
});
613+
}
614+
if lookset.contains_word() {
615+
builder.set_look_have(|have| {
616+
have.insert(Look::WordStartHalfAscii)
617+
.insert(Look::WordStartHalfUnicode)
618+
});
619+
}
608620
}
609621
Start::LineLF => {
610622
if rev {
611-
builder.set_is_half_crlf();
612-
builder.set_look_have(|have| have.insert(Look::StartLF));
623+
if lookset.contains_anchor_crlf() {
624+
builder.set_is_half_crlf();
625+
}
626+
if lookset.contains_anchor_line() {
627+
builder.set_look_have(|have| have.insert(Look::StartLF));
628+
}
613629
} else {
614-
builder.set_look_have(|have| have.insert(Look::StartCRLF));
630+
if lookset.contains_anchor_line() {
631+
builder.set_look_have(|have| have.insert(Look::StartCRLF));
632+
}
615633
}
616-
if lineterm == b'\n' {
634+
if lookset.contains_anchor_line() && lineterm == b'\n' {
617635
builder.set_look_have(|have| have.insert(Look::StartLF));
618636
}
619-
builder.set_look_have(|have| {
620-
have.insert(Look::WordStartHalfAscii)
621-
.insert(Look::WordStartHalfUnicode)
622-
});
637+
if lookset.contains_word() {
638+
builder.set_look_have(|have| {
639+
have.insert(Look::WordStartHalfAscii)
640+
.insert(Look::WordStartHalfUnicode)
641+
});
642+
}
623643
}
624644
Start::LineCR => {
625-
if rev {
626-
builder.set_look_have(|have| have.insert(Look::StartCRLF));
627-
} else {
628-
builder.set_is_half_crlf();
645+
if lookset.contains_anchor_crlf() {
646+
if rev {
647+
builder.set_look_have(|have| have.insert(Look::StartCRLF));
648+
} else {
649+
builder.set_is_half_crlf();
650+
}
629651
}
630-
if lineterm == b'\r' {
652+
if lookset.contains_anchor_line() && lineterm == b'\r' {
631653
builder.set_look_have(|have| have.insert(Look::StartLF));
632654
}
633-
builder.set_look_have(|have| {
634-
have.insert(Look::WordStartHalfAscii)
635-
.insert(Look::WordStartHalfUnicode)
636-
});
655+
if lookset.contains_word() {
656+
builder.set_look_have(|have| {
657+
have.insert(Look::WordStartHalfAscii)
658+
.insert(Look::WordStartHalfUnicode)
659+
});
660+
}
637661
}
638662
Start::CustomLineTerminator => {
639-
builder.set_look_have(|have| have.insert(Look::StartLF));
663+
if lookset.contains_anchor_line() {
664+
builder.set_look_have(|have| have.insert(Look::StartLF));
665+
}
640666
// This is a bit of a tricky case, but if the line terminator was
641667
// set to a word byte, then we also need to behave as if the start
642668
// configuration is Start::WordByte. That is, we need to mark our
643669
// state as having come from a word byte.
644-
if utf8::is_word_byte(lineterm) {
645-
builder.set_is_from_word();
646-
} else {
647-
builder.set_look_have(|have| {
648-
have.insert(Look::WordStartHalfAscii)
649-
.insert(Look::WordStartHalfUnicode)
650-
});
670+
if lookset.contains_word() {
671+
if utf8::is_word_byte(lineterm) {
672+
builder.set_is_from_word();
673+
} else {
674+
builder.set_look_have(|have| {
675+
have.insert(Look::WordStartHalfAscii)
676+
.insert(Look::WordStartHalfUnicode)
677+
});
678+
}
651679
}
652680
}
653681
}

testdata/line-terminator.toml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,18 @@ unescape = true
3838
line-terminator = '\xFF'
3939
utf8 = false
4040

41+
# This tests a tricky case where the line terminator is set to \r. This ensures
42+
# that the StartLF look-behind assertion is tracked when computing the start
43+
# state.
44+
[[test]]
45+
name = "carriage"
46+
regex = '(?m)^[a-z]+'
47+
haystack = 'ABC\rabc'
48+
matches = [[4, 7]]
49+
bounds = [4, 7]
50+
unescape = true
51+
line-terminator = '\r'
52+
4153
# This tests that we can set the line terminator to a byte corresponding to a
4254
# word character, and things work as expected.
4355
[[test]]

testdata/word-boundary-special.toml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -651,3 +651,37 @@ regex = '\b{end-half}'
651651
haystack = "b𝛃"
652652
matches = [[5, 5]]
653653
unicode = true
654+
655+
# Specialty tests.
656+
657+
# Since \r is special cased in the start state computation (to deal with CRLF
658+
# mode), this test ensures that the correct start state is computed when the
659+
# pattern starts with a half word boundary assertion.
660+
[[test]]
661+
name = "word-start-half-ascii-carriage"
662+
regex = '\b{start-half}[a-z]+'
663+
haystack = 'ABC\rabc'
664+
matches = [[4, 7]]
665+
bounds = [4, 7]
666+
unescape = true
667+
668+
# Since \n is also special cased in the start state computation, this test
669+
# ensures that the correct start state is computed when the pattern starts with
670+
# a half word boundary assertion.
671+
[[test]]
672+
name = "word-start-half-ascii-linefeed"
673+
regex = '\b{start-half}[a-z]+'
674+
haystack = 'ABC\nabc'
675+
matches = [[4, 7]]
676+
bounds = [4, 7]
677+
unescape = true
678+
679+
# Like the carriage return test above, but with a custom line terminator.
680+
[[test]]
681+
name = "word-start-half-ascii-customlineterm"
682+
regex = '\b{start-half}[a-z]+'
683+
haystack = 'ABC!abc'
684+
matches = [[4, 7]]
685+
bounds = [4, 7]
686+
unescape = true
687+
line-terminator = '!'

0 commit comments

Comments
 (0)