You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit fixes a subtle *performance* bug in the start state
computation. The issue here is rather tricky, but it boils down to the
fact that the way the look-behind assertions are computed in the start
state is not quite precisely equivalent to how they're computed during
normal state generation. Namely, in normal state generation, we only
compute look-behind assertions if the NFA actually has one (or one
similar to it) in its graph somewhere. If it doesn't, then there's no
point in saving whether the assertion is satisfied or not.
Logically speaking, this doesn't matter too much, because if the
look-around assertions don't match up with how they're computed in the
start state, a new state will simply be created. Not a huge deal, but
wasteful. The real problem is that the new state will no longer be
considered a start state. It will just be like any other normal state.
We rely on being able to detect start states at search time to know when
to trigger the prefilter. So if we re-generate start states as non-start
states, then we may end up not triggering the prefilter. That's bad.
rebar actually caught this bug via the
`imported/sherlock/line-boundary-sherlock-holmes` benchmark, which
recorded a 20x slowdown due to the prefilter not running. Owch!
This specifically was caused by the start states unconditionally
attaching half-starting word boundary assertions whenever they were
satisfied, where as normal state generation only does this when there is
actually a half-starting word boundary assertion in the NFA. So this led
to re-generating start states needlessly.
Interestingly, the start state computation was unconditionally attaching
all different types of look-behind assertions, and thus in theory, this
problem already existed under different circumstances. My hypothesis is
that it wasn't "as bad" because it was mostly limited to line
terminators. But the half-starting word boundary assertion is much more
broadly applicable.
We remedy this not only for the half-starting word boundary assertion,
but for all others as well. I also did manual mutation testing in this
start state computation and found a few branches not covered by tests.
We add those tests here.
Thanks rebar!
0 commit comments