Fast pattern matching #306

nightlark · 2024-12-17T20:34:14Z

Regex is nice due to its ubiquity (and Python having a built-in library for it). But it is slow to match hundreds of patterns against large files. An algorithm such as Aho-corasick that can simultaneously match multiple patterns with an FSM and only needs to go through a file approx. once should be much faster.

Look into options for recognizing the static string prefix portion of a regular expression (that hopefully is not an empty string), and turn all of the prefixes identified into something that can do a fast first pass match, and only check the regular expressions that match the prefix once a match for the prefix is found in a file.

hyperscan via the pyperscan (https://pypi.org/project/pyperscan) PyPI package is another possibility (would need to fall back on the existing code built-in re module, but on platforms where it is available this might be close to a drop-in replacement).

These two options may not be mutually exclusive either -- if both get implemented and hyperscan is significantly faster, then we could use hyperscan when available, then fall-back to Aho-Corasick pre-filtering as a portable pure Python option.

The text was updated successfully, but these errors were encountered:

nightlark added the enhancement New feature or request label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast pattern matching #306

Fast pattern matching #306

nightlark commented Dec 17, 2024 •

edited

Loading

Fast pattern matching #306

Fast pattern matching #306

Comments

nightlark commented Dec 17, 2024 • edited Loading

nightlark commented Dec 17, 2024 •

edited

Loading