You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regex is nice due to its ubiquity (and Python having a built-in library for it). But it is slow to match hundreds of patterns against large files. An algorithm such as Aho-corasick that can simultaneously match multiple patterns with an FSM and only needs to go through a file approx. once should be much faster.
Look into options for recognizing the static string prefix portion of a regular expression (that hopefully is not an empty string), and turn all of the prefixes identified into something that can do a fast first pass match, and only check the regular expressions that match the prefix once a match for the prefix is found in a file.
hyperscan via the pyperscan (https://pypi.org/project/pyperscan) PyPI package is another possibility (would need to fall back on the existing code built-in re module, but on platforms where it is available this might be close to a drop-in replacement).
These two options may not be mutually exclusive either -- if both get implemented and hyperscan is significantly faster, then we could use hyperscan when available, then fall-back to Aho-Corasick pre-filtering as a portable pure Python option.
The text was updated successfully, but these errors were encountered:
Regex is nice due to its ubiquity (and Python having a built-in library for it). But it is slow to match hundreds of patterns against large files. An algorithm such as Aho-corasick that can simultaneously match multiple patterns with an FSM and only needs to go through a file approx. once should be much faster.
Look into options for recognizing the static string prefix portion of a regular expression (that hopefully is not an empty string), and turn all of the prefixes identified into something that can do a fast first pass match, and only check the regular expressions that match the prefix once a match for the prefix is found in a file.
hyperscan via the pyperscan (https://pypi.org/project/pyperscan) PyPI package is another possibility (would need to fall back on the existing code built-in
re
module, but on platforms where it is available this might be close to a drop-in replacement).These two options may not be mutually exclusive either -- if both get implemented and hyperscan is significantly faster, then we could use hyperscan when available, then fall-back to Aho-Corasick pre-filtering as a portable pure Python option.
The text was updated successfully, but these errors were encountered: