Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast pattern matching #306

Open
nightlark opened this issue Dec 17, 2024 · 0 comments
Open

Fast pattern matching #306

nightlark opened this issue Dec 17, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@nightlark
Copy link
Collaborator

nightlark commented Dec 17, 2024

Regex is nice due to its ubiquity (and Python having a built-in library for it). But it is slow to match hundreds of patterns against large files. An algorithm such as Aho-corasick that can simultaneously match multiple patterns with an FSM and only needs to go through a file approx. once should be much faster.

Look into options for recognizing the static string prefix portion of a regular expression (that hopefully is not an empty string), and turn all of the prefixes identified into something that can do a fast first pass match, and only check the regular expressions that match the prefix once a match for the prefix is found in a file.

hyperscan via the pyperscan (https://pypi.org/project/pyperscan) PyPI package is another possibility (would need to fall back on the existing code built-in re module, but on platforms where it is available this might be close to a drop-in replacement).

These two options may not be mutually exclusive either -- if both get implemented and hyperscan is significantly faster, then we could use hyperscan when available, then fall-back to Aho-Corasick pre-filtering as a portable pure Python option.

@nightlark nightlark added the enhancement New feature or request label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant