-
-
Notifications
You must be signed in to change notification settings - Fork 60
(feat) add RE2::Set bindings #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Ran a small bench across different implementations. This tends to perform best in situations where your patterns do not match. |
Could you share code of your benchmark? I plan to add a benchmarking utility. |
'use strict';
const RE2 = require('./re2');
function makePatterns(n) {
const arr = [];
for (let i = 0; i < n; ++i) arr.push('token' + i + '(?:[a-z]+)?');
return arr;
}
function makeInputs(patternCount, count, withHits) {
const arr = [];
for (let j = 0; j < count; ++j) {
if (withHits) {
arr.push('xx' + (j % patternCount) + ' ' + (j & 7) + ' token' + (j % patternCount) + ' tail');
} else {
arr.push('xx' + (j % patternCount) + ' ' + (j & 7) + ' tok' + (j % patternCount) + ' tail');
}
}
return arr;
}
function makeLongAllHits(patternCount) {
const parts = [];
for (let i = 0; i < patternCount; ++i) parts.push('some prefix ' + i + ' token' + i + ' suffix ' + (i & 7));
return parts.join(' | ');
}
function makeLongNoHits(patternCount) {
const parts = [];
for (let i = 0; i < patternCount; ++i) parts.push('item' + i + ' nohit ' + (i & 7));
return parts.join(' | ').repeat(2);
}
function measure(fn) {
const start = process.hrtime.bigint();
const result = fn();
const ms = Number(process.hrtime.bigint() - start) / 1e6;
return { timeMs: ms, result };
}
const configs = [
{ name: 'multi-small-50-hits', patterns: 50, inputs: makeInputs(50, 4000, true), mode: 'multi', desc: '4k small strings, many hits' },
{ name: 'multi-small-50-nohits', patterns: 50, inputs: makeInputs(50, 4000, false), mode: 'multi', desc: '4k small strings, no hits' },
{ name: 'multi-small-200-hits', patterns: 200, inputs: makeInputs(200, 4000, true), mode: 'multi', desc: '4k small strings, many hits' },
{ name: 'multi-small-200-nohits', patterns: 200, inputs: makeInputs(200, 4000, false), mode: 'multi', desc: '4k small strings, no hits' },
{ name: 'single-long-50-hits', patterns: 50, inputs: [makeLongAllHits(50)], mode: 'single', desc: '~3 KB string, all tokens present' },
{ name: 'single-long-50-nohits', patterns: 50, inputs: [makeLongNoHits(50)], mode: 'single', desc: '~3 KB string, no tokens' },
{ name: 'single-long-200-hits', patterns: 200, inputs: [makeLongAllHits(200)], mode: 'single', desc: '~7 KB string, all tokens present' },
{ name: 'single-long-200-nohits', patterns: 200, inputs: [makeLongNoHits(200)], mode: 'single', desc: '~8 KB string, no tokens' },
];
const results = [];
for (const cfg of configs) {
const patterns = makePatterns(cfg.patterns);
const set = new RE2.Set(patterns);
const re2List = patterns.map((p) => new RE2(p));
const jsList = patterns.map((p) => new RegExp(p));
const avgLen = cfg.inputs.reduce((n, s) => n + s.length, 0) / cfg.inputs.length;
if (cfg.mode === 'multi') {
const setRes = measure(() => {
let m = 0;
for (const s of cfg.inputs) m += set.test(s) ? 1 : 0;
return m;
});
const re2Res = measure(() => {
let m = 0;
for (const s of cfg.inputs) {
for (const re of re2List) { if (re.test(s)) { ++m; break; } }
}
return m;
});
const jsRes = measure(() => {
let m = 0;
for (const s of cfg.inputs) {
for (const re of jsList) { if (re.test(s)) { ++m; break; } }
}
return m;
});
results.push({ ...cfg, avgLen, inputsCount: cfg.inputs.length, set: setRes.timeMs, re2: re2Res.timeMs, js: jsRes.timeMs, matches: setRes.result });
} else {
const setRes = measure(() => set.match(cfg.inputs[0]).length);
const re2Res = measure(() => re2List.reduce((n, re) => n + (re.test(cfg.inputs[0]) ? 1 : 0), 0));
const jsRes = measure(() => jsList.reduce((n, re) => n + (re.test(cfg.inputs[0]) ? 1 : 0), 0));
results.push({ ...cfg, avgLen, inputsCount: cfg.inputs.length, set: setRes.timeMs, re2: re2Res.timeMs, js: jsRes.timeMs, matches: setRes.result });
}
}
console.table(results.map(r => ({
scenario: r.name,
patterns: r.patterns,
inputs: r.inputsCount,
avgInputLen: Math.round(r.avgLen),
matches: r.matches,
setMs: r.set.toFixed(3),
re2Ms: r.re2.toFixed(3),
jsMs: r.js.toFixed(3),
desc: r.desc,
})));Let me know if you hit any issues. |
|
Hey @uhop, just following up on this to check if there's any changes needed or anything I can help with to get this merged :) |
|
Looks good, just swamped at the moment to do the proper due diligence. |
uhop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general: impressive! The pre-processing of data is outstanding. The post-processing in match() can be improved.
In order to test it I would suggest to create a text, sprinkle some emoji in it and see if matching works and offsets are correct. Alternatively translate some simple text (3-4 words are enough) to some known multi-byte symbols (I usually use Russian) and match words verifying offsets.
Other notes (no need to act on them):
- We need to expand the representation of incoming strings. We may save some time by using
DataViewas returned values as well. I created #233 to track it. - For some reason a pattern of reusing the same string as input is common. That's why internally we "cache" the last string: https://github.com/uhop/node-re2/blob/master/lib/wrapped_re2.h#L127 — I would suggest to split off the required functionality from
RE2, create a base class and reuse it inRE2::Set. This is purely (speculative) perf improvement, which can be done later.
| return Nan::ThrowTypeError("Invalid anchor option for RE2.Set."); | ||
| } | ||
| } | ||
| (void)anchorName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why do we need it?
| auto result = Nan::New<v8::Array>(matches.size()); | ||
| for (size_t i = 0, n = matches.size(); i < n; ++i) | ||
| { | ||
| Nan::Set(result, i, Nan::New(matches[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood correctly, matches[i] is an index into a UTF-8 buffer. It would work for buffers but wouldn't work for strings with multi-byte (non-ASCII) characters, e.g., emojis or East-Asian symbols.
Take a look how exec() does it. The problem is that the process of calculating string offsets is linear, so the best bet is to sort indices and calculate distances between them forming offsets afterwards.
Obviously string offsets should be calculated only for strings. Buffers are fine.
RE2.Setbinding (iterable patterns, flags/anchor parsing, match/test/toString plus flags/sources/source/size/anchor props) and exported it from the addon.Closes #43