-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use TextEncoder
and TextDecoder
for utf8 strings
#4513
base: master
Are you sure you want to change the base?
Conversation
master, af1a9c9
seia-soto:textencoder, 1de005e
|
packages/adblocker/src/data-view.ts
Outdated
this.buffer[this.pos++] = str.charCodeAt(i); | ||
} | ||
const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4)); | ||
this.pushUint32(written); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that wasteful in terms of final engine size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encodeInto
doesn't emit C-style NULL character which indicates the end of string at the end. We can consume the string dynamically when reading the binary but I think it's simpler to have a final length of string at the front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was about using 32-bits instead of potentially 8-bits or 16-bits for smaller strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change this to 16 bits (= ~65535 chars). 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced in d6032eb
(#4513)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preferring to use getLength
and pushLength
as pushUint16
is breaking the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're barely called which is not my expect. I only got 5 logs:
17 65 989 117 46 93
@seia-soto Did you figure out why they are barely called and if it's expected? Because if this method is almost not called maybe it's preferable to keep a simpler code for the variable length part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@remusao I got all pushUTF8
calls and these are the entries:
- Tracker DB properties
- Cosmetic selectors with unicode
- Use of
optionValue
in Network filters (csp, redirect, and replace) - Resource content (including scriptlets)
- Preprocessor conditions
I don't think we will see those frequently in testing or benchmarking setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we will see those frequently in testing or benchmarking setup.
We need to check with production assets being loaded. I don't think you will learn anything otherwise. With which lists did you do the measurement above?
The point being: if when loading full assets with all lists/resources/etc. this method is not called a lot, then I think it's best to keep the simpler version without the variadic length logic (that only makes sense if it has a measurable impact on final binary size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@remusao Here are the results:
textencoder
current branchtextencoder-2
current implementation with fixed length of 4 bytesnpm:adblocker
from npm
The processing time can be ignored as system is under pretty heavy-load compared to its performance.
aa@MacBookPro temp0 % ./start.sh
Switched to branch 'textencoder'
seia-soto/adblocker:textencoder: 6.559ms
[size]
Reserved bytes: 1674549B (~1635.3017578125 kB)
Consumed bytes: 1674549B (~1635.3017578125 kB)
Switched to a new branch 'textencoder-2'
seia-soto/adblocker:textencoder-2: 7.465ms
[size]
Reserved bytes: 1924346B (~1879.244140625 kB)
Consumed bytes: 1924346B (~1879.244140625 kB)
npm:adblocker: 6.808ms
[size]
Reserved bytes: 1718671B (~1678.3896484375 kB)
Consumed bytes: 1718671B (~1678.3896484375 kB)
refs https://github.com/seia-soto/adblocker-test-textencoder-diffing
- ~65535 ASCII only characters
> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841 seia-soto:textencoder, 65764e9
master, de7bfb5
|
This PR is awaiting for the final review. Further changes are expected to be categorized as a performance improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other uses of encode
in this file?
@@ -20,6 +20,9 @@ export const EMPTY_UINT32_ARRAY = new Uint32Array(0); | |||
// Check if current architecture is little endian | |||
const LITTLE_ENDIAN: boolean = new Int8Array(new Int16Array([1]).buffer)[0] === 1; | |||
|
|||
// TextEncoder doesn't need to be recreated every time unlike TextDecoder | |||
const TEXT_ENCODER = new TextEncoder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively you could do:
const encoder = new TextEncoder();
const encode = encoder.encode.bind(encoder);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode
is already exported by punycode
. Would you like to have another variable name?
Co-authored-by: Krzysztof Modras <[email protected]>
Co-authored-by: Krzysztof Modras <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a test for the exact filter that was a problem in #4424
Also, would be great to have a fuzz test setup to test serialize/deserialize on random data
fixes #4424
This PR replaces punycode encoder and decoder with
TextEncoder
andTextDecoder
for utf8 strings.\ufeff
should be skipped when decoding to ensure the original formUint8Array.subarray
doesn't copy the array but provides a direct interface to subarrayTextEncoder.encodeInto
doesn't produce EOL character ( NULL, U+0000 ) but we don't care becauseTextDecoder
can stop nicely when provided buffer endsTo be safe: