Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use TextEncoder and TextDecoder for utf8 strings #4513

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

seia-soto
Copy link
Member

fixes #4424

This PR replaces punycode encoder and decoder with TextEncoder and TextDecoder for utf8 strings.

  • BOM character \ufeff should be skipped when decoding to ensure the original form
  • Uint8Array.subarray doesn't copy the array but provides a direct interface to subarray
  • TextEncoder.encodeInto doesn't produce EOL character ( NULL, U+0000 ) but we don't care because TextDecoder can stop nicely when provided buffer ends

To be safe:

  • The performance should be evaluated carefully
  • The output binary size needs to be evaluated carefully

@seia-soto seia-soto added the PR: Internal 🏠 Changes only affect internals label Dec 10, 2024
@seia-soto seia-soto self-assigned this Dec 10, 2024
@seia-soto seia-soto requested a review from remusao as a code owner December 10, 2024 08:29
@seia-soto
Copy link
Member Author

seia-soto commented Dec 11, 2024

> (6635933 / 6634657) * 100 // raw > ads + trackers + annoyances (104550 network + 69076 hide)
100.01923234313395

master, af1a9c9

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49534 network + 38171 hide)
 + raw 3358745 bytes
 + gzip 1783830 bytes
 + brotli 1467652 bytes
> ads (0 network + 38171 hide)
 + raw 1825421 bytes
 + gzip 880805 bytes
 + brotli 708974 bytes
> ads (49534 network + 0 hide)
 + raw 1533629 bytes
 + gzip 905016 bytes
 + brotli 765458 bytes
> ads + trackers (102405 network + 38324 hide)
 + raw 4958193 bytes
 + gzip 2635039 bytes
 + brotli 2172546 bytes
> ads + trackers (0 network + 38324 hide)
 + raw 1847893 bytes
 + gzip 890432 bytes
 + brotli 715577 bytes
> ads + trackers (102405 network + 0 hide)
 + raw 3110601 bytes
 + gzip 1745248 bytes
 + brotli 1459959 bytes
> ads + trackers + annoyances (104585 network + 69043 hide)
 + raw 6635933 bytes
 + gzip 3468626 bytes
 + brotli 2862312 bytes
> ads + trackers + annoyances (0 network + 69043 hide)
 + raw 3455897 bytes
 + gzip 1674036 bytes
 + brotli 1363518 bytes
> ads + trackers + annoyances (104585 network + 0 hide)
 + raw 3180345 bytes
 + gzip 1793551 bytes
 + brotli 1501761 bytes

seia-soto:textencoder, 1de005e

aa@MacBookPro adblocker % yarn tsx ./tools/engine-size.ts 
> ads (49501 network + 38172 hide)
 + raw 3357509 bytes
 + gzip 1782885 bytes
 + brotli 1468373 bytes
> ads (0 network + 38172 hide)
 + raw 1824997 bytes
 + gzip 880743 bytes
 + brotli 706714 bytes
> ads (49501 network + 0 hide)
 + raw 1532817 bytes
 + gzip 904485 bytes
 + brotli 764578 bytes
> ads + trackers (102370 network + 38325 hide)
 + raw 4957369 bytes
 + gzip 2634088 bytes
 + brotli 2171825 bytes
> ads + trackers (0 network + 38325 hide)
 + raw 1847925 bytes
 + gzip 890324 bytes
 + brotli 719134 bytes
> ads + trackers (102370 network + 0 hide)
 + raw 3109745 bytes
 + gzip 1744659 bytes
 + brotli 1459817 bytes
> ads + trackers + annoyances (104550 network + 69076 hide)
 + raw 6634657 bytes
 + gzip 3468231 bytes
 + brotli 2859315 bytes
> ads + trackers + annoyances (0 network + 69076 hide)
 + raw 3455473 bytes
 + gzip 1674098 bytes
 + brotli 1362984 bytes
> ads + trackers + annoyances (104550 network + 0 hide)
 + raw 3179489 bytes
 + gzip 1793145 bytes
 + brotli 1501288 bytes

@seia-soto seia-soto requested a review from chrmod December 11, 2024 07:45
this.buffer[this.pos++] = str.charCodeAt(i);
}
const { written } = TEXT_ENCODER.encodeInto(raw, this.buffer.subarray(this.pos + 4));
this.pushUint32(written);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that wasteful in terms of final engine size?

Copy link
Member Author

@seia-soto seia-soto Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeInto doesn't emit C-style NULL character which indicates the end of string at the end. We can consume the string dynamically when reading the binary but I think it's simpler to have a final length of string at the front.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about using 32-bits instead of potentially 8-bits or 16-bits for smaller strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this to 16 bits (= ~65535 chars). 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced in d6032eb (#4513)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferring to use getLength and pushLength as pushUint16 is breaking the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're barely called which is not my expect. I only got 5 logs:

17
65
989
117
46
93

@seia-soto Did you figure out why they are barely called and if it's expected? Because if this method is almost not called maybe it's preferable to keep a simpler code for the variable length part.

Copy link
Member Author

@seia-soto seia-soto Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remusao I got all pushUTF8 calls and these are the entries:

  1. Tracker DB properties
  2. Cosmetic selectors with unicode
  3. Use of optionValue in Network filters (csp, redirect, and replace)
  4. Resource content (including scriptlets)
  5. Preprocessor conditions

I don't think we will see those frequently in testing or benchmarking setup.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will see those frequently in testing or benchmarking setup.

We need to check with production assets being loaded. I don't think you will learn anything otherwise. With which lists did you do the measurement above?

The point being: if when loading full assets with all lists/resources/etc. this method is not called a lot, then I think it's best to keep the simpler version without the variadic length logic (that only makes sense if it has a measurable impact on final binary size)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@remusao Here are the results:

  • textencoder current branch
  • textencoder-2 current implementation with fixed length of 4 bytes
  • npm:adblocker from npm

The processing time can be ignored as system is under pretty heavy-load compared to its performance.

aa@MacBookPro temp0 % ./start.sh 
Switched to branch 'textencoder'
seia-soto/adblocker:textencoder: 6.559ms
[size]
Reserved bytes: 1674549B (~1635.3017578125 kB)
Consumed bytes: 1674549B (~1635.3017578125 kB)

Switched to a new branch 'textencoder-2'
seia-soto/adblocker:textencoder-2: 7.465ms
[size]
Reserved bytes: 1924346B (~1879.244140625 kB)
Consumed bytes: 1924346B (~1879.244140625 kB)

npm:adblocker: 6.808ms
[size]
Reserved bytes: 1718671B (~1678.3896484375 kB)
Consumed bytes: 1718671B (~1678.3896484375 kB)

refs https://github.com/seia-soto/adblocker-test-textencoder-diffing

@seia-soto
Copy link
Member Author

seia-soto commented Dec 12, 2024

> 147.50420889870574 / 156.1296767089117 // benchEngineDeserialization
0.944754463135876
> 147.50420889870574 / 148.7394726802865 // benchEngineSerialization
0.9916951179177841

seia-soto:textencoder, 65764e9

  • benchEngineDeserialization: 147.50420889870574 op/s
  • benchEngineSerialization: 147.50420889870574 op/s

master, de7bfb5

  • benchEngineDeserialization: 156.1296767089117 op/s
  • benchEngineSerialization: 148.7394726802865 op/s

@chrmod chrmod added PR: Bug Fix 🐛 Increment patch version when merged and removed PR: Internal 🏠 Changes only affect internals labels Dec 23, 2024
@seia-soto
Copy link
Member Author

This PR is awaiting for the final review. Further changes are expected to be categorized as a performance improvement.

Copy link
Member

@chrmod chrmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other uses of encode in this file?

packages/adblocker/src/data-view.ts Outdated Show resolved Hide resolved
packages/adblocker/src/data-view.ts Show resolved Hide resolved
@@ -20,6 +20,9 @@ export const EMPTY_UINT32_ARRAY = new Uint32Array(0);
// Check if current architecture is little endian
const LITTLE_ENDIAN: boolean = new Int8Array(new Int16Array([1]).buffer)[0] === 1;

// TextEncoder doesn't need to be recreated every time unlike TextDecoder
const TEXT_ENCODER = new TextEncoder();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you could do:

const encoder = new TextEncoder();
const encode = encoder.encode.bind(encoder);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode is already exported by punycode. Would you like to have another variable name?

Copy link
Member

@chrmod chrmod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a test for the exact filter that was a problem in #4424

Also, would be great to have a fuzz test setup to test serialize/deserialize on random data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: Bug Fix 🐛 Increment patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Overflow error when loading uBO quick fix list
4 participants