Improve copy 16 #264

folkertdev · 2026-01-12T16:04:17Z

The copy operation will ultimately use the same instructions

        vmovups xmm0, xmmword ptr [r8 + rdx + 16]
        vmovups xmmword ptr [rax + rdx], xmm0
        vmovups xmm0, xmmword ptr [r8 + rdx + 32]
        vmovups xmmword ptr [rax + rdx + 16], xmm0

however, presenting this as a 128-bit load and store to LLVM from the start optimizes better than the memmove that ptr::copy generates.

  measurement          mean ± σ            min … max           outliers         delta
  wall_time           257ms ± 2.91ms     253ms …  265ms          1 ( 5%)        ⚡-  2.3% ±  1.1%
  peak_rss           24.1MB ± 34.2KB    24.0MB … 24.2MB          3 (15%)          -  0.0% ±  0.1%
  cpu_cycles         1.12G  ± 13.4M     1.10G  … 1.16G           1 ( 5%)        ⚡-  2.6% ±  1.1%
  instructions       4.04G  ±  646      4.04G  … 4.04G           1 ( 5%)          -  0.1% ±  0.0%
  cache_references   58.7M  ± 1.11M     58.0M  … 63.4M           1 ( 5%)          +  0.3% ±  1.1%
  cache_misses       5.90M  ± 48.8K     5.79M  … 5.99M           1 ( 5%)          -  2.0% ±  1.8%
  branch_misses      4.74M  ± 9.44K     4.73M  … 4.75M           0 ( 0%)          -  0.3% ±  0.1%

There isn't that much to review really, but maybe you have useful thoughts/notes?

rust version https://godbolt.org/z/17hGE5Pa8
c version https://c.godbolt.org/z/e7Pv1nW9Y

codecov · 2026-01-12T16:05:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag	Coverage Δ
test-aarch64-apple-darwin	`32.94% <100.00%> (-0.04%)`	⬇️
test-aarch64-unknown-linux-gnu	`31.68% <100.00%> (-0.03%)`	⬇️
test-i686-unknown-linux-gnu	`31.74% <100.00%> (-0.03%)`	⬇️
test-x86_64-unknown-linux-gnu	`33.38% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
lib/common/zstd_internal.rs	`91.52% <100.00%> (+0.29%)`	⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

this change removes one (1) instruction, but gives a ~2% performance improvement on the benchmarks.

bjorn3 · 2026-01-13T09:32:22Z

Do you have a diff of the functions that get optimized better with this?

folkertdev · 2026-01-13T10:46:21Z

This change is most relevant for ZSTD_wildcopy. In the godbolts I've extracted the most commonly-used branch of that function. This change removes one add instruction from one of the hottest parts of the program. I also suspect somehow the movups address calculations are somehow improved, though I don't really understand why that would be.

folkertdev · 2026-01-13T10:46:55Z

You can replace the body of copy16 with std::ptr::copy(ip, op, 16) to compare.

bjorn3 · 2026-01-13T13:54:05Z

I don't see any difference when I do that.

Edit: Never mind. There were two copy16 functions. I changed the wrong one.

folkertdev · 2026-01-13T13:57:10Z

Strangely, even though we now emit the same assembly, and the function is called the same number of times according to cachegrind, we still spend much more time on it than the C version.

Alignment doesn't seem to really change anything, and I don't really see how it could be the cache.

run tests 20 times

bb4d0c6

folkertdev requested a review from bjorn3 January 12, 2026 16:04

optimze ZSTD_copy16 on x86

fa27fe2

this change removes one (1) instruction, but gives a ~2% performance improvement on the benchmarks.

folkertdev force-pushed the improve-copy-16 branch from 1dce4fc to fa27fe2 Compare January 12, 2026 17:58

bjorn3 approved these changes Jan 13, 2026

View reviewed changes

folkertdev merged commit 192dd3f into main Jan 13, 2026
19 checks passed

bjorn3 deleted the improve-copy-16 branch January 13, 2026 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve copy 16 #264

Improve copy 16 #264

Uh oh!

folkertdev commented Jan 12, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

bjorn3 commented Jan 13, 2026

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

bjorn3 commented Jan 13, 2026 •

edited

Loading

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Improve copy 16 #264

Improve copy 16 #264

Uh oh!

Conversation

folkertdev commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bjorn3 commented Jan 13, 2026

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

bjorn3 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

folkertdev commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

folkertdev commented Jan 12, 2026 •

edited

Loading

codecov bot commented Jan 12, 2026 •

edited

Loading

bjorn3 commented Jan 13, 2026 •

edited

Loading