Skip to content

Conversation

@folkertdev
Copy link
Member

@folkertdev folkertdev commented Jan 12, 2026

The copy operation will ultimately use the same instructions

        vmovups xmm0, xmmword ptr [r8 + rdx + 16]
        vmovups xmmword ptr [rax + rdx], xmm0
        vmovups xmm0, xmmword ptr [r8 + rdx + 32]
        vmovups xmmword ptr [rax + rdx + 16], xmm0

however, presenting this as a 128-bit load and store to LLVM from the start optimizes better than the memmove that ptr::copy generates.

  measurement          mean ± σ            min … max           outliers         delta
  wall_time           257ms ± 2.91ms     253ms …  265ms          1 ( 5%)        ⚡-  2.3% ±  1.1%
  peak_rss           24.1MB ± 34.2KB    24.0MB … 24.2MB          3 (15%)          -  0.0% ±  0.1%
  cpu_cycles         1.12G  ± 13.4M     1.10G  … 1.16G           1 ( 5%)        ⚡-  2.6% ±  1.1%
  instructions       4.04G  ±  646      4.04G  … 4.04G           1 ( 5%)          -  0.1% ±  0.0%
  cache_references   58.7M  ± 1.11M     58.0M  … 63.4M           1 ( 5%)          +  0.3% ±  1.1%
  cache_misses       5.90M  ± 48.8K     5.79M  … 5.99M           1 ( 5%)          -  2.0% ±  1.8%
  branch_misses      4.74M  ± 9.44K     4.73M  … 4.75M           0 ( 0%)          -  0.3% ±  0.1%

There isn't that much to review really, but maybe you have useful thoughts/notes?

@folkertdev folkertdev requested a review from bjorn3 January 12, 2026 16:04
@codecov
Copy link

codecov bot commented Jan 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
test-aarch64-apple-darwin 32.94% <100.00%> (-0.04%) ⬇️
test-aarch64-unknown-linux-gnu 31.68% <100.00%> (-0.03%) ⬇️
test-i686-unknown-linux-gnu 31.74% <100.00%> (-0.03%) ⬇️
test-x86_64-unknown-linux-gnu 33.38% <100.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/common/zstd_internal.rs 91.52% <100.00%> (+0.29%) ⬆️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

this change removes one (1) instruction, but gives a ~2% performance improvement on the benchmarks.
@bjorn3
Copy link
Collaborator

bjorn3 commented Jan 13, 2026

Do you have a diff of the functions that get optimized better with this?

@folkertdev
Copy link
Member Author

This change is most relevant for ZSTD_wildcopy. In the godbolts I've extracted the most commonly-used branch of that function. This change removes one add instruction from one of the hottest parts of the program. I also suspect somehow the movups address calculations are somehow improved, though I don't really understand why that would be.

@folkertdev
Copy link
Member Author

You can replace the body of copy16 with std::ptr::copy(ip, op, 16) to compare.

@bjorn3
Copy link
Collaborator

bjorn3 commented Jan 13, 2026

I don't see any difference when I do that.

Edit: Never mind. There were two copy16 functions. I changed the wrong one.

@folkertdev
Copy link
Member Author

Strangely, even though we now emit the same assembly, and the function is called the same number of times according to cachegrind, we still spend much more time on it than the C version.

Alignment doesn't seem to really change anything, and I don't really see how it could be the cache.

@folkertdev folkertdev merged commit 192dd3f into main Jan 13, 2026
19 checks passed
@bjorn3 bjorn3 deleted the improve-copy-16 branch January 13, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants