ValueUseState calculation too coarse for x64's sinkable_load? #10010

alexcrichton · 2025-01-14T17:41:36Z

I was profiling some SIMD-using code today and noticed that the WebAssembly simd instruction v128.load32_splat wasn't generating the x64 instruction I thought it was supposed to do. This is what I'd expect but this is what I got:

function u0:0(i64, i32x4) -> i32x4 tail {
block0(v0: i64, v1: i32x4):
    v2 = iadd_imm v0, 100
    jump block2(v0, v1)

block2(v3: i64, v5: i32x4):
    v6 = load.i32 v3
    v7 = splat.i32x4 v6
    v8 = iadd v5, v7
    v11 = iadd_imm v3, 4
    v12 = icmp eq v11, v2
    brif v12, block2(v11, v8), block3

block3:
    return v8
}

$ cargo run compile ../clif/wasm_func_0.clif --target x86_64 --set has_avx2 --set has_avx -D
Disassembly of 41 bytes <u0:0>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 8d 4f 64             leaq    0x64(%rdi), %rcx
   8:   8b 17                   movl    (%rdi), %edx
   a:   c5 f9 6e e2             vmovd   %edx, %xmm4
   e:   c4 e2 79 58 f4          vpbroadcastd    %xmm4, %xmm6
  13:   c5 f9 fe c6             vpaddd  %xmm6, %xmm0, %xmm0
  17:   48 83 c7 04             addq    $4, %rdi
  1b:   48 39 cf                cmpq    %rcx, %rdi
  1e:   0f 84 e4 ff ff ff       je      8
  24:   48 89 ec                movq    %rbp, %rsp
  27:   5d                      popq    %rbp
  28:   c3                      retq

Notably at offset e the it's not using vbroadcastss, but instead a combo of movl, vmovd, and vpbroadcastd. I don't have a great way of measuring the relative performance of one vs the other as my hope was to convince Cranelift to generate one or the other to measure.

The reason that this lowering rule isn't firing is due to:

...
 TRACE cranelift_codegen::machinst::lower      > arg v6 used, old state Unused, new Once
...
 TRACE cranelift_codegen::machinst::lower      > arg v8 used, old state Once, new Multiple
...
 TRACE cranelift_codegen::machinst::lower      >  -> DFS reaches v6
 TRACE cranelift_codegen::machinst::lower      >  -> became Multiple
...
 TRACE cranelift_codegen::machinst::lower      > lowering: inst inst3: v7 = splat.i32x4 v6
 TRACE cranelift_codegen::machinst::lower      > get_input_for_val: val v6 at cur_inst Some(inst3) cur_scan_entry_color Some(InstColor(4))
 TRACE cranelift_codegen::machinst::lower      >  -> src inst inst2
 TRACE cranelift_codegen::machinst::lower      >  -> has lowering side effect: true
 TRACE cranelift_codegen::machinst::lower      >  -> side-effecting op inst2 for val v6: use state Multiple
 TRACE cranelift_codegen::machinst::lower      > put_value_in_regs: val v6

(or at least I think this is why sinkable_load isn't firing)

I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...

The text was updated successfully, but these errors were encountered:

cfallin · 2025-01-14T18:19:51Z

So the issue here is that v8 is indeed used multiple times, and multiplicity is necessarily transitive, because we can't know how deep the rule-matching will go; so we can't allow the load to sink.

Imagine the following (very smart, but possible) instruction selection rules:

A branch argument (v8 in the args to block2 at the brif) has a special mega-CISC instruction that merges an add, a splat, and a load all together with the branch. We thus do the load side-effect here at the branch, if we permit the load to sink.
A return-value sequence has a special mega-CISC instruction that merges an add, a splat and a load all together, puts the result in the return-value register, and returns. We thus do the load side-effect here at the return op, if we permit the load to sink.

This particular example is a bit far-fetched, but it's important that we don't encode even more subtle knowledge about the possible instruction selection combinations into the core algorithms; they have to assume that rule-matching can go arbitrarily deep. We cannot allow the load to sink twice to two different locations, so the multiplicity analysis (correctly) propagates this multiple-use status backward to the load and prevents it from sinking.

I know I've run up against compute_use_states in the past and it's subtle enough that I can't ever seem to keep it in my head. I can't seem to wrap my head around this time either...

Believe it or not this was the simplest abstraction I could come up with at the time (I know, I'm not totally happy either...) It certainly errs on the side of safety but it did prevent real bugs we had / kept creating at the time. I know this last came up here and the explanation I gave there is as from-first-principles as I could make it; suggestions welcome for ways to explain it better and/or better algorithms that preserve all the safety properties we want!

cfallin · 2025-01-14T18:23:12Z

I'll note as well that we could relax multiplicity if we "cut" it at certain points by pinning values into registers -- e.g., constrain that v8 must be computed into a register, then allow it to be computed once (and avoid multiplicity flowing backward). However placing those cut-points is a provably NP-hard combinatorial optimization problem, which doesn't play nicely with our general ethos of single-pass algorithms (or at worst fast fixpoint analyses before our main lowering pass), so we haven't really pursued it further...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueUseState calculation too coarse for x64's sinkable_load? #10010

ValueUseState calculation too coarse for x64's sinkable_load? #10010

alexcrichton commented Jan 14, 2025

cfallin commented Jan 14, 2025

cfallin commented Jan 14, 2025

ValueUseState calculation too coarse for x64's sinkable_load? #10010

ValueUseState calculation too coarse for x64's sinkable_load? #10010

Comments

alexcrichton commented Jan 14, 2025

cfallin commented Jan 14, 2025

cfallin commented Jan 14, 2025