Replies: 3 comments 4 replies
-
|
Strange... When I tested the PR #2248 I checked disassembled Example 0 and then Example 1, so I make a conclusion that the speed up is due to reduced I just disassembled code with the suggested change: Details
Disassembled: Details
This code loads correctly, but the speed is clearly slower... |
Beta Was this translation helpful? Give feedback.
-
|
So, I don't know what the Metal compiler on your computer does, but on my M2 Max, the kernel below gives me a run time of 18.7 ms/token for 7B |
Beta Was this translation helpful? Give feedback.
-
Well, yes, it looks like the Metal compiler does need a hand here and there.
This is not my experience. In fact, letting a thread in a simd group compute half a block at a time for |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Apple doesn't provide any kind of disassembler as a part of developer tools to allow us take a look at the low level stuff. However, thanks to the efforts from open source community we do have a functional disassembler.
Usage
Clone the repository https://github.com/dougallj/applegpu, and run the command:
python compiler_explorer.py test.metalYou can have any kind of macros or templates in you
.metalfile, but you can only have onekernelfunction. Detailed explanations for each instruction can be found at https://dougallj.github.io/applegpu/docs.html. I feel that instructions for Apple GPU are a bit like RISC, where you have instructions to load and store some values between memory and registers, and some other instructions to operate on registers, but no "load-operate" instructions.Example 0
Here is the
kernel_mul_mat_q4_0_f32function from master branch. I removed some logics for simplicity.Details
And here is the disassembled code (very long):
Details
The kernel start at the label
compute shader:, and I usually analyze the structure by findingjmpanddevice_loadinstructions. In this code there are twojmp_exec_anyat0x9b2and0x9cc, jumping to0x298and0x164. There are corresponding the two loops in our codes. Between0x164and0x298we can see a block ofdevice_load:The first
device_loadloads 4i32to 32-bit registersr5,r6,r7,r8from address stored inr39,r40. The whole 8device_loadload 32float. That's the 32floatstored iny_currbefore starting the inner loop. Thewait 1means waiting untilgroup 1load instructions finished, which are the first fourdevice_loadin this block.Between
0x298and0x9b2is our inner loop, we still look fordevice_load. Notice thatr48hmeans the high 16-bit of registerr48andr48lmeans the low 16-bit.In our code the inner loop runs 4 times, with each time loading one block. In the assembly the inner loop actually runs 2 times, each time loading 2 blocks. When it loads a block, it first loads one 16-bit, then another 16-bit, then four 16-bit, then two 16-bit and finally one last 16-bit. Before and after these
device_loadwe also see a lotmovandbfeil, meaning the GPU copys a 16-bit to another register and mask its high or low 8-bit.Example 1
Here is the
kernel_mul_mat_q4_0_f32function from PR #2248 . I removed some logics for simplicity.Details
And here is the disassembled code (very long):
Details
The structure is similar to Example 0, so we only analyze the inner loop.:
Now when it loads a block it first loads one 16-bit, then four 16-bit and four 16-bit. Before and after these
device_loadwe don't seemovandbfeilany more because now we directly operate on 16-bit values.Beta Was this translation helpful? Give feedback.
All reactions