|
1 | 1 | # Matrix-Multiplication Profiling Examples
|
2 | 2 |
|
3 |
| -This code times the execution of a C = A x B matrix multiplication. |
4 |
| -C and A are column-major, B is row-major. |
5 |
| -A is MxK, B is KxN, C is MxN. |
6 |
| - |
7 |
| -The first multiplication product is checked for correctness on the host. |
8 |
| - |
9 |
| -There are three programs: |
10 |
| -* `sgemm-basic` (`basic.cu`): A global-memory multiplication |
11 |
| -* `sgemm-tiled` (`tiled.cu`): a shared-memory tiled multiplication |
12 |
| -* `sgemm-regtiled-coarsened` (`regtiled_coarsened.cu`): a register-tiled and coarsened multiplication |
| 3 | +This code contains a global memory, shared-memory tiled, and joint shared-memory and register-tiled matrix matrix multiplications. |
13 | 4 |
|
14 | 5 |
|
15 | 6 | ## Module 1: Nvidia Nsight Compute
|
16 | 7 |
|
17 |
| -* `1-1-basic`: (`pinned_basic.cu`) |
18 |
| -* `1-2-tiled`: (`pinned_tiled.cu`) |
19 |
| -* `1-3-joint`: (`pinned_joint.cu`) |
| 8 | +Examples for using Nsight Compute to compare kernel performance. |
| 9 | + |
| 10 | +* `1-1-pinned-basic`: (`1_1_pinned_basic.cu`) |
| 11 | +* `1-2-pinned-tiled`: (`1_1_pinned_tiled.cu`) |
| 12 | +* `1-3-pinned-joint`: (`1_1_pinned_joint.cu`) |
20 | 13 |
|
21 | 14 | ## Module 2: Nvidia Nsight Systems
|
22 | 15 |
|
23 |
| -* `2-1-pageable-basic`: (`pageable_basic.cu`) |
24 |
| -* `2-2-pinned-basic`: (`pinned_basic.cu`) |
25 |
| -* `2-3-pinned-joint`: (`pinned_joint_wall.cu`) |
26 |
| -* `2-3-pinned-joint`: (`pinned_joint_overlap.cu`) |
| 16 | +Examples for using Nsight Systems to compare data transfer, and relationship between data transfer and end-to-end time. |
| 17 | + |
| 18 | +* `2-1-pageable-basic`: (`2_1_pageable_basic.cu`) |
| 19 | +* `2-2-pinned-basic`: (`2_2_pinned_basic.cu`) |
| 20 | +* `2-3-pinned-tiled`: (`2_3_pinned_tiled.cu`) |
| 21 | +* `2-4-pinned-tiled-overlap`: (`2_4_pinned_tiled_overlap.cu`) |
| 22 | +* `2-5-pinned-joint`: (`2_5_pinned_joint.cu`) |
| 23 | +* `2-6-pinned-joint-overlap`: (`2_6_pinned_joint_overlap.cu`) |
27 | 24 |
|
28 | 25 | All programs share the same basic options:
|
29 | 26 |
|
30 | 27 | * Three optional positional arguments to set M, N, and K.
|
31 | 28 | * `--iters <int>` the number of measured iterations (default `5`)
|
32 | 29 | * `--warmup <int>` the number of warmup iterations (default `5`)
|
33 | 30 | * `--check`: check correctness (default `false`). Only use for small multiplications
|
34 |
| - |
35 |
| -## Optimizing regtiled_coarsened |
36 |
| - |
37 |
| -regtiled_coarsened.cu: |
38 |
| - |
39 |
| -theoretical occupancy is 75% instead of 100%. |
40 |
| -limited to 12 blocks per SM by registers |
41 |
| - |
42 |
| -Achieved occupancy is only 6.25% |
0 commit comments