Skip to content

Commit 1e803b3

Browse files
committed
improve sgemm/README.md
1 parent 8aad0c6 commit 1e803b3

File tree

2 files changed

+18
-26
lines changed

2 files changed

+18
-26
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ If you are developing a workflow and want stability, choose a tag like `amd64-10
2424
* [Part 3: Nsight Compute](https://youtu.be/UNX0KNMQlW8)
2525
* [Part 4: Nsight Systems](https://youtu.be/YHrmnaPgFfY)
2626

27+
## Examples
28+
29+
* [sgemm](sgemm) Featuring basic, shared-memory tiled, and joint shared-memory and register tiling.
30+
2731
## Installing Nsight Systems and Nsight Compute
2832

2933
There is a command-line (CLI) and graphical (GUI) version of each tool.

sgemm/README.md

Lines changed: 14 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,30 @@
11
# Matrix-Multiplication Profiling Examples
22

3-
This code times the execution of a C = A x B matrix multiplication.
4-
C and A are column-major, B is row-major.
5-
A is MxK, B is KxN, C is MxN.
6-
7-
The first multiplication product is checked for correctness on the host.
8-
9-
There are three programs:
10-
* `sgemm-basic` (`basic.cu`): A global-memory multiplication
11-
* `sgemm-tiled` (`tiled.cu`): a shared-memory tiled multiplication
12-
* `sgemm-regtiled-coarsened` (`regtiled_coarsened.cu`): a register-tiled and coarsened multiplication
3+
This code contains a global memory, shared-memory tiled, and joint shared-memory and register-tiled matrix matrix multiplications.
134

145

156
## Module 1: Nvidia Nsight Compute
167

17-
* `1-1-basic`: (`pinned_basic.cu`)
18-
* `1-2-tiled`: (`pinned_tiled.cu`)
19-
* `1-3-joint`: (`pinned_joint.cu`)
8+
Examples for using Nsight Compute to compare kernel performance.
9+
10+
* `1-1-pinned-basic`: (`1_1_pinned_basic.cu`)
11+
* `1-2-pinned-tiled`: (`1_1_pinned_tiled.cu`)
12+
* `1-3-pinned-joint`: (`1_1_pinned_joint.cu`)
2013

2114
## Module 2: Nvidia Nsight Systems
2215

23-
* `2-1-pageable-basic`: (`pageable_basic.cu`)
24-
* `2-2-pinned-basic`: (`pinned_basic.cu`)
25-
* `2-3-pinned-joint`: (`pinned_joint_wall.cu`)
26-
* `2-3-pinned-joint`: (`pinned_joint_overlap.cu`)
16+
Examples for using Nsight Systems to compare data transfer, and relationship between data transfer and end-to-end time.
17+
18+
* `2-1-pageable-basic`: (`2_1_pageable_basic.cu`)
19+
* `2-2-pinned-basic`: (`2_2_pinned_basic.cu`)
20+
* `2-3-pinned-tiled`: (`2_3_pinned_tiled.cu`)
21+
* `2-4-pinned-tiled-overlap`: (`2_4_pinned_tiled_overlap.cu`)
22+
* `2-5-pinned-joint`: (`2_5_pinned_joint.cu`)
23+
* `2-6-pinned-joint-overlap`: (`2_6_pinned_joint_overlap.cu`)
2724

2825
All programs share the same basic options:
2926

3027
* Three optional positional arguments to set M, N, and K.
3128
* `--iters <int>` the number of measured iterations (default `5`)
3229
* `--warmup <int>` the number of warmup iterations (default `5`)
3330
* `--check`: check correctness (default `false`). Only use for small multiplications
34-
35-
## Optimizing regtiled_coarsened
36-
37-
regtiled_coarsened.cu:
38-
39-
theoretical occupancy is 75% instead of 100%.
40-
limited to 12 blocks per SM by registers
41-
42-
Achieved occupancy is only 6.25%

0 commit comments

Comments
 (0)