improve sgemm/README.md

cwpearson · cwpearson · commit 1e803b352453 · 2020-04-16T12:47:36.000-05:00
diff --git a/README.md b/README.md
@@ -24,6 +24,10 @@ If you are developing a workflow and want stability, choose a tag like `amd64-10
     * [Part 3: Nsight Compute](https://youtu.be/UNX0KNMQlW8)
     * [Part 4: Nsight Systems](https://youtu.be/YHrmnaPgFfY)
 
+## Examples
+
+* [sgemm](sgemm) Featuring basic, shared-memory tiled, and joint shared-memory and register tiling.
+
 ## Installing Nsight Systems and Nsight Compute
 
 There is a command-line (CLI) and graphical (GUI) version of each tool.
diff --git a/sgemm/README.md b/sgemm/README.md
@@ -1,42 +1,30 @@
 # Matrix-Multiplication Profiling Examples
 
-This code times the execution of a C = A x B matrix multiplication.
-C and A are column-major, B is row-major.
-A is MxK, B is KxN, C is MxN.
-
-The first multiplication product is checked for correctness on the host.
-
-There are three programs:
-* `sgemm-basic` (`basic.cu`): A global-memory multiplication
-* `sgemm-tiled` (`tiled.cu`): a shared-memory tiled multiplication
-* `sgemm-regtiled-coarsened` (`regtiled_coarsened.cu`): a register-tiled and coarsened multiplication
+This code contains a global memory, shared-memory tiled, and joint shared-memory and register-tiled matrix matrix multiplications.
 
 
 ## Module 1: Nvidia Nsight Compute
 
-* `1-1-basic`: (`pinned_basic.cu`)
-* `1-2-tiled`: (`pinned_tiled.cu`)
-* `1-3-joint`: (`pinned_joint.cu`)
+Examples for using Nsight Compute to compare kernel performance.
+
+* `1-1-pinned-basic`: (`1_1_pinned_basic.cu`)
+* `1-2-pinned-tiled`: (`1_1_pinned_tiled.cu`)
+* `1-3-pinned-joint`: (`1_1_pinned_joint.cu`)
 
 ## Module 2: Nvidia Nsight Systems
 
-* `2-1-pageable-basic`: (`pageable_basic.cu`)
-* `2-2-pinned-basic`: (`pinned_basic.cu`)
-* `2-3-pinned-joint`: (`pinned_joint_wall.cu`)
-* `2-3-pinned-joint`: (`pinned_joint_overlap.cu`)
+Examples for using Nsight Systems to compare data transfer, and relationship between data transfer and end-to-end time.
+
+* `2-1-pageable-basic`: (`2_1_pageable_basic.cu`)
+* `2-2-pinned-basic`: (`2_2_pinned_basic.cu`)
+* `2-3-pinned-tiled`: (`2_3_pinned_tiled.cu`)
+* `2-4-pinned-tiled-overlap`: (`2_4_pinned_tiled_overlap.cu`)
+* `2-5-pinned-joint`: (`2_5_pinned_joint.cu`)
+* `2-6-pinned-joint-overlap`: (`2_6_pinned_joint_overlap.cu`)
 
 All programs share the same basic options:
 
 * Three optional positional arguments to set M, N, and K.
 * `--iters <int>` the number of measured iterations (default `5`)
 * `--warmup <int>` the number of warmup iterations (default `5`)
 * `--check`: check correctness (default `false`). Only use for small multiplications
-
-## Optimizing regtiled_coarsened
-
-regtiled_coarsened.cu: 
-
-theoretical occupancy is 75% instead of 100%.
-limited to 12 blocks per SM by registers
-
-Achieved occupancy is only 6.25%