Performance investigations on RDNA2 cards. #28

Azralee · 2021-02-06T15:34:54Z

I ran the Raytracer through AMDs GPU profiler to check out how it runs on my RX6800. (I will upload the results in a different issue.)
It reported that the RT shader is limited by it's LDS usage of 4 KB to half occupancy, and that it uses 80 vector registers.
Decreasing LDS usage to 3072B could increase the occupancy up to 12 parallel wavefronts (warps on nVidia Hardware) the maximum for 80 Vector registers. This should improve performance as less time is spent idle.

Reducing LDS usage to 2048B would allow further optimizations to VGPR (Vector General Purpose Register) usage. Reducing VGPRS to 64 would allow full occupancy and presumably maximum performance.

Edit: I believe LDS is AMDs name for Workgroup Memory.

Azralee · 2021-02-06T19:57:52Z

This is way more complicated than the original comment stated. I need to study this more.

GPSnoopy · 2021-02-07T14:15:15Z

Yes, I think that it might be specific to AMD as well as they use the normal shader resources for BVH tranversals. Register usage might be different on other cards.

I've tried to use NVIDIA Nsight Graphics 2021.1.0 for profiling the Vulkan Ray Tracing shaders in Scene 1 last night. It hinted that the main bottleneck was L1TEX ( I assume L1 cache) and MIO (Memory IO) but the feature is Beta and light on information, it doesn't seem to correlate yet with the source code (even SPIR-V bytecode) in all but the simplest case, so it's hard to identify the actual location of bottlenecks in the shaders.

Azralee · 2021-02-07T16:19:51Z

Yeah, I tried to use AMD's static Analyzer (RGA) yesterday, but they have not released an update with the latest SPIR-V tools.

AMD's Vulkan traversal code seems under-optimized at the moment, especially in comparison with the DX12 one.
I have attached some traces. Metro Exodus seems pretty much like the most optimized RT on AMD that I found until now. And even Control (which runs awfull) seems to use half the LDS of the Vulkan examples.

Trace from this Project:

Trace from Quake2 RTX:

Trace from Metro Exodus (DX12):

Trace from Control (Corridor of DOOM, as Digital Foundry puts it) (RT passes: BVH, reflections, diffuse Reflections, GI, Contact shadows):

Azralee · 2021-02-09T17:20:14Z

I think I figured the performance difference out.

If you take a look at the heatmap, you can see that the heatmap is relatively coarse, specifically it's batches of 8x8 pixels, or 64 pixels. This means that the wavefront runs until the last ray has terminated. Combined with the currently relatively low occupancy, this could explain the odd results.

Azralee · 2021-02-09T17:54:16Z

It could probably be faster to trace a single sample per shader and dispatch a bunch of samples per pixel at the same time, no idea if it's possible though.

Azralee · 2021-02-09T22:04:39Z

This hypothesis seems convincing. I ran the raytracer at 5120x2880 and at 1280x720:

The 64-pixel blocks are very visible in 720p and cover a lot ot the central statue, while in 2880p the hot blocks less widely distributed. This looks much closer to the nVidia heatmap in README.md and corresponds more to the naively expected behaviour.

To test if it improves performance, I ran the benchmark at identical Ray counts, (2880p has 16x the pixels of 720p), at different resolutions (60s every scene):

Resolution\Scene	Rays per frame	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5
720p 16 Samples	14,745,600	70.02	69.37	30.60	55.71	20.23
2880p 1 Sample	14,745,600	71.67	70.76	35.51	54.79	20.33
Performance of higher res	100%	102,36%	102%	116,05%	98,35%	100,5%

The performance is highly scene dependent, with most scenes being more or less within the margin of error. Scene 3 with the Lucy statues is improved by quite a lot, just as expected from the heatmaps.

Azralee · 2021-02-09T23:25:26Z

The bad performance in the Cornell box scenes make a lot of sense, since there are only very few triangles (36 if I count correctly) making up the scene.
Ampere improved twofold on Ray-Triangle-Intersections, which is basically all that happens in the Cornell Box. On real models, like Lucy, the BVH-hirarchy should be much deeper. If we assume a BVH4 (like the RDNA2 ISA-Guide implies), then the 448K triangles of the Lucy Statues should be in a BVH of more than 10 Layers deep each. The Cornell Box on the other hand should be in a BVH Box about 5 layers total.
The Cornell box is a micro benchmark of Ray-Triangle performance, so it's no wonder that Ampere runs so fast here.

Azralee · 2021-03-08T18:16:56Z

I did some more testing and found a way to force wave32 execution on the RT shaders on RDNA2. This improved performance by ~6%. Moving line 101 directly behind the load in line 85, reduced the amount of used vector registers and increased performance further:
FPS table:

execution mode	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5
wave32 optimized	48.78	48.14	23.12	36.26	13.83
wave32	43.67	43.09	20.13	33.09	12.04
wave64 (base)	40.99	40.66	18.94	31.59	11.29

Improvements over base:

execution mode	Scene 1	Scene 2	Scene 3	Scene 4	Scene 5
wave32 optimized	19.00%	18.40%	22.07%	14.78%	22.50%
wave32	6.54%	5.98%	6.28%	4.75%	6.64%
wave64 (base)	0.00%	0.00%	0.00%	0.00%	0.00%

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

I will open a pull request later.

This patch improves performance on RDNA2 cards by 19%, as documented in issue GPSnoopy#28. It does this by switching the execution mode to wave32 (via the change in line 27) and reducing the amount of used vector registers in Wave32 mode to increase occupancy (via the imageStore in line 87).

CasperTheCat · 2021-03-19T07:40:41Z

I recently found AMD's video on GPUOpen that covers some perform tips for DXR 1.1.
https://gpuopen.com/videos/amd-rdna2-directx-raytracing/

At least in DirectX12, AMD recommends moving TraceRay to the compute queue and dispatching in 8x4 tiles. This dispatch size appears to line up with your findings, namely the optimality of wave32 and the LDS pressure.

GPSnoopy · 2021-03-19T10:57:48Z

It does feel like it's on AMD to do these optimisations automatically in their JIT compiler. Keep in mind that NVIDIA VK ray tracing performance pretty much doubled since it was introduced two years ago, purely thanks to driver improvements. I'm hoping AMD can address some of the low hanging fruits relatively quickly.

Azralee · 2021-03-20T00:02:59Z

Well the newest driver automatically defaults to wave32, so that's good. Makes the first part of my pull request unnecessary.

Azralee · 2021-03-20T00:17:01Z

Second Part also does nothing anymore, so they are indeed working on it.

Azralee · 2021-05-10T09:21:22Z

I think I have an avenue for some improvements (Page 21):
http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2003%20-%20the%20perfect%20BVH.pdf

GPSnoopy · 2021-05-10T12:48:47Z

Split box-quads into smaller polygons?

Azralee · 2021-05-10T15:36:36Z

That was my Idea.

Azralee · 2021-05-10T16:58:21Z

I'm currently implementing a (very) primitive function that splits all triangles given to it in half. It does this recursively.

Azralee · 2021-05-10T18:28:57Z

This was a dead end unfortunately. I only got performance degradation using this "tesselation" function:

std::function<void(std::vector<Vertex>&, std::vector<uint32_t>&, int)> divideTriangles = [&](std::vector<Vertex>& Vertices, std::vector<uint32_t>& indices, int depth = 4) {
	if (depth <= 0) return;
	for (int i = 0; i < indices.size(); i += 6)
	{
		double length = 0.0;
		std::array<std::pair<uint32_t, uint32_t>, 3 > edges{ {{0,1},{0,2},{2,1}} };
		size_t pair = 0;
		for (size_t j = 0; j < 3; j++)
		{
			auto edge = edges.at(j);
			vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
			vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
			auto dist = distance(pos1, pos2);
			if (length < dist)
			{
				length = dist;
				pair = j;
			}
		}
		std::vector<uint32_t>::iterator nth = indices.begin() + 3;
		auto edge = edges.at(pair);
		vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
		vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
		vec3 difference = pos1 - pos2;
		difference /= 2.0;
		Vertex newVertex = Vertices.at(indices.at(i + edge.second));
		newVertex.Position += difference;
		uint32_t VertexIndex = Vertices.size();
		Vertices.push_back(newVertex);
		//prepare new triangle:
		//prepare new triangle:
		std::vector<uint32_t> newindices{};
		switch (pair) {
		case 0:
			newindices = std::vector<uint32_t>{ { indices.at(i),indices.at(i + 2),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 1:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 2:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i),indices.at(i + 2),VertexIndex } };
			break;
		}
		for (size_t j = 0; j < 3; j++)
		{
			indices.at(i + j) = newindices.at(j);
		}
		indices.insert(nth, newindices.begin() + 3, newindices.end());
	}
	divideTriangles(Vertices, indices, depth - 1);
};

Azralee · 2021-05-10T19:13:19Z

On the plus side I now have an excellent tool to test the importance of the L3 cache for RT on RDNA2.

GPSnoopy · 2021-05-10T20:49:51Z

Not too surprising. Internally you would expect the drivers to do this if it was beneficial.

Azralee · 2021-05-10T21:00:58Z

It was an experiment. I was fascinated how the performance dropped when increasing the "tesselation factor." Scene 5 was completely unaffected (probably due to the complex Lucy model), and Scene 4 dropped by ~30% when I increased the amount of polygons to 64x.

Azralee · 2021-05-11T12:17:21Z

I would love to see how Ampere would fare with this. The only thing that should be changing with the increased number of triangles is the depth and size of the BVH. The amount of Ray-Triangle-intersections should stay exactly the same.

darksylinc · 2021-07-12T02:14:20Z

I'm just passing through!

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

Does VK_EXT_subgroup_size_control work for RayTracing PSO? That extension lets you explicitly control wave32/64 mode.

unphased · 2022-01-29T04:22:33Z

Also passing through. I'm currently researching the relative performance between Vulkan Ray Tracing and DXR. Do you guys have an impression of this? Has it more or less reached parity? I'm only interested in open APIs like Vulkan, however it would be good to be aware of any shortcomings (if any) that it might currently have compared to DX12.

Also hoping to see Vulkan Ray Tracing supported efficiently on Metal via MoltenVK. Wonder how far off that'll be.

Azralee closed this as completed Feb 6, 2021

Azralee reopened this Feb 9, 2021

Azralee changed the title ~~Performance improvements on RDNA2 cards.~~ Performance differenceon RDNA2 cards. Feb 9, 2021

Azralee changed the title ~~Performance differenceon RDNA2 cards.~~ Performance difference on RDNA2 cards. Feb 9, 2021

Azralee mentioned this issue Mar 3, 2021

Performance improvement: Change Ray dispatch #46

Closed

Azralee changed the title ~~Performance difference on RDNA2 cards.~~ Performance investigations on RDNA2 cards. Mar 8, 2021

Azralee mentioned this issue Mar 8, 2021

Optimize performance on RNDA2 cards #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance investigations on RDNA2 cards. #28

Performance investigations on RDNA2 cards. #28

Azralee commented Feb 6, 2021 •

edited

Loading

Azralee commented Feb 6, 2021

GPSnoopy commented Feb 7, 2021

Azralee commented Feb 7, 2021 •

edited

Loading

Azralee commented Feb 9, 2021

Azralee commented Feb 9, 2021 •

edited

Loading

Azralee commented Feb 9, 2021 •

edited

Loading

Azralee commented Feb 9, 2021

Azralee commented Mar 8, 2021 •

edited

Loading

CasperTheCat commented Mar 19, 2021

GPSnoopy commented Mar 19, 2021

Azralee commented Mar 20, 2021

Azralee commented Mar 20, 2021

Azralee commented May 10, 2021

GPSnoopy commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 10, 2021 •

edited

Loading

Azralee commented May 10, 2021

GPSnoopy commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 11, 2021

darksylinc commented Jul 12, 2021 •

edited

Loading

unphased commented Jan 29, 2022

Performance investigations on RDNA2 cards. #28

Performance investigations on RDNA2 cards. #28

Comments

Azralee commented Feb 6, 2021 • edited Loading

Azralee commented Feb 6, 2021

GPSnoopy commented Feb 7, 2021

Azralee commented Feb 7, 2021 • edited Loading

Azralee commented Feb 9, 2021

Azralee commented Feb 9, 2021 • edited Loading

Azralee commented Feb 9, 2021 • edited Loading

Azralee commented Feb 9, 2021

Azralee commented Mar 8, 2021 • edited Loading

CasperTheCat commented Mar 19, 2021

GPSnoopy commented Mar 19, 2021

Azralee commented Mar 20, 2021

Azralee commented Mar 20, 2021

Azralee commented May 10, 2021

GPSnoopy commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 10, 2021 • edited Loading

Azralee commented May 10, 2021

GPSnoopy commented May 10, 2021

Azralee commented May 10, 2021

Azralee commented May 11, 2021

darksylinc commented Jul 12, 2021 • edited Loading

unphased commented Jan 29, 2022

Azralee commented Feb 6, 2021 •

edited

Loading

Azralee commented Feb 7, 2021 •

edited

Loading

Azralee commented Feb 9, 2021 •

edited

Loading

Azralee commented Feb 9, 2021 •

edited

Loading

Azralee commented Mar 8, 2021 •

edited

Loading

Azralee commented May 10, 2021 •

edited

Loading

darksylinc commented Jul 12, 2021 •

edited

Loading