Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance investigations on RDNA2 cards. #28

Open
Azralee opened this issue Feb 6, 2021 · 23 comments
Open

Performance investigations on RDNA2 cards. #28

Azralee opened this issue Feb 6, 2021 · 23 comments

Comments

@Azralee
Copy link
Contributor

Azralee commented Feb 6, 2021

I ran the Raytracer through AMDs GPU profiler to check out how it runs on my RX6800. (I will upload the results in a different issue.)
It reported that the RT shader is limited by it's LDS usage of 4 KB to half occupancy, and that it uses 80 vector registers.
Decreasing LDS usage to 3072B could increase the occupancy up to 12 parallel wavefronts (warps on nVidia Hardware) the maximum for 80 Vector registers. This should improve performance as less time is spent idle.

Reducing LDS usage to 2048B would allow further optimizations to VGPR (Vector General Purpose Register) usage. Reducing VGPRS to 64 would allow full occupancy and presumably maximum performance.

Edit: I believe LDS is AMDs name for Workgroup Memory.

@Azralee
Copy link
Contributor Author

Azralee commented Feb 6, 2021

This is way more complicated than the original comment stated. I need to study this more.

@Azralee Azralee closed this as completed Feb 6, 2021
@GPSnoopy
Copy link
Owner

GPSnoopy commented Feb 7, 2021

Yes, I think that it might be specific to AMD as well as they use the normal shader resources for BVH tranversals. Register usage might be different on other cards.

I've tried to use NVIDIA Nsight Graphics 2021.1.0 for profiling the Vulkan Ray Tracing shaders in Scene 1 last night. It hinted that the main bottleneck was L1TEX ( I assume L1 cache) and MIO (Memory IO) but the feature is Beta and light on information, it doesn't seem to correlate yet with the source code (even SPIR-V bytecode) in all but the simplest case, so it's hard to identify the actual location of bottlenecks in the shaders.

@Azralee
Copy link
Contributor Author

Azralee commented Feb 7, 2021

Yeah, I tried to use AMD's static Analyzer (RGA) yesterday, but they have not released an update with the latest SPIR-V tools.

AMD's Vulkan traversal code seems under-optimized at the moment, especially in comparison with the DX12 one.
I have attached some traces. Metro Exodus seems pretty much like the most optimized RT on AMD that I found until now. And even Control (which runs awfull) seems to use half the LDS of the Vulkan examples.

Trace from this Project:
grafik
Trace from Quake2 RTX:
grafik
Trace from Metro Exodus (DX12):
grafik
Trace from Control (Corridor of DOOM, as Digital Foundry puts it) (RT passes: BVH, reflections, diffuse Reflections, GI, Contact shadows):
grafik

@Azralee
Copy link
Contributor Author

Azralee commented Feb 9, 2021

I think I figured the performance difference out.
RT-Heatmap
If you take a look at the heatmap, you can see that the heatmap is relatively coarse, specifically it's batches of 8x8 pixels, or 64 pixels. This means that the wavefront runs until the last ray has terminated. Combined with the currently relatively low occupancy, this could explain the odd results.

@Azralee Azralee reopened this Feb 9, 2021
@Azralee Azralee changed the title Performance improvements on RDNA2 cards. Performance differenceon RDNA2 cards. Feb 9, 2021
@Azralee Azralee changed the title Performance differenceon RDNA2 cards. Performance difference on RDNA2 cards. Feb 9, 2021
@Azralee
Copy link
Contributor Author

Azralee commented Feb 9, 2021

It could probably be faster to trace a single sample per shader and dispatch a bunch of samples per pixel at the same time, no idea if it's possible though.

@Azralee
Copy link
Contributor Author

Azralee commented Feb 9, 2021

This hypothesis seems convincing. I ran the raytracer at 5120x2880 and at 1280x720:
grafik
grafik
The 64-pixel blocks are very visible in 720p and cover a lot ot the central statue, while in 2880p the hot blocks less widely distributed. This looks much closer to the nVidia heatmap in README.md and corresponds more to the naively expected behaviour.

To test if it improves performance, I ran the benchmark at identical Ray counts, (2880p has 16x the pixels of 720p), at different resolutions (60s every scene):

Resolution\Scene Rays per frame Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
720p 16 Samples 14,745,600 70.02 69.37 30.60 55.71 20.23
2880p 1 Sample 14,745,600 71.67 70.76 35.51 54.79 20.33
Performance of higher res 100% 102,36% 102% 116,05% 98,35% 100,5%

The performance is highly scene dependent, with most scenes being more or less within the margin of error. Scene 3 with the Lucy statues is improved by quite a lot, just as expected from the heatmaps.

@Azralee
Copy link
Contributor Author

Azralee commented Feb 9, 2021

The bad performance in the Cornell box scenes make a lot of sense, since there are only very few triangles (36 if I count correctly) making up the scene.
Ampere improved twofold on Ray-Triangle-Intersections, which is basically all that happens in the Cornell Box. On real models, like Lucy, the BVH-hirarchy should be much deeper. If we assume a BVH4 (like the RDNA2 ISA-Guide implies), then the 448K triangles of the Lucy Statues should be in a BVH of more than 10 Layers deep each. The Cornell Box on the other hand should be in a BVH Box about 5 layers total.
The Cornell box is a micro benchmark of Ray-Triangle performance, so it's no wonder that Ampere runs so fast here.

@Azralee Azralee changed the title Performance difference on RDNA2 cards. Performance investigations on RDNA2 cards. Mar 8, 2021
@Azralee
Copy link
Contributor Author

Azralee commented Mar 8, 2021

I did some more testing and found a way to force wave32 execution on the RT shaders on RDNA2. This improved performance by ~6%. Moving line 101 directly behind the load in line 85, reduced the amount of used vector registers and increased performance further:
FPS table:

execution mode Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
wave32 optimized 48.78 48.14 23.12 36.26 13.83
wave32 43.67 43.09 20.13 33.09 12.04
wave64 (base) 40.99 40.66 18.94 31.59 11.29

Improvements over base:

execution mode Scene 1 Scene 2 Scene 3 Scene 4 Scene 5
wave32 optimized 19.00% 18.40% 22.07% 14.78% 22.50%
wave32 6.54% 5.98% 6.28% 4.75% 6.64%
wave64 (base) 0.00% 0.00% 0.00% 0.00% 0.00%

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

I will open a pull request later.

Azralee pushed a commit to Azralee/RayTracingInVulkan that referenced this issue Mar 8, 2021
This patch improves performance on RDNA2 cards by 19%, as documented in issue GPSnoopy#28. It does this by switching the execution mode to wave32 (via the change in line 27) and reducing the amount of used vector registers in Wave32 mode to increase occupancy (via the imageStore in line 87).
@CasperTheCat
Copy link

I recently found AMD's video on GPUOpen that covers some perform tips for DXR 1.1.
https://gpuopen.com/videos/amd-rdna2-directx-raytracing/

At least in DirectX12, AMD recommends moving TraceRay to the compute queue and dispatching in 8x4 tiles. This dispatch size appears to line up with your findings, namely the optimality of wave32 and the LDS pressure.

@GPSnoopy
Copy link
Owner

It does feel like it's on AMD to do these optimisations automatically in their JIT compiler. Keep in mind that NVIDIA VK ray tracing performance pretty much doubled since it was introduced two years ago, purely thanks to driver improvements. I'm hoping AMD can address some of the low hanging fruits relatively quickly.

@Azralee
Copy link
Contributor Author

Azralee commented Mar 20, 2021

Well the newest driver automatically defaults to wave32, so that's good. Makes the first part of my pull request unnecessary.

@Azralee
Copy link
Contributor Author

Azralee commented Mar 20, 2021

Second Part also does nothing anymore, so they are indeed working on it.

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

I think I have an avenue for some improvements (Page 21):
http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2003%20-%20the%20perfect%20BVH.pdf

@GPSnoopy
Copy link
Owner

Split box-quads into smaller polygons?

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

That was my Idea.

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

I'm currently implementing a (very) primitive function that splits all triangles given to it in half. It does this recursively.

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

This was a dead end unfortunately. I only got performance degradation using this "tesselation" function:

std::function<void(std::vector<Vertex>&, std::vector<uint32_t>&, int)> divideTriangles = [&](std::vector<Vertex>& Vertices, std::vector<uint32_t>& indices, int depth = 4) {
	if (depth <= 0) return;
	for (int i = 0; i < indices.size(); i += 6)
	{
		double length = 0.0;
		std::array<std::pair<uint32_t, uint32_t>, 3 > edges{ {{0,1},{0,2},{2,1}} };
		size_t pair = 0;
		for (size_t j = 0; j < 3; j++)
		{
			auto edge = edges.at(j);
			vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
			vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
			auto dist = distance(pos1, pos2);
			if (length < dist)
			{
				length = dist;
				pair = j;
			}
		}
		std::vector<uint32_t>::iterator nth = indices.begin() + 3;
		auto edge = edges.at(pair);
		vec3 pos1 = Vertices.at(indices.at(i + edge.first)).Position;
		vec3 pos2 = Vertices.at(indices.at(i + edge.second)).Position;
		vec3 difference = pos1 - pos2;
		difference /= 2.0;
		Vertex newVertex = Vertices.at(indices.at(i + edge.second));
		newVertex.Position += difference;
		uint32_t VertexIndex = Vertices.size();
		Vertices.push_back(newVertex);
		//prepare new triangle:
		//prepare new triangle:
		std::vector<uint32_t> newindices{};
		switch (pair) {
		case 0:
			newindices = std::vector<uint32_t>{ { indices.at(i),indices.at(i + 2),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 1:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i + 1),indices.at(i + 2),VertexIndex } };
			break;
		case 2:
			newindices = std::vector<uint32_t>{ {indices.at(i),indices.at(i + 1),VertexIndex,indices.at(i),indices.at(i + 2),VertexIndex } };
			break;
		}
		for (size_t j = 0; j < 3; j++)
		{
			indices.at(i + j) = newindices.at(j);
		}
		indices.insert(nth, newindices.begin() + 3, newindices.end());
	}
	divideTriangles(Vertices, indices, depth - 1);
};

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

On the plus side I now have an excellent tool to test the importance of the L3 cache for RT on RDNA2.

@GPSnoopy
Copy link
Owner

Not too surprising. Internally you would expect the drivers to do this if it was beneficial.

@Azralee
Copy link
Contributor Author

Azralee commented May 10, 2021

It was an experiment. I was fascinated how the performance dropped when increasing the "tesselation factor." Scene 5 was completely unaffected (probably due to the complex Lucy model), and Scene 4 dropped by ~30% when I increased the amount of polygons to 64x.

@Azralee
Copy link
Contributor Author

Azralee commented May 11, 2021

I would love to see how Ampere would fare with this. The only thing that should be changing with the increased number of triangles is the depth and size of the BVH. The amount of Ray-Triangle-intersections should stay exactly the same.

@darksylinc
Copy link

darksylinc commented Jul 12, 2021

I'm just passing through!

Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics.

Does VK_EXT_subgroup_size_control work for RayTracing PSO? That extension lets you explicitly control wave32/64 mode.

@unphased
Copy link

Also passing through. I'm currently researching the relative performance between Vulkan Ray Tracing and DXR. Do you guys have an impression of this? Has it more or less reached parity? I'm only interested in open APIs like Vulkan, however it would be good to be aware of any shortcomings (if any) that it might currently have compared to DX12.

Also hoping to see Vulkan Ray Tracing supported efficiently on Metal via MoltenVK. Wonder how far off that'll be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants