-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance investigations on RDNA2 cards. #28
Comments
This is way more complicated than the original comment stated. I need to study this more. |
Yes, I think that it might be specific to AMD as well as they use the normal shader resources for BVH tranversals. Register usage might be different on other cards. I've tried to use NVIDIA Nsight Graphics 2021.1.0 for profiling the Vulkan Ray Tracing shaders in Scene 1 last night. It hinted that the main bottleneck was L1TEX ( I assume L1 cache) and MIO (Memory IO) but the feature is Beta and light on information, it doesn't seem to correlate yet with the source code (even SPIR-V bytecode) in all but the simplest case, so it's hard to identify the actual location of bottlenecks in the shaders. |
It could probably be faster to trace a single sample per shader and dispatch a bunch of samples per pixel at the same time, no idea if it's possible though. |
The bad performance in the Cornell box scenes make a lot of sense, since there are only very few triangles (36 if I count correctly) making up the scene. |
I did some more testing and found a way to force wave32 execution on the RT shaders on RDNA2. This improved performance by ~6%. Moving line 101 directly behind the load in line 85, reduced the amount of used vector registers and increased performance further:
Improvements over base:
Unfortunately there is currently no way to indicate a preference for wave32, outside the compiler heuristics. I will open a pull request later. |
This patch improves performance on RDNA2 cards by 19%, as documented in issue GPSnoopy#28. It does this by switching the execution mode to wave32 (via the change in line 27) and reducing the amount of used vector registers in Wave32 mode to increase occupancy (via the imageStore in line 87).
I recently found AMD's video on GPUOpen that covers some perform tips for DXR 1.1. At least in DirectX12, AMD recommends moving TraceRay to the compute queue and dispatching in 8x4 tiles. This dispatch size appears to line up with your findings, namely the optimality of wave32 and the LDS pressure. |
It does feel like it's on AMD to do these optimisations automatically in their JIT compiler. Keep in mind that NVIDIA VK ray tracing performance pretty much doubled since it was introduced two years ago, purely thanks to driver improvements. I'm hoping AMD can address some of the low hanging fruits relatively quickly. |
Well the newest driver automatically defaults to wave32, so that's good. Makes the first part of my pull request unnecessary. |
Second Part also does nothing anymore, so they are indeed working on it. |
I think I have an avenue for some improvements (Page 21): |
Split box-quads into smaller polygons? |
That was my Idea. |
I'm currently implementing a (very) primitive function that splits all triangles given to it in half. It does this recursively. |
This was a dead end unfortunately. I only got performance degradation using this "tesselation" function:
|
On the plus side I now have an excellent tool to test the importance of the L3 cache for RT on RDNA2. |
Not too surprising. Internally you would expect the drivers to do this if it was beneficial. |
It was an experiment. I was fascinated how the performance dropped when increasing the "tesselation factor." Scene 5 was completely unaffected (probably due to the complex Lucy model), and Scene 4 dropped by ~30% when I increased the amount of polygons to 64x. |
I would love to see how Ampere would fare with this. The only thing that should be changing with the increased number of triangles is the depth and size of the BVH. The amount of Ray-Triangle-intersections should stay exactly the same. |
I'm just passing through!
Does VK_EXT_subgroup_size_control work for RayTracing PSO? That extension lets you explicitly control wave32/64 mode. |
Also passing through. I'm currently researching the relative performance between Vulkan Ray Tracing and DXR. Do you guys have an impression of this? Has it more or less reached parity? I'm only interested in open APIs like Vulkan, however it would be good to be aware of any shortcomings (if any) that it might currently have compared to DX12. Also hoping to see Vulkan Ray Tracing supported efficiently on Metal via MoltenVK. Wonder how far off that'll be. |
I ran the Raytracer through AMDs GPU profiler to check out how it runs on my RX6800. (I will upload the results in a different issue.)
It reported that the RT shader is limited by it's LDS usage of 4 KB to half occupancy, and that it uses 80 vector registers.
Decreasing LDS usage to 3072B could increase the occupancy up to 12 parallel wavefronts (warps on nVidia Hardware) the maximum for 80 Vector registers. This should improve performance as less time is spent idle.
Reducing LDS usage to 2048B would allow further optimizations to VGPR (Vector General Purpose Register) usage. Reducing VGPRS to 64 would allow full occupancy and presumably maximum performance.
Edit: I believe LDS is AMDs name for Workgroup Memory.
The text was updated successfully, but these errors were encountered: