-
Notifications
You must be signed in to change notification settings - Fork 969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic "parent device lost" error/crash after #6092 #6279
Comments
I think that's related to which is about to get fixed by However after above PR this is still still a validation error which it almost certainly shouldn't be, see this comment thread here #6253 (comment) Anyways, I'm not certain all of this is the case and device lost itself shouldn't happen in the first place, so if you have more information about your system and a as-minimal-as-reasonably-possible repro that would be great! EDIT: Didn't pay enough attention to the fact that this is on wgpu-master only and dismissed the bisect result too quickly. That's quite curious, maybe some object lifetime got messed up 🤔. cc: @teoxoy |
Thanks for the quick reply! I don't understand the internals here so quite possible it's related to the linked issues but its not at startup so think it could be different, didn't do a good job describing some other symptoms:
So it seems like something race-y perhaps. Will try a single threaded setup at some point. Otherwise some more specs - 4070 gpu (on Optimus or whatever its called now), haven't tried other gpus yet. |
Do we have steps to reproduce this? |
I'm getting this also, but no idea how to repro. But I too have another thread submitting the queue occasionally. |
Is the exact error message, so it occurs on queue submit, not on surface configure. There's nothing on the stack trace, so dunno how to debug really. Sounds exactly like described above in @ArthurBrussee 's comment. Randomly at runtime. wgpu 22.1.0 |
Hmm, having a validation error attributed to a device loss sounds wrong. That sounds like it should be classified as an internal error, rather than a validation error. That probably doesn't really matter WRT the root cause for the OP, though. We are aware of issues with multi-threaded command submission, and this is likely a symptom. Because this is likely due to raciness, it's hard to comment on how to reproduce it, though. 😅 |
I managed to capture one crash of mine with debug logs.
I'm not sure if this is the exact cause, but it could give some clues. I've recently updated my wgpu, and I've also recently made some changes to how I create bind groups (not every frame for these particular ones). Some of those might cause this to pop up for me now. |
Pretty sure I'm doing some things wrong as well. |
This might be related. I tried doing submits only from main thread, and this error still keeps happening. My app is so complex, that it's hard to extract a repro step :S |
This is a particularly frustrating one, because I don't even always get any validation errors. But every run of my game eventually reaches this error and panics. Wouldn't mind some pinpoints on how to approach a reproducible report, when the submit error stacktrace contains nothing. |
After seeing #6318 I hoped I could catch the error by testing using
I don't think this should be happening... |
@hakolao: That indicates that
Assuming the above is correct, we have two problems:
This seems like a hazard that would apply to more places than just this one; it's not obvious that we should let go of the device's snatch lock before we call Ugh! |
I bet we can artificially reproduce (2) by forcibly returning a HAL error of some kind, instead of performing the HAL call as normal. That would let us focus on resolving it. |
So, I suppose this #6229 causes the lock acquisition problems. Maybe I'm beginning to suspect that my issue could be driver induced. Because I can't find anything wrong, and no longer am seeing any validation errors either. Dunno though... |
@ErichDonGubler @teoxoy how likely is it that I am doing something wrong when getting device is lost? Or how likely is it that something is wrong on wgpu side. I read here that for others this error relates to barriers, synchronization or memory, but these are areas that wgpu should handle automatically. Can't see any validation errors (fixed those) using I begun getting these after starting to reuse already created bind groups (chunks that I render, and texture atlas). Nothing else comes to mind that would have changed. And because I've tested that this issue happens already as far back as wgpu 0.19 (didn't bother to go further back), I kinda want to suspect that it's my bug. I also updated my gpu drivers. I'll keep debugging, but this has been a difficult problem to investigate. |
You could try turning on these features: Lines 96 to 107 in c0fa1bc
and looking at the callstack to pinpoint the vulkan call that's returning the lost error. |
Thanks, I'll try those. I realized I do still have some remaining warnings that I had missed from vulkan validation, so gonna check those also. |
Could you share the vulkan validation errors? wgpu's validation should in principle catch issues earlier. |
|
This was an easy fix, and probably an easy to validate for you as well.
I was passing a single sampler to bind group, but had set count to 256 (that's how many textures I've got in an array). I had made a refactor to reuse samplers (instead of having one per image... silly, right...). But forgot to remove the count from the I could imagine something like this could potentially crash... Now I'm only seeing occasional resize validation error, but other stuff is just wanings / perf things. Some of which are too much effort for no gain. |
From our maintainer meeting agenda today, we decided that this issue does not need to block the v23 release. We need to eventually fix validation to catch this sort of issue, but it doesn't prevent programs that are correct from running. |
Ruffle has started getting a bunch of reports of this during application startup from our users since the update to v23, and I'm fairly sure we're doing things correct. No crazy sampler counts or other thread shenanigans, no relevant validation warnings that I can see. Is there anything else this could be? |
It seems not to happen on Metal. My first time to see the error is when I running a program for 10 minutes on Vulkan (NVIDIA GPU). |
@Dinnerbone is there a specific wgpu call that returns "Parent device is lost"? |
I got rid of my last such errors by switching bind groups that had texture arrays of fairly big quantities to use a texture atlas. I could never figure out the last but rare one, except that it stopped happening when I reduced the number of inputs in the bindgroup. |
@teoxoy I don't think we know (or at least I do not) since none of us have encountered the issue. It was reported by end users in ruffle-rs/ruffle#18690 and all the duplicate issues listed beneath that issue. |
We'll need to add better error reporting to verify, but looking at the code that fails, it's likely either wgpu instance creation, surface creation, or request_adapter_and_device |
Oh it's init, that's very interesting |
I'm still debugging with Ruffle - none of us developers can reproduce it so it's just going by trickled in reports. From what I can tell, it's throwing in Edit: Looking at the source, I suspect it's now showing up due to #6119 (specifically e4c5b47) - likely Edit 2: Aaactually they mapped to the same errors before, crap. I don't know then. |
Sounds far fetched but the new errors you're seeing could actually be related to There's a new validation that causes a device lost error message on startup |
Description
After updating an egui + burn app to wgpu master, I observed random crashes with a generic "validation failed: parent device lost" error. Nothing in particular seems to cause the crash, even just drawing the app while firing off empty submit() calls seemed to crash.
After a bisection it seems to come down to this specific commit: ce9c9b7
It doesn't look that suspicious but does touch some raw pointers so idk... I definitely can't tell what's wrong anyway.
I can try to work on a smaller repro than "draw an egui app while another thread fires off submit()" but maybe this already gives enough hints.
Thanks for having a look!
Platform
Vulkan + Windows
The text was updated successfully, but these errors were encountered: