-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
app_rpt: Acquire blocklock mutex when hard hanging up link channel. #460
base: master
Are you sure you want to change the base?
Conversation
I needed to add Line 6852 in 3457244
If I remove it, this crash goes away and we get a different error/crash (same as case 2 in the issue comments):
Ton of errors to follow:
|
What's line 3091 in your channel.c? |
3091 looks like this:
|
OK - I think I've compiled with debug and better back trace flags -> lets see if this is any more helpful: |
I had a theory that we are trying to hang up the channel twice -> once due to Commenting out Wish I understood what the return codes are used for - did a bit of digging and can't find any docs. |
SO, changing this: Line 4163 in e4dee90
to Line 6848 in e4dee90
removing Again, I'm not sure if it's the "right way", but it doesn't crash :) |
This is the set of changes (probably clearer this way): |
I don't think To be clear, the channel may be locked by another thread, which seems probable if it's being serviced by another thread too. But no_lock is if the current thread has already locked the channel, which it doesn't appear we have.
|
That didn't work either. #1: We have a channel hangup message, which caused #2 Is simply testing the channel with Is there a description of the expected return codes from rpt_exec (or any xxx_exec) somewhere? I've hunted a bit and don't see one. |
I think I found the description:
non-zero is calling a hangup. |
This is updated to what may be more appropriate (still has Essentially remove the Still not really sure why |
Returning 0 from a dialplan application ( A softhangup can be called from any thread to queue a hangup on a channel (typically owned by another thread). It will typically cause internal applications like The PBX thread normally owns a channel, in which case returning -1 would be sufficient to hang it up. But due to the whole keepalive thing where a separate thread is then responsible for the channel, that changes thngs there.
chan wouldn't be NULL (or it would be visible in the backtrace). But if chan itself is invalid for some reason, then that could indeed be an issue.
As I said above, 0 means continue and -1 means (failure). Failure can be handled by the TryExec application, but otherwise dialplan will terminate and the channel is hungup. There used to be more return values like AST_PBX_KEEPALIVE, but those have mostly been removed. |
Is this the code that calls |
I'm not sure what the next step is on this one. While there are a couple of ways to "not crash", it sounds like they are not the right way to deal with the deadlock on the channel. |
Nope, Dial is another application, just like Rpt. The code would be part of pbx.c
Yeah, I think what would probably be most helpful, at least to me, is if there is a way I can reliably reproduce this myself. That was instrumental in getting the other change resolved as we could quickly test and iterate. If there's any way to do that here, that would be great. Unfortunately, I have to deal with moving some equipment, and I'm going to be out of town next week, so I might be a bit slow in being able to look into this. |
You need the server node and a client node: If you have another way to issue AMI commands, just use the commands from the script. Anything under .10 seconds typically will trigger the failed to send !NEWKEY1! error then soon after a 2 to 0 message. (and crash anywhere in between) I found you can't issue them through the command line and you can't create a macro with both connect/disconnect in it - neither are fast enough to get it to fail. Thinking about it a bit more -> maybe putting a sleep in the right spot in the I'll be here when you have something :) |
|
I don't know how I got that link... this looks more like the spot: If we exit non 0, It looks like we fall down to the |
I really believe this is pointing to us using |
@InterLinked1 Any new thoughts on how we can fix this? |
Sorry, it's been a bit since I've looked at this. I think The best way would be if would could get a backtrace from that scenario, is that something you can reproduce easily? We can force it to crash at that line if you add |
core-asterisk-2025-02-20T00-53-49Z-full.txt This is the core dump from adding I placed the assert(0) here: |
This looks suspicious to me, if it's crashing in timing.c. But apparently handle is NULL there and it shouldn't be. Just to confirm, do you have a timing module loaded? Weird crashes can occur in Asterisk if you don't have one loaded, and there isn't anything that enforces you have one loaded. |
I'm fairly sure it's loaded. I will recompile everything just to make sure I didn't cross pollinate one of the files or something odd like that. |
One of these right? I didn't see a loading error in the log. |
Yeah, that should do it, interesting. Just wanted to make sure that wasn't it! Also, the line is the one containing the error above (Hard hangup) |
Resolves: #459