Skip to content

fix: Prevent client destruction race in services.call_service (backport #1255)#1269

Merged
bjsowa merged 2 commits into
kiltedfrom
mergify/bp/kilted/pr-1255
May 27, 2026
Merged

fix: Prevent client destruction race in services.call_service (backport #1255)#1269
bjsowa merged 2 commits into
kiltedfrom
mergify/bp/kilted/pr-1255

Conversation

@mergify

@mergify mergify Bot commented May 27, 2026

Copy link
Copy Markdown

Summary

services.call_service and AdvertisedActionHandler both create rclpy entities (rclpy.Client and rclpy.action.ActionServer, respectively) on the rosbridge node directly from a worker thread, while the rclpy executor is concurrently spinning that same node. The executor's wait-set rebuild can race with the worker thread's entity registration / destruction.

The symptom varies by entity:

  • Service clients silently lose their backing handle: the executor ends up holding a Client whose PyCapsule has been freed, and the next rclpy binding call on the affected node raises TypeError: Object of type 'NoneType' is not an instance of 'capsule'. After this trips once, every subsequent call_service on the affected websocket returns result: false and the connection is unrecoverable until the rosbridge process restarts.

  • Action servers crash harder: the worker thread reaches _rclpy.ActionServer(...) in rclpy/action/server.py:__init__ while the executor is mid-wait-set-rebuild and the process SIGSEGVs. This is the actual cause of the test_capabilities hang on Rolling/Lyrical CI — pytest's xunit fires after the segfault, but ctest waits the full 60 s before declaring the test timed out, masking the real signal in run logs.

Approach

Same idea as the IncomingQueue race fixed in #1183: route the lifecycle calls through the executor via executor.create_task(...) so they serialize with the executor's own work instead of competing with it.

Extract the executor-routing pattern into rosbridge_library.internal.executor_helpers — one module, two helpers:

Helper Returns Used for
run_on_executor(node, fn) fn()'s return value Entity creation, where the caller needs the new handle before continuing
schedule_on_executor(node, fn) nothing (fire-and-forget) Entity destruction, where the caller just needs the teardown to happen on the right thread

Both fall back to inline invocation when the node has no attached executor, which keeps unit-test setups (no executor) working unchanged.

What changed

  • services.call_service now routes create_client through run_on_executor (it needs the resulting Client before it can call wait_for_service / call_async). The three destroy_client sites — wait-for-service failure, response timeout, and the successful path — go through schedule_on_executor.
  • AdvertisedActionHandler.__init__ routes ActionServer(...) through run_on_executor. (ActionServer destruction was already routed through the executor in feat: Improve action unadvertising #1248, so only the creation side needed the fix.)
  • ActionTester in test/internal/actions/test_actions.py builds its own ActionServer from the test thread while the fixture's executor is spinning. The same SIGSEGV reproduced there, so its ActionServer(...) call now goes through run_on_executor too.
  • Scope is intentionally limited to the create/destroy lifecycle. The read-only setup calls in call_service (resolve_service_name, get_service_names_and_types) stay on the worker thread.

Regression test

rosbridge_library/test/internal/services/test_stress_service_clients.py exercises the race window:

  • test_many_parallel_service_calls — 40 ServiceCallers in flight at once against the same service.
  • test_barrier_synchronized_service_calls — 20 callers start at the same instant via a threading.Barrier.
  • test_repeated_rapid_service_calls — 100 sequential calls with no settle time between iterations.
  • test_interleaved_waves_of_service_calls — each wave of callers starts while the previous wave is still tearing down its clients.

After each scenario, the test asserts the executor's spin thread is still alive, has logged no errors, and can still process newly submitted tasks. The action-server race is covered by the existing test_action_capabilities and test_actions tests, which previously SIGSEGV'd and now pass.

Back-ports

services.py and advertise_action.py are identical on jazzy and humble; sibling PRs target those branches. Happy to defer those to mergify if preferred.


This is an automatic backport of pull request #1255 done by Mergify.

…lities (#1255)

Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler`
create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`,
respectively) on the rosbridge node directly from a worker thread.  The same
node is concurrently spun by the rclpy executor, so the executor's wait-set
rebuild can race with the worker thread's entity registration / destruction
and either crash inside `_rclpy` (action servers SIGSEGV in
`rclpy/action/server.py:__init__`) or leave the executor holding a handle
whose underlying PyCapsule has been freed (services surface this later as
`TypeError: Object of type 'NoneType' is not an instance of 'capsule'`).

This is the service- and action-capability analogue of the IncomingQueue
race fixed in #1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own work
instead of competing with it.

- Extract the executor-routing pattern into
  `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor`
  (synchronous; returns the value, used when the caller needs the entity
  back) and `schedule_on_executor` (fire-and-forget, used for destructors).
  Both fall back to inline invocation when the node has no attached
  executor, preserving unit-test behaviour.
- Apply `run_on_executor` to client creation in `services.call_service` and
  ActionServer creation in `AdvertisedActionHandler.__init__`. The three
  `destroy_client` sites in `services.call_service` now use
  `schedule_on_executor`. ActionServer destruction was already routed
  through the executor in #1248.
- `ActionTester` in `test_actions.py` constructs its own ActionServer from
  the test thread while the fixture's executor is already spinning. Route
  that creation through `run_on_executor` for the same reason; without it
  the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy.

`test_stress_service_clients.py` exercises the service race window
(parallel callers, barrier-synchronized starts, rapid sequential cycles,
and interleaved waves of concurrent callers) and asserts the executor
remains healthy and responsive afterwards.

Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
(cherry picked from commit 06bf324)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
#	rosbridge_library/test/internal/actions/test_actions.py
@mergify mergify Bot added the conflicts label May 27, 2026
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Author

Cherry-pick of 06bf324 has failed:

On branch mergify/bp/kilted/pr-1255
Your branch is up to date with 'origin/kilted'.

You are currently cherry-picking commit 06bf324.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   rosbridge_library/src/rosbridge_library/internal/executor_helpers.py
	modified:   rosbridge_library/src/rosbridge_library/internal/services.py
	modified:   rosbridge_library/test/capabilities/test_action_capabilities.py
	new file:   rosbridge_library/test/internal/services/test_stress_service_clients.py

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
	both modified:   rosbridge_library/test/internal/actions/test_actions.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@bjsowa bjsowa added kilted and removed conflicts labels May 27, 2026
@bjsowa bjsowa merged commit 7631e3f into kilted May 27, 2026
5 checks passed
@bjsowa bjsowa deleted the mergify/bp/kilted/pr-1255 branch May 27, 2026 20:19
@bjsowa

bjsowa commented May 27, 2026

Copy link
Copy Markdown
Member

@Mergifyio backport jazzy

@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Author

backport jazzy

✅ Backports have been created

Details

Cherry-pick of 7631e3f has failed:

On branch mergify/bp/jazzy/pr-1269
Your branch is up to date with 'origin/jazzy'.

You are currently cherry-picking commit 7631e3f.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   rosbridge_library/src/rosbridge_library/internal/executor_helpers.py
	modified:   rosbridge_library/src/rosbridge_library/internal/services.py
	modified:   rosbridge_library/test/capabilities/test_action_capabilities.py
	new file:   rosbridge_library/test/internal/services/test_stress_service_clients.py

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
	both modified:   rosbridge_library/test/internal/actions/test_actions.py

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

bjsowa added a commit that referenced this pull request May 27, 2026
#1255) (#1270)

* fix: Prevent client destruction race in services.call_service (backport #1255) (#1269)

* fix: prevent rclpy entity-lifecycle race in service and action capabilities (#1255)

Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler`
create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`,
respectively) on the rosbridge node directly from a worker thread.  The same
node is concurrently spun by the rclpy executor, so the executor's wait-set
rebuild can race with the worker thread's entity registration / destruction
and either crash inside `_rclpy` (action servers SIGSEGV in
`rclpy/action/server.py:__init__`) or leave the executor holding a handle
whose underlying PyCapsule has been freed (services surface this later as
`TypeError: Object of type 'NoneType' is not an instance of 'capsule'`).

This is the service- and action-capability analogue of the IncomingQueue
race fixed in #1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own work
instead of competing with it.

- Extract the executor-routing pattern into
  `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor`
  (synchronous; returns the value, used when the caller needs the entity
  back) and `schedule_on_executor` (fire-and-forget, used for destructors).
  Both fall back to inline invocation when the node has no attached
  executor, preserving unit-test behaviour.
- Apply `run_on_executor` to client creation in `services.call_service` and
  ActionServer creation in `AdvertisedActionHandler.__init__`. The three
  `destroy_client` sites in `services.call_service` now use
  `schedule_on_executor`. ActionServer destruction was already routed
  through the executor in #1248.
- `ActionTester` in `test_actions.py` constructs its own ActionServer from
  the test thread while the fixture's executor is already spinning. Route
  that creation through `run_on_executor` for the same reason; without it
  the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy.

`test_stress_service_clients.py` exercises the service race window
(parallel callers, barrier-synchronized starts, rapid sequential cycles,
and interleaved waves of concurrent callers) and asserts the executor
remains healthy and responsive afterwards.

Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
(cherry picked from commit 06bf324)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
#	rosbridge_library/test/internal/actions/test_actions.py

* Fix conflicts

---------

Co-authored-by: Joshua Whitley <josh@electrifiedautonomy.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
(cherry picked from commit 7631e3f)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
#	rosbridge_library/test/internal/actions/test_actions.py

* Fix conflicts

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
bjsowa added a commit that referenced this pull request May 27, 2026
#1255) (#1272)

* fix: Prevent client destruction race in services.call_service (backport #1255) (#1269)

* fix: prevent rclpy entity-lifecycle race in service and action capabilities (#1255)

Both `services.call_service` and `advertise_action`'s `AdvertisedActionHandler`
create rclpy entities (`rclpy.Client` and `rclpy.action.ActionServer`,
respectively) on the rosbridge node directly from a worker thread.  The same
node is concurrently spun by the rclpy executor, so the executor's wait-set
rebuild can race with the worker thread's entity registration / destruction
and either crash inside `_rclpy` (action servers SIGSEGV in
`rclpy/action/server.py:__init__`) or leave the executor holding a handle
whose underlying PyCapsule has been freed (services surface this later as
`TypeError: Object of type 'NoneType' is not an instance of 'capsule'`).

This is the service- and action-capability analogue of the IncomingQueue
race fixed in #1183: route the lifecycle calls through the executor via
`executor.create_task(...)` so they serialize with the executor's own work
instead of competing with it.

- Extract the executor-routing pattern into
  `rosbridge_library.internal.executor_helpers`, exposing `run_on_executor`
  (synchronous; returns the value, used when the caller needs the entity
  back) and `schedule_on_executor` (fire-and-forget, used for destructors).
  Both fall back to inline invocation when the node has no attached
  executor, preserving unit-test behaviour.
- Apply `run_on_executor` to client creation in `services.call_service` and
  ActionServer creation in `AdvertisedActionHandler.__init__`. The three
  `destroy_client` sites in `services.call_service` now use
  `schedule_on_executor`. ActionServer destruction was already routed
  through the executor in #1248.
- `ActionTester` in `test_actions.py` constructs its own ActionServer from
  the test thread while the fixture's executor is already spinning. Route
  that creation through `run_on_executor` for the same reason; without it
  the rclpy SIGSEGV reproduces on a workspace with up-to-date rclpy.

`test_stress_service_clients.py` exercises the service race window
(parallel callers, barrier-synchronized starts, rapid sequential cycles,
and interleaved waves of concurrent callers) and asserts the executor
remains healthy and responsive afterwards.


(cherry picked from commit 06bf324)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
#	rosbridge_library/test/internal/actions/test_actions.py

* Fix conflicts

---------



(cherry picked from commit 7631e3f)

# Conflicts:
#	rosbridge_library/src/rosbridge_library/capabilities/advertise_action.py
#	rosbridge_library/test/internal/actions/test_actions.py

* Fix conflicts

---------



(cherry picked from commit f5202d9)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Błażej Sowa <bsowa123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants