pm: device_runtime: Use its own queue #87496

ceolin · 2025-03-21T20:22:36Z

Device runtime is using the system workqueue to do operations that are mostly blockers (suspend a device). This should not happen.

This commit defines a queue for the device runtime async operations.

The test for this API was assuming that the system workqueue priority (which is cooperative) so we need to change the test priority to be lower than the device runtime workqueue priority.

ceolin · 2025-03-21T20:23:24Z

@bjarki-andreasen thanks for pointing this problem out

Device runtime is using the system workqueue to do operations that are mostly blockers (suspend a device). This should not happen. This commit defines a queue for the device runtime async operations. The test for this API was assuming that the system workqueue priority (which is cooperative) so we need to change the test priority to be lower than the device runtime workqueue priority. Signed-off-by: Flavio Ceolin <flavio@hubblenetwork.com>

bjarki-andreasen

Really nice!

JordanYates · 2025-03-22T03:43:08Z

Should this be optional? Asynchronous PM is not necessarily used at all in an application, but this will incur the additional thread regardless. I'm really not a fan of these "one job" workqueues. What is the use case that triggered this change?

bjarki-andreasen · 2025-03-22T06:09:36Z

Should this be optional? Asynchronous PM is not necessarily used at all in an application, but this will incur the additional thread regardless. I'm really not a fan of these "one job" workqueues. What is the use case that triggered this change?

Async will become optional as well.

A usecase that triggered this change is the modem subsystem, which is async, and build around the system workqueue. The work queue is not allowed to be blocking, otherwise it can't be shared, work has to be async. pm_device_runtime_get() and put() are blocking actions, yet often users call these from system work queue items, either directly from their app, or indirectly through pm_device_put_async(). This breaks drivers and subsystems which use the system workqueue internally, since it can't both be blocked waiting for the modem driver to suspend, and do the work required to suspend the modem driver, so its deadlocked.

A design decision that triggered the change is simply that no blocking API may be called from a shared work queue, so pretty much all device driver APIs and pm actions are not allowed.

JordanYates · 2025-03-22T09:55:39Z

A design decision that triggered the change is simply that no blocking API may be called from a shared work queue, so pretty much all device driver APIs and pm actions are not allowed.

Hard disagree on this one, Bluetooth tried to enforce such a black and white rule and it caused many problems, see #84282.
Please provide the RFC or architecture meeting notes for the decision, it is certainly not documented on https://docs.zephyrproject.org/latest/kernel/services/threads/workqueue.html#system-workqueue

ceolin

Should this be optional? Asynchronous PM is not necessarily used at all in an application, but this will incur the additional thread regardless. I'm really not a fan of these "one job" workqueues. What is the use case that triggered this change?

I am not a fan either. The reason that triggered it is what @bjarki-andreasen explained. There is a resource penalty in having its own queue but make this code less error prone, to minimize the drawbacks I plan to make it optional (async support) since a lot of applications may not need it.

bjarki-andreasen · 2025-03-23T08:47:15Z

A design decision that triggered the change is simply that no blocking API may be called from a shared work queue, so pretty much all device driver APIs and pm actions are not allowed.

Hard disagree on this one, Bluetooth tried to enforce such a black and white rule and it caused many problems, see #84282.
Please provide the RFC or architecture meeting notes for the decision, it is certainly not documented on https://docs.zephyrproject.org/latest/kernel/services/threads/workqueue.html#system-workqueue

The system work queue is serviced by a single thread. It can only execute one work item at a time. If item "a" blocks until item "b" is executed, item "b" is never executed. Since any module can use the sys work queue internally, there is no guarantee that item "a" will not end up blocked by item "b".

If the above statement is correct, there is nothing to disagree with, it simply can't be used safely with blocking APIs.

I will create a PR to add this to the docs. We can discuss from there :)

bjarki-andreasen · 2025-03-23T09:23:43Z

@JordanYates There is an entry in the docs here warning about the deadlock https://docs.zephyrproject.org/latest/kernel/services/threads/workqueue.html#work-item-lifecycle:

"A handler function can use any kernel API available to threads. However, operations that are potentially blocking (e.g. taking a semaphore) must be used with care, since the workqueue cannot process subsequent work items in its queue until the handler function finishes executing."

Its not that strongly worded, "must be used with care" is essentially what I'm referring to. Calling pm_device_action_run() is not using with care, its hoping the driver does not depend on a work item internally to perform the action...

A user has no way of knowing why calling a pm device API on a particular device suddenly makes seemingly random parts of the system stop responding, its a particularly unintuitive behavior which is really hard to narrow down.

bjarki-andreasen · 2025-03-23T14:20:49Z

PR for documenting and asserting safe use of system work queue #87522

nashif · 2025-03-24T19:05:58Z

can this be made a choice, i.e a choice between using system workq and a dedicated workq? This way, depending on the application and resources available you can use existing system workq or have one dedicated for PM if system workq become crowded with blocking consumers and you are able to afford another thread?

ceolin · 2025-03-24T22:47:41Z

can this be made a choice, i.e a choice between using system workq and a dedicated workq? This way, depending on the application and resources available you can use existing system workq or have one dedicated for PM if system workq become crowded with blocking consumers and you are able to afford another thread?

it can be done. There is possible issues associated with this but we can make it optional with default to have its own queue.

bjarki-andreasen · 2025-03-25T06:49:57Z

can this be made a choice, i.e a choice between using system workq and a dedicated workq? This way, depending on the application and resources available you can use existing system workq or have one dedicated for PM if system workq become crowded with blocking consumers and you are able to afford another thread?

Not safely, it breaks drivers which use the system workqueue internally. Its not a question of "crowded", its a question of deadlocking :)

zephyrbot added the area: Power Management label Mar 21, 2025

zephyrbot requested review from bjarki-andreasen, JordanYates, nashif, teburd and tmleman March 21, 2025 20:23

zephyrbot assigned ceolin and bjarki-andreasen Mar 21, 2025

ceolin force-pushed the pm/device-runtime/async branch from 2d0a7ec to 71ed1f3 Compare March 21, 2025 20:50

bjarki-andreasen approved these changes Mar 22, 2025

View reviewed changes

ceolin commented Mar 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pm: device_runtime: Use its own queue #87496

pm: device_runtime: Use its own queue #87496

ceolin commented Mar 21, 2025

ceolin commented Mar 21, 2025

bjarki-andreasen left a comment

JordanYates commented Mar 22, 2025

bjarki-andreasen commented Mar 22, 2025 •

edited

Loading

JordanYates commented Mar 22, 2025

ceolin left a comment

bjarki-andreasen commented Mar 23, 2025

bjarki-andreasen commented Mar 23, 2025 •

edited

Loading

bjarki-andreasen commented Mar 23, 2025

nashif commented Mar 24, 2025

ceolin commented Mar 24, 2025

bjarki-andreasen commented Mar 25, 2025

pm: device_runtime: Use its own queue #87496

Are you sure you want to change the base?

pm: device_runtime: Use its own queue #87496

Conversation

ceolin commented Mar 21, 2025

ceolin commented Mar 21, 2025

bjarki-andreasen left a comment

Choose a reason for hiding this comment

JordanYates commented Mar 22, 2025

bjarki-andreasen commented Mar 22, 2025 • edited Loading

JordanYates commented Mar 22, 2025

ceolin left a comment

Choose a reason for hiding this comment

bjarki-andreasen commented Mar 23, 2025

bjarki-andreasen commented Mar 23, 2025 • edited Loading

bjarki-andreasen commented Mar 23, 2025

nashif commented Mar 24, 2025

ceolin commented Mar 24, 2025

bjarki-andreasen commented Mar 25, 2025

bjarki-andreasen commented Mar 22, 2025 •

edited

Loading

bjarki-andreasen commented Mar 23, 2025 •

edited

Loading