block/blk-pm.c | 1 + drivers/base/power/runtime.c | 3 ++- include/linux/pm.h | 1 + 3 files changed, 4 insertions(+), 1 deletion(-)
Yang Yang (2):
PM: runtime: Fix I/O hang due to race between resume and runtime
disable
blk-mq: Fix I/O hang caused by incomplete device resume
block/blk-pm.c | 1 +
drivers/base/power/runtime.c | 3 ++-
include/linux/pm.h | 1 +
3 files changed, 4 insertions(+), 1 deletion(-)
--
2.34.1
On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote: > > > Yang Yang (2): > PM: runtime: Fix I/O hang due to race between resume and runtime > disable > blk-mq: Fix I/O hang caused by incomplete device resume This is a no-go as far as I'm concerned. Please address the issue differently.
On 11/26/25 3:31 AM, Rafael J. Wysocki wrote: > Please address the issue differently. It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang. Would it be safe to remove the cancel_work_sync() call from __pm_runtime_barrier() since pm_runtime_work() calls functions that check disable_depth when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would this be sufficient to fix the reported deadlock? Thanks, Bart.
On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote: > > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote: > > Please address the issue differently. > > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang. I wouldn't call it a hang. __pm_runtime_barrier() removes the work item queued by pm_request_resume(), but at the time when it is called, which is device_suspend_late(), the work item queued by pm_request_resume() cannot make progress anyway. It will only be able to make progress when the PM workqueue is unfrozen at the end of the system resume transition. > Would it be safe to remove the > cancel_work_sync() call from __pm_runtime_barrier() since > pm_runtime_work() calls functions that check disable_depth > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would > this be sufficient to fix the reported deadlock? If you want the resume work item to survive the system suspend/resume cycle, __pm_runtime_disable() may be changed to make that happen, but this still will not allow the work to make progress until the system resume ends. I'm not sure if this would help to address the issue at hand though.
On Wed, Nov 26, 2025 at 5:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote: > > > > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote: > > > Please address the issue differently. > > > > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang. > > I wouldn't call it a hang. > > __pm_runtime_barrier() removes the work item queued by > pm_request_resume(), but at the time when it is called, which is > device_suspend_late(), the work item queued by pm_request_resume() > cannot make progress anyway. It will only be able to make progress > when the PM workqueue is unfrozen at the end of the system resume > transition. > > > Would it be safe to remove the > > cancel_work_sync() call from __pm_runtime_barrier() since > > pm_runtime_work() calls functions that check disable_depth > > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would > > this be sufficient to fix the reported deadlock? > > If you want the resume work item to survive the system suspend/resume > cycle, __pm_runtime_disable() may be changed to make that happen, but > this still will not allow the work to make progress until the system > resume ends. > > I'm not sure if this would help to address the issue at hand though. I actually have a better idea: Why don't we resume all devices that have runtime resume work items pending at the time when device_suspend() is called? Arguably, somebody wanted them to runtime-resume, so they should be resumed before being prepared for system suspend and that will eliminate the issue at hand (because devices cannot suspend during system suspend/resume).
On Wed, Nov 26, 2025 at 6:21 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Nov 26, 2025 at 5:59 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@acm.org> wrote: > > > > > > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote: > > > > Please address the issue differently. > > > > > > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang. > > > > I wouldn't call it a hang. > > > > __pm_runtime_barrier() removes the work item queued by > > pm_request_resume(), but at the time when it is called, which is > > device_suspend_late(), the work item queued by pm_request_resume() > > cannot make progress anyway. It will only be able to make progress > > when the PM workqueue is unfrozen at the end of the system resume > > transition. > > > > > Would it be safe to remove the > > > cancel_work_sync() call from __pm_runtime_barrier() since > > > pm_runtime_work() calls functions that check disable_depth > > > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would > > > this be sufficient to fix the reported deadlock? > > > > If you want the resume work item to survive the system suspend/resume > > cycle, __pm_runtime_disable() may be changed to make that happen, but > > this still will not allow the work to make progress until the system > > resume ends. > > > > I'm not sure if this would help to address the issue at hand though. > > I actually have a better idea: Why don't we resume all devices that > have runtime resume work items pending at the time when > device_suspend() is called? > > Arguably, somebody wanted them to runtime-resume, so they should be > resumed before being prepared for system suspend and that will > eliminate the issue at hand (because devices cannot suspend during > system suspend/resume). Wait, there is a pm_runtime_barrier() call in device_suspend() that does just that and additionally it calls __pm_runtime_barrier(), so all of the pending runtime PM work items should be cancelled by it. So it looks like the device in question is runtime-suspended at that point and only later blk_pm_resume_queue() is called to resume it. I'm wondering where it is called from. And maybe pm_runtime_resume() should be called for it from its ->suspend() callback?
© 2016 - 2025 Red Hat, Inc.