Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable

Rafael J. Wysocki posted 1 patch 5 days, 1 hour ago
block/blk-core.c |    6 +++---
block/blk-pm.h   |    7 ++++---
2 files changed, 7 insertions(+), 6 deletions(-)
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Rafael J. Wysocki 5 days, 1 hour ago
On Wednesday, November 26, 2025 8:34:54 PM CET Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 8:16 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Wed, Nov 26, 2025 at 7:06 PM Bart Van Assche <bvanassche@acm.org> wrote:
> > >
> > > On 11/26/25 3:30 AM, Rafael J. Wysocki wrote:
> > > > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@vivo.com> wrote:
> > > >>   T1:                                   T2:
> > > >>   blk_queue_enter
> > > >>   blk_pm_resume_queue
> > > >>   pm_request_resume
> > > >
> > > > Shouldn't this be pm_runtime_resume() rather?
> > >
> > > I tried to make that change on an Android device. As a result, the
> > > kernel complaint shown below appeared. My understanding is that sleeping
> > > in atomic context can trigger a deadlock and hence is not allowed.
> > >
> > > [   13.728890][    T1] WARNING: CPU: 6 PID: 1 at
> > > kernel/sched/core.c:9714 __might_sleep+0x78/0x84
> > > [   13.758800][    T1] Call trace:
> > > [   13.759027][    T1]  __might_sleep+0x78/0x84
> > > [   13.759340][    T1]  __pm_runtime_resume+0x40/0xb8
> > > [   13.759781][    T1]  __bio_queue_enter+0xc0/0x1cc
> > > [   13.760153][    T1]  blk_mq_submit_bio+0x884/0xadc
> > > [   13.760548][    T1]  __submit_bio+0x2c8/0x49c
> > > [   13.760879][    T1]  __submit_bio_noacct_mq+0x38/0x88
> > > [   13.761242][    T1]  submit_bio_noacct_nocheck+0x4fc/0x7b8
> > > [   13.761631][    T1]  submit_bio+0x214/0x4c0
> > > [   13.761941][    T1]  mpage_readahead+0x1b8/0x1fc
> > > [   13.762284][    T1]  blkdev_readahead+0x18/0x28
> > > [   13.762660][    T1]  page_cache_ra_unbounded+0x310/0x4d8
> > > [   13.763072][    T1]  page_cache_ra_order+0xc0/0x5b0
> > > [   13.763434][    T1]  page_cache_sync_ra+0x17c/0x268
> > > [   13.763782][    T1]  filemap_read+0x4c4/0x12f4
> > > [   13.764125][    T1]  blkdev_read_iter+0x100/0x164
> > > [   13.764475][    T1]  vfs_read+0x188/0x348
> > > [   13.764789][    T1]  __se_sys_pread64+0x84/0xc8
> > > [   13.765180][    T1]  __arm64_sys_pread64+0x1c/0x2c
> > > [   13.765556][    T1]  invoke_syscall+0x58/0xf0
> > > [   13.765876][    T1]  do_el0_svc+0x8c/0xe0
> > > [   13.766172][    T1]  el0_svc+0x50/0xd4
> > > [   13.766583][    T1]  el0t_64_sync_handler+0x20/0xf4
> > > [   13.766932][    T1]  el0t_64_sync+0x1bc/0x1c0
> > > [   13.767294][    T1] irq event stamp: 2589614
> > > [   13.767592][    T1] hardirqs last  enabled at (2589613):
> > > [<ffffffc0800eaf24>] finish_lock_switch+0x70/0x108
> > > [   13.768283][    T1] hardirqs last disabled at (2589614):
> > > [<ffffffc0814b66f4>] el1_dbg+0x24/0x80
> > > [   13.768875][    T1] softirqs last  enabled at (2589370):
> > > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> > > [   13.769529][    T1] softirqs last disabled at (2589349):
> > > [<ffffffc080082a7c>] ____do_softirq+0x10/0x20
> > >
> > > I think that the filemap_invalidate_lock_shared() call in
> > > page_cache_ra_unbounded() forbids sleeping in submit_bio().
> >
> > The wait_event() macro in __bio_queue_enter() calls might_sleep() at
> > the very beginning, so why would it not complain?
> >
> > IIUC, this is the WARN_ONCE() in __might_sleep() about the task state
> > being different from TASK_RUNNING, which triggers because
> > prepare_to_wait_event() changes the task state to
> > TASK_UNINTERRUPTIBLE.
> >
> > This means that calling pm_runtime_resume() cannot be part of the
> > wait_event() condition, so blk_pm_resume_queue() and the wait_event()
> > macros involving it would need some rewriting.
> 
> Interestingly enough, the pm_request_resume() call in
> blk_pm_resume_queue() is not even necessary in the __bio_queue_enter()
> case because pm is false there and it doesn't even check
> q->rpm_status.
> 
> So in fact the resume is only necessary in blk_queue_enter() if pm is nonzero.

If I'm not completely in the weeds, something like the patch below should be
doable.

Also, I'd consider using pm_runtime_get_noresume() and pm_runtime_put_noidle()
in blk_queue_enter() and blk_queue_exit(), respectively, in the "pm != 0" case
to prevent the device from suspending while the .q_usage_counter ref is held.

---
 block/blk-core.c |    6 +++---
 block/blk-pm.h   |    7 ++++---
 2 files changed, 7 insertions(+), 6 deletions(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EAGAIN;
 
+		/* if necessary, resume .dev (assume success). */
+		blk_pm_resume_queue(pm, q);
 		/*
 		 * read pair of barrier in blk_freeze_queue_start(), we need to
 		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
@@ -318,9 +320,7 @@ int blk_queue_enter(struct request_queue
 		 */
 		smp_rmb();
 		wait_event(q->mq_freeze_wq,
-			   (!q->mq_freeze_depth &&
-			    blk_pm_resume_queue(pm, q)) ||
-			   blk_queue_dying(q));
+			   !q->mq_freeze_depth || blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
 	}
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -10,9 +10,10 @@ static inline int blk_pm_resume_queue(co
 {
 	if (!q->dev || !blk_queue_pm_only(q))
 		return 1;	/* Nothing to do */
-	if (pm && q->rpm_status != RPM_SUSPENDED)
-		return 1;	/* Request allowed */
-	pm_request_resume(q->dev);
+
+	if (pm)
+		pm_runtime_resume(q->dev);
+
 	return 0;
 }
 
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Bart Van Assche 5 days ago
On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>   		if (flags & BLK_MQ_REQ_NOWAIT)
>   			return -EAGAIN;
>   
> +		/* if necessary, resume .dev (assume success). */
> +		blk_pm_resume_queue(pm, q);
>   		/*
>   		 * read pair of barrier in blk_freeze_queue_start(), we need to
>   		 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and

blk_queue_enter() may be called from the suspend path so I don't think
that the above change will work. As an example, the UFS driver submits a
SCSI START STOP UNIT command from its runtime suspend callback. The call
chain is as follows:

   ufshcd_wl_runtime_suspend()
     __ufshcd_wl_suspend()
       ufshcd_set_dev_pwr_mode()
         ufshcd_execute_start_stop()
           scsi_execute_cmd()
             scsi_alloc_request()
               blk_queue_enter()
             blk_execute_rq()
             blk_mq_free_request()
               blk_queue_exit()

Thanks,

Bart.
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Rafael J. Wysocki 5 days ago
On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >               if (flags & BLK_MQ_REQ_NOWAIT)
> >                       return -EAGAIN;
> >
> > +             /* if necessary, resume .dev (assume success). */
> > +             blk_pm_resume_queue(pm, q);
> >               /*
> >                * read pair of barrier in blk_freeze_queue_start(), we need to
> >                * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>
> blk_queue_enter() may be called from the suspend path so I don't think
> that the above change will work.

Why would the existing code work then?

Are you suggesting that q->rpm_status should still be checked before
calling pm_runtime_resume() or do you mean something else?

> As an example, the UFS driver submits a
> SCSI START STOP UNIT command from its runtime suspend callback. The call
> chain is as follows:
>
>    ufshcd_wl_runtime_suspend()
>      __ufshcd_wl_suspend()
>        ufshcd_set_dev_pwr_mode()
>          ufshcd_execute_start_stop()
>            scsi_execute_cmd()
>              scsi_alloc_request()
>                blk_queue_enter()
>              blk_execute_rq()
>              blk_mq_free_request()
>                blk_queue_exit()

In any case, calling pm_request_resume() from blk_pm_resume_queue() in
the !pm case is a mistake.
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Bart Van Assche 4 days, 23 hours ago
On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>                if (flags & BLK_MQ_REQ_NOWAIT)
>>>                        return -EAGAIN;
>>>
>>> +             /* if necessary, resume .dev (assume success). */
>>> +             blk_pm_resume_queue(pm, q);
>>>                /*
>>>                 * read pair of barrier in blk_freeze_queue_start(), we need to
>>>                 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>
>> blk_queue_enter() may be called from the suspend path so I don't think
>> that the above change will work.
> 
> Why would the existing code work then?

The existing code works reliably on a very large number of devices.
Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
requests that should be processed even if the power status is changing
(RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
as follows: process this request even if a power state change is
ongoing.
> Are you suggesting that q->rpm_status should still be checked before
> calling pm_runtime_resume() or do you mean something else?
The purpose of the code changes from a previous email is not entirely
clear to me so I'm not sure what the code should look like. But to
answer your question, calling blk_pm_resume_queue() if the runtime
status is RPM_SUSPENDED should be safe.
>> As an example, the UFS driver submits a
>> SCSI START STOP UNIT command from its runtime suspend callback. The call
>> chain is as follows:
>>
>>     ufshcd_wl_runtime_suspend()
>>       __ufshcd_wl_suspend()
>>         ufshcd_set_dev_pwr_mode()
>>           ufshcd_execute_start_stop()
>>             scsi_execute_cmd()
>>               scsi_alloc_request()
>>                 blk_queue_enter()
>>               blk_execute_rq()
>>               blk_mq_free_request()
>>                 blk_queue_exit()
> 
> In any case, calling pm_request_resume() from blk_pm_resume_queue() in
> the !pm case is a mistake.
  Hmm ... we may disagree about this. Does what I wrote above make clear
why blk_pm_resume_queue() is called if pm == false?

Thanks,

Bart.
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Rafael J. Wysocki 4 days, 9 hours ago
On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>
> >> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>                if (flags & BLK_MQ_REQ_NOWAIT)
> >>>                        return -EAGAIN;
> >>>
> >>> +             /* if necessary, resume .dev (assume success). */
> >>> +             blk_pm_resume_queue(pm, q);
> >>>                /*
> >>>                 * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>                 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>
> >> blk_queue_enter() may be called from the suspend path so I don't think
> >> that the above change will work.
> >
> > Why would the existing code work then?
>
> The existing code works reliably on a very large number of devices.

Well, except that it doesn't work during system suspend and
hibernation when the PM workqueue is frozen.  I think that we agree
here.

This needs to be addressed because it may very well cause system
suspend to deadlock.

There are two possible ways to address it I can think of:

1. Changing blk_pm_resume_queue() and its users to carry out a
synchronous resume of q->dev instead of calling pm_request_resume()
and (effectively) waiting for the queued-up runtime resume of q->dev
to take effect.

This would be my preferred option, but at this point I'm not sure if
it's viable.

2. Stop freezing the PM workqueue before system suspend/hibernation
and adapt device_suspend_late() to that.

This should be doable, even though it is a bit risky because it may
uncover some latent bugs (the freezing of the PM workqueue has been
there forever), but it wouldn't address the problem entirely because
device_suspend_late() would still need to disable runtime PM for the
device (and for some devices it is disabled earlier), so
pm_request_resume() would just start to fail at that point and if
blk_queue_enter() were called after that point for a device supporting
runtime PM, it might deadlock.

> Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
> requests that should be processed even if the power status is changing
> (RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
> as follows: process this request even if a power state change is
> ongoing.

I see.

The behavior depends on whether or not q->pm_only is set.  If it is
not set, both blk_queue_enter() and __bio_queue_enter() will allow the
request to be processed.

If q->pm_only is set, __bio_queue_enter() will wait until it gets
cleared and in that case pm_request_resume(q->dev) is called to make
that happen (did I get it right?).  This is a bit fragile because what
if the async resume of q->dev fails for some reason?  You deadlock
instead of failing the request.

Unlike __bio_queue_enter(), blk_queue_enter() additionally checks the
runtime PM status of the queue if q->pm_only is set and it will allow
the request to be processed in that case so long as q->rpm_status is
not RPM_SUSPENDED.  However, if the queue status is RPM_SUSPENDED,
pm_request_resume(q->dev) will be called like in the
__bio_queue_enter() case.

I'm not sure why pm_request_resume(q->dev) needs to be called from
within blk_pm_resume_queue().  Arguably, it should be sufficient to
call it once before using the wait_event() macro, if the conditions
checked by blk_pm_resume_queue() are not met.

> > Are you suggesting that q->rpm_status should still be checked before
> > calling pm_runtime_resume() or do you mean something else?
> The purpose of the code changes from a previous email is not entirely
> clear to me so I'm not sure what the code should look like. But to
> answer your question, calling blk_pm_resume_queue() if the runtime
> status is RPM_SUSPENDED should be safe.
> >> As an example, the UFS driver submits a
> >> SCSI START STOP UNIT command from its runtime suspend callback. The call
> >> chain is as follows:
> >>
> >>     ufshcd_wl_runtime_suspend()
> >>       __ufshcd_wl_suspend()
> >>         ufshcd_set_dev_pwr_mode()
> >>           ufshcd_execute_start_stop()
> >>             scsi_execute_cmd()
> >>               scsi_alloc_request()
> >>                 blk_queue_enter()
> >>               blk_execute_rq()
> >>               blk_mq_free_request()
> >>                 blk_queue_exit()
> >
> > In any case, calling pm_request_resume() from blk_pm_resume_queue() in
> > the !pm case is a mistake.
>   Hmm ... we may disagree about this. Does what I wrote above make clear
> why blk_pm_resume_queue() is called if pm == false?

Yes, it does, thanks!
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by YangYang 12 hours ago
On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
>>>>>                         return -EAGAIN;
>>>>>
>>>>> +             /* if necessary, resume .dev (assume success). */
>>>>> +             blk_pm_resume_queue(pm, q);
>>>>>                 /*
>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>>>
>>>> blk_queue_enter() may be called from the suspend path so I don't think
>>>> that the above change will work.
>>>
>>> Why would the existing code work then?
>>
>> The existing code works reliably on a very large number of devices.
> 
> Well, except that it doesn't work during system suspend and
> hibernation when the PM workqueue is frozen.  I think that we agree
> here.
> 
> This needs to be addressed because it may very well cause system
> suspend to deadlock.
> 
> There are two possible ways to address it I can think of:
> 
> 1. Changing blk_pm_resume_queue() and its users to carry out a
> synchronous resume of q->dev instead of calling pm_request_resume()
> and (effectively) waiting for the queued-up runtime resume of q->dev
> to take effect.
> 
> This would be my preferred option, but at this point I'm not sure if
> it's viable.
> 

After __pm_runtime_disable() is called from device_suspend_late(), 
dev->power.disable_depth is set, preventing rpm_resume() from making 
progress until the system resume completes, regardless of whether 
rpm_resume() is invoked synchronously or asynchronously.
Performing a synchronous resume of q->dev seems to have a similar 
effect to removing the following code block from 
__pm_runtime_barrier(), which is invoked by __pm_runtime_disable():

1428     if (dev->power.request_pending) {
1429         dev->power.request = RPM_REQ_NONE;
1430         spin_unlock_irq(&dev->power.lock);
1431
1432         cancel_work_sync(&dev->power.work);
1433
1434         spin_lock_irq(&dev->power.lock);
1435         dev->power.request_pending = false;
1436     }

> 2. Stop freezing the PM workqueue before system suspend/hibernation
> and adapt device_suspend_late() to that.
> 
> This should be doable, even though it is a bit risky because it may
> uncover some latent bugs (the freezing of the PM workqueue has been
> there forever), but it wouldn't address the problem entirely because
> device_suspend_late() would still need to disable runtime PM for the
> device (and for some devices it is disabled earlier), so
> pm_request_resume() would just start to fail at that point and if
> blk_queue_enter() were called after that point for a device supporting
> runtime PM, it might deadlock.
> 
>> Maybe there is a misunderstanding? RQF_PM / BLK_MQ_REQ_PM are set for
>> requests that should be processed even if the power status is changing
>> (RPM_SUSPENDING or RPM_RESUMING). The meaning of the 'pm' variable is
>> as follows: process this request even if a power state change is
>> ongoing.
> 
> I see.
> 
> The behavior depends on whether or not q->pm_only is set.  If it is
> not set, both blk_queue_enter() and __bio_queue_enter() will allow the
> request to be processed.
> 
> If q->pm_only is set, __bio_queue_enter() will wait until it gets
> cleared and in that case pm_request_resume(q->dev) is called to make
> that happen (did I get it right?).  This is a bit fragile because what
> if the async resume of q->dev fails for some reason?  You deadlock
> instead of failing the request.
> 
> Unlike __bio_queue_enter(), blk_queue_enter() additionally checks the
> runtime PM status of the queue if q->pm_only is set and it will allow
> the request to be processed in that case so long as q->rpm_status is
> not RPM_SUSPENDED.  However, if the queue status is RPM_SUSPENDED,
> pm_request_resume(q->dev) will be called like in the
> __bio_queue_enter() case.
> 
> I'm not sure why pm_request_resume(q->dev) needs to be called from
> within blk_pm_resume_queue().  Arguably, it should be sufficient to
> call it once before using the wait_event() macro, if the conditions
> checked by blk_pm_resume_queue() are not met.
> 
>>> Are you suggesting that q->rpm_status should still be checked before
>>> calling pm_runtime_resume() or do you mean something else?
>> The purpose of the code changes from a previous email is not entirely
>> clear to me so I'm not sure what the code should look like. But to
>> answer your question, calling blk_pm_resume_queue() if the runtime
>> status is RPM_SUSPENDED should be safe.
>>>> As an example, the UFS driver submits a
>>>> SCSI START STOP UNIT command from its runtime suspend callback. The call
>>>> chain is as follows:
>>>>
>>>>      ufshcd_wl_runtime_suspend()
>>>>        __ufshcd_wl_suspend()
>>>>          ufshcd_set_dev_pwr_mode()
>>>>            ufshcd_execute_start_stop()
>>>>              scsi_execute_cmd()
>>>>                scsi_alloc_request()
>>>>                  blk_queue_enter()
>>>>                blk_execute_rq()
>>>>                blk_mq_free_request()
>>>>                  blk_queue_exit()
>>>
>>> In any case, calling pm_request_resume() from blk_pm_resume_queue() in
>>> the !pm case is a mistake.
>>    Hmm ... we may disagree about this. Does what I wrote above make clear
>> why blk_pm_resume_queue() is called if pm == false?
> 
> Yes, it does, thanks!

Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Rafael J. Wysocki 3 hours ago
On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>
> >> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>
> >>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>                         return -EAGAIN;
> >>>>>
> >>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>                 /*
> >>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>
> >>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>> that the above change will work.
> >>>
> >>> Why would the existing code work then?
> >>
> >> The existing code works reliably on a very large number of devices.
> >
> > Well, except that it doesn't work during system suspend and
> > hibernation when the PM workqueue is frozen.  I think that we agree
> > here.
> >
> > This needs to be addressed because it may very well cause system
> > suspend to deadlock.
> >
> > There are two possible ways to address it I can think of:
> >
> > 1. Changing blk_pm_resume_queue() and its users to carry out a
> > synchronous resume of q->dev instead of calling pm_request_resume()
> > and (effectively) waiting for the queued-up runtime resume of q->dev
> > to take effect.
> >
> > This would be my preferred option, but at this point I'm not sure if
> > it's viable.
> >
>
> After __pm_runtime_disable() is called from device_suspend_late(),
> dev->power.disable_depth is set, preventing rpm_resume() from making
> progress until the system resume completes, regardless of whether
> rpm_resume() is invoked synchronously or asynchronously.

This isn't factually correct.  rpm_resume() will make progress when
runtime PM is disabled, but it will not resume the target device.
That's what disabling runtime PM means.

Of course, when runtime PM is disabled for the given device,
rpm_resume() will return an error code that can be checked.  However,
if pm_request_resume() is called before disabling runtime PM for the
device and runtime PM is disabled for it before the work item queued
by pm_request_resume() runs, the failure will be silent from the
caller's perspective.

> Performing a synchronous resume of q->dev seems to have a similar
> effect to removing the following code block from
> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
>
> 1428     if (dev->power.request_pending) {
> 1429         dev->power.request = RPM_REQ_NONE;
> 1430         spin_unlock_irq(&dev->power.lock);
> 1431
> 1432         cancel_work_sync(&dev->power.work);
> 1433
> 1434         spin_lock_irq(&dev->power.lock);
> 1435         dev->power.request_pending = false;
> 1436     }

It is different.

First of all, synchronous runtime resume is not affected by the
freezing of the runtime PM workqueue.  Next, see the remark above
regarding returning an error code.  Finally, so long as
__pm_runtime_resume() acquires power.lock before
__pm_runtime_disable(), the synchronous resume will be waited for by
the latter.

Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
run in parallel with device_suspend_late() for q->dev, the driver of
that device is defective, because it is responsible for preventing
this situation from happening.  The most straightforward way to
achieve that is to provide a .suspend() callback for q->dev that will
runtime-resume it (and, of course, q->dev will need to be prepared for
system suspend as appropriate after that).

If blk_queue_enter() or __bio_queue_enter() is allowed to race with
disabling runtime PM for q->dev, failure to resume q->dev is alway
possible and there are no changes that can be made to
pm_runtime_disable() to prevent that from happening.  If
__pm_runtime_disable() wins the race, it will increment
power.disable_depth and rpm_resume() will bail out when it sees that
no matter what.

You should not conflate "runtime PM doesn't work when it is disabled"
with "asynchronous runtime PM doesn't work after freezing the PM
workqueue".  They are both true, but they are not the same.
Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by YangYang 9 hours ago
On 2025/12/1 17:46, YangYang wrote:
> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
>> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
>>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>
>>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
>>>>>> --- a/block/blk-core.c
>>>>>> +++ b/block/blk-core.c
>>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
>>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
>>>>>>                         return -EAGAIN;
>>>>>>
>>>>>> +             /* if necessary, resume .dev (assume success). */
>>>>>> +             blk_pm_resume_queue(pm, q);
>>>>>>                 /*
>>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
>>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
>>>>>
>>>>> blk_queue_enter() may be called from the suspend path so I don't think
>>>>> that the above change will work.
>>>>
>>>> Why would the existing code work then?
>>>
>>> The existing code works reliably on a very large number of devices.
>>
>> Well, except that it doesn't work during system suspend and
>> hibernation when the PM workqueue is frozen.  I think that we agree
>> here.
>>
>> This needs to be addressed because it may very well cause system
>> suspend to deadlock.
>>
>> There are two possible ways to address it I can think of:
>>
>> 1. Changing blk_pm_resume_queue() and its users to carry out a
>> synchronous resume of q->dev instead of calling pm_request_resume()
>> and (effectively) waiting for the queued-up runtime resume of q->dev
>> to take effect.
>>
>> This would be my preferred option, but at this point I'm not sure if
>> it's viable.
>>
> 
> After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing 
> rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked 
> synchronously or asynchronously.
> Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from 
> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
> 
> 1428     if (dev->power.request_pending) {
> 1429         dev->power.request = RPM_REQ_NONE;
> 1430         spin_unlock_irq(&dev->power.lock);
> 1431
> 1432         cancel_work_sync(&dev->power.work);
> 1433
> 1434         spin_lock_irq(&dev->power.lock);
> 1435         dev->power.request_pending = false;
> 1436     }
> 

Since both synchronous and asynchronous resumes face similar issues,
it may be sufficient to keep using the asynchronous resume path as long as
pending work items are not canceled while the PM workqueue is frozen.
This allows the pending work to proceed normally once the PM workqueue
is unfrozen.

---
  drivers/base/power/main.c    |  2 +-
  drivers/base/power/runtime.c | 17 +++++++++++------
  include/linux/pm_runtime.h   |  6 +++---
  3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 1de1cd72b616..d5c3d7a6777e 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1635,7 +1635,7 @@ static void device_suspend_late(struct device *dev, pm_message_t state, bool asy
  	 * Disable runtime PM for the device without checking if there is a
  	 * pending resume request for it.
  	 */
-	__pm_runtime_disable(dev, false);
+	__pm_runtime_disable(dev, false, true);

  	if (dev->power.syscore)
  		goto Skip;
diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 1b11a3cd4acc..ff3fdfba2dc8 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -1421,11 +1421,16 @@ EXPORT_SYMBOL_GPL(__pm_runtime_set_status);
   *
   * Should be called under dev->power.lock with interrupts disabled.
   */
-static void __pm_runtime_barrier(struct device *dev)
+static void __pm_runtime_barrier(struct device *dev, bool frozen)
  {
  	pm_runtime_deactivate_timer(dev);

-	if (dev->power.request_pending) {
+	/*
+	 * If the PM workqueue has already been frozen, the following
+	 * operations are unnecessary. This allows any pending work to
+	 * continue execution once the PM workqueue is unfrozen.
+	 */
+	if (!frozen && dev->power.request_pending) {
  		dev->power.request = RPM_REQ_NONE;
  		spin_unlock_irq(&dev->power.lock);

@@ -1485,7 +1490,7 @@ int pm_runtime_barrier(struct device *dev)
  		retval = 1;
  	}

-	__pm_runtime_barrier(dev);
+	__pm_runtime_barrier(dev, false);

  	spin_unlock_irq(&dev->power.lock);
  	pm_runtime_put_noidle(dev);
@@ -1519,7 +1524,7 @@ void pm_runtime_unblock(struct device *dev)
  	spin_unlock_irq(&dev->power.lock);
  }

-void __pm_runtime_disable(struct device *dev, bool check_resume)
+void __pm_runtime_disable(struct device *dev, bool check_resume, bool frozen)
  {
  	spin_lock_irq(&dev->power.lock);

@@ -1550,7 +1555,7 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
  	update_pm_runtime_accounting(dev);

  	if (!dev->power.disable_depth++) {
-		__pm_runtime_barrier(dev);
+		__pm_runtime_barrier(dev, frozen);
  		dev->power.last_status = dev->power.runtime_status;
  	}

@@ -1893,7 +1898,7 @@ void pm_runtime_reinit(struct device *dev)
   */
  void pm_runtime_remove(struct device *dev)
  {
-	__pm_runtime_disable(dev, false);
+	__pm_runtime_disable(dev, false, false);
  	pm_runtime_reinit(dev);
  }

diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index 0b436e15f4cd..102060a9ebc7 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -80,7 +80,7 @@ extern int pm_runtime_barrier(struct device *dev);
  extern bool pm_runtime_block_if_disabled(struct device *dev);
  extern void pm_runtime_unblock(struct device *dev);
  extern void pm_runtime_enable(struct device *dev);
-extern void __pm_runtime_disable(struct device *dev, bool check_resume);
+extern void __pm_runtime_disable(struct device *dev, bool check_resume, bool frozen);
  extern void pm_runtime_allow(struct device *dev);
  extern void pm_runtime_forbid(struct device *dev);
  extern void pm_runtime_no_callbacks(struct device *dev);
@@ -288,7 +288,7 @@ static inline int pm_runtime_barrier(struct device *dev) { return 0; }
  static inline bool pm_runtime_block_if_disabled(struct device *dev) { return true; }
  static inline void pm_runtime_unblock(struct device *dev) {}
  static inline void pm_runtime_enable(struct device *dev) {}
-static inline void __pm_runtime_disable(struct device *dev, bool c) {}
+static inline void __pm_runtime_disable(struct device *dev, bool c, bool f) {}
  static inline bool pm_runtime_blocked(struct device *dev) { return true; }
  static inline void pm_runtime_allow(struct device *dev) {}
  static inline void pm_runtime_forbid(struct device *dev) {}
@@ -775,7 +775,7 @@ static inline int pm_runtime_set_suspended(struct device *dev)
   */
  static inline void pm_runtime_disable(struct device *dev)
  {
-	__pm_runtime_disable(dev, true);
+	__pm_runtime_disable(dev, true, false);
  }

  /**
-- 

Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable
Posted by Rafael J. Wysocki 3 hours ago
On Mon, Dec 1, 2025 at 1:56 PM YangYang <yang.yang@vivo.com> wrote:
>
> On 2025/12/1 17:46, YangYang wrote:
> > On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> >> On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>
> >>> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>>>
> >>>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>>> --- a/block/blk-core.c
> >>>>>> +++ b/block/blk-core.c
> >>>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>>                         return -EAGAIN;
> >>>>>>
> >>>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>>                 /*
> >>>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>>
> >>>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>>> that the above change will work.
> >>>>
> >>>> Why would the existing code work then?
> >>>
> >>> The existing code works reliably on a very large number of devices.
> >>
> >> Well, except that it doesn't work during system suspend and
> >> hibernation when the PM workqueue is frozen.  I think that we agree
> >> here.
> >>
> >> This needs to be addressed because it may very well cause system
> >> suspend to deadlock.
> >>
> >> There are two possible ways to address it I can think of:
> >>
> >> 1. Changing blk_pm_resume_queue() and its users to carry out a
> >> synchronous resume of q->dev instead of calling pm_request_resume()
> >> and (effectively) waiting for the queued-up runtime resume of q->dev
> >> to take effect.
> >>
> >> This would be my preferred option, but at this point I'm not sure if
> >> it's viable.
> >>
> >
> > After __pm_runtime_disable() is called from device_suspend_late(), dev->power.disable_depth is set, preventing
> > rpm_resume() from making progress until the system resume completes, regardless of whether rpm_resume() is invoked
> > synchronously or asynchronously.
> > Performing a synchronous resume of q->dev seems to have a similar effect to removing the following code block from
> > __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
> >
> > 1428     if (dev->power.request_pending) {
> > 1429         dev->power.request = RPM_REQ_NONE;
> > 1430         spin_unlock_irq(&dev->power.lock);
> > 1431
> > 1432         cancel_work_sync(&dev->power.work);
> > 1433
> > 1434         spin_lock_irq(&dev->power.lock);
> > 1435         dev->power.request_pending = false;
> > 1436     }
> >
>
> Since both synchronous and asynchronous resumes face similar issues,

No, they don't.

> it may be sufficient to keep using the asynchronous resume path as long as
> pending work items are not canceled while the PM workqueue is frozen.

Except for two things:

1. If blk_queue_enter() or __bio_queue_enter() is allowed to race with
disabling runtime PM, queuing up the resume work item may fail in the
first place.

2. If a device runtime resume work item is queued up before the whole
system is suspended, it may not make sense to run that work item after
resuming the whole system because the state of the system as a whole
is generally different at that point.

> This allows the pending work to proceed normally once the PM workqueue
> is unfrozen.

Not really.