drivers/base/power/main.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
On Friday, July 11, 2025 3:54:00 PM CEST Rafael J. Wysocki wrote:
> On Fri, Jul 11, 2025 at 3:38 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Jul 11, 2025 at 3:08 PM Tudor Ambarus <tudor.ambarus@linaro.org> wrote:
> > >
> > >
> > > Hi, Rafael,
> > >
> > > On 3/14/25 12:50 PM, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >
> > > > According to [1], the handling of device suspend and resume, and
> > > > particularly the latter, involves unnecessary overhead related to
> > > > starting new async work items for devices that cannot make progress
> > > > right away because they have to wait for other devices.
> > > >
> > > > To reduce this problem in the resume path, use the observation that
> > > > starting the async resume of the children of a device after resuming
> > > > the parent is likely to produce less scheduling and memory management
> > > > noise than starting it upfront while at the same time it should not
> > > > increase the resume duration substantially.
> > > >
> > > > Accordingly, modify the code to start the async resume of the device's
> > > > children when the processing of the parent has been completed in each
> > > > stage of device resume and only start async resume upfront for devices
> > > > without parents.
> > > >
> > > > Also make it check if a given device can be resumed asynchronously
> > > > before starting the synchronous resume of it in case it will have to
> > > > wait for another that is already resuming asynchronously.
> > > >
> > > > In addition to making the async resume of devices more friendly to
> > > > systems with relatively less computing resources, this change is also
> > > > preliminary for analogous changes in the suspend path.
> > > >
> > > > On the systems where it has been tested, this change by itself does
> > > > not affect the overall system resume duration in a measurable way.
> > > >
> > > > Link: https://lore.kernel.org/linux-pm/20241114220921.2529905-1-saravanak@google.com/ [1]
> > > > Suggested-by: Saravana Kannan <saravanak@google.com>
> > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > I'd like to let you know of a suspend crash that I'm dealing with
> > > when using the OOT pixel6 drivers on top of v6.16-rc4.
> >
> > Well, thanks, but there's not much I can do about it.
> >
> > It is also better to start a new thread in such cases than to reply to
> > a patch submission.
> >
> > > Similar to what Jon reported, everything gets back to normal if
> > > I disable pm_async or if I revert the following patches:
> > > 443046d1ad66 PM: sleep: Make suspend of devices more asynchronous
> > > aa7a9275ab81 PM: sleep: Suspend async parents after suspending children
> > > 0cbef962ce1f PM: sleep: Resume children after resuming the parent
> > >
> > > I also reverted their fixes when testing:
> > > 8887abccf8aa PM: sleep: Add locking to dpm_async_resume_children()
> > > d46c4c839c20 PM: sleep: Fix power.is_suspended cleanup for direct-complete devices
> > > 079e8889ad13 PM: sleep: Fix list splicing in device suspend error paths
> > >
> > > It seems that the hang happens in dpm_suspend() at
> > > async_synchronize_full() time after a driver fails to suspend.
> > > The phone then naturally resets with an APC watchdog.
> > >
> > > [ 519.142279][ T7917] lwis lwis-eeprom-m24c64x: Can't suspend because eeprom-m24c64x is in use!
> > > [ 519.143556][ T7917] lwis-i2c eeprom@2: PM: dpm_run_callback(): platform_pm_suspend returns -16
> > > [ 519.143872][ T7917] lwis-i2c eeprom@2: PM: platform_pm_suspend returned -16 after 1596 usecs
> > > [ 519.144197][ T7917] lwis-i2c eeprom@2: PM: failed to suspend: error -16
> > > [ 519.144448][ T7917] PM: tudor: dpm_suspend: after while loop, list_empty(&dpm_prepared_list)? 1
> > > [ 519.144779][ T7917] PM: tudor: dpm_suspend: before async_synchronize_full
> > >
> > > The extra prints are because of:
> > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> > > index d9d4fc58bc5a..3efe538c2ec2 100644
> > > --- a/drivers/base/power/main.c
> > > +++ b/drivers/base/power/main.c
> > > @@ -1967,10 +1967,15 @@ int dpm_suspend(pm_message_t state)
> > > break;
> > > }
> > > }
> > > + pr_err("tudor: %s: after while loop, list_empty(&dpm_prepared_list)? %d\n",
> > > + __func__, list_empty(&dpm_prepared_list));
> > >
> > > mutex_unlock(&dpm_list_mtx);
> > >
> > > + pr_err("tudor: %s: before async_synchronize_full\n", __func__);
> > > async_synchronize_full();
> > > + pr_err("tudor: %s: after async_synchronize_full();\n", __func__);
> > > +
> > > if (!error)
> > > error = async_error;
> > >
> > > The synchronous suspend works because its strict, one-by-one ordering
> > > ensures that device dependencies are met and that no device is suspended
> > > while another is still using it. The asynchronous suspend fails because
> > > it creates a race condition where the lwis-eeprom-m24c64x is called for
> > > suspension before the process using it has been suspended, leading to a
> > > fatal "device busy" error. Should the failure of a device suspend be
> > > fatal?
> >
> > It shouldn't in principle, but it depends on what exactly is involved and how.
> >
> > It looks like something is blocking on power.completion somewhere.
> > I'll check the code, maybe a complete() is missing in an error path or
> > similar.
>
> It doesn't look like anything is missing in the core, so the suspend
> failure seems to be triggering a deadlock of some sort.
Well, I'm taking this back.
The following scenario definitely can happen:
1. Device A is async and it depends on device B that is sync.
2. Async suspend is scheduled for A before the processing of B is started.
3. A is waiting for B.
4. In the meantime, an unrelated device fails to suspend and returns an error.
5. The processing of B doesn't start at all and its power.completion is not
updated.
6. A is still waiting for B when async_synchronize_full() is called.
7. Deadlock ensues.
If this is what happens in your case, the (untested) patch below should help
(unless I messed it up, that is).
---
drivers/base/power/main.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1323,6 +1323,22 @@
device_links_read_unlock(idx);
}
+static void dpm_async_suspend_complete_all(struct list_head *device_list)
+{
+ struct device *dev;
+
+ guard(mutex)(&async_wip_mtx);
+
+ list_for_each_entry_reverse(dev, device_list, power.entry) {
+ /*
+ * In case the device is being waited for and async processing
+ * has not started for it yet, let the waiters make progress.
+ */
+ if (!dev->power.work_in_progress)
+ complete_all(&dev->power.completion);
+ }
+}
+
/**
* resume_event - Return a "resume" message for given "suspend" sleep state.
* @sleep_state: PM message representing a sleep state.
@@ -1499,6 +1515,7 @@
mutex_lock(&dpm_list_mtx);
if (error || async_error) {
+ dpm_async_suspend_complete_all(&dpm_late_early_list);
/*
* Move all devices to the target list to resume them
* properly.
@@ -1701,6 +1718,7 @@
mutex_lock(&dpm_list_mtx);
if (error || async_error) {
+ dpm_async_suspend_complete_all(&dpm_suspended_list);
/*
* Move all devices to the target list to resume them
* properly.
@@ -1994,6 +2012,7 @@
mutex_lock(&dpm_list_mtx);
if (error || async_error) {
+ dpm_async_suspend_complete_all(&dpm_prepared_list);
/*
* Move all devices to the target list to resume them
* properly.
On 7/12/25 8:54 AM, Rafael J. Wysocki wrote: > On Friday, July 11, 2025 3:54:00 PM CEST Rafael J. Wysocki wrote: >> On Fri, Jul 11, 2025 at 3:38 PM Rafael J. Wysocki <rafael@kernel.org> wrote: >>> >>> On Fri, Jul 11, 2025 at 3:08 PM Tudor Ambarus <tudor.ambarus@linaro.org> wrote: >>>> >>>> >>>> Hi, Rafael, >>>> >>>> On 3/14/25 12:50 PM, Rafael J. Wysocki wrote: >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>>> >>>>> According to [1], the handling of device suspend and resume, and >>>>> particularly the latter, involves unnecessary overhead related to >>>>> starting new async work items for devices that cannot make progress >>>>> right away because they have to wait for other devices. >>>>> >>>>> To reduce this problem in the resume path, use the observation that >>>>> starting the async resume of the children of a device after resuming >>>>> the parent is likely to produce less scheduling and memory management >>>>> noise than starting it upfront while at the same time it should not >>>>> increase the resume duration substantially. >>>>> >>>>> Accordingly, modify the code to start the async resume of the device's >>>>> children when the processing of the parent has been completed in each >>>>> stage of device resume and only start async resume upfront for devices >>>>> without parents. >>>>> >>>>> Also make it check if a given device can be resumed asynchronously >>>>> before starting the synchronous resume of it in case it will have to >>>>> wait for another that is already resuming asynchronously. >>>>> >>>>> In addition to making the async resume of devices more friendly to >>>>> systems with relatively less computing resources, this change is also >>>>> preliminary for analogous changes in the suspend path. >>>>> >>>>> On the systems where it has been tested, this change by itself does >>>>> not affect the overall system resume duration in a measurable way. >>>>> >>>>> Link: https://lore.kernel.org/linux-pm/20241114220921.2529905-1-saravanak@google.com/ [1] >>>>> Suggested-by: Saravana Kannan <saravanak@google.com> >>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>>> >>>> I'd like to let you know of a suspend crash that I'm dealing with >>>> when using the OOT pixel6 drivers on top of v6.16-rc4. >>> >>> Well, thanks, but there's not much I can do about it. >>> >>> It is also better to start a new thread in such cases than to reply to >>> a patch submission. >>> >>>> Similar to what Jon reported, everything gets back to normal if >>>> I disable pm_async or if I revert the following patches: >>>> 443046d1ad66 PM: sleep: Make suspend of devices more asynchronous >>>> aa7a9275ab81 PM: sleep: Suspend async parents after suspending children >>>> 0cbef962ce1f PM: sleep: Resume children after resuming the parent >>>> >>>> I also reverted their fixes when testing: >>>> 8887abccf8aa PM: sleep: Add locking to dpm_async_resume_children() >>>> d46c4c839c20 PM: sleep: Fix power.is_suspended cleanup for direct-complete devices >>>> 079e8889ad13 PM: sleep: Fix list splicing in device suspend error paths >>>> >>>> It seems that the hang happens in dpm_suspend() at >>>> async_synchronize_full() time after a driver fails to suspend. >>>> The phone then naturally resets with an APC watchdog. >>>> >>>> [ 519.142279][ T7917] lwis lwis-eeprom-m24c64x: Can't suspend because eeprom-m24c64x is in use! >>>> [ 519.143556][ T7917] lwis-i2c eeprom@2: PM: dpm_run_callback(): platform_pm_suspend returns -16 >>>> [ 519.143872][ T7917] lwis-i2c eeprom@2: PM: platform_pm_suspend returned -16 after 1596 usecs >>>> [ 519.144197][ T7917] lwis-i2c eeprom@2: PM: failed to suspend: error -16 >>>> [ 519.144448][ T7917] PM: tudor: dpm_suspend: after while loop, list_empty(&dpm_prepared_list)? 1 >>>> [ 519.144779][ T7917] PM: tudor: dpm_suspend: before async_synchronize_full >>>> >>>> The extra prints are because of: >>>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >>>> index d9d4fc58bc5a..3efe538c2ec2 100644 >>>> --- a/drivers/base/power/main.c >>>> +++ b/drivers/base/power/main.c >>>> @@ -1967,10 +1967,15 @@ int dpm_suspend(pm_message_t state) >>>> break; >>>> } >>>> } >>>> + pr_err("tudor: %s: after while loop, list_empty(&dpm_prepared_list)? %d\n", >>>> + __func__, list_empty(&dpm_prepared_list)); >>>> >>>> mutex_unlock(&dpm_list_mtx); >>>> >>>> + pr_err("tudor: %s: before async_synchronize_full\n", __func__); >>>> async_synchronize_full(); >>>> + pr_err("tudor: %s: after async_synchronize_full();\n", __func__); >>>> + >>>> if (!error) >>>> error = async_error; >>>> >>>> The synchronous suspend works because its strict, one-by-one ordering >>>> ensures that device dependencies are met and that no device is suspended >>>> while another is still using it. The asynchronous suspend fails because >>>> it creates a race condition where the lwis-eeprom-m24c64x is called for >>>> suspension before the process using it has been suspended, leading to a >>>> fatal "device busy" error. Should the failure of a device suspend be >>>> fatal? >>> >>> It shouldn't in principle, but it depends on what exactly is involved and how. >>> >>> It looks like something is blocking on power.completion somewhere. >>> I'll check the code, maybe a complete() is missing in an error path or >>> similar. >> >> It doesn't look like anything is missing in the core, so the suspend >> failure seems to be triggering a deadlock of some sort. > > Well, I'm taking this back. > > The following scenario definitely can happen: > > 1. Device A is async and it depends on device B that is sync. > 2. Async suspend is scheduled for A before the processing of B is started. > 3. A is waiting for B. > 4. In the meantime, an unrelated device fails to suspend and returns an error. > 5. The processing of B doesn't start at all and its power.completion is not > updated. > 6. A is still waiting for B when async_synchronize_full() is called. > 7. Deadlock ensues. > > If this is what happens in your case, the (untested) patch below should help > (unless I messed it up, that is). Thanks, Rafael. I added few prints (see updated patch below) to figure out whether complete_all(&dev->power.completion) is called in my case, and it seems it's not, I still get the APC watchdog: [ 724.361425][ T8468] lwis-i2c eeprom@2: PM: calling platform_pm_suspend @ 8468, parent: platform [ 724.361751][ T8468] lwis lwis-eeprom-m24c64x: Can't suspend because eeprom-m24c64x is in use! [ 724.362098][ T8468] lwis-i2c eeprom@2: PM: dpm_run_callback(): platform_pm_suspend returns -16 [ 724.362427][ T8468] lwis-i2c eeprom@2: PM: platform_pm_suspend returned -16 after 679 usecs [ 724.362750][ T8468] lwis-i2c eeprom@2: PM: failed to suspend: error -16 [ 724.362999][ T8468] PM: tudor: dpm_async_suspend_complete_all: enter [ 724.363242][ T8468] PM: tudor: dpm_suspend: before async_synchronize_full diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index d9d4fc58bc5a..0e186bc38a00 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1281,6 +1281,27 @@ static void dpm_async_suspend_parent(struct device *dev, async_func_t func) dpm_async_with_cleanup(dev->parent, func); } +static void dpm_async_suspend_complete_all(struct list_head *device_list) +{ + struct device *dev; + + + pr_err("tudor: %s: enter\n", __func__); + guard(mutex)(&async_wip_mtx); + + list_for_each_entry_reverse(dev, device_list, power.entry) { + /* + * In case the device is being waited for and async processing + * has not started for it yet, let the waiters make progress. + */ + pr_err("tudor: %s: in device list\n", __func__); + if (!dev->power.work_in_progress) { + pr_err("tudor: %s: call complete_all\n", __func__); + complete_all(&dev->power.completion); + } + } +} + /** * resume_event - Return a "resume" message for given "suspend" sleep state. * @sleep_state: PM message representing a sleep state. @@ -1459,6 +1480,7 @@ static int dpm_noirq_suspend_devices(pm_message_t state) mutex_lock(&dpm_list_mtx); if (error || async_error) { + dpm_async_suspend_complete_all(&dpm_late_early_list); /* * Move all devices to the target list to resume them * properly. @@ -1663,6 +1685,7 @@ int dpm_suspend_late(pm_message_t state) mutex_lock(&dpm_list_mtx); if (error || async_error) { + dpm_async_suspend_complete_all(&dpm_late_early_list); /* * Move all devices to the target list to resume them * properly. @@ -1959,6 +1982,7 @@ int dpm_suspend(pm_message_t state) mutex_lock(&dpm_list_mtx); if (error || async_error) { + dpm_async_suspend_complete_all(&dpm_late_early_list); /* * Move all devices to the target list to resume them * properly. @@ -1970,9 +1994,12 @@ int dpm_suspend(pm_message_t state) mutex_unlock(&dpm_list_mtx); + pr_err("tudor: %s: before async_synchronize_full\n", __func__); async_synchronize_full(); if (!error) error = async_error; + pr_err("tudor: %s: after async_synchronize_full();\n", __func__); + if (error) dpm_save_failed_step(SUSPEND_SUSPEND);
On Mon, Jul 14, 2025 at 9:09 AM Tudor Ambarus <tudor.ambarus@linaro.org> wrote: > > > > On 7/12/25 8:54 AM, Rafael J. Wysocki wrote: > > On Friday, July 11, 2025 3:54:00 PM CEST Rafael J. Wysocki wrote: > >> On Fri, Jul 11, 2025 at 3:38 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > >>> > >>> On Fri, Jul 11, 2025 at 3:08 PM Tudor Ambarus <tudor.ambarus@linaro.org> wrote: > >>>> > >>>> > >>>> Hi, Rafael, > >>>> > >>>> On 3/14/25 12:50 PM, Rafael J. Wysocki wrote: > >>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>>>> > >>>>> According to [1], the handling of device suspend and resume, and > >>>>> particularly the latter, involves unnecessary overhead related to > >>>>> starting new async work items for devices that cannot make progress > >>>>> right away because they have to wait for other devices. > >>>>> > >>>>> To reduce this problem in the resume path, use the observation that > >>>>> starting the async resume of the children of a device after resuming > >>>>> the parent is likely to produce less scheduling and memory management > >>>>> noise than starting it upfront while at the same time it should not > >>>>> increase the resume duration substantially. > >>>>> > >>>>> Accordingly, modify the code to start the async resume of the device's > >>>>> children when the processing of the parent has been completed in each > >>>>> stage of device resume and only start async resume upfront for devices > >>>>> without parents. > >>>>> > >>>>> Also make it check if a given device can be resumed asynchronously > >>>>> before starting the synchronous resume of it in case it will have to > >>>>> wait for another that is already resuming asynchronously. > >>>>> > >>>>> In addition to making the async resume of devices more friendly to > >>>>> systems with relatively less computing resources, this change is also > >>>>> preliminary for analogous changes in the suspend path. > >>>>> > >>>>> On the systems where it has been tested, this change by itself does > >>>>> not affect the overall system resume duration in a measurable way. > >>>>> > >>>>> Link: https://lore.kernel.org/linux-pm/20241114220921.2529905-1-saravanak@google.com/ [1] > >>>>> Suggested-by: Saravana Kannan <saravanak@google.com> > >>>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>>> > >>>> I'd like to let you know of a suspend crash that I'm dealing with > >>>> when using the OOT pixel6 drivers on top of v6.16-rc4. > >>> > >>> Well, thanks, but there's not much I can do about it. > >>> > >>> It is also better to start a new thread in such cases than to reply to > >>> a patch submission. > >>> > >>>> Similar to what Jon reported, everything gets back to normal if > >>>> I disable pm_async or if I revert the following patches: > >>>> 443046d1ad66 PM: sleep: Make suspend of devices more asynchronous > >>>> aa7a9275ab81 PM: sleep: Suspend async parents after suspending children > >>>> 0cbef962ce1f PM: sleep: Resume children after resuming the parent > >>>> > >>>> I also reverted their fixes when testing: > >>>> 8887abccf8aa PM: sleep: Add locking to dpm_async_resume_children() > >>>> d46c4c839c20 PM: sleep: Fix power.is_suspended cleanup for direct-complete devices > >>>> 079e8889ad13 PM: sleep: Fix list splicing in device suspend error paths > >>>> > >>>> It seems that the hang happens in dpm_suspend() at > >>>> async_synchronize_full() time after a driver fails to suspend. > >>>> The phone then naturally resets with an APC watchdog. > >>>> > >>>> [ 519.142279][ T7917] lwis lwis-eeprom-m24c64x: Can't suspend because eeprom-m24c64x is in use! > >>>> [ 519.143556][ T7917] lwis-i2c eeprom@2: PM: dpm_run_callback(): platform_pm_suspend returns -16 > >>>> [ 519.143872][ T7917] lwis-i2c eeprom@2: PM: platform_pm_suspend returned -16 after 1596 usecs > >>>> [ 519.144197][ T7917] lwis-i2c eeprom@2: PM: failed to suspend: error -16 > >>>> [ 519.144448][ T7917] PM: tudor: dpm_suspend: after while loop, list_empty(&dpm_prepared_list)? 1 > >>>> [ 519.144779][ T7917] PM: tudor: dpm_suspend: before async_synchronize_full > >>>> > >>>> The extra prints are because of: > >>>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > >>>> index d9d4fc58bc5a..3efe538c2ec2 100644 > >>>> --- a/drivers/base/power/main.c > >>>> +++ b/drivers/base/power/main.c > >>>> @@ -1967,10 +1967,15 @@ int dpm_suspend(pm_message_t state) > >>>> break; > >>>> } > >>>> } > >>>> + pr_err("tudor: %s: after while loop, list_empty(&dpm_prepared_list)? %d\n", > >>>> + __func__, list_empty(&dpm_prepared_list)); > >>>> > >>>> mutex_unlock(&dpm_list_mtx); > >>>> > >>>> + pr_err("tudor: %s: before async_synchronize_full\n", __func__); > >>>> async_synchronize_full(); > >>>> + pr_err("tudor: %s: after async_synchronize_full();\n", __func__); > >>>> + > >>>> if (!error) > >>>> error = async_error; > >>>> > >>>> The synchronous suspend works because its strict, one-by-one ordering > >>>> ensures that device dependencies are met and that no device is suspended > >>>> while another is still using it. The asynchronous suspend fails because > >>>> it creates a race condition where the lwis-eeprom-m24c64x is called for > >>>> suspension before the process using it has been suspended, leading to a > >>>> fatal "device busy" error. Should the failure of a device suspend be > >>>> fatal? > >>> > >>> It shouldn't in principle, but it depends on what exactly is involved and how. > >>> > >>> It looks like something is blocking on power.completion somewhere. > >>> I'll check the code, maybe a complete() is missing in an error path or > >>> similar. > >> > >> It doesn't look like anything is missing in the core, so the suspend > >> failure seems to be triggering a deadlock of some sort. > > > > Well, I'm taking this back. > > > > The following scenario definitely can happen: > > > > 1. Device A is async and it depends on device B that is sync. > > 2. Async suspend is scheduled for A before the processing of B is started. > > 3. A is waiting for B. > > 4. In the meantime, an unrelated device fails to suspend and returns an error. > > 5. The processing of B doesn't start at all and its power.completion is not > > updated. > > 6. A is still waiting for B when async_synchronize_full() is called. > > 7. Deadlock ensues. > > > > If this is what happens in your case, the (untested) patch below should help > > (unless I messed it up, that is). > > Thanks, Rafael. > > I added few prints (see updated patch below) to figure out whether > complete_all(&dev->power.completion) is called in my case, and it seems > it's not, I still get the APC watchdog: > > [ 724.361425][ T8468] lwis-i2c eeprom@2: PM: calling platform_pm_suspend @ 8468, parent: platform > [ 724.361751][ T8468] lwis lwis-eeprom-m24c64x: Can't suspend because eeprom-m24c64x is in use! > [ 724.362098][ T8468] lwis-i2c eeprom@2: PM: dpm_run_callback(): platform_pm_suspend returns -16 > [ 724.362427][ T8468] lwis-i2c eeprom@2: PM: platform_pm_suspend returned -16 after 679 usecs > [ 724.362750][ T8468] lwis-i2c eeprom@2: PM: failed to suspend: error -16 > [ 724.362999][ T8468] PM: tudor: dpm_async_suspend_complete_all: enter > [ 724.363242][ T8468] PM: tudor: dpm_suspend: before async_synchronize_full Well, this most likely happens because -> > > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index d9d4fc58bc5a..0e186bc38a00 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -1281,6 +1281,27 @@ static void dpm_async_suspend_parent(struct device *dev, async_func_t func) > dpm_async_with_cleanup(dev->parent, func); > } > > +static void dpm_async_suspend_complete_all(struct list_head *device_list) > +{ > + struct device *dev; > + > + > + pr_err("tudor: %s: enter\n", __func__); > + guard(mutex)(&async_wip_mtx); > + > + list_for_each_entry_reverse(dev, device_list, power.entry) { > + /* > + * In case the device is being waited for and async processing > + * has not started for it yet, let the waiters make progress. > + */ > + pr_err("tudor: %s: in device list\n", __func__); > + if (!dev->power.work_in_progress) { > + pr_err("tudor: %s: call complete_all\n", __func__); > + complete_all(&dev->power.completion); > + } > + } > +} > + > /** > * resume_event - Return a "resume" message for given "suspend" sleep state. > * @sleep_state: PM message representing a sleep state. > @@ -1459,6 +1480,7 @@ static int dpm_noirq_suspend_devices(pm_message_t state) > mutex_lock(&dpm_list_mtx); > > if (error || async_error) { > + dpm_async_suspend_complete_all(&dpm_late_early_list); > /* > * Move all devices to the target list to resume them > * properly. > @@ -1663,6 +1685,7 @@ int dpm_suspend_late(pm_message_t state) > mutex_lock(&dpm_list_mtx); > > if (error || async_error) { > + dpm_async_suspend_complete_all(&dpm_late_early_list); > /* > * Move all devices to the target list to resume them > * properly. > @@ -1959,6 +1982,7 @@ int dpm_suspend(pm_message_t state) > mutex_lock(&dpm_list_mtx); > > if (error || async_error) { > + dpm_async_suspend_complete_all(&dpm_late_early_list); -> There is a bug here which is not present in the patch I've sent. It should be dpm_async_suspend_complete_all(&dpm_prepared_list); It is also there in dpm_noirq_suspend_devices() above, but it probably doesn't matter. > /* > * Move all devices to the target list to resume them > * properly. > @@ -1970,9 +1994,12 @@ int dpm_suspend(pm_message_t state) > > mutex_unlock(&dpm_list_mtx); > > + pr_err("tudor: %s: before async_synchronize_full\n", __func__); > async_synchronize_full(); > if (!error) > error = async_error; > + pr_err("tudor: %s: after async_synchronize_full();\n", __func__); > + > > if (error) > dpm_save_failed_step(SUSPEND_SUSPEND);
On 7/14/25 8:29 AM, Rafael J. Wysocki wrote: >> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >> index d9d4fc58bc5a..0e186bc38a00 100644 >> --- a/drivers/base/power/main.c >> +++ b/drivers/base/power/main.c >> @@ -1281,6 +1281,27 @@ static void dpm_async_suspend_parent(struct device *dev, async_func_t func) >> dpm_async_with_cleanup(dev->parent, func); >> } >> >> +static void dpm_async_suspend_complete_all(struct list_head *device_list) >> +{ >> + struct device *dev; >> + >> + >> + pr_err("tudor: %s: enter\n", __func__); >> + guard(mutex)(&async_wip_mtx); >> + >> + list_for_each_entry_reverse(dev, device_list, power.entry) { >> + /* >> + * In case the device is being waited for and async processing >> + * has not started for it yet, let the waiters make progress. >> + */ >> + pr_err("tudor: %s: in device list\n", __func__); >> + if (!dev->power.work_in_progress) { >> + pr_err("tudor: %s: call complete_all\n", __func__); >> + complete_all(&dev->power.completion); >> + } >> + } >> +} >> + >> /** >> * resume_event - Return a "resume" message for given "suspend" sleep state. >> * @sleep_state: PM message representing a sleep state. >> @@ -1459,6 +1480,7 @@ static int dpm_noirq_suspend_devices(pm_message_t state) >> mutex_lock(&dpm_list_mtx); >> >> if (error || async_error) { >> + dpm_async_suspend_complete_all(&dpm_late_early_list); >> /* >> * Move all devices to the target list to resume them >> * properly. >> @@ -1663,6 +1685,7 @@ int dpm_suspend_late(pm_message_t state) >> mutex_lock(&dpm_list_mtx); >> >> if (error || async_error) { >> + dpm_async_suspend_complete_all(&dpm_late_early_list); >> /* >> * Move all devices to the target list to resume them >> * properly. >> @@ -1959,6 +1982,7 @@ int dpm_suspend(pm_message_t state) >> mutex_lock(&dpm_list_mtx); >> >> if (error || async_error) { >> + dpm_async_suspend_complete_all(&dpm_late_early_list); > -> There is a bug here which is not present in the patch I've sent. My bad, I edited by hand, sorry. > > It should be > > dpm_async_suspend_complete_all(&dpm_prepared_list); Wonderful, it seems this makes suspend happy on downstream pixel6! I'm running some more tests and get back to you in a few hours. > > It is also there in dpm_noirq_suspend_devices() above, but it probably > doesn't matter. > >> /* >> * Move all devices to the target list to resume them >> * properly. >> @@ -1970,9 +1994,12 @@ int dpm_suspend(pm_message_t state) >> >> mutex_unlock(&dpm_list_mtx); >> >> + pr_err("tudor: %s: before async_synchronize_full\n", __func__); >> async_synchronize_full(); >> if (!error) >> error = async_error; >> + pr_err("tudor: %s: after async_synchronize_full();\n", __func__); >> + >> >> if (error) >> dpm_save_failed_step(SUSPEND_SUSPEND);
On 7/14/25 11:35 AM, Tudor Ambarus wrote: > > > On 7/14/25 8:29 AM, Rafael J. Wysocki wrote: >>> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c >>> index d9d4fc58bc5a..0e186bc38a00 100644 >>> --- a/drivers/base/power/main.c >>> +++ b/drivers/base/power/main.c >>> @@ -1281,6 +1281,27 @@ static void dpm_async_suspend_parent(struct device *dev, async_func_t func) >>> dpm_async_with_cleanup(dev->parent, func); >>> } >>> >>> +static void dpm_async_suspend_complete_all(struct list_head *device_list) >>> +{ >>> + struct device *dev; >>> + >>> + >>> + pr_err("tudor: %s: enter\n", __func__); >>> + guard(mutex)(&async_wip_mtx); >>> + >>> + list_for_each_entry_reverse(dev, device_list, power.entry) { >>> + /* >>> + * In case the device is being waited for and async processing >>> + * has not started for it yet, let the waiters make progress. >>> + */ >>> + pr_err("tudor: %s: in device list\n", __func__); >>> + if (!dev->power.work_in_progress) { >>> + pr_err("tudor: %s: call complete_all\n", __func__); >>> + complete_all(&dev->power.completion); >>> + } >>> + } >>> +} >>> + >>> /** >>> * resume_event - Return a "resume" message for given "suspend" sleep state. >>> * @sleep_state: PM message representing a sleep state. >>> @@ -1459,6 +1480,7 @@ static int dpm_noirq_suspend_devices(pm_message_t state) >>> mutex_lock(&dpm_list_mtx); >>> >>> if (error || async_error) { >>> + dpm_async_suspend_complete_all(&dpm_late_early_list); >>> /* >>> * Move all devices to the target list to resume them >>> * properly. >>> @@ -1663,6 +1685,7 @@ int dpm_suspend_late(pm_message_t state) >>> mutex_lock(&dpm_list_mtx); >>> >>> if (error || async_error) { >>> + dpm_async_suspend_complete_all(&dpm_late_early_list); >>> /* >>> * Move all devices to the target list to resume them >>> * properly. >>> @@ -1959,6 +1982,7 @@ int dpm_suspend(pm_message_t state) >>> mutex_lock(&dpm_list_mtx); >>> >>> if (error || async_error) { >>> + dpm_async_suspend_complete_all(&dpm_late_early_list); >> -> There is a bug here which is not present in the patch I've sent. > > My bad, I edited by hand, sorry. > >> >> It should be >> >> dpm_async_suspend_complete_all(&dpm_prepared_list); > > > Wonderful, it seems this makes suspend happy on downstream pixel6! > I'm running some more tests and get back to you in a few hours. > Solves failures on pixel6 downstream: Reported-and-tested-by: Tudor Ambarus <tudor.ambarus@linaro.org> Thanks!
© 2016 - 2025 Red Hat, Inc.