cpu/suspend: Do a partial hotplug during suspend

[RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Saravana Kannan 1 year, 2 months ago

The hotplug state machine goes through 100+ states when transitioning
from online to offline. And on the way back, it goes through these
states in reverse.

When a CPU goes offline, some of the states that occur after a CPU is
powered off are about freeing up various per-CPU resources like
kmalloc caches, pages, network buffers, etc. All of these states make
sense when a CPU is permanently hotplugged off.

However, when offlining a CPU during suspend, we just want to power
down the CPUs to that the system can enter suspend. In this scenario,
we could simply stop the hotplug state machine right after the CPU has
been power off. During resume, we can simply resume the CPU to an
online state from the state where we paused the offline.

This save both time both during suspend and resume and it is
proportional to the number of CPUs in the system. So, if systems with
a large number of CPUs, we can expect this to have a huge amount of
time saved.

On a Pixel 6, averaging across 100+ suspend/resumes cycles, the total
time to power off 7 of the 8 CPUs goes from 51 ms down to 24 ms.
Similarly, the average time to power off each individual CPU (they are
different) also goes down by 50%.

The average time spent powering up CPUs goes down from 34 ms to 32 ms.
Keep in mind that the time saved during resume is not easily
quantified by looking at CPU onlining times. This is because the
actual time savings comes later when per-CPU resources do not need to
be reallocated and would speed up actions like allocations, etc that
can pick up memory from per-CPU kmalloc caches, etc.

Signed-off-by: Saravana Kannan <saravanak@google.com>
---

Hi Thomas/Peter,

The hotplug state machine rewrite is great! Enables all kinds of
optimizations for suspend/resume.

About this patch, I'm not sure if the exact state the hotplug state is
paused at (CPUHP_WORKQUEUE_PREP) will work for all arch/boards, but
this is the general idea.

If it works as is, great! At a glance, it looks like it should work
though. None of the other stages between this and CPUHP_OFFLINE seem
to be touching hardware.

If CPUHP_WORKQUEUE_PREP doesn't work, then we can make it a config
option to select the state or an arch call or something along those
lines.

What are your thoughts on this? How would you like me to proceed?

Btw, ideally, I'd have liked to have paused at CPUHP_TMIGR_PREPARE,
but the hrtimers unwinding seems to be broken if we fail/unwind before
reaching "hrtimers:prepare". I'll send out a separate email on that.

Thanks,
Saravana

 kernel/cpu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index d293d52a3e00..ca21ac017471 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1649,7 +1649,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
 	if (st->state >= target)
 		goto out;
 
-	if (st->state == CPUHP_OFFLINE) {
+	if (st->state < CPUHP_BP_KICK_AP) {
 		/* Let it fail before we try to bring the cpu up */
 		idle = idle_thread_get(cpu);
 		if (IS_ERR(idle)) {
@@ -1931,7 +1931,7 @@ int freeze_secondary_cpus(int primary)
 		}
 
 		trace_suspend_resume(TPS("CPU_OFF"), cpu, true);
-		error = _cpu_down(cpu, 1, CPUHP_OFFLINE);
+		error = _cpu_down(cpu, 1, CPUHP_WORKQUEUE_PREP);
 		trace_suspend_resume(TPS("CPU_OFF"), cpu, false);
 		if (!error)
 			cpumask_set_cpu(cpu, frozen_cpus);
-- 
2.47.0.338.g60cca15819-goog

Re: [RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Peter Zijlstra 1 year, 2 months ago

On Mon, Nov 18, 2024 at 06:05:15PM -0800, Saravana Kannan wrote:
> The hotplug state machine goes through 100+ states when transitioning
> from online to offline. And on the way back, it goes through these
> states in reverse.
> 
> When a CPU goes offline, some of the states that occur after a CPU is
> powered off are about freeing up various per-CPU resources like
> kmalloc caches, pages, network buffers, etc. All of these states make
> sense when a CPU is permanently hotplugged off.
> 
> However, when offlining a CPU during suspend, we just want to power
> down the CPUs to that the system can enter suspend. In this scenario,
> we could simply stop the hotplug state machine right after the CPU has
> been power off. During resume, we can simply resume the CPU to an
> online state from the state where we paused the offline.
> 
> This save both time both during suspend and resume and it is
> proportional to the number of CPUs in the system. So, if systems with
> a large number of CPUs, we can expect this to have a huge amount of
> time saved.
> 
> On a Pixel 6, averaging across 100+ suspend/resumes cycles, the total
> time to power off 7 of the 8 CPUs goes from 51 ms down to 24 ms.
> Similarly, the average time to power off each individual CPU (they are
> different) also goes down by 50%.
> 
> The average time spent powering up CPUs goes down from 34 ms to 32 ms.
> Keep in mind that the time saved during resume is not easily
> quantified by looking at CPU onlining times. This is because the
> actual time savings comes later when per-CPU resources do not need to
> be reallocated and would speed up actions like allocations, etc that
> can pick up memory from per-CPU kmalloc caches, etc.
> 
> Signed-off-by: Saravana Kannan <saravanak@google.com>
> ---
> 
> Hi Thomas/Peter,
> 
> The hotplug state machine rewrite is great! Enables all kinds of
> optimizations for suspend/resume.
> 
> About this patch, I'm not sure if the exact state the hotplug state is
> paused at (CPUHP_WORKQUEUE_PREP) will work for all arch/boards, but
> this is the general idea.
> 
> If it works as is, great! At a glance, it looks like it should work
> though. None of the other stages between this and CPUHP_OFFLINE seem
> to be touching hardware.
> 
> If CPUHP_WORKQUEUE_PREP doesn't work, then we can make it a config
> option to select the state or an arch call or something along those
> lines.
> 
> What are your thoughts on this? How would you like me to proceed?

Well, if we push this one step further, why do we need hotplug at all?
Can't we just keep them up and idle?

That is, if we look at suspend_enter(), you'll note that
PM_SUSPEND_TO_IDLE happens before the whole disable_secondary_cpus()
thing.

So million-dollar question, can this pixel thing do suspend to idle?


Traditionally hybernate is the whole save-to-disk and power machine off
thing, and then there was suspend (to RAM) which was some dodgy as heck
BIOS thing (on x86) which required all non-boot CPUs to be 'dead'.

But does your (aaargh64) platform actually require us to take out the
non-boot CPUs, or is this just histerical raisins?

Re: [RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Saravana Kannan 1 year, 2 months ago

On Tue, Nov 19, 2024 at 1:28 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 18, 2024 at 06:05:15PM -0800, Saravana Kannan wrote:
> > The hotplug state machine goes through 100+ states when transitioning
> > from online to offline. And on the way back, it goes through these
> > states in reverse.
> >
> > When a CPU goes offline, some of the states that occur after a CPU is
> > powered off are about freeing up various per-CPU resources like
> > kmalloc caches, pages, network buffers, etc. All of these states make
> > sense when a CPU is permanently hotplugged off.
> >
> > However, when offlining a CPU during suspend, we just want to power
> > down the CPUs to that the system can enter suspend. In this scenario,
> > we could simply stop the hotplug state machine right after the CPU has
> > been power off. During resume, we can simply resume the CPU to an
> > online state from the state where we paused the offline.
> >
> > This save both time both during suspend and resume and it is
> > proportional to the number of CPUs in the system. So, if systems with
> > a large number of CPUs, we can expect this to have a huge amount of
> > time saved.
> >
> > On a Pixel 6, averaging across 100+ suspend/resumes cycles, the total
> > time to power off 7 of the 8 CPUs goes from 51 ms down to 24 ms.
> > Similarly, the average time to power off each individual CPU (they are
> > different) also goes down by 50%.
> >
> > The average time spent powering up CPUs goes down from 34 ms to 32 ms.
> > Keep in mind that the time saved during resume is not easily
> > quantified by looking at CPU onlining times. This is because the
> > actual time savings comes later when per-CPU resources do not need to
> > be reallocated and would speed up actions like allocations, etc that
> > can pick up memory from per-CPU kmalloc caches, etc.
> >
> > Signed-off-by: Saravana Kannan <saravanak@google.com>
> > ---
> >
> > Hi Thomas/Peter,
> >
> > The hotplug state machine rewrite is great! Enables all kinds of
> > optimizations for suspend/resume.
> >
> > About this patch, I'm not sure if the exact state the hotplug state is
> > paused at (CPUHP_WORKQUEUE_PREP) will work for all arch/boards, but
> > this is the general idea.
> >
> > If it works as is, great! At a glance, it looks like it should work
> > though. None of the other stages between this and CPUHP_OFFLINE seem
> > to be touching hardware.
> >
> > If CPUHP_WORKQUEUE_PREP doesn't work, then we can make it a config
> > option to select the state or an arch call or something along those
> > lines.
> >
> > What are your thoughts on this? How would you like me to proceed?
>
> Well, if we push this one step further, why do we need hotplug at all?
> Can't we just keep them up and idle?
>
> That is, if we look at suspend_enter(), you'll note that
> PM_SUSPEND_TO_IDLE happens before the whole disable_secondary_cpus()
> thing.
>
> So million-dollar question, can this pixel thing do suspend to idle?

Unfortunately not. You saw my rant about firmware and s2idle bugs at
LPC. But yes, I'm going my part towards pushing for s2idle over s2ram.

And even if this Pixel could do it, there are a lot of devices in use
today that will never get a firmware update to enable s2idle. So, why
have all of them waste time and energy doing useless steps during
suspend?

> Traditionally hybernate is the whole save-to-disk and power machine off
> thing, and then there was suspend (to RAM) which was some dodgy as heck
> BIOS thing (on x86) which required all non-boot CPUs to be 'dead'.

My change would also help with the time it takes to power off the CPUs
during hibernate :) If it'll work (otherwise, we can make sure this
applies only to suspend).

> But does your (aaargh64) platform actually require us to take out the
> non-boot CPUs, or is this just histerical raisins?

Lol, I had to google histerical raisins to understand what it meant. I
might start using this :)
I'm pretty sure we need to call into the firmware to power off the CPU
so it can do all the housekeeping before powering down the caches.

Thanks,
Saravana

Re: [RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Peter Zijlstra 1 year, 2 months ago

On Tue, Nov 19, 2024 at 06:28:00PM -0800, Saravana Kannan wrote:
> On Tue, Nov 19, 2024 at 1:28 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > Well, if we push this one step further, why do we need hotplug at all?
> > Can't we just keep them up and idle?
> >
> > That is, if we look at suspend_enter(), you'll note that
> > PM_SUSPEND_TO_IDLE happens before the whole disable_secondary_cpus()
> > thing.
> >
> > So million-dollar question, can this pixel thing do suspend to idle?
> 
> Unfortunately not. You saw my rant about firmware and s2idle bugs at
> LPC. But yes, I'm going my part towards pushing for s2idle over s2ram.

Right, so with Google doing their own chips, I think you stand a fair
chance of making it work 'soon', right? :-)

> And even if this Pixel could do it, there are a lot of devices in use
> today that will never get a firmware update to enable s2idle. So, why
> have all of them waste time and energy doing useless steps during
> suspend?

Right, so if we really want to go do this, we should add place-holder
state for suspend, something like CPUHP_SUSPEND and document the
requirements and audit all existing states now skipped to meet
requirements.

I think it should go somewhere right between CPUHP_BP_PREPARE_DYN_END
and CPUHP_BRINGUP_CPU. WORKQUEUE_PREP seems awefully random, and the
typical purpose of the _PREPARE stages is to allocate memory/resources
such that STARTING can do its thing, similarly _DEAD is about freeing
resources that got unused during _DYING.

So the most logical setup would be to skip the entire _DEAD/_PREPARE
cycle.

> > Traditionally hybernate is the whole save-to-disk and power machine off
> > thing, and then there was suspend (to RAM) which was some dodgy as heck
> > BIOS thing (on x86) which required all non-boot CPUs to be 'dead'.
> 
> My change would also help with the time it takes to power off the CPUs
> during hibernate :) If it'll work (otherwise, we can make sure this
> applies only to suspend).

So I'm not sure you can actually skip this during hibernate. The thing
is, we load the image from the boot CPU in a state where the secondary
CPUs have never yet been loaded up. It might be possible to skip the
DEAD/PREPARE cycle, but it would also mean the image must contain the
full PREPARE resources. So if it all works, then the result is a larger
image, for a slightly faster cycle, but since hibernate is already super
slow, I don't think this trade-off is worth it.

Re: [RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Saravana Kannan 1 year, 2 months ago

On Wed, Nov 20, 2024 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 19, 2024 at 06:28:00PM -0800, Saravana Kannan wrote:
> > On Tue, Nov 19, 2024 at 1:28 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Well, if we push this one step further, why do we need hotplug at all?
> > > Can't we just keep them up and idle?
> > >
> > > That is, if we look at suspend_enter(), you'll note that
> > > PM_SUSPEND_TO_IDLE happens before the whole disable_secondary_cpus()
> > > thing.
> > >
> > > So million-dollar question, can this pixel thing do suspend to idle?
> >
> > Unfortunately not. You saw my rant about firmware and s2idle bugs at
> > LPC. But yes, I'm going my part towards pushing for s2idle over s2ram.
>
> Right, so with Google doing their own chips, I think you stand a fair
> chance of making it work 'soon', right? :-)

I can neither confirm or deny :)

>
> > And even if this Pixel could do it, there are a lot of devices in use
> > today that will never get a firmware update to enable s2idle. So, why
> > have all of them waste time and energy doing useless steps during
> > suspend?
>
> Right, so if we really want to go do this, we should add place-holder
> state for suspend, something like CPUHP_SUSPEND and document the
> requirements and audit all existing states now skipped to meet
> requirements.

Yup! This is exactly what I had in mind. But didn't want to go full
out on the patch before I got some sanity check here.

>
> I think it should go somewhere right between CPUHP_BP_PREPARE_DYN_END
> and CPUHP_BRINGUP_CPU.

I was thinking before CPUHP_BP_PREPARE_DYN because I saw some drivers
doing whatever the heck they do in CPUHP_BP_PREPARE_DYN. It'll be much
easier to do audits of non-dynamic stuff and keep it within
requirements.

> WORKQUEUE_PREP seems awefully random, and the
> typical purpose of the _PREPARE stages is to allocate memory/resources
> such that STARTING can do its thing, similarly _DEAD is about freeing
> resources that got unused during _DYING.

Yeah, I understood all this. I wanted to pick CPUHP_TMIGR_PREPARE
(mentioned in my first email) because it was right before
CPUHP_BP_PREPARE_DYN (and if you skip over CPUHP_MIPS_SOC_PREPARE
which sounds like a hardware step). But hrtimers seem to have a bug --
if the sequence fails anywhere in between CPUHP_AP_HRTIMERS_DYING and
CPUHP_HRTIMERS_PREPARE things fail badly.

So, for now I'd say we get in something like CPUHP_SUSPEND wherever it
works right now (after WORKQUEUE_PREP) and slowly move it up till we
get it right before CPUHP_BP_PREPARE_DYN.

> So the most logical setup would be to skip the entire _DEAD/_PREPARE
> cycle.

Makes sense to me.

On a separate note, I'm kinda confused by state machine stages where
only one of the startup/teardown callbacks are set up. For example,
I'd think the workqueue_prepare_cpu() would be combined with
workqueue_online_cpu()/workqueue_offline_cpu(). Why is online() not
sufficient to undo whatever offline() did?

>
> > > Traditionally hybernate is the whole save-to-disk and power machine off
> > > thing, and then there was suspend (to RAM) which was some dodgy as heck
> > > BIOS thing (on x86) which required all non-boot CPUs to be 'dead'.
> >
> > My change would also help with the time it takes to power off the CPUs
> > during hibernate :) If it'll work (otherwise, we can make sure this
> > applies only to suspend).
>
> So I'm not sure you can actually skip this during hibernate. The thing
> is, we load the image from the boot CPU in a state where the secondary
> CPUs have never yet been loaded up. It might be possible to skip the
> DEAD/PREPARE cycle, but it would also mean the image must contain the
> full PREPARE resources. So if it all works, then the result is a larger
> image, for a slightly faster cycle, but since hibernate is already super
> slow, I don't think this trade-off is worth it.

Ok, makes sense. We'll have to make some changes so that this doesn't
run for hibernate (it will as this patch is written).

-Saravana

Re: [RFC PATCH v1] cpu/suspend: Do a partial hotplug during suspend

Posted by Thomas Gleixner 1 year, 1 month ago

On Wed, Nov 20 2024 at 13:02, Saravana Kannan wrote:
> On Wed, Nov 20, 2024 at 12:42 AM Peter Zijlstra <peterz@infradead.org> wrote:
> I was thinking before CPUHP_BP_PREPARE_DYN because I saw some drivers
> doing whatever the heck they do in CPUHP_BP_PREPARE_DYN. It'll be much
> easier to do audits of non-dynamic stuff and keep it within
> requirements.
>
>> WORKQUEUE_PREP seems awefully random, and the
>> typical purpose of the _PREPARE stages is to allocate memory/resources
>> such that STARTING can do its thing, similarly _DEAD is about freeing
>> resources that got unused during _DYING.
>
> Yeah, I understood all this. I wanted to pick CPUHP_TMIGR_PREPARE
> (mentioned in my first email) because it was right before
> CPUHP_BP_PREPARE_DYN (and if you skip over CPUHP_MIPS_SOC_PREPARE
> which sounds like a hardware step). But hrtimers seem to have a bug --
> if the sequence fails anywhere in between CPUHP_AP_HRTIMERS_DYING and
> CPUHP_HRTIMERS_PREPARE things fail badly.

Yes, that's known and someone is working on it. Here is the thread:

  https://lore.kernel.org/all/87wmg9oyzk.ffs@tglx

> So, for now I'd say we get in something like CPUHP_SUSPEND wherever it
> works right now (after WORKQUEUE_PREP) and slowly move it up till we
> get it right before CPUHP_BP_PREPARE_DYN.
>
>> So the most logical setup would be to skip the entire _DEAD/_PREPARE
>> cycle.
>
> Makes sense to me.
>
> On a separate note, I'm kinda confused by state machine stages where
> only one of the startup/teardown callbacks are set up. For example,
> I'd think the workqueue_prepare_cpu() would be combined with
> workqueue_online_cpu()/workqueue_offline_cpu(). Why is online() not
> sufficient to undo whatever offline() did?

Some of this is purely historical and was more or less blindly converted
from the original notifier chains.

Other things are one-off initializations to allocate and initialize
memory, which is never freed again under the assumption that the CPUs
will come back online.

But yes, this needs to be looked at on a state by state basis.

Thanks,

        tglx