cpu/hotplug: ensure the starting section runs fully regardless of target

[PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Koichiro Den 1 year, 2 months ago

When CONFIG_CPU_HOTPLUG_STATE_CONTROL=y, writing a state within the
STARTING section to 'hotplug/target' file for a fully online cpu can
cause a kernel crash [1]. This occurs because take_cpu_down() disables
the CPU, but the state machine does not reach CPUHP_AP_OFFLINE. As a
result, when cpu stopper thread finishes its work and idle task takes
over, cpuhp_report_idle_dead() crashes on 'BUG_ON(st->state !=
CPUHP_AP_OFFLINE)'.

In the opposite direction, start_secondary() assumes all startup
callbacks have been invoked and transitions to CPUHP_AP_ONLINE_IDLE,
regardless of the written target. This can result in some callbacks in
the section being silently skipped.

Callbacks in STARTING section must not fail in any case and seem
expected to be executed in one continuous sequence. So, modify both
take_cpu_down() and notify_cpu_starting() to ignore st->target and fully
traverse the STARTING section to its appropriate end state. This
resolves the issue and ensures symmetric behavior for both directions.

[1]: example of reproduction steps:

  # grep 'tick:dying' /sys/devices/system/cpu/hotplug/states
    143: tick:dying # whatever in the middle of the section can be used.
                    # (st->cant_stop needs to be false)
  # cat /sys/devices/system/cpu/cpu7/hotplug/target
    238             # fully online
  # echo 143 > /sys/devices/system/cpu/cpu7/hotplug/target

    [  145.091832] ------------[ cut here ]------------
    [  145.092928] kernel BUG at kernel/cpu.c:1365!
    [  145.093960] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    --(snip)--

  With this patch, the crash no longer occurs and the state transitions
  to the opposite end of the section.

  # echo 143 > /sys/devices/system/cpu/cpu7/hotplug/target
  # cat /sys/devices/system/cpu/cpu7/hotplug/state
    88              # "idle:dead"

Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
---
 kernel/cpu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 85fd7ac4561e..34749792b37e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1269,7 +1269,6 @@ void clear_tasks_mm_cpumask(int cpu)
 static int take_cpu_down(void *_param)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
-	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
 	int err, cpu = smp_processor_id();
 
 	/* Ensure this CPU doesn't handle any more interrupts. */
@@ -1285,8 +1284,9 @@ static int take_cpu_down(void *_param)
 
 	/*
 	 * Invoke the former CPU_DYING callbacks. DYING must not fail!
+	 * Regardless of st->target, it must run through to CPUHP_AP_OFFLINE.
 	 */
-	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
+	cpuhp_invoke_callback_range_nofail(false, cpu, st, CPUHP_AP_OFFLINE);
 
 	/* Park the stopper thread */
 	stop_machine_park(cpu);
@@ -1593,15 +1593,15 @@ void smp_shutdown_nonboot_cpus(unsigned int primary_cpu)
 void notify_cpu_starting(unsigned int cpu)
 {
 	struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
-	enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
 
 	rcutree_report_cpu_starting(cpu);	/* Enables RCU usage on this CPU. */
 	cpumask_set_cpu(cpu, &cpus_booted_once_mask);
 
 	/*
 	 * STARTING must not fail!
+	 * Regardless of st->target, it must run through to CPUHP_AP_ONLINE.
 	 */
-	cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
+	cpuhp_invoke_callback_range_nofail(true, cpu, st, CPUHP_AP_ONLINE);
 }
 
 /*
-- 
2.43.0

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Thomas Gleixner 1 year, 2 months ago

On Sat, Dec 07 2024 at 23:47, Koichiro Den wrote:
>  static int take_cpu_down(void *_param)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> -	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
>  	int err, cpu = smp_processor_id();
>  
>  	/* Ensure this CPU doesn't handle any more interrupts. */
> @@ -1285,8 +1284,9 @@ static int take_cpu_down(void *_param)
>  
>  	/*
>  	 * Invoke the former CPU_DYING callbacks. DYING must not fail!
> +	 * Regardless of st->target, it must run through to CPUHP_AP_OFFLINE.
>  	 */
> -	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
> +	cpuhp_invoke_callback_range_nofail(false, cpu, st, CPUHP_AP_OFFLINE);

This is really the wrong place. This want's to be enforced at the sysfs
interface already and reject writes which are between AP_OFFLINE and
AP_ONLINE.
  
It's utterly confusing to write a particular target and then magically
end up at some other state.

Thanks,

        tglx

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Koichiro Den 1 year, 2 months ago

On Sun, Dec 08, 2024 at 09:34:37PM +0100, Thomas Gleixner wrote:
> On Sat, Dec 07 2024 at 23:47, Koichiro Den wrote:
> >  static int take_cpu_down(void *_param)
> >  {
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> > -	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> >  	int err, cpu = smp_processor_id();
> >  
> >  	/* Ensure this CPU doesn't handle any more interrupts. */
> > @@ -1285,8 +1284,9 @@ static int take_cpu_down(void *_param)
> >  
> >  	/*
> >  	 * Invoke the former CPU_DYING callbacks. DYING must not fail!
> > +	 * Regardless of st->target, it must run through to CPUHP_AP_OFFLINE.
> >  	 */
> > -	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
> > +	cpuhp_invoke_callback_range_nofail(false, cpu, st, CPUHP_AP_OFFLINE);
> 
> This is really the wrong place. This want's to be enforced at the sysfs
> interface already and reject writes which are between AP_OFFLINE and
> AP_ONLINE.
>   
> It's utterly confusing to write a particular target and then magically
> end up at some other state.

Ok, I'll send v2. Thanks for the review.

-Koichiro Den

> 
> Thanks,
> 
>         tglx

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Koichiro Den 1 year, 2 months ago

On Mon, Dec 09, 2024 at 09:12:44AM +0900, Koichiro Den wrote:
> On Sun, Dec 08, 2024 at 09:34:37PM +0100, Thomas Gleixner wrote:
> > On Sat, Dec 07 2024 at 23:47, Koichiro Den wrote:
> > >  static int take_cpu_down(void *_param)
> > >  {
> > >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> > > -	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> > >  	int err, cpu = smp_processor_id();
> > >  
> > >  	/* Ensure this CPU doesn't handle any more interrupts. */
> > > @@ -1285,8 +1284,9 @@ static int take_cpu_down(void *_param)
> > >  
> > >  	/*
> > >  	 * Invoke the former CPU_DYING callbacks. DYING must not fail!
> > > +	 * Regardless of st->target, it must run through to CPUHP_AP_OFFLINE.
> > >  	 */
> > > -	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
> > > +	cpuhp_invoke_callback_range_nofail(false, cpu, st, CPUHP_AP_OFFLINE);
> > 
> > This is really the wrong place. This want's to be enforced at the sysfs
> > interface already and reject writes which are between AP_OFFLINE and
> > AP_ONLINE.
> >   
> > It's utterly confusing to write a particular target and then magically
> > end up at some other state.
> 
> Ok, I'll send v2. Thanks for the review.

Now I'm wondering whether we should go further..
Because in PREPARE section, things are not fully symmetric, so
there is a problem like an example below:

    E.g.

    (1) writing <some state in between> into 'target' and then (2) bringing
    the the cpu fully online again by writing a large value leaves
    hrtimer_cpu_base's 'online' field at 0 because hrtimers:prepare does not
    re-run.

           - hrtimers:prepare (CPUHP_HRTIMERS_PREPARE)
           - ...
    ------ - <some state in between>
     ^  :  - ...
     :  :  - hrtimers:dying (CPUHP_AP_HRTIMERS_DYING)
    (1)(2)
     :  :
     :  v

While I understand this is still a debug option, it seems to me that there are
several approaches to consider here. I'm inclined toward (a).

(a). prohibit writing all halfway states in PREPARE+STARTING sections, i.e.

     --- a/kernel/cpu.c
     +++ b/kernel/cpu.c
     @@ -2759,7 +2759,8 @@ static ssize_t target_store(struct device *dev,
                     return ret;
     
      #ifdef CONFIG_CPU_HOTPLUG_STATE_CONTROL
     -       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE)
     +       if (target != CPUHP_OFFLINE || target > CPUHP_ONLINE ||
     +           target < CPUHP_AP_OFFLINE_IDLE)
                     return -EINVAL;
      #else
             if (target != CPUHP_OFFLINE && target != CPUHP_ONLINE)

(b). make all fully symmetric. (I'm not sure whether it could be possible)
(c). add all safety catch at sysfs interface. (For the above example, once
     it goes down to <some state in between>, disallow to go up without
     once going down to a state earlier than hrtimers:prepare beforehand.
     I guess this would mess up source code unnecessarily though.)
...

Let me know what you think.

Thanks,

-Koichiro Den


> 
> -Koichiro Den
> 
> > 
> > Thanks,
> > 
> >         tglx

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Thomas Gleixner 1 year, 2 months ago

On Mon, Dec 09 2024 at 13:19, Koichiro Den wrote:
> Now I'm wondering whether we should go further..
> Because in PREPARE section, things are not fully symmetric, so
> there is a problem like an example below:
>
>     E.g.
>
>     (1) writing <some state in between> into 'target' and then (2) bringing
>     the the cpu fully online again by writing a large value leaves
>     hrtimer_cpu_base's 'online' field at 0 because hrtimers:prepare does not
>     re-run.
>
>            - hrtimers:prepare (CPUHP_HRTIMERS_PREPARE)
>            - ...
>     ------ - <some state in between>
>      ^  :  - ...
>      :  :  - hrtimers:dying (CPUHP_AP_HRTIMERS_DYING)
>     (1)(2)
>      :  :
>      :  v

That got broken by me with commit:

  5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")

and want's to be fixed.

> While I understand this is still a debug option, it seems to me that there are
> several approaches to consider here. I'm inclined toward (a).
>
> (a). prohibit writing all halfway states in PREPARE+STARTING sections, i.e.
>
>      --- a/kernel/cpu.c
>      +++ b/kernel/cpu.c
>      @@ -2759,7 +2759,8 @@ static ssize_t target_store(struct device *dev,
>                      return ret;
>      
>       #ifdef CONFIG_CPU_HOTPLUG_STATE_CONTROL
>      -       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE)
>      +       if (target != CPUHP_OFFLINE || target > CPUHP_ONLINE ||
>      +           target < CPUHP_AP_OFFLINE_IDLE)
>                      return -EINVAL;
>       #else
>              if (target != CPUHP_OFFLINE && target != CPUHP_ONLINE)

That's lame.

> (b). make all fully symmetric. (I'm not sure whether it could be possible)

There is no requirement to make everything symmetric as there are
prepare steps which just allocate memory and do some basic
initialization. So I rather go through the steps and keep the ability to
invoke them fine granular in all directions (except for the atomic AP
states which started this discussion.

> (c). add all safety catch at sysfs interface. (For the above example, once
>      it goes down to <some state in between>, disallow to go up without
>      once going down to a state earlier than hrtimers:prepare beforehand.
>      I guess this would mess up source code unnecessarily though.)

That's unmaintainable.

Thanks,

        tglx

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Koichiro Den 1 year, 2 months ago

On Mon, Dec 09, 2024 at 12:51:59PM +0100, Thomas Gleixner wrote:
> On Mon, Dec 09 2024 at 13:19, Koichiro Den wrote:
> > Now I'm wondering whether we should go further..
> > Because in PREPARE section, things are not fully symmetric, so
> > there is a problem like an example below:
> >
> >     E.g.
> >
> >     (1) writing <some state in between> into 'target' and then (2) bringing
> >     the the cpu fully online again by writing a large value leaves
> >     hrtimer_cpu_base's 'online' field at 0 because hrtimers:prepare does not
> >     re-run.
> >
> >            - hrtimers:prepare (CPUHP_HRTIMERS_PREPARE)
> >            - ...
> >     ------ - <some state in between>
> >      ^  :  - ...
> >      :  :  - hrtimers:dying (CPUHP_AP_HRTIMERS_DYING)
> >     (1)(2)
> >      :  :
> >      :  v
> 
> That got broken by me with commit:
> 
>   5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> 
> and want's to be fixed.

I see, so let me address it while I'm at it.

> 
> > While I understand this is still a debug option, it seems to me that there are
> > several approaches to consider here. I'm inclined toward (a).
> >
> > (a). prohibit writing all halfway states in PREPARE+STARTING sections, i.e.
> >
> >      --- a/kernel/cpu.c
> >      +++ b/kernel/cpu.c
> >      @@ -2759,7 +2759,8 @@ static ssize_t target_store(struct device *dev,
> >                      return ret;
> >      
> >       #ifdef CONFIG_CPU_HOTPLUG_STATE_CONTROL
> >      -       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE)
> >      +       if (target != CPUHP_OFFLINE || target > CPUHP_ONLINE ||
> >      +           target < CPUHP_AP_OFFLINE_IDLE)
> >                      return -EINVAL;
> >       #else
> >              if (target != CPUHP_OFFLINE && target != CPUHP_ONLINE)
> 
> That's lame.

Hmm, could be. (By the way this was incorrect and I've fixed it here:
https://lore.kernel.org/all/20241207144721.2828390-1-koichiro.den@canonical.com/T/#mf6529a51167ffd1be52e1b67169442f486beb084

> 
> > (b). make all fully symmetric. (I'm not sure whether it could be possible)
> 
> There is no requirement to make everything symmetric as there are
> prepare steps which just allocate memory and do some basic
> initialization. So I rather go through the steps and keep the ability to
> invoke them fine granular in all directions (except for the atomic AP
> states which started this discussion.

Makes sense, I'm going to look carefully into every possible step and will
get back with v2 later. Thanks a lot for the review and discussion.

-Koichiro Den

> 
> > (c). add all safety catch at sysfs interface. (For the above example, once
> >      it goes down to <some state in between>, disallow to go up without
> >      once going down to a state earlier than hrtimers:prepare beforehand.
> >      I guess this would mess up source code unnecessarily though.)
> 
> That's unmaintainable.
> 
> Thanks,
> 
>         tglx

Re: [PATCH] cpu/hotplug: ensure the starting section runs fully regardless of target

Posted by Koichiro Den 1 year, 2 months ago

On Mon, Dec 09, 2024 at 01:19:43PM +0900, Koichiro Den wrote:
> On Mon, Dec 09, 2024 at 09:12:44AM +0900, Koichiro Den wrote:
> > On Sun, Dec 08, 2024 at 09:34:37PM +0100, Thomas Gleixner wrote:
> > > On Sat, Dec 07 2024 at 23:47, Koichiro Den wrote:
> > > >  static int take_cpu_down(void *_param)
> > > >  {
> > > >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> > > > -	enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> > > >  	int err, cpu = smp_processor_id();
> > > >  
> > > >  	/* Ensure this CPU doesn't handle any more interrupts. */
> > > > @@ -1285,8 +1284,9 @@ static int take_cpu_down(void *_param)
> > > >  
> > > >  	/*
> > > >  	 * Invoke the former CPU_DYING callbacks. DYING must not fail!
> > > > +	 * Regardless of st->target, it must run through to CPUHP_AP_OFFLINE.
> > > >  	 */
> > > > -	cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
> > > > +	cpuhp_invoke_callback_range_nofail(false, cpu, st, CPUHP_AP_OFFLINE);
> > > 
> > > This is really the wrong place. This want's to be enforced at the sysfs
> > > interface already and reject writes which are between AP_OFFLINE and
> > > AP_ONLINE.
> > >   
> > > It's utterly confusing to write a particular target and then magically
> > > end up at some other state.
> > 
> > Ok, I'll send v2. Thanks for the review.
> 
> Now I'm wondering whether we should go further..
> Because in PREPARE section, things are not fully symmetric, so
> there is a problem like an example below:
> 
>     E.g.
> 
>     (1) writing <some state in between> into 'target' and then (2) bringing
>     the the cpu fully online again by writing a large value leaves
>     hrtimer_cpu_base's 'online' field at 0 because hrtimers:prepare does not
>     re-run.
> 
>            - hrtimers:prepare (CPUHP_HRTIMERS_PREPARE)
>            - ...
>     ------ - <some state in between>
>      ^  :  - ...
>      :  :  - hrtimers:dying (CPUHP_AP_HRTIMERS_DYING)
>     (1)(2)
>      :  :
>      :  v
> 
> While I understand this is still a debug option, it seems to me that there are
> several approaches to consider here. I'm inclined toward (a).
> 
> (a). prohibit writing all halfway states in PREPARE+STARTING sections, i.e.
> 
>      --- a/kernel/cpu.c
>      +++ b/kernel/cpu.c
>      @@ -2759,7 +2759,8 @@ static ssize_t target_store(struct device *dev,
>                      return ret;
>      
>       #ifdef CONFIG_CPU_HOTPLUG_STATE_CONTROL
>      -       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE)
>      +       if (target != CPUHP_OFFLINE || target > CPUHP_ONLINE ||
>      +           target < CPUHP_AP_OFFLINE_IDLE)
>                      return -EINVAL;
>       #else
>              if (target != CPUHP_OFFLINE && target != CPUHP_ONLINE)

Oops, sorry this was my mistake. it should be:

       --- a/kernel/cpu.c
       +++ b/kernel/cpu.c
       @@ -2759,7 +2759,8 @@ static ssize_t target_store(struct device *dev
                       return ret;
       
        #ifdef CONFIG_CPU_HOTPLUG_STATE_CONTROL
       -       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE)
       +       if (target < CPUHP_OFFLINE || target > CPUHP_ONLINE ||
       +           (target > CPUHP_OFFLINE && target < CPUHP_AP_OFFLINE_IDLE))
                       return -EINVAL;
        #else
               if (target != CPUHP_OFFLINE && target != CPUHP_ONLINE)

> 
> (b). make all fully symmetric. (I'm not sure whether it could be possible)
> (c). add all safety catch at sysfs interface. (For the above example, once
>      it goes down to <some state in between>, disallow to go up without
>      once going down to a state earlier than hrtimers:prepare beforehand.
>      I guess this would mess up source code unnecessarily though.)
> ...
> 
> Let me know what you think.
> 
> Thanks,
> 
> -Koichiro Den
> 
> 
> > 
> > -Koichiro Den
> > 
> > > 
> > > Thanks,
> > > 
> > >         tglx