[PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]

Frederic Weisbecker posted 1 patch 1 year, 10 months ago
kernel/time/timer_migration.c | 20 --------------------
1 file changed, 20 deletions(-)
[PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
Posted by Frederic Weisbecker 1 year, 10 months ago
On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
> I notice CPU3 didn't have its own non-deferrable timer queued (local or
> global), so could the following happen?
> 
> 	timer_base_try_to_set_idle():
> 	  __get_next_timer_interrupt():
> 	    fetch_next_timer_interrupt():
> 	      // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
> 	      // tevt->local == tevt->gloabl = KTIME_MAX
> 	    timer_use_tmigr():
> 	      tmigr_cpu_deactivate():
> 	        __tmigr_cpu_deactivate():
> 		  // tmc->cpuevt.ignore untouched still == true
> 		  walk_groups(&tmigr_inactive_up, ...):
> 		    tmigr_inactive_up():
> 		      data->remote = true;
> 		      tmigr_update_events():
> 		        if (child) { // child is NULL
> 			  ...
> 			} else {
> 			  first_childevt = evt = data->evt;
> 
> 			  if (evt->ignore && !remote)
> 			    return true; // no remote tick is picked.
> 			  ...
> 			}

Nice catch! Florian can you try the following?

From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <frederic@kernel.org>
Date: Fri, 15 Mar 2024 00:21:01 +0100
Subject: [PATCH] timer/migration: Remove buggy early return on deactivation

When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.

If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:

           [GRP0:0]
         migrator = 0
         active   = 0
         nextevt  = T0i
          /         \
         0           1
      active (T0i)  idle (T1)

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.

           [GRP0:0]
         migrator = NONE
         active   = NONE
         nextevt  = T0i
          /         \
         0           1
      idle (T0i)  idle (T1)

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.

This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.

                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.

Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/timer_migration.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 8f49b6b96dfd..611cd904f035 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,26 +751,6 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 
 		first_childevt = evt = data->evt;
 
-		/*
-		 * Walking the hierarchy is required in any case when a
-		 * remote expiry was done before. This ensures to not lose
-		 * already queued events in non active groups (see section
-		 * "Required event and timerqueue update after a remote
-		 * expiry" in the documentation at the top).
-		 *
-		 * The two call sites which are executed without a remote expiry
-		 * before, are not prevented from propagating changes through
-		 * the hierarchy by the return:
-		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
-		 *    is never set.
-		 *  - tmigr_inactive_up() takes care of the propagation by
-		 *    itself and ignores the return value. But an immediate
-		 *    return is required because nothing has to be done in this
-		 *    level as the event could be ignored.
-		 */
-		if (evt->ignore && !remote)
-			return true;
-
 		raw_spin_lock(&group->lock);
 
 		childstate.state = 0;
-- 
2.34.1
Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
Posted by Anna-Maria Behnsen 1 year, 10 months ago
Hi Frederic,

I'm sorry sending my concerns late, but I was on sick leave. Keep in
mind, it is definitely possible that my brain is not yet in the timer
migration hierarchy mood after the sick leave :) so please correct me
whenever I'm wrong.

Frederic Weisbecker <frederic@kernel.org> writes:

> On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
>> I notice CPU3 didn't have its own non-deferrable timer queued (local or
>> global), so could the following happen?
>> 
>> 	timer_base_try_to_set_idle():
>> 	  __get_next_timer_interrupt():
>> 	    fetch_next_timer_interrupt():
>> 	      // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
>> 	      // tevt->local == tevt->gloabl = KTIME_MAX
>> 	    timer_use_tmigr():
>> 	      tmigr_cpu_deactivate():
>> 	        __tmigr_cpu_deactivate():
>> 		  // tmc->cpuevt.ignore untouched still == true
>> 		  walk_groups(&tmigr_inactive_up, ...):
>> 		    tmigr_inactive_up():
>> 		      data->remote = true;
>> 		      tmigr_update_events():
>> 		        if (child) { // child is NULL
>> 			  ...
>> 			} else {
>> 			  first_childevt = evt = data->evt;
>> 
>> 			  if (evt->ignore && !remote)
>> 			    return true; // no remote tick is picked.
>> 			  ...
>> 			}
>
> Nice catch! Florian can you try the following?
>
> From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Fri, 15 Mar 2024 00:21:01 +0100
> Subject: [PATCH] timer/migration: Remove buggy early return on deactivation
>
> When a CPU enters into idle and deactivates itself from the timer
> migration hierarchy without any global timer of its own to propagate,
> the group event of that CPU is set to "ignore" and tmigr_update_events()
> accordingly performs an early return without considering timers queued
> by other CPUs.
>
> If the hierarchy has a single level, and the CPU is the last one to
> enter idle, it will ignore others' global timers, as in the following
> layout:
>
>            [GRP0:0]
>          migrator = 0
>          active   = 0
>          nextevt  = T0i
>           /         \
>          0           1
>       active (T0i)  idle (T1)
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
>
>            [GRP0:0]
>          migrator = NONE
>          active   = NONE
>          nextevt  = T0i
>           /         \
>          0           1
>       idle (T0i)  idle (T1)
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
> unhandled.

This is broken - indeed.

> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
>
>                             [GRP1:0]
>                          migrator = GRP0:0
>                          active   = GRP0:0
>                          nextevt  = T0:0i, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = 0              migrator = NONE
>            active   = 0              active   = NONE
>            nextevt  = T0i            nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle
>
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
>
>                             [GRP1:0]
>                          migrator = GRP0:0
>                          active   = GRP0:0
>                          nextevt  = T0:0i, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = NONE           migrator = NONE
>            active   = NONE           active   = NONE
>            nextevt  = T0i            nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle
>
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1. The change only propagated up to 1st
> level so far.

Right. T0 doesn't has to be enqueued into the timer queue of GRP0:0 as
this timer could be ignored. So nothing changes directly in GRP0:0.

>                             [GRP1:0]
>                          migrator = NONE
>                          active   = NONE
>                          nextevt  = T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = NONE           migrator = NONE
>            active   = NONE           active   = NONE
>            nextevt  = T0i            nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle
>
> 2) The change now propagates up to the top. tmigr_update_events() finds
> that the child event is ignored and thus removes it. The top level next
> event is now T2 which is returned to CPU 0 as its next effective expiry
> to take account for as the global idle migrator. However T1 has been
> ignored along the way, leaving it unhandled.

Now propagation goes on as GRP0:0 is completely idle. When executing
tmigr_update_events() in the next step of walking the hierarchy via
tmigr_inactive_up(), the arguments for tmigr_update_events() are set in
the following way:

  group = GRP1:0
  child = GRP0:0

Then at the begin of tmigr_update_events() the group event of child is
updated - so all ignored events are removed (T0i), and the
child->groupevt and child->next_expiry is updated with T1. This
reevaluated child->groupevt is then queued/updated in the GRP1:0
timerqueue.

So T1 will be handled!

As there is no parent, the top level group event is updated (see goto
label "check_toplvl") and T1 will be still the first event.

> Fix those issues with removing the buggy related early return. Ignored
> child events must not prevent from evaluating the other events within
> the same group.

I would prefere to keep this early return but skip it, when there is
!group->parent (only a single level in hierarchy).

Then it would prevent taking the group lock and making some random
event updates which are done nevertheless on the next iteration of the
hierarchy walk.

Thanks,

	Anna-Maria
Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
Posted by Frederic Weisbecker 1 year, 10 months ago
Le Tue, Mar 26, 2024 at 05:41:00PM +0100, Anna-Maria Behnsen a écrit :
> Frederic Weisbecker <frederic@kernel.org> writes:
> Now propagation goes on as GRP0:0 is completely idle. When executing
> tmigr_update_events() in the next step of walking the hierarchy via
> tmigr_inactive_up(), the arguments for tmigr_update_events() are set in
> the following way:
> 
>   group = GRP1:0
>   child = GRP0:0
> 
> Then at the begin of tmigr_update_events() the group event of child is
> updated - so all ignored events are removed (T0i), and the
> child->groupevt and child->next_expiry is updated with T1. This
> reevaluated child->groupevt is then queued/updated in the GRP1:0
> timerqueue.
> 
> So T1 will be handled!
> 
> As there is no parent, the top level group event is updated (see goto
> label "check_toplvl") and T1 will be still the first event.

Bah! Good point, I got confused there...

> 
> > Fix those issues with removing the buggy related early return. Ignored
> > child events must not prevent from evaluating the other events within
> > the same group.
> 
> I would prefere to keep this early return but skip it, when there is
> !group->parent (only a single level in hierarchy).
> 
> Then it would prevent taking the group lock and making some random
> event updates which are done nevertheless on the next iteration of the
> hierarchy walk.

Ok sounds like a good plan!

Thanks.

> 
> Thanks,
> 
> 	Anna-Maria
> 
[PATCH] timers/migration: Return early on deactivation
Posted by Anna-Maria Behnsen 1 year, 10 months ago
Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i, T1        nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

  lock(GRP0:0->lock);
  timerqueue_del(GRP0:0, T0i);
  unlock(GRP0:0->lock);


                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:0, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

  lock(GRP0:0->lock);
  lock(GRP1:0->lock);
  evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
					in GRP0:0
  ... update GRP1:0 group event and timerqueue ...
  unlock(GRP1:0->lock);
  unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group.

Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,31 @@ bool tmigr_update_events(struct tmigr_gr
 
 		first_childevt = evt = data->evt;
 
+		/*
+		 * Walking the hierarchy is required in any case when a
+		 * remote expiry was done before. This ensures to not lose
+		 * already queued events in non active groups (see section
+		 * "Required event and timerqueue update after a remote
+		 * expiry" in the documentation at the top).
+		 *
+		 * The two call sites which are executed without a remote expiry
+		 * before, are not prevented from propagating changes through
+		 * the hierarchy by the return:
+		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
+		 *    is never set.
+		 *  - tmigr_inactive_up() takes care of the propagation by
+		 *    itself and ignores the return value. But an immediate
+		 *    return is required because nothing has to be done in this
+		 *    level as the event could be ignored.
+		 *
+		 * But, if the hierarchy has only a single level so @group is
+		 * the top level group, make sure first event information of the
+		 * group is updated properly and also handled properly, so skip
+		 * this fast return path.
+		 */
+		if (evt->ignore && !remote && group->parent)
+			return true;
+
 		raw_spin_lock(&group->lock);
 
 		childstate.state = 0;
Re: [PATCH] timers/migration: Return early on deactivation
Posted by Frederic Weisbecker 1 year, 10 months ago
Le Thu, Apr 04, 2024 at 06:50:26PM +0200, Anna-Maria Behnsen a écrit :
> Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
> deactivation") removed the logic to return early in tmigr_update_events()
> on deactivation. With this the problem with a not properly updated first
> global event in a hierarchy containing only a single group was fixed.
> 
> But when having a look at this code path with a hierarchy with more than a
> single level, now unnecessary work is done (example is partially copied
> from the message of the commit mentioned above):
> 
>                             [GRP1:0]
>                          migrator = GRP0:0
>                          active   = GRP0:0
>                          nextevt  = T0:0i, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = 0              migrator = NONE
>            active   = 0              active   = NONE
>            nextevt  = T0i, T1        nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>       active         idle            idle       idle
> 
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
> 
>                             [GRP1:0]
>                          migrator = GRP0:0
>                          active   = GRP0:0
>                          nextevt  = T0:0i, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = NONE           migrator = NONE
>            active   = NONE           active   = NONE
>            nextevt  = T1             nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle
> 
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". Without this
> early return the following steps happen in tmigr_update_events() when
> child = null and group = GRP0:0 :
> 
>   lock(GRP0:0->lock);
>   timerqueue_del(GRP0:0, T0i);
>   unlock(GRP0:0->lock);
> 
> 
>                             [GRP1:0]
>                          migrator = NONE
>                          active   = NONE
>                          nextevt  = T0:0, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = NONE           migrator = NONE
>            active   = NONE           active   = NONE
>            nextevt  = T1             nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle
> 
> 2) The change now propagates up to the top. Then tmigr_update_events()
> updates the group event of GRP0:0 and executes the following steps
> (child = GRP0:0 and group = GRP0:0):
> 
>   lock(GRP0:0->lock);
>   lock(GRP1:0->lock);
>   evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
> 					in GRP0:0
>   ... update GRP1:0 group event and timerqueue ...
>   unlock(GRP1:0->lock);
>   unlock(GRP0:0->lock);
> 
> So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
> the timerqueue is redundand as this is done nevertheless in 2) when
> tmigr_next_groupevt(GRP0:0) is executed.
> 
> Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
> deactivation") and add a condition into return path to skip the return
> only, when hierarchy contains a single group.
> 
> Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Just some comment nits:

> ---
>  kernel/time/timer_migration.c |   25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -751,6 +751,31 @@ bool tmigr_update_events(struct tmigr_gr
>  
>  		first_childevt = evt = data->evt;
>  
> +		/*
> +		 * Walking the hierarchy is required in any case when a
> +		 * remote expiry was done before. This ensures to not lose
> +		 * already queued events in non active groups (see section
> +		 * "Required event and timerqueue update after a remote
> +		 * expiry" in the documentation at the top).
> +		 *
> +		 * The two call sites which are executed without a remote expiry
> +		 * before, are not prevented from propagating changes through
> +		 * the hierarchy by the return:
> +		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
> +		 *    is never set.
> +		 *  - tmigr_inactive_up() takes care of the propagation by
> +		 *    itself and ignores the return value. But an immediate
> +		 *    return is required because nothing has to be done in this
> +		 *    level as the event could be ignored.

It's not exactly required, it's an optimization. How about:

"""
But an immediate return is possible if there is a parent, sparing group
locking at this level, because the upper walking call to the parent will
take care about removing this event from within the group and update
next_expiry accordingly.
"""

> +		 *
> +		 * But, if the hierarchy has only a single level so @group is
> +		 * the top level group, make sure first event information of the
> +		 * group is updated properly and also handled properly, so skip
> +		 * this fast return path.

"""
However if there is no parent, ie: the hierarchy has only a single level so
@group is the top level group, make sure the first event information of the
group is updated properly and also handled properly, so skip
this fast return path.
"""

Thanks.

> +		 */
> +		if (evt->ignore && !remote && group->parent)
> +			return true;
> +
>  		raw_spin_lock(&group->lock);
>  
>  		childstate.state = 0;
[PATCH v2] timers/migration: Return early on deactivation
Posted by Anna-Maria Behnsen 1 year, 10 months ago
Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i, T1        nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

  lock(GRP0:0->lock);
  timerqueue_del(GRP0:0, T0i);
  unlock(GRP0:0->lock);


                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:0, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

  lock(GRP0:0->lock);
  lock(GRP1:0->lock);
  evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
					in GRP0:0
  ... update GRP1:0 group event and timerqueue ...
  unlock(GRP1:0->lock);
  unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.

Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/timer_migration.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,33 @@ bool tmigr_update_events(struct tmigr_gr
 
 		first_childevt = evt = data->evt;
 
+		/*
+		 * Walking the hierarchy is required in any case when a
+		 * remote expiry was done before. This ensures to not lose
+		 * already queued events in non active groups (see section
+		 * "Required event and timerqueue update after a remote
+		 * expiry" in the documentation at the top).
+		 *
+		 * The two call sites which are executed without a remote expiry
+		 * before, are not prevented from propagating changes through
+		 * the hierarchy by the return:
+		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
+		 *    is never set.
+		 *  - tmigr_inactive_up() takes care of the propagation by
+		 *    itself and ignores the return value. But an immediate
+		 *    return is possible if there is a parent, sparing group
+		 *    locking at this level, because the upper walking call to
+		 *    the parent will take care about removing this event from
+		 *    within the group and update next_expiry accordingly.
+		 *
+		 * However if there is no parent, ie: the hierarchy has only a
+		 * single level so @group is the top level group, make sure the
+		 * first event information of the group is updated properly and
+		 * also handled properly, so skip this fast return path.
+		 */
+		if (evt->ignore && !remote && group->parent)
+			return true;
+
 		raw_spin_lock(&group->lock);
 
 		childstate.state = 0;
[tip: timers/urgent] timers/migration: Return early on deactivation
Posted by tip-bot2 for Anna-Maria Behnsen 1 year, 10 months ago
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID:     7a96a84bfbee96871bb16c70ee3e93d564e190f4
Gitweb:        https://git.kernel.org/tip/7a96a84bfbee96871bb16c70ee3e93d564e190f4
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Fri, 05 Apr 2024 10:53:21 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Fri, 05 Apr 2024 11:05:16 +02:00

timers/migration: Return early on deactivation

Commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") removed the logic to return early in tmigr_update_events()
on deactivation. With this the problem with a not properly updated first
global event in a hierarchy containing only a single group was fixed.

But when having a look at this code path with a hierarchy with more than a
single level, now unnecessary work is done (example is partially copied
from the message of the commit mentioned above):

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i, T1        nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". Without this
early return the following steps happen in tmigr_update_events() when
child = null and group = GRP0:0 :

  lock(GRP0:0->lock);
  timerqueue_del(GRP0:0, T0i);
  unlock(GRP0:0->lock);


                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:0, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T1             nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. Then tmigr_update_events()
updates the group event of GRP0:0 and executes the following steps
(child = GRP0:0 and group = GRP0:0):

  lock(GRP0:0->lock);
  lock(GRP1:0->lock);
  evt = tmigr_next_groupevt(GRP0:0); -> this removes the ignored events
					in GRP0:0
  ... update GRP1:0 group event and timerqueue ...
  unlock(GRP1:0->lock);
  unlock(GRP0:0->lock);

So the dance in 1) with locking the GRP0:0->lock and removing the T0i from
the timerqueue is redundand as this is done nevertheless in 2) when
tmigr_next_groupevt(GRP0:0) is executed.

Revert commit 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on
deactivation") and add a condition into return path to skip the return
only, when hierarchy contains a single group. Adapt comments accordingly.

Fixes: 4b6f4c5a67c0 ("timer/migration: Remove buggy early return on deactivation")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87cyr49on2.fsf@somnus

---
 kernel/time/timer_migration.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index e3075e4..ccba875 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,6 +751,33 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 
 		first_childevt = evt = data->evt;
 
+		/*
+		 * Walking the hierarchy is required in any case when a
+		 * remote expiry was done before. This ensures to not lose
+		 * already queued events in non active groups (see section
+		 * "Required event and timerqueue update after a remote
+		 * expiry" in the documentation at the top).
+		 *
+		 * The two call sites which are executed without a remote expiry
+		 * before, are not prevented from propagating changes through
+		 * the hierarchy by the return:
+		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
+		 *    is never set.
+		 *  - tmigr_inactive_up() takes care of the propagation by
+		 *    itself and ignores the return value. But an immediate
+		 *    return is possible if there is a parent, sparing group
+		 *    locking at this level, because the upper walking call to
+		 *    the parent will take care about removing this event from
+		 *    within the group and update next_expiry accordingly.
+		 *
+		 * However if there is no parent, ie: the hierarchy has only a
+		 * single level so @group is the top level group, make sure the
+		 * first event information of the group is updated properly and
+		 * also handled properly, so skip this fast return path.
+		 */
+		if (evt->ignore && !remote && group->parent)
+			return true;
+
 		raw_spin_lock(&group->lock);
 
 		childstate.state = 0;
Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
Posted by Florian Fainelli 1 year, 10 months ago

On 3/14/2024 6:14 PM, Frederic Weisbecker wrote:
> On Thu, Mar 14, 2024 at 03:05:53PM -0700, Boqun Feng wrote:
>> I notice CPU3 didn't have its own non-deferrable timer queued (local or
>> global), so could the following happen?
>>
>> 	timer_base_try_to_set_idle():
>> 	  __get_next_timer_interrupt():
>> 	    fetch_next_timer_interrupt():
>> 	      // nextevt_local == nextevt_global == basej + NEXT_TIMER_MAX_DELTA
>> 	      // tevt->local == tevt->gloabl = KTIME_MAX
>> 	    timer_use_tmigr():
>> 	      tmigr_cpu_deactivate():
>> 	        __tmigr_cpu_deactivate():
>> 		  // tmc->cpuevt.ignore untouched still == true
>> 		  walk_groups(&tmigr_inactive_up, ...):
>> 		    tmigr_inactive_up():
>> 		      data->remote = true;
>> 		      tmigr_update_events():
>> 		        if (child) { // child is NULL
>> 			  ...
>> 			} else {
>> 			  first_childevt = evt = data->evt;
>>
>> 			  if (evt->ignore && !remote)
>> 			    return true; // no remote tick is picked.
>> 			  ...
>> 			}
> 
> Nice catch! Florian can you try the following?
> 
>  From b0e335371ed758f68bf4f501246298c98a615b04 Mon Sep 17 00:00:00 2001
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Fri, 15 Mar 2024 00:21:01 +0100
> Subject: [PATCH] timer/migration: Remove buggy early return on deactivation
> 
> When a CPU enters into idle and deactivates itself from the timer
> migration hierarchy without any global timer of its own to propagate,
> the group event of that CPU is set to "ignore" and tmigr_update_events()
> accordingly performs an early return without considering timers queued
> by other CPUs.
> 
> If the hierarchy has a single level, and the CPU is the last one to
> enter idle, it will ignore others' global timers, as in the following
> layout:
> 
>             [GRP0:0]
>           migrator = 0
>           active   = 0
>           nextevt  = T0i
>            /         \
>           0           1
>        active (T0i)  idle (T1)
> 
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> 
>             [GRP0:0]
>           migrator = NONE
>           active   = NONE
>           nextevt  = T0i
>            /         \
>           0           1
>        idle (T0i)  idle (T1)
> 
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
> unhandled.
> 
> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
> 
>                              [GRP1:0]
>                           migrator = GRP0:0
>                           active   = GRP0:0
>                           nextevt  = T0:0i, T0:1
>                           /              \
>                [GRP0:0]                  [GRP0:1]
>             migrator = 0              migrator = NONE
>             active   = 0              active   = NONE
>             nextevt  = T0i            nextevt  = T2
>             /         \                /         \
>            0 (T0i)     1 (T1)         2 (T2)      3
>          idle         idle            idle         idle
> 
> 0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
> upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
> CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2
> 
>                              [GRP1:0]
>                           migrator = GRP0:0
>                           active   = GRP0:0
>                           nextevt  = T0:0i, T0:1
>                           /              \
>                [GRP0:0]                  [GRP0:1]
>             migrator = NONE           migrator = NONE
>             active   = NONE           active   = NONE
>             nextevt  = T0i            nextevt  = T2
>             /         \                /         \
>            0 (T0i)     1 (T1)         2 (T2)      3
>          idle         idle            idle         idle
> 
> 1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
> pushed as its next expiry and its own event kept as "ignore". As a result
> tmigr_update_events() ignores T1. The change only propagated up to 1st
> level so far.
> 
>                              [GRP1:0]
>                           migrator = NONE
>                           active   = NONE
>                           nextevt  = T0:1
>                           /              \
>                [GRP0:0]                  [GRP0:1]
>             migrator = NONE           migrator = NONE
>             active   = NONE           active   = NONE
>             nextevt  = T0i            nextevt  = T2
>             /         \                /         \
>            0 (T0i)     1 (T1)         2 (T2)      3
>          idle         idle            idle         idle
> 
> 2) The change now propagates up to the top. tmigr_update_events() finds
> that the child event is ignored and thus removes it. The top level next
> event is now T2 which is returned to CPU 0 as its next effective expiry
> to take account for as the global idle migrator. However T1 has been
> ignored along the way, leaving it unhandled.
> 
> Fix those issues with removing the buggy related early return. Ignored
> child events must not prevent from evaluating the other events within
> the same group.
> 
> Reported-by: Boqun Feng <boqun.feng@gmail.com>
> Reported-by: Florian Fainelli <f.fainelli@gmail.com>

Gave it a good 20 cold boots and suspend to DRAM cycles and none of 
those triggered the reported behavior:

Reported-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>

Thanks for the quick fix Frederic!
-- 
Florian
Re: [PATCH] timer/migration: Remove buggy early return on deactivation [was Re: Unexplained long boot delays [Was Re: [GIT PULL] RCU changes for v6.9]]
Posted by Frederic Weisbecker 1 year, 10 months ago
On Fri, Mar 15, 2024 at 02:14:47AM +0100, Frederic Weisbecker wrote:
> This isn't proper to single level hierarchy though. A similar issue,
> although slightly different, may arise on multi-level:
> 
>                             [GRP1:0]
>                          migrator = GRP0:0
>                          active   = GRP0:0
>                          nextevt  = T0:0i, T0:1
>                          /              \
>               [GRP0:0]                  [GRP0:1]
>            migrator = 0              migrator = NONE
>            active   = 0              active   = NONE
>            nextevt  = T0i            nextevt  = T2
>            /         \                /         \
>           0 (T0i)     1 (T1)         2 (T2)      3
>         idle         idle            idle         idle

The "idle" below "0" above should be "active"...
[tip: timers/urgent] timer/migration: Remove buggy early return on deactivation
Posted by tip-bot2 for Frederic Weisbecker 1 year, 10 months ago
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID:     4b6f4c5a67c07417bf29d896c76f513a4be07516
Gitweb:        https://git.kernel.org/tip/4b6f4c5a67c07417bf29d896c76f513a4be07516
Author:        Frederic Weisbecker <frederic@kernel.org>
AuthorDate:    Fri, 15 Mar 2024 02:14:47 +01:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Sat, 16 Mar 2024 19:55:46 +01:00

timer/migration: Remove buggy early return on deactivation

When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.

If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:

           [GRP0:0]
         migrator = 0
         active   = 0
         nextevt  = T0i
          /         \
         0           1
      active (T0i)  idle (T1)

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.

           [GRP0:0]
         migrator = NONE
         active   = NONE
         nextevt  = T0i
          /         \
         0           1
      idle (T0i)  idle (T1)

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.

This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.

                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.

Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.

Reported-by: Boqun Feng <boqun.feng@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen

---
 kernel/time/timer_migration.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 8f49b6b..611cd90 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -751,26 +751,6 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 
 		first_childevt = evt = data->evt;
 
-		/*
-		 * Walking the hierarchy is required in any case when a
-		 * remote expiry was done before. This ensures to not lose
-		 * already queued events in non active groups (see section
-		 * "Required event and timerqueue update after a remote
-		 * expiry" in the documentation at the top).
-		 *
-		 * The two call sites which are executed without a remote expiry
-		 * before, are not prevented from propagating changes through
-		 * the hierarchy by the return:
-		 *  - When entering this path by tmigr_new_timer(), @evt->ignore
-		 *    is never set.
-		 *  - tmigr_inactive_up() takes care of the propagation by
-		 *    itself and ignores the return value. But an immediate
-		 *    return is required because nothing has to be done in this
-		 *    level as the event could be ignored.
-		 */
-		if (evt->ignore && !remote)
-			return true;
-
 		raw_spin_lock(&group->lock);
 
 		childstate.state = 0;