timer_migration: Fix two possible races and an improvement

[PATCH v2 4/5] timer_migration: Fix possible race in tmigr_active_up() in setup path

Posted by Anna-Maria Behnsen 1 year, 7 months ago

Frederic reported the following possible race:


		  [GRP0:0]
	       migrator = 0
	       active   = 0
	       nextevt  = KTIME_MAX
	       /         \
	      0         1 .. 7
	  active         idle

0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

1) CPU 8 is booting and creates a new group in first level GRP0:1 and
   therefore also a new top group GRP1:0. For now the setup code proceeded
   only until the connected between GRP0:1 to the new top group. The
   connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
   still !online.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
	      active   = NONE              active   = NONE
	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
   tmigr_update_events() and it propagates the change to the top (no change
   there and no wakeup programmed since there is no timer).

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
	      active   = NONE             active   = NONE
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
   active in GRP1:0

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0, GRP0:1
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = 8
	      active   = NONE             active   = 8
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                active

5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
   of tmigr_cpu_online().

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = T8
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
	      active   = NONE               active   = NONE
	      nextevt  = KTIME_MAX          nextevt  = T8
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.

The update which is done in third step is not noticed by setup code. So a
wrong migrator is set to top level group and a timer could get ignored.

Rework the activation of an already existing group in setup code after
adding a new top level group and use memory barriers (as already used in
tmigr_inactive_up()) to be sure that child state updates and group state
updates are ordered properly.

The update of the group event ignore bit is now ignored. But this is fine,
as this bit is only required when queueing the event into the timer queue
of the parent group. As this is always the update of the top level group,
it doesn't matter.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 49 ++++++++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 17 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 0e1c53dac390..7971512a60b0 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1448,6 +1448,37 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
+{
+	union tmigr_state curstate, childstate;
+	bool walk_done;
+
+	/*
+	 * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
+	 * make sure updates of child and group states are ordered. The ordering
+	 * is mandatory, as the update of the group state is only valid for when
+	 * child state is active.
+	 */
+	curstate.state = atomic_read_acquire(&group->migr_state);
+
+	for (;;) {
+		childstate.state = atomic_read(&child->migr_state);
+		if (!childstate.active)
+			return;
+
+		if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
+			break;
+
+		/*
+		 * The memory barrier is paired with the cmpxchg() in
+		 * tmigr_inactive_up() to make sure the updates of child and
+		 * group states are ordered. It is required only when the
+		 * try_cmpxchg() in __tmigr_active_up() fails.
+		 */
+		smp_mb__after_atomic();
+	}
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1522,8 +1553,6 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
-
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
@@ -1551,21 +1580,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
-
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	tmigr_setup_active_up(parent, child);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)

-- 
2.39.2

Re: [PATCH v2 4/5] timer_migration: Fix possible race in tmigr_active_up() in setup path

Posted by Frederic Weisbecker 1 year, 7 months ago

Le Mon, Jun 24, 2024 at 04:53:56PM +0200, Anna-Maria Behnsen a écrit :
> Frederic reported the following possible race:
> 
> 
> 		  [GRP0:0]
> 	       migrator = 0
> 	       active   = 0
> 	       nextevt  = KTIME_MAX
> 	       /         \
> 	      0         1 .. 7
> 	  active         idle
> 
> 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 					 \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 1) CPU 8 is booting and creates a new group in first level GRP0:1 and
>    therefore also a new top group GRP1:0. For now the setup code proceeded
>    only until the connected between GRP0:1 to the new top group. The

*connection

>    connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
>    still !online.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
> 	      active   = NONE              active   = NONE
> 	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
>    tmigr_update_events() and it propagates the change to the top (no change
>    there and no wakeup programmed since there is no timer).
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
> 	      active   = NONE             active   = NONE
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
>    active in GRP1:0
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0, GRP0:1
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = 8
> 	      active   = NONE             active   = 8
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                active
> 
> 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
>    of tmigr_cpu_online().
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = T8
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
> 	      active   = NONE               active   = NONE
> 	      nextevt  = KTIME_MAX          nextevt  = T8
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                  idle
> 
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
> 
> The update which is done in third step is not noticed by setup code. So a
> wrong migrator is set to top level group and a timer could get ignored.
> 
> Rework the activation of an already existing group in setup code after
> adding a new top level group and use memory barriers (as already used in
> tmigr_inactive_up()) to be sure that child state updates and group state
> updates are ordered properly.
> 
> The update of the group event ignore bit is now ignored. But this is fine,
> as this bit is only required when queueing the event into the timer queue
> of the parent group. As this is always the update of the top level group,
> it doesn't matter.
> 
> Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>

Very nice fix, thanks!

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Just some comment suggestions below:

> ---
>  kernel/time/timer_migration.c | 49 ++++++++++++++++++++++++++++---------------
>  1 file changed, 32 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 0e1c53dac390..7971512a60b0 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1448,6 +1448,37 @@ u64 tmigr_quick_check(u64 nextevt)
>  	return KTIME_MAX;
>  }
>  
> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> +{
> +	union tmigr_state curstate, childstate;
> +	bool walk_done;
> +
> +	/*
> +	 * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
> +	 * make sure updates of child and group states are ordered. The ordering
> +	 * is mandatory, as the update of the group state is only valid for when
> +	 * child state is active.
> +	 */
> +	curstate.state = atomic_read_acquire(&group->migr_state);
> +
> +	for (;;) {

/*
 * If the child hasn't yet propagated anything to the top level, the above
 * acquire has no effect. However thanks to child locking (see comment in
 * tmigr_connect_child_parent()), either the latest child->migr_state is
 * observed or the remote CPU now observes the new parent and is about to
 * propagate to the new parent. In the latter case it will either beat the
 * current trial update or overwrite it.
 */

And another comment later:

> +		childstate.state = atomic_read(&child->migr_state);
> +		if (!childstate.active)
> +			return;
> +
> +		if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
> +			break;
> +
> +		/*
> +		 * The memory barrier is paired with the cmpxchg() in
> +		 * tmigr_inactive_up() to make sure the updates of child and
> +		 * group states are ordered. It is required only when the
> +		 * try_cmpxchg() in __tmigr_active_up() fails.
> +		 */
> +		smp_mb__after_atomic();
> +	}
> +}
> +
>  static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
>  			     int node)
>  {
> @@ -1522,8 +1553,6 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
>  static void tmigr_connect_child_parent(struct tmigr_group *child,
>  				       struct tmigr_group *parent)
>  {
> -	union tmigr_state childstate;
> -

/*
 *  CPU 0                                CPU 1
 *  -----                                ------
 *  // tmigr_connect_child_parent        // tmigr_inactive_up()
 *  LOCK child                           WRITE group->migr_state
 *  WRITE child->parent                  // tmigr_update_events()
 *  UNLOCK child                         LOCK group
 *  // tmigr_setup_active_up             UNLOCK group
 *  READ child->migr_state               // walk_groups()
 *                                       READ group->parent
 *
 * Due to RELEASE/ACQUIRE semantics, it is ensured that either the
 * current CPU sees the latest migr_state update on the remote group
 * or the remote group will observe the new parent.
 *
 * Early lockless return on tmigr_update_events() involve either
 * observing the new parent or dealing with an intermediate parent that
 * will trigger relevant locking while walking up.
 */

Thanks!

Re: [PATCH v2 4/5] timer_migration: Fix possible race in tmigr_active_up() in setup path

Posted by Frederic Weisbecker 1 year, 7 months ago

Le Tue, Jun 25, 2024 at 12:01:27AM +0200, Frederic Weisbecker a écrit :
> > +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> > +{
> > +	union tmigr_state curstate, childstate;
> > +	bool walk_done;
> > +
> > +	/*
> > +	 * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
> > +	 * make sure updates of child and group states are ordered. The ordering
> > +	 * is mandatory, as the update of the group state is only valid for when
> > +	 * child state is active.
> > +	 */
> > +	curstate.state = atomic_read_acquire(&group->migr_state);
> > +
> > +	for (;;) {
> 
> /*
>  * If the child hasn't yet propagated anything to the top level, the above
>  * acquire has no effect. However thanks to child locking (see comment in
>  * tmigr_connect_child_parent()), either the latest child->migr_state is
>  * observed or the remote CPU now observes the new parent and is about to
>  * propagate to the new parent. In the latter case it will either beat the
>  * current trial update or overwrite it.
>  */
> 
> And another comment later:
> 
> > +		childstate.state = atomic_read(&child->migr_state);
> > +		if (!childstate.active)
> > +			return;

Hmm, on the other hand the remote group doesn't use locking if some tmc activates
after that and so there is no guarantee it will see the new parent. Does it
matter?

Thanks.

[PATCH v2 1/5] timer_migration: Do not rely always on group->parent
[PATCH v2 2/5] timer_migration: Improve tracing
[PATCH v2 3/5] timer_migration: Split out state update of tmigr_active_up()
[PATCH v2 4/5] timer_migration: Fix possible race in tmigr_active_up() in setup path
[PATCH v2 5/5] timer_migration: Spare write when nothing changed