timer_migration: Fix a possible race and improvements

[PATCH 0/3] timer_migration: Fix a possible race and improvements

Posted by Anna-Maria Behnsen 1 year, 7 months ago

Borislav reported a warning in timer migration deactive path

  https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local

Sadly it doesn't reproduce directly. But with the change of timing (by
adding a trace prinkt before the warning), it is possible to trigger the
warning reliable at least in my test setup. The problem here is a racy
check agains group->parent pointer. This is also used in other places in
the code and fixing this racy usage is adressed by the first patch.

While working with the code, I saw two things which could be improved
(tracing and update of per cpu group wakeup value). This improvements are
adressed by the other two patches.

Patches are available here:

  https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc

Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org

Thanks,

Anna-Maria

---
Anna-Maria Behnsen (3):
      timer_migration: Do not rely always on group->parent
      timer_migration: Spare write when nothing changed
      timer_migration: Improve tracing

 kernel/time/timer_migration.c | 55 ++++++++++++++++++++-----------------------
 kernel/time/timer_migration.h | 12 +++++++++-
 2 files changed, 36 insertions(+), 31 deletions(-)

Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

Posted by Frederic Weisbecker 1 year, 7 months ago

Le Fri, Jun 21, 2024 at 11:37:05AM +0200, Anna-Maria Behnsen a écrit :
> Borislav reported a warning in timer migration deactive path
> 
>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
> 
> Sadly it doesn't reproduce directly. But with the change of timing (by
> adding a trace prinkt before the warning), it is possible to trigger the
> warning reliable at least in my test setup. The problem here is a racy
> check agains group->parent pointer. This is also used in other places in
> the code and fixing this racy usage is adressed by the first patch.
> 
> While working with the code, I saw two things which could be improved
> (tracing and update of per cpu group wakeup value). This improvements are
> adressed by the other two patches.
> 
> Patches are available here:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
> 
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> 
> Thanks,
> 
> Anna-Maria
> 
> ---

This made me stare at the group creation again and I might have found
something. Does the following race look plausible to you?


                  [GRP0:0]
               migrator = 0
               active   = 0
               nextevt  = KTIME_MAX
               /         \
              0         1 .. 7
          active         idle

0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
   as active.

   
                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                                         \
                 [GRP0:0]                  [GRP0:1]
              migrator = 0              migrator = TMIGR_NONE
              active   = 0              active   = NONE
              nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
          active         idle                !online

1) CPU 8 is booting and creates a new node and a new top. For now it's
   only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
   __tmigr_cpu_activate() on itself yet.


                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = 0              migrator = TMIGR_NONE
              active   = 0              active   = NONE
              nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
          active         idle                active

2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.


                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE        migrator = TMIGR_NONE
              active   = NONE              active   = NONE
              nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events()
   and it propagates the change to the top (no change there and no wakeup
   programmed since there is no timer).


                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE       migrator = TMIGR_NONE
              active   = NONE             active   = NONE
              nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

4) Now CPU 8 finally calls tmigr_active_up() to GRP0:0

                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0, GRP0:1
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE       migrator = 8
              active   = NONE             active   = 8
              nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

5) And out of tmigr_cpu_online() CPU 8 calls tmigr_active_up() on itself

                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0
                        nextevt  = T8
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE         migrator = TMIGR_NONE
              active   = NONE               active   = NONE
              nextevt  = KTIME_MAX          nextevt  = T8
                /         \                    |
              0          1 .. 7                8
            idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.


And if that race looks plausible, does the following fix look good?

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 84413114db5c..0609cb8c770e 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1525,7 +1525,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	child->childmask = BIT(parent->num_children++);
 
 	raw_spin_unlock(&parent->lock);
-	raw_spin_unlock_irq(&child->lock);
 
 	trace_tmigr_connect_child_parent(child);
 
@@ -1559,6 +1558,14 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 		 */
 		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 	}
+	/*
+	 * Keep the lock up to that point so that if the child goes idle
+	 * concurrently, either it sees the new parent with its active state
+	 * after locking on tmigr_update_events() and propagates afterwards
+	 * its idle state up, or the current booting CPU will observe TMIGR_NONE
+	 * on the remote child and it won't propagate a spurious active state.
+	 */
+	raw_spin_unlock_irq(&child->lock);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)

Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

Posted by Anna-Maria Behnsen 1 year, 7 months ago

Frederic Weisbecker <frederic@kernel.org> writes:

> Le Fri, Jun 21, 2024 at 11:37:05AM +0200, Anna-Maria Behnsen a écrit :
>> Borislav reported a warning in timer migration deactive path
>> 
>>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
>> 
>> Sadly it doesn't reproduce directly. But with the change of timing (by
>> adding a trace prinkt before the warning), it is possible to trigger the
>> warning reliable at least in my test setup. The problem here is a racy
>> check agains group->parent pointer. This is also used in other places in
>> the code and fixing this racy usage is adressed by the first patch.
>> 
>> While working with the code, I saw two things which could be improved
>> (tracing and update of per cpu group wakeup value). This improvements are
>> adressed by the other two patches.
>> 
>> Patches are available here:
>> 
>>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
>> 
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: linux-kernel@vger.kernel.org
>> 
>> Thanks,
>> 
>> Anna-Maria
>> 
>> ---
>
> This made me stare at the group creation again and I might have found
> something. Does the following race look plausible to you?

Yes...

>
>                   [GRP0:0]
>                migrator = 0
>                active   = 0
>                nextevt  = KTIME_MAX
>                /         \
>               0         1 .. 7
>           active         idle
>
> 0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
>    as active.
>
>    
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                                          \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                !online
>
> 1) CPU 8 is booting and creates a new node and a new top. For now it's
>    only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
>    __tmigr_cpu_activate() on itself yet.

NIT: In step 1) CPU8 is not yet connected to GRP0:1 as the second while
loop is not yet finished, but nevertheless...

>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                active
>
> 2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.

NIT: CPU8 keeps its state !online until step 5.

>
>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE        migrator = TMIGR_NONE
>               active   = NONE              active   = NONE
>               nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events()
>    and it propagates the change to the top (no change there and no wakeup
>    programmed since there is no timer).
>
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = TMIGR_NONE
>               active   = NONE             active   = NONE
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 4) Now CPU 8 finally calls tmigr_active_up() to GRP0:0

... oh no. Here it starts to go mad. Good catch!

>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0, GRP0:1
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = 8
>               active   = NONE             active   = 8
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 5) And out of tmigr_cpu_online() CPU 8 calls tmigr_active_up() on itself
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = T8
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE         migrator = TMIGR_NONE
>               active   = NONE               active   = NONE
>               nextevt  = KTIME_MAX          nextevt  = T8
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                  idle
>
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
>
>
> And if that race looks plausible, does the following fix look good?

Holding the lock will not help as the state is not protected by the
lock. Step 4) would nevertheless happen. To properly fix this, we need a
memory barrier. I added a comment back then in tmigr_active_up() why the
barrier is not required there but without having this corner case in
mind. Or am I wrong?

To solve it, we could

a) add a memory barrier also in tmigr_active_up() and read the
   childstate there always (similar to tmigr_inactive_up()).

b) create a separate tmigr_setup_active_up() function with this memory
   barrier and with reading the childstate there after the memory
   barrier.

I would propose to go with b) to do not impact active_up().

I dropped the walk_hierarchy information for tmigr_setup_active_up() and
also do not set the group event ignore bit. This shouldn't be required,
as this active update could only happen between the new top level and
the formerly top level group. And the event ignore bit on the top level
isn't taken into account.

This is not tested yet, but it could look like the following. When it
makes sense to you and if I didn't miss something else - I would start
testing and functionality verification ;)

---8<----
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -623,12 +623,37 @@ static u64 tmigr_next_groupevt_expires(s
 		return evt->nextevt.expires;
 }
 
+static bool __tmigr_active_up(struct tmigr_group *group, bool *walk_done, union tmigr_state *curstate, u8 childmask)
+{
+	union tmigr_state newstate;
+
+	newstate = *curstate;
+	*walk_done = true;
+
+	if (newstate.migrator == TMIGR_NONE) {
+		newstate.migrator = childmask;
+
+		/* Changes need to be propagated */
+		*walk_done = false;
+	}
+
+	newstate.active |= childmask;
+	newstate.seq++;
+
+	if (atomic_try_cmpxchg(&group->migr_state, &curstate->state, newstate.state)) {
+		trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+		return true;
+	}
+
+	return false;
+}
+
 static bool tmigr_active_up(struct tmigr_group *group,
 			    struct tmigr_group *child,
 			    void *ptr)
 {
-	union tmigr_state curstate, newstate;
 	struct tmigr_walk *data = ptr;
+	union tmigr_state curstate;
 	bool walk_done;
 	u8 childmask;
 
@@ -640,23 +665,10 @@ static bool tmigr_active_up(struct tmigr
 	 */
 	curstate.state = atomic_read(&group->migr_state);
 
-	do {
-		newstate = curstate;
-		walk_done = true;
-
-		if (newstate.migrator == TMIGR_NONE) {
-			newstate.migrator = childmask;
-
-			/* Changes need to be propagated */
-			walk_done = false;
-		}
-
-		newstate.active |= childmask;
-		newstate.seq++;
-
-	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
-
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+	for(;;) {
+		if (__tmigr_active_up(group, &walk_done, &curstate, childmask))
+		    break;
+	}
 
 	if (walk_done == false)
 		data->childmask = group->childmask;
@@ -1436,6 +1448,35 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
+{
+	union tmigr_state curstate, childstate;
+	bool walk_done;
+
+	/*
+	 * FIXME: Memory barrier is required here as the child state
+	 * could have changed in the meantime
+	 */
+	curstate.state = atomic_read_acquire(&group->migr_state);
+
+	for (;;) {
+		childstate.state = atomic_read(&child->migr_state);
+		if (!childstate.active)
+			return;
+
+		if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
+			break;
+
+		/*
+		 * The memory barrier is paired with the cmpxchg() in
+		 * tmigr_inactive_up() to make sure the updates of child and group
+		 * states are ordered. It is required only when the
+		 * try_cmpxchg() in __tmigr_active_up() fails.
+		 */
+		smp_mb__after_atomic();
+	}
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1510,8 +1551,6 @@ static struct tmigr_group *tmigr_get_gro
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
-
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
@@ -1539,21 +1578,7 @@ static void tmigr_connect_child_parent(s
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
-
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	tmigr_setup_active_up(parent, child);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)

Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

Posted by Frederic Weisbecker 1 year, 7 months ago

On Mon, Jun 24, 2024 at 10:58:26AM +0200, Anna-Maria Behnsen wrote:
> Frederic Weisbecker <frederic@kernel.org> writes:
> > 0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
> >    as active.
> >
> >    
> >                              [GRP1:0]
> >                         migrator = TMIGR_NONE
> >                         active   = NONE
> >                         nextevt  = KTIME_MAX
> >                                          \
> >                  [GRP0:0]                  [GRP0:1]
> >               migrator = 0              migrator = TMIGR_NONE
> >               active   = 0              active   = NONE
> >               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> >                 /         \                    |
> >               0          1 .. 7                8
> >           active         idle                !online
> >
> > 1) CPU 8 is booting and creates a new node and a new top. For now it's
> >    only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
> >    __tmigr_cpu_activate() on itself yet.
> 
> NIT: In step 1) CPU8 is not yet connected to GRP0:1 as the second while
> loop is not yet finished, but nevertheless...

Yes, I was in the second loop in my mind, but that didn't transcribe well :-)

> 
> >
> >                              [GRP1:0]
> >                         migrator = TMIGR_NONE
> >                         active   = NONE
> >                         nextevt  = KTIME_MAX
> >                        /                  \
> >                  [GRP0:0]                  [GRP0:1]
> >               migrator = 0              migrator = TMIGR_NONE
> >               active   = 0              active   = NONE
> >               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> >                 /         \                    |
> >               0          1 .. 7                8
> >           active         idle                active
> >
> > 2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
> >    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
> >    prepares to call tmigr_active_up() on it. It hasn't done it yet.
> 
> NIT: CPU8 keeps its state !online until step 5.

Oops, copy paste hazards.

> Holding the lock will not help as the state is not protected by the
> lock.

No but any access happening before an UNLOCK is guaranteed to be visible
after the next LOCK. This makes sure that either:

1) If the remote CPU going inactive has passed tmigr_update_events() with
its unlock of group->lock then the onlining CPU will see the TMIGR_NONE

or:

1) If the onlining CPU locks before the remote CPU calls tmigr_update_events(),
   then the subsequent LOCK on group->lock in tmigr_update_events() will acquire
   the new parent link and propagate the up the inactive state.

And yes there is an optimization case where we don't lock:

		if (evt->ignore && !remote && group->parent)
			return true;

And that would fall through the cracks. So, forget about that lock idea.


> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> +{
> +	union tmigr_state curstate, childstate;
> +	bool walk_done;
> +
> +	/*
> +	 * FIXME: Memory barrier is required here as the child state
> +	 * could have changed in the meantime
> +	 */
> +	curstate.state = atomic_read_acquire(&group->migr_state);
> +
> +	for (;;) {
> +		childstate.state = atomic_read(&child->migr_state);
> +		if (!childstate.active)
> +			return;

Ok there could have been a risk that we miss the remote CPU going active. But
again thanks to the lock this makes sure that either we observe the childstate
as active or the remote CPU sees the link and propagates its active state. And
the unlocked optimization tmigr_update_events() still works because either it
sees the new parent and proceeds or it sees an intermediate parent and the next
ones will be locked.

Phew!

I'll do a deeper review this evening but it _looks_ ok.

Thanks.

Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements

Posted by Anna-Maria Behnsen 1 year, 7 months ago

Frederic Weisbecker <frederic@kernel.org> writes:

> On Mon, Jun 24, 2024 at 10:58:26AM +0200, Anna-Maria Behnsen wrote:
>> Frederic Weisbecker <frederic@kernel.org> writes:
>> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
>> +{
>> +	union tmigr_state curstate, childstate;
>> +	bool walk_done;
>> +
>> +	/*
>> +	 * FIXME: Memory barrier is required here as the child state
>> +	 * could have changed in the meantime
>> +	 */
>> +	curstate.state = atomic_read_acquire(&group->migr_state);
>> +
>> +	for (;;) {
>> +		childstate.state = atomic_read(&child->migr_state);
>> +		if (!childstate.active)
>> +			return;
>
> Ok there could have been a risk that we miss the remote CPU going active. But
> again thanks to the lock this makes sure that either we observe the childstate
> as active or the remote CPU sees the link and propagates its active state. And
> the unlocked optimization tmigr_update_events() still works because either it
> sees the new parent and proceeds or it sees an intermediate parent and the next
> ones will be locked.
>
> Phew!
>
> I'll do a deeper review this evening but it _looks_ ok.
>

I will send you a v2 of the whole timer_migration series where the fix
is also splitted. And review should then be a little easier.

Thanks,

        Anna-Maria