Frederic reported the following possible race:
[GRP0:0]
migrator = 0
active = 0
nextevt = KTIME_MAX
/ \
0 1 .. 7
active idle
0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
\
[GRP0:0] [GRP0:1]
migrator = 0 migrator = TMIGR_NONE
active = 0 active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
active idle !online
1) CPU 8 is booting and creates a new group in first level GRP0:1 and
therefore also a new top group GRP1:0. For now the setup code proceeded
only until the connected between GRP0:1 to the new top group. The
connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
still !online.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = 0 migrator = TMIGR_NONE
active = 0 active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
active idle !online
2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
prepares to call tmigr_active_up() on it. It hasn't done it yet.
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
idle idle !online
3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
tmigr_update_events() and it propagates the change to the top (no change
there and no wakeup programmed since there is no timer).
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \
0 1 .. 7 8
idle idle !online
4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
active in GRP1:0
[GRP1:0]
migrator = GRP0:0
active = GRP0:0, GRP0:1
nextevt = KTIME_MAX
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = 8
active = NONE active = 8
nextevt = KTIME_MAX nextevt = KTIME_MAX
/ \ |
0 1 .. 7 8
idle idle active
5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
of tmigr_cpu_online().
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
nextevt = T8
/ \
[GRP0:0] [GRP0:1]
migrator = TMIGR_NONE migrator = TMIGR_NONE
active = NONE active = NONE
nextevt = KTIME_MAX nextevt = T8
/ \ |
0 1 .. 7 8
idle idle idle
5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
But it's not really active, so T8 gets ignored.
The update which is done in third step is not noticed by setup code. So a
wrong migrator is set to top level group and a timer could get ignored.
Rework the activation of an already existing group in setup code after
adding a new top level group and use memory barriers (as already used in
tmigr_inactive_up()) to be sure that child state updates and group state
updates are ordered properly.
The update of the group event ignore bit is now ignored. But this is fine,
as this bit is only required when queueing the event into the timer queue
of the parent group. As this is always the update of the top level group,
it doesn't matter.
Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
kernel/time/timer_migration.c | 49 ++++++++++++++++++++++++++++---------------
1 file changed, 32 insertions(+), 17 deletions(-)
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 0e1c53dac390..7971512a60b0 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1448,6 +1448,37 @@ u64 tmigr_quick_check(u64 nextevt)
return KTIME_MAX;
}
+static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
+{
+ union tmigr_state curstate, childstate;
+ bool walk_done;
+
+ /*
+ * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
+ * make sure updates of child and group states are ordered. The ordering
+ * is mandatory, as the update of the group state is only valid for when
+ * child state is active.
+ */
+ curstate.state = atomic_read_acquire(&group->migr_state);
+
+ for (;;) {
+ childstate.state = atomic_read(&child->migr_state);
+ if (!childstate.active)
+ return;
+
+ if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
+ break;
+
+ /*
+ * The memory barrier is paired with the cmpxchg() in
+ * tmigr_inactive_up() to make sure the updates of child and
+ * group states are ordered. It is required only when the
+ * try_cmpxchg() in __tmigr_active_up() fails.
+ */
+ smp_mb__after_atomic();
+ }
+}
+
static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
int node)
{
@@ -1522,8 +1553,6 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
static void tmigr_connect_child_parent(struct tmigr_group *child,
struct tmigr_group *parent)
{
- union tmigr_state childstate;
-
raw_spin_lock_irq(&child->lock);
raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
@@ -1551,21 +1580,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
* executed with the formerly top level group (child) and the newly
* created group (parent).
*/
- childstate.state = atomic_read(&child->migr_state);
- if (childstate.migrator != TMIGR_NONE) {
- struct tmigr_walk data;
-
- data.childmask = child->childmask;
-
- /*
- * There is only one new level per time (which is protected by
- * tmigr_mutex). When connecting the child and the parent and
- * set the child active when the parent is inactive, the parent
- * needs to be the uppermost level. Otherwise there went
- * something wrong!
- */
- WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
- }
+ tmigr_setup_active_up(parent, child);
}
static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
--
2.39.2
Le Mon, Jun 24, 2024 at 04:53:56PM +0200, Anna-Maria Behnsen a écrit :
> Frederic reported the following possible race:
>
>
> [GRP0:0]
> migrator = 0
> active = 0
> nextevt = KTIME_MAX
> / \
> 0 1 .. 7
> active idle
>
> 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
>
> [GRP1:0]
> migrator = TMIGR_NONE
> active = NONE
> nextevt = KTIME_MAX
> \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = TMIGR_NONE
> active = 0 active = NONE
> nextevt = KTIME_MAX nextevt = KTIME_MAX
> / \
> 0 1 .. 7 8
> active idle !online
>
> 1) CPU 8 is booting and creates a new group in first level GRP0:1 and
> therefore also a new top group GRP1:0. For now the setup code proceeded
> only until the connected between GRP0:1 to the new top group. The
*connection
> connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
> still !online.
>
> [GRP1:0]
> migrator = TMIGR_NONE
> active = NONE
> nextevt = KTIME_MAX
> / \
> [GRP0:0] [GRP0:1]
> migrator = 0 migrator = TMIGR_NONE
> active = 0 active = NONE
> nextevt = KTIME_MAX nextevt = KTIME_MAX
> / \
> 0 1 .. 7 8
> active idle !online
>
> 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
> tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
> prepares to call tmigr_active_up() on it. It hasn't done it yet.
>
> [GRP1:0]
> migrator = TMIGR_NONE
> active = NONE
> nextevt = KTIME_MAX
> / \
> [GRP0:0] [GRP0:1]
> migrator = TMIGR_NONE migrator = TMIGR_NONE
> active = NONE active = NONE
> nextevt = KTIME_MAX nextevt = KTIME_MAX
> / \
> 0 1 .. 7 8
> idle idle !online
>
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
> GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
> tmigr_update_events() and it propagates the change to the top (no change
> there and no wakeup programmed since there is no timer).
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = KTIME_MAX
> / \
> [GRP0:0] [GRP0:1]
> migrator = TMIGR_NONE migrator = TMIGR_NONE
> active = NONE active = NONE
> nextevt = KTIME_MAX nextevt = KTIME_MAX
> / \
> 0 1 .. 7 8
> idle idle !online
>
> 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
> active in GRP1:0
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0, GRP0:1
> nextevt = KTIME_MAX
> / \
> [GRP0:0] [GRP0:1]
> migrator = TMIGR_NONE migrator = 8
> active = NONE active = 8
> nextevt = KTIME_MAX nextevt = KTIME_MAX
> / \ |
> 0 1 .. 7 8
> idle idle active
>
> 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
> of tmigr_cpu_online().
>
> [GRP1:0]
> migrator = GRP0:0
> active = GRP0:0
> nextevt = T8
> / \
> [GRP0:0] [GRP0:1]
> migrator = TMIGR_NONE migrator = TMIGR_NONE
> active = NONE active = NONE
> nextevt = KTIME_MAX nextevt = T8
> / \ |
> 0 1 .. 7 8
> idle idle idle
>
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
> But it's not really active, so T8 gets ignored.
>
> The update which is done in third step is not noticed by setup code. So a
> wrong migrator is set to top level group and a timer could get ignored.
>
> Rework the activation of an already existing group in setup code after
> adding a new top level group and use memory barriers (as already used in
> tmigr_inactive_up()) to be sure that child state updates and group state
> updates are ordered properly.
>
> The update of the group event ignore bit is now ignored. But this is fine,
> as this bit is only required when queueing the event into the timer queue
> of the parent group. As this is always the update of the top level group,
> it doesn't matter.
>
> Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
> Reported-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Very nice fix, thanks!
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Just some comment suggestions below:
> ---
> kernel/time/timer_migration.c | 49 ++++++++++++++++++++++++++++---------------
> 1 file changed, 32 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 0e1c53dac390..7971512a60b0 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1448,6 +1448,37 @@ u64 tmigr_quick_check(u64 nextevt)
> return KTIME_MAX;
> }
>
> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> +{
> + union tmigr_state curstate, childstate;
> + bool walk_done;
> +
> + /*
> + * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
> + * make sure updates of child and group states are ordered. The ordering
> + * is mandatory, as the update of the group state is only valid for when
> + * child state is active.
> + */
> + curstate.state = atomic_read_acquire(&group->migr_state);
> +
> + for (;;) {
/*
* If the child hasn't yet propagated anything to the top level, the above
* acquire has no effect. However thanks to child locking (see comment in
* tmigr_connect_child_parent()), either the latest child->migr_state is
* observed or the remote CPU now observes the new parent and is about to
* propagate to the new parent. In the latter case it will either beat the
* current trial update or overwrite it.
*/
And another comment later:
> + childstate.state = atomic_read(&child->migr_state);
> + if (!childstate.active)
> + return;
> +
> + if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
> + break;
> +
> + /*
> + * The memory barrier is paired with the cmpxchg() in
> + * tmigr_inactive_up() to make sure the updates of child and
> + * group states are ordered. It is required only when the
> + * try_cmpxchg() in __tmigr_active_up() fails.
> + */
> + smp_mb__after_atomic();
> + }
> +}
> +
> static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
> int node)
> {
> @@ -1522,8 +1553,6 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
> static void tmigr_connect_child_parent(struct tmigr_group *child,
> struct tmigr_group *parent)
> {
> - union tmigr_state childstate;
> -
/*
* CPU 0 CPU 1
* ----- ------
* // tmigr_connect_child_parent // tmigr_inactive_up()
* LOCK child WRITE group->migr_state
* WRITE child->parent // tmigr_update_events()
* UNLOCK child LOCK group
* // tmigr_setup_active_up UNLOCK group
* READ child->migr_state // walk_groups()
* READ group->parent
*
* Due to RELEASE/ACQUIRE semantics, it is ensured that either the
* current CPU sees the latest migr_state update on the remote group
* or the remote group will observe the new parent.
*
* Early lockless return on tmigr_update_events() involve either
* observing the new parent or dealing with an intermediate parent that
* will trigger relevant locking while walking up.
*/
Thanks!
Le Tue, Jun 25, 2024 at 12:01:27AM +0200, Frederic Weisbecker a écrit :
> > +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> > +{
> > + union tmigr_state curstate, childstate;
> > + bool walk_done;
> > +
> > + /*
> > + * Memory barrier is paired with the cmpxchg in tmigr_inactive_up() to
> > + * make sure updates of child and group states are ordered. The ordering
> > + * is mandatory, as the update of the group state is only valid for when
> > + * child state is active.
> > + */
> > + curstate.state = atomic_read_acquire(&group->migr_state);
> > +
> > + for (;;) {
>
> /*
> * If the child hasn't yet propagated anything to the top level, the above
> * acquire has no effect. However thanks to child locking (see comment in
> * tmigr_connect_child_parent()), either the latest child->migr_state is
> * observed or the remote CPU now observes the new parent and is about to
> * propagate to the new parent. In the latter case it will either beat the
> * current trial update or overwrite it.
> */
>
> And another comment later:
>
> > + childstate.state = atomic_read(&child->migr_state);
> > + if (!childstate.active)
> > + return;
Hmm, on the other hand the remote group doesn't use locking if some tmc activates
after that and so there is no guarantee it will see the new parent. Does it
matter?
Thanks.
© 2016 - 2025 Red Hat, Inc.