From nobody Mon Dec 15 21:46:54 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 552512206A2 for ; Tue, 14 Jan 2025 23:15:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896514; cv=none; b=LNaWn36LIZj4ez8h8UTnZtjhAVO0cS/OtrOfXYrMTwOnShfulSd4/teoNu5mYFHG5q6906Qjg6uQsbt7RkrYf0KZtNzF8CgekejJVrlSISVp1amTWaSbFLEFnxBVsON3F1xuRBzxWwQUMikSvGB+GhJK5SrqtVbNoIGv793eZLM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896514; c=relaxed/simple; bh=Fn4RfTrADqe2s6TIycpEv4JkOCG1UMzd1VZQ0iUnj0I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=eGpXwEcrA28ect2gcyUklcgD2JtsMdlOYVQWIaSMa/iAwXdL7MfErZIjlkP+4Hvw7HGzenFONzOgO9zcxmzaLdeGL1iIVNMJgSuD5S4EZ6XVVZuY+ZF7AeNCl5pRIpPjQzVyRPE4fhCIY7OAOs2jJXbtDTFoIgLPXxsodVATrLI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Y6uERvPw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Y6uERvPw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3436AC4CEE4; Tue, 14 Jan 2025 23:15:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736896514; bh=Fn4RfTrADqe2s6TIycpEv4JkOCG1UMzd1VZQ0iUnj0I=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Y6uERvPwEzrEqptKANDSIYg5ELh1JZwI1JiQtudiv8yWNbTI5TWLw0SN9P8qfo+AD sx00c7MI+kiHB2gMlZAt/ZKZBnMt9nipA2K32hBqTeG7uNoeuVBgtjNWB6XbIa+c3r EF9qK8thDirCI+z3mTechhlt0dHejRtwxapQ5ybu+yxyEug2r7W1fgb6wJvhEypgY1 3I2w4+HL1Jsfx96T9YL1HwgdsfLx/MJ6mHr2zFaABbgT7UOvK6MfENn8ZEPMrzNsYW krmZZJRDoW8jTQKKvSA9QSS8c4rg9iRx3TeGa6eDPUzfPIPW2CtTahZ8lJznH/z64C ZcV35VRLQDGSw== From: Frederic Weisbecker To: Thomas Gleixner Cc: LKML , Frederic Weisbecker , anna-maria@linutronix.de Subject: [PATCH 1/4] timers/migration: Fix another race between hotplug and idle entry/exit Date: Wed, 15 Jan 2025 00:15:04 +0100 Message-ID: <20250114231507.21672-2-frederic@kernel.org> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20250114231507.21672-1-frederic@kernel.org> References: <20250114231507.21672-1-frederic@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The following commit: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") has fixed a race between idle exit and cpu hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is still a situation that remains unhandled: [GRP0:0] migrator =3D TMIGR_NONE active =3D NONE groupmask =3D 0 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator =3D CPU 0 active =3D CPU 0 groupmask =3D 0 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator =3D CPU 0 active =3D CPU 0, CPU 1 groupmask =3D 0 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator =3D TMIGR_NONE active =3D NONE groupmask =3D 0 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D TMIGR_NONE active =3D CPU 0, CPU1 active =3D NONE groupmask =3D 2 groupmask =3D 1 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator =3D 0 (!!!) active =3D NONE groupmask =3D 0 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D TMIGR_NONE active =3D CPU 0, CPU1 active =3D NONE groupmask =3D 2 groupmask =3D 1 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched but the freshly updated groupmask of GRP0:0 may not be visible due to lack of ordering! As a result tmigr_active_up() is called to GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then becomes the migrator for GRP1:0 forever. As a result, timers on a fully idle system get ignored. One possible fix would be to define TMIGR_NONE as "0" so that such a race would have no effect. And after all TMIGR_NONE doesn't need to be anything else. However this would leave an uncomfortable state machine where gears happen not to break by chance but are vulnerable to future modifications. Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of any newly created top level. This groupmask is guaranteed to be visible upon fetching the corresponding group for the 1st time: _ By the upcoming CPU thanks to CPU hotplug synchronization between the control CPU (BP) and the booting one (AP). _ By the control CPU since the groupmask and parent pointers are initialized locally. _ By all CPUs belonging to the same group than the control CPU because they must wait for it to ever become idle before needing to walk to the new top. The cmpcxhg() on ->migr_state then makes sure its groupmask is visible. With this pre-initialization, it is guaranteed that if a future top level is linked to an old one, it is walked through with a valid groupmask. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplu= g prepare callback") Signed-off-by: Frederic Weisbecker --- kernel/time/timer_migration.c | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 8d57f7686bb0..c8a8ea2e5b98 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1487,6 +1487,21 @@ static void tmigr_init_group(struct tmigr_group *gro= up, unsigned int lvl, s.seq =3D 0; atomic_set(&group->migr_state, s.state); =20 + /* + * If this is a new top-level, prepare its groupmask in advance. + * This avoids accidents where yet another new top-level is + * created in the future and made visible before the current groupmask. + */ + if (list_empty(&tmigr_level_list[lvl])) { + group->groupmask =3D BIT(0); + /* + * The previous top level has prepared its groupmask already, + * simply account it as the first child. + */ + if (lvl > 0) + group->num_children =3D 1; + } + timerqueue_init_head(&group->events); timerqueue_init(&group->groupevt.nextevt); group->groupevt.nextevt.expires =3D KTIME_MAX; @@ -1550,8 +1565,20 @@ static void tmigr_connect_child_parent(struct tmigr_= group *child, raw_spin_lock_irq(&child->lock); raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING); =20 + if (activate) { + /* + * @child is the old top and @parent the new one. In this + * case groupmask is pre-initialized and @child already + * accounted, along with its new sibling corresponding to the + * CPU going up. + */ + WARN_ON_ONCE(child->groupmask !=3D BIT(0) || parent->num_children !=3D 2= ); + } else { + /* Adding @child for the CPU going up to @parent. */ + child->groupmask =3D BIT(parent->num_children++); + } + child->parent =3D parent; - child->groupmask =3D BIT(parent->num_children++); =20 raw_spin_unlock(&parent->lock); raw_spin_unlock_irq(&child->lock); --=20 2.46.0 From nobody Mon Dec 15 21:46:54 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E32AC21ADD1 for ; Tue, 14 Jan 2025 23:15:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896516; cv=none; b=h950RSSMaoSrATF7JV0jzSOCLgETHmwWxDfbXY6BrpoTG2xqIhH/Lf3BDcTnPdm5p/MAVTvsZ3MLBshsA4R58t9ZUomWfdTvQhsXAOPh6bR+i8tYYsRVtrkQWNcJdw4ejwiVEEITYa30Q7aC9AFWoYJMfgDfpPuEdeEqofQvW2Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896516; c=relaxed/simple; bh=dTHKcq79s49NqYofgmRV0XYmTwp0lakY7TEkHdT8jJg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=gWmbykTfThzm73lH8LzPbzp2ynJBqnwBFYBxVvJOATnZT2vE6v3Mr2zeYKghI58TrNQKuQ72X93BAYBA6pQzJbDfPhjUrvm11xw+Jkrei7fBWpbaQ9OQq+DpouQQFeG6VVR22Q3DiMOvD+5OiDodFrGuW1iX5e9INYLPabyNXps= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gH+hpNky; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gH+hpNky" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8974BC4CEE3; Tue, 14 Jan 2025 23:15:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736896515; bh=dTHKcq79s49NqYofgmRV0XYmTwp0lakY7TEkHdT8jJg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gH+hpNkyHfe4F4tGlOoeXndVsJVAXMv8MHTxrVh/paFGm/QG5wWW1p3QsyhZGQI6O MdvMYTLZKg3So7BEe66dscBUf8bd+aLa0VbAetl3zzq9N40N1d7iWaXiCeyN7pTsAr h2arPMVF46Iaf+/kRIjP6eE/MgfUdnDA53uMYa7TaYI3u1adduehl8QyxLmL+RGQwO VerjjIqvoNX68Va5Nd2EGpUh6mfQklh3AbU0t+UQdt4dHd121uy0zALzPYnHNnA4BM 8qFqojvQ8laNgBEVxY3s6VsBEK8ZOmM/BER1AJXRqOpnwkDuUf4MoMedCV/cRKefFx 9EOiAO0OmH/xA== From: Frederic Weisbecker To: Thomas Gleixner Cc: LKML , Frederic Weisbecker , anna-maria@linutronix.de Subject: [PATCH 2/4] timers/migration: Enforce group initialization visibility to tree walkers Date: Wed, 15 Jan 2025 00:15:05 +0100 Message-ID: <20250114231507.21672-3-frederic@kernel.org> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20250114231507.21672-1-frederic@kernel.org> References: <20250114231507.21672-1-frederic@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The following commit: 2522c84db513 ("timers/migration: Fix another race between hotplug and idle entry/exit") has fixed yet another race between idle exit and cpu hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is yet another situation that remains unhandled: [GRP0:0] migrator =3D TMIGR_NONE active =3D NONE groupmask =3D 1 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator =3D CPU 0 active =3D CPU 0 groupmask =3D 1 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator =3D CPU 0 active =3D CPU 0, CPU 1 groupmask =3D 1 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator =3D TMIGR_NONE active =3D NONE groupmask =3D 1 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D TMIGR_NONE active =3D CPU 0, CPU1 active =3D NONE groupmask =3D 1 groupmask =3D 2 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator =3D GRP0:0 active =3D GRP0:0 groupmask =3D 1 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D TMIGR_NONE active =3D CPU 0, CPU1 active =3D NONE groupmask =3D 1 groupmask =3D 2 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched and the pre-initialized groupmask of GRP0:0 is also visible. As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active and migrator. CPU 0 is returning to __walk_groups() but suffers again a #VMEXIT. [GRP1:0] migrator =3D GRP0:0 active =3D GRP0:0 groupmask =3D 1 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D TMIGR_NONE active =3D CPU 0, CPU1 active =3D NONE groupmask =3D 1 groupmask =3D 2 / \ \ 0 1 2..7 8 active active idle !online 5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no effect since CPU 0 did it already. [GRP1:0] migrator =3D GRP0:0 active =3D GRP0:0, GRP0:1 groupmask =3D 1 / \ [GRP0:0] [GRP0:1] migrator =3D CPU 0 migrator =3D CPU 8 active =3D CPU 0, CPU1 active =3D CPU 8 groupmask =3D 1 groupmask =3D 2 / \ \ \ 0 1 2..7 8 active active idle active 6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through CPUHP_AP_TMIGR_ONLINE which propagates activation. [GRP2:0] migrator =3D TMIGR_NONE active =3D NONE groupmask =3D 1 / \ [GRP1:0] [GRP1:1] migrator =3D GRP0:0 migrator =3D TMIGR_NONE active =3D GRP0:0, GRP0:1 active =3D NONE groupmask =3D 1 groupmask =3D 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator =3D CPU 0 migrator =3D CPU 8 migrator =3D T= MIGR_NONE active =3D CPU 0, CPU1 active =3D CPU 8 active =3D N= ONE groupmask =3D 1 groupmask =3D 2 groupmask =3D 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to GRP2:0. [GRP2:0] migrator =3D 0 (!!!) active =3D NONE groupmask =3D 1 / \ [GRP1:0] [GRP1:1] migrator =3D GRP0:0 migrator =3D TMIGR_NONE active =3D GRP0:0, GRP0:1 active =3D NONE groupmask =3D 1 groupmask =3D 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator =3D CPU 0 migrator =3D CPU 8 migrator =3D T= MIGR_NONE active =3D CPU 0, CPU1 active =3D CPU 8 active =3D N= ONE groupmask =3D 1 groupmask =3D 2 groupmask =3D 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP2:0 is visible and fetched but the pre-initialized groupmask of GRP1:0 is not because no ordering made its initialization visible. As a result tmigr_active_up() may be called to GRP2:0 with a "0" child's groumask. Leaving the timers ignored for ever when the system is fully idle. The race is highly theoretical and perhaps impossible in practice but the groupmask of the child is not the only concern here as the whole initialization of the child is not guaranteed to be visible to any tree walker racing against hotplug (idle entry/exit, remote handling, etc...). Although the current code layout seem to be resilient to such hazards, this doesn't tell much about the future. Fix this with enforcing address dependency between group initialization and the write/read to the group's parent's pointer. Fortunately that doesn't involve any barrier addition in the fast paths. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplu= g prepare callback") Signed-off-by: Frederic Weisbecker --- kernel/time/timer_migration.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index c8a8ea2e5b98..371a62a749aa 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -534,8 +534,13 @@ static void __walk_groups(up_f up, struct tmigr_walk *= data, break; =20 child =3D group; - group =3D group->parent; + /* + * Pairs with the store release on group connection + * to make sure group initialization is visible. + */ + group =3D READ_ONCE(group->parent); data->childmask =3D child->groupmask; + WARN_ON_ONCE(!data->childmask); } while (group); } =20 @@ -1578,7 +1583,12 @@ static void tmigr_connect_child_parent(struct tmigr_= group *child, child->groupmask =3D BIT(parent->num_children++); } =20 - child->parent =3D parent; + /* + * Make sure parent initialization is visible before publishing it to a + * racing CPU entering/exiting idle. This RELEASE barrier enforces an + * address dependency that pairs with the READ_ONCE() in __walk_groups(). + */ + smp_store_release(&child->parent, parent); =20 raw_spin_unlock(&parent->lock); raw_spin_unlock_irq(&child->lock); --=20 2.46.0 From nobody Mon Dec 15 21:46:54 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26356214A95 for ; Tue, 14 Jan 2025 23:15:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896517; cv=none; b=bIbue7SXj0ilm2X2yrNnwi4BtzJBL5kT8WbpyK3NVfTDNkAvWSXsbBQDf/tRok1RgN5ev5jEVh+R4UTin59xfXOZG/jbBn5XfT5ge2dw3/GFXuMbYscrauEtl9jZvEvLC/UDuqyNVcCkSErG+lgMoSa+aIJEU4QIxTK5Jay52gc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896517; c=relaxed/simple; bh=+dkBUYgtIbpGLYVDZf92NPvsm2ocuJ6CpzQYfGq8foM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YhhBaGBvm6QoT1bshd6AEFfaPQyftosuJb/ndx9JQK7xCqYxDPzrIsj0O61ojvk/sHieXoBkFiNWLpB6KCE7msIz/inTv6Simx+OCMcFjgB5p/BZOmKQ0M4UYjxNGkjKcreqrHw0+GTX5O0S0n63B+z1U/ZL3B0szNEhQ8macKc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TKNNAPHm; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TKNNAPHm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DFE6FC4CEE4; Tue, 14 Jan 2025 23:15:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736896517; bh=+dkBUYgtIbpGLYVDZf92NPvsm2ocuJ6CpzQYfGq8foM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=TKNNAPHmyILEXASXER/Hbd4Ty1F1MvsWD+FYUNrfTvh5J4DCEF884LDcL+Jwig1Al qPkOt38BPQyIrqkpxfAvhm3VRYhH5hyrmCDybesWzS/QWtRh6jZdFj4JpdZG43g0Iw hZXQ9I5sFeUDofDENNWLfrvauW4mEbydVNOR/c3w+kTtu4Lmd2B2JsZHSNdUM22VBa wLt6sjYnatnxQdHuoCTAc0rqhmkfSDy5yP2/T89idJ2NPYg0wvRb9rgiRSJ8rXTOGw Pv6dR4uibYk+ouRp2aAdmO5oC1EoNdQ6mf3PSLtaXFiW/bIXOs88+tkNw9zWk7LKel htdKSD4GHFx0g== From: Frederic Weisbecker To: Thomas Gleixner Cc: LKML , Frederic Weisbecker , anna-maria@linutronix.de, kernel test robot Subject: [PATCH 3/4] timers/migration: Annotate accesses to ignore flag Date: Wed, 15 Jan 2025 00:15:06 +0100 Message-ID: <20250114231507.21672-4-frederic@kernel.org> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20250114231507.21672-1-frederic@kernel.org> References: <20250114231507.21672-1-frederic@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The group's ignore flag is: _ read under the group's lock (idle entry, remote expiry) _ turned on/off under the group's lock (idle entry, remote expiry) _ turned on locklessly on idle exit When idle entry or remote expiry clear the "ignore" flag of a group, the operation must be synchronized against other concurrent idle entry or remote expiry to make sure the related group timer is never missed. To enforce this synchronization, both "ignore" clear and read are performed under the group lock. On the contrary, whether idle entry or remote expiry manage to observe the "ignore" flag turned on by a CPU exiting idle is a matter of optimization. If that flag set is missed or cleared concurrently, the worst outcome is a migrator wasting time remotely handling a "ghost" timer. This is why the ignore flag can be set locklessly. Unfortunately, the related lockless accesses are bare and miss appropriate annotations. KCSAN rightfully complains: BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0: __tmigr_cpu_activate tmigr_cpu_activate timer_clear_idle tick_nohz_restart_sched_tick tick_nohz_idle_exit do_idle cpu_startup_entry kernel_init do_initcalls clear_bss reserve_bios_regions common_startup_64 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1: print_report kcsan_report_known_origin kcsan_setup_watchpoint tmigr_next_groupevt tmigr_update_events tmigr_inactive_up __walk_groups+0x50/0x77 walk_groups __tmigr_cpu_deactivate tmigr_cpu_deactivate __get_next_timer_interrupt timer_base_try_to_set_idle tick_nohz_stop_tick tick_nohz_idle_stop_tick cpuidle_idle_call do_idle Although the relevant accesses could be marked as data_race(), the "ignore" flag being read several times within the same tmigr_update_events() function is confusing and error prone. Prefer reading it once in that function and make use of similar/paired accesses elsewhere with appropriate comments when necessary. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com Signed-off-by: Frederic Weisbecker --- kernel/time/timer_migration.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 371a62a749aa..066c9ddca4ec 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -569,7 +569,7 @@ static struct tmigr_event *tmigr_next_groupevt(struct t= migr_group *group) while ((node =3D timerqueue_getnext(&group->events))) { evt =3D container_of(node, struct tmigr_event, nextevt); =20 - if (!evt->ignore) { + if (!READ_ONCE(evt->ignore)) { WRITE_ONCE(group->next_expiry, evt->nextevt.expires); return evt; } @@ -665,7 +665,7 @@ static bool tmigr_active_up(struct tmigr_group *group, * lock is held while updating the ignore flag in idle path. So this * state change will not be lost. */ - group->groupevt.ignore =3D true; + WRITE_ONCE(group->groupevt.ignore, true); =20 return walk_done; } @@ -726,6 +726,7 @@ bool tmigr_update_events(struct tmigr_group *group, str= uct tmigr_group *child, union tmigr_state childstate, groupstate; bool remote =3D data->remote; bool walk_done =3D false; + bool ignore; u64 nextexp; =20 if (child) { @@ -744,11 +745,19 @@ bool tmigr_update_events(struct tmigr_group *group, s= truct tmigr_group *child, nextexp =3D child->next_expiry; evt =3D &child->groupevt; =20 - evt->ignore =3D (nextexp =3D=3D KTIME_MAX) ? true : false; + /* + * This can race with concurrent idle exit (activate). + * If the current writer wins, a useless remote expiration may + * be scheduled. If the activate wins, the event is properly + * ignored. + */ + ignore =3D (nextexp =3D=3D KTIME_MAX) ? true : false; + WRITE_ONCE(evt->ignore, ignore); } else { nextexp =3D data->nextexp; =20 first_childevt =3D evt =3D data->evt; + ignore =3D evt->ignore; =20 /* * Walking the hierarchy is required in any case when a @@ -774,7 +783,7 @@ bool tmigr_update_events(struct tmigr_group *group, str= uct tmigr_group *child, * first event information of the group is updated properly and * also handled properly, so skip this fast return path. */ - if (evt->ignore && !remote && group->parent) + if (ignore && !remote && group->parent) return true; =20 raw_spin_lock(&group->lock); @@ -788,7 +797,7 @@ bool tmigr_update_events(struct tmigr_group *group, str= uct tmigr_group *child, * queue when the expiry time changed only or when it could be ignored. */ if (timerqueue_node_queued(&evt->nextevt)) { - if ((evt->nextevt.expires =3D=3D nextexp) && !evt->ignore) { + if ((evt->nextevt.expires =3D=3D nextexp) && !ignore) { /* Make sure not to miss a new CPU event with the same expiry */ evt->cpu =3D first_childevt->cpu; goto check_toplvl; @@ -798,7 +807,7 @@ bool tmigr_update_events(struct tmigr_group *group, str= uct tmigr_group *child, WRITE_ONCE(group->next_expiry, KTIME_MAX); } =20 - if (evt->ignore) { + if (ignore) { /* * When the next child event could be ignored (nextexp is * KTIME_MAX) and there was no remote timer handling before or --=20 2.46.0 From nobody Mon Dec 15 21:46:54 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF6B7223701 for ; Tue, 14 Jan 2025 23:15:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896520; cv=none; b=sE4jqCElqnoCqWiJDz99wdrtJ7O0T0dQVApDMR4vqoE09j3ZCopUZA4OHsUMr7GIo0PuuRoXHXw/PhwHo+mxeYJ6EPTFZ2WU+GefNfd0o0e51B0uyPmEHzbX1jirooHOc8EiJrp8fNou92Z1TtNG95L5hVNY+mjzd1ZB0963scI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736896520; c=relaxed/simple; bh=sWUWAF396kiMsEfoNAL5bSK9ZkpcgbDFPSBBHHw4b90=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bdbUBIL5n4M0CVjNYmCs/NBWWAPiUxOhWzzc16DsicqvSZxhGiCvblytIqx0Q8EzyRO6nVHzhYlVPaTwff/r6BagDOrXxPso9cg+1DTIiTzUbupUIJ/vvBm/xxrDRomDXGrZSrYqW/zqgtcPYr3J8tHnmDhJXvueoY6VHlq88KQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VDuUDRl6; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VDuUDRl6" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 67BBCC4CEDD; Tue, 14 Jan 2025 23:15:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1736896518; bh=sWUWAF396kiMsEfoNAL5bSK9ZkpcgbDFPSBBHHw4b90=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=VDuUDRl614Bl0+2dKvwwYXd8wbkWGC8G/nfDNCDWvxcMFIwHsdZemKC1Ug23LHNCF zDU+vx+WAQbY4hsBUNFIJjwEi3zv2iA+9h03IJebp1uUL/ghxj1VvFx0lyZcCzdOMW m21T8igTXtQSM0P6QrQp503BKIVi4Kv5jxmc1JoP2H7xT04qTcTnMTKgmt8qqj88hr tuzM++61o1NP5IvQ98z4P1HIDMyL3fO/9UH9tIaA2g4U+tYKhJhu+E09xg4IjfvATG /pybvPR+Tg63Zz/ypR/1ovIM/yVhcgMolJuOKnJGI5yN2Zpqz/9Js+GJOmMhQhUPYD QMf1/onpTjY5g== From: Frederic Weisbecker To: Thomas Gleixner Cc: LKML , Frederic Weisbecker , anna-maria@linutronix.de Subject: [PATCH 4/4] timers/migration: Simplify top level detection on group setup Date: Wed, 15 Jan 2025 00:15:07 +0100 Message-ID: <20250114231507.21672-5-frederic@kernel.org> X-Mailer: git-send-email 2.46.0 In-Reply-To: <20250114231507.21672-1-frederic@kernel.org> References: <20250114231507.21672-1-frederic@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Having a single group on a given level is enough to know this is the top level, because a root has to have at least two children, unless that root is the only group and the children are actual CPUs. Simplify the test in tmigr_setup_groups() accordingly. Signed-off-by: Frederic Weisbecker --- kernel/time/timer_migration.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 066c9ddca4ec..9cb9b6584ea1 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1670,9 +1670,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsig= ned int node) * be different from tmigr_hierarchy_levels, contains only a * single group. */ - if (group->parent || i =3D=3D tmigr_hierarchy_levels || - (list_empty(&tmigr_level_list[i]) && - list_is_singular(&tmigr_level_list[i - 1]))) + if (group->parent || list_is_singular(&tmigr_level_list[i - 1])) break; =20 } while (i < tmigr_hierarchy_levels); --=20 2.46.0