include/linux/hrtimer.h | 1 + kernel/cpu.c | 2 +- kernel/time/hrtimer.c | 9 +++++++++ 3 files changed, 11 insertions(+), 1 deletion(-)
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a cpu hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE. Since hrtimers_prepare_cpu() does not run, .hres_active
remains set to 1 throughout. However, during the cpuhotplug down, the
tick and clockevents are shut down at CPUHP_AP_TICK_DYING. On the return
to the online state, for instance CFS incorrectly assumes that hrtick is
already active, and the chance of clockevent device to transition to
oneshot mode is also lost forever for the cpu, unless it goes back to a
lower state than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; .online is not reset to 1 after
the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Reset these fields at the appropriate points. Ideally, .hres_active
would be reset in tick_cpu_dying() as underlying clockevent is shut down
there. However, since these operations occur in the atomic AP section
with interrupts disabled in any case, this patch resets .hres_active to
0 in hrtimers_cpu_dying() for simplicity.
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
---
include/linux/hrtimer.h | 1 +
kernel/cpu.c | 2 +-
kernel/time/hrtimer.c | 9 +++++++++
3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 7ef5f7ef31a9..a5fdd305cabd 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -387,6 +387,7 @@ extern void sysrq_timer_list_show(void);
int hrtimers_prepare_cpu(unsigned int cpu);
#ifdef CONFIG_HOTPLUG_CPU
+int hrtimers_cpu_starting(unsigned int cpu);
int hrtimers_cpu_dying(unsigned int cpu);
#else
#define hrtimers_cpu_dying NULL
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 85fd7ac4561e..34f1a09349fc 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2180,7 +2180,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
},
[CPUHP_AP_HRTIMERS_DYING] = {
.name = "hrtimers:dying",
- .startup.single = NULL,
+ .startup.single = hrtimers_cpu_starting,
.teardown.single = hrtimers_cpu_dying,
},
[CPUHP_AP_TICK_DYING] = {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 80fe3749d2db..98f23c9341f5 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2246,6 +2246,14 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
}
}
+int hrtimers_cpu_starting(unsigned int cpu)
+{
+ struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
+
+ cpu_base->online = 1;
+ return 0;
+}
+
int hrtimers_cpu_dying(unsigned int dying_cpu)
{
int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
@@ -2275,6 +2283,7 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
raw_spin_unlock(&new_base->lock);
+ old_base->hres_active = 0;
old_base->online = 0;
raw_spin_unlock(&old_base->lock);
--
2.43.0
On Fri, Dec 20 2024 at 22:44, Koichiro Den wrote:
> Reset these fields at the appropriate points. Ideally, .hres_active
> would be reset in tick_cpu_dying() as underlying clockevent is shut down
> there. However, since these operations occur in the atomic AP section
> with interrupts disabled in any case, this patch resets .hres_active to
> 0 in hrtimers_cpu_dying() for simplicity.
It does not matter where it is reset and from a hrtimer point of view
the dying() callback is the appropriate place.
Thanks,
tglx
On Thu, Jan 16 2025 at 10:16, Thomas Gleixner wrote: > On Fri, Dec 20 2024 at 22:44, Koichiro Den wrote: >> Reset these fields at the appropriate points. Ideally, .hres_active >> would be reset in tick_cpu_dying() as underlying clockevent is shut down >> there. However, since these operations occur in the atomic AP section >> with interrupts disabled in any case, this patch resets .hres_active to >> 0 in hrtimers_cpu_dying() for simplicity. > > It does not matter where it is reset and from a hrtimer point of view > the dying() callback is the appropriate place. That said. The change is actually incomplete as the other fields which are related to the per CPU base state are not reset either.
On Thu, Jan 16, 2025 at 10:24:07AM GMT, Thomas Gleixner wrote: > On Thu, Jan 16 2025 at 10:16, Thomas Gleixner wrote: > > > On Fri, Dec 20 2024 at 22:44, Koichiro Den wrote: > >> Reset these fields at the appropriate points. Ideally, .hres_active > >> would be reset in tick_cpu_dying() as underlying clockevent is shut down > >> there. However, since these operations occur in the atomic AP section > >> with interrupts disabled in any case, this patch resets .hres_active to > >> 0 in hrtimers_cpu_dying() for simplicity. > > > > It does not matter where it is reset and from a hrtimer point of view > > the dying() callback is the appropriate place. > > That said. The change is actually incomplete as the other fields which > are related to the per CPU base state are not reset either. > Thank you for the feedback and handling it. -Koichiro Den
On Fri, Dec 20 2024 at 22:44, Koichiro Den wrote:
> int hrtimers_prepare_cpu(unsigned int cpu);
> #ifdef CONFIG_HOTPLUG_CPU
> +int hrtimers_cpu_starting(unsigned int cpu);
This does not compile for CONFIG_HOTPLUG_CPU=n
> int hrtimers_cpu_dying(unsigned int cpu);
> #else
> #define hrtimers_cpu_dying NULL
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 85fd7ac4561e..34f1a09349fc 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2180,7 +2180,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
> },
> [CPUHP_AP_HRTIMERS_DYING] = {
> .name = "hrtimers:dying",
> - .startup.single = NULL,
> + .startup.single = hrtimers_cpu_starting,
> .teardown.single = hrtimers_cpu_dying,
> },
> [CPUHP_AP_TICK_DYING] = {
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index 80fe3749d2db..98f23c9341f5 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -2246,6 +2246,14 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
> }
> }
>
> +int hrtimers_cpu_starting(unsigned int cpu)
> +{
> + struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu);
this_cpu_ptr()
> + cpu_base->online = 1;
With that setting cpu_base->online in the prepare_cpu() callback
does not make any sense.
I'll fix it up for you this time.
Thanks,
tglx
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: 2f8dea1692eef2b7ba6a256246ed82c365fdc686
Gitweb: https://git.kernel.org/tip/2f8dea1692eef2b7ba6a256246ed82c365fdc686
Author: Koichiro Den <koichiro.den@canonical.com>
AuthorDate: Fri, 20 Dec 2024 22:44:21 +09:00
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 16 Jan 2025 13:06:14 +01:00
hrtimers: Handle CPU state correctly on hotplug
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
---
include/linux/hrtimer.h | 1 +
kernel/cpu.c | 2 +-
kernel/time/hrtimer.c | 11 ++++++++++-
3 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 7ef5f7e..f7bfdcf 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -386,6 +386,7 @@ extern void __init hrtimers_init(void);
extern void sysrq_timer_list_show(void);
int hrtimers_prepare_cpu(unsigned int cpu);
+int hrtimers_cpu_starting(unsigned int cpu);
#ifdef CONFIG_HOTPLUG_CPU
int hrtimers_cpu_dying(unsigned int cpu);
#else
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b605334..0509a97 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2179,7 +2179,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
},
[CPUHP_AP_HRTIMERS_DYING] = {
.name = "hrtimers:dying",
- .startup.single = NULL,
+ .startup.single = hrtimers_cpu_starting,
.teardown.single = hrtimers_cpu_dying,
},
[CPUHP_AP_TICK_DYING] = {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 80fe374..030426c 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2202,6 +2202,15 @@ int hrtimers_prepare_cpu(unsigned int cpu)
}
cpu_base->cpu = cpu;
+ hrtimer_cpu_base_init_expiry_lock(cpu_base);
+ return 0;
+}
+
+int hrtimers_cpu_starting(unsigned int cpu)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+
+ /* Clear out any left over state from a CPU down operation */
cpu_base->active_bases = 0;
cpu_base->hres_active = 0;
cpu_base->hang_detected = 0;
@@ -2210,7 +2219,6 @@ int hrtimers_prepare_cpu(unsigned int cpu)
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
cpu_base->online = 1;
- hrtimer_cpu_base_init_expiry_lock(cpu_base);
return 0;
}
@@ -2286,5 +2294,6 @@ int hrtimers_cpu_dying(unsigned int dying_cpu)
void __init hrtimers_init(void)
{
hrtimers_prepare_cpu(smp_processor_id());
+ hrtimers_cpu_starting(smp_processor_id());
open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq);
}
The following commit has been merged into the timers/urgent branch of tip:
Commit-ID: e00954d8b0a2d54075131fbad4a11b2b7355eee1
Gitweb: https://git.kernel.org/tip/e00954d8b0a2d54075131fbad4a11b2b7355eee1
Author: Koichiro Den <koichiro.den@canonical.com>
AuthorDate: Fri, 20 Dec 2024 22:44:21 +09:00
Committer: Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 16 Jan 2025 10:39:25 +01:00
hrtimers: Handle CPU state correctly on hotplug
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
---
include/linux/hrtimer.h | 1 +
kernel/cpu.c | 2 +-
kernel/time/hrtimer.c | 10 +++++++++-
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 7ef5f7e..f7bfdcf 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -386,6 +386,7 @@ extern void __init hrtimers_init(void);
extern void sysrq_timer_list_show(void);
int hrtimers_prepare_cpu(unsigned int cpu);
+int hrtimers_cpu_starting(unsigned int cpu);
#ifdef CONFIG_HOTPLUG_CPU
int hrtimers_cpu_dying(unsigned int cpu);
#else
diff --git a/kernel/cpu.c b/kernel/cpu.c
index b605334..0509a97 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2179,7 +2179,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
},
[CPUHP_AP_HRTIMERS_DYING] = {
.name = "hrtimers:dying",
- .startup.single = NULL,
+ .startup.single = hrtimers_cpu_starting,
.teardown.single = hrtimers_cpu_dying,
},
[CPUHP_AP_TICK_DYING] = {
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 80fe374..edb04af 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -2202,6 +2202,15 @@ int hrtimers_prepare_cpu(unsigned int cpu)
}
cpu_base->cpu = cpu;
+ hrtimer_cpu_base_init_expiry_lock(cpu_base);
+ return 0;
+}
+
+int hrtimers_cpu_starting(unsigned int cpu)
+{
+ struct hrtimer_cpu_base *cpu_base = this_cpu_ptr(&hrtimer_bases);
+
+ /* Clear out any left over state from a CPU down operation */
cpu_base->active_bases = 0;
cpu_base->hres_active = 0;
cpu_base->hang_detected = 0;
@@ -2210,7 +2219,6 @@ int hrtimers_prepare_cpu(unsigned int cpu)
cpu_base->expires_next = KTIME_MAX;
cpu_base->softirq_expires_next = KTIME_MAX;
cpu_base->online = 1;
- hrtimer_cpu_base_init_expiry_lock(cpu_base);
return 0;
}
On Thu, Jan 16, 2025 at 09:50:38AM -0000, tip-bot2 for Koichiro Den wrote:
> The following commit has been merged into the timers/urgent branch of tip:
>
> Commit-ID: e00954d8b0a2d54075131fbad4a11b2b7355eee1
> Gitweb: https://git.kernel.org/tip/e00954d8b0a2d54075131fbad4a11b2b7355eee1
> Author: Koichiro Den <koichiro.den@canonical.com>
> AuthorDate: Fri, 20 Dec 2024 22:44:21 +09:00
> Committer: Thomas Gleixner <tglx@linutronix.de>
> CommitterDate: Thu, 16 Jan 2025 10:39:25 +01:00
>
> hrtimers: Handle CPU state correctly on hotplug
>
> Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
> through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
> CPUHP_ONLINE:
>
> Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
> to 1 throughout. However, during a CPU unplug operation, the tick and the
> clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
> state, for instance CFS incorrectly assumes that the hrtick is already
> active, and the chance of the clockevent device to transition to oneshot
> mode is also lost forever for the CPU, unless it goes back to a lower state
> than CPUHP_HRTIMERS_PREPARE once.
>
> This round-trip reveals another issue; cpu_base.online is not set to 1
> after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
>
> Aside of that, the bulk of the per CPU state is not reset either, which
> means there are dangling pointers in the worst case.
>
> Address this by adding a corresponding startup() callback, which resets the
> stale per CPU state and sets the online flag.
>
> [ tglx: Make the new callback unconditionally available, remove the online
> modification in the prepare() callback and clear the remaining
> state in the starting callback instead of the prepare callback ]
>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
> ---
> include/linux/hrtimer.h | 1 +
> kernel/cpu.c | 2 +-
> kernel/time/hrtimer.c | 10 +++++++++-
> 3 files changed, 11 insertions(+), 2 deletions(-)
This causes the below. Lemme zap it.
[ 0.324444] ------------[ cut here ]------------
[ 0.325783] WARNING: CPU: 0 PID: 0 at kernel/time/hrtimer.c:1076 enqueue_hrtimer+0x77/0x80
[ 0.325783] Modules linked in:
[ 0.325783] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc7+ #1
[ 0.325783] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 2023.11-8 02/21/2024
[ 0.325783] RIP: 0010:enqueue_hrtimer+0x77/0x80
[ 0.325783] Code: cb 7e 48 8b 05 42 8c b5 01 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 39 c1 ff ff 65 ff 0d 9a 42 cb 7e 75 9f 0f 1f 44 00 00 eb 98 <0f> 0b eb 9d 0f 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 49 89 ce 41
[ 0.325783] RSP: 0000:ffffffff82603c28 EFLAGS: 00010046
[ 0.325783] RAX: ff11000074225e80 RBX: ff11000074225ec0 RCX: 00000000448b22fc
[ 0.325783] RDX: 0000000000000008 RSI: ff11000074225ec0 RDI: ff110000742349c0
[ 0.325783] RBP: ff110000742349c0 R08: 000000390fc0a600 R09: fffffffffffffffe
[ 0.325783] R10: 0000000000000000 R11: 0000000000000000 R12: ff11000074225e80
[ 0.325783] R13: ff11000074225ec0 R14: ff11000074225ec0 R15: 0000000000000040
[ 0.325783] FS: 0000000000000000(0000) GS:ff11000074200000(0000) knlGS:0000000000000000
[ 0.325783] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.325783] CR2: ff11000004601000 CR3: 0000000002e22001 CR4: 0000000000771ef0
[ 0.325783] PKRU: 55555554
[ 0.325783] Call Trace:
[ 0.325783] <TASK>
[ 0.325783] ? __warn+0x89/0x130
[ 0.325783] ? enqueue_hrtimer+0x77/0x80
[ 0.325783] ? report_bug+0x164/0x190
[ 0.325783] ? handle_bug+0x58/0x90
[ 0.325783] ? exc_invalid_op+0x17/0x70
[ 0.325783] ? asm_exc_invalid_op+0x1a/0x20
[ 0.325783] ? enqueue_hrtimer+0x77/0x80
[ 0.325783] hrtimer_start_range_ns+0x258/0x370
[ 0.325783] start_dl_timer+0xa6/0x130
[ 0.325783] enqueue_dl_entity+0x43d/0xa60
[ 0.325783] dl_server_start+0x42/0x190
[ 0.325783] enqueue_task_fair+0x22f/0x5e0
[ 0.325783] enqueue_task+0x32/0x150
[ 0.325783] ? post_init_entity_util_avg+0x29/0x160
[ 0.325783] wake_up_new_task+0x19b/0x300
[ 0.325783] kernel_clone+0x126/0x430
[ 0.325783] user_mode_thread+0x5f/0x90
[ 0.325783] ? rest_init+0xd0/0xd0
[ 0.325783] rest_init+0x1e/0xd0
[ 0.325783] start_kernel+0x5fc/0x870
[ 0.325783] x86_64_start_reservations+0x18/0x30
[ 0.325783] x86_64_start_kernel+0x96/0xa0
[ 0.325783] common_startup_64+0x13e/0x141
[ 0.325783] </TASK>
[ 0.325783] ---[ end trace 0000000000000000 ]---
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
© 2016 - 2025 Red Hat, Inc.