[PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true

Victor Hassan posted 1 patch 1 year ago
kernel/time/tick-broadcast.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
[PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Victor Hassan 1 year ago
If a broadcast timer is registered after the system switched to oneshot
mode, a hang_task err could occur like that:

INFO: task kworker/u15:0:7 blocked for more than 120 seconds.
      Tainted: G            E     5.15.41-android13-8-00002-xxx #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:0   state:D stack: 9808 pid:  7 ppid: 2 flags:0x00000008
Workqueue: events_unbound deferred_probe_work_func.cfi_jt
Call trace:
 __switch_to+0y240/0x490
 __schedule+0x620/0xafc
 schedule+0x110/0x204
 schedule_hrtimeout_range_clock+0x9c/0x118
 usleep_range_state+0x150/0x1ac
 _regulator_do_enable+0x528/0x878
 set_machine_constraints+0x6a0/0xf2c
 regulator_register+0x3ac/0x7ac
 devm_regulator_register+0xbc/0x120
 pmu_ext_regulator_probe+0xb0/0x1b4 [pmu_ext_regulator]
 platform_probe+0x70/0x194
 really_probe+0x320/0x68c
 __driver_probe_device+0x204/0x260
 driver_probe_device+0x48/0x1e0

When the new broadcast timer was registered after the system switched
to oneshot mode, the broadcast timer was not used as periodic. If the
oneshot mask was set incorrectly, all cores which did not enter cpu_idle
state can't enter cpu_idle normally, causing the hrtimer mechanism to
break. Like:

* CPU 1 stop its tick, next event is in one hour. It calls
  tick_broadcast_enter() and goes to sleep.
* CPU 1 gets an interrupt that enqueues a new timer expiring in the next jiffy
  (note it's not yet actually programmed in the tick device)
* CPU 1 call tick_broadcast_exit().
* CPU 0 registers new broadcast device and sets CPU 1 in tick_broadcast_oneshot_mask
* CPU 0 runs the broadcast callback, sees that the next timer for CPU 1
  is in one hour (because the recently enqueued timer for CPU 1 hasn't been programmed
  yet), so it programs the broadcast to that 1 hour deadline.
* CPU 1 runs tick_nohz_idle_stop_tick() which eventually writes and program
  dev->next_event to next jiffy
* CPU 1 runs into cpuidle_enter_state(), and tick_broadcast_enter() is ignored because
  the CPU is already in tick_broadcast_oneshot_mask, so the dev->next_event
  change isn't propagated to broadcast.
* CPU 1 goes to sleep for 1 hour.

This patch fixes the issue by moving the update action about oneshot
mask to a more strict conditions. The tick_broadcast_setup_oneshot would
be called in two typical condition, and they all will work.

1. tick_handle_periodic -> tick_broadcast_setup_oneshot

The origin broadcast was periodic, so it can set the oneshot_mask bits
for those waiting for periodic broadcast and program the broadcast timer
to fire.

2. tick_install_broadcast_device -> tick_broadcast_setup_oneshot

The origin broadcast was oneshot, so the cores which enter the cpu_idle
already used the oneshot_mask bits. It is unnecessary to update the
oneshot_mask.

Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Signed-off-by: Victor Hassan <victor@allwinnertech.com>
---
 kernel/time/tick-broadcast.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 93bf2b4e47e5..fdbbba487978 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -1041,12 +1041,13 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 		 */
 		cpumask_copy(tmpmask, tick_broadcast_mask);
 		cpumask_clear_cpu(cpu, tmpmask);
-		cpumask_or(tick_broadcast_oneshot_mask,
-			   tick_broadcast_oneshot_mask, tmpmask);
 
 		if (was_periodic && !cpumask_empty(tmpmask)) {
 			ktime_t nextevt = tick_get_next_period();
 
+			cpumask_or(tick_broadcast_oneshot_mask,
+				   tick_broadcast_oneshot_mask, tmpmask);
+
 			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
 			tick_broadcast_init_next_event(tmpmask, nextevt);
 			tick_broadcast_set_event(bc, cpu, nextevt);
-- 
2.29.0
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
Victor!

On Wed, Apr 12 2023 at 08:34, Victor Hassan wrote:

Thanks for tracking this problem down!

> If a broadcast timer is registered after the system switched to oneshot
> mode, a hang_task err could occur like that:
>
> INFO: task kworker/u15:0:7 blocked for more than 120 seconds.
>       Tainted: G            E     5.15.41-android13-8-00002-xxx #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:kworker/u16:0   state:D stack: 9808 pid:  7 ppid: 2 flags:0x00000008
> Workqueue: events_unbound deferred_probe_work_func.cfi_jt
> Call trace:
>  __switch_to+0y240/0x490
>  __schedule+0x620/0xafc
>  schedule+0x110/0x204
>  schedule_hrtimeout_range_clock+0x9c/0x118
>  usleep_range_state+0x150/0x1ac
>  _regulator_do_enable+0x528/0x878
>  set_machine_constraints+0x6a0/0xf2c
>  regulator_register+0x3ac/0x7ac
>  devm_regulator_register+0xbc/0x120
>  pmu_ext_regulator_probe+0xb0/0x1b4 [pmu_ext_regulator]
>  platform_probe+0x70/0x194
>  really_probe+0x320/0x68c
>  __driver_probe_device+0x204/0x260
>  driver_probe_device+0x48/0x1e0

That backtrace is not really helpful to explain the problem. That's just the
consequence, i.e. the symptom.

Backtraces can help to document the call chain leading to a problem. As
this is _not_ leading to the problem, the backtrace is just a
distraction. See:

 https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

> When the new broadcast timer was registered after the system switched
> to oneshot mode, the broadcast timer was not used as periodic. If the
> oneshot mask was set incorrectly, all cores which did not enter cpu_idle
> state can't enter cpu_idle normally, causing the hrtimer mechanism to
> break.

This is not really a proper problem description. It's obvious that
things break when a mask is set incorrectly. But that lacks a
description of the context and the why the mask is incorrect.

> Like:
>
> * CPU 1 stop its tick, next event is in one hour. It calls
>   tick_broadcast_enter() and goes to sleep.

So there is already a broadcast device installed, right?

> * CPU 1 gets an interrupt that enqueues a new timer expiring in the next jiffy
>   (note it's not yet actually programmed in the tick device)
> * CPU 1 call tick_broadcast_exit().
> * CPU 0 registers new broadcast device and sets CPU 1 in tick_broadcast_oneshot_mask

This lacks an explanation why CPU0 sets CPU1 in that mask. It does not
_set_ it explicitely, only implicitely by ORing the periodic broadcast
cpumask over.

Now the question is why is CPU1 set in the periodic broadcast mask when
the CPU already switched over to NOHZ mode?

That needs to be explained too.

> * CPU 0 runs the broadcast callback, sees that the next timer for CPU 1
>   is in one hour (because the recently enqueued timer for CPU 1 hasn't been programmed
>   yet), so it programs the broadcast to that 1 hour deadline.
> * CPU 1 runs tick_nohz_idle_stop_tick() which eventually writes and program
>   dev->next_event to next jiffy
> * CPU 1 runs into cpuidle_enter_state(), and tick_broadcast_enter() is ignored because
>   the CPU is already in tick_broadcast_oneshot_mask, so the dev->next_event
>   change isn't propagated to broadcast.
> * CPU 1 goes to sleep for 1 hour.

Also please use tabular style to explain the parallel events as
explained in the documentation.

> This patch fixes the issue by moving the update action about oneshot

git grep 'This patch' Documentation/process/

> Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")

Pointless new line.

> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Ditto. And please use the tag ordering from the tip documentation.

> Signed-off-by: Victor Hassan <victor@allwinnertech.com>
> ---
>  kernel/time/tick-broadcast.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index 93bf2b4e47e5..fdbbba487978 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -1041,12 +1041,13 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
>  		 */
>  		cpumask_copy(tmpmask, tick_broadcast_mask);
>  		cpumask_clear_cpu(cpu, tmpmask);
> -		cpumask_or(tick_broadcast_oneshot_mask,
> -			   tick_broadcast_oneshot_mask, tmpmask);

This breaks the case when the broadcast device was already in use before
switching to oneshot broadcast mode and the underlying clock event
device does not support periodic state, i.e. it operates internally in
one shot state.

The condition avoids reprogramming in that case because the device is
already armed for the next tick when the periodic broadcast mask is not
empty, so no further action required.

But with moving the OR operation into the condition the CPUs in the
periodic broadcast mask are then not woken up.

There is a distinction between the tick/broadcast mode and the
clockevent device state.

tick/broadcast			clockevent

TICK_MODE_PERIODIC              CLOCK_EVENT_STATE_PERIODIC (if supported) or _ONESHOT
TICK_MODE_ONESHOT  (NOHZ)       CLOCK_EVENT_STATE_ONESHOT

The tick/broadcast mode is a software state. The clockevent state is a
hardware state.

After more analysis of that code it turns out that this is even more
broken because of this:

CPU0                       CPU1

                           idle()
                             tick_broadcast_enter()
                                 test_and_set_cpu(cpu, oneshot_mask);
                                 shutdown_cpu_local_device();
                                 tick_broadcast_set_event();
                             sleep_deep();

                           // All good. Broadcast will wake the CPU up

install_new_broadcast_device(newdev)
  tick_broadcast_setup_oneshot(newdev)
    if (was_periodic)  <- Path not taken because device is in shutdown state
       ...
    else
      newdev->next_event = KTIME_MAX;

So what switches the new device into oneshot state and what wakes CPU1
on time?

The switch to oneshot mode happens when the next CPU goes idle and
invokes tick_broadcast_set_event() because that sets up one shot state
implicitly.

The wake-up on time for CPU1 happens only when the next CPU goes idle
before the expiry time and queues a broadcast event which is the same or
earlier than CPU1s event, but that's far from correct.

It will eventually be delivered, but that might be way too late and even
cause stalls or hung task events in the worst case.

Duh. What was that Gleixner dude thinking...

Just for the record. I hated that broadcast code from day one.

The irony is that the only architecture which required it back in the
days (x86) and caused me to write this horror in order to make NOHZ
possible has by now functional timers which just work even in deeper
idle states. Therefore x86 does not use that code anymore on any
halfways contemporary system.

Though all other architectures had to make the same mistake again...

Completely untested patch below.

Thanks,

        tglx
---
From: Thomas Gleixner <tglx@linutronix.de>
Subject: tick/broadcast: Make broadcast device replacement work correctly
Date: Wed, 12 Apr 2023 08:34:25 +0800

When a tick broadcast clockevent device is initialized for one shot mode
then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
cpumask into the oneshot broadcast cpumask.

This is required when switching from periodic broadcast mode to oneshot
broadcast mode to ensure that CPUs which are waiting for periodic
broadcast are woken up on the next tick.

But it is subtly broken, when an active broadcast device is replaced and
the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
this and debugged the issue.

Then the OR of the periodic broadcast CPU mask is wrong as the periodic
cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
unless explicitly cleared via tick_broadcast_disable().

That means that this sets all other CPUs which have tick broadcasting
enabled at that point unconditionally in the oneshot broadcast mask.

If the affected CPUs were already idle and had their bits set in the
oneshot broadcast mask then this does no harm. But for non idle CPUs
which were not set this corrupts their state.

On their next invocation of tick_broadcast_enable() they observe the bit
set, which indicates that the broadcast for the CPU is already set up.
As a consequence they fail to update the broadcast event even if their
earliest expiring timer is before the actually programmed broadcast
event.

If the programmed broadcast event is far in the future, then this can
cause stalls or trigger the hung task detector.

Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
this is the initial switch over from periodic to oneshot broadcast which
must take the periodic broadcast mask into account. In the case of
initialization of a replacement device this prevents that the broadcast
oneshot mask is modified.

There is a second problem with broadcast device replacement in this
function. The broadcast device is only armed when the previous state of
the device was periodic.

That is correct for the switch from periodic broadcast mode to oneshot
broadcast mode as the underlying broadcast device could operate in
oneshot state already due to lack of periodic state in hardware. In that
case it is already armed to expire at the next tick.

For the replacement case this is wrong as the device is in shutdown
state. That means that any already pending broadcast event will not be
armed.

This went unnoticed because any CPU which goes idle will observe that
the broadcast device has an expiry time of KTIME_MAX and therefore any
CPUs next timer event will be earlier and cause a reprogramming of the
broadcast device. But that does not guarantee that the events of the
CPUs which were already in idle are delivered on time.

Fix this by arming the newly installed device for an immediate event
which will reevaluate the per CPU expiry times and reprogram the
broadcast device accordingly. This is simpler than caching the last
expiry time in yet another place or saving it before the device exchange
and handing it down to the setup function. Replacement of broadcast
devices is not a frequent operation and usually happens once somewhere
late in the boot process.

Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")
Reported-by: Victor Hassan <victor@allwinnertech.com>
Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-broadcast.c |  106 ++++++++++++++++++++++++++++++-------------
 1 file changed, 75 insertions(+), 31 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE
 #ifdef CONFIG_TICK_ONESHOT
 static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
 
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
 static void tick_broadcast_clear_oneshot(int cpu);
 static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 # ifdef CONFIG_HOTPLUG_CPU
 static void tick_broadcast_oneshot_offline(unsigned int cpu);
 # endif
 #else
-static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
+static inline void
+tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
 static inline void tick_broadcast_clear_oneshot(int cpu) { }
 static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
 # ifdef CONFIG_HOTPLUG_CPU
@@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct cl
 		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 			tick_broadcast_start_periodic(bc);
 		else
-			tick_broadcast_setup_oneshot(bc);
+			tick_broadcast_setup_oneshot(bc, false);
 		ret = 1;
 	} else {
 		/*
@@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_br
 			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 				tick_broadcast_start_periodic(bc);
 			else
-				tick_broadcast_setup_oneshot(bc);
+				tick_broadcast_setup_oneshot(bc, false);
 		}
 	}
 out:
@@ -1020,48 +1021,89 @@ static inline ktime_t tick_get_next_peri
 /**
  * tick_broadcast_setup_oneshot - setup the broadcast device
  */
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
+					 bool from_periodic)
 {
 	int cpu = smp_processor_id();
+	ktime_t nexttick = 0;
 
 	if (!bc)
 		return;
 
 	/* Set it up only once ! */
-	if (bc->event_handler != tick_handle_oneshot_broadcast) {
-		int was_periodic = clockevent_state_periodic(bc);
-
-		bc->event_handler = tick_handle_oneshot_broadcast;
-
+	if (bc->event_handler == tick_handle_oneshot_broadcast) {
 		/*
-		 * We must be careful here. There might be other CPUs
-		 * waiting for periodic broadcast. We need to set the
-		 * oneshot_mask bits for those and program the
-		 * broadcast device to fire.
+		 * The CPU which switches from periodic to oneshot mode
+		 * sets the broadcast oneshot bit for all other CPUs which
+		 * are in the general (periodic) broadcast mask to ensure
+		 * that CPUs which wait for the periodic broadcast are
+		 * woken up.
+		 *
+		 * Clear the bit for the local CPU as the set bit would
+		 * prevent the first tick_broadcast_enter() after this CPU
+		 * switched to oneshot state to program the broadcast
+		 * device.
 		 */
+		tick_broadcast_clear_oneshot(cpu);
+	}
+
+
+	bc->event_handler = tick_handle_oneshot_broadcast;
+	bc->next_event = KTIME_MAX;
+
+	/*
+	 * When the tick mode is switched from periodic to oneshot it must
+	 * be ensured that CPUs which are waiting for periodic broadcast
+	 * get their wake-up at the next tick.  This is achieved by ORing
+	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
+	 *
+	 * For other callers, e.g. broadcast device replacement,
+	 * tick_broadcast_oneshot_mask must not be touched as this would
+	 * set bits for CPUs which are already NOHZ, but not idle. Their
+	 * next tick_broadcast_enter() would observe the bit set and fail
+	 * to update the expiry time and the broadcast event device.
+	 */
+	if (from_periodic) {
 		cpumask_copy(tmpmask, tick_broadcast_mask);
+		/* Remove the local CPU as it is obviously not idle */
 		cpumask_clear_cpu(cpu, tmpmask);
-		cpumask_or(tick_broadcast_oneshot_mask,
-			   tick_broadcast_oneshot_mask, tmpmask);
+		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
 
-		if (was_periodic && !cpumask_empty(tmpmask)) {
-			ktime_t nextevt = tick_get_next_period();
+		/*
+		 * Ensure that the oneshot broadcast handler will wake the
+		 * CPUs which are still waiting for periodic broadcast.
+		 */
+		nexttick = tick_get_next_period();
+		tick_broadcast_init_next_event(tmpmask, nexttick);
 
-			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
-			tick_broadcast_init_next_event(tmpmask, nextevt);
-			tick_broadcast_set_event(bc, cpu, nextevt);
-		} else
-			bc->next_event = KTIME_MAX;
-	} else {
 		/*
-		 * The first cpu which switches to oneshot mode sets
-		 * the bit for all other cpus which are in the general
-		 * (periodic) broadcast mask. So the bit is set and
-		 * would prevent the first broadcast enter after this
-		 * to program the bc device.
+		 * If the underlying broadcast clock event device is
+		 * already in oneshot state, then there is nothing to do.
+		 * The device was already armed for the next tick
+		 * in tick_handle_broadcast_periodic()
 		 */
-		tick_broadcast_clear_oneshot(cpu);
+		if (clockevent_state_oneshot(bc))
+			return;
 	}
+
+	/*
+	 * When switching from periodic to oneshot mode arm the broadcast
+	 * device for the next tick.
+	 *
+	 * If the broadcast device has been replaced in oneshot mode and
+	 * the oneshot broadcast mask is not empty, then arm it to expire
+	 * immediately in order to reevaluate the next expiring timer.
+	 * nexttick is 0 and therefore in the past which will cause the
+	 * clockevent code to force an event.
+	 *
+	 * For both cases the programming can be avoided when the oneshot
+	 * broadcast mask is empty.
+	 *
+	 * tick_broadcast_set_event() implicitly switches the broadcast
+	 * device to oneshot state.
+	 */
+	if (!cpumask_empty(tick_broadcast_oneshot_mask))
+		tick_broadcast_set_event(bc, cpu, nexttick);
 }
 
 /*
@@ -1070,14 +1112,16 @@ static void tick_broadcast_setup_oneshot
 void tick_broadcast_switch_to_oneshot(void)
 {
 	struct clock_event_device *bc;
+	enum tick_device_mode oldmode;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 
+	oldmode = tick_broadcast_device.mode;
 	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
 	bc = tick_broadcast_device.evtdev;
 	if (bc)
-		tick_broadcast_setup_oneshot(bc);
+		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
 
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Victor Hassan 1 year ago

On 4/16/2023 5:01 AM, Thomas Gleixner wrote:
> Victor!
> 
> On Wed, Apr 12 2023 at 08:34, Victor Hassan wrote:
> 
> Thanks for tracking this problem down!
> 
>> If a broadcast timer is registered after the system switched to oneshot
>> mode, a hang_task err could occur like that:
>>
>> INFO: task kworker/u15:0:7 blocked for more than 120 seconds.
>>        Tainted: G            E     5.15.41-android13-8-00002-xxx #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:kworker/u16:0   state:D stack: 9808 pid:  7 ppid: 2 flags:0x00000008
>> Workqueue: events_unbound deferred_probe_work_func.cfi_jt
>> Call trace:
>>   __switch_to+0y240/0x490
>>   __schedule+0x620/0xafc
>>   schedule+0x110/0x204
>>   schedule_hrtimeout_range_clock+0x9c/0x118
>>   usleep_range_state+0x150/0x1ac
>>   _regulator_do_enable+0x528/0x878
>>   set_machine_constraints+0x6a0/0xf2c
>>   regulator_register+0x3ac/0x7ac
>>   devm_regulator_register+0xbc/0x120
>>   pmu_ext_regulator_probe+0xb0/0x1b4 [pmu_ext_regulator]
>>   platform_probe+0x70/0x194
>>   really_probe+0x320/0x68c
>>   __driver_probe_device+0x204/0x260
>>   driver_probe_device+0x48/0x1e0
> 
> That backtrace is not really helpful to explain the problem. That's just the
> consequence, i.e. the symptom.
> 
> Backtraces can help to document the call chain leading to a problem. As
> this is _not_ leading to the problem, the backtrace is just a
> distraction. See:
> 
>   https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 
>> When the new broadcast timer was registered after the system switched
>> to oneshot mode, the broadcast timer was not used as periodic. If the
>> oneshot mask was set incorrectly, all cores which did not enter cpu_idle
>> state can't enter cpu_idle normally, causing the hrtimer mechanism to
>> break.
> 
> This is not really a proper problem description. It's obvious that
> things break when a mask is set incorrectly. But that lacks a
> description of the context and the why the mask is incorrect.
> 
>> Like:
>>
>> * CPU 1 stop its tick, next event is in one hour. It calls
>>    tick_broadcast_enter() and goes to sleep.
> 
> So there is already a broadcast device installed, right?
> 
>> * CPU 1 gets an interrupt that enqueues a new timer expiring in the next jiffy
>>    (note it's not yet actually programmed in the tick device)
>> * CPU 1 call tick_broadcast_exit().
>> * CPU 0 registers new broadcast device and sets CPU 1 in tick_broadcast_oneshot_mask
> 
> This lacks an explanation why CPU0 sets CPU1 in that mask. It does not
> _set_ it explicitely, only implicitely by ORing the periodic broadcast
> cpumask over.
> 
> Now the question is why is CPU1 set in the periodic broadcast mask when
> the CPU already switched over to NOHZ mode?
> 
> That needs to be explained too.
> 
>> * CPU 0 runs the broadcast callback, sees that the next timer for CPU 1
>>    is in one hour (because the recently enqueued timer for CPU 1 hasn't been programmed
>>    yet), so it programs the broadcast to that 1 hour deadline.
>> * CPU 1 runs tick_nohz_idle_stop_tick() which eventually writes and program
>>    dev->next_event to next jiffy
>> * CPU 1 runs into cpuidle_enter_state(), and tick_broadcast_enter() is ignored because
>>    the CPU is already in tick_broadcast_oneshot_mask, so the dev->next_event
>>    change isn't propagated to broadcast.
>> * CPU 1 goes to sleep for 1 hour.
> 
> Also please use tabular style to explain the parallel events as
> explained in the documentation.
> 
>> This patch fixes the issue by moving the update action about oneshot
> 
> git grep 'This patch' Documentation/process/
> 
>> Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")
> 
> Pointless new line.
> 
>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> 
> Ditto. And please use the tag ordering from the tip documentation.
> 
>> Signed-off-by: Victor Hassan <victor@allwinnertech.com>
>> ---
>>   kernel/time/tick-broadcast.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
>> index 93bf2b4e47e5..fdbbba487978 100644
>> --- a/kernel/time/tick-broadcast.c
>> +++ b/kernel/time/tick-broadcast.c
>> @@ -1041,12 +1041,13 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
>>   		 */
>>   		cpumask_copy(tmpmask, tick_broadcast_mask);
>>   		cpumask_clear_cpu(cpu, tmpmask);
>> -		cpumask_or(tick_broadcast_oneshot_mask,
>> -			   tick_broadcast_oneshot_mask, tmpmask);
> 
> This breaks the case when the broadcast device was already in use before
> switching to oneshot broadcast mode and the underlying clock event
> device does not support periodic state, i.e. it operates internally in
> one shot state.
> 
> The condition avoids reprogramming in that case because the device is
> already armed for the next tick when the periodic broadcast mask is not
> empty, so no further action required.
> 
> But with moving the OR operation into the condition the CPUs in the
> periodic broadcast mask are then not woken up.
> 
> There is a distinction between the tick/broadcast mode and the
> clockevent device state.
> 
> tick/broadcast			clockevent
> 
> TICK_MODE_PERIODIC              CLOCK_EVENT_STATE_PERIODIC (if supported) or _ONESHOT
> TICK_MODE_ONESHOT  (NOHZ)       CLOCK_EVENT_STATE_ONESHOT
> 
> The tick/broadcast mode is a software state. The clockevent state is a
> hardware state.

Yes, you are right.

tick_setup_device
     -> td->mode = TICKDEV_MODE_PERIODIC;
     -> tick_setup_periodic
         -> tick_set_periodic_handler
         -> if dev->features & CLOCK_EVT_FEAT_PERIODIC // may not support
             -> clockevents_switch_state(dev, CLOCK_EVT_STATE_PERIODIC);
         -> else
             -> clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)

> 
> After more analysis of that code it turns out that this is even more
> broken because of this:
> 
> CPU0                       CPU1
> 
>                             idle()
>                               tick_broadcast_enter()
>                                   test_and_set_cpu(cpu, oneshot_mask);
>                                   shutdown_cpu_local_device();
>                                   tick_broadcast_set_event();
>                               sleep_deep();
> 
>                             // All good. Broadcast will wake the CPU up
> 
> install_new_broadcast_device(newdev)
>    tick_broadcast_setup_oneshot(newdev)
>      if (was_periodic)  <- Path not taken because device is in shutdown state

Are you saying that the "tick_broadcast_enter->broadcast_shutdown_local" 
path will turn off the cpu1 tick device(as the broadcast)?

I think this only happens when CPU1's tick device is used as the 
broadcast device. However, the "broadcast_needs_cpu" path prevents this 
from happening, right?

Nevertheless, there is still an issue here. At this point, the broadcast 
will be in oneshot state (was_periodic is still false). The reason why 
this has not caused any serious problems may be because other CPUs will 
quickly enter idle to help refresh the broadcast.

>         ...
>      else
>        newdev->next_event = KTIME_MAX;
> 
> So what switches the new device into oneshot state and what wakes CPU1
> on time?
> 
> The switch to oneshot mode happens when the next CPU goes idle and
> invokes tick_broadcast_set_event() because that sets up one shot state
> implicitly.
> 
> The wake-up on time for CPU1 happens only when the next CPU goes idle
> before the expiry time and queues a broadcast event which is the same or
> earlier than CPU1s event, but that's far from correct.
> 
> It will eventually be delivered, but that might be way too late and even
> cause stalls or hung task events in the worst case.
> 
> Duh. What was that Gleixner dude thinking...
> 
> Just for the record. I hated that broadcast code from day one.
> 
> The irony is that the only architecture which required it back in the
> days (x86) and caused me to write this horror in order to make NOHZ
> possible has by now functional timers which just work even in deeper
> idle states. Therefore x86 does not use that code anymore on any
> halfways contemporary system.
> 
> Though all other architectures had to make the same mistake again...
> 
> Completely untested patch below.
> 
> Thanks,
> 
>          tglx
> ---
> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: tick/broadcast: Make broadcast device replacement work correctly
> Date: Wed, 12 Apr 2023 08:34:25 +0800
> 
> When a tick broadcast clockevent device is initialized for one shot mode
> then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
> cpumask into the oneshot broadcast cpumask.
> 
> This is required when switching from periodic broadcast mode to oneshot
> broadcast mode to ensure that CPUs which are waiting for periodic
> broadcast are woken up on the next tick.
> 
> But it is subtly broken, when an active broadcast device is replaced and
> the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
> this and debugged the issue.
> 
> Then the OR of the periodic broadcast CPU mask is wrong as the periodic
> cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
> unless explicitly cleared via tick_broadcast_disable().
> 
> That means that this sets all other CPUs which have tick broadcasting
> enabled at that point unconditionally in the oneshot broadcast mask.
> 
> If the affected CPUs were already idle and had their bits set in the
> oneshot broadcast mask then this does no harm. But for non idle CPUs
> which were not set this corrupts their state.
> 
> On their next invocation of tick_broadcast_enable() they observe the bit
> set, which indicates that the broadcast for the CPU is already set up.
> As a consequence they fail to update the broadcast event even if their
> earliest expiring timer is before the actually programmed broadcast
> event.
> 
> If the programmed broadcast event is far in the future, then this can
> cause stalls or trigger the hung task detector.
> 
> Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
> this is the initial switch over from periodic to oneshot broadcast which
> must take the periodic broadcast mask into account. In the case of
> initialization of a replacement device this prevents that the broadcast
> oneshot mask is modified.
> 
> There is a second problem with broadcast device replacement in this
> function. The broadcast device is only armed when the previous state of
> the device was periodic.
> 
> That is correct for the switch from periodic broadcast mode to oneshot
> broadcast mode as the underlying broadcast device could operate in
> oneshot state already due to lack of periodic state in hardware. In that
> case it is already armed to expire at the next tick.
> 
> For the replacement case this is wrong as the device is in shutdown
> state. That means that any already pending broadcast event will not be
> armed.
> 
> This went unnoticed because any CPU which goes idle will observe that
> the broadcast device has an expiry time of KTIME_MAX and therefore any
> CPUs next timer event will be earlier and cause a reprogramming of the
> broadcast device. But that does not guarantee that the events of the
> CPUs which were already in idle are delivered on time.
> 
> Fix this by arming the newly installed device for an immediate event
> which will reevaluate the per CPU expiry times and reprogram the
> broadcast device accordingly. This is simpler than caching the last
> expiry time in yet another place or saving it before the device exchange
> and handing it down to the setup function. Replacement of broadcast
> devices is not a frequent operation and usually happens once somewhere
> late in the boot process.
> 
> Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")
> Reported-by: Victor Hassan <victor@allwinnertech.com>
> Not-Yet-Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/time/tick-broadcast.c |  106 ++++++++++++++++++++++++++++++-------------
>   1 file changed, 75 insertions(+), 31 deletions(-)
> 
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE
>   #ifdef CONFIG_TICK_ONESHOT
>   static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
>   
> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
>   static void tick_broadcast_clear_oneshot(int cpu);
>   static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
>   # ifdef CONFIG_HOTPLUG_CPU
>   static void tick_broadcast_oneshot_offline(unsigned int cpu);
>   # endif
>   #else
> -static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
> +static inline void
> +tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
>   static inline void tick_broadcast_clear_oneshot(int cpu) { }
>   static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
>   # ifdef CONFIG_HOTPLUG_CPU
> @@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct cl
>   		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>   			tick_broadcast_start_periodic(bc);
>   		else
> -			tick_broadcast_setup_oneshot(bc);
> +			tick_broadcast_setup_oneshot(bc, false);
>   		ret = 1;
>   	} else {
>   		/*
> @@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_br
>   			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>   				tick_broadcast_start_periodic(bc);
>   			else
> -				tick_broadcast_setup_oneshot(bc);
> +				tick_broadcast_setup_oneshot(bc, false);
>   		}
>   	}
>   out:
> @@ -1020,48 +1021,89 @@ static inline ktime_t tick_get_next_peri
>   /**
>    * tick_broadcast_setup_oneshot - setup the broadcast device
>    */
> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
> +					 bool from_periodic)
>   {
>   	int cpu = smp_processor_id();
> +	ktime_t nexttick = 0;
>   
>   	if (!bc)
>   		return;
>   
>   	/* Set it up only once ! */
> -	if (bc->event_handler != tick_handle_oneshot_broadcast) {
> -		int was_periodic = clockevent_state_periodic(bc);
> -
> -		bc->event_handler = tick_handle_oneshot_broadcast;
> -
> +	if (bc->event_handler == tick_handle_oneshot_broadcast) {
>   		/*
> -		 * We must be careful here. There might be other CPUs
> -		 * waiting for periodic broadcast. We need to set the
> -		 * oneshot_mask bits for those and program the
> -		 * broadcast device to fire.
> +		 * The CPU which switches from periodic to oneshot mode
> +		 * sets the broadcast oneshot bit for all other CPUs which
> +		 * are in the general (periodic) broadcast mask to ensure
> +		 * that CPUs which wait for the periodic broadcast are
> +		 * woken up.
> +		 *
> +		 * Clear the bit for the local CPU as the set bit would
> +		 * prevent the first tick_broadcast_enter() after this CPU
> +		 * switched to oneshot state to program the broadcast
> +		 * device.
>   		 */
> +		tick_broadcast_clear_oneshot(cpu);
> +	}
> +
> +
> +	bc->event_handler = tick_handle_oneshot_broadcast;
> +	bc->next_event = KTIME_MAX;
> +
> +	/*
> +	 * When the tick mode is switched from periodic to oneshot it must
> +	 * be ensured that CPUs which are waiting for periodic broadcast
> +	 * get their wake-up at the next tick.  This is achieved by ORing
> +	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
> +	 *
> +	 * For other callers, e.g. broadcast device replacement,
> +	 * tick_broadcast_oneshot_mask must not be touched as this would
> +	 * set bits for CPUs which are already NOHZ, but not idle. Their
> +	 * next tick_broadcast_enter() would observe the bit set and fail
> +	 * to update the expiry time and the broadcast event device.
> +	 */
> +	if (from_periodic) {
>   		cpumask_copy(tmpmask, tick_broadcast_mask);
> +		/* Remove the local CPU as it is obviously not idle */
>   		cpumask_clear_cpu(cpu, tmpmask);
> -		cpumask_or(tick_broadcast_oneshot_mask,
> -			   tick_broadcast_oneshot_mask, tmpmask);
> +		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
>   
> -		if (was_periodic && !cpumask_empty(tmpmask)) {
> -			ktime_t nextevt = tick_get_next_period();
> +		/*
> +		 * Ensure that the oneshot broadcast handler will wake the
> +		 * CPUs which are still waiting for periodic broadcast.
> +		 */
> +		nexttick = tick_get_next_period();
> +		tick_broadcast_init_next_event(tmpmask, nexttick);
>   
> -			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
> -			tick_broadcast_init_next_event(tmpmask, nextevt);
> -			tick_broadcast_set_event(bc, cpu, nextevt);
> -		} else
> -			bc->next_event = KTIME_MAX;
> -	} else {
>   		/*
> -		 * The first cpu which switches to oneshot mode sets
> -		 * the bit for all other cpus which are in the general
> -		 * (periodic) broadcast mask. So the bit is set and
> -		 * would prevent the first broadcast enter after this
> -		 * to program the bc device.
> +		 * If the underlying broadcast clock event device is
> +		 * already in oneshot state, then there is nothing to do.
> +		 * The device was already armed for the next tick
> +		 * in tick_handle_broadcast_periodic()
>   		 */
> -		tick_broadcast_clear_oneshot(cpu);
> +		if (clockevent_state_oneshot(bc))
> +			return;
>   	}
> +
> +	/*
> +	 * When switching from periodic to oneshot mode arm the broadcast
> +	 * device for the next tick.
> +	 *
> +	 * If the broadcast device has been replaced in oneshot mode and
> +	 * the oneshot broadcast mask is not empty, then arm it to expire
> +	 * immediately in order to reevaluate the next expiring timer.
> +	 * nexttick is 0 and therefore in the past which will cause the
> +	 * clockevent code to force an event.
> +	 *
> +	 * For both cases the programming can be avoided when the oneshot
> +	 * broadcast mask is empty.
> +	 *
> +	 * tick_broadcast_set_event() implicitly switches the broadcast
> +	 * device to oneshot state.
> +	 */
> +	if (!cpumask_empty(tick_broadcast_oneshot_mask))
> +		tick_broadcast_set_event(bc, cpu, nexttick);
>   }
>   
>   /*
> @@ -1070,14 +1112,16 @@ static void tick_broadcast_setup_oneshot
>   void tick_broadcast_switch_to_oneshot(void)
>   {
>   	struct clock_event_device *bc;
> +	enum tick_device_mode oldmode;
>   	unsigned long flags;
>   
>   	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
>   
> +	oldmode = tick_broadcast_device.mode;
>   	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
>   	bc = tick_broadcast_device.evtdev;
>   	if (bc)
> -		tick_broadcast_setup_oneshot(bc);
> +		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
>   
>   	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>   }
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Sun, Apr 23 2023 at 22:16, Victor Hassan wrote:
> On 4/16/2023 5:01 AM, Thomas Gleixner wrote:
>> After more analysis of that code it turns out that this is even more
>> broken because of this:
>> 
>> CPU0                       CPU1
>> 
>>                             idle()
>>                               tick_broadcast_enter()
>>                                   test_and_set_cpu(cpu, oneshot_mask);
>>                                   shutdown_cpu_local_device();
>>                                   tick_broadcast_set_event();
>>                               sleep_deep();
>> 
>>                             // All good. Broadcast will wake the CPU up
>> 
>> install_new_broadcast_device(newdev)
>>    tick_broadcast_setup_oneshot(newdev)
>>      if (was_periodic)  <- Path not taken because device is in shutdown state
>
> Are you saying that the "tick_broadcast_enter->broadcast_shutdown_local" 
> path will turn off the cpu1 tick device(as the broadcast)?

No. On CPU1 the idle path does:

    - Mark the CPU in the broadcast oneshot mask
    - Conditionally shut down the CPU local clock event device
    - Set the broadcast event

> I think this only happens when CPU1's tick device is used as the 
> broadcast device. However, the "broadcast_needs_cpu" path prevents this 
> from happening, right?

No. broadcast_shutdown_local() checks first whether

    - the broadcast device is hrtimer based, i.e. a software emulation
    - the broadcast device is armed
    - the CPU handling the hrtimer is the current CPU

If all apply then the CPU local device cannot be shut down.

Then it checks whether:

    - the broadcast device is hrtimer based, i.e. a software emulation
    - the CPU local event is before the broadcast event, as it cannot
      reprogram the remote CPUs clockevent device

If all apply then the CPU local device cannot be shut down.

> Nevertheless, there is still an issue here. At this point, the broadcast 
> will be in oneshot state (was_periodic is still false). The reason why 
> this has not caused any serious problems may be because other CPUs will 
> quickly enter idle to help refresh the broadcast.

The installation of a new broadcast device is a rare event and yes, if
the CPU which installed it or come other CPU goes idle shortly after it
will arm the broadcast event and stuff keeps moving, but that's far from
correct.

Thanks,

        tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Mon, Apr 24 2023 at 20:28, Thomas Gleixner wrote:

Btw, does the patch fix your issue?

Thanks,

         tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Victor Hassan 1 year ago

On 4/25/2023 2:31 AM, Thomas Gleixner wrote:
> On Mon, Apr 24 2023 at 20:28, Thomas Gleixner wrote:
> 
> Btw, does the patch fix your issue?
This is a probabilistic problem and I am currently testing it.

> 
> Thanks,
> 
>           tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Victor Hassan 12 months ago
On 4/26/2023 10:50 AM, Victor Hassan wrote:
> 
> 
> On 4/25/2023 2:31 AM, Thomas Gleixner wrote:
>> On Mon, Apr 24 2023 at 20:28, Thomas Gleixner wrote:
>>
>> Btw, does the patch fix your issue?
> This is a probabilistic problem and I am currently testing it.

I have tested for 7*24 hours and there have been no issues as before.
> 
>>
>> Thanks,
>>
>>           tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 1 year ago
Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
> @@ -1020,48 +1021,89 @@ static inline ktime_t tick_get_next_peri
>  /**
>   * tick_broadcast_setup_oneshot - setup the broadcast device
>   */
> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
> +					 bool from_periodic)
>  {
>  	int cpu = smp_processor_id();
> +	ktime_t nexttick = 0;
>  
>  	if (!bc)
>  		return;
>  
>  	/* Set it up only once ! */
> -	if (bc->event_handler != tick_handle_oneshot_broadcast) {
> -		int was_periodic = clockevent_state_periodic(bc);
> -
> -		bc->event_handler = tick_handle_oneshot_broadcast;
> -
> +	if (bc->event_handler == tick_handle_oneshot_broadcast) {
>  		/*
> -		 * We must be careful here. There might be other CPUs
> -		 * waiting for periodic broadcast. We need to set the
> -		 * oneshot_mask bits for those and program the
> -		 * broadcast device to fire.
> +		 * The CPU which switches from periodic to oneshot mode
> +		 * sets the broadcast oneshot bit for all other CPUs which
> +		 * are in the general (periodic) broadcast mask to ensure
> +		 * that CPUs which wait for the periodic broadcast are
> +		 * woken up.
> +		 *
> +		 * Clear the bit for the local CPU as the set bit would
> +		 * prevent the first tick_broadcast_enter() after this CPU
> +		 * switched to oneshot state to program the broadcast
> +		 * device.
>  		 */
> +		tick_broadcast_clear_oneshot(cpu);

So this path is reached when we setup/exchange a new tick device
on a CPU after the broadcast device has been set to oneshot, right?

Why does it have a specific treatment? Is it for optimization? Or am I
missing a correctness based reason?

> +	}
> +
> +
> +	bc->event_handler = tick_handle_oneshot_broadcast;
> +	bc->next_event = KTIME_MAX;
> +
> +	/*
> +	 * When the tick mode is switched from periodic to oneshot it must
> +	 * be ensured that CPUs which are waiting for periodic broadcast
> +	 * get their wake-up at the next tick.  This is achieved by ORing
> +	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
> +	 *
> +	 * For other callers, e.g. broadcast device replacement,
> +	 * tick_broadcast_oneshot_mask must not be touched as this would
> +	 * set bits for CPUs which are already NOHZ, but not idle. Their
> +	 * next tick_broadcast_enter() would observe the bit set and fail
> +	 * to update the expiry time and the broadcast event device.
> +	 */
> +	if (from_periodic) {
>  		cpumask_copy(tmpmask, tick_broadcast_mask);
> +		/* Remove the local CPU as it is obviously not idle */
>  		cpumask_clear_cpu(cpu, tmpmask);
> -		cpumask_or(tick_broadcast_oneshot_mask,
> -			   tick_broadcast_oneshot_mask, tmpmask);
> +		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
>  
> -		if (was_periodic && !cpumask_empty(tmpmask)) {
> -			ktime_t nextevt = tick_get_next_period();
> +		/*
> +		 * Ensure that the oneshot broadcast handler will wake the
> +		 * CPUs which are still waiting for periodic broadcast.
> +		 */
> +		nexttick = tick_get_next_period();
> +		tick_broadcast_init_next_event(tmpmask, nexttick);
>  
> -			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
> -			tick_broadcast_init_next_event(tmpmask, nextevt);
> -			tick_broadcast_set_event(bc, cpu, nextevt);
> -		} else
> -			bc->next_event = KTIME_MAX;
> -	} else {
>  		/*
> -		 * The first cpu which switches to oneshot mode sets
> -		 * the bit for all other cpus which are in the general
> -		 * (periodic) broadcast mask. So the bit is set and
> -		 * would prevent the first broadcast enter after this
> -		 * to program the bc device.
> +		 * If the underlying broadcast clock event device is
> +		 * already in oneshot state, then there is nothing to do.
> +		 * The device was already armed for the next tick
> +		 * in tick_handle_broadcast_periodic()
>  		 */
> -		tick_broadcast_clear_oneshot(cpu);
> +		if (clockevent_state_oneshot(bc))
> +			return;
>  	}
> +
> +	/*
> +	 * When switching from periodic to oneshot mode arm the broadcast
> +	 * device for the next tick.
> +	 *
> +	 * If the broadcast device has been replaced in oneshot mode and
> +	 * the oneshot broadcast mask is not empty, then arm it to expire
> +	 * immediately in order to reevaluate the next expiring timer.
> +	 * nexttick is 0 and therefore in the past which will cause the
> +	 * clockevent code to force an event.
> +	 *
> +	 * For both cases the programming can be avoided when the oneshot
> +	 * broadcast mask is empty.
> +	 *
> +	 * tick_broadcast_set_event() implicitly switches the broadcast
> +	 * device to oneshot state.
> +	 */
> +	if (!cpumask_empty(tick_broadcast_oneshot_mask))
> +		tick_broadcast_set_event(bc, cpu, nexttick);

For the case where the other CPUs have already installed their
tick devices and if that function is called with from_periodic=true,
the other CPUs will notice the oneshot change on their next call to
tick_broadcast_enter() thanks to the lock, right? So the tick broadcast
will keep firing until all CPUs have been through idle once and called
tick_broadcast_exit(), right? Because only them can clear themselves
from tick_broadcast_oneshot_mask, am I understanding this correctly?

I'm trying to find the opportunity for a race with dev->next_event
being seen as too far ahead in the future but can't manage so far...

Thanks.
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Wed, Apr 19 2023 at 15:36, Frederic Weisbecker wrote:
> Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
>> @@ -1020,48 +1021,89 @@ static inline ktime_t tick_get_next_peri
>>  /**
>>   * tick_broadcast_setup_oneshot - setup the broadcast device
>>   */
>> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
>> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
>> +					 bool from_periodic)
>>  {
>>  	int cpu = smp_processor_id();
>> +	ktime_t nexttick = 0;
>>  
>>  	if (!bc)
>>  		return;
>>  
>>  	/* Set it up only once ! */
>> -	if (bc->event_handler != tick_handle_oneshot_broadcast) {
>> -		int was_periodic = clockevent_state_periodic(bc);
>> -
>> -		bc->event_handler = tick_handle_oneshot_broadcast;
>> -
>> +	if (bc->event_handler == tick_handle_oneshot_broadcast) {
>>  		/*
>> -		 * We must be careful here. There might be other CPUs
>> -		 * waiting for periodic broadcast. We need to set the
>> -		 * oneshot_mask bits for those and program the
>> -		 * broadcast device to fire.
>> +		 * The CPU which switches from periodic to oneshot mode
>> +		 * sets the broadcast oneshot bit for all other CPUs which
>> +		 * are in the general (periodic) broadcast mask to ensure
>> +		 * that CPUs which wait for the periodic broadcast are
>> +		 * woken up.
>> +		 *
>> +		 * Clear the bit for the local CPU as the set bit would
>> +		 * prevent the first tick_broadcast_enter() after this CPU
>> +		 * switched to oneshot state to program the broadcast
>> +		 * device.
>>  		 */
>> +		tick_broadcast_clear_oneshot(cpu);
>
> So this path is reached when we setup/exchange a new tick device
> on a CPU after the broadcast device has been set to oneshot, right?
>
> Why does it have a specific treatment? Is it for optimization? Or am I
> missing a correctness based reason?

This path is taken during the switch from periodic to oneshot mode. The
way how this works is:

boot()
  setup_periodic()
    setup_periodic_broadcast()

  // From here on everything depends on the periodic broadcasting

  highres_clocksource_becomes_available()
    tick_clock_notify() <- Set's the .check_clocks bit on all CPUs

Now the first CPU which observes that bit switches to oneshot mode, but
the other CPUs might be waiting for the periodic broadcast at that
point. So the periodic to oneshot transition does:

  		cpumask_copy(tmpmask, tick_broadcast_mask);
		/* Remove the local CPU as it is obviously not idle */
  		cpumask_clear_cpu(cpu, tmpmask);
		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);

I.e. it makes sure that _ALL_ not yet converted CPUs will get woken up
by the new oneshot broadcast handler. 

Now when the other CPUs will observe the check_clock bit after that they
need to clear their bit in the oneshot mask while switching themself
from periodic to oneshot one otherwise the next tick_broadcast_enter()
would do nothing. That's all serialized by broadcast lock, so no race.

But that has nothing to do with switching the underlying clockevent
device. At that point all CPUs are already in oneshot mode and
tick_broadcast_oneshot_mask is correct.

So that will take the other code path:

    if (bc->event_handler == tick_handle_oneshot_broadcast) {
       // not taken because the new device is not yet set up
       return;
    }

    if (from_periodic) {
       // not taken because the switchover already happened
       // Here is where the cpumask magic happens
    }
    
> For the case where the other CPUs have already installed their
> tick devices and if that function is called with from_periodic=true,
> the other CPUs will notice the oneshot change on their next call to
> tick_broadcast_enter() thanks to the lock, right? So the tick broadcast
> will keep firing until all CPUs have been through idle once and called
> tick_broadcast_exit(), right? Because only them can clear themselves
> from tick_broadcast_oneshot_mask, am I understanding this correctly?

No. See above. It's about the check_clock bit handling on the other
CPUs.

It seems I failed miserably to explain that coherently with the tons of
comments added. Hrmpf :(

> I'm trying to find the opportunity for a race with dev->next_event
> being seen as too far ahead in the future but can't manage so far...

There is none :)

Thanks,

        tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 1 year ago
On Fri, Apr 21, 2023 at 11:32:15PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 19 2023 at 15:36, Frederic Weisbecker wrote:
> This path is taken during the switch from periodic to oneshot mode. The
> way how this works is:
> 
> boot()
>   setup_periodic()
>     setup_periodic_broadcast()
> 
>   // From here on everything depends on the periodic broadcasting
> 
>   highres_clocksource_becomes_available()
>     tick_clock_notify() <- Set's the .check_clocks bit on all CPUs
> 
> Now the first CPU which observes that bit switches to oneshot mode, but
> the other CPUs might be waiting for the periodic broadcast at that
> point. So the periodic to oneshot transition does:
> 
>   		cpumask_copy(tmpmask, tick_broadcast_mask);
> 		/* Remove the local CPU as it is obviously not idle */
>   		cpumask_clear_cpu(cpu, tmpmask);
> 		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
> 
> I.e. it makes sure that _ALL_ not yet converted CPUs will get woken up
> by the new oneshot broadcast handler. 
> 
> Now when the other CPUs will observe the check_clock bit after that they
> need to clear their bit in the oneshot mask while switching themself
> from periodic to oneshot one otherwise the next tick_broadcast_enter()
> would do nothing. That's all serialized by broadcast lock, so no race.
> 
> But that has nothing to do with switching the underlying clockevent
> device. At that point all CPUs are already in oneshot mode and
> tick_broadcast_oneshot_mask is correct.
> 
> So that will take the other code path:
> 
>     if (bc->event_handler == tick_handle_oneshot_broadcast) {
>        // not taken because the new device is not yet set up
>        return;
>     }
> 
>     if (from_periodic) {
>        // not taken because the switchover already happened
>        // Here is where the cpumask magic happens
>     }
>

I see, I guess I got lost somewhere into the tree of the possible
callchains :)

tick_broadcast_setup_oneshot()
	tick_broadcast_switch_to_oneshot
		tick_install_broadcast_device
			tick_check_new_device
				clockevents_notify_released
					clockevents_register_device (new device)
				clockevents_register_device (new device)
		tick_switch_to_oneshot
			tick_init_highres
				 hrtimer_switch_to_hres
					hrtimer_run_queues (timer softirq)
			tick_nohz_switch_to_nohz
				tick_check_oneshot_change (test and clear check_clock)
					hrtimer_run_queues (timer softirq))
	tick_device_uses_broadcast
		tick_setup_device
			tick_install_replacement
				clockevents_replace
					__clockevents_unbind
						clockevents_unbind
							unbind_device_store (sysfs)
							clockevents_unbind_device (driver)
			tick_check_new_device
				clockevents_notify_released
					clockevents_register_device (new device)
				clockevents_register_device (new device)
	tick_broadcast_control
		tick_broadcast_enable (cpuidle driver register, cpu up, ...)
		tick_broadcast_disable (cpuidle driver unregister, ...)
		tick_broadcast_force (amd apic bug setup)


Ok I get the check_clock game. But then, why do we need to reprogram
again the broadcast device to fire in one jiffy if the caller is
tick_nohz_switch_to_nohz() (that is the (bc->event_handler ==
tick_handle_oneshot_broadcast) branch)? In that case the broadcast device
should have been programmed already by the CPU that first switched the
current broadcast device, right?

> > For the case where the other CPUs have already installed their
> > tick devices and if that function is called with from_periodic=true,
> > the other CPUs will notice the oneshot change on their next call to
> > tick_broadcast_enter() thanks to the lock, right? So the tick broadcast
> > will keep firing until all CPUs have been through idle once and called
> > tick_broadcast_exit(), right? Because only them can clear themselves
> > from tick_broadcast_oneshot_mask, am I understanding this correctly?
> 
> No. See above. It's about the check_clock bit handling on the other
> CPUs.
> 
> It seems I failed miserably to explain that coherently with the tons of
> comments added. Hrmpf :(

Don't pay too much attention, confusion is my vehicle to explore any code
that I'm not used to. But yes I must confess the
(bc->event_handler == tick_handle_oneshot_broadcast) may deserve a comment
remaining where we come from (ie: low-res hrtimer softirq).

Thanks.
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Tue, May 02 2023 at 13:19, Frederic Weisbecker wrote:
> On Fri, Apr 21, 2023 at 11:32:15PM +0200, Thomas Gleixner wrote:
> Ok I get the check_clock game. But then, why do we need to reprogram
> again the broadcast device to fire in one jiffy if the caller is
> tick_nohz_switch_to_nohz() (that is the (bc->event_handler ==
> tick_handle_oneshot_broadcast) branch)? In that case the broadcast device
> should have been programmed already by the CPU that first switched the
> current broadcast device, right?

That clearly lacks a return in that path.

>> It seems I failed miserably to explain that coherently with the tons of
>> comments added. Hrmpf :(
>
> Don't pay too much attention, confusion is my vehicle to explore any code
> that I'm not used to. But yes I must confess the
> (bc->event_handler == tick_handle_oneshot_broadcast) may deserve a comment
> remaining where we come from (ie: low-res hrtimer softirq).

Updated patch below.

Thanks,

        tglx
---
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE
 #ifdef CONFIG_TICK_ONESHOT
 static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
 
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
 static void tick_broadcast_clear_oneshot(int cpu);
 static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 # ifdef CONFIG_HOTPLUG_CPU
 static void tick_broadcast_oneshot_offline(unsigned int cpu);
 # endif
 #else
-static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
+static inline void
+tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
 static inline void tick_broadcast_clear_oneshot(int cpu) { }
 static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
 # ifdef CONFIG_HOTPLUG_CPU
@@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct cl
 		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 			tick_broadcast_start_periodic(bc);
 		else
-			tick_broadcast_setup_oneshot(bc);
+			tick_broadcast_setup_oneshot(bc, false);
 		ret = 1;
 	} else {
 		/*
@@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_br
 			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 				tick_broadcast_start_periodic(bc);
 			else
-				tick_broadcast_setup_oneshot(bc);
+				tick_broadcast_setup_oneshot(bc, false);
 		}
 	}
 out:
@@ -1020,48 +1021,95 @@ static inline ktime_t tick_get_next_peri
 /**
  * tick_broadcast_setup_oneshot - setup the broadcast device
  */
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
+					 bool from_periodic)
 {
 	int cpu = smp_processor_id();
+	ktime_t nexttick = 0;
 
 	if (!bc)
 		return;
 
-	/* Set it up only once ! */
-	if (bc->event_handler != tick_handle_oneshot_broadcast) {
-		int was_periodic = clockevent_state_periodic(bc);
-
-		bc->event_handler = tick_handle_oneshot_broadcast;
-
+	/*
+	 * When the broadcast device was switched to oneshot by the first
+	 * CPU handling the NOHZ change, the other CPUs will reach this
+	 * code via hrtimer_run_queues() -> tick_check_oneshot_change()
+	 * too. Set up the broadcast device only once!
+	 */
+	if (bc->event_handler == tick_handle_oneshot_broadcast) {
 		/*
-		 * We must be careful here. There might be other CPUs
-		 * waiting for periodic broadcast. We need to set the
-		 * oneshot_mask bits for those and program the
-		 * broadcast device to fire.
+		 * The CPU which switched from periodic to oneshot mode
+		 * set the broadcast oneshot bit for all other CPUs which
+		 * are in the general (periodic) broadcast mask to ensure
+		 * that CPUs which wait for the periodic broadcast are
+		 * woken up.
+		 *
+		 * Clear the bit for the local CPU as the set bit would
+		 * prevent the first tick_broadcast_enter() after this CPU
+		 * switched to oneshot state to program the broadcast
+		 * device.
 		 */
+		tick_broadcast_clear_oneshot(cpu);
+		return;
+	}
+
+
+	bc->event_handler = tick_handle_oneshot_broadcast;
+	bc->next_event = KTIME_MAX;
+
+	/*
+	 * When the tick mode is switched from periodic to oneshot it must
+	 * be ensured that CPUs which are waiting for periodic broadcast
+	 * get their wake-up at the next tick.  This is achieved by ORing
+	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
+	 *
+	 * For other callers, e.g. broadcast device replacement,
+	 * tick_broadcast_oneshot_mask must not be touched as this would
+	 * set bits for CPUs which are already NOHZ, but not idle. Their
+	 * next tick_broadcast_enter() would observe the bit set and fail
+	 * to update the expiry time and the broadcast event device.
+	 */
+	if (from_periodic) {
 		cpumask_copy(tmpmask, tick_broadcast_mask);
+		/* Remove the local CPU as it is obviously not idle */
 		cpumask_clear_cpu(cpu, tmpmask);
-		cpumask_or(tick_broadcast_oneshot_mask,
-			   tick_broadcast_oneshot_mask, tmpmask);
+		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
 
-		if (was_periodic && !cpumask_empty(tmpmask)) {
-			ktime_t nextevt = tick_get_next_period();
+		/*
+		 * Ensure that the oneshot broadcast handler will wake the
+		 * CPUs which are still waiting for periodic broadcast.
+		 */
+		nexttick = tick_get_next_period();
+		tick_broadcast_init_next_event(tmpmask, nexttick);
 
-			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
-			tick_broadcast_init_next_event(tmpmask, nextevt);
-			tick_broadcast_set_event(bc, cpu, nextevt);
-		} else
-			bc->next_event = KTIME_MAX;
-	} else {
 		/*
-		 * The first cpu which switches to oneshot mode sets
-		 * the bit for all other cpus which are in the general
-		 * (periodic) broadcast mask. So the bit is set and
-		 * would prevent the first broadcast enter after this
-		 * to program the bc device.
+		 * If the underlying broadcast clock event device is
+		 * already in oneshot state, then there is nothing to do.
+		 * The device was already armed for the next tick
+		 * in tick_handle_broadcast_periodic()
 		 */
-		tick_broadcast_clear_oneshot(cpu);
+		if (clockevent_state_oneshot(bc))
+			return;
 	}
+
+	/*
+	 * When switching from periodic to oneshot mode arm the broadcast
+	 * device for the next tick.
+	 *
+	 * If the broadcast device has been replaced in oneshot mode and
+	 * the oneshot broadcast mask is not empty, then arm it to expire
+	 * immediately in order to reevaluate the next expiring timer.
+	 * nexttick is 0 and therefore in the past which will cause the
+	 * clockevent code to force an event.
+	 *
+	 * For both cases the programming can be avoided when the oneshot
+	 * broadcast mask is empty.
+	 *
+	 * tick_broadcast_set_event() implicitly switches the broadcast
+	 * device to oneshot state.
+	 */
+	if (!cpumask_empty(tick_broadcast_oneshot_mask))
+		tick_broadcast_set_event(bc, cpu, nexttick);
 }
 
 /*
@@ -1070,14 +1118,16 @@ static void tick_broadcast_setup_oneshot
 void tick_broadcast_switch_to_oneshot(void)
 {
 	struct clock_event_device *bc;
+	enum tick_device_mode oldmode;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 
+	oldmode = tick_broadcast_device.mode;
 	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
 	bc = tick_broadcast_device.evtdev;
 	if (bc)
-		tick_broadcast_setup_oneshot(bc);
+		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
 
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Victor Hassan 12 months ago

On 5/2/2023 8:38 PM, Thomas Gleixner wrote:
> On Tue, May 02 2023 at 13:19, Frederic Weisbecker wrote:
>> On Fri, Apr 21, 2023 at 11:32:15PM +0200, Thomas Gleixner wrote:
>> Ok I get the check_clock game. But then, why do we need to reprogram
>> again the broadcast device to fire in one jiffy if the caller is
>> tick_nohz_switch_to_nohz() (that is the (bc->event_handler ==
>> tick_handle_oneshot_broadcast) branch)? In that case the broadcast device
>> should have been programmed already by the CPU that first switched the
>> current broadcast device, right?
> 
> That clearly lacks a return in that path.
> 
>>> It seems I failed miserably to explain that coherently with the tons of
>>> comments added. Hrmpf :(
>>
>> Don't pay too much attention, confusion is my vehicle to explore any code
>> that I'm not used to. But yes I must confess the
>> (bc->event_handler == tick_handle_oneshot_broadcast) may deserve a comment
>> remaining where we come from (ie: low-res hrtimer softirq).
> 
> Updated patch below.
> 
> Thanks,
> 
>          tglx
> ---
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE
>   #ifdef CONFIG_TICK_ONESHOT
>   static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
>   
> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
>   static void tick_broadcast_clear_oneshot(int cpu);
>   static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
>   # ifdef CONFIG_HOTPLUG_CPU
>   static void tick_broadcast_oneshot_offline(unsigned int cpu);
>   # endif
>   #else
> -static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
> +static inline void
> +tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
>   static inline void tick_broadcast_clear_oneshot(int cpu) { }
>   static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
>   # ifdef CONFIG_HOTPLUG_CPU
> @@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct cl
>   		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>   			tick_broadcast_start_periodic(bc);
>   		else
> -			tick_broadcast_setup_oneshot(bc);
> +			tick_broadcast_setup_oneshot(bc, false);
>   		ret = 1;
>   	} else {
>   		/*
> @@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_br
>   			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
>   				tick_broadcast_start_periodic(bc);
>   			else
> -				tick_broadcast_setup_oneshot(bc);
> +				tick_broadcast_setup_oneshot(bc, false);
>   		}
>   	}
>   out:
> @@ -1020,48 +1021,95 @@ static inline ktime_t tick_get_next_peri
>   /**
>    * tick_broadcast_setup_oneshot - setup the broadcast device
>    */
> -static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
> +static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
> +					 bool from_periodic)
>   {
>   	int cpu = smp_processor_id();
> +	ktime_t nexttick = 0;
>   
>   	if (!bc)
>   		return;
>   
> -	/* Set it up only once ! */
> -	if (bc->event_handler != tick_handle_oneshot_broadcast) {
> -		int was_periodic = clockevent_state_periodic(bc);
> -
> -		bc->event_handler = tick_handle_oneshot_broadcast;
> -
> +	/*
> +	 * When the broadcast device was switched to oneshot by the first
> +	 * CPU handling the NOHZ change, the other CPUs will reach this
> +	 * code via hrtimer_run_queues() -> tick_check_oneshot_change()
> +	 * too. Set up the broadcast device only once!
> +	 */
> +	if (bc->event_handler == tick_handle_oneshot_broadcast) {
>   		/*
> -		 * We must be careful here. There might be other CPUs
> -		 * waiting for periodic broadcast. We need to set the
> -		 * oneshot_mask bits for those and program the
> -		 * broadcast device to fire.
> +		 * The CPU which switched from periodic to oneshot mode
> +		 * set the broadcast oneshot bit for all other CPUs which
> +		 * are in the general (periodic) broadcast mask to ensure
> +		 * that CPUs which wait for the periodic broadcast are
> +		 * woken up.
> +		 *
> +		 * Clear the bit for the local CPU as the set bit would
> +		 * prevent the first tick_broadcast_enter() after this CPU
> +		 * switched to oneshot state to program the broadcast
> +		 * device.
>   		 */
> +		tick_broadcast_clear_oneshot(cpu);
> +		return;
> +	}
> +
> +
> +	bc->event_handler = tick_handle_oneshot_broadcast;
> +	bc->next_event = KTIME_MAX;
> +
> +	/*
> +	 * When the tick mode is switched from periodic to oneshot it must
> +	 * be ensured that CPUs which are waiting for periodic broadcast
> +	 * get their wake-up at the next tick.  This is achieved by ORing
> +	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
> +	 *
> +	 * For other callers, e.g. broadcast device replacement,
> +	 * tick_broadcast_oneshot_mask must not be touched as this would
> +	 * set bits for CPUs which are already NOHZ, but not idle. Their
> +	 * next tick_broadcast_enter() would observe the bit set and fail
> +	 * to update the expiry time and the broadcast event device.
> +	 */
> +	if (from_periodic) {
>   		cpumask_copy(tmpmask, tick_broadcast_mask);
> +		/* Remove the local CPU as it is obviously not idle */
>   		cpumask_clear_cpu(cpu, tmpmask);
> -		cpumask_or(tick_broadcast_oneshot_mask,
> -			   tick_broadcast_oneshot_mask, tmpmask);
> +		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
>   
> -		if (was_periodic && !cpumask_empty(tmpmask)) {
> -			ktime_t nextevt = tick_get_next_period();
> +		/*
> +		 * Ensure that the oneshot broadcast handler will wake the
> +		 * CPUs which are still waiting for periodic broadcast.
> +		 */
> +		nexttick = tick_get_next_period();
> +		tick_broadcast_init_next_event(tmpmask, nexttick);
>   
> -			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
> -			tick_broadcast_init_next_event(tmpmask, nextevt);
> -			tick_broadcast_set_event(bc, cpu, nextevt);
> -		} else
> -			bc->next_event = KTIME_MAX;
> -	} else {
>   		/*
> -		 * The first cpu which switches to oneshot mode sets
> -		 * the bit for all other cpus which are in the general
> -		 * (periodic) broadcast mask. So the bit is set and
> -		 * would prevent the first broadcast enter after this
> -		 * to program the bc device.
> +		 * If the underlying broadcast clock event device is
> +		 * already in oneshot state, then there is nothing to do.
> +		 * The device was already armed for the next tick
> +		 * in tick_handle_broadcast_periodic()
>   		 */
> -		tick_broadcast_clear_oneshot(cpu);
> +		if (clockevent_state_oneshot(bc))
> +			return;
>   	}
> +
> +	/*
> +	 * When switching from periodic to oneshot mode arm the broadcast
> +	 * device for the next tick.
> +	 *
> +	 * If the broadcast device has been replaced in oneshot mode and
> +	 * the oneshot broadcast mask is not empty, then arm it to expire
> +	 * immediately in order to reevaluate the next expiring timer.
> +	 * nexttick is 0 and therefore in the past which will cause the
> +	 * clockevent code to force an event.
> +	 *
> +	 * For both cases the programming can be avoided when the oneshot
> +	 * broadcast mask is empty.
> +	 *
> +	 * tick_broadcast_set_event() implicitly switches the broadcast
> +	 * device to oneshot state.
> +	 */
> +	if (!cpumask_empty(tick_broadcast_oneshot_mask))
> +		tick_broadcast_set_event(bc, cpu, nexttick);
>   }
>   
>   /*
> @@ -1070,14 +1118,16 @@ static void tick_broadcast_setup_oneshot
>   void tick_broadcast_switch_to_oneshot(void)
>   {
>   	struct clock_event_device *bc;
> +	enum tick_device_mode oldmode;
>   	unsigned long flags;
>   
>   	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
>   
> +	oldmode = tick_broadcast_device.mode;
>   	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
>   	bc = tick_broadcast_device.evtdev;
>   	if (bc)
> -		tick_broadcast_setup_oneshot(bc);
> +		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
>   
>   	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
>   }

That looks good to me.
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 12 months ago
On Tue, May 02, 2023 at 02:38:57PM +0200, Thomas Gleixner wrote:
> Updated patch below.
> 
> Thanks,

Looks good from my layperson's eyes, just a doubt about a comment below:

> +
> +	/*
> +	 * When switching from periodic to oneshot mode arm the broadcast
> +	 * device for the next tick.
> +	 *
> +	 * If the broadcast device has been replaced in oneshot mode and
> +	 * the oneshot broadcast mask is not empty, then arm it to expire
> +	 * immediately in order to reevaluate the next expiring timer.
> +	 * nexttick is 0 and therefore in the past which will cause the

Is nexttick really in the past? It's set to tick_next_period...

Thanks.

> +	 * clockevent code to force an event.
> +	 *
> +	 * For both cases the programming can be avoided when the oneshot
> +	 * broadcast mask is empty.
> +	 *
> +	 * tick_broadcast_set_event() implicitly switches the broadcast
> +	 * device to oneshot state.
> +	 */
> +	if (!cpumask_empty(tick_broadcast_oneshot_mask))
> +		tick_broadcast_set_event(bc, cpu, nexttick);
>  }
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 12 months ago
On Thu, May 04 2023 at 00:27, Frederic Weisbecker wrote:

> On Tue, May 02, 2023 at 02:38:57PM +0200, Thomas Gleixner wrote:
>> Updated patch below.
>> 
>> Thanks,
>
> Looks good from my layperson's eyes, just a doubt about a comment below:
>
>> +
>> +	/*
>> +	 * When switching from periodic to oneshot mode arm the broadcast
>> +	 * device for the next tick.
>> +	 *
>> +	 * If the broadcast device has been replaced in oneshot mode and
>> +	 * the oneshot broadcast mask is not empty, then arm it to expire
>> +	 * immediately in order to reevaluate the next expiring timer.
>> +	 * nexttick is 0 and therefore in the past which will cause the
>
> Is nexttick really in the past? It's set to tick_next_period...

Only in the non-replacement, i.e. the 'from_periodic' codepath, no?

Thanks,

        tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 12 months ago
On Thu, May 04, 2023 at 12:53:52AM +0200, Thomas Gleixner wrote:
> On Thu, May 04 2023 at 00:27, Frederic Weisbecker wrote:
> 
> > On Tue, May 02, 2023 at 02:38:57PM +0200, Thomas Gleixner wrote:
> >> Updated patch below.
> >> 
> >> Thanks,
> >
> > Looks good from my layperson's eyes, just a doubt about a comment below:
> >
> >> +
> >> +	/*
> >> +	 * When switching from periodic to oneshot mode arm the broadcast
> >> +	 * device for the next tick.
> >> +	 *
> >> +	 * If the broadcast device has been replaced in oneshot mode and
> >> +	 * the oneshot broadcast mask is not empty, then arm it to expire
> >> +	 * immediately in order to reevaluate the next expiring timer.
> >> +	 * nexttick is 0 and therefore in the past which will cause the
> >
> > Is nexttick really in the past? It's set to tick_next_period...
> 
> Only in the non-replacement, i.e. the 'from_periodic' codepath, no?

Bah, missed that, right.

A shy Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thanks!
[PATCH v3] tick/broadcast: Make broadcast device replacement work correctly
Posted by Thomas Gleixner 12 months ago
When a tick broadcast clockevent device is initialized for one shot mode
then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
cpumask into the oneshot broadcast cpumask.

This is required when switching from periodic broadcast mode to oneshot
broadcast mode to ensure that CPUs which are waiting for periodic
broadcast are woken up on the next tick.

But it is subtly broken, when an active broadcast device is replaced and
the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
this and debugged the issue.

Then the OR of the periodic broadcast CPU mask is wrong as the periodic
cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
unless explicitly cleared via tick_broadcast_disable().

That means that this sets all other CPUs which have tick broadcasting
enabled at that point unconditionally in the oneshot broadcast mask.

If the affected CPUs were already idle and had their bits set in the
oneshot broadcast mask then this does no harm. But for non idle CPUs
which were not set this corrupts their state.

On their next invocation of tick_broadcast_enable() they observe the bit
set, which indicates that the broadcast for the CPU is already set up.
As a consequence they fail to update the broadcast event even if their
earliest expiring timer is before the actually programmed broadcast
event.

If the programmed broadcast event is far in the future, then this can
cause stalls or trigger the hung task detector.

Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
this is the initial switch over from periodic to oneshot broadcast which
must take the periodic broadcast mask into account. In the case of
initialization of a replacement device this prevents that the broadcast
oneshot mask is modified.

There is a second problem with broadcast device replacement in this
function. The broadcast device is only armed when the previous state of
the device was periodic.

That is correct for the switch from periodic broadcast mode to oneshot
broadcast mode as the underlying broadcast device could operate in
oneshot state already due to lack of periodic state in hardware. In that
case it is already armed to expire at the next tick.

For the replacement case this is wrong as the device is in shutdown
state. That means that any already pending broadcast event will not be
armed.

This went unnoticed because any CPU which goes idle will observe that
the broadcast device has an expiry time of KTIME_MAX and therefore any
CPUs next timer event will be earlier and cause a reprogramming of the
broadcast device. But that does not guarantee that the events of the
CPUs which were already in idle are delivered on time.

Fix this by arming the newly installed device for an immediate event
which will reevaluate the per CPU expiry times and reprogram the
broadcast device accordingly. This is simpler than caching the last
expiry time in yet another place or saving it before the device exchange
and handing it down to the setup function. Replacement of broadcast
devices is not a frequent operation and usually happens once somewhere
late in the boot process.

Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")
Reported-by: Victor Hassan <victor@allwinnertech.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/tick-broadcast.c |  114 ++++++++++++++++++++++++++++++-------------
 1 file changed, 82 insertions(+), 32 deletions(-)

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE
 #ifdef CONFIG_TICK_ONESHOT
 static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
 
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
 static void tick_broadcast_clear_oneshot(int cpu);
 static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 # ifdef CONFIG_HOTPLUG_CPU
 static void tick_broadcast_oneshot_offline(unsigned int cpu);
 # endif
 #else
-static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
+static inline void
+tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
 static inline void tick_broadcast_clear_oneshot(int cpu) { }
 static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
 # ifdef CONFIG_HOTPLUG_CPU
@@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct cl
 		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 			tick_broadcast_start_periodic(bc);
 		else
-			tick_broadcast_setup_oneshot(bc);
+			tick_broadcast_setup_oneshot(bc, false);
 		ret = 1;
 	} else {
 		/*
@@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_br
 			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 				tick_broadcast_start_periodic(bc);
 			else
-				tick_broadcast_setup_oneshot(bc);
+				tick_broadcast_setup_oneshot(bc, false);
 		}
 	}
 out:
@@ -1020,48 +1021,95 @@ static inline ktime_t tick_get_next_peri
 /**
  * tick_broadcast_setup_oneshot - setup the broadcast device
  */
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
+					 bool from_periodic)
 {
 	int cpu = smp_processor_id();
+	ktime_t nexttick = 0;
 
 	if (!bc)
 		return;
 
-	/* Set it up only once ! */
-	if (bc->event_handler != tick_handle_oneshot_broadcast) {
-		int was_periodic = clockevent_state_periodic(bc);
-
-		bc->event_handler = tick_handle_oneshot_broadcast;
-
+	/*
+	 * When the broadcast device was switched to oneshot by the first
+	 * CPU handling the NOHZ change, the other CPUs will reach this
+	 * code via hrtimer_run_queues() -> tick_check_oneshot_change()
+	 * too. Set up the broadcast device only once!
+	 */
+	if (bc->event_handler == tick_handle_oneshot_broadcast) {
 		/*
-		 * We must be careful here. There might be other CPUs
-		 * waiting for periodic broadcast. We need to set the
-		 * oneshot_mask bits for those and program the
-		 * broadcast device to fire.
+		 * The CPU which switched from periodic to oneshot mode
+		 * set the broadcast oneshot bit for all other CPUs which
+		 * are in the general (periodic) broadcast mask to ensure
+		 * that CPUs which wait for the periodic broadcast are
+		 * woken up.
+		 *
+		 * Clear the bit for the local CPU as the set bit would
+		 * prevent the first tick_broadcast_enter() after this CPU
+		 * switched to oneshot state to program the broadcast
+		 * device.
 		 */
+		tick_broadcast_clear_oneshot(cpu);
+		return;
+	}
+
+
+	bc->event_handler = tick_handle_oneshot_broadcast;
+	bc->next_event = KTIME_MAX;
+
+	/*
+	 * When the tick mode is switched from periodic to oneshot it must
+	 * be ensured that CPUs which are waiting for periodic broadcast
+	 * get their wake-up at the next tick.  This is achieved by ORing
+	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
+	 *
+	 * For other callers, e.g. broadcast device replacement,
+	 * tick_broadcast_oneshot_mask must not be touched as this would
+	 * set bits for CPUs which are already NOHZ, but not idle. Their
+	 * next tick_broadcast_enter() would observe the bit set and fail
+	 * to update the expiry time and the broadcast event device.
+	 */
+	if (from_periodic) {
 		cpumask_copy(tmpmask, tick_broadcast_mask);
+		/* Remove the local CPU as it is obviously not idle */
 		cpumask_clear_cpu(cpu, tmpmask);
-		cpumask_or(tick_broadcast_oneshot_mask,
-			   tick_broadcast_oneshot_mask, tmpmask);
+		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
 
-		if (was_periodic && !cpumask_empty(tmpmask)) {
-			ktime_t nextevt = tick_get_next_period();
+		/*
+		 * Ensure that the oneshot broadcast handler will wake the
+		 * CPUs which are still waiting for periodic broadcast.
+		 */
+		nexttick = tick_get_next_period();
+		tick_broadcast_init_next_event(tmpmask, nexttick);
 
-			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
-			tick_broadcast_init_next_event(tmpmask, nextevt);
-			tick_broadcast_set_event(bc, cpu, nextevt);
-		} else
-			bc->next_event = KTIME_MAX;
-	} else {
 		/*
-		 * The first cpu which switches to oneshot mode sets
-		 * the bit for all other cpus which are in the general
-		 * (periodic) broadcast mask. So the bit is set and
-		 * would prevent the first broadcast enter after this
-		 * to program the bc device.
+		 * If the underlying broadcast clock event device is
+		 * already in oneshot state, then there is nothing to do.
+		 * The device was already armed for the next tick
+		 * in tick_handle_broadcast_periodic()
 		 */
-		tick_broadcast_clear_oneshot(cpu);
+		if (clockevent_state_oneshot(bc))
+			return;
 	}
+
+	/*
+	 * When switching from periodic to oneshot mode arm the broadcast
+	 * device for the next tick.
+	 *
+	 * If the broadcast device has been replaced in oneshot mode and
+	 * the oneshot broadcast mask is not empty, then arm it to expire
+	 * immediately in order to reevaluate the next expiring timer.
+	 * nexttick is 0 and therefore in the past which will cause the
+	 * clockevent code to force an event.
+	 *
+	 * For both cases the programming can be avoided when the oneshot
+	 * broadcast mask is empty.
+	 *
+	 * tick_broadcast_set_event() implicitly switches the broadcast
+	 * device to oneshot state.
+	 */
+	if (!cpumask_empty(tick_broadcast_oneshot_mask))
+		tick_broadcast_set_event(bc, cpu, nexttick);
 }
 
 /*
@@ -1070,14 +1118,16 @@ static void tick_broadcast_setup_oneshot
 void tick_broadcast_switch_to_oneshot(void)
 {
 	struct clock_event_device *bc;
+	enum tick_device_mode oldmode;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 
+	oldmode = tick_broadcast_device.mode;
 	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
 	bc = tick_broadcast_device.evtdev;
 	if (bc)
-		tick_broadcast_setup_oneshot(bc);
+		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
 
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
[tip: timers/urgent] tick/broadcast: Make broadcast device replacement work correctly
Posted by tip-bot2 for Thomas Gleixner 11 months, 4 weeks ago
The following commit has been merged into the timers/urgent branch of tip:

Commit-ID:     f9d36cf445ffff0b913ba187a3eff78028f9b1fb
Gitweb:        https://git.kernel.org/tip/f9d36cf445ffff0b913ba187a3eff78028f9b1fb
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Sat, 06 May 2023 18:40:57 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Mon, 08 May 2023 23:18:16 +02:00

tick/broadcast: Make broadcast device replacement work correctly

When a tick broadcast clockevent device is initialized for one shot mode
then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
cpumask into the oneshot broadcast cpumask.

This is required when switching from periodic broadcast mode to oneshot
broadcast mode to ensure that CPUs which are waiting for periodic
broadcast are woken up on the next tick.

But it is subtly broken, when an active broadcast device is replaced and
the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
this and debugged the issue.

Then the OR of the periodic broadcast CPU mask is wrong as the periodic
cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
unless explicitly cleared via tick_broadcast_disable().

That means that this sets all other CPUs which have tick broadcasting
enabled at that point unconditionally in the oneshot broadcast mask.

If the affected CPUs were already idle and had their bits set in the
oneshot broadcast mask then this does no harm. But for non idle CPUs
which were not set this corrupts their state.

On their next invocation of tick_broadcast_enable() they observe the bit
set, which indicates that the broadcast for the CPU is already set up.
As a consequence they fail to update the broadcast event even if their
earliest expiring timer is before the actually programmed broadcast
event.

If the programmed broadcast event is far in the future, then this can
cause stalls or trigger the hung task detector.

Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
this is the initial switch over from periodic to oneshot broadcast which
must take the periodic broadcast mask into account. In the case of
initialization of a replacement device this prevents that the broadcast
oneshot mask is modified.

There is a second problem with broadcast device replacement in this
function. The broadcast device is only armed when the previous state of
the device was periodic.

That is correct for the switch from periodic broadcast mode to oneshot
broadcast mode as the underlying broadcast device could operate in
oneshot state already due to lack of periodic state in hardware. In that
case it is already armed to expire at the next tick.

For the replacement case this is wrong as the device is in shutdown
state. That means that any already pending broadcast event will not be
armed.

This went unnoticed because any CPU which goes idle will observe that
the broadcast device has an expiry time of KTIME_MAX and therefore any
CPUs next timer event will be earlier and cause a reprogramming of the
broadcast device. But that does not guarantee that the events of the
CPUs which were already in idle are delivered on time.

Fix this by arming the newly installed device for an immediate event
which will reevaluate the per CPU expiry times and reprogram the
broadcast device accordingly. This is simpler than caching the last
expiry time in yet another place or saving it before the device exchange
and handing it down to the setup function. Replacement of broadcast
devices is not a frequent operation and usually happens once somewhere
late in the boot process.

Fixes: 9c336c9935cf ("tick/broadcast: Allow late registered device to enter oneshot mode")
Reported-by: Victor Hassan <victor@allwinnertech.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/87pm7d2z1i.ffs@tglx

---
 kernel/time/tick-broadcast.c | 120 ++++++++++++++++++++++++----------
 1 file changed, 88 insertions(+), 32 deletions(-)

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index 93bf2b4..771d1e0 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -35,14 +35,15 @@ static __cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(tick_broadcast_lock);
 #ifdef CONFIG_TICK_ONESHOT
 static DEFINE_PER_CPU(struct clock_event_device *, tick_oneshot_wakeup_device);
 
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc);
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic);
 static void tick_broadcast_clear_oneshot(int cpu);
 static void tick_resume_broadcast_oneshot(struct clock_event_device *bc);
 # ifdef CONFIG_HOTPLUG_CPU
 static void tick_broadcast_oneshot_offline(unsigned int cpu);
 # endif
 #else
-static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc) { BUG(); }
+static inline void
+tick_broadcast_setup_oneshot(struct clock_event_device *bc, bool from_periodic) { BUG(); }
 static inline void tick_broadcast_clear_oneshot(int cpu) { }
 static inline void tick_resume_broadcast_oneshot(struct clock_event_device *bc) { }
 # ifdef CONFIG_HOTPLUG_CPU
@@ -264,7 +265,7 @@ int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
 		if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 			tick_broadcast_start_periodic(bc);
 		else
-			tick_broadcast_setup_oneshot(bc);
+			tick_broadcast_setup_oneshot(bc, false);
 		ret = 1;
 	} else {
 		/*
@@ -500,7 +501,7 @@ void tick_broadcast_control(enum tick_broadcast_mode mode)
 			if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
 				tick_broadcast_start_periodic(bc);
 			else
-				tick_broadcast_setup_oneshot(bc);
+				tick_broadcast_setup_oneshot(bc, false);
 		}
 	}
 out:
@@ -1020,48 +1021,101 @@ static inline ktime_t tick_get_next_period(void)
 /**
  * tick_broadcast_setup_oneshot - setup the broadcast device
  */
-static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
+static void tick_broadcast_setup_oneshot(struct clock_event_device *bc,
+					 bool from_periodic)
 {
 	int cpu = smp_processor_id();
+	ktime_t nexttick = 0;
 
 	if (!bc)
 		return;
 
-	/* Set it up only once ! */
-	if (bc->event_handler != tick_handle_oneshot_broadcast) {
-		int was_periodic = clockevent_state_periodic(bc);
-
-		bc->event_handler = tick_handle_oneshot_broadcast;
-
+	/*
+	 * When the broadcast device was switched to oneshot by the first
+	 * CPU handling the NOHZ change, the other CPUs will reach this
+	 * code via hrtimer_run_queues() -> tick_check_oneshot_change()
+	 * too. Set up the broadcast device only once!
+	 */
+	if (bc->event_handler == tick_handle_oneshot_broadcast) {
 		/*
-		 * We must be careful here. There might be other CPUs
-		 * waiting for periodic broadcast. We need to set the
-		 * oneshot_mask bits for those and program the
-		 * broadcast device to fire.
+		 * The CPU which switched from periodic to oneshot mode
+		 * set the broadcast oneshot bit for all other CPUs which
+		 * are in the general (periodic) broadcast mask to ensure
+		 * that CPUs which wait for the periodic broadcast are
+		 * woken up.
+		 *
+		 * Clear the bit for the local CPU as the set bit would
+		 * prevent the first tick_broadcast_enter() after this CPU
+		 * switched to oneshot state to program the broadcast
+		 * device.
+		 *
+		 * This code can also be reached via tick_broadcast_control(),
+		 * but this cannot avoid the tick_broadcast_clear_oneshot()
+		 * as that would break the periodic to oneshot transition of
+		 * secondary CPUs. But that's harmless as the below only
+		 * clears already cleared bits.
 		 */
+		tick_broadcast_clear_oneshot(cpu);
+		return;
+	}
+
+
+	bc->event_handler = tick_handle_oneshot_broadcast;
+	bc->next_event = KTIME_MAX;
+
+	/*
+	 * When the tick mode is switched from periodic to oneshot it must
+	 * be ensured that CPUs which are waiting for periodic broadcast
+	 * get their wake-up at the next tick.  This is achieved by ORing
+	 * tick_broadcast_mask into tick_broadcast_oneshot_mask.
+	 *
+	 * For other callers, e.g. broadcast device replacement,
+	 * tick_broadcast_oneshot_mask must not be touched as this would
+	 * set bits for CPUs which are already NOHZ, but not idle. Their
+	 * next tick_broadcast_enter() would observe the bit set and fail
+	 * to update the expiry time and the broadcast event device.
+	 */
+	if (from_periodic) {
 		cpumask_copy(tmpmask, tick_broadcast_mask);
+		/* Remove the local CPU as it is obviously not idle */
 		cpumask_clear_cpu(cpu, tmpmask);
-		cpumask_or(tick_broadcast_oneshot_mask,
-			   tick_broadcast_oneshot_mask, tmpmask);
+		cpumask_or(tick_broadcast_oneshot_mask, tick_broadcast_oneshot_mask, tmpmask);
 
-		if (was_periodic && !cpumask_empty(tmpmask)) {
-			ktime_t nextevt = tick_get_next_period();
+		/*
+		 * Ensure that the oneshot broadcast handler will wake the
+		 * CPUs which are still waiting for periodic broadcast.
+		 */
+		nexttick = tick_get_next_period();
+		tick_broadcast_init_next_event(tmpmask, nexttick);
 
-			clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
-			tick_broadcast_init_next_event(tmpmask, nextevt);
-			tick_broadcast_set_event(bc, cpu, nextevt);
-		} else
-			bc->next_event = KTIME_MAX;
-	} else {
 		/*
-		 * The first cpu which switches to oneshot mode sets
-		 * the bit for all other cpus which are in the general
-		 * (periodic) broadcast mask. So the bit is set and
-		 * would prevent the first broadcast enter after this
-		 * to program the bc device.
+		 * If the underlying broadcast clock event device is
+		 * already in oneshot state, then there is nothing to do.
+		 * The device was already armed for the next tick
+		 * in tick_handle_broadcast_periodic()
 		 */
-		tick_broadcast_clear_oneshot(cpu);
+		if (clockevent_state_oneshot(bc))
+			return;
 	}
+
+	/*
+	 * When switching from periodic to oneshot mode arm the broadcast
+	 * device for the next tick.
+	 *
+	 * If the broadcast device has been replaced in oneshot mode and
+	 * the oneshot broadcast mask is not empty, then arm it to expire
+	 * immediately in order to reevaluate the next expiring timer.
+	 * @nexttick is 0 and therefore in the past which will cause the
+	 * clockevent code to force an event.
+	 *
+	 * For both cases the programming can be avoided when the oneshot
+	 * broadcast mask is empty.
+	 *
+	 * tick_broadcast_set_event() implicitly switches the broadcast
+	 * device to oneshot state.
+	 */
+	if (!cpumask_empty(tick_broadcast_oneshot_mask))
+		tick_broadcast_set_event(bc, cpu, nexttick);
 }
 
 /*
@@ -1070,14 +1124,16 @@ static void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
 void tick_broadcast_switch_to_oneshot(void)
 {
 	struct clock_event_device *bc;
+	enum tick_device_mode oldmode;
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
 
+	oldmode = tick_broadcast_device.mode;
 	tick_broadcast_device.mode = TICKDEV_MODE_ONESHOT;
 	bc = tick_broadcast_device.evtdev;
 	if (bc)
-		tick_broadcast_setup_oneshot(bc);
+		tick_broadcast_setup_oneshot(bc, oldmode == TICKDEV_MODE_PERIODIC);
 
 	raw_spin_unlock_irqrestore(&tick_broadcast_lock, flags);
 }
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 1 year ago
Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
> From: Thomas Gleixner <tglx@linutronix.de>
> Subject: tick/broadcast: Make broadcast device replacement work correctly
> Date: Wed, 12 Apr 2023 08:34:25 +0800
> 
> When a tick broadcast clockevent device is initialized for one shot mode
> then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
> cpumask into the oneshot broadcast cpumask.
> 
> This is required when switching from periodic broadcast mode to oneshot
> broadcast mode to ensure that CPUs which are waiting for periodic
> broadcast are woken up on the next tick.
> 
> But it is subtly broken, when an active broadcast device is replaced and
> the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
> this and debugged the issue.
> 
> Then the OR of the periodic broadcast CPU mask is wrong as the periodic
> cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
> unless explicitly cleared via tick_broadcast_disable().
> 
> That means that this sets all other CPUs which have tick broadcasting
> enabled at that point unconditionally in the oneshot broadcast mask.
> 
> If the affected CPUs were already idle and had their bits set in the
> oneshot broadcast mask then this does no harm. But for non idle CPUs
> which were not set this corrupts their state.
> 
> On their next invocation of tick_broadcast_enable() they observe the bit
> set, which indicates that the broadcast for the CPU is already set up.
> As a consequence they fail to update the broadcast event even if their
> earliest expiring timer is before the actually programmed broadcast
> event.
> 
> If the programmed broadcast event is far in the future, then this can
> cause stalls or trigger the hung task detector.
> 
> Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
> this is the initial switch over from periodic to oneshot broadcast which
> must take the periodic broadcast mask into account. In the case of
> initialization of a replacement device this prevents that the broadcast
> oneshot mask is modified.
> 
> There is a second problem with broadcast device replacement in this
> function. The broadcast device is only armed when the previous state of
> the device was periodic.

Any chance the patch could be cut in two then?

Thanks.
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Mon, Apr 17 2023 at 17:02, Frederic Weisbecker wrote:
> Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
>> There is a second problem with broadcast device replacement in this
>> function. The broadcast device is only armed when the previous state of
>> the device was periodic.
>
> Any chance the patch could be cut in two then?

Let me try.
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Thomas Gleixner 1 year ago
On Tue, Apr 18 2023 at 10:50, Thomas Gleixner wrote:

> On Mon, Apr 17 2023 at 17:02, Frederic Weisbecker wrote:
>> Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
>>> There is a second problem with broadcast device replacement in this
>>> function. The broadcast device is only armed when the previous state of
>>> the device was periodic.
>>
>> Any chance the patch could be cut in two then?
>
> Let me try.

Not really as fixing that part depends on fixing the original issue and
once the original issue is fixed the issue for the replacement case is
more or less automatically there.

Thanks,

        tglx
Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except was_periodic was true
Posted by Frederic Weisbecker 1 year ago
Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
> > * CPU 1 stop its tick, next event is in one hour. It calls
> >   tick_broadcast_enter() and goes to sleep.
> 
> So there is already a broadcast device installed, right?

Yes

> 
> > * CPU 1 gets an interrupt that enqueues a new timer expiring in the next jiffy
> >   (note it's not yet actually programmed in the tick device)
> > * CPU 1 call tick_broadcast_exit().
> > * CPU 0 registers new broadcast device and sets CPU 1 in tick_broadcast_oneshot_mask
> 
> This lacks an explanation why CPU0 sets CPU1 in that mask. It does not
> _set_ it explicitely, only implicitely by ORing the periodic broadcast
> cpumask over.
> 
> Now the question is why is CPU1 set in the periodic broadcast mask when
> the CPU already switched over to NOHZ mode?
> 
> That needs to be explained too.

I probably got confused with that tick_broadcast_mask, so it's only set for
periodic broadcast? Should it be renamed to tick_periodic_broadcast_mask to
disambiguate my eternal confusion?

> 
> > * CPU 0 runs the broadcast callback, sees that the next timer for CPU 1
> >   is in one hour (because the recently enqueued timer for CPU 1 hasn't been programmed
> >   yet), so it programs the broadcast to that 1 hour deadline.
> > * CPU 1 runs tick_nohz_idle_stop_tick() which eventually writes and program
> >   dev->next_event to next jiffy
> > * CPU 1 runs into cpuidle_enter_state(), and tick_broadcast_enter() is ignored because
> >   the CPU is already in tick_broadcast_oneshot_mask, so the dev->next_event
> >   change isn't propagated to broadcast.
> > * CPU 1 goes to sleep for 1 hour.
> 
> Also please use tabular style to explain the parallel events as
> explained in the documentation.

Yeah my bad I asked Victor to integrate that scenario that popped out
of me misunderstanding that code. Not even mentioning the form.

Now to review your proposal.

Thanks.