[PATCH v11 8/8] timers: Exclude isolated cpus from timer migration

Gabriele Monaco posted 8 patches 1 month, 3 weeks ago
There is a newer version of this series
[PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Gabriele Monaco 1 month, 3 weeks ago
The timer migration mechanism allows active CPUs to pull timers from
idle ones to improve the overall idle time. This is however undesired
when CPU intensive workloads run on isolated cores, as the algorithm
would move the timers from housekeeping to isolated cores, negatively
affecting the isolation.

Exclude isolated cores from the timer migration algorithm, extend the
concept of unavailable cores, currently used for offline ones, to
isolated ones:
* A core is unavailable if isolated or offline;
* A core is available if non isolated and online;

A core is considered unavailable as isolated if it belongs to:
* the isolcpus (domain) list
* an isolated cpuset
Except if it is:
* in the nohz_full list (already idle for the hierarchy)
* the nohz timekeeper core (must be available to handle global timers)

CPUs are added to the hierarchy during late boot, excluding isolated
ones, the hierarchy is also adapted when the cpuset isolation changes.

Due to how the timer migration algorithm works, any CPU part of the
hierarchy can have their global timers pulled by remote CPUs and have to
pull remote timers, only skipping pulling remote timers would break the
logic.
For this reason, prevent isolated CPUs from pulling remote global
timers, but also the other way around: any global timer started on an
isolated CPU will run there. This does not break the concept of
isolation (global timers don't come from outside the CPU) and, if
considered inappropriate, can usually be mitigated with other isolation
techniques (e.g. IRQ pinning).

This effect was noticed on a 128 cores machine running oslat on the
isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
and the CPU with lowest count in a timer migration hierarchy (here 1
and 65) appears as always active and continuously pulls global timers,
from the housekeeping CPUs. This ends up moving driver work (e.g.
delayed work) to isolated CPUs and causes latency spikes:

before the change:

 # oslat -c 1-31,33-63,65-95,97-127 -D 62s
 ...
  Maximum:     1203 10 3 4 ... 5 (us)

after the change:

 # oslat -c 1-31,33-63,65-95,97-127 -D 62s
 ...
  Maximum:      10 4 3 4 3 ... 5 (us)

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 include/linux/timer.h         |   9 +++
 kernel/cgroup/cpuset.c        |   3 +
 kernel/time/timer_migration.c | 103 +++++++++++++++++++++++++++++++++-
 3 files changed, 112 insertions(+), 3 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 0414d9e6b4fc..62e1cea71125 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -188,4 +188,13 @@ int timers_dead_cpu(unsigned int cpu);
 #define timers_dead_cpu		NULL
 #endif
 
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask);
+#else
+static inline int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
+{
+	return 0;
+}
+#endif
+
 #endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 7b66ccedbc53..2e73fc450a81 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1393,6 +1393,9 @@ static void update_exclusion_cpumasks(bool isolcpus_updated)
 
 	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
 	WARN_ON_ONCE(ret < 0);
+
+	ret = tmigr_isolated_exclude_cpumask(isolated_cpus);
+	WARN_ON_ONCE(ret < 0);
 }
 
 /**
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 0a3a26e766d0..07b63be18f83 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -10,6 +10,7 @@
 #include <linux/spinlock.h>
 #include <linux/timerqueue.h>
 #include <trace/events/ipi.h>
+#include <linux/sched/isolation.h>
 
 #include "timer_migration.h"
 #include "tick-internal.h"
@@ -436,6 +437,23 @@ static inline bool tmigr_is_not_available(struct tmigr_cpu *tmc)
 	return !(tmc->tmgroup && tmc->available);
 }
 
+/*
+ * Returns true if @cpu should be excluded from the hierarchy as isolated.
+ * Domain isolated CPUs don't participate in timer migration, nohz_full CPUs
+ * are still part of the hierarchy but become idle (from a tick and timer
+ * migration perspective) when they stop their tick. This lets the timekeeping
+ * CPU handle their global timers. Marking also isolated CPUs as idle would be
+ * too costly, hence they are completely excluded from the hierarchy.
+ * This check is necessary, for instance, to prevent offline isolated CPUs from
+ * being incorrectly marked as available once getting back online.
+ */
+static inline bool tmigr_is_isolated(int cpu)
+{
+	return (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
+		cpuset_cpu_is_isolated(cpu)) &&
+	       housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE);
+}
+
 /*
  * Returns true, when @childmask corresponds to the group migrator or when the
  * group is not active - so no migrator is set.
@@ -1451,6 +1469,8 @@ static int tmigr_clear_cpu_available(unsigned int cpu)
 
 	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
 	scoped_guard(raw_spinlock_irq, &tmc->lock) {
+		if (!tmc->available)
+			return 0;
 		tmc->available = false;
 		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
 
@@ -1470,7 +1490,7 @@ static int tmigr_clear_cpu_available(unsigned int cpu)
 	return 0;
 }
 
-static int tmigr_set_cpu_available(unsigned int cpu)
+static inline int _tmigr_set_cpu_available(unsigned int cpu)
 {
 	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
 
@@ -1480,6 +1500,8 @@ static int tmigr_set_cpu_available(unsigned int cpu)
 
 	cpumask_set_cpu(cpu, tmigr_available_cpumask);
 	scoped_guard(raw_spinlock_irq, &tmc->lock) {
+		if (tmc->available)
+			return 0;
 		trace_tmigr_cpu_available(tmc);
 		tmc->idle = timer_base_is_idle();
 		if (!tmc->idle)
@@ -1489,14 +1511,89 @@ static int tmigr_set_cpu_available(unsigned int cpu)
 	return 0;
 }
 
+static int tmigr_set_cpu_available(unsigned int cpu)
+{
+	if (tmigr_is_isolated(cpu))
+		return 0;
+	return _tmigr_set_cpu_available(cpu);
+}
+
+static bool tmigr_should_isolate_cpu(int cpu, void *ignored)
+{
+	/*
+	 * The tick CPU can be marked as isolated by the cpuset code, however
+	 * we cannot mark it as unavailable to avoid having no global migrator
+	 * for the nohz_full CPUs.
+	 */
+	return tick_nohz_cpu_hotpluggable(cpu);
+}
+
+static void tmigr_cpu_isolate(void *ignored)
+{
+	tmigr_clear_cpu_available(smp_processor_id());
+}
+
+static void tmigr_cpu_unisolate(void *ignored)
+{
+	tmigr_set_cpu_available(smp_processor_id());
+}
+
+static void tmigr_cpu_unisolate_force(void *ignored)
+{
+	/*
+	 * Required at boot to restore the tick CPU if nohz_full is available.
+	 * Hotplug handlers don't check for tick CPUs during runtime.
+	 */
+	_tmigr_set_cpu_available(smp_processor_id());
+}
+
+int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
+{
+	cpumask_var_t cpumask;
+
+	lockdep_assert_cpus_held();
+
+	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+		return -ENOMEM;
+
+	cpumask_and(cpumask, exclude_cpumask, tmigr_available_cpumask);
+	cpumask_and(cpumask, cpumask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+	on_each_cpu_cond_mask(tmigr_should_isolate_cpu, tmigr_cpu_isolate, NULL,
+			      1, cpumask);
+
+	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
+	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
+	on_each_cpu_mask(cpumask, tmigr_cpu_unisolate, NULL, 1);
+
+	free_cpumask_var(cpumask);
+	return 0;
+}
+
 /*
  * NOHZ can only be enabled after clocksource_done_booting(). Don't
  * bother trashing the cache in the tree before.
  */
 static int __init tmigr_late_init(void)
 {
-	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
-				 tmigr_set_cpu_available, tmigr_clear_cpu_available);
+	int cpu, ret;
+
+	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
+				tmigr_set_cpu_available, tmigr_clear_cpu_available);
+	if (ret)
+		return ret;
+	/*
+	 * The tick CPU may not be marked as available in the above call, this
+	 * can occur only at boot as hotplug handlers are not called on the
+	 * tick CPU. Force it enabled here.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (!tick_nohz_cpu_hotpluggable(cpu)) {
+			ret = smp_call_function_single(
+				cpu, tmigr_cpu_unisolate_force, NULL, 1);
+			break;
+		}
+	}
+	return ret;
 }
 
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
-- 
2.50.1
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Frederic Weisbecker 1 month ago
Le Fri, Aug 08, 2025 at 06:01:42PM +0200, Gabriele Monaco a écrit :
>  /*
>   * NOHZ can only be enabled after clocksource_done_booting(). Don't
>   * bother trashing the cache in the tree before.
>   */
>  static int __init tmigr_late_init(void)
>  {
> -	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
> -				 tmigr_set_cpu_available, tmigr_clear_cpu_available);
> +	int cpu, ret;
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
> +				tmigr_set_cpu_available, tmigr_clear_cpu_available);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * The tick CPU may not be marked as available in the above call, this
> +	 * can occur only at boot as hotplug handlers are not called on the
> +	 * tick CPU. Force it enabled here.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		if (!tick_nohz_cpu_hotpluggable(cpu)) {
> +			ret = smp_call_function_single(
> +				cpu, tmigr_cpu_unisolate_force, NULL, 1);
> +			break;
> +		}
> +	}

Why not evaluate tick_nohz_cpu_hotpluggable() from tmigr_clear_cpu_available()
instead of this force IPI?

But if I understand correctly, this will be handled by cpuset, right?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Gabriele Monaco 1 month ago
On Mon, 2025-09-01 at 14:41 +0200, Frederic Weisbecker wrote:
> Le Fri, Aug 08, 2025 at 06:01:42PM +0200, Gabriele Monaco a écrit :
> >  /*
> >   * NOHZ can only be enabled after clocksource_done_booting().
> > Don't
> >   * bother trashing the cache in the tree before.
> >   */
> >  static int __init tmigr_late_init(void)
> >  {
> > -	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > "tmigr:online",
> > -				 tmigr_set_cpu_available,
> > tmigr_clear_cpu_available);
> > +	int cpu, ret;
> > +
> > +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > "tmigr:online",
> > +				tmigr_set_cpu_available,
> > tmigr_clear_cpu_available);
> > +	if (ret)
> > +		return ret;
> > +	/*
> > +	 * The tick CPU may not be marked as available in the
> > above call, this
> > +	 * can occur only at boot as hotplug handlers are not
> > called on the
> > +	 * tick CPU. Force it enabled here.
> > +	 */
> > +	for_each_possible_cpu(cpu) {
> > +		if (!tick_nohz_cpu_hotpluggable(cpu)) {
> > +			ret = smp_call_function_single(
> > +				cpu, tmigr_cpu_unisolate_force,
> > NULL, 1);
> > +			break;
> > +		}
> > +	}
> 
> Why not evaluate tick_nohz_cpu_hotpluggable() from
> tmigr_clear_cpu_available() instead of this force IPI?

The idea is that this IPI runs once during late boot only for the tick
CPU, while the call to tick_nohz_cpu_hotpluggable() would be running at
every hotplug event if I move it to tmigr_clear_cpu_available.
In that scenario, it's guaranteed to return true (besides the very
first call).

I don't have a strong opinion against running that check every time
although it's needed only at boot time and remove this IPI, but in my
understanding that's one of the thing Thomas was finding confusing [1].

Am I missing anything here?

> But if I understand correctly, this will be handled by cpuset, right?

Currently tick_nohz_cpu_hotpluggable() is called by
tmigr_should_isolate_cpu() and that is called by cpuset code, changing
cpuset would save that call but won't deal with the tick CPU not
enabled at boot time, unless I'm misunderstanding what Waiman implied.

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/875xgqqrel.ffs@tglx
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Frederic Weisbecker 1 month ago
Le Mon, Sep 01, 2025 at 03:48:15PM +0200, Gabriele Monaco a écrit :
> On Mon, 2025-09-01 at 14:41 +0200, Frederic Weisbecker wrote:
> > Le Fri, Aug 08, 2025 at 06:01:42PM +0200, Gabriele Monaco a écrit :
> > >  /*
> > >   * NOHZ can only be enabled after clocksource_done_booting().
> > > Don't
> > >   * bother trashing the cache in the tree before.
> > >   */
> > >  static int __init tmigr_late_init(void)
> > >  {
> > > -	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > > "tmigr:online",
> > > -				 tmigr_set_cpu_available,
> > > tmigr_clear_cpu_available);
> > > +	int cpu, ret;
> > > +
> > > +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > > "tmigr:online",
> > > +				tmigr_set_cpu_available,
> > > tmigr_clear_cpu_available);
> > > +	if (ret)
> > > +		return ret;
> > > +	/*
> > > +	 * The tick CPU may not be marked as available in the
> > > above call, this
> > > +	 * can occur only at boot as hotplug handlers are not
> > > called on the
> > > +	 * tick CPU. Force it enabled here.
> > > +	 */
> > > +	for_each_possible_cpu(cpu) {
> > > +		if (!tick_nohz_cpu_hotpluggable(cpu)) {
> > > +			ret = smp_call_function_single(
> > > +				cpu, tmigr_cpu_unisolate_force,
> > > NULL, 1);
> > > +			break;
> > > +		}
> > > +	}
> > 
> > Why not evaluate tick_nohz_cpu_hotpluggable() from
> > tmigr_clear_cpu_available() instead of this force IPI?
> 
> The idea is that this IPI runs once during late boot only for the tick
> CPU, while the call to tick_nohz_cpu_hotpluggable() would be running at
> every hotplug event if I move it to tmigr_clear_cpu_available.
> In that scenario, it's guaranteed to return true (besides the very
> first call).
> 
> I don't have a strong opinion against running that check every time
> although it's needed only at boot time and remove this IPI, but in my
> understanding that's one of the thing Thomas was finding confusing [1].
> 
> Am I missing anything here?

Right, Thomas didn't like it, but the organization of the code has changed
a bit since then with the late initcall. If the best we can do to workaround
the situation is to make the CPU unavailable regardless and then undo that
right after with an IPI, then it's a good sign that we should just simplify
and eventually check tick_nohz_cpu_hotpluggable() from tmigr_is_isolated().

> > But if I understand correctly, this will be handled by cpuset, right?
> 
> Currently tick_nohz_cpu_hotpluggable() is called by
> tmigr_should_isolate_cpu() and that is called by cpuset code, changing
> cpuset would save that call but won't deal with the tick CPU not
> enabled at boot time, unless I'm misunderstanding what Waiman implied.

Good point!

Thanks.

> 
> Thanks,
> Gabriele
> 
> [1] - https://lore.kernel.org/lkml/875xgqqrel.ffs@tglx
> 

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Gabriele Monaco 1 month ago
On Mon, 2025-09-01 at 23:11 +0200, Frederic Weisbecker wrote:
> Le Mon, Sep 01, 2025 at 03:48:15PM +0200, Gabriele Monaco a écrit :
> > On Mon, 2025-09-01 at 14:41 +0200, Frederic Weisbecker wrote:
> > > Why not evaluate tick_nohz_cpu_hotpluggable() from
> > > tmigr_clear_cpu_available() instead of this force IPI?
> > 
> > The idea is that this IPI runs once during late boot only for the
> > tick CPU, while the call to tick_nohz_cpu_hotpluggable() would be
> > running at every hotplug event if I move it to
> > tmigr_clear_cpu_available. In that scenario, it's guaranteed to
> > return true (besides the very first call).
> > 
> > I don't have a strong opinion against running that check every time
> > although it's needed only at boot time and remove this IPI, but in
> > my understanding that's one of the thing Thomas was finding
> > confusing [1].
> > 
> > Am I missing anything here?
> 
> Right, Thomas didn't like it, but the organization of the code has
> changed a bit since then with the late initcall. If the best we can
> do to workaround the situation is to make the CPU unavailable
> regardless and then undo that right after with an IPI, then it's a
> good sign that we should just simplify and eventually check
> tick_nohz_cpu_hotpluggable() from tmigr_is_isolated().

Makes sense.
I'd be tempted using a static branch but since the call to
tick_nohz_cpu_hotpluggable() isn't really heavy, we can just be fine
including it in the tmigr_is_isolated() check.


> > > But if I understand correctly, this will be handled by cpuset,
> > > right?
> > 
> > Currently tick_nohz_cpu_hotpluggable() is called by
> > tmigr_should_isolate_cpu() and that is called by cpuset code,
> > changing cpuset would save that call but won't deal with the tick
> > CPU not enabled at boot time, unless I'm misunderstanding what
> > Waiman implied.
> 
> Good point!

Here I'm a bit unsure how to proceed though. We want to fail any single
isolated cpuset that includes the tick CPU under nohz_full. I can do it
directly in isolcpus_nohz_conflict and that looks easy.

But is that going to be clear for the user?
Can the user even know what the tick CPU is? Besides /assuming/ 0.

Thanks,
Gabriele

> Thanks.
> 
> > 
> > Thanks,
> > Gabriele
> > 
> > [1] - https://lore.kernel.org/lkml/875xgqqrel.ffs@tglx
> > 
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Frederic Weisbecker 1 month ago
Le Tue, Sep 02, 2025 at 01:08:25PM +0200, Gabriele Monaco a écrit :
> On Mon, 2025-09-01 at 23:11 +0200, Frederic Weisbecker wrote:
> > Le Mon, Sep 01, 2025 at 03:48:15PM +0200, Gabriele Monaco a écrit :
> > > On Mon, 2025-09-01 at 14:41 +0200, Frederic Weisbecker wrote:
> > > > Why not evaluate tick_nohz_cpu_hotpluggable() from
> > > > tmigr_clear_cpu_available() instead of this force IPI?
> > > 
> > > The idea is that this IPI runs once during late boot only for the
> > > tick CPU, while the call to tick_nohz_cpu_hotpluggable() would be
> > > running at every hotplug event if I move it to
> > > tmigr_clear_cpu_available. In that scenario, it's guaranteed to
> > > return true (besides the very first call).
> > > 
> > > I don't have a strong opinion against running that check every time
> > > although it's needed only at boot time and remove this IPI, but in
> > > my understanding that's one of the thing Thomas was finding
> > > confusing [1].
> > > 
> > > Am I missing anything here?
> > 
> > Right, Thomas didn't like it, but the organization of the code has
> > changed a bit since then with the late initcall. If the best we can
> > do to workaround the situation is to make the CPU unavailable
> > regardless and then undo that right after with an IPI, then it's a
> > good sign that we should just simplify and eventually check
> > tick_nohz_cpu_hotpluggable() from tmigr_is_isolated().
> 
> Makes sense.
> I'd be tempted using a static branch but since the call to
> tick_nohz_cpu_hotpluggable() isn't really heavy, we can just be fine
> including it in the tmigr_is_isolated() check.

Right.

> 
> 
> > > > But if I understand correctly, this will be handled by cpuset,
> > > > right?
> > > 
> > > Currently tick_nohz_cpu_hotpluggable() is called by
> > > tmigr_should_isolate_cpu() and that is called by cpuset code,
> > > changing cpuset would save that call but won't deal with the tick
> > > CPU not enabled at boot time, unless I'm misunderstanding what
> > > Waiman implied.
> > 
> > Good point!
> 
> Here I'm a bit unsure how to proceed though. We want to fail any single
> isolated cpuset that includes the tick CPU under nohz_full. I can do it
> directly in isolcpus_nohz_conflict and that looks easy.
> 
> But is that going to be clear for the user?
> Can the user even know what the tick CPU is? Besides /assuming/ 0.

You're right the user can't know in advance which CPU is the tick.
I don't mind if we prevent or not cpuset from allowing to isolate
the timekeeper but either way, a pr_info() could be helpful to tell
that either:

* The isolated timekeeper will have limited isolation
or
* The timekeeper can't be isolated.

Thanks.

> 
> Thanks,
> Gabriele
> 
> > Thanks.
> > 
> > > 
> > > Thanks,
> > > Gabriele
> > > 
> > > [1] - https://lore.kernel.org/lkml/875xgqqrel.ffs@tglx
> > > 
> 

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Gabriele Monaco 3 weeks, 5 days ago
On Tue, 2025-09-02 at 14:45 +0200, Frederic Weisbecker wrote:
> > On Mon, 2025-09-01 at 23:11 +0200, Frederic Weisbecker wrote:
> > > Right, Thomas didn't like it, but the organization of the code
> > > has changed a bit since then with the late initcall. If the best
> > > we can do to workaround the situation is to make the CPU
> > > unavailable regardless and then undo that right after with an
> > > IPI, then it's a good sign that we should just simplify and
> > > eventually check tick_nohz_cpu_hotpluggable() from
> > > tmigr_is_isolated().
> > Here I'm a bit unsure how to proceed though. We want to fail any
> > single isolated cpuset that includes the tick CPU under nohz_full.
> > I can do it directly in isolcpus_nohz_conflict and that looks easy.
> > 
> > But is that going to be clear for the user?
> > Can the user even know what the tick CPU is? Besides /assuming/ 0.
> 
> You're right the user can't know in advance which CPU is the tick.
> I don't mind if we prevent or not cpuset from allowing to isolate
> the timekeeper but either way, a pr_info() could be helpful to tell
> that either:
> 
> * The isolated timekeeper will have limited isolation
> or
> * The timekeeper can't be isolated.
> 

As far as I understand, we are now decided to fail cpuset isolated
assignments involving the tick CPU when nohz_full is active.

What prevents us from doing the same in the isolcpus code?
Or at least un-isolate the tick cpu like it happens for nohz_full?

The patch currently avoids the tick CPU in the isolated tmigr logic but
allows it anywhere else.
I believe if we are changing this, cpusets and isolcpus should at least
be consistent.

I remember I went down this path because it was mentioned the tick CPU
is allowed to change. Is this really a concern when nohz_full is
active?

If the tick CPU can change after boot, we would probably need to track
this change either way (in tmigr or isolation/cpuset), but it doesn't
seem the case to me.

If we guarantee the tick cpu is never isolated, we can get rid of the
tick_nohz_cpu_hotpluggable() check for good.

Am I missing anything?

Thanks,
Gabriele
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Waiman Long 1 month, 3 weeks ago
On 8/8/25 12:01 PM, Gabriele Monaco wrote:
> The timer migration mechanism allows active CPUs to pull timers from
> idle ones to improve the overall idle time. This is however undesired
> when CPU intensive workloads run on isolated cores, as the algorithm
> would move the timers from housekeeping to isolated cores, negatively
> affecting the isolation.
>
> Exclude isolated cores from the timer migration algorithm, extend the
> concept of unavailable cores, currently used for offline ones, to
> isolated ones:
> * A core is unavailable if isolated or offline;
> * A core is available if non isolated and online;
>
> A core is considered unavailable as isolated if it belongs to:
> * the isolcpus (domain) list
> * an isolated cpuset
> Except if it is:
> * in the nohz_full list (already idle for the hierarchy)
> * the nohz timekeeper core (must be available to handle global timers)
>
> CPUs are added to the hierarchy during late boot, excluding isolated
> ones, the hierarchy is also adapted when the cpuset isolation changes.
>
> Due to how the timer migration algorithm works, any CPU part of the
> hierarchy can have their global timers pulled by remote CPUs and have to
> pull remote timers, only skipping pulling remote timers would break the
> logic.
> For this reason, prevent isolated CPUs from pulling remote global
> timers, but also the other way around: any global timer started on an
> isolated CPU will run there. This does not break the concept of
> isolation (global timers don't come from outside the CPU) and, if
> considered inappropriate, can usually be mitigated with other isolation
> techniques (e.g. IRQ pinning).
>
> This effect was noticed on a 128 cores machine running oslat on the
> isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
> and the CPU with lowest count in a timer migration hierarchy (here 1
> and 65) appears as always active and continuously pulls global timers,
> from the housekeeping CPUs. This ends up moving driver work (e.g.
> delayed work) to isolated CPUs and causes latency spikes:
>
> before the change:
>
>   # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>   ...
>    Maximum:     1203 10 3 4 ... 5 (us)
>
> after the change:
>
>   # oslat -c 1-31,33-63,65-95,97-127 -D 62s
>   ...
>    Maximum:      10 4 3 4 3 ... 5 (us)
>
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   include/linux/timer.h         |   9 +++
>   kernel/cgroup/cpuset.c        |   3 +
>   kernel/time/timer_migration.c | 103 +++++++++++++++++++++++++++++++++-
>   3 files changed, 112 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index 0414d9e6b4fc..62e1cea71125 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
> @@ -188,4 +188,13 @@ int timers_dead_cpu(unsigned int cpu);
>   #define timers_dead_cpu		NULL
>   #endif
>   
> +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> +extern int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask);
> +#else
> +static inline int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
> +{
> +	return 0;
> +}
> +#endif
> +
>   #endif
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 7b66ccedbc53..2e73fc450a81 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1393,6 +1393,9 @@ static void update_exclusion_cpumasks(bool isolcpus_updated)
>   
>   	ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
>   	WARN_ON_ONCE(ret < 0);
> +
> +	ret = tmigr_isolated_exclude_cpumask(isolated_cpus);
> +	WARN_ON_ONCE(ret < 0);
>   }
>   
>   /**
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 0a3a26e766d0..07b63be18f83 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -10,6 +10,7 @@
>   #include <linux/spinlock.h>
>   #include <linux/timerqueue.h>
>   #include <trace/events/ipi.h>
> +#include <linux/sched/isolation.h>
>   
>   #include "timer_migration.h"
>   #include "tick-internal.h"
> @@ -436,6 +437,23 @@ static inline bool tmigr_is_not_available(struct tmigr_cpu *tmc)
>   	return !(tmc->tmgroup && tmc->available);
>   }
>   
> +/*
> + * Returns true if @cpu should be excluded from the hierarchy as isolated.
> + * Domain isolated CPUs don't participate in timer migration, nohz_full CPUs
> + * are still part of the hierarchy but become idle (from a tick and timer
> + * migration perspective) when they stop their tick. This lets the timekeeping
> + * CPU handle their global timers. Marking also isolated CPUs as idle would be
> + * too costly, hence they are completely excluded from the hierarchy.
> + * This check is necessary, for instance, to prevent offline isolated CPUs from
> + * being incorrectly marked as available once getting back online.
> + */
> +static inline bool tmigr_is_isolated(int cpu)
> +{
> +	return (!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
> +		cpuset_cpu_is_isolated(cpu)) &&
> +	       housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE);
> +}
> +
>   /*
>    * Returns true, when @childmask corresponds to the group migrator or when the
>    * group is not active - so no migrator is set.
> @@ -1451,6 +1469,8 @@ static int tmigr_clear_cpu_available(unsigned int cpu)
>   
>   	cpumask_clear_cpu(cpu, tmigr_available_cpumask);
>   	scoped_guard(raw_spinlock_irq, &tmc->lock) {
> +		if (!tmc->available)
> +			return 0;
>   		tmc->available = false;
>   		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
>   
> @@ -1470,7 +1490,7 @@ static int tmigr_clear_cpu_available(unsigned int cpu)
>   	return 0;
>   }
>   
> -static int tmigr_set_cpu_available(unsigned int cpu)
> +static inline int _tmigr_set_cpu_available(unsigned int cpu)
>   {
>   	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
>   
> @@ -1480,6 +1500,8 @@ static int tmigr_set_cpu_available(unsigned int cpu)
>   
>   	cpumask_set_cpu(cpu, tmigr_available_cpumask);
>   	scoped_guard(raw_spinlock_irq, &tmc->lock) {
> +		if (tmc->available)
> +			return 0;
>   		trace_tmigr_cpu_available(tmc);
>   		tmc->idle = timer_base_is_idle();
>   		if (!tmc->idle)
> @@ -1489,14 +1511,89 @@ static int tmigr_set_cpu_available(unsigned int cpu)
>   	return 0;
>   }
>   
> +static int tmigr_set_cpu_available(unsigned int cpu)
> +{
> +	if (tmigr_is_isolated(cpu))
> +		return 0;
> +	return _tmigr_set_cpu_available(cpu);
> +}
> +
> +static bool tmigr_should_isolate_cpu(int cpu, void *ignored)
> +{
> +	/*
> +	 * The tick CPU can be marked as isolated by the cpuset code, however
> +	 * we cannot mark it as unavailable to avoid having no global migrator
> +	 * for the nohz_full CPUs.
> +	 */
> +	return tick_nohz_cpu_hotpluggable(cpu);
> +}
We may have to update the cpuset code to fail isolated partition 
formation if it includes the nohz_full tick CPU as that CPU cannot be 
fully isolated. That will also make this patch simpler.
> +
> +static void tmigr_cpu_isolate(void *ignored)
> +{
> +	tmigr_clear_cpu_available(smp_processor_id());
> +}
> +
> +static void tmigr_cpu_unisolate(void *ignored)
> +{
> +	tmigr_set_cpu_available(smp_processor_id());
> +}
> +
> +static void tmigr_cpu_unisolate_force(void *ignored)
> +{
> +	/*
> +	 * Required at boot to restore the tick CPU if nohz_full is available.
> +	 * Hotplug handlers don't check for tick CPUs during runtime.
> +	 */
> +	_tmigr_set_cpu_available(smp_processor_id());
> +}
> +
> +int tmigr_isolated_exclude_cpumask(struct cpumask *exclude_cpumask)
> +{
> +	cpumask_var_t cpumask;
> +
> +	lockdep_assert_cpus_held();
> +
> +	if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	cpumask_and(cpumask, exclude_cpumask, tmigr_available_cpumask);
> +	cpumask_and(cpumask, cpumask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
> +	on_each_cpu_cond_mask(tmigr_should_isolate_cpu, tmigr_cpu_isolate, NULL,
> +			      1, cpumask);
> +
> +	cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
> +	cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
> +	on_each_cpu_mask(cpumask, tmigr_cpu_unisolate, NULL, 1);
> +
> +	free_cpumask_var(cpumask);
> +	return 0;
> +}
> +
>   /*
>    * NOHZ can only be enabled after clocksource_done_booting(). Don't
>    * bother trashing the cache in the tree before.
>    */
>   static int __init tmigr_late_init(void)
>   {
> -	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
> -				 tmigr_set_cpu_available, tmigr_clear_cpu_available);
> +	int cpu, ret;
> +
> +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
> +				tmigr_set_cpu_available, tmigr_clear_cpu_available);
> +	if (ret)
> +		return ret;
> +	/*
> +	 * The tick CPU may not be marked as available in the above call, this
> +	 * can occur only at boot as hotplug handlers are not called on the
> +	 * tick CPU. Force it enabled here.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		if (!tick_nohz_cpu_hotpluggable(cpu)) {
> +			ret = smp_call_function_single(
> +				cpu, tmigr_cpu_unisolate_force, NULL, 1);
> +			break;
> +		}
> +	}
> +	return ret;
>   }
Can you integrate the 
tick_nohz_cpu_hotpluggable/tmigr_should_isolate_cpu check into 
tmigr_set_cpu_available() instead of special-casing the tick CPU here?

Cheers,
Longman
Re: [PATCH v11 8/8] timers: Exclude isolated cpus from timer migration
Posted by Gabriele Monaco 1 month, 3 weeks ago

On Tue, 2025-08-12 at 13:20 -0400, Waiman Long wrote:
> On 8/8/25 12:01 PM, Gabriele Monaco wrote:
> > +static bool tmigr_should_isolate_cpu(int cpu, void *ignored)
> > +{
> > +	/*
> > +	 * The tick CPU can be marked as isolated by the cpuset
> > code, however
> > +	 * we cannot mark it as unavailable to avoid having no
> > global migrator
> > +	 * for the nohz_full CPUs.
> > +	 */
> > +	return tick_nohz_cpu_hotpluggable(cpu);
> > +}
> We may have to update the cpuset code to fail isolated partition 
> formation if it includes the nohz_full tick CPU as that CPU cannot be
> fully isolated. That will also make this patch simpler.

Good idea, I can check that!

[...]
> >   /*
> >    * NOHZ can only be enabled after clocksource_done_booting().
> > Don't
> >    * bother trashing the cache in the tree before.
> >    */
> >   static int __init tmigr_late_init(void)
> >   {
> > -	return cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > "tmigr:online",
> > -				 tmigr_set_cpu_available,
> > tmigr_clear_cpu_available);
> > +	int cpu, ret;
> > +
> > +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE,
> > "tmigr:online",
> > +				tmigr_set_cpu_available,
> > tmigr_clear_cpu_available);
> > +	if (ret)
> > +		return ret;
> > +	/*
> > +	 * The tick CPU may not be marked as available in the
> > above call, this
> > +	 * can occur only at boot as hotplug handlers are not
> > called on the
> > +	 * tick CPU. Force it enabled here.
> > +	 */
> > +	for_each_possible_cpu(cpu) {
> > +		if (!tick_nohz_cpu_hotpluggable(cpu)) {
> > +			ret = smp_call_function_single(
> > +				cpu, tmigr_cpu_unisolate_force,
> > NULL, 1);
> > +			break;
> > +		}
> > +	}
> > +	return ret;
> >   }
> Can you integrate the
> tick_nohz_cpu_hotpluggable/tmigr_should_isolate_cpu check into 
> tmigr_set_cpu_available() instead of special-casing the tick CPU
> here?


I could but I'm afraid Thomas won't like that [1]:
> This is really horribly confusing. This function is also invoked from
> the CPU offline hotplug callback and in context of CPU hotplug this
> check makes absolutely no sense at all.

The whole thing changed a bit since then, but I think his point still
stands, tick_nohz_cpu_hotpluggable is not meaningful from hotplug
callbacks besides that very first invocation.

Or am I missing your point?

Thanks,
Gabriele

[1] - https://lore.kernel.org/lkml/875xgqqrel.ffs@tglx/