The timer migration mechanism allows active CPUs to pull timers from
idle ones to improve the overall idle time. This is however undesired
when CPU intensive workloads run on isolated cores, as the algorithm
would move the timers from housekeeping to isolated cores, negatively
affecting the isolation.
This effect was noticed on a 128 cores machine running oslat on the
isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
and the CPU with lowest count in a timer migration hierarchy (here 1
and 65) appears as always active and continuously pulls global timers,
from the housekeeping CPUs. This ends up moving driver work (e.g.
delayed work) to isolated CPUs and causes latency spikes:
before the change:
# oslat -c 1-31,33-63,65-95,97-127 -D 62s
...
Maximum: 1203 10 3 4 ... 5 (us)
after the change:
# oslat -c 1-31,33-63,65-95,97-127 -D 62s
...
Maximum: 10 4 3 4 3 ... 5 (us)
Exclude isolated cores from the timer migration algorithm, extend the
concept of unavailable cores, currently used for offline ones, to
isolated ones:
* A core is unavailable if isolated or offline;
* A core is available if isolated and offline;
A core is considered unavailable as isolated if it belongs to:
* the isolcpus (domain) list
* an isolated cpuset
Except if it is:
* in the nohz_full list (already idle for the hierarchy)
* the nohz timekeeper core (must be available to handle global timers)
Due to how the timer migration algorithm works, any CPU part of the
hierarchy can have their global timers pulled by remote CPUs and have to
pull remote timers, only skipping pulling remote timers would break the
logic.
For this reason, we prevent isolated CPUs from pulling remote global
timers, but also the other way around: any global timer started on an
isolated CPU will run there. This does not break the concept of
isolation (global timers don't come from outside the CPU) and, if
considered inappropriate, can usually be mitigated with other isolation
techniques (e.g. IRQ pinning).
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
include/linux/timer.h | 9 ++++++
kernel/cgroup/cpuset.c | 3 ++
kernel/time/timer_migration.c | 54 +++++++++++++++++++++++++++++++++++
3 files changed, 66 insertions(+)
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 10596d7c3a346..a8b683d9ce25d 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -190,4 +190,13 @@ int timers_dead_cpu(unsigned int cpu);
#define timers_dead_cpu NULL
#endif
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask);
+#else
+static inline int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask)
+{
+ return 0;
+}
+#endif
+
#endif
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index e3494ed677f5c..7b26226f2a276 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1378,6 +1378,9 @@ static void update_exclusion_cpumasks(bool isolcpus_updated)
ret = workqueue_unbound_exclude_cpumask(isolated_cpus);
WARN_ON_ONCE(ret < 0);
+
+ ret = tmigr_isolated_exclude_cpumask(isolated_cpus);
+ WARN_ON_ONCE(ret < 0);
}
/**
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 25439f961ccf8..a14fbca7457fe 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -10,6 +10,7 @@
#include <linux/spinlock.h>
#include <linux/timerqueue.h>
#include <trace/events/ipi.h>
+#include <linux/sched/isolation.h>
#include "timer_migration.h"
#include "tick-internal.h"
@@ -1449,6 +1450,13 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
int migrator;
u64 firstexp;
+ /*
+ * The tick CPU can be marked as isolated by the cpuset code, however
+ * we cannot mark it as unavailable to avoid having no global migrator
+ * for the nohz_full CPUs.
+ */
+ if (!tick_nohz_cpu_hotpluggable(cpu))
+ return 0;
raw_spin_lock_irq(&tmc->lock);
tmc->available = false;
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
@@ -1478,6 +1486,20 @@ static int tmigr_cpu_available(unsigned int cpu)
if (WARN_ON_ONCE(!tmc->tmgroup))
return -EINVAL;
+ /*
+ * Domain isolated CPUs don't participate in timer migration, nohz_full
+ * CPUs are still part of the hierarchy but are always considered idle.
+ * Additionally, the tick CPU during nohz_full cannot be disabled.
+ * Checking here guarantees that CPUs isolated at boot (e.g. isolcpus)
+ * are not marked as available when they first become online.
+ * During runtime, any offline isolated CPU is also not incorrectly
+ * marked as available once it gets back online.
+ */
+ if ((!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
+ cpuset_cpu_is_isolated(cpu)) &&
+ housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) &&
+ tick_nohz_cpu_hotpluggable(cpu))
+ return 0;
raw_spin_lock_irq(&tmc->lock);
trace_tmigr_cpu_available(tmc);
tmc->idle = timer_base_is_idle();
@@ -1489,6 +1511,38 @@ static int tmigr_cpu_available(unsigned int cpu)
return 0;
}
+static void tmigr_remote_cpu_unavailable(void *ignored)
+{
+ tmigr_cpu_unavailable(smp_processor_id());
+}
+
+static void tmigr_remote_cpu_available(void *ignored)
+{
+ tmigr_cpu_available(smp_processor_id());
+}
+
+int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask)
+{
+ cpumask_var_t cpumask;
+ int ret = 0;
+
+ lockdep_assert_cpus_held();
+
+ if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
+ return -ENOMEM;
+
+ cpumask_and(cpumask, exclude_cpumask, tmigr_available_cpumask);
+ cpumask_and(cpumask, cpumask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
+ on_each_cpu_mask(cpumask, tmigr_remote_cpu_unavailable, NULL, 0);
+
+ cpumask_andnot(cpumask, cpu_online_mask, exclude_cpumask);
+ cpumask_andnot(cpumask, cpumask, tmigr_available_cpumask);
+ on_each_cpu_mask(cpumask, tmigr_remote_cpu_available, NULL, 0);
+
+ free_cpumask_var(cpumask);
+ return ret;
+}
+
static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
int node)
{
--
2.49.0
On Fri, May 30 2025 at 16:20, Gabriele Monaco wrote:
timer migation?
> The timer migration mechanism allows active CPUs to pull timers from
> idle ones to improve the overall idle time. This is however undesired
> when CPU intensive workloads run on isolated cores, as the algorithm
> would move the timers from housekeeping to isolated cores, negatively
> affecting the isolation.
>
> This effect was noticed on a 128 cores machine running oslat on the
> isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
> and the CPU with lowest count in a timer migration hierarchy (here 1
> and 65) appears as always active and continuously pulls global timers,
> from the housekeeping CPUs. This ends up moving driver work (e.g.
> delayed work) to isolated CPUs and causes latency spikes:
>
> before the change:
>
> # oslat -c 1-31,33-63,65-95,97-127 -D 62s
> ...
> Maximum: 1203 10 3 4 ... 5 (us)
>
> after the change:
>
> # oslat -c 1-31,33-63,65-95,97-127 -D 62s
> ...
> Maximum: 10 4 3 4 3 ... 5 (us)
It's nice to add numbers, but you should out them at the end _after_ you
described the change. At this point 'after the change' makes no sense as
the reader does not read backwards or jumps around randomly in the text.
> Exclude isolated cores from the timer migration algorithm, extend the
> concept of unavailable cores, currently used for offline ones, to
> isolated ones:
> * A core is unavailable if isolated or offline;
> * A core is available if isolated and offline;
>
> A core is considered unavailable as isolated if it belongs to:
> * the isolcpus (domain) list
> * an isolated cpuset
> Except if it is:
> * in the nohz_full list (already idle for the hierarchy)
> * the nohz timekeeper core (must be available to handle global timers)
>
> Due to how the timer migration algorithm works, any CPU part of the
> hierarchy can have their global timers pulled by remote CPUs and have to
> pull remote timers, only skipping pulling remote timers would break the
> logic.
> For this reason, we prevent isolated CPUs from pulling remote global
s/we//
> timers, but also the other way around: any global timer started on an
> isolated CPU will run there. This does not break the concept of
> isolation (global timers don't come from outside the CPU) and, if
> considered inappropriate, can usually be mitigated with other isolation
> techniques (e.g. IRQ pinning).
> +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> +extern int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask);
s/cpumask_var_t/struct cpumask */
cpumask_var_t is only for declaration purposes to avoid ifdeffery, but a
function always gets a pointer to the cpumask itself.
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 25439f961ccf8..a14fbca7457fe 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -10,6 +10,7 @@
> #include <linux/spinlock.h>
> #include <linux/timerqueue.h>
> #include <trace/events/ipi.h>
> +#include <linux/sched/isolation.h>
>
> #include "timer_migration.h"
> #include "tick-internal.h"
> @@ -1449,6 +1450,13 @@ static int tmigr_cpu_unavailable(unsigned int cpu)
> int migrator;
> u64 firstexp;
>
> + /*
> + * The tick CPU can be marked as isolated by the cpuset code, however
> + * we cannot mark it as unavailable to avoid having no global migrator
> + * for the nohz_full CPUs.
> + */
> + if (!tick_nohz_cpu_hotpluggable(cpu))
> + return 0;
This is really horribly confusing. This function is also invoked from
the CPU offline hotplug callback and in context of CPU hotplug this
check makes absolutely no sense at all.
Please do this check in the smp function call which handles this
isolation muck.
> raw_spin_lock_irq(&tmc->lock);
> tmc->available = false;
This clearly lacks a check whether available is true or not. If it's
false, it must return right here.
To avoid goto or whatever ugly please convert that locking to a scoped
guard in a seperate patch. Then you can simply do
if (!tmc->available)
return 0;
inside the lock scope and the compiler will mop it up for you.
> WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> @@ -1478,6 +1486,20 @@ static int tmigr_cpu_available(unsigned int cpu)
> if (WARN_ON_ONCE(!tmc->tmgroup))
> return -EINVAL;
>
> + /*
> + * Domain isolated CPUs don't participate in timer migration, nohz_full
> + * CPUs are still part of the hierarchy but are always considered idle.
> + * Additionally, the tick CPU during nohz_full cannot be disabled.
> + * Checking here guarantees that CPUs isolated at boot (e.g. isolcpus)
> + * are not marked as available when they first become online.
> + * During runtime, any offline isolated CPU is also not incorrectly
> + * marked as available once it gets back online.
> + */
> + if ((!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
> + cpuset_cpu_is_isolated(cpu)) &&
> + housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) &&
> + tick_nohz_cpu_hotpluggable(cpu))
> + return 0;
Same nonsense as above.
> raw_spin_lock_irq(&tmc->lock);
Same missing check for available == true
> trace_tmigr_cpu_available(tmc);
> tmc->idle = timer_base_is_idle();
> @@ -1489,6 +1511,38 @@ static int tmigr_cpu_available(unsigned int cpu)
> return 0;
> }
>
> +static void tmigr_remote_cpu_unavailable(void *ignored)
This is a SMP function call so what's the _remote_ for? It runs on the
CPU on which the call is scheduled. Please make it entirely clear what
this is about. This is about isolation and not about available. The fact
that it sets/clears the availability is just an implementation detail.
static void timgr_cpu_isolate(void *ignored)
{
unsigned int cpu = smp_processor_id();
/* Big fat comment */
if (!tick_nohz_cpu_hotpluggable(cpu))
return;
tmigr_set_cpu_available(cpu);
}
> +{
> + tmigr_cpu_unavailable(smp_processor_id());
> +}
> +
> +static void tmigr_remote_cpu_available(void *ignored)
> +{
> + tmigr_cpu_available(smp_processor_id());
> +}
> +
> +int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask)
> +{
> + cpumask_var_t cpumask;
> + int ret = 0;
> +
> + lockdep_assert_cpus_held();
> +
> + if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> + return -ENOMEM;
> +
> + cpumask_and(cpumask, exclude_cpumask, tmigr_available_cpumask);
> + cpumask_and(cpumask, cpumask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
> + on_each_cpu_mask(cpumask, tmigr_remote_cpu_unavailable, NULL, 0);
Why are those function calls async?
Shouldn't it be guaranteed that the change has been committed once this
function returns? If not then this wants a comment why it does not matter.
Thanks,
tglx
On Fri, 2025-06-20 at 19:00 +0200, Thomas Gleixner wrote:
> On Fri, May 30 2025 at 16:20, Gabriele Monaco wrote:
>
> timer migation?
>
> > The timer migration mechanism allows active CPUs to pull timers
> > from
> > idle ones to improve the overall idle time. This is however
> > undesired
> > when CPU intensive workloads run on isolated cores, as the
> > algorithm
> > would move the timers from housekeeping to isolated cores,
> > negatively
> > affecting the isolation.
> >
> > This effect was noticed on a 128 cores machine running oslat on the
> > isolated cores (1-31,33-63,65-95,97-127). The tool monopolises
> > CPUs,
> > and the CPU with lowest count in a timer migration hierarchy (here
> > 1
> > and 65) appears as always active and continuously pulls global
> > timers,
> > from the housekeeping CPUs. This ends up moving driver work (e.g.
> > delayed work) to isolated CPUs and causes latency spikes:
> >
> > before the change:
> >
> > # oslat -c 1-31,33-63,65-95,97-127 -D 62s
> > ...
> > Maximum: 1203 10 3 4 ... 5 (us)
> >
> > after the change:
> >
> > # oslat -c 1-31,33-63,65-95,97-127 -D 62s
> > ...
> > Maximum: 10 4 3 4 3 ... 5 (us)
>
> It's nice to add numbers, but you should out them at the end _after_
> you
> described the change. At this point 'after the change' makes no sense
> as
> the reader does not read backwards or jumps around randomly in the
> text.
>
> > Exclude isolated cores from the timer migration algorithm, extend
> > the
> > concept of unavailable cores, currently used for offline ones, to
> > isolated ones:
> > * A core is unavailable if isolated or offline;
> > * A core is available if isolated and offline;
> >
> > A core is considered unavailable as isolated if it belongs to:
> > * the isolcpus (domain) list
> > * an isolated cpuset
> > Except if it is:
> > * in the nohz_full list (already idle for the hierarchy)
> > * the nohz timekeeper core (must be available to handle global
> > timers)
> >
> > Due to how the timer migration algorithm works, any CPU part of the
> > hierarchy can have their global timers pulled by remote CPUs and
> > have to
> > pull remote timers, only skipping pulling remote timers would break
> > the
> > logic.
> > For this reason, we prevent isolated CPUs from pulling remote
> > global
>
> s/we//
>
> > timers, but also the other way around: any global timer started on
> > an
> > isolated CPU will run there. This does not break the concept of
> > isolation (global timers don't come from outside the CPU) and, if
> > considered inappropriate, can usually be mitigated with other
> > isolation
> > techniques (e.g. IRQ pinning).
>
> > +#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
> > +extern int tmigr_isolated_exclude_cpumask(cpumask_var_t
> > exclude_cpumask);
>
> s/cpumask_var_t/struct cpumask */
>
> cpumask_var_t is only for declaration purposes to avoid ifdeffery,
> but a
> function always gets a pointer to the cpumask itself.
>
> > diff --git a/kernel/time/timer_migration.c
> > b/kernel/time/timer_migration.c
> > index 25439f961ccf8..a14fbca7457fe 100644
> > --- a/kernel/time/timer_migration.c
> > +++ b/kernel/time/timer_migration.c
> > @@ -10,6 +10,7 @@
> > #include <linux/spinlock.h>
> > #include <linux/timerqueue.h>
> > #include <trace/events/ipi.h>
> > +#include <linux/sched/isolation.h>
> >
> > #include "timer_migration.h"
> > #include "tick-internal.h"
> > @@ -1449,6 +1450,13 @@ static int tmigr_cpu_unavailable(unsigned
> > int cpu)
> > int migrator;
> > u64 firstexp;
> >
> > + /*
> > + * The tick CPU can be marked as isolated by the cpuset
> > code, however
> > + * we cannot mark it as unavailable to avoid having no
> > global migrator
> > + * for the nohz_full CPUs.
> > + */
> > + if (!tick_nohz_cpu_hotpluggable(cpu))
> > + return 0;
>
> This is really horribly confusing. This function is also invoked from
> the CPU offline hotplug callback and in context of CPU hotplug this
> check makes absolutely no sense at all.
>
> Please do this check in the smp function call which handles this
> isolation muck.
>
> > raw_spin_lock_irq(&tmc->lock);
> > tmc->available = false;
>
> This clearly lacks a check whether available is true or not. If it's
> false, it must return right here.
>
> To avoid goto or whatever ugly please convert that locking to a
> scoped
> guard in a seperate patch. Then you can simply do
>
> if (!tmc->available)
> return 0;
>
> inside the lock scope and the compiler will mop it up for you.
>
> > WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> > @@ -1478,6 +1486,20 @@ static int tmigr_cpu_available(unsigned int
> > cpu)
> > if (WARN_ON_ONCE(!tmc->tmgroup))
> > return -EINVAL;
> >
> > + /*
> > + * Domain isolated CPUs don't participate in timer
> > migration, nohz_full
> > + * CPUs are still part of the hierarchy but are always
> > considered idle.
> > + * Additionally, the tick CPU during nohz_full cannot be
> > disabled.
> > + * Checking here guarantees that CPUs isolated at boot
> > (e.g. isolcpus)
> > + * are not marked as available when they first become
> > online.
> > + * During runtime, any offline isolated CPU is also not
> > incorrectly
> > + * marked as available once it gets back online.
> > + */
> > + if ((!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
> > + cpuset_cpu_is_isolated(cpu)) &&
> > + housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) &&
> > + tick_nohz_cpu_hotpluggable(cpu))
> > + return 0;
>
> Same nonsense as above.
>
Thanks for the review, I agree with moving the check above, but I
believe this should remain here.
tmigr_cpu_available is called at boot time and is applying also the
boot time isolation parameters (tmigr_isolated_exclude_cpumask is only
used by the cpuset code).
Now let's assume a machine booted with the arguments isolcpus=0-3
nohz_full=5-7.
Without checking for tick_nohz_cpu_hotpluggable() here, we would not
set the tick cpu (0) as available at boot, ending up in the unwanted
corner cases discussed in the v5 of the series.
I could remove this call here (which is mostly redundant after boot)
and enable explicitly the tick cpu in another way, but this still seems
cleaner to me.
Does it make sense to you? Is the comment in the code unclear?
Thanks again,
Gabriele
> > raw_spin_lock_irq(&tmc->lock);
>
> Same missing check for available == true
>
> > trace_tmigr_cpu_available(tmc);
> > tmc->idle = timer_base_is_idle();
> > @@ -1489,6 +1511,38 @@ static int tmigr_cpu_available(unsigned int
> > cpu)
> > return 0;
> > }
> >
> > +static void tmigr_remote_cpu_unavailable(void *ignored)
>
> This is a SMP function call so what's the _remote_ for? It runs on
> the
> CPU on which the call is scheduled. Please make it entirely clear
> what
> this is about. This is about isolation and not about available. The
> fact
> that it sets/clears the availability is just an implementation
> detail.
>
> static void timgr_cpu_isolate(void *ignored)
> {
> unsigned int cpu = smp_processor_id();
>
> /* Big fat comment */
> if (!tick_nohz_cpu_hotpluggable(cpu))
> return;
>
> tmigr_set_cpu_available(cpu);
> }
>
> > +{
> > + tmigr_cpu_unavailable(smp_processor_id());
> > +}
> > +
> > +static void tmigr_remote_cpu_available(void *ignored)
> > +{
> > + tmigr_cpu_available(smp_processor_id());
> > +}
> > +
> > +int tmigr_isolated_exclude_cpumask(cpumask_var_t exclude_cpumask)
> > +{
> > + cpumask_var_t cpumask;
> > + int ret = 0;
> > +
> > + lockdep_assert_cpus_held();
> > +
> > + if (!alloc_cpumask_var(&cpumask, GFP_KERNEL))
> > + return -ENOMEM;
> > +
> > + cpumask_and(cpumask, exclude_cpumask,
> > tmigr_available_cpumask);
> > + cpumask_and(cpumask, cpumask,
> > housekeeping_cpumask(HK_TYPE_KERNEL_NOISE));
> > + on_each_cpu_mask(cpumask, tmigr_remote_cpu_unavailable,
> > NULL, 0);
>
> Why are those function calls async?
>
> Shouldn't it be guaranteed that the change has been committed once
> this
> function returns? If not then this wants a comment why it does not
> matter.
>
> Thanks,
>
> tglx
On Tue, Jun 24 2025 at 10:05, Gabriele Monaco wrote:
Please trim your replies. It's a pain to scroll through 160 quotes lines
to find the gist of the mail...
> On Fri, 2025-06-20 at 19:00 +0200, Thomas Gleixner wrote:
>> > + if ((!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) ||
>> > + cpuset_cpu_is_isolated(cpu)) &&
>> > + housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) &&
>> > + tick_nohz_cpu_hotpluggable(cpu))
>> > + return 0;
>>
>> Same nonsense as above.
>>
> tmigr_cpu_available is called at boot time and is applying also the
> boot time isolation parameters (tmigr_isolated_exclude_cpumask is only
> used by the cpuset code).
>
> Now let's assume a machine booted with the arguments isolcpus=0-3
> nohz_full=5-7.
>
> Without checking for tick_nohz_cpu_hotpluggable() here, we would not
> set the tick cpu (0) as available at boot, ending up in the unwanted
> corner cases discussed in the v5 of the series.
>
> I could remove this call here (which is mostly redundant after boot)
> and enable explicitly the tick cpu in another way, but this still seems
> cleaner to me.
>
> Does it make sense to you? Is the comment in the code unclear?
It does not make sense and the comment does not change that.
The point is that tmigr_init() is an early initcall which is invoked
before SMP is initialized and APs are brought up.
At this point CPU0 can neither be isolated nor nohz full for obvious
reasons, no?
Thanks,
tglx
On Tue, 2025-06-24 at 15:20 +0200, Thomas Gleixner wrote: > On Tue, Jun 24 2025 at 10:05, Gabriele Monaco wrote: > > Please trim your replies. It's a pain to scroll through 160 quotes > lines > to find the gist of the mail... > > > On Fri, 2025-06-20 at 19:00 +0200, Thomas Gleixner wrote: > > > > + if ((!housekeeping_cpu(cpu, HK_TYPE_DOMAIN) || > > > > + cpuset_cpu_is_isolated(cpu)) && > > > > + housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) && > > > > + tick_nohz_cpu_hotpluggable(cpu)) > > > > + return 0; > > > > > > Same nonsense as above. > > > > > tmigr_cpu_available is called at boot time and is applying also the > > boot time isolation parameters (tmigr_isolated_exclude_cpumask is > > only > > used by the cpuset code). > > > > Now let's assume a machine booted with the arguments isolcpus=0-3 > > nohz_full=5-7. > > > > Without checking for tick_nohz_cpu_hotpluggable() here, we would > > not > > set the tick cpu (0) as available at boot, ending up in the > > unwanted > > corner cases discussed in the v5 of the series. > > > > I could remove this call here (which is mostly redundant after > > boot) > > and enable explicitly the tick cpu in another way, but this still > > seems > > cleaner to me. > > > > Does it make sense to you? Is the comment in the code unclear? > > It does not make sense and the comment does not change that. > > The point is that tmigr_init() is an early initcall which is invoked > before SMP is initialized and APs are brought up. > > At this point CPU0 can neither be isolated nor nohz full for obvious > reasons, no? > Right, but as far as I understood, the first call to tmigr_set_cpu_available() happens after the isolcpus parameter has been parsed so we know at least cpu0 is going to be isolated. On my machine it works reliably this way. I'm a bit lost in the init code but seeing housekeeping_init() before rcu_init(), which in turn should be required for some RCU-related early_initcalls, makes me believe this order is guaranteed to be respected. Or am I missing something? Thanks, Gabriele
Le Tue, Jun 24, 2025 at 04:06:41PM +0200, Gabriele Monaco a écrit : > Right, but as far as I understood, the first call to > tmigr_set_cpu_available() happens after the isolcpus parameter has been > parsed so we know at least cpu0 is going to be isolated. > > On my machine it works reliably this way. I'm a bit lost in the init > code but seeing housekeeping_init() before rcu_init(), which in turn > should be required for some RCU-related early_initcalls, makes me > believe this order is guaranteed to be respected. > Or am I missing something? Right I think you need to keep those checks because if CPU 0 is isolcpus and CPU 5 is nohz_full, CPU 0 will become later the timekeeper and must stay in the tmigr hierarchy. OTOH if CPU 0 is isolcpus and there is no nohz_full CPUs, then CPU 0 doesn't want to go to the hierarchy. cpuset isolated partitions are different because they issue SMP calls whereas isolcpus is defined on boot. An alternative for isolcpus could be to make a late initcall and do the smp calls from there just like is done for cpusets. Thanks. -- Frederic Weisbecker SUSE Labs
On Tue, Jun 24 2025 at 16:52, Frederic Weisbecker wrote:
> Le Tue, Jun 24, 2025 at 04:06:41PM +0200, Gabriele Monaco a écrit :
>> Right, but as far as I understood, the first call to
>> tmigr_set_cpu_available() happens after the isolcpus parameter has been
>> parsed so we know at least cpu0 is going to be isolated.
>>
>> On my machine it works reliably this way. I'm a bit lost in the init
>> code but seeing housekeeping_init() before rcu_init(), which in turn
>> should be required for some RCU-related early_initcalls, makes me
>> believe this order is guaranteed to be respected.
>> Or am I missing something?
>
> Right I think you need to keep those checks because if CPU 0 is isolcpus
> and CPU 5 is nohz_full, CPU 0 will become later the timekeeper and must stay
> in the tmigr hierarchy.
>
> OTOH if CPU 0 is isolcpus and there is no nohz_full CPUs, then CPU 0 doesn't
> want to go to the hierarchy.
>
> cpuset isolated partitions are different because they issue SMP calls whereas
> isolcpus is defined on boot.
>
> An alternative for isolcpus could be to make a late initcall and do the smp
> calls from there just like is done for cpusets.
There is zero reason for isolcpus and nohz full muck to be
active/evaluated during early boot. That's all irrelevant and just
complicates things further.
Can we please clean this up and make it a sensible design instead of
duct taping new functionality into it in completely incomprehensible
ways?
Especially under the aspect that all this should become run-time
modifyable. That requires a run-time switch mechanism anyway, so the
obvious design choice is to utilize that run-time switch late in the
boot sequence to set this stuff up before user space starts and leave
the boot process alone and simple.
The KISS principle applies here fully.
Thanks,
tglx
Le Wed, Jun 25, 2025 at 12:45:32PM +0200, Thomas Gleixner a écrit : > On Tue, Jun 24 2025 at 16:52, Frederic Weisbecker wrote: > > Le Tue, Jun 24, 2025 at 04:06:41PM +0200, Gabriele Monaco a écrit : > >> Right, but as far as I understood, the first call to > >> tmigr_set_cpu_available() happens after the isolcpus parameter has been > >> parsed so we know at least cpu0 is going to be isolated. > >> > >> On my machine it works reliably this way. I'm a bit lost in the init > >> code but seeing housekeeping_init() before rcu_init(), which in turn > >> should be required for some RCU-related early_initcalls, makes me > >> believe this order is guaranteed to be respected. > >> Or am I missing something? > > > > Right I think you need to keep those checks because if CPU 0 is isolcpus > > and CPU 5 is nohz_full, CPU 0 will become later the timekeeper and must stay > > in the tmigr hierarchy. > > > > OTOH if CPU 0 is isolcpus and there is no nohz_full CPUs, then CPU 0 doesn't > > want to go to the hierarchy. > > > > cpuset isolated partitions are different because they issue SMP calls whereas > > isolcpus is defined on boot. > > > > An alternative for isolcpus could be to make a late initcall and do the smp > > calls from there just like is done for cpusets. > > There is zero reason for isolcpus and nohz full muck to be > active/evaluated during early boot. That's all irrelevant and just > complicates things further. > > Can we please clean this up and make it a sensible design instead of > duct taping new functionality into it in completely incomprehensible > ways? > > Especially under the aspect that all this should become run-time > modifyable. That requires a run-time switch mechanism anyway, so the > obvious design choice is to utilize that run-time switch late in the > boot sequence to set this stuff up before user space starts and leave > the boot process alone and simple. > > The KISS principle applies here fully. Ok so the late initcall should work. Thanks. -- Frederic Weisbecker SUSE Labs
On Wed, 2025-06-25 at 15:16 +0200, Frederic Weisbecker wrote: > Le Wed, Jun 25, 2025 at 12:45:32PM +0200, Thomas Gleixner a écrit : > > On Tue, Jun 24 2025 at 16:52, Frederic Weisbecker wrote: > > > Right I think you need to keep those checks because if CPU 0 is > > > isolcpus > > > and CPU 5 is nohz_full, CPU 0 will become later the timekeeper > > > and must stay > > > in the tmigr hierarchy. > > > > > > OTOH if CPU 0 is isolcpus and there is no nohz_full CPUs, then > > > CPU 0 doesn't > > > want to go to the hierarchy. > > > > > > cpuset isolated partitions are different because they issue SMP > > > calls whereas > > > isolcpus is defined on boot. > > > > > > An alternative for isolcpus could be to make a late initcall and > > > do the smp > > > calls from there just like is done for cpusets. > > > > There is zero reason for isolcpus and nohz full muck to be > > active/evaluated during early boot. That's all irrelevant and just > > complicates things further. > > > > Can we please clean this up and make it a sensible design instead > > of > > duct taping new functionality into it in completely > > incomprehensible > > ways? > > > > Especially under the aspect that all this should become run-time > > modifyable. That requires a run-time switch mechanism anyway, so > > the > > obvious design choice is to utilize that run-time switch late in > > the > > boot sequence to set this stuff up before user space starts and > > leave > > the boot process alone and simple. > > > > The KISS principle applies here fully. > > Ok so the late initcall should work. > Thanks both for the reviews. I'm a bit puzzled by what is expected now, though. The late initcall would work just fine to replace the call to tick_nohz_cpu_hotpluggable(), indeed superfluous for hotplug calls, however the checks for housekeeping CPUs is required to prevent isolated CPUs getting online from becoming available and so will run on early boot too (without any practical reason, only because the hotplug handlers run there). I might avoid it by playing with cpuhp_setup_state_nocalls perhaps, but that feels even more hacky. Otherwise, I can refactor the code to maintain a separate field (isolated), restore the 'online' field and keep the functions for online/offline and isolation as separate as possible, while considering available = !isolated && online This would make reading housekeeping masks superfluous on hotplug (and boot) code, but again, it doesn't look simpler to me. Am I missing some obviously elegant solution here? Thanks, Gabriele
Le Wed, Jun 25, 2025 at 03:46:02PM +0200, Gabriele Monaco a écrit : > Thanks both for the reviews. > I'm a bit puzzled by what is expected now, though. > > The late initcall would work just fine to replace the call to > tick_nohz_cpu_hotpluggable(), indeed superfluous for hotplug calls, > however the checks for housekeeping CPUs is required to prevent > isolated CPUs getting online from becoming available and so will run on > early boot too (without any practical reason, only because the hotplug > handlers run there). > > I might avoid it by playing with cpuhp_setup_state_nocalls perhaps, but > that feels even more hacky. > > Otherwise, I can refactor the code to maintain a separate field > (isolated), restore the 'online' field and keep the functions for > online/offline and isolation as separate as possible, while considering > available = !isolated && online > > This would make reading housekeeping masks superfluous on hotplug (and > boot) code, but again, it doesn't look simpler to me. > > Am I missing some obviously elegant solution here? I keep being confused as well but yes, I think you're right, we need to keep the checks anyway on CPU up. > > Thanks, > Gabriele > -- Frederic Weisbecker SUSE Labs
© 2016 - 2025 Red Hat, Inc.