NOHZ idle load balance is done among CPUs in nohz.idle_cpus_mask.
A CPU is added to nohz.idle_cpus_mask in:
do_idle()
-> tick_nohz_idle_stop_tick()
-> nohz_balance_enter_idle()
nohz_balance_enter_idle() is called if:
1) tick is stopped (TS_FLAG_STOPPED is set)
2) and tick was not already stopped before tick_nohz_idle_stop_tick()
stops the tick (!was_stopped)
When CONFIG_NO_HZ_FULL is set and the CPU is in the nohz_full list
then 'was_stopped' may always be true.
The flag 'TS_FLAG_STOPPED' may be already set in
tick_nohz_full_stop_tick(). So nohz_balance_enter_idle() has no chance
to be called.
As a result, CPU will stay in a 'wrong' state:
1) tick is stopped (TS_FLAG_STOPPED is set)
2) and CPU is not in nohz.idle_cpus_mask
3) and CPU stays idle
Neither the periodic nor the NOHZ idle load balancing can move task
to this CPU. Some CPUs keep idle while others busy.
In nohz_balance_enter_idle(), 'rq->nohz_tick_stopped' is checked to avoid
duplicated nohz.idle_cpus_mask setting. So for nohz_balance_enter_idle()
there is no need to check the '!was_stopped' condition.
This patch will add the CPU to nohz.idle_cpus_mask as expected.
Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
kernel/time/tick-sched.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..b900a120ab54 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1229,8 +1229,9 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_sleeps++;
ts->idle_expires = expires;
- if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
- ts->idle_jiffies = ts->last_jiffies;
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+ if (!was_stopped)
+ ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
} else {
--
2.34.1
Le Thu, Aug 21, 2025 at 04:27:06AM +0000, Adam Li a écrit : > NOHZ idle load balance is done among CPUs in nohz.idle_cpus_mask. > A CPU is added to nohz.idle_cpus_mask in: > do_idle() > -> tick_nohz_idle_stop_tick() > -> nohz_balance_enter_idle() > > nohz_balance_enter_idle() is called if: > 1) tick is stopped (TS_FLAG_STOPPED is set) > 2) and tick was not already stopped before tick_nohz_idle_stop_tick() > stops the tick (!was_stopped) > > When CONFIG_NO_HZ_FULL is set and the CPU is in the nohz_full list > then 'was_stopped' may always be true. > The flag 'TS_FLAG_STOPPED' may be already set in > tick_nohz_full_stop_tick(). So nohz_balance_enter_idle() has no chance > to be called. > > As a result, CPU will stay in a 'wrong' state: > 1) tick is stopped (TS_FLAG_STOPPED is set) > 2) and CPU is not in nohz.idle_cpus_mask > 3) and CPU stays idle > > Neither the periodic nor the NOHZ idle load balancing can move task > to this CPU. Some CPUs keep idle while others busy. > > In nohz_balance_enter_idle(), 'rq->nohz_tick_stopped' is checked to avoid > duplicated nohz.idle_cpus_mask setting. So for nohz_balance_enter_idle() > there is no need to check the '!was_stopped' condition. > > This patch will add the CPU to nohz.idle_cpus_mask as expected. > > Signed-off-by: Adam Li <adamli@os.amperecomputing.com> > Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> > --- > kernel/time/tick-sched.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index c527b421c865..b900a120ab54 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1229,8 +1229,9 @@ void tick_nohz_idle_stop_tick(void) > ts->idle_sleeps++; > ts->idle_expires = expires; > > - if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) { > - ts->idle_jiffies = ts->last_jiffies; > + if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) { > + if (!was_stopped) > + ts->idle_jiffies = ts->last_jiffies; > nohz_balance_enter_idle(cpu); The current state is indeed broken and some people have already tried to fix it. The thing is nohz_full don't want dynamic isolation because it is deemed to run a single task. Therefore those tasks must be placed manually in order not to break isolation guarantees by accident. In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset v2 partitions) and I even intend to make nohz_full depend on domain isolation in the long term. Thanks. > } > } else { > -- > 2.34.1 > -- Frederic Weisbecker SUSE Labs
On Thu, 4 Sep 2025, Frederic Weisbecker wrote: > The current state is indeed broken and some people have already tried to fix it. > The thing is nohz_full don't want dynamic isolation because it is deemed to run a > single task. Therefore those tasks must be placed manually in order not to break > isolation guarantees by accident. > > In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset > v2 partitions) and I even intend to make nohz_full depend on domain isolation > in the long term. I have never used isolcpus with nohz_full. AFAICT isolcpus is depreciated and cpusets are unnecessary complex overhead.
Le Thu, Sep 04, 2025 at 09:10:29AM -0700, Christoph Lameter (Ampere) a écrit : > On Thu, 4 Sep 2025, Frederic Weisbecker wrote: > > > The current state is indeed broken and some people have already tried to fix it. > > The thing is nohz_full don't want dynamic isolation because it is deemed to run a > > single task. Therefore those tasks must be placed manually in order not to break > > isolation guarantees by accident. > > > > In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset > > v2 partitions) and I even intend to make nohz_full depend on domain isolation > > in the long term. > > I have never used isolcpus with nohz_full. AFAICT isolcpus is depreciated > and cpusets are unnecessary complex overhead. isolcpus for domain isolation is indeed in the way for long term deprecation and the only replacement possible is cpuset, which overhead is only visible on partition creation and update. We could argue on the interface, the point is that nohz_full doesn't make sense without domain isolation. Thanks. -- Frederic Weisbecker SUSE Labs
On Fri, 5 Sep 2025, Frederic Weisbecker wrote: > isolcpus for domain isolation is indeed in the way for long term deprecation > and the only replacement possible is cpuset, which overhead is only visible > on partition creation and update. > > We could argue on the interface, the point is that nohz_full doesn't make sense > without domain isolation. Most use cases I see use nohz_full on all cpus and rely on the OS to exempt the sheperd cpu.
© 2016 - 2025 Red Hat, Inc.