[PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state

Adam Li posted 2 patches 1 month, 1 week ago
[PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state
Posted by Adam Li 1 month, 1 week ago
NOHZ idle load balance is done among CPUs in nohz.idle_cpus_mask.
A CPU is added to nohz.idle_cpus_mask in:
do_idle()
  -> tick_nohz_idle_stop_tick()
     -> nohz_balance_enter_idle()

nohz_balance_enter_idle() is called if:
1) tick is stopped (TS_FLAG_STOPPED is set)
2) and tick was not already stopped before tick_nohz_idle_stop_tick()
   stops the tick (!was_stopped)

When CONFIG_NO_HZ_FULL is set and the CPU is in the nohz_full list
then 'was_stopped' may always be true.
The flag 'TS_FLAG_STOPPED' may be already set in
tick_nohz_full_stop_tick(). So nohz_balance_enter_idle() has no chance
to be called.

As a result, CPU will stay in a 'wrong' state:
1) tick is stopped (TS_FLAG_STOPPED is set)
2) and CPU is not in nohz.idle_cpus_mask
3) and CPU stays idle

Neither the periodic nor the NOHZ idle load balancing can move task
to this CPU. Some CPUs keep idle while others busy.

In nohz_balance_enter_idle(), 'rq->nohz_tick_stopped' is checked to avoid
duplicated nohz.idle_cpus_mask setting. So for nohz_balance_enter_idle()
there is no need to check the '!was_stopped' condition.

This patch will add the CPU to nohz.idle_cpus_mask as expected.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
 kernel/time/tick-sched.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c527b421c865..b900a120ab54 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1229,8 +1229,9 @@ void tick_nohz_idle_stop_tick(void)
 		ts->idle_sleeps++;
 		ts->idle_expires = expires;
 
-		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
-			ts->idle_jiffies = ts->last_jiffies;
+		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+			if (!was_stopped)
+				ts->idle_jiffies = ts->last_jiffies;
 			nohz_balance_enter_idle(cpu);
 		}
 	} else {
-- 
2.34.1
Re: [PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state
Posted by Frederic Weisbecker 4 weeks, 1 day ago
Le Thu, Aug 21, 2025 at 04:27:06AM +0000, Adam Li a écrit :
> NOHZ idle load balance is done among CPUs in nohz.idle_cpus_mask.
> A CPU is added to nohz.idle_cpus_mask in:
> do_idle()
>   -> tick_nohz_idle_stop_tick()
>      -> nohz_balance_enter_idle()
> 
> nohz_balance_enter_idle() is called if:
> 1) tick is stopped (TS_FLAG_STOPPED is set)
> 2) and tick was not already stopped before tick_nohz_idle_stop_tick()
>    stops the tick (!was_stopped)
> 
> When CONFIG_NO_HZ_FULL is set and the CPU is in the nohz_full list
> then 'was_stopped' may always be true.
> The flag 'TS_FLAG_STOPPED' may be already set in
> tick_nohz_full_stop_tick(). So nohz_balance_enter_idle() has no chance
> to be called.
> 
> As a result, CPU will stay in a 'wrong' state:
> 1) tick is stopped (TS_FLAG_STOPPED is set)
> 2) and CPU is not in nohz.idle_cpus_mask
> 3) and CPU stays idle
> 
> Neither the periodic nor the NOHZ idle load balancing can move task
> to this CPU. Some CPUs keep idle while others busy.
> 
> In nohz_balance_enter_idle(), 'rq->nohz_tick_stopped' is checked to avoid
> duplicated nohz.idle_cpus_mask setting. So for nohz_balance_enter_idle()
> there is no need to check the '!was_stopped' condition.
> 
> This patch will add the CPU to nohz.idle_cpus_mask as expected.
> 
> Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
> ---
>  kernel/time/tick-sched.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c527b421c865..b900a120ab54 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1229,8 +1229,9 @@ void tick_nohz_idle_stop_tick(void)
>  		ts->idle_sleeps++;
>  		ts->idle_expires = expires;
>  
> -		if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> -			ts->idle_jiffies = ts->last_jiffies;
> +		if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> +			if (!was_stopped)
> +				ts->idle_jiffies = ts->last_jiffies;
>  			nohz_balance_enter_idle(cpu);

The current state is indeed broken and some people have already tried to fix it.
The thing is nohz_full don't want dynamic isolation because it is deemed to run a
single task. Therefore those tasks must be placed manually in order not to break
isolation guarantees by accident.

In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset
v2 partitions) and I even intend to make nohz_full depend on domain isolation
in the long term.

Thanks.

>  		}
>  	} else {
> -- 
> 2.34.1
> 

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state
Posted by Christoph Lameter (Ampere) 4 weeks, 1 day ago
On Thu, 4 Sep 2025, Frederic Weisbecker wrote:

> The current state is indeed broken and some people have already tried to fix it.
> The thing is nohz_full don't want dynamic isolation because it is deemed to run a
> single task. Therefore those tasks must be placed manually in order not to break
> isolation guarantees by accident.
>
> In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset
> v2 partitions) and I even intend to make nohz_full depend on domain isolation
> in the long term.

I have never used isolcpus with nohz_full. AFAICT isolcpus is depreciated
and cpusets are unnecessary complex overhead.
Re: [PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state
Posted by Frederic Weisbecker 4 weeks ago
Le Thu, Sep 04, 2025 at 09:10:29AM -0700, Christoph Lameter (Ampere) a écrit :
> On Thu, 4 Sep 2025, Frederic Weisbecker wrote:
> 
> > The current state is indeed broken and some people have already tried to fix it.
> > The thing is nohz_full don't want dynamic isolation because it is deemed to run a
> > single task. Therefore those tasks must be placed manually in order not to break
> > isolation guarantees by accident.
> >
> > In fact nohz_full doesn't make much sense without isolcpus (or isolated cpuset
> > v2 partitions) and I even intend to make nohz_full depend on domain isolation
> > in the long term.
> 
> I have never used isolcpus with nohz_full. AFAICT isolcpus is depreciated
> and cpusets are unnecessary complex overhead.

isolcpus for domain isolation is indeed in the way for long term deprecation
and the only replacement possible is cpuset, which overhead is only visible
on partition creation and update.

We could argue on the interface, the point is that nohz_full doesn't make sense
without domain isolation.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [PATCH RESEND 1/2] tick/nohz: Fix wrong NOHZ idle CPU state
Posted by Christoph Lameter (Ampere) 3 weeks, 4 days ago
On Fri, 5 Sep 2025, Frederic Weisbecker wrote:

> isolcpus for domain isolation is indeed in the way for long term deprecation
> and the only replacement possible is cpuset, which overhead is only visible
> on partition creation and update.
>
> We could argue on the interface, the point is that nohz_full doesn't make sense
> without domain isolation.

Most use cases I see use nohz_full on all cpus and rely on the OS to
exempt the sheperd cpu.