These days most of the system have multi cores. The likelyhood of
at least one or more CPUs in nohz (idle state) is higher.
So move likely to unlikely.
Allow stats balancing to complete when there are no nr_cpus as the check
happens later. This may do an additional stats based load balancing
which would reset has_blocked_load. Code also looks saner by removing
that uncharactiristic return in between.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
kernel/sched/fair.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd1c78d2c272..5ceb9126d441 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12456,10 +12456,10 @@ static void nohz_balancer_kick(struct rq *rq)
/*
* None are in tickless mode and hence no need for NOHZ idle load
- * balancing:
+ * balancing, do stats update if its due
*/
- if (likely(!atomic_read(&nohz.nr_cpus)))
- return;
+ if (unlikely(!atomic_read(&nohz.nr_cpus)))
+ goto out;
if (rq->nr_running >= 2) {
flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
--
2.47.3
Hello Shrikanth,
On 1/2/2026 6:17 PM, Shrikanth Hegde wrote:
> These days most of the system have multi cores. The likelyhood of
> at least one or more CPUs in nohz (idle state) is higher.
> So move likely to unlikely.
>
> Allow stats balancing to complete when there are no nr_cpus as the check
> happens later. This may do an additional stats based load balancing
> which would reset has_blocked_load. Code also looks saner by removing
> that uncharactiristic return in between.
>
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
> kernel/sched/fair.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cd1c78d2c272..5ceb9126d441 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12456,10 +12456,10 @@ static void nohz_balancer_kick(struct rq *rq)
>
> /*
> * None are in tickless mode and hence no need for NOHZ idle load
> - * balancing:
> + * balancing, do stats update if its due
> */
> - if (likely(!atomic_read(&nohz.nr_cpus)))
> - return;
> + if (unlikely(!atomic_read(&nohz.nohz_balancer_kick)))
> + goto out;
Since we are sure that "nohz.nr_cpus" is 0, there is a good chance
find_new_ilb() in kick_ilb() will not find any CPU to run balance on, so
why not just retain that return?
The "flags" can only be set to (NOHZ_NEXT_KICK | NOHZ_STATS_KICK) on
this path and kick_ilb() will simply return early without updating
"nohz.next_balance" when it doesn't see NOHZ_BALANCE_KICK and fails to
find any CPU. Might as well keep the early return.
>
> if (rq->nr_running >= 2) {
> flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
--
Thanks and Regards,
Prateek
On 1/5/26 9:22 AM, K Prateek Nayak wrote: > Hello Shrikanth, > > On 1/2/2026 6:17 PM, Shrikanth Hegde wrote: >> These days most of the system have multi cores. The likelyhood of >> at least one or more CPUs in nohz (idle state) is higher. >> So move likely to unlikely. >> >> Allow stats balancing to complete when there are no nr_cpus as the check >> happens later. This may do an additional stats based load balancing >> which would reset has_blocked_load. Code also looks saner by removing >> that uncharactiristic return in between. >> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> >> --- >> kernel/sched/fair.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index cd1c78d2c272..5ceb9126d441 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -12456,10 +12456,10 @@ static void nohz_balancer_kick(struct rq *rq) >> >> /* >> * None are in tickless mode and hence no need for NOHZ idle load >> - * balancing: >> + * balancing, do stats update if its due >> */ >> - if (likely(!atomic_read(&nohz.nr_cpus))) >> - return; >> + if (unlikely(!atomic_read(&nohz.nohz_balancer_kick))) >> + goto out; Did something got edited here? > Since we are sure that "nohz.nr_cpus" is 0, there is a good chance > find_new_ilb() in kick_ilb() will not find any CPU to run balance on, so > why not just retain that return? > > The "flags" can only be set to (NOHZ_NEXT_KICK | NOHZ_STATS_KICK) on > this path and kick_ilb() will simply return early without updating > "nohz.next_balance" when it doesn't see NOHZ_BALANCE_KICK and fails to > find any CPU. Might as well keep the early return. > The only reason why flags would be set is, if nohz.has_blocked_load is set and time is after next_blocked. In that case, doing a stats based balance will make nohz.has_blocked_load=0 and subsequent invocations flags =0 and no load balance will happen if nr_cpus stays 0. However, if we just, has_blocked_load might remains stale value. Isn't that the case?
Hello Shrikanth,
On 1/5/2026 10:37 AM, Shrikanth Hegde wrote:
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -12456,10 +12456,10 @@ static void nohz_balancer_kick(struct rq *rq)
>>> /*
>>> * None are in tickless mode and hence no need for NOHZ idle load
>>> - * balancing:
>>> + * balancing, do stats update if its due
>>> */
>>> - if (likely(!atomic_read(&nohz.nr_cpus)))
>>> - return;
>>> + if (unlikely(!atomic_read(&nohz.nr_cpus)))
>>> + goto out;
>
> Did something got edited here?
Welp! Stray edit. My bad. Reverted back to original diff.
>
>> Since we are sure that "nohz.nr_cpus" is 0, there is a good chance
>> find_new_ilb() in kick_ilb() will not find any CPU to run balance on, so
>> why not just retain that return?
>>
>> The "flags" can only be set to (NOHZ_NEXT_KICK | NOHZ_STATS_KICK) on
>> this path and kick_ilb() will simply return early without updating
>> "nohz.next_balance" when it doesn't see NOHZ_BALANCE_KICK and fails to
>> find any CPU. Might as well keep the early return.
>>
>
> The only reason why flags would be set is, if nohz.has_blocked_load
> is set and time is after next_blocked. In that case, doing a stats
> based balance will make nohz.has_blocked_load=0 and subsequent invocations
> flags =0 and no load balance will happen if nr_cpus stays 0.
>
> However, if we just, has_blocked_load might remains stale value.
>
> Isn't that the case?
So cumulatively, including Patch 3, we do:
flags = 0;
if (READ_ONCE(nohz.has_blocked_load) && ...)
flags = NOHZ_STATS_KICK;
if (time_before(now, nohz.next_balance))
goto out; /* Checks nohz.idle_cpus_mask in find_new_ilb() ... (1) */
if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
goto out; /* Still goes to kick_ilb() ... (2) */
...
out:
if (READ_ONCE(nohz.needs_update))
flags |= NOHZ_NEXT_KICK;
/* assume either NOHZ_STATS_KICK or NOHZ_NEXT_KICK is set */
kick_ilb()
{
if (flags & NOHZ_BALANCE_KICK) /* Not possible */
...
ilb_cpu = find_new_ilb(); /* Find CPU in nohz.idle_cpus_mask */
If we arrive here from (2), we know "nohz.idle_cpus_mask" was empty a
while back and we've not updated any global "nohz" state. If we don't
find an ilb_cpu, we just do:
if (ilb_cpu < 0)
return;
So why not simply return from (2)?
--
Thanks and Regards,
Prateek
Hi Prateek,
>
> On 1/5/2026 10:37 AM, Shrikanth Hegde wrote:
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -12456,10 +12456,10 @@ static void nohz_balancer_kick(struct rq *rq)
>>>> /*
>>>> * None are in tickless mode and hence no need for NOHZ idle load
>>>> - * balancing:
>>>> + * balancing, do stats update if its due
>>>> */
>>>> - if (likely(!atomic_read(&nohz.nr_cpus)))
>>>> - return;
>>>> + if (unlikely(!atomic_read(&nohz.nr_cpus)))
>>>> + goto out;
>>
>> Did something got edited here?
>
> Welp! Stray edit. My bad. Reverted back to original diff.
>
>>
>>> Since we are sure that "nohz.nr_cpus" is 0, there is a good chance
>>> find_new_ilb() in kick_ilb() will not find any CPU to run balance on, so
>>> why not just retain that return?
>>>
>>> The "flags" can only be set to (NOHZ_NEXT_KICK | NOHZ_STATS_KICK) on
>>> this path and kick_ilb() will simply return early without updating
>>> "nohz.next_balance" when it doesn't see NOHZ_BALANCE_KICK and fails to
>>> find any CPU. Might as well keep the early return.
>>>
>>
>> The only reason why flags would be set is, if nohz.has_blocked_load
>> is set and time is after next_blocked. In that case, doing a stats
>> based balance will make nohz.has_blocked_load=0 and subsequent invocations
>> flags =0 and no load balance will happen if nr_cpus stays 0.
>>
>> However, if we just, has_blocked_load might remains stale value.
>>
>> Isn't that the case?
>
> So cumulatively, including Patch 3, we do:
>
> flags = 0;
>
> if (READ_ONCE(nohz.has_blocked_load) && ...)
> flags = NOHZ_STATS_KICK;
>
> if (time_before(now, nohz.next_balance))
> goto out; /* Checks nohz.idle_cpus_mask in find_new_ilb() ... (1) */
>
> if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
> goto out; /* Still goes to kick_ilb() ... (2) */
>
> ...
>
> out:
> if (READ_ONCE(nohz.needs_update))
> flags |= NOHZ_NEXT_KICK;
>
> /* assume either NOHZ_STATS_KICK or NOHZ_NEXT_KICK is set */
> kick_ilb()
> {
> if (flags & NOHZ_BALANCE_KICK) /* Not possible */
> ...
>
> ilb_cpu = find_new_ilb(); /* Find CPU in nohz.idle_cpus_mask */
>
>
> If we arrive here from (2), we know "nohz.idle_cpus_mask" was empty a
> while back and we've not updated any global "nohz" state. If we don't
> find an ilb_cpu, we just do:
>
> if (ilb_cpu < 0)
> return;
>
> So why not simply return from (2)?
>
I see, kick_ilb though called will not do a balance since ilb_cpu was not found.
I don't want to have that return in between the two out's.
How about we do below? When there are no idle CPUs left, both has_blocked_load
and needs_update should be reset. no?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 805b53d9709e..fa0e6065bc9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12377,6 +12377,15 @@ static inline int find_new_ilb(void)
return ilb_cpu;
}
+ /* There is no idle CPU left.
+ * reset has_blocked_load and needs_update, such that unless
+ * some CPU enters idle state, it will not trigger kick_ilb
+ */
+ if (READ_ONCE(nohz.has_blocked_load))
+ WRITE_ONCE(nohz.has_blocked_load, 0);
+ if (READ_ONCE(nohz.needs_update))
+ WRITE_ONCE(nohz.needs_update, 0);
+
return -1;
}
Hello Shrikanth,
On 1/5/2026 4:09 PM, Shrikanth Hegde wrote:
>> So cumulatively, including Patch 3, we do:
>>
>> flags = 0;
>>
>> if (READ_ONCE(nohz.has_blocked_load) && ...)
>> flags = NOHZ_STATS_KICK;
>>
>> if (time_before(now, nohz.next_balance))
>> goto out; /* Checks nohz.idle_cpus_mask in find_new_ilb() ... (1) */
>>
>> if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>> goto out; /* Still goes to kick_ilb() ... (2) */
>>
>> ...
>>
>> out:
>> if (READ_ONCE(nohz.needs_update))
>> flags |= NOHZ_NEXT_KICK;
>>
>> /* assume either NOHZ_STATS_KICK or NOHZ_NEXT_KICK is set */
>> kick_ilb()
>> {
>> if (flags & NOHZ_BALANCE_KICK) /* Not possible */
>> ...
>>
>> ilb_cpu = find_new_ilb(); /* Find CPU in nohz.idle_cpus_mask */
>>
>>
>> If we arrive here from (2), we know "nohz.idle_cpus_mask" was empty a
>> while back and we've not updated any global "nohz" state. If we don't
>> find an ilb_cpu, we just do:
>>
>> if (ilb_cpu < 0)
>> return;
>>
>> So why not simply return from (2)?
>>
>
> I see, kick_ilb though called will not do a balance since ilb_cpu was not found.
>
> I don't want to have that return in between the two out's.
>
> How about we do below? When there are no idle CPUs left, both has_blocked_load
> and needs_update should be reset. no?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 805b53d9709e..fa0e6065bc9c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12377,6 +12377,15 @@ static inline int find_new_ilb(void)
> return ilb_cpu;
> }
>
> + /* There is no idle CPU left.
> + * reset has_blocked_load and needs_update, such that unless
> + * some CPU enters idle state, it will not trigger kick_ilb
> + */
> + if (READ_ONCE(nohz.has_blocked_load))
> + WRITE_ONCE(nohz.has_blocked_load, 0);
> + if (READ_ONCE(nohz.needs_update))
> + WRITE_ONCE(nohz.needs_update, 0);
> +
I'm slightly skeptical - find_new_ilb() will also fail to find any CPU
if idle_cpu() returns false momentarily.
Those CPUs can again go back to idle without updating
"nohz.has_blocked_load" - tick_nohz_idle_stop_tick() will skip
nohz_balance_enter_idle() if the CPU already had TS_FLAG_STOPPED set.
For those cases, we'll need to retain the "nohz" state as is until a
suitable ILB CPU can be found - this is the reason why we delay
clearing the "nohz" state until _nohz_idle_balance() and reconstruct it
once the CPU is done with idle balancing.
There are also nuances like the smp_mb__after_atomic() in
nohz_balance_enter_idle() which requires us to check the
"nohz.idle_cpus_mask" after we are done clearing "nohz.needs_update"
and "nohz.has_blocked_load".
> return -1;
> }
>
>
--
Thanks and Regards,
Prateek
On 1/6/26 8:33 AM, K Prateek Nayak wrote:
> Hello Shrikanth,
>
> On 1/5/2026 4:09 PM, Shrikanth Hegde wrote:
>>> So cumulatively, including Patch 3, we do:
>>>
>>> flags = 0;
>>>
>>> if (READ_ONCE(nohz.has_blocked_load) && ...)
>>> flags = NOHZ_STATS_KICK;
>>>
>>> if (time_before(now, nohz.next_balance))
>>> goto out; /* Checks nohz.idle_cpus_mask in find_new_ilb() ... (1) */
>>>
>>> if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
>>> goto out; /* Still goes to kick_ilb() ... (2) */
>>>
>>> ...
>>>
>>> out:
>>> if (READ_ONCE(nohz.needs_update))
>>> flags |= NOHZ_NEXT_KICK;
>>>
>>> /* assume either NOHZ_STATS_KICK or NOHZ_NEXT_KICK is set */
>>> kick_ilb()
>>> {
>>> if (flags & NOHZ_BALANCE_KICK) /* Not possible */
>>> ...
>>>
>>> ilb_cpu = find_new_ilb(); /* Find CPU in nohz.idle_cpus_mask */
>>>
>>>
>>> If we arrive here from (2), we know "nohz.idle_cpus_mask" was empty a
>>> while back and we've not updated any global "nohz" state. If we don't
>>> find an ilb_cpu, we just do:
>>>
>>> if (ilb_cpu < 0)
>>> return;
>>>
>>> So why not simply return from (2)?
>>>
>>
>> I see, kick_ilb though called will not do a balance since ilb_cpu was not found.
>>
>> I don't want to have that return in between the two out's.
>>
>> How about we do below? When there are no idle CPUs left, both has_blocked_load
>> and needs_update should be reset. no?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 805b53d9709e..fa0e6065bc9c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12377,6 +12377,15 @@ static inline int find_new_ilb(void)
>> return ilb_cpu;
>> }
>>
>> + /* There is no idle CPU left.
>> + * reset has_blocked_load and needs_update, such that unless
>> + * some CPU enters idle state, it will not trigger kick_ilb
>> + */
cpumask_empty check.
>> + if (READ_ONCE(nohz.has_blocked_load))
>> + WRITE_ONCE(nohz.has_blocked_load, 0);
>> + if (READ_ONCE(nohz.needs_update))
>> + WRITE_ONCE(nohz.needs_update, 0);
>> +
>
> I'm slightly skeptical - find_new_ilb() will also fail to find any CPU
> if idle_cpu() returns false momentarily.
>
Yes. I even thought to put cpumask_empty check here and then do these updates.
But since there might be a window where a remote CPU may be going idle,
might race with it and possibly lose a chance for idle balance.
Since it is extreme case of 100% busy, all these checks may not be necessary.
Will put a return instead :)
Thanks for pointing that out.
> Those CPUs can again go back to idle without updating
> "nohz.has_blocked_load" - tick_nohz_idle_stop_tick() will skip
> nohz_balance_enter_idle() if the CPU already had TS_FLAG_STOPPED set.
>
> For those cases, we'll need to retain the "nohz" state as is until a
> suitable ILB CPU can be found - this is the reason why we delay
> clearing the "nohz" state until _nohz_idle_balance() and reconstruct it
> once the CPU is done with idle balancing.
>
> There are also nuances like the smp_mb__after_atomic() in
That reminds me, need to upgrade this to smp_mb now, given atomic is gone.
> nohz_balance_enter_idle() which requires us to check the
> "nohz.idle_cpus_mask" after we are done clearing "nohz.needs_update"
> and "nohz.has_blocked_load".
>
>> return -1;
>> }
>>
>>
>
Hello Shrikanth, On 1/6/2026 9:53 AM, Shrikanth Hegde wrote: >> There are also nuances like the smp_mb__after_atomic() in > > That reminds me, need to upgrade this to smp_mb now, given atomic is gone. Isn't cpumask_set_cpu() also an atomic op? set_bit() has a comment stating it is a "relaxed atomic operation" in asm-generic/bitops/instrumented-atomic.h which makes it similar to a atomic_inc(). -- Thanks and Regards, Prateek
On 1/6/26 11:20 AM, K Prateek Nayak wrote: > Hello Shrikanth, > > On 1/6/2026 9:53 AM, Shrikanth Hegde wrote: >>> There are also nuances like the smp_mb__after_atomic() in >> >> That reminds me, need to upgrade this to smp_mb now, given atomic is gone. > > Isn't cpumask_set_cpu() also an atomic op? > > set_bit() has a comment stating it is a "relaxed atomic operation" in > asm-generic/bitops/instrumented-atomic.h which makes it similar to a > atomic_inc(). " * set_bit - Atomically set a bit in memory * @nr: the bit to set * @addr: the address to start counting from * * This is a relaxed atomic operation (no implied memory barriers). " I believed the comment. If it was relaxed atomic operation then one would need a smp barrier. But it calls arch_set_bit. Looking at x86 implementation in arch/x86/include/asm/bitops.h it looks like it does full smp_mb. Whereas on powerpc it doesn't. So smp_mb__after_atomic can remain as is. Effective change to v3 is return instead of out.
© 2016 - 2026 Red Hat, Inc.