nohz_balance_enter_idle:
cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
atomic_inc(&nohz.nr_cpus)
nohz_balance_exit_idle:
cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
atomic_dec(&nohz.nr_cpus)
kick_ilb:
if (likely(!atomic_read(&nohz.nr_cpus)))
return;
So, idle_cpus_mask contains the same information. Instead of doing
costly atomic in large systems, its better to check if cpumask is empty
or not to make the same decision to trigger idle load balance.
There might be race between cpumask_empty check and set of cpumask in
the remote CPUs. In such case at next tick idle load balance will be
triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
checks if CPU is idle or not before doing the balance.
cpumask_empty uses ffs. So should not be very costly.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
kernel/sched/fair.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5534822fd754..3afa79d733dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
static struct {
cpumask_var_t idle_cpus_mask;
- atomic_t nr_cpus;
int has_blocked; /* Idle CPUS has blocked load */
int needs_update; /* Newly idle CPUs need their next_balance collated */
unsigned long next_balance; /* in jiffy units */
@@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
* None are in tickless mode and hence no need for NOHZ idle load
* balancing, do stats update if its due
*/
- if (unlikely(!atomic_read(&nohz.nr_cpus)))
+ if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
goto out;
if (rq->nr_running >= 2) {
@@ -12563,7 +12562,6 @@ void nohz_balance_exit_idle(struct rq *rq)
rq->nohz_tick_stopped = 0;
cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
- atomic_dec(&nohz.nr_cpus);
set_cpu_sd_state_busy(rq->cpu);
}
@@ -12621,7 +12619,6 @@ void nohz_balance_enter_idle(int cpu)
rq->nohz_tick_stopped = 1;
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
- atomic_inc(&nohz.nr_cpus);
/*
* Ensures that if nohz_idle_balance() fails to observe our
--
2.43.0
* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> nohz_balance_enter_idle:
> cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
> atomic_inc(&nohz.nr_cpus)
>
> nohz_balance_exit_idle:
> cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
> atomic_dec(&nohz.nr_cpus)
>
> kick_ilb:
> if (likely(!atomic_read(&nohz.nr_cpus)))
> return;
>
> So, idle_cpus_mask contains the same information. Instead of doing
> costly atomic in large systems, its better to check if cpumask is empty
> or not to make the same decision to trigger idle load balance.
>
> There might be race between cpumask_empty check and set of cpumask in
> the remote CPUs. In such case at next tick idle load balance will be
> triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
> checks if CPU is idle or not before doing the balance.
>
> cpumask_empty uses ffs. So should not be very costly.
>
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> static struct {
> cpumask_var_t idle_cpus_mask;
> - atomic_t nr_cpus;
> int has_blocked; /* Idle CPUS has blocked load */
> int needs_update; /* Newly idle CPUs need their next_balance collated */
> unsigned long next_balance; /* in jiffy units */
> @@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
> * None are in tickless mode and hence no need for NOHZ idle load
> * balancing, do stats update if its due
> */
> - if (unlikely(!atomic_read(&nohz.nr_cpus)))
> + if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
> goto out;
So the thing is, if the goal is to avoid cacheline
bouncing, this won't fundamentally change the
situation:
> rq->nohz_tick_stopped = 0;
> cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
> - atomic_dec(&nohz.nr_cpus);
nohz.idle_cpus_mask will be on a single 64-byte
cacheline even on 512 CPU systems, and the
cpumask_clear_cpu() and cpumask_set_cpu() calls will
dirty the cacheline and make it bounce with exactly the
same frequency as the atomic_inc/dec() of nohz.nr_cpus
does today.
From the 0/4 boilerplate description:
> It was noted when running on large systems
> nohz.nr_cpus cacheline was bouncing quite often.
> There is atomic inc/dec and read happening on many
> CPUs at a time and it is possible for this line to
> bounce often.
That the nr_cpus modification is an atomic op doesn't
change the situation much in terms of cacheline
bouncing, because the cacheline dirtying will still
cause comparable levels of bouncing on modern CPUs with
modern cache coherency protocols.
If nr_cpus and nohz.nr_cpus are in separate cachelines,
then this patch might eliminate about half of the
bounces - but AFAICS they are right next to each other,
so unless it's off-stack cpumasks, they should be in
the same cacheline. Half of 'bad bouncing' is still
kinda 'bad bouncing'. :-)
I'm not really objecting to the patch, because it would
reduce cacheline bouncing in the offstack-mask case,
but the explanation isn't very clear about these
details.
Thanks,
Ingo
© 2016 - 2025 Red Hat, Inc.