kernel/sched/fair.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
At load balance time, balance of last level cache domains and
above needs to be serialized. The scheduler checks the atomic var
sched_balance_running first and then see if time is due for a load
balance. This is an expensive operation as multiple CPUs can attempt
sched_balance_running acquisition at the same time.
On a 2 socket Granite Rapid systems enabling sub-numa cluster and
running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
sched_balance_running. Most of the time, a balance attempt is aborted
immediately after acquiring sched_balance_running as load balance time
is not due.
Instead, check balance due time first before acquiring
sched_balance_running. This skips many useless acquisitions
of sched_balance_running and knocks the 7.6% CPU overhead on
sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
improved by 11%.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..5e5f7a770b2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
+ need_serialize = sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+ goto out;
+ }
+
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
@@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
+ if (need_serialize)
+ atomic_set_release(&sched_balance_running, 0);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
--
2.32.0
On Wed, 16 Apr 2025 at 05:51, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> At load balance time, balance of last level cache domains and
> above needs to be serialized. The scheduler checks the atomic var
> sched_balance_running first and then see if time is due for a load
> balance. This is an expensive operation as multiple CPUs can attempt
> sched_balance_running acquisition at the same time.
>
> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> sched_balance_running. Most of the time, a balance attempt is aborted
> immediately after acquiring sched_balance_running as load balance time
> is not due.
>
> Instead, check balance due time first before acquiring
> sched_balance_running. This skips many useless acquisitions
> of sched_balance_running and knocks the 7.6% CPU overhead on
> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> improved by 11%.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e580..5e5f7a770b2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out;
> + }
> +
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> * The LBF_DST_PINNED logic could have changed
> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> --
> 2.32.0
>
On Fri, Jun 06, 2025 at 03:51:34PM +0200, Vincent Guittot wrote: > On Wed, 16 Apr 2025 at 05:51, Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > At load balance time, balance of last level cache domains and > > above needs to be serialized. The scheduler checks the atomic var > > sched_balance_running first and then see if time is due for a load > > balance. This is an expensive operation as multiple CPUs can attempt > > sched_balance_running acquisition at the same time. > > > > On a 2 socket Granite Rapid systems enabling sub-numa cluster and > > running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of > > sched_balance_running. Most of the time, a balance attempt is aborted > > immediately after acquiring sched_balance_running as load balance time > > is not due. > > > > Instead, check balance due time first before acquiring > > sched_balance_running. This skips many useless acquisitions > > of sched_balance_running and knocks the 7.6% CPU overhead on > > sched_balance_domain() down to 0.05%. Throughput of the OLTP workload > > improved by 11%. > > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > Reported-by: Mohini Narkhede <mohini.narkhede@intel.com> > > Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > Reviewed-by: Mel Gorman <mgorman@techsingularity.net> I've been missing for a while and even now on reduced workload so I'm only looking at this patch now. It was never merged, but why? It looks like a no-brainer to avoid an atomic operation with minimal effort even if it only applies to balancing across NUMA domains. Performance looks better for a small number of workloads on multi-socket machines including some Zen variants. Most results were neutral which is not very surprising given the path affected. I made no effort to determine how hot this particular path is for any of the tested workloads but nothing obviously superceded this patch or made it irrelevant. -- Mel Gorman SUSE Labs
On 4/16/2025 11:58 AM, Tim Chen wrote: > At load balance time, balance of last level cache domains and > above needs to be serialized. The scheduler checks the atomic var > sched_balance_running first and then see if time is due for a load > balance. This is an expensive operation as multiple CPUs can attempt > sched_balance_running acquisition at the same time. > > On a 2 socket Granite Rapid systems enabling sub-numa cluster and > running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of > sched_balance_running. Most of the time, a balance attempt is aborted > immediately after acquiring sched_balance_running as load balance time > is not due. > > Instead, check balance due time first before acquiring > sched_balance_running. This skips many useless acquisitions > of sched_balance_running and knocks the 7.6% CPU overhead on > sched_balance_domain() down to 0.05%. Throughput of the OLTP workload > improved by 11%. > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > Reported-by: Mohini Narkhede <mohini.narkhede@intel.com> > Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> > --- This change is straightforward(simple enough) to mitigate the costly atomic operation, it looks good from my understanding, so: Reviewed-by: Chen Yu yu.c.chen@intel.com thanks, Chenyu
Hello Tim,
On 4/16/2025 9:28 AM, Tim Chen wrote:
> At load balance time, balance of last level cache domains and
> above needs to be serialized. The scheduler checks the atomic var
> sched_balance_running first and then see if time is due for a load
> balance. This is an expensive operation as multiple CPUs can attempt
> sched_balance_running acquisition at the same time.
>
> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> sched_balance_running. Most of the time, a balance attempt is aborted
> immediately after acquiring sched_balance_running as load balance time
> is not due.
>
> Instead, check balance due time first before acquiring
> sched_balance_running. This skips many useless acquisitions
> of sched_balance_running and knocks the 7.6% CPU overhead on
> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> improved by 11%.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
I tested this series on a 3rd Generation EPYC system and I could see
minor improvements across the board for most workloads. Feel free to
include:
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
Thanks and Regards,
Prateek
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e580..5e5f7a770b2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out;
> + }
> +
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> * The LBF_DST_PINNED logic could have changed
> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
On 4/16/25 09:28, Tim Chen wrote:
> At load balance time, balance of last level cache domains and
> above needs to be serialized. The scheduler checks the atomic var
> sched_balance_running first and then see if time is due for a load
> balance. This is an expensive operation as multiple CPUs can attempt
> sched_balance_running acquisition at the same time.
>
> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> sched_balance_running. Most of the time, a balance attempt is aborted
> immediately after acquiring sched_balance_running as load balance time
> is not due.
>
> Instead, check balance due time first before acquiring
> sched_balance_running. This skips many useless acquisitions
> of sched_balance_running and knocks the 7.6% CPU overhead on
> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> improved by 11%.
>
Hi Tim.
Time check makes sense specially on large systems mainly due to NEWIDLE balance.
One more point to add, A lot of time, the CPU which acquired sched_balance_running,
need not end up doing the load balance, since it not the CPU meant to do the load balance.
This thread.
https://lore.kernel.org/all/1e43e783-55e7-417f-a1a7-503229eb163a@linux.ibm.com/
Best thing probably is to acquire it if this CPU has passed the time check and as well it is
actually going to do load balance.
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e580..5e5f7a770b2f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out;
> + }
> +
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> * The LBF_DST_PINNED logic could have changed
> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
Hi Shrikanth,
On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>
>
> On 4/16/25 09:28, Tim Chen wrote:
>> At load balance time, balance of last level cache domains and
>> above needs to be serialized. The scheduler checks the atomic var
>> sched_balance_running first and then see if time is due for a load
>> balance. This is an expensive operation as multiple CPUs can attempt
>> sched_balance_running acquisition at the same time.
>>
>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>> sched_balance_running. Most of the time, a balance attempt is aborted
>> immediately after acquiring sched_balance_running as load balance time
>> is not due.
>>
>> Instead, check balance due time first before acquiring
>> sched_balance_running. This skips many useless acquisitions
>> of sched_balance_running and knocks the 7.6% CPU overhead on
>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>> improved by 11%.
>>
>
> Hi Tim.
>
> Time check makes sense specially on large systems mainly due to NEWIDLE
> balance.
>
Could you elaborate a little on this statement? There is no timeout
mechanism like periodic load balancer for the NEWLY_IDLE, right?
> One more point to add, A lot of time, the CPU which acquired
> sched_balance_running,
> need not end up doing the load balance, since it not the CPU meant to do
> the load balance.
>
> This thread.
> https://lore.kernel.org/all/1e43e783-55e7-417f-
> a1a7-503229eb163a@linux.ibm.com/
>
>
> Best thing probably is to acquire it if this CPU has passed the time
> check and as well it is
> actually going to do load balance.
>
>
This is a good point, and we might only want to deal with periodic load
balancer rather than NEWLY_IDLE balance. Because the latter is too
frequent and contention on the sched_balance_running might introduce
high cache contention.
thanks,
Chenyu
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>> ---
>> kernel/sched/fair.c | 16 ++++++++--------
>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e43993a4e580..5e5f7a770b2f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq
>> *rq, enum cpu_idle_type idle)
>> interval = get_sd_balance_interval(sd, busy);
>> - need_serialize = sd->flags & SD_SERIALIZE;
>> - if (need_serialize) {
>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>> - goto out;
>> - }
>> -
>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>> + need_serialize = sd->flags & SD_SERIALIZE;
>> + if (need_serialize) {
>> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0,
>> 1))
>> + goto out;
>> + }
>> +
>> if (sched_balance_rq(cpu, rq, sd, idle,
>> &continue_balancing)) {
>> /*
>> * The LBF_DST_PINNED logic could have changed
>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>> *rq, enum cpu_idle_type idle)
>> }
>> sd->last_balance = jiffies;
>> interval = get_sd_balance_interval(sd, busy);
>> + if (need_serialize)
>> + atomic_set_release(&sched_balance_running, 0);
>> }
>> - if (need_serialize)
>> - atomic_set_release(&sched_balance_running, 0);
>> out:
>> if (time_after(next_balance, sd->last_balance + interval)) {
>> next_balance = sd->last_balance + interval;
>
On 4/16/25 11:58, Chen, Yu C wrote:
> Hi Shrikanth,
>
> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>>
>>
>> On 4/16/25 09:28, Tim Chen wrote:
>>> At load balance time, balance of last level cache domains and
>>> above needs to be serialized. The scheduler checks the atomic var
>>> sched_balance_running first and then see if time is due for a load
>>> balance. This is an expensive operation as multiple CPUs can attempt
>>> sched_balance_running acquisition at the same time.
>>>
>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>>> sched_balance_running. Most of the time, a balance attempt is aborted
>>> immediately after acquiring sched_balance_running as load balance time
>>> is not due.
>>>
>>> Instead, check balance due time first before acquiring
>>> sched_balance_running. This skips many useless acquisitions
>>> of sched_balance_running and knocks the 7.6% CPU overhead on
>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>>> improved by 11%.
>>>
>>
>> Hi Tim.
>>
>> Time check makes sense specially on large systems mainly due to
>> NEWIDLE balance.
scratch the NEWLY_IDLE part from that comment.
>>
>
> Could you elaborate a little on this statement? There is no timeout
> mechanism like periodic load balancer for the NEWLY_IDLE, right?
Yes. NEWLY_IDLE is very opportunistic.
>
>> One more point to add, A lot of time, the CPU which acquired
>> sched_balance_running,
>> need not end up doing the load balance, since it not the CPU meant to
>> do the load balance.
>>
>> This thread.
>> https://lore.kernel.org/all/1e43e783-55e7-417f-
>> a1a7-503229eb163a@linux.ibm.com/
>>
>>
>> Best thing probably is to acquire it if this CPU has passed the time
>> check and as well it is
>> actually going to do load balance.
>>
>>
>
> This is a good point, and we might only want to deal with periodic load
> balancer rather than NEWLY_IDLE balance. Because the latter is too
> frequent and contention on the sched_balance_running might introduce
> high cache contention.
>
But NEWLY_IDLE doesn't serialize using sched_balance_running and can
endup consuming a lot of cycles. But if we serialize using
sched_balance_running, it would definitely cause a lot contention as is.
The point was, before acquiring it, it would be better if this CPU is
definite to do the load balance. Else there are chances to miss the
actual load balance.
> thanks,
> Chenyu
>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>> ---
>>> kernel/sched/fair.c | 16 ++++++++--------
>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e43993a4e580..5e5f7a770b2f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq
>>> *rq, enum cpu_idle_type idle)
>>> interval = get_sd_balance_interval(sd, busy);
>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>> - if (need_serialize) {
>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>> - goto out;
>>> - }
>>> -
>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>> + need_serialize = sd->flags & SD_SERIALIZE;
>>> + if (need_serialize) {
>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
>>> 0, 1))
>>> + goto out;
>>> + }
>>> +
>>> if (sched_balance_rq(cpu, rq, sd, idle,
>>> &continue_balancing)) {
>>> /*
>>> * The LBF_DST_PINNED logic could have changed
>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>>> *rq, enum cpu_idle_type idle)
>>> }
>>> sd->last_balance = jiffies;
>>> interval = get_sd_balance_interval(sd, busy);
>>> + if (need_serialize)
>>> + atomic_set_release(&sched_balance_running, 0);
>>> }
>>> - if (need_serialize)
>>> - atomic_set_release(&sched_balance_running, 0);
>>> out:
>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>> next_balance = sd->last_balance + interval;
>>
On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
>
> On 4/16/25 11:58, Chen, Yu C wrote:
> > Hi Shrikanth,
> >
> > On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
> > >
> > >
> > > On 4/16/25 09:28, Tim Chen wrote:
> > > > At load balance time, balance of last level cache domains and
> > > > above needs to be serialized. The scheduler checks the atomic var
> > > > sched_balance_running first and then see if time is due for a load
> > > > balance. This is an expensive operation as multiple CPUs can attempt
> > > > sched_balance_running acquisition at the same time.
> > > >
> > > > On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> > > > running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> > > > sched_balance_running. Most of the time, a balance attempt is aborted
> > > > immediately after acquiring sched_balance_running as load balance time
> > > > is not due.
> > > >
> > > > Instead, check balance due time first before acquiring
> > > > sched_balance_running. This skips many useless acquisitions
> > > > of sched_balance_running and knocks the 7.6% CPU overhead on
> > > > sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> > > > improved by 11%.
> > > >
> > >
> > > Hi Tim.
> > >
> > > Time check makes sense specially on large systems mainly due to
> > > NEWIDLE balance.
>
> scratch the NEWLY_IDLE part from that comment.
>
> > >
> >
> > Could you elaborate a little on this statement? There is no timeout
> > mechanism like periodic load balancer for the NEWLY_IDLE, right?
>
> Yes. NEWLY_IDLE is very opportunistic.
>
> >
> > > One more point to add, A lot of time, the CPU which acquired
> > > sched_balance_running,
> > > need not end up doing the load balance, since it not the CPU meant to
> > > do the load balance.
> > >
> > > This thread.
> > > https://lore.kernel.org/all/1e43e783-55e7-417f-
> > > a1a7-503229eb163a@linux.ibm.com/
> > >
> > >
> > > Best thing probably is to acquire it if this CPU has passed the time
> > > check and as well it is
> > > actually going to do load balance.
> > >
> > >
> >
> > This is a good point, and we might only want to deal with periodic load
> > balancer rather than NEWLY_IDLE balance. Because the latter is too
> > frequent and contention on the sched_balance_running might introduce
> > high cache contention.
> >
>
> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
> endup consuming a lot of cycles. But if we serialize using
> sched_balance_running, it would definitely cause a lot contention as is.
>
>
> The point was, before acquiring it, it would be better if this CPU is
> definite to do the load balance. Else there are chances to miss the
> actual load balance.
>
You mean doing a should_we_balance() check? I think we should not
even consider that if balance time is not due and this balance due check should
come first.
Do you have objection to switching the order of the time due check and serialization/sched_balance_running
around as in this patch? Adding a change to see if this is the right balancing CPU could be
an orthogonal change.
97% of CPU cycles in sched_balance_domains() are not spent doing useful load balancing work,
but simply in the acquisition of sched_balance_running in the OLTP workload we tested.
:
: 104 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
: 105 {
: 106 return arch_cmpxchg(&v->counter, old, new);
0.00 : ffffffff81138f8e: xor %eax,%eax
0.00 : ffffffff81138f90: mov $0x1,%ecx
0.00 : ffffffff81138f95: lock cmpxchg %ecx,0x2577d33(%rip) # ffffffff836b0cd0 <sched_balance_running>
: 110 sched_balance_domains():
: 12146 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
97.01 : ffffffff81138f9d: test %eax,%eax
0.00 : ffffffff81138f9f: jne ffffffff81138fbb <sched_balance_domains+0x20b>
: 12150 if (time_after_eq(jiffies, sd->last_balance + interval)) {
0.00 : ffffffff81138fa1: mov 0x16cfa18(%rip),%rax # ffffffff828089c0 <jiffies_64>
0.00 : ffffffff81138fa8: sub 0x48(%r14),%rax
0.00 : ffffffff81138fac: cmp %rdx,%rax
0.00 : ffffffff81138faf: jns ffffffff8113900f <sched_balance_domains+0x25f>
: 12155 raw_atomic_set_release():
So trying to skip this unnecessary acquisition and consider load balancing only when time is due.
Tim
>
> > thanks,
> > Chenyu
> >
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> > > > Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> > > > ---
> > > > kernel/sched/fair.c | 16 ++++++++--------
> > > > 1 file changed, 8 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e43993a4e580..5e5f7a770b2f 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq
> > > > *rq, enum cpu_idle_type idle)
> > > > interval = get_sd_balance_interval(sd, busy);
> > > > - need_serialize = sd->flags & SD_SERIALIZE;
> > > > - if (need_serialize) {
> > > > - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > > > - goto out;
> > > > - }
> > > > -
> > > > if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > > > + need_serialize = sd->flags & SD_SERIALIZE;
> > > > + if (need_serialize) {
> > > > + if (atomic_cmpxchg_acquire(&sched_balance_running,
> > > > 0, 1))
> > > > + goto out;
> > > > + }
> > > > +
> > > > if (sched_balance_rq(cpu, rq, sd, idle,
> > > > &continue_balancing)) {
> > > > /*
> > > > * The LBF_DST_PINNED logic could have changed
> > > > @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
> > > > *rq, enum cpu_idle_type idle)
> > > > }
> > > > sd->last_balance = jiffies;
> > > > interval = get_sd_balance_interval(sd, busy);
> > > > + if (need_serialize)
> > > > + atomic_set_release(&sched_balance_running, 0);
> > > > }
> > > > - if (need_serialize)
> > > > - atomic_set_release(&sched_balance_running, 0);
> > > > out:
> > > > if (time_after(next_balance, sd->last_balance + interval)) {
> > > > next_balance = sd->last_balance + interval;
> > >
>
>
On 4/16/25 21:49, Tim Chen wrote:
> On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
>>
>>
> You mean doing a should_we_balance() check? I think we should not
> even consider that if balance time is not due and this balance due check should
> come first.
>
Hi Tim.
This is the code I was suggesting.
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c19459c8042..e712934973e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11723,6 +11723,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
break;
}
}
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ * is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ * execution, as non-SD_SERIALIZE domains will still be
+ * load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -11738,6 +11753,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
struct rq *busiest;
struct rq_flags rf;
struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+ int need_serialize = 0;
struct lb_env env = {
.sd = sd,
.dst_cpu = this_cpu,
@@ -11760,6 +11776,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
goto out_balanced;
}
+ need_serialize = (idle!=CPU_NEWLY_IDLE) && sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+ return ld_moved;
+ }
+
group = sched_balance_find_src_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11884,6 +11906,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
+ if (need_serialize)
+ atomic_set_release(&sched_balance_running, 0);
goto redo;
}
goto out_all_pinned;
@@ -12000,6 +12024,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
out:
+ if (need_serialize)
+ atomic_set_release(&sched_balance_running, 0);
+
return ld_moved;
}
@@ -12124,21 +12151,6 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}
-/*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- * is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- * execution, as non-SD_SERIALIZE domains will still be
- * load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
/*
* Scale the max sched_balance_rq interval with the number of CPUs in the system.
* This trades load-balance latency on larger machines for less cross talk.
@@ -12188,7 +12200,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
+ int need_decay = 0;
u64 max_cost = 0;
rcu_read_lock();
@@ -12213,12 +12225,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
@@ -12232,9 +12238,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
-out:
+
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;
On Thu, 2025-04-17 at 14:49 +0530, Shrikanth Hegde wrote:
>
> On 4/16/25 21:49, Tim Chen wrote:
> > On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
> > >
>
> > >
> > You mean doing a should_we_balance() check? I think we should not
> > even consider that if balance time is not due and this balance due check should
> > come first.
> >
> Hi Tim.
>
> This is the code I was suggesting.
Thanks for suggesting this alternative patch.
The way I read it, it moves the serialization check into sched_balance_rq()
so we do the serialization check after the balance due check. Functionally
it is equivalent to my proposed patch.
However, I think your original goal is to get the CPU that could
actually balance the sched domain a chance to run in current balance period
and not get skipped over if current CPU cannot balance.
With your patch, if should_we_balance() failed, we
still advance sd->last_balance. So the CPU that could balance has to wait for its
chance in the next period. I think you'll need to pass the outcome of
should_we_balance() from sched_balance_rq() to sched_balance_domains(),
and not advance sd->last_balance when should_we_balance() failed.
This would allow the CPU that could balance a chance to run
in the current balance period.
That said, I'm not actually proposing that we do the above. For many large systems,
the overhead in balancing top domains is already high and increasing balance frequency is
not necessarily desirable.
Thanks.
Tim
>
> ---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c19459c8042..e712934973e7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11723,6 +11723,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> break;
> }
> }
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (above the NODE topology level) - only one load-balancing instance
> + * may run at a time, to reduce overhead on very large systems with
> + * lots of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize rebalance_domains()
> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
> +
>
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> @@ -11738,6 +11753,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> struct rq *busiest;
> struct rq_flags rf;
> struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> + int need_serialize = 0;
> struct lb_env env = {
> .sd = sd,
> .dst_cpu = this_cpu,
> @@ -11760,6 +11776,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + need_serialize = (idle!=CPU_NEWLY_IDLE) && sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + return ld_moved;
> + }
> +
> group = sched_balance_find_src_group(&env);
> if (!group) {
> schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11884,6 +11906,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> if (!cpumask_subset(cpus, env.dst_grpmask)) {
> env.loop = 0;
> env.loop_break = SCHED_NR_MIGRATE_BREAK;
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> goto redo;
> }
> goto out_all_pinned;
> @@ -12000,6 +12024,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> sd->balance_interval < sd->max_interval)
> sd->balance_interval *= 2;
> out:
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> +
> return ld_moved;
> }
>
> @@ -12124,21 +12151,6 @@ static int active_load_balance_cpu_stop(void *data)
> return 0;
> }
>
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - * is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - * execution, as non-SD_SERIALIZE domains will still be
> - * load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
> /*
> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> * This trades load-balance latency on larger machines for less cross talk.
> @@ -12188,7 +12200,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> - int need_serialize, need_decay = 0;
> + int need_decay = 0;
> u64 max_cost = 0;
>
> rcu_read_lock();
> @@ -12213,12 +12225,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> @@ -12232,9 +12238,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> -out:
> +
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> update_next_balance = 1;
>
On 4/16/25 21:49, Tim Chen wrote:
> On Wed, 2025-04-16 at 14:46 +0530, Shrikanth Hegde wrote:
>>
>> On 4/16/25 11:58, Chen, Yu C wrote:
>>> Hi Shrikanth,
>>>
>>> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>>>>
>>>>
>>>> On 4/16/25 09:28, Tim Chen wrote:
>>>>> At load balance time, balance of last level cache domains and
>>>>> above needs to be serialized. The scheduler checks the atomic var
>>>>> sched_balance_running first and then see if time is due for a load
>>>>> balance. This is an expensive operation as multiple CPUs can attempt
>>>>> sched_balance_running acquisition at the same time.
>>>>>
>>>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>>>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>>>>> sched_balance_running. Most of the time, a balance attempt is aborted
>>>>> immediately after acquiring sched_balance_running as load balance time
>>>>> is not due.
>>>>>
>>>>> Instead, check balance due time first before acquiring
>>>>> sched_balance_running. This skips many useless acquisitions
>>>>> of sched_balance_running and knocks the 7.6% CPU overhead on
>>>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>>>>> improved by 11%.
>>>>>
>>>>
>>>> Hi Tim.
>>>>
>>>> Time check makes sense specially on large systems mainly due to
>>>> NEWIDLE balance.
>>
>> scratch the NEWLY_IDLE part from that comment.
>>
>>>>
>>>
>>> Could you elaborate a little on this statement? There is no timeout
>>> mechanism like periodic load balancer for the NEWLY_IDLE, right?
>>
>> Yes. NEWLY_IDLE is very opportunistic.
>>
>>>
>>>> One more point to add, A lot of time, the CPU which acquired
>>>> sched_balance_running,
>>>> need not end up doing the load balance, since it not the CPU meant to
>>>> do the load balance.
>>>>
>>>> This thread.
>>>> https://lore.kernel.org/all/1e43e783-55e7-417f-
>>>> a1a7-503229eb163a@linux.ibm.com/
>>>>
>>>>
>>>> Best thing probably is to acquire it if this CPU has passed the time
>>>> check and as well it is
>>>> actually going to do load balance.
>>>>
>>>>
>>>
>>> This is a good point, and we might only want to deal with periodic load
>>> balancer rather than NEWLY_IDLE balance. Because the latter is too
>>> frequent and contention on the sched_balance_running might introduce
>>> high cache contention.
>>>
>>
>> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
>> endup consuming a lot of cycles. But if we serialize using
>> sched_balance_running, it would definitely cause a lot contention as is.
>>
>>
>> The point was, before acquiring it, it would be better if this CPU is
>> definite to do the load balance. Else there are chances to miss the
>> actual load balance.
>>
> You mean doing a should_we_balance() check? I think we should not
> even consider that if balance time is not due and this balance due check should
> come first.
Time check first makes sense.
>
> Do you have objection to switching the order of the time due check and serialization/sched_balance_running
> around as in this patch? Adding a change to see if this is the right balancing CPU could be
> an orthogonal change.
This check could come after the time check. Even after the time check,
CPU may acquire it only to release later, while a legit CPU couldn't
acquire it and bailed out.
>
> 97% of CPU cycles in sched_balance_domains() are not spent doing useful load balancing work,
> but simply in the acquisition of sched_balance_running in the OLTP workload we tested.
>
> :
> : 104 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
> : 105 {
> : 106 return arch_cmpxchg(&v->counter, old, new);
> 0.00 : ffffffff81138f8e: xor %eax,%eax
> 0.00 : ffffffff81138f90: mov $0x1,%ecx
> 0.00 : ffffffff81138f95: lock cmpxchg %ecx,0x2577d33(%rip) # ffffffff836b0cd0 <sched_balance_running>
> : 110 sched_balance_domains():
> : 12146 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 97.01 : ffffffff81138f9d: test %eax,%eax
> 0.00 : ffffffff81138f9f: jne ffffffff81138fbb <sched_balance_domains+0x20b>
> : 12150 if (time_after_eq(jiffies, sd->last_balance + interval)) {
> 0.00 : ffffffff81138fa1: mov 0x16cfa18(%rip),%rax # ffffffff828089c0 <jiffies_64>
> 0.00 : ffffffff81138fa8: sub 0x48(%r14),%rax
> 0.00 : ffffffff81138fac: cmp %rdx,%rax
> 0.00 : ffffffff81138faf: jns ffffffff8113900f <sched_balance_domains+0x25f>
> : 12155 raw_atomic_set_release():
>
> So trying to skip this unnecessary acquisition and consider load balancing only when time is due.
>
> Tim
>
>>
>>> thanks,
>>> Chenyu
>>>
>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>> ---
>>>>> kernel/sched/fair.c | 16 ++++++++--------
>>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index e43993a4e580..5e5f7a770b2f 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct rq
>>>>> *rq, enum cpu_idle_type idle)
>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>>>> - if (need_serialize) {
>>>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>>>> - goto out;
>>>>> - }
>>>>> -
>>>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>>>> + need_serialize = sd->flags & SD_SERIALIZE;
>>>>> + if (need_serialize) {
>>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
>>>>> 0, 1))
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> if (sched_balance_rq(cpu, rq, sd, idle,
>>>>> &continue_balancing)) {
>>>>> /*
>>>>> * The LBF_DST_PINNED logic could have changed
>>>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>>>>> *rq, enum cpu_idle_type idle)
>>>>> }
>>>>> sd->last_balance = jiffies;
>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>> + if (need_serialize)
>>>>> + atomic_set_release(&sched_balance_running, 0);
>>>>> }
>>>>> - if (need_serialize)
>>>>> - atomic_set_release(&sched_balance_running, 0);
>>>>> out:
>>>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>>>> next_balance = sd->last_balance + interval;
>>>>
>>
>>
>
On 4/16/25 14:46, Shrikanth Hegde wrote:
>
>
> On 4/16/25 11:58, Chen, Yu C wrote:
>> Hi Shrikanth,
>>
>> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>>>
>>>
>>> On 4/16/25 09:28, Tim Chen wrote:
>>>> At load balance time, balance of last level cache domains and
>>>> above needs to be serialized. The scheduler checks the atomic var
>>>> sched_balance_running first and then see if time is due for a load
>>>> balance. This is an expensive operation as multiple CPUs can attempt
>>>> sched_balance_running acquisition at the same time.
>>>>
>>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>>>> sched_balance_running. Most of the time, a balance attempt is aborted
>>>> immediately after acquiring sched_balance_running as load balance time
>>>> is not due.
>>>>
>>>> Instead, check balance due time first before acquiring
>>>> sched_balance_running. This skips many useless acquisitions
>>>> of sched_balance_running and knocks the 7.6% CPU overhead on
>>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>>>> improved by 11%.
>>>>
>>>
>>> Hi Tim.
>>>
>>> Time check makes sense specially on large systems mainly due to
>>> NEWIDLE balance.
>
> scratch the NEWLY_IDLE part from that comment.
>
>>>
>>
>> Could you elaborate a little on this statement? There is no timeout
>> mechanism like periodic load balancer for the NEWLY_IDLE, right?
>
> Yes. NEWLY_IDLE is very opportunistic.
>
>>
>>> One more point to add, A lot of time, the CPU which acquired
>>> sched_balance_running,
>>> need not end up doing the load balance, since it not the CPU meant to
>>> do the load balance.
>>>
>>> This thread.
>>> https://lore.kernel.org/all/1e43e783-55e7-417f-
>>> a1a7-503229eb163a@linux.ibm.com/
>>>
>>>
>>> Best thing probably is to acquire it if this CPU has passed the time
>>> check and as well it is
>>> actually going to do load balance.
>>>
>>>
>>
>> This is a good point, and we might only want to deal with periodic load
>> balancer rather than NEWLY_IDLE balance. Because the latter is too
>> frequent and contention on the sched_balance_running might introduce
>> high cache contention.
>>
>
> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
> endup consuming a lot of cycles. But if we serialize using
> sched_balance_running, it would definitely cause a lot contention as is.
>
>
> The point was, before acquiring it, it would be better if this CPU is
> definite to do the load balance. Else there are chances to miss the
> actual load balance.
>
>
Sorry, forgot to add.
Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
the regular (idle) can take care for NUMA by serializing it?
- if (sd->flags & SD_BALANCE_NEWIDLE) {
+ if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
pulled_task = sched_balance_rq(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE,
Anyways, having a policy around this SD_SERIALIZE would be a good thing.
>> thanks,
>> Chenyu
>>
>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>> ---
>>>> kernel/sched/fair.c | 16 ++++++++--------
>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e43993a4e580..5e5f7a770b2f 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
>>>> rq *rq, enum cpu_idle_type idle)
>>>> interval = get_sd_balance_interval(sd, busy);
>>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>>> - if (need_serialize) {
>>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>>> - goto out;
>>>> - }
>>>> -
>>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>>> + need_serialize = sd->flags & SD_SERIALIZE;
>>>> + if (need_serialize) {
>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
>>>> 0, 1))
>>>> + goto out;
>>>> + }
>>>> +
>>>> if (sched_balance_rq(cpu, rq, sd, idle,
>>>> &continue_balancing)) {
>>>> /*
>>>> * The LBF_DST_PINNED logic could have changed
>>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>>>> *rq, enum cpu_idle_type idle)
>>>> }
>>>> sd->last_balance = jiffies;
>>>> interval = get_sd_balance_interval(sd, busy);
>>>> + if (need_serialize)
>>>> + atomic_set_release(&sched_balance_running, 0);
>>>> }
>>>> - if (need_serialize)
>>>> - atomic_set_release(&sched_balance_running, 0);
>>>> out:
>>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>>> next_balance = sd->last_balance + interval;
>>>
>
On Wed, 16 Apr 2025 at 11:29, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
>
> On 4/16/25 14:46, Shrikanth Hegde wrote:
> >
> >
> > On 4/16/25 11:58, Chen, Yu C wrote:
> >> Hi Shrikanth,
> >>
> >> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
> >>>
> >>>
> >>> On 4/16/25 09:28, Tim Chen wrote:
> >>>> At load balance time, balance of last level cache domains and
> >>>> above needs to be serialized. The scheduler checks the atomic var
> >>>> sched_balance_running first and then see if time is due for a load
> >>>> balance. This is an expensive operation as multiple CPUs can attempt
> >>>> sched_balance_running acquisition at the same time.
> >>>>
> >>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> >>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> >>>> sched_balance_running. Most of the time, a balance attempt is aborted
> >>>> immediately after acquiring sched_balance_running as load balance time
> >>>> is not due.
> >>>>
> >>>> Instead, check balance due time first before acquiring
> >>>> sched_balance_running. This skips many useless acquisitions
> >>>> of sched_balance_running and knocks the 7.6% CPU overhead on
> >>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> >>>> improved by 11%.
> >>>>
> >>>
> >>> Hi Tim.
> >>>
> >>> Time check makes sense specially on large systems mainly due to
> >>> NEWIDLE balance.
> >
> > scratch the NEWLY_IDLE part from that comment.
> >
> >>>
> >>
> >> Could you elaborate a little on this statement? There is no timeout
> >> mechanism like periodic load balancer for the NEWLY_IDLE, right?
> >
> > Yes. NEWLY_IDLE is very opportunistic.
> >
> >>
> >>> One more point to add, A lot of time, the CPU which acquired
> >>> sched_balance_running,
> >>> need not end up doing the load balance, since it not the CPU meant to
> >>> do the load balance.
> >>>
> >>> This thread.
> >>> https://lore.kernel.org/all/1e43e783-55e7-417f-
> >>> a1a7-503229eb163a@linux.ibm.com/
> >>>
> >>>
> >>> Best thing probably is to acquire it if this CPU has passed the time
> >>> check and as well it is
> >>> actually going to do load balance.
> >>>
> >>>
> >>
> >> This is a good point, and we might only want to deal with periodic load
> >> balancer rather than NEWLY_IDLE balance. Because the latter is too
> >> frequent and contention on the sched_balance_running might introduce
> >> high cache contention.
> >>
> >
> > But NEWLY_IDLE doesn't serialize using sched_balance_running and can
> > endup consuming a lot of cycles. But if we serialize using
> > sched_balance_running, it would definitely cause a lot contention as is.
> >
> >
> > The point was, before acquiring it, it would be better if this CPU is
> > definite to do the load balance. Else there are chances to miss the
> > actual load balance.
> >
> >
>
> Sorry, forgot to add.
>
> Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
> the regular (idle) can take care for NUMA by serializing it?
>
> - if (sd->flags & SD_BALANCE_NEWIDLE) {
> + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
set SD_SERIALIZE
>
> pulled_task = sched_balance_rq(this_cpu, this_rq,
> sd, CPU_NEWLY_IDLE,
>
>
> Anyways, having a policy around this SD_SERIALIZE would be a good thing.
>
> >> thanks,
> >> Chenyu
> >>
> >>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> >>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> >>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> >>>> ---
> >>>> kernel/sched/fair.c | 16 ++++++++--------
> >>>> 1 file changed, 8 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index e43993a4e580..5e5f7a770b2f 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
> >>>> rq *rq, enum cpu_idle_type idle)
> >>>> interval = get_sd_balance_interval(sd, busy);
> >>>> - need_serialize = sd->flags & SD_SERIALIZE;
> >>>> - if (need_serialize) {
> >>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> >>>> - goto out;
> >>>> - }
> >>>> -
> >>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> >>>> + need_serialize = sd->flags & SD_SERIALIZE;
> >>>> + if (need_serialize) {
> >>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
> >>>> 0, 1))
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> if (sched_balance_rq(cpu, rq, sd, idle,
> >>>> &continue_balancing)) {
> >>>> /*
> >>>> * The LBF_DST_PINNED logic could have changed
> >>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
> >>>> *rq, enum cpu_idle_type idle)
> >>>> }
> >>>> sd->last_balance = jiffies;
> >>>> interval = get_sd_balance_interval(sd, busy);
> >>>> + if (need_serialize)
> >>>> + atomic_set_release(&sched_balance_running, 0);
> >>>> }
> >>>> - if (need_serialize)
> >>>> - atomic_set_release(&sched_balance_running, 0);
> >>>> out:
> >>>> if (time_after(next_balance, sd->last_balance + interval)) {
> >>>> next_balance = sd->last_balance + interval;
> >>>
> >
>
On 4/16/2025 3:17 PM, Vincent Guittot wrote:
>>
>> Sorry, forgot to add.
>>
>> Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
>> the regular (idle) can take care for NUMA by serializing it?
>>
>> - if (sd->flags & SD_BALANCE_NEWIDLE) {
>> + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
>
> Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
> set SD_SERIALIZE
I've some questions around "sched_balance_running":
o Since this is a single flag across the entire system, it also implies
CPUs cannon concurrently do load balancing across different NUMA
domains which seems reasonable since a load balance at lower NUMA
domain can potentially change the "nr_numa_running" and
"nr_preferred_running" stats for the higher domain but if this is the
case, a newidle balance at lower NUMA domain can interfere with a
concurrent busy / newidle load balancing at higher NUMA domain.
Is this expected? Should newidle balance be serialized too?
(P.S. I copied over the serialize logic from sched_balance_domains()
into sched_balance_newidle() and did not see any difference in my
testing but perhaps there are benchmarks out there that care for
this)
o If the intention of SD_SERIALIZE was to actually "serializes
load-balancing passes over large domains (above the NODE topology
level)" as the comment above "sched_balance_running" states, and
this question is specific to x86 - when enabling SNC on Intel or
NPS on AMD servers, the first NUMA domain is in fact as big as the
NODE (now PKG domain) if not smaller. Is it okay to clear
SD_SERIALIZE for these domains since they are small enough now?
--
Thanks and Regards,
Prateek
On Thu, Apr 17, 2025 at 05:01:37PM +0530, K Prateek Nayak wrote:
> On 4/16/2025 3:17 PM, Vincent Guittot wrote:
> > >
> > > Sorry, forgot to add.
> > >
> > > Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
> > > the regular (idle) can take care for NUMA by serializing it?
> > >
> > > - if (sd->flags & SD_BALANCE_NEWIDLE) {
> > > + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
> >
> > Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
> > set SD_SERIALIZE
>
> I've some questions around "sched_balance_running":
>
> o Since this is a single flag across the entire system, it also implies
> CPUs cannon concurrently do load balancing across different NUMA
> domains which seems reasonable since a load balance at lower NUMA
> domain can potentially change the "nr_numa_running" and
> "nr_preferred_running" stats for the higher domain but if this is the
> case, a newidle balance at lower NUMA domain can interfere with a
> concurrent busy / newidle load balancing at higher NUMA domain.
> Is this expected? Should newidle balance be serialized too?
Serializing new-idle might create too much idle time.
> (P.S. I copied over the serialize logic from sched_balance_domains()
> into sched_balance_newidle() and did not see any difference in my
> testing but perhaps there are benchmarks out there that care for
> this)
>
> o If the intention of SD_SERIALIZE was to actually "serializes
> load-balancing passes over large domains (above the NODE topology
> level)" as the comment above "sched_balance_running" states, and
> this question is specific to x86 - when enabling SNC on Intel or
> NPS on AMD servers, the first NUMA domain is in fact as big as the
> NODE (now PKG domain) if not smaller. Is it okay to clear
> SD_SERIALIZE for these domains since they are small enough now?
You'll have to dive into the history here, but IIRC it was from SGI back
in the day, where NUMA factors were quite large and load-balancing
across numa was a pain.
Small really isn't the criteria, but inter-node latency might be, we
also have this node_reclaim_distance thing.
Not quite sure what makes sense, someone should tinker I suppose, see
what works with today's hardare.
Hello Peter, On 4/17/2025 5:31 PM, Peter Zijlstra wrote: >> o Since this is a single flag across the entire system, it also implies >> CPUs cannon concurrently do load balancing across different NUMA >> domains which seems reasonable since a load balance at lower NUMA >> domain can potentially change the "nr_numa_running" and >> "nr_preferred_running" stats for the higher domain but if this is the >> case, a newidle balance at lower NUMA domain can interfere with a >> concurrent busy / newidle load balancing at higher NUMA domain. >> Is this expected? Should newidle balance be serialized too? > > Serializing new-idle might create too much idle time. In the context of busy and idle balancing, What are your thoughts on a per sd "serialize' flag? > >> (P.S. I copied over the serialize logic from sched_balance_domains() >> into sched_balance_newidle() and did not see any difference in my >> testing but perhaps there are benchmarks out there that care for >> this) >> >> o If the intention of SD_SERIALIZE was to actually "serializes >> load-balancing passes over large domains (above the NODE topology >> level)" as the comment above "sched_balance_running" states, and >> this question is specific to x86 - when enabling SNC on Intel or >> NPS on AMD servers, the first NUMA domain is in fact as big as the >> NODE (now PKG domain) if not smaller. Is it okay to clear >> SD_SERIALIZE for these domains since they are small enough now? > > You'll have to dive into the history here, but IIRC it was from SGI back > in the day, where NUMA factors were quite large and load-balancing > across numa was a pain. Let me dig up the git history and see if any interesting details hide there. > > Small really isn't the criteria, but inter-node latency might be, we > also have this node_reclaim_distance thing. > > Not quite sure what makes sense, someone should tinker I suppose, see > what works with today's hardare. I'll try some experiments over the weekend to see if my machine turns up happy with non-serialized lb for inter-PKG load balancing with NPS turned on. I'll probably piggy back off of "node_reclaim_distance" heuristics. -- Thanks and Regards, Prateek
On Fri, Apr 18, 2025 at 10:56:04AM +0530, K Prateek Nayak wrote: > Hello Peter, > > On 4/17/2025 5:31 PM, Peter Zijlstra wrote: > > > o Since this is a single flag across the entire system, it also implies > > > CPUs cannon concurrently do load balancing across different NUMA > > > domains which seems reasonable since a load balance at lower NUMA > > > domain can potentially change the "nr_numa_running" and > > > "nr_preferred_running" stats for the higher domain but if this is the > > > case, a newidle balance at lower NUMA domain can interfere with a > > > concurrent busy / newidle load balancing at higher NUMA domain. > > > Is this expected? Should newidle balance be serialized too? > > > > Serializing new-idle might create too much idle time. > > In the context of busy and idle balancing, What are your thoughts on a > per sd "serialize' flag? My sekret hope is that this push stuff can rid us all the idle balance bits. But yeah, early days on that. Other than that, I don't quite see why we should split that, busy balancing is the one that runs more often and is the one that should be serialized to avoid too much cross node traffic and all that, no? The idle thing is less often, why not limit that?
Hello Peter, On 4/18/2025 2:58 PM, Peter Zijlstra wrote: > On Fri, Apr 18, 2025 at 10:56:04AM +0530, K Prateek Nayak wrote: >> Hello Peter, >> >> On 4/17/2025 5:31 PM, Peter Zijlstra wrote: >>>> o Since this is a single flag across the entire system, it also implies >>>> CPUs cannon concurrently do load balancing across different NUMA >>>> domains which seems reasonable since a load balance at lower NUMA >>>> domain can potentially change the "nr_numa_running" and >>>> "nr_preferred_running" stats for the higher domain but if this is the >>>> case, a newidle balance at lower NUMA domain can interfere with a >>>> concurrent busy / newidle load balancing at higher NUMA domain. >>>> Is this expected? Should newidle balance be serialized too? >>> >>> Serializing new-idle might create too much idle time. >> >> In the context of busy and idle balancing, What are your thoughts on a >> per sd "serialize' flag? > > My sekret hope is that this push stuff can rid us all the idle balance > bits. But yeah, early days on that. > > Other than that, I don't quite see why we should split that, busy > balancing is the one that runs more often and is the one that should be > serialized to avoid too much cross node traffic and all that, no? > > The idle thing is less often, why not limit that? Makes sense. I'll add it to the set of my weekend experiment runs. -- Thanks and Regards, Prateek
On 4/16/25 15:17, Vincent Guittot wrote:
> On Wed, 16 Apr 2025 at 11:29, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>
>>
>>
>> On 4/16/25 14:46, Shrikanth Hegde wrote:
>>>
>>>
>>> On 4/16/25 11:58, Chen, Yu C wrote:
>>>> Hi Shrikanth,
>>>>
>>>> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>>>>>
>>>>>
>>>>> On 4/16/25 09:28, Tim Chen wrote:
>>>>>> At load balance time, balance of last level cache domains and
>>>>>> above needs to be serialized. The scheduler checks the atomic var
>>>>>> sched_balance_running first and then see if time is due for a load
>>>>>> balance. This is an expensive operation as multiple CPUs can attempt
>>>>>> sched_balance_running acquisition at the same time.
>>>>>>
>>>>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>>>>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>>>>>> sched_balance_running. Most of the time, a balance attempt is aborted
>>>>>> immediately after acquiring sched_balance_running as load balance time
>>>>>> is not due.
>>>>>>
>>>>>> Instead, check balance due time first before acquiring
>>>>>> sched_balance_running. This skips many useless acquisitions
>>>>>> of sched_balance_running and knocks the 7.6% CPU overhead on
>>>>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>>>>>> improved by 11%.
>>>>>>
>>>>>
>>>>> Hi Tim.
>>>>>
>>>>> Time check makes sense specially on large systems mainly due to
>>>>> NEWIDLE balance.
>>>
>>> scratch the NEWLY_IDLE part from that comment.
>>>
>>>>>
>>>>
>>>> Could you elaborate a little on this statement? There is no timeout
>>>> mechanism like periodic load balancer for the NEWLY_IDLE, right?
>>>
>>> Yes. NEWLY_IDLE is very opportunistic.
>>>
>>>>
>>>>> One more point to add, A lot of time, the CPU which acquired
>>>>> sched_balance_running,
>>>>> need not end up doing the load balance, since it not the CPU meant to
>>>>> do the load balance.
>>>>>
>>>>> This thread.
>>>>> https://lore.kernel.org/all/1e43e783-55e7-417f-
>>>>> a1a7-503229eb163a@linux.ibm.com/
>>>>>
>>>>>
>>>>> Best thing probably is to acquire it if this CPU has passed the time
>>>>> check and as well it is
>>>>> actually going to do load balance.
>>>>>
>>>>>
>>>>
>>>> This is a good point, and we might only want to deal with periodic load
>>>> balancer rather than NEWLY_IDLE balance. Because the latter is too
>>>> frequent and contention on the sched_balance_running might introduce
>>>> high cache contention.
>>>>
>>>
>>> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
>>> endup consuming a lot of cycles. But if we serialize using
>>> sched_balance_running, it would definitely cause a lot contention as is.
>>>
>>>
>>> The point was, before acquiring it, it would be better if this CPU is
>>> definite to do the load balance. Else there are chances to miss the
>>> actual load balance.
>>>
>>>
>>
>> Sorry, forgot to add.
>>
>> Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
>> the regular (idle) can take care for NUMA by serializing it?
>>
>> - if (sd->flags & SD_BALANCE_NEWIDLE) {
>> + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
>
> Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
> set SD_SERIALIZE
Hi Vincent.
There is even kernel parameter "relax_domain_level" which one can make use of.
concern was newidle does this without acquiring the sched_balance_running,
while busy,idle try to acquire this for NUMA.
Slightly different topic: It(kernel parameter) also resets SHCED_BALANCE_WAKE. But is it being used?
I couldn't find out how it is used.
>
>>
>> pulled_task = sched_balance_rq(this_cpu, this_rq,
>> sd, CPU_NEWLY_IDLE,
>>
>>
>> Anyways, having a policy around this SD_SERIALIZE would be a good thing.
>>
>>>> thanks,
>>>> Chenyu
>>>>
>>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 16 ++++++++--------
>>>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index e43993a4e580..5e5f7a770b2f 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
>>>>>> rq *rq, enum cpu_idle_type idle)
>>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>>>>> - if (need_serialize) {
>>>>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>>>>> - goto out;
>>>>>> - }
>>>>>> -
>>>>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>>>>> + need_serialize = sd->flags & SD_SERIALIZE;
>>>>>> + if (need_serialize) {
>>>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
>>>>>> 0, 1))
>>>>>> + goto out;
>>>>>> + }
>>>>>> +
>>>>>> if (sched_balance_rq(cpu, rq, sd, idle,
>>>>>> &continue_balancing)) {
>>>>>> /*
>>>>>> * The LBF_DST_PINNED logic could have changed
>>>>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>>>>>> *rq, enum cpu_idle_type idle)
>>>>>> }
>>>>>> sd->last_balance = jiffies;
>>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>>> + if (need_serialize)
>>>>>> + atomic_set_release(&sched_balance_running, 0);
>>>>>> }
>>>>>> - if (need_serialize)
>>>>>> - atomic_set_release(&sched_balance_running, 0);
>>>>>> out:
>>>>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>>>>> next_balance = sd->last_balance + interval;
>>>>>
>>>
>>
On Wed, 16 Apr 2025 at 16:14, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>
>
>
> On 4/16/25 15:17, Vincent Guittot wrote:
> > On Wed, 16 Apr 2025 at 11:29, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> On 4/16/25 14:46, Shrikanth Hegde wrote:
> >>>
> >>>
> >>> On 4/16/25 11:58, Chen, Yu C wrote:
> >>>> Hi Shrikanth,
> >>>>
> >>>> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
> >>>>>
> >>>>>
> >>>>> On 4/16/25 09:28, Tim Chen wrote:
> >>>>>> At load balance time, balance of last level cache domains and
> >>>>>> above needs to be serialized. The scheduler checks the atomic var
> >>>>>> sched_balance_running first and then see if time is due for a load
> >>>>>> balance. This is an expensive operation as multiple CPUs can attempt
> >>>>>> sched_balance_running acquisition at the same time.
> >>>>>>
> >>>>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
> >>>>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
> >>>>>> sched_balance_running. Most of the time, a balance attempt is aborted
> >>>>>> immediately after acquiring sched_balance_running as load balance time
> >>>>>> is not due.
> >>>>>>
> >>>>>> Instead, check balance due time first before acquiring
> >>>>>> sched_balance_running. This skips many useless acquisitions
> >>>>>> of sched_balance_running and knocks the 7.6% CPU overhead on
> >>>>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
> >>>>>> improved by 11%.
> >>>>>>
> >>>>>
> >>>>> Hi Tim.
> >>>>>
> >>>>> Time check makes sense specially on large systems mainly due to
> >>>>> NEWIDLE balance.
> >>>
> >>> scratch the NEWLY_IDLE part from that comment.
> >>>
> >>>>>
> >>>>
> >>>> Could you elaborate a little on this statement? There is no timeout
> >>>> mechanism like periodic load balancer for the NEWLY_IDLE, right?
> >>>
> >>> Yes. NEWLY_IDLE is very opportunistic.
> >>>
> >>>>
> >>>>> One more point to add, A lot of time, the CPU which acquired
> >>>>> sched_balance_running,
> >>>>> need not end up doing the load balance, since it not the CPU meant to
> >>>>> do the load balance.
> >>>>>
> >>>>> This thread.
> >>>>> https://lore.kernel.org/all/1e43e783-55e7-417f-
> >>>>> a1a7-503229eb163a@linux.ibm.com/
> >>>>>
> >>>>>
> >>>>> Best thing probably is to acquire it if this CPU has passed the time
> >>>>> check and as well it is
> >>>>> actually going to do load balance.
> >>>>>
> >>>>>
> >>>>
> >>>> This is a good point, and we might only want to deal with periodic load
> >>>> balancer rather than NEWLY_IDLE balance. Because the latter is too
> >>>> frequent and contention on the sched_balance_running might introduce
> >>>> high cache contention.
> >>>>
> >>>
> >>> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
> >>> endup consuming a lot of cycles. But if we serialize using
> >>> sched_balance_running, it would definitely cause a lot contention as is.
> >>>
> >>>
> >>> The point was, before acquiring it, it would be better if this CPU is
> >>> definite to do the load balance. Else there are chances to miss the
> >>> actual load balance.
> >>>
> >>>
> >>
> >> Sorry, forgot to add.
> >>
> >> Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
> >> the regular (idle) can take care for NUMA by serializing it?
> >>
> >> - if (sd->flags & SD_BALANCE_NEWIDLE) {
> >> + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
> >
> > Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
> > set SD_SERIALIZE
>
> Hi Vincent.
>
> There is even kernel parameter "relax_domain_level" which one can make use of.
> concern was newidle does this without acquiring the sched_balance_running,
> while busy,idle try to acquire this for NUMA.
>
>
>
> Slightly different topic: It(kernel parameter) also resets SHCED_BALANCE_WAKE. But is it being used?
> I couldn't find out how it is used.
Hi Shrikanth,
The define below does the link
#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
..
wake_flags |= WF_TTWU;
..
cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
select_task_rq_fair()
int sd_flag = wake_flags & 0xF;
..
for_each_domain(cpu, tmp) {
..
if (tmp->flags & sd_flag)
>
> >
> >>
> >> pulled_task = sched_balance_rq(this_cpu, this_rq,
> >> sd, CPU_NEWLY_IDLE,
> >>
> >>
> >> Anyways, having a policy around this SD_SERIALIZE would be a good thing.
> >>
> >>>> thanks,
> >>>> Chenyu
> >>>>
> >>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> >>>>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
> >>>>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
> >>>>>> ---
> >>>>>> kernel/sched/fair.c | 16 ++++++++--------
> >>>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
> >>>>>>
> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>>> index e43993a4e580..5e5f7a770b2f 100644
> >>>>>> --- a/kernel/sched/fair.c
> >>>>>> +++ b/kernel/sched/fair.c
> >>>>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
> >>>>>> rq *rq, enum cpu_idle_type idle)
> >>>>>> interval = get_sd_balance_interval(sd, busy);
> >>>>>> - need_serialize = sd->flags & SD_SERIALIZE;
> >>>>>> - if (need_serialize) {
> >>>>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> >>>>>> - goto out;
> >>>>>> - }
> >>>>>> -
> >>>>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> >>>>>> + need_serialize = sd->flags & SD_SERIALIZE;
> >>>>>> + if (need_serialize) {
> >>>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
> >>>>>> 0, 1))
> >>>>>> + goto out;
> >>>>>> + }
> >>>>>> +
> >>>>>> if (sched_balance_rq(cpu, rq, sd, idle,
> >>>>>> &continue_balancing)) {
> >>>>>> /*
> >>>>>> * The LBF_DST_PINNED logic could have changed
> >>>>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
> >>>>>> *rq, enum cpu_idle_type idle)
> >>>>>> }
> >>>>>> sd->last_balance = jiffies;
> >>>>>> interval = get_sd_balance_interval(sd, busy);
> >>>>>> + if (need_serialize)
> >>>>>> + atomic_set_release(&sched_balance_running, 0);
> >>>>>> }
> >>>>>> - if (need_serialize)
> >>>>>> - atomic_set_release(&sched_balance_running, 0);
> >>>>>> out:
> >>>>>> if (time_after(next_balance, sd->last_balance + interval)) {
> >>>>>> next_balance = sd->last_balance + interval;
> >>>>>
> >>>
> >>
>
On 4/18/25 20:32, Vincent Guittot wrote:
> On Wed, 16 Apr 2025 at 16:14, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>
>>
>>
>> On 4/16/25 15:17, Vincent Guittot wrote:
>>> On Wed, 16 Apr 2025 at 11:29, Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/16/25 14:46, Shrikanth Hegde wrote:
>>>>>
>>>>>
>>>>> On 4/16/25 11:58, Chen, Yu C wrote:
>>>>>> Hi Shrikanth,
>>>>>>
>>>>>> On 4/16/2025 1:30 PM, Shrikanth Hegde wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 4/16/25 09:28, Tim Chen wrote:
>>>>>>>> At load balance time, balance of last level cache domains and
>>>>>>>> above needs to be serialized. The scheduler checks the atomic var
>>>>>>>> sched_balance_running first and then see if time is due for a load
>>>>>>>> balance. This is an expensive operation as multiple CPUs can attempt
>>>>>>>> sched_balance_running acquisition at the same time.
>>>>>>>>
>>>>>>>> On a 2 socket Granite Rapid systems enabling sub-numa cluster and
>>>>>>>> running OLTP workloads, 7.6% of cpu cycles are spent on cmpxchg of
>>>>>>>> sched_balance_running. Most of the time, a balance attempt is aborted
>>>>>>>> immediately after acquiring sched_balance_running as load balance time
>>>>>>>> is not due.
>>>>>>>>
>>>>>>>> Instead, check balance due time first before acquiring
>>>>>>>> sched_balance_running. This skips many useless acquisitions
>>>>>>>> of sched_balance_running and knocks the 7.6% CPU overhead on
>>>>>>>> sched_balance_domain() down to 0.05%. Throughput of the OLTP workload
>>>>>>>> improved by 11%.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Tim.
>>>>>>>
>>>>>>> Time check makes sense specially on large systems mainly due to
>>>>>>> NEWIDLE balance.
>>>>>
>>>>> scratch the NEWLY_IDLE part from that comment.
>>>>>
>>>>>>>
>>>>>>
>>>>>> Could you elaborate a little on this statement? There is no timeout
>>>>>> mechanism like periodic load balancer for the NEWLY_IDLE, right?
>>>>>
>>>>> Yes. NEWLY_IDLE is very opportunistic.
>>>>>
>>>>>>
>>>>>>> One more point to add, A lot of time, the CPU which acquired
>>>>>>> sched_balance_running,
>>>>>>> need not end up doing the load balance, since it not the CPU meant to
>>>>>>> do the load balance.
>>>>>>>
>>>>>>> This thread.
>>>>>>> https://lore.kernel.org/all/1e43e783-55e7-417f-
>>>>>>> a1a7-503229eb163a@linux.ibm.com/
>>>>>>>
>>>>>>>
>>>>>>> Best thing probably is to acquire it if this CPU has passed the time
>>>>>>> check and as well it is
>>>>>>> actually going to do load balance.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> This is a good point, and we might only want to deal with periodic load
>>>>>> balancer rather than NEWLY_IDLE balance. Because the latter is too
>>>>>> frequent and contention on the sched_balance_running might introduce
>>>>>> high cache contention.
>>>>>>
>>>>>
>>>>> But NEWLY_IDLE doesn't serialize using sched_balance_running and can
>>>>> endup consuming a lot of cycles. But if we serialize using
>>>>> sched_balance_running, it would definitely cause a lot contention as is.
>>>>>
>>>>>
>>>>> The point was, before acquiring it, it would be better if this CPU is
>>>>> definite to do the load balance. Else there are chances to miss the
>>>>> actual load balance.
>>>>>
>>>>>
>>>>
>>>> Sorry, forgot to add.
>>>>
>>>> Do we really need newidle running all the way till NUMA? or if it runs till PKG is it enough?
>>>> the regular (idle) can take care for NUMA by serializing it?
>>>>
>>>> - if (sd->flags & SD_BALANCE_NEWIDLE) {
>>>> + if (sd->flags & SD_BALANCE_NEWIDLE && !(sd->flags & SD_SERIALIZE)) {
>>>
>>> Why not just clearing SD_BALANCE_NEWIDLE in your sched domain when you
>>> set SD_SERIALIZE
>>
>> Hi Vincent.
>>
>> There is even kernel parameter "relax_domain_level" which one can make use of.
>> concern was newidle does this without acquiring the sched_balance_running,
>> while busy,idle try to acquire this for NUMA.
>>
>>
>>
>> Slightly different topic: It(kernel parameter) also resets SHCED_BALANCE_WAKE. But is it being used?
>> I couldn't find out how it is used.
>
> Hi Shrikanth,
>
> The define below does the link
>
> #define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
>
> int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> ..
> wake_flags |= WF_TTWU;
> ..
> cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> select_task_rq_fair()
> int sd_flag = wake_flags & 0xF;
> ..
> for_each_domain(cpu, tmp) {
> ..
> if (tmp->flags & sd_flag)
>
Thanks Vincent, Prateek.
>>
>>>
>>>>
>>>> pulled_task = sched_balance_rq(this_cpu, this_rq,
>>>> sd, CPU_NEWLY_IDLE,
>>>>
>>>>
>>>> Anyways, having a policy around this SD_SERIALIZE would be a good thing.
>>>>
>>>>>> thanks,
>>>>>> Chenyu
>>>>>>
>>>>>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>>>>>> Reported-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>>>>> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
>>>>>>>> ---
>>>>>>>> kernel/sched/fair.c | 16 ++++++++--------
>>>>>>>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>>>> index e43993a4e580..5e5f7a770b2f 100644
>>>>>>>> --- a/kernel/sched/fair.c
>>>>>>>> +++ b/kernel/sched/fair.c
>>>>>>>> @@ -12220,13 +12220,13 @@ static void sched_balance_domains(struct
>>>>>>>> rq *rq, enum cpu_idle_type idle)
>>>>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>>>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>>>>>>> - if (need_serialize) {
>>>>>>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>>>>>>> - goto out;
>>>>>>>> - }
>>>>>>>> -
>>>>>>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>>>>>>> + need_serialize = sd->flags & SD_SERIALIZE;
>>>>>>>> + if (need_serialize) {
>>>>>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running,
>>>>>>>> 0, 1))
>>>>>>>> + goto out;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> if (sched_balance_rq(cpu, rq, sd, idle,
>>>>>>>> &continue_balancing)) {
>>>>>>>> /*
>>>>>>>> * The LBF_DST_PINNED logic could have changed
>>>>>>>> @@ -12238,9 +12238,9 @@ static void sched_balance_domains(struct rq
>>>>>>>> *rq, enum cpu_idle_type idle)
>>>>>>>> }
>>>>>>>> sd->last_balance = jiffies;
>>>>>>>> interval = get_sd_balance_interval(sd, busy);
>>>>>>>> + if (need_serialize)
>>>>>>>> + atomic_set_release(&sched_balance_running, 0);
>>>>>>>> }
>>>>>>>> - if (need_serialize)
>>>>>>>> - atomic_set_release(&sched_balance_running, 0);
>>>>>>>> out:
>>>>>>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>>>>>>> next_balance = sd->last_balance + interval;
>>>>>>>
>>>>>
>>>>
>>
Hello Shrikanth, On 4/16/2025 7:44 PM, Shrikanth Hegde wrote: > Slightly different topic: It(kernel parameter) also resets SHCED_BALANCE_WAKE. But is it being used? > I couldn't find out how it is used. So the usage is very convoluted. SHCED_BALANCE_WAKE is same as WF_TTWU. kernel/sched/sched.h makes sure of it, with some static asserts. In select_task_rq_fair(), the for_each_domain() loop has: if (tmp->flags & sd_flag) sd = tmp; where sd_flag is (wake_flags & 0xF) This boils down to (sd->flags & SD_BALANCE_FORK) for fork (WF_FORK) and (sd->flags & SHCED_BALANCE_WAKE) for wakeup (WF_TTWU). There is more convoluted interaction with arch specific "node_reclaim_distance" that can clear SD_BALANCE_FORK, SD_BALANCE_EXEC, and SD_WAKE_AFFINE if a node is deemed very far to explore balance on wakeup / fork. Probably some of these flags can be merged now that things have evolved but it requires a bit of auditing before jumping in. -- Thanks and Regards, Prateek
© 2016 - 2025 Red Hat, Inc.