SCHED: scatter nohz idle balance target cpus

[PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Jianyong Wu 10 months, 4 weeks ago

Currently, cpu selection logic for nohz idle balance lacks history info
that leads to cpu0 is always chosen if it's in nohz cpu mask. It's not
fair fot the tasks reside in numa node0. It's worse in the machine with
large cpu number, nohz idle balance may be very heavy.

To address this issue, adding a member to "nohz" to indicate who is
chosen last time and choose next for this round of nohz idle balance.

Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
---
 kernel/sched/fair.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c798d2795243..ba6930c79e25 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7197,6 +7197,7 @@ static struct {
 	atomic_t nr_cpus;
 	int has_blocked;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
+	int last_cpu;			/* Last cpu chosen to do nohz idle balance */
 	unsigned long next_balance;     /* in jiffy units */
 	unsigned long next_blocked;	/* Next update of blocked load in jiffies */
 } nohz ____cacheline_aligned;
@@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
 
 	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
 
-	for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
+	for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu + 1) {
 
-		if (ilb_cpu == smp_processor_id())
+		if (ilb_cpu == smp_processor_id() || !cpumask_test_cpu(ilb_cpu, hk_mask))
 			continue;
 
-		if (idle_cpu(ilb_cpu))
+		if (idle_cpu(ilb_cpu)) {
+			nohz.last_cpu = ilb_cpu;
 			return ilb_cpu;
+		}
 	}
 
 	return -1;
-- 
2.43.0

Re: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Vincent Guittot 10 months, 3 weeks ago

On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@hygon.cn> wrote:
>
> Currently, cpu selection logic for nohz idle balance lacks history info
> that leads to cpu0 is always chosen if it's in nohz cpu mask. It's not
> fair fot the tasks reside in numa node0. It's worse in the machine with
> large cpu number, nohz idle balance may be very heavy.

Could you provide more details about why it's not fair for tasks that
reside on numa node 0 ? cpu0 is idle so ilb doesn't steal time to
other tasks.

Do you have figures or use cases to highlight this unfairness ?

>
> To address this issue, adding a member to "nohz" to indicate who is
> chosen last time and choose next for this round of nohz idle balance.
>
> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> ---
>  kernel/sched/fair.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c798d2795243..ba6930c79e25 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7197,6 +7197,7 @@ static struct {
>         atomic_t nr_cpus;
>         int has_blocked;                /* Idle CPUS has blocked load */
>         int needs_update;               /* Newly idle CPUs need their next_balance collated */
> +       int last_cpu;                   /* Last cpu chosen to do nohz idle balance */
>         unsigned long next_balance;     /* in jiffy units */
>         unsigned long next_blocked;     /* Next update of blocked load in jiffies */
>  } nohz ____cacheline_aligned;
> @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
>
>         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
>
> -       for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> +       for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu + 1) {
>
> -               if (ilb_cpu == smp_processor_id())
> +               if (ilb_cpu == smp_processor_id() || !cpumask_test_cpu(ilb_cpu, hk_mask))
>                         continue;
>
> -               if (idle_cpu(ilb_cpu))
> +               if (idle_cpu(ilb_cpu)) {
> +                       nohz.last_cpu = ilb_cpu;
>                         return ilb_cpu;
> +               }
>         }
>
>         return -1;
> --
> 2.43.0
>

RE: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Jianyong Wu 10 months, 3 weeks ago


> -----Original Message-----
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Sent: Wednesday, March 19, 2025 4:46 PM
> To: Jianyong Wu <wujianyong@hygon.cn>
> Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> 
> On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@hygon.cn> wrote:
> >
> > Currently, cpu selection logic for nohz idle balance lacks history
> > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > It's not fair fot the tasks reside in numa node0. It's worse in the
> > machine with large cpu number, nohz idle balance may be very heavy.
> 
> Could you provide more details about why it's not fair for tasks that reside on
> numa node 0 ? cpu0 is idle so ilb doesn't steal time to other tasks.
> 
> Do you have figures or use cases to highlight this unfairness ?
> 
[Jianyong Wu] 
Yeah, here is a test case.
In a system with a large number of CPUs (in my scenario, there are 256 CPUs), when the entire system is under a low load, if you try to bind two or more CPU - bound jobs to a single CPU other than CPU0, you'll notice that the softirq utilization for CPU0 can reach approximately 10%, while it remains negligible for other CPUs. By checking the /proc/softirqs file, it becomes evident that a significant number of SCHED softirqs are only executed on CPU0.
> >
> > To address this issue, adding a member to "nohz" to indicate who is
> > chosen last time and choose next for this round of nohz idle balance.
> >
> > Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> > ---
> >  kernel/sched/fair.c | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > c798d2795243..ba6930c79e25 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7197,6 +7197,7 @@ static struct {
> >         atomic_t nr_cpus;
> >         int has_blocked;                /* Idle CPUS has blocked load
> */
> >         int needs_update;               /* Newly idle CPUs need their
> next_balance collated */
> > +       int last_cpu;                   /* Last cpu chosen to do nohz
> idle balance */
> >         unsigned long next_balance;     /* in jiffy units */
> >         unsigned long next_blocked;     /* Next update of blocked load in
> jiffies */
> >  } nohz ____cacheline_aligned;
> > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> >
> >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> >
> > -       for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > +       for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu
> > + + 1) {
> >
> > -               if (ilb_cpu == smp_processor_id())
> > +               if (ilb_cpu == smp_processor_id() ||
> > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> >                         continue;
> >
> > -               if (idle_cpu(ilb_cpu))
> > +               if (idle_cpu(ilb_cpu)) {
> > +                       nohz.last_cpu = ilb_cpu;
> >                         return ilb_cpu;
> > +               }
> >         }
> >
> >         return -1;
> > --
> > 2.43.0
> >

Re: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Vincent Guittot 10 months, 3 weeks ago

On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@hygon.cn> wrote:
>
>
>
> > -----Original Message-----
> > From: Vincent Guittot <vincent.guittot@linaro.org>
> > Sent: Wednesday, March 19, 2025 4:46 PM
> > To: Jianyong Wu <wujianyong@hygon.cn>
> > Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> >
> > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@hygon.cn> wrote:
> > >
> > > Currently, cpu selection logic for nohz idle balance lacks history
> > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > It's not fair fot the tasks reside in numa node0. It's worse in the
> > > machine with large cpu number, nohz idle balance may be very heavy.
> >
> > Could you provide more details about why it's not fair for tasks that reside on
> > numa node 0 ? cpu0 is idle so ilb doesn't steal time to other tasks.
> >
> > Do you have figures or use cases to highlight this unfairness ?
> >
> [Jianyong Wu]
> Yeah, here is a test case.
> In a system with a large number of CPUs (in my scenario, there are 256 CPUs), when the entire system is under a low load, if you try to bind two or more CPU - bound jobs to a single CPU other than CPU0, you'll notice that the softirq utilization for CPU0 can reach approximately 10%, while it remains negligible for other CPUs. By checking the /proc/softirqs file, it becomes evident that a significant number of SCHED softirqs are only executed on CPU0.

yes, but this 10% of softirq time would have been idle time otherwise
so why should we care ?

At the opposite, if all your cpus are busy, then cpu0 will do the busy
load balance of parents sched domain level whereas other cpus will not
and this time is stolen to the task running on CPU0 which could be
unfair compared to task running on other CPUS


> > >
> > > To address this issue, adding a member to "nohz" to indicate who is
> > > chosen last time and choose next for this round of nohz idle balance.
> > >
> > > Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> > > ---
> > >  kernel/sched/fair.c | 9 ++++++---
> > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > c798d2795243..ba6930c79e25 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7197,6 +7197,7 @@ static struct {
> > >         atomic_t nr_cpus;
> > >         int has_blocked;                /* Idle CPUS has blocked load
> > */
> > >         int needs_update;               /* Newly idle CPUs need their
> > next_balance collated */
> > > +       int last_cpu;                   /* Last cpu chosen to do nohz
> > idle balance */
> > >         unsigned long next_balance;     /* in jiffy units */
> > >         unsigned long next_blocked;     /* Next update of blocked load in
> > jiffies */
> > >  } nohz ____cacheline_aligned;
> > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > >
> > >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > >
> > > -       for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > +       for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.last_cpu
> > > + + 1) {
> > >
> > > -               if (ilb_cpu == smp_processor_id())
> > > +               if (ilb_cpu == smp_processor_id() ||
> > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > >                         continue;
> > >
> > > -               if (idle_cpu(ilb_cpu))
> > > +               if (idle_cpu(ilb_cpu)) {
> > > +                       nohz.last_cpu = ilb_cpu;
> > >                         return ilb_cpu;
> > > +               }
> > >         }
> > >
> > >         return -1;
> > > --
> > > 2.43.0
> > >

RE: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Jianyong Wu 10 months, 3 weeks ago


> -----Original Message-----
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Sent: Wednesday, March 19, 2025 5:26 PM
> To: Jianyong Wu <wujianyong@hygon.cn>
> Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> 
> On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@hygon.cn> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Vincent Guittot <vincent.guittot@linaro.org>
> > > Sent: Wednesday, March 19, 2025 4:46 PM
> > > To: Jianyong Wu <wujianyong@hygon.cn>
> > > Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> > > linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> > >
> > > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@hygon.cn> wrote:
> > > >
> > > > Currently, cpu selection logic for nohz idle balance lacks history
> > > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > > It's not fair fot the tasks reside in numa node0. It's worse in
> > > > the machine with large cpu number, nohz idle balance may be very heavy.
> > >
> > > Could you provide more details about why it's not fair for tasks
> > > that reside on numa node 0 ? cpu0 is idle so ilb doesn't steal time to other
> tasks.
> > >
> > > Do you have figures or use cases to highlight this unfairness ?
> > >
> > [Jianyong Wu]
> > Yeah, here is a test case.
> > In a system with a large number of CPUs (in my scenario, there are 256 CPUs),
> when the entire system is under a low load, if you try to bind two or more CPU -
> bound jobs to a single CPU other than CPU0, you'll notice that the softirq
> utilization for CPU0 can reach approximately 10%, while it remains negligible for
> other CPUs. By checking the /proc/softirqs file, it becomes evident that a
> significant number of SCHED softirqs are only executed on CPU0.
> 
> yes, but this 10% of softirq time would have been idle time otherwise so why
> should we care ?
> 
[Jianyong Wu] 
However, this value is proportional to the number of CPUs. In the event that tasks are scheduled to CPU0, delays will occur.
 
> At the opposite, if all your cpus are busy, then cpu0 will do the busy load balance
> of parents sched domain level whereas other cpus will not and this time is stolen
> to the task running on CPU0 which could be unfair compared to task running on
> other CPUS
> 
[Jianyong Wu] 
Yeah. Do you think this patch can mitigate this?
> 
> > > >
> > > > To address this issue, adding a member to "nohz" to indicate who
> > > > is chosen last time and choose next for this round of nohz idle balance.
> > > >
> > > > Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> > > > ---
> > > >  kernel/sched/fair.c | 9 ++++++---
> > > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > > c798d2795243..ba6930c79e25 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7197,6 +7197,7 @@ static struct {
> > > >         atomic_t nr_cpus;
> > > >         int has_blocked;                /* Idle CPUS has blocked
> load
> > > */
> > > >         int needs_update;               /* Newly idle CPUs need
> their
> > > next_balance collated */
> > > > +       int last_cpu;                   /* Last cpu chosen to do
> nohz
> > > idle balance */
> > > >         unsigned long next_balance;     /* in jiffy units */
> > > >         unsigned long next_blocked;     /* Next update of blocked
> load in
> > > jiffies */
> > > >  } nohz ____cacheline_aligned;
> > > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > > >
> > > >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > > >
> > > > -       for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > > +       for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask,
> > > > + nohz.last_cpu
> > > > + + 1) {
> > > >
> > > > -               if (ilb_cpu == smp_processor_id())
> > > > +               if (ilb_cpu == smp_processor_id() ||
> > > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > > >                         continue;
> > > >
> > > > -               if (idle_cpu(ilb_cpu))
> > > > +               if (idle_cpu(ilb_cpu)) {
> > > > +                       nohz.last_cpu = ilb_cpu;
> > > >                         return ilb_cpu;
> > > > +               }
> > > >         }
> > > >
> > > >         return -1;
> > > > --
> > > > 2.43.0
> > > >

Re: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Vincent Guittot 10 months, 3 weeks ago

On Wed, 19 Mar 2025 at 10:42, Jianyong Wu <wujianyong@hygon.cn> wrote:
>
>
>
> > -----Original Message-----
> > From: Vincent Guittot <vincent.guittot@linaro.org>
> > Sent: Wednesday, March 19, 2025 5:26 PM
> > To: Jianyong Wu <wujianyong@hygon.cn>
> > Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> >
> > On Wed, 19 Mar 2025 at 10:03, Jianyong Wu <wujianyong@hygon.cn> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Vincent Guittot <vincent.guittot@linaro.org>
> > > > Sent: Wednesday, March 19, 2025 4:46 PM
> > > > To: Jianyong Wu <wujianyong@hygon.cn>
> > > > Cc: mingo@redhat.com; peterz@infradead.org; jianyong.wu@outlook.com;
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> > > >
> > > > On Tue, 18 Mar 2025 at 03:27, Jianyong Wu <wujianyong@hygon.cn> wrote:
> > > > >
> > > > > Currently, cpu selection logic for nohz idle balance lacks history
> > > > > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > > > > It's not fair fot the tasks reside in numa node0. It's worse in
> > > > > the machine with large cpu number, nohz idle balance may be very heavy.
> > > >
> > > > Could you provide more details about why it's not fair for tasks
> > > > that reside on numa node 0 ? cpu0 is idle so ilb doesn't steal time to other
> > tasks.
> > > >
> > > > Do you have figures or use cases to highlight this unfairness ?
> > > >
> > > [Jianyong Wu]
> > > Yeah, here is a test case.
> > > In a system with a large number of CPUs (in my scenario, there are 256 CPUs),
> > when the entire system is under a low load, if you try to bind two or more CPU -
> > bound jobs to a single CPU other than CPU0, you'll notice that the softirq
> > utilization for CPU0 can reach approximately 10%, while it remains negligible for
> > other CPUs. By checking the /proc/softirqs file, it becomes evident that a
> > significant number of SCHED softirqs are only executed on CPU0.
> >
> > yes, but this 10% of softirq time would have been idle time otherwise so why
> > should we care ?
> >
> [Jianyong Wu]
> However, this value is proportional to the number of CPUs. In the event that tasks are scheduled to CPU0, delays will occur.

ILB aborts when a task is enqueued on the CPU so the 10% softirq does
not reflect the delay of an enqueued task.  It would be good to get
figures about the problem you are trying to solve

>
> > At the opposite, if all your cpus are busy, then cpu0 will do the busy load balance
> > of parents sched domain level whereas other cpus will not and this time is stolen
> > to the task running on CPU0 which could be unfair compared to task running on
> > other CPUS
> >
> [Jianyong Wu]
> Yeah. Do you think this patch can mitigate this?
> >
> > > > >
> > > > > To address this issue, adding a member to "nohz" to indicate who
> > > > > is chosen last time and choose next for this round of nohz idle balance.
> > > > >
> > > > > Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> > > > > ---
> > > > >  kernel/sched/fair.c | 9 ++++++---
> > > > >  1 file changed, 6 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
> > > > > c798d2795243..ba6930c79e25 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -7197,6 +7197,7 @@ static struct {
> > > > >         atomic_t nr_cpus;
> > > > >         int has_blocked;                /* Idle CPUS has blocked
> > load
> > > > */
> > > > >         int needs_update;               /* Newly idle CPUs need
> > their
> > > > next_balance collated */
> > > > > +       int last_cpu;                   /* Last cpu chosen to do
> > nohz
> > > > idle balance */
> > > > >         unsigned long next_balance;     /* in jiffy units */
> > > > >         unsigned long next_blocked;     /* Next update of blocked
> > load in
> > > > jiffies */
> > > > >  } nohz ____cacheline_aligned;
> > > > > @@ -12266,13 +12267,15 @@ static inline int find_new_ilb(void)
> > > > >
> > > > >         hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
> > > > >
> > > > > -       for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
> > > > > +       for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask,
> > > > > + nohz.last_cpu
> > > > > + + 1) {
> > > > >
> > > > > -               if (ilb_cpu == smp_processor_id())
> > > > > +               if (ilb_cpu == smp_processor_id() ||
> > > > > + !cpumask_test_cpu(ilb_cpu, hk_mask))
> > > > >                         continue;
> > > > >
> > > > > -               if (idle_cpu(ilb_cpu))
> > > > > +               if (idle_cpu(ilb_cpu)) {
> > > > > +                       nohz.last_cpu = ilb_cpu;
> > > > >                         return ilb_cpu;
> > > > > +               }
> > > > >         }
> > > > >
> > > > >         return -1;
> > > > > --
> > > > > 2.43.0
> > > > >

Re: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Peter Zijlstra 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 02:23:58AM +0000, Jianyong Wu wrote:

Re subject; if you look at other patches for sched, you'll note that we
do not capitalize it. Also, what you're doing is not scatter, it is
rotation.

> Currently, cpu selection logic for nohz idle balance lacks history info
> that leads to cpu0 is always chosen if it's in nohz cpu mask. It's not
> fair fot the tasks reside in numa node0. It's worse in the machine with
> large cpu number, nohz idle balance may be very heavy.

Since you seem to care about ilb and numa; there is this _very_ old
patch set that never got finished:

	https://lore.kernel.org/all/20091211013056.305998000@intel.com/

IIRC there was a problem where it would simply stop running the per-node
ilb when the node went idle, leading to node level imbalances. This
should be curable by picking one such idle node and keeping its ILB
active or somesuch.

Something to poke at if you're interested..

RE: [PATCH] SCHED: scatter nohz idle balance target cpus

Posted by Jianyong Wu 10 months, 3 weeks ago

Hi Peter,

Thanks for replay.

> -----Original Message-----
> From: Peter Zijlstra <peterz@infradead.org>
> Sent: Tuesday, March 18, 2025 2:39 PM
> To: Jianyong Wu <wujianyong@hygon.cn>
> Cc: mingo@redhat.com; vincent.guittot@linaro.org; jianyong.wu@outlook.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] SCHED: scatter nohz idle balance target cpus
> 
> On Tue, Mar 18, 2025 at 02:23:58AM +0000, Jianyong Wu wrote:
> 
> Re subject; if you look at other patches for sched, you'll note that we do not
> capitalize it.
[Jianyong Wu] 
OK, will correct it.

> Also, what you're doing is not scatter, it is rotation.
[Jianyong Wu] 
It seems. However, as nohz idle balance occurs quite frequently, in proportion to the value of "HZ" and CPUs number, the selected CPU will spread to the whole system soon. Compared with current situation where mostly only cpu0 is selected, the change in this patch is more like a "scattering" effect. But it doesn't matter, "rotation" is fine. I can change it according your wish.

> 
> > Currently, cpu selection logic for nohz idle balance lacks history
> > info that leads to cpu0 is always chosen if it's in nohz cpu mask.
> > It's not fair fot the tasks reside in numa node0. It's worse in the
> > machine with large cpu number, nohz idle balance may be very heavy.
> 
> Since you seem to care about ilb and numa; there is this _very_ old patch set
> that never got finished:
> 
> 	https://lore.kernel.org/all/20091211013056.305998000@intel.com/
> 
[Jianyong Wu] 
Thanks for this. I'll look into it (may take some time)

> IIRC there was a problem where it would simply stop running the per-node ilb
> when the node went idle, leading to node level imbalances. This should be
> curable by picking one such idle node and keeping its ILB active or somesuch.
[Jianyong Wu] 
I think this patch is simple enough to achieve the "fairness". WDYT?
>
> Something to poke at if you're interested..
>