[RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.

Kenan.Liu posted 2 patches 2 years, 1 month ago
kernel/sched/fair.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 99 insertions(+), 4 deletions(-)
[RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Kenan.Liu 2 years, 1 month ago
From: "Kenan.Liu" <Kenan.Liu@linux.alibaba.com>

Multithreading workloads in VM with Qemu may encounter an unexpected
phenomenon: one hyperthread of a physical core is busy while its sibling
is idle. Such as:

%Cpu0  : 19.8 us,  3.8 sy,  0.0 ni, 76.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  6.4 us,  1.0 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 15.8 us,  4.5 sy,  0.0 ni, 79.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  4.7 us,  0.7 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 15.5 us,  4.5 sy,  0.0 ni, 80.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  4.7 us,  0.7 sy,  0.0 ni, 94.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 13.4 us,  3.4 sy,  0.0 ni, 83.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  2.7 us,  0.3 sy,  0.0 ni, 97.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 16.1 us,  4.8 sy,  0.0 ni, 79.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  2.0 us,  0.3 sy,  0.0 ni, 97.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 17.5 us,  5.2 sy,  0.0 ni, 77.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 : 17.6 us,  4.5 sy,  0.0 ni, 77.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 :  0.3 us,  0.0 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 : 16.1 us,  4.1 sy,  0.0 ni, 79.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

The main reason is that hyperthread index is consecutive in qemu native x86 CPU
model which is different from the physical topology. As the current kernel scheduler
implementation, hyperthread with an even ID number will be picked up in a much
higher probability during load-balancing and load-deploying.

This RFC targets to solve the problem by adjusting CFS loabalance policy:
1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
with qemu native CPU topology.
2. Export a procfs to control the traverse length when select idle cpu.

Kenan.Liu (2):
  sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
    topology.
  sched/fair: Export a param to control the traverse len when select
    idle cpu.

 kernel/sched/fair.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 99 insertions(+), 4 deletions(-)

-- 
1.8.3.1
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Peter Zijlstra 2 years, 1 month ago
On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
> From: "Kenan.Liu" <Kenan.Liu@linux.alibaba.com>
> 
> Multithreading workloads in VM with Qemu may encounter an unexpected
> phenomenon: one hyperthread of a physical core is busy while its sibling
> is idle. Such as:

Is this with vCPU pinning? Without that, guest topology makes no sense
what so ever.

> The main reason is that hyperthread index is consecutive in qemu native x86 CPU
> model which is different from the physical topology. 

I'm sorry, what? That doesn't make sense. SMT enumeration is all over
the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
always (n,n+1) IIRC.

> As the current kernel scheduler
> implementation, hyperthread with an even ID number will be picked up in a much
> higher probability during load-balancing and load-deploying.

How so?

> This RFC targets to solve the problem by adjusting CFS loabalance policy:
> 1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
> with qemu native CPU topology.
> 2. Export a procfs to control the traverse length when select idle cpu.
> 
> Kenan.Liu (2):
>   sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
>     topology.
>   sched/fair: Export a param to control the traverse len when select
>     idle cpu.

NAK, qemu can either provide a fake topology to the guest using normal
x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
quite insane.
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Kenan.Liu 2 years, 1 month ago
Hi Peter, thanks for your attention,

please refer to my answer to your question inline:


在 2023/7/20 下午4:50, Peter Zijlstra 写道:
> On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
>> From: "Kenan.Liu" <Kenan.Liu@linux.alibaba.com>
>>
>> Multithreading workloads in VM with Qemu may encounter an unexpected
>> phenomenon: one hyperthread of a physical core is busy while its sibling
>> is idle. Such as:
> Is this with vCPU pinning? Without that, guest topology makes no sense
> what so ever.


vCPU is pinned on host and the imbalance phenomenon we observed is inside
VM, not for the vCPU threads on host.


>> The main reason is that hyperthread index is consecutive in qemu native x86 CPU
>> model which is different from the physical topology.
> I'm sorry, what? That doesn't make sense. SMT enumeration is all over
> the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
> always (n,n+1) IIRC.
>
>> As the current kernel scheduler
>> implementation, hyperthread with an even ID number will be picked up in a much
>> higher probability during load-balancing and load-deploying.
> How so?


The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
n means the total core number of the machine.

The imbalance happens when the number of runnable threads is less
than the number of hyperthreads, select_idle_core() would be called
to decide which cpu be selected to run the waken-up task.

select_idle_core() will return the checked cpu number if the whole
core is idle. On the contrary, if any one HT of the core is busy,
select_idle_core() would clear the whole core out from cpumask and
check the next core.

select_idle_core():
     …
     if (idle)
         return core;

     cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
     return -1;

In this manner, except the very beginning of for_each_cpu_wrap() loop,
HT with even ID number is always be checked at first, and be returned
to the caller if the whole core is idle, so the odd indexed HT almost
has no chance to be selected.

select_idle_cpu():
     …
     for_each_cpu_wrap(cpu, cpus, target + 1) {
         if (has_idle_core) {
             i = select_idle_core(p, cpu, cpus, &idle_cpu);

And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
when the loop starts from the bottom half of SMT number, HTs with larger
number will be checked first, when it starts from the top half, their
siblings with smaller number take the first place of inner core searching.


>
>> This RFC targets to solve the problem by adjusting CFS loabalance policy:
>> 1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
>> with qemu native CPU topology.
>> 2. Export a procfs to control the traverse length when select idle cpu.
>>
>> Kenan.Liu (2):
>>    sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
>>      topology.
>>    sched/fair: Export a param to control the traverse len when select
>>      idle cpu.
> NAK, qemu can either provide a fake topology to the guest using normal
> x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
> quite insane.
Thanks,

Kenan.Liu
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Peter Zijlstra 2 years, 1 month ago
On Fri, Jul 21, 2023 at 10:58:50AM +0800, Kenan.Liu wrote:

> > > As the current kernel scheduler
> > > implementation, hyperthread with an even ID number will be picked up in a much
> > > higher probability during load-balancing and load-deploying.
> > How so?
> 
> 
> The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
> but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
> n means the total core number of the machine.

That is only common on Intel hardware, AMD (and some Intel) will in fact
enumerate SMT like Qemu does.
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Vincent Guittot 2 years, 1 month ago
On Fri, 21 Jul 2023 at 04:59, Kenan.Liu <Kenan.Liu@linux.alibaba.com> wrote:
>
> Hi Peter, thanks for your attention,
>
> please refer to my answer to your question inline:
>
>
> 在 2023/7/20 下午4:50, Peter Zijlstra 写道:
> > On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
> >> From: "Kenan.Liu" <Kenan.Liu@linux.alibaba.com>
> >>
> >> Multithreading workloads in VM with Qemu may encounter an unexpected
> >> phenomenon: one hyperthread of a physical core is busy while its sibling
> >> is idle. Such as:
> > Is this with vCPU pinning? Without that, guest topology makes no sense
> > what so ever.
>
>
> vCPU is pinned on host and the imbalance phenomenon we observed is inside
> VM, not for the vCPU threads on host.
>
>
> >> The main reason is that hyperthread index is consecutive in qemu native x86 CPU
> >> model which is different from the physical topology.
> > I'm sorry, what? That doesn't make sense. SMT enumeration is all over
> > the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
> > always (n,n+1) IIRC.
> >
> >> As the current kernel scheduler
> >> implementation, hyperthread with an even ID number will be picked up in a much
> >> higher probability during load-balancing and load-deploying.
> > How so?
>
>
> The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
> but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
> n means the total core number of the machine.
>
> The imbalance happens when the number of runnable threads is less
> than the number of hyperthreads, select_idle_core() would be called
> to decide which cpu be selected to run the waken-up task.
>
> select_idle_core() will return the checked cpu number if the whole
> core is idle. On the contrary, if any one HT of the core is busy,
> select_idle_core() would clear the whole core out from cpumask and
> check the next core.
>
> select_idle_core():
>      …
>      if (idle)
>          return core;
>
>      cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>      return -1;
>
> In this manner, except the very beginning of for_each_cpu_wrap() loop,
> HT with even ID number is always be checked at first, and be returned
> to the caller if the whole core is idle, so the odd indexed HT almost
> has no chance to be selected.
>
> select_idle_cpu():
>      …
>      for_each_cpu_wrap(cpu, cpus, target + 1) {
>          if (has_idle_core) {
>              i = select_idle_core(p, cpu, cpus, &idle_cpu);
>
> And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
> when the loop starts from the bottom half of SMT number, HTs with larger
> number will be checked first, when it starts from the top half, their
> siblings with smaller number take the first place of inner core searching.

But why is it a problem ? Your system is almost idle and 1 HT per core
is used. Who cares to select evenly one HT or the other as long as we
select an idle core in priority ?

This seems related to
https://lore.kernel.org/lkml/BYAPR21MB1688FE804787663C425C2202D753A@BYAPR21MB1688.namprd21.prod.outlook.com/
we concluded that it was not a problem

>
>
> >
> >> This RFC targets to solve the problem by adjusting CFS loabalance policy:
> >> 1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
> >> with qemu native CPU topology.
> >> 2. Export a procfs to control the traverse length when select idle cpu.
> >>
> >> Kenan.Liu (2):
> >>    sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
> >>      topology.
> >>    sched/fair: Export a param to control the traverse len when select
> >>      idle cpu.
> > NAK, qemu can either provide a fake topology to the guest using normal
> > x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
> > quite insane.
> Thanks,
>
> Kenan.Liu
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Peter Zijlstra 2 years, 1 month ago
On Fri, Jul 21, 2023 at 10:33:44AM +0200, Vincent Guittot wrote:
> On Fri, 21 Jul 2023 at 04:59, Kenan.Liu <Kenan.Liu@linux.alibaba.com> wrote:

> > The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
> > but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
> > n means the total core number of the machine.
> >
> > The imbalance happens when the number of runnable threads is less
> > than the number of hyperthreads, select_idle_core() would be called
> > to decide which cpu be selected to run the waken-up task.
> >
> > select_idle_core() will return the checked cpu number if the whole
> > core is idle. On the contrary, if any one HT of the core is busy,
> > select_idle_core() would clear the whole core out from cpumask and
> > check the next core.
> >
> > select_idle_core():
> >      …
> >      if (idle)
> >          return core;
> >
> >      cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> >      return -1;
> >
> > In this manner, except the very beginning of for_each_cpu_wrap() loop,
> > HT with even ID number is always be checked at first, and be returned
> > to the caller if the whole core is idle, so the odd indexed HT almost
> > has no chance to be selected.
> >
> > select_idle_cpu():
> >      …
> >      for_each_cpu_wrap(cpu, cpus, target + 1) {
> >          if (has_idle_core) {
> >              i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >
> > And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
> > when the loop starts from the bottom half of SMT number, HTs with larger
> > number will be checked first, when it starts from the top half, their
> > siblings with smaller number take the first place of inner core searching.
> 
> But why is it a problem ? Your system is almost idle and 1 HT per core
> is used. Who cares to select evenly one HT or the other as long as we
> select an idle core in priority ?

Right, why is this a problem? Hyperthreads are supposed to be symmetric,
it doesn't matter which of the two are active, the important thing is to
only have one active if we can.

(Unlike Power7, they have asymmetric SMT)
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by luoben@linux.alibaba.com 2 years, 1 month ago
On 2023/7/21 17:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 21, 2023 at 10:33:44AM +0200, Vincent Guittot wrote:
> > On Fri, 21 Jul 2023 at 04:59, Kenan.Liu <Kenan.Liu@linux.alibaba.com> wrote:
> 
> >> The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
> >> but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
> >> n means the total core number of the machine.
> >>
> >> The imbalance happens when the number of runnable threads is less
> >> than the number of hyperthreads, select_idle_core() would be called
> >> to decide which cpu be selected to run the waken-up task.
> >>
> >> select_idle_core() will return the checked cpu number if the whole
> >> core is idle. On the contrary, if any one HT of the core is busy,
> >> select_idle_core() would clear the whole core out from cpumask and
> >> check the next core.
> >>
> >> select_idle_core():
> >>       …
> >>       if (idle)
> >>           return core;
> >>
> >>       cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> >>       return -1;
> >>
> >> In this manner, except the very beginning of for_each_cpu_wrap() loop,
> >> HT with even ID number is always be checked at first, and be returned
> >> to the caller if the whole core is idle, so the odd indexed HT almost
> >> has no chance to be selected.
> >>
> >> select_idle_cpu():
> >>       …
> >>       for_each_cpu_wrap(cpu, cpus, target + 1) {
> >>           if (has_idle_core) {
> >>               i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >>
> >> And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
> >> when the loop starts from the bottom half of SMT number, HTs with larger
> >> number will be checked first, when it starts from the top half, their
> >> siblings with smaller number take the first place of inner core searching.
> >
> > But why is it a problem ? Your system is almost idle and 1 HT per core
> > is used. Who cares to select evenly one HT or the other as long as we
> > select an idle core in priority ?
> 
> Right, why is this a problem? Hyperthreads are supposed to be symmetric,
> it doesn't matter which of the two are active, the important thing is to
> only have one active if we can.
> 
> (Unlike Power7, they have asymmetric SMT)
> 

hi Peter and Vincent,

Some upper-level monitoring logic may take the CPU usage as a metric for
computing resource scaling. Imbalanced scheduling can create the illusion
of CPU resource scarcity, leading to more frequent triggering of resource
expansion by the upper-level scheduling system. However, this is actually
a waste of resources. So, we think this may be a problem.

Could you please take a further look at PATCH#2? We found that the default
'nr' value did not perform well under our scenario, and we believe that
adjustable variables would be more appropriate.

Our scenario is as follows:
16 processes are running in a 32 CPU VM, with 8 threads per process,
they are all running the same job. 

The expected result is that the CPU usage is evenly distributed, but
we found that even-numbered cores were used for scheduling decisions
and consumed more CPU resources (5%~20%), mainly because of the default
value of nr=4. In this scenario, we found that nr=2 is more suitable.

Thanks,
Ben
Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU topology.
Posted by Peter Zijlstra 2 years, 1 month ago
On Mon, Jul 24, 2023 at 02:57:57PM +0800, luoben@linux.alibaba.com wrote:

> > Right, why is this a problem? Hyperthreads are supposed to be symmetric,
> > it doesn't matter which of the two are active, the important thing is to
> > only have one active if we can.
> > 
> > (Unlike Power7, they have asymmetric SMT)
> > 
> 
> hi Peter and Vincent,
> 
> Some upper-level monitoring logic may take the CPU usage as a metric for
> computing resource scaling. Imbalanced scheduling can create the illusion
> of CPU resource scarcity, leading to more frequent triggering of resource
> expansion by the upper-level scheduling system. However, this is actually
> a waste of resources. So, we think this may be a problem.

This is a problem of your monitoring logic -- there is absolutely no
functional problem with the kernel AFAICT.

Teach the thing about SMT instead.

> Could you please take a further look at PATCH#2? We found that the default
> 'nr' value did not perform well under our scenario, and we believe that
> adjustable variables would be more appropriate.

That patch is tweaking default disabled code -- which we should be
removing sometime soon I suppose. So no.