[Qemu-devel] [PATCH] i386: turn off l3-cache property by default

Denis Plotnikov posted 1 patch 6 years, 5 months ago
Failed in applying to current master (apply log)
include/hw/i386/pc.h | 7 ++++++-
target/i386/cpu.c    | 2 +-
2 files changed, 7 insertions(+), 2 deletions(-)
[Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Denis Plotnikov 6 years, 5 months ago
Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
introduced and set by default exposing l3 to the guest.

The motivation behind it was that in the Linux scheduler, when waking up
a task on a sibling CPU, the task was put onto the target CPU's runqueue
directly, without sending a reschedule IPI.  Reduction in the IPI count
led to performance gain.

However, this isn't the whole story.  Once the task is on the target
CPU's runqueue, it may have to preempt the current task on that CPU, be
it the idle task putting the CPU to sleep or just another running task.
For that a reschedule IPI will have to be issued, too.  Only when that
other CPU is running a normal task for too little time, the fairness
constraints will prevent the preemption and thus the IPI.

This boils down to the improvement being only achievable in workloads
with many actively switching tasks.  We had no access to the
(proprietary?) SAP HANA benchmark the commit referred to, but the
pattern is also reproduced with "perf bench sched messaging -g 1"
on 1 socket, 8 cores vCPU topology, we see indeed:

l3-cache	#res IPI /s	#time / 10000 loops
off		560K		1.8 sec
on		40K		0.9 sec

Now there's a downside: with L3 cache the Linux scheduler is more eager
to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
interactions and therefore exessive halts and IPIs.  E.g. "perf bench
sched pipe -i 100000" gives

l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
off		200 (no K)	230		0.2 sec
on		400K		330K		0.5 sec

In a more realistic test, we observe 15% degradation in VM density
(measured as the number of VMs, each running Drupal CMS serving 2 http
requests per second to its main page, with 95%-percentile response
latency under 100 ms) with l3-cache=on.

We think that mostly-idle scenario is more common in cloud and personal
usage, and should be optimized for by default; users of highly loaded
VMs should be able to tune them up themselves.

So switch l3-cache off by default, and add a compat clause for the range
of machine types where it was on.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
---
 include/hw/i386/pc.h | 7 ++++++-
 target/i386/cpu.c    | 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 087d184..1d2dcae 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
         .driver   = TYPE_X86_CPU,\
         .property = "x-hv-max-vps",\
         .value    = "0x40",\
-    },
+    },\
+    {\
+        .driver   = TYPE_X86_CPU,\
+        .property = "l3-cache",\
+        .value    = "on",\
+    },\
 
 #define PC_COMPAT_2_9 \
     HW_COMPAT_2_9 \
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 1edcf29..95a51bd 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = {
     DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
     DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
     DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
-    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
+    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false),
     DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration,
                      false),
     DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true),
-- 
2.7.4


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Michael S. Tsirkin 6 years, 4 months ago
On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> introduced and set by default exposing l3 to the guest.
> 
> The motivation behind it was that in the Linux scheduler, when waking up
> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> directly, without sending a reschedule IPI.  Reduction in the IPI count
> led to performance gain.
> 
> However, this isn't the whole story.  Once the task is on the target
> CPU's runqueue, it may have to preempt the current task on that CPU, be
> it the idle task putting the CPU to sleep or just another running task.
> For that a reschedule IPI will have to be issued, too.  Only when that
> other CPU is running a normal task for too little time, the fairness
> constraints will prevent the preemption and thus the IPI.
> 
> This boils down to the improvement being only achievable in workloads
> with many actively switching tasks.  We had no access to the
> (proprietary?) SAP HANA benchmark the commit referred to, but the
> pattern is also reproduced with "perf bench sched messaging -g 1"
> on 1 socket, 8 cores vCPU topology, we see indeed:
> 
> l3-cache	#res IPI /s	#time / 10000 loops
> off		560K		1.8 sec
> on		40K		0.9 sec
> 
> Now there's a downside: with L3 cache the Linux scheduler is more eager
> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> sched pipe -i 100000" gives
> 
> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> off		200 (no K)	230		0.2 sec
> on		400K		330K		0.5 sec
> 
> In a more realistic test, we observe 15% degradation in VM density
> (measured as the number of VMs, each running Drupal CMS serving 2 http
> requests per second to its main page, with 95%-percentile response
> latency under 100 ms) with l3-cache=on.
> 
> We think that mostly-idle scenario is more common in cloud and personal
> usage, and should be optimized for by default; users of highly loaded
> VMs should be able to tune them up themselves.
> 
> So switch l3-cache off by default, and add a compat clause for the range
> of machine types where it was on.
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>

Pls add a new machine type so 2.11 can keep it on by
default.

> ---
>  include/hw/i386/pc.h | 7 ++++++-
>  target/i386/cpu.c    | 2 +-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 087d184..1d2dcae 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>          .driver   = TYPE_X86_CPU,\
>          .property = "x-hv-max-vps",\
>          .value    = "0x40",\
> -    },
> +    },\
> +    {\
> +        .driver   = TYPE_X86_CPU,\
> +        .property = "l3-cache",\
> +        .value    = "on",\
> +    },\
>  
>  #define PC_COMPAT_2_9 \
>      HW_COMPAT_2_9 \
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 1edcf29..95a51bd 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
> -    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
> +    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false),
>      DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration,
>                       false),
>      DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true),
> -- 
> 2.7.4

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Paolo Bonzini 6 years, 4 months ago
On 28/11/2017 19:54, Michael S. Tsirkin wrote:
>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>> sched pipe -i 100000" gives
>>
>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>> off		200 (no K)	230		0.2 sec
>> on		400K		330K		0.5 sec
>>
>> In a more realistic test, we observe 15% degradation in VM density
>> (measured as the number of VMs, each running Drupal CMS serving 2 http
>> requests per second to its main page, with 95%-percentile response
>> latency under 100 ms) with l3-cache=on.
>>
>> We think that mostly-idle scenario is more common in cloud and personal
>> usage, and should be optimized for by default; users of highly loaded
>> VMs should be able to tune them up themselves.

Hi Denis,

thanks for the report.  I think there are two cases:

1) The dedicated pCPU case: do you still get the performance degradation
with dedicated pCPUs?

2) The non-dedicated pCPU case: do you still get the performance
degradation with threads=1?  If not, why do you have sibling vCPUs at
all, if you don't have a dedicated physical CPU for each vCPU?

Thanks,

Paolo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
On Tue, Nov 28, 2017 at 08:50:54PM +0100, Paolo Bonzini wrote:
> On 28/11/2017 19:54, Michael S. Tsirkin wrote:
> >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >> sched pipe -i 100000" gives
> >>
> >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >> off		200 (no K)	230		0.2 sec
> >> on		400K		330K		0.5 sec
> >>
> >> In a more realistic test, we observe 15% degradation in VM density
> >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> >> requests per second to its main page, with 95%-percentile response
> >> latency under 100 ms) with l3-cache=on.
> >>
> >> We think that mostly-idle scenario is more common in cloud and personal
> >> usage, and should be optimized for by default; users of highly loaded
> >> VMs should be able to tune them up themselves.
> 
> Hi Denis,
> 
> thanks for the report.  I think there are two cases:
> 
> 1) The dedicated pCPU case: do you still get the performance degradation
> with dedicated pCPUs?
> 
> 2) The non-dedicated pCPU case: do you still get the performance
> degradation with threads=1?  If not, why do you have sibling vCPUs at
> all, if you don't have a dedicated physical CPU for each vCPU?

I assume you mean cores=1,threads=1?

Even if the pCPUs are dedicated, I would still like to see a
comparison between cores=1,threads=1,l3-cache=off and
cores=1,threads=1,l3-cache=off.  Maybe configuring cores > 1
isn't really helpful in some use cases.

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Tue, Nov 28, 2017 at 08:50:54PM +0100, Paolo Bonzini wrote:
> On 28/11/2017 19:54, Michael S. Tsirkin wrote:
> >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >> sched pipe -i 100000" gives
> >>
> >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >> off		200 (no K)	230		0.2 sec
> >> on		400K		330K		0.5 sec
> >>
> >> In a more realistic test, we observe 15% degradation in VM density
> >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> >> requests per second to its main page, with 95%-percentile response
> >> latency under 100 ms) with l3-cache=on.
> >>
> >> We think that mostly-idle scenario is more common in cloud and personal
> >> usage, and should be optimized for by default; users of highly loaded
> >> VMs should be able to tune them up themselves.
> 
> Hi Denis,
> 
> thanks for the report.  I think there are two cases:
> 
> 1) The dedicated pCPU case: do you still get the performance degradation
> with dedicated pCPUs?

I wonder why dedicated pCPU would matter at all?  The behavior change is
in the guest scheduler.

> 2) The non-dedicated pCPU case: do you still get the performance
> degradation with threads=1?  If not, why do you have sibling vCPUs at
> all, if you don't have a dedicated physical CPU for each vCPU?

We have sibling vCPUs in terms of cores, not threads.  I.e. the
configuration in the test was sockets=1,cores=8,threads=1.  Are you
suggesting that it shouldn't be used without pCPU binding?

Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
Hi,

On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> introduced and set by default exposing l3 to the guest.
> 
> The motivation behind it was that in the Linux scheduler, when waking up
> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> directly, without sending a reschedule IPI.  Reduction in the IPI count
> led to performance gain.
> 
> However, this isn't the whole story.  Once the task is on the target
> CPU's runqueue, it may have to preempt the current task on that CPU, be
> it the idle task putting the CPU to sleep or just another running task.
> For that a reschedule IPI will have to be issued, too.  Only when that
> other CPU is running a normal task for too little time, the fairness
> constraints will prevent the preemption and thus the IPI.
> 
> This boils down to the improvement being only achievable in workloads
> with many actively switching tasks.  We had no access to the
> (proprietary?) SAP HANA benchmark the commit referred to, but the
> pattern is also reproduced with "perf bench sched messaging -g 1"
> on 1 socket, 8 cores vCPU topology, we see indeed:
> 
> l3-cache	#res IPI /s	#time / 10000 loops
> off		560K		1.8 sec
> on		40K		0.9 sec
> 
> Now there's a downside: with L3 cache the Linux scheduler is more eager
> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> sched pipe -i 100000" gives
> 
> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> off		200 (no K)	230		0.2 sec
> on		400K		330K		0.5 sec
> 
> In a more realistic test, we observe 15% degradation in VM density
> (measured as the number of VMs, each running Drupal CMS serving 2 http
> requests per second to its main page, with 95%-percentile response
> latency under 100 ms) with l3-cache=on.
> 
> We think that mostly-idle scenario is more common in cloud and personal
> usage, and should be optimized for by default; users of highly loaded
> VMs should be able to tune them up themselves.
> 

There's one thing I don't understand in your test case: if you
just found out that Linux will behave worse if it assumes that
the VCPUs are sharing a L3 cache, why are you configuring a
8-core VCPU topology explicitly?

Do you still see a difference in the numbers if you use "-smp 8"
with no "cores" and "threads" options?


> So switch l3-cache off by default, and add a compat clause for the range
> of machine types where it was on.
> 
> Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> ---
>  include/hw/i386/pc.h | 7 ++++++-
>  target/i386/cpu.c    | 2 +-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 087d184..1d2dcae 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
>          .driver   = TYPE_X86_CPU,\
>          .property = "x-hv-max-vps",\
>          .value    = "0x40",\
> -    },
> +    },\
> +    {\
> +        .driver   = TYPE_X86_CPU,\
> +        .property = "l3-cache",\
> +        .value    = "on",\
> +    },\
>  
>  #define PC_COMPAT_2_9 \
>      HW_COMPAT_2_9 \
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 1edcf29..95a51bd 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = {
>      DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
>      DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
>      DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
> -    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
> +    DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false),
>      DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration,
>                       false),
>      DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true),
> -- 
> 2.7.4
> 
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Denis V. Lunev 6 years, 4 months ago
On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> Hi,
>
> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>> introduced and set by default exposing l3 to the guest.
>>
>> The motivation behind it was that in the Linux scheduler, when waking up
>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>> led to performance gain.
>>
>> However, this isn't the whole story.  Once the task is on the target
>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>> it the idle task putting the CPU to sleep or just another running task.
>> For that a reschedule IPI will have to be issued, too.  Only when that
>> other CPU is running a normal task for too little time, the fairness
>> constraints will prevent the preemption and thus the IPI.
>>
>> This boils down to the improvement being only achievable in workloads
>> with many actively switching tasks.  We had no access to the
>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>> pattern is also reproduced with "perf bench sched messaging -g 1"
>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>
>> l3-cache	#res IPI /s	#time / 10000 loops
>> off		560K		1.8 sec
>> on		40K		0.9 sec
>>
>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>> sched pipe -i 100000" gives
>>
>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>> off		200 (no K)	230		0.2 sec
>> on		400K		330K		0.5 sec
>>
>> In a more realistic test, we observe 15% degradation in VM density
>> (measured as the number of VMs, each running Drupal CMS serving 2 http
>> requests per second to its main page, with 95%-percentile response
>> latency under 100 ms) with l3-cache=on.
>>
>> We think that mostly-idle scenario is more common in cloud and personal
>> usage, and should be optimized for by default; users of highly loaded
>> VMs should be able to tune them up themselves.
>>
> There's one thing I don't understand in your test case: if you
> just found out that Linux will behave worse if it assumes that
> the VCPUs are sharing a L3 cache, why are you configuring a
> 8-core VCPU topology explicitly?
>
> Do you still see a difference in the numbers if you use "-smp 8"
> with no "cores" and "threads" options?
>
This is quite simple. A lot of software licenses are bound to the amount
of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
with 1 socket/xx cores to reduce the amount of money necessary to
be paid for the software.

Den

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
[CCing the people who were copied in the original patch that
enabled l3cache]

On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > Hi,
> >
> > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> >> introduced and set by default exposing l3 to the guest.
> >>
> >> The motivation behind it was that in the Linux scheduler, when waking up
> >> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> >> led to performance gain.
> >>
> >> However, this isn't the whole story.  Once the task is on the target
> >> CPU's runqueue, it may have to preempt the current task on that CPU, be
> >> it the idle task putting the CPU to sleep or just another running task.
> >> For that a reschedule IPI will have to be issued, too.  Only when that
> >> other CPU is running a normal task for too little time, the fairness
> >> constraints will prevent the preemption and thus the IPI.
> >>
> >> This boils down to the improvement being only achievable in workloads
> >> with many actively switching tasks.  We had no access to the
> >> (proprietary?) SAP HANA benchmark the commit referred to, but the
> >> pattern is also reproduced with "perf bench sched messaging -g 1"
> >> on 1 socket, 8 cores vCPU topology, we see indeed:
> >>
> >> l3-cache	#res IPI /s	#time / 10000 loops
> >> off		560K		1.8 sec
> >> on		40K		0.9 sec
> >>
> >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >> sched pipe -i 100000" gives
> >>
> >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >> off		200 (no K)	230		0.2 sec
> >> on		400K		330K		0.5 sec
> >>
> >> In a more realistic test, we observe 15% degradation in VM density
> >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> >> requests per second to its main page, with 95%-percentile response
> >> latency under 100 ms) with l3-cache=on.
> >>
> >> We think that mostly-idle scenario is more common in cloud and personal
> >> usage, and should be optimized for by default; users of highly loaded
> >> VMs should be able to tune them up themselves.
> >>
> > There's one thing I don't understand in your test case: if you
> > just found out that Linux will behave worse if it assumes that
> > the VCPUs are sharing a L3 cache, why are you configuring a
> > 8-core VCPU topology explicitly?
> >
> > Do you still see a difference in the numbers if you use "-smp 8"
> > with no "cores" and "threads" options?
> >
> This is quite simple. A lot of software licenses are bound to the amount
> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> with 1 socket/xx cores to reduce the amount of money necessary to
> be paid for the software.

In this case it looks like we're talking about the expected
meaning of "cores=N".  My first interpretation would be that the
user obviously want the guest to see the multiple cores sharing a
L3 cache, because that's how real CPUs normally work.  But I see
why you have different expectations.

Numbers on dedicated-pCPU scenarios would be helpful to guide the
decision.  I wouldn't like to cause a performance regression for
users that fine-tuned vCPU topology and set up CPU pinning.

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Gonglei (Arei) 6 years, 4 months ago
> -----Original Message-----
> From: Eduardo Habkost [mailto:ehabkost@redhat.com]
> Sent: Wednesday, November 29, 2017 5:13 AM
> To: Denis V. Lunev; longpeng; Michael S. Tsirkin
> Cc: Denis Plotnikov; pbonzini@redhat.com; rth@twiddle.net;
> qemu-devel@nongnu.org; rkagan@virtuozzo.com; Gonglei (Arei); huangpeng;
> Zhaoshenglong; herongguang.he@huawei.com
> Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
> 
> [CCing the people who were copied in the original patch that
> enabled l3cache]
> 
Thanks for Ccing.

> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > Hi,
> > >
> > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> > >> introduced and set by default exposing l3 to the guest.
> > >>
> > >> The motivation behind it was that in the Linux scheduler, when waking up
> > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> > >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> > >> led to performance gain.
> > >>

Yes, that's one thing.

The other reason for enabling L3 cache is the performance of accessing memory.
We tested it by Stream benchmark, the performance is better with L3-cache=on.

> > >> However, this isn't the whole story.  Once the task is on the target
> > >> CPU's runqueue, it may have to preempt the current task on that CPU, be
> > >> it the idle task putting the CPU to sleep or just another running task.
> > >> For that a reschedule IPI will have to be issued, too.  Only when that
> > >> other CPU is running a normal task for too little time, the fairness
> > >> constraints will prevent the preemption and thus the IPI.
> > >>
> > >> This boils down to the improvement being only achievable in workloads
> > >> with many actively switching tasks.  We had no access to the
> > >> (proprietary?) SAP HANA benchmark the commit referred to, but the
> > >> pattern is also reproduced with "perf bench sched messaging -g 1"
> > >> on 1 socket, 8 cores vCPU topology, we see indeed:
> > >>
> > >> l3-cache	#res IPI /s	#time / 10000 loops
> > >> off		560K		1.8 sec
> > >> on		40K		0.9 sec
> > >>
> > >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> > >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> > >> sched pipe -i 100000" gives
> > >>
> > >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> > >> off		200 (no K)	230		0.2 sec
> > >> on		400K		330K		0.5 sec
> > >>
> > >> In a more realistic test, we observe 15% degradation in VM density
> > >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> > >> requests per second to its main page, with 95%-percentile response
> > >> latency under 100 ms) with l3-cache=on.
> > >>
> > >> We think that mostly-idle scenario is more common in cloud and personal
> > >> usage, and should be optimized for by default; users of highly loaded
> > >> VMs should be able to tune them up themselves.
> > >>

For currently public cloud providers, they usually provide different instances,
Including sharing instances and dedicated instances. 

And the public cloud tenants usually want the L3 cache, even bigger is better.

Basically all performance tuning target to specific scenarios, 
we only need to ensure benefit in most scenes.

Thanks,
-Gonglei

> > > There's one thing I don't understand in your test case: if you
> > > just found out that Linux will behave worse if it assumes that
> > > the VCPUs are sharing a L3 cache, why are you configuring a
> > > 8-core VCPU topology explicitly?
> > >
> > > Do you still see a difference in the numbers if you use "-smp 8"
> > > with no "cores" and "threads" options?
> > >
> > This is quite simple. A lot of software licenses are bound to the amount
> > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> > with 1 socket/xx cores to reduce the amount of money necessary to
> > be paid for the software.
> 
> In this case it looks like we're talking about the expected
> meaning of "cores=N".  My first interpretation would be that the
> user obviously want the guest to see the multiple cores sharing a
> L3 cache, because that's how real CPUs normally work.  But I see
> why you have different expectations.
> 
> Numbers on dedicated-pCPU scenarios would be helpful to guide the
> decision.  I wouldn't like to cause a performance regression for
> users that fine-tuned vCPU topology and set up CPU pinning.
> 
> --
> Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by rkagan@virtuozzo.com 6 years, 4 months ago
On Wed, Nov 29, 2017 at 01:57:14AM +0000, Gonglei (Arei) wrote:
> > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> > > >> introduced and set by default exposing l3 to the guest.
> > > >>
> > > >> The motivation behind it was that in the Linux scheduler, when waking up
> > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> > > >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> > > >> led to performance gain.
> > > >>
> 
> Yes, that's one thing.
> 
> The other reason for enabling L3 cache is the performance of accessing memory.

I guess you're talking about the super-smart buffer size tuning glibc
does in its memcpy and friends.  We try to control that with an atomic
test for memcpy, and we didn't notice a difference.  We'll need to
double-check...

> We tested it by Stream benchmark, the performance is better with L3-cache=on.

This one: https://www.cs.virginia.edu/stream/ ?  Thanks, we'll have a
look, too.

> > > >> However, this isn't the whole story.  Once the task is on the target
> > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be
> > > >> it the idle task putting the CPU to sleep or just another running task.
> > > >> For that a reschedule IPI will have to be issued, too.  Only when that
> > > >> other CPU is running a normal task for too little time, the fairness
> > > >> constraints will prevent the preemption and thus the IPI.
> > > >>
> > > >> This boils down to the improvement being only achievable in workloads
> > > >> with many actively switching tasks.  We had no access to the
> > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the
> > > >> pattern is also reproduced with "perf bench sched messaging -g 1"
> > > >> on 1 socket, 8 cores vCPU topology, we see indeed:
> > > >>
> > > >> l3-cache	#res IPI /s	#time / 10000 loops
> > > >> off		560K		1.8 sec
> > > >> on		40K		0.9 sec
> > > >>
> > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> > > >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> > > >> sched pipe -i 100000" gives
> > > >>
> > > >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> > > >> off		200 (no K)	230		0.2 sec
> > > >> on		400K		330K		0.5 sec
> > > >>
> > > >> In a more realistic test, we observe 15% degradation in VM density
> > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> > > >> requests per second to its main page, with 95%-percentile response
> > > >> latency under 100 ms) with l3-cache=on.
> > > >>
> > > >> We think that mostly-idle scenario is more common in cloud and personal
> > > >> usage, and should be optimized for by default; users of highly loaded
> > > >> VMs should be able to tune them up themselves.
> > > >>
> 
> For currently public cloud providers, they usually provide different instances,
> Including sharing instances and dedicated instances. 
> 
> And the public cloud tenants usually want the L3 cache, even bigger is better.
> 
> Basically all performance tuning target to specific scenarios, 
> we only need to ensure benefit in most scenes.

There's no doubt the ability to configure l3-cache is useful.  The
question is what the default value should be.

Thanks,
Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Gonglei (Arei) 6 years, 4 months ago

> -----Original Message-----
> From: rkagan@virtuozzo.com [mailto:rkagan@virtuozzo.com]
> Sent: Wednesday, November 29, 2017 1:56 PM
> To: Gonglei (Arei)
> Cc: Eduardo Habkost; Denis V. Lunev; longpeng; Michael S. Tsirkin; Denis
> Plotnikov; pbonzini@redhat.com; rth@twiddle.net; qemu-devel@nongnu.org;
> huangpeng; Zhaoshenglong
> Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
> 
> On Wed, Nov 29, 2017 at 01:57:14AM +0000, Gonglei (Arei) wrote:
> > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for
> vcpus"
> > > > >> introduced and set by default exposing l3 to the guest.
> > > > >>
> > > > >> The motivation behind it was that in the Linux scheduler, when waking
> up
> > > > >> a task on a sibling CPU, the task was put onto the target CPU's
> runqueue
> > > > >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> > > > >> led to performance gain.
> > > > >>
> >
> > Yes, that's one thing.
> >
> > The other reason for enabling L3 cache is the performance of accessing
> memory.
> 
> I guess you're talking about the super-smart buffer size tuning glibc
> does in its memcpy and friends.  We try to control that with an atomic
> test for memcpy, and we didn't notice a difference.  We'll need to
> double-check...
> 
> > We tested it by Stream benchmark, the performance is better with
> L3-cache=on.
> 
> This one: https://www.cs.virginia.edu/stream/ ?  Thanks, we'll have a
> look, too.
> 
Yes. :)

Thanks,
-Gonglei

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Longpeng (Mike) 6 years, 4 months ago

On 2017/11/29 5:13, Eduardo Habkost wrote:

> [CCing the people who were copied in the original patch that
> enabled l3cache]
> 
> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>> Hi,
>>>
>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>> introduced and set by default exposing l3 to the guest.
>>>>
>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>> led to performance gain.
>>>>
>>>> However, this isn't the whole story.  Once the task is on the target
>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>> it the idle task putting the CPU to sleep or just another running task.
>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>> other CPU is running a normal task for too little time, the fairness
>>>> constraints will prevent the preemption and thus the IPI.
>>>>

Agree. :)

Our testing VM is Suse11 guest with idle=poll at that time and now I realize
that Suse11 has a BUG in its scheduler.

For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
rq->idle is not polling:
'''
static void ttwu_queue_remote(struct task_struct *p, int cpu)
{
	struct rq *rq = cpu_rq(cpu);

	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
		if (!set_nr_if_polling(rq->idle))
			smp_send_reschedule(cpu);
		else
			trace_sched_wake_idle_without_ipi(cpu);
	}
}
'''

But for Suse11, it does not check, it send a RES IPI unconditionally.

>>>> This boils down to the improvement being only achievable in workloads
>>>> with many actively switching tasks.  We had no access to the
>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>
>>>> l3-cache	#res IPI /s	#time / 10000 loops
>>>> off		560K		1.8 sec
>>>> on		40K		0.9 sec
>>>>
>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>> sched pipe -i 100000" gives
>>>>
>>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>>>> off		200 (no K)	230		0.2 sec
>>>> on		400K		330K		0.5 sec
>>>>

I guess this issue could be resolved by disable the SD_WAKE_AFFINE.

As Gonglei said:
1. the L3 cache relates to the user experience.
2. the glibc would get the cache info by CPUID directly, and relates to the
memory performance.

What's more, the L3 cache relates to the sched_domain which is important to the
(load) balancer when system is busy.

All this doesn't mean the patch is insignificant, I just think we should do more
research before decide. I'll do some tests, thanks. :)

>>>> In a more realistic test, we observe 15% degradation in VM density
>>>> (measured as the number of VMs, each running Drupal CMS serving 2 http
>>>> requests per second to its main page, with 95%-percentile response
>>>> latency under 100 ms) with l3-cache=on.
>>>>
>>>> We think that mostly-idle scenario is more common in cloud and personal
>>>> usage, and should be optimized for by default; users of highly loaded
>>>> VMs should be able to tune them up themselves.
>>>>
>>> There's one thing I don't understand in your test case: if you
>>> just found out that Linux will behave worse if it assumes that
>>> the VCPUs are sharing a L3 cache, why are you configuring a
>>> 8-core VCPU topology explicitly?
>>>
>>> Do you still see a difference in the numbers if you use "-smp 8"
>>> with no "cores" and "threads" options?
>>>
>> This is quite simple. A lot of software licenses are bound to the amount
>> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
>> with 1 socket/xx cores to reduce the amount of money necessary to
>> be paid for the software.
> 
> In this case it looks like we're talking about the expected
> meaning of "cores=N".  My first interpretation would be that the
> user obviously want the guest to see the multiple cores sharing a
> L3 cache, because that's how real CPUs normally work.  But I see
> why you have different expectations.
> 
> Numbers on dedicated-pCPU scenarios would be helpful to guide the
> decision.  I wouldn't like to cause a performance regression for
> users that fine-tuned vCPU topology and set up CPU pinning.
> 


-- 
Regards,
Longpeng(Mike)


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
> On 2017/11/29 5:13, Eduardo Habkost wrote:
> 
> > [CCing the people who were copied in the original patch that
> > enabled l3cache]
> > 
> > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> >>> Hi,
> >>>
> >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> >>>> introduced and set by default exposing l3 to the guest.
> >>>>
> >>>> The motivation behind it was that in the Linux scheduler, when waking up
> >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> >>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
> >>>> led to performance gain.
> >>>>
> >>>> However, this isn't the whole story.  Once the task is on the target
> >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
> >>>> it the idle task putting the CPU to sleep or just another running task.
> >>>> For that a reschedule IPI will have to be issued, too.  Only when that
> >>>> other CPU is running a normal task for too little time, the fairness
> >>>> constraints will prevent the preemption and thus the IPI.
> >>>>
> 
> Agree. :)
> 
> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
                                      ^^^^^^^^^
Oh, that's a whole lot of a difference!  I wish you mentioned that in
that patch.

> that Suse11 has a BUG in its scheduler.
> 
> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
> rq->idle is not polling:
> '''
> static void ttwu_queue_remote(struct task_struct *p, int cpu)
> {
> 	struct rq *rq = cpu_rq(cpu);
> 
> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
> 		if (!set_nr_if_polling(rq->idle))
> 			smp_send_reschedule(cpu);
> 		else
> 			trace_sched_wake_idle_without_ipi(cpu);
> 	}
> }
> '''
> 
> But for Suse11, it does not check, it send a RES IPI unconditionally.
> 
> >>>> This boils down to the improvement being only achievable in workloads
> >>>> with many actively switching tasks.  We had no access to the
> >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
> >>>> pattern is also reproduced with "perf bench sched messaging -g 1"
> >>>> on 1 socket, 8 cores vCPU topology, we see indeed:
> >>>>
> >>>> l3-cache	#res IPI /s	#time / 10000 loops
> >>>> off		560K		1.8 sec
> >>>> on		40K		0.9 sec
> >>>>
> >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >>>> sched pipe -i 100000" gives
> >>>>
> >>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >>>> off		200 (no K)	230		0.2 sec
> >>>> on		400K		330K		0.5 sec
> >>>>
> 
> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.

But that requires extra tuning in the guest which is even less likely to
happen in the cloud case when VM admin != host admin.

> As Gonglei said:
> 1. the L3 cache relates to the user experience.
> 2. the glibc would get the cache info by CPUID directly, and relates to the
> memory performance.
> 
> What's more, the L3 cache relates to the sched_domain which is important to the
> (load) balancer when system is busy.
> 
> All this doesn't mean the patch is insignificant, I just think we should do more
> research before decide. I'll do some tests, thanks. :)

Looking forward to it, thanks!
Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Longpeng (Mike) 6 years, 4 months ago

On 2017/11/29 14:01, Roman Kagan wrote:

> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>
>>> [CCing the people who were copied in the original patch that
>>> enabled l3cache]
>>>
>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>
>>>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>>>> led to performance gain.
>>>>>>
>>>>>> However, this isn't the whole story.  Once the task is on the target
>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>> constraints will prevent the preemption and thus the IPI.
>>>>>>
>>
>> Agree. :)
>>
>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
>                                       ^^^^^^^^^
> Oh, that's a whole lot of a difference!  I wish you mentioned that in
> that patch.
> 

:( Sorry for missing that...

>> that Suse11 has a BUG in its scheduler.
>>
>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
>> rq->idle is not polling:
>> '''
>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>> {
>> 	struct rq *rq = cpu_rq(cpu);
>>
>> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>> 		if (!set_nr_if_polling(rq->idle))
>> 			smp_send_reschedule(cpu);
>> 		else
>> 			trace_sched_wake_idle_without_ipi(cpu);
>> 	}
>> }
>> '''
>>
>> But for Suse11, it does not check, it send a RES IPI unconditionally.
>>
>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>> with many actively switching tasks.  We had no access to the
>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>
>>>>>> l3-cache	#res IPI /s	#time / 10000 loops
>>>>>> off		560K		1.8 sec
>>>>>> on		40K		0.9 sec
>>>>>>
>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>>>> sched pipe -i 100000" gives
>>>>>>
>>>>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>>>>>> off		200 (no K)	230		0.2 sec
>>>>>> on		400K		330K		0.5 sec
>>>>>>
>>
>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
> 
> But that requires extra tuning in the guest which is even less likely to
> happen in the cloud case when VM admin != host admin.
> 

Ah, yep, that's a problem.

>> As Gonglei said:
>> 1. the L3 cache relates to the user experience.
>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>> memory performance.
>>
>> What's more, the L3 cache relates to the sched_domain which is important to the
>> (load) balancer when system is busy.
>>
>> All this doesn't mean the patch is insignificant, I just think we should do more
>> research before decide. I'll do some tests, thanks. :)
> 
> Looking forward to it, thanks!

> Roman.
> 
> 


-- 
Regards,
Longpeng(Mike)


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
> On 2017/11/29 5:13, Eduardo Habkost wrote:
> > [CCing the people who were copied in the original patch that
> > enabled l3cache]
> > 
> > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> >>> Hi,
> >>>
> >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> >>>> introduced and set by default exposing l3 to the guest.
> >>>>
> >>>> The motivation behind it was that in the Linux scheduler, when waking up
> >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> >>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
> >>>> led to performance gain.
> >>>>
> >>>> However, this isn't the whole story.  Once the task is on the target
> >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
> >>>> it the idle task putting the CPU to sleep or just another running task.
> >>>> For that a reschedule IPI will have to be issued, too.  Only when that
> >>>> other CPU is running a normal task for too little time, the fairness
> >>>> constraints will prevent the preemption and thus the IPI.
> >>>>
> 
> Agree. :)
> 
> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
> that Suse11 has a BUG in its scheduler.
> 
> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
> rq->idle is not polling:
> '''
> static void ttwu_queue_remote(struct task_struct *p, int cpu)
> {
> 	struct rq *rq = cpu_rq(cpu);
> 
> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
> 		if (!set_nr_if_polling(rq->idle))
> 			smp_send_reschedule(cpu);
> 		else
> 			trace_sched_wake_idle_without_ipi(cpu);
> 	}
> }
> '''
> 
> But for Suse11, it does not check, it send a RES IPI unconditionally.

So, does that mean no Linux guest benefits from the l3-cache=on
default except SuSE 11 guests?


> 
> >>>> This boils down to the improvement being only achievable in workloads
> >>>> with many actively switching tasks.  We had no access to the
> >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
> >>>> pattern is also reproduced with "perf bench sched messaging -g 1"
> >>>> on 1 socket, 8 cores vCPU topology, we see indeed:
> >>>>
> >>>> l3-cache	#res IPI /s	#time / 10000 loops
> >>>> off		560K		1.8 sec
> >>>> on		40K		0.9 sec
> >>>>
> >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >>>> sched pipe -i 100000" gives
> >>>>
> >>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >>>> off		200 (no K)	230		0.2 sec
> >>>> on		400K		330K		0.5 sec
> >>>>
> 
> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
> 
> As Gonglei said:
> 1. the L3 cache relates to the user experience.

This is true, in a way: I have seen a fair share of user reports
where they incorrectly blame the L3 cache absence or the L3 cache
size for performance problems.

> 2. the glibc would get the cache info by CPUID directly, and relates to the
> memory performance.

I'm interested in numbers that demonstrate that.

> 
> What's more, the L3 cache relates to the sched_domain which is important to the
> (load) balancer when system is busy.
> 
> All this doesn't mean the patch is insignificant, I just think we should do more
> research before decide. I'll do some tests, thanks. :)

Yes, we need more data.  But if we find out that there are no
cases where the l3-cache=on default actually improves
performance, I will be willing to apply this patch.

IMO, the long term solution is to make Linux guests not misbehave
when we stop lying about the L3 cache.  Maybe we could provide a
"IPIs are expensive, please avoid them" hint in the KVM CPUID
leaf?

> 
> >>>> In a more realistic test, we observe 15% degradation in VM density
> >>>> (measured as the number of VMs, each running Drupal CMS serving 2 http
> >>>> requests per second to its main page, with 95%-percentile response
> >>>> latency under 100 ms) with l3-cache=on.
> >>>>
> >>>> We think that mostly-idle scenario is more common in cloud and personal
> >>>> usage, and should be optimized for by default; users of highly loaded
> >>>> VMs should be able to tune them up themselves.
> >>>>
> >>> There's one thing I don't understand in your test case: if you
> >>> just found out that Linux will behave worse if it assumes that
> >>> the VCPUs are sharing a L3 cache, why are you configuring a
> >>> 8-core VCPU topology explicitly?
> >>>
> >>> Do you still see a difference in the numbers if you use "-smp 8"
> >>> with no "cores" and "threads" options?
> >>>
> >> This is quite simple. A lot of software licenses are bound to the amount
> >> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> >> with 1 socket/xx cores to reduce the amount of money necessary to
> >> be paid for the software.
> > 
> > In this case it looks like we're talking about the expected
> > meaning of "cores=N".  My first interpretation would be that the
> > user obviously want the guest to see the multiple cores sharing a
> > L3 cache, because that's how real CPUs normally work.  But I see
> > why you have different expectations.
> > 
> > Numbers on dedicated-pCPU scenarios would be helpful to guide the
> > decision.  I wouldn't like to cause a performance regression for
> > users that fine-tuned vCPU topology and set up CPU pinning.
> > 
> 
> 
> -- 
> Regards,
> Longpeng(Mike)
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Longpeng (Mike) 6 years, 4 months ago

On 2017/11/29 18:41, Eduardo Habkost wrote:

> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>> [CCing the people who were copied in the original patch that
>>> enabled l3cache]
>>>
>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>
>>>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>>>> led to performance gain.
>>>>>>
>>>>>> However, this isn't the whole story.  Once the task is on the target
>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>> constraints will prevent the preemption and thus the IPI.
>>>>>>
>>
>> Agree. :)
>>
>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
>> that Suse11 has a BUG in its scheduler.
>>
>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
>> rq->idle is not polling:
>> '''
>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>> {
>> 	struct rq *rq = cpu_rq(cpu);
>>
>> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>> 		if (!set_nr_if_polling(rq->idle))
>> 			smp_send_reschedule(cpu);
>> 		else
>> 			trace_sched_wake_idle_without_ipi(cpu);
>> 	}
>> }
>> '''
>>
>> But for Suse11, it does not check, it send a RES IPI unconditionally.
> 
> So, does that mean no Linux guest benefits from the l3-cache=on
> default except SuSE 11 guests?
> 

Not only that, there is another scenario:

static void ttwu_queue(...)
{
	if (...two cpus NOT sharing L3-cache) {
		...
		ttwu_queue_remote(p, cpu, wake_flags);
		return;
	}
	...
	ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
	...
}

In ttwu_do_activate(), there are also some opportunities with low probability to
do not send RES IPI even if the target cpu isn't in IDLE polling state.

> 
>>
>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>> with many actively switching tasks.  We had no access to the
>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>
>>>>>> l3-cache	#res IPI /s	#time / 10000 loops
>>>>>> off		560K		1.8 sec
>>>>>> on		40K		0.9 sec
>>>>>>
>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>>>> sched pipe -i 100000" gives
>>>>>>
>>>>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>>>>>> off		200 (no K)	230		0.2 sec
>>>>>> on		400K		330K		0.5 sec
>>>>>>
>>
>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
>>
>> As Gonglei said:
>> 1. the L3 cache relates to the user experience.
> 
> This is true, in a way: I have seen a fair share of user reports
> where they incorrectly blame the L3 cache absence or the L3 cache
> size for performance problems.
> 
>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>> memory performance.
> 
> I'm interested in numbers that demonstrate that.
> 

Sorry I have no numbers in hand currently :(

I'll do some tests these days, please give me some time.

>>
>> What's more, the L3 cache relates to the sched_domain which is important to the
>> (load) balancer when system is busy.
>>
>> All this doesn't mean the patch is insignificant, I just think we should do more
>> research before decide. I'll do some tests, thanks. :)
> 
> Yes, we need more data.  But if we find out that there are no
> cases where the l3-cache=on default actually improves
> performance, I will be willing to apply this patch.
> 

That's a good thing if we find the truth, it's free. :)

OTOH, I think we should notice that: Linux is designed on real hardware, maybe
there're some other problems if QEMU lacks some related features. If we search
'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block
Layer.

> IMO, the long term solution is to make Linux guests not misbehave
> when we stop lying about the L3 cache.  Maybe we could provide a
> "IPIs are expensive, please avoid them" hint in the KVM CPUID
> leaf?
> 

Good idea. :)

Maybe more PV features could be digged.

>>
>>>>>> In a more realistic test, we observe 15% degradation in VM density
>>>>>> (measured as the number of VMs, each running Drupal CMS serving 2 http
>>>>>> requests per second to its main page, with 95%-percentile response
>>>>>> latency under 100 ms) with l3-cache=on.
>>>>>>
>>>>>> We think that mostly-idle scenario is more common in cloud and personal
>>>>>> usage, and should be optimized for by default; users of highly loaded
>>>>>> VMs should be able to tune them up themselves.
>>>>>>
>>>>> There's one thing I don't understand in your test case: if you
>>>>> just found out that Linux will behave worse if it assumes that
>>>>> the VCPUs are sharing a L3 cache, why are you configuring a
>>>>> 8-core VCPU topology explicitly?
>>>>>
>>>>> Do you still see a difference in the numbers if you use "-smp 8"
>>>>> with no "cores" and "threads" options?
>>>>>
>>>> This is quite simple. A lot of software licenses are bound to the amount
>>>> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
>>>> with 1 socket/xx cores to reduce the amount of money necessary to
>>>> be paid for the software.
>>>
>>> In this case it looks like we're talking about the expected
>>> meaning of "cores=N".  My first interpretation would be that the
>>> user obviously want the guest to see the multiple cores sharing a
>>> L3 cache, because that's how real CPUs normally work.  But I see
>>> why you have different expectations.
>>>
>>> Numbers on dedicated-pCPU scenarios would be helpful to guide the
>>> decision.  I wouldn't like to cause a performance regression for
>>> users that fine-tuned vCPU topology and set up CPU pinning.
>>>
>>
>>
>> -- 
>> Regards,
>> Longpeng(Mike)
>>
> 


-- 
Regards,
Longpeng(Mike)


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote:
> On 2017/11/29 18:41, Eduardo Habkost wrote:
> > On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
> >> On 2017/11/29 5:13, Eduardo Habkost wrote:
> >>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> >>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> >>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> >>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> >>>>>> introduced and set by default exposing l3 to the guest.
> >>>>>>
> >>>>>> The motivation behind it was that in the Linux scheduler, when waking up
> >>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> >>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
> >>>>>> led to performance gain.
> >>>>>>
> >>>>>> However, this isn't the whole story.  Once the task is on the target
> >>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
> >>>>>> it the idle task putting the CPU to sleep or just another running task.
> >>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
> >>>>>> other CPU is running a normal task for too little time, the fairness
> >>>>>> constraints will prevent the preemption and thus the IPI.
> >>>>>>
> >>
> >> Agree. :)
> >>
> >> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
> >> that Suse11 has a BUG in its scheduler.
> >>
> >> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
> >> rq->idle is not polling:
> >> '''
> >> static void ttwu_queue_remote(struct task_struct *p, int cpu)
> >> {
> >> 	struct rq *rq = cpu_rq(cpu);
> >>
> >> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
> >> 		if (!set_nr_if_polling(rq->idle))
> >> 			smp_send_reschedule(cpu);
> >> 		else
> >> 			trace_sched_wake_idle_without_ipi(cpu);
> >> 	}
> >> }
> >> '''
> >>
> >> But for Suse11, it does not check, it send a RES IPI unconditionally.
> > 
> > So, does that mean no Linux guest benefits from the l3-cache=on
> > default except SuSE 11 guests?
> > 
> 
> Not only that, there is another scenario:
> 
> static void ttwu_queue(...)
> {
> 	if (...two cpus NOT sharing L3-cache) {
> 		...
> 		ttwu_queue_remote(p, cpu, wake_flags);
> 		return;
> 	}
> 	...
> 	ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
> 	...
> }
> 
> In ttwu_do_activate(), there are also some opportunities with low probability to
> do not send RES IPI even if the target cpu isn't in IDLE polling state.

Well it isn't so low actually, what you need is to keep the cpus busy
switching tasks.  In that case it's not uncommon that the task being
woken up on a remote cpu has accumulated more vruntime than the task
already running on that cpu; in that case the new task won't preempt the
current task and the IPI won't be issued.  E.g. on a RHEL 7.4 guest we
saw:

> >>>>>> This boils down to the improvement being only achievable in workloads
> >>>>>> with many actively switching tasks.  We had no access to the
> >>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
> >>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
> >>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
> >>>>>>
> >>>>>> l3-cache	#res IPI /s	#time / 10000 loops
> >>>>>> off		560K		1.8 sec
> >>>>>> on		40K		0.9 sec

The workload where it bites is mostly idle guest, with chains of
dependent wakeups, i.e. with little parallelism:

> >>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
> >>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> >>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> >>>>>> sched pipe -i 100000" gives
> >>>>>>
> >>>>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> >>>>>> off		200 (no K)	230		0.2 sec
> >>>>>> on		400K		330K		0.5 sec
> >>>>>>
> >>
> >> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.

Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this
l3-cache, because the scheduler thinks that the cpus that share the
last-level cache are close enough that a dependent task can be woken up
on a sibling cpu.

> >> As Gonglei said:
> >> 1. the L3 cache relates to the user experience.
> > 
> > This is true, in a way: I have seen a fair share of user reports
> > where they incorrectly blame the L3 cache absence or the L3 cache
> > size for performance problems.
> > 
> >> 2. the glibc would get the cache info by CPUID directly, and relates to the
> >> memory performance.
> > 
> > I'm interested in numbers that demonstrate that.

Me too.  I vaguely remember debugging a memcpy degradation in the guest
(on the Parallels proprietary hypervisor), that turned out being due a
combination of l3 cache size and the cpu topology exposed to the guest,
which caused glibc to choose an inadequate buffuer size.

> Sorry I have no numbers in hand currently :(
> 
> I'll do some tests these days, please give me some time.

We'll try to get some data on this, too.

> >> What's more, the L3 cache relates to the sched_domain which is important to the
> >> (load) balancer when system is busy.
> >>
> >> All this doesn't mean the patch is insignificant, I just think we should do more
> >> research before decide. I'll do some tests, thanks. :)
> > 
> > Yes, we need more data.  But if we find out that there are no
> > cases where the l3-cache=on default actually improves
> > performance, I will be willing to apply this patch.
> > 
> 
> That's a good thing if we find the truth, it's free. :)
> 
> OTOH, I think we should notice that: Linux is designed on real hardware, maybe
> there're some other problems if QEMU lacks some related features. If we search
> 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block
> Layer.
> 
> > IMO, the long term solution is to make Linux guests not misbehave
> > when we stop lying about the L3 cache.  Maybe we could provide a
> > "IPIs are expensive, please avoid them" hint in the KVM CPUID
> > leaf?

We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
of hypervisors where IPIs aren't expensive.

> Maybe more PV features could be digged.

One problem with this is that PV features are hard to get into other
guest OSes or existing Linux guests.

Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
On Wed, Nov 29, 2017 at 04:35:25PM +0300, Roman Kagan wrote:
> > On 2017/11/29 18:41, Eduardo Habkost wrote:
[...]
> > > IMO, the long term solution is to make Linux guests not misbehave
> > > when we stop lying about the L3 cache.  Maybe we could provide a
> > > "IPIs are expensive, please avoid them" hint in the KVM CPUID
> > > leaf?
> 
> We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
> of hypervisors where IPIs aren't expensive.

Sounds good enough to me, if we can convince the Linux kernel
maintainers that it should avoid IPIs under all hypervisors.

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Paolo Bonzini 6 years, 4 months ago
On 29/11/2017 14:35, Roman Kagan wrote:
>>
>>> IMO, the long term solution is to make Linux guests not misbehave
>>> when we stop lying about the L3 cache.  Maybe we could provide a
>>> "IPIs are expensive, please avoid them" hint in the KVM CPUID
>>> leaf?
> We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
> of hypervisors where IPIs aren't expensive.
> 

In theory, AMD's AVIC should optimize IPIs to running vCPUs.  Amazon's
recently posted patches to disable HLT and MWAIT exits might tilt the
balance in favor of IPIs even for Intel APICv (where sending the IPI is
expensive, but receiving it isn't).

Being able to tie this to Amazon's other proposal, the "DEDICATED" CPUID
bit, would be nice.  My plan was to disable all three of MWAIT/HLT/PAUSE
when setting the dedicated bit.

Paolo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Wed, Nov 29, 2017 at 06:15:05PM +0100, Paolo Bonzini wrote:
> On 29/11/2017 14:35, Roman Kagan wrote:
> >>
> >>> IMO, the long term solution is to make Linux guests not misbehave
> >>> when we stop lying about the L3 cache.  Maybe we could provide a
> >>> "IPIs are expensive, please avoid them" hint in the KVM CPUID
> >>> leaf?
> > We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
> > of hypervisors where IPIs aren't expensive.
> > 
> 
> In theory, AMD's AVIC should optimize IPIs to running vCPUs.  Amazon's
> recently posted patches to disable HLT and MWAIT exits might tilt the
> balance in favor of IPIs even for Intel APICv (where sending the IPI is
> expensive, but receiving it isn't).
> 
> Being able to tie this to Amazon's other proposal, the "DEDICATED" CPUID
> bit, would be nice.  My plan was to disable all three of MWAIT/HLT/PAUSE
> when setting the dedicated bit.

Yes the IPI cost can hopefully be mitigated in the case of dedicated and
busy vCPUs.

However, in the max density scenario this doesn't help.

Obviously, in the pipe benchmark scheduling the two ends of the pipe on
different cores is detrimental for performance even on a physical
machine; however, IIUC it was a conscious decision by the scheduler
folks because it provides acceptable latency for mostly-idle systems and
decent performance in more loaded cases.

We wouldn't care about this pipe benchmark numbers per se, because the
latencies are still good for practical purposes.  However, in case of
virtual machines, this extra overhead of remote scheduling in the guest
results in a slight -- circa 15% in our Drupal-based test -- increase of
the host cpu consumption by vcpu threads.  That, in turn, makes the host
cpu overcommit being reached with 15% less VMs (and, once overcommit is
reached, the drupal response latency goes into the sky, so it's
effectively a cut-off for density).

Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Longpeng (Mike) 6 years, 4 months ago

On 2017/11/29 21:35, Roman Kagan wrote:

> On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote:
>> On 2017/11/29 18:41, Eduardo Habkost wrote:
>>> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote:
>>>> On 2017/11/29 5:13, Eduardo Habkost wrote:
>>>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
>>>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
>>>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
>>>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
>>>>>>>> introduced and set by default exposing l3 to the guest.
>>>>>>>>
>>>>>>>> The motivation behind it was that in the Linux scheduler, when waking up
>>>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue
>>>>>>>> directly, without sending a reschedule IPI.  Reduction in the IPI count
>>>>>>>> led to performance gain.
>>>>>>>>
>>>>>>>> However, this isn't the whole story.  Once the task is on the target
>>>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be
>>>>>>>> it the idle task putting the CPU to sleep or just another running task.
>>>>>>>> For that a reschedule IPI will have to be issued, too.  Only when that
>>>>>>>> other CPU is running a normal task for too little time, the fairness
>>>>>>>> constraints will prevent the preemption and thus the IPI.
>>>>>>>>
>>>>
>>>> Agree. :)
>>>>
>>>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize
>>>> that Suse11 has a BUG in its scheduler.
>>>>
>>>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if
>>>> rq->idle is not polling:
>>>> '''
>>>> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>>>> {
>>>> 	struct rq *rq = cpu_rq(cpu);
>>>>
>>>> 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
>>>> 		if (!set_nr_if_polling(rq->idle))
>>>> 			smp_send_reschedule(cpu);
>>>> 		else
>>>> 			trace_sched_wake_idle_without_ipi(cpu);
>>>> 	}
>>>> }
>>>> '''
>>>>
>>>> But for Suse11, it does not check, it send a RES IPI unconditionally.
>>>
>>> So, does that mean no Linux guest benefits from the l3-cache=on
>>> default except SuSE 11 guests?
>>>
>>
>> Not only that, there is another scenario:
>>
>> static void ttwu_queue(...)
>> {
>> 	if (...two cpus NOT sharing L3-cache) {
>> 		...
>> 		ttwu_queue_remote(p, cpu, wake_flags);
>> 		return;
>> 	}
>> 	...
>> 	ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here*
>> 	...
>> }
>>
>> In ttwu_do_activate(), there are also some opportunities with low probability to
>> do not send RES IPI even if the target cpu isn't in IDLE polling state.
> 
> Well it isn't so low actually, what you need is to keep the cpus busy
> switching tasks.  In that case it's not uncommon that the task being
> woken up on a remote cpu has accumulated more vruntime than the task
> already running on that cpu; in that case the new task won't preempt the
> current task and the IPI won't be issued.  E.g. on a RHEL 7.4 guest we
> saw:
> 

I get it, thanks.

>>>>>>>> This boils down to the improvement being only achievable in workloads
>>>>>>>> with many actively switching tasks.  We had no access to the
>>>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the
>>>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1"
>>>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed:
>>>>>>>>
>>>>>>>> l3-cache	#res IPI /s	#time / 10000 loops
>>>>>>>> off		560K		1.8 sec
>>>>>>>> on		40K		0.9 sec
> 
> The workload where it bites is mostly idle guest, with chains of
> dependent wakeups, i.e. with little parallelism:
> 
>>>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager
>>>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
>>>>>>>> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
>>>>>>>> sched pipe -i 100000" gives
>>>>>>>>
>>>>>>>> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
>>>>>>>> off		200 (no K)	230		0.2 sec
>>>>>>>> on		400K		330K		0.5 sec
>>>>>>>>
>>>>
>>>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE.
> 
> Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this
> l3-cache, because the scheduler thinks that the cpus that share the
> last-level cache are close enough that a dependent task can be woken up
> on a sibling cpu.
> 

In this case (sched pipe), without L3-cache, a dependent task is woken up on the
original cpu mostly, if these two tasks ran on the same cpu then the dependent
task is woken up without a RES IPI. The related codes are:
'''
void resched_curr(struct rq *rq)
{
	if (cpu == smp_processor_id()) {
		set_tsk_need_resched(curr);
		set_preempt_need_resched();
		return;
	}
}
	
'''

Do I understand correctly ? If not, hope you could point out what's wrong :)

>>>> As Gonglei said:
>>>> 1. the L3 cache relates to the user experience.
>>>
>>> This is true, in a way: I have seen a fair share of user reports
>>> where they incorrectly blame the L3 cache absence or the L3 cache
>>> size for performance problems.
>>>
>>>> 2. the glibc would get the cache info by CPUID directly, and relates to the
>>>> memory performance.
>>>
>>> I'm interested in numbers that demonstrate that.
> 
> Me too.  I vaguely remember debugging a memcpy degradation in the guest
> (on the Parallels proprietary hypervisor), that turned out being due a
> combination of l3 cache size and the cpu topology exposed to the guest,
> which caused glibc to choose an inadequate buffuer size.
> 

We faced the same problem several months ago.


I did some simple tests at noon, it seems that numbers are better without
L3-cache except 'perf bench sched messaging'.

VM: 1 sockets, 8 cores, 3.10.0 guest
Hardware: Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz

Stream:(100 turns)
l3    Copy    Scale    Add    Triad
------------------------------------
off  8025.8  8019.5  8363.1  8589.9
on   8016.7  7999.9  8344.2  8568.9

perf sched bench message:(100 turns)
l3    Total-time
-----------------
off   0.0238
on    0.0178

perf sched bench pipe:(100 turns)
l3    Total-time
-----------------
off   0.3190
on    1.2688

We are so busy at end of each month, maybe my tests is insufficient, I'm sorry
for that.
According the numbers above, I think it's worth to turn off L3-cache by default.


>> Sorry I have no numbers in hand currently :(
>>
>> I'll do some tests these days, please give me some time.
> 
> We'll try to get some data on this, too.
> 
>>>> What's more, the L3 cache relates to the sched_domain which is important to the
>>>> (load) balancer when system is busy.
>>>>
>>>> All this doesn't mean the patch is insignificant, I just think we should do more
>>>> research before decide. I'll do some tests, thanks. :)
>>>
>>> Yes, we need more data.  But if we find out that there are no
>>> cases where the l3-cache=on default actually improves
>>> performance, I will be willing to apply this patch.
>>>
>>
>> That's a good thing if we find the truth, it's free. :)
>>
>> OTOH, I think we should notice that: Linux is designed on real hardware, maybe
>> there're some other problems if QEMU lacks some related features. If we search
>> 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block
>> Layer.
>>
>>> IMO, the long term solution is to make Linux guests not misbehave
>>> when we stop lying about the L3 cache.  Maybe we could provide a
>>> "IPIs are expensive, please avoid them" hint in the KVM CPUID
>>> leaf?
> 
> We already have it, it's the hypervisor bit ;)  Seriously, I'm unaware
> of hypervisors where IPIs aren't expensive.
> 
>> Maybe more PV features could be digged.
> 
> One problem with this is that PV features are hard to get into other
> guest OSes or existing Linux guests.
> 

Some cloud providers (e.g. Amazon,AliBaBa...) provide a customized guest which
could includes more PV features to reach the limiting performance.

> Roman.
> 
> 


-- 
Regards,
Longpeng(Mike)


Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Tue, Nov 28, 2017 at 07:13:26PM -0200, Eduardo Habkost wrote:
> [CCing the people who were copied in the original patch that
> enabled l3cache]

Thanks, and sorry to have forgotten to do so in the patch!

> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> > >> introduced and set by default exposing l3 to the guest.
> > >>
> > >> The motivation behind it was that in the Linux scheduler, when waking up
> > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> > >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> > >> led to performance gain.
> > >>
> > >> However, this isn't the whole story.  Once the task is on the target
> > >> CPU's runqueue, it may have to preempt the current task on that CPU, be
> > >> it the idle task putting the CPU to sleep or just another running task.
> > >> For that a reschedule IPI will have to be issued, too.  Only when that
> > >> other CPU is running a normal task for too little time, the fairness
> > >> constraints will prevent the preemption and thus the IPI.
> > >>
> > >> This boils down to the improvement being only achievable in workloads
> > >> with many actively switching tasks.  We had no access to the
> > >> (proprietary?) SAP HANA benchmark the commit referred to, but the
> > >> pattern is also reproduced with "perf bench sched messaging -g 1"
> > >> on 1 socket, 8 cores vCPU topology, we see indeed:
> > >>
> > >> l3-cache	#res IPI /s	#time / 10000 loops
> > >> off		560K		1.8 sec
> > >> on		40K		0.9 sec
> > >>
> > >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> > >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> > >> sched pipe -i 100000" gives
> > >>
> > >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> > >> off		200 (no K)	230		0.2 sec
> > >> on		400K		330K		0.5 sec
> > >>
> > >> In a more realistic test, we observe 15% degradation in VM density
> > >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> > >> requests per second to its main page, with 95%-percentile response
> > >> latency under 100 ms) with l3-cache=on.
> > >>
> > >> We think that mostly-idle scenario is more common in cloud and personal
> > >> usage, and should be optimized for by default; users of highly loaded
> > >> VMs should be able to tune them up themselves.
> > >>
> > > There's one thing I don't understand in your test case: if you
> > > just found out that Linux will behave worse if it assumes that
> > > the VCPUs are sharing a L3 cache, why are you configuring a
> > > 8-core VCPU topology explicitly?
> > >
> > > Do you still see a difference in the numbers if you use "-smp 8"
> > > with no "cores" and "threads" options?

No we don't.  The guest scheduler makes no optimizations (which turn out
pessimizations) in this case.

> > This is quite simple. A lot of software licenses are bound to the amount
> > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> > with 1 socket/xx cores to reduce the amount of money necessary to
> > be paid for the software.
> 
> In this case it looks like we're talking about the expected
> meaning of "cores=N".  My first interpretation would be that the
> user obviously want the guest to see the multiple cores sharing a
> L3 cache, because that's how real CPUs normally work.  But I see
> why you have different expectations.
> 
> Numbers on dedicated-pCPU scenarios would be helpful to guide the
> decision.  I wouldn't like to cause a performance regression for
> users that fine-tuned vCPU topology and set up CPU pinning.

We're speaking about the default setting; those who do fine-tuning are
not going to lose the ability to configure l3-cache.

Dedicated pCPU scenarios are the opposite of density ones.  Besides I'm
still struggling to see why dedicated pCPUs would change anything in the
guest scheduler behavior.  (I hope Denis (Plotnikov) should be able to
redo the measurements with dedicated pCPUs and provide the numbers
regardless of my opinion though.)

Our main problem at the moment is that our users get a performance
degradation in our default (arguably suboptimal) configuration when
transitioning to a newer version of QEMU, that's why we're suggesting to
set the default to match the pre-l3-cache behavior.  (Well, they would
if we didn't apply this change to our version of QEMU.)

Roman.

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Eduardo Habkost 6 years, 4 months ago
On Wed, Nov 29, 2017 at 08:46:55AM +0300, Roman Kagan wrote:
> On Tue, Nov 28, 2017 at 07:13:26PM -0200, Eduardo Habkost wrote:
> > [CCing the people who were copied in the original patch that
> > enabled l3cache]
> 
> Thanks, and sorry to have forgotten to do so in the patch!
> 
> > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote:
> > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote:
> > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
> > > >> introduced and set by default exposing l3 to the guest.
> > > >>
> > > >> The motivation behind it was that in the Linux scheduler, when waking up
> > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue
> > > >> directly, without sending a reschedule IPI.  Reduction in the IPI count
> > > >> led to performance gain.
> > > >>
> > > >> However, this isn't the whole story.  Once the task is on the target
> > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be
> > > >> it the idle task putting the CPU to sleep or just another running task.
> > > >> For that a reschedule IPI will have to be issued, too.  Only when that
> > > >> other CPU is running a normal task for too little time, the fairness
> > > >> constraints will prevent the preemption and thus the IPI.
> > > >>
> > > >> This boils down to the improvement being only achievable in workloads
> > > >> with many actively switching tasks.  We had no access to the
> > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the
> > > >> pattern is also reproduced with "perf bench sched messaging -g 1"
> > > >> on 1 socket, 8 cores vCPU topology, we see indeed:
> > > >>
> > > >> l3-cache	#res IPI /s	#time / 10000 loops
> > > >> off		560K		1.8 sec
> > > >> on		40K		0.9 sec
> > > >>
> > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager
> > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
> > > >> interactions and therefore exessive halts and IPIs.  E.g. "perf bench
> > > >> sched pipe -i 100000" gives
> > > >>
> > > >> l3-cache	#res IPI /s	#HLT /s		#time /100000 loops
> > > >> off		200 (no K)	230		0.2 sec
> > > >> on		400K		330K		0.5 sec
> > > >>
> > > >> In a more realistic test, we observe 15% degradation in VM density
> > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http
> > > >> requests per second to its main page, with 95%-percentile response
> > > >> latency under 100 ms) with l3-cache=on.
> > > >>
> > > >> We think that mostly-idle scenario is more common in cloud and personal
> > > >> usage, and should be optimized for by default; users of highly loaded
> > > >> VMs should be able to tune them up themselves.
> > > >>
> > > > There's one thing I don't understand in your test case: if you
> > > > just found out that Linux will behave worse if it assumes that
> > > > the VCPUs are sharing a L3 cache, why are you configuring a
> > > > 8-core VCPU topology explicitly?
> > > >
> > > > Do you still see a difference in the numbers if you use "-smp 8"
> > > > with no "cores" and "threads" options?
> 
> No we don't.  The guest scheduler makes no optimizations (which turn out
> pessimizations) in this case.
> 
> > > This is quite simple. A lot of software licenses are bound to the amount
> > > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> > > with 1 socket/xx cores to reduce the amount of money necessary to
> > > be paid for the software.
> > 
> > In this case it looks like we're talking about the expected
> > meaning of "cores=N".  My first interpretation would be that the
> > user obviously want the guest to see the multiple cores sharing a
> > L3 cache, because that's how real CPUs normally work.  But I see
> > why you have different expectations.
> > 
> > Numbers on dedicated-pCPU scenarios would be helpful to guide the
> > decision.  I wouldn't like to cause a performance regression for
> > users that fine-tuned vCPU topology and set up CPU pinning.
> 
> We're speaking about the default setting; those who do fine-tuning are
> not going to lose the ability to configure l3-cache.
> 
> Dedicated pCPU scenarios are the opposite of density ones.  Besides I'm
> still struggling to see why dedicated pCPUs would change anything in the
> guest scheduler behavior.  (I hope Denis (Plotnikov) should be able to
> redo the measurements with dedicated pCPUs and provide the numbers
> regardless of my opinion though.)

I'm interested in confirming that even if we make the vCPUS
topology+placement (including cores/thread count and cache
topology) reflect the host topology very closely, Linux guests'
behavior is still not optimal.

If we confirm that things get worse on all cases where we present
L3 cache info the Linux guests, we have another argument to try
to make Linux guests behave better.

It might be impossible to fix this solely on the host side if the
L3 cache info triggers desirable guest-side optimizations
somewhere else.

> 
> Our main problem at the moment is that our users get a performance
> degradation in our default (arguably suboptimal) configuration when
> transitioning to a newer version of QEMU, that's why we're suggesting to
> set the default to match the pre-l3-cache behavior.  (Well, they would
> if we didn't apply this change to our version of QEMU.)
> 
> Roman.

-- 
Eduardo

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Michael S. Tsirkin 6 years, 4 months ago
On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > There's one thing I don't understand in your test case: if you
> > just found out that Linux will behave worse if it assumes that
> > the VCPUs are sharing a L3 cache, why are you configuring a
> > 8-core VCPU topology explicitly?
> >
> > Do you still see a difference in the numbers if you use "-smp 8"
> > with no "cores" and "threads" options?
> >
> This is quite simple. A lot of software licenses are bound to the amount
> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> with 1 socket/xx cores to reduce the amount of money necessary to
> be paid for the software.

This answers the first question but not the second one.

If the answer to the second one is negative, then I don't understand why
changing the default makes sense.


I would expect qemu by default to be as close to emulating a physical
system as possible.  If one has to deviate from that for some workloads,
that is fine, but probably not a good default.

-- 
MST

Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Posted by Roman Kagan 6 years, 4 months ago
On Wed, Nov 29, 2017 at 06:17:40AM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote:
> > > There's one thing I don't understand in your test case: if you
> > > just found out that Linux will behave worse if it assumes that
> > > the VCPUs are sharing a L3 cache, why are you configuring a
> > > 8-core VCPU topology explicitly?
> > >
> > > Do you still see a difference in the numbers if you use "-smp 8"
> > > with no "cores" and "threads" options?
> > >
> > This is quite simple. A lot of software licenses are bound to the amount
> > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology
> > with 1 socket/xx cores to reduce the amount of money necessary to
> > be paid for the software.
> 
> This answers the first question but not the second one.

The answer to the second one is no, there's no difference (from memory,
I don't have the numbers at hand.)

> If the answer to the second one is negative, then I don't understand why
> changing the default makes sense.

Because the setting adversly affected the performance in some
configurations?

That said, we may be well past the point where that would matter because
the new setting has been there for a few releases already...

> I would expect qemu by default to be as close to emulating a physical
> system as possible.  If one has to deviate from that for some workloads,
> that is fine, but probably not a good default.

Thanks,
Roman.