include/hw/i386/pc.h | 7 ++++++- target/i386/cpu.c | 2 +- 2 files changed, 7 insertions(+), 2 deletions(-)
Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
introduced and set by default exposing l3 to the guest.
The motivation behind it was that in the Linux scheduler, when waking up
a task on a sibling CPU, the task was put onto the target CPU's runqueue
directly, without sending a reschedule IPI. Reduction in the IPI count
led to performance gain.
However, this isn't the whole story. Once the task is on the target
CPU's runqueue, it may have to preempt the current task on that CPU, be
it the idle task putting the CPU to sleep or just another running task.
For that a reschedule IPI will have to be issued, too. Only when that
other CPU is running a normal task for too little time, the fairness
constraints will prevent the preemption and thus the IPI.
This boils down to the improvement being only achievable in workloads
with many actively switching tasks. We had no access to the
(proprietary?) SAP HANA benchmark the commit referred to, but the
pattern is also reproduced with "perf bench sched messaging -g 1"
on 1 socket, 8 cores vCPU topology, we see indeed:
l3-cache #res IPI /s #time / 10000 loops
off 560K 1.8 sec
on 40K 0.9 sec
Now there's a downside: with L3 cache the Linux scheduler is more eager
to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
interactions and therefore exessive halts and IPIs. E.g. "perf bench
sched pipe -i 100000" gives
l3-cache #res IPI /s #HLT /s #time /100000 loops
off 200 (no K) 230 0.2 sec
on 400K 330K 0.5 sec
In a more realistic test, we observe 15% degradation in VM density
(measured as the number of VMs, each running Drupal CMS serving 2 http
requests per second to its main page, with 95%-percentile response
latency under 100 ms) with l3-cache=on.
We think that mostly-idle scenario is more common in cloud and personal
usage, and should be optimized for by default; users of highly loaded
VMs should be able to tune them up themselves.
So switch l3-cache off by default, and add a compat clause for the range
of machine types where it was on.
Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
---
include/hw/i386/pc.h | 7 ++++++-
target/i386/cpu.c | 2 +-
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 087d184..1d2dcae 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
.driver = TYPE_X86_CPU,\
.property = "x-hv-max-vps",\
.value = "0x40",\
- },
+ },\
+ {\
+ .driver = TYPE_X86_CPU,\
+ .property = "l3-cache",\
+ .value = "on",\
+ },\
#define PC_COMPAT_2_9 \
HW_COMPAT_2_9 \
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 1edcf29..95a51bd 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = {
DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
- DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
+ DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false),
DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration,
false),
DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true),
--
2.7.4
On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > introduced and set by default exposing l3 to the guest. > > The motivation behind it was that in the Linux scheduler, when waking up > a task on a sibling CPU, the task was put onto the target CPU's runqueue > directly, without sending a reschedule IPI. Reduction in the IPI count > led to performance gain. > > However, this isn't the whole story. Once the task is on the target > CPU's runqueue, it may have to preempt the current task on that CPU, be > it the idle task putting the CPU to sleep or just another running task. > For that a reschedule IPI will have to be issued, too. Only when that > other CPU is running a normal task for too little time, the fairness > constraints will prevent the preemption and thus the IPI. > > This boils down to the improvement being only achievable in workloads > with many actively switching tasks. We had no access to the > (proprietary?) SAP HANA benchmark the commit referred to, but the > pattern is also reproduced with "perf bench sched messaging -g 1" > on 1 socket, 8 cores vCPU topology, we see indeed: > > l3-cache #res IPI /s #time / 10000 loops > off 560K 1.8 sec > on 40K 0.9 sec > > Now there's a downside: with L3 cache the Linux scheduler is more eager > to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > interactions and therefore exessive halts and IPIs. E.g. "perf bench > sched pipe -i 100000" gives > > l3-cache #res IPI /s #HLT /s #time /100000 loops > off 200 (no K) 230 0.2 sec > on 400K 330K 0.5 sec > > In a more realistic test, we observe 15% degradation in VM density > (measured as the number of VMs, each running Drupal CMS serving 2 http > requests per second to its main page, with 95%-percentile response > latency under 100 ms) with l3-cache=on. > > We think that mostly-idle scenario is more common in cloud and personal > usage, and should be optimized for by default; users of highly loaded > VMs should be able to tune them up themselves. > > So switch l3-cache off by default, and add a compat clause for the range > of machine types where it was on. > > Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Pls add a new machine type so 2.11 can keep it on by default. > --- > include/hw/i386/pc.h | 7 ++++++- > target/i386/cpu.c | 2 +- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h > index 087d184..1d2dcae 100644 > --- a/include/hw/i386/pc.h > +++ b/include/hw/i386/pc.h > @@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); > .driver = TYPE_X86_CPU,\ > .property = "x-hv-max-vps",\ > .value = "0x40",\ > - }, > + },\ > + {\ > + .driver = TYPE_X86_CPU,\ > + .property = "l3-cache",\ > + .value = "on",\ > + },\ > > #define PC_COMPAT_2_9 \ > HW_COMPAT_2_9 \ > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index 1edcf29..95a51bd 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = { > DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id), > DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true), > DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false), > - DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true), > + DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false), > DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration, > false), > DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true), > -- > 2.7.4
On 28/11/2017 19:54, Michael S. Tsirkin wrote: >> Now there's a downside: with L3 cache the Linux scheduler is more eager >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >> interactions and therefore exessive halts and IPIs. E.g. "perf bench >> sched pipe -i 100000" gives >> >> l3-cache #res IPI /s #HLT /s #time /100000 loops >> off 200 (no K) 230 0.2 sec >> on 400K 330K 0.5 sec >> >> In a more realistic test, we observe 15% degradation in VM density >> (measured as the number of VMs, each running Drupal CMS serving 2 http >> requests per second to its main page, with 95%-percentile response >> latency under 100 ms) with l3-cache=on. >> >> We think that mostly-idle scenario is more common in cloud and personal >> usage, and should be optimized for by default; users of highly loaded >> VMs should be able to tune them up themselves. Hi Denis, thanks for the report. I think there are two cases: 1) The dedicated pCPU case: do you still get the performance degradation with dedicated pCPUs? 2) The non-dedicated pCPU case: do you still get the performance degradation with threads=1? If not, why do you have sibling vCPUs at all, if you don't have a dedicated physical CPU for each vCPU? Thanks, Paolo
On Tue, Nov 28, 2017 at 08:50:54PM +0100, Paolo Bonzini wrote: > On 28/11/2017 19:54, Michael S. Tsirkin wrote: > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >> sched pipe -i 100000" gives > >> > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > >> off 200 (no K) 230 0.2 sec > >> on 400K 330K 0.5 sec > >> > >> In a more realistic test, we observe 15% degradation in VM density > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > >> requests per second to its main page, with 95%-percentile response > >> latency under 100 ms) with l3-cache=on. > >> > >> We think that mostly-idle scenario is more common in cloud and personal > >> usage, and should be optimized for by default; users of highly loaded > >> VMs should be able to tune them up themselves. > > Hi Denis, > > thanks for the report. I think there are two cases: > > 1) The dedicated pCPU case: do you still get the performance degradation > with dedicated pCPUs? > > 2) The non-dedicated pCPU case: do you still get the performance > degradation with threads=1? If not, why do you have sibling vCPUs at > all, if you don't have a dedicated physical CPU for each vCPU? I assume you mean cores=1,threads=1? Even if the pCPUs are dedicated, I would still like to see a comparison between cores=1,threads=1,l3-cache=off and cores=1,threads=1,l3-cache=off. Maybe configuring cores > 1 isn't really helpful in some use cases. -- Eduardo
On Tue, Nov 28, 2017 at 08:50:54PM +0100, Paolo Bonzini wrote: > On 28/11/2017 19:54, Michael S. Tsirkin wrote: > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >> sched pipe -i 100000" gives > >> > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > >> off 200 (no K) 230 0.2 sec > >> on 400K 330K 0.5 sec > >> > >> In a more realistic test, we observe 15% degradation in VM density > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > >> requests per second to its main page, with 95%-percentile response > >> latency under 100 ms) with l3-cache=on. > >> > >> We think that mostly-idle scenario is more common in cloud and personal > >> usage, and should be optimized for by default; users of highly loaded > >> VMs should be able to tune them up themselves. > > Hi Denis, > > thanks for the report. I think there are two cases: > > 1) The dedicated pCPU case: do you still get the performance degradation > with dedicated pCPUs? I wonder why dedicated pCPU would matter at all? The behavior change is in the guest scheduler. > 2) The non-dedicated pCPU case: do you still get the performance > degradation with threads=1? If not, why do you have sibling vCPUs at > all, if you don't have a dedicated physical CPU for each vCPU? We have sibling vCPUs in terms of cores, not threads. I.e. the configuration in the test was sockets=1,cores=8,threads=1. Are you suggesting that it shouldn't be used without pCPU binding? Roman.
Hi, On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > introduced and set by default exposing l3 to the guest. > > The motivation behind it was that in the Linux scheduler, when waking up > a task on a sibling CPU, the task was put onto the target CPU's runqueue > directly, without sending a reschedule IPI. Reduction in the IPI count > led to performance gain. > > However, this isn't the whole story. Once the task is on the target > CPU's runqueue, it may have to preempt the current task on that CPU, be > it the idle task putting the CPU to sleep or just another running task. > For that a reschedule IPI will have to be issued, too. Only when that > other CPU is running a normal task for too little time, the fairness > constraints will prevent the preemption and thus the IPI. > > This boils down to the improvement being only achievable in workloads > with many actively switching tasks. We had no access to the > (proprietary?) SAP HANA benchmark the commit referred to, but the > pattern is also reproduced with "perf bench sched messaging -g 1" > on 1 socket, 8 cores vCPU topology, we see indeed: > > l3-cache #res IPI /s #time / 10000 loops > off 560K 1.8 sec > on 40K 0.9 sec > > Now there's a downside: with L3 cache the Linux scheduler is more eager > to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > interactions and therefore exessive halts and IPIs. E.g. "perf bench > sched pipe -i 100000" gives > > l3-cache #res IPI /s #HLT /s #time /100000 loops > off 200 (no K) 230 0.2 sec > on 400K 330K 0.5 sec > > In a more realistic test, we observe 15% degradation in VM density > (measured as the number of VMs, each running Drupal CMS serving 2 http > requests per second to its main page, with 95%-percentile response > latency under 100 ms) with l3-cache=on. > > We think that mostly-idle scenario is more common in cloud and personal > usage, and should be optimized for by default; users of highly loaded > VMs should be able to tune them up themselves. > There's one thing I don't understand in your test case: if you just found out that Linux will behave worse if it assumes that the VCPUs are sharing a L3 cache, why are you configuring a 8-core VCPU topology explicitly? Do you still see a difference in the numbers if you use "-smp 8" with no "cores" and "threads" options? > So switch l3-cache off by default, and add a compat clause for the range > of machine types where it was on. > > Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > --- > include/hw/i386/pc.h | 7 ++++++- > target/i386/cpu.c | 2 +- > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h > index 087d184..1d2dcae 100644 > --- a/include/hw/i386/pc.h > +++ b/include/hw/i386/pc.h > @@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); > .driver = TYPE_X86_CPU,\ > .property = "x-hv-max-vps",\ > .value = "0x40",\ > - }, > + },\ > + {\ > + .driver = TYPE_X86_CPU,\ > + .property = "l3-cache",\ > + .value = "on",\ > + },\ > > #define PC_COMPAT_2_9 \ > HW_COMPAT_2_9 \ > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index 1edcf29..95a51bd 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = { > DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id), > DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true), > DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false), > - DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true), > + DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false), > DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration, > false), > DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true), > -- > 2.7.4 > > -- Eduardo
On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > Hi, > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >> introduced and set by default exposing l3 to the guest. >> >> The motivation behind it was that in the Linux scheduler, when waking up >> a task on a sibling CPU, the task was put onto the target CPU's runqueue >> directly, without sending a reschedule IPI. Reduction in the IPI count >> led to performance gain. >> >> However, this isn't the whole story. Once the task is on the target >> CPU's runqueue, it may have to preempt the current task on that CPU, be >> it the idle task putting the CPU to sleep or just another running task. >> For that a reschedule IPI will have to be issued, too. Only when that >> other CPU is running a normal task for too little time, the fairness >> constraints will prevent the preemption and thus the IPI. >> >> This boils down to the improvement being only achievable in workloads >> with many actively switching tasks. We had no access to the >> (proprietary?) SAP HANA benchmark the commit referred to, but the >> pattern is also reproduced with "perf bench sched messaging -g 1" >> on 1 socket, 8 cores vCPU topology, we see indeed: >> >> l3-cache #res IPI /s #time / 10000 loops >> off 560K 1.8 sec >> on 40K 0.9 sec >> >> Now there's a downside: with L3 cache the Linux scheduler is more eager >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >> interactions and therefore exessive halts and IPIs. E.g. "perf bench >> sched pipe -i 100000" gives >> >> l3-cache #res IPI /s #HLT /s #time /100000 loops >> off 200 (no K) 230 0.2 sec >> on 400K 330K 0.5 sec >> >> In a more realistic test, we observe 15% degradation in VM density >> (measured as the number of VMs, each running Drupal CMS serving 2 http >> requests per second to its main page, with 95%-percentile response >> latency under 100 ms) with l3-cache=on. >> >> We think that mostly-idle scenario is more common in cloud and personal >> usage, and should be optimized for by default; users of highly loaded >> VMs should be able to tune them up themselves. >> > There's one thing I don't understand in your test case: if you > just found out that Linux will behave worse if it assumes that > the VCPUs are sharing a L3 cache, why are you configuring a > 8-core VCPU topology explicitly? > > Do you still see a difference in the numbers if you use "-smp 8" > with no "cores" and "threads" options? > This is quite simple. A lot of software licenses are bound to the amount of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology with 1 socket/xx cores to reduce the amount of money necessary to be paid for the software. Den
[CCing the people who were copied in the original patch that enabled l3cache] On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > Hi, > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >> introduced and set by default exposing l3 to the guest. > >> > >> The motivation behind it was that in the Linux scheduler, when waking up > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >> directly, without sending a reschedule IPI. Reduction in the IPI count > >> led to performance gain. > >> > >> However, this isn't the whole story. Once the task is on the target > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > >> it the idle task putting the CPU to sleep or just another running task. > >> For that a reschedule IPI will have to be issued, too. Only when that > >> other CPU is running a normal task for too little time, the fairness > >> constraints will prevent the preemption and thus the IPI. > >> > >> This boils down to the improvement being only achievable in workloads > >> with many actively switching tasks. We had no access to the > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > >> pattern is also reproduced with "perf bench sched messaging -g 1" > >> on 1 socket, 8 cores vCPU topology, we see indeed: > >> > >> l3-cache #res IPI /s #time / 10000 loops > >> off 560K 1.8 sec > >> on 40K 0.9 sec > >> > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >> sched pipe -i 100000" gives > >> > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > >> off 200 (no K) 230 0.2 sec > >> on 400K 330K 0.5 sec > >> > >> In a more realistic test, we observe 15% degradation in VM density > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > >> requests per second to its main page, with 95%-percentile response > >> latency under 100 ms) with l3-cache=on. > >> > >> We think that mostly-idle scenario is more common in cloud and personal > >> usage, and should be optimized for by default; users of highly loaded > >> VMs should be able to tune them up themselves. > >> > > There's one thing I don't understand in your test case: if you > > just found out that Linux will behave worse if it assumes that > > the VCPUs are sharing a L3 cache, why are you configuring a > > 8-core VCPU topology explicitly? > > > > Do you still see a difference in the numbers if you use "-smp 8" > > with no "cores" and "threads" options? > > > This is quite simple. A lot of software licenses are bound to the amount > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > with 1 socket/xx cores to reduce the amount of money necessary to > be paid for the software. In this case it looks like we're talking about the expected meaning of "cores=N". My first interpretation would be that the user obviously want the guest to see the multiple cores sharing a L3 cache, because that's how real CPUs normally work. But I see why you have different expectations. Numbers on dedicated-pCPU scenarios would be helpful to guide the decision. I wouldn't like to cause a performance regression for users that fine-tuned vCPU topology and set up CPU pinning. -- Eduardo
> -----Original Message----- > From: Eduardo Habkost [mailto:ehabkost@redhat.com] > Sent: Wednesday, November 29, 2017 5:13 AM > To: Denis V. Lunev; longpeng; Michael S. Tsirkin > Cc: Denis Plotnikov; pbonzini@redhat.com; rth@twiddle.net; > qemu-devel@nongnu.org; rkagan@virtuozzo.com; Gonglei (Arei); huangpeng; > Zhaoshenglong; herongguang.he@huawei.com > Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default > > [CCing the people who were copied in the original patch that > enabled l3cache] > Thanks for Ccing. > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > > Hi, > > > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > > >> introduced and set by default exposing l3 to the guest. > > >> > > >> The motivation behind it was that in the Linux scheduler, when waking up > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > > >> directly, without sending a reschedule IPI. Reduction in the IPI count > > >> led to performance gain. > > >> Yes, that's one thing. The other reason for enabling L3 cache is the performance of accessing memory. We tested it by Stream benchmark, the performance is better with L3-cache=on. > > >> However, this isn't the whole story. Once the task is on the target > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > > >> it the idle task putting the CPU to sleep or just another running task. > > >> For that a reschedule IPI will have to be issued, too. Only when that > > >> other CPU is running a normal task for too little time, the fairness > > >> constraints will prevent the preemption and thus the IPI. > > >> > > >> This boils down to the improvement being only achievable in workloads > > >> with many actively switching tasks. We had no access to the > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > > >> pattern is also reproduced with "perf bench sched messaging -g 1" > > >> on 1 socket, 8 cores vCPU topology, we see indeed: > > >> > > >> l3-cache #res IPI /s #time / 10000 loops > > >> off 560K 1.8 sec > > >> on 40K 0.9 sec > > >> > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > > >> sched pipe -i 100000" gives > > >> > > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > > >> off 200 (no K) 230 0.2 sec > > >> on 400K 330K 0.5 sec > > >> > > >> In a more realistic test, we observe 15% degradation in VM density > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > > >> requests per second to its main page, with 95%-percentile response > > >> latency under 100 ms) with l3-cache=on. > > >> > > >> We think that mostly-idle scenario is more common in cloud and personal > > >> usage, and should be optimized for by default; users of highly loaded > > >> VMs should be able to tune them up themselves. > > >> For currently public cloud providers, they usually provide different instances, Including sharing instances and dedicated instances. And the public cloud tenants usually want the L3 cache, even bigger is better. Basically all performance tuning target to specific scenarios, we only need to ensure benefit in most scenes. Thanks, -Gonglei > > > There's one thing I don't understand in your test case: if you > > > just found out that Linux will behave worse if it assumes that > > > the VCPUs are sharing a L3 cache, why are you configuring a > > > 8-core VCPU topology explicitly? > > > > > > Do you still see a difference in the numbers if you use "-smp 8" > > > with no "cores" and "threads" options? > > > > > This is quite simple. A lot of software licenses are bound to the amount > > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > > with 1 socket/xx cores to reduce the amount of money necessary to > > be paid for the software. > > In this case it looks like we're talking about the expected > meaning of "cores=N". My first interpretation would be that the > user obviously want the guest to see the multiple cores sharing a > L3 cache, because that's how real CPUs normally work. But I see > why you have different expectations. > > Numbers on dedicated-pCPU scenarios would be helpful to guide the > decision. I wouldn't like to cause a performance regression for > users that fine-tuned vCPU topology and set up CPU pinning. > > -- > Eduardo
On Wed, Nov 29, 2017 at 01:57:14AM +0000, Gonglei (Arei) wrote: > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > > > >> introduced and set by default exposing l3 to the guest. > > > >> > > > >> The motivation behind it was that in the Linux scheduler, when waking up > > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > > > >> directly, without sending a reschedule IPI. Reduction in the IPI count > > > >> led to performance gain. > > > >> > > Yes, that's one thing. > > The other reason for enabling L3 cache is the performance of accessing memory. I guess you're talking about the super-smart buffer size tuning glibc does in its memcpy and friends. We try to control that with an atomic test for memcpy, and we didn't notice a difference. We'll need to double-check... > We tested it by Stream benchmark, the performance is better with L3-cache=on. This one: https://www.cs.virginia.edu/stream/ ? Thanks, we'll have a look, too. > > > >> However, this isn't the whole story. Once the task is on the target > > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > > > >> it the idle task putting the CPU to sleep or just another running task. > > > >> For that a reschedule IPI will have to be issued, too. Only when that > > > >> other CPU is running a normal task for too little time, the fairness > > > >> constraints will prevent the preemption and thus the IPI. > > > >> > > > >> This boils down to the improvement being only achievable in workloads > > > >> with many actively switching tasks. We had no access to the > > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > > > >> pattern is also reproduced with "perf bench sched messaging -g 1" > > > >> on 1 socket, 8 cores vCPU topology, we see indeed: > > > >> > > > >> l3-cache #res IPI /s #time / 10000 loops > > > >> off 560K 1.8 sec > > > >> on 40K 0.9 sec > > > >> > > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > > > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > > > >> sched pipe -i 100000" gives > > > >> > > > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > > > >> off 200 (no K) 230 0.2 sec > > > >> on 400K 330K 0.5 sec > > > >> > > > >> In a more realistic test, we observe 15% degradation in VM density > > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > > > >> requests per second to its main page, with 95%-percentile response > > > >> latency under 100 ms) with l3-cache=on. > > > >> > > > >> We think that mostly-idle scenario is more common in cloud and personal > > > >> usage, and should be optimized for by default; users of highly loaded > > > >> VMs should be able to tune them up themselves. > > > >> > > For currently public cloud providers, they usually provide different instances, > Including sharing instances and dedicated instances. > > And the public cloud tenants usually want the L3 cache, even bigger is better. > > Basically all performance tuning target to specific scenarios, > we only need to ensure benefit in most scenes. There's no doubt the ability to configure l3-cache is useful. The question is what the default value should be. Thanks, Roman.
> -----Original Message----- > From: rkagan@virtuozzo.com [mailto:rkagan@virtuozzo.com] > Sent: Wednesday, November 29, 2017 1:56 PM > To: Gonglei (Arei) > Cc: Eduardo Habkost; Denis V. Lunev; longpeng; Michael S. Tsirkin; Denis > Plotnikov; pbonzini@redhat.com; rth@twiddle.net; qemu-devel@nongnu.org; > huangpeng; Zhaoshenglong > Subject: Re: [Qemu-devel] [PATCH] i386: turn off l3-cache property by default > > On Wed, Nov 29, 2017 at 01:57:14AM +0000, Gonglei (Arei) wrote: > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > > > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for > vcpus" > > > > >> introduced and set by default exposing l3 to the guest. > > > > >> > > > > >> The motivation behind it was that in the Linux scheduler, when waking > up > > > > >> a task on a sibling CPU, the task was put onto the target CPU's > runqueue > > > > >> directly, without sending a reschedule IPI. Reduction in the IPI count > > > > >> led to performance gain. > > > > >> > > > > Yes, that's one thing. > > > > The other reason for enabling L3 cache is the performance of accessing > memory. > > I guess you're talking about the super-smart buffer size tuning glibc > does in its memcpy and friends. We try to control that with an atomic > test for memcpy, and we didn't notice a difference. We'll need to > double-check... > > > We tested it by Stream benchmark, the performance is better with > L3-cache=on. > > This one: https://www.cs.virginia.edu/stream/ ? Thanks, we'll have a > look, too. > Yes. :) Thanks, -Gonglei
On 2017/11/29 5:13, Eduardo Habkost wrote: > [CCing the people who were copied in the original patch that > enabled l3cache] > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: >>> Hi, >>> >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >>>> introduced and set by default exposing l3 to the guest. >>>> >>>> The motivation behind it was that in the Linux scheduler, when waking up >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue >>>> directly, without sending a reschedule IPI. Reduction in the IPI count >>>> led to performance gain. >>>> >>>> However, this isn't the whole story. Once the task is on the target >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be >>>> it the idle task putting the CPU to sleep or just another running task. >>>> For that a reschedule IPI will have to be issued, too. Only when that >>>> other CPU is running a normal task for too little time, the fairness >>>> constraints will prevent the preemption and thus the IPI. >>>> Agree. :) Our testing VM is Suse11 guest with idle=poll at that time and now I realize that Suse11 has a BUG in its scheduler. For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if rq->idle is not polling: ''' static void ttwu_queue_remote(struct task_struct *p, int cpu) { struct rq *rq = cpu_rq(cpu); if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { if (!set_nr_if_polling(rq->idle)) smp_send_reschedule(cpu); else trace_sched_wake_idle_without_ipi(cpu); } } ''' But for Suse11, it does not check, it send a RES IPI unconditionally. >>>> This boils down to the improvement being only achievable in workloads >>>> with many actively switching tasks. We had no access to the >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the >>>> pattern is also reproduced with "perf bench sched messaging -g 1" >>>> on 1 socket, 8 cores vCPU topology, we see indeed: >>>> >>>> l3-cache #res IPI /s #time / 10000 loops >>>> off 560K 1.8 sec >>>> on 40K 0.9 sec >>>> >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench >>>> sched pipe -i 100000" gives >>>> >>>> l3-cache #res IPI /s #HLT /s #time /100000 loops >>>> off 200 (no K) 230 0.2 sec >>>> on 400K 330K 0.5 sec >>>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. As Gonglei said: 1. the L3 cache relates to the user experience. 2. the glibc would get the cache info by CPUID directly, and relates to the memory performance. What's more, the L3 cache relates to the sched_domain which is important to the (load) balancer when system is busy. All this doesn't mean the patch is insignificant, I just think we should do more research before decide. I'll do some tests, thanks. :) >>>> In a more realistic test, we observe 15% degradation in VM density >>>> (measured as the number of VMs, each running Drupal CMS serving 2 http >>>> requests per second to its main page, with 95%-percentile response >>>> latency under 100 ms) with l3-cache=on. >>>> >>>> We think that mostly-idle scenario is more common in cloud and personal >>>> usage, and should be optimized for by default; users of highly loaded >>>> VMs should be able to tune them up themselves. >>>> >>> There's one thing I don't understand in your test case: if you >>> just found out that Linux will behave worse if it assumes that >>> the VCPUs are sharing a L3 cache, why are you configuring a >>> 8-core VCPU topology explicitly? >>> >>> Do you still see a difference in the numbers if you use "-smp 8" >>> with no "cores" and "threads" options? >>> >> This is quite simple. A lot of software licenses are bound to the amount >> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology >> with 1 socket/xx cores to reduce the amount of money necessary to >> be paid for the software. > > In this case it looks like we're talking about the expected > meaning of "cores=N". My first interpretation would be that the > user obviously want the guest to see the multiple cores sharing a > L3 cache, because that's how real CPUs normally work. But I see > why you have different expectations. > > Numbers on dedicated-pCPU scenarios would be helpful to guide the > decision. I wouldn't like to cause a performance regression for > users that fine-tuned vCPU topology and set up CPU pinning. > -- Regards, Longpeng(Mike)
On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: > On 2017/11/29 5:13, Eduardo Habkost wrote: > > > [CCing the people who were copied in the original patch that > > enabled l3cache] > > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > >>> Hi, > >>> > >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >>>> introduced and set by default exposing l3 to the guest. > >>>> > >>>> The motivation behind it was that in the Linux scheduler, when waking up > >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >>>> directly, without sending a reschedule IPI. Reduction in the IPI count > >>>> led to performance gain. > >>>> > >>>> However, this isn't the whole story. Once the task is on the target > >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be > >>>> it the idle task putting the CPU to sleep or just another running task. > >>>> For that a reschedule IPI will have to be issued, too. Only when that > >>>> other CPU is running a normal task for too little time, the fairness > >>>> constraints will prevent the preemption and thus the IPI. > >>>> > > Agree. :) > > Our testing VM is Suse11 guest with idle=poll at that time and now I realize ^^^^^^^^^ Oh, that's a whole lot of a difference! I wish you mentioned that in that patch. > that Suse11 has a BUG in its scheduler. > > For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if > rq->idle is not polling: > ''' > static void ttwu_queue_remote(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > > if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { > if (!set_nr_if_polling(rq->idle)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > } > } > ''' > > But for Suse11, it does not check, it send a RES IPI unconditionally. > > >>>> This boils down to the improvement being only achievable in workloads > >>>> with many actively switching tasks. We had no access to the > >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the > >>>> pattern is also reproduced with "perf bench sched messaging -g 1" > >>>> on 1 socket, 8 cores vCPU topology, we see indeed: > >>>> > >>>> l3-cache #res IPI /s #time / 10000 loops > >>>> off 560K 1.8 sec > >>>> on 40K 0.9 sec > >>>> > >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager > >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >>>> sched pipe -i 100000" gives > >>>> > >>>> l3-cache #res IPI /s #HLT /s #time /100000 loops > >>>> off 200 (no K) 230 0.2 sec > >>>> on 400K 330K 0.5 sec > >>>> > > I guess this issue could be resolved by disable the SD_WAKE_AFFINE. But that requires extra tuning in the guest which is even less likely to happen in the cloud case when VM admin != host admin. > As Gonglei said: > 1. the L3 cache relates to the user experience. > 2. the glibc would get the cache info by CPUID directly, and relates to the > memory performance. > > What's more, the L3 cache relates to the sched_domain which is important to the > (load) balancer when system is busy. > > All this doesn't mean the patch is insignificant, I just think we should do more > research before decide. I'll do some tests, thanks. :) Looking forward to it, thanks! Roman.
On 2017/11/29 14:01, Roman Kagan wrote: > On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: >> On 2017/11/29 5:13, Eduardo Habkost wrote: >> >>> [CCing the people who were copied in the original patch that >>> enabled l3cache] >>> >>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: >>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: >>>>> Hi, >>>>> >>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >>>>>> introduced and set by default exposing l3 to the guest. >>>>>> >>>>>> The motivation behind it was that in the Linux scheduler, when waking up >>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue >>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count >>>>>> led to performance gain. >>>>>> >>>>>> However, this isn't the whole story. Once the task is on the target >>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be >>>>>> it the idle task putting the CPU to sleep or just another running task. >>>>>> For that a reschedule IPI will have to be issued, too. Only when that >>>>>> other CPU is running a normal task for too little time, the fairness >>>>>> constraints will prevent the preemption and thus the IPI. >>>>>> >> >> Agree. :) >> >> Our testing VM is Suse11 guest with idle=poll at that time and now I realize > ^^^^^^^^^ > Oh, that's a whole lot of a difference! I wish you mentioned that in > that patch. > :( Sorry for missing that... >> that Suse11 has a BUG in its scheduler. >> >> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if >> rq->idle is not polling: >> ''' >> static void ttwu_queue_remote(struct task_struct *p, int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> >> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { >> if (!set_nr_if_polling(rq->idle)) >> smp_send_reschedule(cpu); >> else >> trace_sched_wake_idle_without_ipi(cpu); >> } >> } >> ''' >> >> But for Suse11, it does not check, it send a RES IPI unconditionally. >> >>>>>> This boils down to the improvement being only achievable in workloads >>>>>> with many actively switching tasks. We had no access to the >>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the >>>>>> pattern is also reproduced with "perf bench sched messaging -g 1" >>>>>> on 1 socket, 8 cores vCPU topology, we see indeed: >>>>>> >>>>>> l3-cache #res IPI /s #time / 10000 loops >>>>>> off 560K 1.8 sec >>>>>> on 40K 0.9 sec >>>>>> >>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager >>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench >>>>>> sched pipe -i 100000" gives >>>>>> >>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops >>>>>> off 200 (no K) 230 0.2 sec >>>>>> on 400K 330K 0.5 sec >>>>>> >> >> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. > > But that requires extra tuning in the guest which is even less likely to > happen in the cloud case when VM admin != host admin. > Ah, yep, that's a problem. >> As Gonglei said: >> 1. the L3 cache relates to the user experience. >> 2. the glibc would get the cache info by CPUID directly, and relates to the >> memory performance. >> >> What's more, the L3 cache relates to the sched_domain which is important to the >> (load) balancer when system is busy. >> >> All this doesn't mean the patch is insignificant, I just think we should do more >> research before decide. I'll do some tests, thanks. :) > > Looking forward to it, thanks! > Roman. > > -- Regards, Longpeng(Mike)
On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: > On 2017/11/29 5:13, Eduardo Habkost wrote: > > [CCing the people who were copied in the original patch that > > enabled l3cache] > > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > >> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > >>> Hi, > >>> > >>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >>>> introduced and set by default exposing l3 to the guest. > >>>> > >>>> The motivation behind it was that in the Linux scheduler, when waking up > >>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >>>> directly, without sending a reschedule IPI. Reduction in the IPI count > >>>> led to performance gain. > >>>> > >>>> However, this isn't the whole story. Once the task is on the target > >>>> CPU's runqueue, it may have to preempt the current task on that CPU, be > >>>> it the idle task putting the CPU to sleep or just another running task. > >>>> For that a reschedule IPI will have to be issued, too. Only when that > >>>> other CPU is running a normal task for too little time, the fairness > >>>> constraints will prevent the preemption and thus the IPI. > >>>> > > Agree. :) > > Our testing VM is Suse11 guest with idle=poll at that time and now I realize > that Suse11 has a BUG in its scheduler. > > For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if > rq->idle is not polling: > ''' > static void ttwu_queue_remote(struct task_struct *p, int cpu) > { > struct rq *rq = cpu_rq(cpu); > > if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { > if (!set_nr_if_polling(rq->idle)) > smp_send_reschedule(cpu); > else > trace_sched_wake_idle_without_ipi(cpu); > } > } > ''' > > But for Suse11, it does not check, it send a RES IPI unconditionally. So, does that mean no Linux guest benefits from the l3-cache=on default except SuSE 11 guests? > > >>>> This boils down to the improvement being only achievable in workloads > >>>> with many actively switching tasks. We had no access to the > >>>> (proprietary?) SAP HANA benchmark the commit referred to, but the > >>>> pattern is also reproduced with "perf bench sched messaging -g 1" > >>>> on 1 socket, 8 cores vCPU topology, we see indeed: > >>>> > >>>> l3-cache #res IPI /s #time / 10000 loops > >>>> off 560K 1.8 sec > >>>> on 40K 0.9 sec > >>>> > >>>> Now there's a downside: with L3 cache the Linux scheduler is more eager > >>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >>>> sched pipe -i 100000" gives > >>>> > >>>> l3-cache #res IPI /s #HLT /s #time /100000 loops > >>>> off 200 (no K) 230 0.2 sec > >>>> on 400K 330K 0.5 sec > >>>> > > I guess this issue could be resolved by disable the SD_WAKE_AFFINE. > > As Gonglei said: > 1. the L3 cache relates to the user experience. This is true, in a way: I have seen a fair share of user reports where they incorrectly blame the L3 cache absence or the L3 cache size for performance problems. > 2. the glibc would get the cache info by CPUID directly, and relates to the > memory performance. I'm interested in numbers that demonstrate that. > > What's more, the L3 cache relates to the sched_domain which is important to the > (load) balancer when system is busy. > > All this doesn't mean the patch is insignificant, I just think we should do more > research before decide. I'll do some tests, thanks. :) Yes, we need more data. But if we find out that there are no cases where the l3-cache=on default actually improves performance, I will be willing to apply this patch. IMO, the long term solution is to make Linux guests not misbehave when we stop lying about the L3 cache. Maybe we could provide a "IPIs are expensive, please avoid them" hint in the KVM CPUID leaf? > > >>>> In a more realistic test, we observe 15% degradation in VM density > >>>> (measured as the number of VMs, each running Drupal CMS serving 2 http > >>>> requests per second to its main page, with 95%-percentile response > >>>> latency under 100 ms) with l3-cache=on. > >>>> > >>>> We think that mostly-idle scenario is more common in cloud and personal > >>>> usage, and should be optimized for by default; users of highly loaded > >>>> VMs should be able to tune them up themselves. > >>>> > >>> There's one thing I don't understand in your test case: if you > >>> just found out that Linux will behave worse if it assumes that > >>> the VCPUs are sharing a L3 cache, why are you configuring a > >>> 8-core VCPU topology explicitly? > >>> > >>> Do you still see a difference in the numbers if you use "-smp 8" > >>> with no "cores" and "threads" options? > >>> > >> This is quite simple. A lot of software licenses are bound to the amount > >> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > >> with 1 socket/xx cores to reduce the amount of money necessary to > >> be paid for the software. > > > > In this case it looks like we're talking about the expected > > meaning of "cores=N". My first interpretation would be that the > > user obviously want the guest to see the multiple cores sharing a > > L3 cache, because that's how real CPUs normally work. But I see > > why you have different expectations. > > > > Numbers on dedicated-pCPU scenarios would be helpful to guide the > > decision. I wouldn't like to cause a performance regression for > > users that fine-tuned vCPU topology and set up CPU pinning. > > > > > -- > Regards, > Longpeng(Mike) > -- Eduardo
On 2017/11/29 18:41, Eduardo Habkost wrote: > On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: >> On 2017/11/29 5:13, Eduardo Habkost wrote: >>> [CCing the people who were copied in the original patch that >>> enabled l3cache] >>> >>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: >>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: >>>>> Hi, >>>>> >>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >>>>>> introduced and set by default exposing l3 to the guest. >>>>>> >>>>>> The motivation behind it was that in the Linux scheduler, when waking up >>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue >>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count >>>>>> led to performance gain. >>>>>> >>>>>> However, this isn't the whole story. Once the task is on the target >>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be >>>>>> it the idle task putting the CPU to sleep or just another running task. >>>>>> For that a reschedule IPI will have to be issued, too. Only when that >>>>>> other CPU is running a normal task for too little time, the fairness >>>>>> constraints will prevent the preemption and thus the IPI. >>>>>> >> >> Agree. :) >> >> Our testing VM is Suse11 guest with idle=poll at that time and now I realize >> that Suse11 has a BUG in its scheduler. >> >> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if >> rq->idle is not polling: >> ''' >> static void ttwu_queue_remote(struct task_struct *p, int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> >> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { >> if (!set_nr_if_polling(rq->idle)) >> smp_send_reschedule(cpu); >> else >> trace_sched_wake_idle_without_ipi(cpu); >> } >> } >> ''' >> >> But for Suse11, it does not check, it send a RES IPI unconditionally. > > So, does that mean no Linux guest benefits from the l3-cache=on > default except SuSE 11 guests? > Not only that, there is another scenario: static void ttwu_queue(...) { if (...two cpus NOT sharing L3-cache) { ... ttwu_queue_remote(p, cpu, wake_flags); return; } ... ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here* ... } In ttwu_do_activate(), there are also some opportunities with low probability to do not send RES IPI even if the target cpu isn't in IDLE polling state. > >> >>>>>> This boils down to the improvement being only achievable in workloads >>>>>> with many actively switching tasks. We had no access to the >>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the >>>>>> pattern is also reproduced with "perf bench sched messaging -g 1" >>>>>> on 1 socket, 8 cores vCPU topology, we see indeed: >>>>>> >>>>>> l3-cache #res IPI /s #time / 10000 loops >>>>>> off 560K 1.8 sec >>>>>> on 40K 0.9 sec >>>>>> >>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager >>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench >>>>>> sched pipe -i 100000" gives >>>>>> >>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops >>>>>> off 200 (no K) 230 0.2 sec >>>>>> on 400K 330K 0.5 sec >>>>>> >> >> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. >> >> As Gonglei said: >> 1. the L3 cache relates to the user experience. > > This is true, in a way: I have seen a fair share of user reports > where they incorrectly blame the L3 cache absence or the L3 cache > size for performance problems. > >> 2. the glibc would get the cache info by CPUID directly, and relates to the >> memory performance. > > I'm interested in numbers that demonstrate that. > Sorry I have no numbers in hand currently :( I'll do some tests these days, please give me some time. >> >> What's more, the L3 cache relates to the sched_domain which is important to the >> (load) balancer when system is busy. >> >> All this doesn't mean the patch is insignificant, I just think we should do more >> research before decide. I'll do some tests, thanks. :) > > Yes, we need more data. But if we find out that there are no > cases where the l3-cache=on default actually improves > performance, I will be willing to apply this patch. > That's a good thing if we find the truth, it's free. :) OTOH, I think we should notice that: Linux is designed on real hardware, maybe there're some other problems if QEMU lacks some related features. If we search 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block Layer. > IMO, the long term solution is to make Linux guests not misbehave > when we stop lying about the L3 cache. Maybe we could provide a > "IPIs are expensive, please avoid them" hint in the KVM CPUID > leaf? > Good idea. :) Maybe more PV features could be digged. >> >>>>>> In a more realistic test, we observe 15% degradation in VM density >>>>>> (measured as the number of VMs, each running Drupal CMS serving 2 http >>>>>> requests per second to its main page, with 95%-percentile response >>>>>> latency under 100 ms) with l3-cache=on. >>>>>> >>>>>> We think that mostly-idle scenario is more common in cloud and personal >>>>>> usage, and should be optimized for by default; users of highly loaded >>>>>> VMs should be able to tune them up themselves. >>>>>> >>>>> There's one thing I don't understand in your test case: if you >>>>> just found out that Linux will behave worse if it assumes that >>>>> the VCPUs are sharing a L3 cache, why are you configuring a >>>>> 8-core VCPU topology explicitly? >>>>> >>>>> Do you still see a difference in the numbers if you use "-smp 8" >>>>> with no "cores" and "threads" options? >>>>> >>>> This is quite simple. A lot of software licenses are bound to the amount >>>> of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology >>>> with 1 socket/xx cores to reduce the amount of money necessary to >>>> be paid for the software. >>> >>> In this case it looks like we're talking about the expected >>> meaning of "cores=N". My first interpretation would be that the >>> user obviously want the guest to see the multiple cores sharing a >>> L3 cache, because that's how real CPUs normally work. But I see >>> why you have different expectations. >>> >>> Numbers on dedicated-pCPU scenarios would be helpful to guide the >>> decision. I wouldn't like to cause a performance regression for >>> users that fine-tuned vCPU topology and set up CPU pinning. >>> >> >> >> -- >> Regards, >> Longpeng(Mike) >> > -- Regards, Longpeng(Mike)
On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote: > On 2017/11/29 18:41, Eduardo Habkost wrote: > > On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: > >> On 2017/11/29 5:13, Eduardo Habkost wrote: > >>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > >>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > >>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > >>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > >>>>>> introduced and set by default exposing l3 to the guest. > >>>>>> > >>>>>> The motivation behind it was that in the Linux scheduler, when waking up > >>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue > >>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count > >>>>>> led to performance gain. > >>>>>> > >>>>>> However, this isn't the whole story. Once the task is on the target > >>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be > >>>>>> it the idle task putting the CPU to sleep or just another running task. > >>>>>> For that a reschedule IPI will have to be issued, too. Only when that > >>>>>> other CPU is running a normal task for too little time, the fairness > >>>>>> constraints will prevent the preemption and thus the IPI. > >>>>>> > >> > >> Agree. :) > >> > >> Our testing VM is Suse11 guest with idle=poll at that time and now I realize > >> that Suse11 has a BUG in its scheduler. > >> > >> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if > >> rq->idle is not polling: > >> ''' > >> static void ttwu_queue_remote(struct task_struct *p, int cpu) > >> { > >> struct rq *rq = cpu_rq(cpu); > >> > >> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { > >> if (!set_nr_if_polling(rq->idle)) > >> smp_send_reschedule(cpu); > >> else > >> trace_sched_wake_idle_without_ipi(cpu); > >> } > >> } > >> ''' > >> > >> But for Suse11, it does not check, it send a RES IPI unconditionally. > > > > So, does that mean no Linux guest benefits from the l3-cache=on > > default except SuSE 11 guests? > > > > Not only that, there is another scenario: > > static void ttwu_queue(...) > { > if (...two cpus NOT sharing L3-cache) { > ... > ttwu_queue_remote(p, cpu, wake_flags); > return; > } > ... > ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here* > ... > } > > In ttwu_do_activate(), there are also some opportunities with low probability to > do not send RES IPI even if the target cpu isn't in IDLE polling state. Well it isn't so low actually, what you need is to keep the cpus busy switching tasks. In that case it's not uncommon that the task being woken up on a remote cpu has accumulated more vruntime than the task already running on that cpu; in that case the new task won't preempt the current task and the IPI won't be issued. E.g. on a RHEL 7.4 guest we saw: > >>>>>> This boils down to the improvement being only achievable in workloads > >>>>>> with many actively switching tasks. We had no access to the > >>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the > >>>>>> pattern is also reproduced with "perf bench sched messaging -g 1" > >>>>>> on 1 socket, 8 cores vCPU topology, we see indeed: > >>>>>> > >>>>>> l3-cache #res IPI /s #time / 10000 loops > >>>>>> off 560K 1.8 sec > >>>>>> on 40K 0.9 sec The workload where it bites is mostly idle guest, with chains of dependent wakeups, i.e. with little parallelism: > >>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager > >>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > >>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench > >>>>>> sched pipe -i 100000" gives > >>>>>> > >>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops > >>>>>> off 200 (no K) 230 0.2 sec > >>>>>> on 400K 330K 0.5 sec > >>>>>> > >> > >> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this l3-cache, because the scheduler thinks that the cpus that share the last-level cache are close enough that a dependent task can be woken up on a sibling cpu. > >> As Gonglei said: > >> 1. the L3 cache relates to the user experience. > > > > This is true, in a way: I have seen a fair share of user reports > > where they incorrectly blame the L3 cache absence or the L3 cache > > size for performance problems. > > > >> 2. the glibc would get the cache info by CPUID directly, and relates to the > >> memory performance. > > > > I'm interested in numbers that demonstrate that. Me too. I vaguely remember debugging a memcpy degradation in the guest (on the Parallels proprietary hypervisor), that turned out being due a combination of l3 cache size and the cpu topology exposed to the guest, which caused glibc to choose an inadequate buffuer size. > Sorry I have no numbers in hand currently :( > > I'll do some tests these days, please give me some time. We'll try to get some data on this, too. > >> What's more, the L3 cache relates to the sched_domain which is important to the > >> (load) balancer when system is busy. > >> > >> All this doesn't mean the patch is insignificant, I just think we should do more > >> research before decide. I'll do some tests, thanks. :) > > > > Yes, we need more data. But if we find out that there are no > > cases where the l3-cache=on default actually improves > > performance, I will be willing to apply this patch. > > > > That's a good thing if we find the truth, it's free. :) > > OTOH, I think we should notice that: Linux is designed on real hardware, maybe > there're some other problems if QEMU lacks some related features. If we search > 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block > Layer. > > > IMO, the long term solution is to make Linux guests not misbehave > > when we stop lying about the L3 cache. Maybe we could provide a > > "IPIs are expensive, please avoid them" hint in the KVM CPUID > > leaf? We already have it, it's the hypervisor bit ;) Seriously, I'm unaware of hypervisors where IPIs aren't expensive. > Maybe more PV features could be digged. One problem with this is that PV features are hard to get into other guest OSes or existing Linux guests. Roman.
On Wed, Nov 29, 2017 at 04:35:25PM +0300, Roman Kagan wrote: > > On 2017/11/29 18:41, Eduardo Habkost wrote: [...] > > > IMO, the long term solution is to make Linux guests not misbehave > > > when we stop lying about the L3 cache. Maybe we could provide a > > > "IPIs are expensive, please avoid them" hint in the KVM CPUID > > > leaf? > > We already have it, it's the hypervisor bit ;) Seriously, I'm unaware > of hypervisors where IPIs aren't expensive. Sounds good enough to me, if we can convince the Linux kernel maintainers that it should avoid IPIs under all hypervisors. -- Eduardo
On 29/11/2017 14:35, Roman Kagan wrote: >> >>> IMO, the long term solution is to make Linux guests not misbehave >>> when we stop lying about the L3 cache. Maybe we could provide a >>> "IPIs are expensive, please avoid them" hint in the KVM CPUID >>> leaf? > We already have it, it's the hypervisor bit ;) Seriously, I'm unaware > of hypervisors where IPIs aren't expensive. > In theory, AMD's AVIC should optimize IPIs to running vCPUs. Amazon's recently posted patches to disable HLT and MWAIT exits might tilt the balance in favor of IPIs even for Intel APICv (where sending the IPI is expensive, but receiving it isn't). Being able to tie this to Amazon's other proposal, the "DEDICATED" CPUID bit, would be nice. My plan was to disable all three of MWAIT/HLT/PAUSE when setting the dedicated bit. Paolo
On Wed, Nov 29, 2017 at 06:15:05PM +0100, Paolo Bonzini wrote: > On 29/11/2017 14:35, Roman Kagan wrote: > >> > >>> IMO, the long term solution is to make Linux guests not misbehave > >>> when we stop lying about the L3 cache. Maybe we could provide a > >>> "IPIs are expensive, please avoid them" hint in the KVM CPUID > >>> leaf? > > We already have it, it's the hypervisor bit ;) Seriously, I'm unaware > > of hypervisors where IPIs aren't expensive. > > > > In theory, AMD's AVIC should optimize IPIs to running vCPUs. Amazon's > recently posted patches to disable HLT and MWAIT exits might tilt the > balance in favor of IPIs even for Intel APICv (where sending the IPI is > expensive, but receiving it isn't). > > Being able to tie this to Amazon's other proposal, the "DEDICATED" CPUID > bit, would be nice. My plan was to disable all three of MWAIT/HLT/PAUSE > when setting the dedicated bit. Yes the IPI cost can hopefully be mitigated in the case of dedicated and busy vCPUs. However, in the max density scenario this doesn't help. Obviously, in the pipe benchmark scheduling the two ends of the pipe on different cores is detrimental for performance even on a physical machine; however, IIUC it was a conscious decision by the scheduler folks because it provides acceptable latency for mostly-idle systems and decent performance in more loaded cases. We wouldn't care about this pipe benchmark numbers per se, because the latencies are still good for practical purposes. However, in case of virtual machines, this extra overhead of remote scheduling in the guest results in a slight -- circa 15% in our Drupal-based test -- increase of the host cpu consumption by vcpu threads. That, in turn, makes the host cpu overcommit being reached with 15% less VMs (and, once overcommit is reached, the drupal response latency goes into the sky, so it's effectively a cut-off for density). Roman.
On 2017/11/29 21:35, Roman Kagan wrote: > On Wed, Nov 29, 2017 at 07:58:19PM +0800, Longpeng (Mike) wrote: >> On 2017/11/29 18:41, Eduardo Habkost wrote: >>> On Wed, Nov 29, 2017 at 01:20:38PM +0800, Longpeng (Mike) wrote: >>>> On 2017/11/29 5:13, Eduardo Habkost wrote: >>>>> On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: >>>>>> On 11/28/2017 10:58 PM, Eduardo Habkost wrote: >>>>>>> On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: >>>>>>>> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" >>>>>>>> introduced and set by default exposing l3 to the guest. >>>>>>>> >>>>>>>> The motivation behind it was that in the Linux scheduler, when waking up >>>>>>>> a task on a sibling CPU, the task was put onto the target CPU's runqueue >>>>>>>> directly, without sending a reschedule IPI. Reduction in the IPI count >>>>>>>> led to performance gain. >>>>>>>> >>>>>>>> However, this isn't the whole story. Once the task is on the target >>>>>>>> CPU's runqueue, it may have to preempt the current task on that CPU, be >>>>>>>> it the idle task putting the CPU to sleep or just another running task. >>>>>>>> For that a reschedule IPI will have to be issued, too. Only when that >>>>>>>> other CPU is running a normal task for too little time, the fairness >>>>>>>> constraints will prevent the preemption and thus the IPI. >>>>>>>> >>>> >>>> Agree. :) >>>> >>>> Our testing VM is Suse11 guest with idle=poll at that time and now I realize >>>> that Suse11 has a BUG in its scheduler. >>>> >>>> For REHL 7.3 or upstream kernel, in ttwu_queue_remote(), a RES IPI is issued if >>>> rq->idle is not polling: >>>> ''' >>>> static void ttwu_queue_remote(struct task_struct *p, int cpu) >>>> { >>>> struct rq *rq = cpu_rq(cpu); >>>> >>>> if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) { >>>> if (!set_nr_if_polling(rq->idle)) >>>> smp_send_reschedule(cpu); >>>> else >>>> trace_sched_wake_idle_without_ipi(cpu); >>>> } >>>> } >>>> ''' >>>> >>>> But for Suse11, it does not check, it send a RES IPI unconditionally. >>> >>> So, does that mean no Linux guest benefits from the l3-cache=on >>> default except SuSE 11 guests? >>> >> >> Not only that, there is another scenario: >> >> static void ttwu_queue(...) >> { >> if (...two cpus NOT sharing L3-cache) { >> ... >> ttwu_queue_remote(p, cpu, wake_flags); >> return; >> } >> ... >> ttwu_do_activate(rq, p, wake_flags, &rf); <--*Here* >> ... >> } >> >> In ttwu_do_activate(), there are also some opportunities with low probability to >> do not send RES IPI even if the target cpu isn't in IDLE polling state. > > Well it isn't so low actually, what you need is to keep the cpus busy > switching tasks. In that case it's not uncommon that the task being > woken up on a remote cpu has accumulated more vruntime than the task > already running on that cpu; in that case the new task won't preempt the > current task and the IPI won't be issued. E.g. on a RHEL 7.4 guest we > saw: > I get it, thanks. >>>>>>>> This boils down to the improvement being only achievable in workloads >>>>>>>> with many actively switching tasks. We had no access to the >>>>>>>> (proprietary?) SAP HANA benchmark the commit referred to, but the >>>>>>>> pattern is also reproduced with "perf bench sched messaging -g 1" >>>>>>>> on 1 socket, 8 cores vCPU topology, we see indeed: >>>>>>>> >>>>>>>> l3-cache #res IPI /s #time / 10000 loops >>>>>>>> off 560K 1.8 sec >>>>>>>> on 40K 0.9 sec > > The workload where it bites is mostly idle guest, with chains of > dependent wakeups, i.e. with little parallelism: > >>>>>>>> Now there's a downside: with L3 cache the Linux scheduler is more eager >>>>>>>> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU >>>>>>>> interactions and therefore exessive halts and IPIs. E.g. "perf bench >>>>>>>> sched pipe -i 100000" gives >>>>>>>> >>>>>>>> l3-cache #res IPI /s #HLT /s #time /100000 loops >>>>>>>> off 200 (no K) 230 0.2 sec >>>>>>>> on 400K 330K 0.5 sec >>>>>>>> >>>> >>>> I guess this issue could be resolved by disable the SD_WAKE_AFFINE. > > Actually, it's SD_WAKE_AFFINE that's being effectively defeated by this > l3-cache, because the scheduler thinks that the cpus that share the > last-level cache are close enough that a dependent task can be woken up > on a sibling cpu. > In this case (sched pipe), without L3-cache, a dependent task is woken up on the original cpu mostly, if these two tasks ran on the same cpu then the dependent task is woken up without a RES IPI. The related codes are: ''' void resched_curr(struct rq *rq) { if (cpu == smp_processor_id()) { set_tsk_need_resched(curr); set_preempt_need_resched(); return; } } ''' Do I understand correctly ? If not, hope you could point out what's wrong :) >>>> As Gonglei said: >>>> 1. the L3 cache relates to the user experience. >>> >>> This is true, in a way: I have seen a fair share of user reports >>> where they incorrectly blame the L3 cache absence or the L3 cache >>> size for performance problems. >>> >>>> 2. the glibc would get the cache info by CPUID directly, and relates to the >>>> memory performance. >>> >>> I'm interested in numbers that demonstrate that. > > Me too. I vaguely remember debugging a memcpy degradation in the guest > (on the Parallels proprietary hypervisor), that turned out being due a > combination of l3 cache size and the cpu topology exposed to the guest, > which caused glibc to choose an inadequate buffuer size. > We faced the same problem several months ago. I did some simple tests at noon, it seems that numbers are better without L3-cache except 'perf bench sched messaging'. VM: 1 sockets, 8 cores, 3.10.0 guest Hardware: Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz Stream:(100 turns) l3 Copy Scale Add Triad ------------------------------------ off 8025.8 8019.5 8363.1 8589.9 on 8016.7 7999.9 8344.2 8568.9 perf sched bench message:(100 turns) l3 Total-time ----------------- off 0.0238 on 0.0178 perf sched bench pipe:(100 turns) l3 Total-time ----------------- off 0.3190 on 1.2688 We are so busy at end of each month, maybe my tests is insufficient, I'm sorry for that. According the numbers above, I think it's worth to turn off L3-cache by default. >> Sorry I have no numbers in hand currently :( >> >> I'll do some tests these days, please give me some time. > > We'll try to get some data on this, too. > >>>> What's more, the L3 cache relates to the sched_domain which is important to the >>>> (load) balancer when system is busy. >>>> >>>> All this doesn't mean the patch is insignificant, I just think we should do more >>>> research before decide. I'll do some tests, thanks. :) >>> >>> Yes, we need more data. But if we find out that there are no >>> cases where the l3-cache=on default actually improves >>> performance, I will be willing to apply this patch. >>> >> >> That's a good thing if we find the truth, it's free. :) >> >> OTOH, I think we should notice that: Linux is designed on real hardware, maybe >> there're some other problems if QEMU lacks some related features. If we search >> 'cpus_share_cache' in the Linux kernel, we can see that it's also used by Block >> Layer. >> >>> IMO, the long term solution is to make Linux guests not misbehave >>> when we stop lying about the L3 cache. Maybe we could provide a >>> "IPIs are expensive, please avoid them" hint in the KVM CPUID >>> leaf? > > We already have it, it's the hypervisor bit ;) Seriously, I'm unaware > of hypervisors where IPIs aren't expensive. > >> Maybe more PV features could be digged. > > One problem with this is that PV features are hard to get into other > guest OSes or existing Linux guests. > Some cloud providers (e.g. Amazon,AliBaBa...) provide a customized guest which could includes more PV features to reach the limiting performance. > Roman. > > -- Regards, Longpeng(Mike)
On Tue, Nov 28, 2017 at 07:13:26PM -0200, Eduardo Habkost wrote: > [CCing the people who were copied in the original patch that > enabled l3cache] Thanks, and sorry to have forgotten to do so in the patch! > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > > >> introduced and set by default exposing l3 to the guest. > > >> > > >> The motivation behind it was that in the Linux scheduler, when waking up > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > > >> directly, without sending a reschedule IPI. Reduction in the IPI count > > >> led to performance gain. > > >> > > >> However, this isn't the whole story. Once the task is on the target > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > > >> it the idle task putting the CPU to sleep or just another running task. > > >> For that a reschedule IPI will have to be issued, too. Only when that > > >> other CPU is running a normal task for too little time, the fairness > > >> constraints will prevent the preemption and thus the IPI. > > >> > > >> This boils down to the improvement being only achievable in workloads > > >> with many actively switching tasks. We had no access to the > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > > >> pattern is also reproduced with "perf bench sched messaging -g 1" > > >> on 1 socket, 8 cores vCPU topology, we see indeed: > > >> > > >> l3-cache #res IPI /s #time / 10000 loops > > >> off 560K 1.8 sec > > >> on 40K 0.9 sec > > >> > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > > >> sched pipe -i 100000" gives > > >> > > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > > >> off 200 (no K) 230 0.2 sec > > >> on 400K 330K 0.5 sec > > >> > > >> In a more realistic test, we observe 15% degradation in VM density > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > > >> requests per second to its main page, with 95%-percentile response > > >> latency under 100 ms) with l3-cache=on. > > >> > > >> We think that mostly-idle scenario is more common in cloud and personal > > >> usage, and should be optimized for by default; users of highly loaded > > >> VMs should be able to tune them up themselves. > > >> > > > There's one thing I don't understand in your test case: if you > > > just found out that Linux will behave worse if it assumes that > > > the VCPUs are sharing a L3 cache, why are you configuring a > > > 8-core VCPU topology explicitly? > > > > > > Do you still see a difference in the numbers if you use "-smp 8" > > > with no "cores" and "threads" options? No we don't. The guest scheduler makes no optimizations (which turn out pessimizations) in this case. > > This is quite simple. A lot of software licenses are bound to the amount > > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > > with 1 socket/xx cores to reduce the amount of money necessary to > > be paid for the software. > > In this case it looks like we're talking about the expected > meaning of "cores=N". My first interpretation would be that the > user obviously want the guest to see the multiple cores sharing a > L3 cache, because that's how real CPUs normally work. But I see > why you have different expectations. > > Numbers on dedicated-pCPU scenarios would be helpful to guide the > decision. I wouldn't like to cause a performance regression for > users that fine-tuned vCPU topology and set up CPU pinning. We're speaking about the default setting; those who do fine-tuning are not going to lose the ability to configure l3-cache. Dedicated pCPU scenarios are the opposite of density ones. Besides I'm still struggling to see why dedicated pCPUs would change anything in the guest scheduler behavior. (I hope Denis (Plotnikov) should be able to redo the measurements with dedicated pCPUs and provide the numbers regardless of my opinion though.) Our main problem at the moment is that our users get a performance degradation in our default (arguably suboptimal) configuration when transitioning to a newer version of QEMU, that's why we're suggesting to set the default to match the pre-l3-cache behavior. (Well, they would if we didn't apply this change to our version of QEMU.) Roman.
On Wed, Nov 29, 2017 at 08:46:55AM +0300, Roman Kagan wrote: > On Tue, Nov 28, 2017 at 07:13:26PM -0200, Eduardo Habkost wrote: > > [CCing the people who were copied in the original patch that > > enabled l3cache] > > Thanks, and sorry to have forgotten to do so in the patch! > > > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > > On 11/28/2017 10:58 PM, Eduardo Habkost wrote: > > > > On Fri, Nov 24, 2017 at 04:26:50PM +0300, Denis Plotnikov wrote: > > > >> Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" > > > >> introduced and set by default exposing l3 to the guest. > > > >> > > > >> The motivation behind it was that in the Linux scheduler, when waking up > > > >> a task on a sibling CPU, the task was put onto the target CPU's runqueue > > > >> directly, without sending a reschedule IPI. Reduction in the IPI count > > > >> led to performance gain. > > > >> > > > >> However, this isn't the whole story. Once the task is on the target > > > >> CPU's runqueue, it may have to preempt the current task on that CPU, be > > > >> it the idle task putting the CPU to sleep or just another running task. > > > >> For that a reschedule IPI will have to be issued, too. Only when that > > > >> other CPU is running a normal task for too little time, the fairness > > > >> constraints will prevent the preemption and thus the IPI. > > > >> > > > >> This boils down to the improvement being only achievable in workloads > > > >> with many actively switching tasks. We had no access to the > > > >> (proprietary?) SAP HANA benchmark the commit referred to, but the > > > >> pattern is also reproduced with "perf bench sched messaging -g 1" > > > >> on 1 socket, 8 cores vCPU topology, we see indeed: > > > >> > > > >> l3-cache #res IPI /s #time / 10000 loops > > > >> off 560K 1.8 sec > > > >> on 40K 0.9 sec > > > >> > > > >> Now there's a downside: with L3 cache the Linux scheduler is more eager > > > >> to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU > > > >> interactions and therefore exessive halts and IPIs. E.g. "perf bench > > > >> sched pipe -i 100000" gives > > > >> > > > >> l3-cache #res IPI /s #HLT /s #time /100000 loops > > > >> off 200 (no K) 230 0.2 sec > > > >> on 400K 330K 0.5 sec > > > >> > > > >> In a more realistic test, we observe 15% degradation in VM density > > > >> (measured as the number of VMs, each running Drupal CMS serving 2 http > > > >> requests per second to its main page, with 95%-percentile response > > > >> latency under 100 ms) with l3-cache=on. > > > >> > > > >> We think that mostly-idle scenario is more common in cloud and personal > > > >> usage, and should be optimized for by default; users of highly loaded > > > >> VMs should be able to tune them up themselves. > > > >> > > > > There's one thing I don't understand in your test case: if you > > > > just found out that Linux will behave worse if it assumes that > > > > the VCPUs are sharing a L3 cache, why are you configuring a > > > > 8-core VCPU topology explicitly? > > > > > > > > Do you still see a difference in the numbers if you use "-smp 8" > > > > with no "cores" and "threads" options? > > No we don't. The guest scheduler makes no optimizations (which turn out > pessimizations) in this case. > > > > This is quite simple. A lot of software licenses are bound to the amount > > > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > > > with 1 socket/xx cores to reduce the amount of money necessary to > > > be paid for the software. > > > > In this case it looks like we're talking about the expected > > meaning of "cores=N". My first interpretation would be that the > > user obviously want the guest to see the multiple cores sharing a > > L3 cache, because that's how real CPUs normally work. But I see > > why you have different expectations. > > > > Numbers on dedicated-pCPU scenarios would be helpful to guide the > > decision. I wouldn't like to cause a performance regression for > > users that fine-tuned vCPU topology and set up CPU pinning. > > We're speaking about the default setting; those who do fine-tuning are > not going to lose the ability to configure l3-cache. > > Dedicated pCPU scenarios are the opposite of density ones. Besides I'm > still struggling to see why dedicated pCPUs would change anything in the > guest scheduler behavior. (I hope Denis (Plotnikov) should be able to > redo the measurements with dedicated pCPUs and provide the numbers > regardless of my opinion though.) I'm interested in confirming that even if we make the vCPUS topology+placement (including cores/thread count and cache topology) reflect the host topology very closely, Linux guests' behavior is still not optimal. If we confirm that things get worse on all cases where we present L3 cache info the Linux guests, we have another argument to try to make Linux guests behave better. It might be impossible to fix this solely on the host side if the L3 cache info triggers desirable guest-side optimizations somewhere else. > > Our main problem at the moment is that our users get a performance > degradation in our default (arguably suboptimal) configuration when > transitioning to a newer version of QEMU, that's why we're suggesting to > set the default to match the pre-l3-cache behavior. (Well, they would > if we didn't apply this change to our version of QEMU.) > > Roman. -- Eduardo
On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > There's one thing I don't understand in your test case: if you > > just found out that Linux will behave worse if it assumes that > > the VCPUs are sharing a L3 cache, why are you configuring a > > 8-core VCPU topology explicitly? > > > > Do you still see a difference in the numbers if you use "-smp 8" > > with no "cores" and "threads" options? > > > This is quite simple. A lot of software licenses are bound to the amount > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > with 1 socket/xx cores to reduce the amount of money necessary to > be paid for the software. This answers the first question but not the second one. If the answer to the second one is negative, then I don't understand why changing the default makes sense. I would expect qemu by default to be as close to emulating a physical system as possible. If one has to deviate from that for some workloads, that is fine, but probably not a good default. -- MST
On Wed, Nov 29, 2017 at 06:17:40AM +0200, Michael S. Tsirkin wrote: > On Tue, Nov 28, 2017 at 11:20:27PM +0300, Denis V. Lunev wrote: > > > There's one thing I don't understand in your test case: if you > > > just found out that Linux will behave worse if it assumes that > > > the VCPUs are sharing a L3 cache, why are you configuring a > > > 8-core VCPU topology explicitly? > > > > > > Do you still see a difference in the numbers if you use "-smp 8" > > > with no "cores" and "threads" options? > > > > > This is quite simple. A lot of software licenses are bound to the amount > > of CPU __sockets__. Thus it is mandatory in a lot of cases to set topology > > with 1 socket/xx cores to reduce the amount of money necessary to > > be paid for the software. > > This answers the first question but not the second one. The answer to the second one is no, there's no difference (from memory, I don't have the numbers at hand.) > If the answer to the second one is negative, then I don't understand why > changing the default makes sense. Because the setting adversly affected the performance in some configurations? That said, we may be well past the point where that would matter because the new setting has been there for a few releases already... > I would expect qemu by default to be as close to emulating a physical > system as possible. If one has to deviate from that for some workloads, > that is fine, but probably not a good default. Thanks, Roman.
© 2016 - 2024 Red Hat, Inc.