[Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use

Alexey Kardashevskiy posted 4 patches 6 years, 6 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20170907092010.3605-1-aik@ozlabs.ru
Test checkpatch passed
Test docker failed
Test s390x failed
There is a newer version of this series
include/exec/memory-internal.h |   6 +-
include/exec/memory.h          |  93 +++++++++----
exec.c                         | 242 +++++++++++++++++++--------------
hw/alpha/typhoon.c             |   2 +-
hw/dma/rc4030.c                |   4 +-
hw/i386/amd_iommu.c            |   2 +-
hw/i386/intel_iommu.c          |   9 +-
hw/intc/openpic_kvm.c          |   2 +-
hw/pci-host/apb.c              |   2 +-
hw/pci/pci.c                   |   3 +-
hw/ppc/spapr_iommu.c           |   4 +-
hw/s390x/s390-pci-bus.c        |   2 +-
hw/vfio/common.c               |   6 +-
hw/virtio/vhost.c              |   6 +-
memory.c                       | 299 +++++++++++++++++++++++++++--------------
monitor.c                      |   3 +-
vl.c                           |   4 +
hmp-commands-info.hx           |   7 +-
18 files changed, 448 insertions(+), 248 deletions(-)
[Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Alexey Kardashevskiy 6 years, 6 months ago
This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593

What happens ithere is that every virtio block device creates 2 address
spaces - for modern config space (called "virtio-pci-cfg-as") and
for busmaster (common pci thing, called after the device name,
in my case "virtio-blk-pci").

Each address_space_init() updates topology for every address space.
Every topology update (address_space_update_topology()) creates a new
dispatch tree - AddressSpaceDispatch with nodes (1KB) and
sections (48KB) and destroys the old one.

However the dispatch destructor is postponed via RCU which does not
get a chance to execute until the machine is initialized but before
we get there, memory is not returned to the pool, and this is a lot
of memory which grows n^2.

These patches are trying to address the memory use and boot time
issues but tbh only the first one provides visible outcome.

There are still things to polish and double check the use of RCU,
I'd like to get any feedback before proceeding - is this going
the right way or way too ugly?


This is based on sha1
1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".

Please comment. Thanks.



Alexey Kardashevskiy (4):
  memory: Postpone flatview and dispatch tree building till all devices
    are added
  memory: Prepare for shared flat views
  memory: Share flat views and dispatch trees between address spaces
  memory: Add flat views to HMP "info mtree"

 include/exec/memory-internal.h |   6 +-
 include/exec/memory.h          |  93 +++++++++----
 exec.c                         | 242 +++++++++++++++++++--------------
 hw/alpha/typhoon.c             |   2 +-
 hw/dma/rc4030.c                |   4 +-
 hw/i386/amd_iommu.c            |   2 +-
 hw/i386/intel_iommu.c          |   9 +-
 hw/intc/openpic_kvm.c          |   2 +-
 hw/pci-host/apb.c              |   2 +-
 hw/pci/pci.c                   |   3 +-
 hw/ppc/spapr_iommu.c           |   4 +-
 hw/s390x/s390-pci-bus.c        |   2 +-
 hw/vfio/common.c               |   6 +-
 hw/virtio/vhost.c              |   6 +-
 memory.c                       | 299 +++++++++++++++++++++++++++--------------
 monitor.c                      |   3 +-
 vl.c                           |   4 +
 hmp-commands-info.hx           |   7 +-
 18 files changed, 448 insertions(+), 248 deletions(-)

-- 
2.11.0


Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Dr. David Alan Gilbert 6 years, 6 months ago
* Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
> 
> What happens ithere is that every virtio block device creates 2 address
> spaces - for modern config space (called "virtio-pci-cfg-as") and
> for busmaster (common pci thing, called after the device name,
> in my case "virtio-blk-pci").
> 
> Each address_space_init() updates topology for every address space.
> Every topology update (address_space_update_topology()) creates a new
> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
> sections (48KB) and destroys the old one.
> 
> However the dispatch destructor is postponed via RCU which does not
> get a chance to execute until the machine is initialized but before
> we get there, memory is not returned to the pool, and this is a lot
> of memory which grows n^2.
> 
> These patches are trying to address the memory use and boot time
> issues but tbh only the first one provides visible outcome.

Do you have a feel for how much memory is saved?

Dave

> There are still things to polish and double check the use of RCU,
> I'd like to get any feedback before proceeding - is this going
> the right way or way too ugly?
> 
> 
> This is based on sha1
> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
> 
> Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (4):
>   memory: Postpone flatview and dispatch tree building till all devices
>     are added
>   memory: Prepare for shared flat views
>   memory: Share flat views and dispatch trees between address spaces
>   memory: Add flat views to HMP "info mtree"
> 
>  include/exec/memory-internal.h |   6 +-
>  include/exec/memory.h          |  93 +++++++++----
>  exec.c                         | 242 +++++++++++++++++++--------------
>  hw/alpha/typhoon.c             |   2 +-
>  hw/dma/rc4030.c                |   4 +-
>  hw/i386/amd_iommu.c            |   2 +-
>  hw/i386/intel_iommu.c          |   9 +-
>  hw/intc/openpic_kvm.c          |   2 +-
>  hw/pci-host/apb.c              |   2 +-
>  hw/pci/pci.c                   |   3 +-
>  hw/ppc/spapr_iommu.c           |   4 +-
>  hw/s390x/s390-pci-bus.c        |   2 +-
>  hw/vfio/common.c               |   6 +-
>  hw/virtio/vhost.c              |   6 +-
>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
>  monitor.c                      |   3 +-
>  vl.c                           |   4 +
>  hmp-commands-info.hx           |   7 +-
>  18 files changed, 448 insertions(+), 248 deletions(-)
> 
> -- 
> 2.11.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by David Gibson 6 years, 6 months ago
On Thu, Sep 07, 2017 at 10:51:42AM +0100, Dr. David Alan Gilbert wrote:
> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> > This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
> > 
> > What happens ithere is that every virtio block device creates 2 address
> > spaces - for modern config space (called "virtio-pci-cfg-as") and
> > for busmaster (common pci thing, called after the device name,
> > in my case "virtio-blk-pci").
> > 
> > Each address_space_init() updates topology for every address space.
> > Every topology update (address_space_update_topology()) creates a new
> > dispatch tree - AddressSpaceDispatch with nodes (1KB) and
> > sections (48KB) and destroys the old one.
> > 
> > However the dispatch destructor is postponed via RCU which does not
> > get a chance to execute until the machine is initialized but before
> > we get there, memory is not returned to the pool, and this is a lot
> > of memory which grows n^2.
> > 
> > These patches are trying to address the memory use and boot time
> > issues but tbh only the first one provides visible outcome.
> 
> Do you have a feel for how much memory is saved?

I think that's a bit hard to answer.

As noted above there's O(n^2) (or more) space complexity here - one
which shouldn't be required by the data we actually have to track.
That means the amount of "excess" memory depends on how many devices
there are.

I haven't yet looked at these patches in detail, to know if it truly
fixes that O(n^2) or just pares the constant.  If it does fix the
O(n^2) then the amount is going to vary from "probably not enough to
worry about" in normal use cases to hundreds of gigabytes in cases
with many devices.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Alexey Kardashevskiy 6 years, 6 months ago
On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
>>
>> What happens ithere is that every virtio block device creates 2 address
>> spaces - for modern config space (called "virtio-pci-cfg-as") and
>> for busmaster (common pci thing, called after the device name,
>> in my case "virtio-blk-pci").
>>
>> Each address_space_init() updates topology for every address space.
>> Every topology update (address_space_update_topology()) creates a new
>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
>> sections (48KB) and destroys the old one.
>>
>> However the dispatch destructor is postponed via RCU which does not
>> get a chance to execute until the machine is initialized but before
>> we get there, memory is not returned to the pool, and this is a lot
>> of memory which grows n^2.
>>
>> These patches are trying to address the memory use and boot time
>> issues but tbh only the first one provides visible outcome.
> 
> Do you have a feel for how much memory is saved?


The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
devices. These GB figures are the peak values (but it does not matter for
OOM killer), memory gets released in one go when RCU kicks in, it just
happens too late.

The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
much). Strangely, I do not see the difference in valgrind output when I run
a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
to catch this.

> 
> Dave
> 
>> There are still things to polish and double check the use of RCU,
>> I'd like to get any feedback before proceeding - is this going
>> the right way or way too ugly?
>>
>>
>> This is based on sha1
>> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
>>
>> Please comment. Thanks.
>>
>>
>>
>> Alexey Kardashevskiy (4):
>>   memory: Postpone flatview and dispatch tree building till all devices
>>     are added
>>   memory: Prepare for shared flat views
>>   memory: Share flat views and dispatch trees between address spaces
>>   memory: Add flat views to HMP "info mtree"
>>
>>  include/exec/memory-internal.h |   6 +-
>>  include/exec/memory.h          |  93 +++++++++----
>>  exec.c                         | 242 +++++++++++++++++++--------------
>>  hw/alpha/typhoon.c             |   2 +-
>>  hw/dma/rc4030.c                |   4 +-
>>  hw/i386/amd_iommu.c            |   2 +-
>>  hw/i386/intel_iommu.c          |   9 +-
>>  hw/intc/openpic_kvm.c          |   2 +-
>>  hw/pci-host/apb.c              |   2 +-
>>  hw/pci/pci.c                   |   3 +-
>>  hw/ppc/spapr_iommu.c           |   4 +-
>>  hw/s390x/s390-pci-bus.c        |   2 +-
>>  hw/vfio/common.c               |   6 +-
>>  hw/virtio/vhost.c              |   6 +-
>>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
>>  monitor.c                      |   3 +-
>>  vl.c                           |   4 +
>>  hmp-commands-info.hx           |   7 +-
>>  18 files changed, 448 insertions(+), 248 deletions(-)
>>
>> -- 
>> 2.11.0
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


-- 
Alexey

Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Dr. David Alan Gilbert 6 years, 6 months ago
* Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
> > * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> >> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
> >>
> >> What happens ithere is that every virtio block device creates 2 address
> >> spaces - for modern config space (called "virtio-pci-cfg-as") and
> >> for busmaster (common pci thing, called after the device name,
> >> in my case "virtio-blk-pci").
> >>
> >> Each address_space_init() updates topology for every address space.
> >> Every topology update (address_space_update_topology()) creates a new
> >> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
> >> sections (48KB) and destroys the old one.
> >>
> >> However the dispatch destructor is postponed via RCU which does not
> >> get a chance to execute until the machine is initialized but before
> >> we get there, memory is not returned to the pool, and this is a lot
> >> of memory which grows n^2.
> >>
> >> These patches are trying to address the memory use and boot time
> >> issues but tbh only the first one provides visible outcome.
> > 
> > Do you have a feel for how much memory is saved?
> 
> 
> The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
> devices. These GB figures are the peak values (but it does not matter for
> OOM killer), memory gets released in one go when RCU kicks in, it just
> happens too late.

Nice saving!  Still, why is it using 11GB?
What's it like for more sane configurations, say 2-3 virtio devices - is
there anything noticable or is it just the huge setups?

Dave


> The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
> much). Strangely, I do not see the difference in valgrind output when I run
> a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
> to catch this.
> 
> > 
> > Dave
> > 
> >> There are still things to polish and double check the use of RCU,
> >> I'd like to get any feedback before proceeding - is this going
> >> the right way or way too ugly?
> >>
> >>
> >> This is based on sha1
> >> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
> >>
> >> Please comment. Thanks.
> >>
> >>
> >>
> >> Alexey Kardashevskiy (4):
> >>   memory: Postpone flatview and dispatch tree building till all devices
> >>     are added
> >>   memory: Prepare for shared flat views
> >>   memory: Share flat views and dispatch trees between address spaces
> >>   memory: Add flat views to HMP "info mtree"
> >>
> >>  include/exec/memory-internal.h |   6 +-
> >>  include/exec/memory.h          |  93 +++++++++----
> >>  exec.c                         | 242 +++++++++++++++++++--------------
> >>  hw/alpha/typhoon.c             |   2 +-
> >>  hw/dma/rc4030.c                |   4 +-
> >>  hw/i386/amd_iommu.c            |   2 +-
> >>  hw/i386/intel_iommu.c          |   9 +-
> >>  hw/intc/openpic_kvm.c          |   2 +-
> >>  hw/pci-host/apb.c              |   2 +-
> >>  hw/pci/pci.c                   |   3 +-
> >>  hw/ppc/spapr_iommu.c           |   4 +-
> >>  hw/s390x/s390-pci-bus.c        |   2 +-
> >>  hw/vfio/common.c               |   6 +-
> >>  hw/virtio/vhost.c              |   6 +-
> >>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
> >>  monitor.c                      |   3 +-
> >>  vl.c                           |   4 +
> >>  hmp-commands-info.hx           |   7 +-
> >>  18 files changed, 448 insertions(+), 248 deletions(-)
> >>
> >> -- 
> >> 2.11.0
> >>
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> 
> -- 
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Alexey Kardashevskiy 6 years, 6 months ago
On 08/09/17 00:54, Dr. David Alan Gilbert wrote:
> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>> On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
>>> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>>>> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
>>>>
>>>> What happens ithere is that every virtio block device creates 2 address
>>>> spaces - for modern config space (called "virtio-pci-cfg-as") and
>>>> for busmaster (common pci thing, called after the device name,
>>>> in my case "virtio-blk-pci").
>>>>
>>>> Each address_space_init() updates topology for every address space.
>>>> Every topology update (address_space_update_topology()) creates a new
>>>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
>>>> sections (48KB) and destroys the old one.
>>>>
>>>> However the dispatch destructor is postponed via RCU which does not
>>>> get a chance to execute until the machine is initialized but before
>>>> we get there, memory is not returned to the pool, and this is a lot
>>>> of memory which grows n^2.
>>>>
>>>> These patches are trying to address the memory use and boot time
>>>> issues but tbh only the first one provides visible outcome.
>>>
>>> Do you have a feel for how much memory is saved?
>>
>>
>> The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
>> devices. These GB figures are the peak values (but it does not matter for
>> OOM killer), memory gets released in one go when RCU kicks in, it just
>> happens too late.
> 
> Nice saving!  Still, why is it using 11GB?

Yet to be discovered :) Not clear at the moment.


> What's it like for more sane configurations, say 2-3 virtio devices - is
> there anything noticable or is it just the huge setups?
> 
> Dave
> 
> 
>> The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
>> much). Strangely, I do not see the difference in valgrind output when I run
>> a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
>> to catch this.

I did some more tests.

v2.10:
1024 CPUs, no virtio:     0:47   490.8MB   38/34
1 CPU, 500 virtio-block:  5:03   59.69GB   2354438/3

1/4 applied:
1024 CPUs, no virtio:	  0:49   490.8MB   38/34
1 CPU, 500 virtio-block:  1:57   17.74GB   2186/3

3/4 applied:
1024 CPUs, no virtio:     0:53   491.1MB   20/17
1 CPU, 500 virtio-block:  2:01    17.7GB   2167/0


Time is what it takes to start QEMU with -S and then Q-Ax.

Memory amount is peak use from valgrind massif.

Last 2 numbers - "38/34" for example - 38 is the number of g_new(FlatView,
1), 34 is the number of g_free(view); the numbers are printed at
https://git.qemu.org/?p=qemu.git;a=blob;f=vl.c;h=8e247cc2a239ae8fb3d3cdf6d4ee78fd723d1053;hb=1ab5eb4efb91a3d4569b0df6e824cc08ab4bd8ec#l4666
before RCU kicks in.

500 virtio-block + bridges use around 1100 address spaces.





>>
>>>
>>> Dave
>>>
>>>> There are still things to polish and double check the use of RCU,
>>>> I'd like to get any feedback before proceeding - is this going
>>>> the right way or way too ugly?
>>>>
>>>>
>>>> This is based on sha1
>>>> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
>>>>
>>>> Please comment. Thanks.
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (4):
>>>>   memory: Postpone flatview and dispatch tree building till all devices
>>>>     are added
>>>>   memory: Prepare for shared flat views
>>>>   memory: Share flat views and dispatch trees between address spaces
>>>>   memory: Add flat views to HMP "info mtree"
>>>>
>>>>  include/exec/memory-internal.h |   6 +-
>>>>  include/exec/memory.h          |  93 +++++++++----
>>>>  exec.c                         | 242 +++++++++++++++++++--------------
>>>>  hw/alpha/typhoon.c             |   2 +-
>>>>  hw/dma/rc4030.c                |   4 +-
>>>>  hw/i386/amd_iommu.c            |   2 +-
>>>>  hw/i386/intel_iommu.c          |   9 +-
>>>>  hw/intc/openpic_kvm.c          |   2 +-
>>>>  hw/pci-host/apb.c              |   2 +-
>>>>  hw/pci/pci.c                   |   3 +-
>>>>  hw/ppc/spapr_iommu.c           |   4 +-
>>>>  hw/s390x/s390-pci-bus.c        |   2 +-
>>>>  hw/vfio/common.c               |   6 +-
>>>>  hw/virtio/vhost.c              |   6 +-
>>>>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
>>>>  monitor.c                      |   3 +-
>>>>  vl.c                           |   4 +
>>>>  hmp-commands-info.hx           |   7 +-
>>>>  18 files changed, 448 insertions(+), 248 deletions(-)
>>>>
>>>> -- 
>>>> 2.11.0
>>>>
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>
>>
>> -- 
>> Alexey
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


-- 
Alexey

Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Alexey Kardashevskiy 6 years, 6 months ago
On 08/09/17 12:08, Alexey Kardashevskiy wrote:
> On 08/09/17 00:54, Dr. David Alan Gilbert wrote:
>> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>>> On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
>>>> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>>>>> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
>>>>>
>>>>> What happens ithere is that every virtio block device creates 2 address
>>>>> spaces - for modern config space (called "virtio-pci-cfg-as") and
>>>>> for busmaster (common pci thing, called after the device name,
>>>>> in my case "virtio-blk-pci").
>>>>>
>>>>> Each address_space_init() updates topology for every address space.
>>>>> Every topology update (address_space_update_topology()) creates a new
>>>>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
>>>>> sections (48KB) and destroys the old one.
>>>>>
>>>>> However the dispatch destructor is postponed via RCU which does not
>>>>> get a chance to execute until the machine is initialized but before
>>>>> we get there, memory is not returned to the pool, and this is a lot
>>>>> of memory which grows n^2.
>>>>>
>>>>> These patches are trying to address the memory use and boot time
>>>>> issues but tbh only the first one provides visible outcome.
>>>>
>>>> Do you have a feel for how much memory is saved?
>>>
>>>
>>> The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
>>> devices. These GB figures are the peak values (but it does not matter for
>>> OOM killer), memory gets released in one go when RCU kicks in, it just
>>> happens too late.
>>
>> Nice saving!  Still, why is it using 11GB?
> 
> Yet to be discovered :) Not clear at the moment.
> 
> 
>> What's it like for more sane configurations, say 2-3 virtio devices - is
>> there anything noticable or is it just the huge setups?
>>
>> Dave
>>
>>
>>> The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
>>> much). Strangely, I do not see the difference in valgrind output when I run
>>> a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
>>> to catch this.
> 
> I did some more tests.
> 
> v2.10:
> 1024 CPUs, no virtio:     0:47   490.8MB   38/34
> 1 CPU, 500 virtio-block:  5:03   59.69GB   2354438/3
> 
> 1/4 applied:
> 1024 CPUs, no virtio:	  0:49   490.8MB   38/34
> 1 CPU, 500 virtio-block:  1:57   17.74GB   2186/3
> 
> 3/4 applied:
> 1024 CPUs, no virtio:     0:53   491.1MB   20/17
> 1 CPU, 500 virtio-block:  2:01    17.7GB   2167/0
> 
> 
> Time is what it takes to start QEMU with -S and then Q-Ax.
> 
> Memory amount is peak use from valgrind massif.
> 
> Last 2 numbers - "38/34" for example - 38 is the number of g_new(FlatView,
> 1), 34 is the number of g_free(view); the numbers are printed at
> https://git.qemu.org/?p=qemu.git;a=blob;f=vl.c;h=8e247cc2a239ae8fb3d3cdf6d4ee78fd723d1053;hb=1ab5eb4efb91a3d4569b0df6e824cc08ab4bd8ec#l4666
> before RCU kicks in.
> 
> 500 virtio-block + bridges use around 1100 address spaces.


Numbers are quite wrong for the [3/4] patch as
qemu_run_machine_init_done_notifiers() does the same dance all over again.

I can bring memory peak use down to 3.75GB (yay!) for "1 CPU, 500
virtio-block" with the below fix. It belongs to [1/4] and we might want
just extend that begin+commit to cover both devices creation and notifiers
initialization.

My guess that the general problem is that memory_region_transaction_begin()
and memory_region_transaction_commit() update all address spaces (or flat
views after [3/4] applied) while they do not really need to - they could
receive a hint (an AS? a flatview) and update only respective bits, I just
do not see an easy way to provide with a hint other than storing an AS
pointer in an MR.

In my test, I have 500 devices and 2 layers of PCI bridges. For every PCI
device probed by SLOF (our boot firmware), _all_ flatviews are rebuild 8
times (at least, sometime more, like 20, when a next bridge gets activated,
dunno):

- config space writes at 0x20 0x24 0x28 0x2c 0x1c - we end up in
pci_bridge_update_mappings();
- config space writes at 0x4 for a device and a bridge - pci_update_mappings();
- memory_region_set_enabled() on d->bus_master_enable_region
enable bridge.

Each of these operations re-renders all 1000-ish flatviews and dispatch
trees, making device probing really, really slow. But at least memory is
not sort of leaking or hold by RCU :) And I cannot wrap these config writes
into memory_region_transaction_begin/commit as every config space access
comes independently from SLOF.



diff --git a/vl.c b/vl.c
index 89fb58c1de..4317ef01b4 100644
--- a/vl.c
+++ b/vl.c
@@ -4754,8 +4754,13 @@ int main(int argc, char **argv, char **envp)
     /* TODO: once all bus devices are qdevified, this should be done
      * when bus is created by qdev.c */
     qemu_register_reset(qbus_reset_all_fn, sysbus_get_default());
+
+    memory_region_transaction_begin();
+
     qemu_run_machine_init_done_notifiers();

+    memory_region_transaction_commit();
+
     if (rom_check_and_register_reset() != 0) {
         error_report("rom check and register reset failed");
         exit(1);




> 
> 
> 
> 
> 
>>>
>>>>
>>>> Dave
>>>>
>>>>> There are still things to polish and double check the use of RCU,
>>>>> I'd like to get any feedback before proceeding - is this going
>>>>> the right way or way too ugly?
>>>>>
>>>>>
>>>>> This is based on sha1
>>>>> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
>>>>>
>>>>> Please comment. Thanks.
>>>>>
>>>>>
>>>>>
>>>>> Alexey Kardashevskiy (4):
>>>>>   memory: Postpone flatview and dispatch tree building till all devices
>>>>>     are added
>>>>>   memory: Prepare for shared flat views
>>>>>   memory: Share flat views and dispatch trees between address spaces
>>>>>   memory: Add flat views to HMP "info mtree"
>>>>>
>>>>>  include/exec/memory-internal.h |   6 +-
>>>>>  include/exec/memory.h          |  93 +++++++++----
>>>>>  exec.c                         | 242 +++++++++++++++++++--------------
>>>>>  hw/alpha/typhoon.c             |   2 +-
>>>>>  hw/dma/rc4030.c                |   4 +-
>>>>>  hw/i386/amd_iommu.c            |   2 +-
>>>>>  hw/i386/intel_iommu.c          |   9 +-
>>>>>  hw/intc/openpic_kvm.c          |   2 +-
>>>>>  hw/pci-host/apb.c              |   2 +-
>>>>>  hw/pci/pci.c                   |   3 +-
>>>>>  hw/ppc/spapr_iommu.c           |   4 +-
>>>>>  hw/s390x/s390-pci-bus.c        |   2 +-
>>>>>  hw/vfio/common.c               |   6 +-
>>>>>  hw/virtio/vhost.c              |   6 +-
>>>>>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
>>>>>  monitor.c                      |   3 +-
>>>>>  vl.c                           |   4 +
>>>>>  hmp-commands-info.hx           |   7 +-
>>>>>  18 files changed, 448 insertions(+), 248 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.11.0
>>>>>
>>>>>
>>>> --
>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>
>>>
>>>
>>> -- 
>>> Alexey
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>
> 
> 


-- 
Alexey

Re: [Qemu-devel] [RFC PATCH qemu 0/4] memory: Reduce memory use
Posted by Dr. David Alan Gilbert 6 years, 6 months ago
* Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> On 08/09/17 00:54, Dr. David Alan Gilbert wrote:
> > * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> >> On 07/09/17 19:51, Dr. David Alan Gilbert wrote:
> >>> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> >>>> This was inspired by https://bugzilla.redhat.com/show_bug.cgi?id=1481593
> >>>>
> >>>> What happens ithere is that every virtio block device creates 2 address
> >>>> spaces - for modern config space (called "virtio-pci-cfg-as") and
> >>>> for busmaster (common pci thing, called after the device name,
> >>>> in my case "virtio-blk-pci").
> >>>>
> >>>> Each address_space_init() updates topology for every address space.
> >>>> Every topology update (address_space_update_topology()) creates a new
> >>>> dispatch tree - AddressSpaceDispatch with nodes (1KB) and
> >>>> sections (48KB) and destroys the old one.
> >>>>
> >>>> However the dispatch destructor is postponed via RCU which does not
> >>>> get a chance to execute until the machine is initialized but before
> >>>> we get there, memory is not returned to the pool, and this is a lot
> >>>> of memory which grows n^2.
> >>>>
> >>>> These patches are trying to address the memory use and boot time
> >>>> issues but tbh only the first one provides visible outcome.
> >>>
> >>> Do you have a feel for how much memory is saved?
> >>
> >>
> >> The 1/4 saves ~33GB (~44GB -> 11GB) for a 2GB guest and 400 virtio-pci
> >> devices. These GB figures are the peak values (but it does not matter for
> >> OOM killer), memory gets released in one go when RCU kicks in, it just
> >> happens too late.
> > 
> > Nice saving!  Still, why is it using 11GB?
> 
> Yet to be discovered :) Not clear at the moment.
> 
> 
> > What's it like for more sane configurations, say 2-3 virtio devices - is
> > there anything noticable or is it just the huge setups?
> > 
> > Dave
> > 
> > 
> >> The 3/4 saves less, I'd say 50KB per VCPU (more if you count peaks but so
> >> much). Strangely, I do not see the difference in valgrind output when I run
> >> a guest with 1024 or just 8 CPUs, probably "massif" is not the right tool
> >> to catch this.
> 
> I did some more tests.
> 
> v2.10:
> 1024 CPUs, no virtio:     0:47   490.8MB   38/34
> 1 CPU, 500 virtio-block:  5:03   59.69GB   2354438/3
> 
> 1/4 applied:
> 1024 CPUs, no virtio:	  0:49   490.8MB   38/34
> 1 CPU, 500 virtio-block:  1:57   17.74GB   2186/3
> 
> 3/4 applied:
> 1024 CPUs, no virtio:     0:53   491.1MB   20/17
> 1 CPU, 500 virtio-block:  2:01    17.7GB   2167/0
> 
> 
> Time is what it takes to start QEMU with -S and then Q-Ax.
> 
> Memory amount is peak use from valgrind massif.
> 
> Last 2 numbers - "38/34" for example - 38 is the number of g_new(FlatView,
> 1), 34 is the number of g_free(view); the numbers are printed at
> https://git.qemu.org/?p=qemu.git;a=blob;f=vl.c;h=8e247cc2a239ae8fb3d3cdf6d4ee78fd723d1053;hb=1ab5eb4efb91a3d4569b0df6e824cc08ab4bd8ec#l4666
> before RCU kicks in.
> 
> 500 virtio-block + bridges use around 1100 address spaces.

What I find interesting is the effect even on small VMs, I'm using
valgrind --tool=exp-dhat as per your bz comment, on a qemu close to
head:

valgrind --tool=exp-dhat ~/try/x86_64-softmmu/qemu-system-x86_64 -nographic -device sga -m 1G -M pc,accel=kvm -drive file=/home/vmimages/littlefed20.img,id=d1,if=none -device virtio-blk,drive=d1 -drive file=/home/vmimages/dummy1,id=d2,if=none -device virtio-blk,drive=d2 -drive file=/home/vmimages/dummy2,id=d3,if=none -device virtio-blk,drive=d3 -device virtio-serial -device virtio-serial -device virtio-serial -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0

==5945== guest_insns:  2,845,498,404
==5945== max_live:     73,745,261 in 45,395 blocks
==5945== tot_alloc:    615,696,752 in 515,110 blocks

with your 1-4 patches:
==14661== guest_insns:  2,626,826,254
==14661== max_live:     27,825,659 in 28,950 blocks
==14661== tot_alloc:    529,978,686 in 444,043 blocks

so that's a 45MB saving on a simple VM - those type of numbers add up
for people running lots of small VMs; they notice when their total
qemu RAM overhead for their box goes up by a few GB.

Dave

> 
> 
> 
> 
> >>
> >>>
> >>> Dave
> >>>
> >>>> There are still things to polish and double check the use of RCU,
> >>>> I'd like to get any feedback before proceeding - is this going
> >>>> the right way or way too ugly?
> >>>>
> >>>>
> >>>> This is based on sha1
> >>>> 1ab5eb4efb Peter Maydell "Update version for v2.10.0 release".
> >>>>
> >>>> Please comment. Thanks.
> >>>>
> >>>>
> >>>>
> >>>> Alexey Kardashevskiy (4):
> >>>>   memory: Postpone flatview and dispatch tree building till all devices
> >>>>     are added
> >>>>   memory: Prepare for shared flat views
> >>>>   memory: Share flat views and dispatch trees between address spaces
> >>>>   memory: Add flat views to HMP "info mtree"
> >>>>
> >>>>  include/exec/memory-internal.h |   6 +-
> >>>>  include/exec/memory.h          |  93 +++++++++----
> >>>>  exec.c                         | 242 +++++++++++++++++++--------------
> >>>>  hw/alpha/typhoon.c             |   2 +-
> >>>>  hw/dma/rc4030.c                |   4 +-
> >>>>  hw/i386/amd_iommu.c            |   2 +-
> >>>>  hw/i386/intel_iommu.c          |   9 +-
> >>>>  hw/intc/openpic_kvm.c          |   2 +-
> >>>>  hw/pci-host/apb.c              |   2 +-
> >>>>  hw/pci/pci.c                   |   3 +-
> >>>>  hw/ppc/spapr_iommu.c           |   4 +-
> >>>>  hw/s390x/s390-pci-bus.c        |   2 +-
> >>>>  hw/vfio/common.c               |   6 +-
> >>>>  hw/virtio/vhost.c              |   6 +-
> >>>>  memory.c                       | 299 +++++++++++++++++++++++++++--------------
> >>>>  monitor.c                      |   3 +-
> >>>>  vl.c                           |   4 +
> >>>>  hmp-commands-info.hx           |   7 +-
> >>>>  18 files changed, 448 insertions(+), 248 deletions(-)
> >>>>
> >>>> -- 
> >>>> 2.11.0
> >>>>
> >>>>
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>
> >>
> >>
> >> -- 
> >> Alexey
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> 
> -- 
> Alexey
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK