[PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX

David Hildenbrand posted 13 patches 5 years, 9 months ago
Test docker-quick@centos7 passed
Test FreeBSD passed
Test docker-mingw@fedora passed
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20200203183125.164879-1-david@redhat.com
Maintainers: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, Stefan Weil <sw@weilnetz.de>, Richard Henderson <rth@twiddle.net>, Eduardo Habkost <ehabkost@redhat.com>
There is a newer version of this series
exec.c                    |  99 ++++++++++++++++++----
hw/core/numa.c            |  39 +++++++++
include/exec/cpu-common.h |   3 +
include/exec/memory.h     |   8 ++
include/exec/ramlist.h    |   4 +
include/qemu/mmap-alloc.h |  21 +++--
include/qemu/osdep.h      |   6 +-
stubs/ram-block.c         |  20 -----
util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
util/oslib-posix.c        |  37 ++++++++-
util/oslib-win32.c        |  14 ++++
util/trace-events         |   5 +-
util/vfio-helpers.c       |  33 ++++----
13 files changed, 331 insertions(+), 126 deletions(-)
[PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by David Hildenbrand 5 years, 9 months ago
We already allow resizable ram blocks for anonymous memory, however, they
are not actually resized. All memory is mmaped() R/W, including the memory
exceeding the used_length, up to the max_length.

When resizing, effectively only the boundary is moved. Implement actually
resizable anonymous allocations and make use of them in resizable ram
blocks when possible. Memory exceeding the used_length will be
inaccessible. Especially ram block notifiers require care.

Having actually resizable anonymous allocations (via mmap-hackery) allows
to reserve a big region in virtual address space and grow the
accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
is set to "never" under Linux, huge reservations will succeed. If there is
not enough memory when resizing (to populate parts of the reserved region),
trying to resize will fail. Only the actually used size is reserved in the
OS.

E.g., virtio-mem [1] wants to reserve big resizable memory regions and
grow the usable part on demand. I think this change is worth sending out
individually. Accompanied by a bunch of minor fixes and cleanups.

[1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

David Hildenbrand (13):
  util: vfio-helpers: Factor out and fix processing of existings ram
    blocks
  exec: Factor out setting ram settings (madvise ...) into
    qemu_ram_apply_settings()
  exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
  exec: Drop "shared" parameter from ram_block_add()
  util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
  util/mmap-alloc: Factor out reserving of a memory region to
    mmap_reserve()
  util/mmap-alloc: Factor out populating of memory to mmap_populate()
  util/mmap-alloc: Prepare for resizable mmaps
  util/mmap-alloc: Implement resizable mmaps
  numa: Introduce ram_block_notify_resized() and
    ram_block_notifiers_support_resize()
  util: vfio-helpers: Implement ram_block_resized()
  util: oslib: Resizable anonymous allocations under POSIX
  exec: Ram blocks with resizable anonymous allocations under POSIX

 exec.c                    |  99 ++++++++++++++++++----
 hw/core/numa.c            |  39 +++++++++
 include/exec/cpu-common.h |   3 +
 include/exec/memory.h     |   8 ++
 include/exec/ramlist.h    |   4 +
 include/qemu/mmap-alloc.h |  21 +++--
 include/qemu/osdep.h      |   6 +-
 stubs/ram-block.c         |  20 -----
 util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
 util/oslib-posix.c        |  37 ++++++++-
 util/oslib-win32.c        |  14 ++++
 util/trace-events         |   5 +-
 util/vfio-helpers.c       |  33 ++++----
 13 files changed, 331 insertions(+), 126 deletions(-)

-- 
2.24.1


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by Dr. David Alan Gilbert 5 years, 9 months ago
* David Hildenbrand (david@redhat.com) wrote:
> We already allow resizable ram blocks for anonymous memory, however, they
> are not actually resized. All memory is mmaped() R/W, including the memory
> exceeding the used_length, up to the max_length.
> 
> When resizing, effectively only the boundary is moved. Implement actually
> resizable anonymous allocations and make use of them in resizable ram
> blocks when possible. Memory exceeding the used_length will be
> inaccessible. Especially ram block notifiers require care.
> 
> Having actually resizable anonymous allocations (via mmap-hackery) allows
> to reserve a big region in virtual address space and grow the
> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> is set to "never" under Linux, huge reservations will succeed. If there is
> not enough memory when resizing (to populate parts of the reserved region),
> trying to resize will fail. Only the actually used size is reserved in the
> OS.
> 
> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> grow the usable part on demand. I think this change is worth sending out
> individually. Accompanied by a bunch of minor fixes and cleanups.
> 
> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

There's a few bits I've not understood from skimming the patches:

  a) Am I correct in thinking you PROT_NONE the extra space so you can
gkrow/shrink it?
  b) What does kvm see - does it have a slot for the whole space or for
only the used space?
     I ask because we found with virtiofs/DAX experiments that on Power,
kvm gets upset if you give it a mapping with PROT_NONE.
     (That maybe less of an issue if you change the mapping after the
slot is created).

  c) It's interesting this is keyed off the RAMBlock notifiers - do
     memory_listener's on the address space the block is mapped into get
    triggered?  I'm wondering how vhost (and vhost-user) in particular
    see this.

Dave

> 
> David Hildenbrand (13):
>   util: vfio-helpers: Factor out and fix processing of existings ram
>     blocks
>   exec: Factor out setting ram settings (madvise ...) into
>     qemu_ram_apply_settings()
>   exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
>   exec: Drop "shared" parameter from ram_block_add()
>   util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
>   util/mmap-alloc: Factor out reserving of a memory region to
>     mmap_reserve()
>   util/mmap-alloc: Factor out populating of memory to mmap_populate()
>   util/mmap-alloc: Prepare for resizable mmaps
>   util/mmap-alloc: Implement resizable mmaps
>   numa: Introduce ram_block_notify_resized() and
>     ram_block_notifiers_support_resize()
>   util: vfio-helpers: Implement ram_block_resized()
>   util: oslib: Resizable anonymous allocations under POSIX
>   exec: Ram blocks with resizable anonymous allocations under POSIX
> 
>  exec.c                    |  99 ++++++++++++++++++----
>  hw/core/numa.c            |  39 +++++++++
>  include/exec/cpu-common.h |   3 +
>  include/exec/memory.h     |   8 ++
>  include/exec/ramlist.h    |   4 +
>  include/qemu/mmap-alloc.h |  21 +++--
>  include/qemu/osdep.h      |   6 +-
>  stubs/ram-block.c         |  20 -----
>  util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
>  util/oslib-posix.c        |  37 ++++++++-
>  util/oslib-win32.c        |  14 ++++
>  util/trace-events         |   5 +-
>  util/vfio-helpers.c       |  33 ++++----
>  13 files changed, 331 insertions(+), 126 deletions(-)
> 
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by David Hildenbrand 5 years, 9 months ago

> Am 06.02.2020 um 21:11 schrieb Dr. David Alan Gilbert <dgilbert@redhat.com>:
> 
> * David Hildenbrand (david@redhat.com) wrote:
>> We already allow resizable ram blocks for anonymous memory, however, they
>> are not actually resized. All memory is mmaped() R/W, including the memory
>> exceeding the used_length, up to the max_length.
>> 
>> When resizing, effectively only the boundary is moved. Implement actually
>> resizable anonymous allocations and make use of them in resizable ram
>> blocks when possible. Memory exceeding the used_length will be
>> inaccessible. Especially ram block notifiers require care.
>> 
>> Having actually resizable anonymous allocations (via mmap-hackery) allows
>> to reserve a big region in virtual address space and grow the
>> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
>> is set to "never" under Linux, huge reservations will succeed. If there is
>> not enough memory when resizing (to populate parts of the reserved region),
>> trying to resize will fail. Only the actually used size is reserved in the
>> OS.
>> 
>> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
>> grow the usable part on demand. I think this change is worth sending out
>> individually. Accompanied by a bunch of minor fixes and cleanups.
>> 
>> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> 
> There's a few bits I've not understood from skimming the patches:
> 

Thanks for having a look!

>  a) Am I correct in thinking you PROT_NONE the extra space so you can
> gkrow/shrink it?

Yes!

>  b) What does kvm see - does it have a slot for the whole space or for
> only the used space?

Only the used space. Resizing triggers a resize of the memory region. That triggers memory notifiers, which remove the old kvm memslot and re-add the new kvm memslot. (That‘s existing handling, so nothing new).

So KVM will not see PROT_NONE when creating a slot.

>     I ask because we found with virtiofs/DAX experiments that on Power,
> kvm gets upset if you give it a mapping with PROT_NONE.
>     (That maybe less of an issue if you change the mapping after the
> slot is created).

That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.

> 
>  c) It's interesting this is keyed off the RAMBlock notifiers - do
>     memory_listener's on the address space the block is mapped into get
>    triggered?  I'm wondering how vhost (and vhost-user) in particular
>    see this.

Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.

The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.

E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.

Cheers!


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by Dr. David Alan Gilbert 5 years, 9 months ago
* David Hildenbrand (david@redhat.com) wrote:
> 
> 
> > Am 06.02.2020 um 21:11 schrieb Dr. David Alan Gilbert <dgilbert@redhat.com>:
> > 
> > * David Hildenbrand (david@redhat.com) wrote:
> >> We already allow resizable ram blocks for anonymous memory, however, they
> >> are not actually resized. All memory is mmaped() R/W, including the memory
> >> exceeding the used_length, up to the max_length.
> >> 
> >> When resizing, effectively only the boundary is moved. Implement actually
> >> resizable anonymous allocations and make use of them in resizable ram
> >> blocks when possible. Memory exceeding the used_length will be
> >> inaccessible. Especially ram block notifiers require care.
> >> 
> >> Having actually resizable anonymous allocations (via mmap-hackery) allows
> >> to reserve a big region in virtual address space and grow the
> >> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> >> is set to "never" under Linux, huge reservations will succeed. If there is
> >> not enough memory when resizing (to populate parts of the reserved region),
> >> trying to resize will fail. Only the actually used size is reserved in the
> >> OS.
> >> 
> >> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> >> grow the usable part on demand. I think this change is worth sending out
> >> individually. Accompanied by a bunch of minor fixes and cleanups.
> >> 
> >> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> > 
> > There's a few bits I've not understood from skimming the patches:
> > 
> 
> Thanks for having a look!
> 
> >  a) Am I correct in thinking you PROT_NONE the extra space so you can
> > gkrow/shrink it?
> 
> Yes!
> 
> >  b) What does kvm see - does it have a slot for the whole space or for
> > only the used space?
> 
> Only the used space. Resizing triggers a resize of the memory region. That triggers memory notifiers, which remove the old kvm memslot and re-add the new kvm memslot. (That‘s existing handling, so nothing new).
> 
> So KVM will not see PROT_NONE when creating a slot.

OK, that's easy then.

> >     I ask because we found with virtiofs/DAX experiments that on Power,
> > kvm gets upset if you give it a mapping with PROT_NONE.
> >     (That maybe less of an issue if you change the mapping after the
> > slot is created).
> 
> That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.

Hmm 'when' during an incoming migration; I ask because of userfaultfd
setup for postcopy.  Also note those things can combine - i.e. a reboot
that happens during a migration (we've already got a pile of related
bugs).

> > 
> >  c) It's interesting this is keyed off the RAMBlock notifiers - do
> >     memory_listener's on the address space the block is mapped into get
> >    triggered?  I'm wondering how vhost (and vhost-user) in particular
> >    see this.
> 
> Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.
> 
> The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.
> 
> E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.

I guess for a memory listener it just sees a new layout after the commit
and then can figure out what changed.

Dave

> Cheers!
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by David Hildenbrand 5 years, 9 months ago
>> That should work as expected. Resizing *while* kvm is running is tricky, but that‘s not part of this series and a different story :) right now, resizing is only valid on reboot/incoming migration.
> 
> Hmm 'when' during an incoming migration; I ask because of userfaultfd
> setup for postcopy.  Also note those things can combine - i.e. a reboot
> that happens during a migration (we've already got a pile of related
> bugs).

Yes, resizing while migration is in progress (source already sent the rem block sizes,
receiving side in postcopy), or when dumping is active is bad. In general,
resizing while *somebody* thinks the used_length of a ram block won't change is bad.

Incoming migration: migration/ram.c: ram_load_precopy(): RAM_SAVE_FLAG_MEM_SIZE

So in preload, before the guest is running and before postcopy has started.
At that point, performing the resize is fine.

> 
>>>
>>>  c) It's interesting this is keyed off the RAMBlock notifiers - do
>>>     memory_listener's on the address space the block is mapped into get
>>>    triggered?  I'm wondering how vhost (and vhost-user) in particular
>>>    see this.
>>
>> Yes, memory listeners get triggered. Old region is removed, new one is added. Nothing changed on that front.
>>
>> The issue with ram block notifiers is that they did not do a „remove old, add new“ on resizes. They only added the full ram block. Bad. E.g., vfio wants to pin all memory - which would fail on PROT_NONE.
>>
>> E.g., for HAX, there is no kernel ioctl to remove a ram block ... for SEV there is, but I am not sure about the implications when converting back and forth between encrypted/unencrypted. So SEV and HAX require legacy handling.
> 
> I guess for a memory listener it just sees a new layout after the commit
> and then can figure out what changed.

Exactly.


-- 
Thanks,

David / dhildenb


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by Michael S. Tsirkin 5 years, 9 months ago
On Mon, Feb 03, 2020 at 07:31:12PM +0100, David Hildenbrand wrote:
> We already allow resizable ram blocks for anonymous memory, however, they
> are not actually resized. All memory is mmaped() R/W, including the memory
> exceeding the used_length, up to the max_length.
> 
> When resizing, effectively only the boundary is moved. Implement actually
> resizable anonymous allocations and make use of them in resizable ram
> blocks when possible. Memory exceeding the used_length will be
> inaccessible. Especially ram block notifiers require care.
> 
> Having actually resizable anonymous allocations (via mmap-hackery) allows
> to reserve a big region in virtual address space and grow the
> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
> is set to "never" under Linux, huge reservations will succeed. If there is
> not enough memory when resizing (to populate parts of the reserved region),
> trying to resize will fail. Only the actually used size is reserved in the
> OS.
> 
> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
> grow the usable part on demand. I think this change is worth sending out
> individually. Accompanied by a bunch of minor fixes and cleanups.
> 
> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/

How does this inteact with all the prealloc/mlock things designed
for realtime?

> David Hildenbrand (13):
>   util: vfio-helpers: Factor out and fix processing of existings ram
>     blocks
>   exec: Factor out setting ram settings (madvise ...) into
>     qemu_ram_apply_settings()
>   exec: Reuse qemu_ram_apply_settings() in qemu_ram_remap()
>   exec: Drop "shared" parameter from ram_block_add()
>   util/mmap-alloc: Factor out calculation of pagesize to mmap_pagesize()
>   util/mmap-alloc: Factor out reserving of a memory region to
>     mmap_reserve()
>   util/mmap-alloc: Factor out populating of memory to mmap_populate()
>   util/mmap-alloc: Prepare for resizable mmaps
>   util/mmap-alloc: Implement resizable mmaps
>   numa: Introduce ram_block_notify_resized() and
>     ram_block_notifiers_support_resize()
>   util: vfio-helpers: Implement ram_block_resized()
>   util: oslib: Resizable anonymous allocations under POSIX
>   exec: Ram blocks with resizable anonymous allocations under POSIX
> 
>  exec.c                    |  99 ++++++++++++++++++----
>  hw/core/numa.c            |  39 +++++++++
>  include/exec/cpu-common.h |   3 +
>  include/exec/memory.h     |   8 ++
>  include/exec/ramlist.h    |   4 +
>  include/qemu/mmap-alloc.h |  21 +++--
>  include/qemu/osdep.h      |   6 +-
>  stubs/ram-block.c         |  20 -----
>  util/mmap-alloc.c         | 168 ++++++++++++++++++++++++--------------
>  util/oslib-posix.c        |  37 ++++++++-
>  util/oslib-win32.c        |  14 ++++
>  util/trace-events         |   5 +-
>  util/vfio-helpers.c       |  33 ++++----
>  13 files changed, 331 insertions(+), 126 deletions(-)
> 
> -- 
> 2.24.1


Re: [PATCH v1 00/13] Ram blocks with resizable anonymous allocations under POSIX
Posted by David Hildenbrand 5 years, 9 months ago
On 06.02.20 10:27, Michael S. Tsirkin wrote:
> On Mon, Feb 03, 2020 at 07:31:12PM +0100, David Hildenbrand wrote:
>> We already allow resizable ram blocks for anonymous memory, however, they
>> are not actually resized. All memory is mmaped() R/W, including the memory
>> exceeding the used_length, up to the max_length.
>>
>> When resizing, effectively only the boundary is moved. Implement actually
>> resizable anonymous allocations and make use of them in resizable ram
>> blocks when possible. Memory exceeding the used_length will be
>> inaccessible. Especially ram block notifiers require care.
>>
>> Having actually resizable anonymous allocations (via mmap-hackery) allows
>> to reserve a big region in virtual address space and grow the
>> accessible/usable part on demand. Even if "/proc/sys/vm/overcommit_memory"
>> is set to "never" under Linux, huge reservations will succeed. If there is
>> not enough memory when resizing (to populate parts of the reserved region),
>> trying to resize will fail. Only the actually used size is reserved in the
>> OS.
>>
>> E.g., virtio-mem [1] wants to reserve big resizable memory regions and
>> grow the usable part on demand. I think this change is worth sending out
>> individually. Accompanied by a bunch of minor fixes and cleanups.
>>
>> [1] https://lore.kernel.org/kvm/20191212171137.13872-1-david@redhat.com/
> 
> How does this inteact with all the prealloc/mlock things designed
> for realtime?

- Prealloc: we don't support resizable ram blocks with prealloc
-- qemu_ram_alloc_from_ptr() is the only way to get "real" preallocated
   ram blocks from a pointer. They are never resizable.
-- "prealloc" with memory backends (e.g., backends/hostmem.c): Memory
   backends don't support resizable ram blocks yet. I have patches to
   support that, and disallow prealloc for them.
-- file-based ram blocks are not resizable

- mlock
-- os_mlock() does a mlockall(MCL_CURRENT | MCL_FUTURE)
-- That should work just fine with ordinary mmap() invalidating old
   mmaps and creating new mmaps(). Just like when hotplugging/unplugging
   a DIMM.

Resizing currently only happens during reboot/migration. If mlock_all
would result in issues, we could fallback to old handling (if
(enable_mlock) ...) - but I don't think this is necessary.

So I don't see an issues with that :) Thanks!

-- 
Thanks,

David / dhildenb