@MST, any comment on the vhost bits (mostly uncontroversial and only in
the memslot domain)?
I'm planning on queuing this myself (but will wait a bit more), unless
you want to take it.
On 08.09.23 16:21, David Hildenbrand wrote:
> Quoting from patch #14:
>
> Having large virtio-mem devices that only expose little memory to a VM
> is currently a problem: we map the whole sparse memory region into the
> guest using a single memslot, resulting in one gigantic memslot in KVM.
> KVM allocates metadata for the whole memslot, which can result in quite
> some memory waste.
>
> Assuming we have a 1 TiB virtio-mem device and only expose little (e.g.,
> 1 GiB) memory, we would create a single 1 TiB memslot and KVM has to
> allocate metadata for that 1 TiB memslot: on x86, this implies allocating
> a significant amount of memory for metadata:
>
> (1) RMAP: 8 bytes per 4 KiB, 8 bytes per 2 MiB, 8 bytes per 1 GiB
> -> For 1 TiB: 2147483648 + 4194304 + 8192 = ~ 2 GiB (0.2 %)
>
> With the TDP MMU (cat /sys/module/kvm/parameters/tdp_mmu) this gets
> allocated lazily when required for nested VMs
> (2) gfn_track: 2 bytes per 4 KiB
> -> For 1 TiB: 536870912 = ~512 MiB (0.05 %)
> (3) lpage_info: 4 bytes per 2 MiB, 4 bytes per 1 GiB
> -> For 1 TiB: 2097152 + 4096 = ~2 MiB (0.0002 %)
> (4) 2x dirty bitmaps for tracking: 2x 1 bit per 4 KiB page
> -> For 1 TiB: 536870912 = 64 MiB (0.006 %)
>
> So we primarily care about (1) and (2). The bad thing is, that the
> memory consumption doubles once SMM is enabled, because we create the
> memslot once for !SMM and once for SMM.
>
> Having a 1 TiB memslot without the TDP MMU consumes around:
> * With SMM: 5 GiB
> * Without SMM: 2.5 GiB
> Having a 1 TiB memslot with the TDP MMU consumes around:
> * With SMM: 1 GiB
> * Without SMM: 512 MiB
>
> ... and that's really something we want to optimize, to be able to just
> start a VM with small boot memory (e.g., 4 GiB) and a virtio-mem device
> that can grow very large (e.g., 1 TiB).
>
> Consequently, using multiple memslots and only mapping the memslots we
> really need can significantly reduce memory waste and speed up
> memslot-related operations. Let's expose the sparse RAM memory region using
> multiple memslots, mapping only the memslots we currently need into our
> device memory region container.
>
> The hyper-v balloon driver has similar demands [1].
>
> For virtio-mem, this has to be turned manually on ("multiple-memslots=on"),
> due to the interaction with vhost (below).
>
> If we have less than 509 memslots available, we always default to a single
> memslot. Otherwise, we automatically decide how many memslots to use
> based on a simple heuristic (see patch #12), and try not to use more than
> 256 memslots across all memory devices: our historical DIMM limit.
>
> As soon as any memory devices automatically decided on using more than
> one memslot, vhost devices that support less than 509 memslots (e.g.,
> currently most vhost-user devices like with virtiofsd) can no longer be
> plugged as a precaution.
>
> Quoting from patch #12:
>
> Plugging vhost devices with less than 509 memslots available while we
> have memory devices plugged that consume multiple memslots due to
> automatic decisions can be problematic. Most configurations might just fail
> due to "limit < used + reserved", however, it can also happen that these
> memory devices would suddenly consume memslots that would actually be
> required by other memslot consumers (boot, PCI BARs) later. Note that this
> has always been sketchy with vhost devices that support only a small number
> of memslots; but we don't want to make it any worse.So let's keep it simple
> and simply reject plugging such vhost devices in such a configuration.
>
> Eventually, all vhost devices that want to be fully compatible with such
> memory devices should support a decent number of memslots (>= 509).
>
>
> The recommendation is to plug such vhost devices before the virtio-mem
> decides, or to not set "multiple-memslots=on". As soon as these devices
> support a reasonable number of memslots (>= 509), this will start working
> automatically.
>
> I run some tests on x86_64, now also including vfio tests. Seems to work
> as expected, even when multiple memslots are used.
>
>
> Patch #1 -- #3 are from [2] that were not picked up yet.
>
> Patch #4 -- #12 add handling of multiple memslots to memory devices
>
> Patch #13 -- #14 add "multiple-memslots=on" support to virtio-mem
>
> Patch #15 -- #16 make sure that virtio-mem memslots can be enabled/disable
> atomically
>
> v2 -> v3:
> * "kvm: Return number of free memslots"
> -> Return 0 in stub
> * "kvm: Add stub for kvm_get_max_memslots()"
> -> Return 0 in stub
> * Adjust other patches to check for kvm_enabled() before calling
> kvm_get_free_memslots()/kvm_get_max_memslots()
> * Add RBs
>
> v1 -> v2:
> * Include patches from [1]
> * A lot of code simplification and reorganization, too many to spell out
> * don't add a general soft-limit on memslots, to avoid warning in sane
> setups
> * Simplify handling of vhost devices with a small number of memslots:
> simply fail plugging them
> * "virtio-mem: Expose device memory via multiple memslots if enabled"
> -> Fix one "is this the last memslot" check
> * Much more testing
>
>
> [1] https://lkml.kernel.org/r/cover.1689786474.git.maciej.szmigiero@oracle.com
> [2] https://lkml.kernel.org/r/20230523185915.540373-1-david@redhat.com
>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: "Philippe Mathieu-Daudé" <philmd@linaro.org>
> Cc: Eduardo Habkost <eduardo@habkost.net>
> Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
> Cc: Yanan Wang <wangyanan55@huawei.com>
> Cc: Michal Privoznik <mprivozn@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gavin Shan <gshan@redhat.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
> Cc: kvm@vger.kernel.org
>
> David Hildenbrand (16):
> vhost: Rework memslot filtering and fix "used_memslot" tracking
> vhost: Remove vhost_backend_can_merge() callback
> softmmu/physmem: Fixup qemu_ram_block_from_host() documentation
> kvm: Return number of free memslots
> vhost: Return number of free memslots
> memory-device: Support memory devices with multiple memslots
> stubs: Rename qmp_memory_device.c to memory_device.c
> memory-device: Track required and actually used memslots in
> DeviceMemoryState
> memory-device,vhost: Support memory devices that dynamically consume
> memslots
> kvm: Add stub for kvm_get_max_memslots()
> vhost: Add vhost_get_max_memslots()
> memory-device,vhost: Support automatic decision on the number of
> memslots
> memory: Clarify mapping requirements for RamDiscardManager
> virtio-mem: Expose device memory via multiple memslots if enabled
> memory,vhost: Allow for marking memory device memory regions
> unmergeable
> virtio-mem: Mark memslot alias memory regions unmergeable
>
> MAINTAINERS | 1 +
> accel/kvm/kvm-all.c | 35 ++-
> accel/stubs/kvm-stub.c | 9 +-
> hw/mem/memory-device.c | 196 ++++++++++++-
> hw/virtio/vhost-stub.c | 9 +-
> hw/virtio/vhost-user.c | 21 +-
> hw/virtio/vhost-vdpa.c | 1 -
> hw/virtio/vhost.c | 103 +++++--
> hw/virtio/virtio-mem-pci.c | 21 ++
> hw/virtio/virtio-mem.c | 272 +++++++++++++++++-
> include/exec/cpu-common.h | 15 +
> include/exec/memory.h | 27 +-
> include/hw/boards.h | 14 +-
> include/hw/mem/memory-device.h | 57 ++++
> include/hw/virtio/vhost-backend.h | 9 +-
> include/hw/virtio/vhost.h | 3 +-
> include/hw/virtio/virtio-mem.h | 23 +-
> include/sysemu/kvm.h | 4 +-
> include/sysemu/kvm_int.h | 1 +
> softmmu/memory.c | 35 ++-
> softmmu/physmem.c | 17 --
> .../{qmp_memory_device.c => memory_device.c} | 10 +
> stubs/meson.build | 2 +-
> 23 files changed, 779 insertions(+), 106 deletions(-)
> rename stubs/{qmp_memory_device.c => memory_device.c} (56%)
>
--
Cheers,
David / dhildenb