virtio-mem logically plugs/unplugs memory within a sparse memory region
and notifies via the RamDiscardManager interface when parts become
plugged (populated) or unplugged (discarded).
Currently, we end up (via the two users)
1) zeroing all logically unplugged/discarded memory during TPM resets.
2) reading all logically unplugged/discarded memory when dumping, to
figure out the content is zero.
1) is always bad, because we assume unplugged memory stays discarded
(and is already implicitly zero).
2) isn't that bad with anonymous memory, we end up reading the zero
page (slow and unnecessary, though). However, once we use some
file-backed memory (future use case), even reading will populate memory.
Let's cut out all parts marked as not-populated (discarded) via the
RamDiscardManager. As virtio-mem is the single user, this now means that
logically unplugged memory ranges will no longer be included in the
dump, which results in smaller dump files and faster dumping.
virtio-mem has a minimum granularity of 1 MiB (and the default is usually
2 MiB). Theoretically, we can see quite some fragmentation, in practice
we won't have it completely fragmented in 1 MiB pieces. Still, we might
end up with many physical ranges.
Both, the ELF format and kdump seem to be ready to support many
individual ranges (e.g., for ELF it seems to be UINT32_MAX, kdump has a
linear bitmap).
Cc: Marc-André Lureau <marcandre.lureau@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Claudio Fontana <cfontana@suse.de>
Cc: Thomas Huth <thuth@redhat.com>
Cc: "Alex Bennée" <alex.bennee@linaro.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Laurent Vivier <lvivier@redhat.com>
Cc: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
softmmu/memory_mapping.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/softmmu/memory_mapping.c b/softmmu/memory_mapping.c
index b7e4f3f788..856778a109 100644
--- a/softmmu/memory_mapping.c
+++ b/softmmu/memory_mapping.c
@@ -246,6 +246,15 @@ static void guest_phys_block_add_section(GuestPhysListener *g,
#endif
}
+static int guest_phys_ram_populate_cb(MemoryRegionSection *section,
+ void *opaque)
+{
+ GuestPhysListener *g = opaque;
+
+ guest_phys_block_add_section(g, section);
+ return 0;
+}
+
static void guest_phys_blocks_region_add(MemoryListener *listener,
MemoryRegionSection *section)
{
@@ -257,6 +266,17 @@ static void guest_phys_blocks_region_add(MemoryListener *listener,
memory_region_is_nonvolatile(section->mr)) {
return;
}
+
+ /* for special sparse regions, only add populated parts */
+ if (memory_region_has_ram_discard_manager(section->mr)) {
+ RamDiscardManager *rdm;
+
+ rdm = memory_region_get_ram_discard_manager(section->mr);
+ ram_discard_manager_replay_populated(rdm, section,
+ guest_phys_ram_populate_cb, g);
+ return;
+ }
+
guest_phys_block_add_section(g, section);
}
--
2.31.1
On Tue, Jul 20, 2021 at 03:03:04PM +0200, David Hildenbrand wrote:
> virtio-mem logically plugs/unplugs memory within a sparse memory region
> and notifies via the RamDiscardManager interface when parts become
> plugged (populated) or unplugged (discarded).
>
> Currently, we end up (via the two users)
> 1) zeroing all logically unplugged/discarded memory during TPM resets.
> 2) reading all logically unplugged/discarded memory when dumping, to
> figure out the content is zero.
>
> 1) is always bad, because we assume unplugged memory stays discarded
> (and is already implicitly zero).
> 2) isn't that bad with anonymous memory, we end up reading the zero
> page (slow and unnecessary, though). However, once we use some
> file-backed memory (future use case), even reading will populate memory.
>
> Let's cut out all parts marked as not-populated (discarded) via the
> RamDiscardManager. As virtio-mem is the single user, this now means that
> logically unplugged memory ranges will no longer be included in the
> dump, which results in smaller dump files and faster dumping.
>
> virtio-mem has a minimum granularity of 1 MiB (and the default is usually
> 2 MiB). Theoretically, we can see quite some fragmentation, in practice
> we won't have it completely fragmented in 1 MiB pieces. Still, we might
> end up with many physical ranges.
>
> Both, the ELF format and kdump seem to be ready to support many
> individual ranges (e.g., for ELF it seems to be UINT32_MAX, kdump has a
> linear bitmap).
>
> Cc: Marc-André Lureau <marcandre.lureau@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: Claudio Fontana <cfontana@suse.de>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: "Alex Bennée" <alex.bennee@linaro.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Laurent Vivier <lvivier@redhat.com>
> Cc: Stefan Berger <stefanb@linux.ibm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> softmmu/memory_mapping.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/softmmu/memory_mapping.c b/softmmu/memory_mapping.c
> index b7e4f3f788..856778a109 100644
> --- a/softmmu/memory_mapping.c
> +++ b/softmmu/memory_mapping.c
> @@ -246,6 +246,15 @@ static void guest_phys_block_add_section(GuestPhysListener *g,
> #endif
> }
>
> +static int guest_phys_ram_populate_cb(MemoryRegionSection *section,
> + void *opaque)
> +{
> + GuestPhysListener *g = opaque;
> +
> + guest_phys_block_add_section(g, section);
> + return 0;
> +}
> +
> static void guest_phys_blocks_region_add(MemoryListener *listener,
> MemoryRegionSection *section)
> {
> @@ -257,6 +266,17 @@ static void guest_phys_blocks_region_add(MemoryListener *listener,
> memory_region_is_nonvolatile(section->mr)) {
> return;
> }
> +
> + /* for special sparse regions, only add populated parts */
> + if (memory_region_has_ram_discard_manager(section->mr)) {
> + RamDiscardManager *rdm;
> +
> + rdm = memory_region_get_ram_discard_manager(section->mr);
> + ram_discard_manager_replay_populated(rdm, section,
> + guest_phys_ram_populate_cb, g);
> + return;
> + }
> +
> guest_phys_block_add_section(g, section);
> }
As I've asked this question previously elsewhere, it's more or less also
related to the design decision of having virtio-mem being able to sparsely
plugged in such a small granularity rather than making the plug/unplug still
continuous within GPA range (so we move page when unplug).
There's definitely reasons there and I believe you're the expert on that (as
you mentioned once: some guest GUPed pages cannot migrate so cannot get those
ranges offlined otherwise), but so far I still not sure whether that's a kernel
issue to solve on GUP, although I agree it's a complicated one anyway!
Maybe it's a trade-off you made at last, I don't have enough knowledge to tell.
The patch itself looks okay to me, there's just a slight worry on not sure how
long would the list be at last; if it's chopped in 1M/2M small chunks.
Thanks,
--
Peter Xu
>
> As I've asked this question previously elsewhere, it's more or less also
> related to the design decision of having virtio-mem being able to sparsely
> plugged in such a small granularity rather than making the plug/unplug still
> continuous within GPA range (so we move page when unplug).
Yes, in an ideal world that would be optimal solution. Unfortunately,
we're not living in an ideal world :)
virtio-mem in Linux guests will as default try unplugging
highest-to-lowest address, and I have on my TODO list an item to shrink
the usable region (-> later, shrinking the actual RAMBlock) once possible.
So virtio-mem is prepared for that, but it will only apply in some cases.
>
> There's definitely reasons there and I believe you're the expert on that (as
> you mentioned once: some guest GUPed pages cannot migrate so cannot get those
> ranges offlined otherwise), but so far I still not sure whether that's a kernel
> issue to solve on GUP, although I agree it's a complicated one anyway!
To do something like that reliably, you have to manage hotplugged memory
in a special way, for example, in a movable zone.
We have a at least 4 cases:
a) The guest OS supports the movable zone and uses it for all hotplugged
memory
b) The guest OS supports the movable zone and uses it for some
hotplugged memory
c) The guest OS supports the movable zone and uses it for no hotplugged
memory
d) The guest OS does not support the concept of movable zones
a) is the dream but only applies in some cases if Linux is properly
configured (e.g., never hotplug more than 3 times boot memory)
b) will be possible under Linux soon (e.g., when hotplugging more than 3
times boot memory)
c) is the default under Linux for most Linux distributions
d) Is Windows
In addition, we can still have random unplug errors when using the
movable zone, for example, if someone references a page just a little
too long.
Maybe that helps.
>
> Maybe it's a trade-off you made at last, I don't have enough knowledge to tell.
That's the precise description of what virtio-mem is. It's a trade-off
between which OSs we want to support, what the guest OS can actually do,
how we can manage memory in the hypervisor efficiently, ...
>
> The patch itself looks okay to me, there's just a slight worry on not sure how
> long would the list be at last; if it's chopped in 1M/2M small chunks.
I don't think that's really an issue: take a look at
qemu_get_guest_memory_mapping(), which will create as many entries as
necessary to express the guest physical mapping of the guest virtual (!)
address space with such chunks. That can be a lot :)
--
Thanks,
David / dhildenb
On Fri, Jul 23, 2021 at 08:56:54PM +0200, David Hildenbrand wrote: > > > > As I've asked this question previously elsewhere, it's more or less also > > related to the design decision of having virtio-mem being able to sparsely > > plugged in such a small granularity rather than making the plug/unplug still > > continuous within GPA range (so we move page when unplug). > > Yes, in an ideal world that would be optimal solution. Unfortunately, we're > not living in an ideal world :) > > virtio-mem in Linux guests will as default try unplugging highest-to-lowest > address, and I have on my TODO list an item to shrink the usable region (-> > later, shrinking the actual RAMBlock) once possible. > > So virtio-mem is prepared for that, but it will only apply in some cases. > > > > > There's definitely reasons there and I believe you're the expert on that (as > > you mentioned once: some guest GUPed pages cannot migrate so cannot get those > > ranges offlined otherwise), but so far I still not sure whether that's a kernel > > issue to solve on GUP, although I agree it's a complicated one anyway! > > To do something like that reliably, you have to manage hotplugged memory in > a special way, for example, in a movable zone. > > We have a at least 4 cases: > > a) The guest OS supports the movable zone and uses it for all hotplugged > memory > b) The guest OS supports the movable zone and uses it for some > hotplugged memory > c) The guest OS supports the movable zone and uses it for no hotplugged > memory > d) The guest OS does not support the concept of movable zones > > > a) is the dream but only applies in some cases if Linux is properly > configured (e.g., never hotplug more than 3 times boot memory) > b) will be possible under Linux soon (e.g., when hotplugging more than 3 > times boot memory) > c) is the default under Linux for most Linux distributions > d) Is Windows > > In addition, we can still have random unplug errors when using the movable > zone, for example, if someone references a page just a little too long. > > Maybe that helps. Yes, thanks. > > > > > Maybe it's a trade-off you made at last, I don't have enough knowledge to tell. > > That's the precise description of what virtio-mem is. It's a trade-off > between which OSs we want to support, what the guest OS can actually do, how > we can manage memory in the hypervisor efficiently, ... > > > > > The patch itself looks okay to me, there's just a slight worry on not sure how > > long would the list be at last; if it's chopped in 1M/2M small chunks. > > I don't think that's really an issue: take a look at > qemu_get_guest_memory_mapping(), which will create as many entries as > necessary to express the guest physical mapping of the guest virtual (!) > address space with such chunks. That can be a lot :) I'm indeed a bit surprised by the "paging" parameter.. I gave it a try, the list grows into tens of thousands. One last question: will virtio-mem still do best-effort to move the pages, so as to grant as less holes as possible? Thanks, -- Peter Xu
On 24.07.21 00:33, Peter Xu wrote: > On Fri, Jul 23, 2021 at 08:56:54PM +0200, David Hildenbrand wrote: >>> >>> As I've asked this question previously elsewhere, it's more or less also >>> related to the design decision of having virtio-mem being able to sparsely >>> plugged in such a small granularity rather than making the plug/unplug still >>> continuous within GPA range (so we move page when unplug). >> >> Yes, in an ideal world that would be optimal solution. Unfortunately, we're >> not living in an ideal world :) >> >> virtio-mem in Linux guests will as default try unplugging highest-to-lowest >> address, and I have on my TODO list an item to shrink the usable region (-> >> later, shrinking the actual RAMBlock) once possible. >> >> So virtio-mem is prepared for that, but it will only apply in some cases. >> >>> >>> There's definitely reasons there and I believe you're the expert on that (as >>> you mentioned once: some guest GUPed pages cannot migrate so cannot get those >>> ranges offlined otherwise), but so far I still not sure whether that's a kernel >>> issue to solve on GUP, although I agree it's a complicated one anyway! >> >> To do something like that reliably, you have to manage hotplugged memory in >> a special way, for example, in a movable zone. >> >> We have a at least 4 cases: >> >> a) The guest OS supports the movable zone and uses it for all hotplugged >> memory >> b) The guest OS supports the movable zone and uses it for some >> hotplugged memory >> c) The guest OS supports the movable zone and uses it for no hotplugged >> memory >> d) The guest OS does not support the concept of movable zones >> >> >> a) is the dream but only applies in some cases if Linux is properly >> configured (e.g., never hotplug more than 3 times boot memory) >> b) will be possible under Linux soon (e.g., when hotplugging more than 3 >> times boot memory) >> c) is the default under Linux for most Linux distributions >> d) Is Windows >> >> In addition, we can still have random unplug errors when using the movable >> zone, for example, if someone references a page just a little too long. >> >> Maybe that helps. > > Yes, thanks. > >> >>> >>> Maybe it's a trade-off you made at last, I don't have enough knowledge to tell. >> >> That's the precise description of what virtio-mem is. It's a trade-off >> between which OSs we want to support, what the guest OS can actually do, how >> we can manage memory in the hypervisor efficiently, ... >> >>> >>> The patch itself looks okay to me, there's just a slight worry on not sure how >>> long would the list be at last; if it's chopped in 1M/2M small chunks. >> >> I don't think that's really an issue: take a look at >> qemu_get_guest_memory_mapping(), which will create as many entries as >> necessary to express the guest physical mapping of the guest virtual (!) >> address space with such chunks. That can be a lot :) > > I'm indeed a bit surprised by the "paging" parameter.. I gave it a try, the > list grows into tens of thousands. Yes, and the bigger the VM, the more entries you should get ... like with virtio-mem. > > One last question: will virtio-mem still do best-effort to move the pages, so > as to grant as less holes as possible? That depends on the guest OS. Linux guests will unplug highest-to-lowest addresses. They will try migrating pages away (alloc_contig_range()) to minimize fragmentation. Further, when (un)plugging, they will try a) unplug within already fragmented Linux memory blocks (e.g., 128 MiB) b) plugging within already fragmented Linux memory blocks first. Because the goal is to require as little as possible Linux memory blocks to reduce metadata (memmap) overhead. I recall that the Windows prototype also tries unplug highest-to-lowest using the Windows range allocator, however, I have no idea what that range allcoator actually does (if it only grabs free pages or if it can actually move around busy pages). For Linux guests, there is a work item to continue defragmenting the layout to free up complete Linux memory blocks over time. With a 1 TiB virtio-mem device and a 2 MiB block size (default), in the worst case we would get 262144 individual blocks (every second one plugged). While this is far from realistic, I assume we can get something comparable when dumping a huge VM in paging mode. With 262144 entires, with ~48 byte (6*8 byte) per element, we'd consume 12 MiB for the whole list. Not perfect, but not too bad. -- Thanks, David / dhildenb
On Tue, Jul 20, 2021 at 03:03:04PM +0200, David Hildenbrand wrote: > virtio-mem logically plugs/unplugs memory within a sparse memory region > and notifies via the RamDiscardManager interface when parts become > plugged (populated) or unplugged (discarded). > > Currently, we end up (via the two users) > 1) zeroing all logically unplugged/discarded memory during TPM resets. > 2) reading all logically unplugged/discarded memory when dumping, to > figure out the content is zero. > > 1) is always bad, because we assume unplugged memory stays discarded > (and is already implicitly zero). > 2) isn't that bad with anonymous memory, we end up reading the zero > page (slow and unnecessary, though). However, once we use some > file-backed memory (future use case), even reading will populate memory. > > Let's cut out all parts marked as not-populated (discarded) via the > RamDiscardManager. As virtio-mem is the single user, this now means that > logically unplugged memory ranges will no longer be included in the > dump, which results in smaller dump files and faster dumping. > > virtio-mem has a minimum granularity of 1 MiB (and the default is usually > 2 MiB). Theoretically, we can see quite some fragmentation, in practice > we won't have it completely fragmented in 1 MiB pieces. Still, we might > end up with many physical ranges. > > Both, the ELF format and kdump seem to be ready to support many > individual ranges (e.g., for ELF it seems to be UINT32_MAX, kdump has a > linear bitmap). > > Cc: Marc-André Lureau <marcandre.lureau@redhat.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Eduardo Habkost <ehabkost@redhat.com> > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> > Cc: Igor Mammedov <imammedo@redhat.com> > Cc: Claudio Fontana <cfontana@suse.de> > Cc: Thomas Huth <thuth@redhat.com> > Cc: "Alex Bennée" <alex.bennee@linaro.org> > Cc: Peter Xu <peterx@redhat.com> > Cc: Laurent Vivier <lvivier@redhat.com> > Cc: Stefan Berger <stefanb@linux.ibm.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> -- Peter Xu
© 2016 - 2026 Red Hat, Inc.