[PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model

Michal Privoznik posted 14 patches 2 years, 10 months ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/cover.1624442835.git.mprivozn@redhat.com
There is a newer version of this series
NEWS.rst                                      |  12 +
docs/formatdomain.rst                         |  45 +++-
docs/kbase/index.rst                          |   4 +
docs/kbase/memorydevices.rst                  | 150 +++++++++++
docs/kbase/meson.build                        |   1 +
docs/manpages/virsh.rst                       |  30 +++
docs/schemas/domaincommon.rng                 |  16 ++
examples/c/misc/event-test.c                  |  17 ++
include/libvirt/libvirt-domain.h              |  23 ++
src/conf/domain_conf.c                        | 118 ++++++++-
src/conf/domain_conf.h                        |  15 ++
src/conf/domain_event.c                       |  84 ++++++
src/conf/domain_event.h                       |  10 +
src/conf/domain_validate.c                    |  39 +++
src/libvirt_private.syms                      |   5 +
src/qemu/qemu_alias.c                         |  10 +-
src/qemu/qemu_capabilities.c                  |   2 +
src/qemu/qemu_capabilities.h                  |   1 +
src/qemu/qemu_command.c                       |  13 +-
src/qemu/qemu_domain.c                        |  33 ++-
src/qemu/qemu_domain.h                        |   1 +
src/qemu/qemu_domain_address.c                |  38 ++-
src/qemu/qemu_driver.c                        | 240 +++++++++++++++++-
src/qemu/qemu_hotplug.c                       |  18 ++
src/qemu/qemu_hotplug.h                       |   5 +
src/qemu/qemu_monitor.c                       |  37 +++
src/qemu/qemu_monitor.h                       |  28 ++
src/qemu/qemu_monitor_json.c                  |  97 +++++--
src/qemu/qemu_monitor_json.h                  |   5 +
src/qemu/qemu_process.c                       |  72 ++++++
src/qemu/qemu_validate.c                      |   8 +
src/remote/remote_daemon_dispatch.c           |  30 +++
src/remote/remote_driver.c                    |  32 +++
src/remote/remote_protocol.x                  |  14 +-
src/remote_protocol-structs                   |   7 +
src/security/security_apparmor.c              |   1 +
src/security/security_dac.c                   |   2 +
src/security/security_selinux.c               |   2 +
src/util/virhostmem.c                         |  54 ++++
src/util/virhostmem.h                         |   3 +
tests/domaincapsmock.c                        |   9 +
.../caps_5.1.0.x86_64.xml                     |   1 +
.../caps_5.2.0.x86_64.xml                     |   1 +
.../caps_6.0.0.x86_64.xml                     |   1 +
.../caps_6.1.0.x86_64.xml                     |   1 +
...mory-hotplug-virtio-mem.x86_64-latest.args |  41 +++
.../memory-hotplug-virtio-mem.xml             |  67 +++++
tests/qemuxml2argvtest.c                      |   1 +
...emory-hotplug-virtio-mem.x86_64-latest.xml |   1 +
tests/qemuxml2xmltest.c                       |   1 +
tools/virsh-domain.c                          | 169 ++++++++++++
51 files changed, 1562 insertions(+), 53 deletions(-)
create mode 100644 docs/kbase/memorydevices.rst
create mode 100644 tests/qemuxml2argvdata/memory-hotplug-virtio-mem.x86_64-latest.args
create mode 100644 tests/qemuxml2argvdata/memory-hotplug-virtio-mem.xml
create mode 120000 tests/qemuxml2xmloutdata/memory-hotplug-virtio-mem.x86_64-latest.xml
[PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by Michal Privoznik 2 years, 10 months ago
v4 of:

https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html

diff to v3:
- Rebased code on the top of master
- Tried to work in all Peter's review suggestions
- Fixed a bug where adjusting <requested/> was viewed as hotplug of new
  <memory/> by XML validator and thus if <maxMemory/> was close enough to
  <currentMemory/> the validator reported an error (this was reported by
  David).

You can also find these patches on my branch:

https://gitlab.com/MichalPrivoznik/libvirt/-/tree/virtio_mem_v4

Michal Prívozník (14):
  virhostmem: Introduce virHostMemGetTHPSize()
  qemu_capabilities: Introduce QEMU_CAPS_DEVICE_VIRTIO_MEM_PCI
  conf: Introduce virtio-mem <memory/> model
  qemu: Build command line for virtio-mem
  qemu: Wire up <memory/> live update
  Introduce <actual/> property to virtio-mem
  conf: Introduce virDomainMemoryFindByDeviceAlias()
  qemu: Wire up MEMORY_DEVICE_SIZE_CHANGE event
  qemu: Refresh the actual size of virtio-mem on monitor reconnect
  qemu: Account for both memballoon and virtio-mem
  qemuDomainSetMemoryFlags: Take virtio-mem into consideration
  virsh: Introduce update-memory-device command
  news: document recent virtio memory addition
  kbase: Document virtio-mem

 NEWS.rst                                      |  12 +
 docs/formatdomain.rst                         |  45 +++-
 docs/kbase/index.rst                          |   4 +
 docs/kbase/memorydevices.rst                  | 150 +++++++++++
 docs/kbase/meson.build                        |   1 +
 docs/manpages/virsh.rst                       |  30 +++
 docs/schemas/domaincommon.rng                 |  16 ++
 examples/c/misc/event-test.c                  |  17 ++
 include/libvirt/libvirt-domain.h              |  23 ++
 src/conf/domain_conf.c                        | 118 ++++++++-
 src/conf/domain_conf.h                        |  15 ++
 src/conf/domain_event.c                       |  84 ++++++
 src/conf/domain_event.h                       |  10 +
 src/conf/domain_validate.c                    |  39 +++
 src/libvirt_private.syms                      |   5 +
 src/qemu/qemu_alias.c                         |  10 +-
 src/qemu/qemu_capabilities.c                  |   2 +
 src/qemu/qemu_capabilities.h                  |   1 +
 src/qemu/qemu_command.c                       |  13 +-
 src/qemu/qemu_domain.c                        |  33 ++-
 src/qemu/qemu_domain.h                        |   1 +
 src/qemu/qemu_domain_address.c                |  38 ++-
 src/qemu/qemu_driver.c                        | 240 +++++++++++++++++-
 src/qemu/qemu_hotplug.c                       |  18 ++
 src/qemu/qemu_hotplug.h                       |   5 +
 src/qemu/qemu_monitor.c                       |  37 +++
 src/qemu/qemu_monitor.h                       |  28 ++
 src/qemu/qemu_monitor_json.c                  |  97 +++++--
 src/qemu/qemu_monitor_json.h                  |   5 +
 src/qemu/qemu_process.c                       |  72 ++++++
 src/qemu/qemu_validate.c                      |   8 +
 src/remote/remote_daemon_dispatch.c           |  30 +++
 src/remote/remote_driver.c                    |  32 +++
 src/remote/remote_protocol.x                  |  14 +-
 src/remote_protocol-structs                   |   7 +
 src/security/security_apparmor.c              |   1 +
 src/security/security_dac.c                   |   2 +
 src/security/security_selinux.c               |   2 +
 src/util/virhostmem.c                         |  54 ++++
 src/util/virhostmem.h                         |   3 +
 tests/domaincapsmock.c                        |   9 +
 .../caps_5.1.0.x86_64.xml                     |   1 +
 .../caps_5.2.0.x86_64.xml                     |   1 +
 .../caps_6.0.0.x86_64.xml                     |   1 +
 .../caps_6.1.0.x86_64.xml                     |   1 +
 ...mory-hotplug-virtio-mem.x86_64-latest.args |  41 +++
 .../memory-hotplug-virtio-mem.xml             |  67 +++++
 tests/qemuxml2argvtest.c                      |   1 +
 ...emory-hotplug-virtio-mem.x86_64-latest.xml |   1 +
 tests/qemuxml2xmltest.c                       |   1 +
 tools/virsh-domain.c                          | 169 ++++++++++++
 51 files changed, 1562 insertions(+), 53 deletions(-)
 create mode 100644 docs/kbase/memorydevices.rst
 create mode 100644 tests/qemuxml2argvdata/memory-hotplug-virtio-mem.x86_64-latest.args
 create mode 100644 tests/qemuxml2argvdata/memory-hotplug-virtio-mem.xml
 create mode 120000 tests/qemuxml2xmloutdata/memory-hotplug-virtio-mem.x86_64-latest.xml

-- 
2.31.1

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by David Hildenbrand 2 years, 9 months ago
On 23.06.21 12:12, Michal Privoznik wrote:
> v4 of:
> 
> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
> 
> diff to v3:
> - Rebased code on the top of master
> - Tried to work in all Peter's review suggestions
> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>    <memory/> by XML validator and thus if <maxMemory/> was close enough to
>    <currentMemory/> the validator reported an error (this was reported by
>    David).
> 

Hi Michal,

I just retested with this version and it mostly works as expected. I 
tested quite some memory configurations and have some comments / reports :)

I tested successfully:
- 1 node with one device
- 2 nodes with one device on each node
- 2 nodes with two devices on one node
- "virsh update-memory-device" on live domains -- works great
- huge pages and anonymous memory with access=private and access=shared.
   There is only one issue with hugepages and memfd (prealloc=on gets
   set).
- shared memory on memfd and anonymous memory (-> shared file) with
   access=shared

I only tested on a single host NUMA node so far, but don't expect 
surprises with host numa policies.


1. "virsh update-memory-device" and stopped domains

Once I have more than one virtio-mem device defined for a VM, "virsh 
update-memory-device" cannot be used anymore as aliases don't seem to be 
available on stopped VMs. If I manually define an alias on a stopped VM, 
the alias silently gets dropped. Is there any way to identify a 
virtio-mem device on a stopped domain?


2. "virsh update-memory-device" with --config on a running domain

# virsh update-memory-device "Fedora34" --config --alias "virtiomem1" 
--requested-size 16G
error: no memory device found

I guess the issue is again, that alias don't apply to the "!live" XML. 
So the "--config" option doesn't really work when having more than one 
virtio-mem device defined for a VM.


3. "virsh update-memory-device" and nodes

In addition to "--alias", something like "--node" would also be nice to 
have -- assuming there is only a single virtio-mem device per NUMA node, 
which is usually the case. For example:

"virsh update-memory-device "Fedora34" --node 1 --requested-size 16G" 
could come in handy. This would also work on "!live" domains.


4. "actual" vs. "current"

"<actual unit='KiB'>16777216</actual>" I wonder if "current" instead of 
"actual" would be more in line with "currentMemory". But no strong opinion.


5. Slot handling.

As already discussed, virtio-mem and virtio-pmem don't need slots. Yet, 
the "slots" definition is required and libvirt reserves once slot for 
each such device ("error: unsupported configuration: memory device count 
'2' exceeds slots count '1'"). This is certainly future work, if we ever 
want to change that.


6. 4k source results in an error

     <source>
       <pagesize unit='KiB'>4096</pagesize>
       <nodemask>0-1</nodemask>
     </source>

"error: internal error: Unable to find any usable hugetlbfs mount for 
4096 KiB"

This example is taken from https://libvirt.org/formatdomain.html for 
DIMMs. Not sure what the expected behavior is.


7. File source gets silently dropped

     <source>
       <path>/dev/shmem/vm0</path>
     </source>

The statement gets silently dropped, which is somewhat surprising. 
However, I did not test what happens with DIMMs, maybe it's the same.


8. Global preallocation of memory

With

<memoryBacking>
	<allocation mode="immediate"\>
</memoryBacking>

we also get "prealloc=on" set for the memory backends of the virito-mem 
devices, which is sub-optimal, because we end up preallocating all 
memory of the memory backend (which is unexpected for a virtio-mem 
device) and virtio-mem will then discard all memory immediately again. 
So it's essentially a dangerous NOP -- dangerous because we temporarily 
consume a lot of memory.

In an ideal world, we would not set this for the memory backend used for 
the virtio-mem devices, but for the virtio-mem devices themselves, such 
that preallocation happens when new memory blocks are actually exposed 
to the VM.

As virtio-mem does not support "prealloc=on" for virtio-mem devices yet, 
this is future work. We might want to error out, though, if <allocation 
mode="immediate"\> is used along with virtio-mem devices for now. I'm 
planning on implementing this in QEMU soon. Until then, it might also be 
good enough to simply document that this setup should be avoided.


9. Memfd and huge pages

<memoryBacking>
     <source type="memfd"/>
</memoryBacking>

and

<memory model='virtio-mem' access='shared'>
   <source>
     <pagesize unit='KiB'>2048</pagesize>
   </source>
   ...
</memory>


I get on the QEMU cmdline

"-object 
{"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"

Dropping "the memfd" source I get on the QEMU cmdline:

-object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}

"prealloc":true should not have been added for virtio-mem in case of 
memfd. !memfd does what's expected.


10. Memory locking

With

<memoryBacking>
	<locked/>
</memoryBacking>

virtio-mem fails with

"qemu-system-x86_64: -device 
virtio-mem-pci,node=0,block-size=2097152,requested-size=0,memdev=memvirtiomem0,id=virtiomem0,bus=pci.0,addr=0x2: 
Incompatible with mlock"

Unfortunately,for example, on shmem like:

<memoryBacking>
	<locked/>
	<access mode="shared"/>
	<source type="memfd"/>
</memoryBacking>

it seems to fail after essentially (temporarily) preallocating all 
memory for the memory backend of the virtio-mem device. In the future, 
virtio-mem might be able to support mlock, until then, this is 
suboptimal but at least fails at some point.


11. Reservation of memory

With new QEMU versions we'll want to pass "reserve=off" for the memory 
backend used, especially with hugepages and private mappings. While this 
change was merged into QEMU, it's not part of an official release yet. 
Future work.

https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@redhat.com/

Otherwise, when we don't have the "size" currently in free and 
"unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to 
map backing store for guest RAM: Cannot allocate memory". The same thing 
can easily happen on anonymous memory when memory overcommit isn't disabled.

So this is future work, but at least the QEMU part is already upstream.



I'm planning on adding some libvirt documentation to 
https://virtio-mem.gitlab.io/ soon, where I'll document some of this, 
including care that has to be taken with mlock and preallocation.

Thanks for all your work!

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by Michal Prívozník 2 years, 7 months ago
On 7/7/21 12:30 PM, David Hildenbrand wrote:
> On 23.06.21 12:12, Michal Privoznik wrote:
>> v4 of:
>>
>> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
>>
>> diff to v3:
>> - Rebased code on the top of master
>> - Tried to work in all Peter's review suggestions
>> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>>    <memory/> by XML validator and thus if <maxMemory/> was close
>> enough to
>>    <currentMemory/> the validator reported an error (this was reported by
>>    David).
>>
> 
> Hi Michal,

Hi,

sorry for replying this late.

> 
> I just retested with this version and it mostly works as expected. I
> tested quite some memory configurations and have some comments / reports :)
> 
> I tested successfully:
> - 1 node with one device
> - 2 nodes with one device on each node
> - 2 nodes with two devices on one node
> - "virsh update-memory-device" on live domains -- works great
> - huge pages and anonymous memory with access=private and access=shared.
>   There is only one issue with hugepages and memfd (prealloc=on gets
>   set).
> - shared memory on memfd and anonymous memory (-> shared file) with
>   access=shared
> 
> I only tested on a single host NUMA node so far, but don't expect
> surprises with host numa policies.
> 
> 
> 1. "virsh update-memory-device" and stopped domains
> 
> Once I have more than one virtio-mem device defined for a VM, "virsh
> update-memory-device" cannot be used anymore as aliases don't seem to be
> available on stopped VMs. If I manually define an alias on a stopped VM,
> the alias silently gets dropped. Is there any way to identify a
> virtio-mem device on a stopped domain?

Yes. You want what we call user aliases. They have to have "ua-" prefix
so that they don't clash with whatever libvirt generates. Something like
this:

<memory model="virtio-mem">
  <source>
    <pagesize unit="KiB">2048</pagesize>
  </source>
  <target>
    <size unit="KiB">2097152</size>
    <node>0</node>
    <block unit="KiB">2048</block>
    <requested unit="KiB">1048576</requested>
  </target>
  <alias name="ua-virtiomem0"/>
  <address type="pci" domain="0x0000" bus="0x00" slot="0x06"
function="0x0"/>
</memory>

But then you get to the fact that I haven't implemented update for
inactive domains. Will do in v5.

> 
> 
> 2. "virsh update-memory-device" with --config on a running domain
> 
> # virsh update-memory-device "Fedora34" --config --alias "virtiomem1"
> --requested-size 16G
> error: no memory device found
> 
> I guess the issue is again, that alias don't apply to the "!live" XML.
> So the "--config" option doesn't really work when having more than one
> virtio-mem device defined for a VM.

Good point. I wonder what piece of input XML I can use to look up
corresponding virtio-mem. Since it has a PCI address maybe I can use
that instead of alias? But then we are back to the old problem - in
general inactive and active XMLs can be different (due to hot(un-)plug).
So even when I'd find a device on the same PCI address it may be
different, actually. Therefore, I think the safest is to use aliases. At
anyrate - this can be implemented afterwards.

> 
> 
> 3. "virsh update-memory-device" and nodes
> 
> In addition to "--alias", something like "--node" would also be nice to
> have -- assuming there is only a single virtio-mem device per NUMA node,
> which is usually the case. For example:
> 
> "virsh update-memory-device "Fedora34" --node 1 --requested-size 16G"
> could come in handy. This would also work on "!live" domains.

Yes, makes sense.

> 
> 
> 4. "actual" vs. "current"
> 
> "<actual unit='KiB'>16777216</actual>" I wonder if "current" instead of
> "actual" would be more in line with "currentMemory". But no strong opinion.

Yeah, I don't have any opinion either. I can change it.

> 
> 
> 5. Slot handling.
> 
> As already discussed, virtio-mem and virtio-pmem don't need slots. Yet,
> the "slots" definition is required and libvirt reserves once slot for
> each such device ("error: unsupported configuration: memory device count
> '2' exceeds slots count '1'"). This is certainly future work, if we ever
> want to change that.

I can look into this.

> 
> 
> 6. 4k source results in an error
> 
>     <source>
>       <pagesize unit='KiB'>4096</pagesize>
>       <nodemask>0-1</nodemask>
>     </source>
> 
> "error: internal error: Unable to find any usable hugetlbfs mount for
> 4096 KiB"
> 
> This example is taken from https://libvirt.org/formatdomain.html for
> DIMMs. Not sure what the expected behavior is.

Ouch, this is a clear bug. Let me investigate and fix in next version.

> 
> 
> 7. File source gets silently dropped
> 
>     <source>
>       <path>/dev/shmem/vm0</path>
>     </source>
> 
> The statement gets silently dropped, which is somewhat surprising.
> However, I did not test what happens with DIMMs, maybe it's the same.

Yeah, this is somewhat expected. I mean, the part that's expected is
that libvirt drops parts it doesn't parse. Sometimes it is pretty
obvious (<source someRandomAttribute='value'/>) and sometimes it's not
so (like in your example when <path/> makes sense for other memory
models like virtio-pmem). But just to be sure - since virtio-mem can be
backed by any memory-backing-* backend, does it make sense to have
<path/> there? So far my code would use memory-backend-file for
hugepages only.

> 
> 
> 8. Global preallocation of memory
> 
> With
> 
> <memoryBacking>
>     <allocation mode="immediate"\>
> </memoryBacking>
> 
> we also get "prealloc=on" set for the memory backends of the virito-mem
> devices, which is sub-optimal, because we end up preallocating all
> memory of the memory backend (which is unexpected for a virtio-mem
> device) and virtio-mem will then discard all memory immediately again.
> So it's essentially a dangerous NOP -- dangerous because we temporarily
> consume a lot of memory.
> 
> In an ideal world, we would not set this for the memory backend used for
> the virtio-mem devices, but for the virtio-mem devices themselves, such
> that preallocation happens when new memory blocks are actually exposed
> to the VM.
> 
> As virtio-mem does not support "prealloc=on" for virtio-mem devices yet,
> this is future work. We might want to error out, though, if <allocation
> mode="immediate"\> is used along with virtio-mem devices for now. I'm
> planning on implementing this in QEMU soon. Until then, it might also be
> good enough to simply document that this setup should be avoided.

Right. Meanwhile this was implemented in QEMU and thus I can drop
prealloc=on. But then my question is what happens when user wants to
expose additional memory to the guest but doesn't have enough free
hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't
get killed later, after the guest booted up.

> 
> 
> 9. Memfd and huge pages
> 
> <memoryBacking>
>     <source type="memfd"/>
> </memoryBacking>
> 
> and
> 
> <memory model='virtio-mem' access='shared'>
>   <source>
>     <pagesize unit='KiB'>2048</pagesize>
>   </source>
>   ...
> </memory>
> 
> 
> I get on the QEMU cmdline
> 
> "-object
> {"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"
> 
> 
> Dropping "the memfd" source I get on the QEMU cmdline:
> 
> -object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}
> 
> 
> "prealloc":true should not have been added for virtio-mem in case of
> memfd. !memfd does what's expected.
> 


Okay, I will fix this. But can you shed more light here? I mean, why the
difference?

> 
> 10. Memory locking
> 
> With
> 
> <memoryBacking>
>     <locked/>
> </memoryBacking>
> 
> virtio-mem fails with
> 
> "qemu-system-x86_64: -device
> virtio-mem-pci,node=0,block-size=2097152,requested-size=0,memdev=memvirtiomem0,id=virtiomem0,bus=pci.0,addr=0x2:
> Incompatible with mlock"
> 
> Unfortunately,for example, on shmem like:
> 
> <memoryBacking>
>     <locked/>
>     <access mode="shared"/>
>     <source type="memfd"/>
> </memoryBacking>
> 
> it seems to fail after essentially (temporarily) preallocating all
> memory for the memory backend of the virtio-mem device. In the future,
> virtio-mem might be able to support mlock, until then, this is
> suboptimal but at least fails at some point.
> 
> 
> 11. Reservation of memory
> 
> With new QEMU versions we'll want to pass "reserve=off" for the memory
> backend used, especially with hugepages and private mappings. While this
> change was merged into QEMU, it's not part of an official release yet.
> Future work.
> 
> https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@redhat.com/
> 
> Otherwise, when we don't have the "size" currently in free and
> "unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to
> map backing store for guest RAM: Cannot allocate memory". The same thing
> can easily happen on anonymous memory when memory overcommit isn't
> disabled.
> 
> So this is future work, but at least the QEMU part is already upstream.


So what's the difference between reserve and prealloc?

Michal

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by David Hildenbrand 2 years, 7 months ago
> 
> Hi,
> 
> sorry for replying this late.

Thanks for looking into this. It's a fairly long list, so it's 
understandable that it took a while. :)

>>
>>
>> 5. Slot handling.
>>
>> As already discussed, virtio-mem and virtio-pmem don't need slots. Yet,
>> the "slots" definition is required and libvirt reserves once slot for
>> each such device ("error: unsupported configuration: memory device count
>> '2' exceeds slots count '1'"). This is certainly future work, if we ever
>> want to change that.
> 
> I can look into this.

Yeah, but it can certainly be considered as future work as well.

>>
>> 7. File source gets silently dropped
>>
>>      <source>
>>        <path>/dev/shmem/vm0</path>
>>      </source>
>>
>> The statement gets silently dropped, which is somewhat surprising.
>> However, I did not test what happens with DIMMs, maybe it's the same.
> 
> Yeah, this is somewhat expected. I mean, the part that's expected is
> that libvirt drops parts it doesn't parse. Sometimes it is pretty
> obvious (<source someRandomAttribute='value'/>) and sometimes it's not
> so (like in your example when <path/> makes sense for other memory
> models like virtio-pmem). But just to be sure - since virtio-mem can be
> backed by any memory-backing-* backend, does it make sense to have
> <path/> there? So far my code would use memory-backend-file for
> hugepages only.

So it could be backed by a file residing on a filesystem that supports 
sparse files (shmem, hugetlbfs, ext4, ...) -- IOW anything modern :)

It's supposed to work, but if it makes your life easier, we can consider 
supporting other file sources future work.

> 
>>
>>
>> 8. Global preallocation of memory
>>
>> With
>>
>> <memoryBacking>
>>      <allocation mode="immediate"\>
>> </memoryBacking>
>>
>> we also get "prealloc=on" set for the memory backends of the virito-mem
>> devices, which is sub-optimal, because we end up preallocating all
>> memory of the memory backend (which is unexpected for a virtio-mem
>> device) and virtio-mem will then discard all memory immediately again.
>> So it's essentially a dangerous NOP -- dangerous because we temporarily
>> consume a lot of memory.
>>
>> In an ideal world, we would not set this for the memory backend used for
>> the virtio-mem devices, but for the virtio-mem devices themselves, such
>> that preallocation happens when new memory blocks are actually exposed
>> to the VM.
>>
>> As virtio-mem does not support "prealloc=on" for virtio-mem devices yet,
>> this is future work. We might want to error out, though, if <allocation
>> mode="immediate"\> is used along with virtio-mem devices for now. I'm
>> planning on implementing this in QEMU soon. Until then, it might also be
>> good enough to simply document that this setup should be avoided.
> 
> Right. Meanwhile this was implemented in QEMU and thus I can drop
> prealloc=on. But then my question is what happens when user wants to
> expose additional memory to the guest but doesn't have enough free
> hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't
> get killed later, after the guest booted up.

So "prealloc=on" support for virtio-mem is unfortunately not part of 
QEMU yet (only "reserve=off" for memory backends). As you correctly 
state, until that is in place, huge pages cannot be used in a safe way 
with virtio-mem, which is why they are not officially supported yet by 
virtio-mem.

The idea is to specify "prealloc=on" on the virtio-mem device level once 
supported, instead of on the memory backend level. So virtio-mem will 
preallocate the relevant memory before actually giving new block to the 
guest via virtio-mem, not when creating the memory backend. If 
preallocation fails at that point, no new blocks are given to the guest 
and we won't get killed.

Think of it like this: you defer preallocation to the point where you 
actually use the memory and handle preallocation errors still in a safe way.

More details are below.

> 
>>
>>
>> 9. Memfd and huge pages
>>
>> <memoryBacking>
>>      <source type="memfd"/>
>> </memoryBacking>
>>
>> and
>>
>> <memory model='virtio-mem' access='shared'>
>>    <source>
>>      <pagesize unit='KiB'>2048</pagesize>
>>    </source>
>>    ...
>> </memory>
>>
>>
>> I get on the QEMU cmdline
>>
>> "-object
>> {"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"
>>
>>
>> Dropping "the memfd" source I get on the QEMU cmdline:
>>
>> -object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}
>>
>>
>> "prealloc":true should not have been added for virtio-mem in case of
>> memfd. !memfd does what's expected.
>>
> 
> 
> Okay, I will fix this. But can you shed more light here? I mean, why the
> difference?

Assume you want a 1TB virtio-mem device backed by huge pages but 
initially only expose 1GB to the VM.

When setting prealloc=on on the memory backend, we will preallocate 1TB 
of huge pages when starting QEMU to discard the memory immediately again 
within virtio-mem startup code (first thing it does is make sure there 
is no memory backing at all, meaning the memory backend is completely 
"empty"). We end up with no preallocated memory and temporarily having 
allocated 1 TB.

When setting "prealloc=on" (once supported) on the virtio-mem device 
instead, we'll preallocate memory dynamically as we hand it to the VM -- 
so initially only 1GB.

Assume we want to give the VM an additional 16GB via that virtio-mem 
device. virtio-mem will dynamically try preallocating the memory before 
giving the guest 16GB. Assume only 8GB could be preallocated, then the 
VM will only get additional 8GB and we won't crash.

[...]

>> 11. Reservation of memory
>>
>> With new QEMU versions we'll want to pass "reserve=off" for the memory
>> backend used, especially with hugepages and private mappings. While this
>> change was merged into QEMU, it's not part of an official release yet.
>> Future work.
>>
>> https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@redhat.com/
>>
>> Otherwise, when we don't have the "size" currently in free and
>> "unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable to
>> map backing store for guest RAM: Cannot allocate memory". The same thing
>> can easily happen on anonymous memory when memory overcommit isn't
>> disabled.
>>
>> So this is future work, but at least the QEMU part is already upstream.
> 
> 
> So what's the difference between reserve and prealloc?

It's difficult the way especially huge pages work.

Say you mmap() 1TB of huge pages. Linux will "reserve" 1TB of huge pages 
and fail mmap() if it can't. BUT it will not preallocate huge pages yet, 
they are only accounted for in the OS as reserved for this mapping.

The idea is that you cannot really overcommit huge pages in the 
traditional sense, so the reservation mechanism was implemented to make 
it harder for user space to do something stupid. BUT we still need 
preallocation because the whole "huge page reservation" code is broken 
and not NUMA aware!

This, however, breaks the idea of virtio-mem, where you want to 
dynamically decide how much memory you actually give to a VM. If you 
reserve all huge pages of the memory backend upfront, they cannot be 
used for anything else in the meantime and you can just stop using 
virtio-mem and use a large DIMM instead.

In the end, what we want in virtio-mem with huge pages in the future is:
* reserve=off for the memory backend: don't reserve any huge pages by
   in the OS, we'll be preallocating instead.
* prealloc=on for the virtio-mem device: preallocate memory dynamically
   when really about to be used by the VM and fail in a safe way if
   preallcoation fails.


In addition to that, "reserve=off" can be useful with virtio-mem also 
when backed by ordinary system RAM where we don't use preallocation, 
just due to the way some memory overcommit modes work. But that's also 
stuff for the future to optimize and you don't have to bother about that 
just now. :)

-- 
Thanks,

David / dhildenb

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by Michal Prívozník 2 years, 7 months ago
On 7/7/21 12:30 PM, David Hildenbrand wrote:
> On 23.06.21 12:12, Michal Privoznik wrote:
>> v4 of:
>>
>> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
>>
>> diff to v3:
>> - Rebased code on the top of master
>> - Tried to work in all Peter's review suggestions
>> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>>    <memory/> by XML validator and thus if <maxMemory/> was close
>> enough to
>>    <currentMemory/> the validator reported an error (this was reported by
>>    David).
>>
> 
> Hi Michal,
> 


> 6. 4k source results in an error
> 
>     <source>
>       <pagesize unit='KiB'>4096</pagesize>
>       <nodemask>0-1</nodemask>
>     </source>
> 
> "error: internal error: Unable to find any usable hugetlbfs mount for
> 4096 KiB"
> 
> This example is taken from https://libvirt.org/formatdomain.html for
> DIMMs. Not sure what the expected behavior is.

Just realized that this IS expected behavior. 4096KiB pages (=4MiB) are
not regular system pages (4KiB). Thus libvirt is trying to find
hugetlbfs mount point that's serving 4MiB pages, unsuccessfully. I'll
post a patch that fixes the example though.

Michal

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model
Posted by David Hildenbrand 2 years, 7 months ago
On 13.09.21 08:53, Michal Prívozník wrote:
> On 7/7/21 12:30 PM, David Hildenbrand wrote:
>> On 23.06.21 12:12, Michal Privoznik wrote:
>>> v4 of:
>>>
>>> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
>>>
>>> diff to v3:
>>> - Rebased code on the top of master
>>> - Tried to work in all Peter's review suggestions
>>> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>>>     <memory/> by XML validator and thus if <maxMemory/> was close
>>> enough to
>>>     <currentMemory/> the validator reported an error (this was reported by
>>>     David).
>>>
>>
>> Hi Michal,
>>
> 
> 
>> 6. 4k source results in an error
>>
>>      <source>
>>        <pagesize unit='KiB'>4096</pagesize>
>>        <nodemask>0-1</nodemask>
>>      </source>
>>
>> "error: internal error: Unable to find any usable hugetlbfs mount for
>> 4096 KiB"
>>
>> This example is taken from https://libvirt.org/formatdomain.html for
>> DIMMs. Not sure what the expected behavior is.
> 
> Just realized that this IS expected behavior. 4096KiB pages (=4MiB) are
> not regular system pages (4KiB). Thus libvirt is trying to find
> hugetlbfs mount point that's serving 4MiB pages, unsuccessfully. I'll
> post a patch that fixes the example though.

Ah, very right. I blindly copied the example ... make sense to me and 
the error message is correct.

-- 
Thanks,

David / dhildenb