[PATCH] qemu: Relax memory pre-allocation rules

Michal Privoznik posted 1 patch 3 years, 5 months ago
Test syntax-check failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/866ab430dc2dd84a67f9b948c90879e2ad176db0.1606730774.git.mprivozn@redhat.com
src/qemu/qemu_command.c                               | 11 +++++++++--
.../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
2 files changed, 10 insertions(+), 3 deletions(-)
[PATCH] qemu: Relax memory pre-allocation rules
Posted by Michal Privoznik 3 years, 5 months ago
Currently, we configure QEMU to prealloc memory almost by
default. Well, by default for NVDIMMs, hugepages and if user
asked us to (via memoryBacking <allocation mode="immediate"/>).

However, there are two cases where this approach is not the best:

1) in case when guest's NVDIMM is backed by real life NVDIMM. In
   this case users should put <pmem/> into the <memory/> device
   <source/>, like this:

   <memory model='nvdimm' access='shared'>
     <source>
       <path>/dev/pmem0</path>
       <pmem/>
     </source>
   </memory>

   Instructing QEMU to do prealloc in this case means that each
   page of the NVDIMM is "touched" (the first byte is read and
   written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
   device wear.

2) if free-page-reporting is turned on. While the
   free-page-reporting feature might not have a catchy or obvious
   name, when enabled it instructs KVM and subsequently QEMU to
   free pages no longer used by guest resulting in smaller memory
   footprint. And preallocating whole memory goes against this.

The BZ comment 11 mentions another, third case 'virtio-mem' but
that is not implemented in libvirt, yet.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_command.c                               | 11 +++++++++--
 .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 479bcc0b0c..3df8b5ac76 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
     if (discard == VIR_TRISTATE_BOOL_ABSENT)
         discard = def->mem.discard;
 
-    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
+    /* The whole point of free_page_reporting is that as soon as guest frees
+     * any memory it is freed in the host too. Prealloc doesn't make much sense
+     * then. */
+    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
+        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
         prealloc = true;
 
     if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
@@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
 
         if (mem->nvdimmPath) {
             memPath = g_strdup(mem->nvdimmPath);
-            prealloc = true;
+            /* If the NVDIMM is a real device then there's nothing to prealloc.
+             * If anyhing, we would be only wearing off the device. */
+            if (!mem->nvdimmPmem)
+                prealloc = true;
         } else if (useHugepage) {
             if (qemuGetDomainHupageMemPath(priv->driver, def, pagesize, &memPath) < 0)
                 return -1;
diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
index cac02a6f6d..fb4ae4b518 100644
--- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
+++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
@@ -20,7 +20,7 @@ file=/tmp/lib/domain--1-QEMUGuest1/master-key.aes \
 -object memory-backend-ram,id=ram-node0,size=224395264 \
 -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
 -object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm,share=no,\
-prealloc=yes,size=536870912,pmem=yes \
+size=536870912,pmem=yes \
 -device nvdimm,node=0,memdev=memnvdimm0,id=nvdimm0,slot=0 \
 -uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \
 -display none \
-- 
2.26.2

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Daniel P. Berrangé 3 years, 5 months ago
On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:
> Currently, we configure QEMU to prealloc memory almost by
> default. Well, by default for NVDIMMs, hugepages and if user
> asked us to (via memoryBacking <allocation mode="immediate"/>).
> 
> However, there are two cases where this approach is not the best:
> 
> 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
>    this case users should put <pmem/> into the <memory/> device
>    <source/>, like this:
> 
>    <memory model='nvdimm' access='shared'>
>      <source>
>        <path>/dev/pmem0</path>
>        <pmem/>
>      </source>
>    </memory>
> 
>    Instructing QEMU to do prealloc in this case means that each
>    page of the NVDIMM is "touched" (the first byte is read and
>    written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
>    device wear.
> 
> 2) if free-page-reporting is turned on. While the
>    free-page-reporting feature might not have a catchy or obvious
>    name, when enabled it instructs KVM and subsequently QEMU to
>    free pages no longer used by guest resulting in smaller memory
>    footprint. And preallocating whole memory goes against this.
> 
> The BZ comment 11 mentions another, third case 'virtio-mem' but
> that is not implemented in libvirt, yet.
> 
> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  src/qemu/qemu_command.c                               | 11 +++++++++--
>  .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> index 479bcc0b0c..3df8b5ac76 100644
> --- a/src/qemu/qemu_command.c
> +++ b/src/qemu/qemu_command.c
> @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>      if (discard == VIR_TRISTATE_BOOL_ABSENT)
>          discard = def->mem.discard;
>  
> -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
> +    /* The whole point of free_page_reporting is that as soon as guest frees
> +     * any memory it is freed in the host too. Prealloc doesn't make much sense
> +     * then. */
> +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
> +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
>          prealloc = true;

If the user asked for allocation == immediate, we should not be
silently ignoring that request. Isn't the scenario described simply
a wierd user configuration scenario and if they don't want that, then
then they can set     <allocation mode="ondemand"/> instead.

>      if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
> @@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>  
>          if (mem->nvdimmPath) {
>              memPath = g_strdup(mem->nvdimmPath);
> -            prealloc = true;



> +            /* If the NVDIMM is a real device then there's nothing to prealloc.
> +             * If anyhing, we would be only wearing off the device. */
> +            if (!mem->nvdimmPmem)
> +                prealloc = true;

I wonder if QEMU itself should take this optimization to skip its
allocation logic ? 

>          } else if (useHugepage) {
>              if (qemuGetDomainHupageMemPath(priv->driver, def, pagesize, &memPath) < 0)
>                  return -1;
> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
> index cac02a6f6d..fb4ae4b518 100644
> --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
> +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.x86_64-latest.args
> @@ -20,7 +20,7 @@ file=/tmp/lib/domain--1-QEMUGuest1/master-key.aes \
>  -object memory-backend-ram,id=ram-node0,size=224395264 \
>  -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
>  -object memory-backend-file,id=memnvdimm0,mem-path=/tmp/nvdimm,share=no,\
> -prealloc=yes,size=536870912,pmem=yes \
> +size=536870912,pmem=yes \
>  -device nvdimm,node=0,memdev=memnvdimm0,id=nvdimm0,slot=0 \
>  -uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \
>  -display none \
> -- 
> 2.26.2
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Michal Privoznik 3 years, 5 months ago
On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:
> On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:
>> Currently, we configure QEMU to prealloc memory almost by
>> default. Well, by default for NVDIMMs, hugepages and if user
>> asked us to (via memoryBacking <allocation mode="immediate"/>).
>>
>> However, there are two cases where this approach is not the best:
>>
>> 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
>>     this case users should put <pmem/> into the <memory/> device
>>     <source/>, like this:
>>
>>     <memory model='nvdimm' access='shared'>
>>       <source>
>>         <path>/dev/pmem0</path>
>>         <pmem/>
>>       </source>
>>     </memory>
>>
>>     Instructing QEMU to do prealloc in this case means that each
>>     page of the NVDIMM is "touched" (the first byte is read and
>>     written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
>>     device wear.
>>
>> 2) if free-page-reporting is turned on. While the
>>     free-page-reporting feature might not have a catchy or obvious
>>     name, when enabled it instructs KVM and subsequently QEMU to
>>     free pages no longer used by guest resulting in smaller memory
>>     footprint. And preallocating whole memory goes against this.
>>
>> The BZ comment 11 mentions another, third case 'virtio-mem' but
>> that is not implemented in libvirt, yet.
>>
>> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> ---
>>   src/qemu/qemu_command.c                               | 11 +++++++++--
>>   .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
>>   2 files changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
>> index 479bcc0b0c..3df8b5ac76 100644
>> --- a/src/qemu/qemu_command.c
>> +++ b/src/qemu/qemu_command.c
>> @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>>       if (discard == VIR_TRISTATE_BOOL_ABSENT)
>>           discard = def->mem.discard;
>>   
>> -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
>> +    /* The whole point of free_page_reporting is that as soon as guest frees
>> +     * any memory it is freed in the host too. Prealloc doesn't make much sense
>> +     * then. */
>> +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
>> +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
>>           prealloc = true;
> 
> If the user asked for allocation == immediate, we should not be
> silently ignoring that request. Isn't the scenario described simply
> a wierd user configuration scenario and if they don't want that, then
> then they can set     <allocation mode="ondemand"/> instead.

Okay.

> 
>>       if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
>> @@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>>   
>>           if (mem->nvdimmPath) {
>>               memPath = g_strdup(mem->nvdimmPath);
>> -            prealloc = true;
> 
> 
> 
>> +            /* If the NVDIMM is a real device then there's nothing to prealloc.
>> +             * If anyhing, we would be only wearing off the device. */
>> +            if (!mem->nvdimmPmem)
>> +                prealloc = true;
> 
> I wonder if QEMU itself should take this optimization to skip its
> allocation logic ?

Also would make sense. This is that kind of bug which lies in between 
libvirt and qemu. Although, since we are worried in silently ignoring 
user requests, then wouldn't this be exactly what QEMU would be doing? I 
mean, if an user/libvirt put both .prealloc=yes and .pmem=yes onto cmd 
line then these would cancel out, wouldn't they?

Michal

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Daniel P. Berrangé 3 years, 5 months ago
On Mon, Nov 30, 2020 at 11:48:28AM +0100, Michal Privoznik wrote:
> On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:
> > On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:
> > > Currently, we configure QEMU to prealloc memory almost by
> > > default. Well, by default for NVDIMMs, hugepages and if user
> > > asked us to (via memoryBacking <allocation mode="immediate"/>).
> > > 
> > > However, there are two cases where this approach is not the best:
> > > 
> > > 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
> > >     this case users should put <pmem/> into the <memory/> device
> > >     <source/>, like this:
> > > 
> > >     <memory model='nvdimm' access='shared'>
> > >       <source>
> > >         <path>/dev/pmem0</path>
> > >         <pmem/>
> > >       </source>
> > >     </memory>
> > > 
> > >     Instructing QEMU to do prealloc in this case means that each
> > >     page of the NVDIMM is "touched" (the first byte is read and
> > >     written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
> > >     device wear.
> > > 
> > > 2) if free-page-reporting is turned on. While the
> > >     free-page-reporting feature might not have a catchy or obvious
> > >     name, when enabled it instructs KVM and subsequently QEMU to
> > >     free pages no longer used by guest resulting in smaller memory
> > >     footprint. And preallocating whole memory goes against this.
> > > 
> > > The BZ comment 11 mentions another, third case 'virtio-mem' but
> > > that is not implemented in libvirt, yet.
> > > 
> > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
> > > Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> > > ---
> > >   src/qemu/qemu_command.c                               | 11 +++++++++--
> > >   .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
> > >   2 files changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> > > index 479bcc0b0c..3df8b5ac76 100644
> > > --- a/src/qemu/qemu_command.c
> > > +++ b/src/qemu/qemu_command.c
> > > @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > >       if (discard == VIR_TRISTATE_BOOL_ABSENT)
> > >           discard = def->mem.discard;
> > > -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
> > > +    /* The whole point of free_page_reporting is that as soon as guest frees
> > > +     * any memory it is freed in the host too. Prealloc doesn't make much sense
> > > +     * then. */
> > > +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
> > > +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
> > >           prealloc = true;
> > 
> > If the user asked for allocation == immediate, we should not be
> > silently ignoring that request. Isn't the scenario described simply
> > a wierd user configuration scenario and if they don't want that, then
> > then they can set     <allocation mode="ondemand"/> instead.
> 
> Okay.
> 
> > 
> > >       if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
> > > @@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > >           if (mem->nvdimmPath) {
> > >               memPath = g_strdup(mem->nvdimmPath);
> > > -            prealloc = true;
> > 
> > 
> > 
> > > +            /* If the NVDIMM is a real device then there's nothing to prealloc.
> > > +             * If anyhing, we would be only wearing off the device. */
> > > +            if (!mem->nvdimmPmem)
> > > +                prealloc = true;
> > 
> > I wonder if QEMU itself should take this optimization to skip its
> > allocation logic ?
> 
> Also would make sense. This is that kind of bug which lies in between
> libvirt and qemu. Although, since we are worried in silently ignoring user
> requests, then wouldn't this be exactly what QEMU would be doing? I mean, if
> an user/libvirt put both .prealloc=yes and .pmem=yes onto cmd line then
> these would cancel out, wouldn't they?

The difference is that an real NVDIMM is inherantly preallocated. QEMU
would not be ignoring the prealloc=yes arg - its implementation would
merely be a no-op.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Michal Privoznik 3 years, 5 months ago
On 11/30/20 12:14 PM, Daniel P. Berrangé wrote:
> On Mon, Nov 30, 2020 at 11:48:28AM +0100, Michal Privoznik wrote:
>> On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:
>>> On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:

>>>>
>>>> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
>>>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>

> The difference is that an real NVDIMM is inherantly preallocated. QEMU
> would not be ignoring the prealloc=yes arg - its implementation would
> merely be a no-op.

Okay, let me switch the bug back onto qemu.

Michal

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Igor Mammedov 3 years, 4 months ago
On Mon, 30 Nov 2020 11:14:20 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Mon, Nov 30, 2020 at 11:48:28AM +0100, Michal Privoznik wrote:
> > On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:  
> > > On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:  
> > > > Currently, we configure QEMU to prealloc memory almost by
> > > > default. Well, by default for NVDIMMs, hugepages and if user
> > > > asked us to (via memoryBacking <allocation mode="immediate"/>).
> > > > 
> > > > However, there are two cases where this approach is not the best:
> > > > 
> > > > 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
> > > >     this case users should put <pmem/> into the <memory/> device
> > > >     <source/>, like this:
> > > > 
> > > >     <memory model='nvdimm' access='shared'>
> > > >       <source>
> > > >         <path>/dev/pmem0</path>
> > > >         <pmem/>
> > > >       </source>
> > > >     </memory>
> > > > 
> > > >     Instructing QEMU to do prealloc in this case means that each
> > > >     page of the NVDIMM is "touched" (the first byte is read and
> > > >     written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
> > > >     device wear.
> > > > 
> > > > 2) if free-page-reporting is turned on. While the
> > > >     free-page-reporting feature might not have a catchy or obvious
> > > >     name, when enabled it instructs KVM and subsequently QEMU to
> > > >     free pages no longer used by guest resulting in smaller memory
> > > >     footprint. And preallocating whole memory goes against this.
> > > > 
> > > > The BZ comment 11 mentions another, third case 'virtio-mem' but
> > > > that is not implemented in libvirt, yet.
> > > > 
> > > > Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
> > > > Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> > > > ---
> > > >   src/qemu/qemu_command.c                               | 11 +++++++++--
> > > >   .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
> > > >   2 files changed, 10 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> > > > index 479bcc0b0c..3df8b5ac76 100644
> > > > --- a/src/qemu/qemu_command.c
> > > > +++ b/src/qemu/qemu_command.c
> > > > @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > > >       if (discard == VIR_TRISTATE_BOOL_ABSENT)
> > > >           discard = def->mem.discard;
> > > > -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
> > > > +    /* The whole point of free_page_reporting is that as soon as guest frees
> > > > +     * any memory it is freed in the host too. Prealloc doesn't make much sense
> > > > +     * then. */
> > > > +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
> > > > +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
> > > >           prealloc = true;  
> > > 
> > > If the user asked for allocation == immediate, we should not be
> > > silently ignoring that request. Isn't the scenario described simply
> > > a wierd user configuration scenario and if they don't want that, then
> > > then they can set     <allocation mode="ondemand"/> instead.  
> > 
> > Okay.
> >   
> > >   
> > > >       if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
> > > > @@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
> > > >           if (mem->nvdimmPath) {
> > > >               memPath = g_strdup(mem->nvdimmPath);
> > > > -            prealloc = true;  
> > > 
> > > 
> > >   
> > > > +            /* If the NVDIMM is a real device then there's nothing to prealloc.
> > > > +             * If anyhing, we would be only wearing off the device. */
> > > > +            if (!mem->nvdimmPmem)
> > > > +                prealloc = true;  
> > > 
> > > I wonder if QEMU itself should take this optimization to skip its
> > > allocation logic ?  

by default QEMU does not prealloc, and if users explicitly ask for prealloc,
they should get it.
So libvirt also shouldn't set prealloc by default when it comes to nvdimm
on a file that's allocated on pmem enabled storage.

> > Also would make sense. This is that kind of bug which lies in between
> > libvirt and qemu. Although, since we are worried in silently ignoring user
> > requests, then wouldn't this be exactly what QEMU would be doing? I mean, if
> > an user/libvirt put both .prealloc=yes and .pmem=yes onto cmd line then
> > these would cancel out, wouldn't they?  
> 
> The difference is that an real NVDIMM is inherantly preallocated. QEMU

that's assuming that used backend file is on NVDIMM
(pmem=on doesn't guarantee it though)

> would not be ignoring the prealloc=yes arg - its implementation would
> merely be a no-op.

As for ignoring user's input, I don't like it (it usually bites down' the road).
if we decide that "pmem=on + prealloc=on" is invalid combo,
I'd rather error out with "fix your CLI" kind of message or we can warn user
that options combination is not optimal.



> Regards,
> Daniel


Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Peter Krempa 3 years, 5 months ago
On Mon, Nov 30, 2020 at 11:06:14 +0100, Michal Privoznik wrote:
> Currently, we configure QEMU to prealloc memory almost by
> default. Well, by default for NVDIMMs, hugepages and if user
> asked us to (via memoryBacking <allocation mode="immediate"/>).
> 
> However, there are two cases where this approach is not the best:
> 
> 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
>    this case users should put <pmem/> into the <memory/> device
>    <source/>, like this:
> 
>    <memory model='nvdimm' access='shared'>
>      <source>
>        <path>/dev/pmem0</path>
>        <pmem/>
>      </source>
>    </memory>
> 
>    Instructing QEMU to do prealloc in this case means that each
>    page of the NVDIMM is "touched" (the first byte is read and
>    written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
>    device wear.
> 
> 2) if free-page-reporting is turned on. While the
>    free-page-reporting feature might not have a catchy or obvious
>    name, when enabled it instructs KVM and subsequently QEMU to
>    free pages no longer used by guest resulting in smaller memory
>    footprint. And preallocating whole memory goes against this.
> 
> The BZ comment 11 mentions another, third case 'virtio-mem' but
> that is not implemented in libvirt, yet.
> 
> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
> ---
>  src/qemu/qemu_command.c                               | 11 +++++++++--
>  .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
> index 479bcc0b0c..3df8b5ac76 100644
> --- a/src/qemu/qemu_command.c
> +++ b/src/qemu/qemu_command.c
> @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>      if (discard == VIR_TRISTATE_BOOL_ABSENT)
>          discard = def->mem.discard;
>  
> -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
> +    /* The whole point of free_page_reporting is that as soon as guest frees
> +     * any memory it is freed in the host too. Prealloc doesn't make much sense
> +     * then. */
> +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
> +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
>          prealloc = true;

IIUC def->mem.allocation is an explicit user-specified configuration, in
such case we should not override the user wish based on a different
configuration.

If they don't make sense together technically, we should reject the
config rather than silently changing it. If it's just semantically wrong
and pre-existing we may leave it be.

Additionally the patch is missing a test case for this scenario.

Re: [PATCH] qemu: Relax memory pre-allocation rules
Posted by Michal Privoznik 3 years, 5 months ago
On 11/30/20 11:21 AM, Peter Krempa wrote:
> On Mon, Nov 30, 2020 at 11:06:14 +0100, Michal Privoznik wrote:
>> Currently, we configure QEMU to prealloc memory almost by
>> default. Well, by default for NVDIMMs, hugepages and if user
>> asked us to (via memoryBacking <allocation mode="immediate"/>).
>>
>> However, there are two cases where this approach is not the best:
>>
>> 1) in case when guest's NVDIMM is backed by real life NVDIMM. In
>>     this case users should put <pmem/> into the <memory/> device
>>     <source/>, like this:
>>
>>     <memory model='nvdimm' access='shared'>
>>       <source>
>>         <path>/dev/pmem0</path>
>>         <pmem/>
>>       </source>
>>     </memory>
>>
>>     Instructing QEMU to do prealloc in this case means that each
>>     page of the NVDIMM is "touched" (the first byte is read and
>>     written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
>>     device wear.
>>
>> 2) if free-page-reporting is turned on. While the
>>     free-page-reporting feature might not have a catchy or obvious
>>     name, when enabled it instructs KVM and subsequently QEMU to
>>     free pages no longer used by guest resulting in smaller memory
>>     footprint. And preallocating whole memory goes against this.
>>
>> The BZ comment 11 mentions another, third case 'virtio-mem' but
>> that is not implemented in libvirt, yet.
>>
>> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> ---
>>   src/qemu/qemu_command.c                               | 11 +++++++++--
>>   .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
>>   2 files changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
>> index 479bcc0b0c..3df8b5ac76 100644
>> --- a/src/qemu/qemu_command.c
>> +++ b/src/qemu/qemu_command.c
>> @@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
>>       if (discard == VIR_TRISTATE_BOOL_ABSENT)
>>           discard = def->mem.discard;
>>   
>> -    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
>> +    /* The whole point of free_page_reporting is that as soon as guest frees
>> +     * any memory it is freed in the host too. Prealloc doesn't make much sense
>> +     * then. */
>> +    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
>> +        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
>>           prealloc = true;
> 
> IIUC def->mem.allocation is an explicit user-specified configuration, in
> such case we should not override the user wish based on a different
> configuration.
> 
> If they don't make sense together technically, we should reject the
> config rather than silently changing it. If it's just semantically wrong
> and pre-existing we may leave it be.
> 
> Additionally the patch is missing a test case for this scenario.
> 

Yeah, Dan already pointed out that this is not really desired. So I will 
leave this hunk out. But to address your point - would it make sense to 
add a valiador check? I mean, something like:

   if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
       def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON) {
       virReportError("this combination doesn't make much sense");
       return -1;
   }


Technically, it is possible to fully allocate memory on domain startup 
and then leave QEMU to free pages (which happens as soon virtio_balloon 
module is loaded), but IMO it doesn't make much sense. Semantically, at 
least to me these two options are mutually exclusive.

Michal