[libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'

Michal Privoznik posted 2 patches 5 years ago
Test syntax-check passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/cover.1554912407.git.mprivozn@redhat.com
There is a newer version of this series
src/qemu/qemu_cgroup.c  |  5 ++---
src/qemu/qemu_process.c | 12 ++++++++----
2 files changed, 10 insertions(+), 7 deletions(-)
[libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Michal Privoznik 5 years ago
v2 of:

https://www.redhat.com/archives/libvir-list/2019-April/msg00658.html

diff to v1:
- Fixed the reported problem. Basically, even though emulator CGroup was
  created qemu was not running in it. Now qemu is moved into the CGroup
  even before exec()

Michal Prívozník (2):
  qemuSetupCpusetMems: Use VIR_AUTOFREE()
  qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu

 src/qemu/qemu_cgroup.c  |  5 ++---
 src/qemu/qemu_process.c | 12 ++++++++----
 2 files changed, 10 insertions(+), 7 deletions(-)

-- 
2.21.0

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Daniel Henrique Barboza 5 years ago
Hi,

I've tested these patches again, twice, in similar setups like I tested
the first version (first in a Power8, then in a Power9 server).

Same results, though. Libvirt will not avoid the launch of a pseries guest,
with numanode=strict, even if the numa node does not have available
RAM. If I stress test the memory of the guest to force the allocation,
QEMU exits with an error as soon as the memory of the host numa node
is exhausted.

If I change the numanode setting to 'preferred' and repeats the test, QEMU
doesn't exit with an error - the process starts to take memory from other
numa nodes. This indicates that the numanode policy is apparently being
forced in the QEMU process - however, it is not forced in VM boot.

I've debugged it a little and haven't found anything wrong that jumps the
eye. All functions that succeeds qemuSetupCpusetMems exits out with
ret = 0. Unfortunately, I don't have access to a x86 server with more than
one NUMA node to compare results.

Since I can't say for sure if what I'm seeing is an exclusive pseries
behavior, I see no problem into pushing this series upstream
if it makes sense for x86. We can debug/fix the Power side later.



Thanks,


DHB





On 4/10/19 1:10 PM, Michal Privoznik wrote:
> v2 of:
>
> https://www.redhat.com/archives/libvir-list/2019-April/msg00658.html
>
> diff to v1:
> - Fixed the reported problem. Basically, even though emulator CGroup was
>    created qemu was not running in it. Now qemu is moved into the CGroup
>    even before exec()
>
> Michal Prívozník (2):
>    qemuSetupCpusetMems: Use VIR_AUTOFREE()
>    qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
>
>   src/qemu/qemu_cgroup.c  |  5 ++---
>   src/qemu/qemu_process.c | 12 ++++++++----
>   2 files changed, 10 insertions(+), 7 deletions(-)
>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Michal Privoznik 5 years ago
On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
> Hi,
> 
> I've tested these patches again, twice, in similar setups like I tested
> the first version (first in a Power8, then in a Power9 server).
> 
> Same results, though. Libvirt will not avoid the launch of a pseries guest,
> with numanode=strict, even if the numa node does not have available
> RAM. If I stress test the memory of the guest to force the allocation,
> QEMU exits with an error as soon as the memory of the host numa node
> is exhausted.

Yes, this is expected. I mean, by default qemu doesn't allocate memory 
for the guest fully. You'd have to force it:

<memoryBacking>
   <allocation mode='immediate'/>
</memoryBacking>

> 
> If I change the numanode setting to 'preferred' and repeats the test, QEMU
> doesn't exit with an error - the process starts to take memory from other
> numa nodes. This indicates that the numanode policy is apparently being
> forced in the QEMU process - however, it is not forced in VM boot.
> 
> I've debugged it a little and haven't found anything wrong that jumps the
> eye. All functions that succeeds qemuSetupCpusetMems exits out with
> ret = 0. Unfortunately, I don't have access to a x86 server with more than
> one NUMA node to compare results.
> 
> Since I can't say for sure if what I'm seeing is an exclusive pseries
> behavior, I see no problem into pushing this series upstream
> if it makes sense for x86. We can debug/fix the Power side later.

I bet that if you force the allocation then the domain will be unable to 
boot.

Thanks for the testing!

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Daniel Henrique Barboza 5 years ago

On 4/11/19 11:56 AM, Michal Privoznik wrote:
> On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
>> Hi,
>>
>> I've tested these patches again, twice, in similar setups like I tested
>> the first version (first in a Power8, then in a Power9 server).
>>
>> Same results, though. Libvirt will not avoid the launch of a pseries 
>> guest,
>> with numanode=strict, even if the numa node does not have available
>> RAM. If I stress test the memory of the guest to force the allocation,
>> QEMU exits with an error as soon as the memory of the host numa node
>> is exhausted.
>
> Yes, this is expected. I mean, by default qemu doesn't allocate memory 
> for the guest fully. You'd have to force it:
>
> <memoryBacking>
>   <allocation mode='immediate'/>
> </memoryBacking>
>

Tried with this extra setting, still no good. Domain still boots, even if
there is not enough memory to load up all its ram in the NUMA node
I am setting. For reference, this is the top of the guest XML:


   <name>vm1</name>
   <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
   <memory unit='KiB'>314572800</memory>
   <currentMemory unit='KiB'>314572800</currentMemory>
   <memoryBacking>
     <allocation mode='immediate'/>
   </memoryBacking>
   <vcpu placement='static'>16</vcpu>
   <numatune>
     <memory mode='strict' nodeset='0'/>
   </numatune>
   <os>
     <type arch='ppc64' machine='pseries'>hvm</type>
     <boot dev='hd'/>
   </os>
   <clock offset='utc'/>

While doing this test, I recalled that some of my IBM peers recently
mentioned that they were unable to do a pre-allocation of the RAM
of a pseries guest using Libvirt, but they were able to do it using QEMU
directly (using -realtime mlock=on). In fact, I just tried it out with 
command
line QEMU and the guest allocated all the memory at boot.

This means that the pseries guest is able to do mem pre-alloc. I'd say that
there might be something missing somewhere (XML, host setup, libvirt
config ...) or perhaps even a bug that is preventing Libvirt from doing
this pre-alloc. This explains why I can't verify this patch series. I'll 
see if
I dig it further to understand why when I have the time.


Thanks,


DHB


>>
>> If I change the numanode setting to 'preferred' and repeats the test, 
>> QEMU
>> doesn't exit with an error - the process starts to take memory from 
>> other
>> numa nodes. This indicates that the numanode policy is apparently being
>> forced in the QEMU process - however, it is not forced in VM boot.
>>
>> I've debugged it a little and haven't found anything wrong that jumps 
>> the
>> eye. All functions that succeeds qemuSetupCpusetMems exits out with
>> ret = 0. Unfortunately, I don't have access to a x86 server with more 
>> than
>> one NUMA node to compare results.
>>
>> Since I can't say for sure if what I'm seeing is an exclusive pseries
>> behavior, I see no problem into pushing this series upstream
>> if it makes sense for x86. We can debug/fix the Power side later.
>
> I bet that if you force the allocation then the domain will be unable 
> to boot.
>
> Thanks for the testing!
>
> Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Michal Privoznik 5 years ago
On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 4/11/19 11:56 AM, Michal Privoznik wrote:
>> On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
>>> Hi,
>>>
>>> I've tested these patches again, twice, in similar setups like I tested
>>> the first version (first in a Power8, then in a Power9 server).
>>>
>>> Same results, though. Libvirt will not avoid the launch of a pseries 
>>> guest,
>>> with numanode=strict, even if the numa node does not have available
>>> RAM. If I stress test the memory of the guest to force the allocation,
>>> QEMU exits with an error as soon as the memory of the host numa node
>>> is exhausted.
>>
>> Yes, this is expected. I mean, by default qemu doesn't allocate memory 
>> for the guest fully. You'd have to force it:
>>
>> <memoryBacking>
>>   <allocation mode='immediate'/>
>> </memoryBacking>
>>
> 
> Tried with this extra setting, still no good. Domain still boots, even if
> there is not enough memory to load up all its ram in the NUMA node
> I am setting. For reference, this is the top of the guest XML:
> 
> 
>    <name>vm1</name>
>    <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
>    <memory unit='KiB'>314572800</memory>
>    <currentMemory unit='KiB'>314572800</currentMemory>
>    <memoryBacking>
>      <allocation mode='immediate'/>
>    </memoryBacking>
>    <vcpu placement='static'>16</vcpu>
>    <numatune>
>      <memory mode='strict' nodeset='0'/>
>    </numatune>
>    <os>
>      <type arch='ppc64' machine='pseries'>hvm</type>
>      <boot dev='hd'/>
>    </os>
>    <clock offset='utc'/>
> 
> While doing this test, I recalled that some of my IBM peers recently
> mentioned that they were unable to do a pre-allocation of the RAM
> of a pseries guest using Libvirt, but they were able to do it using QEMU
> directly (using -realtime mlock=on). In fact, I just tried it out with 
> command
> line QEMU and the guest allocated all the memory at boot.

Ah, so looks like -mem-prealloc doesn't work at Power? Can you please check:

1) that -mem-prealloc is on the qemu command line
2) how much memory qemu allocates right after it started the guest? I 
mean, before you start some mem stress test which causes it to allocate 
the memory fully.

> 
> This means that the pseries guest is able to do mem pre-alloc. I'd say that
> there might be something missing somewhere (XML, host setup, libvirt
> config ...) or perhaps even a bug that is preventing Libvirt from doing
> this pre-alloc. This explains why I can't verify this patch series. I'll 
> see if
> I dig it further to understand why when I have the time.

Yeah, I don't know Power well enough to help you. Sorry.

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Daniel Henrique Barboza 5 years ago

On 4/12/19 6:10 AM, Michal Privoznik wrote:
> On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
>>
>>
>> On 4/11/19 11:56 AM, Michal Privoznik wrote:
>>> On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
>>>> Hi,
>>>>
>>>> I've tested these patches again, twice, in similar setups like I 
>>>> tested
>>>> the first version (first in a Power8, then in a Power9 server).
>>>>
>>>> Same results, though. Libvirt will not avoid the launch of a 
>>>> pseries guest,
>>>> with numanode=strict, even if the numa node does not have available
>>>> RAM. If I stress test the memory of the guest to force the allocation,
>>>> QEMU exits with an error as soon as the memory of the host numa node
>>>> is exhausted.
>>>
>>> Yes, this is expected. I mean, by default qemu doesn't allocate 
>>> memory for the guest fully. You'd have to force it:
>>>
>>> <memoryBacking>
>>>   <allocation mode='immediate'/>
>>> </memoryBacking>
>>>
>>
>> Tried with this extra setting, still no good. Domain still boots, 
>> even if
>> there is not enough memory to load up all its ram in the NUMA node
>> I am setting. For reference, this is the top of the guest XML:
>>
>>
>>    <name>vm1</name>
>>    <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
>>    <memory unit='KiB'>314572800</memory>
>>    <currentMemory unit='KiB'>314572800</currentMemory>
>>    <memoryBacking>
>>      <allocation mode='immediate'/>
>>    </memoryBacking>
>>    <vcpu placement='static'>16</vcpu>
>>    <numatune>
>>      <memory mode='strict' nodeset='0'/>
>>    </numatune>
>>    <os>
>>      <type arch='ppc64' machine='pseries'>hvm</type>
>>      <boot dev='hd'/>
>>    </os>
>>    <clock offset='utc'/>
>>
>> While doing this test, I recalled that some of my IBM peers recently
>> mentioned that they were unable to do a pre-allocation of the RAM
>> of a pseries guest using Libvirt, but they were able to do it using QEMU
>> directly (using -realtime mlock=on). In fact, I just tried it out 
>> with command
>> line QEMU and the guest allocated all the memory at boot.
>
> Ah, so looks like -mem-prealloc doesn't work at Power? Can you please 
> check:
>
> 1) that -mem-prealloc is on the qemu command line

Yes. This is the cmd line generated:

/usr/bin/qemu-system-ppc64 \
-name guest=vm1,debug-threads=on \
-S \
-object 
secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes 
\
-machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \
-bios /home/user/boot_rom.bin \
-m 307200 \
-mem-prealloc \
-realtime mlock=off \
-smp 16,sockets=16,cores=1,threads=1 \
-uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \
-display none \
-no-user-config \
-nodefaults \
-chardev 
socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait 
\
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc \
-no-shutdown \
-boot strict=on \
-device spapr-pci-host-bridge,index=1,id=pci.1 \
-device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \
-drive 
file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
\
-chardev pty,id=charserial0 \
-device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \
-sandbox 
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on




> 2) how much memory qemu allocates right after it started the guest? I 
> mean, before you start some mem stress test which causes it to 
> allocate the memory fully.

It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb),
then it takes ~70Gb from another NUMA node to complete the 300Gb.


>
>>
>> This means that the pseries guest is able to do mem pre-alloc. I'd 
>> say that
>> there might be something missing somewhere (XML, host setup, libvirt
>> config ...) or perhaps even a bug that is preventing Libvirt from doing
>> this pre-alloc. This explains why I can't verify this patch series. 
>> I'll see if
>> I dig it further to understand why when I have the time.
>
> Yeah, I don't know Power well enough to help you. Sorry.


No problem. One question: Libvirt is supposed to let the VM do the full
allocation of its RAM using -mem-prealloc and with -realtime mlock=off,
is that correct?


Thanks,

DHB


>
> Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Michal Privoznik 5 years ago
On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
> 
> 
> On 4/12/19 6:10 AM, Michal Privoznik wrote:
>> On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
>>>
>>>
>>> On 4/11/19 11:56 AM, Michal Privoznik wrote:
>>>> On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
>>>>> Hi,
>>>>>
>>>>> I've tested these patches again, twice, in similar setups like I 
>>>>> tested
>>>>> the first version (first in a Power8, then in a Power9 server).
>>>>>
>>>>> Same results, though. Libvirt will not avoid the launch of a 
>>>>> pseries guest,
>>>>> with numanode=strict, even if the numa node does not have available
>>>>> RAM. If I stress test the memory of the guest to force the allocation,
>>>>> QEMU exits with an error as soon as the memory of the host numa node
>>>>> is exhausted.
>>>>
>>>> Yes, this is expected. I mean, by default qemu doesn't allocate 
>>>> memory for the guest fully. You'd have to force it:
>>>>
>>>> <memoryBacking>
>>>>   <allocation mode='immediate'/>
>>>> </memoryBacking>
>>>>
>>>
>>> Tried with this extra setting, still no good. Domain still boots, 
>>> even if
>>> there is not enough memory to load up all its ram in the NUMA node
>>> I am setting. For reference, this is the top of the guest XML:
>>>
>>>
>>>    <name>vm1</name>
>>>    <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
>>>    <memory unit='KiB'>314572800</memory>
>>>    <currentMemory unit='KiB'>314572800</currentMemory>
>>>    <memoryBacking>
>>>      <allocation mode='immediate'/>
>>>    </memoryBacking>
>>>    <vcpu placement='static'>16</vcpu>
>>>    <numatune>
>>>      <memory mode='strict' nodeset='0'/>
>>>    </numatune>
>>>    <os>
>>>      <type arch='ppc64' machine='pseries'>hvm</type>
>>>      <boot dev='hd'/>
>>>    </os>
>>>    <clock offset='utc'/>
>>>
>>> While doing this test, I recalled that some of my IBM peers recently
>>> mentioned that they were unable to do a pre-allocation of the RAM
>>> of a pseries guest using Libvirt, but they were able to do it using QEMU
>>> directly (using -realtime mlock=on). In fact, I just tried it out 
>>> with command
>>> line QEMU and the guest allocated all the memory at boot.
>>
>> Ah, so looks like -mem-prealloc doesn't work at Power? Can you please 
>> check:
>>
>> 1) that -mem-prealloc is on the qemu command line
> 
> Yes. This is the cmd line generated:
> 
> /usr/bin/qemu-system-ppc64 \
> -name guest=vm1,debug-threads=on \
> -S \
> -object 
> secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes 
> \
> -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \
> -bios /home/user/boot_rom.bin \
> -m 307200 \
> -mem-prealloc \
> -realtime mlock=off \

This looks correct.

> -smp 16,sockets=16,cores=1,threads=1 \
> -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \
> -display none \
> -no-user-config \
> -nodefaults \
> -chardev 
> socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait 
> \
> -mon chardev=charmonitor,id=monitor,mode=control \
> -rtc base=utc \
> -no-shutdown \
> -boot strict=on \
> -device spapr-pci-host-bridge,index=1,id=pci.1 \
> -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \
> -drive 
> file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
> -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
> \
> -chardev pty,id=charserial0 \
> -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \
> -sandbox 
> on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
> -msg timestamp=on
> 
> 
> 
> 
>> 2) how much memory qemu allocates right after it started the guest? I 
>> mean, before you start some mem stress test which causes it to 
>> allocate the memory fully.
> 
> It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb),
> then it takes ~70Gb from another NUMA node to complete the 300Gb.

Huh, than -mem-prealloc is working but something else is not. What 
strikes me is that once guest starts using the memory then host kernel 
kills the guest. So host kernel knows about the limits we've set but 
doesn't enforce them when allocating the memory.

> 
> 
>>
>>>
>>> This means that the pseries guest is able to do mem pre-alloc. I'd 
>>> say that
>>> there might be something missing somewhere (XML, host setup, libvirt
>>> config ...) or perhaps even a bug that is preventing Libvirt from doing
>>> this pre-alloc. This explains why I can't verify this patch series. 
>>> I'll see if
>>> I dig it further to understand why when I have the time.
>>
>> Yeah, I don't know Power well enough to help you. Sorry.
> 
> 
> No problem. One question: Libvirt is supposed to let the VM do the full
> allocation of its RAM using -mem-prealloc and with -realtime mlock=off,
> is that correct?

-mem-prealloc should be enough. -realtime mlock is ther to lock the 
allocated memory so that it doesn't get swapped out. You can enable 
memory locking via:

   <memoryBacking>
     <locked/>
   </memoryBacking>

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 0/2] qemu: Honor memory mode='strict'
Posted by Daniel P. Berrangé 5 years ago
On Fri, Apr 12, 2019 at 01:15:05PM +0200, Michal Privoznik wrote:
> On 4/12/19 12:11 PM, Daniel Henrique Barboza wrote:
> > 
> > 
> > On 4/12/19 6:10 AM, Michal Privoznik wrote:
> > > On 4/11/19 7:29 PM, Daniel Henrique Barboza wrote:
> > > > 
> > > > 
> > > > On 4/11/19 11:56 AM, Michal Privoznik wrote:
> > > > > On 4/11/19 4:23 PM, Daniel Henrique Barboza wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I've tested these patches again, twice, in similar
> > > > > > setups like I tested
> > > > > > the first version (first in a Power8, then in a Power9 server).
> > > > > > 
> > > > > > Same results, though. Libvirt will not avoid the launch
> > > > > > of a pseries guest,
> > > > > > with numanode=strict, even if the numa node does not have available
> > > > > > RAM. If I stress test the memory of the guest to force the allocation,
> > > > > > QEMU exits with an error as soon as the memory of the host numa node
> > > > > > is exhausted.
> > > > > 
> > > > > Yes, this is expected. I mean, by default qemu doesn't
> > > > > allocate memory for the guest fully. You'd have to force it:
> > > > > 
> > > > > <memoryBacking>
> > > > >   <allocation mode='immediate'/>
> > > > > </memoryBacking>
> > > > > 
> > > > 
> > > > Tried with this extra setting, still no good. Domain still
> > > > boots, even if
> > > > there is not enough memory to load up all its ram in the NUMA node
> > > > I am setting. For reference, this is the top of the guest XML:
> > > > 
> > > > 
> > > >    <name>vm1</name>
> > > >    <uuid>f48e9e35-8406-4784-875f-5185cb4d47d7</uuid>
> > > >    <memory unit='KiB'>314572800</memory>
> > > >    <currentMemory unit='KiB'>314572800</currentMemory>
> > > >    <memoryBacking>
> > > >      <allocation mode='immediate'/>
> > > >    </memoryBacking>
> > > >    <vcpu placement='static'>16</vcpu>
> > > >    <numatune>
> > > >      <memory mode='strict' nodeset='0'/>
> > > >    </numatune>
> > > >    <os>
> > > >      <type arch='ppc64' machine='pseries'>hvm</type>
> > > >      <boot dev='hd'/>
> > > >    </os>
> > > >    <clock offset='utc'/>
> > > > 
> > > > While doing this test, I recalled that some of my IBM peers recently
> > > > mentioned that they were unable to do a pre-allocation of the RAM
> > > > of a pseries guest using Libvirt, but they were able to do it using QEMU
> > > > directly (using -realtime mlock=on). In fact, I just tried it
> > > > out with command
> > > > line QEMU and the guest allocated all the memory at boot.
> > > 
> > > Ah, so looks like -mem-prealloc doesn't work at Power? Can you
> > > please check:
> > > 
> > > 1) that -mem-prealloc is on the qemu command line
> > 
> > Yes. This is the cmd line generated:
> > 
> > /usr/bin/qemu-system-ppc64 \
> > -name guest=vm1,debug-threads=on \
> > -S \
> > -object secret,id=masterKey0,format=raw,file=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/master-key.aes
> > \
> > -machine pseries-2.11,accel=kvm,usb=off,dump-guest-core=off \
> > -bios /home/user/boot_rom.bin \
> > -m 307200 \
> > -mem-prealloc \
> > -realtime mlock=off \
> 
> This looks correct.
> 
> > -smp 16,sockets=16,cores=1,threads=1 \
> > -uuid f48e9e35-8406-4784-875f-5185cb4d47d7 \
> > -display none \
> > -no-user-config \
> > -nodefaults \
> > -chardev socket,id=charmonitor,path=/home/user/usr/var/lib/libvirt/qemu/domain-2-vm1/monitor.sock,server,nowait
> > \
> > -mon chardev=charmonitor,id=monitor,mode=control \
> > -rtc base=utc \
> > -no-shutdown \
> > -boot strict=on \
> > -device spapr-pci-host-bridge,index=1,id=pci.1 \
> > -device qemu-xhci,id=usb,bus=pci.0,addr=0x3 \
> > -drive
> > file=/home/user/nv2-vm1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0
> > \
> > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
> > \
> > -chardev pty,id=charserial0 \
> > -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \
> > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x1 \
> > -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
> > \
> > -msg timestamp=on
> > 
> > 
> > 
> > 
> > > 2) how much memory qemu allocates right after it started the guest?
> > > I mean, before you start some mem stress test which causes it to
> > > allocate the memory fully.
> > 
> > It starts with 300Gb. It depletes its assigned NUMA node (that has 256Gb),
> > then it takes ~70Gb from another NUMA node to complete the 300Gb.
> 
> Huh, than -mem-prealloc is working but something else is not. What strikes
> me is that once guest starts using the memory then host kernel kills the
> guest. So host kernel knows about the limits we've set but doesn't enforce
> them when allocating the memory.

The way QEMU implemnetings -mem-prealloc is a bit of a hack.

Essentially it tries to write a single byte in each page of
memory, on the belief that this will cause the kernel to
allocate that page.

See do_touch_pages() in qemu's  util/oslib-posix.c:


        for (i = 0; i < numpages; i++) {
            /*
             * Read & write back the same value, so we don't
             * corrupt existing user/app data that might be
             * stored.
             *
             * 'volatile' to stop compiler optimizing this away
             * to a no-op
             *
             * TODO: get a better solution from kernel so we
             * don't need to write at all so we don't cause
             * wear on the storage backing the region...
             */
            *(volatile char *)addr = *addr;
            addr += hpagesize;
        }

I wonder if the compiler on PPC is optimizing this in some
way that turns it into a no-op unexpectedly.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list