[libvirt] [PATCH v2 2/2] qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu

Michal Privoznik posted 2 patches 6 years, 10 months ago
There is a newer version of this series
[libvirt] [PATCH v2 2/2] qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
Posted by Michal Privoznik 6 years, 10 months ago
It's funny how this went unnoticed for such a long time. Long
story short, if a domain is configured with
VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour
that. This is because of 7e72ac787848 after which libvirt allowed
qemu to allocate memory just anywhere and only after that it used
some magic involving cpuset.memory_migrate and cpuset.mems to
move the memory to desired NUMA nodes. This was done in order to
work around some KVM bug where KVM would fail if there wasn't a
DMA zone available on the NUMA node. Well, while the work around
might stopped libvirt tickling the KVM bug it also caused a bug
on libvirt side: if there is not enough memory on configured NUMA
node(s) then any attempt to start a domain must fail. Because of
the way we play with guest memory domains can start just happily.

The solution is to move the child we've just forked into emulator
cgroup, set up cpuset.mems and exec() qemu only after that.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_process.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 47d8ca2ff1..076ec18e21 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn,
     if (qemuProcessInitCpuAffinity(vm) < 0)
         goto cleanup;
 
+    VIR_DEBUG("Setting emulator tuning/settings");
+    if (qemuProcessSetupEmulator(vm) < 0)
+        goto cleanup;
+
+    VIR_DEBUG("Setting up post-init cgroup restrictions");
+    if (qemuSetupCpusetMems(vm) < 0)
+        goto cleanup;
+
     VIR_DEBUG("Setting cgroup for external devices (if required)");
     if (qemuSetupCgroupForExtDevices(vm, driver) < 0)
         goto cleanup;
@@ -6744,10 +6752,6 @@ qemuProcessLaunch(virConnectPtr conn,
     if (qemuProcessDetectIOThreadPIDs(driver, vm, asyncJob) < 0)
         goto cleanup;
 
-    VIR_DEBUG("Setting emulator tuning/settings");
-    if (qemuProcessSetupEmulator(vm) < 0)
-        goto cleanup;
-
     VIR_DEBUG("Setting global CPU cgroup (if required)");
     if (qemuSetupGlobalCpuCgroup(vm) < 0)
         goto cleanup;
-- 
2.21.0

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 2/2] qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
Posted by Martin Kletzander 6 years, 9 months ago
On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
>It's funny how this went unnoticed for such a long time. Long
>story short, if a domain is configured with
>VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour
>that. This is because of 7e72ac787848 after which libvirt allowed
>qemu to allocate memory just anywhere and only after that it used
>some magic involving cpuset.memory_migrate and cpuset.mems to
>move the memory to desired NUMA nodes. This was done in order to
>work around some KVM bug where KVM would fail if there wasn't a
>DMA zone available on the NUMA node. Well, while the work around
>might stopped libvirt tickling the KVM bug it also caused a bug
>on libvirt side: if there is not enough memory on configured NUMA
>node(s) then any attempt to start a domain must fail. Because of
>the way we play with guest memory domains can start just happily.
>
>The solution is to move the child we've just forked into emulator
>cgroup, set up cpuset.mems and exec() qemu only after that.
>

So you are saying this was a bug in KVM?  Is it fixed now?  I am not against
this patch, I hated that I had to do the workaround, but I just want to be sure
we won't start hitting that again.

>Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>---
> src/qemu/qemu_process.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
>diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>index 47d8ca2ff1..076ec18e21 100644
>--- a/src/qemu/qemu_process.c
>+++ b/src/qemu/qemu_process.c
>@@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn,
>     if (qemuProcessInitCpuAffinity(vm) < 0)
>         goto cleanup;
>
>+    VIR_DEBUG("Setting emulator tuning/settings");
>+    if (qemuProcessSetupEmulator(vm) < 0)
>+        goto cleanup;
>+
>+    VIR_DEBUG("Setting up post-init cgroup restrictions");

This is not post-init any more, but more importantly,

>+    if (qemuSetupCpusetMems(vm) < 0)

This function does a subset of what qemuProcessSetupEmulator() called right
before, does, so I see no reason for it being called here, or to keep existing
in the codebase for that matter.

>+        goto cleanup;
>+
>     VIR_DEBUG("Setting cgroup for external devices (if required)");
>     if (qemuSetupCgroupForExtDevices(vm, driver) < 0)
>         goto cleanup;
>@@ -6744,10 +6752,6 @@ qemuProcessLaunch(virConnectPtr conn,
>     if (qemuProcessDetectIOThreadPIDs(driver, vm, asyncJob) < 0)
>         goto cleanup;
>
>-    VIR_DEBUG("Setting emulator tuning/settings");
>-    if (qemuProcessSetupEmulator(vm) < 0)
>-        goto cleanup;
>-
>     VIR_DEBUG("Setting global CPU cgroup (if required)");
>     if (qemuSetupGlobalCpuCgroup(vm) < 0)
>         goto cleanup;
>-- 
>2.21.0
>
>--
>libvir-list mailing list
>libvir-list@redhat.com
>https://www.redhat.com/mailman/listinfo/libvir-list
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 2/2] qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
Posted by Michal Privoznik 6 years, 9 months ago
On 4/15/19 3:47 PM, Martin Kletzander wrote:
> On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
>> It's funny how this went unnoticed for such a long time. Long
>> story short, if a domain is configured with
>> VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour
>> that. This is because of 7e72ac787848 after which libvirt allowed
>> qemu to allocate memory just anywhere and only after that it used
>> some magic involving cpuset.memory_migrate and cpuset.mems to
>> move the memory to desired NUMA nodes. This was done in order to
>> work around some KVM bug where KVM would fail if there wasn't a
>> DMA zone available on the NUMA node. Well, while the work around
>> might stopped libvirt tickling the KVM bug it also caused a bug
>> on libvirt side: if there is not enough memory on configured NUMA
>> node(s) then any attempt to start a domain must fail. Because of
>> the way we play with guest memory domains can start just happily.
>>
>> The solution is to move the child we've just forked into emulator
>> cgroup, set up cpuset.mems and exec() qemu only after that.
>>
> 
> So you are saying this was a bug in KVM?  Is it fixed now?  I am not 
> against
> this patch, I hated that I had to do the workaround, but I just want to 
> be sure
> we won't start hitting that again.

Yes, that's what I'm saying. Looks like the KVM bug is fixed now because 
with a Fedora 29 on a NUMA machine I can start domains just fine.

> 
>> Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>> ---
>> src/qemu/qemu_process.c | 12 ++++++++----
>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>> index 47d8ca2ff1..076ec18e21 100644
>> --- a/src/qemu/qemu_process.c
>> +++ b/src/qemu/qemu_process.c
>> @@ -6653,6 +6653,14 @@ qemuProcessLaunch(virConnectPtr conn,
>>     if (qemuProcessInitCpuAffinity(vm) < 0)
>>         goto cleanup;
>>
>> +    VIR_DEBUG("Setting emulator tuning/settings");
>> +    if (qemuProcessSetupEmulator(vm) < 0)
>> +        goto cleanup;
>> +
>> +    VIR_DEBUG("Setting up post-init cgroup restrictions");
> 
> This is not post-init any more, but more importantly,
> 
>> +    if (qemuSetupCpusetMems(vm) < 0)
> 
> This function does a subset of what qemuProcessSetupEmulator() called right
> before, does, so I see no reason for it being called here, or to keep 
> existing
> in the codebase for that matter.

Ah, good point. I'll send v3.

Michal

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v2 2/2] qemu: Set up EMULATOR thread and cpuset.mems before exec()-ing qemu
Posted by Martin Kletzander 6 years, 9 months ago
On Mon, Apr 15, 2019 at 06:32:32PM +0200, Michal Privoznik wrote:
>On 4/15/19 3:47 PM, Martin Kletzander wrote:
>> On Wed, Apr 10, 2019 at 06:10:44PM +0200, Michal Privoznik wrote:
>>> It's funny how this went unnoticed for such a long time. Long
>>> story short, if a domain is configured with
>>> VIR_DOMAIN_NUMATUNE_MEM_STRICT libvirt doesn't really honour
>>> that. This is because of 7e72ac787848 after which libvirt allowed
>>> qemu to allocate memory just anywhere and only after that it used
>>> some magic involving cpuset.memory_migrate and cpuset.mems to
>>> move the memory to desired NUMA nodes. This was done in order to
>>> work around some KVM bug where KVM would fail if there wasn't a
>>> DMA zone available on the NUMA node. Well, while the work around
>>> might stopped libvirt tickling the KVM bug it also caused a bug
>>> on libvirt side: if there is not enough memory on configured NUMA
>>> node(s) then any attempt to start a domain must fail. Because of
>>> the way we play with guest memory domains can start just happily.
>>>
>>> The solution is to move the child we've just forked into emulator
>>> cgroup, set up cpuset.mems and exec() qemu only after that.
>>>
>>
>> So you are saying this was a bug in KVM?  Is it fixed now?  I am not
>> against
>> this patch, I hated that I had to do the workaround, but I just want to
>> be sure
>> we won't start hitting that again.
>
>Yes, that's what I'm saying. Looks like the KVM bug is fixed now because
>with a Fedora 29 on a NUMA machine I can start domains just fine.
>

What I was saying was that it would be nice to have some proof for this instead
of guesswork.  I, however, acknowledge that this might not be easy, or even
possible (the first patch that introduced the need for the initial workaround
was not pinpointed, at least not to my knowledge).  Just make sure that when
checking for this, you strictly required all the allocations to be done from
node not mentioned in output of:

  cat /proc/zoneinfo | grep DMA

and also that you used multiple vCPUs.  If you can also hotplug an extra vCPU
later on, then the test is perfect enough for me to justify this change [1].

Martin

[1] If you feel like looking up (bisecting) the kernel commit that used this,
    I'm _not_ standing in your way ;)
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list