[PATCH 2/4] qemu: Start emulator thread with more generous cpuset.mems

Michal Privoznik posted 4 patches 10 months, 3 weeks ago
[PATCH 2/4] qemu: Start emulator thread with more generous cpuset.mems
Posted by Michal Privoznik 10 months, 3 weeks ago
Consider a domain with two guest NUMA nodes and the following
<numatune/> setting :

  <numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
  </numatune>

What this means is the emulator thread is pinned onto host NUMA
node #0 (by setting corresponding cpuset.mems to "0"), and two
memory-backend-* objects are created:

  -object '{"qom-type":"memory-backend-ram","id":"ram-node0", .., "host-nodes":[1],"policy":"bind"}' \
  -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
  -object '{"qom-type":"memory-backend-ram","id":"ram-node1", .., "host-nodes":[0],"policy":"bind"}' \
  -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \

Note, the emulator thread is pinned well before QEMU is even
exec()-ed.

Now, the way memory allocation works in QEMU is: the emulator
thread calls mmap() followed by mbind() (which is sane, that's
how everybody should do it). BUT, because the thread is already
restricted by CGroups to just NUMA node #0, calling:

  mbind(host-nodes:[1]); /* made up syntax (TM) */

fails. This is expected though. Kernel was instructed to place
the memory at NUMA node "0" and yet, process is trying to place
it elsewhere.

We used to solve this by not restricting emulator thread at all
initially, and only after it's done initializing (i.e. we got the
QMP greeting) we placed it onto desired nodes. But this had its
own problems (e.g. QEMU might have locked pieces of its memory
which were then unable to migrate onto different NUMA nodes).

Therefore, in v5.1.0-rc1~282 we've changed this and set CGropus
upfront (even before exec()-ing QEMU). And this used to work, but
something has changed (I can't really put my finger anywhere).

Therefore, for the initialization start the thread with union of
all configured host NUMA nodes ("0-1" in our example) and fix the
placement only after QEMU is started.

NB, the memory hotplug suffers the same problem, but that might
be fixed later.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_domain.c  | 26 +++++++++++++++++++++++++
 src/qemu/qemu_domain.h  |  5 +++++
 src/qemu/qemu_process.c | 42 ++++++++++++++++++++++++++++++-----------
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 35d03515c7..d97a0ce24e 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -12674,3 +12674,29 @@ qemuDomainEvaluateCPUMask(const virDomainDef *def,
 
     return NULL;
 }
+
+
+void
+qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm,
+                                          virBitmap **nodeset,
+                                          char **nodesetStr)
+{
+    virDomainNuma *numatune = vm->def->numa;
+    qemuDomainObjPrivate *priv = vm->privateData;
+    g_autoptr(virBitmap) unionMask = virBitmapNew(0);
+    ssize_t i;
+
+    for (i = -1; i < (ssize_t)virDomainNumaGetNodeCount(numatune); i++) {
+        virBitmap *tmp;
+
+        tmp = virDomainNumatuneGetNodeset(numatune, priv->autoNodeset, i);
+        if (tmp)
+            virBitmapUnion(unionMask, tmp);
+    }
+
+    if (nodesetStr)
+        *nodesetStr = virBitmapFormat(unionMask);
+
+    if (nodeset)
+        *nodeset = g_steal_pointer(&unionMask);
+}
diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
index ec9ae75bce..999190e381 100644
--- a/src/qemu/qemu_domain.h
+++ b/src/qemu/qemu_domain.h
@@ -1140,3 +1140,8 @@ virBitmap *
 qemuDomainEvaluateCPUMask(const virDomainDef *def,
                           virBitmap *cpumask,
                           virBitmap *autoCpuset);
+
+void
+qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm,
+                                          virBitmap **nodeset,
+                                          char **nodesetStr);
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 57c3ea2dbf..6b85b7cee7 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -2550,7 +2550,8 @@ qemuProcessSetupPid(virDomainObj *vm,
                     virBitmap *cpumask,
                     unsigned long long period,
                     long long quota,
-                    virDomainThreadSchedParam *sched)
+                    virDomainThreadSchedParam *sched,
+                    bool unionMems)
 {
     qemuDomainObjPrivate *priv = vm->privateData;
     virDomainNuma *numatune = vm->def->numa;
@@ -2592,11 +2593,22 @@ qemuProcessSetupPid(virDomainObj *vm,
 
         if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 &&
             (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT ||
-             mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) &&
-            virDomainNumatuneMaybeFormatNodeset(numatune,
-                                                priv->autoNodeset,
-                                                &mem_mask, -1) < 0)
-            goto cleanup;
+             mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) {
+
+            /* QEMU allocates its memory from the emulator thread. Thus it
+             * needs to access union of all host nodes configured. This is
+             * going to be replaced with proper value later in the startup
+             * process. */
+            if (unionMems &&
+                nameval == VIR_CGROUP_THREAD_EMULATOR) {
+                qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask);
+            } else {
+                if (virDomainNumatuneMaybeFormatNodeset(numatune,
+                                                        priv->autoNodeset,
+                                                        &mem_mask, -1) < 0)
+                    goto cleanup;
+            }
+        }
 
         /* For restrictive numatune mode we need to set cpuset.mems for vCPU
          * threads based on the node they are in as there is nothing else uses
@@ -2689,13 +2701,15 @@ qemuProcessSetupPid(virDomainObj *vm,
 
 
 static int
-qemuProcessSetupEmulator(virDomainObj *vm)
+qemuProcessSetupEmulator(virDomainObj *vm,
+                         bool unionMems)
 {
     return qemuProcessSetupPid(vm, vm->pid, VIR_CGROUP_THREAD_EMULATOR,
                                0, vm->def->cputune.emulatorpin,
                                vm->def->cputune.emulator_period,
                                vm->def->cputune.emulator_quota,
-                               vm->def->cputune.emulatorsched);
+                               vm->def->cputune.emulatorsched,
+                               unionMems);
 }
 
 
@@ -5891,7 +5905,8 @@ qemuProcessSetupVcpu(virDomainObj *vm,
                             vcpuid, vcpu->cpumask,
                             vm->def->cputune.period,
                             vm->def->cputune.quota,
-                            &vcpu->sched) < 0)
+                            &vcpu->sched,
+                            false) < 0)
         return -1;
 
     if (schedCore &&
@@ -6046,7 +6061,8 @@ qemuProcessSetupIOThread(virDomainObj *vm,
                                iothread->cpumask,
                                vm->def->cputune.iothread_period,
                                vm->def->cputune.iothread_quota,
-                               &iothread->sched);
+                               &iothread->sched,
+                               false);
 }
 
 
@@ -7746,7 +7762,7 @@ qemuProcessLaunch(virConnectPtr conn,
         goto cleanup;
 
     VIR_DEBUG("Setting emulator tuning/settings");
-    if (qemuProcessSetupEmulator(vm) < 0)
+    if (qemuProcessSetupEmulator(vm, true) < 0)
         goto cleanup;
 
     VIR_DEBUG("Setting cgroup for external devices (if required)");
@@ -7809,6 +7825,10 @@ qemuProcessLaunch(virConnectPtr conn,
     if (qemuConnectAgent(driver, vm) < 0)
         goto cleanup;
 
+    VIR_DEBUG("Fixing up emulator tuning/settings");
+    if (qemuProcessSetupEmulator(vm, false) < 0)
+        goto cleanup;
+
     VIR_DEBUG("setting up hotpluggable cpus");
     if (qemuDomainHasHotpluggableStartupVcpus(vm->def)) {
         if (qemuDomainRefreshVcpuInfo(vm, asyncJob, false) < 0)
-- 
2.39.3
Re: [PATCH 2/4] qemu: Start emulator thread with more generous cpuset.mems
Posted by Martin Kletzander 10 months, 3 weeks ago
On Tue, May 23, 2023 at 12:06:19PM +0200, Michal Privoznik wrote:
>Consider a domain with two guest NUMA nodes and the following
><numatune/> setting :
>
>  <numatune>
>    <memory mode="strict" nodeset="0"/>
>    <memnode cellid="0" mode="strict" nodeset="1"/>
>  </numatune>
>
>What this means is the emulator thread is pinned onto host NUMA
>node #0 (by setting corresponding cpuset.mems to "0"), and two
>memory-backend-* objects are created:
>
>  -object '{"qom-type":"memory-backend-ram","id":"ram-node0", .., "host-nodes":[1],"policy":"bind"}' \
>  -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
>  -object '{"qom-type":"memory-backend-ram","id":"ram-node1", .., "host-nodes":[0],"policy":"bind"}' \
>  -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \
>
>Note, the emulator thread is pinned well before QEMU is even
>exec()-ed.
>
>Now, the way memory allocation works in QEMU is: the emulator
>thread calls mmap() followed by mbind() (which is sane, that's
>how everybody should do it). BUT, because the thread is already
>restricted by CGroups to just NUMA node #0, calling:
>
>  mbind(host-nodes:[1]); /* made up syntax (TM) */
>
>fails. This is expected though. Kernel was instructed to place
>the memory at NUMA node "0" and yet, process is trying to place
>it elsewhere.
>
>We used to solve this by not restricting emulator thread at all
>initially, and only after it's done initializing (i.e. we got the
>QMP greeting) we placed it onto desired nodes. But this had its
>own problems (e.g. QEMU might have locked pieces of its memory
>which were then unable to migrate onto different NUMA nodes).
>
>Therefore, in v5.1.0-rc1~282 we've changed this and set CGropus

*cgroups

>upfront (even before exec()-ing QEMU). And this used to work, but
>something has changed (I can't really put my finger anywhere).
>

I'm glad that's the case

>Therefore, for the initialization start the thread with union of
>all configured host NUMA nodes ("0-1" in our example) and fix the
>placement only after QEMU is started.
>
>NB, the memory hotplug suffers the same problem, but that might
>be fixed later.
>
>Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150
>Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
>---
> src/qemu/qemu_domain.c  | 26 +++++++++++++++++++++++++
> src/qemu/qemu_domain.h  |  5 +++++
> src/qemu/qemu_process.c | 42 ++++++++++++++++++++++++++++++-----------
> 3 files changed, 62 insertions(+), 11 deletions(-)
>
>diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>index 35d03515c7..d97a0ce24e 100644
>--- a/src/qemu/qemu_domain.c
>+++ b/src/qemu/qemu_domain.c
>@@ -12674,3 +12674,29 @@ qemuDomainEvaluateCPUMask(const virDomainDef *def,
>
>     return NULL;
> }
>+
>+
>+void
>+qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm,
>+                                          virBitmap **nodeset,
>+                                          char **nodesetStr)
>+{
>+    virDomainNuma *numatune = vm->def->numa;
>+    qemuDomainObjPrivate *priv = vm->privateData;
>+    g_autoptr(virBitmap) unionMask = virBitmapNew(0);
>+    ssize_t i;
>+
>+    for (i = -1; i < (ssize_t)virDomainNumaGetNodeCount(numatune); i++) {
>+        virBitmap *tmp;
>+
>+        tmp = virDomainNumatuneGetNodeset(numatune, priv->autoNodeset, i);
>+        if (tmp)
>+            virBitmapUnion(unionMask, tmp);
>+    }
>+
>+    if (nodesetStr)
>+        *nodesetStr = virBitmapFormat(unionMask);
>+
>+    if (nodeset)
>+        *nodeset = g_steal_pointer(&unionMask);
>+}
>diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
>index ec9ae75bce..999190e381 100644
>--- a/src/qemu/qemu_domain.h
>+++ b/src/qemu/qemu_domain.h
>@@ -1140,3 +1140,8 @@ virBitmap *
> qemuDomainEvaluateCPUMask(const virDomainDef *def,
>                           virBitmap *cpumask,
>                           virBitmap *autoCpuset);
>+
>+void
>+qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm,
>+                                          virBitmap **nodeset,
>+                                          char **nodesetStr);
>diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>index 57c3ea2dbf..6b85b7cee7 100644
>--- a/src/qemu/qemu_process.c
>+++ b/src/qemu/qemu_process.c
>@@ -2550,7 +2550,8 @@ qemuProcessSetupPid(virDomainObj *vm,
>                     virBitmap *cpumask,
>                     unsigned long long period,
>                     long long quota,
>-                    virDomainThreadSchedParam *sched)
>+                    virDomainThreadSchedParam *sched,
>+                    bool unionMems)
> {
>     qemuDomainObjPrivate *priv = vm->privateData;
>     virDomainNuma *numatune = vm->def->numa;
>@@ -2592,11 +2593,22 @@ qemuProcessSetupPid(virDomainObj *vm,
>
>         if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) == 0 &&
>             (mem_mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT ||
>-             mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) &&
>-            virDomainNumatuneMaybeFormatNodeset(numatune,
>-                                                priv->autoNodeset,
>-                                                &mem_mask, -1) < 0)
>-            goto cleanup;
>+             mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) {
>+
>+            /* QEMU allocates its memory from the emulator thread. Thus it
>+             * needs to access union of all host nodes configured. This is
>+             * going to be replaced with proper value later in the startup
>+             * process. */
>+            if (unionMems &&
>+                nameval == VIR_CGROUP_THREAD_EMULATOR) {
>+                qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_mask);

This should not be the case for VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE as
that will not result in host-nodes being set.

With those fixes

Reviewed-by: Martin Kletzander <mkletzan@redhat.com>