From nobody Sat Jul 27 08:39:06 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of redhat.com designates 170.10.129.124 as permitted sender) client-ip=170.10.129.124; envelope-from=libvir-list-bounces@redhat.com; helo=us-smtp-delivery-124.mimecast.com; Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=libvir-list-bounces@redhat.com; dmarc=pass(p=none dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1684836399; cv=none; d=zohomail.com; s=zohoarc; b=loBIkAAdiWUe7dvCc2frn6Nx2oXKjb95ktYd6dMhdOLt3mglwNHy0K33aYMVxLmEQaKE0617fnUM1Qhny3RkuvAxCqtJAafc1U19cBpGM2MszXIWkXMsAjQon7Rr/kKJKJyWu3jwtoNQ9RxE2HrFwHf43ZHr/IitGOBEvr5TFTY= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1684836399; h=Content-Type:Content-Transfer-Encoding:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=pw9azVqRiSCNxRBo6NkBH8AuY8FSvPJzMY1T7m6uOdQ=; b=UaR260NEslmMqYet2daQqT8tu46wzRnNPms4v8hmnQdZhPAC5aB6rsHNDF6GJnWNaj5txeBiOCVwH0/xwFA01Ap7FZllSsOxo7peEfDz65SPgun2rRiWECzoyf8OvL20PuInjR/iV5AYbz6HH7JIDBwJmrmTSagF/a5uKI8rEPc= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=libvir-list-bounces@redhat.com; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mx.zohomail.com with SMTPS id 1684836399207253.55126677341764; Tue, 23 May 2023 03:06:39 -0700 (PDT) Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-617-IygvSD_YORapE4N8dXcNUw-1; Tue, 23 May 2023 06:06:36 -0400 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 39CC7185A793; Tue, 23 May 2023 10:06:34 +0000 (UTC) Received: from mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com [10.30.29.100]) by smtp.corp.redhat.com (Postfix) with ESMTP id 26964401026; Tue, 23 May 2023 10:06:34 +0000 (UTC) Received: from mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (localhost [IPv6:::1]) by mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (Postfix) with ESMTP id E91FF19465B6; Tue, 23 May 2023 10:06:30 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) by mm-prod-listman-01.mail-001.prod.us-east-1.aws.redhat.com (Postfix) with ESMTP id 8A1ED19465A0 for ; Tue, 23 May 2023 10:06:24 +0000 (UTC) Received: by smtp.corp.redhat.com (Postfix) id 586BA1121319; Tue, 23 May 2023 10:06:24 +0000 (UTC) Received: from localhost.localdomain (unknown [10.43.2.39]) by smtp.corp.redhat.com (Postfix) with ESMTP id 001C4112131B for ; Tue, 23 May 2023 10:06:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684836398; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post; bh=pw9azVqRiSCNxRBo6NkBH8AuY8FSvPJzMY1T7m6uOdQ=; b=AsiJRA7fscMUAtuyccwUY1iZGwnZlfzx6eDy7i4rxnxAvIsWinJRLDnELl9oX0LHopeL5L gEHPiS1KUvb8K8wvxNXMJX69c7U8NEcuj+HeK2ycWqWu3v4ivNu1dyryhucy4NNamkWH7A peHRMEK5hTVXXQdPTZ1py8Qdzf3lm8U= X-MC-Unique: IygvSD_YORapE4N8dXcNUw-1 X-Original-To: libvir-list@listman.corp.redhat.com From: Michal Privoznik To: libvir-list@redhat.com Subject: [PATCH 2/4] qemu: Start emulator thread with more generous cpuset.mems Date: Tue, 23 May 2023 12:06:19 +0200 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 X-BeenThere: libvir-list@redhat.com X-Mailman-Version: 2.1.29 Precedence: list List-Id: Development discussions about the libvirt library & tools List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libvir-list-bounces@redhat.com Sender: "libvir-list" X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1684836401215100003 Content-Type: text/plain; charset="utf-8"; x-default="true" Consider a domain with two guest NUMA nodes and the following setting : What this means is the emulator thread is pinned onto host NUMA node #0 (by setting corresponding cpuset.mems to "0"), and two memory-backend-* objects are created: -object '{"qom-type":"memory-backend-ram","id":"ram-node0", .., "host-nod= es":[1],"policy":"bind"}' \ -numa node,nodeid=3D0,cpus=3D0-1,memdev=3Dram-node0 \ -object '{"qom-type":"memory-backend-ram","id":"ram-node1", .., "host-nod= es":[0],"policy":"bind"}' \ -numa node,nodeid=3D1,cpus=3D2-3,memdev=3Dram-node1 \ Note, the emulator thread is pinned well before QEMU is even exec()-ed. Now, the way memory allocation works in QEMU is: the emulator thread calls mmap() followed by mbind() (which is sane, that's how everybody should do it). BUT, because the thread is already restricted by CGroups to just NUMA node #0, calling: mbind(host-nodes:[1]); /* made up syntax (TM) */ fails. This is expected though. Kernel was instructed to place the memory at NUMA node "0" and yet, process is trying to place it elsewhere. We used to solve this by not restricting emulator thread at all initially, and only after it's done initializing (i.e. we got the QMP greeting) we placed it onto desired nodes. But this had its own problems (e.g. QEMU might have locked pieces of its memory which were then unable to migrate onto different NUMA nodes). Therefore, in v5.1.0-rc1~282 we've changed this and set CGropus upfront (even before exec()-ing QEMU). And this used to work, but something has changed (I can't really put my finger anywhere). Therefore, for the initialization start the thread with union of all configured host NUMA nodes ("0-1" in our example) and fix the placement only after QEMU is started. NB, the memory hotplug suffers the same problem, but that might be fixed later. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=3D2138150 Signed-off-by: Michal Privoznik Reviewed-by: Martin Kletzander --- src/qemu/qemu_domain.c | 26 +++++++++++++++++++++++++ src/qemu/qemu_domain.h | 5 +++++ src/qemu/qemu_process.c | 42 ++++++++++++++++++++++++++++++----------- 3 files changed, 62 insertions(+), 11 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 35d03515c7..d97a0ce24e 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -12674,3 +12674,29 @@ qemuDomainEvaluateCPUMask(const virDomainDef *def, =20 return NULL; } + + +void +qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm, + virBitmap **nodeset, + char **nodesetStr) +{ + virDomainNuma *numatune =3D vm->def->numa; + qemuDomainObjPrivate *priv =3D vm->privateData; + g_autoptr(virBitmap) unionMask =3D virBitmapNew(0); + ssize_t i; + + for (i =3D -1; i < (ssize_t)virDomainNumaGetNodeCount(numatune); i++) { + virBitmap *tmp; + + tmp =3D virDomainNumatuneGetNodeset(numatune, priv->autoNodeset, i= ); + if (tmp) + virBitmapUnion(unionMask, tmp); + } + + if (nodesetStr) + *nodesetStr =3D virBitmapFormat(unionMask); + + if (nodeset) + *nodeset =3D g_steal_pointer(&unionMask); +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index ec9ae75bce..999190e381 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -1140,3 +1140,8 @@ virBitmap * qemuDomainEvaluateCPUMask(const virDomainDef *def, virBitmap *cpumask, virBitmap *autoCpuset); + +void +qemuDomainNumatuneMaybeFormatNodesetUnion(virDomainObj *vm, + virBitmap **nodeset, + char **nodesetStr); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 57c3ea2dbf..6b85b7cee7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2550,7 +2550,8 @@ qemuProcessSetupPid(virDomainObj *vm, virBitmap *cpumask, unsigned long long period, long long quota, - virDomainThreadSchedParam *sched) + virDomainThreadSchedParam *sched, + bool unionMems) { qemuDomainObjPrivate *priv =3D vm->privateData; virDomainNuma *numatune =3D vm->def->numa; @@ -2592,11 +2593,22 @@ qemuProcessSetupPid(virDomainObj *vm, =20 if (virDomainNumatuneGetMode(numatune, -1, &mem_mode) =3D=3D 0 && (mem_mode =3D=3D VIR_DOMAIN_NUMATUNE_MEM_STRICT || - mem_mode =3D=3D VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) && - virDomainNumatuneMaybeFormatNodeset(numatune, - priv->autoNodeset, - &mem_mask, -1) < 0) - goto cleanup; + mem_mode =3D=3D VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE)) { + + /* QEMU allocates its memory from the emulator thread. Thus it + * needs to access union of all host nodes configured. This is + * going to be replaced with proper value later in the startup + * process. */ + if (unionMems && + nameval =3D=3D VIR_CGROUP_THREAD_EMULATOR) { + qemuDomainNumatuneMaybeFormatNodesetUnion(vm, NULL, &mem_m= ask); + } else { + if (virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, -1) < 0) + goto cleanup; + } + } =20 /* For restrictive numatune mode we need to set cpuset.mems for vC= PU * threads based on the node they are in as there is nothing else = uses @@ -2689,13 +2701,15 @@ qemuProcessSetupPid(virDomainObj *vm, =20 =20 static int -qemuProcessSetupEmulator(virDomainObj *vm) +qemuProcessSetupEmulator(virDomainObj *vm, + bool unionMems) { return qemuProcessSetupPid(vm, vm->pid, VIR_CGROUP_THREAD_EMULATOR, 0, vm->def->cputune.emulatorpin, vm->def->cputune.emulator_period, vm->def->cputune.emulator_quota, - vm->def->cputune.emulatorsched); + vm->def->cputune.emulatorsched, + unionMems); } =20 =20 @@ -5891,7 +5905,8 @@ qemuProcessSetupVcpu(virDomainObj *vm, vcpuid, vcpu->cpumask, vm->def->cputune.period, vm->def->cputune.quota, - &vcpu->sched) < 0) + &vcpu->sched, + false) < 0) return -1; =20 if (schedCore && @@ -6046,7 +6061,8 @@ qemuProcessSetupIOThread(virDomainObj *vm, iothread->cpumask, vm->def->cputune.iothread_period, vm->def->cputune.iothread_quota, - &iothread->sched); + &iothread->sched, + false); } =20 =20 @@ -7746,7 +7762,7 @@ qemuProcessLaunch(virConnectPtr conn, goto cleanup; =20 VIR_DEBUG("Setting emulator tuning/settings"); - if (qemuProcessSetupEmulator(vm) < 0) + if (qemuProcessSetupEmulator(vm, true) < 0) goto cleanup; =20 VIR_DEBUG("Setting cgroup for external devices (if required)"); @@ -7809,6 +7825,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuConnectAgent(driver, vm) < 0) goto cleanup; =20 + VIR_DEBUG("Fixing up emulator tuning/settings"); + if (qemuProcessSetupEmulator(vm, false) < 0) + goto cleanup; + VIR_DEBUG("setting up hotpluggable cpus"); if (qemuDomainHasHotpluggableStartupVcpus(vm->def)) { if (qemuDomainRefreshVcpuInfo(vm, asyncJob, false) < 0) --=20 2.39.3