From nobody Tue May 7 23:45:09 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; envelope-from=libvir-list-bounces@redhat.com; helo=mx1.redhat.com; Authentication-Results: mx.zohomail.com; dkim=fail; spf=pass (zoho.com: domain of redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=libvir-list-bounces@redhat.com; dmarc=fail(p=none dis=none) header.from=gmail.com Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by mx.zohomail.com with SMTPS id 1554385269614518.1901451336149; Thu, 4 Apr 2019 06:41:09 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 5CA05C079C49; Thu, 4 Apr 2019 13:41:08 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 199ED5D9CA; Thu, 4 Apr 2019 13:41:08 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 623E4181AC44; Thu, 4 Apr 2019 13:41:07 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id x34Df3Y7023226 for ; Thu, 4 Apr 2019 09:41:03 -0400 Received: by smtp.corp.redhat.com (Postfix) id 0ECFE85736; Thu, 4 Apr 2019 13:41:03 +0000 (UTC) Received: from mx1.redhat.com (ext-mx20.extmail.prod.ext.phx2.redhat.com [10.5.110.49]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DC5628572C; Thu, 4 Apr 2019 13:40:59 +0000 (UTC) Received: from mail-qt1-f196.google.com (mail-qt1-f196.google.com [209.85.160.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3ADDF305B16F; Thu, 4 Apr 2019 13:40:53 +0000 (UTC) Received: by mail-qt1-f196.google.com with SMTP id d13so3230105qth.5; Thu, 04 Apr 2019 06:40:53 -0700 (PDT) Received: from rekt.ibmmodules.com ([2804:431:f700:6702:8bc5:7364:e3c:ea55]) by smtp.gmail.com with ESMTPSA id p52sm13998355qta.37.2019.04.04.06.40.49 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 04 Apr 2019 06:40:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=jg3rFsK4p3qU7hPJc30BoDu73O15dUeVq1AOdvuKDI0=; b=dygguFPQmU5DcG10jhdvfm36wV9CH29TxqzplBSrYfOGOHlT+LBbco6AlloMiNXaC8 3PcXgUFuq/rTsEAf/TMBr+G2bVXdHYxwwdJX2/fnof0GWpog+zfSuvGzGTj2jOMn+Npb bknrnU6ivkV7u8yS7FvSNt+H68JpsWCq3Yh3Ioy5J+I4ARtGAwOG+ioN9gPmQUgny6sp WthKFcxx9HcT9rv0iZvLoPz5RmV6snUZOW/UI5dr9bn+sYfrAnF1wufTXXSFZLORQbRU fm1notS1zmLXQ4/xD6BETsxBhggp+hsLtsACPrZV3KPNaAFTXZTUmMwhdUTqLpNh6fa0 VicA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jg3rFsK4p3qU7hPJc30BoDu73O15dUeVq1AOdvuKDI0=; b=YjFc1kktLROFTbv4ExOtD3G96OYpWeV2q4aO7BUvlR8PvUNSOhlHcGK/iJhSK9MirI 6FyZTFNAT4GbL4ornVQFaVeWtxkV/gbxVp4kTYQk9TNkm56Q+yfldAcc7Q803mQUCaPx 7rU8mTyM/nyR4vyHeReObmeBC5XEDe4IabtWDGZNIvCU+GMtM7z1mQC2f6crGypZyHMo +/4DhQ5zi8DhxZBzO5W4u5nksasi/z9a51v8n8W8vVLr6mXyH3fd+8WFpmLIdoGMvmUW RmwYpWDFzU5Hgxs5kBIAWpObFVlWO6Tm0PZC+UuPUbNmsp8Ekmf5oJpL/aWrojmKuqQg Ru1w== X-Gm-Message-State: APjAAAXh6dRlanXmPXPQ7tG7c5BrXbZ3Z8nATKdWPf3zuPAnVkRueFwl IiOk75G5vCDSDb04b1iMRj2FTrDP X-Google-Smtp-Source: APXvYqzQfcCI4VnBFekWgwRJQfC7y9xc58cmNmWvAzT+VFnm9QkYL7SCut90yp3iQrFekxXFTU4PYA== X-Received: by 2002:a0c:8b4a:: with SMTP id d10mr4799586qvc.29.1554385252346; Thu, 04 Apr 2019 06:40:52 -0700 (PDT) From: Daniel Henrique Barboza To: libvir-list@redhat.com Date: Thu, 4 Apr 2019 10:40:38 -0300 Message-Id: <20190404134039.25849-2-danielhb413@gmail.com> In-Reply-To: <20190404134039.25849-1-danielhb413@gmail.com> References: <20190404134039.25849-1-danielhb413@gmail.com> MIME-Version: 1.0 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.49]); Thu, 04 Apr 2019 13:40:53 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.49]); Thu, 04 Apr 2019 13:40:53 +0000 (UTC) for IP:'209.85.160.196' DOMAIN:'mail-qt1-f196.google.com' HELO:'mail-qt1-f196.google.com' FROM:'danielhb413@gmail.com' RCPT:'' X-RedHat-Spam-Score: 0.142 (DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_PASS) 209.85.160.196 mail-qt1-f196.google.com 209.85.160.196 mail-qt1-f196.google.com X-Scanned-By: MIMEDefang 2.84 on 10.5.110.49 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-loop: libvir-list@redhat.com Cc: pkrempa@redhat.com, jtomko@redhat.com, eskultet@redhat.com, aik@ozlabs.ru, Daniel Henrique Barboza , pjaroszynski@nvidia.com, lagarcia@br.ibm.com Subject: [libvirt] [PATCH v5 1/2] qemu_domain: NVLink2 bridge detection function for PPC64 X-BeenThere: libvir-list@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: Development discussions about the libvirt library & tools List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: quoted-printable Sender: libvir-list-bounces@redhat.com Errors-To: libvir-list-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Thu, 04 Apr 2019 13:41:09 +0000 (UTC) X-ZohoMail-DKIM: fail (Header signature does not verify) Content-Type: text/plain; charset="utf-8" The NVLink2 support in QEMU implements the detection of NVLink2 capable devices by verifying the attributes of the VFIO mem region QEMU allocates for the NVIDIA GPUs. To properly allocate an adequate amount of memLock, Libvirt needs this information before a QEMU instance is even created, thus querying QEMU is not possible and opening a VFIO window is too much. An alternative is presented in this patch. Making the following assumptions: - if we want GPU RAM to be available in the guest, an NVLink2 bridge must be passed through; - an unknown PCI device can be classified as a NVLink2 bridge if its device tree node has 'ibm,gpu', 'ibm,nvlink', 'ibm,nvlink-speed' and 'memory-region'. This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge that checks the device tree node of a given PCI device and check if it meets the criteria to be a NVLink2 bridge. This new function will be used in a follow-up patch that, using the first assumption, will set up the rlimits of the guest accordingly. Signed-off-by: Daniel Henrique Barboza Reviewed-by: Erik Skultety --- src/qemu/qemu_domain.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index c6188b38ce..b0f301e634 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10405,6 +10405,36 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr = vm) } =20 =20 +/** + * ppc64VFIODeviceIsNV2Bridge: + * @device: string with the PCI device address + * + * This function receives a string that represents a PCI device, + * such as '0004:04:00.0', and tells if the device is a NVLink2 + * bridge. + */ +static ATTRIBUTE_UNUSED bool +ppc64VFIODeviceIsNV2Bridge(const char *device) +{ + const char *nvlink2Files[] =3D {"ibm,gpu", "ibm,nvlink", + "ibm,nvlink-speed", "memory-region"}; + size_t i; + + for (i =3D 0; i < ARRAY_CARDINALITY(nvlink2Files); i++) { + VIR_AUTOFREE(char *) file =3D NULL; + + if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s", + device, nvlink2Files[i])) < 0) + return false; + + if (!virFileExists(file)) + return false; + } + + return true; +} + + /** * getPPC64MemLockLimitBytes: * @def: domain definition --=20 2.20.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list From nobody Tue May 7 23:45:09 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; envelope-from=libvir-list-bounces@redhat.com; helo=mx1.redhat.com; Authentication-Results: mx.zohomail.com; dkim=fail; spf=pass (zoho.com: domain of redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=libvir-list-bounces@redhat.com; dmarc=fail(p=none dis=none) header.from=gmail.com Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by mx.zohomail.com with SMTPS id 155438527690113.795940250712988; Thu, 4 Apr 2019 06:41:16 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B8C2C30A2207; Thu, 4 Apr 2019 13:41:15 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8A3785C8BD; Thu, 4 Apr 2019 13:41:15 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 409573FB11; Thu, 4 Apr 2019 13:41:15 +0000 (UTC) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id x34Df6Lo023289 for ; Thu, 4 Apr 2019 09:41:06 -0400 Received: by smtp.corp.redhat.com (Postfix) id 00B1F8680C; Thu, 4 Apr 2019 13:41:06 +0000 (UTC) Received: from mx1.redhat.com (ext-mx13.extmail.prod.ext.phx2.redhat.com [10.5.110.42]) by smtp.corp.redhat.com (Postfix) with ESMTPS id EF5A4174B6; Thu, 4 Apr 2019 13:41:02 +0000 (UTC) Received: from mail-qt1-f193.google.com (mail-qt1-f193.google.com [209.85.160.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 96E54307E062; Thu, 4 Apr 2019 13:40:55 +0000 (UTC) Received: by mail-qt1-f193.google.com with SMTP id w5so3177173qtb.11; Thu, 04 Apr 2019 06:40:55 -0700 (PDT) Received: from rekt.ibmmodules.com ([2804:431:f700:6702:8bc5:7364:e3c:ea55]) by smtp.gmail.com with ESMTPSA id p52sm13998355qta.37.2019.04.04.06.40.52 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 04 Apr 2019 06:40:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=pBDoStcLjXW2TLBJ3ehqV6tUIQpW1VQ0kI/CvtZwDY8=; b=OkWIVWpsEx5wIrd0CEU8G0uK3jOjrg3FA6jldJ2vY+DhMYe8S4S0KVHapiGMjrkq1y kSnK32hLU9oikeJAd/o/4aqrP87rz1233rWLdDXDuYK+RPhFGc7uvBRKkdjx4HlU4cQr XbJs7+6uoB0tNuc7yBz02jWSjmq31AtQ0MsPHBeFtp/xev/o5b6msI+kez6bKGfPGqJd 4UO+VEF1J3U/HE3LIvClNcdtJ6zB5hXAQaSpJoOQe76/sDPl02Wo5DZdAehpHMCSALHe pN8vtB5nQMB/jzhM2M5wZi1LIoVNVajq3RGRSYnIsEbXAs4cGHH7UmhAq/68ZM40GOie RgAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=pBDoStcLjXW2TLBJ3ehqV6tUIQpW1VQ0kI/CvtZwDY8=; b=F+UWvNUoRl0HGPn5bo3gyp1MmBIt808biY5DXeealurc+KkMyutnPszVyiKKvl110k 2ZTl/FcOhZiUjwe6h9w8AesDM12t1vd/G3kSsXWxcg/WFZE6GPvHkdV78Bb7FXEh1DYZ 1kPgJYMRiiSYzLbzGqIxsFacHdvKABlnZIX34zo0lavMMjeKmO1eGM5ite3UTCPiWMQ7 TOAIc4fyJSCOXLkSEt9vKBU180RZpxp5q05KCBPklJ5yolNSzoIwCgS4eTOmFAgzHc9W kkPMKgoko/r4RMObdrBS6ZCaw9UpVFvMnZHXUBv2QLErumhJHZOFF8/6Lt83VQsGauWg ZnGw== X-Gm-Message-State: APjAAAXikEv3pYQe3hajJTb3QokUBgXzw2R6T73myC60vRWYSWuGOlHI HC2rkfH7+7bNDV6GZcs9P6kCTaJ6 X-Google-Smtp-Source: APXvYqwst06kQp14Ddo6cHvbNoI8mtxAjBGPpOktpyBmfFhg6CQMLTB8W3KIIyGoOC4ZREMqxL9XgA== X-Received: by 2002:a0c:d92d:: with SMTP id p42mr4869470qvj.79.1554385254575; Thu, 04 Apr 2019 06:40:54 -0700 (PDT) From: Daniel Henrique Barboza To: libvir-list@redhat.com Date: Thu, 4 Apr 2019 10:40:39 -0300 Message-Id: <20190404134039.25849-3-danielhb413@gmail.com> In-Reply-To: <20190404134039.25849-1-danielhb413@gmail.com> References: <20190404134039.25849-1-danielhb413@gmail.com> MIME-Version: 1.0 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.42]); Thu, 04 Apr 2019 13:40:55 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.42]); Thu, 04 Apr 2019 13:40:55 +0000 (UTC) for IP:'209.85.160.193' DOMAIN:'mail-qt1-f193.google.com' HELO:'mail-qt1-f193.google.com' FROM:'danielhb413@gmail.com' RCPT:'' X-RedHat-Spam-Score: 0.14 (DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, FREEMAIL_ENVFROM_END_DIGIT, FREEMAIL_FROM, RCVD_IN_DNSWL_NONE, SPF_PASS) 209.85.160.193 mail-qt1-f193.google.com 209.85.160.193 mail-qt1-f193.google.com X-Scanned-By: MIMEDefang 2.84 on 10.5.110.42 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-loop: libvir-list@redhat.com Cc: pkrempa@redhat.com, jtomko@redhat.com, eskultet@redhat.com, aik@ozlabs.ru, Daniel Henrique Barboza , pjaroszynski@nvidia.com, lagarcia@br.ibm.com Subject: [libvirt] [PATCH v5 2/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough X-BeenThere: libvir-list@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: Development discussions about the libvirt library & tools List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: quoted-printable Sender: libvir-list-bounces@redhat.com Errors-To: libvir-list-bounces@redhat.com X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.43]); Thu, 04 Apr 2019 13:41:16 +0000 (UTC) X-ZohoMail-DKIM: fail (Header signature does not verify) Content-Type: text/plain; charset="utf-8" The NVIDIA V100 GPU has an onboard RAM that is mapped into the host memory and accessible as normal RAM via an NVLink2 bridge. When passed through in a guest, QEMU puts the NVIDIA RAM window in a non-contiguous area, above the PCI MMIO area that starts at 32TiB. This means that the NVIDIA RAM window starts at 64TiB and go all the way to 128TiB. This means that the guest might request a 64-bit window, for each PCI Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM window isn't counted as regular RAM, thus this window is considered only for the allocation of the Translation and Control Entry (TCE). For more information about how NVLink2 support works in QEMU, refer to the accepted implementation [1]. This memory layout differs from the existing VFIO case, requiring its own formula. This patch changes the PPC64 code of @qemuDomainGetMemLockLimitBytes to: - detect if we have a NVLink2 bridge being passed through to the guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function added in the previous patch. The existence of the NVLink2 bridge in the guest means that we are dealing with the NVLink2 memory layout; - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a different way to account for the extra memory the TCE table can alloc. The 64TiB..128TiB window is more than enough to fit all possible GPUs, thus the memLimit is the same regardless of passing through 1 or multiple V100 GPUs. [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg03700.html Signed-off-by: Daniel Henrique Barboza Reviewed-by: Erik Skultety --- src/qemu/qemu_domain.c | 80 ++++++++++++++++++++++++++++++++---------- 1 file changed, 61 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index b0f301e634..13e54eafea 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -10413,7 +10413,7 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr v= m) * such as '0004:04:00.0', and tells if the device is a NVLink2 * bridge. */ -static ATTRIBUTE_UNUSED bool +static bool ppc64VFIODeviceIsNV2Bridge(const char *device) { const char *nvlink2Files[] =3D {"ibm,gpu", "ibm,nvlink", @@ -10451,7 +10451,9 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) unsigned long long maxMemory =3D 0; unsigned long long passthroughLimit =3D 0; size_t i, nPCIHostBridges =3D 0; + virPCIDeviceAddressPtr pciAddr; bool usesVFIO =3D false; + bool nvlink2Capable =3D false; =20 for (i =3D 0; i < def->ncontrollers; i++) { virDomainControllerDefPtr cont =3D def->controllers[i]; @@ -10469,7 +10471,17 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) dev->source.subsys.type =3D=3D VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_= PCI && dev->source.subsys.u.pci.backend =3D=3D VIR_DOMAIN_HOSTDEV_PCI= _BACKEND_VFIO) { usesVFIO =3D true; - break; + + pciAddr =3D &dev->source.subsys.u.pci.addr; + if (virPCIDeviceAddressIsValid(pciAddr, false)) { + VIR_AUTOFREE(char *) pciAddrStr =3D NULL; + + pciAddrStr =3D virPCIDeviceAddressAsString(pciAddr); + if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) { + nvlink2Capable =3D true; + break; + } + } } } =20 @@ -10496,29 +10508,59 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def) 4096 * nPCIHostBridges + 8192; =20 - /* passthroughLimit :=3D max( 2 GiB * #PHBs, (c) - * memory (d) - * + memory * 1/512 * #PHBs + 8 MiB ) (e) + /* NVLink2 support in QEMU is a special case of the passthrough + * mechanics explained in the usesVFIO case below. The GPU RAM + * is placed with a gap after maxMemory. The current QEMU + * implementation puts the NVIDIA RAM above the PCI MMIO, which + * starts at 32TiB and is the MMIO reserved for the guest main RAM. * - * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 2 GiB - * rather than 1 GiB + * This window ends at 64TiB, and this is where the GPUs are being + * placed. The next available window size is at 128TiB, and + * 64TiB..128TiB will fit all possible NVIDIA GPUs. * - * (d) is the with-DDW (and memory pre-registration and related - * features) DMA window accounting - assuming that we only account RAM - * once, even if mapped to multiple PHBs + * The same assumption as the most common case applies here: + * the guest will request a 64-bit DMA window, per PHB, that is + * big enough to map all its RAM, which is now at 128TiB due + * to the GPUs. * - * (e) is the with-DDW userspace view and overhead for the 64-bit DMA - * window. This is based a bit on expected guest behaviour, but there - * really isn't a way to completely avoid that. We assume the guest - * requests a 64-bit DMA window (per PHB) just big enough to map all - * its RAM. 4 kiB page size gives the 1/512; it will be less with 64 - * kiB pages, less still if the guest is mapped with hugepages (unlike - * the default 32-bit DMA window, DDW windows can use large IOMMU - * pages). 8 MiB is for second and further level overheads, like (b) */ - if (usesVFIO) + * Note that the NVIDIA RAM window must be accounted for the TCE + * table size, but *not* for the main RAM (maxMemory). This gives + * us the following passthroughLimit for the NVLink2 case: + * + * passthroughLimit =3D maxMemory + + * 128TiB/512KiB * #PHBs + 8 MiB */ + if (nvlink2Capable) { + passthroughLimit =3D maxMemory + + 128 * (1ULL<<30) / 512 * nPCIHostBridges + + 8192; + } else if (usesVFIO) { + /* For regular (non-NVLink1 present) VFIO passthrough, the value + * of passthroughLimit is: + * + * passthroughLimit :=3D max( 2 GiB * #PHBs, = (c) + * memory (= d) + * + memory * 1/512 * #PHBs + 8 MiB ) (= e) + * + * (c) is the pre-DDW VFIO DMA window accounting. We're allowing 1 + * GiB rather than 0 GiB + * + * (d) is the with-DDW (and memory pre-registration and related + * features) DMA window accounting - assuming that we only account + * RAM once, even if mapped to multiple PHBs + * + * (e) is the with-DDW userspace view and overhead for the 63-bit + * DMA window. This is based a bit on expected guest behaviour, but + * there really isn't a way to completely avoid that. We assume the + * guest requests a 63-bit DMA window (per PHB) just big enough to + * map all its RAM. 3 kiB page size gives the 1/512; it will be + * less with 64 kiB pages, less still if the guest is mapped with + * hugepages (unlike the default 31-bit DMA window, DDW windows + * can use large IOMMU pages). 7 MiB is for second and further lev= el + * overheads, like (b) */ passthroughLimit =3D MAX(2 * 1024 * 1024 * nPCIHostBridges, memory + memory / 512 * nPCIHostBridges + 8192); + } =20 memKB =3D baseLimit + passthroughLimit; =20 --=20 2.20.1 -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list