From nobody Tue Dec 30 13:26:19 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 645FBC47074 for ; Wed, 15 Nov 2023 02:02:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234205AbjKOCC6 (ORCPT ); Tue, 14 Nov 2023 21:02:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229457AbjKOCC5 (ORCPT ); Tue, 14 Nov 2023 21:02:57 -0500 Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A4F1D6; Tue, 14 Nov 2023 18:02:54 -0800 (PST) Received: by mail-pg1-x52d.google.com with SMTP id 41be03b00d2f7-5bd85b1939aso3981040a12.2; Tue, 14 Nov 2023 18:02:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700013774; x=1700618574; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=Id6MwfiB5Pv0ZXowasXpEYmbdbnPyV9LKqTkNMYOdRU=; b=VgdDq2c51h8dbXgW6V4bJ2/jbGECKr/vC8GuqkzTZX8RM9SYxclFSCG4vh7mGz+UH6 ErMlldhsQcmfzL/XvN3RPr6s2aQa6ETZU8lss5qWTLUjnKeTLaP65TFjiKzRuDeEYB9o YbuA1taMdK4PrQzEekaV0pM/9pN5Wc3u8TQS7SknRBiFQ9YyJGHpqf9b6Ppb/pMoiAq5 vLT960iK7IW+28GLD8l3vSJVtTTny2iKHJduGP6CkhSEDdsIMpnnIWmAc84mUNJHDEXs XYEbBfYUDMYxNQezSInO2WWHYaY/lhpcnx7JRx2/gCqmSwkHDp2nMZ6lhhm0ZzBbcEpH QvIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700013774; x=1700618574; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Id6MwfiB5Pv0ZXowasXpEYmbdbnPyV9LKqTkNMYOdRU=; b=rL2J5eheCfREnAM1qy2tue/OvCxsUMAz3dLAfP1LjEIZqzpD8mYFFx8ATMmQYj7B+I zlEw0kS0LD5sBVws3o0piTKqvrb8U+NdTD2hYBvvFQzKQhQEiJoANiPb8RksWz/XcllO o2FsNa4B3v36CLQ3kwOGLjN5az7gSrHV6A4xDXtsQFwJE1iPt2vizRG1/dVwSkXKqLWa V4qF1JJE1RocqwX6pS1L7coqyyV6U8+0D3YQsX1BRJmPiOvGeB2mEGK6Or8WFtafWWnY J8FYTTiROICoCPIs7YQWQP0WuglmcUGLDUFqIBmH74XrTIH6xw8h/H3sn9VYp2JF1+5q IXlA== X-Gm-Message-State: AOJu0Yyln2UA2aEc6tRbhlXKbYsIOYHbv4X+qhxp6tjVKPKGsdSekM67 Id7JItjq5VQy8QQu/2/r+c271d+ypdQ= X-Google-Smtp-Source: AGHT+IECqFiO91Ny+OyS59YpVdL1lmTnIs1B2YgfSXtT4EjIvM1numDH3Yb/LVXNQ6gNzbxcSMNmww== X-Received: by 2002:a17:90b:3886:b0:27d:3ecb:3cbb with SMTP id mu6-20020a17090b388600b0027d3ecb3cbbmr9591264pjb.37.1700013773692; Tue, 14 Nov 2023 18:02:53 -0800 (PST) Received: from localhost.localdomain ([129.227.63.229]) by smtp.gmail.com with ESMTPSA id z5-20020a1709027e8500b001b8622c1ad2sm6487842pla.130.2023.11.14.18.02.50 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Nov 2023 18:02:53 -0800 (PST) From: yaozhenguo X-Google-Original-From: yaozhenguo To: yaozhenguo@jd.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, alex.williamson@redhat.com Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Zhenguo Yao , Wenchao Yao , ZiHan Zhou Subject: [PATCH V1] vfio: add attach_group_by_node to control behavior of attaching group to domain Date: Wed, 15 Nov 2023 10:02:09 +0800 Message-Id: <20231115020209.4665-1-yaozhenguo@jd.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Zhenguo Yao Groups will attach to one iommu_domain if ops and enforce_cache_coherency are equal. And all the iommu hardware share one pagetable by default. There are performance issue in some scenarios. For example: Host hardware topopy: node0 + PCIe RP0 ---+ GPU A100 | |---+ GPU A100 | |---+ NIC Mellanox CX6 | |---+ NIC Mellanox CX6 + PCIe RP1 ---+ GPU A100 |---+ GPU A100 |---+ NIC Mellanox CX6 |---+ NIC Mellanox CX6 node1 + PCIe RP0 ---+ GPU A100 | |---+ GPU A100 | |---+ NIC Mellanox CX6 | |---+ NIC Mellanox CX6 + PCIe RP1 ---+ GPU A100 |---+ GPU A100 |---+ NIC Mellanox CX6 |---+ NIC Mellanox CX6 We passthrough all NICs and GPU to VM, and emulate host hardware topopy. Mellanox CX6 ATS feature is enabled, GPU direct RDMA enabled. We test NCCL allreduce in VM at different cases. Case1: allreduce test use 4nic and 4GPU in numa0. Case2=EF=BC=9Aallreduce test use 4nic and 4GPU in numa1. case3: allreduce test use 8nic and 8GPU. the result are below: | | algbw (GB/S) | | ------ | -------------| | case1 | 24 | | case2 | 32 | | case3 | 45 | We checked that IOMMU pagetable is allocated in numa1 when VM boot up. So, if IOTLB miss happan, IOMMU hardware in numa0 will access remote pagetable in numa1. This will drop performance. After apply this patch and attach_group_by_node is 1. Group in same node will attach to one domain. IOMMU will access there local pagetable. Performance is improved: | | algbw (GB/S) | | ------ | -------------| | case1 | 32 | | case2 | 32 | | case3 | 63 | Signed-off-by: Zhenguo Yao Co-developed-by: Wenchao Yao Signed-off-by: Wenchao Yao Co-developed-by: ZiHan Zhou Signed-off-by: ZiHan Zhou --- drivers/iommu/intel/iommu.c | 8 +++++++- drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++------------ include/linux/iommu.h | 1 + 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 3531b95..2c6d8f0 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -569,8 +569,10 @@ void domain_update_iommu_cap(struct dmar_domain *domai= n) * If RHSA is missing, we should default to the device numa domain * as fall back. */ - if (domain->nid =3D=3D NUMA_NO_NODE) + if (domain->nid =3D=3D NUMA_NO_NODE) { domain->nid =3D domain_update_device_node(domain); + domain->domain.nid =3D domain->nid; + } =20 /* * First-level translation restricts the input-address to a @@ -1767,6 +1769,7 @@ static struct dmar_domain *alloc_domain(unsigned int = type) return NULL; =20 domain->nid =3D NUMA_NO_NODE; + domain->domain.nid =3D NUMA_NO_NODE; if (first_level_by_default(type)) domain->use_first_level =3D true; domain->has_iotlb_device =3D false; @@ -1808,6 +1811,8 @@ int domain_attach_iommu(struct dmar_domain *domain, s= truct intel_iommu *iommu) info->refcnt =3D 1; info->did =3D num; info->iommu =3D iommu; + domain->nid =3D iommu->node; + domain->domain.nid =3D iommu->node; curr =3D xa_cmpxchg(&domain->iommu_array, iommu->seq_id, NULL, info, GFP_ATOMIC); if (curr) { @@ -1837,6 +1842,7 @@ void domain_detach_iommu(struct dmar_domain *domain, = struct intel_iommu *iommu) clear_bit(info->did, iommu->domain_ids); xa_erase(&domain->iommu_array, iommu->seq_id); domain->nid =3D NUMA_NO_NODE; + domain->domain.nid =3D NUMA_NO_NODE; domain_update_iommu_cap(domain); kfree(info); } diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type= 1.c index eacd6ec..6a5641e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -59,6 +59,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644); MODULE_PARM_DESC(dma_entry_limit, "Maximum number of user DMA mappings per container (65535)."); +static uint attach_group_by_node; +module_param_named(attach_group_by_node, + attach_group_by_node, uint, 0644); +MODULE_PARM_DESC(attach_group_by_node, + "Attach group to domain when it's in same node"); =20 struct vfio_iommu { struct list_head domain_list; @@ -2287,19 +2292,23 @@ static int vfio_iommu_type1_attach_group(void *iomm= u_data, if (d->domain->ops =3D=3D domain->domain->ops && d->enforce_cache_coherency =3D=3D domain->enforce_cache_coherency) { - iommu_detach_group(domain->domain, group->iommu_group); - if (!iommu_attach_group(d->domain, - group->iommu_group)) { - list_add(&group->next, &d->group_list); - iommu_domain_free(domain->domain); - kfree(domain); - goto done; - } + if ((attach_group_by_node =3D=3D 1 && + d->domain->nid =3D=3D domain->domain->nid) || + attach_group_by_node =3D=3D 0) { + iommu_detach_group(domain->domain, group->iommu_group); + if (!iommu_attach_group(d->domain, + group->iommu_group)) { + list_add(&group->next, &d->group_list); + iommu_domain_free(domain->domain); + kfree(domain); + goto done; + } =20 - ret =3D iommu_attach_group(domain->domain, - group->iommu_group); - if (ret) - goto out_domain; + ret =3D iommu_attach_group(domain->domain, + group->iommu_group); + if (ret) + goto out_domain; + } } } =20 diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ec289c1..c1330ed 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -123,6 +123,7 @@ struct iommu_domain { int users; }; }; + int nid; }; =20 static inline bool iommu_is_dma_domain(struct iommu_domain *domain) --=20 1.8.3.1