From nobody Mon Jun 8 18:55:44 2026 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A1DD3EDE49; Wed, 27 May 2026 10:23:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.145.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877439; cv=none; b=mlX1sspywYuuYwqm67reNH63uKeUsMQwnIGPlS1CM+6yhP4CoOvq61te2sJiYE2bcCy3gwN8WGTQubqZXZbfa+kIz0g0XnZA7eV02tS+AifKUWKboP3o58jP7jTTZBlI/oHxEZCARCAwTEtqXsdDzxKfrse63I/YiLjInPPLKjM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877439; c=relaxed/simple; bh=kbESEVyr65NZ4RQa3X2siX+A6mJ1SVf+jkIjLR0Teoo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=dDefx2I49MDTnDmwxhWnxxBIyJSnHDClvQgrBsO7ri+cMv5YAZNBnyOdFU5LOLQchIwbNQj0ktHc3nPUSBh01OIV4Nb6RsNpSA2Y/idsxzXJS8yFWM7n+XoEizTrGnEMit2huW0lOxSB+Oxi32/UJOcoXFJs0XKYgjs4rLb8udU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=cwc2HuK9; arc=none smtp.client-ip=67.231.145.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="cwc2HuK9" Received: from pps.filterd (m0528007.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R8oZTY2830248; Wed, 27 May 2026 03:23:31 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=0dM+ibflGvjNJ9OOpb/CWIGy79O9UklXHD+jxx0h2vA=; b=cwc2HuK98bEk y3sftonaEb5DPcX0X0882y/cyrbMLXRy4/7kEjnh/XdRA3XpbT88y0jMxa9jL0OA NFu5hJVkQ2tAYGWa4APozNYEaZelDRnm6ftNOFDcg2rp7l1vmHqlHFmQ6/9PUW+O cXbWaCIj7lnF5Qg3uct3zmOFkHrzLpuv+HU1MX1DShtgwA7C81YnawB0bc5Xdv5A nqrQZtKZdEeXLAHUkBhM1DBLRv5L74eumH0PzfsZmH3q040xsv7HkEnzx/KYGeQ6 W7zUimGINA9U4DyKLnE5j1v86lBuhl8+L8kRwhN675KA9fbgSOOuBPrFXivsU91r 4Qbk+s3YoQ== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edwf2gcu6-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:30 -0700 (PDT) Received: from localhost (2620:10d:c0a8:1c::11) by mail.thefacebook.com (2620:10d:c0a9:6f::8fd4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:29 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 1/9] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE Date: Wed, 27 May 2026 03:23:04 -0700 Message-ID: <20260527102319.100128-2-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-GUID: r69kDfeuS8IE7tKO5UaV2joeD0dQgTmS X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfXwSrHTweNVJ/i jabw+Z4NHU78PRjHP9AjOyrxFUlpjgORnITq2qglWf2bUWf3W3s8vmYa0w6YQRgCPzpWnC64vgT eH2Lj/ZaqIBJPSHt5Y879df1binaQ3w520KwBeMxAQWEfrDAeKjzWA3vjJinS+sCGvueTfZqJMi dVc+dFcyy4xvjNYLTXs3uIUjN/k+JMACX2WokhFOEPV11mlwNDB8NciX7pOo5XBZ2wnkXLT8Obn rmcIOU6v3V473mOxi2b4F8JOPrG71V+3UeLN9WEhsCZpPWCtHCac+BHYtTyeqj0v98HQMlPS8Dq pXqvy5hY50jR6lkWVuQODlRuIx/1+KSbizvGCnS26n9YJY5VZG/lZcAUpV6g9Xwoym+PoaIxAO8 0sJ5ZG+tFMP38X3aAw12EXa10Hr9UPCmUrGS/Pa26AwOJEeuM6n/1zVLmwjwFDDq8raABGrmD+5 2NaZH1XKvk0/oARp0XA== X-Authority-Analysis: v=2.4 cv=BJ6DalQG c=1 sm=1 tr=0 ts=6a16c622 cx=c_pps a=MfjaFnPeirRr97d5FC5oHw==:117 a=MfjaFnPeirRr97d5FC5oHw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=4h92JMTCafKA-fb_NiOh:22 a=VabnemYjAAAA:8 a=3w1EjqnpN93m_gGFsbMA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-ORIG-GUID: r69kDfeuS8IE7tKO5UaV2joeD0dQgTmS X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option: 1. Locate providers via pcim_p2pdma_provider() 2. Manage actual P2P DMA Other code (such as vfio-pci) depends on 1, without having a hard dependency on 2. A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed. To resolve this, introduce CONFIG_PCI_P2PDMA_CORE which contains the basic provider functionality to make it available even if the CONFIG_PCI_P2PDMA feature is disabled or unavailable due to !CONFIG_ZONE_DEVICE. Users such as vfio-pci can enable their own P2P features based off the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set). Signed-off-by: Matt Evans Reviewed-by: Logan Gunthorpe --- drivers/pci/Kconfig | 10 +++++----- drivers/pci/Makefile | 2 +- drivers/pci/p2pdma.c | 16 ++++++++++++++++ include/linux/pci-p2pdma.h | 24 ++++++++++++++---------- include/linux/pci.h | 2 +- 5 files changed, 37 insertions(+), 17 deletions(-) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 33c88432b728..59d70bc84cc9 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE - # - # The need for the scatterlist DMA bus address flag means PCI P2PDMA - # requires 64bit - # - depends on 64BIT + select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help @@ -226,6 +222,10 @@ config PCI_P2PDMA =20 If unsure, say N. =20 +config PCI_P2PDMA_CORE + default n + bool + config PCI_LABEL def_bool y if (DMI || ACPI) select NLS diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 41ebc3b9a518..419b646a301d 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -30,7 +30,7 @@ obj-$(CONFIG_PCI_SYSCALL) +=3D syscall.o obj-$(CONFIG_PCI_STUB) +=3D pci-stub.o obj-$(CONFIG_PCI_PF_STUB) +=3D pci-pf-stub.o obj-$(CONFIG_PCI_ECAM) +=3D ecam.o -obj-$(CONFIG_PCI_P2PDMA) +=3D p2pdma.o +obj-$(CONFIG_PCI_P2PDMA_CORE) +=3D p2pdma.o obj-$(CONFIG_XEN_PCIDEV_FRONTEND) +=3D xen-pcifront.o obj-$(CONFIG_VGA_ARB) +=3D vgaarb.o obj-$(CONFIG_PCI_DOE) +=3D doe.o diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 7c898542af8d..619d46c652b8 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -28,6 +28,14 @@ struct pci_p2pdma { struct p2pdma_provider mem[PCI_STD_NUM_BARS]; }; =20 +/* + * CONFIG_PCI_P2PDMA_CORE provides just a bare-bones init and + * pcim_p2pdma_provider() interface (used by things like VFIO even if + * full P2PDMA isn't present). The full P2PDMA feature is under the + * CONFIG_PCI_P2PDMA option. + */ +#ifdef CONFIG_PCI_P2PDMA + struct pci_p2pdma_pagemap { struct dev_pagemap pgmap; struct p2pdma_provider *mem; @@ -226,6 +234,8 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = =3D { .folio_free =3D p2pdma_folio_free, }; =20 +#endif /* CONFIG_PCI_P2PDMA */ + static void pci_p2pdma_release(void *data) { struct pci_dev *pdev =3D data; @@ -241,11 +251,13 @@ static void pci_p2pdma_release(void *data) synchronize_rcu(); xa_destroy(&p2pdma->map_types); =20 +#ifdef CONFIG_PCI_P2PDMA if (!p2pdma->pool) return; =20 gen_pool_destroy(p2pdma->pool); sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); +#endif } =20 /** @@ -330,6 +342,8 @@ struct p2pdma_provider *pcim_p2pdma_provider(struct pci= _dev *pdev, int bar) } EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); =20 +#ifdef CONFIG_PCI_P2PDMA + static int pci_p2pdma_setup_pool(struct pci_dev *pdev) { struct pci_p2pdma *p2pdma; @@ -1207,3 +1221,5 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci= _dev *p2p_dev, return sprintf(page, "%s\n", pci_name(p2p_dev)); } EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show); + +#endif diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 873de20a2247..4c42a7b2ee85 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -67,9 +67,22 @@ enum pci_p2pdma_map_type { PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, }; =20 -#ifdef CONFIG_PCI_P2PDMA +#ifdef CONFIG_PCI_P2PDMA_CORE int pcim_p2pdma_init(struct pci_dev *pdev); struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar= ); +#else +static inline int pcim_p2pdma_init(struct pci_dev *pdev) +{ + return -EOPNOTSUPP; +} +static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev = *pdev, + int bar) +{ + return NULL; +} +#endif + +#ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **cli= ents, @@ -89,15 +102,6 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_d= ev *p2p_dev, enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provi= der, struct device *dev); #else /* CONFIG_PCI_P2PDMA */ -static inline int pcim_p2pdma_init(struct pci_dev *pdev) -{ - return -EOPNOTSUPP; -} -static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev = *pdev, - int bar) -{ - return NULL; -} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 2c4454583c11..531aec355686 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -557,7 +557,7 @@ struct pci_dev { u16 pasid_cap; /* PASID Capability offset */ u16 pasid_features; #endif -#ifdef CONFIG_PCI_P2PDMA +#ifdef CONFIG_PCI_P2PDMA_CORE struct pci_p2pdma __rcu *p2pdma; #endif #ifdef CONFIG_PCI_DOE --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FCDB3F0A81; Wed, 27 May 2026 10:23:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.145.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877444; cv=none; b=MgGN04wCx8MqVGiOWTxZPRrGPyO1LS4fy2DqdtounkKF1DmZS3UjTRXqaba29rapAVzd/qpzQ5lBZSEHu9e4oeyps/6ZdQ29FH9rHzwFPpmv2R0VtGFReNdyMjUDiuzV68ofqYVaWbTagNO4HV7/7dyJenw57BtsCkJJZr0KDdE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877444; c=relaxed/simple; bh=PZQ1NL5LNqE9qVR+w8+NeV5LdnBkdprTA6joRKshK6E=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=JhGPscZ79GQEuCFhFj77P2IFRsDiuXmwoegEkruaJVfmz7NBoSLU7Mm3gG2tu2D+uI/Oy4GmeJ9ipS2dIrfupWq+LlBFmdTdstJDuuR+1vwfnAfWqyFMMUhtzUzji5pvLH30ELdl4qTEfhPnEnIERFzUea4CwSQM/iU2ytqU2c0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=tqzopi5q; arc=none smtp.client-ip=67.231.145.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="tqzopi5q" Received: from pps.filterd (m0528008.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R6QKQZ2299940; Wed, 27 May 2026 03:23:34 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=dwK9cW3HySDjv45f+xGjayqXcS0t4ugLKylgiA/Ko2M=; b=tqzopi5qRKhJ 08rKa5mgCwQgsIUBH0JQVnYGGU1ycg4GK8gy+nCc+Q/2p+UrS3EIhAE+jwOS6/yX pPALzlx4cVABGnVHeLvPQHsS10cI3Vv7dRMT6oUiOxsWLSSzf9zGNICyIzddySwe J1YDofbCOhFkC7llUSTPDylVx/ZP19+CsYeK1pIN4QWhmTetGcv/snErBrdaeenC HSKSHCozQIJvQ+4Kjs1zwabapVFzUt6WiNvEwBI0UwRuGImLw4L8T0F15RQJlKgZ 9ZbKIdmeMRowCwDUZOIf1IGDpDbu6m1hvo5W8MOmaQHStg47GTMOKnKiCJ33abzS +ugnTOqBDw== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edubfrxwj-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:33 -0700 (PDT) Received: from localhost (2620:10d:c0a8:1b::8e35) by mail.thefacebook.com (2620:10d:c0a9:6f::8fd4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:31 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 2/9] vfio/pci: Add a helper to look up PFNs for DMABUFs Date: Wed, 27 May 2026 03:23:05 -0700 Message-ID: <20260527102319.100128-3-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-GUID: cNZHWaKIG3WvwPrUAPiwwYt1dJ0-HxI1 X-Proofpoint-ORIG-GUID: cNZHWaKIG3WvwPrUAPiwwYt1dJ0-HxI1 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX9qsZQTS07TM8 xe4DJ06j4mUpIz1a/LVnINHSwogH6NheNnzTw7vONeRQ6SChsGVS3+2cDNwNGw0mZf0GprOd6OP lkttfKYAhlakum46vQNpBL9tUJlS+Ajel8RvyJarVlDwLe4/OTncySrG+46Ud3Tb91Op9TbRha7 E6S1Vwlu85CfIHmk8//O2dwyGR/TWD7dkwSrE9R94AjFcxVjdGcmzIp8bQ6EcuusDS6IZfnJsBV achGkOdn6r6qrGYDeE6Fd6uyDHTd/bA8b3UXY4NOnWe9yVe3YMCAyUHUjCz4+8XhEdmR3p051Pn SLGVLPfVTGvTEU4VINguukeQIAf8RXabPfph1tEBRqVvIRuZ1Z5VdF7ZIWA4leGXLbFC/7opc0x ExNRO6cWCkflZoqbPqCiWjbc8hHFqi4RA95e7lCJJL4VrhMcjtfJ+Yx82fNbNXaJrCygOHZfZXi PQvbvvI/iMV0TVkXYdg== X-Authority-Analysis: v=2.4 cv=B5uJFutM c=1 sm=1 tr=0 ts=6a16c625 cx=c_pps a=MfjaFnPeirRr97d5FC5oHw==:117 a=MfjaFnPeirRr97d5FC5oHw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=_1IyUuN4QrATX339ibzo:22 a=VabnemYjAAAA:8 a=BHKhlg6EwROUgAhvwz4A:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN. This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs. Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header. Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_dmabuf.c | 142 ++++++++++++++++++++++++++--- drivers/vfio/pci/vfio_pci_priv.h | 20 ++++ 2 files changed, 149 insertions(+), 13 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index c16f460c01d6..0d132c4ca95f 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,19 +9,6 @@ =20 MODULE_IMPORT_NS("DMA_BUF"); =20 -struct vfio_pci_dma_buf { - struct dma_buf *dmabuf; - struct vfio_pci_core_device *vdev; - struct list_head dmabufs_elm; - size_t size; - struct phys_vec *phys_vec; - struct p2pdma_provider *provider; - u32 nr_ranges; - struct kref kref; - struct completion comp; - u8 revoked : 1; -}; - static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -106,6 +93,135 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = =3D { .release =3D vfio_pci_dma_buf_release, }; =20 +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf, + struct vm_area_struct *vma, + unsigned long address, + unsigned int order, + unsigned long *out_pfn) +{ + /* + * Given a VMA (start, end, pgoffs) and a fault address, + * search the corresponding DMABUF's phys_vec[] to find the + * range representing the address's offset into the VMA, and + * its PFN. + * + * The phys_vec[] ranges represent contiguous spans of VAs + * upwards from the buffer offset 0; the actual PFNs might be + * in any order, overlap/alias, etc. Calculate an offset of + * the desired page given VMA start/pgoff and address, then + * search upwards from 0 to find which span contains it. + * + * On success, a valid PFN for a page sized by 'order' is + * returned into out_pfn. + * + * Failure occurs if: + * - The page would cross the edge of the VMA + * - The page isn't entirely contained within a range + * - We find a range, but the final PFN isn't aligned to the + * requested order. + * + * (Upon failure, the caller is expected to try again with a + * smaller order; the tests above will always succeed for + * order=3D0 as the limit case.) + * + * It's suboptimal if DMABUFs are created with neigbouring + * ranges that are physically contiguous, since hugepages + * can't straddle range boundaries. (The construction of the + * ranges vector should merge such ranges.) + * + * Finally, vma_pgoff_adjust is used for a DMABUF representing + * a VFIO BAR mmap, which is created from the start of the + * offset region. It should be zero, or equal vm_pgoff. + */ + + const unsigned long pagesize =3D PAGE_SIZE << order; + unsigned long vma_off =3D ((vma->vm_pgoff - vpdmabuf->vma_pgoff_adjust) << + PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK; + unsigned long rounded_page_addr =3D ALIGN_DOWN(address, pagesize); + unsigned long rounded_page_end =3D rounded_page_addr + pagesize; + unsigned long page_buf_offset; + unsigned long range_buf_offset =3D 0; + unsigned int i; + + if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) { + if (order > 0) + return -EAGAIN; + + /* A fault address outside of the VMA is absurd. */ + WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n", + address, vma->vm_start, vma->vm_end); + return -EFAULT; + } + + if (vpdmabuf->vma_pgoff_adjust !=3D 0 && + vpdmabuf->vma_pgoff_adjust !=3D (vma->vm_pgoff & + (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT))) { + WARN(1, "Unexpected vma_pgoff_adjust 0x%lx (vm_pgoff 0x%lx)\n", + vpdmabuf->vma_pgoff_adjust, vma->vm_pgoff); + return -EFAULT; + } + + if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start, + vma_off, &page_buf_offset))) + return -EFAULT; + + for (i =3D 0; i < vpdmabuf->nr_ranges; i++) { + unsigned long page_buf_offset_end; + size_t range_len =3D vpdmabuf->phys_vec[i].len; + phys_addr_t range_start =3D vpdmabuf->phys_vec[i].paddr; + + if (unlikely(check_add_overflow(page_buf_offset, pagesize, + &page_buf_offset_end))) + return -EFAULT; + /* + * If the current range starts after the page's span, + * this and any future range won't match. Bail early. + */ + if (page_buf_offset_end <=3D range_buf_offset) + break; + + if (page_buf_offset >=3D range_buf_offset && + page_buf_offset_end <=3D range_buf_offset + range_len) { + /* + * The faulting page is wholly contained + * within the span represented by the range. + * Validate PFN alignment for the order: + */ + unsigned long pfn =3D (range_start + page_buf_offset - + range_buf_offset) / PAGE_SIZE; + + if (IS_ALIGNED(pfn, 1 << order)) { + *out_pfn =3D pfn; + return 0; + } + /* Retry with smaller order */ + return -EAGAIN; + } + range_buf_offset +=3D range_len; + } + + /* + * A hugepage straddling a range boundary will fail to match a + * range, but the address will (eventually) match when retried + * with a smaller page. + */ + if (order > 0) + return -EAGAIN; + + /* + * If we get here, the address fell outside of the span + * represented by the (concatenated) ranges. Setup of a + * mapping must ensure that the VMA is <=3D the total size of + * the ranges, so this should never happen. But, if it does, + * force SIGBUS for the access and warn. + */ + WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x= %lx, %u ranges, size 0x%zx\n", + address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff, + vpdmabuf->nr_ranges, vpdmabuf->size); + + return -EFAULT; +} + /* * This is a temporary "private interconnect" between VFIO DMABUF and iomm= ufd. * It allows the two co-operating drivers to exchange the physical address= of diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index fca9d0dfac90..c8f6f959056a 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -23,6 +23,20 @@ struct vfio_pci_ioeventfd { bool test_mem; }; =20 +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + size_t size; + struct phys_vec *phys_vec; + struct p2pdma_provider *provider; + u32 nr_ranges; + struct kref kref; + struct completion comp; + unsigned long vma_pgoff_adjust; + u8 revoked : 1; +}; + bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev); =20 @@ -114,6 +128,12 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pde= v) return (pdev->class >> 8) =3D=3D PCI_CLASS_DISPLAY_VGA; } =20 +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf, + struct vm_area_struct *vma, + unsigned long address, + unsigned int order, + unsigned long *out_pfn); + #ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 f= lags, struct vfio_device_feature_dma_buf __user *arg, --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5D5403F0A9E; Wed, 27 May 2026 10:23:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877440; cv=none; b=sefG4LXafrbM7Y2tuJhkrTX1AGyPoIYPeWpBz0eC2ruiTURfCUY3+aBj2b3VE8fp0wjQ8xHSPi3Rr79LzQQK3k1Llcz6aaPnwXMnm6jSJZcMd8Iy54oBcjrAoj675ZkdTkZFSEbuhUT3HsG9qU2nyVIsElLgRUipVIwqkWPnvig= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877440; c=relaxed/simple; bh=/ZWxarv3AE81fMxzHYBMsdTe2MZ+6AnCBgKeRE/bKk8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=tkxBJdB/MpwAyBjOFu/5xYsvtNyYCqF07XWOpi8Uv2nS6x5by9rcNqzMjoUQ+JD930lB5bv7Cf6m9OFNBXcRkzzhcnIWp9BAGC1BbmTTtAaBqrEgopmWzVQiFmmDjHP2bj8piJE/Uz58GVCLDp9ymg12p3JdBrXqJe1doRbHgEY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=mBkEP+7h; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="mBkEP+7h" Received: from pps.filterd (m0528006.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R6Idd12507771; Wed, 27 May 2026 03:23:36 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=Yuf1e99Hyt9m1Y+SOOekZ0anPKoS02wo2j6VEIXeXMg=; b=mBkEP+7hQfUH SAYv2vG+a/uc1Cbfnzs79Jd7zwdaVO/7pQVLeviKZkbaiQ+KDtdq58vbDM8nL5kd Qv9QwSgpxcKjXFdTLXdMgfcP0qhPR3qFUZ/cEBaXF2QXRYKZiZJMk2Y0vYfIcaN4 hRVASclojQjuQVvQ4FCwMcVQdXDl616niFvjFeySXbp0yvmGXX+MHsiHifhTwfSl Ow5ZihL/a8Ggfzpz53B+rKyzyU5swMyeaOuOiiz1SCNvXesIuFLpRis0IUxGWSIp 61g9q0F+TtFauidQxZ+wfgUnJUALU9qIiZE2fPaYGyp9uOO3OODFfwwx1CKx+tq1 YCB6EgPhFg== Received: from mail.thefacebook.com ([163.114.134.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edpnga485-18 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:36 -0700 (PDT) Received: from localhost (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c08b:78::c78f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:35 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 3/9] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Date: Wed, 27 May 2026 03:23:06 -0700 Message-ID: <20260527102319.100128-4-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-GUID: ldLGxNKgwkPJ6fDYNndm2sdErj-fsz_V X-Authority-Analysis: v=2.4 cv=PME/P/qC c=1 sm=1 tr=0 ts=6a16c628 cx=c_pps a=CB4LiSf2rd0gKozIdrpkBw==:117 a=CB4LiSf2rd0gKozIdrpkBw==:17 a=Dv35txUGz5gI0hTa:21 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=kkcUborcUVj0H7zxAXTl:22 a=VabnemYjAAAA:8 a=nU6Csegrn5W5CUOfz84A:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX3j17/4CUnBUa E91HJyg2HDm1k5qdWHPJ/Yd7P0V5mbrJWqCWr0lC/uecb1QHqZnPi7mMjSpV0Q0W8z/pi4jDioF CZaSqu0LuRumuaROo880Azue7AAI62b/R2fw15uS8oxW68l1hMhJOlu5oFJt6WX1rli6DIDOLvL f/jOZXcgV3qTrbIErJeyv1eJanUtxTWvLbSqgVmZC+qYA3N5uzzBg5KLfY9apBiUVU38d+JIMzm dd974AvBv0MLFeHI20xOqPAM52G2UedBthv5giwrNPj2uvpupdUEIJeZUpLugqXEvdtKVYJpGFf UpGlzSE/NKuomgmnZRiVnY3RS82zX0Kn4RSNvqXIDIkEBRbaYnoN9UE3m1ZCZv6rcX45En22q++ CsLySv3SkF9hLcJ43L7XFzfKiY9Uf9IxkEwcdzs7Tv6e+DQrRT1PydiQvZxN0RPgkTybU0H6w0w Ot9/X+N6BPJkEqlg3xw== X-Proofpoint-ORIG-GUID: ldLGxNKgwkPJ6fDYNndm2sdErj-fsz_V X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range DMABUF for the purpose of mapping a PCI BAR. This is used in a future commit by VFIO's ordinary mmap() path. This function transfers ownership of the VFIO device fd to the DMABUF, which fput()s when it's released. Refactor the existing vfio_pci_core_feature_dma_buf() to split out export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and this new VFIO_BAR mmap(). Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_dmabuf.c | 140 ++++++++++++++++++++++------- drivers/vfio/pci/vfio_pci_priv.h | 5 ++ 2 files changed, 115 insertions(+), 30 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index 0d132c4ca95f..782408c08a5e 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmab= uf) up_write(&priv->vdev->memory_lock); vfio_device_put_registration(&priv->vdev->vdev); } + if (priv->vfile) + fput(priv->vfile); kfree(priv->phys_vec); kfree(priv); } @@ -222,6 +224,45 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf = *vpdmabuf, return -EFAULT; } =20 +/* + * Create a DMABUF corresponding to priv, add it to vdev->dmabufs list + * for tracking (meaning cleanup or revocation will zap it), and take + * a vfio_device registration. + */ +static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev, + struct vfio_pci_dma_buf *priv, uint32_t flags) +{ + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + + if (!vfio_device_try_get_registration(&vdev->vdev)) + return -ENODEV; + + exp_info.ops =3D &vfio_pci_dmabuf_ops; + exp_info.size =3D priv->size; + exp_info.flags =3D flags; + exp_info.priv =3D priv; + + priv->dmabuf =3D dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + vfio_device_put_registration(&vdev->vdev); + return PTR_ERR(priv->dmabuf); + } + + kref_init(&priv->kref); + init_completion(&priv->comp); + + /* dma_buf_put() now frees priv */ + INIT_LIST_HEAD(&priv->dmabufs_elm); + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked =3D !__vfio_pci_memory_enabled(vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + return 0; +} + /* * This is a temporary "private interconnect" between VFIO DMABUF and iomm= ufd. * It allows the two co-operating drivers to exchange the physical address= of @@ -340,7 +381,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_= device *vdev, u32 flags, { struct vfio_device_feature_dma_buf get_dma_buf =3D {}; struct vfio_region_dma_range *dma_ranges; - DEFINE_DMA_BUF_EXPORT_INFO(exp_info); struct vfio_pci_dma_buf *priv; size_t length; int ret; @@ -400,34 +440,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core= _device *vdev, u32 flags, kfree(dma_ranges); dma_ranges =3D NULL; =20 - if (!vfio_device_try_get_registration(&vdev->vdev)) { - ret =3D -ENODEV; + ret =3D vfio_pci_dmabuf_export(vdev, priv, get_dma_buf.open_flags); + if (ret) goto err_free_phys; - } - - exp_info.ops =3D &vfio_pci_dmabuf_ops; - exp_info.size =3D priv->size; - exp_info.flags =3D get_dma_buf.open_flags; - exp_info.priv =3D priv; - - priv->dmabuf =3D dma_buf_export(&exp_info); - if (IS_ERR(priv->dmabuf)) { - ret =3D PTR_ERR(priv->dmabuf); - goto err_dev_put; - } - - kref_init(&priv->kref); - init_completion(&priv->comp); - - /* dma_buf_put() now frees priv */ - INIT_LIST_HEAD(&priv->dmabufs_elm); - down_write(&vdev->memory_lock); - dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked =3D !__vfio_pci_memory_enabled(vdev); - list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); - dma_resv_unlock(priv->dmabuf->resv); - up_write(&vdev->memory_lock); - /* * dma_buf_fd() consumes the reference, when the file closes the dmabuf * will be released. @@ -438,8 +453,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_= device *vdev, u32 flags, =20 return ret; =20 -err_dev_put: - vfio_device_put_registration(&vdev->vdev); err_free_phys: kfree(priv->phys_vec); err_free_priv: @@ -449,6 +462,73 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core= _device *vdev, u32 flags, return ret; } =20 +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, + struct vm_area_struct *vma, + u64 phys_start, u64 req_len, + unsigned int res_index) +{ + struct vfio_pci_dma_buf *priv; + const unsigned int nr_ranges =3D 1; + unsigned long vma_pgoff =3D vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE= _SHIFT); + int ret; + + priv =3D kzalloc_obj(*priv); + if (!priv) + return -ENOMEM; + + priv->phys_vec =3D kzalloc_obj(*priv->phys_vec); + if (!priv->phys_vec) { + ret =3D -ENOMEM; + goto err_free_priv; + } + + /* + * The DMABUF begins from the mmap()'s BAR offset, i.e. the + * start of the VMA corresponds to byte 0 of the DMABUF and + * byte (vma_pgoff << PAGE_SHIFT) of the BAR. + * + * vfio_pci_dma_buf_find_pfn() reverses this offset using + * vma_pgoff_adjust, so that ultimately a fault's offset from + * the start of the _VMA_ has a consistent usage whether the + * VMA originates from an mmap() of the VFIO device here or a + * direct DMABUF mmap(). + */ + priv->vdev =3D vdev; + priv->size =3D req_len; + priv->nr_ranges =3D nr_ranges; + priv->vma_pgoff_adjust =3D vma_pgoff; + priv->provider =3D pcim_p2pdma_provider(vdev->pdev, res_index); + if (!priv->provider) { + ret =3D -EINVAL; + goto err_free_phys; + } + + priv->phys_vec[0].paddr =3D phys_start + ((u64)vma_pgoff << PAGE_SHIFT); + priv->phys_vec[0].len =3D priv->size; + + ret =3D vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); + if (ret) + goto err_free_phys; + + /* + * The VMA gets the DMABUF file so that other users can locate + * the DMABUF via a VA. Ownership of the original VFIO device + * file being mmap()ed transfers to priv, and is put when the + * DMABUF is released. + */ + priv->vfile =3D vma->vm_file; + vma->vm_file =3D priv->dmabuf->file; + vma->vm_private_data =3D priv; + + return 0; + +err_free_phys: + kfree(priv->phys_vec); +err_free_priv: + kfree(priv); + return ret; +} + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) { struct vfio_pci_dma_buf *priv; diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index c8f6f959056a..06dc0fd3e230 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -30,6 +30,7 @@ struct vfio_pci_dma_buf { size_t size; struct phys_vec *phys_vec; struct p2pdma_provider *provider; + struct file *vfile; u32 nr_ranges; struct kref kref; struct completion comp; @@ -133,6 +134,10 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf = *vpdmabuf, unsigned long address, unsigned int order, unsigned long *out_pfn); +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, + struct vm_area_struct *vma, + u64 phys_start, u64 req_len, + unsigned int res_index); =20 #ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 f= lags, --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2851B3F076F; Wed, 27 May 2026 10:23:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877444; cv=none; b=SeOs6eZkT6ie8QKoQq+Wj8bZAtPhGHMPEBtmV2Y23hWhqJADs9tXfXVK9mjichTQx+4V0a0GXSbb5AmRg4Omku85strMtSUFrvWBiCoHoniIUmWFK9OqcilBDPurk6dFy5CGTeP1K7xsGVpDf8LvHgcGTWEQRN0NKvq/hk/54/M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877444; c=relaxed/simple; bh=8AjP2E7wycMVT+GE5fljOgHl/nBaQS8K3L+HuYXbb68=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=A+O/xHSs/ph8bRfzCKL3vvncuE85bITyeUavX3/63snQrnMDRHBz5APPV8gTZ/S/NR4tE+gyW1Z/7Em+WUTkm4zQfTUoZyngu5mBY2Z2mBcN9goZdjgCsR3paEwnkFeyI5muBQutir8QgaqYuQV7t68XbqeSfwOCkq4Vd9P6w+Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=PIdg/5bL; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="PIdg/5bL" Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R6bWC62158015; Wed, 27 May 2026 03:23:40 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=QNLFtcg7hT09UnqX95tE0M3PN6DHPnumMgCTq5SDFgM=; b=PIdg/5bLUZOa zVNCL4GMx/YzWFXEpSIKzoDgeL4epPfFIUUD/f7tllNxMvXyA+MMdpFtDYdPm2+g Qy6CfveZCEu1G8JMV92NrbzDqPiYDRv+HfeGYOhaFW4AMNwKEUR2Mv0a25+jA9j9 RoM5+g8m2arrSonXAHAEAv8cG5sqiKBfg8mh+cKCoO+wgWOb57uuSqtW77Y3AkJq a+Vkt2B2xXFA0Q41eh3ydlzVKmgBKHbn5CtiBvTRr5ijbAwd72is5P4u2jW/FEvO 0Z5ZS3+dpIRdf0sA0OseGrDUvh3FbCxpIct6Qb9l6w2QJYnjbsNXj+SScqsi2lPn os7T6+hgVw== Received: from mail.thefacebook.com ([163.114.134.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edugs0x7n-10 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:39 -0700 (PDT) Received: from localhost (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c08b:78::2ac9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:38 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 4/9] vfio/pci: Convert BAR mmap() to use a DMABUF Date: Wed, 27 May 2026 03:23:07 -0700 Message-ID: <20260527102319.100128-5-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Authority-Analysis: v=2.4 cv=D4537PRj c=1 sm=1 tr=0 ts=6a16c62b cx=c_pps a=CB4LiSf2rd0gKozIdrpkBw==:117 a=CB4LiSf2rd0gKozIdrpkBw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=wpfVPzegXHpEFt3DAXn9:22 a=VabnemYjAAAA:8 a=ZpS8L8WqeLmTq60q1noA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-ORIG-GUID: XMnj2BK63uwisMvdcqlVAoQ2WyLOSQa1 X-Proofpoint-GUID: XMnj2BK63uwisMvdcqlVAoQ2WyLOSQa1 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX/tYAWhsicTlN x4IDezpQl+871Fa5zsd/1hOGfdIih4CWIV5nz7eqw8/17MtgxO+uzUy8c9VcCX5/Oo9hpwVyyY0 J7ET1asZ5pxSDf02KjM20MyFUWFiufRyu7gHcb51zi0b4wMBqQFnwGDUkjHfeDF5SV92o0uovmJ h+BxanAIWv9jUpuUg67ikAGs5kTSU2iE/Afd8YlDoC2Xn1y8bcbCT3JSYhSUBxngBEofv6ICU8k XuyxNrdSgcoJ+q7DS5pjiqBMy3SKHXuAik4v8RTUwo2OwPflW4xUrICXtnoF7EBWAcflZxeVVOF O7wdZo0ZlgcX78kwXm+kOpT4Of8FbpptV6X91qcJVrYupmdtjLL3U1Opeej74oz81Z7aRg9YZI0 ZmzgcjexydoCKStJ4GKB15gNcid8STKEvkeD156yGw93Wx9l2R5QBuhKlKLQwTHhpm+EchRJ06j NrH28nMky4KNqLb0RVQ== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P. First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address. Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths). CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA. Signed-off-by: Matt Evans --- drivers/vfio/pci/Kconfig | 4 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_core.c | 79 +++++++++++++++++++----------- drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++ drivers/vfio/pci/vfio_pci_priv.h | 11 +---- 5 files changed, 68 insertions(+), 41 deletions(-) diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 296bf01e185e..9197343a7301 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -6,6 +6,8 @@ config VFIO_PCI_CORE tristate select VFIO_VIRQFD select IRQ_BYPASS_MANAGER + select PCI_P2PDMA_CORE + select DMA_SHARED_BUFFER =20 config VFIO_PCI_INTX def_bool y if !S390 @@ -56,7 +58,7 @@ config VFIO_PCI_ZDEV_KVM To enable s390x KVM vfio-pci extensions, say Y. =20 config VFIO_PCI_DMABUF - def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER + def_bool y if PCI_P2PDMA =20 source "drivers/vfio/pci/mlx5/Kconfig" =20 diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 6138f1bf241d..881452ea89be 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,8 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only =20 -vfio-pci-core-y :=3D vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio= _pci_config.o +vfio-pci-core-y :=3D vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio= _pci_config.o vfio_pci_dmabuf.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) +=3D vfio_pci_zdev.o -vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) +=3D vfio_pci_dmabuf.o obj-$(CONFIG_VFIO_PCI_CORE) +=3D vfio-pci-core.o =20 vfio-pci-y :=3D vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index 041243a84d81..c5f934905ce0 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1683,18 +1683,6 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_= pci_core_device *vdev, u16 c up_write(&vdev->memory_lock); } =20 -static unsigned long vma_to_pfn(struct vm_area_struct *vma) -{ - struct vfio_pci_core_device *vdev =3D vma->vm_private_data; - int index =3D vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); - u64 pgoff; - - pgoff =3D vma->vm_pgoff & - ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); - - return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; -} - vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev, struct vm_fault *vmf, unsigned long pfn, @@ -1722,23 +1710,42 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct v= m_fault *vmf, unsigned int order) { struct vm_area_struct *vma =3D vmf->vma; - struct vfio_pci_core_device *vdev =3D vma->vm_private_data; - unsigned long addr =3D vmf->address & ~((PAGE_SIZE << order) - 1); - unsigned long pgoff =3D (addr - vma->vm_start) >> PAGE_SHIFT; - unsigned long pfn =3D vma_to_pfn(vma) + pgoff; - vm_fault_t ret =3D VM_FAULT_FALLBACK; - - if (is_aligned_for_order(vma, addr, pfn, order)) { - scoped_guard(rwsem_read, &vdev->memory_lock) - ret =3D vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order); - } + struct vfio_pci_dma_buf *priv =3D vma->vm_private_data; + struct vfio_pci_core_device *vdev; + unsigned long pfn =3D 0; + vm_fault_t ret =3D VM_FAULT_SIGBUS; =20 - dev_dbg_ratelimited(&vdev->pdev->dev, - "%s(,order =3D %d) BAR %ld page offset 0x%lx: 0x%x\n", - __func__, order, - vma->vm_pgoff >> - (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT), - pgoff, (unsigned int)ret); + /* + * We can rely on the existence of both a DMABUF (priv) and + * the VFIO device it was exported from (vdev). This fault's + * VMA was established using vfio_pci_core_mmap_prep_dmabuf() + * which transfers ownership of the VFIO device fd to the + * DMABUF, and so the VFIO device is held open because the + * VMA's vm_file (DMABUF) is open. + * + * Since vfio_pci_dma_buf_cleanup() cannot have happened, + * vdev must be valid; we can take memory_lock. + */ + vdev =3D READ_ONCE(priv->vdev); + + scoped_guard(rwsem_read, &vdev->memory_lock) { + if (!priv->revoked) { + int pres =3D vfio_pci_dma_buf_find_pfn(priv, vma, + vmf->address, + order, &pfn); + + if (pres =3D=3D 0) + ret =3D vfio_pci_vmf_insert_pfn(vdev, vmf, + pfn, order); + else if (pres =3D=3D -EAGAIN) + ret =3D VM_FAULT_FALLBACK; + } + + dev_dbg_ratelimited(&vdev->pdev->dev, + "%s(order =3D %d) PFN 0x%lx, VA 0x%lx, pgoff 0x%lx: 0x%x\n", + __func__, order, pfn, vmf->address, + vma->vm_pgoff, (unsigned int)ret); + } =20 return ret; } @@ -1763,6 +1770,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev,= struct vm_area_struct *vma unsigned int index; u64 phys_len, req_len, pgoff, req_start; void __iomem *bar_io; + int ret; =20 index =3D vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); =20 @@ -1802,7 +1810,20 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev= , struct vm_area_struct *vma if (IS_ERR(bar_io)) return PTR_ERR(bar_io); =20 - vma->vm_private_data =3D vdev; + /* + * Create a DMABUF with a single range corresponding to this + * mapping, and wire it into vma->vm_private_data. The VMA's + * vm_file becomes that of the DMABUF, and the DMABUF takes + * ownership of the VFIO device file (put upon DMABUF + * release). This maintains the behaviour of a live VMA + * mapping holding the VFIO device file open. + */ + ret =3D vfio_pci_core_mmap_prep_dmabuf(vdev, vma, + pci_resource_start(pdev, index), + req_len, index); + if (ret) + return ret; + vma->vm_page_prot =3D pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot =3D pgprot_decrypted(vma->vm_page_prot); =20 diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index 782408c08a5e..f7797f58d44b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,6 +9,7 @@ =20 MODULE_IMPORT_NS("DMA_BUF"); =20 +#ifdef CONFIG_VFIO_PCI_DMABUF static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -25,6 +26,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, =20 return 0; } +#endif /* CONFIG_VFIO_PCI_DMABUF */ =20 static void vfio_pci_dma_buf_done(struct kref *kref) { @@ -89,7 +91,9 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmab= uf) } =20 static const struct dma_buf_ops vfio_pci_dmabuf_ops =3D { +#ifdef CONFIG_VFIO_PCI_DMABUF .attach =3D vfio_pci_dma_buf_attach, +#endif .map_dma_buf =3D vfio_pci_dma_buf_map, .unmap_dma_buf =3D vfio_pci_dma_buf_unmap, .release =3D vfio_pci_dma_buf_release, @@ -263,6 +267,7 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_= device *vdev, return 0; } =20 +#ifdef CONFIG_VFIO_PCI_DMABUF /* * This is a temporary "private interconnect" between VFIO DMABUF and iomm= ufd. * It allows the two co-operating drivers to exchange the physical address= of @@ -461,6 +466,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_= device *vdev, u32 flags, kfree(dma_ranges); return ret; } +#endif /* CONFIG_VFIO_PCI_DMABUF */ =20 int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, @@ -535,6 +541,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device= *vdev, bool revoked) struct vfio_pci_dma_buf *tmp; =20 lockdep_assert_held_write(&vdev->memory_lock); + /* + * Holding memory_lock ensures a racing VMA fault observes + * priv->revoked properly. + */ =20 list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) @@ -552,6 +562,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device = *vdev, bool revoked) if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); wait_for_completion(&priv->comp); + unmap_mapping_range(priv->dmabuf->file->f_mapping, + 0, priv->size, 1); /* * Re-arm the registered kref reference and the * completion so the post-revoke state matches the diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index 06dc0fd3e230..d38e1b98b2e9 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -138,13 +138,13 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_co= re_device *vdev, struct vm_area_struct *vma, u64 phys_start, u64 req_len, unsigned int res_index); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked= ); =20 #ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 f= lags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); -void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); -void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked= ); #else static inline int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, @@ -153,13 +153,6 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_dev= ice *vdev, u32 flags, { return -ENOTTY; } -static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *v= dev) -{ -} -static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, - bool revoked) -{ -} #endif =20 #endif --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A89F3F20E7; Wed, 27 May 2026 10:24:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877459; cv=none; b=De/0S4vsKjtUOkyRcrmAMgFSb/pit2nUOFsDMw2WGpk+EDL+n71W2jnLugcblcpdaVgQ1rvRkEpYzyVrgjlLpfEMI6FB6w+KH3Un54qoqeZR8eFPpSGInT/PLxS0OOI1dHHaNS1jirxXjqstyzsP2dl4Lip7JPfa/B4mNlCTDJc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877459; c=relaxed/simple; bh=jZJEHcDN9aywnSQ+6qzMnDVsbWUz43L8EWNYjY086Ik=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pNW0cVNyxD0hE8F1Q9EIktvuwFf+DVWrng2OiSxj3v/CIB/4pnwsSQSbUcaKmhyeyONjaGpQIEtpn8jqzrQOwLMBNrqMoDZOO5p4hbycZy+Rbq4NEH9K9mwHSQYI0mU9/OjBlT4+FG7a16eNhOVFvAwqY+IJTFbzgo1FlHlW7pk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=Dia3mcFj; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="Dia3mcFj" Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R6hXB11411431; Wed, 27 May 2026 03:23:49 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=HWCayHTmA6fkxyKnxbcnh2gOZ+Q9k0y58BUHYwbE5n0=; b=Dia3mcFjV163 OcGNR2GryCFOoihIVfKz75oVUnzouc2XdsWRMFq5GiJlInIwjXhWUUvHDL+JrmJH Fd2LTzy/ZxPENThZe/zL7h/6C8Z2BOU4faMNxBgxMtaytxQnKJxt20dmDN83NnWp /f8esMtfhZ5lpiHUwGQXiJGZ8MwdhscC1eHV6Jtp+g3IIy5ThcNedDfSPfwtfcek rZiesF9MNSTwlgWHhKfLaeqhE7GeMMqQTOWoYiJMNmrLhc1HZDxkYzw4Oe8hUT6D ZsxYG7e7NwFecJAApeCH6HwkKFDLSroC6IJkVPMRQMUTddYeP6KlmApNpmz4JfUe DznSY5Hb+g== Received: from mail.thefacebook.com ([163.114.134.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edujrgwd1-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:49 -0700 (PDT) Received: from localhost (2620:10d:c085:108::150d) by mail.thefacebook.com (2620:10d:c08b:78::2ac9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:47 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 5/9] vfio/pci: Provide a user-facing name for BAR mappings Date: Wed, 27 May 2026 03:23:08 -0700 Message-ID: <20260527102319.100128-6-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Authority-Analysis: v=2.4 cv=J5OaKgnS c=1 sm=1 tr=0 ts=6a16c635 cx=c_pps a=CB4LiSf2rd0gKozIdrpkBw==:117 a=CB4LiSf2rd0gKozIdrpkBw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=JnKecZnUtZousrUlYMGU:22 a=VabnemYjAAAA:8 a=3OuInECnMAfxEXBjkJUA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-GUID: 6pUddGDdSoEMjZBf4ZeaFcYBx7g3t3Dh X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX0loea65+K6Sw GKEBO2dOXgp0/ZDJ72MpNPgU8ws8ZnweQEaUk+Cqb7QXjFrL1ZK+6h2uaGjG9Y2DlgEsdnsie+p hLszdgNqvw0/GEzy6q7grOHZpMP14U7EseQlpPNyjmcLVrPLB9wA2UBeoCigZX4eVE7Pe1flAUH phkitdAu/fT1NV1Scsy9rkNpt1yGMVoEO95uHaVb3ZKkmSnIOnbKdTuzuXQQDIGDZaf3mIi84sZ JQZqP1s63DEcedkD1lEURfj9Bs/tyF8C//gg39Cml9s8HWhpweNYGb0DzIve/5JC0ORvBvlx+sm LYALuPPMb80RCFaqytg7HOZc2z5hhSZm3UFcIdN22FPY/b9WiO8Ap19+GJ7tpmtItin3h08IF0N zzjIzYmX24H0UBUYiUxrPb9D1Hj1F605WL98iaMwO9IE/ux74zVaSHh9isw0j1a+nAMjQfXKRZu KWyrtkw8AdlM5fx2ZCg== X-Proofpoint-ORIG-GUID: 6pUddGDdSoEMjZBf4ZeaFcYBx7g3t3Dh X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" Since converting BAR mmap()s to using DMABUFs, we lose the original device path in /proc//maps, lsof, etc. Generate a debug-oriented synthetic 'filename' based on the cdev, plus BDF, plus resource index. This applies only to BAR mappings via the VFIO device fd, as explicitly-exported DMABUFs are named by userspace via the DMA_BUF_SET_NAME ioctl. Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index f7797f58d44b..733607371082 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -4,6 +4,7 @@ #include #include #include +#include =20 #include "vfio_pci_priv.h" =20 @@ -476,6 +477,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core= _device *vdev, struct vfio_pci_dma_buf *priv; const unsigned int nr_ranges =3D 1; unsigned long vma_pgoff =3D vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE= _SHIFT); + char *bufname; int ret; =20 priv =3D kzalloc_obj(*priv); @@ -488,6 +490,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_cor= e_device *vdev, goto err_free_priv; } =20 + bufname =3D kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL); + if (!bufname) { + ret =3D -ENOMEM; + goto err_free_phys; + } + + /* + * Maximum size of the friendly debug name is + * vfio1234567890:ffff:ff:3f.7/5 =3D 30, which fits within + * DMA_BUF_NAME_LEN. + */ + snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x", + dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index); + /* * The DMABUF begins from the mmap()'s BAR offset, i.e. the * start of the VMA corresponds to byte 0 of the DMABUF and @@ -506,7 +522,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core= _device *vdev, priv->provider =3D pcim_p2pdma_provider(vdev->pdev, res_index); if (!priv->provider) { ret =3D -EINVAL; - goto err_free_phys; + goto err_free_name; } =20 priv->phys_vec[0].paddr =3D phys_start + ((u64)vma_pgoff << PAGE_SHIFT); @@ -514,7 +530,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core= _device *vdev, =20 ret =3D vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); if (ret) - goto err_free_phys; + goto err_free_name; =20 /* * The VMA gets the DMABUF file so that other users can locate @@ -526,8 +542,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_cor= e_device *vdev, vma->vm_file =3D priv->dmabuf->file; vma->vm_private_data =3D priv; =20 + spin_lock(&priv->dmabuf->name_lock); + kfree(priv->dmabuf->name); + priv->dmabuf->name =3D bufname; + spin_unlock(&priv->dmabuf->name_lock); + return 0; =20 +err_free_name: + kfree(bufname); err_free_phys: kfree(priv->phys_vec); err_free_priv: --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 647623E51CE; Wed, 27 May 2026 10:24:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877459; cv=none; b=LCQQZaUrXcSrvR5PhO9OOf9BSTQPrb8ywrXIEaI5eMJR8VRDuOUrMk40cA8d5xSuf4rpJFE4FWgPaehGqaXvO+iUpPtNA+q36rcqmcoT1U/VAIX8xRGy9JGRhqRa1M6x5sWiXxVhfgGIRF3WrgtsMy5EnV4YEWmcIfCYxIfi8BA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877459; c=relaxed/simple; bh=o8/Uos2+gfnyfcJrdePV6OdJL2QRuEmEiET7CKM4BE4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=okTPjUsikXflfWLT1IJ9OJQDwDGGmCIEaU4VU+TgeJItp0FFj4JGeXm/6VV9IQZm7e+Yp/kk9uWYSHQxm+D8ibXtjR4QBw1JF6DL+a0387FwyH/iU7N9tYz56NKMMmgm7QLCHCFOIOfnCP2Rrz6C603QJeH+qws7iNDcFhx/arU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=oDklvi3p; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="oDklvi3p" Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.18.1.11/8.18.1.11) with ESMTP id 64R3BPu42788695; Wed, 27 May 2026 03:23:53 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=lKKu0h9afTECSWfDN1dTmPgNktwiSaCV9lSdHiVErV0=; b=oDklvi3pd80N +lmPlK6nTAPp9nhZm6j4opcvHJUBQvl8CQDODqDOjWrIcZ+8z+2MNh1XnSsnxV9+ I2gFHrauLsUBiPC8f0AA30BAU7xTXqlqIwVBxEHSFJVQrVrvn8puIdZCvH5bN3cL yXGagb3nSvdYbQ2Wk9hVVLjoIJjLd4huVhfKXS8Lje7DgfhUfrXG6UN5cw+oMHsD IWZ82xgrpII2LXr87oLRASidhCxgyXYeer77nbqbyChYg6oVdYWnrSBKimpJ3Lox ZgWyfrL6uj/42kiTl7sXcAaxdiZ/WD4SQQRKEdQ39VRWkpFLJpxxoqkMjMgCCRxl 8gBFy/83Yw== Received: from mail.thefacebook.com ([163.114.134.16]) by m0089730.ppops.net (PPS) with ESMTPS id 4edrg4hm2m-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:52 -0700 (PDT) Received: from localhost (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c08b:78::2ac9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:51 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 6/9] vfio/pci: Clean up BAR zap and revocation Date: Wed, 27 May 2026 03:23:09 -0700 Message-ID: <20260527102319.100128-7-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-GUID: DAzYb5ML5xxxyld5fCoI4GYgI3HX5aCE X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfXzdm/Kj9qVG7W eD3SLCh3qDd22593lUuBd8Mo/hJo5xzogJ4OAs7akDkrhuxgUwg6EFAZ+w6UU16SXQzf5PAwQg6 E/J8qCtuoRe1EBCOJqm8q3BPd8/yJmBNMJIyWkJE2qSDsGyyacXux6dSD+I2Hky4Rdzi1F8CykO 9jWJY1aY5cus1fsCCjNcxUU0PvJfyb8ted2ATpku35mfop/tAXvBrvV208+KUZ+UNjreFlxEalp I5PEUA3Zjvlhx31x9k5VkW068ZYWFfReYXU/xmzhPgSX9IhwpmjMGLAVJzTFdheP/UOxACHztA0 CcdbQ3WcXKSquRjRtsjZfJDe6/O0oYXxlnV6bRG1Dg06CogXxlVFA6UijuJDMTWpMrAgv1q+cA5 oiI9chJVfcxfM5dt2azouMGDv6woYR4gHwpa1vZPRklgaWLCnS/1Ai4OYIa/UT7nyzj5NcUj1kv ALG86fVGkVtNAqQAtwQ== X-Proofpoint-ORIG-GUID: DAzYb5ML5xxxyld5fCoI4GYgI3HX5aCE X-Authority-Analysis: v=2.4 cv=Ov1/DS/t c=1 sm=1 tr=0 ts=6a16c639 cx=c_pps a=CB4LiSf2rd0gKozIdrpkBw==:117 a=CB4LiSf2rd0gKozIdrpkBw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=855S8uPTkML1Oy45N9_h:22 a=VabnemYjAAAA:8 a=c4k9Xcs0TiChc187yiQA:9 a=O8hF6Hzn-FEA:10 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" Previously, vfio_pci_zap_bars() (and the wrapper vfio_pci_zap_and_down_write_memory_lock()) calls were paired with calls of vfio_pci_dma_buf_move(). This commit replaces them a unified new function, vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move() and the unmap_mapping_range(), making it harder for callers to omit one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes the write memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars() for the re-enable path. However, as of "vfio/pci: Convert BAR mmap() to use a DMABUF" the unmap_mapping_range() to zap is entirely redundant for plain vfio-pci, since the DMABUFs used for BAR mappings already zap PTEs when the vfio_pci_dma_buf_move() occurs. One exception remains as a FIXME: in nvgrace-gpu, some BAR VMAs conditionally use custom vm_ops, which have not moved to be backed by DMABUFs. If these BARs are mmap()ed, the vdev enables the existing behaviour of unmap_mapping_range() for the device fd address space. Signed-off-by: Matt Evans --- drivers/vfio/pci/nvgrace-gpu/main.c | 5 +++ drivers/vfio/pci/vfio_pci_config.c | 30 ++++++------- drivers/vfio/pci/vfio_pci_core.c | 66 ++++++++++++++++++++--------- drivers/vfio/pci/vfio_pci_priv.h | 3 +- include/linux/vfio_pci_core.h | 1 + 5 files changed, 66 insertions(+), 39 deletions(-) diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace= -gpu/main.c index 15e2f03c6cd4..cfa649200a7f 100644 --- a/drivers/vfio/pci/nvgrace-gpu/main.c +++ b/drivers/vfio/pci/nvgrace-gpu/main.c @@ -364,6 +364,8 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vd= ev, struct nvgrace_gpu_pci_core_device *nvdev =3D container_of(core_vdev, struct nvgrace_gpu_pci_core_device, core_device.vdev); + struct vfio_pci_core_device *vdev =3D + container_of(core_vdev, struct vfio_pci_core_device, vdev); struct mem_region *memregion; u64 req_len, pgoff, end; unsigned int index; @@ -374,6 +376,9 @@ static int nvgrace_gpu_mmap(struct vfio_device *core_vd= ev, if (!memregion) return vfio_pci_core_mmap(core_vdev, vma); =20 + /* Non-DMABUF BAR mappings need an extra zap */ + vdev->bar_needs_zap =3D true; + /* * Request to mmap the BAR. Map to the CPU accessible memory on the * GPU using the memory information gathered from the system ACPI diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci= _config.c index a10ed733f0e3..8bfab0da481c 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -590,12 +590,10 @@ static int vfio_basic_config_write(struct vfio_pci_co= re_device *vdev, int pos, virt_mem =3D !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem =3D !!(new_cmd & PCI_COMMAND_MEMORY); =20 - if (!new_mem) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); - } else { + if (!new_mem) + vfio_pci_lock_zap_revoke_bars(vdev); + else down_write(&vdev->memory_lock); - } =20 /* * If the user is writing mem/io enable (new_mem/io) and we @@ -631,7 +629,7 @@ static int vfio_basic_config_write(struct vfio_pci_core= _device *vdev, int pos, *virt_cmd |=3D cpu_to_le16(new_cmd & mask); =20 if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } =20 @@ -712,16 +710,14 @@ static int __init init_pci_cap_basic_perm(struct perm= _bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vde= v, pci_power_t state) { - if (state >=3D PCI_D3hot) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); - } else { + if (state >=3D PCI_D3hot) + vfio_pci_lock_zap_revoke_bars(vdev); + else down_write(&vdev->memory_lock); - } =20 vfio_pci_set_power_state(vdev, state); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } =20 @@ -908,11 +904,10 @@ static int vfio_exp_config_write(struct vfio_pci_core= _device *vdev, int pos, &cap); =20 if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } } @@ -993,11 +988,10 @@ static int vfio_af_config_write(struct vfio_pci_core_= device *vdev, int pos, &cap); =20 if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index c5f934905ce0..cfea59806a4f 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -319,8 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_co= re_device *vdev, * The vdev power related flags are protected with 'memory_lock' * semaphore. */ - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev); =20 if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); @@ -406,7 +405,7 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_co= re_device *vdev) down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } =20 @@ -1256,6 +1255,8 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_co= re_device *vdev, return ret; } =20 +static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev); + static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, void __user *arg) { @@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_= device *vdev, if (!vdev->reset_works) return -EINVAL; =20 - vfio_pci_zap_and_down_write_memory_lock(vdev); + down_write(&vdev->memory_lock); =20 /* * This function can be invoked while the power state is non-D0. If @@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_cor= e_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0); =20 - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_zap_revoke_bars(vdev); + ret =3D pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); =20 return ret; @@ -1648,20 +1650,44 @@ ssize_t vfio_pci_core_write(struct vfio_device *cor= e_vdev, const char __user *bu } EXPORT_SYMBOL_GPL(vfio_pci_core_write); =20 -static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev) +static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev) { - struct vfio_device *core_vdev =3D &vdev->vdev; - loff_t start =3D VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); - loff_t end =3D VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); - loff_t len =3D end - start; + lockdep_assert_held_write(&vdev->memory_lock); + vfio_pci_dma_buf_move(vdev, true); =20 - unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true); + /* + * All VFIO PCI BARs are backed by DMABUFs, with the current + * exception of the nvgrace-gpu device which uses its own + * vm_ops for a subset of BARs. For this, BAR mappings are + * still made in the vdev's address_space, and a zap is + * required. The tracking is crude, and will (harmlessly) + * continue to zap if the special BAR is unmapped, but that + * behaviour isn't the common case. + * + * FIXME: This can go away if the special nvgrace-gpu mapping + * is converted to use DMABUF. + */ + if (vdev->bar_needs_zap) { + struct vfio_device *core_vdev =3D &vdev->vdev; + loff_t start =3D VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); + loff_t end =3D VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); + loff_t len =3D end - start; + + unmap_mapping_range(core_vdev->inode->i_mapping, + start, len, true); + } } =20 -void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *= vdev) +void vfio_pci_lock_zap_revoke_bars(struct vfio_pci_core_device *vdev) { down_write(&vdev->memory_lock); - vfio_pci_zap_bars(vdev); + vfio_pci_zap_revoke_bars(vdev); +} + +void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev) +{ + lockdep_assert_held_write(&vdev->memory_lock); + vfio_pci_dma_buf_move(vdev, false); } =20 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev) @@ -2517,9 +2543,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_de= vice_set *dev_set, } =20 /* - * Take the memory write lock for each device and zap BAR - * mappings to prevent the user accessing the device while in - * reset. Locking multiple devices is prone to deadlock, + * Take the memory write lock for each device and + * zap/revoke BAR mappings to prevent the user (or + * peers) accessing the device while in reset. + * Locking multiple devices is prone to deadlock, * runaway and unwind if we hit contention. */ if (!down_write_trylock(&vdev->memory_lock)) { @@ -2527,8 +2554,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_dev= ice_set *dev_set, break; } =20 - vfio_pci_dma_buf_move(vdev, true); - vfio_pci_zap_bars(vdev); + vfio_pci_zap_revoke_bars(vdev); } =20 if (!list_entry_is_head(vdev, @@ -2558,7 +2584,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_dev= ice_set *dev_set, list_for_each_entry_from_reverse(vdev, &dev_set->device_list, vdev.dev_set_list) { if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } =20 diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index d38e1b98b2e9..10833aabd7fb 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -83,7 +83,8 @@ void vfio_config_free(struct vfio_pci_core_device *vdev); int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state); =20 -void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *= vdev); +void vfio_pci_lock_zap_revoke_bars(struct vfio_pci_core_device *vdev); +void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev); u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev); void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd); diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 7accd0eac457..e35e82c24c8c 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -127,6 +127,7 @@ struct vfio_pci_core_device { bool needs_pm_restore:1; bool pm_intx_masked:1; bool pm_runtime_engaged:1; + bool bar_needs_zap:1; struct pci_saved_state *pci_saved_state; struct pci_saved_state *pm_save; int ioeventfds_nr; --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 459CF8F7D; Wed, 27 May 2026 10:24:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877465; cv=none; b=VkMLoLBwcV5rXz2tLDvRAHkcyG11KQATo9J9kvNHm4jJGVy0z1JNeb2ReBb4PQCe4j5nBjJgDSUjDEX9yK9FuzS3iwhel85KqR344Cwa3JKpyC7nvvonuu38dZNo876c2XtQNhKMudo8eiFBbtAMgSVHfbiIIYEIZAuOHcI0eK0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877465; c=relaxed/simple; bh=ncCBVvQTtvDaSl0squf7Wcsrj4qkLgcHdVihZtQlQAU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=p7+/epsMN4kdidOY+f3U4TDs+/Z9gomSrINgH0cmBKeejNjcdLB7/EOb+cUQaNIpAzJqUtGF/4gZbN9P2AwSCVs84hn3UHHOPoiGi/VgexKSmD0g1WmfzDpFxhJScfprPOmvIS+CikwVDteDcyfpstoffvNhH9UqL1qI4N5lAS0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=Qv2+VQdL; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="Qv2+VQdL" Received: from pps.filterd (m0528006.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R3o4pC2171432; Wed, 27 May 2026 03:23:55 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=mMTWJVtqzN8WN4dwueBfEElMWJlOMrlZgPcLhQSbOns=; b=Qv2+VQdLFLoS TWkvzBVVzX/ToteJDpPD3TB7BeiWHfw9hOFTitblyXyQzwPmMlKmbqXKFqV6vIh5 2ixnkO0/6Ip++U3/xt+4xsLCXxUSDrwMHoGa7EyYsXDgSFJnIIhaFHprWs0sMkxd qQQH0tbTigGvUiJL7dS8/Cm82HRgYszpPKoEAzNuvJYv3HS5ebOtNwh7ggG4PGu9 ZN8LoulK0YKHJPBfJ5LuVgOH5yMRj5oXVXhnyZ7XuQjB7HYINgJgbvvGvnN4dv5+ RgFBhZy00HjSRldH/51MvsXQ4nVZQkl2upfUKZuIMXhUBVCX6TQ8Eyog2xm5Mory 99XDQOaV6A== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edpnga4aa-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:55 -0700 (PDT) Received: from localhost (2620:10d:c0a8:1b::30) by mail.thefacebook.com (2620:10d:c0a9:6f::237c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:54 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 7/9] vfio/pci: Support mmap() of a VFIO DMABUF Date: Wed, 27 May 2026 03:23:10 -0700 Message-ID: <20260527102319.100128-8-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-GUID: PoCiWHpBth1VXSDGH90Ek6hz48wd8lLz X-Authority-Analysis: v=2.4 cv=PME/P/qC c=1 sm=1 tr=0 ts=6a16c63b cx=c_pps a=MfjaFnPeirRr97d5FC5oHw==:117 a=MfjaFnPeirRr97d5FC5oHw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=kkcUborcUVj0H7zxAXTl:22 a=VabnemYjAAAA:8 a=KDxd6JyZDO1E9H-QfAEA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfXwCXkhWzkxdJM iMIstgE2tPmnMtJy1rlnARPbo25cZyNWE5PS9j8CgG99Z9hXSVCoLJG8WwPFpa/MrUaVzaiLvHx EhXAu7HhXOI3xKGexPehDKjZETeXcwqx5TkWAfXOltQIeULa0gbY9Zmd+SjRNmndeSdFjxO9RfL GMJpnJjHdzLHi2gCgCUwIACoI43GKBYHel3tYA1+FSJE+EPnp20HdB0hpszKcJG42y0wZ8rzZkl N+q5Cqb6KujTqUiec9G/zhZ27KHzc/4r3HuMDnS6FhV/Yg48CI7jl2o7t+tSqNdbR5dPW+tnue2 RA9YTmylx8h8bJMIYUkuJJy9FL3wHOaQp0oHfEQeyciPkmz+0BwOqY2sji9trsb2FcYvmbSDQgl dlPYW7lw3EnIlwRpGbMDMFhoimWYZtMwxaJyUipYVpqQwMLTy/2iJHwxGn6IgMIhQtcnpUTbjM5 hOMKjRQqTYVz8mhedzA== X-Proofpoint-ORIG-GUID: PoCiWHpBth1VXSDGH90Ek6hz48wd8lLz X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" A VFIO DMABUF can export a subset of a BAR to userspace by fd; add support for mmap() of this fd. This provides another route for a process to map BARs, except one where the process can only map a specific subset of a BAR represented by the exported DMABUF. mmap() support enables userspace driver designs that safely delegate access to BAR sub-ranges to other client processes by sharing a DMABUF fd, without having to share the (omnipotent) VFIO device fd with them. Since the main VFIO BAR mmap() is now DMABUF-aware, this path reuses the existing vm_ops. But, since the lifecycle of an exported DMABUF is still decoupled from that of the device fd it came from, the device fd might now be closed concurrent with a VMA fault. Extra synchronisation is added to deal with the possibility of a fault racing with the DMABUF cleanup path. (Note that this differs to a DMABUF implicitly created on the mmap() path, which holds ownership of the device fd and so prevents close-during-fault scenarios in order to maintain the same user-facing behaviour on close.) It does this by temporarily taking a VFIO device registration to ensure vdev remains valid, then vdev->memory_lock can be taken. Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_core.c | 79 ++++++++++++++++++++++++++---- drivers/vfio/pci/vfio_pci_dmabuf.c | 27 ++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 2 + 3 files changed, 99 insertions(+), 9 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index cfea59806a4f..41e049fa9a8a 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -12,6 +12,8 @@ =20 #include #include +#include +#include #include #include #include @@ -1742,19 +1744,77 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct v= m_fault *vmf, vm_fault_t ret =3D VM_FAULT_SIGBUS; =20 /* - * We can rely on the existence of both a DMABUF (priv) and - * the VFIO device it was exported from (vdev). This fault's - * VMA was established using vfio_pci_core_mmap_prep_dmabuf() - * which transfers ownership of the VFIO device fd to the - * DMABUF, and so the VFIO device is held open because the - * VMA's vm_file (DMABUF) is open. + * The only thing this can rely on is that the DMABUF relating + * to the VMA's vm_file exists (priv). * - * Since vfio_pci_dma_buf_cleanup() cannot have happened, - * vdev must be valid; we can take memory_lock. + * A DMABUF for a VFIO device fd mmap() holds a reference to + * the original VFIO device fd, but an explicitly-exported + * DMABUF does not. The original fd might have closed, + * meaning this fault can race with + * vfio_pci_dma_buf_cleanup(), meaning priv->vdev might be + * NULL, and the VFIO device registration might have been + * dropped. + * + * With the goal of taking vdev->memory_lock in a world where + * vdev might not still exist: + * + * 1. Take the resv lock on the DMABUF: + * - If racing cleanup got in first, the buffer is revoked; + * stop/exit if so. + * - If we got in first, the buffer is not revoked so vdev is + * non-NULL, accessible, and cleanup _has not yet put the + * VFIO device registration_. So, the device refcount must + * be >0. + * + * 2. Take vfio_device registration (refcount guaranteed >0 + * hereafter). + * + * 3. Unlock the DMABUF's resv lock: + * - A racing cleanup can now complete. + * - But, the device refcount >0, meaning the vfio_device + * (and vfio_pcie_core device vdev) have not yet been + * freed. vdev is accessible, even if the DMABUF has been + * revoked or cleanup has happened, because + * vfio_unregister_group_dev() can't complete. + * + * 4. Take the vdev->memory_lock + * - Either the DMABUF is usable, or has been cleaned up. + * Whichever, it can no longer change under us. + * - Test the DMABUF revocation status again: if it was + * revoked between 1 and 4 return a SIGBUS. Otherwise, + * return a PFN. + * - It's not necessary to also take the resv lock, because + * the status/vdev can't change while memory_lock is held. + * + * 5. Unlock, done. */ + + dma_resv_lock(priv->dmabuf->resv, NULL); vdev =3D READ_ONCE(priv->vdev); =20 + if (priv->revoked || !vdev) { + pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned u= p\n", + __func__, vmf->address, vma->vm_pgoff); + dma_resv_unlock(priv->dmabuf->resv); + return VM_FAULT_SIGBUS; + } + /* vdev is usable */ + + if (!vfio_device_try_get_registration(&vdev->vdev)) { + /* + * If vdev !=3D NULL (above), the registration should + * already be >0 and so this try_get should never + * fail. + */ + dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n", + __func__); + dma_resv_unlock(priv->dmabuf->resv); + return VM_FAULT_SIGBUS; + } + dma_resv_unlock(priv->dmabuf->resv); + scoped_guard(rwsem_read, &vdev->memory_lock) { + /* Revocation status must be re-read, under memory_lock */ if (!priv->revoked) { int pres =3D vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address, @@ -1773,6 +1833,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_= fault *vmf, vma->vm_pgoff, (unsigned int)ret); } =20 + vfio_device_put_registration(&vdev->vdev); return ret; } =20 @@ -1781,7 +1842,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_= fault *vmf) return vfio_pci_mmap_huge_fault(vmf, 0); } =20 -static const struct vm_operations_struct vfio_pci_mmap_ops =3D { +const struct vm_operations_struct vfio_pci_mmap_ops =3D { .fault =3D vfio_pci_mmap_page_fault, #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP .huge_fault =3D vfio_pci_mmap_huge_fault, diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index 733607371082..4b3b15655f1d 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -27,6 +27,32 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabu= f, =20 return 0; } + +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_st= ruct *vma) +{ + struct vfio_pci_dma_buf *priv =3D dmabuf->priv; + + if (priv->revoked) + return -ENODEV; + if ((vma->vm_flags & VM_SHARED) =3D=3D 0) + return -EINVAL; + + /* + * dma_buf_mmap_internal() has asserted that the VMA is + * contained within the DMABUF size before calling this. + */ + + vma->vm_page_prot =3D pgprot_noncached(vma->vm_page_prot); + vma->vm_page_prot =3D pgprot_decrypted(vma->vm_page_prot); + + /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */ + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP | + VM_DONTEXPAND | VM_DONTDUMP); + vma->vm_private_data =3D priv; + vma->vm_ops =3D &vfio_pci_mmap_ops; + + return 0; +} #endif /* CONFIG_VFIO_PCI_DMABUF */ =20 static void vfio_pci_dma_buf_done(struct kref *kref) @@ -94,6 +120,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dma= buf) static const struct dma_buf_ops vfio_pci_dmabuf_ops =3D { #ifdef CONFIG_VFIO_PCI_DMABUF .attach =3D vfio_pci_dma_buf_attach, + .mmap =3D vfio_pci_dma_buf_mmap, #endif .map_dma_buf =3D vfio_pci_dma_buf_map, .unmap_dma_buf =3D vfio_pci_dma_buf_unmap, diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index 10833aabd7fb..db2e2aeae88f 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -38,6 +38,8 @@ struct vfio_pci_dma_buf { u8 revoked : 1; }; =20 +extern const struct vm_operations_struct vfio_pci_mmap_ops; + bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev); =20 --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7CC33B5846; Wed, 27 May 2026 10:24:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.153.30 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877466; cv=none; b=e2uTa1rDkBqaxEyMGibvJO1qmBXYNHvDQ/tWsihgS+gpTY3feOBqaC0FGcrpjeQimNuaIA3f2U5MG1B4JZAv8n81nOHO2EjdT98zCA7WZOOwHVbbX/1TSU7VPCh7asRu3GlB1LIRxCN79rkohdnRUL2uCYcGj+tRY6cwpi+kRJg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877466; c=relaxed/simple; bh=gXmqQgOY3MoDdgrghS3/WxvlEk6KZJMxjRC5hrfhQVk=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Jq563YkQZPbkCfplYzxhRk88RJ1L+DXvvpwCR3ZRdLBRKX8g7wQPHuV0rT+jy7B5yzPrb8yY/h9OuCYnL53m8/uiD3hvqH9VVSidA5kHCGEoHPjyS694Cv15iyL0N9PsQ6KDP84xOudBdJrc3PIHDoZVixLiIvAK11nom/qbKt0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=hRjql4s7; arc=none smtp.client-ip=67.231.153.30 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="hRjql4s7" Received: from pps.filterd (m0109331.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R6blq12159197; Wed, 27 May 2026 03:23:59 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=ZixgUTNQUyVbAGjzEHapm+pMGUGzkDhRb4vcBz7Wr1U=; b=hRjql4s7bz3s 4Vgx4rt781bmGwkydDfKuLoQLrj5ywu+V4flYU1VRspYnNVQFN7BTrzrkd6BZX68 vj7W9K58ilKBjWW1OIohmCQF0meDaOILDK4hQAkBWm97hcZgGsONBsEATqjS9jy2 OKfRbqbkf6J+FHdc1V30ASHSijxuqhIUnBwFQ5Pg8i6hCocy2vQkpxI9JKwxtEvs 3wiViyFQO6DkD+fYqiZaWL/U64Y9UtT8ZALtDKM/6tUbiAMr8CtGtzcTugha3Epl FEmksD004YkRD726LBZLLqqUkLoQZCZj6sb8IbTO3TBAH5CAD2qq1sSI0QRk3l9c 2FgO+u2TzQ== Received: from mail.thefacebook.com ([163.114.134.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edugs0x99-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:23:59 -0700 (PDT) Received: from localhost (2620:10d:c085:208::7cb7) by mail.thefacebook.com (2620:10d:c08b:78::c78f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:23:57 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 8/9] vfio/pci: Permanently revoke a DMABUF on request Date: Wed, 27 May 2026 03:23:11 -0700 Message-ID: <20260527102319.100128-9-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Authority-Analysis: v=2.4 cv=D4537PRj c=1 sm=1 tr=0 ts=6a16c63f cx=c_pps a=CB4LiSf2rd0gKozIdrpkBw==:117 a=CB4LiSf2rd0gKozIdrpkBw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=wpfVPzegXHpEFt3DAXn9:22 a=VabnemYjAAAA:8 a=H5rW_VwydN-1pwcJZSMA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-ORIG-GUID: 6m3oTYDHo4rBVsAK3cdFCdh0p9msGDEe X-Proofpoint-GUID: 6m3oTYDHo4rBVsAK3cdFCdh0p9msGDEe X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX7W6ACu/MGsNJ 806ayi1obUxtor1wHahQavfnb0uB7iCKn+2VdevWQzzMoXf12a7peFYsYcg26UUbGSWdgDTAjnY Ez0dMVnBg7//nL1AiTMfvJW+7afDutgP5PkaHRuoqOymq8WODvdPwBnLeN/jC/fv9CS0480gE39 O7eOCCxF0pCh+Oiu1t5SD6lfpc8jsZan5J+SqpVSsD1RjhUNLtbta/azUhKgdQMu7AwXKqVxcrG d3202JA0qrE9aaQxafgiCU5HXvhaUa6B7wKKZBUUHOkSRbJtsTWtN+86DgKj7evmbcwpDCs0Xto 2dFXn+0z/zTMRNegCjL2PWz7htkMdALd2/4ZzR63/tpVsN5vceuR90KBXscgP+8Kq/5m/Zuubhb VhEVnsQTIpfQwIDnzbU8Z/sFVOHAgRverSFgLModT1QR0W0ozcmFi+0NHMEhFEAfX1N/9OMWtla s1Ix53bgu2VdGcWZp8w== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked. The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new ioctl(VFIO_DEVICE_PCI_DMABUF_REVOKE) request. This VFIO device fd ioctl() passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped. This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere. Refactor the revocation code out of vfio_pci_dma_buf_move() to a function common to move and the new ioctl path. Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_core.c | 21 ++++- drivers/vfio/pci/vfio_pci_dmabuf.c | 146 +++++++++++++++++++++-------- drivers/vfio/pci/vfio_pci_priv.h | 14 ++- include/uapi/linux/vfio.h | 30 ++++++ 4 files changed, 170 insertions(+), 41 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index 41e049fa9a8a..5184b3cac160 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1500,6 +1500,21 @@ static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_= core_device *vdev, ioeventfd.fd); } =20 +static int vfio_pci_ioctl_dmabuf_revoke(struct vfio_pci_core_device *vdev, + struct vfio_pci_dmabuf_revoke __user *arg) +{ + unsigned long minsz =3D offsetofend(struct vfio_pci_dmabuf_revoke, dmabuf= _fd); + struct vfio_pci_dmabuf_revoke revoke; + + if (copy_from_user(&revoke, arg, minsz)) + return -EFAULT; + + if (revoke.argsz < minsz) + return -EINVAL; + + return vfio_pci_dma_buf_revoke(vdev, revoke.dmabuf_fd); +} + long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, unsigned long arg) { @@ -1522,6 +1537,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vde= v, unsigned int cmd, return vfio_pci_ioctl_reset(vdev, uarg); case VFIO_DEVICE_SET_IRQS: return vfio_pci_ioctl_set_irqs(vdev, uarg); + case VFIO_DEVICE_PCI_DMABUF_REVOKE: + return vfio_pci_ioctl_dmabuf_revoke(vdev, uarg); default: return -ENOTTY; } @@ -1792,7 +1809,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_= fault *vmf, dma_resv_lock(priv->dmabuf->resv, NULL); vdev =3D READ_ONCE(priv->vdev); =20 - if (priv->revoked || !vdev) { + if (priv->status !=3D VFIO_PCI_DMABUF_OK || !vdev) { pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned u= p\n", __func__, vmf->address, vma->vm_pgoff); dma_resv_unlock(priv->dmabuf->resv); @@ -1815,7 +1832,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_= fault *vmf, =20 scoped_guard(rwsem_read, &vdev->memory_lock) { /* Revocation status must be re-read, under memory_lock */ - if (!priv->revoked) { + if (priv->status =3D=3D VFIO_PCI_DMABUF_OK) { int pres =3D vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address, order, &pfn); diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index 4b3b15655f1d..3fa14760898f 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -19,7 +19,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (!attachment->peer2peer) return -EOPNOTSUPP; =20 - if (priv->revoked) + if (priv->status !=3D VFIO_PCI_DMABUF_OK) return -ENODEV; =20 if (!dma_buf_attach_revocable(attachment)) @@ -32,7 +32,7 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, = struct vm_area_struct * { struct vfio_pci_dma_buf *priv =3D dmabuf->priv; =20 - if (priv->revoked) + if (priv->status !=3D VFIO_PCI_DMABUF_OK) return -ENODEV; if ((vma->vm_flags & VM_SHARED) =3D=3D 0) return -EINVAL; @@ -72,7 +72,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachmen= t, =20 dma_resv_assert_held(priv->dmabuf->resv); =20 - if (priv->revoked) + if (priv->status !=3D VFIO_PCI_DMABUF_OK) return ERR_PTR(-ENODEV); =20 ret =3D dma_buf_phys_vec_to_sgt(attachment, priv->provider, @@ -287,7 +287,8 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_= device *vdev, INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock); dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked =3D !__vfio_pci_memory_enabled(vdev); + priv->status =3D __vfio_pci_memory_enabled(vdev) ? VFIO_PCI_DMABUF_OK : + VFIO_PCI_DMABUF_TEMP_REVOKED; list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); dma_resv_unlock(priv->dmabuf->resv); up_write(&vdev->memory_lock); @@ -318,7 +319,7 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachm= ent *attachment, return -EOPNOTSUPP; =20 priv =3D attachment->dmabuf->priv; - if (priv->revoked) + if (priv->status !=3D VFIO_PCI_DMABUF_OK) return -ENODEV; =20 /* More than one range to iommufd will require proper DMABUF support */ @@ -585,6 +586,63 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_cor= e_device *vdev, return ret; } =20 +static void __vfio_pci_dma_buf_revoke(struct vfio_pci_dma_buf *priv, bool = revoked, + bool permanently) +{ + bool was_revoked; + + lockdep_assert_held_write(&priv->vdev->memory_lock); + + if ((priv->status =3D=3D VFIO_PCI_DMABUF_PERM_REVOKED) || + (priv->status =3D=3D VFIO_PCI_DMABUF_OK && !revoked) || + (priv->status =3D=3D VFIO_PCI_DMABUF_TEMP_REVOKED && revoked && !perm= anently)) { + return; + } + + dma_resv_lock(priv->dmabuf->resv, NULL); + was_revoked =3D priv->status !=3D VFIO_PCI_DMABUF_OK; + + if (revoked) + priv->status =3D permanently ? VFIO_PCI_DMABUF_PERM_REVOKED : + VFIO_PCI_DMABUF_TEMP_REVOKED; + + /* + * If TEMP_REVOKED is being upgraded to PERM_REVOKED, the + * buffer is already gone. Don't wait on it again. + */ + if (was_revoked && revoked) { + dma_resv_unlock(priv->dmabuf->resv); + return; + } + + dma_buf_invalidate_mappings(priv->dmabuf); + dma_resv_wait_timeout(priv->dmabuf->resv, + DMA_RESV_USAGE_BOOKKEEP, false, + MAX_SCHEDULE_TIMEOUT); + dma_resv_unlock(priv->dmabuf->resv); + if (revoked) { + kref_put(&priv->kref, vfio_pci_dma_buf_done); + wait_for_completion(&priv->comp); + unmap_mapping_range(priv->dmabuf->file->f_mapping, + 0, priv->size, 1); + /* + * Re-arm the registered kref reference and the + * completion so the post-revoke state matches the + * post-creation state. An un-revoke followed by a + * new mapping needs the kref to be non-zero before + * kref_get(), and vfio_pci_dma_buf_cleanup() + * delegates its drain back through this revoke + * path on a possibly-already-revoked dma-buf. + */ + kref_init(&priv->kref); + reinit_completion(&priv->comp); + } else { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->status =3D VFIO_PCI_DMABUF_OK; + dma_resv_unlock(priv->dmabuf->resv); + } +} + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) { struct vfio_pci_dma_buf *priv; @@ -593,44 +651,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_devic= e *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); /* * Holding memory_lock ensures a racing VMA fault observes - * priv->revoked properly. + * priv->status properly. */ =20 list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) continue; - - if (priv->revoked !=3D revoked) { - dma_resv_lock(priv->dmabuf->resv, NULL); - if (revoked) - priv->revoked =3D true; - dma_buf_invalidate_mappings(priv->dmabuf); - dma_resv_wait_timeout(priv->dmabuf->resv, - DMA_RESV_USAGE_BOOKKEEP, false, - MAX_SCHEDULE_TIMEOUT); - dma_resv_unlock(priv->dmabuf->resv); - if (revoked) { - kref_put(&priv->kref, vfio_pci_dma_buf_done); - wait_for_completion(&priv->comp); - unmap_mapping_range(priv->dmabuf->file->f_mapping, - 0, priv->size, 1); - /* - * Re-arm the registered kref reference and the - * completion so the post-revoke state matches the - * post-creation state. An un-revoke followed by a - * new mapping needs the kref to be non-zero before - * kref_get(), and vfio_pci_dma_buf_cleanup() - * delegates its drain back through this revoke - * path on a possibly-already-revoked dma-buf. - */ - kref_init(&priv->kref); - reinit_completion(&priv->comp); - } else { - dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked =3D false; - dma_resv_unlock(priv->dmabuf->resv); - } - } + __vfio_pci_dma_buf_revoke(priv, revoked, false); fput(priv->dmabuf->file); } } @@ -662,3 +689,46 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_dev= ice *vdev) } up_write(&vdev->memory_lock); } + +#ifdef CONFIG_VFIO_PCI_DMABUF +int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev, int dmabuf_= fd) +{ + struct vfio_pci_dma_buf *priv; + struct dma_buf *dmabuf; + int ret =3D 0; + + dmabuf =3D dma_buf_get(dmabuf_fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + priv =3D dmabuf->priv; + /* + * Sanity-check the DMABUF is really a vfio_pci_dma_buf _and_ + * relates to the VFIO device it was provided with. + * + * If the DMABUF relates to this vdev then priv->vdev is + * stable because this open fd prevents cleanup. + * + * If it relates to a different vdev, reading priv->vdev might + * race with a concurrent cleanup on that device. But if so, + * it points to a non-matching vdev or NULL and is unusable + * either way. + */ + if (dmabuf->ops !=3D &vfio_pci_dmabuf_ops || priv->vdev !=3D vdev) { + ret =3D -ENODEV; + goto out_put_buf; + } + + scoped_guard(rwsem_write, &vdev->memory_lock) { + if (priv->status =3D=3D VFIO_PCI_DMABUF_PERM_REVOKED) + ret =3D -EBADFD; + else + __vfio_pci_dma_buf_revoke(priv, true, true); + } + + out_put_buf: + dma_buf_put(dmabuf); + + return ret; +} +#endif /* CONFIG_VFIO_PCI_DMABUF */ diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index db2e2aeae88f..a1e0f4fcb1dc 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -23,6 +23,12 @@ struct vfio_pci_ioeventfd { bool test_mem; }; =20 +enum vfio_pci_dma_buf_status { + VFIO_PCI_DMABUF_OK =3D 0, + VFIO_PCI_DMABUF_TEMP_REVOKED =3D 1, + VFIO_PCI_DMABUF_PERM_REVOKED =3D 2, +}; + struct vfio_pci_dma_buf { struct dma_buf *dmabuf; struct vfio_pci_core_device *vdev; @@ -35,7 +41,7 @@ struct vfio_pci_dma_buf { struct kref kref; struct completion comp; unsigned long vma_pgoff_adjust; - u8 revoked : 1; + enum vfio_pci_dma_buf_status status; }; =20 extern const struct vm_operations_struct vfio_pci_mmap_ops; @@ -148,6 +154,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device = *vdev, bool revoked); int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 f= lags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); +int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev, int dmabuf_= fd); #else static inline int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, @@ -156,6 +163,11 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_dev= ice *vdev, u32 flags, { return -ENOTTY; } +static inline int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vde= v, + int dmabuf_fd) +{ + return -ENODEV; +} #endif =20 #endif diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 5de618a3a5ee..02366e9f8e16 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1321,6 +1321,36 @@ struct vfio_precopy_info { =20 #define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21) =20 +/** + * VFIO_DEVICE_PCI_DMABUF_REVOKE - _IO(VFIO_TYPE, VFIO_BASE + 22) + * + * This ioctl is used on the device FD, and requests that access to + * the buffer corresponding to the DMABUF FD parameter is immediately + * and permanently revoked. On successful return, the buffer is not + * accessible through any mmap() or dma-buf import. The request fails + * if the buffer is pinned; otherwise, the exporter marks the buffer + * as inaccessible and uses the move_notify callback to inform + * importers of the change. The buffer is permanently disabled, and + * VFIO refuses all map, mmap, attach, etc. requests. + * + * Returns: + * + * Return: 0 on success, -1 and errno set on failure: + * + * ENODEV if the associated dmabuf FD no longer exists/is closed, + * or is not a DMABUF created for this device. + * EINVAL if the dmabuf_fd parameter isn't a DMABUF. + * EBADF if the dmabuf_fd parameter isn't a valid file number. + * EBADFD if the buffer has already been revoked. + * + */ +struct vfio_pci_dmabuf_revoke { + __u32 argsz; + __s32 dmabuf_fd; +}; + +#define VFIO_DEVICE_PCI_DMABUF_REVOKE _IO(VFIO_TYPE, VFIO_BASE + 22) + /* * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low p= ower * state with the platform-based power management. Device use of lower po= wer --=20 2.47.3 From nobody Mon Jun 8 18:55:44 2026 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88EBB3F0761; Wed, 27 May 2026 10:24:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=67.231.145.42 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877468; cv=none; b=seUUTr6of5BlXxVNYBmTsBU1moBHgtbv59yFQho4n3QZwjM3aHDh1tBsKied0soup0DvdPEPVxPrYHfmk9DbCIQlLRIzo7cFqS9g8+v1auQ80bk43Q2PwQyZ+qaKlJwMqH+rEW8E1bfwoN0qrR/WvPd77n7PBMpYpbMUmWkhePE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779877468; c=relaxed/simple; bh=sOaChq9nv2UPRP+kiF+PtH4pejsFfV71JUAMacyhN8Q=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=V7ptsRU8qBDqWojUC1mTReK6njaBd6F0M/9Rws3KDGOMsaAtLdht97OalPs0X0JJdfn+edUlfbsaya6vqR4CVM3SB3zbp76okIS40uK6fgOdxRoDbLkdcoUsnBJtYp++U6id+UuszCcUH8TFWf0rMqQD4ErEJvs0lsDCBUf+GJA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com; spf=pass smtp.mailfrom=meta.com; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b=Lyx4tVqU; arc=none smtp.client-ip=67.231.145.42 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=meta.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=meta.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=meta.com header.i=@meta.com header.b="Lyx4tVqU" Received: from pps.filterd (m0528009.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 64R7w0xj2645751; Wed, 27 May 2026 03:24:02 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=meta.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s=s2048-2025-q2; bh=Pe77KzAluzhbqg+Cv3jmNN3g493DWRV71uyOIHJnQbY=; b=Lyx4tVqU6mWK ru4yzdsy6Y3jTho8pappWAanM46P/J9WdVmBdnMi30Iix9Bga9vuUWdpiI+FJNYI OkgFO+tvpslbK/e+PHiSt9wNHeWRmIIogajR5WNjChiVce0wM06zCLFBfTWRtk7r M6OETmbLJdSWXojZOfonalInCu1UuYawYOMJ9WEL+bmf7yrBTGKrgTFIZoutvcDM +seh3ckUd7gDIfOfoNlRq3GDoWkt8+CrsTz5X7fZ9+mI2DQDn7PO0mZhtVrMmyfo 8WgaceGOb2ArgEH2y/Ia7fiy1k2xmELuiozAwqxYpx62T6/g8XM9T7rp5nMP8S+C cP7lonyYWg== Received: from maileast.thefacebook.com ([163.114.135.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 4edtn893u4-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Wed, 27 May 2026 03:24:02 -0700 (PDT) Received: from localhost (2620:10d:c0a8:1b::2d) by mail.thefacebook.com (2620:10d:c0a9:6f::8fd4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.2.2562.41; Wed, 27 May 2026 10:24:00 +0000 From: Matt Evans To: Alex Williamson , Leon Romanovsky , Jason Gunthorpe , Alex Mastro , =?UTF-8?q?Christian=20K=C3=B6nig?= , Bjorn Helgaas , Logan Gunthorpe CC: Mahmoud Adam , David Matlack , =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= , Sumit Semwal , Kevin Tian , Ankit Agrawal , Pranjal Shrivastava , Alistair Popple , Vivek Kasireddy , , , , , , Subject: [PATCH v2 9/9] vfio/pci: Add mmap() attributes to DMABUF feature Date: Wed, 27 May 2026 03:23:12 -0700 Message-ID: <20260527102319.100128-10-mattev@meta.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260527102319.100128-1-mattev@meta.com> References: <20260527102319.100128-1-mattev@meta.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNTI3MDEwMCBTYWx0ZWRfX7SUDs8UBgUFd zv+Npwh2wxb3EutWuU6PPoMLWpdVgCdBmXSDHnnad0IpxzwWhIk4IJmFUONjDiCEMWxpPpdWnqz 0l1ziv70qs5joSwvCCED6iARBsGWPN7xJ+bkJmBq0l1uySBr5iedrN1h27M9o6pjv6RHheydCnV vAaLs8Ihy2rS19RdVkLQKIXPb5/1l5+3CRjsMPsZmQo1FBHGnOcAHCzPt+EOJPgMr9PvKizO0ka YD4V13rmKBzunGf4ivzFAP8UzBP0uihHk1h0quS2HG/Ifwl3O08xSWBWynkhGpYfUAUAGvK1duP ufsHbr9wYW9p91gUpPurUMLJUuxX3+0vYkFLgDMGWLgMpXMr9HFKDmKH/J0Ec7+XpwPck716+9b 24trFIfyWs7yeZ0G/4bA7zSlH0he1pa99NNO3gGRJIbGkF70qAeovLEquKaUV/B1bNqtv7osF/c bQpkwgzNlCkoAS55RWg== X-Proofpoint-GUID: niFXQOLyeaHKneo1auZ7pSHrrJmSXM9G X-Proofpoint-ORIG-GUID: niFXQOLyeaHKneo1auZ7pSHrrJmSXM9G X-Authority-Analysis: v=2.4 cv=BbToFLt2 c=1 sm=1 tr=0 ts=6a16c642 cx=c_pps a=MfjaFnPeirRr97d5FC5oHw==:117 a=MfjaFnPeirRr97d5FC5oHw==:17 a=NGcC8JguVDcA:10 a=VkNPw1HP01LnGYTKEx00:22 a=7x6HtfJdh03M6CCDgxCd:22 a=U_y8lYiYyhHBU5rMqhb2:22 a=VabnemYjAAAA:8 a=WcwxkBucmGPv41HzS2YA:9 a=gKebqoRLp9LExxC7YDUY:22 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.125,FMLib:17.12.100.49 definitions=2026-05-27_01,2026-05-26_03,2025-10-01_01 Content-Type: text/plain; charset="utf-8" A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set (and get) CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer. There are two attributes supported: - The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_UC - VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region. Signed-off-by: Matt Evans --- drivers/vfio/pci/vfio_pci_core.c | 2 + drivers/vfio/pci/vfio_pci_dmabuf.c | 70 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 12 +++++ include/uapi/linux/vfio.h | 27 ++++++++++++ 4 files changed, 110 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_c= ore.c index 5184b3cac160..e256a925e7ce 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1590,6 +1590,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *d= evice, u32 flags, return vfio_pci_core_feature_token(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_DMA_BUF: return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR: + return vfio_pci_core_feature_dma_buf_memattr(vdev, flags, arg, argsz); default: return -ENOTTY; } diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci= _dmabuf.c index 3fa14760898f..db8b95ddbe18 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -42,7 +42,10 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf,= struct vm_area_struct * * contained within the DMABUF size before calling this. */ =20 - vma->vm_page_prot =3D pgprot_noncached(vma->vm_page_prot); + if (READ_ONCE(priv->memattr) =3D=3D VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_W= C) + vma->vm_page_prot =3D pgprot_writecombine(vma->vm_page_prot); + else + vma->vm_page_prot =3D pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot =3D pgprot_decrypted(vma->vm_page_prot); =20 /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */ @@ -464,6 +467,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_= device *vdev, u32 flags, priv->vdev =3D vdev; priv->nr_ranges =3D get_dma_buf.nr_ranges; priv->size =3D length; + priv->memattr =3D VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC; ret =3D vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, get_dma_buf.region_index, priv->phys_vec, dma_ranges, @@ -731,4 +735,68 @@ int vfio_pci_dma_buf_revoke(struct vfio_pci_core_devic= e *vdev, int dmabuf_fd) =20 return ret; } + +int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf_memattr db_attr; + struct vfio_pci_dma_buf *priv; + struct dma_buf *dmabuf; + int ret; + + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) + return -EOPNOTSUPP; + + ret =3D vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_GET | + VFIO_DEVICE_FEATURE_SET, + sizeof(db_attr)); + if (ret !=3D 1) + return ret; + + if (copy_from_user(&db_attr, arg, sizeof(db_attr))) + return -EFAULT; + + dmabuf =3D dma_buf_get(db_attr.dmabuf_fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */ + priv =3D dmabuf->priv; + if (dmabuf->ops !=3D &vfio_pci_dmabuf_ops || priv->vdev !=3D vdev) { + ret =3D -ENODEV; + goto out_put_buf; + } + + ret =3D 0; + scoped_guard(rwsem_write, &vdev->memory_lock) { + uint32_t old_attr =3D priv->memattr; + + if (flags & VFIO_DEVICE_FEATURE_SET) { + switch(db_attr.memattr) { + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC: + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC: + priv->memattr =3D db_attr.memattr; + break; + + default: + ret =3D -ENOTSUPP; + } + } + db_attr.memattr =3D old_attr; + } + + if (!ret && (flags & VFIO_DEVICE_FEATURE_GET)) { + if (copy_to_user(arg, &db_attr, sizeof(db_attr))) + ret =3D -EFAULT; + } + + out_put_buf: + dma_buf_put(dmabuf); + + return ret; + +} #endif /* CONFIG_VFIO_PCI_DMABUF */ diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_p= riv.h index a1e0f4fcb1dc..8067be45beb0 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -41,6 +41,7 @@ struct vfio_pci_dma_buf { struct kref kref; struct completion comp; unsigned long vma_pgoff_adjust; + u32 memattr; enum vfio_pci_dma_buf_status status; }; =20 @@ -154,6 +155,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device= *vdev, bool revoked); int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 f= lags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); +int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz); int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev, int dmabuf_= fd); #else static inline int @@ -163,6 +168,13 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_dev= ice *vdev, u32 flags, { return -ENOTTY; } +static inline int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz) +{ + return -ENODEV; +} static inline int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vde= v, int dmabuf_fd) { diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 02366e9f8e16..9b0b68f8a1ef 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1564,6 +1564,33 @@ struct vfio_device_feature_dma_buf { */ #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12 =20 +/** + * Given a dma_buf fd previously created by + * VFIO_DEVICE_FEATURE_DMA_BUF, GET or SET the memory attribute that + * will be used by future mmap()s of that fd. SETting a new attribute + * does not affect existing VMAs. + * + * The default, if no previous SET has been performed, is NC. + * + * Return: 0 on success, -1 and errno is set on failure: + * + * ENOTSUPP: The given memattr is not supported. + * EBADF, EINVAL: dmabuf_fd is not a DMABUF fd. + * ENODEV: The dmabuf_fd does not match this VFIO device. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR 13 + +/* Valid memory attributes for the memattr field */ +enum vfio_device_dma_buf_memattr { + VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC =3D 0, /* pgprot_noncached */ + VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC =3D 1, /* pgprot_writecombine */ +}; + +struct vfio_device_feature_dma_buf_memattr { + __s32 dmabuf_fd; + __u32 memattr; +}; + /* -------- API for Type1 VFIO IOMMU -------- */ =20 /** --=20 2.47.3