[RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64

Shivaprasad G Bhat posted 1 patch 1 week, 3 days ago
arch/powerpc/include/asm/iommu.h              |    2
arch/powerpc/kernel/iommu.c                   |  181 +++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda-tce.c |    4 -
arch/powerpc/platforms/powernv/pci-ioda.c     |    4 -
arch/powerpc/platforms/powernv/pci.h          |    2
arch/powerpc/platforms/pseries/iommu.c        |    6 -
drivers/vfio/Kconfig                          |    4 -
7 files changed, 190 insertions(+), 13 deletions(-)
[RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Shivaprasad G Bhat 1 week, 3 days ago
The RFC attempts to implement the IOMMUFD support on PPC64 by
adding new iommu_ops for paging domain. The existing platform
domain continues to be the default domain for in-kernel use.
The domain ownership transfer ensures the reset of iommu states
for the new paging domain and in-kernel usage.

On PPC64, IOVA ranges are based on the type of the DMA window
and their properties. Currently, there is no way to expose the
attributes of the non-default 64-bit DMA window, which the platform
supports. The platform allows the operating system to select the
starting offset(at 4GiB or 512PiB default offset), pagesize and
window size for the non-default 64-bit DMA window. For example,
with VFIO, this is handled via VFIO_IOMMU_SPAPR_TCE_GET_INFO
and VFIO_IOMMU_SPAPR_TCE_CREATE|REMOVE ioctls. While I am exploring
the ways to expose and configure these DMA window attributes as
per user input, any suggestions in this regard will be very helpful.

Currently existing vfio type1 specific vfio-compat driver even
with this patch will not work for PPC64. I believe we need to have
a separate "vfio-spapr-compat" driver to make it work.

So brief list of current open problems and ongoing reworks:
 - Second DMA window support as mentioned above.
 - KVM support.
 - EEH support.
 - The vfio compat driver for the spapr tce iommu.
 - Multiple devices (multifunction, same/different iommu group checks,
   SRIOV VF assignment) support.
 - Race conditions, device plug/unplug.
 - self|tests.

The patch currently works for single device and exposes only the
default DMA window of 1GB to the user. It has been tested for
both PowerNV and pSeries machine tce iommu backends. The testing
was done using a Qemu[1] and TCG guest having a NVME device
passthrough. One can use the command like below to try:

qemu-system-ppc64 -machine pseries -accel tcg \
-device spapr-pci-host-bridge,index=1,id=pci.1,ddw=off \
-device vfio-pci,host=<hostdev>,id=hostdev0,\
bus=pci.1.0,addr=0x1,iommufd=iommufd0 \
-object iommufd,id=iommufd0 <...>
...
root:localhost# mount /dev/nvme0n1 /mnt
root:localhost# ls /mnt
...

The current patch is based on linux kernel 6.19-rc6 tree.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>

References:
1 : https://github.com/shivaprasadbhat/qemu/tree/iommufd-wip
---
 arch/powerpc/include/asm/iommu.h              |    2 
 arch/powerpc/kernel/iommu.c                   |  181 +++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |    4 -
 arch/powerpc/platforms/powernv/pci-ioda.c     |    4 -
 arch/powerpc/platforms/powernv/pci.h          |    2 
 arch/powerpc/platforms/pseries/iommu.c        |    6 -
 drivers/vfio/Kconfig                          |    4 -
 7 files changed, 190 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index eafdd63cd6c4..1dc72fbb89e7 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -46,7 +46,7 @@ struct iommu_table_ops {
 			long index, long npages,
 			unsigned long uaddr,
 			enum dma_data_direction direction,
-			unsigned long attrs);
+			unsigned long attrs, bool is_phys);
 #ifdef CONFIG_IOMMU_API
 	/*
 	 * Exchanges existing TCE with new TCE plus direction bits;
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 0ce71310b7d9..e6543480c461 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -365,7 +365,7 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 	/* Put the TCEs in the HW table */
 	build_fail = tbl->it_ops->set(tbl, entry, npages,
 				      (unsigned long)page &
-				      IOMMU_PAGE_MASK(tbl), direction, attrs);
+				      IOMMU_PAGE_MASK(tbl), direction, attrs, false);
 
 	/* tbl->it_ops->set() only returns non-zero for transient errors.
 	 * Clean up the table bitmap in this case and return
@@ -539,7 +539,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
 		/* Insert into HW table */
 		build_fail = tbl->it_ops->set(tbl, entry, npages,
 					      vaddr & IOMMU_PAGE_MASK(tbl),
-					      direction, attrs);
+					      direction, attrs, false);
 		if(unlikely(build_fail))
 			goto failure;
 
@@ -1201,7 +1201,15 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
 	 * also sets the dma_api ops
 	 */
 	table_group = iommu_group_get_iommudata(grp);
+
+	if (old && old->type == IOMMU_DOMAIN_DMA) {
+		ret = table_group->ops->unset_window(table_group, 0);
+		if (ret)
+			goto exit;
+	}
+
 	ret = table_group->ops->take_ownership(table_group, dev);
+exit:
 	iommu_group_put(grp);
 
 	return ret;
@@ -1260,6 +1268,167 @@ static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
 	return hose->controller_ops.device_group(hose, pdev);
 }
 
+struct ppc64_domain {
+	struct iommu_domain  domain;
+	struct device        *device; /* Make it a list */
+	struct iommu_table   *table;
+	spinlock_t           list_lock;
+	struct rcu_head      rcu;
+};
+
+static struct ppc64_domain *to_ppc64_domain(struct iommu_domain *dom)
+{
+	return container_of(dom, struct ppc64_domain, domain);
+}
+
+static void spapr_tce_domain_free(struct iommu_domain *domain)
+{
+	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
+
+	kfree(ppc64_domain);
+}
+
+static const struct iommu_ops spapr_tce_iommu_ops;
+static struct iommu_domain *spapr_tce_domain_alloc_paging(struct device *dev)
+{
+	struct iommu_group *grp = iommu_group_get(dev);
+	struct iommu_table_group *table_group;
+	struct ppc64_domain *ppc64_domain;
+	struct iommu_table *ptbl;
+	int ret = -1;
+
+	table_group = iommu_group_get_iommudata(grp);
+	ppc64_domain = kzalloc(sizeof(*ppc64_domain), GFP_KERNEL);
+	if (!ppc64_domain)
+		return NULL;
+
+	/* Just the default window hardcode for now */
+	ret = table_group->ops->create_table(table_group, 0, 0xc, 0x40000000, 1, &ptbl);
+	iommu_tce_table_get(ptbl);
+	ppc64_domain->table = ptbl; /* REVISIT: Single device for now */
+	if (!ppc64_domain->table) {
+		kfree(ppc64_domain);
+		iommu_tce_table_put(ptbl);
+		iommu_group_put(grp);
+		return NULL;
+	}
+
+	table_group->ops->set_window(table_group, 0, ptbl);
+	iommu_group_put(grp);
+
+	ppc64_domain->domain.pgsize_bitmap = SZ_4K;
+	ppc64_domain->domain.geometry.force_aperture = true;
+	ppc64_domain->domain.geometry.aperture_start = 0;
+	ppc64_domain->domain.geometry.aperture_end = 0x40000000; /*default window */
+	ppc64_domain->domain.ops = spapr_tce_iommu_ops.default_domain_ops;
+
+	spin_lock_init(&ppc64_domain->list_lock);
+
+	return &ppc64_domain->domain;
+}
+
+static size_t spapr_tce_iommu_unmap_pages(struct iommu_domain *domain,
+				unsigned long iova,
+				size_t pgsize, size_t pgcount,
+				struct iommu_iotlb_gather *gather)
+{
+	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
+	struct iommu_table *tbl = ppc64_domain->table;
+	unsigned long pgshift = __ffs(pgsize);
+	size_t size = pgcount << pgshift;
+	size_t mapped = 0;
+	unsigned int tcenum;
+	int  mask;
+
+	if (pgsize != SZ_4K)
+		return -EINVAL;
+
+	size = PAGE_ALIGN(size);
+
+	mask = IOMMU_PAGE_MASK(tbl);
+	tcenum = iova >> tbl->it_page_shift;
+
+	tbl->it_ops->clear(tbl, tcenum, pgcount);
+
+	mapped = pgsize * pgcount;
+
+	return mapped;
+}
+
+static phys_addr_t spapr_tce_iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
+{
+	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
+	struct iommu_table *tbl = ppc64_domain->table;
+	phys_addr_t paddr, rpn, tceval;
+	unsigned int tcenum;
+
+	tcenum = iova >> tbl->it_page_shift;
+	tceval = tbl->it_ops->get(tbl, tcenum);
+
+	/* Ignore the direction bits */
+	rpn = tceval >> tbl->it_page_shift;
+	paddr = rpn << tbl->it_page_shift;
+
+	return paddr;
+}
+
+static int spapr_tce_iommu_map_pages(struct iommu_domain *domain,
+				unsigned long iova, phys_addr_t paddr,
+				size_t pgsize, size_t pgcount,
+				int prot, gfp_t gfp, size_t *mapped)
+{
+	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
+	enum dma_data_direction direction = DMA_BIDIRECTIONAL;
+	struct iommu_table *tbl = ppc64_domain->table;
+	unsigned long pgshift = __ffs(pgsize);
+	size_t size = pgcount << pgshift;
+	unsigned int tcenum;
+	int ret;
+
+	if (pgsize != SZ_4K)
+		return -EINVAL;
+
+	if (iova < ppc64_domain->domain.geometry.aperture_start ||
+	    (iova + size - 1) > ppc64_domain->domain.geometry.aperture_end)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(iova | paddr, pgsize))
+		return -EINVAL;
+
+	if (!(prot & IOMMU_WRITE))
+		direction = DMA_FROM_DEVICE;
+
+	if (!(prot & IOMMU_READ))
+		direction = DMA_TO_DEVICE;
+
+	size = PAGE_ALIGN(size);
+	tcenum = iova >> tbl->it_page_shift;
+
+	/* Put the TCEs in the HW table */
+	ret = tbl->it_ops->set(tbl, tcenum, pgcount,
+				paddr, direction, 0, true);
+	if (!ret && mapped)
+		*mapped = pgsize;
+
+	return 0;
+}
+
+static int spapr_tce_iommu_attach_device(struct iommu_domain *domain,
+				    struct device *dev, struct iommu_domain *old)
+{
+	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
+
+	/* REVISIT */
+	if (!domain)
+		return 0;
+
+	/* REVISIT: Check table group, list handling */
+	ppc64_domain->device = dev;
+
+	return 0;
+}
+
+
 static const struct iommu_ops spapr_tce_iommu_ops = {
 	.default_domain = &spapr_tce_platform_domain,
 	.blocked_domain = &spapr_tce_blocked_domain,
@@ -1267,6 +1436,14 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
 	.probe_device = spapr_tce_iommu_probe_device,
 	.release_device = spapr_tce_iommu_release_device,
 	.device_group = spapr_tce_iommu_device_group,
+	.domain_alloc_paging = spapr_tce_domain_alloc_paging,
+	.default_domain_ops = &(const struct iommu_domain_ops) {
+		.attach_dev     = spapr_tce_iommu_attach_device,
+		.map_pages      = spapr_tce_iommu_map_pages,
+		.unmap_pages    = spapr_tce_iommu_unmap_pages,
+		.iova_to_phys   = spapr_tce_iommu_iova_to_phys,
+		.free           = spapr_tce_domain_free,
+	}
 };
 
 static struct attribute *spapr_tce_iommu_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index e96324502db0..8800bf86d17a 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -123,10 +123,10 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
 
 int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction,
-		unsigned long attrs)
+		unsigned long attrs, bool is_phys)
 {
 	u64 proto_tce = iommu_direction_to_tce_perm(direction);
-	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
+	u64 rpn = !is_phys ? __pa(uaddr) >> tbl->it_page_shift : uaddr >> tbl->it_page_shift;
 	long i;
 
 	if (proto_tce & TCE_PCI_WRITE)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index b0c1d9d16fb5..610146a63e3b 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1241,10 +1241,10 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
 static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
 		long npages, unsigned long uaddr,
 		enum dma_data_direction direction,
-		unsigned long attrs)
+		unsigned long attrs, bool is_phys)
 {
 	int ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
-			attrs);
+			attrs, is_phys);
 
 	if (!ret)
 		pnv_pci_ioda2_tce_invalidate(tbl, index, npages);
diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
index 42075501663b..3579ecd55d00 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -300,7 +300,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
 
 extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction,
-		unsigned long attrs);
+		unsigned long attrs, bool is_phys);
 extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
 extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
 		unsigned long *hpa, enum dma_data_direction *direction);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index eec333dd2e59..8c6f9f18e462 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -122,7 +122,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
 static int tce_build_pSeries(struct iommu_table *tbl, long index,
 			      long npages, unsigned long uaddr,
 			      enum dma_data_direction direction,
-			      unsigned long attrs)
+			      unsigned long attrs, bool false)
 {
 	u64 proto_tce;
 	__be64 *tcep;
@@ -250,7 +250,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
 static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
 				     enum dma_data_direction direction,
-				     unsigned long attrs)
+				     unsigned long attrs, bool is_phys)
 {
 	u64 rc = 0;
 	u64 proto_tce;
@@ -287,7 +287,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 		__this_cpu_write(tce_page, tcep);
 	}
 
-	rpn = __pa(uaddr) >> tceshift;
+	rpn = !is_phys ? __pa(uaddr) >> tceshift : uaddr >> tceshift;
 	proto_tce = TCE_PCI_READ;
 	if (direction != DMA_TO_DEVICE)
 		proto_tce |= TCE_PCI_WRITE;
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index ceae52fd7586..9929aa78a5da 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -4,7 +4,7 @@ menuconfig VFIO
 	select IOMMU_API
 	depends on IOMMUFD || !IOMMUFD
 	select INTERVAL_TREE
-	select VFIO_GROUP if SPAPR_TCE_IOMMU || IOMMUFD=n
+	select VFIO_GROUP if IOMMUFD=n
 	select VFIO_DEVICE_CDEV if !VFIO_GROUP
 	select VFIO_CONTAINER if IOMMUFD=n
 	help
@@ -16,7 +16,7 @@ menuconfig VFIO
 if VFIO
 config VFIO_DEVICE_CDEV
 	bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
-	depends on IOMMUFD && !SPAPR_TCE_IOMMU
+	depends on IOMMUFD
 	default !VFIO_GROUP
 	help
 	  The VFIO device cdev is another way for userspace to get device
Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Christophe Leroy (CS GROUP) 1 week, 3 days ago

Le 27/01/2026 à 19:35, Shivaprasad G Bhat a écrit :
> The RFC attempts to implement the IOMMUFD support on PPC64 by
> adding new iommu_ops for paging domain. The existing platform
> domain continues to be the default domain for in-kernel use.
> The domain ownership transfer ensures the reset of iommu states
> for the new paging domain and in-kernel usage.
> 
> On PPC64, IOVA ranges are based on the type of the DMA window
> and their properties. Currently, there is no way to expose the
> attributes of the non-default 64-bit DMA window, which the platform
> supports. The platform allows the operating system to select the
> starting offset(at 4GiB or 512PiB default offset), pagesize and
> window size for the non-default 64-bit DMA window. For example,
> with VFIO, this is handled via VFIO_IOMMU_SPAPR_TCE_GET_INFO
> and VFIO_IOMMU_SPAPR_TCE_CREATE|REMOVE ioctls. While I am exploring
> the ways to expose and configure these DMA window attributes as
> per user input, any suggestions in this regard will be very helpful.
> 
> Currently existing vfio type1 specific vfio-compat driver even
> with this patch will not work for PPC64. I believe we need to have
> a separate "vfio-spapr-compat" driver to make it work.
> 
> So brief list of current open problems and ongoing reworks:
>   - Second DMA window support as mentioned above.
>   - KVM support.
>   - EEH support.
>   - The vfio compat driver for the spapr tce iommu.
>   - Multiple devices (multifunction, same/different iommu group checks,
>     SRIOV VF assignment) support.
>   - Race conditions, device plug/unplug.
>   - self|tests.
> 
> The patch currently works for single device and exposes only the
> default DMA window of 1GB to the user. It has been tested for
> both PowerNV and pSeries machine tce iommu backends. The testing
> was done using a Qemu[1] and TCG guest having a NVME device
> passthrough. One can use the command like below to try:
> 
> qemu-system-ppc64 -machine pseries -accel tcg \
> -device spapr-pci-host-bridge,index=1,id=pci.1,ddw=off \
> -device vfio-pci,host=<hostdev>,id=hostdev0,\
> bus=pci.1.0,addr=0x1,iommufd=iommufd0 \
> -object iommufd,id=iommufd0 <...>
> ...
> root:localhost# mount /dev/nvme0n1 /mnt
> root:localhost# ls /mnt
> ...
> 
> The current patch is based on linux kernel 6.19-rc6 tree.

Getting the following build failure on linuxppc-dev patchwork with 
g5_defconfig or ppc64_defconfig:

Error: /linux/arch/powerpc/sysdev/dart_iommu.c:325:9: error: 
initialization of 'int (*)(struct iommu_table *, long int,  long int, 
long unsigned int,  enum dma_data_direction,  long unsigned int,  bool)' 
{aka 'int (*)(struct iommu_table *, long int,  long int,  long unsigned 
int,  enum dma_data_direction,  long unsigned int,  _Bool)'} from 
incompatible pointer type 'int (*)(struct iommu_table *, long int,  long 
int,  long unsigned int,  enum dma_data_direction,  long unsigned int)' 
[-Werror=incompatible-pointer-types]
   .set = dart_build,
          ^~~~~~~~~~
/linux/arch/powerpc/sysdev/dart_iommu.c:325:9: note: (near 
initialization for 'iommu_dart_ops.set')
cc1: all warnings being treated as errors
make[5]: *** [/linux/scripts/Makefile.build:287: 
arch/powerpc/sysdev/dart_iommu.o] Error 1
make[4]: *** [/linux/scripts/Makefile.build:544: arch/powerpc/sysdev] 
Error 2

Christophe

> 
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> 
> References:
> 1 : https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fshivaprasadbhat%2Fqemu%2Ftree%2Fiommufd-wip&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b6054524dcf4d42f24308de5dd2fc27%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639051357920885715%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=NBGzjiMaEskySEDGCZHhPwQ5VzADQXPCpH45d5p4Cuk%3D&reserved=0
> ---
>   arch/powerpc/include/asm/iommu.h              |    2
>   arch/powerpc/kernel/iommu.c                   |  181 +++++++++++++++++++++++++
>   arch/powerpc/platforms/powernv/pci-ioda-tce.c |    4 -
>   arch/powerpc/platforms/powernv/pci-ioda.c     |    4 -
>   arch/powerpc/platforms/powernv/pci.h          |    2
>   arch/powerpc/platforms/pseries/iommu.c        |    6 -
>   drivers/vfio/Kconfig                          |    4 -
>   7 files changed, 190 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index eafdd63cd6c4..1dc72fbb89e7 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -46,7 +46,7 @@ struct iommu_table_ops {
>   			long index, long npages,
>   			unsigned long uaddr,
>   			enum dma_data_direction direction,
> -			unsigned long attrs);
> +			unsigned long attrs, bool is_phys);
>   #ifdef CONFIG_IOMMU_API
>   	/*
>   	 * Exchanges existing TCE with new TCE plus direction bits;
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 0ce71310b7d9..e6543480c461 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -365,7 +365,7 @@ static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
>   	/* Put the TCEs in the HW table */
>   	build_fail = tbl->it_ops->set(tbl, entry, npages,
>   				      (unsigned long)page &
> -				      IOMMU_PAGE_MASK(tbl), direction, attrs);
> +				      IOMMU_PAGE_MASK(tbl), direction, attrs, false);
>   
>   	/* tbl->it_ops->set() only returns non-zero for transient errors.
>   	 * Clean up the table bitmap in this case and return
> @@ -539,7 +539,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table *tbl,
>   		/* Insert into HW table */
>   		build_fail = tbl->it_ops->set(tbl, entry, npages,
>   					      vaddr & IOMMU_PAGE_MASK(tbl),
> -					      direction, attrs);
> +					      direction, attrs, false);
>   		if(unlikely(build_fail))
>   			goto failure;
>   
> @@ -1201,7 +1201,15 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
>   	 * also sets the dma_api ops
>   	 */
>   	table_group = iommu_group_get_iommudata(grp);
> +
> +	if (old && old->type == IOMMU_DOMAIN_DMA) {
> +		ret = table_group->ops->unset_window(table_group, 0);
> +		if (ret)
> +			goto exit;
> +	}
> +
>   	ret = table_group->ops->take_ownership(table_group, dev);
> +exit:
>   	iommu_group_put(grp);
>   
>   	return ret;
> @@ -1260,6 +1268,167 @@ static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
>   	return hose->controller_ops.device_group(hose, pdev);
>   }
>   
> +struct ppc64_domain {
> +	struct iommu_domain  domain;
> +	struct device        *device; /* Make it a list */
> +	struct iommu_table   *table;
> +	spinlock_t           list_lock;
> +	struct rcu_head      rcu;
> +};
> +
> +static struct ppc64_domain *to_ppc64_domain(struct iommu_domain *dom)
> +{
> +	return container_of(dom, struct ppc64_domain, domain);
> +}
> +
> +static void spapr_tce_domain_free(struct iommu_domain *domain)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +
> +	kfree(ppc64_domain);
> +}
> +
> +static const struct iommu_ops spapr_tce_iommu_ops;
> +static struct iommu_domain *spapr_tce_domain_alloc_paging(struct device *dev)
> +{
> +	struct iommu_group *grp = iommu_group_get(dev);
> +	struct iommu_table_group *table_group;
> +	struct ppc64_domain *ppc64_domain;
> +	struct iommu_table *ptbl;
> +	int ret = -1;
> +
> +	table_group = iommu_group_get_iommudata(grp);
> +	ppc64_domain = kzalloc(sizeof(*ppc64_domain), GFP_KERNEL);
> +	if (!ppc64_domain)
> +		return NULL;
> +
> +	/* Just the default window hardcode for now */
> +	ret = table_group->ops->create_table(table_group, 0, 0xc, 0x40000000, 1, &ptbl);
> +	iommu_tce_table_get(ptbl);
> +	ppc64_domain->table = ptbl; /* REVISIT: Single device for now */
> +	if (!ppc64_domain->table) {
> +		kfree(ppc64_domain);
> +		iommu_tce_table_put(ptbl);
> +		iommu_group_put(grp);
> +		return NULL;
> +	}
> +
> +	table_group->ops->set_window(table_group, 0, ptbl);
> +	iommu_group_put(grp);
> +
> +	ppc64_domain->domain.pgsize_bitmap = SZ_4K;
> +	ppc64_domain->domain.geometry.force_aperture = true;
> +	ppc64_domain->domain.geometry.aperture_start = 0;
> +	ppc64_domain->domain.geometry.aperture_end = 0x40000000; /*default window */
> +	ppc64_domain->domain.ops = spapr_tce_iommu_ops.default_domain_ops;
> +
> +	spin_lock_init(&ppc64_domain->list_lock);
> +
> +	return &ppc64_domain->domain;
> +}
> +
> +static size_t spapr_tce_iommu_unmap_pages(struct iommu_domain *domain,
> +				unsigned long iova,
> +				size_t pgsize, size_t pgcount,
> +				struct iommu_iotlb_gather *gather)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	unsigned long pgshift = __ffs(pgsize);
> +	size_t size = pgcount << pgshift;
> +	size_t mapped = 0;
> +	unsigned int tcenum;
> +	int  mask;
> +
> +	if (pgsize != SZ_4K)
> +		return -EINVAL;
> +
> +	size = PAGE_ALIGN(size);
> +
> +	mask = IOMMU_PAGE_MASK(tbl);
> +	tcenum = iova >> tbl->it_page_shift;
> +
> +	tbl->it_ops->clear(tbl, tcenum, pgcount);
> +
> +	mapped = pgsize * pgcount;
> +
> +	return mapped;
> +}
> +
> +static phys_addr_t spapr_tce_iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	phys_addr_t paddr, rpn, tceval;
> +	unsigned int tcenum;
> +
> +	tcenum = iova >> tbl->it_page_shift;
> +	tceval = tbl->it_ops->get(tbl, tcenum);
> +
> +	/* Ignore the direction bits */
> +	rpn = tceval >> tbl->it_page_shift;
> +	paddr = rpn << tbl->it_page_shift;
> +
> +	return paddr;
> +}
> +
> +static int spapr_tce_iommu_map_pages(struct iommu_domain *domain,
> +				unsigned long iova, phys_addr_t paddr,
> +				size_t pgsize, size_t pgcount,
> +				int prot, gfp_t gfp, size_t *mapped)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +	enum dma_data_direction direction = DMA_BIDIRECTIONAL;
> +	struct iommu_table *tbl = ppc64_domain->table;
> +	unsigned long pgshift = __ffs(pgsize);
> +	size_t size = pgcount << pgshift;
> +	unsigned int tcenum;
> +	int ret;
> +
> +	if (pgsize != SZ_4K)
> +		return -EINVAL;
> +
> +	if (iova < ppc64_domain->domain.geometry.aperture_start ||
> +	    (iova + size - 1) > ppc64_domain->domain.geometry.aperture_end)
> +		return -EINVAL;
> +
> +	if (!IS_ALIGNED(iova | paddr, pgsize))
> +		return -EINVAL;
> +
> +	if (!(prot & IOMMU_WRITE))
> +		direction = DMA_FROM_DEVICE;
> +
> +	if (!(prot & IOMMU_READ))
> +		direction = DMA_TO_DEVICE;
> +
> +	size = PAGE_ALIGN(size);
> +	tcenum = iova >> tbl->it_page_shift;
> +
> +	/* Put the TCEs in the HW table */
> +	ret = tbl->it_ops->set(tbl, tcenum, pgcount,
> +				paddr, direction, 0, true);
> +	if (!ret && mapped)
> +		*mapped = pgsize;
> +
> +	return 0;
> +}
> +
> +static int spapr_tce_iommu_attach_device(struct iommu_domain *domain,
> +				    struct device *dev, struct iommu_domain *old)
> +{
> +	struct ppc64_domain *ppc64_domain = to_ppc64_domain(domain);
> +
> +	/* REVISIT */
> +	if (!domain)
> +		return 0;
> +
> +	/* REVISIT: Check table group, list handling */
> +	ppc64_domain->device = dev;
> +
> +	return 0;
> +}
> +
> +
>   static const struct iommu_ops spapr_tce_iommu_ops = {
>   	.default_domain = &spapr_tce_platform_domain,
>   	.blocked_domain = &spapr_tce_blocked_domain,
> @@ -1267,6 +1436,14 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
>   	.probe_device = spapr_tce_iommu_probe_device,
>   	.release_device = spapr_tce_iommu_release_device,
>   	.device_group = spapr_tce_iommu_device_group,
> +	.domain_alloc_paging = spapr_tce_domain_alloc_paging,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev     = spapr_tce_iommu_attach_device,
> +		.map_pages      = spapr_tce_iommu_map_pages,
> +		.unmap_pages    = spapr_tce_iommu_unmap_pages,
> +		.iova_to_phys   = spapr_tce_iommu_iova_to_phys,
> +		.free           = spapr_tce_domain_free,
> +	}
>   };
>   
>   static struct attribute *spapr_tce_iommu_attrs[] = {
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index e96324502db0..8800bf86d17a 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -123,10 +123,10 @@ static __be64 *pnv_tce(struct iommu_table *tbl, bool user, long idx, bool alloc)
>   
>   int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>   		unsigned long uaddr, enum dma_data_direction direction,
> -		unsigned long attrs)
> +		unsigned long attrs, bool is_phys)
>   {
>   	u64 proto_tce = iommu_direction_to_tce_perm(direction);
> -	u64 rpn = __pa(uaddr) >> tbl->it_page_shift;
> +	u64 rpn = !is_phys ? __pa(uaddr) >> tbl->it_page_shift : uaddr >> tbl->it_page_shift;
>   	long i;
>   
>   	if (proto_tce & TCE_PCI_WRITE)
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index b0c1d9d16fb5..610146a63e3b 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1241,10 +1241,10 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl,
>   static int pnv_ioda2_tce_build(struct iommu_table *tbl, long index,
>   		long npages, unsigned long uaddr,
>   		enum dma_data_direction direction,
> -		unsigned long attrs)
> +		unsigned long attrs, bool is_phys)
>   {
>   	int ret = pnv_tce_build(tbl, index, npages, uaddr, direction,
> -			attrs);
> +			attrs, is_phys);
>   
>   	if (!ret)
>   		pnv_pci_ioda2_tce_invalidate(tbl, index, npages);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 42075501663b..3579ecd55d00 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -300,7 +300,7 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
>   
>   extern int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
>   		unsigned long uaddr, enum dma_data_direction direction,
> -		unsigned long attrs);
> +		unsigned long attrs, bool is_phys);
>   extern void pnv_tce_free(struct iommu_table *tbl, long index, long npages);
>   extern int pnv_tce_xchg(struct iommu_table *tbl, long index,
>   		unsigned long *hpa, enum dma_data_direction *direction);
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index eec333dd2e59..8c6f9f18e462 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -122,7 +122,7 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group,
>   static int tce_build_pSeries(struct iommu_table *tbl, long index,
>   			      long npages, unsigned long uaddr,
>   			      enum dma_data_direction direction,
> -			      unsigned long attrs)
> +			      unsigned long attrs, bool false)
>   {
>   	u64 proto_tce;
>   	__be64 *tcep;
> @@ -250,7 +250,7 @@ static DEFINE_PER_CPU(__be64 *, tce_page);
>   static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>   				     long npages, unsigned long uaddr,
>   				     enum dma_data_direction direction,
> -				     unsigned long attrs)
> +				     unsigned long attrs, bool is_phys)
>   {
>   	u64 rc = 0;
>   	u64 proto_tce;
> @@ -287,7 +287,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>   		__this_cpu_write(tce_page, tcep);
>   	}
>   
> -	rpn = __pa(uaddr) >> tceshift;
> +	rpn = !is_phys ? __pa(uaddr) >> tceshift : uaddr >> tceshift;
>   	proto_tce = TCE_PCI_READ;
>   	if (direction != DMA_TO_DEVICE)
>   		proto_tce |= TCE_PCI_WRITE;
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index ceae52fd7586..9929aa78a5da 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -4,7 +4,7 @@ menuconfig VFIO
>   	select IOMMU_API
>   	depends on IOMMUFD || !IOMMUFD
>   	select INTERVAL_TREE
> -	select VFIO_GROUP if SPAPR_TCE_IOMMU || IOMMUFD=n
> +	select VFIO_GROUP if IOMMUFD=n
>   	select VFIO_DEVICE_CDEV if !VFIO_GROUP
>   	select VFIO_CONTAINER if IOMMUFD=n
>   	help
> @@ -16,7 +16,7 @@ menuconfig VFIO
>   if VFIO
>   config VFIO_DEVICE_CDEV
>   	bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
> -	depends on IOMMUFD && !SPAPR_TCE_IOMMU
> +	depends on IOMMUFD
>   	default !VFIO_GROUP
>   	help
>   	  The VFIO device cdev is another way for userspace to get device
> 
> 

Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Shivaprasad G Bhat 3 days, 23 hours ago
Hi Christophe.


Thanks for trying the patch.


On 1/28/26 4:23 PM, Christophe Leroy (CS GROUP) wrote:
>
>
> Le 27/01/2026 à 19:35, Shivaprasad G Bhat a écrit :
>> The RFC attempts to implement the IOMMUFD support on PPC64 by
>> adding new iommu_ops for paging domain. The existing platform
>> domain continues to be the default domain for in-kernel use.
>> The domain ownership transfer ensures the reset of iommu states
>> for the new paging domain and in-kernel usage.


<snip/>

>> ...
>> root:localhost# mount /dev/nvme0n1 /mnt
>> root:localhost# ls /mnt
>> ...
>>
>> The current patch is based on linux kernel 6.19-rc6 tree.
>
> Getting the following build failure on linuxppc-dev patchwork with 
> g5_defconfig or ppc64_defconfig:
>
> Error: /linux/arch/powerpc/sysdev/dart_iommu.c:325:9: error: 
> initialization of 'int (*)(struct iommu_table *, long int,  long int, 
> long unsigned int,  enum dma_data_direction,  long unsigned int,  
> bool)' {aka 'int (*)(struct iommu_table *, long int,  long int,  long 
> unsigned int,  enum dma_data_direction,  long unsigned int,  _Bool)'} 
> from incompatible pointer type 'int (*)(struct iommu_table *, long 
> int,  long int,  long unsigned int,  enum dma_data_direction,  long 
> unsigned int)' [-Werror=incompatible-pointer-types]
>   .set = dart_build,
>          ^~~~~~~~~~
> /linux/arch/powerpc/sysdev/dart_iommu.c:325:9: note: (near 
> initialization for 'iommu_dart_ops.set')
> cc1: all warnings being treated as errors
> make[5]: *** [/linux/scripts/Makefile.build:287: 
> arch/powerpc/sysdev/dart_iommu.o] Error 1
> make[4]: *** [/linux/scripts/Makefile.build:544: arch/powerpc/sysdev] 
> Error 2


I was trying only pseries and powernv configs. I see the changes would break

pasemi and dart iommus.


The below diff should get it going,

===========================

diff --git a/arch/powerpc/platforms/pasemi/iommu.c 
b/arch/powerpc/platforms/pasemi/iommu.c
index 375487cba874..75b526a560b8 100644
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -77,7 +77,7 @@ static int iommu_table_iobmap_inited;
  static int iobmap_build(struct iommu_table *tbl, long index,
                          long npages, unsigned long uaddr,
                          enum dma_data_direction direction,
-                        unsigned long attrs)
+                        unsigned long attrs, bool is_phys)
  {
         u32 *ip;
         u32 rpn;
diff --git a/arch/powerpc/sysdev/dart_iommu.c 
b/arch/powerpc/sysdev/dart_iommu.c
index c0d10c149661..b424a602d07a 100644
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -171,7 +171,7 @@ static void dart_flush(struct iommu_table *tbl)
  static int dart_build(struct iommu_table *tbl, long index,
                        long npages, unsigned long uaddr,
                        enum dma_data_direction direction,
-                      unsigned long attrs)
+                      unsigned long attrs, bool is_phys)
  {
         unsigned int *dp, *orig_dp;
         unsigned int rpn;

===========================


I will take care of it in next version.


Thank you!


Regards,

Shivaprasad

>
> Christophe
>
>
<snip/>
Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Jason Gunthorpe 1 week, 3 days ago
On Tue, Jan 27, 2026 at 06:35:56PM +0000, Shivaprasad G Bhat wrote:
> The RFC attempts to implement the IOMMUFD support on PPC64 by
> adding new iommu_ops for paging domain. The existing platform
> domain continues to be the default domain for in-kernel use.

It would be nice to see the platform domain go away and ppc use the
normal dma-iommu.c stuff, but I don't think it is critical to making
it work with iommufd.

> On PPC64, IOVA ranges are based on the type of the DMA window
> and their properties. Currently, there is no way to expose the
> attributes of the non-default 64-bit DMA window, which the platform
> supports. The platform allows the operating system to select the
> starting offset(at 4GiB or 512PiB default offset), pagesize and
> window size for the non-default 64-bit DMA window. For example,
> with VFIO, this is handled via VFIO_IOMMU_SPAPR_TCE_GET_INFO
> and VFIO_IOMMU_SPAPR_TCE_CREATE|REMOVE ioctls. While I am exploring
> the ways to expose and configure these DMA window attributes as
> per user input, any suggestions in this regard will be very helpful.

You can pass in driver specific information during HWPT creation, so
any properties you need can be specified there.

Then you'd want to introduce a new domain op to get the apertures
instead of the single range hard coded into the domain struct. The new
op would be able to return a list. We can use this op to return
apertures for sign extension page tables too.

Update iommufd to calculate the reserved regions by evaluating the
whole list.

I think you'll find this pretty straight forward, I'd do it as a
followup patch to this one.

> Currently existing vfio type1 specific vfio-compat driver even
> with this patch will not work for PPC64. I believe we need to have
> a separate "vfio-spapr-compat" driver to make it work.

Yes, vfio-compat doesn't support the special spapr ioctls.

I don't think you need a new driver, just implement whatever they do
with the existing interfaces, probably in its own .c file though.

However, I have no idea what is required to implement those ops, or if
it is even possible.. It may be easier to just leave the old vfio
stuff around instead of trying to compat it. The purpose of compat was
to be able to build kernels without type1 at all. It isn't necessary
to start using iommufd in new apps with the new interfaces.

Given you are mainly looking at a VMM that already will have iommufd
support it may not be worthwhile.

> @@ -1201,7 +1201,15 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
>  	 * also sets the dma_api ops
>  	 */
>  	table_group = iommu_group_get_iommudata(grp);
> +
> +	if (old && old->type == IOMMU_DOMAIN_DMA) {

I'm trying to delete IOMMU_DOMAIN_DMA please don't use it in
drivers.

>  static const struct iommu_ops spapr_tce_iommu_ops = {
>  	.default_domain = &spapr_tce_platform_domain,
>  	.blocked_domain = &spapr_tce_blocked_domain,
> @@ -1267,6 +1436,14 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
>  	.probe_device = spapr_tce_iommu_probe_device,
>  	.release_device = spapr_tce_iommu_release_device,
>  	.device_group = spapr_tce_iommu_device_group,
> +	.domain_alloc_paging = spapr_tce_domain_alloc_paging,
> +	.default_domain_ops = &(const struct iommu_domain_ops) {
> +		.attach_dev     = spapr_tce_iommu_attach_device,
> +		.map_pages      = spapr_tce_iommu_map_pages,
> +		.unmap_pages    = spapr_tce_iommu_unmap_pages,
> +		.iova_to_phys   = spapr_tce_iommu_iova_to_phys,
> +		.free           = spapr_tce_domain_free,
> +	}

Please don't use default_domain_ops in a driver that is supporting
multiple domain types and platform, it becomes confusing to guess
which domain type those ops are linked to.

You should also implement the BLOCKING domain type to make VFIO work
better

I wouldn't try to guess if this is right or not, but it looks pretty
reasonable as a first start.

Jason
Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Shivaprasad G Bhat 3 days, 23 hours ago
Hi Jason,


Thanks for reviewing this patch.


On 1/28/26 12:46 AM, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 06:35:56PM +0000, Shivaprasad G Bhat wrote:
>> The RFC attempts to implement the IOMMUFD support on PPC64 by
>> adding new iommu_ops for paging domain. The existing platform
>> domain continues to be the default domain for in-kernel use.
> It would be nice to see the platform domain go away and ppc use the
> normal dma-iommu.c stuff, but I don't think it is critical to making
> it work with iommufd.


I agree. I have started on this. I will send incremental changes

as follow-up after this.


>> On PPC64, IOVA ranges are based on the type of the DMA window
>> and their properties. Currently, there is no way to expose the
>> attributes of the non-default 64-bit DMA window, which the platform
>> supports. The platform allows the operating system to select the
>> starting offset(at 4GiB or 512PiB default offset), pagesize and
>> window size for the non-default 64-bit DMA window. For example,
>> with VFIO, this is handled via VFIO_IOMMU_SPAPR_TCE_GET_INFO
>> and VFIO_IOMMU_SPAPR_TCE_CREATE|REMOVE ioctls. While I am exploring
>> the ways to expose and configure these DMA window attributes as
>> per user input, any suggestions in this regard will be very helpful.
> You can pass in driver specific information during HWPT creation, so
> any properties you need can be specified there.


Sure. I think IOMMU_GET_HW_INFO would be useful for getting the

platform supported configuration in this case.


> Then you'd want to introduce a new domain op to get the apertures
> instead of the single range hard coded into the domain struct. The new
> op would be able to return a list. We can use this op to return
> apertures for sign extension page tables too.
>
> Update iommufd to calculate the reserved regions by evaluating the
> whole list.
>
> I think you'll find this pretty straight forward, I'd do it as a
> followup patch to this one.


Thanks. I will wait for that patch.


>
>> Currently existing vfio type1 specific vfio-compat driver even
>> with this patch will not work for PPC64. I believe we need to have
>> a separate "vfio-spapr-compat" driver to make it work.
> Yes, vfio-compat doesn't support the special spapr ioctls.
>
> I don't think you need a new driver, just implement whatever they do
> with the existing interfaces, probably in its own .c file though.

There are ioctl number conflicts like

# grep -n "VFIO_BASE + 1[89]" include/uapi/linux/vfio.h | grep define
940:#defineVFIO_DEVICE_BIND_IOMMUFD_IO(VFIO_TYPE, VFIO_BASE + 18)
976:#defineVFIO_DEVICE_ATTACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 19)
1833:#defineVFIO_IOMMU_SPAPR_UNREGISTER_MEMORY_IO(VFIO_TYPE, VFIO_BASE + 18)
1856:#defineVFIO_IOMMU_SPAPR_TCE_CREATE_IO(VFIO_TYPE, VFIO_BASE + 19)
# grep -n "VFIO_BASE + 20" include/uapi/linux/vfio.h | grep define
999:#defineVFIO_DEVICE_DETACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 20)
1870:#defineVFIO_IOMMU_SPAPR_TCE_REMOVE_IO(VFIO_TYPE, VFIO_BASE + 20)

> However, I have no idea what is required to implement those ops, or if
> it is even possible.. It may be easier to just leave the old vfio
> stuff around instead of trying to compat it. The purpose of compat was
> to be able to build kernels without type1 at all. It isn't necessary
> to start using iommufd in new apps with the new interfaces.
>
> Given you are mainly looking at a VMM that already will have iommufd
> support it may not be worthwhile.


You are right. We do have some use cases beyond VMM, I will consider 
compat driver

only if it is helpful there.


>> @@ -1201,7 +1201,15 @@ spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
>>   	 * also sets the dma_api ops
>>   	 */
>>   	table_group = iommu_group_get_iommudata(grp);
>> +
>> +	if (old && old->type == IOMMU_DOMAIN_DMA) {
> I'm trying to delete IOMMU_DOMAIN_DMA please don't use it in
> drivers.
Sure.
>>   static const struct iommu_ops spapr_tce_iommu_ops = {
>>   	.default_domain = &spapr_tce_platform_domain,
>>   	.blocked_domain = &spapr_tce_blocked_domain,
>> @@ -1267,6 +1436,14 @@ static const struct iommu_ops spapr_tce_iommu_ops = {
>>   	.probe_device = spapr_tce_iommu_probe_device,
>>   	.release_device = spapr_tce_iommu_release_device,
>>   	.device_group = spapr_tce_iommu_device_group,
>> +	.domain_alloc_paging = spapr_tce_domain_alloc_paging,
>> +	.default_domain_ops = &(const struct iommu_domain_ops) {
>> +		.attach_dev     = spapr_tce_iommu_attach_device,
>> +		.map_pages      = spapr_tce_iommu_map_pages,
>> +		.unmap_pages    = spapr_tce_iommu_unmap_pages,
>> +		.iova_to_phys   = spapr_tce_iommu_iova_to_phys,
>> +		.free           = spapr_tce_domain_free,
>> +	}
> Please don't use default_domain_ops in a driver that is supporting
> multiple domain types and platform, it becomes confusing to guess
> which domain type those ops are linked to.


Sure.


> You should also implement the BLOCKING domain type to make VFIO work
> better


I am not sure how this could help making VFIO better. May be, I am not able

to imagine the advantages with the current platform domain approach

in place. Could you please elaborate more on this?


> I wouldn't try to guess if this is right or not, but it looks pretty
> reasonable as a first start.


Thanks, I will iterate this as RFC till i get to reasonable shape.


Regards,

Shivaprasad

> Jason
Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Jason Gunthorpe 3 days, 21 hours ago
On Tue, Feb 03, 2026 at 09:22:13PM +0530, Shivaprasad G Bhat wrote:
> > Then you'd want to introduce a new domain op to get the apertures
> > instead of the single range hard coded into the domain struct. The new
> > op would be able to return a list. We can use this op to return
> > apertures for sign extension page tables too.
> > 
> > Update iommufd to calculate the reserved regions by evaluating the
> > whole list.
> > 
> > I think you'll find this pretty straight forward, I'd do it as a
> > followup patch to this one.
> 
> 
> Thanks. I will wait for that patch.

I think you will have to make it :)

> There are ioctl number conflicts like
> 
> # grep -n "VFIO_BASE + 1[89]" include/uapi/linux/vfio.h | grep define
> 940:#defineVFIO_DEVICE_BIND_IOMMUFD_IO(VFIO_TYPE, VFIO_BASE + 18)
> 976:#defineVFIO_DEVICE_ATTACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 19)
> 1833:#defineVFIO_IOMMU_SPAPR_UNREGISTER_MEMORY_IO(VFIO_TYPE, VFIO_BASE + 18)
> 1856:#defineVFIO_IOMMU_SPAPR_TCE_CREATE_IO(VFIO_TYPE, VFIO_BASE + 19)
> # grep -n "VFIO_BASE + 20" include/uapi/linux/vfio.h | grep define
> 999:#defineVFIO_DEVICE_DETACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 20)
> 1870:#defineVFIO_IOMMU_SPAPR_TCE_REMOVE_IO(VFIO_TYPE, VFIO_BASE + 20)

It's Ok the compat codes will know what type it is operating in before
it decodes any of those.

> You are right. We do have some use cases beyond VMM, I will consider compat
> driver only if it is helpful there.

You can also use the type1 compat mode which will magically start
working with PPC..

> > You should also implement the BLOCKING domain type to make VFIO work
> > better 
 
> I am not sure how this could help making VFIO better. May be, I am not able
> to imagine the advantages with the current platform domain approach
> in place. Could you please elaborate more on this?

VFIO always uses a BLOCKED domain when it opens the device, then it
changes to a paging domain. If the driver doesn't support a native
BLOCKED domain then it allocates an empty page table and uses that.

A proper native BLOCKED domain has better error handling
characteristics since it is not allowed to fail attach and it doesn't
require allocation.

I think you will also find what you are doing easier if you push the
iommu_domain down through the PPC iommu ops instead of retaining these
unnecessary historical layers.

Jason
Re: [RFC PATCH] powerpc: iommu: Initial IOMMUFD support for PPC64
Posted by Shivaprasad G Bhat 2 days, 23 hours ago
On 2/3/26 11:37 PM, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 09:22:13PM +0530, Shivaprasad G Bhat wrote:
>>> Then you'd want to introduce a new domain op to get the apertures
>>> instead of the single range hard coded into the domain struct. The new
>>> op would be able to return a list. We can use this op to return
>>> apertures for sign extension page tables too.
>>>
>>> Update iommufd to calculate the reserved regions by evaluating the
>>> whole list.
>>>
>>> I think you'll find this pretty straight forward, I'd do it as a
>>> followup patch to this one.
>>
>> Thanks. I will wait for that patch.
> I think you will have to make it :)


Sure. Thanks!


>
>> There are ioctl number conflicts like
>>
>> # grep -n "VFIO_BASE + 1[89]" include/uapi/linux/vfio.h | grep define
>> 940:#defineVFIO_DEVICE_BIND_IOMMUFD_IO(VFIO_TYPE, VFIO_BASE + 18)
>> 976:#defineVFIO_DEVICE_ATTACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 19)
>> 1833:#defineVFIO_IOMMU_SPAPR_UNREGISTER_MEMORY_IO(VFIO_TYPE, VFIO_BASE + 18)
>> 1856:#defineVFIO_IOMMU_SPAPR_TCE_CREATE_IO(VFIO_TYPE, VFIO_BASE + 19)
>> # grep -n "VFIO_BASE + 20" include/uapi/linux/vfio.h | grep define
>> 999:#defineVFIO_DEVICE_DETACH_IOMMUFD_PT_IO(VFIO_TYPE, VFIO_BASE + 20)
>> 1870:#defineVFIO_IOMMU_SPAPR_TCE_REMOVE_IO(VFIO_TYPE, VFIO_BASE + 20)
> It's Ok the compat codes will know what type it is operating in before
> it decodes any of those.


I agree, the additional PPC specific ioctls can be selectively handled based

on arch.


>
>> You are right. We do have some use cases beyond VMM, I will consider compat
>> driver only if it is helpful there.
> You can also use the type1 compat mode which will magically start
> working with PPC..
>
>>> You should also implement the BLOCKING domain type to make VFIO work
>>> better
>   
>> I am not sure how this could help making VFIO better. May be, I am not able
>> to imagine the advantages with the current platform domain approach
>> in place. Could you please elaborate more on this?
> VFIO always uses a BLOCKED domain when it opens the device, then it
> changes to a paging domain. If the driver doesn't support a native
> BLOCKED domain then it allocates an empty page table and uses that.
>
> A proper native BLOCKED domain has better error handling
> characteristics since it is not allowed to fail attach and it doesn't
> require allocation.
>
> I think you will also find what you are doing easier if you push the
> iommu_domain down through the PPC iommu ops instead of retaining these
> unnecessary historical layers.


I see there is some more history around this being discussed before on 
the patch here,

https://lore.kernel.org/all/20220701061751.1955857-1-aik@ozlabs.ru/

Will try to dig more and address this in future.

Thanks,

Shivaprasad

> Jason