[v8] vfio/pci: Allow MMIO regions to be exported through dma-buf

[PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 4 weeks ago

From: Leon Romanovsky <leonro@nvidia.com>

Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
MMIO physical address ranges into scatter-gather tables with proper
DMA mapping.

These common functions are a starting point and support any PCI
drivers creating mappings from their BAR's MMIO addresses. VFIO is one
case, as shortly will be RDMA. We can review existing DRM drivers to
refactor them separately. We hope this will evolve into routines to
help common DRM that include mixed CPU and MMIO mappings.

Compared to the dma_map_resource() abuse this implementation handles
the complicated PCI P2P scenarios properly, especially when an IOMMU
is enabled:

 - Direct bus address mapping without IOVA allocation for
   PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
   happens if the IOMMU is enabled but the PCIe switch ACS flags allow
   transactions to avoid the host bridge.

   Further, this handles the slightly obscure, case of MMIO with a
   phys_addr_t that is different from the physical BAR programming
   (bus offset). The phys_addr_t is converted to a dma_addr_t and
   accommodates this effect. This enables certain real systems to
   work, especially on ARM platforms.

 - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
   attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
   This happens when the IOMMU is enabled and the ACS flags are forcing
   all traffic to the IOMMU - ie for virtualization systems.

 - Cases where P2P is not supported through the host bridge/CPU. The
   P2P subsystem is the proper place to detect this and block it.

Helper functions fill_sg_entry() and calc_sg_nents() handle the
scatter-gather table construction, splitting large regions into
UINT_MAX-sized chunks to fit within sg->length field limits.

Since the physical address based DMA API forbids use of the CPU list
of the scatterlist this will produce a mangled scatterlist that has
a fully zero-length and NULL'd CPU list. The list is 0 length,
all the struct page pointers are NULL and zero sized. This is stronger
and more robust than the existing mangle_sg_table() technique. It is
a future project to migrate DMABUF as a subsystem away from using
scatterlist for this data structure.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-buf.h   |  18 ++++
 2 files changed, 253 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 2bcf9ceca997..cb55dff1dad5 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
 
+static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
+					 dma_addr_t addr)
+{
+	unsigned int len, nents;
+	int i;
+
+	nents = DIV_ROUND_UP(length, UINT_MAX);
+	for (i = 0; i < nents; i++) {
+		len = min_t(size_t, length, UINT_MAX);
+		length -= len;
+		/*
+		 * DMABUF abuses scatterlist to create a scatterlist
+		 * that does not have any CPU list, only the DMA list.
+		 * Always set the page related values to NULL to ensure
+		 * importers can't use it. The phys_addr based DMA API
+		 * does not require the CPU list for mapping or unmapping.
+		 */
+		sg_set_page(sgl, NULL, 0, 0);
+		sg_dma_address(sgl) = addr + i * UINT_MAX;
+		sg_dma_len(sgl) = len;
+		sgl = sg_next(sgl);
+	}
+
+	return sgl;
+}
+
+static unsigned int calc_sg_nents(struct dma_iova_state *state,
+				  struct dma_buf_phys_vec *phys_vec,
+				  size_t nr_ranges, size_t size)
+{
+	unsigned int nents = 0;
+	size_t i;
+
+	if (!state || !dma_use_iova(state)) {
+		for (i = 0; i < nr_ranges; i++)
+			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+	} else {
+		/*
+		 * In IOVA case, there is only one SG entry which spans
+		 * for whole IOVA address space, but we need to make sure
+		 * that it fits sg->length, maybe we need more.
+		 */
+		nents = DIV_ROUND_UP(size, UINT_MAX);
+	}
+
+	return nents;
+}
+
+/**
+ * struct dma_buf_dma - holds DMA mapping information
+ * @sgt:    Scatter-gather table
+ * @state:  DMA IOVA state relevant in IOMMU-based DMA
+ * @size:   Total size of DMA transfer
+ */
+struct dma_buf_dma {
+	struct sg_table sgt;
+	struct dma_iova_state *state;
+	size_t size;
+};
+
+/**
+ * dma_buf_map - Returns the scatterlist table of the attachment from arrays
+ * of physical vectors. This funciton is intended for MMIO memory only.
+ * @attach:	[in]	attachment whose scatterlist is to be returned
+ * @provider:	[in]	p2pdma provider
+ * @phys_vec:	[in]	array of physical vectors
+ * @nr_ranges:	[in]	number of entries in phys_vec array
+ * @size:	[in]	total size of phys_vec
+ * @dir:	[in]	direction of DMA transfer
+ *
+ * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
+ * on error. May return -EINTR if it is interrupted by a signal.
+ *
+ * On success, the DMA addresses and lengths in the returned scatterlist are
+ * PAGE_SIZE aligned.
+ *
+ * A mapping must be unmapped by using dma_buf_unmap().
+ */
+struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
+			     struct p2pdma_provider *provider,
+			     struct dma_buf_phys_vec *phys_vec,
+			     size_t nr_ranges, size_t size,
+			     enum dma_data_direction dir)
+{
+	unsigned int nents, mapped_len = 0;
+	struct dma_buf_dma *dma;
+	struct scatterlist *sgl;
+	dma_addr_t addr;
+	size_t i;
+	int ret;
+
+	dma_resv_assert_held(attach->dmabuf->resv);
+
+	if (WARN_ON(!attach || !attach->dmabuf || !provider))
+		/* This function is supposed to work on MMIO memory only */
+		return ERR_PTR(-EINVAL);
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma)
+		return ERR_PTR(-ENOMEM);
+
+	switch (pci_p2pdma_map_type(provider, attach->dev)) {
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		/*
+		 * There is no need in IOVA at all for this flow.
+		 */
+		break;
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
+		if (!dma->state) {
+			ret = -ENOMEM;
+			goto err_free_dma;
+		}
+
+		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
+		break;
+	default:
+		ret = -EINVAL;
+		goto err_free_dma;
+	}
+
+	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
+	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
+	if (ret)
+		goto err_free_state;
+
+	sgl = dma->sgt.sgl;
+
+	for (i = 0; i < nr_ranges; i++) {
+		if (!dma->state) {
+			addr = pci_p2pdma_bus_addr_map(provider,
+						       phys_vec[i].paddr);
+		} else if (dma_use_iova(dma->state)) {
+			ret = dma_iova_link(attach->dev, dma->state,
+					    phys_vec[i].paddr, 0,
+					    phys_vec[i].len, dir,
+					    DMA_ATTR_MMIO);
+			if (ret)
+				goto err_unmap_dma;
+
+			mapped_len += phys_vec[i].len;
+		} else {
+			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
+					    phys_vec[i].len, dir,
+					    DMA_ATTR_MMIO);
+			ret = dma_mapping_error(attach->dev, addr);
+			if (ret)
+				goto err_unmap_dma;
+		}
+
+		if (!dma->state || !dma_use_iova(dma->state))
+			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
+	}
+
+	if (dma->state && dma_use_iova(dma->state)) {
+		WARN_ON_ONCE(mapped_len != size);
+		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
+		if (ret)
+			goto err_unmap_dma;
+
+		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
+	}
+
+	dma->size = size;
+
+	/*
+	 * No CPU list included — set orig_nents = 0 so others can detect
+	 * this via SG table (use nents only).
+	 */
+	dma->sgt.orig_nents = 0;
+
+
+	/*
+	 * SGL must be NULL to indicate that SGL is the last one
+	 * and we allocated correct number of entries in sg_alloc_table()
+	 */
+	WARN_ON_ONCE(sgl);
+	return &dma->sgt;
+
+err_unmap_dma:
+	if (!i || !dma->state) {
+		; /* Do nothing */
+	} else if (dma_use_iova(dma->state)) {
+		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
+				 DMA_ATTR_MMIO);
+	} else {
+		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
+			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
+	}
+	sg_free_table(&dma->sgt);
+err_free_state:
+	kfree(dma->state);
+err_free_dma:
+	kfree(dma);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_map, "DMA_BUF");
+
+/**
+ * dma_buf_unmap - unmaps the buffer
+ * @attach:	[in]	attachment to unmap buffer from
+ * @sgt:	[in]	scatterlist info of the buffer to unmap
+ * @direction:	[in]	direction of DMA transfer
+ *
+ * This unmaps a DMA mapping for @attached obtained by dma_buf_map().
+ */
+void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
+		   enum dma_data_direction dir)
+{
+	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
+	int i;
+
+	dma_resv_assert_held(attach->dmabuf->resv);
+
+	if (!dma->state) {
+		; /* Do nothing */
+	} else if (dma_use_iova(dma->state)) {
+		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
+				 DMA_ATTR_MMIO);
+	} else {
+		struct scatterlist *sgl;
+
+		for_each_sgtable_dma_sg(sgt, sgl, i)
+			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
+	}
+
+	sg_free_table(sgt);
+	kfree(dma->state);
+	kfree(dma);
+
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_unmap, "DMA_BUF");
+
 /**
  * dma_buf_move_notify - notify attachments that DMA-buf is moving
  *
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d58e329ac0e7..545ba27a5040 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
 #include <linux/fs.h>
 #include <linux/dma-fence.h>
 #include <linux/wait.h>
+#include <linux/pci-p2pdma.h>
 
 struct device;
 struct dma_buf;
@@ -530,6 +531,16 @@ struct dma_buf_export_info {
 	void *priv;
 };
 
+/**
+ * struct dma_buf_phys_vec - describe continuous chunk of memory
+ * @paddr:   physical address of that chunk
+ * @len:     Length of this chunk
+ */
+struct dma_buf_phys_vec {
+	phys_addr_t paddr;
+	size_t len;
+};
+
 /**
  * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
  * @name: export-info name
@@ -609,4 +620,11 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
 struct dma_buf *dma_buf_iter_begin(void);
 struct dma_buf *dma_buf_iter_next(struct dma_buf *dmbuf);
+struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
+			     struct p2pdma_provider *provider,
+			     struct dma_buf_phys_vec *phys_vec,
+			     size_t nr_ranges, size_t size,
+			     enum dma_data_direction dir);
+void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
+		   enum dma_data_direction dir);
 #endif /* __DMA_BUF_H__ */

-- 
2.51.1

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 3 weeks ago


On 11/11/25 10:57, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h   |  18 ++++
>  2 files changed, 253 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 2bcf9ceca997..cb55dff1dad5 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>  }
>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>  
> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> +					 dma_addr_t addr)
> +{
> +	unsigned int len, nents;
> +	int i;
> +
> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> +	for (i = 0; i < nents; i++) {
> +		len = min_t(size_t, length, UINT_MAX);
> +		length -= len;
> +		/*
> +		 * DMABUF abuses scatterlist to create a scatterlist
> +		 * that does not have any CPU list, only the DMA list.
> +		 * Always set the page related values to NULL to ensure
> +		 * importers can't use it. The phys_addr based DMA API
> +		 * does not require the CPU list for mapping or unmapping.
> +		 */
> +		sg_set_page(sgl, NULL, 0, 0);
> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> +		sg_dma_len(sgl) = len;
> +		sgl = sg_next(sgl);
> +	}
> +
> +	return sgl;
> +}
> +
> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> +				  struct dma_buf_phys_vec *phys_vec,
> +				  size_t nr_ranges, size_t size)
> +{
> +	unsigned int nents = 0;
> +	size_t i;
> +
> +	if (!state || !dma_use_iova(state)) {
> +		for (i = 0; i < nr_ranges; i++)
> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +	} else {
> +		/*
> +		 * In IOVA case, there is only one SG entry which spans
> +		 * for whole IOVA address space, but we need to make sure
> +		 * that it fits sg->length, maybe we need more.
> +		 */
> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> +	}
> +
> +	return nents;
> +}
> +
> +/**
> + * struct dma_buf_dma - holds DMA mapping information
> + * @sgt:    Scatter-gather table
> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> + * @size:   Total size of DMA transfer
> + */
> +struct dma_buf_dma {
> +	struct sg_table sgt;
> +	struct dma_iova_state *state;
> +	size_t size;
> +};
> +
> +/**
> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> + * of physical vectors. This funciton is intended for MMIO memory only.
> + * @attach:	[in]	attachment whose scatterlist is to be returned
> + * @provider:	[in]	p2pdma provider
> + * @phys_vec:	[in]	array of physical vectors
> + * @nr_ranges:	[in]	number of entries in phys_vec array
> + * @size:	[in]	total size of phys_vec
> + * @dir:	[in]	direction of DMA transfer
> + *
> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> + * on error. May return -EINTR if it is interrupted by a signal.
> + *
> + * On success, the DMA addresses and lengths in the returned scatterlist are
> + * PAGE_SIZE aligned.
> + *
> + * A mapping must be unmapped by using dma_buf_unmap().
> + */
> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,

That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.

> +			     struct p2pdma_provider *provider,
> +			     struct dma_buf_phys_vec *phys_vec,
> +			     size_t nr_ranges, size_t size,
> +			     enum dma_data_direction dir)
> +{
> +	unsigned int nents, mapped_len = 0;
> +	struct dma_buf_dma *dma;
> +	struct scatterlist *sgl;
> +	dma_addr_t addr;
> +	size_t i;
> +	int ret;
> +
> +	dma_resv_assert_held(attach->dmabuf->resv);
> +
> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> +		/* This function is supposed to work on MMIO memory only */
> +		return ERR_PTR(-EINVAL);
> +
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma)
> +		return ERR_PTR(-ENOMEM);
> +
> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		/*
> +		 * There is no need in IOVA at all for this flow.
> +		 */
> +		break;
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> +		if (!dma->state) {
> +			ret = -ENOMEM;
> +			goto err_free_dma;
> +		}
> +
> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);

Oh, that is a clear no-go for the core DMA-buf code.

It's intentionally up to the exporter how to create the DMA addresses the importer can work with.

We could add something like a dma_buf_sg_helper.c or similar and put it in there.

Regards,
Christian.


> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err_free_dma;
> +	}
> +
> +	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
> +	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
> +	if (ret)
> +		goto err_free_state;
> +
> +	sgl = dma->sgt.sgl;
> +
> +	for (i = 0; i < nr_ranges; i++) {
> +		if (!dma->state) {
> +			addr = pci_p2pdma_bus_addr_map(provider,
> +						       phys_vec[i].paddr);
> +		} else if (dma_use_iova(dma->state)) {
> +			ret = dma_iova_link(attach->dev, dma->state,
> +					    phys_vec[i].paddr, 0,
> +					    phys_vec[i].len, dir,
> +					    DMA_ATTR_MMIO);
> +			if (ret)
> +				goto err_unmap_dma;
> +
> +			mapped_len += phys_vec[i].len;
> +		} else {
> +			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
> +					    phys_vec[i].len, dir,
> +					    DMA_ATTR_MMIO);
> +			ret = dma_mapping_error(attach->dev, addr);
> +			if (ret)
> +				goto err_unmap_dma;
> +		}
> +
> +		if (!dma->state || !dma_use_iova(dma->state))
> +			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
> +	}
> +
> +	if (dma->state && dma_use_iova(dma->state)) {
> +		WARN_ON_ONCE(mapped_len != size);
> +		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
> +		if (ret)
> +			goto err_unmap_dma;
> +
> +		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
> +	}
> +
> +	dma->size = size;
> +
> +	/*
> +	 * No CPU list included — set orig_nents = 0 so others can detect
> +	 * this via SG table (use nents only).
> +	 */
> +	dma->sgt.orig_nents = 0;
> +
> +
> +	/*
> +	 * SGL must be NULL to indicate that SGL is the last one
> +	 * and we allocated correct number of entries in sg_alloc_table()
> +	 */
> +	WARN_ON_ONCE(sgl);
> +	return &dma->sgt;
> +
> +err_unmap_dma:
> +	if (!i || !dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> +	}
> +	sg_free_table(&dma->sgt);
> +err_free_state:
> +	kfree(dma->state);
> +err_free_dma:
> +	kfree(dma);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_map, "DMA_BUF");
> +
> +/**
> + * dma_buf_unmap - unmaps the buffer
> + * @attach:	[in]	attachment to unmap buffer from
> + * @sgt:	[in]	scatterlist info of the buffer to unmap
> + * @direction:	[in]	direction of DMA transfer
> + *
> + * This unmaps a DMA mapping for @attached obtained by dma_buf_map().
> + */
> +void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
> +		   enum dma_data_direction dir)
> +{
> +	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
> +	int i;
> +
> +	dma_resv_assert_held(attach->dmabuf->resv);
> +
> +	if (!dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		struct scatterlist *sgl;
> +
> +		for_each_sgtable_dma_sg(sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> +	}
> +
> +	sg_free_table(sgt);
> +	kfree(dma->state);
> +	kfree(dma);
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_unmap, "DMA_BUF");
> +
>  /**
>   * dma_buf_move_notify - notify attachments that DMA-buf is moving
>   *
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d58e329ac0e7..545ba27a5040 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -22,6 +22,7 @@
>  #include <linux/fs.h>
>  #include <linux/dma-fence.h>
>  #include <linux/wait.h>
> +#include <linux/pci-p2pdma.h>
>  
>  struct device;
>  struct dma_buf;
> @@ -530,6 +531,16 @@ struct dma_buf_export_info {
>  	void *priv;
>  };
>  
> +/**
> + * struct dma_buf_phys_vec - describe continuous chunk of memory
> + * @paddr:   physical address of that chunk
> + * @len:     Length of this chunk
> + */
> +struct dma_buf_phys_vec {
> +	phys_addr_t paddr;
> +	size_t len;
> +};
> +
>  /**
>   * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
>   * @name: export-info name
> @@ -609,4 +620,11 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
>  void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
>  struct dma_buf *dma_buf_iter_begin(void);
>  struct dma_buf *dma_buf_iter_next(struct dma_buf *dmbuf);
> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> +			     struct p2pdma_provider *provider,
> +			     struct dma_buf_phys_vec *phys_vec,
> +			     size_t nr_ranges, size_t size,
> +			     enum dma_data_direction dir);
> +void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
> +		   enum dma_data_direction dir);
>  #endif /* __DMA_BUF_H__ */
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> 
> 
> On 11/11/25 10:57, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> > MMIO physical address ranges into scatter-gather tables with proper
> > DMA mapping.
> > 
> > These common functions are a starting point and support any PCI
> > drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> > case, as shortly will be RDMA. We can review existing DRM drivers to
> > refactor them separately. We hope this will evolve into routines to
> > help common DRM that include mixed CPU and MMIO mappings.
> > 
> > Compared to the dma_map_resource() abuse this implementation handles
> > the complicated PCI P2P scenarios properly, especially when an IOMMU
> > is enabled:
> > 
> >  - Direct bus address mapping without IOVA allocation for
> >    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >    transactions to avoid the host bridge.
> > 
> >    Further, this handles the slightly obscure, case of MMIO with a
> >    phys_addr_t that is different from the physical BAR programming
> >    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >    accommodates this effect. This enables certain real systems to
> >    work, especially on ARM platforms.
> > 
> >  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >    This happens when the IOMMU is enabled and the ACS flags are forcing
> >    all traffic to the IOMMU - ie for virtualization systems.
> > 
> >  - Cases where P2P is not supported through the host bridge/CPU. The
> >    P2P subsystem is the proper place to detect this and block it.
> > 
> > Helper functions fill_sg_entry() and calc_sg_nents() handle the
> > scatter-gather table construction, splitting large regions into
> > UINT_MAX-sized chunks to fit within sg->length field limits.
> > 
> > Since the physical address based DMA API forbids use of the CPU list
> > of the scatterlist this will produce a mangled scatterlist that has
> > a fully zero-length and NULL'd CPU list. The list is 0 length,
> > all the struct page pointers are NULL and zero sized. This is stronger
> > and more robust than the existing mangle_sg_table() technique. It is
> > a future project to migrate DMABUF as a subsystem away from using
> > scatterlist for this data structure.
> > 
> > Tested-by: Alex Mastro <amastro@fb.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-buf.h   |  18 ++++
> >  2 files changed, 253 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 2bcf9ceca997..cb55dff1dad5 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >  }
> >  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >  
> > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > +					 dma_addr_t addr)
> > +{
> > +	unsigned int len, nents;
> > +	int i;
> > +
> > +	nents = DIV_ROUND_UP(length, UINT_MAX);
> > +	for (i = 0; i < nents; i++) {
> > +		len = min_t(size_t, length, UINT_MAX);
> > +		length -= len;
> > +		/*
> > +		 * DMABUF abuses scatterlist to create a scatterlist
> > +		 * that does not have any CPU list, only the DMA list.
> > +		 * Always set the page related values to NULL to ensure
> > +		 * importers can't use it. The phys_addr based DMA API
> > +		 * does not require the CPU list for mapping or unmapping.
> > +		 */
> > +		sg_set_page(sgl, NULL, 0, 0);
> > +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> > +		sg_dma_len(sgl) = len;
> > +		sgl = sg_next(sgl);
> > +	}
> > +
> > +	return sgl;
> > +}
> > +
> > +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> > +				  struct dma_buf_phys_vec *phys_vec,
> > +				  size_t nr_ranges, size_t size)
> > +{
> > +	unsigned int nents = 0;
> > +	size_t i;
> > +
> > +	if (!state || !dma_use_iova(state)) {
> > +		for (i = 0; i < nr_ranges; i++)
> > +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > +	} else {
> > +		/*
> > +		 * In IOVA case, there is only one SG entry which spans
> > +		 * for whole IOVA address space, but we need to make sure
> > +		 * that it fits sg->length, maybe we need more.
> > +		 */
> > +		nents = DIV_ROUND_UP(size, UINT_MAX);
> > +	}
> > +
> > +	return nents;
> > +}
> > +
> > +/**
> > + * struct dma_buf_dma - holds DMA mapping information
> > + * @sgt:    Scatter-gather table
> > + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> > + * @size:   Total size of DMA transfer
> > + */
> > +struct dma_buf_dma {
> > +	struct sg_table sgt;
> > +	struct dma_iova_state *state;
> > +	size_t size;
> > +};
> > +
> > +/**
> > + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> > + * of physical vectors. This funciton is intended for MMIO memory only.
> > + * @attach:	[in]	attachment whose scatterlist is to be returned
> > + * @provider:	[in]	p2pdma provider
> > + * @phys_vec:	[in]	array of physical vectors
> > + * @nr_ranges:	[in]	number of entries in phys_vec array
> > + * @size:	[in]	total size of phys_vec
> > + * @dir:	[in]	direction of DMA transfer
> > + *
> > + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> > + * on error. May return -EINTR if it is interrupted by a signal.
> > + *
> > + * On success, the DMA addresses and lengths in the returned scatterlist are
> > + * PAGE_SIZE aligned.
> > + *
> > + * A mapping must be unmapped by using dma_buf_unmap().
> > + */
> > +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> 
> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.

This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?

> 
> > +			     struct p2pdma_provider *provider,
> > +			     struct dma_buf_phys_vec *phys_vec,
> > +			     size_t nr_ranges, size_t size,
> > +			     enum dma_data_direction dir)
> > +{
> > +	unsigned int nents, mapped_len = 0;
> > +	struct dma_buf_dma *dma;
> > +	struct scatterlist *sgl;
> > +	dma_addr_t addr;
> > +	size_t i;
> > +	int ret;
> > +
> > +	dma_resv_assert_held(attach->dmabuf->resv);
> > +
> > +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> > +		/* This function is supposed to work on MMIO memory only */
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> > +	case PCI_P2PDMA_MAP_BUS_ADDR:
> > +		/*
> > +		 * There is no need in IOVA at all for this flow.
> > +		 */
> > +		break;
> > +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> > +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> > +		if (!dma->state) {
> > +			ret = -ENOMEM;
> > +			goto err_free_dma;
> > +		}
> > +
> > +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> 
> Oh, that is a clear no-go for the core DMA-buf code.
> 
> It's intentionally up to the exporter how to create the DMA addresses the importer can work with.

I didn't fully understand the email either. The importer needs to
configure DMA and it supports only MMIO addresses. Exporter controls it
by asking for peer2peer.

Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 3 weeks ago

On 11/19/25 14:42, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>
>>
>> On 11/11/25 10:57, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
>>> MMIO physical address ranges into scatter-gather tables with proper
>>> DMA mapping.
>>>
>>> These common functions are a starting point and support any PCI
>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
>>> case, as shortly will be RDMA. We can review existing DRM drivers to
>>> refactor them separately. We hope this will evolve into routines to
>>> help common DRM that include mixed CPU and MMIO mappings.
>>>
>>> Compared to the dma_map_resource() abuse this implementation handles
>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
>>> is enabled:
>>>
>>>  - Direct bus address mapping without IOVA allocation for
>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>>>    transactions to avoid the host bridge.
>>>
>>>    Further, this handles the slightly obscure, case of MMIO with a
>>>    phys_addr_t that is different from the physical BAR programming
>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>>>    accommodates this effect. This enables certain real systems to
>>>    work, especially on ARM platforms.
>>>
>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
>>>    all traffic to the IOMMU - ie for virtualization systems.
>>>
>>>  - Cases where P2P is not supported through the host bridge/CPU. The
>>>    P2P subsystem is the proper place to detect this and block it.
>>>
>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
>>> scatter-gather table construction, splitting large regions into
>>> UINT_MAX-sized chunks to fit within sg->length field limits.
>>>
>>> Since the physical address based DMA API forbids use of the CPU list
>>> of the scatterlist this will produce a mangled scatterlist that has
>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
>>> all the struct page pointers are NULL and zero sized. This is stronger
>>> and more robust than the existing mangle_sg_table() technique. It is
>>> a future project to migrate DMABUF as a subsystem away from using
>>> scatterlist for this data structure.
>>>
>>> Tested-by: Alex Mastro <amastro@fb.com>
>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>> ---
>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/dma-buf.h   |  18 ++++
>>>  2 files changed, 253 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 2bcf9ceca997..cb55dff1dad5 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>>>  }
>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>>>  
>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>>> +					 dma_addr_t addr)
>>> +{
>>> +	unsigned int len, nents;
>>> +	int i;
>>> +
>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
>>> +	for (i = 0; i < nents; i++) {
>>> +		len = min_t(size_t, length, UINT_MAX);
>>> +		length -= len;
>>> +		/*
>>> +		 * DMABUF abuses scatterlist to create a scatterlist
>>> +		 * that does not have any CPU list, only the DMA list.
>>> +		 * Always set the page related values to NULL to ensure
>>> +		 * importers can't use it. The phys_addr based DMA API
>>> +		 * does not require the CPU list for mapping or unmapping.
>>> +		 */
>>> +		sg_set_page(sgl, NULL, 0, 0);
>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
>>> +		sg_dma_len(sgl) = len;
>>> +		sgl = sg_next(sgl);
>>> +	}
>>> +
>>> +	return sgl;
>>> +}
>>> +
>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
>>> +				  struct dma_buf_phys_vec *phys_vec,
>>> +				  size_t nr_ranges, size_t size)
>>> +{
>>> +	unsigned int nents = 0;
>>> +	size_t i;
>>> +
>>> +	if (!state || !dma_use_iova(state)) {
>>> +		for (i = 0; i < nr_ranges; i++)
>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
>>> +	} else {
>>> +		/*
>>> +		 * In IOVA case, there is only one SG entry which spans
>>> +		 * for whole IOVA address space, but we need to make sure
>>> +		 * that it fits sg->length, maybe we need more.
>>> +		 */
>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
>>> +	}
>>> +
>>> +	return nents;
>>> +}
>>> +
>>> +/**
>>> + * struct dma_buf_dma - holds DMA mapping information
>>> + * @sgt:    Scatter-gather table
>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
>>> + * @size:   Total size of DMA transfer
>>> + */
>>> +struct dma_buf_dma {
>>> +	struct sg_table sgt;
>>> +	struct dma_iova_state *state;
>>> +	size_t size;
>>> +};
>>> +
>>> +/**
>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>> + * @provider:	[in]	p2pdma provider
>>> + * @phys_vec:	[in]	array of physical vectors
>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>> + * @size:	[in]	total size of phys_vec
>>> + * @dir:	[in]	direction of DMA transfer
>>> + *
>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>> + *
>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>> + * PAGE_SIZE aligned.
>>> + *
>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>> + */
>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>
>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> 
> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?

Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.

> 
>>
>>> +			     struct p2pdma_provider *provider,
>>> +			     struct dma_buf_phys_vec *phys_vec,
>>> +			     size_t nr_ranges, size_t size,
>>> +			     enum dma_data_direction dir)
>>> +{
>>> +	unsigned int nents, mapped_len = 0;
>>> +	struct dma_buf_dma *dma;
>>> +	struct scatterlist *sgl;
>>> +	dma_addr_t addr;
>>> +	size_t i;
>>> +	int ret;
>>> +
>>> +	dma_resv_assert_held(attach->dmabuf->resv);
>>> +
>>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
>>> +		/* This function is supposed to work on MMIO memory only */
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>> +	if (!dma)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
>>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
>>> +		/*
>>> +		 * There is no need in IOVA at all for this flow.
>>> +		 */
>>> +		break;
>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>> +		if (!dma->state) {
>>> +			ret = -ENOMEM;
>>> +			goto err_free_dma;
>>> +		}
>>> +
>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>
>> Oh, that is a clear no-go for the core DMA-buf code.
>>
>> It's intentionally up to the exporter how to create the DMA addresses the importer can work with.
> 
> I didn't fully understand the email either. The importer needs to
> configure DMA and it supports only MMIO addresses. Exporter controls it
> by asking for peer2peer.

I miss interpreted the call to pci_p2pdma_map_type() here in that now the DMA-buf code decides if transactions go over the root complex or not.

But the exporter can call pci_p2pdma_map_type() even before calling this function, so that looks fine to me.

Regards,
Christian.

> 
> Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:

> I miss interpreted the call to pci_p2pdma_map_type() here in that
> now the DMA-buf code decides if transactions go over the root
> complex or not.

Oh, that's not it at all. I think you get it, but just to be really
clear:

This code is taking a physical address from the exporter and
determining how it MUST route inside the fabric. There is only one
single choice with no optionality.

The exporter already decided if it will go over the host bridge by
providing an address that must use a host bridge path.

> But the exporter can call pci_p2pdma_map_type() even before calling
> this function, so that looks fine to me.

Yes, the exporter needs to decide where the data is placed before it
tries to map it into the SGT.

Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
> On 11/19/25 14:42, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>
> >>
> >> On 11/11/25 10:57, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>
> >>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> >>> MMIO physical address ranges into scatter-gather tables with proper
> >>> DMA mapping.
> >>>
> >>> These common functions are a starting point and support any PCI
> >>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> >>> case, as shortly will be RDMA. We can review existing DRM drivers to
> >>> refactor them separately. We hope this will evolve into routines to
> >>> help common DRM that include mixed CPU and MMIO mappings.
> >>>
> >>> Compared to the dma_map_resource() abuse this implementation handles
> >>> the complicated PCI P2P scenarios properly, especially when an IOMMU
> >>> is enabled:
> >>>
> >>>  - Direct bus address mapping without IOVA allocation for
> >>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >>>    transactions to avoid the host bridge.
> >>>
> >>>    Further, this handles the slightly obscure, case of MMIO with a
> >>>    phys_addr_t that is different from the physical BAR programming
> >>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >>>    accommodates this effect. This enables certain real systems to
> >>>    work, especially on ARM platforms.
> >>>
> >>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >>>    This happens when the IOMMU is enabled and the ACS flags are forcing
> >>>    all traffic to the IOMMU - ie for virtualization systems.
> >>>
> >>>  - Cases where P2P is not supported through the host bridge/CPU. The
> >>>    P2P subsystem is the proper place to detect this and block it.
> >>>
> >>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> >>> scatter-gather table construction, splitting large regions into
> >>> UINT_MAX-sized chunks to fit within sg->length field limits.
> >>>
> >>> Since the physical address based DMA API forbids use of the CPU list
> >>> of the scatterlist this will produce a mangled scatterlist that has
> >>> a fully zero-length and NULL'd CPU list. The list is 0 length,
> >>> all the struct page pointers are NULL and zero sized. This is stronger
> >>> and more robust than the existing mangle_sg_table() technique. It is
> >>> a future project to migrate DMABUF as a subsystem away from using
> >>> scatterlist for this data structure.
> >>>
> >>> Tested-by: Alex Mastro <amastro@fb.com>
> >>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>> ---
> >>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>  include/linux/dma-buf.h   |  18 ++++
> >>>  2 files changed, 253 insertions(+)
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index 2bcf9ceca997..cb55dff1dad5 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >>>  }
> >>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >>>  
> >>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >>> +					 dma_addr_t addr)
> >>> +{
> >>> +	unsigned int len, nents;
> >>> +	int i;
> >>> +
> >>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> >>> +	for (i = 0; i < nents; i++) {
> >>> +		len = min_t(size_t, length, UINT_MAX);
> >>> +		length -= len;
> >>> +		/*
> >>> +		 * DMABUF abuses scatterlist to create a scatterlist
> >>> +		 * that does not have any CPU list, only the DMA list.
> >>> +		 * Always set the page related values to NULL to ensure
> >>> +		 * importers can't use it. The phys_addr based DMA API
> >>> +		 * does not require the CPU list for mapping or unmapping.
> >>> +		 */
> >>> +		sg_set_page(sgl, NULL, 0, 0);
> >>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> >>> +		sg_dma_len(sgl) = len;
> >>> +		sgl = sg_next(sgl);
> >>> +	}
> >>> +
> >>> +	return sgl;
> >>> +}
> >>> +
> >>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >>> +				  struct dma_buf_phys_vec *phys_vec,
> >>> +				  size_t nr_ranges, size_t size)
> >>> +{
> >>> +	unsigned int nents = 0;
> >>> +	size_t i;
> >>> +
> >>> +	if (!state || !dma_use_iova(state)) {
> >>> +		for (i = 0; i < nr_ranges; i++)
> >>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> >>> +	} else {
> >>> +		/*
> >>> +		 * In IOVA case, there is only one SG entry which spans
> >>> +		 * for whole IOVA address space, but we need to make sure
> >>> +		 * that it fits sg->length, maybe we need more.
> >>> +		 */
> >>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> >>> +	}
> >>> +
> >>> +	return nents;
> >>> +}
> >>> +
> >>> +/**
> >>> + * struct dma_buf_dma - holds DMA mapping information
> >>> + * @sgt:    Scatter-gather table
> >>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> >>> + * @size:   Total size of DMA transfer
> >>> + */
> >>> +struct dma_buf_dma {
> >>> +	struct sg_table sgt;
> >>> +	struct dma_iova_state *state;
> >>> +	size_t size;
> >>> +};
> >>> +
> >>> +/**
> >>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>> + * @provider:	[in]	p2pdma provider
> >>> + * @phys_vec:	[in]	array of physical vectors
> >>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>> + * @size:	[in]	total size of phys_vec
> >>> + * @dir:	[in]	direction of DMA transfer
> >>> + *
> >>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>> + *
> >>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>> + * PAGE_SIZE aligned.
> >>> + *
> >>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>> + */
> >>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>
> >> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> > 
> > This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> 
> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.

Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
file per-your request.

Regarding SG, the long term plan is to remove SG table completely, so at
least external users of DMABUF shouldn't be exposed to internal implementation
details (SG table).

Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 3 weeks ago


On 11/19/25 15:50, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
>> On 11/19/25 14:42, Leon Romanovsky wrote:
>>> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>>>
>>>>
>>>> On 11/11/25 10:57, Leon Romanovsky wrote:
>>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>>
>>>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
>>>>> MMIO physical address ranges into scatter-gather tables with proper
>>>>> DMA mapping.
>>>>>
>>>>> These common functions are a starting point and support any PCI
>>>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
>>>>> case, as shortly will be RDMA. We can review existing DRM drivers to
>>>>> refactor them separately. We hope this will evolve into routines to
>>>>> help common DRM that include mixed CPU and MMIO mappings.
>>>>>
>>>>> Compared to the dma_map_resource() abuse this implementation handles
>>>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
>>>>> is enabled:
>>>>>
>>>>>  - Direct bus address mapping without IOVA allocation for
>>>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>>>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>>>>>    transactions to avoid the host bridge.
>>>>>
>>>>>    Further, this handles the slightly obscure, case of MMIO with a
>>>>>    phys_addr_t that is different from the physical BAR programming
>>>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>>>>>    accommodates this effect. This enables certain real systems to
>>>>>    work, especially on ARM platforms.
>>>>>
>>>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>>>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>>>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
>>>>>    all traffic to the IOMMU - ie for virtualization systems.
>>>>>
>>>>>  - Cases where P2P is not supported through the host bridge/CPU. The
>>>>>    P2P subsystem is the proper place to detect this and block it.
>>>>>
>>>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
>>>>> scatter-gather table construction, splitting large regions into
>>>>> UINT_MAX-sized chunks to fit within sg->length field limits.
>>>>>
>>>>> Since the physical address based DMA API forbids use of the CPU list
>>>>> of the scatterlist this will produce a mangled scatterlist that has
>>>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
>>>>> all the struct page pointers are NULL and zero sized. This is stronger
>>>>> and more robust than the existing mangle_sg_table() technique. It is
>>>>> a future project to migrate DMABUF as a subsystem away from using
>>>>> scatterlist for this data structure.
>>>>>
>>>>> Tested-by: Alex Mastro <amastro@fb.com>
>>>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>>>> ---
>>>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  include/linux/dma-buf.h   |  18 ++++
>>>>>  2 files changed, 253 insertions(+)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>>> index 2bcf9ceca997..cb55dff1dad5 100644
>>>>> --- a/drivers/dma-buf/dma-buf.c
>>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>>>>>  }
>>>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>>>>>  
>>>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>>>>> +					 dma_addr_t addr)
>>>>> +{
>>>>> +	unsigned int len, nents;
>>>>> +	int i;
>>>>> +
>>>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
>>>>> +	for (i = 0; i < nents; i++) {
>>>>> +		len = min_t(size_t, length, UINT_MAX);
>>>>> +		length -= len;
>>>>> +		/*
>>>>> +		 * DMABUF abuses scatterlist to create a scatterlist
>>>>> +		 * that does not have any CPU list, only the DMA list.
>>>>> +		 * Always set the page related values to NULL to ensure
>>>>> +		 * importers can't use it. The phys_addr based DMA API
>>>>> +		 * does not require the CPU list for mapping or unmapping.
>>>>> +		 */
>>>>> +		sg_set_page(sgl, NULL, 0, 0);
>>>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
>>>>> +		sg_dma_len(sgl) = len;
>>>>> +		sgl = sg_next(sgl);
>>>>> +	}
>>>>> +
>>>>> +	return sgl;
>>>>> +}
>>>>> +
>>>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
>>>>> +				  struct dma_buf_phys_vec *phys_vec,
>>>>> +				  size_t nr_ranges, size_t size)
>>>>> +{
>>>>> +	unsigned int nents = 0;
>>>>> +	size_t i;
>>>>> +
>>>>> +	if (!state || !dma_use_iova(state)) {
>>>>> +		for (i = 0; i < nr_ranges; i++)
>>>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
>>>>> +	} else {
>>>>> +		/*
>>>>> +		 * In IOVA case, there is only one SG entry which spans
>>>>> +		 * for whole IOVA address space, but we need to make sure
>>>>> +		 * that it fits sg->length, maybe we need more.
>>>>> +		 */
>>>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
>>>>> +	}
>>>>> +
>>>>> +	return nents;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * struct dma_buf_dma - holds DMA mapping information
>>>>> + * @sgt:    Scatter-gather table
>>>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
>>>>> + * @size:   Total size of DMA transfer
>>>>> + */
>>>>> +struct dma_buf_dma {
>>>>> +	struct sg_table sgt;
>>>>> +	struct dma_iova_state *state;
>>>>> +	size_t size;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>>>> + * @provider:	[in]	p2pdma provider
>>>>> + * @phys_vec:	[in]	array of physical vectors
>>>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>>>> + * @size:	[in]	total size of phys_vec
>>>>> + * @dir:	[in]	direction of DMA transfer
>>>>> + *
>>>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>>>> + *
>>>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>>>> + * PAGE_SIZE aligned.
>>>>> + *
>>>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>>>> + */
>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>>>
>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>>
>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
>>
>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> 
> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> file per-your request.

No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

That we create an IOVA mapping when the access needs to go through the root complex is an implementation detail.

> 
> Regarding SG, the long term plan is to remove SG table completely, so at
> least external users of DMABUF shouldn't be exposed to internal implementation
> details (SG table).

Hui? Well I suggested to remove the sg_table, but that doesn't mean that implementations shouldn't be aware of that.

Regards,
Christian.

> 
> Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:

<...>

> >>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>
> >>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>
> >>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>
> >> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> > 
> > Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> > file per-your request.
> 
> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

In order to progress, I renamed these functions to be
dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.

Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 2 weeks ago

On 11/19/25 17:33, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> 
> <...>
> 
>>>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>>>>>
>>>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>>>>
>>>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
>>>>
>>>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
>>>
>>> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
>>> file per-your request.
>>
>> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.
> 
> In order to progress, I renamed these functions to be
> dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.

Yeah, the problem is I even thought more about it and came to the conclusion that this is still not sufficient for an rb or an Ack-by.

A core concept of DMA-buf is that the exporter takes care of all the mappings and not the framework.

Calling pci_p2pdma_bus_addr_map(), dma_map_phys() or dma_map_phys() from DMA-buf code is extremely questionable.

That should really be inside VFIO and not DMA-buf code, so to move forward I strongly suggest to either move that into VFIO or the DMA API directly.

Regards,
Christian.

> 
> Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 08:03:09AM +0100, Christian König wrote:
> On 11/19/25 17:33, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> > 
> > <...>
> > 
> >>>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>>>
> >>>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>>>
> >>>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>>>
> >>>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> >>>
> >>> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> >>> file per-your request.
> >>
> >> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.
> > 
> > In order to progress, I renamed these functions to be
> > dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.
> 
> Yeah, the problem is I even thought more about it and came to the conclusion that this is still not sufficient for an rb or an Ack-by.
> 
> A core concept of DMA-buf is that the exporter takes care of all the mappings and not the framework.
> 
> Calling pci_p2pdma_bus_addr_map(), dma_map_phys() or dma_map_phys() from DMA-buf code is extremely questionable.
> 
> That should really be inside VFIO and not DMA-buf code, so to move forward I strongly suggest to either move that into VFIO or the DMA API directly.

We got the request to move to DMABUF and agreement a long time ago, in v5.
https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> 
> 
> On 11/19/25 15:50, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
> >> On 11/19/25 14:42, Leon Romanovsky wrote:
> >>> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>>>
> >>>>
> >>>> On 11/11/25 10:57, Leon Romanovsky wrote:
> >>>>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>>>
> >>>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> >>>>> MMIO physical address ranges into scatter-gather tables with proper
> >>>>> DMA mapping.
> >>>>>
> >>>>> These common functions are a starting point and support any PCI
> >>>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> >>>>> case, as shortly will be RDMA. We can review existing DRM drivers to
> >>>>> refactor them separately. We hope this will evolve into routines to
> >>>>> help common DRM that include mixed CPU and MMIO mappings.
> >>>>>
> >>>>> Compared to the dma_map_resource() abuse this implementation handles
> >>>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
> >>>>> is enabled:
> >>>>>
> >>>>>  - Direct bus address mapping without IOVA allocation for
> >>>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >>>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >>>>>    transactions to avoid the host bridge.
> >>>>>
> >>>>>    Further, this handles the slightly obscure, case of MMIO with a
> >>>>>    phys_addr_t that is different from the physical BAR programming
> >>>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >>>>>    accommodates this effect. This enables certain real systems to
> >>>>>    work, especially on ARM platforms.
> >>>>>
> >>>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >>>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >>>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
> >>>>>    all traffic to the IOMMU - ie for virtualization systems.
> >>>>>
> >>>>>  - Cases where P2P is not supported through the host bridge/CPU. The
> >>>>>    P2P subsystem is the proper place to detect this and block it.
> >>>>>
> >>>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> >>>>> scatter-gather table construction, splitting large regions into
> >>>>> UINT_MAX-sized chunks to fit within sg->length field limits.
> >>>>>
> >>>>> Since the physical address based DMA API forbids use of the CPU list
> >>>>> of the scatterlist this will produce a mangled scatterlist that has
> >>>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
> >>>>> all the struct page pointers are NULL and zero sized. This is stronger
> >>>>> and more robust than the existing mangle_sg_table() technique. It is
> >>>>> a future project to migrate DMABUF as a subsystem away from using
> >>>>> scatterlist for this data structure.
> >>>>>
> >>>>> Tested-by: Alex Mastro <amastro@fb.com>
> >>>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>>>> ---
> >>>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  include/linux/dma-buf.h   |  18 ++++
> >>>>>  2 files changed, 253 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>>>> index 2bcf9ceca997..cb55dff1dad5 100644
> >>>>> --- a/drivers/dma-buf/dma-buf.c
> >>>>> +++ b/drivers/dma-buf/dma-buf.c
> >>>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >>>>>  }
> >>>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >>>>>  
> >>>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >>>>> +					 dma_addr_t addr)
> >>>>> +{
> >>>>> +	unsigned int len, nents;
> >>>>> +	int i;
> >>>>> +
> >>>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> >>>>> +	for (i = 0; i < nents; i++) {
> >>>>> +		len = min_t(size_t, length, UINT_MAX);
> >>>>> +		length -= len;
> >>>>> +		/*
> >>>>> +		 * DMABUF abuses scatterlist to create a scatterlist
> >>>>> +		 * that does not have any CPU list, only the DMA list.
> >>>>> +		 * Always set the page related values to NULL to ensure
> >>>>> +		 * importers can't use it. The phys_addr based DMA API
> >>>>> +		 * does not require the CPU list for mapping or unmapping.
> >>>>> +		 */
> >>>>> +		sg_set_page(sgl, NULL, 0, 0);
> >>>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> >>>>> +		sg_dma_len(sgl) = len;
> >>>>> +		sgl = sg_next(sgl);
> >>>>> +	}
> >>>>> +
> >>>>> +	return sgl;
> >>>>> +}
> >>>>> +
> >>>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >>>>> +				  struct dma_buf_phys_vec *phys_vec,
> >>>>> +				  size_t nr_ranges, size_t size)
> >>>>> +{
> >>>>> +	unsigned int nents = 0;
> >>>>> +	size_t i;
> >>>>> +
> >>>>> +	if (!state || !dma_use_iova(state)) {
> >>>>> +		for (i = 0; i < nr_ranges; i++)
> >>>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> >>>>> +	} else {
> >>>>> +		/*
> >>>>> +		 * In IOVA case, there is only one SG entry which spans
> >>>>> +		 * for whole IOVA address space, but we need to make sure
> >>>>> +		 * that it fits sg->length, maybe we need more.
> >>>>> +		 */
> >>>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> >>>>> +	}
> >>>>> +
> >>>>> +	return nents;
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * struct dma_buf_dma - holds DMA mapping information
> >>>>> + * @sgt:    Scatter-gather table
> >>>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> >>>>> + * @size:   Total size of DMA transfer
> >>>>> + */
> >>>>> +struct dma_buf_dma {
> >>>>> +	struct sg_table sgt;
> >>>>> +	struct dma_iova_state *state;
> >>>>> +	size_t size;
> >>>>> +};
> >>>>> +
> >>>>> +/**
> >>>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>>>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>>>> + * @provider:	[in]	p2pdma provider
> >>>>> + * @phys_vec:	[in]	array of physical vectors
> >>>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>>>> + * @size:	[in]	total size of phys_vec
> >>>>> + * @dir:	[in]	direction of DMA transfer
> >>>>> + *
> >>>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>>>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>>>> + *
> >>>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>>>> + * PAGE_SIZE aligned.
> >>>>> + *
> >>>>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>>>> + */
> >>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>
> >>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>
> >>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>
> >> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> > 
> > Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> > file per-your request.
> 
> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

We have both map and unmap, so dma_buf_*_to_*() can be applicable to dma_buf_map() only.
And it is not simple conversion, most of the logic is actually handles mapping:

  137         for (i = 0; i < nr_ranges; i++) {
  138                 if (!dma->state) {
  139                         addr = pci_p2pdma_bus_addr_map(provider,
  140                                                        phys_vec[i].paddr);
  141                 } else if (dma_use_iova(dma->state)) {
  142                         ret = dma_iova_link(attach->dev, dma->state,
  143                                             phys_vec[i].paddr, 0,
  144                                             phys_vec[i].len, dir,
  145                                             DMA_ATTR_MMIO);
  146                         if (ret)
  147                                 goto err_unmap_dma;
  148
  149                         mapped_len += phys_vec[i].len;
  150                 } else {
  151                         addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
  152                                             phys_vec[i].len, dir,
  153                                             DMA_ATTR_MMIO);
  154                         ret = dma_mapping_error(attach->dev, addr);
  155                         if (ret)
  156                                 goto err_unmap_dma;
  157                 }
  158
  159                 if (!dma->state || !dma_use_iova(dma->state))
  160                         sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
  161         }
  162
  163         if (dma->state && dma_use_iova(dma->state)) {
  164                 WARN_ON_ONCE(mapped_len != size);
  165                 ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
  166                 if (ret)
  167                         goto err_unmap_dma;
  168
  169                 sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
  170         }

SG table conversion is only two lines (160 and 169) which are here
because of DMABUF dependency on SG.

What about dma_buf_phys_vec_mapping()/dma_buf_phys_vec_unmapping()?

> 
> That we create an IOVA mapping when the access needs to go through the root complex is an implementation detail.
> 
> > 
> > Regarding SG, the long term plan is to remove SG table completely, so at
> > least external users of DMABUF shouldn't be exposed to internal implementation
> > details (SG table).
> 
> Hui? Well I suggested to remove the sg_table, but that doesn't mean that implementations shouldn't be aware of that.

VFIO which is first user of this interface. It doesn't care how
internally DMABUF handles array of phys_vecs. Today, it is sg_table,
tomorrow it will be something else.

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> > +/**
> > + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> > + * of physical vectors. This funciton is intended for MMIO memory only.
> > + * @attach:	[in]	attachment whose scatterlist is to be returned
> > + * @provider:	[in]	p2pdma provider
> > + * @phys_vec:	[in]	array of physical vectors
> > + * @nr_ranges:	[in]	number of entries in phys_vec array
> > + * @size:	[in]	total size of phys_vec
> > + * @dir:	[in]	direction of DMA transfer
> > + *
> > + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> > + * on error. May return -EINTR if it is interrupted by a signal.
> > + *
> > + * On success, the DMA addresses and lengths in the returned scatterlist are
> > + * PAGE_SIZE aligned.
> > + *
> > + * A mapping must be unmapped by using dma_buf_unmap().
> > + */
> > +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> 
> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> 
> > +			     struct p2pdma_provider *provider,
> > +			     struct dma_buf_phys_vec *phys_vec,
> > +			     size_t nr_ranges, size_t size,
> > +			     enum dma_data_direction dir)
> > +{
> > +	unsigned int nents, mapped_len = 0;
> > +	struct dma_buf_dma *dma;
> > +	struct scatterlist *sgl;
> > +	dma_addr_t addr;
> > +	size_t i;
> > +	int ret;
> > +
> > +	dma_resv_assert_held(attach->dmabuf->resv);
> > +
> > +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> > +		/* This function is supposed to work on MMIO memory only */
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> > +	case PCI_P2PDMA_MAP_BUS_ADDR:
> > +		/*
> > +		 * There is no need in IOVA at all for this flow.
> > +		 */
> > +		break;
> > +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> > +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> > +		if (!dma->state) {
> > +			ret = -ENOMEM;
> > +			goto err_free_dma;
> > +		}
> > +
> > +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> 
> Oh, that is a clear no-go for the core DMA-buf code.
> 
> It's intentionally up to the exporter how to create the DMA
> addresses the importer can work with.

I can't fully understand this remark?

> We could add something like a dma_buf_sg_helper.c or similar and put it in there.

Yes, the intention is this function is an "exporter helper" that an
exporter can call if it wants to help generate the scatterlist.

So your "no-go" is just about what file it is in, not anything about
how it works?

Thanks,
Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 3 weeks ago

On 11/19/25 14:25, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>> +/**
>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>> + * @provider:	[in]	p2pdma provider
>>> + * @phys_vec:	[in]	array of physical vectors
>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>> + * @size:	[in]	total size of phys_vec
>>> + * @dir:	[in]	direction of DMA transfer
>>> + *
>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>> + *
>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>> + * PAGE_SIZE aligned.
>>> + *
>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>> + */
>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>
>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>
>>> +			     struct p2pdma_provider *provider,
>>> +			     struct dma_buf_phys_vec *phys_vec,
>>> +			     size_t nr_ranges, size_t size,
>>> +			     enum dma_data_direction dir)
>>> +{
>>> +	unsigned int nents, mapped_len = 0;
>>> +	struct dma_buf_dma *dma;
>>> +	struct scatterlist *sgl;
>>> +	dma_addr_t addr;
>>> +	size_t i;
>>> +	int ret;
>>> +
>>> +	dma_resv_assert_held(attach->dmabuf->resv);
>>> +
>>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
>>> +		/* This function is supposed to work on MMIO memory only */
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>> +	if (!dma)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
>>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
>>> +		/*
>>> +		 * There is no need in IOVA at all for this flow.
>>> +		 */
>>> +		break;
>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>> +		if (!dma->state) {
>>> +			ret = -ENOMEM;
>>> +			goto err_free_dma;
>>> +		}
>>> +
>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>
>> Oh, that is a clear no-go for the core DMA-buf code.
>>
>> It's intentionally up to the exporter how to create the DMA
>> addresses the importer can work with.
> 
> I can't fully understand this remark?

The exporter should be able to decide if it actually wants to use P2P when the transfer has to go through the host bridge (e.g. when IOMMU/bridge routing bits are enabled).

Thinking more about it exporters can now probably call pci_p2pdma_map_type(provider, attach->dev) before calling this function so that is probably ok.

>> We could add something like a dma_buf_sg_helper.c or similar and put it in there.
> 
> Yes, the intention is this function is an "exporter helper" that an
> exporter can call if it wants to help generate the scatterlist.
> 
> So your "no-go" is just about what file it is in, not anything about
> how it works?

Yes, exactly that. Just move it into a separate file somewhere and it's probably good to go as far as I can see.

But only take that as Acked-by, I would need at least a day (or week) of free time to wrap my head around all the technical details again. And that is something I won't have before January or even later.

Regards,
Christian.

> 
> Thanks,
> Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:

> >>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>> +		if (!dma->state) {
> >>> +			ret = -ENOMEM;
> >>> +			goto err_free_dma;
> >>> +		}
> >>> +
> >>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>
> >> Oh, that is a clear no-go for the core DMA-buf code.
> >>
> >> It's intentionally up to the exporter how to create the DMA
> >> addresses the importer can work with.
> > 
> > I can't fully understand this remark?
> 
> The exporter should be able to decide if it actually wants to use
> P2P when the transfer has to go through the host bridge (e.g. when
> IOMMU/bridge routing bits are enabled).

Sure, but this is a simplified helper for exporters that don't have
choices where the memory comes from.

I fully expet to see changes to this to support more use cases,
including the one above. We should do those changes along with users
making use of them so we can evaluate what works best.

> But only take that as Acked-by, I would need at least a day (or
> week) of free time to wrap my head around all the technical details
> again. And that is something I won't have before January or even
> later.

Sure, it is alot, and I think DRM community in general should come up
to speed on the new DMA API and how we are pushing to see P2P work
within Linux.

So thanks, we can take the Acked-by and progress here. Interested
parties can pick it up from this point when time allows.

We can also have a mini-community call to give a summary/etc on these
topics.

Thanks,
Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 2 weeks ago

On 11/19/25 20:31, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> 
>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>> +		if (!dma->state) {
>>>>> +			ret = -ENOMEM;
>>>>> +			goto err_free_dma;
>>>>> +		}
>>>>> +
>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>
>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>
>>>> It's intentionally up to the exporter how to create the DMA
>>>> addresses the importer can work with.
>>>
>>> I can't fully understand this remark?
>>
>> The exporter should be able to decide if it actually wants to use
>> P2P when the transfer has to go through the host bridge (e.g. when
>> IOMMU/bridge routing bits are enabled).
> 
> Sure, but this is a simplified helper for exporters that don't have
> choices where the memory comes from.

That is extremely questionable as justification to put that in common DMA-buf code.

> I fully expet to see changes to this to support more use cases,
> including the one above. We should do those changes along with users
> making use of them so we can evaluate what works best.

Yeah, exactly that's my concern.

>> But only take that as Acked-by, I would need at least a day (or
>> week) of free time to wrap my head around all the technical details
>> again. And that is something I won't have before January or even
>> later.
> 
> Sure, it is alot, and I think DRM community in general should come up
> to speed on the new DMA API and how we are pushing to see P2P work
> within Linux.
> 
> So thanks, we can take the Acked-by and progress here. Interested
> parties can pick it up from this point when time allows.

Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.

This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.

So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.

Regards,
Christian.

> 
> We can also have a mini-community call to give a summary/etc on these
> topics.
> 
> Thanks,
> Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >> The exporter should be able to decide if it actually wants to use
> >> P2P when the transfer has to go through the host bridge (e.g. when
> >> IOMMU/bridge routing bits are enabled).
> > 
> > Sure, but this is a simplified helper for exporters that don't have
> > choices where the memory comes from.
> 
> That is extremely questionable as justification to put that in common DMA-buf code.

FWIW we already have patches for a RDMA exporter lined up to use it as
well. That's two users already...

Jason

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> On 11/19/25 20:31, Jason Gunthorpe wrote:
> > On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> > 
> >>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>> +		if (!dma->state) {
> >>>>> +			ret = -ENOMEM;
> >>>>> +			goto err_free_dma;
> >>>>> +		}
> >>>>> +
> >>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>
> >>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>
> >>>> It's intentionally up to the exporter how to create the DMA
> >>>> addresses the importer can work with.
> >>>
> >>> I can't fully understand this remark?
> >>
> >> The exporter should be able to decide if it actually wants to use
> >> P2P when the transfer has to go through the host bridge (e.g. when
> >> IOMMU/bridge routing bits are enabled).
> > 
> > Sure, but this is a simplified helper for exporters that don't have
> > choices where the memory comes from.
> 
> That is extremely questionable as justification to put that in common DMA-buf code.
> 
> > I fully expet to see changes to this to support more use cases,
> > including the one above. We should do those changes along with users
> > making use of them so we can evaluate what works best.
> 
> Yeah, exactly that's my concern.
> 
> >> But only take that as Acked-by, I would need at least a day (or
> >> week) of free time to wrap my head around all the technical details
> >> again. And that is something I won't have before January or even
> >> later.
> > 
> > Sure, it is alot, and I think DRM community in general should come up
> > to speed on the new DMA API and how we are pushing to see P2P work
> > within Linux.
> > 
> > So thanks, we can take the Acked-by and progress here. Interested
> > parties can pick it up from this point when time allows.
> 
> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> 
> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> 
> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.

It was put in VFIO at the beginning, but Christoph objected to it,
because that will require exporting symbol for pci_p2pdma_map_type().
which was universally agreed as not good idea.

https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > We can also have a mini-community call to give a summary/etc on these
> > topics.
> > 
> > Thanks,
> > Jason
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 2 weeks ago

On 11/20/25 08:41, Leon Romanovsky wrote:
> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
>> On 11/19/25 20:31, Jason Gunthorpe wrote:
>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
>>>
>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>>>> +		if (!dma->state) {
>>>>>>> +			ret = -ENOMEM;
>>>>>>> +			goto err_free_dma;
>>>>>>> +		}
>>>>>>> +
>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>>>
>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>>>
>>>>>> It's intentionally up to the exporter how to create the DMA
>>>>>> addresses the importer can work with.
>>>>>
>>>>> I can't fully understand this remark?
>>>>
>>>> The exporter should be able to decide if it actually wants to use
>>>> P2P when the transfer has to go through the host bridge (e.g. when
>>>> IOMMU/bridge routing bits are enabled).
>>>
>>> Sure, but this is a simplified helper for exporters that don't have
>>> choices where the memory comes from.
>>
>> That is extremely questionable as justification to put that in common DMA-buf code.
>>
>>> I fully expet to see changes to this to support more use cases,
>>> including the one above. We should do those changes along with users
>>> making use of them so we can evaluate what works best.
>>
>> Yeah, exactly that's my concern.
>>
>>>> But only take that as Acked-by, I would need at least a day (or
>>>> week) of free time to wrap my head around all the technical details
>>>> again. And that is something I won't have before January or even
>>>> later.
>>>
>>> Sure, it is alot, and I think DRM community in general should come up
>>> to speed on the new DMA API and how we are pushing to see P2P work
>>> within Linux.
>>>
>>> So thanks, we can take the Acked-by and progress here. Interested
>>> parties can pick it up from this point when time allows.
>>
>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
>>
>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
>>
>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> 
> It was put in VFIO at the beginning, but Christoph objected to it,
> because that will require exporting symbol for pci_p2pdma_map_type().
> which was universally agreed as not good idea.

Yeah, that is exactly what I object here :)

We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.

The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.

Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.

Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.

To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.

Regards,
Christian.

> 
> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> 
> Thanks
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> We can also have a mini-community call to give a summary/etc on these
>>> topics.
>>>
>>> Thanks,
>>> Jason
>>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
> On 11/20/25 08:41, Leon Romanovsky wrote:
> > On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >> On 11/19/25 20:31, Jason Gunthorpe wrote:
> >>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> >>>
> >>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>>>> +		if (!dma->state) {
> >>>>>>> +			ret = -ENOMEM;
> >>>>>>> +			goto err_free_dma;
> >>>>>>> +		}
> >>>>>>> +
> >>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>>>
> >>>>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>>>
> >>>>>> It's intentionally up to the exporter how to create the DMA
> >>>>>> addresses the importer can work with.
> >>>>>
> >>>>> I can't fully understand this remark?
> >>>>
> >>>> The exporter should be able to decide if it actually wants to use
> >>>> P2P when the transfer has to go through the host bridge (e.g. when
> >>>> IOMMU/bridge routing bits are enabled).
> >>>
> >>> Sure, but this is a simplified helper for exporters that don't have
> >>> choices where the memory comes from.
> >>
> >> That is extremely questionable as justification to put that in common DMA-buf code.
> >>
> >>> I fully expet to see changes to this to support more use cases,
> >>> including the one above. We should do those changes along with users
> >>> making use of them so we can evaluate what works best.
> >>
> >> Yeah, exactly that's my concern.
> >>
> >>>> But only take that as Acked-by, I would need at least a day (or
> >>>> week) of free time to wrap my head around all the technical details
> >>>> again. And that is something I won't have before January or even
> >>>> later.
> >>>
> >>> Sure, it is alot, and I think DRM community in general should come up
> >>> to speed on the new DMA API and how we are pushing to see P2P work
> >>> within Linux.
> >>>
> >>> So thanks, we can take the Acked-by and progress here. Interested
> >>> parties can pick it up from this point when time allows.
> >>
> >> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> >>
> >> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> >>
> >> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> > 
> > It was put in VFIO at the beginning, but Christoph objected to it,
> > because that will require exporting symbol for pci_p2pdma_map_type().
> > which was universally agreed as not good idea.
> 
> Yeah, that is exactly what I object here :)
> 
> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
> 
> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
> 
> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
> 
> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
> 
> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.

Thanks,

What do you think about it?

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index a69bb73db86d..05ec84a0157b 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -84,6 +84,11 @@ struct dma_buf_dma {
  * PAGE_SIZE aligned.
  *
  * A mapping must be unmapped by using dma_buf_free_sgt().
+ *
+ * NOTE: While this function is intended for DMA-buf importers, it is critical
+ * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
+ * directly between PCI devices, without routing transactions through the host
+ * bridge.
  */
 struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
                                         struct p2pdma_provider *provider,
(END)


> 
> Regards,
> Christian.
> 
> > 
> > https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> > 
> > Thanks
> > 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> We can also have a mini-community call to give a summary/etc on these
> >>> topics.
> >>>
> >>> Thanks,
> >>> Jason
> >>
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Christian König 2 months, 2 weeks ago

On 11/20/25 09:06, Leon Romanovsky wrote:
> On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
>> On 11/20/25 08:41, Leon Romanovsky wrote:
>>> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
>>>> On 11/19/25 20:31, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
>>>>>
>>>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>>>>>> +		if (!dma->state) {
>>>>>>>>> +			ret = -ENOMEM;
>>>>>>>>> +			goto err_free_dma;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>>>>>
>>>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>>>>>
>>>>>>>> It's intentionally up to the exporter how to create the DMA
>>>>>>>> addresses the importer can work with.
>>>>>>>
>>>>>>> I can't fully understand this remark?
>>>>>>
>>>>>> The exporter should be able to decide if it actually wants to use
>>>>>> P2P when the transfer has to go through the host bridge (e.g. when
>>>>>> IOMMU/bridge routing bits are enabled).
>>>>>
>>>>> Sure, but this is a simplified helper for exporters that don't have
>>>>> choices where the memory comes from.
>>>>
>>>> That is extremely questionable as justification to put that in common DMA-buf code.
>>>>
>>>>> I fully expet to see changes to this to support more use cases,
>>>>> including the one above. We should do those changes along with users
>>>>> making use of them so we can evaluate what works best.
>>>>
>>>> Yeah, exactly that's my concern.
>>>>
>>>>>> But only take that as Acked-by, I would need at least a day (or
>>>>>> week) of free time to wrap my head around all the technical details
>>>>>> again. And that is something I won't have before January or even
>>>>>> later.
>>>>>
>>>>> Sure, it is alot, and I think DRM community in general should come up
>>>>> to speed on the new DMA API and how we are pushing to see P2P work
>>>>> within Linux.
>>>>>
>>>>> So thanks, we can take the Acked-by and progress here. Interested
>>>>> parties can pick it up from this point when time allows.
>>>>
>>>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
>>>>
>>>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
>>>>
>>>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
>>>
>>> It was put in VFIO at the beginning, but Christoph objected to it,
>>> because that will require exporting symbol for pci_p2pdma_map_type().
>>> which was universally agreed as not good idea.
>>
>> Yeah, that is exactly what I object here :)
>>
>> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
>>
>> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
>>
>> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
>>
>> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
>>
>> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.
> 
> Thanks,
> 
> What do you think about it?
> 
> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index a69bb73db86d..05ec84a0157b 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -84,6 +84,11 @@ struct dma_buf_dma {
>   * PAGE_SIZE aligned.
>   *
>   * A mapping must be unmapped by using dma_buf_free_sgt().
> + *
> + * NOTE: While this function is intended for DMA-buf importers, it is critical
> + * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
> + * directly between PCI devices, without routing transactions through the host
> + * bridge.

Well first of all this function is intended for exporters not importers.

Maybe write something like "This function is intended for exporters. If direct traffic routing is mandatory exporter should call routing pci_p2pdma_map_type() before calling this function.".

Regards,
Christian.

>   */
>  struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
>                                          struct p2pdma_provider *provider,
> (END)
> 
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
>>>
>>> Thanks
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> We can also have a mini-community call to give a summary/etc on these
>>>>> topics.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>
>>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 09:32:22AM +0100, Christian König wrote:
> On 11/20/25 09:06, Leon Romanovsky wrote:
> > On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
> >> On 11/20/25 08:41, Leon Romanovsky wrote:
> >>> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >>>> On 11/19/25 20:31, Jason Gunthorpe wrote:
> >>>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> >>>>>
> >>>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>>>>>> +		if (!dma->state) {
> >>>>>>>>> +			ret = -ENOMEM;
> >>>>>>>>> +			goto err_free_dma;
> >>>>>>>>> +		}
> >>>>>>>>> +
> >>>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>>>>>
> >>>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>>>>>
> >>>>>>>> It's intentionally up to the exporter how to create the DMA
> >>>>>>>> addresses the importer can work with.
> >>>>>>>
> >>>>>>> I can't fully understand this remark?
> >>>>>>
> >>>>>> The exporter should be able to decide if it actually wants to use
> >>>>>> P2P when the transfer has to go through the host bridge (e.g. when
> >>>>>> IOMMU/bridge routing bits are enabled).
> >>>>>
> >>>>> Sure, but this is a simplified helper for exporters that don't have
> >>>>> choices where the memory comes from.
> >>>>
> >>>> That is extremely questionable as justification to put that in common DMA-buf code.
> >>>>
> >>>>> I fully expet to see changes to this to support more use cases,
> >>>>> including the one above. We should do those changes along with users
> >>>>> making use of them so we can evaluate what works best.
> >>>>
> >>>> Yeah, exactly that's my concern.
> >>>>
> >>>>>> But only take that as Acked-by, I would need at least a day (or
> >>>>>> week) of free time to wrap my head around all the technical details
> >>>>>> again. And that is something I won't have before January or even
> >>>>>> later.
> >>>>>
> >>>>> Sure, it is alot, and I think DRM community in general should come up
> >>>>> to speed on the new DMA API and how we are pushing to see P2P work
> >>>>> within Linux.
> >>>>>
> >>>>> So thanks, we can take the Acked-by and progress here. Interested
> >>>>> parties can pick it up from this point when time allows.
> >>>>
> >>>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> >>>>
> >>>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> >>>>
> >>>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> >>>
> >>> It was put in VFIO at the beginning, but Christoph objected to it,
> >>> because that will require exporting symbol for pci_p2pdma_map_type().
> >>> which was universally agreed as not good idea.
> >>
> >> Yeah, that is exactly what I object here :)
> >>
> >> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
> >>
> >> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
> >>
> >> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
> >>
> >> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
> >>
> >> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.
> > 
> > Thanks,
> > 
> > What do you think about it?
> > 
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> > index a69bb73db86d..05ec84a0157b 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -84,6 +84,11 @@ struct dma_buf_dma {
> >   * PAGE_SIZE aligned.
> >   *
> >   * A mapping must be unmapped by using dma_buf_free_sgt().
> > + *
> > + * NOTE: While this function is intended for DMA-buf importers, it is critical
> > + * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
> > + * directly between PCI devices, without routing transactions through the host
> > + * bridge.
> 
> Well first of all this function is intended for exporters not importers.
> 
> Maybe write something like "This function is intended for exporters. If direct traffic routing is mandatory exporter should call routing pci_p2pdma_map_type() before calling this function.".

Sure, no problem.

Thanks

> 
> Regards,
> Christian.
> 
> >   */
> >  struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
> >                                          struct p2pdma_provider *provider,
> > (END)
> > 
> > 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>
> >>>>> We can also have a mini-community call to give a summary/etc on these
> >>>>> topics.
> >>>>>
> >>>>> Thanks,
> >>>>> Jason
> >>>>
> >>
>

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:31:14PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:

<...>

> So thanks, we can take the Acked-by and progress here. Interested
> parties can pick it up from this point when time allows.

Christian,

Can you please provide explicit Acked-by?

Thanks

Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> On 11/19/25 14:25, Jason Gunthorpe wrote:
> > On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>> +/**
> >>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>> + * @provider:	[in]	p2pdma provider
> >>> + * @phys_vec:	[in]	array of physical vectors
> >>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>> + * @size:	[in]	total size of phys_vec
> >>> + * @dir:	[in]	direction of DMA transfer
> >>> + *
> >>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>> + *
> >>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>> + * PAGE_SIZE aligned.
> >>> + *
> >>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>> + */
> >>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>
> >> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>
> >>> +			     struct p2pdma_provider *provider,
> >>> +			     struct dma_buf_phys_vec *phys_vec,
> >>> +			     size_t nr_ranges, size_t size,
> >>> +			     enum dma_data_direction dir)
> >>> +{
> >>> +	unsigned int nents, mapped_len = 0;
> >>> +	struct dma_buf_dma *dma;
> >>> +	struct scatterlist *sgl;
> >>> +	dma_addr_t addr;
> >>> +	size_t i;
> >>> +	int ret;
> >>> +
> >>> +	dma_resv_assert_held(attach->dmabuf->resv);
> >>> +
> >>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> >>> +		/* This function is supposed to work on MMIO memory only */
> >>> +		return ERR_PTR(-EINVAL);
> >>> +
> >>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> >>> +	if (!dma)
> >>> +		return ERR_PTR(-ENOMEM);
> >>> +
> >>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> >>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> >>> +		/*
> >>> +		 * There is no need in IOVA at all for this flow.
> >>> +		 */
> >>> +		break;
> >>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>> +		if (!dma->state) {
> >>> +			ret = -ENOMEM;
> >>> +			goto err_free_dma;
> >>> +		}
> >>> +
> >>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>
> >> Oh, that is a clear no-go for the core DMA-buf code.
> >>
> >> It's intentionally up to the exporter how to create the DMA
> >> addresses the importer can work with.
> > 
> > I can't fully understand this remark?
> 
> The exporter should be able to decide if it actually wants to use P2P when the transfer has to go through the host bridge (e.g. when IOMMU/bridge routing bits are enabled).
> 
> Thinking more about it exporters can now probably call pci_p2pdma_map_type(provider, attach->dev) before calling this function so that is probably ok.
> 
> >> We could add something like a dma_buf_sg_helper.c or similar and put it in there.
> > 
> > Yes, the intention is this function is an "exporter helper" that an
> > exporter can call if it wants to help generate the scatterlist.
> > 
> > So your "no-go" is just about what file it is in, not anything about
> > how it works?
> 
> Yes, exactly that. Just move it into a separate file somewhere and it's probably good to go as far as I can see.
> 
> But only take that as Acked-by, I would need at least a day (or week) of free time to wrap my head around all the technical details again. And that is something I won't have before January or even later.

If it helps, we can meet at LPC. Jason and/or I will be happy to assist.

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks,
> > Jason
>

RE: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Tian, Kevin 2 months, 3 weeks ago

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> +
> +	if (dma->state && dma_use_iova(dma->state)) {
> +		WARN_ON_ONCE(mapped_len != size);

then "goto err_unmap_dma".

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > +
> > +	if (dma->state && dma_use_iova(dma->state)) {
> > +		WARN_ON_ONCE(mapped_len != size);
> 
> then "goto err_unmap_dma".

It never should happen, there is no need to provide error unwind to
something that you won't get.

> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 03:30:00PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > +
> > > +	if (dma->state && dma_use_iova(dma->state)) {
> > > +		WARN_ON_ONCE(mapped_len != size);
> > 
> > then "goto err_unmap_dma".
> 
> It never should happen, there is no need to provide error unwind to
> something that you won't get.

It is expected that WARN_ON has recovery code, if it is possible and
not burdensome.

Jason

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 09:37:08AM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 03:30:00PM +0200, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > > +
> > > > +	if (dma->state && dma_use_iova(dma->state)) {
> > > > +		WARN_ON_ONCE(mapped_len != size);
> > > 
> > > then "goto err_unmap_dma".
> > 
> > It never should happen, there is no need to provide error unwind to
> > something that you won't get.
> 
> It is expected that WARN_ON has recovery code, if it is possible and
> not burdensome.

It’s not necessary, but since I’m calculating mapped_len again, it’s natural—and completely
harmless—to double-check the arithmetic.

Thanks

> 
> Jason

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Nicolin Chen 2 months, 3 weeks ago

On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With a nit:

> +err_unmap_dma:
> +	if (!i || !dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);

Would it be safer to skip dma_unmap_phys() the range [i, nents)?

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Leon Romanovsky 2 months, 3 weeks ago

On Tue, Nov 18, 2025 at 04:06:11PM -0800, Nicolin Chen wrote:
> On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> > MMIO physical address ranges into scatter-gather tables with proper
> > DMA mapping.
> > 
> > These common functions are a starting point and support any PCI
> > drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> > case, as shortly will be RDMA. We can review existing DRM drivers to
> > refactor them separately. We hope this will evolve into routines to
> > help common DRM that include mixed CPU and MMIO mappings.
> > 
> > Compared to the dma_map_resource() abuse this implementation handles
> > the complicated PCI P2P scenarios properly, especially when an IOMMU
> > is enabled:
> > 
> >  - Direct bus address mapping without IOVA allocation for
> >    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >    transactions to avoid the host bridge.
> > 
> >    Further, this handles the slightly obscure, case of MMIO with a
> >    phys_addr_t that is different from the physical BAR programming
> >    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >    accommodates this effect. This enables certain real systems to
> >    work, especially on ARM platforms.
> > 
> >  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >    This happens when the IOMMU is enabled and the ACS flags are forcing
> >    all traffic to the IOMMU - ie for virtualization systems.
> > 
> >  - Cases where P2P is not supported through the host bridge/CPU. The
> >    P2P subsystem is the proper place to detect this and block it.
> > 
> > Helper functions fill_sg_entry() and calc_sg_nents() handle the
> > scatter-gather table construction, splitting large regions into
> > UINT_MAX-sized chunks to fit within sg->length field limits.
> > 
> > Since the physical address based DMA API forbids use of the CPU list
> > of the scatterlist this will produce a mangled scatterlist that has
> > a fully zero-length and NULL'd CPU list. The list is 0 length,
> > all the struct page pointers are NULL and zero sized. This is stronger
> > and more robust than the existing mangle_sg_table() technique. It is
> > a future project to migrate DMABUF as a subsystem away from using
> > scatterlist for this data structure.
> > 
> > Tested-by: Alex Mastro <amastro@fb.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> With a nit:
> 
> > +err_unmap_dma:
> > +	if (!i || !dma->state) {
> > +		; /* Do nothing */
> > +	} else if (dma_use_iova(dma->state)) {
> > +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> > +				 DMA_ATTR_MMIO);
> > +	} else {
> > +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> > +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> > +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> 
> Would it be safer to skip dma_unmap_phys() the range [i, nents)?

[i, nents) is not supposed to be in SG list which we are iterating.

Thanks

Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine

Posted by Jason Gunthorpe 2 months, 3 weeks ago

On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h   |  18 ++++
>  2 files changed, 253 insertions(+)

I've looked at this enough times now, the logic for DMA mapping and
the construction of the scatterlist is good:

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason