[PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region

Shameer Kolothum posted 3 patches 1 month, 2 weeks ago
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, Alex Williamson <alex@shazbot.org>, "Cédric Le Goater" <clg@redhat.com>, Cornelia Huck <cohuck@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
There is a newer version of this series
[PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 1 month, 2 weeks ago
From: Nicolin Chen <nicolinc@nvidia.com>

Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for P2P
use cases. Create a dmabuf for each mapped BAR region after the mmap is set
up, and store the returned fd in the region’s RAMBlock. This allows QEMU to
pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and map
the BAR correctly in the host IOMMU page table.

If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
and continues with normal mmap handling.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/vfio/region.c     | 57 +++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/trace-events |  1 +
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/region.c b/hw/vfio/region.c
index b165ab0b93..6949f6779c 100644
--- a/hw/vfio/region.c
+++ b/hw/vfio/region.c
@@ -29,6 +29,7 @@
 #include "qemu/error-report.h"
 #include "qemu/units.h"
 #include "monitor/monitor.h"
+#include "system/ramblock.h"
 #include "vfio-helpers.h"
 
 /*
@@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion *region, int index)
     region->mmaps[index].mmap = NULL;
 }
 
+static int vfio_region_create_dma_buf(VFIORegion *region)
+{
+    g_autofree struct vfio_device_feature *feature = NULL;
+    VFIODevice *vbasedev = region->vbasedev;
+    struct vfio_device_feature_dma_buf *dma_buf;
+    size_t total_size;
+    int i, ret;
+
+    g_assert(region->nr_mmaps);
+
+    total_size = sizeof(*feature) + sizeof(*dma_buf) +
+                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
+    feature = g_malloc0(total_size);
+    *feature = (struct vfio_device_feature) {
+        .argsz = total_size,
+        .flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF,
+    };
+
+    dma_buf = (void *)feature->data;
+    *dma_buf = (struct vfio_device_feature_dma_buf) {
+        .region_index = region->nr,
+        .open_flags = O_RDWR,
+        .nr_ranges = region->nr_mmaps,
+    };
+
+    for (i = 0; i < region->nr_mmaps; i++) {
+        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
+        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
+    }
+
+    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
+    for (i = 0; i < region->nr_mmaps; i++) {
+        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
+                                 region->mem->name, region->mmaps[i].offset,
+                                 region->mmaps[i].size);
+    }
+    return ret;
+}
+
 int vfio_region_mmap(VFIORegion *region)
 {
     int i, ret, prot = 0;
     char *name;
     int fd;
 
-    if (!region->mem) {
+    if (!region->mem || !region->nr_mmaps) {
         return 0;
     }
 
@@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
                                region->mmaps[i].size - 1);
     }
 
+    ret = vfio_region_create_dma_buf(region);
+    if (ret < 0) {
+        if (ret == -ENOTTY) {
+            warn_report_once("VFIO dmabuf not supported in kernel");
+        } else {
+            error_report("%s: failed to create dmabuf: %s",
+                         memory_region_name(region->mem), strerror(errno));
+        }
+    } else {
+        MemoryRegion *mr = &region->mmaps[0].mem;
+        RAMBlock *ram_block = mr->ram_block;
+
+        ram_block->fd = ret;
+    }
+
     return 0;
 
 no_mmap:
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 1e895448cd..592a0349d4 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
 vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
+vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name, unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d \"%s\", offset: 0x%lx, size: 0x%lx"
 vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
 vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
-- 
2.43.0


Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Eric Auger 4 weeks ago
Hi Shameer,

On 12/22/25 2:53 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for P2P
> use cases. Create a dmabuf for each mapped BAR region after the mmap is set
> up, and store the returned fd in the region’s RAMBlock. This allows QEMU to
> pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and map
> the BAR correctly in the host IOMMU page table.

I tested the series with upstream kernel and your
master-smmuv3-accel-v6-veventq-v2-vcmdq-rfcv1-dmabuf-v1 branch

It works fine with Grace Hopper GPU passthrough, without cmdqv. However
with cmdqv, I get

qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?
qemu-system-aarch64: warning: vfio_container_dma_map(0xaaaea68471b0,
0xc090000, 0x10000, 0xffffb5c90000) = -14 (Bad address)
qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

Maybe this is unrelated to this series and rather relates to the cmdqv one.

Wonder if you get the same warning, and in the positive, if the root
cause is understood and maybe fixed elsewhere.

Thank you in advance

Eric



>
> If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
> and continues with normal mmap handling.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/vfio/region.c     | 57 +++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/region.c b/hw/vfio/region.c
> index b165ab0b93..6949f6779c 100644
> --- a/hw/vfio/region.c
> +++ b/hw/vfio/region.c
> @@ -29,6 +29,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/units.h"
>  #include "monitor/monitor.h"
> +#include "system/ramblock.h"
>  #include "vfio-helpers.h"
>  
>  /*
> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion *region, int index)
>      region->mmaps[index].mmap = NULL;
>  }
>  
> +static int vfio_region_create_dma_buf(VFIORegion *region)
> +{
> +    g_autofree struct vfio_device_feature *feature = NULL;
> +    VFIODevice *vbasedev = region->vbasedev;
> +    struct vfio_device_feature_dma_buf *dma_buf;
> +    size_t total_size;
> +    int i, ret;
> +
> +    g_assert(region->nr_mmaps);
> +
> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> +    feature = g_malloc0(total_size);
> +    *feature = (struct vfio_device_feature) {
> +        .argsz = total_size,
> +        .flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF,
> +    };
> +
> +    dma_buf = (void *)feature->data;
> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> +        .region_index = region->nr,
> +        .open_flags = O_RDWR,
> +        .nr_ranges = region->nr_mmaps,
> +    };
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> +    }
> +
> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> +                                 region->mem->name, region->mmaps[i].offset,
> +                                 region->mmaps[i].size);
> +    }
> +    return ret;
> +}
> +
>  int vfio_region_mmap(VFIORegion *region)
>  {
>      int i, ret, prot = 0;
>      char *name;
>      int fd;
>  
> -    if (!region->mem) {
> +    if (!region->mem || !region->nr_mmaps) {
>          return 0;
>      }
>  
> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>                                 region->mmaps[i].size - 1);
>      }
>  
> +    ret = vfio_region_create_dma_buf(region);
> +    if (ret < 0) {
> +        if (ret == -ENOTTY) {
> +            warn_report_once("VFIO dmabuf not supported in kernel");
> +        } else {
> +            error_report("%s: failed to create dmabuf: %s",
> +                         memory_region_name(region->mem), strerror(errno));
> +        }
> +    } else {
> +        MemoryRegion *mr = &region->mmaps[0].mem;
> +        RAMBlock *ram_block = mr->ram_block;
> +
> +        ram_block->fd = ret;
> +    }
> +
>      return 0;
>  
>  no_mmap:
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1e895448cd..592a0349d4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>  vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name, unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d \"%s\", offset: 0x%lx, size: 0x%lx"
>  vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
>  vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"


RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 3 weeks, 6 days ago
Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 12 January 2026 09:16
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: alex@shazbot.org; clg@redhat.com; cohuck@redhat.com;
> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> 
> On 12/22/25 2:53 PM, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> >
> > Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for
> P2P
> > use cases. Create a dmabuf for each mapped BAR region after the mmap is
> set
> > up, and store the returned fd in the region’s RAMBlock. This allows QEMU
> to
> > pass the fd to dma_map_file(), enabling iommufd to import the dmabuf
> and map
> > the BAR correctly in the host IOMMU page table.
> 
> I tested the series with upstream kernel and your
> master-smmuv3-accel-v6-veventq-v2-vcmdq-rfcv1-dmabuf-v1 branch
> 
> It works fine with Grace Hopper GPU passthrough, without cmdqv. However
> with cmdqv, I get
> 
> qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI
> BAR?
> qemu-system-aarch64: warning: vfio_container_dma_map(0xaaaea68471b0,
> 0xc090000, 0x10000, 0xffffb5c90000) = -14 (Bad address)
> qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI
> BAR?
> 
> Maybe this is unrelated to this series and rather relates to the cmdqv one.
> 
> Wonder if you get the same warning, and in the positive, if the root
> cause is understood and maybe fixed elsewhere.

Yes. I can reproduce this with cmdqv enabled. This is triggered 
during tegra241_cmdqv_init_vcmdq_page0(), when QEMU sets
up the VINTF / vCMDQ page0 mappings.

IIUC, this is not guest RAM and there is also no dmabuf support for this
memory in the kernel, which leads to the IOMMU_IOAS_MAP failure.
AFAICS, this memory region does not participate in device DMA and does
not require any IOMMU mappings. Hence, we can ignore this warning for now.

I will look into this in more detail during the next vCMDQ respin to see
whether we can avoid triggering this warning.

Thanks,
Shameer

> Thank you in advance
> 
> Eric
> 
> 
> 
> >
> > If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
> > and continues with normal mmap handling.
> >
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >  hw/vfio/region.c     | 57
> +++++++++++++++++++++++++++++++++++++++++++-
> >  hw/vfio/trace-events |  1 +
> >  2 files changed, 57 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/vfio/region.c b/hw/vfio/region.c
> > index b165ab0b93..6949f6779c 100644
> > --- a/hw/vfio/region.c
> > +++ b/hw/vfio/region.c
> > @@ -29,6 +29,7 @@
> >  #include "qemu/error-report.h"
> >  #include "qemu/units.h"
> >  #include "monitor/monitor.h"
> > +#include "system/ramblock.h"
> >  #include "vfio-helpers.h"
> >
> >  /*
> > @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
> *region, int index)
> >      region->mmaps[index].mmap = NULL;
> >  }
> >
> > +static int vfio_region_create_dma_buf(VFIORegion *region)
> > +{
> > +    g_autofree struct vfio_device_feature *feature = NULL;
> > +    VFIODevice *vbasedev = region->vbasedev;
> > +    struct vfio_device_feature_dma_buf *dma_buf;
> > +    size_t total_size;
> > +    int i, ret;
> > +
> > +    g_assert(region->nr_mmaps);
> > +
> > +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> > +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> > +    feature = g_malloc0(total_size);
> > +    *feature = (struct vfio_device_feature) {
> > +        .argsz = total_size,
> > +        .flags = VFIO_DEVICE_FEATURE_GET |
> VFIO_DEVICE_FEATURE_DMA_BUF,
> > +    };
> > +
> > +    dma_buf = (void *)feature->data;
> > +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> > +        .region_index = region->nr,
> > +        .open_flags = O_RDWR,
> > +        .nr_ranges = region->nr_mmaps,
> > +    };
> > +
> > +    for (i = 0; i < region->nr_mmaps; i++) {
> > +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> > +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> > +    }
> > +
> > +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> > +    for (i = 0; i < region->nr_mmaps; i++) {
> > +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> > +                                 region->mem->name, region->mmaps[i].offset,
> > +                                 region->mmaps[i].size);
> > +    }
> > +    return ret;
> > +}
> > +
> >  int vfio_region_mmap(VFIORegion *region)
> >  {
> >      int i, ret, prot = 0;
> >      char *name;
> >      int fd;
> >
> > -    if (!region->mem) {
> > +    if (!region->mem || !region->nr_mmaps) {
> >          return 0;
> >      }
> >
> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >                                 region->mmaps[i].size - 1);
> >      }
> >
> > +    ret = vfio_region_create_dma_buf(region);
> > +    if (ret < 0) {
> > +        if (ret == -ENOTTY) {
> > +            warn_report_once("VFIO dmabuf not supported in kernel");
> > +        } else {
> > +            error_report("%s: failed to create dmabuf: %s",
> > +                         memory_region_name(region->mem), strerror(errno));
> > +        }
> > +    } else {
> > +        MemoryRegion *mr = &region->mmaps[0].mem;
> > +        RAMBlock *ram_block = mr->ram_block;
> > +
> > +        ram_block->fd = ret;
> > +    }
> > +
> >      return 0;
> >
> >  no_mmap:
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index 1e895448cd..592a0349d4 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
> >  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data,
> unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >  vfio_region_read(char *name, int index, uint64_t addr, unsigned size,
> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> >  vfio_region_setup(const char *dev, int index, const char *name, unsigned
> long flags, unsigned long offset, unsigned long size) "Device %s, region %d
> \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> > +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name,
> unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d
> \"%s\", offset: 0x%lx, size: 0x%lx"
> >  vfio_region_mmap_fault(const char *name, int index, unsigned long offset,
> unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault:
> %d"
> >  vfio_region_mmap(const char *name, unsigned long offset, unsigned long
> end) "Region %s [0x%lx - 0x%lx]"
> >  vfio_region_exit(const char *name, int index) "Device %s, region %d"

Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Duan, Zhenzhong 1 month ago
On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for P2P
> use cases. Create a dmabuf for each mapped BAR region after the mmap is set
> up, and store the returned fd in the region’s RAMBlock. This allows QEMU to
> pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and map
> the BAR correctly in the host IOMMU page table.
>
> If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
> and continues with normal mmap handling.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>   hw/vfio/region.c     | 57 +++++++++++++++++++++++++++++++++++++++++++-
>   hw/vfio/trace-events |  1 +
>   2 files changed, 57 insertions(+), 1 deletion(-)
>
> diff --git a/hw/vfio/region.c b/hw/vfio/region.c
> index b165ab0b93..6949f6779c 100644
> --- a/hw/vfio/region.c
> +++ b/hw/vfio/region.c
> @@ -29,6 +29,7 @@
>   #include "qemu/error-report.h"
>   #include "qemu/units.h"
>   #include "monitor/monitor.h"
> +#include "system/ramblock.h"
>   #include "vfio-helpers.h"
>   
>   /*
> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion *region, int index)
>       region->mmaps[index].mmap = NULL;
>   }
>   
> +static int vfio_region_create_dma_buf(VFIORegion *region)
> +{
> +    g_autofree struct vfio_device_feature *feature = NULL;
> +    VFIODevice *vbasedev = region->vbasedev;
> +    struct vfio_device_feature_dma_buf *dma_buf;
> +    size_t total_size;
> +    int i, ret;
> +
> +    g_assert(region->nr_mmaps);
> +
> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> +    feature = g_malloc0(total_size);
> +    *feature = (struct vfio_device_feature) {
> +        .argsz = total_size,
> +        .flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF,
> +    };
> +
> +    dma_buf = (void *)feature->data;
> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> +        .region_index = region->nr,
> +        .open_flags = O_RDWR,
> +        .nr_ranges = region->nr_mmaps,
> +    };
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> +    }
> +
> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);

vbasedev->io_ops->device_feature may be NULL for other backend like vfio-user.

> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> +                                 region->mem->name, region->mmaps[i].offset,
> +                                 region->mmaps[i].size);
> +    }
> +    return ret;
> +}
> +
>   int vfio_region_mmap(VFIORegion *region)
>   {
>       int i, ret, prot = 0;
>       char *name;
>       int fd;
>   
> -    if (!region->mem) {
> +    if (!region->mem || !region->nr_mmaps) {

Just curious, when will above check return true?

>           return 0;
>       }
>   
> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>                                  region->mmaps[i].size - 1);
>       }
>   
> +    ret = vfio_region_create_dma_buf(region);
> +    if (ret < 0) {
> +        if (ret == -ENOTTY) {
> +            warn_report_once("VFIO dmabuf not supported in kernel");
> +        } else {
> +            error_report("%s: failed to create dmabuf: %s",
> +                         memory_region_name(region->mem), strerror(errno));
> +        }
> +    } else {
> +        MemoryRegion *mr = &region->mmaps[0].mem;

Do we need to support region->mmaps[1]?

Thanks

Zhenzhong

> +        RAMBlock *ram_block = mr->ram_block;
> +
> +        ram_block->fd = ret;
> +    }
> +
>       return 0;
>   
>   no_mmap:
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1e895448cd..592a0349d4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
>   vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>   vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>   vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name, unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d \"%s\", offset: 0x%lx, size: 0x%lx"
>   vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
>   vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
>   vfio_region_exit(const char *name, int index) "Device %s, region %d"

RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 1 month ago

> -----Original Message-----
> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> Sent: 08 January 2026 09:41
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> >
> > Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for
> > P2P use cases. Create a dmabuf for each mapped BAR region after the
> > mmap is set up, and store the returned fd in the region’s RAMBlock.
> > This allows QEMU to pass the fd to dma_map_file(), enabling iommufd to
> > import the dmabuf and map the BAR correctly in the host IOMMU page
> table.
> >
> > If the kernel lacks support or dmabuf setup fails, QEMU skips the
> > setup and continues with normal mmap handling.
> >
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >   hw/vfio/region.c     | 57
> +++++++++++++++++++++++++++++++++++++++++++-
> >   hw/vfio/trace-events |  1 +
> >   2 files changed, 57 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
> > b165ab0b93..6949f6779c 100644
> > --- a/hw/vfio/region.c
> > +++ b/hw/vfio/region.c
> > @@ -29,6 +29,7 @@
> >   #include "qemu/error-report.h"
> >   #include "qemu/units.h"
> >   #include "monitor/monitor.h"
> > +#include "system/ramblock.h"
> >   #include "vfio-helpers.h"
> >
> >   /*
> > @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
> *region, int index)
> >       region->mmaps[index].mmap = NULL;
> >   }
> >
> > +static int vfio_region_create_dma_buf(VFIORegion *region) {
> > +    g_autofree struct vfio_device_feature *feature = NULL;
> > +    VFIODevice *vbasedev = region->vbasedev;
> > +    struct vfio_device_feature_dma_buf *dma_buf;
> > +    size_t total_size;
> > +    int i, ret;
> > +
> > +    g_assert(region->nr_mmaps);
> > +
> > +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> > +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> > +    feature = g_malloc0(total_size);
> > +    *feature = (struct vfio_device_feature) {
> > +        .argsz = total_size,
> > +        .flags = VFIO_DEVICE_FEATURE_GET |
> VFIO_DEVICE_FEATURE_DMA_BUF,
> > +    };
> > +
> > +    dma_buf = (void *)feature->data;
> > +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> > +        .region_index = region->nr,
> > +        .open_flags = O_RDWR,
> > +        .nr_ranges = region->nr_mmaps,
> > +    };
> > +
> > +    for (i = 0; i < region->nr_mmaps; i++) {
> > +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> > +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> > +    }
> > +
> > +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> 
> vbasedev->io_ops->device_feature may be NULL for other backend like vfio-
> user.

Ah..Ok. I will add a check.

> 
> > +    for (i = 0; i < region->nr_mmaps; i++) {
> > +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> > +                                 region->mem->name, region->mmaps[i].offset,
> > +                                 region->mmaps[i].size);
> > +    }
> > +    return ret;
> > +}
> > +
> >   int vfio_region_mmap(VFIORegion *region)
> >   {
> >       int i, ret, prot = 0;
> >       char *name;
> >       int fd;
> >
> > -    if (!region->mem) {
> > +    if (!region->mem || !region->nr_mmaps) {
> 
> Just curious, when will above check return true?
I think `!region->mem` covers cases where no MemoryRegion was created
(e.g. zero sized regions).  And nr_mmaps checks regions with mmap
support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).

> 
> >           return 0;
> >       }
> >
> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >                                  region->mmaps[i].size - 1);
> >       }
> >
> > +    ret = vfio_region_create_dma_buf(region);
> > +    if (ret < 0) {
> > +        if (ret == -ENOTTY) {
> > +            warn_report_once("VFIO dmabuf not supported in kernel");
> > +        } else {
> > +            error_report("%s: failed to create dmabuf: %s",
> > +                         memory_region_name(region->mem), strerror(errno));
> > +        }
> > +    } else {
> > +        MemoryRegion *mr = &region->mmaps[0].mem;
> 
> Do we need to support region->mmaps[1]?

My understanding is all region->mmaps[] entries for a VFIO region share
the same RAMBlock. And the kernel returns a single dmabuf fd per region,
not per subrange.

Thanks,
Shameer
> 
> Thanks
> 
> Zhenzhong
> 
> > +        RAMBlock *ram_block = mr->ram_block;
> > +
> > +        ram_block->fd = ret;
> > +    }
> > +
> >       return 0;
> >
> >   no_mmap:
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index
> > 1e895448cd..592a0349d4 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
> >   vfio_region_write(const char *name, int index, uint64_t addr, uint64_t
> data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
> >   vfio_region_read(char *name, int index, uint64_t addr, unsigned size,
> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
> >   vfio_region_setup(const char *dev, int index, const char *name, unsigned
> long flags, unsigned long offset, unsigned long size) "Device %s, region %d
> \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> > +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name,
> unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d
> \"%s\", offset: 0x%lx, size: 0x%lx"
> >   vfio_region_mmap_fault(const char *name, int index, unsigned long offset,
> unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault:
> %d"
> >   vfio_region_mmap(const char *name, unsigned long offset, unsigned long
> end) "Region %s [0x%lx - 0x%lx]"
> >   vfio_region_exit(const char *name, int index) "Device %s, region %d"
Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Cédric Le Goater 1 month ago
On 1/8/26 12:04, Shameer Kolothum wrote:
> 
> 
>> -----Original Message-----
>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Sent: 08 January 2026 09:41
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
>> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
>> region
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>
>>> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for
>>> P2P use cases. Create a dmabuf for each mapped BAR region after the
>>> mmap is set up, and store the returned fd in the region’s RAMBlock.
>>> This allows QEMU to pass the fd to dma_map_file(), enabling iommufd to
>>> import the dmabuf and map the BAR correctly in the host IOMMU page
>> table.
>>>
>>> If the kernel lacks support or dmabuf setup fails, QEMU skips the
>>> setup and continues with normal mmap handling.
>>>
>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>>    hw/vfio/region.c     | 57
>> +++++++++++++++++++++++++++++++++++++++++++-
>>>    hw/vfio/trace-events |  1 +
>>>    2 files changed, 57 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
>>> b165ab0b93..6949f6779c 100644
>>> --- a/hw/vfio/region.c
>>> +++ b/hw/vfio/region.c
>>> @@ -29,6 +29,7 @@
>>>    #include "qemu/error-report.h"
>>>    #include "qemu/units.h"
>>>    #include "monitor/monitor.h"
>>> +#include "system/ramblock.h"
>>>    #include "vfio-helpers.h"
>>>
>>>    /*
>>> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
>> *region, int index)
>>>        region->mmaps[index].mmap = NULL;
>>>    }
>>>
>>> +static int vfio_region_create_dma_buf(VFIORegion *region) {
>>> +    g_autofree struct vfio_device_feature *feature = NULL;
>>> +    VFIODevice *vbasedev = region->vbasedev;
>>> +    struct vfio_device_feature_dma_buf *dma_buf;
>>> +    size_t total_size;
>>> +    int i, ret;
>>> +
>>> +    g_assert(region->nr_mmaps);
>>> +
>>> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
>>> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
>>> +    feature = g_malloc0(total_size);
>>> +    *feature = (struct vfio_device_feature) {
>>> +        .argsz = total_size,
>>> +        .flags = VFIO_DEVICE_FEATURE_GET |
>> VFIO_DEVICE_FEATURE_DMA_BUF,
>>> +    };
>>> +
>>> +    dma_buf = (void *)feature->data;
>>> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
>>> +        .region_index = region->nr,
>>> +        .open_flags = O_RDWR,
>>> +        .nr_ranges = region->nr_mmaps,
>>> +    };
>>> +
>>> +    for (i = 0; i < region->nr_mmaps; i++) {
>>> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
>>> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
>>> +    }
>>> +
>>> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
>>
>> vbasedev->io_ops->device_feature may be NULL for other backend like vfio-
>> user.
> 
> Ah..Ok. I will add a check.

Could you please add a global routine :

   int vfio_device_get_feature(VFIODevice *vbasedev, struct vfio_device_feature *feature)



> 
>>
>>> +    for (i = 0; i < region->nr_mmaps; i++) {
>>> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
>>> +                                 region->mem->name, region->mmaps[i].offset,
>>> +                                 region->mmaps[i].size);
>>> +    }
>>> +    return ret;
>>> +}
>>> +
>>>    int vfio_region_mmap(VFIORegion *region)
>>>    {
>>>        int i, ret, prot = 0;
>>>        char *name;
>>>        int fd;
>>>
>>> -    if (!region->mem) {
>>> +    if (!region->mem || !region->nr_mmaps) {
>>
>> Just curious, when will above check return true?
> I think `!region->mem` covers cases where no MemoryRegion was created
> (e.g. zero sized regions).  And nr_mmaps checks regions with mmap
> support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).
> 
>>
>>>            return 0;
>>>        }
>>>
>>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>>>                                   region->mmaps[i].size - 1);
>>>        }
>>>
>>> +    ret = vfio_region_create_dma_buf(region);
>>> +    if (ret < 0) {
>>> +        if (ret == -ENOTTY) {
>>> +            warn_report_once("VFIO dmabuf not supported in kernel");
>>> +        } else {
>>> +            error_report("%s: failed to create dmabuf: %s",
>>> +                         memory_region_name(region->mem), strerror(errno));

Shouldn't we return 'ret' in this case ?

Thanks,

C.

>>> +        }
>>> +    } else {
>>> +        MemoryRegion *mr = &region->mmaps[0].mem;
>>
>> Do we need to support region->mmaps[1]?
> 
> My understanding is all region->mmaps[] entries for a VFIO region share
> the same RAMBlock. And the kernel returns a single dmabuf fd per region,
> not per subrange.
> 
> Thanks,
> Shameer
>>
>> Thanks
>>
>> Zhenzhong
>>
>>> +        RAMBlock *ram_block = mr->ram_block;
>>> +
>>> +        ram_block->fd = ret;
>>> +    }
>>> +
>>>        return 0;
>>>
>>>    no_mmap:
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index
>>> 1e895448cd..592a0349d4 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
>>>    vfio_region_write(const char *name, int index, uint64_t addr, uint64_t
>> data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>>>    vfio_region_read(char *name, int index, uint64_t addr, unsigned size,
>> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>>>    vfio_region_setup(const char *dev, int index, const char *name, unsigned
>> long flags, unsigned long offset, unsigned long size) "Device %s, region %d
>> \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
>>> +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name,
>> unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region %d
>> \"%s\", offset: 0x%lx, size: 0x%lx"
>>>    vfio_region_mmap_fault(const char *name, int index, unsigned long offset,
>> unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault:
>> %d"
>>>    vfio_region_mmap(const char *name, unsigned long offset, unsigned long
>> end) "Region %s [0x%lx - 0x%lx]"
>>>    vfio_region_exit(const char *name, int index) "Device %s, region %d"


RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 4 weeks ago
Hi Cédric,

> -----Original Message-----
> From: Cédric Le Goater <clg@redhat.com>
> Sent: 09 January 2026 17:05
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Duan, Zhenzhong
> <zhenzhong.duan@intel.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; cohuck@redhat.com;
> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> On 1/8/26 12:04, Shameer Kolothum wrote:
> >
> >
> >> -----Original Message-----
> >> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >> Sent: 08 January 2026 09:41
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org;
> >> qemu-devel@nongnu.org
> >> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> >> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
> >> <nicolinc@nvidia.com>; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Krishnakant
> >> Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
> >> per region
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
> >>> From: Nicolin Chen <nicolinc@nvidia.com>
> >>>
> >>> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
> >>> for P2P use cases. Create a dmabuf for each mapped BAR region after
> >>> the mmap is set up, and store the returned fd in the region’s RAMBlock.
> >>> This allows QEMU to pass the fd to dma_map_file(), enabling iommufd
> >>> to import the dmabuf and map the BAR correctly in the host IOMMU
> >>> page
> >> table.
> >>>
> >>> If the kernel lacks support or dmabuf setup fails, QEMU skips the
> >>> setup and continues with normal mmap handling.
> >>>
> >>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> >>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >>> ---
> >>>    hw/vfio/region.c     | 57
> >> +++++++++++++++++++++++++++++++++++++++++++-
> >>>    hw/vfio/trace-events |  1 +
> >>>    2 files changed, 57 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
> >>> b165ab0b93..6949f6779c 100644
> >>> --- a/hw/vfio/region.c
> >>> +++ b/hw/vfio/region.c
> >>> @@ -29,6 +29,7 @@
> >>>    #include "qemu/error-report.h"
> >>>    #include "qemu/units.h"
> >>>    #include "monitor/monitor.h"
> >>> +#include "system/ramblock.h"
> >>>    #include "vfio-helpers.h"
> >>>
> >>>    /*
> >>> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
> >> *region, int index)
> >>>        region->mmaps[index].mmap = NULL;
> >>>    }
> >>>
> >>> +static int vfio_region_create_dma_buf(VFIORegion *region) {
> >>> +    g_autofree struct vfio_device_feature *feature = NULL;
> >>> +    VFIODevice *vbasedev = region->vbasedev;
> >>> +    struct vfio_device_feature_dma_buf *dma_buf;
> >>> +    size_t total_size;
> >>> +    int i, ret;
> >>> +
> >>> +    g_assert(region->nr_mmaps);
> >>> +
> >>> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> >>> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> >>> +    feature = g_malloc0(total_size);
> >>> +    *feature = (struct vfio_device_feature) {
> >>> +        .argsz = total_size,
> >>> +        .flags = VFIO_DEVICE_FEATURE_GET |
> >> VFIO_DEVICE_FEATURE_DMA_BUF,
> >>> +    };
> >>> +
> >>> +    dma_buf = (void *)feature->data;
> >>> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> >>> +        .region_index = region->nr,
> >>> +        .open_flags = O_RDWR,
> >>> +        .nr_ranges = region->nr_mmaps,
> >>> +    };
> >>> +
> >>> +    for (i = 0; i < region->nr_mmaps; i++) {
> >>> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> >>> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> >>> +    }
> >>> +
> >>> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> >>
> >> vbasedev->io_ops->device_feature may be NULL for other backend like
> >> vbasedev->vfio-
> >> user.
> >
> > Ah..Ok. I will add a check.
> 
> Could you please add a global routine :
> 
>    int vfio_device_get_feature(VFIODevice *vbasedev, struct
> vfio_device_feature *feature)

Ok.

> 
> 
> >
> >>
> >>> +    for (i = 0; i < region->nr_mmaps; i++) {
> >>> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> >>> +                                 region->mem->name, region->mmaps[i].offset,
> >>> +                                 region->mmaps[i].size);
> >>> +    }
> >>> +    return ret;
> >>> +}
> >>> +
> >>>    int vfio_region_mmap(VFIORegion *region)
> >>>    {
> >>>        int i, ret, prot = 0;
> >>>        char *name;
> >>>        int fd;
> >>>
> >>> -    if (!region->mem) {
> >>> +    if (!region->mem || !region->nr_mmaps) {
> >>
> >> Just curious, when will above check return true?
> > I think `!region->mem` covers cases where no MemoryRegion was created
> > (e.g. zero sized regions).  And nr_mmaps checks regions with mmap
> > support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).
> >
> >>
> >>>            return 0;
> >>>        }
> >>>
> >>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >>>                                   region->mmaps[i].size - 1);
> >>>        }
> >>>
> >>> +    ret = vfio_region_create_dma_buf(region);
> >>> +    if (ret < 0) {
> >>> +        if (ret == -ENOTTY) {
> >>> +            warn_report_once("VFIO dmabuf not supported in kernel");
> >>> +        } else {
> >>> +            error_report("%s: failed to create dmabuf: %s",
> >>> +                         memory_region_name(region->mem),
> >>> + strerror(errno));
> 
> Shouldn't we return 'ret' in this case ?

That would result in:

Failed to mmap 0018:06:00.0 BAR 0. Performance may be slow

Not sure that error msg is correct in this context. If we don't return 'ret' here
vfio_container_dma_map() will eventually report the warning:

qemu-system-aarch64: warning: vfio_container_dma_map(0xaaaaff67ad40, 0x58000000000, 0xb90000, 0xffff64000000) = -14 (Bad address)
0018:06:00.0: PCI peer-to-peer transactions on BARs are not supported.
qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

I think the above is good enough for this. Please let me know.

Thanks,
Shameer



Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Cédric Le Goater 4 weeks ago
On 1/12/26 17:16, Shameer Kolothum wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: 09 January 2026 17:05
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; Duan, Zhenzhong
>> <zhenzhong.duan@intel.com>; qemu-arm@nongnu.org; qemu-
>> devel@nongnu.org
>> Cc: eric.auger@redhat.com; alex@shazbot.org; cohuck@redhat.com;
>> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
>> region
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 1/8/26 12:04, Shameer Kolothum wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>>>> Sent: 08 January 2026 09:41
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org;
>>>> qemu-devel@nongnu.org
>>>> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
>>>> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
>>>> <nicolinc@nvidia.com>; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>>>> <mochs@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Krishnakant
>>>> Jaju <kjaju@nvidia.com>
>>>> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
>>>> per region
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
>>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>>>
>>>>> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
>>>>> for P2P use cases. Create a dmabuf for each mapped BAR region after
>>>>> the mmap is set up, and store the returned fd in the region’s RAMBlock.
>>>>> This allows QEMU to pass the fd to dma_map_file(), enabling iommufd
>>>>> to import the dmabuf and map the BAR correctly in the host IOMMU
>>>>> page
>>>> table.
>>>>>
>>>>> If the kernel lacks support or dmabuf setup fails, QEMU skips the
>>>>> setup and continues with normal mmap handling.
>>>>>
>>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>>>> ---
>>>>>     hw/vfio/region.c     | 57
>>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>>     hw/vfio/trace-events |  1 +
>>>>>     2 files changed, 57 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
>>>>> b165ab0b93..6949f6779c 100644
>>>>> --- a/hw/vfio/region.c
>>>>> +++ b/hw/vfio/region.c
>>>>> @@ -29,6 +29,7 @@
>>>>>     #include "qemu/error-report.h"
>>>>>     #include "qemu/units.h"
>>>>>     #include "monitor/monitor.h"
>>>>> +#include "system/ramblock.h"
>>>>>     #include "vfio-helpers.h"
>>>>>
>>>>>     /*
>>>>> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
>>>> *region, int index)
>>>>>         region->mmaps[index].mmap = NULL;
>>>>>     }
>>>>>
>>>>> +static int vfio_region_create_dma_buf(VFIORegion *region) {
>>>>> +    g_autofree struct vfio_device_feature *feature = NULL;
>>>>> +    VFIODevice *vbasedev = region->vbasedev;
>>>>> +    struct vfio_device_feature_dma_buf *dma_buf;
>>>>> +    size_t total_size;
>>>>> +    int i, ret;
>>>>> +
>>>>> +    g_assert(region->nr_mmaps);
>>>>> +
>>>>> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
>>>>> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
>>>>> +    feature = g_malloc0(total_size);
>>>>> +    *feature = (struct vfio_device_feature) {
>>>>> +        .argsz = total_size,
>>>>> +        .flags = VFIO_DEVICE_FEATURE_GET |
>>>> VFIO_DEVICE_FEATURE_DMA_BUF,
>>>>> +    };
>>>>> +
>>>>> +    dma_buf = (void *)feature->data;
>>>>> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
>>>>> +        .region_index = region->nr,
>>>>> +        .open_flags = O_RDWR,
>>>>> +        .nr_ranges = region->nr_mmaps,
>>>>> +    };
>>>>> +
>>>>> +    for (i = 0; i < region->nr_mmaps; i++) {
>>>>> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
>>>>> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
>>>>> +    }
>>>>> +
>>>>> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
>>>>
>>>> vbasedev->io_ops->device_feature may be NULL for other backend like
>>>> vbasedev->vfio-
>>>> user.
>>>
>>> Ah..Ok. I will add a check.
>>
>> Could you please add a global routine :
>>
>>     int vfio_device_get_feature(VFIODevice *vbasedev, struct
>> vfio_device_feature *feature)
> 
> Ok.
> 
>>
>>
>>>
>>>>
>>>>> +    for (i = 0; i < region->nr_mmaps; i++) {
>>>>> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
>>>>> +                                 region->mem->name, region->mmaps[i].offset,
>>>>> +                                 region->mmaps[i].size);
>>>>> +    }
>>>>> +    return ret;
>>>>> +}
>>>>> +
>>>>>     int vfio_region_mmap(VFIORegion *region)
>>>>>     {
>>>>>         int i, ret, prot = 0;
>>>>>         char *name;
>>>>>         int fd;
>>>>>
>>>>> -    if (!region->mem) {
>>>>> +    if (!region->mem || !region->nr_mmaps) {
>>>>
>>>> Just curious, when will above check return true?
>>> I think `!region->mem` covers cases where no MemoryRegion was created
>>> (e.g. zero sized regions).  And nr_mmaps checks regions with mmap
>>> support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).
>>>
>>>>
>>>>>             return 0;
>>>>>         }
>>>>>
>>>>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>>>>>                                    region->mmaps[i].size - 1);
>>>>>         }
>>>>>
>>>>> +    ret = vfio_region_create_dma_buf(region);
>>>>> +    if (ret < 0) {
>>>>> +        if (ret == -ENOTTY) {
>>>>> +            warn_report_once("VFIO dmabuf not supported in kernel");
>>>>> +        } else {
>>>>> +            error_report("%s: failed to create dmabuf: %s",
>>>>> +                         memory_region_name(region->mem),
>>>>> + strerror(errno));
>>
>> Shouldn't we return 'ret' in this case ?
> 
> That would result in:
> 
> Failed to mmap 0018:06:00.0 BAR 0. Performance may be slow
> 
> Not sure that error msg is correct in this context. 

Agree. It would be a step backwards from the current situation.

> If we don't return 'ret' here
> vfio_container_dma_map() will eventually report the warning:
> 
> qemu-system-aarch64: warning: vfio_container_dma_map(0xaaaaff67ad40, 0x58000000000, 0xb90000, 0xffff64000000) = -14 (Bad address)
> 0018:06:00.0: PCI peer-to-peer transactions on BARs are not supported.
> qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?

Yes that's what we have today.
  
> I think the above is good enough for this. Please let me know.

Given that, with this change, QEMU will now always request a
DMABUF fd, the question is to what extent a failure should
be considered critical.

Would there be reasons to fail to realize the vfio-pci device
and stop the machine ?

Thanks,

C.


RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 3 weeks, 6 days ago

> -----Original Message-----
> From: Cédric Le Goater <clg@redhat.com>
> Sent: 12 January 2026 18:32
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Duan, Zhenzhong
> <zhenzhong.duan@intel.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; cohuck@redhat.com;
> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> On 1/12/26 17:16, Shameer Kolothum wrote:
> > Hi Cédric,
> >
> >> -----Original Message-----
> >> From: Cédric Le Goater <clg@redhat.com>
> >> Sent: 09 January 2026 17:05
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; Duan, Zhenzhong
> >> <zhenzhong.duan@intel.com>; qemu-arm@nongnu.org; qemu-
> >> devel@nongnu.org
> >> Cc: eric.auger@redhat.com; alex@shazbot.org; cohuck@redhat.com;
> >> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
> >> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason
> Gunthorpe
> >> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> >> region
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 1/8/26 12:04, Shameer Kolothum wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >>>> Sent: 08 January 2026 09:41
> >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >> arm@nongnu.org;
> >>>> qemu-devel@nongnu.org
> >>>> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> >>>> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
> >>>> <nicolinc@nvidia.com>; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >>>> <mochs@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Krishnakant
> >>>> Jaju <kjaju@nvidia.com>
> >>>> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
> >>>> per region
> >>>>
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
> >>>>> From: Nicolin Chen <nicolinc@nvidia.com>
> >>>>>
> >>>>> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
> >>>>> for P2P use cases. Create a dmabuf for each mapped BAR region after
> >>>>> the mmap is set up, and store the returned fd in the region’s RAMBlock.
> >>>>> This allows QEMU to pass the fd to dma_map_file(), enabling iommufd
> >>>>> to import the dmabuf and map the BAR correctly in the host IOMMU
> >>>>> page
> >>>> table.
> >>>>>
> >>>>> If the kernel lacks support or dmabuf setup fails, QEMU skips the
> >>>>> setup and continues with normal mmap handling.
> >>>>>
> >>>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> >>>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >>>>> ---
> >>>>>     hw/vfio/region.c     | 57
> >>>> +++++++++++++++++++++++++++++++++++++++++++-
> >>>>>     hw/vfio/trace-events |  1 +
> >>>>>     2 files changed, 57 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
> >>>>> b165ab0b93..6949f6779c 100644
> >>>>> --- a/hw/vfio/region.c
> >>>>> +++ b/hw/vfio/region.c
> >>>>> @@ -29,6 +29,7 @@
> >>>>>     #include "qemu/error-report.h"
> >>>>>     #include "qemu/units.h"
> >>>>>     #include "monitor/monitor.h"
> >>>>> +#include "system/ramblock.h"
> >>>>>     #include "vfio-helpers.h"
> >>>>>
> >>>>>     /*
> >>>>> @@ -238,13 +239,52 @@ static void
> vfio_subregion_unmap(VFIORegion
> >>>> *region, int index)
> >>>>>         region->mmaps[index].mmap = NULL;
> >>>>>     }
> >>>>>
> >>>>> +static int vfio_region_create_dma_buf(VFIORegion *region) {
> >>>>> +    g_autofree struct vfio_device_feature *feature = NULL;
> >>>>> +    VFIODevice *vbasedev = region->vbasedev;
> >>>>> +    struct vfio_device_feature_dma_buf *dma_buf;
> >>>>> +    size_t total_size;
> >>>>> +    int i, ret;
> >>>>> +
> >>>>> +    g_assert(region->nr_mmaps);
> >>>>> +
> >>>>> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> >>>>> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> >>>>> +    feature = g_malloc0(total_size);
> >>>>> +    *feature = (struct vfio_device_feature) {
> >>>>> +        .argsz = total_size,
> >>>>> +        .flags = VFIO_DEVICE_FEATURE_GET |
> >>>> VFIO_DEVICE_FEATURE_DMA_BUF,
> >>>>> +    };
> >>>>> +
> >>>>> +    dma_buf = (void *)feature->data;
> >>>>> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> >>>>> +        .region_index = region->nr,
> >>>>> +        .open_flags = O_RDWR,
> >>>>> +        .nr_ranges = region->nr_mmaps,
> >>>>> +    };
> >>>>> +
> >>>>> +    for (i = 0; i < region->nr_mmaps; i++) {
> >>>>> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> >>>>> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> >>>>> +    }
> >>>>> +
> >>>>> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> >>>>
> >>>> vbasedev->io_ops->device_feature may be NULL for other backend like
> >>>> vbasedev->vfio-
> >>>> user.
> >>>
> >>> Ah..Ok. I will add a check.
> >>
> >> Could you please add a global routine :
> >>
> >>     int vfio_device_get_feature(VFIODevice *vbasedev, struct
> >> vfio_device_feature *feature)
> >
> > Ok.
> >
> >>
> >>
> >>>
> >>>>
> >>>>> +    for (i = 0; i < region->nr_mmaps; i++) {
> >>>>> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region-
> >nr,
> >>>>> +                                 region->mem->name, region->mmaps[i].offset,
> >>>>> +                                 region->mmaps[i].size);
> >>>>> +    }
> >>>>> +    return ret;
> >>>>> +}
> >>>>> +
> >>>>>     int vfio_region_mmap(VFIORegion *region)
> >>>>>     {
> >>>>>         int i, ret, prot = 0;
> >>>>>         char *name;
> >>>>>         int fd;
> >>>>>
> >>>>> -    if (!region->mem) {
> >>>>> +    if (!region->mem || !region->nr_mmaps) {
> >>>>
> >>>> Just curious, when will above check return true?
> >>> I think `!region->mem` covers cases where no MemoryRegion was created
> >>> (e.g. zero sized regions).  And nr_mmaps checks regions with mmap
> >>> support exists (VFIO_REGION_INFO_FLAG_MMAP/
> _CAP_SPARSE_MMAP).
> >>>
> >>>>
> >>>>>             return 0;
> >>>>>         }
> >>>>>
> >>>>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >>>>>                                    region->mmaps[i].size - 1);
> >>>>>         }
> >>>>>
> >>>>> +    ret = vfio_region_create_dma_buf(region);
> >>>>> +    if (ret < 0) {
> >>>>> +        if (ret == -ENOTTY) {
> >>>>> +            warn_report_once("VFIO dmabuf not supported in kernel");
> >>>>> +        } else {
> >>>>> +            error_report("%s: failed to create dmabuf: %s",
> >>>>> +                         memory_region_name(region->mem),
> >>>>> + strerror(errno));
> >>
> >> Shouldn't we return 'ret' in this case ?
> >
> > That would result in:
> >
> > Failed to mmap 0018:06:00.0 BAR 0. Performance may be slow
> >
> > Not sure that error msg is correct in this context.
> 
> Agree. It would be a step backwards from the current situation.
> 
> > If we don't return 'ret' here
> > vfio_container_dma_map() will eventually report the warning:
> >
> > qemu-system-aarch64: warning:
> vfio_container_dma_map(0xaaaaff67ad40, 0x58000000000, 0xb90000,
> 0xffff64000000) = -14 (Bad address)
> > 0018:06:00.0: PCI peer-to-peer transactions on BARs are not supported.
> > qemu-system-aarch64: warning: IOMMU_IOAS_MAP failed: Bad address,
> PCI BAR?
> 
> Yes that's what we have today.
> 
> > I think the above is good enough for this. Please let me know.
> 
> Given that, with this change, QEMU will now always request a
> DMABUF fd, the question is to what extent a failure should
> be considered critical.
> 
> Would there be reasons to fail to realize the vfio-pci device
> and stop the machine ?

Hmm...I don’t think so. This mainly matters for devices that make use of P2P DMA
or expose device memory, such as Grace GPUs. Also, QEMU currently does not
treat vfio_container_dma_map() failures as fatal either.

For v2, I will keep the existing behaviour. We can consider tightening this 
later if there is a concrete requirement to fail device realization in such cases.

Hope, that is reasonable for now.


Thanks,
Shameer 
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Duan, Zhenzhong 1 month ago

>-----Original Message-----
>From: Shameer Kolothum <skolothumtho@nvidia.com>
>Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
>region
>
>
>
>> -----Original Message-----
>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Sent: 08 January 2026 09:41
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
>> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
><nicolinc@nvidia.com>;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
>> region
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
>> > From: Nicolin Chen <nicolinc@nvidia.com>
>> >
>> > Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
>for
>> > P2P use cases. Create a dmabuf for each mapped BAR region after the
>> > mmap is set up, and store the returned fd in the region’s RAMBlock.
>> > This allows QEMU to pass the fd to dma_map_file(), enabling iommufd to
>> > import the dmabuf and map the BAR correctly in the host IOMMU page
>> table.
>> >
>> > If the kernel lacks support or dmabuf setup fails, QEMU skips the
>> > setup and continues with normal mmap handling.
>> >
>> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> > ---
>> >   hw/vfio/region.c     | 57
>> +++++++++++++++++++++++++++++++++++++++++++-
>> >   hw/vfio/trace-events |  1 +
>> >   2 files changed, 57 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
>> > b165ab0b93..6949f6779c 100644
>> > --- a/hw/vfio/region.c
>> > +++ b/hw/vfio/region.c
>> > @@ -29,6 +29,7 @@
>> >   #include "qemu/error-report.h"
>> >   #include "qemu/units.h"
>> >   #include "monitor/monitor.h"
>> > +#include "system/ramblock.h"
>> >   #include "vfio-helpers.h"
>> >
>> >   /*
>> > @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
>> *region, int index)
>> >       region->mmaps[index].mmap = NULL;
>> >   }
>> >
>> > +static int vfio_region_create_dma_buf(VFIORegion *region) {
>> > +    g_autofree struct vfio_device_feature *feature = NULL;
>> > +    VFIODevice *vbasedev = region->vbasedev;
>> > +    struct vfio_device_feature_dma_buf *dma_buf;
>> > +    size_t total_size;
>> > +    int i, ret;
>> > +
>> > +    g_assert(region->nr_mmaps);
>> > +
>> > +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
>> > +                 sizeof(struct vfio_region_dma_range) *
>region->nr_mmaps;
>> > +    feature = g_malloc0(total_size);
>> > +    *feature = (struct vfio_device_feature) {
>> > +        .argsz = total_size,
>> > +        .flags = VFIO_DEVICE_FEATURE_GET |
>> VFIO_DEVICE_FEATURE_DMA_BUF,
>> > +    };
>> > +
>> > +    dma_buf = (void *)feature->data;
>> > +    *dma_buf = (struct vfio_device_feature_dma_buf) {
>> > +        .region_index = region->nr,
>> > +        .open_flags = O_RDWR,
>> > +        .nr_ranges = region->nr_mmaps,
>> > +    };
>> > +
>> > +    for (i = 0; i < region->nr_mmaps; i++) {
>> > +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
>> > +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
>> > +    }
>> > +
>> > +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
>>
>> vbasedev->io_ops->device_feature may be NULL for other backend like
>vfio-
>> user.
>
>Ah..Ok. I will add a check.
>
>>
>> > +    for (i = 0; i < region->nr_mmaps; i++) {
>> > +        trace_vfio_region_dmabuf(region->vbasedev->name, ret,
>region->nr,
>> > +                                 region->mem->name,
>region->mmaps[i].offset,
>> > +                                 region->mmaps[i].size);
>> > +    }
>> > +    return ret;
>> > +}
>> > +
>> >   int vfio_region_mmap(VFIORegion *region)
>> >   {
>> >       int i, ret, prot = 0;
>> >       char *name;
>> >       int fd;
>> >
>> > -    if (!region->mem) {
>> > +    if (!region->mem || !region->nr_mmaps) {
>>
>> Just curious, when will above check return true?
>I think `!region->mem` covers cases where no MemoryRegion was created
>(e.g. zero sized regions).  And nr_mmaps checks regions with mmap
>support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).

Understood, thanks.

>
>>
>> >           return 0;
>> >       }
>> >
>> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>> >                                  region->mmaps[i].size - 1);
>> >       }
>> >
>> > +    ret = vfio_region_create_dma_buf(region);
>> > +    if (ret < 0) {
>> > +        if (ret == -ENOTTY) {
>> > +            warn_report_once("VFIO dmabuf not supported in
>kernel");
>> > +        } else {
>> > +            error_report("%s: failed to create dmabuf: %s",
>> > +                         memory_region_name(region->mem),
>strerror(errno));
>> > +        }
>> > +    } else {
>> > +        MemoryRegion *mr = &region->mmaps[0].mem;
>>
>> Do we need to support region->mmaps[1]?
>
>My understanding is all region->mmaps[] entries for a VFIO region share
>the same RAMBlock. And the kernel returns a single dmabuf fd per region,
>not per subrange.

Not get, can RAMBlock have holes?

>
>Thanks,
>Shameer
>>
>> Thanks
>>
>> Zhenzhong
>>
>> > +        RAMBlock *ram_block = mr->ram_block;
>> > +
>> > +        ram_block->fd = ret;
>> > +    }
>> > +
>> >       return 0;
>> >
>> >   no_mmap:
>> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index
>> > 1e895448cd..592a0349d4 100644
>> > --- a/hw/vfio/trace-events
>> > +++ b/hw/vfio/trace-events
>> > @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
>> >   vfio_region_write(const char *name, int index, uint64_t addr, uint64_t
>> data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>> >   vfio_region_read(char *name, int index, uint64_t addr, unsigned size,
>> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>> >   vfio_region_setup(const char *dev, int index, const char *name,
>unsigned
>> long flags, unsigned long offset, unsigned long size) "Device %s, region %d
>> \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
>> > +vfio_region_dmabuf(const char *dev, int fd, int index,  const char
>*name,
>> unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d
>region %d
>> \"%s\", offset: 0x%lx, size: 0x%lx"
>> >   vfio_region_mmap_fault(const char *name, int index, unsigned long
>offset,
>> unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault:
>> %d"
>> >   vfio_region_mmap(const char *name, unsigned long offset, unsigned
>long
>> end) "Region %s [0x%lx - 0x%lx]"
>> >   vfio_region_exit(const char *name, int index) "Device %s, region %d"
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 1 month ago

> -----Original Message-----
> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> Sent: 09 January 2026 10:13
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> >-----Original Message-----
> >From: Shameer Kolothum <skolothumtho@nvidia.com>
> >Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> >region
> >
> >
> >
> >> -----Original Message-----
> >> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >> Sent: 08 January 2026 09:41
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org;
> >> qemu-devel@nongnu.org
> >> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> >> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
> ><nicolinc@nvidia.com>;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Jason
> >> Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
> >> per region
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 12/22/2025 9:53 PM, Shameer Kolothum wrote:
> >> > From: Nicolin Chen <nicolinc@nvidia.com>
> >> >
> >> > Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory
> >for
> >> > P2P use cases. Create a dmabuf for each mapped BAR region after the
> >> > mmap is set up, and store the returned fd in the region’s RAMBlock.
> >> > This allows QEMU to pass the fd to dma_map_file(), enabling iommufd
> >> > to import the dmabuf and map the BAR correctly in the host IOMMU
> >> > page
> >> table.
> >> >
> >> > If the kernel lacks support or dmabuf setup fails, QEMU skips the
> >> > setup and continues with normal mmap handling.
> >> >
> >> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> >> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >> > ---
> >> >   hw/vfio/region.c     | 57
> >> +++++++++++++++++++++++++++++++++++++++++++-
> >> >   hw/vfio/trace-events |  1 +
> >> >   2 files changed, 57 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/hw/vfio/region.c b/hw/vfio/region.c index
> >> > b165ab0b93..6949f6779c 100644
> >> > --- a/hw/vfio/region.c
> >> > +++ b/hw/vfio/region.c
> >> > @@ -29,6 +29,7 @@
> >> >   #include "qemu/error-report.h"
> >> >   #include "qemu/units.h"
> >> >   #include "monitor/monitor.h"
> >> > +#include "system/ramblock.h"
> >> >   #include "vfio-helpers.h"
> >> >
> >> >   /*
> >> > @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
> >> *region, int index)
> >> >       region->mmaps[index].mmap = NULL;
> >> >   }
> >> >
> >> > +static int vfio_region_create_dma_buf(VFIORegion *region) {
> >> > +    g_autofree struct vfio_device_feature *feature = NULL;
> >> > +    VFIODevice *vbasedev = region->vbasedev;
> >> > +    struct vfio_device_feature_dma_buf *dma_buf;
> >> > +    size_t total_size;
> >> > +    int i, ret;
> >> > +
> >> > +    g_assert(region->nr_mmaps);
> >> > +
> >> > +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> >> > +                 sizeof(struct vfio_region_dma_range) *
> >region->nr_mmaps;
> >> > +    feature = g_malloc0(total_size);
> >> > +    *feature = (struct vfio_device_feature) {
> >> > +        .argsz = total_size,
> >> > +        .flags = VFIO_DEVICE_FEATURE_GET |
> >> VFIO_DEVICE_FEATURE_DMA_BUF,
> >> > +    };
> >> > +
> >> > +    dma_buf = (void *)feature->data;
> >> > +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> >> > +        .region_index = region->nr,
> >> > +        .open_flags = O_RDWR,
> >> > +        .nr_ranges = region->nr_mmaps,
> >> > +    };
> >> > +
> >> > +    for (i = 0; i < region->nr_mmaps; i++) {
> >> > +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> >> > +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> >> > +    }
> >> > +
> >> > +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> >>
> >> vbasedev->io_ops->device_feature may be NULL for other backend like
> >vfio-
> >> user.
> >
> >Ah..Ok. I will add a check.
> >
> >>
> >> > +    for (i = 0; i < region->nr_mmaps; i++) {
> >> > +        trace_vfio_region_dmabuf(region->vbasedev->name, ret,
> >region->nr,
> >> > +                                 region->mem->name,
> >region->mmaps[i].offset,
> >> > +                                 region->mmaps[i].size);
> >> > +    }
> >> > +    return ret;
> >> > +}
> >> > +
> >> >   int vfio_region_mmap(VFIORegion *region)
> >> >   {
> >> >       int i, ret, prot = 0;
> >> >       char *name;
> >> >       int fd;
> >> >
> >> > -    if (!region->mem) {
> >> > +    if (!region->mem || !region->nr_mmaps) {
> >>
> >> Just curious, when will above check return true?
> >I think `!region->mem` covers cases where no MemoryRegion was created
> >(e.g. zero sized regions).  And nr_mmaps checks regions with mmap
> >support exists (VFIO_REGION_INFO_FLAG_MMAP/ _CAP_SPARSE_MMAP).
> 
> Understood, thanks.
> 
> >
> >>
> >> >           return 0;
> >> >       }
> >> >
> >> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >> >                                  region->mmaps[i].size - 1);
> >> >       }
> >> >
> >> > +    ret = vfio_region_create_dma_buf(region);
> >> > +    if (ret < 0) {
> >> > +        if (ret == -ENOTTY) {
> >> > +            warn_report_once("VFIO dmabuf not supported in
> >kernel");
> >> > +        } else {
> >> > +            error_report("%s: failed to create dmabuf: %s",
> >> > +                         memory_region_name(region->mem),
> >strerror(errno));
> >> > +        }
> >> > +    } else {
> >> > +        MemoryRegion *mr = &region->mmaps[0].mem;
> >>
> >> Do we need to support region->mmaps[1]?
> >
> >My understanding is all region->mmaps[] entries for a VFIO region share
> >the same RAMBlock. And the kernel returns a single dmabuf fd per
> >region, not per subrange.
> 
> Not get, can RAMBlock have holes?

Yes, a RAMBlock can effectively have holes, but in this context
that is not what is happening.

IIUC, for a VFIO PCI BAR region, all region->mmaps[] entries
correspond to subranges of the same BAR and are backed by the
same MemoryRegion and therefore the same RAMBlock. The sparse
mmap layout (nr_mmaps > 1) exists to describe which parts of the
BAR are mappable, not to represent distinct backing memory objects.

So while sparse regions may look like "holes" at the mmap level, there
are no holes in the RAMBlock abstraction itself. All region->mmaps[]
entries share the same RAMBlock, which is why attaching the returned
dmabuf fd to region->mmaps[0].mem.ram_block is sufficient, I think.

However, possible I may be missing the case you are concerned about here.
Please let me know.

Thanks,
Shameer
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Duan, Zhenzhong 4 weeks ago
>> >> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>> >> >                                  region->mmaps[i].size - 1);
>> >> >       }
>> >> >
>> >> > +    ret = vfio_region_create_dma_buf(region);
>> >> > +    if (ret < 0) {
>> >> > +        if (ret == -ENOTTY) {
>> >> > +            warn_report_once("VFIO dmabuf not supported in
>> >kernel");
>> >> > +        } else {
>> >> > +            error_report("%s: failed to create dmabuf: %s",
>> >> > +                         memory_region_name(region->mem),
>> >strerror(errno));
>> >> > +        }
>> >> > +    } else {
>> >> > +        MemoryRegion *mr = &region->mmaps[0].mem;
>> >>
>> >> Do we need to support region->mmaps[1]?
>> >
>> >My understanding is all region->mmaps[] entries for a VFIO region share
>> >the same RAMBlock. And the kernel returns a single dmabuf fd per
>> >region, not per subrange.
>>
>> Not get, can RAMBlock have holes?
>
>Yes, a RAMBlock can effectively have holes, but in this context
>that is not what is happening.
>
>IIUC, for a VFIO PCI BAR region, all region->mmaps[] entries
>correspond to subranges of the same BAR and are backed by the
>same MemoryRegion and therefore the same RAMBlock. The sparse
>mmap layout (nr_mmaps > 1) exists to describe which parts of the
>BAR are mappable, not to represent distinct backing memory objects.
>
>So while sparse regions may look like "holes" at the mmap level, there
>are no holes in the RAMBlock abstraction itself. All region->mmaps[]
>entries share the same RAMBlock, which is why attaching the returned
>dmabuf fd to region->mmaps[0].mem.ram_block is sufficient, I think.
>
>However, possible I may be missing the case you are concerned about here.
>Please let me know.

I see memory_region_init_ram_device_ptr() is called for each region->mmaps[x].mem,
and RAMBlock is allocated in each call.
IIUC, we should set fd and fd_offset in each RAMBlock.

Thanks
Zhenzhong
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 4 weeks ago

> -----Original Message-----
> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> Sent: 12 January 2026 02:33
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> >> >> > @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >> >> >                                  region->mmaps[i].size - 1);
> >> >> >       }
> >> >> >
> >> >> > +    ret = vfio_region_create_dma_buf(region);
> >> >> > +    if (ret < 0) {
> >> >> > +        if (ret == -ENOTTY) {
> >> >> > +            warn_report_once("VFIO dmabuf not supported in
> >> >kernel");
> >> >> > +        } else {
> >> >> > +            error_report("%s: failed to create dmabuf: %s",
> >> >> > +                         memory_region_name(region->mem),
> >> >strerror(errno));
> >> >> > +        }
> >> >> > +    } else {
> >> >> > +        MemoryRegion *mr = &region->mmaps[0].mem;
> >> >>
> >> >> Do we need to support region->mmaps[1]?
> >> >
> >> >My understanding is all region->mmaps[] entries for a VFIO region share
> >> >the same RAMBlock. And the kernel returns a single dmabuf fd per
> >> >region, not per subrange.
> >>
> >> Not get, can RAMBlock have holes?
> >
> >Yes, a RAMBlock can effectively have holes, but in this context
> >that is not what is happening.
> >
> >IIUC, for a VFIO PCI BAR region, all region->mmaps[] entries
> >correspond to subranges of the same BAR and are backed by the
> >same MemoryRegion and therefore the same RAMBlock. The sparse
> >mmap layout (nr_mmaps > 1) exists to describe which parts of the
> >BAR are mappable, not to represent distinct backing memory objects.
> >
> >So while sparse regions may look like "holes" at the mmap level, there
> >are no holes in the RAMBlock abstraction itself. All region->mmaps[]
> >entries share the same RAMBlock, which is why attaching the returned
> >dmabuf fd to region->mmaps[0].mem.ram_block is sufficient, I think.
> >
> >However, possible I may be missing the case you are concerned about here.
> >Please let me know.
> 
> I see memory_region_init_ram_device_ptr() is called for each region-
> >mmaps[x].mem,
> and RAMBlock is allocated in each call.

Ah.. I see.  It does allocate RAMBlock per  mmaps[i]. 

> IIUC, we should set fd and fd_offset in each RAMBlock.

Ok. Will update in v2.

Thanks,
Shameer
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 3 weeks, 5 days ago

> -----Original Message-----
> From: Shameer Kolothum <skolothumtho@nvidia.com>
> Sent: 12 January 2026 08:45
> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region

[...]
 
> Ah.. I see.  It does allocate RAMBlock per  mmaps[i].
> 
> > IIUC, we should set fd and fd_offset in each RAMBlock.
> 
> Ok. Will update in v2.

I have sent out a v2 and missed to CC few. Sorry. Please find here:
https://lore.kernel.org/qemu-devel/20260113113754.1189-1-skolothumtho@nvidia.com/

Thanks,
Shameer
Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Cédric Le Goater 4 weeks ago
On 1/12/26 09:45, Shameer Kolothum wrote:
> 
> 
>> -----Original Message-----
>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Sent: 12 January 2026 02:33
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
>> cohuck@redhat.com; mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> Jason Gunthorpe <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
>> region
>>
>> External email: Use caution opening links or attachments
>>
>>
>>>>>>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>>>>>>>                                   region->mmaps[i].size - 1);
>>>>>>>        }
>>>>>>>
>>>>>>> +    ret = vfio_region_create_dma_buf(region);
>>>>>>> +    if (ret < 0) {
>>>>>>> +        if (ret == -ENOTTY) {
>>>>>>> +            warn_report_once("VFIO dmabuf not supported in
>>>>> kernel");
>>>>>>> +        } else {
>>>>>>> +            error_report("%s: failed to create dmabuf: %s",
>>>>>>> +                         memory_region_name(region->mem),
>>>>> strerror(errno));
>>>>>>> +        }
>>>>>>> +    } else {
>>>>>>> +        MemoryRegion *mr = &region->mmaps[0].mem;
>>>>>>
>>>>>> Do we need to support region->mmaps[1]?
>>>>>
>>>>> My understanding is all region->mmaps[] entries for a VFIO region share
>>>>> the same RAMBlock. And the kernel returns a single dmabuf fd per
>>>>> region, not per subrange.
>>>>
>>>> Not get, can RAMBlock have holes?
>>>
>>> Yes, a RAMBlock can effectively have holes, but in this context
>>> that is not what is happening.
>>>
>>> IIUC, for a VFIO PCI BAR region, all region->mmaps[] entries
>>> correspond to subranges of the same BAR and are backed by the
>>> same MemoryRegion and therefore the same RAMBlock. The sparse
>>> mmap layout (nr_mmaps > 1) exists to describe which parts of the
>>> BAR are mappable, not to represent distinct backing memory objects.
>>>
>>> So while sparse regions may look like "holes" at the mmap level, there
>>> are no holes in the RAMBlock abstraction itself. All region->mmaps[]
>>> entries share the same RAMBlock, which is why attaching the returned
>>> dmabuf fd to region->mmaps[0].mem.ram_block is sufficient, I think.
>>>
>>> However, possible I may be missing the case you are concerned about here.
>>> Please let me know.
>>
>> I see memory_region_init_ram_device_ptr() is called for each region-
>>> mmaps[x].mem,
>> and RAMBlock is allocated in each call.
> 
> Ah.. I see.  It does allocate RAMBlock per  mmaps[i].
> 
>> IIUC, we should set fd and fd_offset in each RAMBlock.
> 
> Ok. Will update in v2.
I'd like to send a vfio PR soon and this v2 looks a like good
candidate for it.

Thanks,

C.
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Shameer Kolothum 4 weeks ago

> -----Original Message-----
> From: Cédric Le Goater <clg@redhat.com>
> Sent: 12 January 2026 15:16
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Duan, Zhenzhong
> <zhenzhong.duan@intel.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org
> Cc: eric.auger@redhat.com; alex@shazbot.org; cohuck@redhat.com;
> mst@redhat.com; Nicolin Chen <nicolinc@nvidia.com>; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per
> region
> 
> External email: Use caution opening links or attachments
> 
> 
> On 1/12/26 09:45, Shameer Kolothum wrote:
> >
> >
> >> -----Original Message-----
> >> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >> Sent: 12 January 2026 02:33
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org;
> >> qemu-devel@nongnu.org
> >> Cc: eric.auger@redhat.com; alex@shazbot.org; clg@redhat.com;
> >> cohuck@redhat.com; mst@redhat.com; Nicolin Chen
> >> <nicolinc@nvidia.com>; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; Jason Gunthorpe <jgg@nvidia.com>; Krishnakant
> >> Jaju <kjaju@nvidia.com>
> >> Subject: RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR
> >> per region
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >>>>>>> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
> >>>>>>>                                   region->mmaps[i].size - 1);
> >>>>>>>        }
> >>>>>>>
> >>>>>>> +    ret = vfio_region_create_dma_buf(region);
> >>>>>>> +    if (ret < 0) {
> >>>>>>> +        if (ret == -ENOTTY) {
> >>>>>>> +            warn_report_once("VFIO dmabuf not supported in
> >>>>> kernel");
> >>>>>>> +        } else {
> >>>>>>> +            error_report("%s: failed to create dmabuf: %s",
> >>>>>>> +                         memory_region_name(region->mem),
> >>>>> strerror(errno));
> >>>>>>> +        }
> >>>>>>> +    } else {
> >>>>>>> +        MemoryRegion *mr = &region->mmaps[0].mem;
> >>>>>>
> >>>>>> Do we need to support region->mmaps[1]?
> >>>>>
> >>>>> My understanding is all region->mmaps[] entries for a VFIO region
> >>>>> share the same RAMBlock. And the kernel returns a single dmabuf fd
> >>>>> per region, not per subrange.
> >>>>
> >>>> Not get, can RAMBlock have holes?
> >>>
> >>> Yes, a RAMBlock can effectively have holes, but in this context that
> >>> is not what is happening.
> >>>
> >>> IIUC, for a VFIO PCI BAR region, all region->mmaps[] entries
> >>> correspond to subranges of the same BAR and are backed by the same
> >>> MemoryRegion and therefore the same RAMBlock. The sparse mmap
> layout
> >>> (nr_mmaps > 1) exists to describe which parts of the BAR are
> >>> mappable, not to represent distinct backing memory objects.
> >>>
> >>> So while sparse regions may look like "holes" at the mmap level,
> >>> there are no holes in the RAMBlock abstraction itself. All
> >>> region->mmaps[] entries share the same RAMBlock, which is why
> >>> attaching the returned dmabuf fd to region->mmaps[0].mem.ram_block is
> sufficient, I think.
> >>>
> >>> However, possible I may be missing the case you are concerned about
> here.
> >>> Please let me know.
> >>
> >> I see memory_region_init_ram_device_ptr() is called for each region-
> >>> mmaps[x].mem,
> >> and RAMBlock is allocated in each call.
> >
> > Ah.. I see.  It does allocate RAMBlock per  mmaps[i].
> >
> >> IIUC, we should set fd and fd_offset in each RAMBlock.
> >
> > Ok. Will update in v2.
> I'd like to send a vfio PR soon and this v2 looks a like good candidate for it.

Sure. Will send out the v2 soon and thanks for sending those update-linux-headers
and the hyperv patches.

Thanks,
Shameer
 
RE: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Kasireddy, Vivek 1 month, 2 weeks ago
Hi Shameer, Cedric,

> Subject: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for
> P2P
> use cases. Create a dmabuf for each mapped BAR region after the mmap is
> set
> up, and store the returned fd in the region's RAMBlock. This allows QEMU
> to
> pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and
> map
> the BAR correctly in the host IOMMU page table.
> 
> If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
> and continues with normal mmap handling.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/vfio/region.c     | 57
> +++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/trace-events |  1 +
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/region.c b/hw/vfio/region.c
> index b165ab0b93..6949f6779c 100644
> --- a/hw/vfio/region.c
> +++ b/hw/vfio/region.c
> @@ -29,6 +29,7 @@
>  #include "qemu/error-report.h"
>  #include "qemu/units.h"
>  #include "monitor/monitor.h"
> +#include "system/ramblock.h"
>  #include "vfio-helpers.h"
> 
>  /*
> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
> *region, int index)
>      region->mmaps[index].mmap = NULL;
>  }
> 
> +static int vfio_region_create_dma_buf(VFIORegion *region)
Would it make sense to consolidate this implementation with the one from my
series: https://lore.kernel.org/qemu-devel/20251122064936.2948632-7-vivek.kasireddy@intel.com/
so that it is a bit more generic and can also be invoked from outside of VFIO?

Or, is it ok to have two dmabuf implementations: one that is internal to VFIO
and takes a VFIORegion as input like this one and another one that takes a
VFIODevice and iovec as input and can be invoked externally?

Thanks,
Vivek

> +{
> +    g_autofree struct vfio_device_feature *feature = NULL;
> +    VFIODevice *vbasedev = region->vbasedev;
> +    struct vfio_device_feature_dma_buf *dma_buf;
> +    size_t total_size;
> +    int i, ret;
> +
> +    g_assert(region->nr_mmaps);
> +
> +    total_size = sizeof(*feature) + sizeof(*dma_buf) +
> +                 sizeof(struct vfio_region_dma_range) * region->nr_mmaps;
> +    feature = g_malloc0(total_size);
> +    *feature = (struct vfio_device_feature) {
> +        .argsz = total_size,
> +        .flags = VFIO_DEVICE_FEATURE_GET |
> VFIO_DEVICE_FEATURE_DMA_BUF,
> +    };
> +
> +    dma_buf = (void *)feature->data;
> +    *dma_buf = (struct vfio_device_feature_dma_buf) {
> +        .region_index = region->nr,
> +        .open_flags = O_RDWR,
> +        .nr_ranges = region->nr_mmaps,
> +    };
> +
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        dma_buf->dma_ranges[i].offset = region->mmaps[i].offset;
> +        dma_buf->dma_ranges[i].length = region->mmaps[i].size;
> +    }
> +
> +    ret = vbasedev->io_ops->device_feature(vbasedev, feature);
> +    for (i = 0; i < region->nr_mmaps; i++) {
> +        trace_vfio_region_dmabuf(region->vbasedev->name, ret, region->nr,
> +                                 region->mem->name, region->mmaps[i].offset,
> +                                 region->mmaps[i].size);
> +    }
> +    return ret;
> +}
> +
>  int vfio_region_mmap(VFIORegion *region)
>  {
>      int i, ret, prot = 0;
>      char *name;
>      int fd;
> 
> -    if (!region->mem) {
> +    if (!region->mem || !region->nr_mmaps) {
>          return 0;
>      }
> 
> @@ -305,6 +345,21 @@ int vfio_region_mmap(VFIORegion *region)
>                                 region->mmaps[i].size - 1);
>      }
> 
> +    ret = vfio_region_create_dma_buf(region);
> +    if (ret < 0) {
> +        if (ret == -ENOTTY) {
> +            warn_report_once("VFIO dmabuf not supported in kernel");
> +        } else {
> +            error_report("%s: failed to create dmabuf: %s",
> +                         memory_region_name(region->mem), strerror(errno));
> +        }
> +    } else {
> +        MemoryRegion *mr = &region->mmaps[0].mem;
> +        RAMBlock *ram_block = mr->ram_block;
> +
> +        ram_block->fd = ret;
> +    }
> +
>      return 0;
> 
>  no_mmap:
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 1e895448cd..592a0349d4 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -117,6 +117,7 @@ vfio_device_put(int fd) "close vdev->fd=%d"
>  vfio_region_write(const char *name, int index, uint64_t addr, uint64_t
> data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
>  vfio_region_read(char *name, int index, uint64_t addr, unsigned size,
> uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
>  vfio_region_setup(const char *dev, int index, const char *name, unsigned
> long flags, unsigned long offset, unsigned long size) "Device %s, region %d
> \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
> +vfio_region_dmabuf(const char *dev, int fd, int index,  const char *name,
> unsigned long offset, unsigned long size) "Device %s, dmabuf fd %d region
> %d \"%s\", offset: 0x%lx, size: 0x%lx"
>  vfio_region_mmap_fault(const char *name, int index, unsigned long offset,
> unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault:
> %d"
>  vfio_region_mmap(const char *name, unsigned long offset, unsigned long
> end) "Region %s [0x%lx - 0x%lx]"
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
> --
> 2.43.0
> 
Re: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
Posted by Cédric Le Goater 1 month, 2 weeks ago
Hello Vivek,

On 12/23/25 21:43, Kasireddy, Vivek wrote:
> Hi Shameer, Cedric,
> 
>> Subject: [PATCH 3/3] hw/vfio/region: Create dmabuf for PCI BAR per region
>>
>> From: Nicolin Chen <nicolinc@nvidia.com>
>>
>> Linux now provides a VFIO dmabuf exporter to expose PCI BAR memory for
>> P2P
>> use cases. Create a dmabuf for each mapped BAR region after the mmap is
>> set
>> up, and store the returned fd in the region's RAMBlock. This allows QEMU
>> to
>> pass the fd to dma_map_file(), enabling iommufd to import the dmabuf and
>> map
>> the BAR correctly in the host IOMMU page table.
>>
>> If the kernel lacks support or dmabuf setup fails, QEMU skips the setup
>> and continues with normal mmap handling.
>>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> ---
>>   hw/vfio/region.c     | 57
>> +++++++++++++++++++++++++++++++++++++++++++-
>>   hw/vfio/trace-events |  1 +
>>   2 files changed, 57 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/region.c b/hw/vfio/region.c
>> index b165ab0b93..6949f6779c 100644
>> --- a/hw/vfio/region.c
>> +++ b/hw/vfio/region.c
>> @@ -29,6 +29,7 @@
>>   #include "qemu/error-report.h"
>>   #include "qemu/units.h"
>>   #include "monitor/monitor.h"
>> +#include "system/ramblock.h"
>>   #include "vfio-helpers.h"
>>
>>   /*
>> @@ -238,13 +239,52 @@ static void vfio_subregion_unmap(VFIORegion
>> *region, int index)
>>       region->mmaps[index].mmap = NULL;
>>   }
>>
>> +static int vfio_region_create_dma_buf(VFIORegion *region)
> Would it make sense to consolidate this implementation with the one from my
> series: https://lore.kernel.org/qemu-devel/20251122064936.2948632-7-vivek.kasireddy@intel.com/
> so that it is a bit more generic and can also be invoked from outside of VFIO?

I would prefer to start by the generic dmabuf support in VFIO, as
it covers multiple use cases. Then, we can look at virtio-gpu.

> Or, is it ok to have two dmabuf implementations: one that is internal to VFIO
> and takes a VFIORegion as input like this one and another one that takes a
> VFIODevice and iovec as input and can be invoked externally?
May be this approach can be revisited a bit ? I haven't looked yet.

Thanks,

C.