[v4] vfio: handle DMA map/unmap up to the addressable limit

[PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 3 weeks ago

This patch series aims to fix vfio_iommu_type.c to support 
VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
ranges which lie against the addressable limit. i.e. ranges where
iova_start + iova_size would overflow to exactly zero.

Today, the VFIO UAPI has an inconsistency: The
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability of VFIO_IOMMU_GET_INFO
reports that ranges up to the end of the address space are available
for use, but are not really due to bugs in handling boundary conditions.

For example:

vfio_find_dma_first_node is called to find the first dma node to unmap
given an unmap range of [iova..iova+size). The check at the end of the
function intends to test if the dma result lies beyond the end of the
unmap range. The condition is incorrectly satisfied when iova+size
overflows to zero, causing the function to return NULL.

The same issue happens inside vfio_dma_do_unmap's while loop.

This bug was also reported by Alejandro Jimenez in [1][2].

Of primary concern are locations in the current code which perform
comparisons against (iova + size) expressions, where overflow to zero
is possible.

The initial list of candidate locations to audit was taken from the
following:

$ rg 'iova.*\+.*size' -n drivers/vfio/vfio_iommu_type1.c | rg -v '\- 1'
173:            else if (start >= dma->iova + dma->size)
192:            if (start < dma->iova + dma->size) {
216:            if (new->iova + new->size <= dma->iova)
1060:   dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
1233:   if (dma && dma->iova + dma->size != iova + size)
1380:           if (dma && dma->iova + dma->size != iova + size)
1501:           ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
1504:                   vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
1721:           while (iova < dma->iova + dma->size) {
1743:                           i = iova + size;
1744:                           while (i < dma->iova + dma->size &&
1754:                           size_t n = dma->iova + dma->size - iova;
1785:                   iova += size;
1810:           while (iova < dma->iova + dma->size) {
1823:                   i = iova + size;
1824:                   while (i < dma->iova + dma->size &&
2919:           if (range.iova + range.size < range.iova)

This series spend the first couple commits making mechanical preparations
before the fix lands in the last commit.

[1] https://lore.kernel.org/qemu-devel/20250919213515.917111-1-alejandro.j.jimenez@oracle.com/
[2] https://lore.kernel.org/all/68e18f2c-79ad-45ec-99b9-99ff68ba5438@oracle.com/

Signed-off-by: Alex Mastro <amastro@fb.com>
---
Changes in v4:
- Fix type assigned to iova_end
- Clarify overflow checking, add checks to vfio_iommu_type1_dirty_pages
- Consider npage==0 an error for vfio_iommu_type1_pin_pages
- Link to v3: https://lore.kernel.org/r/20251010-fix-unmap-v3-0-306c724d6998@fb.com

Changes in v3:
- Fix handling of unmap_all in vfio_dma_do_unmap
- Fix !range.size to return -EINVAL for VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP
  - Dedup !range.size checking
- Return -EOVERFLOW on check_*_overflow
- Link to v2: https://lore.kernel.org/r/20251007-fix-unmap-v2-0-759bceb9792e@fb.com

Changes in v2:
- Change to patch series rather than single commit
- Expand scope to fix more than just the unmap discovery path
- Link to v1: https://lore.kernel.org/r/20251005-fix-unmap-v1-1-6687732ed44e@fb.com

---
Alex Mastro (3):
      vfio/type1: sanitize for overflow using check_*_overflow
      vfio/type1: move iova increment to unmap_unpin_* caller
      vfio/type1: handle DMA map/unmap up to the addressable limit

 drivers/vfio/vfio_iommu_type1.c | 173 +++++++++++++++++++++++++---------------
 1 file changed, 110 insertions(+), 63 deletions(-)
---
base-commit: 407aa63018d15c35a34938633868e61174d2ef6e
change-id: 20251005-fix-unmap-c3f3e87dabfa

Best regards,
-- 
Alex Mastro <amastro@fb.com>

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Williamson 3 months, 3 weeks ago

On Sun, 12 Oct 2025 22:32:23 -0700
Alex Mastro <amastro@fb.com> wrote:

> This patch series aims to fix vfio_iommu_type.c to support 
> VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
> ranges which lie against the addressable limit. i.e. ranges where
> iova_start + iova_size would overflow to exactly zero.

The series looks good to me and passes my testing.  Any further reviews
from anyone?  I think we should make this v6.18-rc material.  Thanks,

Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alejandro Jimenez 3 months, 3 weeks ago

On 10/15/25 3:24 PM, Alex Williamson wrote:
> On Sun, 12 Oct 2025 22:32:23 -0700
> Alex Mastro <amastro@fb.com> wrote:
> 
>> This patch series aims to fix vfio_iommu_type.c to support
>> VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
>> ranges which lie against the addressable limit. i.e. ranges where
>> iova_start + iova_size would overflow to exactly zero.
> 
> The series looks good to me and passes my testing.  Any further reviews
> from anyone?  I think we should make this v6.18-rc material.  Thanks,
> 

I haven't had a chance yet to closely review the latest patchset 
versions, but I did test this v4 and confirmed that it solves the issue 
of not being able to unmap an IOVA range extending up to the address 
space boundary. I verified both with the simplified test case at:
https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8

plus using QEMU's amd-iommu and a guest with iommu.passthrough=0 
iommu.forcedac=1 (which is how I first found the problem).

So Alex Mastro, please feel free to add:

Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>

for the series. I'll try to find time to review the patches in detail.

Thank you,
Alejandro

> Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 3 weeks ago

On Wed, Oct 15, 2025 at 05:25:14PM -0400, Alejandro Jimenez wrote:
> 
> 
> On 10/15/25 3:24 PM, Alex Williamson wrote:
> > On Sun, 12 Oct 2025 22:32:23 -0700
> > Alex Mastro <amastro@fb.com> wrote:
> > 
> > > This patch series aims to fix vfio_iommu_type.c to support
> > > VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
> > > ranges which lie against the addressable limit. i.e. ranges where
> > > iova_start + iova_size would overflow to exactly zero.
> > 
> > The series looks good to me and passes my testing.  Any further reviews
> > from anyone?  I think we should make this v6.18-rc material.  Thanks,
> > 
> 
> I haven't had a chance yet to closely review the latest patchset versions,
> but I did test this v4 and confirmed that it solves the issue of not being
> able to unmap an IOVA range extending up to the address space boundary. I
> verified both with the simplified test case at:
> https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8
> 
> plus using QEMU's amd-iommu and a guest with iommu.passthrough=0
> iommu.forcedac=1 (which is how I first found the problem).
> 
> So Alex Mastro, please feel free to add:
> 
> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> 
> for the series. I'll try to find time to review the patches in detail.
> 
> Thank you,
> Alejandro
> 
> > Alex
> 

Thanks all. I would like additional scrutiny around vfio_iommu_replay. It was
the one block of affected code I have not been able to test, since I don't have
/ am not sure how to simulate a setup which can cause mappings to be replayed
on a newly added IOMMU domain. My confidence in that code is from close review
only.

I explicitly tested various combinations of the following with mappings up to
the addressable limit:
- VFIO_IOMMU_MAP_DMA
- VFIO_IOMMU_UNMAP_DMA range-based, and VFIO_DMA_UNMAP_FLAG_ALL
- VFIO_IOMMU_DIRTY_PAGES with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP

My understanding is that my changes to vfio_iommu_replay would be traversed
when binding a group to a container with existing mappings where the group's
IOMMU domain is not already one of the domains in the container.

Thanks,
Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Williamson 3 months, 3 weeks ago

On Thu, 16 Oct 2025 14:19:53 -0700
Alex Mastro <amastro@fb.com> wrote:

> On Wed, Oct 15, 2025 at 05:25:14PM -0400, Alejandro Jimenez wrote:
> > 
> > 
> > On 10/15/25 3:24 PM, Alex Williamson wrote:  
> > > On Sun, 12 Oct 2025 22:32:23 -0700
> > > Alex Mastro <amastro@fb.com> wrote:
> > >   
> > > > This patch series aims to fix vfio_iommu_type.c to support
> > > > VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
> > > > ranges which lie against the addressable limit. i.e. ranges where
> > > > iova_start + iova_size would overflow to exactly zero.  
> > > 
> > > The series looks good to me and passes my testing.  Any further reviews
> > > from anyone?  I think we should make this v6.18-rc material.  Thanks,
> > >   
> > 
> > I haven't had a chance yet to closely review the latest patchset versions,
> > but I did test this v4 and confirmed that it solves the issue of not being
> > able to unmap an IOVA range extending up to the address space boundary. I
> > verified both with the simplified test case at:
> > https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8
> > 
> > plus using QEMU's amd-iommu and a guest with iommu.passthrough=0
> > iommu.forcedac=1 (which is how I first found the problem).
> > 
> > So Alex Mastro, please feel free to add:
> > 
> > Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
> > 
> > for the series. I'll try to find time to review the patches in detail.
> > 
> > Thank you,
> > Alejandro
> >   
> > > Alex  
> >   
> 
> Thanks all. I would like additional scrutiny around vfio_iommu_replay. It was
> the one block of affected code I have not been able to test, since I don't have
> / am not sure how to simulate a setup which can cause mappings to be replayed
> on a newly added IOMMU domain. My confidence in that code is from close review
> only.
> 
> I explicitly tested various combinations of the following with mappings up to
> the addressable limit:
> - VFIO_IOMMU_MAP_DMA
> - VFIO_IOMMU_UNMAP_DMA range-based, and VFIO_DMA_UNMAP_FLAG_ALL
> - VFIO_IOMMU_DIRTY_PAGES with VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP
> 
> My understanding is that my changes to vfio_iommu_replay would be traversed
> when binding a group to a container with existing mappings where the group's
> IOMMU domain is not already one of the domains in the container.

I really appreciate you making note of it if you're uneasy about the
change.  I'll take a closer look there and hopefully others can as
well.

The legacy vfio container represents a single IOMMU context, which is
typically managed by a single domain.  The replay comes into play when
groups are under IOMMUs with different properties that prevent us from
re-using the domain.  The case that most comes to mind for this is
Intel platforms with integrated graphics where there's a separate IOMMU
for the GPU, which iirc has different coherency settings.

That mechanism for triggering replay requires a specific hardware
configuration, but we can easily trigger it through code
instrumentation, ex:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5167bec14e36..2cb19ddbb524 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2368,7 +2368,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
                    d->enforce_cache_coherency ==
                            domain->enforce_cache_coherency) {
                        iommu_detach_group(domain->domain, group->iommu_group);
-                       if (!iommu_attach_group(d->domain,
+                       if (0 && !iommu_attach_group(d->domain,
                                                group->iommu_group)) {
                                list_add(&group->next, &d->group_list);
                                iommu_domain_free(domain->domain);

We might consider whether it's useful for testing purposes to expose a
mechanism to toggle this.  For a unit test, if we create a container,
add a group, and build up some suspect mappings, if we then add another
group to the container with the above bypass we should trigger the
replay.

In general though the replay shouldn't have a mechanism to trigger
overflows, we're simply iterating the current set of mappings that have
already been validated and applying them to a new domain.

In any case, we can all take a second look at the changes there.
Thanks,

Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 3 weeks ago

On Thu, Oct 16, 2025 at 04:01:38PM -0600, Alex Williamson wrote:
> The legacy vfio container represents a single IOMMU context, which is
> typically managed by a single domain.  The replay comes into play when
> groups are under IOMMUs with different properties that prevent us from
> re-using the domain.  The case that most comes to mind for this is
> Intel platforms with integrated graphics where there's a separate IOMMU
> for the GPU, which iirc has different coherency settings.

Thanks, this context is helpful and makes sense.

> That mechanism for triggering replay requires a specific hardware
> configuration, but we can easily trigger it through code
> instrumentation, ex:
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5167bec14e36..2cb19ddbb524 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2368,7 +2368,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>                     d->enforce_cache_coherency ==
>                             domain->enforce_cache_coherency) {
>                         iommu_detach_group(domain->domain, group->iommu_group);
> -                       if (!iommu_attach_group(d->domain,
> +                       if (0 && !iommu_attach_group(d->domain,
>                                                 group->iommu_group)) {
>                                 list_add(&group->next, &d->group_list);
>                                 iommu_domain_free(domain->domain);
> 
> We might consider whether it's useful for testing purposes to expose a
> mechanism to toggle this.  For a unit test, if we create a container,
> add a group, and build up some suspect mappings, if we then add another
> group to the container with the above bypass we should trigger the
> replay.

Thanks for the tip. I did this, and validated via bpftrace-ing iommu_map that
the container's mappings (one of which lies at the end of address space) are
replayed correctly. Without the fix, the loop body

while (iova < dma->iova + dma->size) { ... iommu_map() ... }

would never be entered for the end of address space mapping due to

dma->iova + dma->size == 0

$ sudo bpftrace -e 'kprobe:iommu_map { printf("pid=%d comm=%s domain=%p iova=%p paddr=%p size=%p prot=%p gfp=%p\n", pid, comm, (void*)arg0, (void*)arg1, (void*)arg2, (void*)arg3, (void*)arg4, (void*)arg5); }'
Attached 1 probe
# original mappings
pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0x10000000000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0x10000001000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0xfffffffffffff000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
# replayed mapping
pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0x10000000000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0x10000001000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0xfffffffffffff000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0

> In general though the replay shouldn't have a mechanism to trigger
> overflows, we're simply iterating the current set of mappings that have
> already been validated and applying them to a new domain.

Agree. Overflow means that some other invariant has broken, and nonsensical
vfio_dma have infiltrated iommu->dma_list. The combination of iommu->lock
serialization + overflow checks elsewhere should have prevented that.

> In any case, we can all take a second look at the changes there.
Thanks!

Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Williamson 3 months, 2 weeks ago

On Fri, 17 Oct 2025 09:29:26 -0700
Alex Mastro <amastro@fb.com> wrote:

> On Thu, Oct 16, 2025 at 04:01:38PM -0600, Alex Williamson wrote:
> > That mechanism for triggering replay requires a specific hardware
> > configuration, but we can easily trigger it through code
> > instrumentation, ex:
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 5167bec14e36..2cb19ddbb524 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2368,7 +2368,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >                     d->enforce_cache_coherency ==
> >                             domain->enforce_cache_coherency) {
> >                         iommu_detach_group(domain->domain, group->iommu_group);
> > -                       if (!iommu_attach_group(d->domain,
> > +                       if (0 && !iommu_attach_group(d->domain,
> >                                                 group->iommu_group)) {
> >                                 list_add(&group->next, &d->group_list);
> >                                 iommu_domain_free(domain->domain);
> > 
> > We might consider whether it's useful for testing purposes to expose a
> > mechanism to toggle this.  For a unit test, if we create a container,
> > add a group, and build up some suspect mappings, if we then add another
> > group to the container with the above bypass we should trigger the
> > replay.  
> 
> Thanks for the tip. I did this, and validated via bpftrace-ing iommu_map that
> the container's mappings (one of which lies at the end of address space) are
> replayed correctly. Without the fix, the loop body
> 
> while (iova < dma->iova + dma->size) { ... iommu_map() ... }
> 
> would never be entered for the end of address space mapping due to
> 
> dma->iova + dma->size == 0
> 
> $ sudo bpftrace -e 'kprobe:iommu_map { printf("pid=%d comm=%s domain=%p iova=%p paddr=%p size=%p prot=%p gfp=%p\n", pid, comm, (void*)arg0, (void*)arg1, (void*)arg2, (void*)arg3, (void*)arg4, (void*)arg5); }'
> Attached 1 probe
> # original mappings
> pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0x10000000000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0x10000001000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> pid=616477 comm=test_dma_map_un domain=0xff11012805dac210 iova=0xfffffffffffff000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> # replayed mapping
> pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0x10000000000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0x10000001000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> pid=616477 comm=test_dma_map_un domain=0xff11012805dab610 iova=0xfffffffffffff000 paddr=0x12ecfdd0000 size=0x1000 prot=0x7 gfp=0x400cc0
> 
> > In general though the replay shouldn't have a mechanism to trigger
> > overflows, we're simply iterating the current set of mappings that have
> > already been validated and applying them to a new domain.  
> 
> Agree. Overflow means that some other invariant has broken, and nonsensical
> vfio_dma have infiltrated iommu->dma_list. The combination of iommu->lock
> serialization + overflow checks elsewhere should have prevented that.
> 
> > In any case, we can all take a second look at the changes there.  
> Thanks!

Thanks for the further testing.  Looking again at the changes, it still
looks good to me.

I do note that we're missing a Fixes: tag.  I think we've had hints of
this issue all the way back to the original implementation, so perhaps
the last commit should include:

Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")

Unless you've identified a more specific target.

Along with the tag, it would probably be useful in that same commit to
expand on the scope of the issue in the commit log.  I believe we allow
mappings to be created at the top of the address space that cannot be
removed via ioctl, but such inconsistency should result in an
application error due to the failed ioctl and does not affect cleanup
on release.  Should we also therefore expand the DMA mapping tests in
tools/testing/selftests/vfio to include an end of address space test?
Thanks,

Alex

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 2 weeks ago

On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> I do note that we're missing a Fixes: tag.  I think we've had hints of
> this issue all the way back to the original implementation, so perhaps
> the last commit should include:
> 
> Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")

SGTM

> Unless you've identified a more specific target.

I have not.

> Along with the tag, it would probably be useful in that same commit to
> expand on the scope of the issue in the commit log.  I believe we allow
> mappings to be created at the top of the address space that cannot be
> removed via ioctl, but such inconsistency should result in an
> application error due to the failed ioctl and does not affect cleanup
> on release.

Makes sense. I will clarify the commit msg in v5 to be more specific about what
this series changes relative to existing functionality.

> Should we also therefore expand the DMA mapping tests in
> tools/testing/selftests/vfio to include an end of address space test?

Yes. I will append such a commit to the end of the series in v5. Our VFIO tests
are built on top of a hermetic rust wrapper library over VFIO ioctls, but they
aren't quite ready to be open sourced yet.

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 1 week ago

On Tue, Oct 21, 2025 at 09:25:55AM -0700, Alex Mastro wrote:
> On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> > I do note that we're missing a Fixes: tag.  I think we've had hints of
> > this issue all the way back to the original implementation, so perhaps
> > the last commit should include:
> > 
> > Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
> 
> SGTM
> 
> > Unless you've identified a more specific target.
> 
> I have not.
> 
> > Along with the tag, it would probably be useful in that same commit to
> > expand on the scope of the issue in the commit log.  I believe we allow
> > mappings to be created at the top of the address space that cannot be
> > removed via ioctl, but such inconsistency should result in an
> > application error due to the failed ioctl and does not affect cleanup
> > on release.

I want to make sure I understand the cleanup on release path. Is my supposition
below correct?

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 916cad80941c..7f8d23b06680 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1127,6 +1127,7 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
 static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			     bool do_accounting)
 {
+	// end == 0 due to overflow
 	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
 	struct vfio_domain *domain, *d;
 	LIST_HEAD(unmapped_region_list);
@@ -1156,6 +1157,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	}
 
 	iommu_iotlb_gather_init(&iotlb_gather);
+	// doesn't enter the loop, never calls iommu_unmap
 	while (iova < end) {
 		size_t unmapped, len;
 		phys_addr_t phys, next;
@@ -1210,6 +1212,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
+	// go here
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
@@ -2394,6 +2397,8 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 	struct rb_node *node;
 
 	while ((node = rb_first(&iommu->dma_list)))
+		// eventually, we attempt to remove the end of address space
+		// mapping
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
@@ -2628,6 +2633,8 @@ static void vfio_release_domain(struct vfio_domain *domain)
 		kfree(group);
 	}
 
+	// Is this backstop what saves us? Even though we didn't do individual
+	// unmaps, the "leaked" end of address space mappings get freed here?
 	iommu_domain_free(domain->domain);
 }
 
@@ -2643,10 +2650,12 @@ static void vfio_iommu_type1_release(void *iommu_data)
 		kfree(group);
 	}
 
+	// start here
 	vfio_iommu_unmap_unpin_all(iommu);
 
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
+		// eventually...
 		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Williamson 3 months, 1 week ago

On Mon, 27 Oct 2025 09:02:55 -0700
Alex Mastro <amastro@fb.com> wrote:

> On Tue, Oct 21, 2025 at 09:25:55AM -0700, Alex Mastro wrote:
> > On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:     
> > > Along with the tag, it would probably be useful in that same commit to
> > > expand on the scope of the issue in the commit log.  I believe we allow
> > > mappings to be created at the top of the address space that cannot be
> > > removed via ioctl, but such inconsistency should result in an
> > > application error due to the failed ioctl and does not affect cleanup
> > > on release.  
> 
> I want to make sure I understand the cleanup on release path. Is my supposition
> below correct?
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 916cad80941c..7f8d23b06680 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1127,6 +1127,7 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
>  static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			     bool do_accounting)
>  {
> +	// end == 0 due to overflow
>  	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
>  	struct vfio_domain *domain, *d;
>  	LIST_HEAD(unmapped_region_list);
> @@ -1156,6 +1157,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	}
>  
>  	iommu_iotlb_gather_init(&iotlb_gather);
> +	// doesn't enter the loop, never calls iommu_unmap

If it were only that, I think the iommu_domain_free() would be
sufficient, but it looks like we're also missing the unpin.  Freeing
the IOMMU domain isn't going to resolve that.  So it actually appears
we're leaking those pinned pages and this isn't as self-resolving as I
had thought.  I imagine if you ran your new unit test to the point where
we'd pinned and failed to unpin the majority of memory you'd start to
see system-wide problems.  Thanks,

Alex

>  	while (iova < end) {
>  		size_t unmapped, len;
>  		phys_addr_t phys, next;
> @@ -1210,6 +1212,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
> +	// go here
>  	vfio_unmap_unpin(iommu, dma, true);
>  	vfio_unlink_dma(iommu, dma);
>  	put_task_struct(dma->task);
> @@ -2394,6 +2397,8 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  	struct rb_node *node;
>  
>  	while ((node = rb_first(&iommu->dma_list)))
> +		// eventually, we attempt to remove the end of address space
> +		// mapping
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> @@ -2628,6 +2633,8 @@ static void vfio_release_domain(struct vfio_domain *domain)
>  		kfree(group);
>  	}
>  
> +	// Is this backstop what saves us? Even though we didn't do individual
> +	// unmaps, the "leaked" end of address space mappings get freed here?
>  	iommu_domain_free(domain->domain);
>  }
>  
> @@ -2643,10 +2650,12 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  		kfree(group);
>  	}
>  
> +	// start here
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> +		// eventually...
>  		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 1 week ago

On Mon, Oct 27, 2025 at 07:57:32PM -0600, Alex Williamson wrote:
> On Mon, 27 Oct 2025 09:02:55 -0700
> Alex Mastro <amastro@fb.com> wrote:
> 
> > On Tue, Oct 21, 2025 at 09:25:55AM -0700, Alex Mastro wrote:
> > > On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:     
> > > > Along with the tag, it would probably be useful in that same commit to
> > > > expand on the scope of the issue in the commit log.  I believe we allow
> > > > mappings to be created at the top of the address space that cannot be
> > > > removed via ioctl, but such inconsistency should result in an
> > > > application error due to the failed ioctl and does not affect cleanup
> > > > on release.  
> > 
> > I want to make sure I understand the cleanup on release path. Is my supposition
> > below correct?
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 916cad80941c..7f8d23b06680 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -1127,6 +1127,7 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
> >  static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  			     bool do_accounting)
> >  {
> > +	// end == 0 due to overflow
> >  	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> >  	struct vfio_domain *domain, *d;
> >  	LIST_HEAD(unmapped_region_list);
> > @@ -1156,6 +1157,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  	}
> >  
> >  	iommu_iotlb_gather_init(&iotlb_gather);
> > +	// doesn't enter the loop, never calls iommu_unmap
> 
> If it were only that, I think the iommu_domain_free() would be
> sufficient, but it looks like we're also missing the unpin.  Freeing

Oh, right.

> the IOMMU domain isn't going to resolve that.  So it actually appears
> we're leaking those pinned pages and this isn't as self-resolving as I
> had thought.  I imagine if you ran your new unit test to the point where
> we'd pinned and failed to unpin the majority of memory you'd start to
> see system-wide problems.  Thanks,

Makes sense.

> Alex
> 
> >  	while (iova < end) {
> >  		size_t unmapped, len;
> >  		phys_addr_t phys, next;
> > @@ -1210,6 +1212,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> >  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >  {
> >  	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
> > +	// go here
> >  	vfio_unmap_unpin(iommu, dma, true);
> >  	vfio_unlink_dma(iommu, dma);
> >  	put_task_struct(dma->task);
> > @@ -2394,6 +2397,8 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> >  	struct rb_node *node;
> >  
> >  	while ((node = rb_first(&iommu->dma_list)))
> > +		// eventually, we attempt to remove the end of address space
> > +		// mapping
> >  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >  }
> >  
> > @@ -2628,6 +2633,8 @@ static void vfio_release_domain(struct vfio_domain *domain)
> >  		kfree(group);
> >  	}
> >  
> > +	// Is this backstop what saves us? Even though we didn't do individual
> > +	// unmaps, the "leaked" end of address space mappings get freed here?
> >  	iommu_domain_free(domain->domain);
> >  }
> >  
> > @@ -2643,10 +2650,12 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >  		kfree(group);
> >  	}
> >  
> > +	// start here
> >  	vfio_iommu_unmap_unpin_all(iommu);
> >  
> >  	list_for_each_entry_safe(domain, domain_tmp,
> >  				 &iommu->domain_list, next) {
> > +		// eventually...
> >  		vfio_release_domain(domain);
> >  		list_del(&domain->next);
> >  		kfree(domain);
>

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 2 weeks ago

One point to clarify

> On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> > Along with the tag, it would probably be useful in that same commit to
> > expand on the scope of the issue in the commit log.  I believe we allow
> > mappings to be created at the top of the address space that cannot be
> > removed via ioctl,

True

> > but such inconsistency should result in an application error due to the
> > failed ioctl

Actually, the ioctl does not fail in the sense that the caller gets an errno.
Attempting to unmap a range ending at the end of address space succeeds (returns
zero), but unmaps zero bytes. An application would only detect this if it
explicitly checked the out size field of vfio_iommu_type1_dma_unmap. Or
attempted to create another overlapping mapping on top of the ranges it expected
to be unmapped.

Annotated below:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 916cad80941c..039cba5a38ca 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -165,19 +165,27 @@ vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
 static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 				      dma_addr_t start, size_t size)
 {
+	// start == ~(dma_addr_t)0
+	// size == 0
 	struct rb_node *node = iommu->dma_list.rb_node;
 
 	while (node) {
 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
 
+		// never true because all dma->iova < ~(dma_addr_t)0
 		if (start + size <= dma->iova)
 			node = node->rb_left;
+		// traverses right until we get to the end of address space
+		// dma, then we walk off the end because
+		// ~(dma_addr_t)0 >= 0 == true
+		// node = NULL
 		else if (start >= dma->iova + dma->size)
 			node = node->rb_right;
 		else
 			return dma;
 	}
 
+	// This happens
 	return NULL;
 }
 
@@ -201,6 +209,8 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
 			node = node->rb_right;
 		}
 	}
+	// iova >= start + size == true, due to overflow to zero => no first
+	// node found
 	if (res && size && dma_res->iova >= start + size)
 		res = NULL;
 	return res;
@@ -1397,6 +1407,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (iova || size)
 			goto unlock;
 		size = U64_MAX;
+	// end of address space unmap passes these checks
 	} else if (!size || size & (pgsize - 1) ||
 		   iova + size - 1 < iova || size > SIZE_MAX) {
 		goto unlock;
@@ -1442,18 +1453,23 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 * mappings within the range.
 	 */
 	if (iommu->v2 && !unmap_all) {
+		// passes this check as long as the unmap start doesn't split an
+		// existing dma
 		dma = vfio_find_dma(iommu, iova, 1);
 		if (dma && dma->iova != iova)
 			goto unlock;
 
+		// dma = NULL, we pass this check
 		dma = vfio_find_dma(iommu, iova + size - 1, 0);
 		if (dma && dma->iova + dma->size != iova + size)
 			goto unlock;
 	}
 
 	ret = 0;
+	// n = NULL
 	n = first_n = vfio_find_dma_first_node(iommu, iova, size);
 
+	// loop body never entered
 	while (n) {
 		dma = rb_entry(n, struct vfio_dma, node);
 		if (dma->iova >= iova + size)
@@ -1513,6 +1529,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	/* Report how much was unmapped */
 	unmap->size = unmapped;
 
+	// return 0
 	return ret;
 }

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Williamson 3 months, 2 weeks ago

On Thu, 23 Oct 2025 13:52:04 -0700
Alex Mastro <amastro@fb.com> wrote:

> One point to clarify
> 
> > On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:  
> > > Along with the tag, it would probably be useful in that same commit to
> > > expand on the scope of the issue in the commit log.  I believe we allow
> > > mappings to be created at the top of the address space that cannot be
> > > removed via ioctl,  
> 
> True
> 
> > > but such inconsistency should result in an application error due to the
> > > failed ioctl  
> 
> Actually, the ioctl does not fail in the sense that the caller gets an errno.
> Attempting to unmap a range ending at the end of address space succeeds (returns
> zero), but unmaps zero bytes. An application would only detect this if it
> explicitly checked the out size field of vfio_iommu_type1_dma_unmap. Or
> attempted to create another overlapping mapping on top of the ranges it expected
> to be unmapped.

Yes, the ioctl doesn't fail but the returned unmap size exposes the
inconsistency.  This is because we don't require a 1:1 unmap, an unmap
can span multiple mappings, or none.  Prior to TYPE1v2 we even allowed
splitting previous mappings.  I agree that it obfuscates the
application detecting the error, but I'm not sure there's much we can
do about it given the uAPI.  Let's document the scenario as best we
can.  Thanks,

Alex


> Annotated below:
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 916cad80941c..039cba5a38ca 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -165,19 +165,27 @@ vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>  static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
>  				      dma_addr_t start, size_t size)
>  {
> +	// start == ~(dma_addr_t)0
> +	// size == 0
>  	struct rb_node *node = iommu->dma_list.rb_node;
>  
>  	while (node) {
>  		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>  
> +		// never true because all dma->iova < ~(dma_addr_t)0
>  		if (start + size <= dma->iova)
>  			node = node->rb_left;
> +		// traverses right until we get to the end of address space
> +		// dma, then we walk off the end because
> +		// ~(dma_addr_t)0 >= 0 == true
> +		// node = NULL
>  		else if (start >= dma->iova + dma->size)
>  			node = node->rb_right;
>  		else
>  			return dma;
>  	}
>  
> +	// This happens
>  	return NULL;
>  }
>  
> @@ -201,6 +209,8 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
>  			node = node->rb_right;
>  		}
>  	}
> +	// iova >= start + size == true, due to overflow to zero => no first
> +	// node found
>  	if (res && size && dma_res->iova >= start + size)
>  		res = NULL;
>  	return res;
> @@ -1397,6 +1407,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (iova || size)
>  			goto unlock;
>  		size = U64_MAX;
> +	// end of address space unmap passes these checks
>  	} else if (!size || size & (pgsize - 1) ||
>  		   iova + size - 1 < iova || size > SIZE_MAX) {
>  		goto unlock;
> @@ -1442,18 +1453,23 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	 * mappings within the range.
>  	 */
>  	if (iommu->v2 && !unmap_all) {
> +		// passes this check as long as the unmap start doesn't split an
> +		// existing dma
>  		dma = vfio_find_dma(iommu, iova, 1);
>  		if (dma && dma->iova != iova)
>  			goto unlock;
>  
> +		// dma = NULL, we pass this check
>  		dma = vfio_find_dma(iommu, iova + size - 1, 0);
>  		if (dma && dma->iova + dma->size != iova + size)
>  			goto unlock;
>  	}
>  
>  	ret = 0;
> +	// n = NULL
>  	n = first_n = vfio_find_dma_first_node(iommu, iova, size);
>  
> +	// loop body never entered
>  	while (n) {
>  		dma = rb_entry(n, struct vfio_dma, node);
>  		if (dma->iova >= iova + size)
> @@ -1513,6 +1529,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	/* Report how much was unmapped */
>  	unmap->size = unmapped;
>  
> +	// return 0
>  	return ret;
>  }
>  
>

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by David Matlack 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 9:26 AM Alex Mastro <amastro@fb.com> wrote:
> On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> > Should we also therefore expand the DMA mapping tests in
> > tools/testing/selftests/vfio to include an end of address space test?
>
> Yes. I will append such a commit to the end of the series in v5. Our VFIO tests
> are built on top of a hermetic rust wrapper library over VFIO ioctls, but they
> aren't quite ready to be open sourced yet.

Feel free to reach out if you have any questions about writing or
running the VFIO selftests.

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 09:31:59AM -0700, David Matlack wrote:
> On Tue, Oct 21, 2025 at 9:26 AM Alex Mastro <amastro@fb.com> wrote:
> > On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> > > Should we also therefore expand the DMA mapping tests in
> > > tools/testing/selftests/vfio to include an end of address space test?
> >
> > Yes. I will append such a commit to the end of the series in v5. Our VFIO tests
> > are built on top of a hermetic rust wrapper library over VFIO ioctls, but they
> > aren't quite ready to be open sourced yet.
> 
> Feel free to reach out if you have any questions about writing or
> running the VFIO selftests.

Thanks David. I built and ran using below. I am not too familiar with
kselftests, so open to tips.

$ make LLVM=1 -j kselftest-install INSTALL_PATH=/tmp/kst TARGETS="vfio"
$ VFIO_SELFTESTS_BDF=0000:05:00.0 /tmp/kst/run_kselftest.sh

I added the following. Is this the right direction? Is multiple fixtures per
file OK? Seems related enough to vfio_dma_mapping_test.c to keep together.

I updated the *_unmap function signatures to return the count of bytes unmapped,
since that is part of the test pass criteria. Also added unmap_all flavors,
since those exercise different code paths than range-based unmap.

Relevant test output:

# #  RUN           vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space ...
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# #            OK  vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space
# ok 16 vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space
# #  RUN           vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space ...
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# #            OK  vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space
# ok 17 vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space
# #  RUN           vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space ...
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# #            OK  vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space
# ok 18 vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space
# #  RUN           vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space ...
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# #            OK  vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space
# ok 19 vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space
# #  RUN           vfio_dma_map_limit_test.iommufd.end_of_address_space ...
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
# Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000

diff --git a/tools/testing/selftests/vfio/lib/include/vfio_util.h b/tools/testing/selftests/vfio/lib/include/vfio_util.h
index ed31606e01b7..8e9d40845ccc 100644
--- a/tools/testing/selftests/vfio/lib/include/vfio_util.h
+++ b/tools/testing/selftests/vfio/lib/include/vfio_util.h
@@ -208,8 +208,9 @@ void vfio_pci_device_reset(struct vfio_pci_device *device);
 
 void vfio_pci_dma_map(struct vfio_pci_device *device,
 		      struct vfio_dma_region *region);
-void vfio_pci_dma_unmap(struct vfio_pci_device *device,
-			struct vfio_dma_region *region);
+u64 vfio_pci_dma_unmap(struct vfio_pci_device *device,
+                       struct vfio_dma_region *region);
+u64 vfio_pci_dma_unmap_all(struct vfio_pci_device *device);
 
 void vfio_pci_config_access(struct vfio_pci_device *device, bool write,
 			    size_t config, size_t size, void *data);
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index 0921b2451ba5..f5ae68a7df9c 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -183,7 +183,7 @@ void vfio_pci_dma_map(struct vfio_pci_device *device,
 	list_add(&region->link, &device->dma_regions);
 }
 
-static void vfio_iommu_dma_unmap(struct vfio_pci_device *device,
+static u64 vfio_iommu_dma_unmap(struct vfio_pci_device *device,
 				 struct vfio_dma_region *region)
 {
 	struct vfio_iommu_type1_dma_unmap args = {
@@ -193,9 +193,25 @@ static void vfio_iommu_dma_unmap(struct vfio_pci_device *device,
 	};
 
 	ioctl_assert(device->container_fd, VFIO_IOMMU_UNMAP_DMA, &args);
+
+	return args.size;
+}
+
+static u64 vfio_iommu_dma_unmap_all(struct vfio_pci_device *device)
+{
+	struct vfio_iommu_type1_dma_unmap args = {
+		.argsz = sizeof(args),
+		.iova = 0,
+		.size = 0,
+		.flags = VFIO_DMA_UNMAP_FLAG_ALL,
+	};
+
+	ioctl_assert(device->container_fd, VFIO_IOMMU_UNMAP_DMA, &args);
+
+	return args.size;
 }
 
-static void iommufd_dma_unmap(struct vfio_pci_device *device,
+static u64 iommufd_dma_unmap(struct vfio_pci_device *device,
 			      struct vfio_dma_region *region)
 {
 	struct iommu_ioas_unmap args = {
@@ -206,17 +222,54 @@ static void iommufd_dma_unmap(struct vfio_pci_device *device,
 	};
 
 	ioctl_assert(device->iommufd, IOMMU_IOAS_UNMAP, &args);
+
+	return args.length;
+}
+
+static u64 iommufd_dma_unmap_all(struct vfio_pci_device *device)
+{
+	struct iommu_ioas_unmap args = {
+		.size = sizeof(args),
+		.iova = 0,
+		.length = UINT64_MAX,
+		.ioas_id = device->ioas_id,
+	};
+
+	ioctl_assert(device->iommufd, IOMMU_IOAS_UNMAP, &args);
+
+	return args.length;
 }
 
-void vfio_pci_dma_unmap(struct vfio_pci_device *device,
+u64 vfio_pci_dma_unmap(struct vfio_pci_device *device,
 			struct vfio_dma_region *region)
 {
+	u64 unmapped;
+
 	if (device->iommufd)
-		iommufd_dma_unmap(device, region);
+		unmapped = iommufd_dma_unmap(device, region);
 	else
-		vfio_iommu_dma_unmap(device, region);
+		unmapped = vfio_iommu_dma_unmap(device, region);
 
 	list_del(&region->link);
+
+	return unmapped;
+}
+
+u64 vfio_pci_dma_unmap_all(struct vfio_pci_device *device)
+{
+	u64 unmapped;
+	struct vfio_dma_region *curr, *next;
+
+	if (device->iommufd)
+		unmapped = iommufd_dma_unmap_all(device);
+	else
+		unmapped = vfio_iommu_dma_unmap_all(device);
+
+	list_for_each_entry_safe(curr, next, &device->dma_regions, link) {
+		list_del(&curr->link);
+	}
+
+	return unmapped;
 }
 
 static void vfio_pci_region_get(struct vfio_pci_device *device, int index,
diff --git a/tools/testing/selftests/vfio/vfio_dma_mapping_test.c b/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
index ab19c54a774d..e908c1fe7103 100644
--- a/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
+++ b/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
@@ -122,6 +122,8 @@ FIXTURE_TEARDOWN(vfio_dma_mapping_test)
 	vfio_pci_device_cleanup(self->device);
 }
 
+#undef FIXTURE_VARIANT_ADD_IOMMU_MODE
+
 TEST_F(vfio_dma_mapping_test, dma_map_unmap)
 {
 	const u64 size = variant->size ?: getpagesize();
@@ -192,6 +194,61 @@ TEST_F(vfio_dma_mapping_test, dma_map_unmap)
 	ASSERT_TRUE(!munmap(region.vaddr, size));
 }
 
+FIXTURE(vfio_dma_map_limit_test) {
+	struct vfio_pci_device *device;
+};
+
+FIXTURE_VARIANT(vfio_dma_map_limit_test) {
+	const char *iommu_mode;
+};
+
+#define FIXTURE_VARIANT_ADD_IOMMU_MODE(_iommu_mode)                            \
+FIXTURE_VARIANT_ADD(vfio_dma_map_limit_test, _iommu_mode) {                    \
+	.iommu_mode = #_iommu_mode,					       \
+}
+
+FIXTURE_VARIANT_ADD_ALL_IOMMU_MODES();
+
+FIXTURE_SETUP(vfio_dma_map_limit_test)
+{
+	self->device = vfio_pci_device_init(device_bdf, variant->iommu_mode);
+}
+
+FIXTURE_TEARDOWN(vfio_dma_map_limit_test)
+{
+	vfio_pci_device_cleanup(self->device);
+}
+
+#undef FIXTURE_VARIANT_ADD_IOMMU_MODE
+
+TEST_F(vfio_dma_map_limit_test, end_of_address_space)
+{
+	struct vfio_dma_region region = {};
+	u64 size = getpagesize();
+	u64 unmapped;
+
+	region.vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+	                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	ASSERT_NE(region.vaddr, MAP_FAILED);
+
+	region.iova = ~(iova_t)0 & ~(size - 1);
+	region.size = size;
+
+	vfio_pci_dma_map(self->device, &region);
+	printf("Mapped HVA %p (size 0x%lx) at IOVA 0x%lx\n", region.vaddr, size, region.iova);
+	ASSERT_EQ(region.iova, to_iova(self->device, region.vaddr));
+
+	unmapped = vfio_pci_dma_unmap(self->device, &region);
+	ASSERT_EQ(unmapped, size);
+
+	vfio_pci_dma_map(self->device, &region);
+	printf("Mapped HVA %p (size 0x%lx) at IOVA 0x%lx\n", region.vaddr, size, region.iova);
+	ASSERT_EQ(region.iova, to_iova(self->device, region.vaddr));
+
+	unmapped = vfio_pci_dma_unmap_all(self->device);
+	ASSERT_EQ(unmapped, size);
+}
+
 int main(int argc, char *argv[])
 {
 	device_bdf = vfio_selftests_get_bdf(&argc, argv);

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by David Matlack 3 months, 2 weeks ago

On Tue, Oct 21, 2025 at 12:13 PM Alex Mastro <amastro@fb.com> wrote:
>
> On Tue, Oct 21, 2025 at 09:31:59AM -0700, David Matlack wrote:
> > On Tue, Oct 21, 2025 at 9:26 AM Alex Mastro <amastro@fb.com> wrote:
> > > On Mon, Oct 20, 2025 at 03:36:33PM -0600, Alex Williamson wrote:
> > > > Should we also therefore expand the DMA mapping tests in
> > > > tools/testing/selftests/vfio to include an end of address space test?
> > >
> > > Yes. I will append such a commit to the end of the series in v5. Our VFIO tests
> > > are built on top of a hermetic rust wrapper library over VFIO ioctls, but they
> > > aren't quite ready to be open sourced yet.
> >
> > Feel free to reach out if you have any questions about writing or
> > running the VFIO selftests.
>
> Thanks David. I built and ran using below. I am not too familiar with
> kselftests, so open to tips.
>
> $ make LLVM=1 -j kselftest-install INSTALL_PATH=/tmp/kst TARGETS="vfio"
> $ VFIO_SELFTESTS_BDF=0000:05:00.0 /tmp/kst/run_kselftest.sh
>
> I added the following. Is this the right direction? Is multiple fixtures per
> file OK? Seems related enough to vfio_dma_mapping_test.c to keep together.

Adding a fixture to vfio_dma_mapping_test.c is what I was imagining as well.

Overall looks good, we can hash out the specifics in the patches if
you prefer. But I added some thoughts below.

>
> I updated the *_unmap function signatures to return the count of bytes unmapped,
> since that is part of the test pass criteria. Also added unmap_all flavors,
> since those exercise different code paths than range-based unmap.

When you send, can you introduce these in a separate commit and update
the existing test function in vfio_dma_mapping_test.c to assert on it?

>
> Relevant test output:
>
> # #  RUN           vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space ...
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # #            OK  vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space
> # ok 16 vfio_dma_map_limit_test.vfio_type1_iommu.end_of_address_space
> # #  RUN           vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space ...
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # #            OK  vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space
> # ok 17 vfio_dma_map_limit_test.vfio_type1v2_iommu.end_of_address_space
> # #  RUN           vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space ...
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # #            OK  vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space
> # ok 18 vfio_dma_map_limit_test.iommufd_compat_type1.end_of_address_space
> # #  RUN           vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space ...
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # #            OK  vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space
> # ok 19 vfio_dma_map_limit_test.iommufd_compat_type1v2.end_of_address_space
> # #  RUN           vfio_dma_map_limit_test.iommufd.end_of_address_space ...
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
> # Mapped HVA 0x7f6638222000 (size 0x1000) at IOVA 0xfffffffffffff000
>
> diff --git a/tools/testing/selftests/vfio/lib/include/vfio_util.h b/tools/testing/selftests/vfio/lib/include/vfio_util.h
> index ed31606e01b7..8e9d40845ccc 100644
> --- a/tools/testing/selftests/vfio/lib/include/vfio_util.h
> +++ b/tools/testing/selftests/vfio/lib/include/vfio_util.h
> @@ -208,8 +208,9 @@ void vfio_pci_device_reset(struct vfio_pci_device *device);
>
>  void vfio_pci_dma_map(struct vfio_pci_device *device,
>                       struct vfio_dma_region *region);
> -void vfio_pci_dma_unmap(struct vfio_pci_device *device,
> -                       struct vfio_dma_region *region);
> +u64 vfio_pci_dma_unmap(struct vfio_pci_device *device,
> +                       struct vfio_dma_region *region);
> +u64 vfio_pci_dma_unmap_all(struct vfio_pci_device *device);
>
>  void vfio_pci_config_access(struct vfio_pci_device *device, bool write,
>                             size_t config, size_t size, void *data);
> diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> index 0921b2451ba5..f5ae68a7df9c 100644
> --- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> +++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
> @@ -183,7 +183,7 @@ void vfio_pci_dma_map(struct vfio_pci_device *device,
>         list_add(&region->link, &device->dma_regions);
>  }
>
> -static void vfio_iommu_dma_unmap(struct vfio_pci_device *device,
> +static u64 vfio_iommu_dma_unmap(struct vfio_pci_device *device,
>                                  struct vfio_dma_region *region)
>  {
>         struct vfio_iommu_type1_dma_unmap args = {
> @@ -193,9 +193,25 @@ static void vfio_iommu_dma_unmap(struct vfio_pci_device *device,
>         };
>
>         ioctl_assert(device->container_fd, VFIO_IOMMU_UNMAP_DMA, &args);
> +
> +       return args.size;
> +}
> +
> +static u64 vfio_iommu_dma_unmap_all(struct vfio_pci_device *device)
> +{
> +       struct vfio_iommu_type1_dma_unmap args = {
> +               .argsz = sizeof(args),
> +               .iova = 0,
> +               .size = 0,
> +               .flags = VFIO_DMA_UNMAP_FLAG_ALL,
> +       };
> +
> +       ioctl_assert(device->container_fd, VFIO_IOMMU_UNMAP_DMA, &args);
> +
> +       return args.size;
>  }
>
> -static void iommufd_dma_unmap(struct vfio_pci_device *device,
> +static u64 iommufd_dma_unmap(struct vfio_pci_device *device,
>                               struct vfio_dma_region *region)
>  {
>         struct iommu_ioas_unmap args = {
> @@ -206,17 +222,54 @@ static void iommufd_dma_unmap(struct vfio_pci_device *device,
>         };
>
>         ioctl_assert(device->iommufd, IOMMU_IOAS_UNMAP, &args);
> +
> +       return args.length;
> +}
> +
> +static u64 iommufd_dma_unmap_all(struct vfio_pci_device *device)
> +{
> +       struct iommu_ioas_unmap args = {
> +               .size = sizeof(args),
> +               .iova = 0,
> +               .length = UINT64_MAX,
> +               .ioas_id = device->ioas_id,
> +       };
> +
> +       ioctl_assert(device->iommufd, IOMMU_IOAS_UNMAP, &args);
> +
> +       return args.length;
>  }
>
> -void vfio_pci_dma_unmap(struct vfio_pci_device *device,
> +u64 vfio_pci_dma_unmap(struct vfio_pci_device *device,
>                         struct vfio_dma_region *region)
>  {
> +       u64 unmapped;
> +
>         if (device->iommufd)
> -               iommufd_dma_unmap(device, region);
> +               unmapped = iommufd_dma_unmap(device, region);
>         else
> -               vfio_iommu_dma_unmap(device, region);
> +               unmapped = vfio_iommu_dma_unmap(device, region);
>
>         list_del(&region->link);
> +
> +       return unmapped;
> +}
> +
> +u64 vfio_pci_dma_unmap_all(struct vfio_pci_device *device)
> +{
> +       u64 unmapped;
> +       struct vfio_dma_region *curr, *next;
> +
> +       if (device->iommufd)
> +               unmapped = iommufd_dma_unmap_all(device);
> +       else
> +               unmapped = vfio_iommu_dma_unmap_all(device);
> +
> +       list_for_each_entry_safe(curr, next, &device->dma_regions, link) {
> +               list_del(&curr->link);
> +       }
> +
> +       return unmapped;
>  }
>
>  static void vfio_pci_region_get(struct vfio_pci_device *device, int index,
> diff --git a/tools/testing/selftests/vfio/vfio_dma_mapping_test.c b/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
> index ab19c54a774d..e908c1fe7103 100644
> --- a/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
> +++ b/tools/testing/selftests/vfio/vfio_dma_mapping_test.c
> @@ -122,6 +122,8 @@ FIXTURE_TEARDOWN(vfio_dma_mapping_test)
>         vfio_pci_device_cleanup(self->device);
>  }
>
> +#undef FIXTURE_VARIANT_ADD_IOMMU_MODE

I think this can/should go just after the
FIXTURE_VARIANT_ADD_ALL_IOMMU_MODES(); statement. The same below.

> +
>  TEST_F(vfio_dma_mapping_test, dma_map_unmap)
>  {
>         const u64 size = variant->size ?: getpagesize();
> @@ -192,6 +194,61 @@ TEST_F(vfio_dma_mapping_test, dma_map_unmap)
>         ASSERT_TRUE(!munmap(region.vaddr, size));
>  }
>
> +FIXTURE(vfio_dma_map_limit_test) {
> +       struct vfio_pci_device *device;
> +};
> +
> +FIXTURE_VARIANT(vfio_dma_map_limit_test) {
> +       const char *iommu_mode;
> +};
> +
> +#define FIXTURE_VARIANT_ADD_IOMMU_MODE(_iommu_mode)                            \
> +FIXTURE_VARIANT_ADD(vfio_dma_map_limit_test, _iommu_mode) {                    \
> +       .iommu_mode = #_iommu_mode,                                            \
> +}
> +
> +FIXTURE_VARIANT_ADD_ALL_IOMMU_MODES();
> +
> +FIXTURE_SETUP(vfio_dma_map_limit_test)
> +{
> +       self->device = vfio_pci_device_init(device_bdf, variant->iommu_mode);
> +}
> +
> +FIXTURE_TEARDOWN(vfio_dma_map_limit_test)
> +{
> +       vfio_pci_device_cleanup(self->device);
> +}
> +
> +#undef FIXTURE_VARIANT_ADD_IOMMU_MODE
> +
> +TEST_F(vfio_dma_map_limit_test, end_of_address_space)
> +{
> +       struct vfio_dma_region region = {};
> +       u64 size = getpagesize();
> +       u64 unmapped;
> +
> +       region.vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +                           MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> +       ASSERT_NE(region.vaddr, MAP_FAILED);
> +
> +       region.iova = ~(iova_t)0 & ~(size - 1);
> +       region.size = size;
> +
> +       vfio_pci_dma_map(self->device, &region);
> +       printf("Mapped HVA %p (size 0x%lx) at IOVA 0x%lx\n", region.vaddr, size, region.iova);
> +       ASSERT_EQ(region.iova, to_iova(self->device, region.vaddr));
> +
> +       unmapped = vfio_pci_dma_unmap(self->device, &region);
> +       ASSERT_EQ(unmapped, size);
> +
> +       vfio_pci_dma_map(self->device, &region);
> +       printf("Mapped HVA %p (size 0x%lx) at IOVA 0x%lx\n", region.vaddr, size, region.iova);
> +       ASSERT_EQ(region.iova, to_iova(self->device, region.vaddr));
> +
> +       unmapped = vfio_pci_dma_unmap_all(self->device);
> +       ASSERT_EQ(unmapped, size);

The unmap_all test should probably be in a separate TEST_F. You can
put the struct vfio_dma_region in the FIXTURE and initialize it in the
FIXTURE_SETUP() to reduce code duplication.
> +}

Would it be useful to add negative map/unmap tests as well? If so we'd
need a way to plumb the return value of the ioctl up to the caller so
you can assert that it failed, which will conflict with returning the
amount of unmapped bytes.

Maybe we should make unmapped an output parameter like so?

int __vfio_pci_dma_map(struct vfio_pci_device *device,
        struct vfio_dma_region *region);

void vfio_pci_dma_map(struct vfio_pci_device *device,
        struct vfio_dma_region *region);

int __vfio_pci_dma_unmap(struct vfio_pci_device *device,
        struct vfio_dma_region *region, u64 *unmapped);

void vfio_pci_dma_unmap(struct vfio_pci_device *device,
        struct vfio_dma_region *region, u64 *unmapped);

int __vfio_pci_dma_unmap_all(struct vfio_pci_device *device, u64 *unmapped);
void vfio_pci_dma_unmap_all(struct vfio_pci_device *device, u64 *unmapped);

unmapped can be optional and callers that don't care can pass in NULL.
It'll be a little gross though to see NULL on all the unmap calls
though... Maybe unmapped can be restricted to __vfio_pci_dma_unmap().
So something like this:

int __vfio_pci_dma_unmap(struct vfio_pci_device *device,
        struct vfio_dma_region *region, u64 *unmapped);

void vfio_pci_dma_unmap(struct vfio_pci_device *device,
        struct vfio_dma_region *region);

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 2 weeks ago

Thanks David -- this is good feedback. Will roll these suggestions into v5.

On Tue, Oct 21, 2025 at 05:38:31PM -0700, David Matlack wrote:
> On Tue, Oct 21, 2025 at 12:13 PM Alex Mastro <amastro@fb.com> wrote:
> > I updated the *_unmap function signatures to return the count of bytes unmapped,
> > since that is part of the test pass criteria. Also added unmap_all flavors,
> > since those exercise different code paths than range-based unmap.
> 
> When you send, can you introduce these in a separate commit and update
> the existing test function in vfio_dma_mapping_test.c to assert on it?

SGTM

> > +#undef FIXTURE_VARIANT_ADD_IOMMU_MODE
> 
> I think this can/should go just after the
> FIXTURE_VARIANT_ADD_ALL_IOMMU_MODES(); statement. The same below.

Ack.

> > +       unmapped = vfio_pci_dma_unmap_all(self->device);
> > +       ASSERT_EQ(unmapped, size);
> 
> The unmap_all test should probably be in a separate TEST_F. You can
> put the struct vfio_dma_region in the FIXTURE and initialize it in the
> FIXTURE_SETUP() to reduce code duplication.
> > +}

Make sense.

> Would it be useful to add negative map/unmap tests as well? If so we'd
> need a way to plumb the return value of the ioctl up to the caller so
> you can assert that it failed, which will conflict with returning the
> amount of unmapped bytes.

Testing negative cases would be useful. Not sure about the mechanics yet.

> 
> Maybe we should make unmapped an output parameter like so?
> 
> int __vfio_pci_dma_map(struct vfio_pci_device *device,
>         struct vfio_dma_region *region);
> 
> void vfio_pci_dma_map(struct vfio_pci_device *device,
>         struct vfio_dma_region *region);
> 
> int __vfio_pci_dma_unmap(struct vfio_pci_device *device,
>         struct vfio_dma_region *region, u64 *unmapped);
> 
> void vfio_pci_dma_unmap(struct vfio_pci_device *device,
>         struct vfio_dma_region *region, u64 *unmapped);
> 
> int __vfio_pci_dma_unmap_all(struct vfio_pci_device *device, u64 *unmapped);
> void vfio_pci_dma_unmap_all(struct vfio_pci_device *device, u64 *unmapped);
> 
> unmapped can be optional and callers that don't care can pass in NULL.
> It'll be a little gross though to see NULL on all the unmap calls
> though... Maybe unmapped can be restricted to __vfio_pci_dma_unmap().
> So something like this:
> 
> int __vfio_pci_dma_unmap(struct vfio_pci_device *device,
>         struct vfio_dma_region *region, u64 *unmapped);
> 
> void vfio_pci_dma_unmap(struct vfio_pci_device *device,
>         struct vfio_dma_region *region);

I'll put some thought into this and propose something in v5.

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alejandro Jimenez 3 months, 2 weeks ago


On 10/13/25 1:32 AM, Alex Mastro wrote:
> This patch series aims to fix vfio_iommu_type.c to support
> VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
> ranges which lie against the addressable limit. i.e. ranges where
> iova_start + iova_size would overflow to exactly zero.
> 

I sent a reply to Patch 3 with a small nit, but regardless of whether 
you chose to make a change or not, the current changes look good to me.

Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>

Thank you,
Alejandro

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Jason Gunthorpe 3 months, 2 weeks ago

On Sun, Oct 12, 2025 at 10:32:23PM -0700, Alex Mastro wrote:
> This patch series aims to fix vfio_iommu_type.c to support 
> VFIO_IOMMU_MAP_DMA and VFIO_IOMMU_UNMAP_DMA operations targeting IOVA
> ranges which lie against the addressable limit. i.e. ranges where
> iova_start + iova_size would overflow to exactly zero.

I didn't try to look through all the math changes but it looks like
the right thing to do to me

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 2 weeks ago

Alex and Jason, during my testing, I found that the behavior of range-based
(!VFIO_DMA_UNMAP_FLAG_ALL) VFIO_IOMMU_UNMAP_DMA differs slightly when using
/dev/iommu as the container.

iommufd treats range-based unmap where there are no hits in the range as an
error, and the ioctl fails with ENOENT.

vfio_iommu_type1.c treats this as a success and reports zero bytes unmapped in
vfio_iommu_type1_dma_unmap.size.

It implies to me that we may need some more shimming in
drivers/iommu/iommufd/vfio_compat.c to iron out observable differences in UAPI
usage.

Addressing it would be outside the scope of this patch series, so this mail is
just to make note of my findings.

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Jason Gunthorpe 3 months, 1 week ago

On Sat, Oct 25, 2025 at 11:11:49AM -0700, Alex Mastro wrote:
> Alex and Jason, during my testing, I found that the behavior of range-based
> (!VFIO_DMA_UNMAP_FLAG_ALL) VFIO_IOMMU_UNMAP_DMA differs slightly when using
> /dev/iommu as the container.
> 
> iommufd treats range-based unmap where there are no hits in the range as an
> error, and the ioctl fails with ENOENT.

> vfio_iommu_type1.c treats this as a success and reports zero bytes unmapped in
> vfio_iommu_type1_dma_unmap.size.

Oh, weird...

What do you think about this:

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index c0360c450880b8..1124f68ec9020d 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -707,7 +707,8 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 	struct iopt_area *area;
 	unsigned long unmapped_bytes = 0;
 	unsigned int tries = 0;
-	int rc = -ENOENT;
+	/* If there are no mapped entries then success */
+	int rc = 0;
 
 	/*
 	 * The domains_rwsem must be held in read mode any time any area->pages
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index 1542c5fd10a85c..ef5e56672dea56 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -367,6 +367,8 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
 				     &unmapped);
 		if (rc)
 			goto out_put;
+		if (!unmapped)
+			rc = -ENOENT;
 	}
 
 	cmd->length = unmapped;


Thanks,
Jason

Re: [PATCH v4 0/3] vfio: handle DMA map/unmap up to the addressable limit

Posted by Alex Mastro 3 months, 1 week ago

On Mon, Oct 27, 2025 at 10:39:04AM -0300, Jason Gunthorpe wrote:
> On Sat, Oct 25, 2025 at 11:11:49AM -0700, Alex Mastro wrote:
> > Alex and Jason, during my testing, I found that the behavior of range-based
> > (!VFIO_DMA_UNMAP_FLAG_ALL) VFIO_IOMMU_UNMAP_DMA differs slightly when using
> > /dev/iommu as the container.
> > 
> > iommufd treats range-based unmap where there are no hits in the range as an
> > error, and the ioctl fails with ENOENT.
> 
> > vfio_iommu_type1.c treats this as a success and reports zero bytes unmapped in
> > vfio_iommu_type1_dma_unmap.size.
> 
> Oh, weird...
> 
> What do you think about this:
> 
> diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
> index c0360c450880b8..1124f68ec9020d 100644
> --- a/drivers/iommu/iommufd/io_pagetable.c
> +++ b/drivers/iommu/iommufd/io_pagetable.c
> @@ -707,7 +707,8 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
>  	struct iopt_area *area;
>  	unsigned long unmapped_bytes = 0;
>  	unsigned int tries = 0;
> -	int rc = -ENOENT;
> +	/* If there are no mapped entries then success */
> +	int rc = 0;
>  
>  	/*
>  	 * The domains_rwsem must be held in read mode any time any area->pages
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 1542c5fd10a85c..ef5e56672dea56 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -367,6 +367,8 @@ int iommufd_ioas_unmap(struct iommufd_ucmd *ucmd)
>  				     &unmapped);
>  		if (rc)
>  			goto out_put;
> +		if (!unmapped)
> +			rc = -ENOENT;
>  	}
>  
>  	cmd->length = unmapped;

Seems reasonable to me. The only affected callers are 

drivers/iommu/iommufd/ioas.c
366:            rc = iopt_unmap_iova(&ioas->iopt, cmd->iova, cmd->length,

drivers/iommu/iommufd/vfio_compat.c
244:            rc = iopt_unmap_iova(&ioas->iopt, unmap.iova, unmap.size,

So your proposal should get vfio_compat.c into good shape.

I think these locations need more scrutiny after your change

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index c0360c450880..e271696f726f 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -777,6 +777,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 
 		down_write(&iopt->iova_rwsem);
 	}
+	// redundant?
 	if (unmapped_bytes)
 		rc = 0;
 
@@ -818,6 +819,7 @@ int iopt_unmap_all(struct io_pagetable *iopt, unsigned long *unmapped)
 	int rc;
 
 	rc = iopt_unmap_iova_range(iopt, 0, ULONG_MAX, unmapped);
+	// intent still holds?
 	/* If the IOVAs are empty then unmap all succeeds */
 	if (rc == -ENOENT)
 		return 0;