vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

[PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Mastro 4 months ago

vfio_find_dma_first_node is called to find the first dma node to unmap
given an unmap range of [iova..iova+size). The check at the end of the
function intends to test if the dma result lies beyond the end of the
unmap range. The condition is incorrectly satisfied when iova+size
overflows to zero, causing the function to return NULL.

The same issue happens inside vfio_dma_do_unmap's while loop.

Fix by comparing to the inclusive range end, which can be expressed
by u64.

This bug was discovered after querying for vfio_iova_range's via
VFIO_IOMMU_GET_INFO, making a VFIO_IOMMU_MAP_DMA inside the last range,
and then attempting to unmap the entirety of the last range i.e.
VFIO_IOMMU_UNMAP_DMA(iova=r.start, size=r.end-r.start+1).

---
I don't think iommufd is susceptible to the same issue since
iopt_unmap_iova computes the inclusive end using checked addition, and
iopt_unmap_iova_range acts on an inclusive range.

Signed-off-by: Alex Mastro <amastro@fb.com>
---
 drivers/vfio/vfio_iommu_type1.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f8d68fe77b41..08242d8ce2ca 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -199,7 +199,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
 			node = node->rb_right;
 		}
 	}
-	if (res && size && dma_res->iova >= start + size)
+	if (res && size && dma_res->iova > start + size - 1)
 		res = NULL;
 	return res;
 }
@@ -1386,7 +1386,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	while (n) {
 		dma = rb_entry(n, struct vfio_dma, node);
-		if (dma->iova >= iova + size)
+		if (dma->iova > iova + size - 1)
 			break;
 
 		if (!iommu->v2 && iova > dma->iova)

---
base-commit: 407aa63018d15c35a34938633868e61174d2ef6e
change-id: 20251005-fix-unmap-c3f3e87dabfa

Best regards,
-- 
Alex Mastro <amastro@fb.com>

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Jason Gunthorpe 4 months ago

On Sun, Oct 05, 2025 at 08:38:42PM -0700, Alex Mastro wrote:
> vfio_find_dma_first_node is called to find the first dma node to unmap
> given an unmap range of [iova..iova+size). The check at the end of the
> function intends to test if the dma result lies beyond the end of the
> unmap range. The condition is incorrectly satisfied when iova+size
> overflows to zero, causing the function to return NULL.
> 
> The same issue happens inside vfio_dma_do_unmap's while loop.
> 
> Fix by comparing to the inclusive range end, which can be expressed
> by u64.
> 
> This bug was discovered after querying for vfio_iova_range's via
> VFIO_IOMMU_GET_INFO, making a VFIO_IOMMU_MAP_DMA inside the last range,
> and then attempting to unmap the entirety of the last range i.e.
> VFIO_IOMMU_UNMAP_DMA(iova=r.start, size=r.end-r.start+1).
> 
> ---
> I don't think iommufd is susceptible to the same issue since
> iopt_unmap_iova computes the inclusive end using checked addition, and
> iopt_unmap_iova_range acts on an inclusive range.

Yeah, iommufd was careful to use inclusive ranges so that ULONG_MAX
can be a valid IOVA.

This doesn't seem complete though, if the range ends at the ULONG_MAX
then these are not working either:

		if (start < dma->iova + dma->size) {

?

And I see a few more instances like that eg in
vfio_iova_dirty_bitmap(), vfio_dma_do_unmap(), vfio_iommu_replay()

Jason

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Mastro 4 months ago

On Mon, Oct 06, 2025 at 09:16:18AM -0300, Jason Gunthorpe wrote:
> This doesn't seem complete though, if the range ends at the ULONG_MAX
> then these are not working either:
> 
> 		if (start < dma->iova + dma->size) {
> 
> ?
> 
> And I see a few more instances like that eg in
> vfio_iova_dirty_bitmap(), vfio_dma_do_unmap(), vfio_iommu_replay()

You are right. There are several places which would need to be fixed to handle
mappings which lie against the end of the addressable range. At least these
would need to be vetted:

$ rg 'iova.*\+.*size' -n drivers/vfio/vfio_iommu_type1.c | rg -v '\- 1'
173:            else if (start >= dma->iova + dma->size)
192:            if (start < dma->iova + dma->size) {
216:            if (new->iova + new->size <= dma->iova)
1060:   dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
1233:   if (dma && dma->iova + dma->size != iova + size)
1380:           if (dma && dma->iova + dma->size != iova + size)
1501:           ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
1504:                   vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
1721:           while (iova < dma->iova + dma->size) {
1743:                           i = iova + size;
1744:                           while (i < dma->iova + dma->size &&
1754:                           size_t n = dma->iova + dma->size - iova;
1785:                   iova += size;
1810:           while (iova < dma->iova + dma->size) {
1823:                   i = iova + size;
1824:                   while (i < dma->iova + dma->size &&
2919:           if (range.iova + range.size < range.iova)

I could take a stab at improving this, but am not confident I could test all the
affected codepaths.

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Jason Gunthorpe 4 months ago

On Mon, Oct 06, 2025 at 09:29:07AM -0700, Alex Mastro wrote:
> On Mon, Oct 06, 2025 at 09:16:18AM -0300, Jason Gunthorpe wrote:
> > This doesn't seem complete though, if the range ends at the ULONG_MAX
> > then these are not working either:
> > 
> > 		if (start < dma->iova + dma->size) {
> > 
> > ?
> > 
> > And I see a few more instances like that eg in
> > vfio_iova_dirty_bitmap(), vfio_dma_do_unmap(), vfio_iommu_replay()
> 
> You are right. There are several places which would need to be fixed to handle
> mappings which lie against the end of the addressable range. At least these
> would need to be vetted:

Could we block right at the ioctl inputs that end at ULONG_MAX? Maybe
that is a good enough fix?

Jason

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Mastro 4 months ago

On Mon, Oct 06, 2025 at 07:50:39PM -0300, Jason Gunthorpe wrote:
> Could we block right at the ioctl inputs that end at ULONG_MAX? Maybe
> that is a good enough fix?

That would be a simpler, and perhaps less risky fix than surgically fixing every
affected place.

If we were to do that, I think VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE should
report coherent limits -- i.e. it should "lie" and say the last page (or
whatever) of u64 addressable space is inaccessible so that the UAPI is coherent
with itself. It should be legal to map/unmap iova ranges up to the limits it
claims to support.

I have doubts that anyone actually relies on MAP_DMA-ing such
end-of-u64-mappings in practice, so perhaps it's OK?

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alejandro Jimenez 4 months ago

Hi Alex, Jason,

On 10/6/25 8:39 PM, Alex Mastro wrote:
> On Mon, Oct 06, 2025 at 07:50:39PM -0300, Jason Gunthorpe wrote:
>> Could we block right at the ioctl inputs that end at ULONG_MAX? Maybe
>> that is a good enough fix?
> 
> That would be a simpler, and perhaps less risky fix than surgically fixing every
> affected place.
> 

If going this way, we'd also have to deny the MAP requests. Right now we 
have this problem because you can map a range ending at the top of the 
address space, but not unmap it. The map operation doesn't overflow 
because vfio_dma_do_map() calls vfio_link_dma() with dma->size == 0, and 
it grows the size later in vfio_pin_map_dma()...

https://elixir.bootlin.com/linux/v6.17.1/source/drivers/vfio/vfio_iommu_type1.c#L1676-L1677

> If we were to do that, I think VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE should
> report coherent limits -- i.e. it should "lie" and say the last page (or
> whatever) of u64 addressable space is inaccessible so that the UAPI is coherent
> with itself. It should be legal to map/unmap iova ranges up to the limits it
> claims to support.
> 
Agree.

> I have doubts that anyone actually relies on MAP_DMA-ing such
> end-of-u64-mappings in practice, so perhaps it's OK?
> 

The AMD IOMMU supports a 64-bit IOVA, so when using the AMD vIOMMU with 
DMA remapping enabled + VF passthrough + a Linux guest with 
iommu.forcedac=1 we hit this issue since the driver (mlx5) starts 
requesting mappings for IOVAs right at the top of the address space.

I mentioned this issue on the cover letter for:
https://lore.kernel.org/qemu-devel/20250919213515.917111-1-alejandro.j.jimenez@oracle.com/

and linked to my candidate fix:

https://github.com/aljimenezb/linux/commit/014be8cafe7464d278729583a2dd5d94514e2e2a

The reason why I hadn't send it to the list yet is because I noticed the 
additional locations Jason mentioned earlier in the thread (e.g. 
vfio_find_dma(), vfio_iommu_replay()) and wanted to put together a 
reproducer that also triggered those paths.

I mentioned in the notes for the patch above why I chose a slightly more 
complex method than the '- 1' approach, since there is a chance that 
iova+size could also go beyond the end of the address space and actually 
wrap around.

My goal was not to trust the inputs at all if possible. We could also 
use check_add_overflow() if we want to add explicit error reporting in 
vfio_iommu_type1 when an overflow is detected. i.e. catch bad callers 
that don't sanitize their inputs as opposed to letting them handle the 
error on the return path...

Thank you,
Alejandro

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Jason Gunthorpe 4 months ago

On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:

> I mentioned this issue on the cover letter for:
> https://lore.kernel.org/qemu-devel/20250919213515.917111-1-alejandro.j.jimenez@oracle.com/

Use iommufd for this kind of work?

> I mentioned in the notes for the patch above why I chose a slightly more
> complex method than the '- 1' approach, since there is a chance that
> iova+size could also go beyond the end of the address space and actually
> wrap around.

At the uapi boundary it should check that size != 0 and
!check_add_overflow(iova+size). It is much easier to understand if the
input from userspace is validated immediately at userspace.

Then the rest of the code safely computes the last with 'iova+size-1'
and does range logic based on last not end.

Jason

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alejandro Jimenez 4 months ago


On 10/7/25 7:48 AM, Jason Gunthorpe wrote:
> On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:
> 
>> I mentioned this issue on the cover letter for:
>> https://lore.kernel.org/qemu-devel/20250919213515.917111-1-alejandro.j.jimenez@oracle.com/
> 
> Use iommufd for this kind of work?
> 

ACK, need to test with iommufd as well.

>> I mentioned in the notes for the patch above why I chose a slightly more
>> complex method than the '- 1' approach, since there is a chance that
>> iova+size could also go beyond the end of the address space and actually
>> wrap around.
> 
> At the uapi boundary it should check that size != 0 and
> !check_add_overflow(iova+size). It is much easier to understand if the
> input from userspace is validated immediately at userspace.
> 
> Then the rest of the code safely computes the last with 'iova+size-1'
> and does range logic based on last not end.
> 

Makes sense, just thought adding overflow detection/resiliency in the 
rest of the code is not that complex and provides an additional safety.

Thank you,
Alejandro

> Jason

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Mastro 4 months ago

On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:
> If going this way, we'd also have to deny the MAP requests. Right now we

Agree.

> > I have doubts that anyone actually relies on MAP_DMA-ing such
> > end-of-u64-mappings in practice, so perhaps it's OK?
> > 
> 
> The AMD IOMMU supports a 64-bit IOVA, so when using the AMD vIOMMU with DMA
> remapping enabled + VF passthrough + a Linux guest with iommu.forcedac=1 we
> hit this issue since the driver (mlx5) starts requesting mappings for IOVAs
> right at the top of the address space.

Interesting!

> The reason why I hadn't send it to the list yet is because I noticed the
> additional locations Jason mentioned earlier in the thread (e.g.
> vfio_find_dma(), vfio_iommu_replay()) and wanted to put together a
> reproducer that also triggered those paths.

I am not well equipped to test some of these paths, so if you have a recipe, I'd
be interested.
 
> I mentioned in the notes for the patch above why I chose a slightly more
> complex method than the '- 1' approach, since there is a chance that
> iova+size could also go beyond the end of the address space and actually
> wrap around.

I think certain invariants have broken if that is possible. The current checks
in the unmap path should prevent that (iova + size - 1 < iova).

1330          } else if (!size || size & (pgsize - 1) ||
1331                     iova + size - 1 < iova || size > SIZE_MAX) {
1332                  goto unlock;

> My goal was not to trust the inputs at all if possible. We could also use
> check_add_overflow() if we want to add explicit error reporting in
> vfio_iommu_type1 when an overflow is detected. i.e. catch bad callers that

I do like the explicitness of the check_* functions over the older style wrap
checks.

Below is my partially validated attempt at a more comprehensive fix if we were
to try and make end of address space mappings work, rather than blanking out
the last page.

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 08242d8ce2ca..66a25de35446 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -28,6 +28,7 @@
 #include <linux/iommu.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/overflow.h>
 #include <linux/kthread.h>
 #include <linux/rbtree.h>
 #include <linux/sched/signal.h>
@@ -165,12 +166,14 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 {
 	struct rb_node *node = iommu->dma_list.rb_node;
 
+	BUG_ON(size == 0);
+
 	while (node) {
 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
 
-		if (start + size <= dma->iova)
+		if (start + size - 1 < dma->iova)
 			node = node->rb_left;
-		else if (start >= dma->iova + dma->size)
+		else if (start > dma->iova + dma->size - 1)
 			node = node->rb_right;
 		else
 			return dma;
@@ -186,10 +189,12 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
 	struct rb_node *node = iommu->dma_list.rb_node;
 	struct vfio_dma *dma_res = NULL;
 
+	BUG_ON(size == 0);
+
 	while (node) {
 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
 
-		if (start < dma->iova + dma->size) {
+		if (start <= dma->iova + dma->size - 1) {
 			res = node;
 			dma_res = dma;
 			if (start >= dma->iova)
@@ -199,7 +204,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
 			node = node->rb_right;
 		}
 	}
-	if (res && size && dma_res->iova > start + size - 1)
+	if (res && dma_res->iova > start + size - 1)
 		res = NULL;
 	return res;
 }
@@ -213,7 +218,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 		parent = *link;
 		dma = rb_entry(parent, struct vfio_dma, node);
 
-		if (new->iova + new->size <= dma->iova)
+		if (new->iova + new->size - 1 < dma->iova)
 			link = &(*link)->rb_left;
 		else
 			link = &(*link)->rb_right;
@@ -825,14 +830,24 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 	unsigned long remote_vaddr;
 	struct vfio_dma *dma;
 	bool do_accounting;
+	u64 end, to_pin;
 
-	if (!iommu || !pages)
+	if (!iommu || !pages || npage < 0)
 		return -EINVAL;
 
 	/* Supported for v2 version only */
 	if (!iommu->v2)
 		return -EACCES;
 
+	if (npage == 0)
+		return 0;
+
+	if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
+		return -EINVAL;
+
+	if (check_add_overflow(user_iova, to_pin - 1, &end))
+		return -EINVAL;
+
 	mutex_lock(&iommu->lock);
 
 	if (WARN_ONCE(iommu->vaddr_invalid_count,
@@ -997,7 +1012,7 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
 #define VFIO_IOMMU_TLB_SYNC_MAX		512
 
 static size_t unmap_unpin_fast(struct vfio_domain *domain,
-			       struct vfio_dma *dma, dma_addr_t *iova,
+			       struct vfio_dma *dma, dma_addr_t iova,
 			       size_t len, phys_addr_t phys, long *unlocked,
 			       struct list_head *unmapped_list,
 			       int *unmapped_cnt,
@@ -1007,18 +1022,16 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
 	struct vfio_regions *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
 
 	if (entry) {
-		unmapped = iommu_unmap_fast(domain->domain, *iova, len,
+		unmapped = iommu_unmap_fast(domain->domain, iova, len,
 					    iotlb_gather);
 
 		if (!unmapped) {
 			kfree(entry);
 		} else {
-			entry->iova = *iova;
+			entry->iova = iova;
 			entry->phys = phys;
 			entry->len  = unmapped;
 			list_add_tail(&entry->list, unmapped_list);
-
-			*iova += unmapped;
 			(*unmapped_cnt)++;
 		}
 	}
@@ -1037,18 +1050,17 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
 }
 
 static size_t unmap_unpin_slow(struct vfio_domain *domain,
-			       struct vfio_dma *dma, dma_addr_t *iova,
+			       struct vfio_dma *dma, dma_addr_t iova,
 			       size_t len, phys_addr_t phys,
 			       long *unlocked)
 {
-	size_t unmapped = iommu_unmap(domain->domain, *iova, len);
+	size_t unmapped = iommu_unmap(domain->domain, iova, len);
 
 	if (unmapped) {
-		*unlocked += vfio_unpin_pages_remote(dma, *iova,
+		*unlocked += vfio_unpin_pages_remote(dma, iova,
 						     phys >> PAGE_SHIFT,
 						     unmapped >> PAGE_SHIFT,
 						     false);
-		*iova += unmapped;
 		cond_resched();
 	}
 	return unmapped;
@@ -1057,12 +1069,12 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
 static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			     bool do_accounting)
 {
-	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
 	struct vfio_domain *domain, *d;
 	LIST_HEAD(unmapped_region_list);
 	struct iommu_iotlb_gather iotlb_gather;
 	int unmapped_region_cnt = 0;
 	long unlocked = 0;
+	size_t pos = 0;
 
 	if (!dma->size)
 		return 0;
@@ -1086,13 +1098,14 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	}
 
 	iommu_iotlb_gather_init(&iotlb_gather);
-	while (iova < end) {
+	while (pos < dma->size) {
 		size_t unmapped, len;
 		phys_addr_t phys, next;
+		dma_addr_t iova = dma->iova + pos;
 
 		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
-			iova += PAGE_SIZE;
+			pos += PAGE_SIZE;
 			continue;
 		}
 
@@ -1101,7 +1114,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 		 * may require hardware cache flushing, try to find the
 		 * largest contiguous physical memory chunk to unmap.
 		 */
-		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
+		for (len = PAGE_SIZE; len + pos < dma->size; len += PAGE_SIZE) {
 			next = iommu_iova_to_phys(domain->domain, iova + len);
 			if (next != phys + len)
 				break;
@@ -1111,16 +1124,18 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 		 * First, try to use fast unmap/unpin. In case of failure,
 		 * switch to slow unmap/unpin path.
 		 */
-		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
+		unmapped = unmap_unpin_fast(domain, dma, iova, len, phys,
 					    &unlocked, &unmapped_region_list,
 					    &unmapped_region_cnt,
 					    &iotlb_gather);
 		if (!unmapped) {
-			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
+			unmapped = unmap_unpin_slow(domain, dma, iova, len,
 						    phys, &unlocked);
 			if (WARN_ON(!unmapped))
 				break;
 		}
+
+		pos += unmapped;
 	}
 
 	dma->iommu_mapped = false;
@@ -1212,7 +1227,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 }
 
 static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
-				  dma_addr_t iova, size_t size, size_t pgsize)
+				  dma_addr_t iova, u64 end, size_t pgsize)
 {
 	struct vfio_dma *dma;
 	struct rb_node *n;
@@ -1229,8 +1244,8 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 	if (dma && dma->iova != iova)
 		return -EINVAL;
 
-	dma = vfio_find_dma(iommu, iova + size - 1, 0);
-	if (dma && dma->iova + dma->size != iova + size)
+	dma = vfio_find_dma(iommu, end, 1);
+	if (dma && dma->iova + dma->size - 1 != end)
 		return -EINVAL;
 
 	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
@@ -1239,7 +1254,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
 		if (dma->iova < iova)
 			continue;
 
-		if (dma->iova > iova + size - 1)
+		if (dma->iova > end)
 			break;
 
 		ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
@@ -1305,6 +1320,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	unsigned long pgshift;
 	dma_addr_t iova = unmap->iova;
 	u64 size = unmap->size;
+	u64 unmap_end;
 	bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
 	bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
 	struct rb_node *n, *first_n;
@@ -1327,11 +1343,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (iova || size)
 			goto unlock;
 		size = U64_MAX;
-	} else if (!size || size & (pgsize - 1) ||
-		   iova + size - 1 < iova || size > SIZE_MAX) {
+	} else if (!size || size & (pgsize - 1) || size > SIZE_MAX) {
 		goto unlock;
 	}
 
+	if (check_add_overflow(iova, size - 1, &unmap_end))
+		goto unlock;
+
 	/* When dirty tracking is enabled, allow only min supported pgsize */
 	if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
 	    (!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
@@ -1376,8 +1394,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma && dma->iova != iova)
 			goto unlock;
 
-		dma = vfio_find_dma(iommu, iova + size - 1, 0);
-		if (dma && dma->iova + dma->size != iova + size)
+		dma = vfio_find_dma(iommu, unmap_end, 1);
+		if (dma && dma->iova + dma->size - 1 != unmap_end)
 			goto unlock;
 	}
 
@@ -1386,7 +1404,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	while (n) {
 		dma = rb_entry(n, struct vfio_dma, node);
-		if (dma->iova > iova + size - 1)
+		if (dma->iova > unmap_end)
 			break;
 
 		if (!iommu->v2 && iova > dma->iova)
@@ -1713,12 +1731,12 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
-		dma_addr_t iova;
+		size_t pos = 0;
 
 		dma = rb_entry(n, struct vfio_dma, node);
-		iova = dma->iova;
 
-		while (iova < dma->iova + dma->size) {
+		while (pos < dma->size) {
+			dma_addr_t iova = dma->iova + pos;
 			phys_addr_t phys;
 			size_t size;
 
@@ -1734,14 +1752,15 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				phys = iommu_iova_to_phys(d->domain, iova);
 
 				if (WARN_ON(!phys)) {
-					iova += PAGE_SIZE;
+					pos += PAGE_SIZE;
 					continue;
 				}
 
+
 				size = PAGE_SIZE;
 				p = phys + size;
 				i = iova + size;
-				while (i < dma->iova + dma->size &&
+				while (size + pos < dma->size &&
 				       p == iommu_iova_to_phys(d->domain, i)) {
 					size += PAGE_SIZE;
 					p += PAGE_SIZE;
@@ -1782,7 +1801,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				goto unwind;
 			}
 
-			iova += size;
+			pos += size;
 		}
 	}
 
@@ -1799,29 +1818,29 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 unwind:
 	for (; n; n = rb_prev(n)) {
 		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
-		dma_addr_t iova;
+		size_t pos = 0;
 
 		if (dma->iommu_mapped) {
 			iommu_unmap(domain->domain, dma->iova, dma->size);
 			continue;
 		}
 
-		iova = dma->iova;
-		while (iova < dma->iova + dma->size) {
+		while (pos < dma->size) {
+			dma_addr_t iova = dma->iova + pos;
 			phys_addr_t phys, p;
 			size_t size;
 			dma_addr_t i;
 
 			phys = iommu_iova_to_phys(domain->domain, iova);
 			if (!phys) {
-				iova += PAGE_SIZE;
+				pos += PAGE_SIZE;
 				continue;
 			}
 
 			size = PAGE_SIZE;
 			p = phys + size;
 			i = iova + size;
-			while (i < dma->iova + dma->size &&
+			while (pos + size < dma->size &&
 			       p == iommu_iova_to_phys(domain->domain, i)) {
 				size += PAGE_SIZE;
 				p += PAGE_SIZE;
@@ -2908,6 +2927,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 		unsigned long pgshift;
 		size_t data_size = dirty.argsz - minsz;
 		size_t iommu_pgsize;
+		u64 end;
 
 		if (!data_size || data_size < sizeof(range))
 			return -EINVAL;
@@ -2916,8 +2936,12 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 				   sizeof(range)))
 			return -EFAULT;
 
-		if (range.iova + range.size < range.iova)
+		if (range.size == 0)
+			return 0;
+
+		if (check_add_overflow(range.iova, range.size - 1, &end))
 			return -EINVAL;
+
 		if (!access_ok((void __user *)range.bitmap.data,
 			       range.bitmap.size))
 			return -EINVAL;
@@ -2949,7 +2973,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 		if (iommu->dirty_page_tracking)
 			ret = vfio_iova_dirty_bitmap(range.bitmap.data,
 						     iommu, range.iova,
-						     range.size,
+						     end,
 						     range.bitmap.pgsize);
 		else
 			ret = -EINVAL;

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Williamson 4 months ago

On Mon, 6 Oct 2025 21:24:35 -0700
Alex Mastro <amastro@fb.com> wrote:
> 
> I do like the explicitness of the check_* functions over the older style wrap
> checks.
> 
> Below is my partially validated attempt at a more comprehensive fix if we were
> to try and make end of address space mappings work, rather than blanking out
> the last page.

I prefer this approach, thanks for tackling it.  Consider splitting
into a few patches for easier review, ex. discrete input sanitizing with
proper overflow checking, refactoring the fast/slow handlers to
increment iova in the caller, remainder to tie it all together.  A few
comments inline below. 

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 08242d8ce2ca..66a25de35446 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -28,6 +28,7 @@
>  #include <linux/iommu.h>
>  #include <linux/module.h>
>  #include <linux/mm.h>
> +#include <linux/overflow.h>
>  #include <linux/kthread.h>
>  #include <linux/rbtree.h>
>  #include <linux/sched/signal.h>
> @@ -165,12 +166,14 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
>  {
>  	struct rb_node *node = iommu->dma_list.rb_node;
>  
> +	BUG_ON(size == 0);
> +
>  	while (node) {
>  		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>  
> -		if (start + size <= dma->iova)
> +		if (start + size - 1 < dma->iova)
>  			node = node->rb_left;
> -		else if (start >= dma->iova + dma->size)
> +		else if (start > dma->iova + dma->size - 1)
>  			node = node->rb_right;
>  		else
>  			return dma;
> @@ -186,10 +189,12 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
>  	struct rb_node *node = iommu->dma_list.rb_node;
>  	struct vfio_dma *dma_res = NULL;
>  
> +	BUG_ON(size == 0);
> +
>  	while (node) {
>  		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>  
> -		if (start < dma->iova + dma->size) {
> +		if (start <= dma->iova + dma->size - 1) {
>  			res = node;
>  			dma_res = dma;
>  			if (start >= dma->iova)
> @@ -199,7 +204,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
>  			node = node->rb_right;
>  		}
>  	}
> -	if (res && size && dma_res->iova > start + size - 1)
> +	if (res && dma_res->iova > start + size - 1)
>  		res = NULL;
>  	return res;
>  }
> @@ -213,7 +218,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>  		parent = *link;
>  		dma = rb_entry(parent, struct vfio_dma, node);
>  
> -		if (new->iova + new->size <= dma->iova)
> +		if (new->iova + new->size - 1 < dma->iova)
>  			link = &(*link)->rb_left;
>  		else
>  			link = &(*link)->rb_right;
> @@ -825,14 +830,24 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  	unsigned long remote_vaddr;
>  	struct vfio_dma *dma;
>  	bool do_accounting;
> +	u64 end, to_pin;

end looks like a dma_addr_t and to_pin ought to be a size_t, right?
Maybe iova_end and iova_size?

>  
> -	if (!iommu || !pages)
> +	if (!iommu || !pages || npage < 0)
>  		return -EINVAL;
>  
>  	/* Supported for v2 version only */
>  	if (!iommu->v2)
>  		return -EACCES;
>  
> +	if (npage == 0)
> +		return 0;
> +
> +	if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
> +		return -EINVAL;
> +
> +	if (check_add_overflow(user_iova, to_pin - 1, &end))
> +		return -EINVAL;
> +

Why not the same checks on vfio_iommu_type1_unpin_pages()?

>  	mutex_lock(&iommu->lock);
>  
>  	if (WARN_ONCE(iommu->vaddr_invalid_count,
> @@ -997,7 +1012,7 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
>  #define VFIO_IOMMU_TLB_SYNC_MAX		512
>  
>  static size_t unmap_unpin_fast(struct vfio_domain *domain,
> -			       struct vfio_dma *dma, dma_addr_t *iova,
> +			       struct vfio_dma *dma, dma_addr_t iova,
>  			       size_t len, phys_addr_t phys, long *unlocked,
>  			       struct list_head *unmapped_list,
>  			       int *unmapped_cnt,
> @@ -1007,18 +1022,16 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
>  	struct vfio_regions *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>  
>  	if (entry) {
> -		unmapped = iommu_unmap_fast(domain->domain, *iova, len,
> +		unmapped = iommu_unmap_fast(domain->domain, iova, len,
>  					    iotlb_gather);
>  
>  		if (!unmapped) {
>  			kfree(entry);
>  		} else {
> -			entry->iova = *iova;
> +			entry->iova = iova;
>  			entry->phys = phys;
>  			entry->len  = unmapped;
>  			list_add_tail(&entry->list, unmapped_list);
> -
> -			*iova += unmapped;
>  			(*unmapped_cnt)++;
>  		}
>  	}
> @@ -1037,18 +1050,17 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
>  }
>  
>  static size_t unmap_unpin_slow(struct vfio_domain *domain,
> -			       struct vfio_dma *dma, dma_addr_t *iova,
> +			       struct vfio_dma *dma, dma_addr_t iova,
>  			       size_t len, phys_addr_t phys,
>  			       long *unlocked)
>  {
> -	size_t unmapped = iommu_unmap(domain->domain, *iova, len);
> +	size_t unmapped = iommu_unmap(domain->domain, iova, len);
>  
>  	if (unmapped) {
> -		*unlocked += vfio_unpin_pages_remote(dma, *iova,
> +		*unlocked += vfio_unpin_pages_remote(dma, iova,
>  						     phys >> PAGE_SHIFT,
>  						     unmapped >> PAGE_SHIFT,
>  						     false);
> -		*iova += unmapped;
>  		cond_resched();
>  	}
>  	return unmapped;
> @@ -1057,12 +1069,12 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
>  static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			     bool do_accounting)
>  {
> -	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
>  	struct vfio_domain *domain, *d;
>  	LIST_HEAD(unmapped_region_list);
>  	struct iommu_iotlb_gather iotlb_gather;
>  	int unmapped_region_cnt = 0;
>  	long unlocked = 0;
> +	size_t pos = 0;
>  
>  	if (!dma->size)
>  		return 0;
> @@ -1086,13 +1098,14 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	}
>  
>  	iommu_iotlb_gather_init(&iotlb_gather);
> -	while (iova < end) {
> +	while (pos < dma->size) {
>  		size_t unmapped, len;
>  		phys_addr_t phys, next;
> +		dma_addr_t iova = dma->iova + pos;
>  
>  		phys = iommu_iova_to_phys(domain->domain, iova);
>  		if (WARN_ON(!phys)) {
> -			iova += PAGE_SIZE;
> +			pos += PAGE_SIZE;
>  			continue;
>  		}
>  
> @@ -1101,7 +1114,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  		 * may require hardware cache flushing, try to find the
>  		 * largest contiguous physical memory chunk to unmap.
>  		 */
> -		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
> +		for (len = PAGE_SIZE; len + pos < dma->size; len += PAGE_SIZE) {
>  			next = iommu_iova_to_phys(domain->domain, iova + len);
>  			if (next != phys + len)
>  				break;
> @@ -1111,16 +1124,18 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  		 * First, try to use fast unmap/unpin. In case of failure,
>  		 * switch to slow unmap/unpin path.
>  		 */
> -		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
> +		unmapped = unmap_unpin_fast(domain, dma, iova, len, phys,
>  					    &unlocked, &unmapped_region_list,
>  					    &unmapped_region_cnt,
>  					    &iotlb_gather);
>  		if (!unmapped) {
> -			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
> +			unmapped = unmap_unpin_slow(domain, dma, iova, len,
>  						    phys, &unlocked);
>  			if (WARN_ON(!unmapped))
>  				break;
>  		}
> +
> +		pos += unmapped;
>  	}
>  
>  	dma->iommu_mapped = false;
> @@ -1212,7 +1227,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>  }
>  
>  static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> -				  dma_addr_t iova, size_t size, size_t pgsize)
> +				  dma_addr_t iova, u64 end, size_t pgsize)
>  {
>  	struct vfio_dma *dma;
>  	struct rb_node *n;
> @@ -1229,8 +1244,8 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>  	if (dma && dma->iova != iova)
>  		return -EINVAL;
>  
> -	dma = vfio_find_dma(iommu, iova + size - 1, 0);
> -	if (dma && dma->iova + dma->size != iova + size)
> +	dma = vfio_find_dma(iommu, end, 1);
> +	if (dma && dma->iova + dma->size - 1 != end)
>  		return -EINVAL;
>  
>  	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> @@ -1239,7 +1254,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>  		if (dma->iova < iova)
>  			continue;
>  
> -		if (dma->iova > iova + size - 1)
> +		if (dma->iova > end)
>  			break;
>  
>  		ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
> @@ -1305,6 +1320,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	unsigned long pgshift;
>  	dma_addr_t iova = unmap->iova;
>  	u64 size = unmap->size;
> +	u64 unmap_end;
>  	bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
>  	bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
>  	struct rb_node *n, *first_n;
> @@ -1327,11 +1343,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (iova || size)
>  			goto unlock;
>  		size = U64_MAX;
> -	} else if (!size || size & (pgsize - 1) ||
> -		   iova + size - 1 < iova || size > SIZE_MAX) {
> +	} else if (!size || size & (pgsize - 1) || size > SIZE_MAX) {
>  		goto unlock;
>  	}
>  
> +	if (check_add_overflow(iova, size - 1, &unmap_end))
> +		goto unlock;
> +
>  	/* When dirty tracking is enabled, allow only min supported pgsize */
>  	if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
>  	    (!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
> @@ -1376,8 +1394,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma && dma->iova != iova)
>  			goto unlock;
>  
> -		dma = vfio_find_dma(iommu, iova + size - 1, 0);
> -		if (dma && dma->iova + dma->size != iova + size)
> +		dma = vfio_find_dma(iommu, unmap_end, 1);
> +		if (dma && dma->iova + dma->size - 1 != unmap_end)
>  			goto unlock;
>  	}
>  
> @@ -1386,7 +1404,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  
>  	while (n) {
>  		dma = rb_entry(n, struct vfio_dma, node);
> -		if (dma->iova > iova + size - 1)
> +		if (dma->iova > unmap_end)
>  			break;
>  
>  		if (!iommu->v2 && iova > dma->iova)
> @@ -1713,12 +1731,12 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
> -		dma_addr_t iova;
> +		size_t pos = 0;
>  
>  		dma = rb_entry(n, struct vfio_dma, node);
> -		iova = dma->iova;
>  
> -		while (iova < dma->iova + dma->size) {
> +		while (pos < dma->size) {
> +			dma_addr_t iova = dma->iova + pos;
>  			phys_addr_t phys;
>  			size_t size;
>  
> @@ -1734,14 +1752,15 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  				phys = iommu_iova_to_phys(d->domain, iova);
>  
>  				if (WARN_ON(!phys)) {
> -					iova += PAGE_SIZE;
> +					pos += PAGE_SIZE;
>  					continue;
>  				}
>  
> +

Extra white space

>  				size = PAGE_SIZE;
>  				p = phys + size;
>  				i = iova + size;
> -				while (i < dma->iova + dma->size &&
> +				while (size + pos < dma->size &&
>  				       p == iommu_iova_to_phys(d->domain, i)) {
>  					size += PAGE_SIZE;
>  					p += PAGE_SIZE;

I think the else branch after this has some use cases too, (iova -
dma->iova) just becomes 'pos' in calculating vaddr, 'n' should be
calculated as (dma->size - pos).

> @@ -1782,7 +1801,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  				goto unwind;
>  			}
>  
> -			iova += size;
> +			pos += size;
>  		}
>  	}
>  
> @@ -1799,29 +1818,29 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  unwind:
>  	for (; n; n = rb_prev(n)) {
>  		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> -		dma_addr_t iova;
> +		size_t pos = 0;
>  
>  		if (dma->iommu_mapped) {
>  			iommu_unmap(domain->domain, dma->iova, dma->size);
>  			continue;
>  		}
>  
> -		iova = dma->iova;
> -		while (iova < dma->iova + dma->size) {
> +		while (pos < dma->size) {
> +			dma_addr_t iova = dma->iova + pos;
>  			phys_addr_t phys, p;
>  			size_t size;
>  			dma_addr_t i;
>  
>  			phys = iommu_iova_to_phys(domain->domain, iova);
>  			if (!phys) {
> -				iova += PAGE_SIZE;
> +				pos += PAGE_SIZE;
>  				continue;
>  			}
>  
>  			size = PAGE_SIZE;
>  			p = phys + size;
>  			i = iova + size;
> -			while (i < dma->iova + dma->size &&
> +			while (pos + size < dma->size &&
>  			       p == iommu_iova_to_phys(domain->domain, i)) {
>  				size += PAGE_SIZE;
>  				p += PAGE_SIZE;
> @@ -2908,6 +2927,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  		unsigned long pgshift;
>  		size_t data_size = dirty.argsz - minsz;
>  		size_t iommu_pgsize;
> +		u64 end;

Seems like a dma_addr_t.  range_end?  Thanks,

Alex

>  
>  		if (!data_size || data_size < sizeof(range))
>  			return -EINVAL;
> @@ -2916,8 +2936,12 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  				   sizeof(range)))
>  			return -EFAULT;
>  
> -		if (range.iova + range.size < range.iova)
> +		if (range.size == 0)
> +			return 0;
> +
> +		if (check_add_overflow(range.iova, range.size - 1, &end))
>  			return -EINVAL;
> +
>  		if (!access_ok((void __user *)range.bitmap.data,
>  			       range.bitmap.size))
>  			return -EINVAL;
> @@ -2949,7 +2973,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  		if (iommu->dirty_page_tracking)
>  			ret = vfio_iova_dirty_bitmap(range.bitmap.data,
>  						     iommu, range.iova,
> -						     range.size,
> +						     end,
>  						     range.bitmap.pgsize);
>  		else
>  			ret = -EINVAL;
>

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alex Mastro 4 months ago

On Tue, Oct 07, 2025 at 02:43:28PM -0600, Alex Williamson wrote:
> I prefer this approach, thanks for tackling it.  Consider splitting
> into a few patches for easier review, ex. discrete input sanitizing with
> proper overflow checking, refactoring the fast/slow handlers to
> increment iova in the caller, remainder to tie it all together.  A few
> comments inline below. 

Alright -- I'll try to stage incrementally. The proposed sequencing sgtm.

> > +	u64 end, to_pin;
> 
> end looks like a dma_addr_t and to_pin ought to be a size_t, right?
> Maybe iova_end and iova_size?

Yes, I think I've been sloppy with the types. Am too 64-bit oriented.

> > -	if (!iommu || !pages)
> > +	if (!iommu || !pages || npage < 0)
> >  		return -EINVAL;
> >  
> >  	/* Supported for v2 version only */
> >  	if (!iommu->v2)
> >  		return -EACCES;
> >  
> > +	if (npage == 0)
> > +		return 0;
> > +
> > +	if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
> > +		return -EINVAL;
> > +
> > +	if (check_add_overflow(user_iova, to_pin - 1, &end))
> > +		return -EINVAL;
> > +
> 
> Why not the same checks on vfio_iommu_type1_unpin_pages()?

Will see if there's opportunity to stay more consistent.

> >  				if (WARN_ON(!phys)) {
> > -					iova += PAGE_SIZE;
> > +					pos += PAGE_SIZE;
> >  					continue;
> >  				}
> >  
> > +
> 
> Extra white space

Ack.

> >  				size = PAGE_SIZE;
> >  				p = phys + size;
> >  				i = iova + size;
> > -				while (i < dma->iova + dma->size &&
> > +				while (size + pos < dma->size &&
> >  				       p == iommu_iova_to_phys(d->domain, i)) {
> >  					size += PAGE_SIZE;
> >  					p += PAGE_SIZE;
> 
> I think the else branch after this has some use cases too, (iova -
> dma->iova) just becomes 'pos' in calculating vaddr, 'n' should be
> calculated as (dma->size - pos).

Missed this simplification - thanks.

> > +		u64 end;
> 
> Seems like a dma_addr_t.  range_end?  Thanks,

Ack

Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

Posted by Alejandro Jimenez 4 months ago


On 10/7/25 12:24 AM, Alex Mastro wrote:
> On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:
>> If going this way, we'd also have to deny the MAP requests. Right now we
> 
> Agree.
> 
>>> I have doubts that anyone actually relies on MAP_DMA-ing such
>>> end-of-u64-mappings in practice, so perhaps it's OK?
>>>
>>
>> The AMD IOMMU supports a 64-bit IOVA, so when using the AMD vIOMMU with DMA
>> remapping enabled + VF passthrough + a Linux guest with iommu.forcedac=1 we
>> hit this issue since the driver (mlx5) starts requesting mappings for IOVAs
>> right at the top of the address space.
> 
> Interesting!
> 
>> The reason why I hadn't send it to the list yet is because I noticed the
>> additional locations Jason mentioned earlier in the thread (e.g.
>> vfio_find_dma(), vfio_iommu_replay()) and wanted to put together a
>> reproducer that also triggered those paths.
> 
> I am not well equipped to test some of these paths, so if you have a recipe, I'd
> be interested.
>   

Unfortunately I didn't find a good recipe yet, which is why I delayed 
sending my patch with the fix.
I tried a few things to trigger the code paths, but the "easy way" 
methods are not enough. e.g. vfio_iommu_replay() is called when enabling 
the IOMMU model on the container i.e. when issuing the VFIO_SET_IOMMU 
ioctl but at that time there are no mappings yet for the domain so there 
is no dma_list to traverse.
I also tried attaching a second group to the same container that already 
contained a group with mappings, but in vfio_iommu_type1_attach_group() 
the new group is attached to the "existing compatible domain" that was 
already created for the first, since mappings are per-domain, so there 
is no need for replay.
Just mentioning this to give an idea, I didn't have time to try 
triggering the paths involved with dirty tracking...

The basic reproducer I linked on my original patch:

https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8

is only exercising vfio_find_dma_first_node() and vfio_dma_do_unmap(), 
which is enough to verify the fix that was causing the AMD vIOMMU failure.


>> I mentioned in the notes for the patch above why I chose a slightly more
>> complex method than the '- 1' approach, since there is a chance that
>> iova+size could also go beyond the end of the address space and actually
>> wrap around.
> 
> I think certain invariants have broken if that is possible. The current checks
> in the unmap path should prevent that (iova + size - 1 < iova).
> 
> 1330          } else if (!size || size & (pgsize - 1) ||
> 1331                     iova + size - 1 < iova || size > SIZE_MAX) {
> 1332                  goto unlock;
> 

Yes, that check properly accounts for overflow, but later 
vfio_find_dma_first_node() doesn't, so it won't find the node in the 
dma_list. I agree with the approach of sanitizing the inputs at the uapi 
boundary as Jason mentions, but the code should be robust enough to 
handle overflow anyways I'd think.

>> My goal was not to trust the inputs at all if possible. We could also use
>> check_add_overflow() if we want to add explicit error reporting in
>> vfio_iommu_type1 when an overflow is detected. i.e. catch bad callers that
> 
> I do like the explicitness of the check_* functions over the older style wrap
> checks.
> 
> Below is my partially validated attempt at a more comprehensive fix if we were
> to try and make end of address space mappings work, rather than blanking out
> the last page.
> 

In my opinion, the latter that is not optional. Ranges ending at the 
64-bit boundary are valid, and not accepting them will break drivers 
that request them.

I'll test this change when I have a chance, though I'd really like to 
have solid test cases so I am hoping there is some advice on how to 
trigger all the paths.

Thank you,
Alejandro

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 08242d8ce2ca..66a25de35446 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -28,6 +28,7 @@
>   #include <linux/iommu.h>
>   #include <linux/module.h>
>   #include <linux/mm.h>
> +#include <linux/overflow.h>
>   #include <linux/kthread.h>
>   #include <linux/rbtree.h>
>   #include <linux/sched/signal.h>
> @@ -165,12 +166,14 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
>   {
>   	struct rb_node *node = iommu->dma_list.rb_node;
>   
> +	BUG_ON(size == 0);
> +
>   	while (node) {
>   		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>   
> -		if (start + size <= dma->iova)
> +		if (start + size - 1 < dma->iova)
>   			node = node->rb_left;
> -		else if (start >= dma->iova + dma->size)
> +		else if (start > dma->iova + dma->size - 1)
>   			node = node->rb_right;
>   		else
>   			return dma;
> @@ -186,10 +189,12 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
>   	struct rb_node *node = iommu->dma_list.rb_node;
>   	struct vfio_dma *dma_res = NULL;
>   
> +	BUG_ON(size == 0);
> +
>   	while (node) {
>   		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>   
> -		if (start < dma->iova + dma->size) {
> +		if (start <= dma->iova + dma->size - 1) {
>   			res = node;
>   			dma_res = dma;
>   			if (start >= dma->iova)
> @@ -199,7 +204,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
>   			node = node->rb_right;
>   		}
>   	}
> -	if (res && size && dma_res->iova > start + size - 1)
> +	if (res && dma_res->iova > start + size - 1)
>   		res = NULL;
>   	return res;
>   }
> @@ -213,7 +218,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>   		parent = *link;
>   		dma = rb_entry(parent, struct vfio_dma, node);
>   
> -		if (new->iova + new->size <= dma->iova)
> +		if (new->iova + new->size - 1 < dma->iova)
>   			link = &(*link)->rb_left;
>   		else
>   			link = &(*link)->rb_right;
> @@ -825,14 +830,24 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>   	unsigned long remote_vaddr;
>   	struct vfio_dma *dma;
>   	bool do_accounting;
> +	u64 end, to_pin;
>   
> -	if (!iommu || !pages)
> +	if (!iommu || !pages || npage < 0)
>   		return -EINVAL;
>   
>   	/* Supported for v2 version only */
>   	if (!iommu->v2)
>   		return -EACCES;
>   
> +	if (npage == 0)
> +		return 0;
> +
> +	if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
> +		return -EINVAL;
> +
> +	if (check_add_overflow(user_iova, to_pin - 1, &end))
> +		return -EINVAL;
> +
>   	mutex_lock(&iommu->lock);
>   
>   	if (WARN_ONCE(iommu->vaddr_invalid_count,
> @@ -997,7 +1012,7 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
>   #define VFIO_IOMMU_TLB_SYNC_MAX		512
>   
>   static size_t unmap_unpin_fast(struct vfio_domain *domain,
> -			       struct vfio_dma *dma, dma_addr_t *iova,
> +			       struct vfio_dma *dma, dma_addr_t iova,
>   			       size_t len, phys_addr_t phys, long *unlocked,
>   			       struct list_head *unmapped_list,
>   			       int *unmapped_cnt,
> @@ -1007,18 +1022,16 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
>   	struct vfio_regions *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
>   
>   	if (entry) {
> -		unmapped = iommu_unmap_fast(domain->domain, *iova, len,
> +		unmapped = iommu_unmap_fast(domain->domain, iova, len,
>   					    iotlb_gather);
>   
>   		if (!unmapped) {
>   			kfree(entry);
>   		} else {
> -			entry->iova = *iova;
> +			entry->iova = iova;
>   			entry->phys = phys;
>   			entry->len  = unmapped;
>   			list_add_tail(&entry->list, unmapped_list);
> -
> -			*iova += unmapped;
>   			(*unmapped_cnt)++;
>   		}
>   	}
> @@ -1037,18 +1050,17 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
>   }
>   
>   static size_t unmap_unpin_slow(struct vfio_domain *domain,
> -			       struct vfio_dma *dma, dma_addr_t *iova,
> +			       struct vfio_dma *dma, dma_addr_t iova,
>   			       size_t len, phys_addr_t phys,
>   			       long *unlocked)
>   {
> -	size_t unmapped = iommu_unmap(domain->domain, *iova, len);
> +	size_t unmapped = iommu_unmap(domain->domain, iova, len);
>   
>   	if (unmapped) {
> -		*unlocked += vfio_unpin_pages_remote(dma, *iova,
> +		*unlocked += vfio_unpin_pages_remote(dma, iova,
>   						     phys >> PAGE_SHIFT,
>   						     unmapped >> PAGE_SHIFT,
>   						     false);
> -		*iova += unmapped;
>   		cond_resched();
>   	}
>   	return unmapped;
> @@ -1057,12 +1069,12 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
>   static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>   			     bool do_accounting)
>   {
> -	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
>   	struct vfio_domain *domain, *d;
>   	LIST_HEAD(unmapped_region_list);
>   	struct iommu_iotlb_gather iotlb_gather;
>   	int unmapped_region_cnt = 0;
>   	long unlocked = 0;
> +	size_t pos = 0;
>   
>   	if (!dma->size)
>   		return 0;
> @@ -1086,13 +1098,14 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>   	}
>   
>   	iommu_iotlb_gather_init(&iotlb_gather);
> -	while (iova < end) {
> +	while (pos < dma->size) {
>   		size_t unmapped, len;
>   		phys_addr_t phys, next;
> +		dma_addr_t iova = dma->iova + pos;
>   
>   		phys = iommu_iova_to_phys(domain->domain, iova);
>   		if (WARN_ON(!phys)) {
> -			iova += PAGE_SIZE;
> +			pos += PAGE_SIZE;
>   			continue;
>   		}
>   
> @@ -1101,7 +1114,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>   		 * may require hardware cache flushing, try to find the
>   		 * largest contiguous physical memory chunk to unmap.
>   		 */
> -		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
> +		for (len = PAGE_SIZE; len + pos < dma->size; len += PAGE_SIZE) {
>   			next = iommu_iova_to_phys(domain->domain, iova + len);
>   			if (next != phys + len)
>   				break;
> @@ -1111,16 +1124,18 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>   		 * First, try to use fast unmap/unpin. In case of failure,
>   		 * switch to slow unmap/unpin path.
>   		 */
> -		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
> +		unmapped = unmap_unpin_fast(domain, dma, iova, len, phys,
>   					    &unlocked, &unmapped_region_list,
>   					    &unmapped_region_cnt,
>   					    &iotlb_gather);
>   		if (!unmapped) {
> -			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
> +			unmapped = unmap_unpin_slow(domain, dma, iova, len,
>   						    phys, &unlocked);
>   			if (WARN_ON(!unmapped))
>   				break;
>   		}
> +
> +		pos += unmapped;
>   	}
>   
>   	dma->iommu_mapped = false;
> @@ -1212,7 +1227,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>   }
>   
>   static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
> -				  dma_addr_t iova, size_t size, size_t pgsize)
> +				  dma_addr_t iova, u64 end, size_t pgsize)
>   {
>   	struct vfio_dma *dma;
>   	struct rb_node *n;
> @@ -1229,8 +1244,8 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>   	if (dma && dma->iova != iova)
>   		return -EINVAL;
>   
> -	dma = vfio_find_dma(iommu, iova + size - 1, 0);
> -	if (dma && dma->iova + dma->size != iova + size)
> +	dma = vfio_find_dma(iommu, end, 1);
> +	if (dma && dma->iova + dma->size - 1 != end)
>   		return -EINVAL;
>   
>   	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> @@ -1239,7 +1254,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
>   		if (dma->iova < iova)
>   			continue;
>   
> -		if (dma->iova > iova + size - 1)
> +		if (dma->iova > end)
>   			break;
>   
>   		ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
> @@ -1305,6 +1320,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>   	unsigned long pgshift;
>   	dma_addr_t iova = unmap->iova;
>   	u64 size = unmap->size;
> +	u64 unmap_end;
>   	bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
>   	bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
>   	struct rb_node *n, *first_n;
> @@ -1327,11 +1343,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>   		if (iova || size)
>   			goto unlock;
>   		size = U64_MAX;
> -	} else if (!size || size & (pgsize - 1) ||
> -		   iova + size - 1 < iova || size > SIZE_MAX) {
> +	} else if (!size || size & (pgsize - 1) || size > SIZE_MAX) {
>   		goto unlock;
>   	}
>   
> +	if (check_add_overflow(iova, size - 1, &unmap_end))
> +		goto unlock;
> +
>   	/* When dirty tracking is enabled, allow only min supported pgsize */
>   	if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
>   	    (!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
> @@ -1376,8 +1394,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>   		if (dma && dma->iova != iova)
>   			goto unlock;
>   
> -		dma = vfio_find_dma(iommu, iova + size - 1, 0);
> -		if (dma && dma->iova + dma->size != iova + size)
> +		dma = vfio_find_dma(iommu, unmap_end, 1);
> +		if (dma && dma->iova + dma->size - 1 != unmap_end)
>   			goto unlock;
>   	}
>   
> @@ -1386,7 +1404,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>   
>   	while (n) {
>   		dma = rb_entry(n, struct vfio_dma, node);
> -		if (dma->iova > iova + size - 1)
> +		if (dma->iova > unmap_end)
>   			break;
>   
>   		if (!iommu->v2 && iova > dma->iova)
> @@ -1713,12 +1731,12 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>   
>   	for (; n; n = rb_next(n)) {
>   		struct vfio_dma *dma;
> -		dma_addr_t iova;
> +		size_t pos = 0;
>   
>   		dma = rb_entry(n, struct vfio_dma, node);
> -		iova = dma->iova;
>   
> -		while (iova < dma->iova + dma->size) {
> +		while (pos < dma->size) {
> +			dma_addr_t iova = dma->iova + pos;
>   			phys_addr_t phys;
>   			size_t size;
>   
> @@ -1734,14 +1752,15 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>   				phys = iommu_iova_to_phys(d->domain, iova);
>   
>   				if (WARN_ON(!phys)) {
> -					iova += PAGE_SIZE;
> +					pos += PAGE_SIZE;
>   					continue;
>   				}
>   
> +
>   				size = PAGE_SIZE;
>   				p = phys + size;
>   				i = iova + size;
> -				while (i < dma->iova + dma->size &&
> +				while (size + pos < dma->size &&
>   				       p == iommu_iova_to_phys(d->domain, i)) {
>   					size += PAGE_SIZE;
>   					p += PAGE_SIZE;
> @@ -1782,7 +1801,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>   				goto unwind;
>   			}
>   
> -			iova += size;
> +			pos += size;
>   		}
>   	}
>   
> @@ -1799,29 +1818,29 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>   unwind:
>   	for (; n; n = rb_prev(n)) {
>   		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> -		dma_addr_t iova;
> +		size_t pos = 0;
>   
>   		if (dma->iommu_mapped) {
>   			iommu_unmap(domain->domain, dma->iova, dma->size);
>   			continue;
>   		}
>   
> -		iova = dma->iova;
> -		while (iova < dma->iova + dma->size) {
> +		while (pos < dma->size) {
> +			dma_addr_t iova = dma->iova + pos;
>   			phys_addr_t phys, p;
>   			size_t size;
>   			dma_addr_t i;
>   
>   			phys = iommu_iova_to_phys(domain->domain, iova);
>   			if (!phys) {
> -				iova += PAGE_SIZE;
> +				pos += PAGE_SIZE;
>   				continue;
>   			}
>   
>   			size = PAGE_SIZE;
>   			p = phys + size;
>   			i = iova + size;
> -			while (i < dma->iova + dma->size &&
> +			while (pos + size < dma->size &&
>   			       p == iommu_iova_to_phys(domain->domain, i)) {
>   				size += PAGE_SIZE;
>   				p += PAGE_SIZE;
> @@ -2908,6 +2927,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>   		unsigned long pgshift;
>   		size_t data_size = dirty.argsz - minsz;
>   		size_t iommu_pgsize;
> +		u64 end;
>   
>   		if (!data_size || data_size < sizeof(range))
>   			return -EINVAL;
> @@ -2916,8 +2936,12 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>   				   sizeof(range)))
>   			return -EFAULT;
>   
> -		if (range.iova + range.size < range.iova)
> +		if (range.size == 0)
> +			return 0;
> +
> +		if (check_add_overflow(range.iova, range.size - 1, &end))
>   			return -EINVAL;
> +
>   		if (!access_ok((void __user *)range.bitmap.data,
>   			       range.bitmap.size))
>   			return -EINVAL;
> @@ -2949,7 +2973,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>   		if (iommu->dirty_page_tracking)
>   			ret = vfio_iova_dirty_bitmap(range.bitmap.data,
>   						     iommu, range.iova,
> -						     range.size,
> +						     end,
>   						     range.bitmap.pgsize);
>   		else
>   			ret = -EINVAL;