From nobody Mon Jun  8 06:36:14 2026
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C8723F44C9;
	Wed,  3 Jun 2026 03:37:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=96.67.55.147
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780457843; cv=none;
 b=e00IwdQ1eXYqxwiD3DxBRlN7Bomusm9//3rLLDq97FwASco3dvRPBZgK2qPfB2mZ+IuIP1nJmj2Mx3vs0GeMGmbNZctPsHezYWZWvHb/nNAqdo/beMJlYx0beC7RUvM9Dp5wUTiIzBCzfo5HrTS7Aixcyv6Ypri87BjKfRIIDxo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780457843; c=relaxed/simple;
	bh=63XzII3LCeVEY+m0pHJPQWm5qZ/ocCGKWItJji8DBvY=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=QjHM4noV0070Tfu3/UM2DLu4JANFtBJbfcb5ZIsZb3QbGQKC+coUWJ9FuJQtZGH3SPRkA0CcZHxTUPUEJp/NRBX1GNM1BazmIkt6EvfS7p1/RqrQ1ZJjIX43uh7djvJiayn2+jiVSg2aMVsvNWnEfMg+za3EorXeCtDBgkK/1DA=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com;
 spf=pass smtp.mailfrom=surriel.com;
 dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b=Cp9P1963; arc=none smtp.client-ip=96.67.55.147
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b="Cp9P1963"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com
	; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To:
	Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=IF7PuSZJhNuFa0ukToH4UJzhPgK3j12pbrmz73mA6Ms=; b=Cp9P19630Kt76USosrAAWIjSY6
	p9uMhEQfV1zDUsL8v78Hm+51kVXS2EM7NiOBq+ah8YJJ3IZsZy7Ix5sybIM0MZEd7E7VZ9xeKobX+
	vsI9/8DouTa0x7jwp6WkUzRxn5rZY3wWQtBz8wbx/7PtEgY+wr8mG7fbbsADUWXEhkoZLHfuuv4w7
	US2r3i77luXg87KSFm2wDppIi3ndMO4ZPemP9lRrESvDcvSk3ySKwMdEi+SmKvF2Yavpqya3GYWT6
	qsU5QwC8aT7cFdxyXmchTfgIvNDNZO6xDBRElya1VNOSz4iqyC0kJbQo5FyBuqu77XK9uCD+k5qwP
	QdLbeurw==;
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@surriel.com>)
	id 1wUcPg-000000002VG-1G5B;
	Tue, 02 Jun 2026 23:37:04 -0400
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com,
	robin.murphy@arm.com,
	joro@8bytes.org,
	will@kernel.org,
	iommu@lists.linux.dev,
	jgg@ziepe.ca,
	kyle@mcmartin.ca,
	Rik van Riel <riel@meta.com>,
	Rik van Riel <riel@surriel.com>
Subject: [PATCH v3 1/3] iova: convert from rbtree to maple tree
Date: Tue,  2 Jun 2026 23:35:46 -0400
Message-ID: <20260603033653.4144138-2-riel@surriel.com>
X-Mailer: git-send-email 2.54.0
In-Reply-To: <20260603033653.4144138-1-riel@surriel.com>
References: <20260603033653.4144138-1-riel@surriel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Rik van Riel <riel@meta.com>

Replace the hand-rolled rbtree in the IOVA allocator with a maple tree.
The maple tree is a B-tree variant designed for range tracking with
built-in gap searching, making it a natural fit for IOVA address space
management -- the same data structure is used by the kernel VMA
subsystem for the analogous problem.

Key changes:

  - struct iova shrinks from 40 bytes (rb_node 24 + pfn_hi 8 + pfn_lo 8)
    to just pfn_hi + pfn_lo (16 bytes). SLAB_HWCACHE_ALIGN is dropped
    from the iova slab cache since struct iova no longer contains an
    embedded rb_node touched during tree rebalancing -- tree traversal
    now touches maple nodes exclusively. This lets the slab allocator
    pack 16-byte objects tightly instead of rounding to 64 bytes.

    The maple tree replaces the embedded rb_node with external B-tree
    nodes: each maple_arange_64 node is 256 bytes and holds up to 10
    entries, plus internal nodes add ~11% overhead. At typical
    utilization (~70-90% full), this works out to ~28-41 bytes of maple
    tree node memory per IOVA entry. The net per-entry cost is ~44-57
    bytes (16 bytes slab + ~28-41 bytes maple), compared to ~64 bytes
    with the old rbtree (64 bytes HWCACHE_ALIGN slab + 0 bytes
    embedded rbtree). Combined with the maple tree's O(log_10 n)
    search depth and better cache locality from B-tree fan-out,
    this improves both memory efficiency and search performance.

  - struct iova_domain replaces rb_root + cached_node + cached32_node
    + anchor with a single struct maple_tree. The iova_rbtree_lock
    spinlock is renamed to iova_lock. The maple tree is initialized
    with MT_FLAGS_ALLOC_RANGE (enables gap tracking for
    mas_empty_area_rev) and MT_FLAGS_LOCK_EXTERN (uses the existing
    iova_lock spinlock instead of the maple tree internal lock).

  - Allocation via __alloc_and_insert_iova_range() uses
    mas_empty_area_rev() to find the highest-addressed gap of
    sufficient size below limit_pfn in O(log n) with B-tree fan-out.
    Alignment is handled by over-requesting (size + alignment - 1)
    to guarantee room after rounding, eliminating the need for a
    retry loop. The result is stored with mas_store_gfp().

  - Lookup via private_find_iova() uses mas_walk() for O(log n)
    point-in-range lookup.

  - Deletion via remove_iova() uses mas_erase(). No successor gap
    fixup needed -- the maple tree handles it internally.

  - reserve_iova() walks the requested range for existing entries,
    computes the merged range, collects old entries for freeing, then
    stores a single merged entry. If the request is fully covered by
    an existing entry, it returns that entry without allocating.

  - The IOVA_ANCHOR sentinel node is eliminated. The maple tree
    tracks gaps implicitly, including the space above the highest
    allocation.

  - The cached_node / cached32_node fields and all their helpers
    are eliminated. The maple tree B-tree structure provides
    equivalent or better cache behaviour.

The rcache (magazine cache) layer is unchanged -- it operates on raw
pfn values and is orthogonal to the tree backing store.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 338 ++++++++++++-------------------------------
 include/linux/iova.h |  10 +-
 2 files changed, 98 insertions(+), 250 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 021daf6528de..523d1e8315f9 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -13,9 +13,7 @@
 #include <linux/bitops.h>
 #include <linux/cpu.h>
 #include <linux/workqueue.h>
-
-/* The anchor node sits above the top of the usable address space */
-#define IOVA_ANCHOR	~0UL
+#include <linux/maple_tree.h>
=20
 #define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (=
in pages) */
=20
@@ -29,11 +27,6 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *io=
vad);
 static void free_global_cached_iovas(struct iova_domain *iovad);
=20
-static struct iova *to_iova(struct rb_node *node)
-{
-	return rb_entry(node, struct iova, node);
-}
-
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	unsigned long start_pfn)
@@ -45,180 +38,63 @@ init_iova_domain(struct iova_domain *iovad, unsigned l=
ong granule,
 	 */
 	BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
=20
-	spin_lock_init(&iovad->iova_rbtree_lock);
-	iovad->rbroot =3D RB_ROOT;
-	iovad->cached_node =3D &iovad->anchor.node;
-	iovad->cached32_node =3D &iovad->anchor.node;
+	spin_lock_init(&iovad->iova_lock);
+	mt_init_flags(&iovad->mtree,
+		      MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
+	mt_set_external_lock(&iovad->mtree, &iovad->iova_lock);
 	iovad->granule =3D granule;
 	iovad->start_pfn =3D start_pfn;
 	iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad));
 	iovad->max32_alloc_size =3D iovad->dma_32bit_pfn;
-	iovad->anchor.pfn_lo =3D iovad->anchor.pfn_hi =3D IOVA_ANCHOR;
-	rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
-	rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
=20
-static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
-{
-	if (limit_pfn <=3D iovad->dma_32bit_pfn)
-		return iovad->cached32_node;
-
-	return iovad->cached_node;
-}
-
-static void
-__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
-{
-	if (new->pfn_hi < iovad->dma_32bit_pfn)
-		iovad->cached32_node =3D &new->node;
-	else
-		iovad->cached_node =3D &new->node;
-}
-
-static void
-__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
-{
-	struct iova *cached_iova;
-
-	cached_iova =3D to_iova(iovad->cached32_node);
-	if (free =3D=3D cached_iova ||
-	    (free->pfn_hi < iovad->dma_32bit_pfn &&
-	     free->pfn_lo >=3D cached_iova->pfn_lo))
-		iovad->cached32_node =3D rb_next(&free->node);
-
-	if (free->pfn_lo < iovad->dma_32bit_pfn)
-		iovad->max32_alloc_size =3D iovad->dma_32bit_pfn;
-
-	cached_iova =3D to_iova(iovad->cached_node);
-	if (free->pfn_lo >=3D cached_iova->pfn_lo)
-		iovad->cached_node =3D rb_next(&free->node);
-}
-
-static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned=
 long limit_pfn)
-{
-	struct rb_node *node, *next;
-	/*
-	 * Ideally what we'd like to judge here is whether limit_pfn is close
-	 * enough to the highest-allocated IOVA that starting the allocation
-	 * walk from the anchor node will be quicker than this initial work to
-	 * find an exact starting point (especially if that ends up being the
-	 * anchor node anyway). This is an incredibly crude approximation which
-	 * only really helps the most likely case, but is at least trivially easy.
-	 */
-	if (limit_pfn > iovad->dma_32bit_pfn)
-		return &iovad->anchor.node;
-
-	node =3D iovad->rbroot.rb_node;
-	while (to_iova(node)->pfn_hi < limit_pfn)
-		node =3D node->rb_right;
-
-search_left:
-	while (node->rb_left && to_iova(node->rb_left)->pfn_lo >=3D limit_pfn)
-		node =3D node->rb_left;
-
-	if (!node->rb_left)
-		return node;
-
-	next =3D node->rb_left;
-	while (next->rb_right) {
-		next =3D next->rb_right;
-		if (to_iova(next)->pfn_lo >=3D limit_pfn) {
-			node =3D next;
-			goto search_left;
-		}
-	}
-
-	return node;
-}
-
-/* Insert the iova into domain rbtree by holding writer lock */
-static void
-iova_insert_rbtree(struct rb_root *root, struct iova *iova,
-		   struct rb_node *start)
-{
-	struct rb_node **new, *parent =3D NULL;
-
-	new =3D (start) ? &start : &(root->rb_node);
-	/* Figure out where to put new node */
-	while (*new) {
-		struct iova *this =3D to_iova(*new);
-
-		parent =3D *new;
-
-		if (iova->pfn_lo < this->pfn_lo)
-			new =3D &((*new)->rb_left);
-		else if (iova->pfn_lo > this->pfn_lo)
-			new =3D &((*new)->rb_right);
-		else {
-			WARN_ON(1); /* this should not happen */
-			return;
-		}
-	}
-	/* Add new node and rebalance tree. */
-	rb_link_node(&iova->node, parent, new);
-	rb_insert_color(&iova->node, root);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
 {
-	struct rb_node *curr, *prev;
-	struct iova *curr_iova;
 	unsigned long flags;
-	unsigned long new_pfn, retry_pfn;
+	unsigned long new_pfn;
 	unsigned long align_mask =3D ~0UL;
-	unsigned long high_pfn =3D limit_pfn, low_pfn =3D iovad->start_pfn;
+	unsigned long search_size =3D size;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
+
+	if (size_aligned) {
+		unsigned long align =3D 1UL << fls_long(size - 1);
=20
-	if (size_aligned)
 		align_mask <<=3D fls_long(size - 1);
+		search_size =3D size + align - 1;
+	}
=20
-	/* Walk the tree backwards */
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	if (limit_pfn <=3D iovad->dma_32bit_pfn &&
 			size >=3D iovad->max32_alloc_size)
 		goto iova32_full;
=20
-	curr =3D __get_cached_rbnode(iovad, limit_pfn);
-	curr_iova =3D to_iova(curr);
-	retry_pfn =3D curr_iova->pfn_hi;
+	if (mas_empty_area_rev(&mas, iovad->start_pfn,
+				 limit_pfn - 1, search_size))
+		goto alloc_fail;
=20
-retry:
-	do {
-		high_pfn =3D min(high_pfn, curr_iova->pfn_lo);
-		new_pfn =3D (high_pfn - size) & align_mask;
-		prev =3D curr;
-		curr =3D rb_prev(curr);
-		curr_iova =3D to_iova(curr);
-	} while (curr && new_pfn <=3D curr_iova->pfn_hi && new_pfn >=3D low_pfn);
-
-	if (high_pfn < size || new_pfn < low_pfn) {
-		if (low_pfn =3D=3D iovad->start_pfn && retry_pfn < limit_pfn) {
-			high_pfn =3D limit_pfn;
-			low_pfn =3D retry_pfn + 1;
-			curr =3D iova_find_limit(iovad, limit_pfn);
-			curr_iova =3D to_iova(curr);
-			goto retry;
-		}
-		iovad->max32_alloc_size =3D size;
-		goto iova32_full;
-	}
+	new_pfn =3D (mas.last - size + 1) & align_mask;
+	if (new_pfn < mas.index || new_pfn < iovad->start_pfn)
+		goto alloc_fail;
=20
-	/* pfn_lo will point to size aligned address if size_aligned is set */
 	new->pfn_lo =3D new_pfn;
-	new->pfn_hi =3D new->pfn_lo + size - 1;
+	new->pfn_hi =3D new_pfn + size - 1;
=20
-	/* If we have 'prev', it's a valid place to start the insertion. */
-	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, new);
+	mas.index =3D new->pfn_lo;
+	mas.last =3D new->pfn_hi;
+	if (mas_store_gfp(&mas, new, GFP_ATOMIC))
+		goto alloc_fail;
=20
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return 0;
=20
+alloc_fail:
+	if (limit_pfn <=3D iovad->dma_32bit_pfn)
+		iovad->max32_alloc_size =3D size;
 iova32_full:
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return -ENOMEM;
 }
=20
@@ -233,8 +109,7 @@ static struct iova *alloc_iova_mem(void)
=20
 static void free_iova_mem(struct iova *iova)
 {
-	if (iova->pfn_lo !=3D IOVA_ANCHOR)
-		kmem_cache_free(iova_cache, iova);
+	kmem_cache_free(iova_cache, iova);
 }
=20
 /**
@@ -275,29 +150,22 @@ EXPORT_SYMBOL_GPL(alloc_iova);
 static struct iova *
 private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 {
-	struct rb_node *node =3D iovad->rbroot.rb_node;
+	MA_STATE(mas, &iovad->mtree, pfn, pfn);
=20
-	assert_spin_locked(&iovad->iova_rbtree_lock);
-
-	while (node) {
-		struct iova *iova =3D to_iova(node);
-
-		if (pfn < iova->pfn_lo)
-			node =3D node->rb_left;
-		else if (pfn > iova->pfn_hi)
-			node =3D node->rb_right;
-		else
-			return iova;	/* pfn falls within iova's range */
-	}
-
-	return NULL;
+	assert_spin_locked(&iovad->iova_lock);
+	return mas_walk(&mas);
 }
=20
 static void remove_iova(struct iova_domain *iovad, struct iova *iova)
 {
-	assert_spin_locked(&iovad->iova_rbtree_lock);
-	__cached_rbnode_delete_update(iovad, iova);
-	rb_erase(&iova->node, &iovad->rbroot);
+	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
+
+	assert_spin_locked(&iovad->iova_lock);
+
+	if (iova->pfn_lo < iovad->dma_32bit_pfn)
+		iovad->max32_alloc_size =3D iovad->dma_32bit_pfn;
+
+	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
 }
=20
 /**
@@ -312,10 +180,9 @@ struct iova *find_iova(struct iova_domain *iovad, unsi=
gned long pfn)
 	unsigned long flags;
 	struct iova *iova;
=20
-	/* Take the lock so that no other thread is manipulating the rbtree */
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	iova =3D private_find_iova(iovad, pfn);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return iova;
 }
 EXPORT_SYMBOL_GPL(find_iova);
@@ -331,9 +198,9 @@ __free_iova(struct iova_domain *iovad, struct iova *iov=
a)
 {
 	unsigned long flags;
=20
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	remove_iova(iovad, iova);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(__free_iova);
@@ -351,14 +218,14 @@ free_iova(struct iova_domain *iovad, unsigned long pf=
n)
 	unsigned long flags;
 	struct iova *iova;
=20
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	iova =3D private_find_iova(iovad, pfn);
 	if (!iova) {
-		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
 		return;
 	}
 	remove_iova(iovad, iova);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(free_iova);
@@ -445,27 +312,18 @@ static void iova_domain_free_rcaches(struct iova_doma=
in *iovad)
  */
 void put_iova_domain(struct iova_domain *iovad)
 {
-	struct iova *iova, *tmp;
+	struct iova *iova;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
=20
 	if (iovad->rcaches)
 		iova_domain_free_rcaches(iovad);
=20
-	rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node)
+	mas_for_each(&mas, iova, ULONG_MAX)
 		free_iova_mem(iova);
+	__mt_destroy(&iovad->mtree);
 }
 EXPORT_SYMBOL_GPL(put_iova_domain);
=20
-static int
-__is_range_overlap(struct rb_node *node,
-	unsigned long pfn_lo, unsigned long pfn_hi)
-{
-	struct iova *iova =3D to_iova(node);
-
-	if ((pfn_lo <=3D iova->pfn_hi) && (pfn_hi >=3D iova->pfn_lo))
-		return 1;
-	return 0;
-}
-
 static inline struct iova *
 alloc_and_init_iova(unsigned long pfn_lo, unsigned long pfn_hi)
 {
@@ -480,29 +338,6 @@ alloc_and_init_iova(unsigned long pfn_lo, unsigned lon=
g pfn_hi)
 	return iova;
 }
=20
-static struct iova *
-__insert_new_range(struct iova_domain *iovad,
-	unsigned long pfn_lo, unsigned long pfn_hi)
-{
-	struct iova *iova;
-
-	iova =3D alloc_and_init_iova(pfn_lo, pfn_hi);
-	if (iova)
-		iova_insert_rbtree(&iovad->rbroot, iova, NULL);
-
-	return iova;
-}
-
-static void
-__adjust_overlap_range(struct iova *iova,
-	unsigned long *pfn_lo, unsigned long *pfn_hi)
-{
-	if (*pfn_lo < iova->pfn_lo)
-		iova->pfn_lo =3D *pfn_lo;
-	if (*pfn_hi > iova->pfn_hi)
-		*pfn_lo =3D iova->pfn_hi + 1;
-}
-
 /**
  * reserve_iova - reserves an iova in the given range
  * @iovad: - iova domain pointer
@@ -510,41 +345,58 @@ __adjust_overlap_range(struct iova *iova,
  * @pfn_hi:- higher pfn address
  * This function allocates reserves the address range from pfn_lo to pfn_h=
i so
  * that this address is not dished out as part of alloc_iova.
+ *
+ * If the requested range overlaps existing reservations, ranges are merge=
d.
+ * If the requested range is fully covered by an existing reservation, the
+ * existing entry is returned without allocating.
  */
 struct iova *
 reserve_iova(struct iova_domain *iovad,
 	unsigned long pfn_lo, unsigned long pfn_hi)
 {
-	struct rb_node *node;
 	unsigned long flags;
-	struct iova *iova;
-	unsigned int overlap =3D 0;
+	struct iova *iova, *overlap;
+	unsigned long merged_lo =3D pfn_lo, merged_hi =3D pfn_hi;
=20
 	/* Don't allow nonsensical pfns */
 	if (WARN_ON((pfn_hi | pfn_lo) > (ULLONG_MAX >> iova_shift(iovad))))
 		return NULL;
=20
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	for (node =3D rb_first(&iovad->rbroot); node; node =3D rb_next(node)) {
-		if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
-			iova =3D to_iova(node);
-			__adjust_overlap_range(iova, &pfn_lo, &pfn_hi);
-			if ((pfn_lo >=3D iova->pfn_lo) &&
-				(pfn_hi <=3D iova->pfn_hi))
-				goto finish;
-			overlap =3D 1;
-
-		} else if (overlap)
-				break;
+	spin_lock_irqsave(&iovad->iova_lock, flags);
+	{
+		MA_STATE(mas, &iovad->mtree, pfn_lo, pfn_hi);
+
+		mas_for_each(&mas, overlap, pfn_hi) {
+			if (pfn_lo >=3D overlap->pfn_lo &&
+			    pfn_hi <=3D overlap->pfn_hi) {
+				spin_unlock_irqrestore(&iovad->iova_lock,
+						       flags);
+				return overlap;
+			}
+			if (overlap->pfn_lo < merged_lo)
+				merged_lo =3D overlap->pfn_lo;
+			if (overlap->pfn_hi > merged_hi)
+				merged_hi =3D overlap->pfn_hi;
+			free_iova_mem(overlap);
+		}
 	}
=20
-	/* We are here either because this is the first reserver node
-	 * or need to insert remaining non overlap addr range
-	 */
-	iova =3D __insert_new_range(iovad, pfn_lo, pfn_hi);
-finish:
+	iova =3D alloc_and_init_iova(merged_lo, merged_hi);
+	if (!iova) {
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
+		return NULL;
+	}
+
+	{
+		MA_STATE(mas, &iovad->mtree, merged_lo, merged_hi);
=20
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		if (mas_store_gfp(&mas, iova, GFP_ATOMIC)) {
+			spin_unlock_irqrestore(&iovad->iova_lock, flags);
+			free_iova_mem(iova);
+			return NULL;
+		}
+	}
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return iova;
 }
 EXPORT_SYMBOL_GPL(reserve_iova);
@@ -621,7 +473,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, stru=
ct iova_domain *iovad)
 	unsigned long flags;
 	int i;
=20
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
=20
 	for (i =3D 0 ; i < mag->size; ++i) {
 		struct iova *iova =3D private_find_iova(iovad, mag->pfns[i]);
@@ -633,7 +485,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, stru=
ct iova_domain *iovad)
 		free_iova_mem(iova);
 	}
=20
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
=20
 	mag->size =3D 0;
 }
@@ -956,8 +808,8 @@ int iova_cache_get(void)
=20
 	mutex_lock(&iova_cache_mutex);
 	if (!iova_cache_users) {
-		iova_cache =3D kmem_cache_create("iommu_iova", sizeof(struct iova), 0,
-					       SLAB_HWCACHE_ALIGN, NULL);
+		iova_cache =3D kmem_cache_create("iommu_iova", sizeof(struct iova),
+					       0, 0, NULL);
 		if (!iova_cache)
 			goto out_err;
=20
diff --git a/include/linux/iova.h b/include/linux/iova.h
index d2c4fd923efa..eb4f9ead5451 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -11,12 +11,11 @@
=20
 #include <linux/types.h>
 #include <linux/kernel.h>
-#include <linux/rbtree.h>
+#include <linux/maple_tree.h>
 #include <linux/dma-mapping.h>
=20
 /* iova structure */
 struct iova {
-	struct rb_node	node;
 	unsigned long	pfn_hi; /* Highest allocated pfn */
 	unsigned long	pfn_lo; /* Lowest allocated pfn */
 };
@@ -26,15 +25,12 @@ struct iova_rcache;
=20
 /* holds all the iova translations for a domain */
 struct iova_domain {
-	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
-	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached_node;	/* Save last alloced node */
-	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
+	spinlock_t	iova_lock;	/* Lock to protect update of maple tree */
+	struct maple_tree	mtree;
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
 	unsigned long	max32_alloc_size; /* Size of last failed allocation */
-	struct iova	anchor;		/* rbtree lookup anchor */
=20
 	struct iova_rcache	*rcaches;
 	struct hlist_node	cpuhp_dead;
--=20
2.54.0
From nobody Mon Jun  8 06:36:14 2026
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F04E3F7896;
	Wed,  3 Jun 2026 03:37:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=96.67.55.147
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780457843; cv=none;
 b=U4sCn8lct/lt/NQobjv1S+JOuT6SWJ8MgYRghncsTTd7+9WNsKhanvmfFbyySt6zt/XTgc6wYtzEGfurLWpk9LPV9iEBN8vCHJEnSGXkG8ohxuSGNZEVPmyk/dUvNTi5C48qqT0RAyx09vIXtOz4ohlCNP0+INwCzPSfjrlKw+U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780457843; c=relaxed/simple;
	bh=1I9QSRXOnE20EOYrfPreA57f8AU8LbqZeBwqNbM1omE=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=b9apu/djMOE9rKlOg0CuyyH9IP12p5ptqyE/PURrCiqnXmeqGvLZYOBr7rn6uQKADhREqIJbESZxJ+Zv4Q0AwUVCU5X2t8hG1MJ/ShjNE7/MTKmO8q4RECBuGrFisGjq/wFZGQRB5UjypzfZuJd7YewsVwVsu5V1dvianxbcQuY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com;
 spf=pass smtp.mailfrom=surriel.com;
 dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b=Yl8T9Jog; arc=none smtp.client-ip=96.67.55.147
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b="Yl8T9Jog"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com
	; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References:
	In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=B+5nPku2Js5sEHixJ+jko8QqcDMr4c3iPdyaq/eYcqg=; b=Yl8T9JogNyf1iyDYgRJ9QBlrN2
	5tq9BQxGyW6B9xVnxyUsZzNoG/CnQsUu68Vg3F5gdO7zqU7XNiI8Bv9D8WDWFHQerMCrMbNXnQNFu
	f3jw+Ks/+LiqbZII9i52gi3g4jO1skpb1XK80i/AKQJx5pVsPP5axuXL56sP7v0dlcSInicjaUdEJ
	WPxkDayX8liti3rihbTjSRQ/JwtDc5WEPPCPF0OLIpxpZWeFpGPjYcS0LJibaXztwHrfJn6JNbw8F
	blzz/TaK305CcL4HodNHL4WYnKKG8+OV5qXnKXmtmacGIdSubFBQsfEOXgsCEuT4az1NGhbMlUDTt
	FHGKIpPg==;
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@surriel.com>)
	id 1wUcPg-000000002VG-1OxO;
	Tue, 02 Jun 2026 23:37:04 -0400
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com,
	robin.murphy@arm.com,
	joro@8bytes.org,
	will@kernel.org,
	iommu@lists.linux.dev,
	jgg@ziepe.ca,
	kyle@mcmartin.ca,
	Rik van Riel <riel@meta.com>,
	Rik van Riel <riel@surriel.com>
Subject: [PATCH v3 2/3] iova: add KUnit test suite
Date: Tue,  2 Jun 2026 23:35:47 -0400
Message-ID: <20260603033653.4144138-3-riel@surriel.com>
X-Mailer: git-send-email 2.54.0
In-Reply-To: <20260603033653.4144138-1-riel@surriel.com>
References: <20260603033653.4144138-1-riel@surriel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

From: Rik van Riel <riel@meta.com>

Add a kunit suite for the maple-tree-based IOVA allocator, plus an
iova_domain_verify_invariants() helper (compiled only when the test
config is enabled) that walks the maple tree and confirms every entry's
pfn_lo/pfn_hi match the maple tree index range and that no entries
overlap.

Test cases:
  - test_size_aligned: alignment of size_aligned allocs across orders 0..7.
  - test_top_down_preference: sequential allocs decrease in pfn_lo.
  - test_reserve_iova: allocs avoid the reserved range.
  - test_32bit_in_64bit_domain: 1000 64-bit allocs followed by a 32-bit
    alloc must still find a slot below DMA_BIT_MASK(32).
  - test_arbitrary_dma_limits: after filling 64-bit space, verify that
    bounded allocations at 33-bit and 56-bit limits still find slots
    within their respective ranges, confirming that mas_empty_area_rev
    generalizes to arbitrary limit_pfn values.
  - test_aligned_in_fragmented: pack size-2 size_aligned allocs, free
    every other to leave size-2 holes; a fresh size-2 aligned alloc
    must still succeed and return a 2-aligned pfn.
  - test_pci_32bit_workaround_pattern: alternate 32-bit-first allocation
    attempts with 64-bit fallback, mirroring dma-iommu.c.
  - test_stress_random: 2048 random alloc/free operations with mixed
    sizes, alignments, and DMA limits (32/33/56/64-bit), checking
    invariants after every operation. Uses a deterministic PRNG so
    failures reproduce across boots.
  - test_full_space_search_time: fill a 16K-pfn range completely, then
    verify that a failed alloc returns in bounded time (O(log n) via
    mas_empty_area_rev, not O(n)).
  - test_fragmented_32bit_search: pack the 32-bit IOVA space, then
    verify bounded search time for both 32-bit failure and 64-bit
    fallback success paths.

Run with:
  tools/testing/kunit/kunit.py run --kunitconfig=3Ddrivers/iommu

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 drivers/iommu/.kunitconfig |   4 +
 drivers/iommu/Kconfig      |  16 ++
 drivers/iommu/Makefile     |   1 +
 drivers/iommu/iova-kunit.c | 432 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/iova.c       |  34 +++
 include/linux/iova.h       |   3 +
 6 files changed, 490 insertions(+)
 create mode 100644 drivers/iommu/.kunitconfig
 create mode 100644 drivers/iommu/iova-kunit.c

diff --git a/drivers/iommu/.kunitconfig b/drivers/iommu/.kunitconfig
new file mode 100644
index 000000000000..56b481ecf993
--- /dev/null
+++ b/drivers/iommu/.kunitconfig
@@ -0,0 +1,4 @@
+CONFIG_KUNIT=3Dy
+CONFIG_IOMMU_SUPPORT=3Dy
+CONFIG_IOMMU_IOVA=3Dy
+CONFIG_IOMMU_IOVA_KUNIT_TEST=3Dy
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f86262b11416..f09046e238fd 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -3,6 +3,22 @@
 config IOMMU_IOVA
 	tristate
=20
+config IOMMU_IOVA_KUNIT_TEST
+	tristate "KUnit tests for the IOVA allocator" if !KUNIT_ALL_TESTS
+	depends on IOMMU_IOVA && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable kunit tests for the IOVA allocator. The tests exercise
+	  basic allocation and free, size-aligned allocation, top-down
+	  ordering, bounded allocations with various DMA limits (32-bit,
+	  33-bit, 56-bit), aligned allocations in fragmented domains,
+	  and randomly-fragmented stress scenarios.
+
+	  Run with:
+	    tools/testing/kunit/kunit.py run --kunitconfig=3Ddrivers/iommu
+
+	  If unsure, say N here.
+
 # IOMMU_API always gets selected by whoever wants it.
 config IOMMU_API
 	bool
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 0275821f4ef9..6bd7da1cbebd 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) +=3D io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) +=3D io-pgtable-arm-selftes=
ts.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) +=3D io-pgtable-dart.o
 obj-$(CONFIG_IOMMU_IOVA) +=3D iova.o
+obj-$(CONFIG_IOMMU_IOVA_KUNIT_TEST) +=3D iova-kunit.o
 obj-$(CONFIG_OF_IOMMU)	+=3D of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) +=3D msm_iommu.o
 obj-$(CONFIG_IPMMU_VMSA) +=3D ipmmu-vmsa.o
diff --git a/drivers/iommu/iova-kunit.c b/drivers/iommu/iova-kunit.c
new file mode 100644
index 000000000000..fffeab8552cd
--- /dev/null
+++ b/drivers/iommu/iova-kunit.c
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KUnit tests for the IOVA allocator.
+ *
+ * Exercises the maple-tree-based allocator: basic alloc/free,
+ * size-aligned allocations, top-down ordering, bounded allocations
+ * with various DMA limits (32-bit, 33-bit, 56-bit), aligned
+ * allocations in fragmented domains, and randomly fragmented stress.
+ *
+ * Each test verifies that the maple tree invariants remain consistent
+ * after every batch of operations.
+ */
+#include <kunit/test.h>
+#include <linux/dma-mapping.h>
+#include <linux/iova.h>
+
+#define TEST_GRANULE PAGE_SIZE
+/* Highest pfn that fits in 32 bits =E2=80=94 triggers the bounded alloc p=
ath. */
+#define TEST_LIMIT_32BIT (DMA_BIT_MASK(32) >> PAGE_SHIFT)
+/* 33-bit limit =E2=80=94 exercises non-power-of-two DMA boundaries. */
+#define TEST_LIMIT_33BIT (DMA_BIT_MASK(33) >> PAGE_SHIFT)
+/* 56-bit limit =E2=80=94 typical server IOMMU address width. */
+#define TEST_LIMIT_56BIT (DMA_BIT_MASK(56) >> PAGE_SHIFT)
+/* A 64-bit-ish limit well above dma_32bit_pfn. 1ULL avoids UB on ILP32. */
+#define TEST_LIMIT_64BIT ((1ULL << 36) >> PAGE_SHIFT)
+/*
+ * A small <=3D32-bit limit used by tests that want to actually exhaust the
+ * restricted region within a tractable number of allocations.
+ */
+#define TEST_LIMIT_32BIT_RESTRICTED (TEST_LIMIT_32BIT / 2)
+
+struct iova_test_ctx {
+	struct iova_domain iovad;
+	bool initialized;
+};
+
+static int iova_test_init(struct kunit *test)
+{
+	struct iova_test_ctx *ctx;
+	int ret;
+
+	ctx =3D kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	test->priv =3D ctx;
+
+	ret =3D iova_cache_get();
+	if (ret)
+		return ret;
+
+	init_iova_domain(&ctx->iovad, TEST_GRANULE, 1);
+	ret =3D iova_domain_init_rcaches(&ctx->iovad);
+	if (ret) {
+		put_iova_domain(&ctx->iovad);
+		iova_cache_put();
+		return ret;
+	}
+	ctx->initialized =3D true;
+
+	KUNIT_ASSERT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	return 0;
+}
+
+static void iova_test_exit(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+
+	if (ctx && ctx->initialized) {
+		put_iova_domain(&ctx->iovad);
+		ctx->initialized =3D false;
+		iova_cache_put();
+	}
+}
+
+static void test_size_aligned(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	int order;
+
+	for (order =3D 0; order < 8; ++order) {
+		unsigned long size =3D 1UL << order;
+		struct iova *iova =3D alloc_iova(&ctx->iovad, size,
+					       TEST_LIMIT_32BIT, true);
+
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+		KUNIT_EXPECT_EQ(test, iova->pfn_lo & (size - 1), 0);
+		KUNIT_EXPECT_EQ(test, iova->pfn_hi - iova->pfn_lo + 1, size);
+		__free_iova(&ctx->iovad, iova);
+		KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	}
+}
+
+static void test_top_down_preference(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	struct iova *iovas[16];
+	int i;
+
+	for (i =3D 0; i < ARRAY_SIZE(iovas); ++i) {
+		iovas[i] =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		KUNIT_ASSERT_NOT_NULL(test, iovas[i]);
+		if (i > 0)
+			KUNIT_EXPECT_LT(test, iovas[i]->pfn_lo,
+					iovas[i - 1]->pfn_lo);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i =3D 0; i < ARRAY_SIZE(iovas); ++i)
+		__free_iova(&ctx->iovad, iovas[i]);
+}
+
+static void test_reserve_iova(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	const unsigned long reserve_lo =3D TEST_LIMIT_32BIT / 2;
+	struct iova *r, *iova;
+	int i;
+
+	/* Reserve the entire top half through the limit_pfn, inclusive. */
+	r =3D reserve_iova(&ctx->iovad, reserve_lo, TEST_LIMIT_32BIT);
+	KUNIT_ASSERT_NOT_NULL(test, r);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	/* All allocs must land below the reserved range. */
+	for (i =3D 0; i < 100; ++i) {
+		iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+		KUNIT_EXPECT_LT(test, iova->pfn_hi, reserve_lo);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
+/*
+ * The pci_32bit_workaround scenario: every PCI device's first IOVA
+ * allocation hits the 32-bit-restricted path before falling back to
+ * 64-bit. Fill the 64-bit space, then verify a 32-bit alloc still
+ * finds a slot below DMA_BIT_MASK(32).
+ */
+static void test_32bit_in_64bit_domain(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	struct iova *iova;
+	int i;
+
+	for (i =3D 0; i < 1000; ++i) {
+		iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	__free_iova(&ctx->iovad, iova);
+}
+
+/*
+ * Exercise non-power-of-two DMA limits: fill the 64-bit space, then
+ * verify that bounded allocations at 33-bit and 56-bit limits still
+ * find slots within their respective ranges. This confirms the
+ * navigate-to-limit_pfn search generalizes beyond the 32-bit case.
+ */
+static void test_arbitrary_dma_limits(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	struct iova *iova;
+	int i;
+
+	for (i =3D 0; i < 1000; ++i) {
+		iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	/* 33-bit bounded allocation */
+	iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_33BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_33BIT);
+	__free_iova(&ctx->iovad, iova);
+
+	/* 56-bit bounded allocation */
+	iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_56BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_56BIT);
+	__free_iova(&ctx->iovad, iova);
+
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
+/*
+ * Aligned allocation in a fragmented domain: pack size-2 size_aligned
+ * allocations at the top, free every other one to leave size-2 holes,
+ * then verify a fresh size-2 aligned alloc still succeeds and returns
+ * a 2-aligned pfn.
+ */
+static void test_aligned_in_fragmented(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	const int N =3D 64;
+	struct iova **iovas;
+	struct iova *iova;
+	int i;
+
+	iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, iovas);
+
+	for (i =3D 0; i < N; ++i) {
+		iovas[i] =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iovas[i]);
+		KUNIT_EXPECT_EQ(test, iovas[i]->pfn_lo & 1, 0);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i =3D 0; i < N; i +=3D 2) {
+		__free_iova(&ctx->iovad, iovas[i]);
+		iovas[i] =3D NULL;
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	iova =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_EQ(test, iova->pfn_lo & 1, 0);
+	__free_iova(&ctx->iovad, iova);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i =3D 0; i < N; ++i)
+		if (iovas[i])
+			__free_iova(&ctx->iovad, iovas[i]);
+}
+
+/*
+ * Mimic dma-iommu's pci_32bit_workaround pattern: every alloc first
+ * tries a small restricted limit; if that fails, retry with the 64-bit
+ * limit. Verifies that the navigate-to-limit search survives rapid
+ * switching between different limit_pfn values.
+ */
+static void test_pci_32bit_workaround_pattern(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	int fallback_count =3D 0;
+	int i;
+
+	for (i =3D 0; i < 500; ++i) {
+		unsigned long size =3D (i % 4) + 1;
+		struct iova *iova =3D alloc_iova(&ctx->iovad, size,
+					       TEST_LIMIT_32BIT_RESTRICTED,
+					       true);
+
+		if (!iova) {
+			iova =3D alloc_iova(&ctx->iovad, size,
+					  TEST_LIMIT_64BIT, true);
+			fallback_count++;
+		}
+		if (!iova)
+			break;
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	KUNIT_EXPECT_GT(test, i, 0);
+}
+
+/*
+ * Random alloc/free over many iterations, verifying invariants after
+ * every operation. Uses a deterministic PRNG so failures reproduce
+ * across boots. Exercises mixed DMA limits (32, 33, 56, 64-bit).
+ */
+static void test_stress_random(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	const int N =3D 512;
+	const int iters =3D 4 * N;
+	const unsigned long limits[] =3D {
+		TEST_LIMIT_32BIT, TEST_LIMIT_33BIT,
+		TEST_LIMIT_56BIT, TEST_LIMIT_64BIT,
+	};
+	struct iova **iovas;
+	u32 rng =3D 0xDEADBEEF;
+	int i;
+
+	iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, iovas);
+
+	for (i =3D 0; i < iters; ++i) {
+		int slot;
+		unsigned long limit;
+		const char *op;
+
+		rng =3D rng * 1103515245 + 12345;
+		slot =3D (rng >> 8) % N;
+		rng =3D rng * 1103515245 + 12345;
+		limit =3D limits[(rng >> 8) % ARRAY_SIZE(limits)];
+
+		if (iovas[slot]) {
+			op =3D "free";
+			__free_iova(&ctx->iovad, iovas[slot]);
+			iovas[slot] =3D NULL;
+		} else {
+			unsigned long size;
+			bool aligned;
+
+			rng =3D rng * 1103515245 + 12345;
+			size =3D 1UL << ((rng >> 8) % 4);
+			rng =3D rng * 1103515245 + 12345;
+			aligned =3D (rng >> 8) & 1;
+
+			op =3D "alloc";
+			iovas[slot] =3D alloc_iova(&ctx->iovad, size, limit,
+						 aligned);
+		}
+		if (!iova_domain_verify_invariants(&ctx->iovad)) {
+			kunit_info(test, "iter %d slot %d: invariant broken after %s\n",
+				   i, slot, op);
+			KUNIT_FAIL(test, "verify failed");
+			break;
+		}
+	}
+
+	for (i =3D 0; i < N; ++i)
+		if (iovas[i])
+			__free_iova(&ctx->iovad, iovas[i]);
+}
+
+/*
+ * Verify that alloc_iova fails in bounded time when the IOVA space is
+ * fully packed. Fill a 16K-pfn range with size-1 allocations (leaving
+ * no gaps), then attempt a size-2 aligned alloc. The maple tree's
+ * mas_empty_area_rev must determine there is no suitable gap in
+ * O(log n) time rather than walking every entry. The 10ms threshold
+ * is generous =E2=80=94 real hardware watchdogs fire at ~10s.
+ */
+static void test_full_space_search_time(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	const unsigned long fill_limit =3D 16384;
+	const int fill_count =3D fill_limit;
+	struct iova *iova;
+	ktime_t start, elapsed;
+	int i, allocated =3D 0;
+
+	for (i =3D 0; i < fill_count; ++i) {
+		iova =3D alloc_iova(&ctx->iovad, 1, fill_limit, false);
+		if (!iova)
+			break;
+		allocated++;
+	}
+	kunit_info(test, "allocated %d iovas in [1, %lu]\n",
+		   allocated, fill_limit);
+	KUNIT_ASSERT_GT(test, allocated, 1000);
+
+	start =3D ktime_get();
+	iova =3D alloc_iova(&ctx->iovad, 2, fill_limit, true);
+	elapsed =3D ktime_sub(ktime_get(), start);
+
+	KUNIT_EXPECT_NULL(test, iova);
+	kunit_info(test, "failed alloc took %lld ns\n",
+		   ktime_to_ns(elapsed));
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+}
+
+/*
+ * Verify bounded search time with a fragmented 32-bit IOVA space.
+ * Pack the 32-bit range with size-1 allocs, then attempt a large
+ * aligned alloc that must either succeed from a remaining gap or
+ * fail fast. The 64-bit fallback must always succeed promptly.
+ */
+static void test_fragmented_32bit_search(struct kunit *test)
+{
+	struct iova_test_ctx *ctx =3D test->priv;
+	struct iova *iova;
+	ktime_t start, elapsed;
+	int i, allocated =3D 0;
+
+	for (i =3D 0; i < 8000; ++i) {
+		iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		if (!iova)
+			break;
+		allocated++;
+	}
+	kunit_info(test, "filled 32-bit space with %d allocs\n", allocated);
+	KUNIT_ASSERT_GT(test, allocated, 1000);
+
+	start =3D ktime_get();
+	iova =3D alloc_iova(&ctx->iovad, 32, TEST_LIMIT_32BIT, true);
+	elapsed =3D ktime_sub(ktime_get(), start);
+
+	kunit_info(test, "32-bit alloc (size 32) took %lld ns, result=3D%px\n",
+		   ktime_to_ns(elapsed), iova);
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+
+	start =3D ktime_get();
+	iova =3D alloc_iova(&ctx->iovad, 32, TEST_LIMIT_64BIT, true);
+	elapsed =3D ktime_sub(ktime_get(), start);
+
+	kunit_info(test, "64-bit fallback (size 32) took %lld ns\n",
+		   ktime_to_ns(elapsed));
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+}
+
+static struct kunit_case iova_test_cases[] =3D {
+	KUNIT_CASE(test_size_aligned),
+	KUNIT_CASE(test_top_down_preference),
+	KUNIT_CASE(test_reserve_iova),
+	KUNIT_CASE(test_32bit_in_64bit_domain),
+	KUNIT_CASE(test_arbitrary_dma_limits),
+	KUNIT_CASE(test_aligned_in_fragmented),
+	KUNIT_CASE(test_pci_32bit_workaround_pattern),
+	KUNIT_CASE(test_stress_random),
+	KUNIT_CASE(test_full_space_search_time),
+	KUNIT_CASE(test_fragmented_32bit_search),
+	{}
+};
+
+static struct kunit_suite iova_test_suite =3D {
+	.name =3D "iova",
+	.init =3D iova_test_init,
+	.exit =3D iova_test_exit,
+	.test_cases =3D iova_test_cases,
+};
+kunit_test_suite(iova_test_suite);
+
+MODULE_DESCRIPTION("KUnit tests for the IOVA allocator");
+MODULE_LICENSE("GPL");
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 523d1e8315f9..1ceab6cbefc2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -857,6 +857,40 @@ void iova_cache_put(void)
 }
 EXPORT_SYMBOL_GPL(iova_cache_put);
=20
+#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
+bool iova_domain_verify_invariants(struct iova_domain *iovad)
+{
+	struct iova *iova, *prev =3D NULL;
+	unsigned long flags;
+	bool ok =3D true;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
+
+	spin_lock_irqsave(&iovad->iova_lock, flags);
+	mas_for_each(&mas, iova, ULONG_MAX) {
+		if (mas.index !=3D iova->pfn_lo || mas.last !=3D iova->pfn_hi) {
+			pr_err("iova_verify: maple index [%lu,%lu] !=3D iova [%lu,%lu]\n",
+			       mas.index, mas.last, iova->pfn_lo, iova->pfn_hi);
+			ok =3D false;
+		}
+		if (iova->pfn_lo > iova->pfn_hi) {
+			pr_err("iova_verify: pfn_lo=3D%lu > pfn_hi=3D%lu\n",
+			       iova->pfn_lo, iova->pfn_hi);
+			ok =3D false;
+		}
+		if (prev && prev->pfn_hi >=3D iova->pfn_lo) {
+			pr_err("iova_verify: overlap prev=3D[%lu,%lu] curr=3D[%lu,%lu]\n",
+			       prev->pfn_lo, prev->pfn_hi,
+			       iova->pfn_lo, iova->pfn_hi);
+			ok =3D false;
+		}
+		prev =3D iova;
+	}
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
+	return ok;
+}
+EXPORT_SYMBOL_GPL(iova_domain_verify_invariants);
+#endif /* CONFIG_IOMMU_IOVA_KUNIT_TEST */
+
 MODULE_AUTHOR("Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>");
 MODULE_DESCRIPTION("IOMMU I/O Virtual Address management");
 MODULE_LICENSE("GPL");
diff --git a/include/linux/iova.h b/include/linux/iova.h
index eb4f9ead5451..6fc070a4f58e 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -98,6 +98,9 @@ void init_iova_domain(struct iova_domain *iovad, unsigned=
 long granule,
 int iova_domain_init_rcaches(struct iova_domain *iovad);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
+#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
+bool iova_domain_verify_invariants(struct iova_domain *iovad);
+#endif
 #else
 static inline int iova_cache_get(void)
 {
--=20
2.54.0

From nobody Mon Jun  8 06:36:14 2026
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6DEF3F7897;
	Wed,  3 Jun 2026 03:37:10 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=96.67.55.147
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1780457841; cv=none;
 b=KTW9w2KHzXsj0/WbKIZ75VELc27/yOyYhrU9MjIZi0p5PkAQVwclGNfonESb2sOkyhIt1XQli5FQ7cWz9pVXs13botS7MF8VcrgIRL7Y89l1Ivuybz/ee2ZVyM+pCO+ADe9rHgZ/CUhaMAkiG9vs4pBYmt7p8g0qgp+D8zHdN4U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1780457841; c=relaxed/simple;
	bh=RTXNSl9tzV2JhUxrhyViDaR1SVISk4XO1r4OZXm0vRA=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type;
 b=J6lVn07dzHPjCPaCZ079StpqPek6MIVgXR+OP5/Pr596/5lx7T+hALC80/WuOP/o7l0/cWRWyslroE+ct2yNnC8ubmWqce+dtwrNBUJ5G3FrkAn7ot/Ole5h4JehT/VIxZfaaUv7xCjgK3ROiwtl1RcH8yYLQoVpdLBFRTq4ZjQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com;
 spf=pass smtp.mailfrom=surriel.com;
 dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b=QUdqRXsz; arc=none smtp.client-ip=96.67.55.147
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=none (p=none dis=none) header.from=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com
 header.b="QUdqRXsz"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com
	; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References:
	In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=UzbOAxtdeLtO5GN7ci5TBmbpzOG8jujIM+3oC4Zy8NI=; b=QUdqRXsz3eQ1XsyASt3Y7eAhB+
	KWPL2A6jkzLbb3nNVSTynC9HGwrHPO2b0U/+675d4b+iDYYzYQqkH5cr7DaxqFN3MyE40ZV1Z0yJ4
	15yvnASSVUV5kXBbOu+11ruyTHp/4N49v/GzajMlf4dpSF1CAPDhG9eEbWrOZJMYaFevcsGJDzadJ
	koFdMPop7hcUrblYuL7x46wiRZLLZYMzq65AlCaV8pQjs5Kw4oZEWa9kDanNv/A7OTtqBDxR7JMK7
	OQYv9bk/egjYaZXA/S608b0pjZSGl2u8pkSFbSW9JLaL3DRZPsePIbo3A0w1yDHxRaRFOuq7RXLLX
	esnQz2CQ==;
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@surriel.com>)
	id 1wUcPg-000000002VG-1WYZ;
	Tue, 02 Jun 2026 23:37:04 -0400
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com,
	robin.murphy@arm.com,
	joro@8bytes.org,
	will@kernel.org,
	iommu@lists.linux.dev,
	jgg@ziepe.ca,
	kyle@mcmartin.ca,
	Rik van Riel <riel@meta.com>,
	Rik van Riel <riel@surriel.com>
Subject: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
Date: Tue,  2 Jun 2026 23:35:48 -0400
Message-ID: <20260603033653.4144138-4-riel@surriel.com>
X-Mailer: git-send-email 2.54.0
In-Reply-To: <20260603033653.4144138-1-riel@surriel.com>
References: <20260603033653.4144138-1-riel@surriel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

From: Rik van Riel <riel@meta.com>

The maple tree may need to allocate nodes during erase operations
for tree rebalancing. Unlike the old rbtree where rb_erase() never
allocated, mas_store_gfp(NULL, GFP_ATOMIC) can fail under memory
pressure. Since the IOVA allocator runs in atomic context (DMA
map/unmap can be called from hardirq, softirq, or with spinlocks
held), GFP_KERNEL allocation is not possible.

Add a deferred free mechanism: when mas_store_gfp(NULL, GFP_ATOMIC)
fails, the iova entry remains in the maple tree (preventing address
reuse and keeping the pointer valid) and is added to a lockless
per-domain deferred free list. A delayed workqueue retries the erase
with GFP_ATOMIC after a 10ms delay -- by the time the workqueue runs,
transient memory pressure has typically subsided and the allocation
succeeds.

The deferred free path temporarily reduces available IOVA address
space until the workqueue processes the backlog, but causes no
corruption -- the entry stays in the tree and the struct iova is not
freed until the erase succeeds.

put_iova_domain() cancels the delayed work and discards the deferred
list before destroying the tree. Since deferred entries remain in
the maple tree, the mas_for_each teardown loop frees them along with
all other entries, avoiding a double-free.

In practice, GFP_ATOMIC erase failures are quite rare: the slab
allocator maintains emergency reserves for GFP_ATOMIC, and the common
erase case (exact_fit, slot_store) needs zero node allocations. This
mechanism is a safety net for the exceptional case.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 drivers/iommu/iova.c | 84 +++++++++++++++++++++++++++++++++++++++-----
 include/linux/iova.h |  3 ++
 2 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 1ceab6cbefc2..ae89d780fce5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -7,6 +7,7 @@
=20
 #include <linux/iova.h>
 #include <linux/kmemleak.h>
+#include <linux/llist.h>
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/smp.h>
@@ -26,6 +27,7 @@ static unsigned long iova_rcache_get(struct iova_domain *=
iovad,
 static void free_iova_rcaches(struct iova_domain *iovad);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *io=
vad);
 static void free_global_cached_iovas(struct iova_domain *iovad);
+static void iova_deferred_free_work(struct work_struct *work);
=20
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -46,6 +48,8 @@ init_iova_domain(struct iova_domain *iovad, unsigned long=
 granule,
 	iovad->start_pfn =3D start_pfn;
 	iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad));
 	iovad->max32_alloc_size =3D iovad->dma_32bit_pfn;
+	init_llist_head(&iovad->deferred_frees);
+	INIT_DELAYED_WORK(&iovad->deferred_free_work, iova_deferred_free_work);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
=20
@@ -156,7 +160,13 @@ private_find_iova(struct iova_domain *iovad, unsigned =
long pfn)
 	return mas_walk(&mas);
 }
=20
-static void remove_iova(struct iova_domain *iovad, struct iova *iova)
+/*
+ * Remove an IOVA entry from the maple tree. Returns true on success.
+ * On failure (maple tree node allocation under GFP_ATOMIC failed),
+ * returns false =E2=80=94 the entry remains in the tree and the caller mu=
st
+ * not free the struct iova.
+ */
+static bool remove_iova(struct iova_domain *iovad, struct iova *iova)
 {
 	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
=20
@@ -165,7 +175,36 @@ static void remove_iova(struct iova_domain *iovad, str=
uct iova *iova)
 	if (iova->pfn_lo < iovad->dma_32bit_pfn)
 		iovad->max32_alloc_size =3D iovad->dma_32bit_pfn;
=20
-	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
+	if (mas_store_gfp(&mas, NULL, GFP_ATOMIC))
+		return false;
+	return true;
+}
+
+static void iova_deferred_free_work(struct work_struct *work)
+{
+	struct delayed_work *dwork =3D to_delayed_work(work);
+	struct iova_domain *iovad =3D container_of(dwork, struct iova_domain,
+						 deferred_free_work);
+	struct llist_node *list =3D llist_del_all(&iovad->deferred_frees);
+	struct llist_node *node, *next;
+
+	llist_for_each_safe(node, next, list) {
+		struct iova *iova =3D container_of(node, struct iova,
+						 deferred_free);
+		unsigned long flags;
+
+		spin_lock_irqsave(&iovad->iova_lock, flags);
+		if (remove_iova(iovad, iova))
+			free_iova_mem(iova);
+		else
+			llist_add(&iova->deferred_free,
+				  &iovad->deferred_frees);
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
+	}
+
+	if (!llist_empty(&iovad->deferred_frees))
+		schedule_delayed_work(&iovad->deferred_free_work,
+				      msecs_to_jiffies(10));
 }
=20
 /**
@@ -199,9 +238,15 @@ __free_iova(struct iova_domain *iovad, struct iova *io=
va)
 	unsigned long flags;
=20
 	spin_lock_irqsave(&iovad->iova_lock, flags);
-	remove_iova(iovad, iova);
+	if (remove_iova(iovad, iova)) {
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
+		free_iova_mem(iova);
+		return;
+	}
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
-	free_iova_mem(iova);
+	llist_add(&iova->deferred_free, &iovad->deferred_frees);
+	schedule_delayed_work(&iovad->deferred_free_work,
+			      msecs_to_jiffies(10));
 }
 EXPORT_SYMBOL_GPL(__free_iova);
=20
@@ -224,9 +269,15 @@ free_iova(struct iova_domain *iovad, unsigned long pfn)
 		spin_unlock_irqrestore(&iovad->iova_lock, flags);
 		return;
 	}
-	remove_iova(iovad, iova);
+	if (remove_iova(iovad, iova)) {
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
+		free_iova_mem(iova);
+		return;
+	}
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
-	free_iova_mem(iova);
+	llist_add(&iova->deferred_free, &iovad->deferred_frees);
+	schedule_delayed_work(&iovad->deferred_free_work,
+			      msecs_to_jiffies(10));
 }
 EXPORT_SYMBOL_GPL(free_iova);
=20
@@ -318,6 +369,15 @@ void put_iova_domain(struct iova_domain *iovad)
 	if (iovad->rcaches)
 		iova_domain_free_rcaches(iovad);
=20
+	cancel_delayed_work_sync(&iovad->deferred_free_work);
+
+	/*
+	 * Deferred entries are still in the maple tree, so the
+	 * mas_for_each loop below frees them along with everything else.
+	 * Just discard the deferred list without double-freeing.
+	 */
+	llist_del_all(&iovad->deferred_frees);
+
 	mas_for_each(&mas, iova, ULONG_MAX)
 		free_iova_mem(iova);
 	__mt_destroy(&iovad->mtree);
@@ -481,12 +541,20 @@ iova_magazine_free_pfns(struct iova_magazine *mag, st=
ruct iova_domain *iovad)
 		if (WARN_ON(!iova))
 			continue;
=20
-		remove_iova(iovad, iova);
-		free_iova_mem(iova);
+		if (remove_iova(iovad, iova)) {
+			free_iova_mem(iova);
+		} else {
+			llist_add(&iova->deferred_free,
+				  &iovad->deferred_frees);
+		}
 	}
=20
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
=20
+	if (!llist_empty(&iovad->deferred_frees))
+		schedule_delayed_work(&iovad->deferred_free_work,
+				      msecs_to_jiffies(10));
+
 	mag->size =3D 0;
 }
=20
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 6fc070a4f58e..cc1b5441a058 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -16,6 +16,7 @@
=20
 /* iova structure */
 struct iova {
+	struct llist_node	deferred_free;
 	unsigned long	pfn_hi; /* Highest allocated pfn */
 	unsigned long	pfn_lo; /* Lowest allocated pfn */
 };
@@ -31,6 +32,8 @@ struct iova_domain {
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
 	unsigned long	max32_alloc_size; /* Size of last failed allocation */
+	struct llist_head deferred_frees;
+	struct delayed_work deferred_free_work;
=20
 	struct iova_rcache	*rcaches;
 	struct hlist_node	cpuhp_dead;
--=20
2.54.0