From nobody Mon Jun 8 06:36:14 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9C8723F44C9; Wed, 3 Jun 2026 03:37:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457843; cv=none; b=e00IwdQ1eXYqxwiD3DxBRlN7Bomusm9//3rLLDq97FwASco3dvRPBZgK2qPfB2mZ+IuIP1nJmj2Mx3vs0GeMGmbNZctPsHezYWZWvHb/nNAqdo/beMJlYx0beC7RUvM9Dp5wUTiIzBCzfo5HrTS7Aixcyv6Ypri87BjKfRIIDxo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457843; c=relaxed/simple; bh=63XzII3LCeVEY+m0pHJPQWm5qZ/ocCGKWItJji8DBvY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QjHM4noV0070Tfu3/UM2DLu4JANFtBJbfcb5ZIsZb3QbGQKC+coUWJ9FuJQtZGH3SPRkA0CcZHxTUPUEJp/NRBX1GNM1BazmIkt6EvfS7p1/RqrQ1ZJjIX43uh7djvJiayn2+jiVSg2aMVsvNWnEfMg+za3EorXeCtDBgkK/1DA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=Cp9P1963; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="Cp9P1963" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=IF7PuSZJhNuFa0ukToH4UJzhPgK3j12pbrmz73mA6Ms=; b=Cp9P19630Kt76USosrAAWIjSY6 p9uMhEQfV1zDUsL8v78Hm+51kVXS2EM7NiOBq+ah8YJJ3IZsZy7Ix5sybIM0MZEd7E7VZ9xeKobX+ vsI9/8DouTa0x7jwp6WkUzRxn5rZY3wWQtBz8wbx/7PtEgY+wr8mG7fbbsADUWXEhkoZLHfuuv4w7 US2r3i77luXg87KSFm2wDppIi3ndMO4ZPemP9lRrESvDcvSk3ySKwMdEi+SmKvF2Yavpqya3GYWT6 qsU5QwC8aT7cFdxyXmchTfgIvNDNZO6xDBRElya1VNOSz4iqyC0kJbQo5FyBuqu77XK9uCD+k5qwP QdLbeurw==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wUcPg-000000002VG-1G5B; Tue, 02 Jun 2026 23:37:04 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, jgg@ziepe.ca, kyle@mcmartin.ca, Rik van Riel , Rik van Riel Subject: [PATCH v3 1/3] iova: convert from rbtree to maple tree Date: Tue, 2 Jun 2026 23:35:46 -0400 Message-ID: <20260603033653.4144138-2-riel@surriel.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260603033653.4144138-1-riel@surriel.com> References: <20260603033653.4144138-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel Replace the hand-rolled rbtree in the IOVA allocator with a maple tree. The maple tree is a B-tree variant designed for range tracking with built-in gap searching, making it a natural fit for IOVA address space management -- the same data structure is used by the kernel VMA subsystem for the analogous problem. Key changes: - struct iova shrinks from 40 bytes (rb_node 24 + pfn_hi 8 + pfn_lo 8) to just pfn_hi + pfn_lo (16 bytes). SLAB_HWCACHE_ALIGN is dropped from the iova slab cache since struct iova no longer contains an embedded rb_node touched during tree rebalancing -- tree traversal now touches maple nodes exclusively. This lets the slab allocator pack 16-byte objects tightly instead of rounding to 64 bytes. The maple tree replaces the embedded rb_node with external B-tree nodes: each maple_arange_64 node is 256 bytes and holds up to 10 entries, plus internal nodes add ~11% overhead. At typical utilization (~70-90% full), this works out to ~28-41 bytes of maple tree node memory per IOVA entry. The net per-entry cost is ~44-57 bytes (16 bytes slab + ~28-41 bytes maple), compared to ~64 bytes with the old rbtree (64 bytes HWCACHE_ALIGN slab + 0 bytes embedded rbtree). Combined with the maple tree's O(log_10 n) search depth and better cache locality from B-tree fan-out, this improves both memory efficiency and search performance. - struct iova_domain replaces rb_root + cached_node + cached32_node + anchor with a single struct maple_tree. The iova_rbtree_lock spinlock is renamed to iova_lock. The maple tree is initialized with MT_FLAGS_ALLOC_RANGE (enables gap tracking for mas_empty_area_rev) and MT_FLAGS_LOCK_EXTERN (uses the existing iova_lock spinlock instead of the maple tree internal lock). - Allocation via __alloc_and_insert_iova_range() uses mas_empty_area_rev() to find the highest-addressed gap of sufficient size below limit_pfn in O(log n) with B-tree fan-out. Alignment is handled by over-requesting (size + alignment - 1) to guarantee room after rounding, eliminating the need for a retry loop. The result is stored with mas_store_gfp(). - Lookup via private_find_iova() uses mas_walk() for O(log n) point-in-range lookup. - Deletion via remove_iova() uses mas_erase(). No successor gap fixup needed -- the maple tree handles it internally. - reserve_iova() walks the requested range for existing entries, computes the merged range, collects old entries for freeing, then stores a single merged entry. If the request is fully covered by an existing entry, it returns that entry without allocating. - The IOVA_ANCHOR sentinel node is eliminated. The maple tree tracks gaps implicitly, including the space above the highest allocation. - The cached_node / cached32_node fields and all their helpers are eliminated. The maple tree B-tree structure provides equivalent or better cache behaviour. The rcache (magazine cache) layer is unchanged -- it operates on raw pfn values and is orthogonal to the tree backing store. Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Rik van Riel Suggested-by: Robin Murphy --- drivers/iommu/iova.c | 338 ++++++++++++------------------------------- include/linux/iova.h | 10 +- 2 files changed, 98 insertions(+), 250 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 021daf6528de..523d1e8315f9 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -13,9 +13,7 @@ #include #include #include - -/* The anchor node sits above the top of the usable address space */ -#define IOVA_ANCHOR ~0UL +#include =20 #define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (= in pages) */ =20 @@ -29,11 +27,6 @@ static void free_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *io= vad); static void free_global_cached_iovas(struct iova_domain *iovad); =20 -static struct iova *to_iova(struct rb_node *node) -{ - return rb_entry(node, struct iova, node); -} - void init_iova_domain(struct iova_domain *iovad, unsigned long granule, unsigned long start_pfn) @@ -45,180 +38,63 @@ init_iova_domain(struct iova_domain *iovad, unsigned l= ong granule, */ BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule)); =20 - spin_lock_init(&iovad->iova_rbtree_lock); - iovad->rbroot =3D RB_ROOT; - iovad->cached_node =3D &iovad->anchor.node; - iovad->cached32_node =3D &iovad->anchor.node; + spin_lock_init(&iovad->iova_lock); + mt_init_flags(&iovad->mtree, + MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN); + mt_set_external_lock(&iovad->mtree, &iovad->iova_lock); iovad->granule =3D granule; iovad->start_pfn =3D start_pfn; iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad)); iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; - iovad->anchor.pfn_lo =3D iovad->anchor.pfn_hi =3D IOVA_ANCHOR; - rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); - rb_insert_color(&iovad->anchor.node, &iovad->rbroot); } EXPORT_SYMBOL_GPL(init_iova_domain); =20 -static struct rb_node * -__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn) -{ - if (limit_pfn <=3D iovad->dma_32bit_pfn) - return iovad->cached32_node; - - return iovad->cached_node; -} - -static void -__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new) -{ - if (new->pfn_hi < iovad->dma_32bit_pfn) - iovad->cached32_node =3D &new->node; - else - iovad->cached_node =3D &new->node; -} - -static void -__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) -{ - struct iova *cached_iova; - - cached_iova =3D to_iova(iovad->cached32_node); - if (free =3D=3D cached_iova || - (free->pfn_hi < iovad->dma_32bit_pfn && - free->pfn_lo >=3D cached_iova->pfn_lo)) - iovad->cached32_node =3D rb_next(&free->node); - - if (free->pfn_lo < iovad->dma_32bit_pfn) - iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; - - cached_iova =3D to_iova(iovad->cached_node); - if (free->pfn_lo >=3D cached_iova->pfn_lo) - iovad->cached_node =3D rb_next(&free->node); -} - -static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned= long limit_pfn) -{ - struct rb_node *node, *next; - /* - * Ideally what we'd like to judge here is whether limit_pfn is close - * enough to the highest-allocated IOVA that starting the allocation - * walk from the anchor node will be quicker than this initial work to - * find an exact starting point (especially if that ends up being the - * anchor node anyway). This is an incredibly crude approximation which - * only really helps the most likely case, but is at least trivially easy. - */ - if (limit_pfn > iovad->dma_32bit_pfn) - return &iovad->anchor.node; - - node =3D iovad->rbroot.rb_node; - while (to_iova(node)->pfn_hi < limit_pfn) - node =3D node->rb_right; - -search_left: - while (node->rb_left && to_iova(node->rb_left)->pfn_lo >=3D limit_pfn) - node =3D node->rb_left; - - if (!node->rb_left) - return node; - - next =3D node->rb_left; - while (next->rb_right) { - next =3D next->rb_right; - if (to_iova(next)->pfn_lo >=3D limit_pfn) { - node =3D next; - goto search_left; - } - } - - return node; -} - -/* Insert the iova into domain rbtree by holding writer lock */ -static void -iova_insert_rbtree(struct rb_root *root, struct iova *iova, - struct rb_node *start) -{ - struct rb_node **new, *parent =3D NULL; - - new =3D (start) ? &start : &(root->rb_node); - /* Figure out where to put new node */ - while (*new) { - struct iova *this =3D to_iova(*new); - - parent =3D *new; - - if (iova->pfn_lo < this->pfn_lo) - new =3D &((*new)->rb_left); - else if (iova->pfn_lo > this->pfn_lo) - new =3D &((*new)->rb_right); - else { - WARN_ON(1); /* this should not happen */ - return; - } - } - /* Add new node and rebalance tree. */ - rb_link_node(&iova->node, parent, new); - rb_insert_color(&iova->node, root); -} - static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn, struct iova *new, bool size_aligned) { - struct rb_node *curr, *prev; - struct iova *curr_iova; unsigned long flags; - unsigned long new_pfn, retry_pfn; + unsigned long new_pfn; unsigned long align_mask =3D ~0UL; - unsigned long high_pfn =3D limit_pfn, low_pfn =3D iovad->start_pfn; + unsigned long search_size =3D size; + MA_STATE(mas, &iovad->mtree, 0, 0); + + if (size_aligned) { + unsigned long align =3D 1UL << fls_long(size - 1); =20 - if (size_aligned) align_mask <<=3D fls_long(size - 1); + search_size =3D size + align - 1; + } =20 - /* Walk the tree backwards */ - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + spin_lock_irqsave(&iovad->iova_lock, flags); if (limit_pfn <=3D iovad->dma_32bit_pfn && size >=3D iovad->max32_alloc_size) goto iova32_full; =20 - curr =3D __get_cached_rbnode(iovad, limit_pfn); - curr_iova =3D to_iova(curr); - retry_pfn =3D curr_iova->pfn_hi; + if (mas_empty_area_rev(&mas, iovad->start_pfn, + limit_pfn - 1, search_size)) + goto alloc_fail; =20 -retry: - do { - high_pfn =3D min(high_pfn, curr_iova->pfn_lo); - new_pfn =3D (high_pfn - size) & align_mask; - prev =3D curr; - curr =3D rb_prev(curr); - curr_iova =3D to_iova(curr); - } while (curr && new_pfn <=3D curr_iova->pfn_hi && new_pfn >=3D low_pfn); - - if (high_pfn < size || new_pfn < low_pfn) { - if (low_pfn =3D=3D iovad->start_pfn && retry_pfn < limit_pfn) { - high_pfn =3D limit_pfn; - low_pfn =3D retry_pfn + 1; - curr =3D iova_find_limit(iovad, limit_pfn); - curr_iova =3D to_iova(curr); - goto retry; - } - iovad->max32_alloc_size =3D size; - goto iova32_full; - } + new_pfn =3D (mas.last - size + 1) & align_mask; + if (new_pfn < mas.index || new_pfn < iovad->start_pfn) + goto alloc_fail; =20 - /* pfn_lo will point to size aligned address if size_aligned is set */ new->pfn_lo =3D new_pfn; - new->pfn_hi =3D new->pfn_lo + size - 1; + new->pfn_hi =3D new_pfn + size - 1; =20 - /* If we have 'prev', it's a valid place to start the insertion. */ - iova_insert_rbtree(&iovad->rbroot, new, prev); - __cached_rbnode_insert_update(iovad, new); + mas.index =3D new->pfn_lo; + mas.last =3D new->pfn_hi; + if (mas_store_gfp(&mas, new, GFP_ATOMIC)) + goto alloc_fail; =20 - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); return 0; =20 +alloc_fail: + if (limit_pfn <=3D iovad->dma_32bit_pfn) + iovad->max32_alloc_size =3D size; iova32_full: - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); return -ENOMEM; } =20 @@ -233,8 +109,7 @@ static struct iova *alloc_iova_mem(void) =20 static void free_iova_mem(struct iova *iova) { - if (iova->pfn_lo !=3D IOVA_ANCHOR) - kmem_cache_free(iova_cache, iova); + kmem_cache_free(iova_cache, iova); } =20 /** @@ -275,29 +150,22 @@ EXPORT_SYMBOL_GPL(alloc_iova); static struct iova * private_find_iova(struct iova_domain *iovad, unsigned long pfn) { - struct rb_node *node =3D iovad->rbroot.rb_node; + MA_STATE(mas, &iovad->mtree, pfn, pfn); =20 - assert_spin_locked(&iovad->iova_rbtree_lock); - - while (node) { - struct iova *iova =3D to_iova(node); - - if (pfn < iova->pfn_lo) - node =3D node->rb_left; - else if (pfn > iova->pfn_hi) - node =3D node->rb_right; - else - return iova; /* pfn falls within iova's range */ - } - - return NULL; + assert_spin_locked(&iovad->iova_lock); + return mas_walk(&mas); } =20 static void remove_iova(struct iova_domain *iovad, struct iova *iova) { - assert_spin_locked(&iovad->iova_rbtree_lock); - __cached_rbnode_delete_update(iovad, iova); - rb_erase(&iova->node, &iovad->rbroot); + MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi); + + assert_spin_locked(&iovad->iova_lock); + + if (iova->pfn_lo < iovad->dma_32bit_pfn) + iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; + + mas_store_gfp(&mas, NULL, GFP_ATOMIC); } =20 /** @@ -312,10 +180,9 @@ struct iova *find_iova(struct iova_domain *iovad, unsi= gned long pfn) unsigned long flags; struct iova *iova; =20 - /* Take the lock so that no other thread is manipulating the rbtree */ - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + spin_lock_irqsave(&iovad->iova_lock, flags); iova =3D private_find_iova(iovad, pfn); - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); return iova; } EXPORT_SYMBOL_GPL(find_iova); @@ -331,9 +198,9 @@ __free_iova(struct iova_domain *iovad, struct iova *iov= a) { unsigned long flags; =20 - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + spin_lock_irqsave(&iovad->iova_lock, flags); remove_iova(iovad, iova); - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); free_iova_mem(iova); } EXPORT_SYMBOL_GPL(__free_iova); @@ -351,14 +218,14 @@ free_iova(struct iova_domain *iovad, unsigned long pf= n) unsigned long flags; struct iova *iova; =20 - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + spin_lock_irqsave(&iovad->iova_lock, flags); iova =3D private_find_iova(iovad, pfn); if (!iova) { - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); return; } remove_iova(iovad, iova); - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); free_iova_mem(iova); } EXPORT_SYMBOL_GPL(free_iova); @@ -445,27 +312,18 @@ static void iova_domain_free_rcaches(struct iova_doma= in *iovad) */ void put_iova_domain(struct iova_domain *iovad) { - struct iova *iova, *tmp; + struct iova *iova; + MA_STATE(mas, &iovad->mtree, 0, 0); =20 if (iovad->rcaches) iova_domain_free_rcaches(iovad); =20 - rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node) + mas_for_each(&mas, iova, ULONG_MAX) free_iova_mem(iova); + __mt_destroy(&iovad->mtree); } EXPORT_SYMBOL_GPL(put_iova_domain); =20 -static int -__is_range_overlap(struct rb_node *node, - unsigned long pfn_lo, unsigned long pfn_hi) -{ - struct iova *iova =3D to_iova(node); - - if ((pfn_lo <=3D iova->pfn_hi) && (pfn_hi >=3D iova->pfn_lo)) - return 1; - return 0; -} - static inline struct iova * alloc_and_init_iova(unsigned long pfn_lo, unsigned long pfn_hi) { @@ -480,29 +338,6 @@ alloc_and_init_iova(unsigned long pfn_lo, unsigned lon= g pfn_hi) return iova; } =20 -static struct iova * -__insert_new_range(struct iova_domain *iovad, - unsigned long pfn_lo, unsigned long pfn_hi) -{ - struct iova *iova; - - iova =3D alloc_and_init_iova(pfn_lo, pfn_hi); - if (iova) - iova_insert_rbtree(&iovad->rbroot, iova, NULL); - - return iova; -} - -static void -__adjust_overlap_range(struct iova *iova, - unsigned long *pfn_lo, unsigned long *pfn_hi) -{ - if (*pfn_lo < iova->pfn_lo) - iova->pfn_lo =3D *pfn_lo; - if (*pfn_hi > iova->pfn_hi) - *pfn_lo =3D iova->pfn_hi + 1; -} - /** * reserve_iova - reserves an iova in the given range * @iovad: - iova domain pointer @@ -510,41 +345,58 @@ __adjust_overlap_range(struct iova *iova, * @pfn_hi:- higher pfn address * This function allocates reserves the address range from pfn_lo to pfn_h= i so * that this address is not dished out as part of alloc_iova. + * + * If the requested range overlaps existing reservations, ranges are merge= d. + * If the requested range is fully covered by an existing reservation, the + * existing entry is returned without allocating. */ struct iova * reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo, unsigned long pfn_hi) { - struct rb_node *node; unsigned long flags; - struct iova *iova; - unsigned int overlap =3D 0; + struct iova *iova, *overlap; + unsigned long merged_lo =3D pfn_lo, merged_hi =3D pfn_hi; =20 /* Don't allow nonsensical pfns */ if (WARN_ON((pfn_hi | pfn_lo) > (ULLONG_MAX >> iova_shift(iovad)))) return NULL; =20 - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); - for (node =3D rb_first(&iovad->rbroot); node; node =3D rb_next(node)) { - if (__is_range_overlap(node, pfn_lo, pfn_hi)) { - iova =3D to_iova(node); - __adjust_overlap_range(iova, &pfn_lo, &pfn_hi); - if ((pfn_lo >=3D iova->pfn_lo) && - (pfn_hi <=3D iova->pfn_hi)) - goto finish; - overlap =3D 1; - - } else if (overlap) - break; + spin_lock_irqsave(&iovad->iova_lock, flags); + { + MA_STATE(mas, &iovad->mtree, pfn_lo, pfn_hi); + + mas_for_each(&mas, overlap, pfn_hi) { + if (pfn_lo >=3D overlap->pfn_lo && + pfn_hi <=3D overlap->pfn_hi) { + spin_unlock_irqrestore(&iovad->iova_lock, + flags); + return overlap; + } + if (overlap->pfn_lo < merged_lo) + merged_lo =3D overlap->pfn_lo; + if (overlap->pfn_hi > merged_hi) + merged_hi =3D overlap->pfn_hi; + free_iova_mem(overlap); + } } =20 - /* We are here either because this is the first reserver node - * or need to insert remaining non overlap addr range - */ - iova =3D __insert_new_range(iovad, pfn_lo, pfn_hi); -finish: + iova =3D alloc_and_init_iova(merged_lo, merged_hi); + if (!iova) { + spin_unlock_irqrestore(&iovad->iova_lock, flags); + return NULL; + } + + { + MA_STATE(mas, &iovad->mtree, merged_lo, merged_hi); =20 - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + if (mas_store_gfp(&mas, iova, GFP_ATOMIC)) { + spin_unlock_irqrestore(&iovad->iova_lock, flags); + free_iova_mem(iova); + return NULL; + } + } + spin_unlock_irqrestore(&iovad->iova_lock, flags); return iova; } EXPORT_SYMBOL_GPL(reserve_iova); @@ -621,7 +473,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, stru= ct iova_domain *iovad) unsigned long flags; int i; =20 - spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + spin_lock_irqsave(&iovad->iova_lock, flags); =20 for (i =3D 0 ; i < mag->size; ++i) { struct iova *iova =3D private_find_iova(iovad, mag->pfns[i]); @@ -633,7 +485,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, stru= ct iova_domain *iovad) free_iova_mem(iova); } =20 - spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + spin_unlock_irqrestore(&iovad->iova_lock, flags); =20 mag->size =3D 0; } @@ -956,8 +808,8 @@ int iova_cache_get(void) =20 mutex_lock(&iova_cache_mutex); if (!iova_cache_users) { - iova_cache =3D kmem_cache_create("iommu_iova", sizeof(struct iova), 0, - SLAB_HWCACHE_ALIGN, NULL); + iova_cache =3D kmem_cache_create("iommu_iova", sizeof(struct iova), + 0, 0, NULL); if (!iova_cache) goto out_err; =20 diff --git a/include/linux/iova.h b/include/linux/iova.h index d2c4fd923efa..eb4f9ead5451 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -11,12 +11,11 @@ =20 #include #include -#include +#include #include =20 /* iova structure */ struct iova { - struct rb_node node; unsigned long pfn_hi; /* Highest allocated pfn */ unsigned long pfn_lo; /* Lowest allocated pfn */ }; @@ -26,15 +25,12 @@ struct iova_rcache; =20 /* holds all the iova translations for a domain */ struct iova_domain { - spinlock_t iova_rbtree_lock; /* Lock to protect update of rbtree */ - struct rb_root rbroot; /* iova domain rbtree root */ - struct rb_node *cached_node; /* Save last alloced node */ - struct rb_node *cached32_node; /* Save last 32-bit alloced node */ + spinlock_t iova_lock; /* Lock to protect update of maple tree */ + struct maple_tree mtree; unsigned long granule; /* pfn granularity for this domain */ unsigned long start_pfn; /* Lower limit for this domain */ unsigned long dma_32bit_pfn; unsigned long max32_alloc_size; /* Size of last failed allocation */ - struct iova anchor; /* rbtree lookup anchor */ =20 struct iova_rcache *rcaches; struct hlist_node cpuhp_dead; --=20 2.54.0 From nobody Mon Jun 8 06:36:14 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F04E3F7896; Wed, 3 Jun 2026 03:37:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457843; cv=none; b=U4sCn8lct/lt/NQobjv1S+JOuT6SWJ8MgYRghncsTTd7+9WNsKhanvmfFbyySt6zt/XTgc6wYtzEGfurLWpk9LPV9iEBN8vCHJEnSGXkG8ohxuSGNZEVPmyk/dUvNTi5C48qqT0RAyx09vIXtOz4ohlCNP0+INwCzPSfjrlKw+U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457843; c=relaxed/simple; bh=1I9QSRXOnE20EOYrfPreA57f8AU8LbqZeBwqNbM1omE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=b9apu/djMOE9rKlOg0CuyyH9IP12p5ptqyE/PURrCiqnXmeqGvLZYOBr7rn6uQKADhREqIJbESZxJ+Zv4Q0AwUVCU5X2t8hG1MJ/ShjNE7/MTKmO8q4RECBuGrFisGjq/wFZGQRB5UjypzfZuJd7YewsVwVsu5V1dvianxbcQuY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=Yl8T9Jog; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="Yl8T9Jog" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=B+5nPku2Js5sEHixJ+jko8QqcDMr4c3iPdyaq/eYcqg=; b=Yl8T9JogNyf1iyDYgRJ9QBlrN2 5tq9BQxGyW6B9xVnxyUsZzNoG/CnQsUu68Vg3F5gdO7zqU7XNiI8Bv9D8WDWFHQerMCrMbNXnQNFu f3jw+Ks/+LiqbZII9i52gi3g4jO1skpb1XK80i/AKQJx5pVsPP5axuXL56sP7v0dlcSInicjaUdEJ WPxkDayX8liti3rihbTjSRQ/JwtDc5WEPPCPF0OLIpxpZWeFpGPjYcS0LJibaXztwHrfJn6JNbw8F blzz/TaK305CcL4HodNHL4WYnKKG8+OV5qXnKXmtmacGIdSubFBQsfEOXgsCEuT4az1NGhbMlUDTt FHGKIpPg==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wUcPg-000000002VG-1OxO; Tue, 02 Jun 2026 23:37:04 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, jgg@ziepe.ca, kyle@mcmartin.ca, Rik van Riel , Rik van Riel Subject: [PATCH v3 2/3] iova: add KUnit test suite Date: Tue, 2 Jun 2026 23:35:47 -0400 Message-ID: <20260603033653.4144138-3-riel@surriel.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260603033653.4144138-1-riel@surriel.com> References: <20260603033653.4144138-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Rik van Riel Add a kunit suite for the maple-tree-based IOVA allocator, plus an iova_domain_verify_invariants() helper (compiled only when the test config is enabled) that walks the maple tree and confirms every entry's pfn_lo/pfn_hi match the maple tree index range and that no entries overlap. Test cases: - test_size_aligned: alignment of size_aligned allocs across orders 0..7. - test_top_down_preference: sequential allocs decrease in pfn_lo. - test_reserve_iova: allocs avoid the reserved range. - test_32bit_in_64bit_domain: 1000 64-bit allocs followed by a 32-bit alloc must still find a slot below DMA_BIT_MASK(32). - test_arbitrary_dma_limits: after filling 64-bit space, verify that bounded allocations at 33-bit and 56-bit limits still find slots within their respective ranges, confirming that mas_empty_area_rev generalizes to arbitrary limit_pfn values. - test_aligned_in_fragmented: pack size-2 size_aligned allocs, free every other to leave size-2 holes; a fresh size-2 aligned alloc must still succeed and return a 2-aligned pfn. - test_pci_32bit_workaround_pattern: alternate 32-bit-first allocation attempts with 64-bit fallback, mirroring dma-iommu.c. - test_stress_random: 2048 random alloc/free operations with mixed sizes, alignments, and DMA limits (32/33/56/64-bit), checking invariants after every operation. Uses a deterministic PRNG so failures reproduce across boots. - test_full_space_search_time: fill a 16K-pfn range completely, then verify that a failed alloc returns in bounded time (O(log n) via mas_empty_area_rev, not O(n)). - test_fragmented_32bit_search: pack the 32-bit IOVA space, then verify bounded search time for both 32-bit failure and 64-bit fallback success paths. Run with: tools/testing/kunit/kunit.py run --kunitconfig=3Ddrivers/iommu Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Rik van Riel --- drivers/iommu/.kunitconfig | 4 + drivers/iommu/Kconfig | 16 ++ drivers/iommu/Makefile | 1 + drivers/iommu/iova-kunit.c | 432 +++++++++++++++++++++++++++++++++++++ drivers/iommu/iova.c | 34 +++ include/linux/iova.h | 3 + 6 files changed, 490 insertions(+) create mode 100644 drivers/iommu/.kunitconfig create mode 100644 drivers/iommu/iova-kunit.c diff --git a/drivers/iommu/.kunitconfig b/drivers/iommu/.kunitconfig new file mode 100644 index 000000000000..56b481ecf993 --- /dev/null +++ b/drivers/iommu/.kunitconfig @@ -0,0 +1,4 @@ +CONFIG_KUNIT=3Dy +CONFIG_IOMMU_SUPPORT=3Dy +CONFIG_IOMMU_IOVA=3Dy +CONFIG_IOMMU_IOVA_KUNIT_TEST=3Dy diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index f86262b11416..f09046e238fd 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -3,6 +3,22 @@ config IOMMU_IOVA tristate =20 +config IOMMU_IOVA_KUNIT_TEST + tristate "KUnit tests for the IOVA allocator" if !KUNIT_ALL_TESTS + depends on IOMMU_IOVA && KUNIT + default KUNIT_ALL_TESTS + help + Enable kunit tests for the IOVA allocator. The tests exercise + basic allocation and free, size-aligned allocation, top-down + ordering, bounded allocations with various DMA limits (32-bit, + 33-bit, 56-bit), aligned allocations in fragmented domains, + and randomly-fragmented stress scenarios. + + Run with: + tools/testing/kunit/kunit.py run --kunitconfig=3Ddrivers/iommu + + If unsure, say N here. + # IOMMU_API always gets selected by whoever wants it. config IOMMU_API bool diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 0275821f4ef9..6bd7da1cbebd 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) +=3D io-pgtable-arm.o obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) +=3D io-pgtable-arm-selftes= ts.o obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) +=3D io-pgtable-dart.o obj-$(CONFIG_IOMMU_IOVA) +=3D iova.o +obj-$(CONFIG_IOMMU_IOVA_KUNIT_TEST) +=3D iova-kunit.o obj-$(CONFIG_OF_IOMMU) +=3D of_iommu.o obj-$(CONFIG_MSM_IOMMU) +=3D msm_iommu.o obj-$(CONFIG_IPMMU_VMSA) +=3D ipmmu-vmsa.o diff --git a/drivers/iommu/iova-kunit.c b/drivers/iommu/iova-kunit.c new file mode 100644 index 000000000000..fffeab8552cd --- /dev/null +++ b/drivers/iommu/iova-kunit.c @@ -0,0 +1,432 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KUnit tests for the IOVA allocator. + * + * Exercises the maple-tree-based allocator: basic alloc/free, + * size-aligned allocations, top-down ordering, bounded allocations + * with various DMA limits (32-bit, 33-bit, 56-bit), aligned + * allocations in fragmented domains, and randomly fragmented stress. + * + * Each test verifies that the maple tree invariants remain consistent + * after every batch of operations. + */ +#include +#include +#include + +#define TEST_GRANULE PAGE_SIZE +/* Highest pfn that fits in 32 bits =E2=80=94 triggers the bounded alloc p= ath. */ +#define TEST_LIMIT_32BIT (DMA_BIT_MASK(32) >> PAGE_SHIFT) +/* 33-bit limit =E2=80=94 exercises non-power-of-two DMA boundaries. */ +#define TEST_LIMIT_33BIT (DMA_BIT_MASK(33) >> PAGE_SHIFT) +/* 56-bit limit =E2=80=94 typical server IOMMU address width. */ +#define TEST_LIMIT_56BIT (DMA_BIT_MASK(56) >> PAGE_SHIFT) +/* A 64-bit-ish limit well above dma_32bit_pfn. 1ULL avoids UB on ILP32. */ +#define TEST_LIMIT_64BIT ((1ULL << 36) >> PAGE_SHIFT) +/* + * A small <=3D32-bit limit used by tests that want to actually exhaust the + * restricted region within a tractable number of allocations. + */ +#define TEST_LIMIT_32BIT_RESTRICTED (TEST_LIMIT_32BIT / 2) + +struct iova_test_ctx { + struct iova_domain iovad; + bool initialized; +}; + +static int iova_test_init(struct kunit *test) +{ + struct iova_test_ctx *ctx; + int ret; + + ctx =3D kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + test->priv =3D ctx; + + ret =3D iova_cache_get(); + if (ret) + return ret; + + init_iova_domain(&ctx->iovad, TEST_GRANULE, 1); + ret =3D iova_domain_init_rcaches(&ctx->iovad); + if (ret) { + put_iova_domain(&ctx->iovad); + iova_cache_put(); + return ret; + } + ctx->initialized =3D true; + + KUNIT_ASSERT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + return 0; +} + +static void iova_test_exit(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + + if (ctx && ctx->initialized) { + put_iova_domain(&ctx->iovad); + ctx->initialized =3D false; + iova_cache_put(); + } +} + +static void test_size_aligned(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + int order; + + for (order =3D 0; order < 8; ++order) { + unsigned long size =3D 1UL << order; + struct iova *iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_32BIT, true); + + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_EQ(test, iova->pfn_lo & (size - 1), 0); + KUNIT_EXPECT_EQ(test, iova->pfn_hi - iova->pfn_lo + 1, size); + __free_iova(&ctx->iovad, iova); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + } +} + +static void test_top_down_preference(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + struct iova *iovas[16]; + int i; + + for (i =3D 0; i < ARRAY_SIZE(iovas); ++i) { + iovas[i] =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iovas[i]); + if (i > 0) + KUNIT_EXPECT_LT(test, iovas[i]->pfn_lo, + iovas[i - 1]->pfn_lo); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + for (i =3D 0; i < ARRAY_SIZE(iovas); ++i) + __free_iova(&ctx->iovad, iovas[i]); +} + +static void test_reserve_iova(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + const unsigned long reserve_lo =3D TEST_LIMIT_32BIT / 2; + struct iova *r, *iova; + int i; + + /* Reserve the entire top half through the limit_pfn, inclusive. */ + r =3D reserve_iova(&ctx->iovad, reserve_lo, TEST_LIMIT_32BIT); + KUNIT_ASSERT_NOT_NULL(test, r); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* All allocs must land below the reserved range. */ + for (i =3D 0; i < 100; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LT(test, iova->pfn_hi, reserve_lo); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); +} + +/* + * The pci_32bit_workaround scenario: every PCI device's first IOVA + * allocation hits the 32-bit-restricted path before falling back to + * 64-bit. Fill the 64-bit space, then verify a 32-bit alloc still + * finds a slot below DMA_BIT_MASK(32). + */ +static void test_32bit_in_64bit_domain(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + struct iova *iova; + int i; + + for (i =3D 0; i < 1000; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + __free_iova(&ctx->iovad, iova); +} + +/* + * Exercise non-power-of-two DMA limits: fill the 64-bit space, then + * verify that bounded allocations at 33-bit and 56-bit limits still + * find slots within their respective ranges. This confirms the + * navigate-to-limit_pfn search generalizes beyond the 32-bit case. + */ +static void test_arbitrary_dma_limits(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + struct iova *iova; + int i; + + for (i =3D 0; i < 1000; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* 33-bit bounded allocation */ + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_33BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_33BIT); + __free_iova(&ctx->iovad, iova); + + /* 56-bit bounded allocation */ + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_56BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_56BIT); + __free_iova(&ctx->iovad, iova); + + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); +} + +/* + * Aligned allocation in a fragmented domain: pack size-2 size_aligned + * allocations at the top, free every other one to leave size-2 holes, + * then verify a fresh size-2 aligned alloc still succeeds and returns + * a 2-aligned pfn. + */ +static void test_aligned_in_fragmented(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + const int N =3D 64; + struct iova **iovas; + struct iova *iova; + int i; + + iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, iovas); + + for (i =3D 0; i < N; ++i) { + iovas[i] =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iovas[i]); + KUNIT_EXPECT_EQ(test, iovas[i]->pfn_lo & 1, 0); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + for (i =3D 0; i < N; i +=3D 2) { + __free_iova(&ctx->iovad, iovas[i]); + iovas[i] =3D NULL; + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + iova =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_EQ(test, iova->pfn_lo & 1, 0); + __free_iova(&ctx->iovad, iova); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + for (i =3D 0; i < N; ++i) + if (iovas[i]) + __free_iova(&ctx->iovad, iovas[i]); +} + +/* + * Mimic dma-iommu's pci_32bit_workaround pattern: every alloc first + * tries a small restricted limit; if that fails, retry with the 64-bit + * limit. Verifies that the navigate-to-limit search survives rapid + * switching between different limit_pfn values. + */ +static void test_pci_32bit_workaround_pattern(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + int fallback_count =3D 0; + int i; + + for (i =3D 0; i < 500; ++i) { + unsigned long size =3D (i % 4) + 1; + struct iova *iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_32BIT_RESTRICTED, + true); + + if (!iova) { + iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_64BIT, true); + fallback_count++; + } + if (!iova) + break; + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + KUNIT_EXPECT_GT(test, i, 0); +} + +/* + * Random alloc/free over many iterations, verifying invariants after + * every operation. Uses a deterministic PRNG so failures reproduce + * across boots. Exercises mixed DMA limits (32, 33, 56, 64-bit). + */ +static void test_stress_random(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + const int N =3D 512; + const int iters =3D 4 * N; + const unsigned long limits[] =3D { + TEST_LIMIT_32BIT, TEST_LIMIT_33BIT, + TEST_LIMIT_56BIT, TEST_LIMIT_64BIT, + }; + struct iova **iovas; + u32 rng =3D 0xDEADBEEF; + int i; + + iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, iovas); + + for (i =3D 0; i < iters; ++i) { + int slot; + unsigned long limit; + const char *op; + + rng =3D rng * 1103515245 + 12345; + slot =3D (rng >> 8) % N; + rng =3D rng * 1103515245 + 12345; + limit =3D limits[(rng >> 8) % ARRAY_SIZE(limits)]; + + if (iovas[slot]) { + op =3D "free"; + __free_iova(&ctx->iovad, iovas[slot]); + iovas[slot] =3D NULL; + } else { + unsigned long size; + bool aligned; + + rng =3D rng * 1103515245 + 12345; + size =3D 1UL << ((rng >> 8) % 4); + rng =3D rng * 1103515245 + 12345; + aligned =3D (rng >> 8) & 1; + + op =3D "alloc"; + iovas[slot] =3D alloc_iova(&ctx->iovad, size, limit, + aligned); + } + if (!iova_domain_verify_invariants(&ctx->iovad)) { + kunit_info(test, "iter %d slot %d: invariant broken after %s\n", + i, slot, op); + KUNIT_FAIL(test, "verify failed"); + break; + } + } + + for (i =3D 0; i < N; ++i) + if (iovas[i]) + __free_iova(&ctx->iovad, iovas[i]); +} + +/* + * Verify that alloc_iova fails in bounded time when the IOVA space is + * fully packed. Fill a 16K-pfn range with size-1 allocations (leaving + * no gaps), then attempt a size-2 aligned alloc. The maple tree's + * mas_empty_area_rev must determine there is no suitable gap in + * O(log n) time rather than walking every entry. The 10ms threshold + * is generous =E2=80=94 real hardware watchdogs fire at ~10s. + */ +static void test_full_space_search_time(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + const unsigned long fill_limit =3D 16384; + const int fill_count =3D fill_limit; + struct iova *iova; + ktime_t start, elapsed; + int i, allocated =3D 0; + + for (i =3D 0; i < fill_count; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, fill_limit, false); + if (!iova) + break; + allocated++; + } + kunit_info(test, "allocated %d iovas in [1, %lu]\n", + allocated, fill_limit); + KUNIT_ASSERT_GT(test, allocated, 1000); + + start =3D ktime_get(); + iova =3D alloc_iova(&ctx->iovad, 2, fill_limit, true); + elapsed =3D ktime_sub(ktime_get(), start); + + KUNIT_EXPECT_NULL(test, iova); + kunit_info(test, "failed alloc took %lld ns\n", + ktime_to_ns(elapsed)); + KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL); + + if (iova) + __free_iova(&ctx->iovad, iova); +} + +/* + * Verify bounded search time with a fragmented 32-bit IOVA space. + * Pack the 32-bit range with size-1 allocs, then attempt a large + * aligned alloc that must either succeed from a remaining gap or + * fail fast. The 64-bit fallback must always succeed promptly. + */ +static void test_fragmented_32bit_search(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D test->priv; + struct iova *iova; + ktime_t start, elapsed; + int i, allocated =3D 0; + + for (i =3D 0; i < 8000; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + if (!iova) + break; + allocated++; + } + kunit_info(test, "filled 32-bit space with %d allocs\n", allocated); + KUNIT_ASSERT_GT(test, allocated, 1000); + + start =3D ktime_get(); + iova =3D alloc_iova(&ctx->iovad, 32, TEST_LIMIT_32BIT, true); + elapsed =3D ktime_sub(ktime_get(), start); + + kunit_info(test, "32-bit alloc (size 32) took %lld ns, result=3D%px\n", + ktime_to_ns(elapsed), iova); + KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL); + + if (iova) + __free_iova(&ctx->iovad, iova); + + start =3D ktime_get(); + iova =3D alloc_iova(&ctx->iovad, 32, TEST_LIMIT_64BIT, true); + elapsed =3D ktime_sub(ktime_get(), start); + + kunit_info(test, "64-bit fallback (size 32) took %lld ns\n", + ktime_to_ns(elapsed)); + KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 10000000LL); + + if (iova) + __free_iova(&ctx->iovad, iova); +} + +static struct kunit_case iova_test_cases[] =3D { + KUNIT_CASE(test_size_aligned), + KUNIT_CASE(test_top_down_preference), + KUNIT_CASE(test_reserve_iova), + KUNIT_CASE(test_32bit_in_64bit_domain), + KUNIT_CASE(test_arbitrary_dma_limits), + KUNIT_CASE(test_aligned_in_fragmented), + KUNIT_CASE(test_pci_32bit_workaround_pattern), + KUNIT_CASE(test_stress_random), + KUNIT_CASE(test_full_space_search_time), + KUNIT_CASE(test_fragmented_32bit_search), + {} +}; + +static struct kunit_suite iova_test_suite =3D { + .name =3D "iova", + .init =3D iova_test_init, + .exit =3D iova_test_exit, + .test_cases =3D iova_test_cases, +}; +kunit_test_suite(iova_test_suite); + +MODULE_DESCRIPTION("KUnit tests for the IOVA allocator"); +MODULE_LICENSE("GPL"); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 523d1e8315f9..1ceab6cbefc2 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -857,6 +857,40 @@ void iova_cache_put(void) } EXPORT_SYMBOL_GPL(iova_cache_put); =20 +#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST) +bool iova_domain_verify_invariants(struct iova_domain *iovad) +{ + struct iova *iova, *prev =3D NULL; + unsigned long flags; + bool ok =3D true; + MA_STATE(mas, &iovad->mtree, 0, 0); + + spin_lock_irqsave(&iovad->iova_lock, flags); + mas_for_each(&mas, iova, ULONG_MAX) { + if (mas.index !=3D iova->pfn_lo || mas.last !=3D iova->pfn_hi) { + pr_err("iova_verify: maple index [%lu,%lu] !=3D iova [%lu,%lu]\n", + mas.index, mas.last, iova->pfn_lo, iova->pfn_hi); + ok =3D false; + } + if (iova->pfn_lo > iova->pfn_hi) { + pr_err("iova_verify: pfn_lo=3D%lu > pfn_hi=3D%lu\n", + iova->pfn_lo, iova->pfn_hi); + ok =3D false; + } + if (prev && prev->pfn_hi >=3D iova->pfn_lo) { + pr_err("iova_verify: overlap prev=3D[%lu,%lu] curr=3D[%lu,%lu]\n", + prev->pfn_lo, prev->pfn_hi, + iova->pfn_lo, iova->pfn_hi); + ok =3D false; + } + prev =3D iova; + } + spin_unlock_irqrestore(&iovad->iova_lock, flags); + return ok; +} +EXPORT_SYMBOL_GPL(iova_domain_verify_invariants); +#endif /* CONFIG_IOMMU_IOVA_KUNIT_TEST */ + MODULE_AUTHOR("Anil S Keshavamurthy "); MODULE_DESCRIPTION("IOMMU I/O Virtual Address management"); MODULE_LICENSE("GPL"); diff --git a/include/linux/iova.h b/include/linux/iova.h index eb4f9ead5451..6fc070a4f58e 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -98,6 +98,9 @@ void init_iova_domain(struct iova_domain *iovad, unsigned= long granule, int iova_domain_init_rcaches(struct iova_domain *iovad); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iova_domain(struct iova_domain *iovad); +#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST) +bool iova_domain_verify_invariants(struct iova_domain *iovad); +#endif #else static inline int iova_cache_get(void) { --=20 2.54.0 From nobody Mon Jun 8 06:36:14 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A6DEF3F7897; Wed, 3 Jun 2026 03:37:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457841; cv=none; b=KTW9w2KHzXsj0/WbKIZ75VELc27/yOyYhrU9MjIZi0p5PkAQVwclGNfonESb2sOkyhIt1XQli5FQ7cWz9pVXs13botS7MF8VcrgIRL7Y89l1Ivuybz/ee2ZVyM+pCO+ADe9rHgZ/CUhaMAkiG9vs4pBYmt7p8g0qgp+D8zHdN4U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780457841; c=relaxed/simple; bh=RTXNSl9tzV2JhUxrhyViDaR1SVISk4XO1r4OZXm0vRA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=J6lVn07dzHPjCPaCZ079StpqPek6MIVgXR+OP5/Pr596/5lx7T+hALC80/WuOP/o7l0/cWRWyslroE+ct2yNnC8ubmWqce+dtwrNBUJ5G3FrkAn7ot/Ole5h4JehT/VIxZfaaUv7xCjgK3ROiwtl1RcH8yYLQoVpdLBFRTq4ZjQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=QUdqRXsz; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="QUdqRXsz" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=UzbOAxtdeLtO5GN7ci5TBmbpzOG8jujIM+3oC4Zy8NI=; b=QUdqRXsz3eQ1XsyASt3Y7eAhB+ KWPL2A6jkzLbb3nNVSTynC9HGwrHPO2b0U/+675d4b+iDYYzYQqkH5cr7DaxqFN3MyE40ZV1Z0yJ4 15yvnASSVUV5kXBbOu+11ruyTHp/4N49v/GzajMlf4dpSF1CAPDhG9eEbWrOZJMYaFevcsGJDzadJ koFdMPop7hcUrblYuL7x46wiRZLLZYMzq65AlCaV8pQjs5Kw4oZEWa9kDanNv/A7OTtqBDxR7JMK7 OQYv9bk/egjYaZXA/S608b0pjZSGl2u8pkSFbSW9JLaL3DRZPsePIbo3A0w1yDHxRaRFOuq7RXLLX esnQz2CQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wUcPg-000000002VG-1WYZ; Tue, 02 Jun 2026 23:37:04 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, jgg@ziepe.ca, kyle@mcmartin.ca, Rik van Riel , Rik van Riel Subject: [PATCH v3 3/3] iova: defer maple tree erase on GFP_ATOMIC failure Date: Tue, 2 Jun 2026 23:35:48 -0400 Message-ID: <20260603033653.4144138-4-riel@surriel.com> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260603033653.4144138-1-riel@surriel.com> References: <20260603033653.4144138-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Rik van Riel The maple tree may need to allocate nodes during erase operations for tree rebalancing. Unlike the old rbtree where rb_erase() never allocated, mas_store_gfp(NULL, GFP_ATOMIC) can fail under memory pressure. Since the IOVA allocator runs in atomic context (DMA map/unmap can be called from hardirq, softirq, or with spinlocks held), GFP_KERNEL allocation is not possible. Add a deferred free mechanism: when mas_store_gfp(NULL, GFP_ATOMIC) fails, the iova entry remains in the maple tree (preventing address reuse and keeping the pointer valid) and is added to a lockless per-domain deferred free list. A delayed workqueue retries the erase with GFP_ATOMIC after a 10ms delay -- by the time the workqueue runs, transient memory pressure has typically subsided and the allocation succeeds. The deferred free path temporarily reduces available IOVA address space until the workqueue processes the backlog, but causes no corruption -- the entry stays in the tree and the struct iova is not freed until the erase succeeds. put_iova_domain() cancels the delayed work and discards the deferred list before destroying the tree. Since deferred entries remain in the maple tree, the mas_for_each teardown loop frees them along with all other entries, avoiding a double-free. In practice, GFP_ATOMIC erase failures are quite rare: the slab allocator maintains emergency reserves for GFP_ATOMIC, and the common erase case (exact_fit, slot_store) needs zero node allocations. This mechanism is a safety net for the exceptional case. Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Rik van Riel --- drivers/iommu/iova.c | 84 +++++++++++++++++++++++++++++++++++++++----- include/linux/iova.h | 3 ++ 2 files changed, 79 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 1ceab6cbefc2..ae89d780fce5 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -7,6 +7,7 @@ =20 #include #include +#include #include #include #include @@ -26,6 +27,7 @@ static unsigned long iova_rcache_get(struct iova_domain *= iovad, static void free_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *io= vad); static void free_global_cached_iovas(struct iova_domain *iovad); +static void iova_deferred_free_work(struct work_struct *work); =20 void init_iova_domain(struct iova_domain *iovad, unsigned long granule, @@ -46,6 +48,8 @@ init_iova_domain(struct iova_domain *iovad, unsigned long= granule, iovad->start_pfn =3D start_pfn; iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad)); iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; + init_llist_head(&iovad->deferred_frees); + INIT_DELAYED_WORK(&iovad->deferred_free_work, iova_deferred_free_work); } EXPORT_SYMBOL_GPL(init_iova_domain); =20 @@ -156,7 +160,13 @@ private_find_iova(struct iova_domain *iovad, unsigned = long pfn) return mas_walk(&mas); } =20 -static void remove_iova(struct iova_domain *iovad, struct iova *iova) +/* + * Remove an IOVA entry from the maple tree. Returns true on success. + * On failure (maple tree node allocation under GFP_ATOMIC failed), + * returns false =E2=80=94 the entry remains in the tree and the caller mu= st + * not free the struct iova. + */ +static bool remove_iova(struct iova_domain *iovad, struct iova *iova) { MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi); =20 @@ -165,7 +175,36 @@ static void remove_iova(struct iova_domain *iovad, str= uct iova *iova) if (iova->pfn_lo < iovad->dma_32bit_pfn) iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; =20 - mas_store_gfp(&mas, NULL, GFP_ATOMIC); + if (mas_store_gfp(&mas, NULL, GFP_ATOMIC)) + return false; + return true; +} + +static void iova_deferred_free_work(struct work_struct *work) +{ + struct delayed_work *dwork =3D to_delayed_work(work); + struct iova_domain *iovad =3D container_of(dwork, struct iova_domain, + deferred_free_work); + struct llist_node *list =3D llist_del_all(&iovad->deferred_frees); + struct llist_node *node, *next; + + llist_for_each_safe(node, next, list) { + struct iova *iova =3D container_of(node, struct iova, + deferred_free); + unsigned long flags; + + spin_lock_irqsave(&iovad->iova_lock, flags); + if (remove_iova(iovad, iova)) + free_iova_mem(iova); + else + llist_add(&iova->deferred_free, + &iovad->deferred_frees); + spin_unlock_irqrestore(&iovad->iova_lock, flags); + } + + if (!llist_empty(&iovad->deferred_frees)) + schedule_delayed_work(&iovad->deferred_free_work, + msecs_to_jiffies(10)); } =20 /** @@ -199,9 +238,15 @@ __free_iova(struct iova_domain *iovad, struct iova *io= va) unsigned long flags; =20 spin_lock_irqsave(&iovad->iova_lock, flags); - remove_iova(iovad, iova); + if (remove_iova(iovad, iova)) { + spin_unlock_irqrestore(&iovad->iova_lock, flags); + free_iova_mem(iova); + return; + } spin_unlock_irqrestore(&iovad->iova_lock, flags); - free_iova_mem(iova); + llist_add(&iova->deferred_free, &iovad->deferred_frees); + schedule_delayed_work(&iovad->deferred_free_work, + msecs_to_jiffies(10)); } EXPORT_SYMBOL_GPL(__free_iova); =20 @@ -224,9 +269,15 @@ free_iova(struct iova_domain *iovad, unsigned long pfn) spin_unlock_irqrestore(&iovad->iova_lock, flags); return; } - remove_iova(iovad, iova); + if (remove_iova(iovad, iova)) { + spin_unlock_irqrestore(&iovad->iova_lock, flags); + free_iova_mem(iova); + return; + } spin_unlock_irqrestore(&iovad->iova_lock, flags); - free_iova_mem(iova); + llist_add(&iova->deferred_free, &iovad->deferred_frees); + schedule_delayed_work(&iovad->deferred_free_work, + msecs_to_jiffies(10)); } EXPORT_SYMBOL_GPL(free_iova); =20 @@ -318,6 +369,15 @@ void put_iova_domain(struct iova_domain *iovad) if (iovad->rcaches) iova_domain_free_rcaches(iovad); =20 + cancel_delayed_work_sync(&iovad->deferred_free_work); + + /* + * Deferred entries are still in the maple tree, so the + * mas_for_each loop below frees them along with everything else. + * Just discard the deferred list without double-freeing. + */ + llist_del_all(&iovad->deferred_frees); + mas_for_each(&mas, iova, ULONG_MAX) free_iova_mem(iova); __mt_destroy(&iovad->mtree); @@ -481,12 +541,20 @@ iova_magazine_free_pfns(struct iova_magazine *mag, st= ruct iova_domain *iovad) if (WARN_ON(!iova)) continue; =20 - remove_iova(iovad, iova); - free_iova_mem(iova); + if (remove_iova(iovad, iova)) { + free_iova_mem(iova); + } else { + llist_add(&iova->deferred_free, + &iovad->deferred_frees); + } } =20 spin_unlock_irqrestore(&iovad->iova_lock, flags); =20 + if (!llist_empty(&iovad->deferred_frees)) + schedule_delayed_work(&iovad->deferred_free_work, + msecs_to_jiffies(10)); + mag->size =3D 0; } =20 diff --git a/include/linux/iova.h b/include/linux/iova.h index 6fc070a4f58e..cc1b5441a058 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -16,6 +16,7 @@ =20 /* iova structure */ struct iova { + struct llist_node deferred_free; unsigned long pfn_hi; /* Highest allocated pfn */ unsigned long pfn_lo; /* Lowest allocated pfn */ }; @@ -31,6 +32,8 @@ struct iova_domain { unsigned long start_pfn; /* Lower limit for this domain */ unsigned long dma_32bit_pfn; unsigned long max32_alloc_size; /* Size of last failed allocation */ + struct llist_head deferred_frees; + struct delayed_work deferred_free_work; =20 struct iova_rcache *rcaches; struct hlist_node cpuhp_dead; --=20 2.54.0