From nobody Fri Jun 12 19:47:12 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9583337BE6B; Wed, 13 May 2026 02:03:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; cv=none; b=fvhfYmzOh0t+YLOjbigbW9YWZXogDgyDBWuWWHwP7rx3yfSV7V4LMzcNkWwYIeglwOiQqKPe6BYYqsBG7dhJLGbc4UgxpeJc3VfG/e5pYt5xcfclPaH3fbd9kNzsDGhug3a0DNikGp+gv8pJm4fg3OtUyyV7Gcj/K4L8pcflYdw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; c=relaxed/simple; bh=/UNGNv99SYYyCkyQIROj0r03jmz125QkHkA/cto5zVY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BpF6LKZm6C3VILNNZta/vjSJYvQFC25PpORg8gwBgNqDa5MoIw7EnWOcCSlLcoSM7kTCFV2SvWda9LgsHltrxWDSUEaw8bRYGVjPO3eb41FsDaRY4+8q+krpwQkZqo9GNmBf2dPxTZRq6cyohHpObh/02egyiXKF+BR3F6AhbHY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=dvkt/qcg; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="dvkt/qcg" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=/bacrmtyorJAQIwe0qXADFoM1LPph2PmgFMHDS08SM0=; b=dvkt/qcg89NhJD5TjeRL0o3W6C QxUTd5HT5KdifuqmjA8fqbA04NoYfFZEW08s8zn7FwQ3Z+62VI1Usp9B4wnZQqhUdrOrKjSxpZhyy UZV7XmgH6xMUdKF6CbBXfY67ewbr3q8HzcAmghX92JGQh0iiBy5YghtPOS4SNTIwnXwxziwVjKW3E rJuNY7r1mIuIyqeqL9TjXzw13lAgK3Y4vqapsQcl33OHaHv+V6OZGimnmWKCE8ukcxJhknwB+IXMY lwT4Wv+oa1jl2JoAmk3+47OgRjnN735IHVuVXN19YOtFQPvFWse0zCXPe9eJB4yJqiRgjTAwFQG02 gNtjIhSQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wMywG-0000000022e-3ijl; Tue, 12 May 2026 22:03:08 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com, Rik van Riel , Rik van Riel Subject: [PATCH 1/5] iova: switch to augmented rbtree for log(n) allocation Date: Tue, 12 May 2026 22:00:18 -0400 Message-ID: <20260513020304.1528751-2-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260513020304.1528751-1-riel@surriel.com> References: <20260513020304.1528751-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel The IOVA allocator's __alloc_and_insert_iova_range() walks the rbtree backwards via rb_prev() to find a free gap, which is O(n) on the number of allocated ranges. Under heavy fragmentation (e.g. iommu_iova caches holding hundreds of thousands of ranges) this can take 20+ seconds and trigger softlockups. Switch to an augmented rbtree that tracks two new fields per iova node: gap_to_prev - the free gap (in pfns) between this node and its in-order predecessor. __subtree_max_gap - max gap_to_prev over this node's subtree. The augmented invariant is maintained via RB_DECLARE_CALLBACKS_MAX from rbtree_augmented.h, plus explicit propagate() calls on insertion (where both the new node and its successor get a fresh gap_to_prev) and on deletion (where the successor's gap_to_prev grows). The new __iova_search_free_gap() walks right-first to prefer high addresses (preserving the existing top-down allocation behaviour) and prunes any subtree whose __subtree_max_gap is smaller than the requested size. For the typical IOMMU workload (uniform DMA mask per domain, so limit_pfn never binds against a candidate gap) and unaligned or small-alignment allocations, the search is O(log n). The alloc_iova_fast() size_aligned path can still degrade to O(n) when alignment is large relative to the available gaps; a follow-up patch in this series addresses that case. The cached_node / cached32_node fields and their update helpers are left in place but no longer consulted on the alloc path; a follow-up can strip them. The fast-fail max32_alloc_size optimization is preserved. Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Rik van Riel --- drivers/iommu/iova.c | 206 ++++++++++++++++++++++++++----------------- include/linux/iova.h | 2 + 2 files changed, 128 insertions(+), 80 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 021daf6528de..953188e296f0 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -13,6 +13,7 @@ #include #include #include +#include =20 /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR ~0UL @@ -34,6 +35,15 @@ static struct iova *to_iova(struct rb_node *node) return rb_entry(node, struct iova, node); } =20 +static inline unsigned long iova_gap_value(struct iova *iova) +{ + return iova->gap_to_prev; +} + +RB_DECLARE_CALLBACKS_MAX(static, iova_gap_callbacks, + struct iova, node, unsigned long, __subtree_max_gap, + iova_gap_value) + void init_iova_domain(struct iova_domain *iovad, unsigned long granule, unsigned long start_pfn) @@ -54,20 +64,13 @@ init_iova_domain(struct iova_domain *iovad, unsigned lo= ng granule, iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad)); iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; iovad->anchor.pfn_lo =3D iovad->anchor.pfn_hi =3D IOVA_ANCHOR; + iovad->anchor.gap_to_prev =3D IOVA_ANCHOR; + iovad->anchor.__subtree_max_gap =3D IOVA_ANCHOR; rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); rb_insert_color(&iovad->anchor.node, &iovad->rbroot); } EXPORT_SYMBOL_GPL(init_iova_domain); =20 -static struct rb_node * -__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn) -{ - if (limit_pfn <=3D iovad->dma_32bit_pfn) - return iovad->cached32_node; - - return iovad->cached_node; -} - static void __cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new) { @@ -96,49 +99,13 @@ __cached_rbnode_delete_update(struct iova_domain *iovad= , struct iova *free) iovad->cached_node =3D rb_next(&free->node); } =20 -static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned= long limit_pfn) -{ - struct rb_node *node, *next; - /* - * Ideally what we'd like to judge here is whether limit_pfn is close - * enough to the highest-allocated IOVA that starting the allocation - * walk from the anchor node will be quicker than this initial work to - * find an exact starting point (especially if that ends up being the - * anchor node anyway). This is an incredibly crude approximation which - * only really helps the most likely case, but is at least trivially easy. - */ - if (limit_pfn > iovad->dma_32bit_pfn) - return &iovad->anchor.node; - - node =3D iovad->rbroot.rb_node; - while (to_iova(node)->pfn_hi < limit_pfn) - node =3D node->rb_right; - -search_left: - while (node->rb_left && to_iova(node->rb_left)->pfn_lo >=3D limit_pfn) - node =3D node->rb_left; - - if (!node->rb_left) - return node; - - next =3D node->rb_left; - while (next->rb_right) { - next =3D next->rb_right; - if (to_iova(next)->pfn_lo >=3D limit_pfn) { - node =3D next; - goto search_left; - } - } - - return node; -} - /* Insert the iova into domain rbtree by holding writer lock */ static void iova_insert_rbtree(struct rb_root *root, struct iova *iova, struct rb_node *start) { struct rb_node **new, *parent =3D NULL; + struct rb_node *prev_node, *next_node; =20 new =3D (start) ? &start : &(root->rb_node); /* Figure out where to put new node */ @@ -156,62 +123,104 @@ iova_insert_rbtree(struct rb_root *root, struct iova= *iova, return; } } - /* Add new node and rebalance tree. */ + rb_link_node(&iova->node, parent, new); - rb_insert_color(&iova->node, root); + + prev_node =3D rb_prev(&iova->node); + if (prev_node) + iova->gap_to_prev =3D iova->pfn_lo - to_iova(prev_node)->pfn_hi - 1; + else + iova->gap_to_prev =3D iova->pfn_lo; + iova->__subtree_max_gap =3D iova->gap_to_prev; + + next_node =3D rb_next(&iova->node); + if (next_node) + to_iova(next_node)->gap_to_prev =3D + to_iova(next_node)->pfn_lo - iova->pfn_hi - 1; + + if (parent) + iova_gap_callbacks.propagate(parent, NULL); + if (next_node) + iova_gap_callbacks.propagate(next_node, NULL); + + rb_insert_augmented(&iova->node, root, &iova_gap_callbacks); +} + +/* + * Search the augmented rbtree for the highest-addressed free gap of at le= ast + * @size pages, with the allocation fitting below @limit_pfn and at or abo= ve + * @start_pfn. Returns the node whose gap_to_prev is used, or NULL. + */ +static struct rb_node * +__iova_search_free_gap(struct rb_node *node, unsigned long size, + unsigned long limit_pfn, unsigned long start_pfn, + unsigned long align_mask, unsigned long *new_pfn) +{ + struct iova *iova; + struct rb_node *result; + + if (!node) + return NULL; + + iova =3D to_iova(node); + if (iova->__subtree_max_gap < size) + return NULL; + + result =3D __iova_search_free_gap(node->rb_right, size, limit_pfn, + start_pfn, align_mask, new_pfn); + if (result) + return result; + + if (iova->gap_to_prev >=3D size) { + unsigned long gap_lo =3D iova->pfn_lo - iova->gap_to_prev; + unsigned long gap_hi =3D iova->pfn_lo - 1; + unsigned long pfn; + + if (gap_hi >=3D limit_pfn) + gap_hi =3D limit_pfn - 1; + if (gap_hi >=3D gap_lo && gap_hi - gap_lo + 1 >=3D size) { + pfn =3D (gap_hi - size + 1) & align_mask; + if (pfn >=3D gap_lo && pfn >=3D start_pfn) { + *new_pfn =3D pfn; + return node; + } + } + } + + return __iova_search_free_gap(node->rb_left, size, limit_pfn, + start_pfn, align_mask, new_pfn); } =20 static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn, struct iova *new, bool size_aligned) { - struct rb_node *curr, *prev; - struct iova *curr_iova; unsigned long flags; - unsigned long new_pfn, retry_pfn; + unsigned long new_pfn; unsigned long align_mask =3D ~0UL; - unsigned long high_pfn =3D limit_pfn, low_pfn =3D iovad->start_pfn; + struct rb_node *gap_node; =20 if (size_aligned) align_mask <<=3D fls_long(size - 1); =20 - /* Walk the tree backwards */ spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); if (limit_pfn <=3D iovad->dma_32bit_pfn && size >=3D iovad->max32_alloc_size) goto iova32_full; =20 - curr =3D __get_cached_rbnode(iovad, limit_pfn); - curr_iova =3D to_iova(curr); - retry_pfn =3D curr_iova->pfn_hi; - -retry: - do { - high_pfn =3D min(high_pfn, curr_iova->pfn_lo); - new_pfn =3D (high_pfn - size) & align_mask; - prev =3D curr; - curr =3D rb_prev(curr); - curr_iova =3D to_iova(curr); - } while (curr && new_pfn <=3D curr_iova->pfn_hi && new_pfn >=3D low_pfn); - - if (high_pfn < size || new_pfn < low_pfn) { - if (low_pfn =3D=3D iovad->start_pfn && retry_pfn < limit_pfn) { - high_pfn =3D limit_pfn; - low_pfn =3D retry_pfn + 1; - curr =3D iova_find_limit(iovad, limit_pfn); - curr_iova =3D to_iova(curr); - goto retry; - } - iovad->max32_alloc_size =3D size; + gap_node =3D __iova_search_free_gap(iovad->rbroot.rb_node, size, + limit_pfn, iovad->start_pfn, + align_mask, &new_pfn); + if (!gap_node) { + if (limit_pfn <=3D iovad->dma_32bit_pfn) + iovad->max32_alloc_size =3D size; goto iova32_full; } =20 - /* pfn_lo will point to size aligned address if size_aligned is set */ new->pfn_lo =3D new_pfn; - new->pfn_hi =3D new->pfn_lo + size - 1; + new->pfn_hi =3D new_pfn + size - 1; =20 - /* If we have 'prev', it's a valid place to start the insertion. */ - iova_insert_rbtree(&iovad->rbroot, new, prev); + iova_insert_rbtree(&iovad->rbroot, new, gap_node); __cached_rbnode_insert_update(iovad, new); =20 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); @@ -295,9 +304,34 @@ private_find_iova(struct iova_domain *iovad, unsigned = long pfn) =20 static void remove_iova(struct iova_domain *iovad, struct iova *iova) { + struct rb_node *next_node; + struct iova *next_iova =3D NULL; + assert_spin_locked(&iovad->iova_rbtree_lock); __cached_rbnode_delete_update(iovad, iova); - rb_erase(&iova->node, &iovad->rbroot); + + next_node =3D rb_next(&iova->node); + if (next_node) { + struct rb_node *prev_node =3D rb_prev(&iova->node); + + next_iova =3D to_iova(next_node); + if (prev_node) + next_iova->gap_to_prev =3D + next_iova->pfn_lo - to_iova(prev_node)->pfn_hi - 1; + else + next_iova->gap_to_prev =3D next_iova->pfn_lo; + /* + * Propagate next_iova's new augmented values to the root BEFORE + * the erase. Otherwise rotations inside rb_erase_augmented may + * copy a stale __subtree_max_gap from next_iova to other nodes, + * leaving ancestors in an inconsistent state that the post-erase + * propagate cannot fully repair (early-termination at matching + * intermediate values). + */ + iova_gap_callbacks.propagate(&next_iova->node, NULL); + } + + rb_erase_augmented(&iova->node, &iovad->rbroot, &iova_gap_callbacks); } =20 /** @@ -527,8 +561,20 @@ reserve_iova(struct iova_domain *iovad, spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); for (node =3D rb_first(&iovad->rbroot); node; node =3D rb_next(node)) { if (__is_range_overlap(node, pfn_lo, pfn_hi)) { + struct rb_node *p; + unsigned long gap; + iova =3D to_iova(node); __adjust_overlap_range(iova, &pfn_lo, &pfn_hi); + p =3D rb_prev(node); + if (p) + gap =3D iova->pfn_lo - to_iova(p)->pfn_hi - 1; + else + gap =3D iova->pfn_lo; + if (iova->gap_to_prev !=3D gap) { + iova->gap_to_prev =3D gap; + iova_gap_callbacks.propagate(node, NULL); + } if ((pfn_lo >=3D iova->pfn_lo) && (pfn_hi <=3D iova->pfn_hi)) goto finish; diff --git a/include/linux/iova.h b/include/linux/iova.h index d2c4fd923efa..52635a72c5c5 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -19,6 +19,8 @@ struct iova { struct rb_node node; unsigned long pfn_hi; /* Highest allocated pfn */ unsigned long pfn_lo; /* Lowest allocated pfn */ + unsigned long gap_to_prev; /* Gap (in pfns) to predecessor */ + unsigned long __subtree_max_gap; /* Max gap_to_prev in subtree */ }; =20 =20 --=20 2.52.0 From nobody Fri Jun 12 19:47:12 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95943383325; Wed, 13 May 2026 02:03:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; cv=none; b=RtPsyfxKlTj+lnMO/C3o1QXiH+MYSqdmXukvALDvKFrIG6mFMBdOBwZOM4/WZvazxwB9WEbI0Mk4MPp6OGGUK2+JIMCDpcaH3KhYZc2y1SuznpN7+LBHC4Nw/55OCg7b3EOBa14hfKJSKiaKUTMkuYqytngzv3Mq6vW9vS/hyXY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; c=relaxed/simple; bh=AwA6ZIGb32+jwjecyQAjZJvrow/SEkhb22WWBbSMOtE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SpMgq6shqRCLm3MthqpfRpEOr9FmTPTZausVMIa/ZndJIy6kSaP6ie2tZvFZWIn71LDLsVfxCkuF9hFiL6X0iISNZnhW3n1U7GHizFP5tJj7dHDNRNgGvLXv2CBYGWFez86bB0qZwOd/vArXggnK4UnZRVSyOLZddnNvpS4zInw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=Uu4lwIZJ; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="Uu4lwIZJ" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=N+F425vwfV1dlvrSyZE575xpot+t9UUXSJNgUZ2Q2kE=; b=Uu4lwIZJzWeADDEZ7NnkN9vxVg R1y05zyn2aYErvuS/02eoikUA6FWinl41DRsU/KUJ3Z6wySv3t/5ppfHJ9fQuegX1UbkdOgpJlcTE jLvdhdZ4R6PEvYdSRH0F3kn8EeZ9i9w4cMDH+oDg2Aksj+vf/usVqfAg6bNzdapzmo4GPXKxt4Hfx CsBCyZWFHu9afqQTZTpad+CnV7qttCOM+oiUx5cgL5wVB4dlV3EZ7XJpXZVnm41h8z1AL7ESZ6Tnw Dhtz+a4mXZkkNLDjyySqogO9hQwxwKQl80H9R6EDNrV0YwljJw5s09zwktemUlQCTyhdJGadRZmLh F70Q4Mxg==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wMywG-0000000022e-3oGc; Tue, 12 May 2026 22:03:08 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com, Rik van Riel , Rik van Riel Subject: [PATCH 2/5] iova: drop dead cached_node / cached32_node infrastructure Date: Tue, 12 May 2026 22:00:19 -0400 Message-ID: <20260513020304.1528751-3-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260513020304.1528751-1-riel@surriel.com> References: <20260513020304.1528751-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel After the augmented-rbtree port, cached_node and cached32_node are still maintained on every insert and delete but are never consulted on the allocation path. Drop the fields and their helpers. The one piece of useful work in __cached_rbnode_delete_update() was resetting iovad->max32_alloc_size when an iova in the 32-bit range was freed (so the next 32-bit alloc can retry). That logic is preserved by moving it inline into remove_iova(). No external consumers reference the cached_node fields. Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Rik van Riel --- drivers/iommu/iova.c | 35 +++-------------------------------- include/linux/iova.h | 2 -- 2 files changed, 3 insertions(+), 34 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 953188e296f0..c358ce981cae 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -57,8 +57,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long= granule, =20 spin_lock_init(&iovad->iova_rbtree_lock); iovad->rbroot =3D RB_ROOT; - iovad->cached_node =3D &iovad->anchor.node; - iovad->cached32_node =3D &iovad->anchor.node; iovad->granule =3D granule; iovad->start_pfn =3D start_pfn; iovad->dma_32bit_pfn =3D 1UL << (32 - iova_shift(iovad)); @@ -71,34 +69,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned lon= g granule, } EXPORT_SYMBOL_GPL(init_iova_domain); =20 -static void -__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new) -{ - if (new->pfn_hi < iovad->dma_32bit_pfn) - iovad->cached32_node =3D &new->node; - else - iovad->cached_node =3D &new->node; -} - -static void -__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) -{ - struct iova *cached_iova; - - cached_iova =3D to_iova(iovad->cached32_node); - if (free =3D=3D cached_iova || - (free->pfn_hi < iovad->dma_32bit_pfn && - free->pfn_lo >=3D cached_iova->pfn_lo)) - iovad->cached32_node =3D rb_next(&free->node); - - if (free->pfn_lo < iovad->dma_32bit_pfn) - iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; - - cached_iova =3D to_iova(iovad->cached_node); - if (free->pfn_lo >=3D cached_iova->pfn_lo) - iovad->cached_node =3D rb_next(&free->node); -} - /* Insert the iova into domain rbtree by holding writer lock */ static void iova_insert_rbtree(struct rb_root *root, struct iova *iova, @@ -221,7 +191,6 @@ static int __alloc_and_insert_iova_range(struct iova_do= main *iovad, new->pfn_hi =3D new_pfn + size - 1; =20 iova_insert_rbtree(&iovad->rbroot, new, gap_node); - __cached_rbnode_insert_update(iovad, new); =20 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); return 0; @@ -308,7 +277,9 @@ static void remove_iova(struct iova_domain *iovad, stru= ct iova *iova) struct iova *next_iova =3D NULL; =20 assert_spin_locked(&iovad->iova_rbtree_lock); - __cached_rbnode_delete_update(iovad, iova); + + if (iova->pfn_lo < iovad->dma_32bit_pfn) + iovad->max32_alloc_size =3D iovad->dma_32bit_pfn; =20 next_node =3D rb_next(&iova->node); if (next_node) { diff --git a/include/linux/iova.h b/include/linux/iova.h index 52635a72c5c5..3c4cc81e5182 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -30,8 +30,6 @@ struct iova_rcache; struct iova_domain { spinlock_t iova_rbtree_lock; /* Lock to protect update of rbtree */ struct rb_root rbroot; /* iova domain rbtree root */ - struct rb_node *cached_node; /* Save last alloced node */ - struct rb_node *cached32_node; /* Save last 32-bit alloced node */ unsigned long granule; /* pfn granularity for this domain */ unsigned long start_pfn; /* Lower limit for this domain */ unsigned long dma_32bit_pfn; --=20 2.52.0 From nobody Fri Jun 12 19:47:12 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 959C5383327; Wed, 13 May 2026 02:03:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; cv=none; b=tn8b8Wt+lugFwMsaSYFVr471b4Kjdj079Kf+jJTcrCRJycdjxV8SzNpAm3UXvwSxgSbMZT8EuEZLwQDpcjfPnP4orz+1MFdcUlzGzIJjw9RK2QF0DxTGqlpGeixwS1Uk8kEvG84T7Yi1B0QF7M3RnhElJfl+E3+t9FvrDleio24= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; c=relaxed/simple; bh=yr0uyDeU+CPABX1lp0VGou8MsU3xmqsVX+jMkSUmCj8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=S8LmQ/4zW014DNM4Ryw3fYqDBgX6xtW3J5eW4GcrQOnbf8eCDHnJNRk3a4z05A1hFMa5eiSHz5uN0vTUxozf90s+KWbv2PYrocVWgkv3VkVBQ2snpvFMFh1S1vFh2/MIYna3ZdCdO+2bAfr6LOgxEvdr8Pqc0pwIWipzukKYU2Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=XaX7p0oP; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="XaX7p0oP" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=5+2od+2GeT0w7P1FoWyNLqYLXCzvpgp9ZU4fCrRUjcI=; b=XaX7p0oPJ5CdE38vlg2JjNVTyg Q/BvyeTmwNqI54bjwE6MCoVzb5u/bomde/loR/BVb8sjRAEsbZbh01onjcmeT7FaReEw+md/myKC5 XJosQt8qqSWWD2c7d8eDqzRCS2QUarOYw/8D8wx93tCDAQeKX5YF+MHTQARHfwwByJ/KxbfrrWxUh nJhPTCBK4jHKcFk+sTQmNrZBATnKGaRA6CQbBWFtwuXY6fVoWwlVO6DRXrOtkrOPv61PbBpskr8Yw MxGvnwberzADYRoqDVJN9YmSTicn2skTpZAeq+FipwQQ3IUKLoBNFloQkntEJyYLoIHHM8/QgmdkS 5VQxV4IA==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wMywG-0000000022e-3tSv; Tue, 12 May 2026 22:03:08 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com, Rik van Riel , Rik van Riel Subject: [PATCH 3/5] iova: limit_pfn-aware augmentation for log(n) 32-bit alloc Date: Tue, 12 May 2026 22:00:20 -0400 Message-ID: <20260513020304.1528751-4-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260513020304.1528751-1-riel@surriel.com> References: <20260513020304.1528751-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel The augmented-rbtree port made __iova_search_free_gap O(log n) when limit_pfn doesn't bind any candidate gap, but degrades to O(n) when limit_pfn is small relative to the gaps in the tree. The classic case is a 32-bit DMA allocation on a domain dominated by 64-bit allocations: the augmented invariant __subtree_max_gap is satisfied everywhere (the huge gap between the highest 64-bit allocation and the IOVA_ANCHOR dominates), so pruning never fires, and every node above limit_pfn must be visited and rejected before falling through to the 32-bit region. Add a second augmented field that bounds the search by 32-bit-clamped gap size: clamped_gap32(node) =3D 0 if gap_to_prev =3D=3D 0 0 if gap_lo >=3D dma_32bit_pfn min(gap_hi, dma_32bit_pfn-1) - gap_lo + 1 otherwise __subtree_max_gap32 =3D max(clamped_gap32) over subtree clamped_gap32 is precomputed and stored on each iova whenever gap_to_prev changes (insert, remove, reserve overlap fix-up), so the augmented callbacks remain pure functions of the node and don't need domain context. The compute helper iova_compute_max() updates both __subtree_max_gap and __subtree_max_gap32 in one pass; the hand-rolled propagate / copy / rotate callbacks (replacing the single-field RB_DECLARE_CALLBACKS_MAX expansion) maintain both fields per insert/erase. Pruning in __iova_search_free_gap selects between them based on whether the caller is doing a 32-bit allocation. For 64-bit allocations the behaviour is unchanged. For 32-bit allocations on mixed-DMA-mask domains the search is now O(log n). Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Rik van Riel --- drivers/iommu/iova.c | 153 ++++++++++++++++++++++++++++++++++++------- include/linux/iova.h | 6 +- 2 files changed, 133 insertions(+), 26 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c358ce981cae..7787da8b2dad 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -35,14 +35,97 @@ static struct iova *to_iova(struct rb_node *node) return rb_entry(node, struct iova, node); } =20 -static inline unsigned long iova_gap_value(struct iova *iova) +/* + * Portion of @iova->gap_to_prev that lies strictly below @dma_32bit_pfn, + * i.e. the largest contiguous sub-gap that a 32-bit-restricted allocation + * could possibly use. Maintained alongside gap_to_prev so the augmented + * callbacks can compare against it without needing per-iova access to the + * domain's dma_32bit_pfn. + */ +static unsigned long +iova_compute_clamped_gap32(struct iova *iova, unsigned long dma_32bit_pfn) +{ + unsigned long gap_lo, gap_hi; + + if (iova->gap_to_prev =3D=3D 0) + return 0; + gap_lo =3D iova->pfn_lo - iova->gap_to_prev; + if (gap_lo >=3D dma_32bit_pfn) + return 0; + gap_hi =3D iova->pfn_lo - 1; + if (gap_hi >=3D dma_32bit_pfn) + gap_hi =3D dma_32bit_pfn - 1; + return gap_hi - gap_lo + 1; +} + +/* + * Recompute @node's __subtree_max_gap and __subtree_max_gap32 from its + * own gap fields and its children's subtree maxes. Returns true (for + * propagate's early-termination) if neither value would change. + */ +static bool iova_compute_max(struct iova *node, bool exit) { - return iova->gap_to_prev; + unsigned long max_gap =3D node->gap_to_prev; + unsigned long max_gap32 =3D node->clamped_gap32; + struct iova *child; + + if (node->node.rb_left) { + child =3D to_iova(node->node.rb_left); + if (child->__subtree_max_gap > max_gap) + max_gap =3D child->__subtree_max_gap; + if (child->__subtree_max_gap32 > max_gap32) + max_gap32 =3D child->__subtree_max_gap32; + } + if (node->node.rb_right) { + child =3D to_iova(node->node.rb_right); + if (child->__subtree_max_gap > max_gap) + max_gap =3D child->__subtree_max_gap; + if (child->__subtree_max_gap32 > max_gap32) + max_gap32 =3D child->__subtree_max_gap32; + } + if (exit && node->__subtree_max_gap =3D=3D max_gap && + node->__subtree_max_gap32 =3D=3D max_gap32) + return true; + node->__subtree_max_gap =3D max_gap; + node->__subtree_max_gap32 =3D max_gap32; + return false; } =20 -RB_DECLARE_CALLBACKS_MAX(static, iova_gap_callbacks, - struct iova, node, unsigned long, __subtree_max_gap, - iova_gap_value) +static void iova_gap_propagate(struct rb_node *rb, struct rb_node *stop) +{ + while (rb !=3D stop) { + struct iova *node =3D to_iova(rb); + + if (iova_compute_max(node, true)) + break; + rb =3D rb_parent(&node->node); + } +} + +static void iova_gap_copy(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct iova *old =3D to_iova(rb_old); + struct iova *new =3D to_iova(rb_new); + + new->__subtree_max_gap =3D old->__subtree_max_gap; + new->__subtree_max_gap32 =3D old->__subtree_max_gap32; +} + +static void iova_gap_rotate(struct rb_node *rb_old, struct rb_node *rb_new) +{ + struct iova *old =3D to_iova(rb_old); + struct iova *new =3D to_iova(rb_new); + + new->__subtree_max_gap =3D old->__subtree_max_gap; + new->__subtree_max_gap32 =3D old->__subtree_max_gap32; + iova_compute_max(old, false); +} + +static const struct rb_augment_callbacks iova_gap_callbacks =3D { + .propagate =3D iova_gap_propagate, + .copy =3D iova_gap_copy, + .rotate =3D iova_gap_rotate, +}; =20 void init_iova_domain(struct iova_domain *iovad, unsigned long granule, @@ -64,6 +147,9 @@ init_iova_domain(struct iova_domain *iovad, unsigned lon= g granule, iovad->anchor.pfn_lo =3D iovad->anchor.pfn_hi =3D IOVA_ANCHOR; iovad->anchor.gap_to_prev =3D IOVA_ANCHOR; iovad->anchor.__subtree_max_gap =3D IOVA_ANCHOR; + iovad->anchor.clamped_gap32 =3D iova_compute_clamped_gap32(&iovad->anchor, + iovad->dma_32bit_pfn); + iovad->anchor.__subtree_max_gap32 =3D iovad->anchor.clamped_gap32; rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); rb_insert_color(&iovad->anchor.node, &iovad->rbroot); } @@ -71,9 +157,10 @@ EXPORT_SYMBOL_GPL(init_iova_domain); =20 /* Insert the iova into domain rbtree by holding writer lock */ static void -iova_insert_rbtree(struct rb_root *root, struct iova *iova, +iova_insert_rbtree(struct iova_domain *iovad, struct iova *iova, struct rb_node *start) { + struct rb_root *root =3D &iovad->rbroot; struct rb_node **new, *parent =3D NULL; struct rb_node *prev_node, *next_node; =20 @@ -101,12 +188,18 @@ iova_insert_rbtree(struct rb_root *root, struct iova = *iova, iova->gap_to_prev =3D iova->pfn_lo - to_iova(prev_node)->pfn_hi - 1; else iova->gap_to_prev =3D iova->pfn_lo; + iova->clamped_gap32 =3D iova_compute_clamped_gap32(iova, iovad->dma_32bit= _pfn); iova->__subtree_max_gap =3D iova->gap_to_prev; + iova->__subtree_max_gap32 =3D iova->clamped_gap32; =20 next_node =3D rb_next(&iova->node); - if (next_node) - to_iova(next_node)->gap_to_prev =3D - to_iova(next_node)->pfn_lo - iova->pfn_hi - 1; + if (next_node) { + struct iova *next_iova =3D to_iova(next_node); + + next_iova->gap_to_prev =3D next_iova->pfn_lo - iova->pfn_hi - 1; + next_iova->clamped_gap32 =3D iova_compute_clamped_gap32(next_iova, + iovad->dma_32bit_pfn); + } =20 if (parent) iova_gap_callbacks.propagate(parent, NULL); @@ -119,25 +212,32 @@ iova_insert_rbtree(struct rb_root *root, struct iova = *iova, /* * Search the augmented rbtree for the highest-addressed free gap of at le= ast * @size pages, with the allocation fitting below @limit_pfn and at or abo= ve - * @start_pfn. Returns the node whose gap_to_prev is used, or NULL. + * @start_pfn. When @is_32bit, prune by the 32-bit-clamped subtree max so + * that 32-bit-restricted allocations on a domain dominated by high-pfn + * allocations stay O(log n) instead of degrading to O(n). Returns the node + * whose gap_to_prev is used, or NULL. */ static struct rb_node * __iova_search_free_gap(struct rb_node *node, unsigned long size, unsigned long limit_pfn, unsigned long start_pfn, - unsigned long align_mask, unsigned long *new_pfn) + unsigned long align_mask, bool is_32bit, + unsigned long *new_pfn) { struct iova *iova; struct rb_node *result; + unsigned long subtree_max; =20 if (!node) return NULL; =20 iova =3D to_iova(node); - if (iova->__subtree_max_gap < size) + subtree_max =3D is_32bit ? iova->__subtree_max_gap32 + : iova->__subtree_max_gap; + if (subtree_max < size) return NULL; =20 result =3D __iova_search_free_gap(node->rb_right, size, limit_pfn, - start_pfn, align_mask, new_pfn); + start_pfn, align_mask, is_32bit, new_pfn); if (result) return result; =20 @@ -158,7 +258,7 @@ __iova_search_free_gap(struct rb_node *node, unsigned l= ong size, } =20 return __iova_search_free_gap(node->rb_left, size, limit_pfn, - start_pfn, align_mask, new_pfn); + start_pfn, align_mask, is_32bit, new_pfn); } =20 static int __alloc_and_insert_iova_range(struct iova_domain *iovad, @@ -169,20 +269,21 @@ static int __alloc_and_insert_iova_range(struct iova_= domain *iovad, unsigned long new_pfn; unsigned long align_mask =3D ~0UL; struct rb_node *gap_node; + bool is_32bit; =20 if (size_aligned) align_mask <<=3D fls_long(size - 1); =20 spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); - if (limit_pfn <=3D iovad->dma_32bit_pfn && - size >=3D iovad->max32_alloc_size) + is_32bit =3D limit_pfn <=3D iovad->dma_32bit_pfn; + if (is_32bit && size >=3D iovad->max32_alloc_size) goto iova32_full; =20 gap_node =3D __iova_search_free_gap(iovad->rbroot.rb_node, size, limit_pfn, iovad->start_pfn, - align_mask, &new_pfn); + align_mask, is_32bit, &new_pfn); if (!gap_node) { - if (limit_pfn <=3D iovad->dma_32bit_pfn) + if (is_32bit) iovad->max32_alloc_size =3D size; goto iova32_full; } @@ -190,7 +291,7 @@ static int __alloc_and_insert_iova_range(struct iova_do= main *iovad, new->pfn_lo =3D new_pfn; new->pfn_hi =3D new_pfn + size - 1; =20 - iova_insert_rbtree(&iovad->rbroot, new, gap_node); + iova_insert_rbtree(iovad, new, gap_node); =20 spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); return 0; @@ -291,13 +392,15 @@ static void remove_iova(struct iova_domain *iovad, st= ruct iova *iova) next_iova->pfn_lo - to_iova(prev_node)->pfn_hi - 1; else next_iova->gap_to_prev =3D next_iova->pfn_lo; + next_iova->clamped_gap32 =3D iova_compute_clamped_gap32(next_iova, + iovad->dma_32bit_pfn); /* * Propagate next_iova's new augmented values to the root BEFORE * the erase. Otherwise rotations inside rb_erase_augmented may - * copy a stale __subtree_max_gap from next_iova to other nodes, - * leaving ancestors in an inconsistent state that the post-erase - * propagate cannot fully repair (early-termination at matching - * intermediate values). + * copy a stale __subtree_max_gap or __subtree_max_gap32 from + * next_iova to other nodes, leaving ancestors in an inconsistent + * state that the post-erase propagate cannot fully repair + * (early-termination at matching intermediate values). */ iova_gap_callbacks.propagate(&next_iova->node, NULL); } @@ -493,7 +596,7 @@ __insert_new_range(struct iova_domain *iovad, =20 iova =3D alloc_and_init_iova(pfn_lo, pfn_hi); if (iova) - iova_insert_rbtree(&iovad->rbroot, iova, NULL); + iova_insert_rbtree(iovad, iova, NULL); =20 return iova; } @@ -544,6 +647,8 @@ reserve_iova(struct iova_domain *iovad, gap =3D iova->pfn_lo; if (iova->gap_to_prev !=3D gap) { iova->gap_to_prev =3D gap; + iova->clamped_gap32 =3D iova_compute_clamped_gap32(iova, + iovad->dma_32bit_pfn); iova_gap_callbacks.propagate(node, NULL); } if ((pfn_lo >=3D iova->pfn_lo) && diff --git a/include/linux/iova.h b/include/linux/iova.h index 3c4cc81e5182..d262c6d88d6c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -19,8 +19,10 @@ struct iova { struct rb_node node; unsigned long pfn_hi; /* Highest allocated pfn */ unsigned long pfn_lo; /* Lowest allocated pfn */ - unsigned long gap_to_prev; /* Gap (in pfns) to predecessor */ - unsigned long __subtree_max_gap; /* Max gap_to_prev in subtree */ + unsigned long gap_to_prev; /* Gap (in pfns) to predecessor */ + unsigned long clamped_gap32; /* gap_to_prev portion below dma_32bit= _pfn */ + unsigned long __subtree_max_gap; /* Max gap_to_prev in subtree */ + unsigned long __subtree_max_gap32; /* Max clamped_gap32 in subtree */ }; =20 =20 --=20 2.52.0 From nobody Fri Jun 12 19:47:12 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9579F37A494; Wed, 13 May 2026 02:03:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; cv=none; b=ZaA6Z00ZD7EK1qa1ciis1BY04sFeYcPs86CdiENvtfaTbELfHCEBDu9MV3XHJwWwi+L5ZigAwYU+Kq99Oh4eVlEP0Z2fRFIbOr9MjFa20jDUTrwV2jCGULxb8Ubrmz/gP14VHiZkss2ECJMUWrCeuZPZammOSEFkSBDTXSvZdOM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; c=relaxed/simple; bh=EkeUf+jwV0orsuixqtfCqEcLphGVVp9g5ROQvcnY+Jg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LaIrd9PjbKfdRTvSXNrXlAQQJmgT3xF36IL9sZty8wV+QBa+DiDB9cYGxQQXttVjy9mOqQtM/iLq9bxnm/ygfylVMcrA/ItJbkqVoHYbp4KyiS6mWTq8/VLS3DEhMI9cgVSobI5BYVAl5LfqdZlgcwHCZGZhjFO0f9ZemiBQm+k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=np8tuy7U; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="np8tuy7U" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=STKTIaA0zUdKDPKtHxdM0Rg8/7A21LDJqEbHedbnIhc=; b=np8tuy7U6mFAuab7mKO7XYfU+R /CD/N35l3lBiPSi2It/lfYNVJ1M73z3HlnSIofHe5dJhNs6I+ZPrv+QLcSgrwM7Sv71i/g/KtQOFc PGoGWWQ3/bmHY9SDU86YF8VT4T5PF86lfJ7grqrw+5+AVO876+wneGhVo9U/gGze2ZL5pU1zSGxhu 1AWrufrX8B/6bqy+WoyFm6VzQQt9A/C0xWHFSVJTNU5rs4SGWkr64U2Y66rOg5foAh0TNdUjRLMGS LVuIpSn8gWVPPD7Mykp0VVynMaNqOCDzqEjAG57tz2hwGBjrOkPNVkvrzwSv1PKu/rHZDXVQ+l3kC UPLo0qpQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wMywG-0000000022e-3zX6; Tue, 12 May 2026 22:03:08 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com, Rik van Riel , Rik van Riel Subject: [PATCH 4/5] iova: alignment-aware two-phase search for log(n) aligned alloc Date: Tue, 12 May 2026 22:00:21 -0400 Message-ID: <20260513020304.1528751-5-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260513020304.1528751-1-riel@surriel.com> References: <20260513020304.1528751-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Rik van Riel After the limit_pfn-aware augmentation, __iova_search_free_gap is O(log n) for both 64-bit and 32-bit alloc paths -- but only when ignoring alignment. The augmented invariant prunes by gap size; it doesn't know about the per-call align_mask. A subtree whose __subtree_max_gap >=3D size can still contain only borderline-sized gaps that fail the alignment check, forcing the search to visit every node in the subtree. For a request of S pages with alignment A (a power of two), a gap of size G is GUARANTEED to contain an aligned-fitting position iff G >=3D S + A - 1. (Worst case: the highest aligned position roundDOWN(gap_hi - S + 1, A) loses up to A-1 pfns to rounding.) Two-phase search: Phase 1: prune by S + A - 1. Every visited node either prunes its subtree or descends into one whose subtree_max guarantees a fitting candidate. Strict O(log n). Phase 2: only if phase 1 returned NULL. Prune by S, the existing behaviour. Catches fortuitously-aligned borderline gaps. O(n) worst case but only triggers when no comfortable gap exists anywhere in the tree. For typical IOMMU workloads (alloc_iova_fast rounds size up to a power of two, so A =3D=3D S, and most allocations are 1-16 pages) phase 1 succeeds and the search is one O(log n) descent. The borderline case is the network/storage workload that fragments small power-of-two allocations and statistically misaligns half (size 2), 75% (size 4), or more (size 16) of the marginal gaps -- exactly where phase 2 is the fallback. Implementation: __iova_search_free_gap takes a separate prune_size distinct from the size used for fitting and pfn computation, so phase 1 can prune more aggressively while still returning the correct top-aligned pfn within whichever gap it finds. The non-aligned alloc path (size_aligned =3D=3D false, A =3D=3D 1) collapses to a single phase si= nce strict_size =3D=3D size. Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Rik van Riel --- drivers/iommu/iova.c | 65 +++++++++++++++++++++++++++++++++----------- 1 file changed, 49 insertions(+), 16 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 7787da8b2dad..4f3465d8ee16 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -210,18 +210,29 @@ iova_insert_rbtree(struct iova_domain *iovad, struct = iova *iova, } =20 /* - * Search the augmented rbtree for the highest-addressed free gap of at le= ast - * @size pages, with the allocation fitting below @limit_pfn and at or abo= ve - * @start_pfn. When @is_32bit, prune by the 32-bit-clamped subtree max so - * that 32-bit-restricted allocations on a domain dominated by high-pfn + * Search the augmented rbtree for the highest-addressed free gap that fits + * @size pages with @align_mask alignment, below @limit_pfn and at or above + * @start_pfn. + * + * @prune_size is the threshold compared against the augmented subtree max: + * - If @prune_size =3D=3D @size, every subtree containing any size-fitt= ing + * gap is descended (lax search; can visit O(n) nodes when alignment or + * limit_pfn rejects most candidates). + * - If @prune_size =3D=3D @size + alignment - 1, only subtrees containi= ng a + * gap big enough to GUARANTEE an aligned fit are descended (strict + * search; O(log n) but may miss borderline-sized fortuitously aligned + * gaps). + * + * When @is_32bit, prune by the 32-bit-clamped subtree max so that + * 32-bit-restricted allocations on a domain dominated by high-pfn * allocations stay O(log n) instead of degrading to O(n). Returns the node * whose gap_to_prev is used, or NULL. */ static struct rb_node * __iova_search_free_gap(struct rb_node *node, unsigned long size, - unsigned long limit_pfn, unsigned long start_pfn, - unsigned long align_mask, bool is_32bit, - unsigned long *new_pfn) + unsigned long prune_size, unsigned long limit_pfn, + unsigned long start_pfn, unsigned long align_mask, + bool is_32bit, unsigned long *new_pfn) { struct iova *iova; struct rb_node *result; @@ -233,11 +244,12 @@ __iova_search_free_gap(struct rb_node *node, unsigned= long size, iova =3D to_iova(node); subtree_max =3D is_32bit ? iova->__subtree_max_gap32 : iova->__subtree_max_gap; - if (subtree_max < size) + if (subtree_max < prune_size) return NULL; =20 - result =3D __iova_search_free_gap(node->rb_right, size, limit_pfn, - start_pfn, align_mask, is_32bit, new_pfn); + result =3D __iova_search_free_gap(node->rb_right, size, prune_size, + limit_pfn, start_pfn, align_mask, + is_32bit, new_pfn); if (result) return result; =20 @@ -257,8 +269,9 @@ __iova_search_free_gap(struct rb_node *node, unsigned l= ong size, } } =20 - return __iova_search_free_gap(node->rb_left, size, limit_pfn, - start_pfn, align_mask, is_32bit, new_pfn); + return __iova_search_free_gap(node->rb_left, size, prune_size, + limit_pfn, start_pfn, align_mask, + is_32bit, new_pfn); } =20 static int __alloc_and_insert_iova_range(struct iova_domain *iovad, @@ -268,11 +281,24 @@ static int __alloc_and_insert_iova_range(struct iova_= domain *iovad, unsigned long flags; unsigned long new_pfn; unsigned long align_mask =3D ~0UL; + unsigned long strict_size =3D size; struct rb_node *gap_node; bool is_32bit; =20 - if (size_aligned) - align_mask <<=3D fls_long(size - 1); + if (size_aligned) { + unsigned long align_shift =3D fls_long(size - 1); + + align_mask <<=3D align_shift; + /* + * For an A-aligned allocation of S pages, a gap of size G is + * guaranteed to contain a fitting aligned position iff + * G >=3D S + A - 1. Use that as the strict pruning threshold for + * a fast O(log n) first pass; if it fails, fall back to a + * pruning threshold of just S to catch fortuitously-aligned + * borderline gaps. + */ + strict_size =3D size + (1UL << align_shift) - 1; + } =20 spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); is_32bit =3D limit_pfn <=3D iovad->dma_32bit_pfn; @@ -280,8 +306,15 @@ static int __alloc_and_insert_iova_range(struct iova_d= omain *iovad, goto iova32_full; =20 gap_node =3D __iova_search_free_gap(iovad->rbroot.rb_node, size, - limit_pfn, iovad->start_pfn, - align_mask, is_32bit, &new_pfn); + strict_size, limit_pfn, + iovad->start_pfn, align_mask, + is_32bit, &new_pfn); + if (!gap_node && strict_size > size) { + gap_node =3D __iova_search_free_gap(iovad->rbroot.rb_node, size, + size, limit_pfn, + iovad->start_pfn, align_mask, + is_32bit, &new_pfn); + } if (!gap_node) { if (is_32bit) iovad->max32_alloc_size =3D size; --=20 2.52.0 From nobody Fri Jun 12 19:47:12 2026 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 958C0381AFB; Wed, 13 May 2026 02:03:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; cv=none; b=UrROo3WAw2u1UgU7vj6fD7YqdoPEXsWuRgDQbmwT81/sTdVqQh3ksifDLIO07yQsaOg7mf7kgF2pz356Xj4wcNiYcbw5GLqT07L06q5ZzLUvk3YnDAUkRfQvHj2DHOiXQlZONenjiif4mW6dbmLIm2S8rEYgtJiFU/7ruCh0xeI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778637803; c=relaxed/simple; bh=41hrHwfchr34G6dv1i37TpBlXqtBPLFr13A8Q3i1U5A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=AxO5vb1BOh1jqlUcNxhIFLJmQfjIBperjvvH8A94Ckl09P4CTS22hET3w7Rmq89CGUjKaL6X6AmdyH26xo2i1u1vnOWRfOs3/ZYPf4yi984q5khZx+9+5AdeVf7dXHmiNyGqqJO3Tu8PbnuybJMp3e1Vz2xQISkU6FDP7X5qT40= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=otRgMr7t; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="otRgMr7t" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=2E57BXtOvv/R8hb2DBqVXtfeBA65Ud9+LXTlXyCtHoo=; b=otRgMr7tnIylHKROGlOtn9n9Aw YGV10qs7/86dH4wgwKAleQAOEJAEIKWoDH4ggk1EnPm+9kP2KcZV+irKWwQSgMay4WT5M64653QNz DFGi3sxc7hlpsti/1+L4VslZYKudx32yebcG2gWiabdkq2ZgJIbf1Qavk7hqT/Webl+GFPPw9pXeE vN2koVyOV08dvGvi8NsCGOmWtBxOrryypHzIoSzQQ7Tnb+Gr0KpH2GEIx6aP/ceAZRMFoh8VQ+m2E Nii7YWRbxV0Gz4+0hLv9qW4CZFBVltUlIEs3dDTL3BZ6YpkPhwPkIyTOJUHu8lew2tpAEbhKD6xYI pM3G4n8A==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wMywG-0000000022e-44xY; Tue, 12 May 2026 22:03:08 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com, Rik van Riel , Rik van Riel Subject: [PATCH 5/5] iova: add KUnit test suite Date: Tue, 12 May 2026 22:00:22 -0400 Message-ID: <20260513020304.1528751-6-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260513020304.1528751-1-riel@surriel.com> References: <20260513020304.1528751-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Rik van Riel Add a kunit suite for the augmented-rbtree IOVA allocator, plus an iova_domain_verify_invariants() helper (compiled only when the test config is enabled) that walks the tree and confirms every node's gap_to_prev, clamped_gap32, __subtree_max_gap, and __subtree_max_gap32 match what recomputation from scratch yields. Test cases: - test_init_destroy: domain lifecycle, no leaks. - test_basic_alloc_free: single alloc/free roundtrip, top-down reuse. - test_size_aligned: alignment of size_aligned allocs across orders 0..7. - test_top_down_preference: sequential allocs decrease in pfn_lo. - test_limit_pfn_respected: 100 allocs all stay <=3D limit_pfn. - test_reserve_iova: allocs avoid the reserved range. - test_find_iova: lookup by pfn returns the right iova. - test_32bit_in_64bit_domain: 1000 64-bit allocs followed by a 32-bit alloc must still find a slot below DMA_BIT_MASK(32) -- exercises the __subtree_max_gap32 augmentation. - test_two_phase_alignment: pack size-2 size_aligned allocs, free every other; subsequent size-2 alloc must succeed via the phase-2 fallback search since phase-1's S+A-1 threshold prunes the size-2 gaps. - test_pci_32bit_workaround_pattern: alternate 32-bit-first allocation attempts with 64-bit fallback, mirroring dma-iommu.c. - test_stress_random: 2048 random alloc/free operations with mixed sizes, alignments, and 32/64-bit limits, with periodic invariant checks. Each test verifies the augmented invariants both during and after the test run so that any sequencing bug in insert / erase / rotate / propagate is caught at the operation that introduced it. Tested by: building drivers/iommu/iova.o and drivers/iommu/iova-kunit.o (no warnings); runtime execution requires booting a kernel with CONFIG_IOMMU_IOVA_KUNIT_TEST=3Dy under qemu-system-x86_64 (not available on this devvm). Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Rik van Riel --- drivers/iommu/Kconfig | 14 ++ drivers/iommu/Makefile | 1 + drivers/iommu/iova-kunit.c | 396 +++++++++++++++++++++++++++++++++++++ drivers/iommu/iova.c | 90 +++++++++ include/linux/iova.h | 3 + 5 files changed, 504 insertions(+) create mode 100644 drivers/iommu/iova-kunit.c diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index f86262b11416..61906a5664a6 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -3,6 +3,20 @@ config IOMMU_IOVA tristate =20 +config IOMMU_IOVA_KUNIT_TEST + tristate "KUnit tests for the IOVA allocator" if !KUNIT_ALL_TESTS + depends on IOMMU_IOVA && KUNIT + default KUNIT_ALL_TESTS + help + Enable kunit tests for the IOVA allocator. The tests exercise + basic allocation and free, size-aligned allocation, top-down + ordering, the limit_pfn-aware 32-bit augmentation, the + alignment-aware two-phase search, and randomly-fragmented + stress scenarios. Each test verifies that the augmented + rbtree invariants remain consistent throughout. + + If unsure, say N here. + # IOMMU_API always gets selected by whoever wants it. config IOMMU_API bool diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 0275821f4ef9..6bd7da1cbebd 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -16,6 +16,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) +=3D io-pgtable-arm.o obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) +=3D io-pgtable-arm-selftes= ts.o obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) +=3D io-pgtable-dart.o obj-$(CONFIG_IOMMU_IOVA) +=3D iova.o +obj-$(CONFIG_IOMMU_IOVA_KUNIT_TEST) +=3D iova-kunit.o obj-$(CONFIG_OF_IOMMU) +=3D of_iommu.o obj-$(CONFIG_MSM_IOMMU) +=3D msm_iommu.o obj-$(CONFIG_IPMMU_VMSA) +=3D ipmmu-vmsa.o diff --git a/drivers/iommu/iova-kunit.c b/drivers/iommu/iova-kunit.c new file mode 100644 index 000000000000..e921c8543668 --- /dev/null +++ b/drivers/iommu/iova-kunit.c @@ -0,0 +1,396 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KUnit tests for the IOVA allocator. + * + * Exercises the augmented-rbtree based allocator: basic alloc/free, + * size-aligned allocations, top-down ordering, the limit_pfn-aware + * 32-bit augmentation (relevant for the dma-iommu pci_32bit_workaround + * pattern), the alignment-aware two-phase search, and randomly + * fragmented stress. + * + * Each test verifies that the augmented invariants + * (__subtree_max_gap, __subtree_max_gap32, gap_to_prev, clamped_gap32) + * remain consistent after every batch of operations. + */ +#include +#include +#include +#include + +#define TEST_GRANULE PAGE_SIZE +/* Highest pfn that fits in 32 bits =E2=80=94 triggers the is_32bit alloc = path. */ +#define TEST_LIMIT_32BIT (DMA_BIT_MASK(32) >> PAGE_SHIFT) +/* A 64-bit-ish limit well above dma_32bit_pfn. */ +#define TEST_LIMIT_64BIT ((1UL << 36) >> PAGE_SHIFT) + +struct iova_test_ctx { + struct iova_domain iovad; + bool initialized; +}; + +static struct iova_test_ctx *iova_test_init_ctx(struct kunit *test) +{ + struct iova_test_ctx *ctx; + int ret; + + ctx =3D kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, ctx); + + ret =3D iova_cache_get(); + KUNIT_ASSERT_EQ(test, ret, 0); + + init_iova_domain(&ctx->iovad, TEST_GRANULE, 1); + ret =3D iova_domain_init_rcaches(&ctx->iovad); + KUNIT_ASSERT_EQ(test, ret, 0); + ctx->initialized =3D true; + KUNIT_ASSERT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + return ctx; +} + +static void iova_test_cleanup(struct kunit *test, struct iova_test_ctx *ct= x) +{ + if (ctx->initialized) { + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + put_iova_domain(&ctx->iovad); + iova_cache_put(); + ctx->initialized =3D false; + } +} + +static void test_init_destroy(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + + iova_test_cleanup(test, ctx); +} + +static void test_basic_alloc_free(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + struct iova *iova; + unsigned long pfn_lo; + + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT); + KUNIT_EXPECT_EQ(test, iova->pfn_hi - iova->pfn_lo + 1, 1); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + pfn_lo =3D iova->pfn_lo; + __free_iova(&ctx->iovad, iova); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* Top-down policy: subsequent alloc should reuse the same pfn. */ + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_EQ(test, iova->pfn_lo, pfn_lo); + __free_iova(&ctx->iovad, iova); + + iova_test_cleanup(test, ctx); +} + +static void test_size_aligned(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + int order; + + for (order =3D 0; order < 8; ++order) { + unsigned long size =3D 1UL << order; + struct iova *iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_32BIT, true); + + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_EQ(test, iova->pfn_lo & (size - 1), 0); + KUNIT_EXPECT_EQ(test, iova->pfn_hi - iova->pfn_lo + 1, size); + __free_iova(&ctx->iovad, iova); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + } + + iova_test_cleanup(test, ctx); +} + +static void test_top_down_preference(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + struct iova *iovas[16]; + int i; + + for (i =3D 0; i < ARRAY_SIZE(iovas); ++i) { + iovas[i] =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iovas[i]); + if (i > 0) + KUNIT_EXPECT_LT(test, iovas[i]->pfn_lo, + iovas[i - 1]->pfn_lo); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + for (i =3D 0; i < ARRAY_SIZE(iovas); ++i) + __free_iova(&ctx->iovad, iovas[i]); + + iova_test_cleanup(test, ctx); +} + +static void test_limit_pfn_respected(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + struct iova *iova; + int i; + + for (i =3D 0; i < 100; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + iova_test_cleanup(test, ctx); +} + +static void test_reserve_iova(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + const unsigned long reserve_lo =3D TEST_LIMIT_32BIT / 2; + struct iova *r, *iova; + int i; + + /* Reserve the entire top half through the limit_pfn, inclusive. */ + r =3D reserve_iova(&ctx->iovad, reserve_lo, TEST_LIMIT_32BIT); + KUNIT_ASSERT_NOT_NULL(test, r); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* All allocs must land below the reserved range. */ + for (i =3D 0; i < 100; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LT(test, iova->pfn_hi, reserve_lo); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + iova_test_cleanup(test, ctx); +} + +static void test_find_iova(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + struct iova *iova, *found; + + iova =3D alloc_iova(&ctx->iovad, 4, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + + found =3D find_iova(&ctx->iovad, iova->pfn_lo); + KUNIT_EXPECT_PTR_EQ(test, found, iova); + found =3D find_iova(&ctx->iovad, iova->pfn_hi); + KUNIT_EXPECT_PTR_EQ(test, found, iova); + /* Outside the range should not find. */ + if (iova->pfn_hi + 1 <=3D TEST_LIMIT_32BIT) { + found =3D find_iova(&ctx->iovad, iova->pfn_hi + 1); + KUNIT_EXPECT_PTR_NE(test, found, iova); + } + + __free_iova(&ctx->iovad, iova); + iova_test_cleanup(test, ctx); +} + +/* + * The pci_32bit_workaround scenario: every PCI device's first IOVA + * allocation hits the 32-bit-restricted path before falling back to + * 64-bit. Mix the two and verify the limit_pfn-aware augmentation + * keeps both correct. + */ +static void test_32bit_in_64bit_domain(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + struct iova *iova; + int i; + + /* Fill the high 64-bit space. */ + for (i =3D 0; i < 1000; ++i) { + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* A 32-bit alloc must still find a slot below DMA_BIT_MASK(32). */ + iova =3D alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + __free_iova(&ctx->iovad, iova); + iova_test_cleanup(test, ctx); +} + +/* + * Force the alignment-aware two-phase search through phase 2: pack + * size-2 size_aligned allocations, free every other one to leave gaps + * of size 2 (which the strict phase-1 threshold of size + align - 1 =3D 3 + * will prune away), and verify a fresh size-2 alloc still succeeds via + * the phase-2 fallback. + */ +static void test_two_phase_alignment(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + const int N =3D 64; + struct iova **iovas; + struct iova *iova; + int i; + bool found_phase2_candidate =3D false; + + iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, iovas); + + for (i =3D 0; i < N; ++i) { + iovas[i] =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iovas[i]); + KUNIT_EXPECT_EQ(test, iovas[i]->pfn_lo & 1, 0); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* Free every other to create size-2 gaps interleaved with allocs. */ + for (i =3D 0; i < N; i +=3D 2) { + __free_iova(&ctx->iovad, iovas[i]); + iovas[i] =3D NULL; + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + /* + * Allocate size-2 =E2=80=94 phase 1 (threshold 3) should miss the size-2 + * gaps; phase 2 (threshold 2) should still find one. The result + * may also land in the unfragmented gap below the lowest packed + * iova; either way, alloc must succeed and the result must be + * 2-aligned. + */ + iova =3D alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + KUNIT_EXPECT_EQ(test, iova->pfn_lo & 1, 0); + /* Was it placed in one of the freed slots (=3D phase 2 hit)? */ + for (i =3D 1; i < N; i +=3D 2) { + struct iova *neighbor =3D iovas[i]; + + if (!neighbor) + continue; + if (iova->pfn_hi + 1 =3D=3D neighbor->pfn_lo || + iova->pfn_lo =3D=3D neighbor->pfn_hi + 1) { + found_phase2_candidate =3D true; + break; + } + } + kunit_info(test, "alloc landed at pfn 0x%lx, phase2-slot=3D%s\n", + iova->pfn_lo, found_phase2_candidate ? "yes" : "no"); + __free_iova(&ctx->iovad, iova); + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + for (i =3D 0; i < N; ++i) + if (iovas[i]) + __free_iova(&ctx->iovad, iovas[i]); + iova_test_cleanup(test, ctx); +} + +/* + * Mimic dma-iommu's pci_32bit_workaround pattern: every alloc first + * tries the 32-bit limit; if that fails, retry with the 64-bit limit. + * Verifies that the dual-augmented invariant survives the rapid + * switching between is_32bit=3Dtrue and is_32bit=3Dfalse. + */ +static void test_pci_32bit_workaround_pattern(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + int i; + + for (i =3D 0; i < 500; ++i) { + unsigned long size =3D (i % 4) + 1; + struct iova *iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_32BIT, true); + + if (!iova) + iova =3D alloc_iova(&ctx->iovad, size, + TEST_LIMIT_64BIT, true); + KUNIT_ASSERT_NOT_NULL(test, iova); + } + KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad)); + + iova_test_cleanup(test, ctx); +} + +/* + * Random alloc/free over many iterations, verifying invariants after + * every operation. Uses a deterministic PRNG so failures reproduce + * across boots. + */ +static void test_stress_random(struct kunit *test) +{ + struct iova_test_ctx *ctx =3D iova_test_init_ctx(test); + const int N =3D 512; + const int iters =3D 4 * N; + struct iova **iovas; + u32 rng =3D 0xDEADBEEF; + int i; + + iovas =3D kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL); + KUNIT_ASSERT_NOT_NULL(test, iovas); + + for (i =3D 0; i < iters; ++i) { + int slot; + bool use_32bit; + unsigned long limit; + const char *op; + + rng =3D rng * 1103515245 + 12345; + slot =3D (rng >> 8) % N; + rng =3D rng * 1103515245 + 12345; + use_32bit =3D rng & 1; + limit =3D use_32bit ? TEST_LIMIT_32BIT : TEST_LIMIT_64BIT; + + if (iovas[slot]) { + op =3D "free"; + __free_iova(&ctx->iovad, iovas[slot]); + iovas[slot] =3D NULL; + } else { + unsigned long size; + bool aligned; + + rng =3D rng * 1103515245 + 12345; + size =3D 1UL << ((rng >> 8) % 4); + rng =3D rng * 1103515245 + 12345; + aligned =3D rng & 1; + + op =3D "alloc"; + iovas[slot] =3D alloc_iova(&ctx->iovad, size, limit, + aligned); + } + if (!iova_domain_verify_invariants(&ctx->iovad)) { + kunit_info(test, "iter %d slot %d: invariant broken after %s\n", + i, slot, op); + KUNIT_FAIL(test, "verify failed"); + break; + } + } + + for (i =3D 0; i < N; ++i) + if (iovas[i]) + __free_iova(&ctx->iovad, iovas[i]); + iova_test_cleanup(test, ctx); +} + +static struct kunit_case iova_test_cases[] =3D { + KUNIT_CASE(test_init_destroy), + KUNIT_CASE(test_basic_alloc_free), + KUNIT_CASE(test_size_aligned), + KUNIT_CASE(test_top_down_preference), + KUNIT_CASE(test_limit_pfn_respected), + KUNIT_CASE(test_reserve_iova), + KUNIT_CASE(test_find_iova), + KUNIT_CASE(test_32bit_in_64bit_domain), + KUNIT_CASE(test_two_phase_alignment), + KUNIT_CASE(test_pci_32bit_workaround_pattern), + KUNIT_CASE(test_stress_random), + {} +}; + +static struct kunit_suite iova_test_suite =3D { + .name =3D "iova", + .test_cases =3D iova_test_cases, +}; +kunit_test_suite(iova_test_suite); + +MODULE_DESCRIPTION("KUnit tests for the IOVA allocator"); +MODULE_LICENSE("GPL"); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 4f3465d8ee16..dfde90fef1f5 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -1160,6 +1160,96 @@ void iova_cache_put(void) } EXPORT_SYMBOL_GPL(iova_cache_put); =20 +#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST) +/* + * Walk the iova rbtree and verify that every node's gap_to_prev, + * clamped_gap32, __subtree_max_gap, and __subtree_max_gap32 match what + * recomputation from scratch would yield. Returns true on success. + * + * Intended for use by kunit tests to catch invariant corruption from + * insertion / deletion / rotation paths. + */ +struct iova_subtree_maxes { + unsigned long max_gap; + unsigned long max_gap32; +}; + +static struct iova_subtree_maxes +iova_walk_verify(struct rb_node *node, struct iova_domain *iovad, bool *ok) +{ + struct iova_subtree_maxes m =3D { 0, 0 }; + struct iova_subtree_maxes left, right; + struct rb_node *prev_node; + struct iova *iova; + unsigned long expected_gap; + unsigned long expected_clamped; + + if (!node) + return m; + + left =3D iova_walk_verify(node->rb_left, iovad, ok); + right =3D iova_walk_verify(node->rb_right, iovad, ok); + iova =3D to_iova(node); + + prev_node =3D rb_prev(node); + if (prev_node) + expected_gap =3D iova->pfn_lo - to_iova(prev_node)->pfn_hi - 1; + else + expected_gap =3D iova->pfn_lo; + if (iova->gap_to_prev !=3D expected_gap) { + pr_err("iova_verify: pfn_lo=3D0x%lx gap_to_prev=3D%lu expected=3D%lu\n", + iova->pfn_lo, iova->gap_to_prev, expected_gap); + *ok =3D false; + } + + expected_clamped =3D iova_compute_clamped_gap32(iova, iovad->dma_32bit_pf= n); + if (iova->clamped_gap32 !=3D expected_clamped) { + pr_err("iova_verify: pfn_lo=3D0x%lx clamped_gap32=3D%lu expected=3D%lu\n= ", + iova->pfn_lo, iova->clamped_gap32, expected_clamped); + *ok =3D false; + } + + m.max_gap =3D iova->gap_to_prev; + if (left.max_gap > m.max_gap) + m.max_gap =3D left.max_gap; + if (right.max_gap > m.max_gap) + m.max_gap =3D right.max_gap; + + m.max_gap32 =3D iova->clamped_gap32; + if (left.max_gap32 > m.max_gap32) + m.max_gap32 =3D left.max_gap32; + if (right.max_gap32 > m.max_gap32) + m.max_gap32 =3D right.max_gap32; + + if (iova->__subtree_max_gap !=3D m.max_gap) { + pr_err("iova_verify: pfn_lo=3D0x%lx __subtree_max_gap=3D%lu expected=3D%= lu (own=3D%lu left=3D%lu right=3D%lu)\n", + iova->pfn_lo, iova->__subtree_max_gap, m.max_gap, + iova->gap_to_prev, left.max_gap, right.max_gap); + *ok =3D false; + } + if (iova->__subtree_max_gap32 !=3D m.max_gap32) { + pr_err("iova_verify: pfn_lo=3D0x%lx __subtree_max_gap32=3D%lu expected= =3D%lu (own=3D%lu left=3D%lu right=3D%lu)\n", + iova->pfn_lo, iova->__subtree_max_gap32, m.max_gap32, + iova->clamped_gap32, left.max_gap32, right.max_gap32); + *ok =3D false; + } + + return m; +} + +bool iova_domain_verify_invariants(struct iova_domain *iovad) +{ + bool ok =3D true; + unsigned long flags; + + spin_lock_irqsave(&iovad->iova_rbtree_lock, flags); + iova_walk_verify(iovad->rbroot.rb_node, iovad, &ok); + spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags); + return ok; +} +EXPORT_SYMBOL_GPL(iova_domain_verify_invariants); +#endif /* CONFIG_IOMMU_IOVA_KUNIT_TEST */ + MODULE_AUTHOR("Anil S Keshavamurthy "); MODULE_DESCRIPTION("IOMMU I/O Virtual Address management"); MODULE_LICENSE("GPL"); diff --git a/include/linux/iova.h b/include/linux/iova.h index d262c6d88d6c..108676d8ad69 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -104,6 +104,9 @@ void init_iova_domain(struct iova_domain *iovad, unsign= ed long granule, int iova_domain_init_rcaches(struct iova_domain *iovad); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iova_domain(struct iova_domain *iovad); +#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST) +bool iova_domain_verify_invariants(struct iova_domain *iovad); +#endif #else static inline int iova_cache_get(void) { --=20 2.52.0