[PATCH 0/3] convert iova to maple tree

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] convert iova to maple tree
@ 2026-06-24  3:07 Rik van Riel
  2026-06-24  3:07 ` [PATCH 1/3] iova: convert from rbtree " Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Rik van Riel @ 2026-06-24  3:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, robin.murphy, joro, will, iommu, jgg, kyle,
	ashok.raj

Occasionally production workloads at Meta run into the linear search
in alloc_iova() in ways that cause real issues. For example, when
enough CPUs at a time fall into the linear search trap, systems
have been known to get stuck for so long that it causes soft lockups.

With the old code, free_iova, find_iova, reserve_iova, iova_insert_rbtree,
and remove_iova were all O(log n) already. They stay that way with these
patches.

This patch series uses a maple tree to index the iova ranges.
This allows alloc_iova() to have O(log n) complexity, while
memory use stays about the same as before.

It also adds some self tests for the iova code.

The code was written by Claude, and nitpicked by myself.
Don't be shy if there are more nitpicks remaining.

It was tested both in a VM (running the selftests), and on an
AMD Bergamo system with IOMMU enabled.

Unfortunately I do not know of any way to reproduce the linear
search soft lockups at will, so I have not been able to verify
that scenary in practice.

Based on 5d6919055dec Linux 7.1-rc3

v4:
 - reduce the size of struct iova to 16 bytes
 - simplify the (hopefully rare) remove_iova GFP_ATOMIC failure path
 - test case for the deferred free code
v3:
 - switch to maple tree (suggested by Robin Murphy)
v2:
 - clean up selftests (thanks Jason Gunthorpe)
 - address Sashiko concerns
 - drop the search-with-alignment, since most iova requests
   should be of similar sizes, so the worst case behavior
   is unlikely to hit once ranges are excluded by the augmented
   rbtree

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] iova: convert from rbtree to maple tree
  2026-06-24  3:07 [PATCH 0/3] convert iova to maple tree Rik van Riel
@ 2026-06-24  3:07 ` Rik van Riel
  2026-06-24  3:07 ` [PATCH 2/3] iova: add KUnit test suite Rik van Riel
  2026-06-24  3:07 ` [PATCH 3/3] iova: defer maple tree erase on GFP_ATOMIC failure Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: Rik van Riel @ 2026-06-24  3:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, robin.murphy, joro, will, iommu, jgg, kyle,
	ashok.raj, Rik van Riel

Replace the hand-rolled rbtree in the IOVA allocator with a maple tree.
The maple tree is a B-tree variant designed for range tracking with
built-in gap searching, making it a natural fit for IOVA address space
management -- the same data structure is used by the kernel VMA
subsystem for the analogous problem.

Key changes:

  - struct iova shrinks from 40 bytes (rb_node 24 + pfn_hi 8 + pfn_lo 8)
    to just pfn_hi + pfn_lo (16 bytes). SLAB_HWCACHE_ALIGN is dropped
    from the iova slab cache since struct iova no longer contains an
    embedded rb_node touched during tree rebalancing -- tree traversal
    now touches maple nodes exclusively. This lets the slab allocator
    pack 16-byte objects tightly instead of rounding to 64 bytes.

    The maple tree replaces the embedded rb_node with external B-tree
    nodes: each maple_arange_64 node is 256 bytes and holds up to 10
    entries, plus internal nodes add ~11% overhead. At typical
    utilization (~70-90% full), this works out to ~28-41 bytes of maple
    tree node memory per IOVA entry. The net per-entry cost is ~44-57
    bytes (16 bytes slab + ~28-41 bytes maple), compared to ~64 bytes
    with the old rbtree (64 bytes HWCACHE_ALIGN slab + 0 bytes
    embedded rbtree). Combined with the maple tree's O(log_10 n)
    search depth and better cache locality from B-tree fan-out,
    this improves both memory efficiency and search performance.

  - struct iova_domain replaces rb_root + cached_node + cached32_node
    + anchor with a single struct maple_tree. The iova_rbtree_lock
    spinlock is renamed to iova_lock. The maple tree is initialized
    with MT_FLAGS_ALLOC_RANGE (enables gap tracking for
    mas_empty_area_rev) and MT_FLAGS_LOCK_EXTERN (uses the existing
    iova_lock spinlock instead of the maple tree internal lock).

  - Allocation via __alloc_and_insert_iova_range() uses
    mas_empty_area_rev() to find the highest-addressed gap of
    sufficient size below limit_pfn in O(log n) with B-tree fan-out.
    Alignment is handled by over-requesting (size + alignment - 1)
    to guarantee room after rounding, eliminating the need for a
    retry loop. The result is stored with mas_store_gfp().

  - Lookup via private_find_iova() uses mas_walk() for O(log n)
    point-in-range lookup.

  - Deletion via remove_iova() uses mas_erase(). No successor gap
    fixup needed -- the maple tree handles it internally.

  - reserve_iova() walks the requested range for existing entries,
    computes the merged range, collects old entries for freeing, then
    stores a single merged entry. If the request is fully covered by
    an existing entry, it returns that entry without allocating.

  - The IOVA_ANCHOR sentinel node is eliminated. The maple tree
    tracks gaps implicitly, including the space above the highest
    allocation.

  - The cached_node / cached32_node fields and all their helpers
    are eliminated. The maple tree B-tree structure provides
    equivalent or better cache behaviour.

The rcache (magazine cache) layer is unchanged -- it operates on raw
pfn values and is orthogonal to the tree backing store.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
---
 drivers/iommu/iova.c | 375 +++++++++++++++----------------------------
 include/linux/iova.h |  10 +-
 2 files changed, 133 insertions(+), 252 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 021daf6528de..fd512b33ba32 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -13,9 +13,7 @@
 #include <linux/bitops.h>
 #include <linux/cpu.h>
 #include <linux/workqueue.h>
-
-/* The anchor node sits above the top of the usable address space */
-#define IOVA_ANCHOR	~0UL
+#include <linux/maple_tree.h>
 
 #define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */
 
@@ -29,11 +27,6 @@ static void free_iova_rcaches(struct iova_domain *iovad);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_global_cached_iovas(struct iova_domain *iovad);
 
-static struct iova *to_iova(struct rb_node *node)
-{
-	return rb_entry(node, struct iova, node);
-}
-
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	unsigned long start_pfn)
@@ -45,180 +38,87 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	 */
 	BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
 
-	spin_lock_init(&iovad->iova_rbtree_lock);
-	iovad->rbroot = RB_ROOT;
-	iovad->cached_node = &iovad->anchor.node;
-	iovad->cached32_node = &iovad->anchor.node;
+	spin_lock_init(&iovad->iova_lock);
+	/*
+	 * Drive the maple tree with iova_lock rather than its internal lock:
+	 * the same lock already serialises the rest of the domain's mutable
+	 * state (e.g. max32_alloc_size), and IOVA frees can run in hardirq
+	 * context, so the tree lock must be irq-safe.
+	 */
+	mt_init_flags(&iovad->mtree,
+		      MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
+	mt_set_external_lock(&iovad->mtree, &iovad->iova_lock);
 	iovad->granule = granule;
 	iovad->start_pfn = start_pfn;
 	iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
 	iovad->max32_alloc_size = iovad->dma_32bit_pfn;
-	iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
-	rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
-	rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
 
-static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
-{
-	if (limit_pfn <= iovad->dma_32bit_pfn)
-		return iovad->cached32_node;
-
-	return iovad->cached_node;
-}
-
-static void
-__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
-{
-	if (new->pfn_hi < iovad->dma_32bit_pfn)
-		iovad->cached32_node = &new->node;
-	else
-		iovad->cached_node = &new->node;
-}
-
-static void
-__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
-{
-	struct iova *cached_iova;
-
-	cached_iova = to_iova(iovad->cached32_node);
-	if (free == cached_iova ||
-	    (free->pfn_hi < iovad->dma_32bit_pfn &&
-	     free->pfn_lo >= cached_iova->pfn_lo))
-		iovad->cached32_node = rb_next(&free->node);
-
-	if (free->pfn_lo < iovad->dma_32bit_pfn)
-		iovad->max32_alloc_size = iovad->dma_32bit_pfn;
-
-	cached_iova = to_iova(iovad->cached_node);
-	if (free->pfn_lo >= cached_iova->pfn_lo)
-		iovad->cached_node = rb_next(&free->node);
-}
-
-static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn)
-{
-	struct rb_node *node, *next;
-	/*
-	 * Ideally what we'd like to judge here is whether limit_pfn is close
-	 * enough to the highest-allocated IOVA that starting the allocation
-	 * walk from the anchor node will be quicker than this initial work to
-	 * find an exact starting point (especially if that ends up being the
-	 * anchor node anyway). This is an incredibly crude approximation which
-	 * only really helps the most likely case, but is at least trivially easy.
-	 */
-	if (limit_pfn > iovad->dma_32bit_pfn)
-		return &iovad->anchor.node;
-
-	node = iovad->rbroot.rb_node;
-	while (to_iova(node)->pfn_hi < limit_pfn)
-		node = node->rb_right;
-
-search_left:
-	while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn)
-		node = node->rb_left;
-
-	if (!node->rb_left)
-		return node;
-
-	next = node->rb_left;
-	while (next->rb_right) {
-		next = next->rb_right;
-		if (to_iova(next)->pfn_lo >= limit_pfn) {
-			node = next;
-			goto search_left;
-		}
-	}
-
-	return node;
-}
-
-/* Insert the iova into domain rbtree by holding writer lock */
-static void
-iova_insert_rbtree(struct rb_root *root, struct iova *iova,
-		   struct rb_node *start)
-{
-	struct rb_node **new, *parent = NULL;
-
-	new = (start) ? &start : &(root->rb_node);
-	/* Figure out where to put new node */
-	while (*new) {
-		struct iova *this = to_iova(*new);
-
-		parent = *new;
-
-		if (iova->pfn_lo < this->pfn_lo)
-			new = &((*new)->rb_left);
-		else if (iova->pfn_lo > this->pfn_lo)
-			new = &((*new)->rb_right);
-		else {
-			WARN_ON(1); /* this should not happen */
-			return;
-		}
-	}
-	/* Add new node and rebalance tree. */
-	rb_link_node(&iova->node, parent, new);
-	rb_insert_color(&iova->node, root);
-}
-
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 		unsigned long size, unsigned long limit_pfn,
 			struct iova *new, bool size_aligned)
 {
-	struct rb_node *curr, *prev;
-	struct iova *curr_iova;
 	unsigned long flags;
-	unsigned long new_pfn, retry_pfn;
+	unsigned long new_pfn;
 	unsigned long align_mask = ~0UL;
-	unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
+	unsigned long search_size = size;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
+
+	if (size_aligned) {
+		unsigned long align = 1UL << fls_long(size - 1);
 
-	if (size_aligned)
 		align_mask <<= fls_long(size - 1);
+		search_size = size + align - 1;
+	}
 
-	/* Walk the tree backwards */
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
+	/*
+	 * Fast-fail a 32-bit request once one of the same-or-larger size has
+	 * already failed, without searching. The hint is read under the lock
+	 * that guards it: a lockless read could race an update and spuriously
+	 * return -ENOMEM, and the cost is hidden by the tree walk that follows.
+	 */
 	if (limit_pfn <= iovad->dma_32bit_pfn &&
 			size >= iovad->max32_alloc_size)
-		goto iova32_full;
-
-	curr = __get_cached_rbnode(iovad, limit_pfn);
-	curr_iova = to_iova(curr);
-	retry_pfn = curr_iova->pfn_hi;
-
-retry:
-	do {
-		high_pfn = min(high_pfn, curr_iova->pfn_lo);
-		new_pfn = (high_pfn - size) & align_mask;
-		prev = curr;
-		curr = rb_prev(curr);
-		curr_iova = to_iova(curr);
-	} while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= low_pfn);
-
-	if (high_pfn < size || new_pfn < low_pfn) {
-		if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
-			high_pfn = limit_pfn;
-			low_pfn = retry_pfn + 1;
-			curr = iova_find_limit(iovad, limit_pfn);
-			curr_iova = to_iova(curr);
-			goto retry;
-		}
-		iovad->max32_alloc_size = size;
-		goto iova32_full;
+		goto alloc_fail;
+
+	if (mas_empty_area_rev(&mas, iovad->start_pfn,
+				 limit_pfn - 1, search_size)) {
+		/*
+		 * No gap of the requested size exists below limit_pfn. For a
+		 * 32-bit allocation record the failing size, so later same-or-
+		 * larger 32-bit allocations fail fast without searching until a
+		 * free in the 32-bit range clears the hint. Only genuine space
+		 * exhaustion updates the hint: the transient store failure
+		 * below must not, or it would spuriously fail-fast valid
+		 * allocations under memory pressure.
+		 */
+		if (limit_pfn <= iovad->dma_32bit_pfn)
+			iovad->max32_alloc_size = size;
+		goto alloc_fail;
 	}
 
-	/* pfn_lo will point to size aligned address if size_aligned is set */
+	/*
+	 * mas_empty_area_rev() guarantees a gap of at least search_size
+	 * within [start_pfn, limit_pfn - 1], so the size-aligned PFN carved
+	 * from the top of that gap is always >= mas.index >= start_pfn.
+	 */
+	new_pfn = (mas.last - size + 1) & align_mask;
+
 	new->pfn_lo = new_pfn;
-	new->pfn_hi = new->pfn_lo + size - 1;
+	new->pfn_hi = new_pfn + size - 1;
 
-	/* If we have 'prev', it's a valid place to start the insertion. */
-	iova_insert_rbtree(&iovad->rbroot, new, prev);
-	__cached_rbnode_insert_update(iovad, new);
+	mas.index = new->pfn_lo;
+	mas.last = new->pfn_hi;
+	if (mas_store_gfp(&mas, new, GFP_ATOMIC))
+		goto alloc_fail;
 
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return 0;
 
-iova32_full:
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+alloc_fail:
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return -ENOMEM;
 }
 
@@ -233,8 +133,7 @@ static struct iova *alloc_iova_mem(void)
 
 static void free_iova_mem(struct iova *iova)
 {
-	if (iova->pfn_lo != IOVA_ANCHOR)
-		kmem_cache_free(iova_cache, iova);
+	kmem_cache_free(iova_cache, iova);
 }
 
 /**
@@ -275,29 +174,22 @@ EXPORT_SYMBOL_GPL(alloc_iova);
 static struct iova *
 private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 {
-	struct rb_node *node = iovad->rbroot.rb_node;
-
-	assert_spin_locked(&iovad->iova_rbtree_lock);
+	MA_STATE(mas, &iovad->mtree, pfn, pfn);
 
-	while (node) {
-		struct iova *iova = to_iova(node);
-
-		if (pfn < iova->pfn_lo)
-			node = node->rb_left;
-		else if (pfn > iova->pfn_hi)
-			node = node->rb_right;
-		else
-			return iova;	/* pfn falls within iova's range */
-	}
-
-	return NULL;
+	assert_spin_locked(&iovad->iova_lock);
+	return mas_walk(&mas);
 }
 
 static void remove_iova(struct iova_domain *iovad, struct iova *iova)
 {
-	assert_spin_locked(&iovad->iova_rbtree_lock);
-	__cached_rbnode_delete_update(iovad, iova);
-	rb_erase(&iova->node, &iovad->rbroot);
+	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
+
+	assert_spin_locked(&iovad->iova_lock);
+
+	if (iova->pfn_lo < iovad->dma_32bit_pfn)
+		iovad->max32_alloc_size = iovad->dma_32bit_pfn;
+
+	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
 }
 
 /**
@@ -312,10 +204,9 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
 	unsigned long flags;
 	struct iova *iova;
 
-	/* Take the lock so that no other thread is manipulating the rbtree */
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	iova = private_find_iova(iovad, pfn);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return iova;
 }
 EXPORT_SYMBOL_GPL(find_iova);
@@ -331,9 +222,9 @@ __free_iova(struct iova_domain *iovad, struct iova *iova)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	remove_iova(iovad, iova);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(__free_iova);
@@ -351,14 +242,14 @@ free_iova(struct iova_domain *iovad, unsigned long pfn)
 	unsigned long flags;
 	struct iova *iova;
 
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 	iova = private_find_iova(iovad, pfn);
 	if (!iova) {
-		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		spin_unlock_irqrestore(&iovad->iova_lock, flags);
 		return;
 	}
 	remove_iova(iovad, iova);
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(free_iova);
@@ -445,27 +336,18 @@ static void iova_domain_free_rcaches(struct iova_domain *iovad)
  */
 void put_iova_domain(struct iova_domain *iovad)
 {
-	struct iova *iova, *tmp;
+	struct iova *iova;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
 
 	if (iovad->rcaches)
 		iova_domain_free_rcaches(iovad);
 
-	rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node)
+	mas_for_each(&mas, iova, ULONG_MAX)
 		free_iova_mem(iova);
+	__mt_destroy(&iovad->mtree);
 }
 EXPORT_SYMBOL_GPL(put_iova_domain);
 
-static int
-__is_range_overlap(struct rb_node *node,
-	unsigned long pfn_lo, unsigned long pfn_hi)
-{
-	struct iova *iova = to_iova(node);
-
-	if ((pfn_lo <= iova->pfn_hi) && (pfn_hi >= iova->pfn_lo))
-		return 1;
-	return 0;
-}
-
 static inline struct iova *
 alloc_and_init_iova(unsigned long pfn_lo, unsigned long pfn_hi)
 {
@@ -480,29 +362,6 @@ alloc_and_init_iova(unsigned long pfn_lo, unsigned long pfn_hi)
 	return iova;
 }
 
-static struct iova *
-__insert_new_range(struct iova_domain *iovad,
-	unsigned long pfn_lo, unsigned long pfn_hi)
-{
-	struct iova *iova;
-
-	iova = alloc_and_init_iova(pfn_lo, pfn_hi);
-	if (iova)
-		iova_insert_rbtree(&iovad->rbroot, iova, NULL);
-
-	return iova;
-}
-
-static void
-__adjust_overlap_range(struct iova *iova,
-	unsigned long *pfn_lo, unsigned long *pfn_hi)
-{
-	if (*pfn_lo < iova->pfn_lo)
-		iova->pfn_lo = *pfn_lo;
-	if (*pfn_hi > iova->pfn_hi)
-		*pfn_lo = iova->pfn_hi + 1;
-}
-
 /**
  * reserve_iova - reserves an iova in the given range
  * @iovad: - iova domain pointer
@@ -510,41 +369,67 @@ __adjust_overlap_range(struct iova *iova,
  * @pfn_hi:- higher pfn address
  * This function allocates reserves the address range from pfn_lo to pfn_hi so
  * that this address is not dished out as part of alloc_iova.
+ *
+ * If the requested range overlaps existing reservations, ranges are merged.
+ * If the requested range is fully covered by an existing reservation, the
+ * existing entry is returned without allocating.
  */
 struct iova *
 reserve_iova(struct iova_domain *iovad,
 	unsigned long pfn_lo, unsigned long pfn_hi)
 {
-	struct rb_node *node;
+	struct iova *iova = NULL, *overlap;
+	unsigned long merged_lo = pfn_lo, merged_hi = pfn_hi;
 	unsigned long flags;
-	struct iova *iova;
-	unsigned int overlap = 0;
+
+	MA_STATE(mas, &iovad->mtree, pfn_lo, pfn_hi);
+	MA_STATE(fmas, &iovad->mtree, 0, 0);
 
 	/* Don't allow nonsensical pfns */
 	if (WARN_ON((pfn_hi | pfn_lo) > (ULLONG_MAX >> iova_shift(iovad))))
 		return NULL;
 
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-	for (node = rb_first(&iovad->rbroot); node; node = rb_next(node)) {
-		if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
-			iova = to_iova(node);
-			__adjust_overlap_range(iova, &pfn_lo, &pfn_hi);
-			if ((pfn_lo >= iova->pfn_lo) &&
-				(pfn_hi <= iova->pfn_hi))
-				goto finish;
-			overlap = 1;
-
-		} else if (overlap)
-				break;
+	spin_lock_irqsave(&iovad->iova_lock, flags);
+
+	/*
+	 * Compute the range covering all overlaps, but do not free the
+	 * overlapping iovas yet: the merged store below can fail, and freeing
+	 * before a failed store would leave dangling pointers in the tree.
+	 */
+	mas_for_each(&mas, overlap, pfn_hi) {
+		/* Fully covered by an existing reservation: hand it back. */
+		if (pfn_lo >= overlap->pfn_lo && pfn_hi <= overlap->pfn_hi) {
+			iova = overlap;
+			goto out;
+		}
+		if (overlap->pfn_lo < merged_lo)
+			merged_lo = overlap->pfn_lo;
+		if (overlap->pfn_hi > merged_hi)
+			merged_hi = overlap->pfn_hi;
 	}
 
-	/* We are here either because this is the first reserver node
-	 * or need to insert remaining non overlap addr range
+	iova = alloc_and_init_iova(merged_lo, merged_hi);
+	if (!iova)
+		goto out;
+
+	/*
+	 * Preallocate the nodes for the merged store so it cannot fail; only
+	 * then free the now-superseded overlaps and store the merged range.
 	 */
-	iova = __insert_new_range(iovad, pfn_lo, pfn_hi);
-finish:
+	mas_set_range(&mas, merged_lo, merged_hi);
+	if (mas_preallocate(&mas, iova, GFP_ATOMIC)) {
+		free_iova_mem(iova);
+		iova = NULL;
+		goto out;
+	}
+
+	mas_set_range(&fmas, merged_lo, merged_hi);
+	mas_for_each(&fmas, overlap, merged_hi)
+		free_iova_mem(overlap);
 
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	mas_store_prealloc(&mas, iova);
+out:
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 	return iova;
 }
 EXPORT_SYMBOL_GPL(reserve_iova);
@@ -621,7 +506,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	spin_lock_irqsave(&iovad->iova_lock, flags);
 
 	for (i = 0 ; i < mag->size; ++i) {
 		struct iova *iova = private_find_iova(iovad, mag->pfns[i]);
@@ -633,7 +518,7 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 		free_iova_mem(iova);
 	}
 
-	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
 
 	mag->size = 0;
 }
@@ -956,8 +841,8 @@ int iova_cache_get(void)
 
 	mutex_lock(&iova_cache_mutex);
 	if (!iova_cache_users) {
-		iova_cache = kmem_cache_create("iommu_iova", sizeof(struct iova), 0,
-					       SLAB_HWCACHE_ALIGN, NULL);
+		iova_cache = kmem_cache_create("iommu_iova", sizeof(struct iova),
+					       0, 0, NULL);
 		if (!iova_cache)
 			goto out_err;
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index d2c4fd923efa..eb4f9ead5451 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -11,12 +11,11 @@
 
 #include <linux/types.h>
 #include <linux/kernel.h>
-#include <linux/rbtree.h>
+#include <linux/maple_tree.h>
 #include <linux/dma-mapping.h>
 
 /* iova structure */
 struct iova {
-	struct rb_node	node;
 	unsigned long	pfn_hi; /* Highest allocated pfn */
 	unsigned long	pfn_lo; /* Lowest allocated pfn */
 };
@@ -26,15 +25,12 @@ struct iova_rcache;
 
 /* holds all the iova translations for a domain */
 struct iova_domain {
-	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
-	struct rb_root	rbroot;		/* iova domain rbtree root */
-	struct rb_node	*cached_node;	/* Save last alloced node */
-	struct rb_node	*cached32_node; /* Save last 32-bit alloced node */
+	spinlock_t	iova_lock;	/* Lock to protect update of maple tree */
+	struct maple_tree	mtree;
 	unsigned long	granule;	/* pfn granularity for this domain */
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
 	unsigned long	max32_alloc_size; /* Size of last failed allocation */
-	struct iova	anchor;		/* rbtree lookup anchor */
 
 	struct iova_rcache	*rcaches;
 	struct hlist_node	cpuhp_dead;
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] iova: add KUnit test suite
  2026-06-24  3:07 [PATCH 0/3] convert iova to maple tree Rik van Riel
  2026-06-24  3:07 ` [PATCH 1/3] iova: convert from rbtree " Rik van Riel
@ 2026-06-24  3:07 ` Rik van Riel
  2026-06-24  3:07 ` [PATCH 3/3] iova: defer maple tree erase on GFP_ATOMIC failure Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: Rik van Riel @ 2026-06-24  3:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, robin.murphy, joro, will, iommu, jgg, kyle,
	ashok.raj, Rik van Riel

Add a kunit suite for the maple-tree-based IOVA allocator, plus an
iova_domain_verify_invariants() helper (compiled only when the test
config is enabled) that walks the maple tree and confirms every entry's
pfn_lo/pfn_hi match the maple tree index range and that no entries
overlap.

Test cases:
  - test_size_aligned: alignment of size_aligned allocs across orders 0..7.
  - test_top_down_preference: sequential allocs decrease in pfn_lo.
  - test_reserve_iova: allocs avoid the reserved range.
  - test_32bit_in_64bit_domain: 1000 64-bit allocs followed by a 32-bit
    alloc must still find a slot below DMA_BIT_MASK(32).
  - test_arbitrary_dma_limits: after filling 64-bit space, verify that
    bounded allocations at 33-bit and 56-bit limits still find slots
    within their respective ranges, confirming that mas_empty_area_rev
    generalizes to arbitrary limit_pfn values.
  - test_aligned_in_fragmented: pack size-2 size_aligned allocs, free
    every other to leave size-2 holes; a fresh size-2 aligned alloc
    must still succeed and return a 2-aligned pfn.
  - test_pci_32bit_workaround_pattern: alternate 32-bit-first allocation
    attempts with 64-bit fallback, mirroring dma-iommu.c.
  - test_stress_random: 2048 random alloc/free operations with mixed
    sizes, alignments, and DMA limits (32/33/56/64-bit), checking
    invariants after every operation. Uses a deterministic PRNG so
    failures reproduce across boots.
  - test_full_space_search_time: fill a 16K-pfn range completely, then
    verify that a failed alloc returns in bounded time (O(log n) via
    mas_empty_area_rev, not O(n)).
  - test_fragmented_32bit_search: pack the 32-bit IOVA space, then
    verify bounded search time for both 32-bit failure and 64-bit
    fallback success paths.

Run with:
  tools/testing/kunit/kunit.py run --kunitconfig=drivers/iommu

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 drivers/iommu/.kunitconfig |   6 +
 drivers/iommu/Kconfig      |  16 ++
 drivers/iommu/Makefile     |   1 +
 drivers/iommu/iova-kunit.c | 439 +++++++++++++++++++++++++++++++++++++
 drivers/iommu/iova.c       |  34 +++
 include/linux/iova.h       |   3 +
 6 files changed, 499 insertions(+)
 create mode 100644 drivers/iommu/.kunitconfig
 create mode 100644 drivers/iommu/iova-kunit.c

diff --git a/drivers/iommu/.kunitconfig b/drivers/iommu/.kunitconfig
new file mode 100644
index 000000000000..d2bb924b1883
--- /dev/null
+++ b/drivers/iommu/.kunitconfig
@@ -0,0 +1,6 @@
+CONFIG_KUNIT=y
+CONFIG_IOMMU_SUPPORT=y
+CONFIG_IOMMU_IOVA=y
+CONFIG_IOMMU_IOVA_KUNIT_TEST=y
+CONFIG_KASAN=y
+CONFIG_KASAN_GENERIC=y
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index f86262b11416..f09046e238fd 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -3,6 +3,22 @@
 config IOMMU_IOVA
 	tristate
 
+config IOMMU_IOVA_KUNIT_TEST
+	tristate "KUnit tests for the IOVA allocator" if !KUNIT_ALL_TESTS
+	depends on IOMMU_IOVA && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable kunit tests for the IOVA allocator. The tests exercise
+	  basic allocation and free, size-aligned allocation, top-down
+	  ordering, bounded allocations with various DMA limits (32-bit,
+	  33-bit, 56-bit), aligned allocations in fragmented domains,
+	  and randomly-fragmented stress scenarios.
+
+	  Run with:
+	    tools/testing/kunit/kunit.py run --kunitconfig=drivers/iommu
+
+	  If unsure, say N here.
+
 # IOMMU_API always gets selected by whoever wants it.
 config IOMMU_API
 	bool
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 0275821f4ef9..6bd7da1cbebd 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST) += io-pgtable-arm-selftests.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_DART) += io-pgtable-dart.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
+obj-$(CONFIG_IOMMU_IOVA_KUNIT_TEST) += iova-kunit.o
 obj-$(CONFIG_OF_IOMMU)	+= of_iommu.o
 obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
 obj-$(CONFIG_IPMMU_VMSA) += ipmmu-vmsa.o
diff --git a/drivers/iommu/iova-kunit.c b/drivers/iommu/iova-kunit.c
new file mode 100644
index 000000000000..bf37c5102e6e
--- /dev/null
+++ b/drivers/iommu/iova-kunit.c
@@ -0,0 +1,439 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KUnit tests for the IOVA allocator.
+ *
+ * Exercises the maple-tree-based allocator: basic alloc/free,
+ * size-aligned allocations, top-down ordering, bounded allocations
+ * with various DMA limits (32-bit, 33-bit, 56-bit), aligned
+ * allocations in fragmented domains, and randomly fragmented stress.
+ *
+ * Each test verifies that the maple tree invariants remain consistent
+ * after every batch of operations.
+ */
+#include <kunit/test.h>
+#include <linux/dma-mapping.h>
+#include <linux/iova.h>
+
+#define TEST_GRANULE PAGE_SIZE
+/* Highest pfn that fits in 32 bits — triggers the bounded alloc path. */
+#define TEST_LIMIT_32BIT (DMA_BIT_MASK(32) >> PAGE_SHIFT)
+/* 33-bit limit — exercises non-power-of-two DMA boundaries. */
+#define TEST_LIMIT_33BIT (DMA_BIT_MASK(33) >> PAGE_SHIFT)
+/* 56-bit limit — typical server IOMMU address width. */
+#define TEST_LIMIT_56BIT (DMA_BIT_MASK(56) >> PAGE_SHIFT)
+/* A 64-bit-ish limit well above dma_32bit_pfn. 1ULL avoids UB on ILP32. */
+#define TEST_LIMIT_64BIT ((1ULL << 36) >> PAGE_SHIFT)
+/*
+ * A small <=32-bit limit used by tests that want to actually exhaust the
+ * restricted region within a tractable number of allocations.
+ */
+#define TEST_LIMIT_32BIT_RESTRICTED 256UL
+
+struct iova_test_ctx {
+	struct iova_domain iovad;
+	bool initialized;
+};
+
+static int iova_test_init(struct kunit *test)
+{
+	struct iova_test_ctx *ctx;
+	int ret;
+
+	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return -ENOMEM;
+	test->priv = ctx;
+
+	ret = iova_cache_get();
+	if (ret)
+		return ret;
+
+	init_iova_domain(&ctx->iovad, TEST_GRANULE, 1);
+	ret = iova_domain_init_rcaches(&ctx->iovad);
+	if (ret) {
+		put_iova_domain(&ctx->iovad);
+		iova_cache_put();
+		return ret;
+	}
+	ctx->initialized = true;
+
+	KUNIT_ASSERT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	return 0;
+}
+
+static void iova_test_exit(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+
+	if (ctx && ctx->initialized) {
+		put_iova_domain(&ctx->iovad);
+		ctx->initialized = false;
+		iova_cache_put();
+	}
+}
+
+static void test_size_aligned(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	int order;
+
+	for (order = 0; order < 8; ++order) {
+		unsigned long size = 1UL << order;
+		struct iova *iova = alloc_iova(&ctx->iovad, size,
+					       TEST_LIMIT_32BIT, true);
+
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+		KUNIT_EXPECT_EQ(test, iova->pfn_lo & (size - 1), 0);
+		KUNIT_EXPECT_EQ(test, iova->pfn_hi - iova->pfn_lo + 1, size);
+		__free_iova(&ctx->iovad, iova);
+		KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	}
+}
+
+static void test_top_down_preference(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *iovas[16];
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(iovas); ++i) {
+		iovas[i] = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		KUNIT_ASSERT_NOT_NULL(test, iovas[i]);
+		if (i > 0)
+			KUNIT_EXPECT_LT(test, iovas[i]->pfn_lo,
+					iovas[i - 1]->pfn_lo);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i = 0; i < ARRAY_SIZE(iovas); ++i)
+		__free_iova(&ctx->iovad, iovas[i]);
+}
+
+static void test_reserve_iova(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	const unsigned long reserve_lo = TEST_LIMIT_32BIT / 2;
+	struct iova *r, *iova;
+	int i;
+
+	/* Reserve the entire top half through the limit_pfn, inclusive. */
+	r = reserve_iova(&ctx->iovad, reserve_lo, TEST_LIMIT_32BIT);
+	KUNIT_ASSERT_NOT_NULL(test, r);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	/* All allocs must land below the reserved range. */
+	for (i = 0; i < 100; ++i) {
+		iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+		KUNIT_EXPECT_LT(test, iova->pfn_hi, reserve_lo);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
+/*
+ * The pci_32bit_workaround scenario: every PCI device's first IOVA
+ * allocation hits the 32-bit-restricted path before falling back to
+ * 64-bit. Fill the 64-bit space, then verify a 32-bit alloc still
+ * finds a slot below DMA_BIT_MASK(32).
+ */
+static void test_32bit_in_64bit_domain(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *iova;
+	int i;
+
+	for (i = 0; i < 1000; ++i) {
+		iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_32BIT);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	__free_iova(&ctx->iovad, iova);
+}
+
+/*
+ * Exercise non-power-of-two DMA limits: fill the 64-bit space, then
+ * verify that bounded allocations at 33-bit and 56-bit limits still
+ * find slots within their respective ranges. This confirms the
+ * navigate-to-limit_pfn search generalizes beyond the 32-bit case.
+ */
+static void test_arbitrary_dma_limits(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *iova;
+	int i;
+
+	for (i = 0; i < 1000; ++i) {
+		iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_64BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iova);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	/* 33-bit bounded allocation */
+	iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_33BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_33BIT);
+	__free_iova(&ctx->iovad, iova);
+
+	/* 56-bit bounded allocation */
+	iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_56BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_LE(test, iova->pfn_hi, TEST_LIMIT_56BIT);
+	__free_iova(&ctx->iovad, iova);
+
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
+/*
+ * Aligned allocation in a fragmented domain: pack size-2 size_aligned
+ * allocations at the top, free every other one to leave size-2 holes,
+ * then verify a fresh size-2 aligned alloc still succeeds and returns
+ * a 2-aligned pfn.
+ */
+static void test_aligned_in_fragmented(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	const int N = 64;
+	struct iova **iovas;
+	struct iova *iova;
+	int i;
+
+	iovas = kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, iovas);
+
+	for (i = 0; i < N; ++i) {
+		iovas[i] = alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true);
+		KUNIT_ASSERT_NOT_NULL(test, iovas[i]);
+		KUNIT_EXPECT_EQ(test, iovas[i]->pfn_lo & 1, 0);
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i = 0; i < N; i += 2) {
+		__free_iova(&ctx->iovad, iovas[i]);
+		iovas[i] = NULL;
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	iova = alloc_iova(&ctx->iovad, 2, TEST_LIMIT_32BIT, true);
+	KUNIT_ASSERT_NOT_NULL(test, iova);
+	KUNIT_EXPECT_EQ(test, iova->pfn_lo & 1, 0);
+	__free_iova(&ctx->iovad, iova);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	for (i = 0; i < N; ++i)
+		if (iovas[i])
+			__free_iova(&ctx->iovad, iovas[i]);
+}
+
+/*
+ * Mimic dma-iommu's pci_32bit_workaround pattern: every alloc first
+ * tries a small restricted limit; if that fails, retry with the 64-bit
+ * limit. Verifies that the navigate-to-limit search survives rapid
+ * switching between different limit_pfn values.
+ */
+static void test_pci_32bit_workaround_pattern(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	int fallback_count = 0;
+	int i;
+
+	for (i = 0; i < 500; ++i) {
+		unsigned long size = (i % 4) + 1;
+		struct iova *iova = alloc_iova(&ctx->iovad, size,
+					       TEST_LIMIT_32BIT_RESTRICTED,
+					       true);
+
+		if (!iova) {
+			iova = alloc_iova(&ctx->iovad, size,
+					  TEST_LIMIT_64BIT, true);
+			fallback_count++;
+		}
+		if (!iova)
+			break;
+	}
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+	/* Every alloc must succeed (via fallback once the restricted region fills). */
+	KUNIT_EXPECT_EQ(test, i, 500);
+	/* The restricted region is small, so the 64-bit fallback must engage. */
+	KUNIT_EXPECT_GT(test, fallback_count, 0);
+}
+
+/*
+ * Random alloc/free over many iterations, verifying invariants after
+ * every operation. Uses a deterministic PRNG so failures reproduce
+ * across boots. Exercises mixed DMA limits (32, 33, 56, 64-bit).
+ */
+static void test_stress_random(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	const int N = 512;
+	const int iters = 4 * N;
+	const unsigned long limits[] = {
+		TEST_LIMIT_32BIT, TEST_LIMIT_33BIT,
+		TEST_LIMIT_56BIT, TEST_LIMIT_64BIT,
+	};
+	struct iova **iovas;
+	u32 rng = 0xDEADBEEF;
+	int i;
+
+	iovas = kunit_kcalloc(test, N, sizeof(*iovas), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test, iovas);
+
+	for (i = 0; i < iters; ++i) {
+		int slot;
+		unsigned long limit;
+		const char *op;
+
+		rng = rng * 1103515245 + 12345;
+		slot = (rng >> 8) % N;
+		rng = rng * 1103515245 + 12345;
+		limit = limits[(rng >> 8) % ARRAY_SIZE(limits)];
+
+		if (iovas[slot]) {
+			op = "free";
+			__free_iova(&ctx->iovad, iovas[slot]);
+			iovas[slot] = NULL;
+		} else {
+			unsigned long size;
+			bool aligned;
+
+			rng = rng * 1103515245 + 12345;
+			size = 1UL << ((rng >> 8) % 4);
+			rng = rng * 1103515245 + 12345;
+			aligned = (rng >> 8) & 1;
+
+			op = "alloc";
+			iovas[slot] = alloc_iova(&ctx->iovad, size, limit,
+						 aligned);
+		}
+		if (!iova_domain_verify_invariants(&ctx->iovad)) {
+			kunit_info(test, "iter %d slot %d: invariant broken after %s\n",
+				   i, slot, op);
+			KUNIT_FAIL(test, "verify failed");
+			break;
+		}
+	}
+
+	for (i = 0; i < N; ++i)
+		if (iovas[i])
+			__free_iova(&ctx->iovad, iovas[i]);
+}
+
+/*
+ * Verify that alloc_iova fails in bounded time when the IOVA space is
+ * fully packed. Fill a 16K-pfn range with size-1 allocations (leaving
+ * no gaps), then attempt a size-2 aligned alloc. The maple tree's
+ * mas_empty_area_rev must determine there is no suitable gap in
+ * O(log n) time rather than walking every entry. The wall-clock check
+ * is a loose hang detector only (CI under KASAN/lockdep/virt is slow);
+ * the real signal is the reported time and that the alloc fails.
+ */
+static void test_full_space_search_time(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	const unsigned long fill_limit = 16384;
+	const int fill_count = fill_limit;
+	struct iova *iova;
+	ktime_t start, elapsed;
+	int i, allocated = 0;
+
+	for (i = 0; i < fill_count; ++i) {
+		iova = alloc_iova(&ctx->iovad, 1, fill_limit, false);
+		if (!iova)
+			break;
+		allocated++;
+	}
+	kunit_info(test, "allocated %d iovas in [1, %lu]\n",
+		   allocated, fill_limit);
+	KUNIT_ASSERT_GT(test, allocated, 1000);
+
+	start = ktime_get();
+	iova = alloc_iova(&ctx->iovad, 2, fill_limit, true);
+	elapsed = ktime_sub(ktime_get(), start);
+
+	KUNIT_EXPECT_NULL(test, iova);
+	kunit_info(test, "failed alloc took %lld ns\n",
+		   ktime_to_ns(elapsed));
+	/* Loose hang detector, not a perf gate (CI under KASAN/lockdep is slow). */
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 1000000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+}
+
+/*
+ * Verify bounded search time with a fragmented 32-bit IOVA space.
+ * Pack the 32-bit range with size-1 allocs, then attempt a large
+ * aligned alloc that must either succeed from a remaining gap or
+ * fail fast. The 64-bit fallback must always succeed promptly.
+ */
+static void test_fragmented_32bit_search(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *iova;
+	ktime_t start, elapsed;
+	int i, allocated = 0;
+
+	for (i = 0; i < 8000; ++i) {
+		iova = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+		if (!iova)
+			break;
+		allocated++;
+	}
+	kunit_info(test, "filled 32-bit space with %d allocs\n", allocated);
+	KUNIT_ASSERT_GT(test, allocated, 1000);
+
+	start = ktime_get();
+	iova = alloc_iova(&ctx->iovad, 32, TEST_LIMIT_32BIT, true);
+	elapsed = ktime_sub(ktime_get(), start);
+
+	kunit_info(test, "32-bit alloc (size 32) took %lld ns, result=%s\n",
+		   ktime_to_ns(elapsed), iova ? "alloc" : "fail");
+	/* Loose hang detector, not a perf gate (CI under KASAN/lockdep is slow). */
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 1000000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+
+	start = ktime_get();
+	iova = alloc_iova(&ctx->iovad, 32, TEST_LIMIT_64BIT, true);
+	elapsed = ktime_sub(ktime_get(), start);
+
+	kunit_info(test, "64-bit fallback (size 32) took %lld ns\n",
+		   ktime_to_ns(elapsed));
+	/* Loose hang detector, not a perf gate (CI under KASAN/lockdep is slow). */
+	KUNIT_EXPECT_LT(test, ktime_to_ns(elapsed), 1000000000LL);
+
+	if (iova)
+		__free_iova(&ctx->iovad, iova);
+}
+
+static struct kunit_case iova_test_cases[] = {
+	KUNIT_CASE(test_size_aligned),
+	KUNIT_CASE(test_top_down_preference),
+	KUNIT_CASE(test_reserve_iova),
+	KUNIT_CASE(test_32bit_in_64bit_domain),
+	KUNIT_CASE(test_arbitrary_dma_limits),
+	KUNIT_CASE(test_aligned_in_fragmented),
+	KUNIT_CASE(test_pci_32bit_workaround_pattern),
+	KUNIT_CASE(test_stress_random),
+	KUNIT_CASE(test_full_space_search_time),
+	KUNIT_CASE(test_fragmented_32bit_search),
+	{}
+};
+
+static struct kunit_suite iova_test_suite = {
+	.name = "iova",
+	.init = iova_test_init,
+	.exit = iova_test_exit,
+	.test_cases = iova_test_cases,
+};
+kunit_test_suite(iova_test_suite);
+
+MODULE_DESCRIPTION("KUnit tests for the IOVA allocator");
+MODULE_LICENSE("GPL");
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index fd512b33ba32..ca8d82417699 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -890,6 +890,40 @@ void iova_cache_put(void)
 }
 EXPORT_SYMBOL_GPL(iova_cache_put);
 
+#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
+bool iova_domain_verify_invariants(struct iova_domain *iovad)
+{
+	struct iova *iova, *prev = NULL;
+	unsigned long flags;
+	bool ok = true;
+	MA_STATE(mas, &iovad->mtree, 0, 0);
+
+	spin_lock_irqsave(&iovad->iova_lock, flags);
+	mas_for_each(&mas, iova, ULONG_MAX) {
+		if (mas.index != iova->pfn_lo || mas.last != iova->pfn_hi) {
+			pr_err("iova_verify: maple index [%lu,%lu] != iova [%lu,%lu]\n",
+			       mas.index, mas.last, iova->pfn_lo, iova->pfn_hi);
+			ok = false;
+		}
+		if (iova->pfn_lo > iova->pfn_hi) {
+			pr_err("iova_verify: pfn_lo=%lu > pfn_hi=%lu\n",
+			       iova->pfn_lo, iova->pfn_hi);
+			ok = false;
+		}
+		if (prev && prev->pfn_hi >= iova->pfn_lo) {
+			pr_err("iova_verify: overlap prev=[%lu,%lu] curr=[%lu,%lu]\n",
+			       prev->pfn_lo, prev->pfn_hi,
+			       iova->pfn_lo, iova->pfn_hi);
+			ok = false;
+		}
+		prev = iova;
+	}
+	spin_unlock_irqrestore(&iovad->iova_lock, flags);
+	return ok;
+}
+EXPORT_SYMBOL_GPL(iova_domain_verify_invariants);
+#endif /* CONFIG_IOMMU_IOVA_KUNIT_TEST */
+
 MODULE_AUTHOR("Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>");
 MODULE_DESCRIPTION("IOMMU I/O Virtual Address management");
 MODULE_LICENSE("GPL");
diff --git a/include/linux/iova.h b/include/linux/iova.h
index eb4f9ead5451..6fc070a4f58e 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -98,6 +98,9 @@ void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 int iova_domain_init_rcaches(struct iova_domain *iovad);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
+#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
+bool iova_domain_verify_invariants(struct iova_domain *iovad);
+#endif
 #else
 static inline int iova_cache_get(void)
 {
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] iova: defer maple tree erase on GFP_ATOMIC failure
  2026-06-24  3:07 [PATCH 0/3] convert iova to maple tree Rik van Riel
  2026-06-24  3:07 ` [PATCH 1/3] iova: convert from rbtree " Rik van Riel
  2026-06-24  3:07 ` [PATCH 2/3] iova: add KUnit test suite Rik van Riel
@ 2026-06-24  3:07 ` Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: Rik van Riel @ 2026-06-24  3:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, robin.murphy, joro, will, iommu, jgg, kyle,
	ashok.raj, Rik van Riel

Unlike the old rbtree, where rb_erase() never allocates, removing an
entry from the maple tree can require a node for rebalancing, and
mas_store_gfp(NULL, GFP_ATOMIC) can fail under memory pressure. The
IOVA allocator frees in atomic context (DMA map/unmap may run from
hardirq, softirq, or with spinlocks held), where the erase must not
fail and GFP_KERNEL is not available.

When the GFP_ATOMIC erase fails, overwrite the slot in place with a
marker (IOVA_DEFERRED) and free the struct iova immediately. Unlike
erasing, an in-place store over the entry's existing range changes no
range boundaries, so it needs no node for rebalancing and cannot fail:
storing NULL first runs mas_wr_extend_null(), which can widen the range
and escape the allocation-free wr_exact_fit store type, whereas storing
a non-NULL value over the same range stays wr_exact_fit. The marker is
a non-NULL, non-internal pointer, so the allocator's gap search keeps
the range reserved, while lookups, teardown and the invariant checker
recognise and skip it.

The slots that need a later erase are tracked by a [deferred_lo,
deferred_hi] bounding range in the domain. The deferred erase is
retried, returning the address space for reuse, by the next allocation
in the domain (before it searches for free space) or by the next
successful free, via iova_drain_deferred(), which walks only that
bounding range.

This keeps struct iova at 16 bytes: the rare-path state lives in the
maple tree (the marker) and two unsigned longs in the domain, rather
than a per-iova list node paid by every live mapping. No timer or
workqueue is needed; the erase that can fail is moved off the free path
and onto the allocation path, where running out of memory can simply be
reported to the caller.

In practice GFP_ATOMIC erase failures are rare: the slab allocator
keeps emergency reserves, the common erase case needs no node
allocation at all, and the first allocation attempt is backed by
per-CPU caches. This mechanism is a safety net for the exceptional
case.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Rik van Riel <riel@surriel.com>
---
 drivers/iommu/iova-kunit.c |  70 ++++++++++++++++
 drivers/iommu/iova.c       | 168 +++++++++++++++++++++++++++++++++----
 include/linux/iova.h       |   9 ++
 3 files changed, 233 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/iova-kunit.c b/drivers/iommu/iova-kunit.c
index bf37c5102e6e..317b46a45e5e 100644
--- a/drivers/iommu/iova-kunit.c
+++ b/drivers/iommu/iova-kunit.c
@@ -413,6 +413,74 @@ static void test_fragmented_32bit_search(struct kunit *test)
 		__free_iova(&ctx->iovad, iova);
 }
 
+/*
+ * Exercise the deferred-erase path: remove_iova() failing to erase under
+ * GFP_ATOMIC leaves an IOVA_DEFERRED marker in the tree and frees the struct
+ * iova immediately. iova_kunit_defer_erase makes that failure deterministic.
+ * Verify that while marked the range looks free to lookups yet stays reserved,
+ * that invariants hold, and that the next allocation drains the marker and
+ * reuses the space.
+ */
+static void test_deferred_erase(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *a, *b;
+	unsigned long pfn;
+
+	a = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+	KUNIT_ASSERT_NOT_NULL(test, a);
+	pfn = a->pfn_lo;
+
+	/* Free 'a', forcing the erase to be deferred (marker left behind). */
+	iova_kunit_defer_erase = true;
+	__free_iova(&ctx->iovad, a);
+	iova_kunit_defer_erase = false;
+
+	/*
+	 * The erase was deferred, not performed: a marker now occupies the slot,
+	 * so the backlog records the deferral and the pfn looks absent to lookups,
+	 * while the tree stays consistent with the marker present.
+	 */
+	KUNIT_EXPECT_TRUE(test, iova_domain_has_deferred(&ctx->iovad));
+	KUNIT_EXPECT_NULL(test, find_iova(&ctx->iovad, pfn));
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	/*
+	 * The next allocation drains deferred markers before searching, so the
+	 * backlog clears and the marked range is reclaimed; a top-down size-1
+	 * alloc reuses exactly the pfn that was freed.
+	 */
+	b = alloc_iova(&ctx->iovad, 1, TEST_LIMIT_32BIT, false);
+	KUNIT_ASSERT_NOT_NULL(test, b);
+	KUNIT_EXPECT_FALSE(test, iova_domain_has_deferred(&ctx->iovad));
+	KUNIT_EXPECT_EQ(test, b->pfn_lo, pfn);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+
+	__free_iova(&ctx->iovad, b);
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
+/*
+ * Tearing down a domain that still holds an undrained IOVA_DEFERRED marker must
+ * skip the marker (it is static storage, not a heap iova) and not crash or
+ * double-free. Leave a marker live for iova_test_exit()'s put_iova_domain().
+ */
+static void test_deferred_erase_teardown(struct kunit *test)
+{
+	struct iova_test_ctx *ctx = test->priv;
+	struct iova *a;
+
+	a = alloc_iova(&ctx->iovad, 4, TEST_LIMIT_32BIT, false);
+	KUNIT_ASSERT_NOT_NULL(test, a);
+
+	iova_kunit_defer_erase = true;
+	__free_iova(&ctx->iovad, a);
+	iova_kunit_defer_erase = false;
+
+	/* Marker left live; the suite's exit -> put_iova_domain must cope. */
+	KUNIT_EXPECT_TRUE(test, iova_domain_verify_invariants(&ctx->iovad));
+}
+
 static struct kunit_case iova_test_cases[] = {
 	KUNIT_CASE(test_size_aligned),
 	KUNIT_CASE(test_top_down_preference),
@@ -424,6 +492,8 @@ static struct kunit_case iova_test_cases[] = {
 	KUNIT_CASE(test_stress_random),
 	KUNIT_CASE(test_full_space_search_time),
 	KUNIT_CASE(test_fragmented_32bit_search),
+	KUNIT_CASE(test_deferred_erase),
+	KUNIT_CASE(test_deferred_erase_teardown),
 	{}
 };
 
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index ca8d82417699..e0680de5311f 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,24 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad,
 static void free_iova_rcaches(struct iova_domain *iovad);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_global_cached_iovas(struct iova_domain *iovad);
+static void iova_drain_deferred(struct iova_domain *iovad);
+
+/*
+ * Placed in a maple-tree slot when an erase had to be deferred because a node
+ * allocation failed under GFP_ATOMIC (see remove_iova()). Overwriting a slot in
+ * place needs no allocation, so this marker can be stored even when the erase
+ * itself could not. It is a non-NULL, 8-byte-aligned (hence non-internal to the
+ * maple tree) pointer, so the allocator's gap search keeps the range reserved,
+ * while lookups, teardown and the invariant checker recognise and skip it.
+ * iova_drain_deferred() erases it once a node allocation succeeds.
+ */
+static const unsigned long iova_deferred_marker;
+#define IOVA_DEFERRED		((struct iova *)&iova_deferred_marker)
+
+static inline bool iova_has_deferred(const struct iova_domain *iovad)
+{
+	return iovad->deferred_lo <= iovad->deferred_hi;
+}
 
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
@@ -52,6 +70,8 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,
 	iovad->start_pfn = start_pfn;
 	iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
 	iovad->max32_alloc_size = iovad->dma_32bit_pfn;
+	iovad->deferred_lo = ULONG_MAX;
+	iovad->deferred_hi = 0;
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
 
@@ -73,6 +93,9 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
 	}
 
 	spin_lock_irqsave(&iovad->iova_lock, flags);
+	/* Reclaim any deferred frees so their address space becomes available. */
+	if (unlikely(iova_has_deferred(iovad)))
+		iova_drain_deferred(iovad);
 	/*
 	 * Fast-fail a 32-bit request once one of the same-or-larger size has
 	 * already failed, without searching. The hint is read under the lock
@@ -175,21 +198,117 @@ static struct iova *
 private_find_iova(struct iova_domain *iovad, unsigned long pfn)
 {
 	MA_STATE(mas, &iovad->mtree, pfn, pfn);
+	struct iova *iova;
 
 	assert_spin_locked(&iovad->iova_lock);
-	return mas_walk(&mas);
+	iova = mas_walk(&mas);
+	/* A deferred-erase marker is not a live iova; treat it as absent. */
+	if (iova == IOVA_DEFERRED)
+		return NULL;
+	return iova;
 }
 
+/*
+ * Erase the deferred-removal markers left by remove_iova() when a GFP_ATOMIC
+ * node allocation was unavailable. The markers lie within the bounding range
+ * [deferred_lo, deferred_hi]; walk it erasing each one. A failed erase means no
+ * atomic memory is available right now, and no reclaim can happen under the
+ * lock, so the remaining erases would fail too: stop and keep everything from
+ * that point up still deferred, to be retried by a later allocation or free.
+ *
+ * Caller must hold iova_lock.
+ */
+static void iova_drain_deferred(struct iova_domain *iovad)
+{
+	unsigned long hi = iovad->deferred_hi;
+	void *entry;
+
+	MA_STATE(mas, &iovad->mtree, iovad->deferred_lo, hi);
+
+	assert_spin_locked(&iovad->iova_lock);
+
+	while ((entry = mas_find(&mas, hi)) != NULL) {
+		unsigned long lo = mas.index, last = mas.last;
+
+		if (entry == IOVA_DEFERRED) {
+			mas_set_range(&mas, lo, last);
+			if (mas_store_gfp(&mas, NULL, GFP_ATOMIC)) {
+				/*
+				 * No atomic memory now; further erases under
+				 * this lock would fail too (no reclaim happens
+				 * here). Everything from here up is still
+				 * deferred; retry on a later allocation or free.
+				 */
+				iovad->deferred_lo = lo;
+				iovad->deferred_hi = hi;
+				return;
+			}
+			/* The store may move the iterator; re-anchor past it. */
+			mas_set(&mas, last + 1);
+		}
+
+		if (last >= hi)
+			break;
+	}
+
+	/* All markers erased: the backlog is empty. */
+	iovad->deferred_lo = ULONG_MAX;
+	iovad->deferred_hi = 0;
+}
+
+#if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
+/*
+ * Test hook: when set, force remove_iova() down the deferred-erase path as if
+ * the GFP_ATOMIC node allocation for the erase had failed. Lets the KUnit suite
+ * exercise the IOVA_DEFERRED marker and iova_drain_deferred() deterministically.
+ */
+bool iova_kunit_defer_erase;
+EXPORT_SYMBOL_GPL(iova_kunit_defer_erase);
+#else
+#define iova_kunit_defer_erase false
+#endif
+
+/*
+ * Remove an IOVA entry from the maple tree and free it.
+ *
+ * This runs in atomic context (DMA map/unmap can be called from hardirq,
+ * softirq, or with spinlocks held) and must not fail. Erasing an entry can
+ * require a maple tree node for rebalancing, and mas_store_gfp(NULL,
+ * GFP_ATOMIC) can fail under memory pressure. When it does, overwrite the slot
+ * in place with IOVA_DEFERRED -- an in-place store needs no node allocation and
+ * so cannot fail -- which keeps the address range reserved (the allocator's gap
+ * search treats the non-NULL slot as occupied) and lets the struct iova be
+ * freed now. The marker is erased later by iova_drain_deferred().
+ */
 static void remove_iova(struct iova_domain *iovad, struct iova *iova)
 {
-	MA_STATE(mas, &iovad->mtree, iova->pfn_lo, iova->pfn_hi);
+	unsigned long pfn_lo = iova->pfn_lo, pfn_hi = iova->pfn_hi;
+
+	MA_STATE(mas, &iovad->mtree, pfn_lo, pfn_hi);
 
 	assert_spin_locked(&iovad->iova_lock);
 
-	if (iova->pfn_lo < iovad->dma_32bit_pfn)
+	if (pfn_lo < iovad->dma_32bit_pfn)
 		iovad->max32_alloc_size = iovad->dma_32bit_pfn;
 
-	mas_store_gfp(&mas, NULL, GFP_ATOMIC);
+	if (iova_kunit_defer_erase || mas_store_gfp(&mas, NULL, GFP_ATOMIC)) {
+		/* Erase failed: mark the slot in place and defer removal. */
+		mas_set_range(&mas, pfn_lo, pfn_hi);
+		if (WARN_ON_ONCE(mas_store_gfp(&mas, IOVA_DEFERRED, GFP_ATOMIC)))
+			return;	/* in-place store cannot fail; entry stays put */
+		if (pfn_lo < iovad->deferred_lo)
+			iovad->deferred_lo = pfn_lo;
+		if (pfn_hi > iovad->deferred_hi)
+			iovad->deferred_hi = pfn_hi;
+		free_iova_mem(iova);
+		return;
+	}
+
+	free_iova_mem(iova);
+
+	/* A successful erase means memory is available; clear any backlog. */
+	if (unlikely(iova_has_deferred(iovad)))
+		iova_drain_deferred(iovad);
 }
 
 /**
@@ -225,7 +344,6 @@ __free_iova(struct iova_domain *iovad, struct iova *iova)
 	spin_lock_irqsave(&iovad->iova_lock, flags);
 	remove_iova(iovad, iova);
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
-	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(__free_iova);
 
@@ -244,13 +362,9 @@ free_iova(struct iova_domain *iovad, unsigned long pfn)
 
 	spin_lock_irqsave(&iovad->iova_lock, flags);
 	iova = private_find_iova(iovad, pfn);
-	if (!iova) {
-		spin_unlock_irqrestore(&iovad->iova_lock, flags);
-		return;
-	}
-	remove_iova(iovad, iova);
+	if (iova)
+		remove_iova(iovad, iova);
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
-	free_iova_mem(iova);
 }
 EXPORT_SYMBOL_GPL(free_iova);
 
@@ -342,8 +456,14 @@ void put_iova_domain(struct iova_domain *iovad)
 	if (iovad->rcaches)
 		iova_domain_free_rcaches(iovad);
 
+	/*
+	 * Free the iovas. We can skip the IOVA_DEFERRED markers, because
+	 * __mt_destroy() will tear down the maple tree nodes without regard
+	 * for special leaf node values.
+	 */
 	mas_for_each(&mas, iova, ULONG_MAX)
-		free_iova_mem(iova);
+		if (iova != IOVA_DEFERRED)
+			free_iova_mem(iova);
 	__mt_destroy(&iovad->mtree);
 }
 EXPORT_SYMBOL_GPL(put_iova_domain);
@@ -391,12 +511,22 @@ reserve_iova(struct iova_domain *iovad,
 
 	spin_lock_irqsave(&iovad->iova_lock, flags);
 
+	/* Resolve any deferred erases so the walk below sees only live iovas. */
+	if (unlikely(iova_has_deferred(iovad)))
+		iova_drain_deferred(iovad);
+
 	/*
 	 * Compute the range covering all overlaps, but do not free the
 	 * overlapping iovas yet: the merged store below can fail, and freeing
 	 * before a failed store would leave dangling pointers in the tree.
 	 */
 	mas_for_each(&mas, overlap, pfn_hi) {
+		/*
+		 * A deferred-erase marker isn't a real iova; the merged-range
+		 * store below spans and overwrites it, so skip it here.
+		 */
+		if (overlap == IOVA_DEFERRED)
+			continue;
 		/* Fully covered by an existing reservation: hand it back. */
 		if (pfn_lo >= overlap->pfn_lo && pfn_hi <= overlap->pfn_hi) {
 			iova = overlap;
@@ -425,7 +555,8 @@ reserve_iova(struct iova_domain *iovad,
 
 	mas_set_range(&fmas, merged_lo, merged_hi);
 	mas_for_each(&fmas, overlap, merged_hi)
-		free_iova_mem(overlap);
+		if (overlap != IOVA_DEFERRED)
+			free_iova_mem(overlap);
 
 	mas_store_prealloc(&mas, iova);
 out:
@@ -515,7 +646,6 @@ iova_magazine_free_pfns(struct iova_magazine *mag, struct iova_domain *iovad)
 			continue;
 
 		remove_iova(iovad, iova);
-		free_iova_mem(iova);
 	}
 
 	spin_unlock_irqrestore(&iovad->iova_lock, flags);
@@ -900,6 +1030,9 @@ bool iova_domain_verify_invariants(struct iova_domain *iovad)
 
 	spin_lock_irqsave(&iovad->iova_lock, flags);
 	mas_for_each(&mas, iova, ULONG_MAX) {
+		/* Deferred-erase markers occupy a slot but aren't real iovas. */
+		if (iova == IOVA_DEFERRED)
+			continue;
 		if (mas.index != iova->pfn_lo || mas.last != iova->pfn_hi) {
 			pr_err("iova_verify: maple index [%lu,%lu] != iova [%lu,%lu]\n",
 			       mas.index, mas.last, iova->pfn_lo, iova->pfn_hi);
@@ -922,6 +1055,13 @@ bool iova_domain_verify_invariants(struct iova_domain *iovad)
 	return ok;
 }
 EXPORT_SYMBOL_GPL(iova_domain_verify_invariants);
+
+/* Test accessor: is there an outstanding deferred-erase backlog? */
+bool iova_domain_has_deferred(struct iova_domain *iovad)
+{
+	return iova_has_deferred(iovad);
+}
+EXPORT_SYMBOL_GPL(iova_domain_has_deferred);
 #endif /* CONFIG_IOMMU_IOVA_KUNIT_TEST */
 
 MODULE_AUTHOR("Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>");
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 6fc070a4f58e..1f54ff3e2fb4 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -31,6 +31,13 @@ struct iova_domain {
 	unsigned long	start_pfn;	/* Lower limit for this domain */
 	unsigned long	dma_32bit_pfn;
 	unsigned long	max32_alloc_size; /* Size of last failed allocation */
+	/*
+	 * Bounding range of slots whose erase was deferred because a node
+	 * allocation failed under GFP_ATOMIC; deferred_lo > deferred_hi when
+	 * there are none. See remove_iova()/iova_drain_deferred().
+	 */
+	unsigned long	deferred_lo;
+	unsigned long	deferred_hi;
 
 	struct iova_rcache	*rcaches;
 	struct hlist_node	cpuhp_dead;
@@ -100,6 +107,8 @@ struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
 #if IS_ENABLED(CONFIG_IOMMU_IOVA_KUNIT_TEST)
 bool iova_domain_verify_invariants(struct iova_domain *iovad);
+bool iova_domain_has_deferred(struct iova_domain *iovad);
+extern bool iova_kunit_defer_erase;
 #endif
 #else
 static inline int iova_cache_get(void)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-24  3:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-24  3:07 [PATCH 0/3] convert iova to maple tree Rik van Riel
2026-06-24  3:07 ` [PATCH 1/3] iova: convert from rbtree " Rik van Riel
2026-06-24  3:07 ` [PATCH 2/3] iova: add KUnit test suite Rik van Riel
2026-06-24  3:07 ` [PATCH 3/3] iova: defer maple tree erase on GFP_ATOMIC failure Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.