From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org,
iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [PATCH 4/5] iova: alignment-aware two-phase search for log(n) aligned alloc
Date: Tue, 12 May 2026 22:00:21 -0400 [thread overview]
Message-ID: <20260513020304.1528751-5-riel@surriel.com> (raw)
In-Reply-To: <20260513020304.1528751-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
After the limit_pfn-aware augmentation, __iova_search_free_gap is
O(log n) for both 64-bit and 32-bit alloc paths -- but only when
ignoring alignment. The augmented invariant prunes by gap size; it
doesn't know about the per-call align_mask. A subtree whose
__subtree_max_gap >= size can still contain only borderline-sized gaps
that fail the alignment check, forcing the search to visit every node
in the subtree.
For a request of S pages with alignment A (a power of two), a gap of
size G is GUARANTEED to contain an aligned-fitting position iff
G >= S + A - 1. (Worst case: the highest aligned position
roundDOWN(gap_hi - S + 1, A) loses up to A-1 pfns to rounding.)
Two-phase search:
Phase 1: prune by S + A - 1. Every visited node either prunes its
subtree or descends into one whose subtree_max guarantees a
fitting candidate. Strict O(log n).
Phase 2: only if phase 1 returned NULL. Prune by S, the existing
behaviour. Catches fortuitously-aligned borderline gaps.
O(n) worst case but only triggers when no comfortable gap
exists anywhere in the tree.
For typical IOMMU workloads (alloc_iova_fast rounds size up to a power
of two, so A == S, and most allocations are 1-16 pages) phase 1
succeeds and the search is one O(log n) descent. The borderline case
is the network/storage workload that fragments small power-of-two
allocations and statistically misaligns half (size 2), 75% (size 4),
or more (size 16) of the marginal gaps -- exactly where phase 2 is the
fallback.
Implementation: __iova_search_free_gap takes a separate prune_size
distinct from the size used for fitting and pfn computation, so phase
1 can prune more aggressively while still returning the correct
top-aligned pfn within whichever gap it finds. The non-aligned alloc
path (size_aligned == false, A == 1) collapses to a single phase since
strict_size == size.
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Rik van Riel <riel@surriel.com>
---
drivers/iommu/iova.c | 65 +++++++++++++++++++++++++++++++++-----------
1 file changed, 49 insertions(+), 16 deletions(-)
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 7787da8b2dad..4f3465d8ee16 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -210,18 +210,29 @@ iova_insert_rbtree(struct iova_domain *iovad, struct iova *iova,
}
/*
- * Search the augmented rbtree for the highest-addressed free gap of at least
- * @size pages, with the allocation fitting below @limit_pfn and at or above
- * @start_pfn. When @is_32bit, prune by the 32-bit-clamped subtree max so
- * that 32-bit-restricted allocations on a domain dominated by high-pfn
+ * Search the augmented rbtree for the highest-addressed free gap that fits
+ * @size pages with @align_mask alignment, below @limit_pfn and at or above
+ * @start_pfn.
+ *
+ * @prune_size is the threshold compared against the augmented subtree max:
+ * - If @prune_size == @size, every subtree containing any size-fitting
+ * gap is descended (lax search; can visit O(n) nodes when alignment or
+ * limit_pfn rejects most candidates).
+ * - If @prune_size == @size + alignment - 1, only subtrees containing a
+ * gap big enough to GUARANTEE an aligned fit are descended (strict
+ * search; O(log n) but may miss borderline-sized fortuitously aligned
+ * gaps).
+ *
+ * When @is_32bit, prune by the 32-bit-clamped subtree max so that
+ * 32-bit-restricted allocations on a domain dominated by high-pfn
* allocations stay O(log n) instead of degrading to O(n). Returns the node
* whose gap_to_prev is used, or NULL.
*/
static struct rb_node *
__iova_search_free_gap(struct rb_node *node, unsigned long size,
- unsigned long limit_pfn, unsigned long start_pfn,
- unsigned long align_mask, bool is_32bit,
- unsigned long *new_pfn)
+ unsigned long prune_size, unsigned long limit_pfn,
+ unsigned long start_pfn, unsigned long align_mask,
+ bool is_32bit, unsigned long *new_pfn)
{
struct iova *iova;
struct rb_node *result;
@@ -233,11 +244,12 @@ __iova_search_free_gap(struct rb_node *node, unsigned long size,
iova = to_iova(node);
subtree_max = is_32bit ? iova->__subtree_max_gap32
: iova->__subtree_max_gap;
- if (subtree_max < size)
+ if (subtree_max < prune_size)
return NULL;
- result = __iova_search_free_gap(node->rb_right, size, limit_pfn,
- start_pfn, align_mask, is_32bit, new_pfn);
+ result = __iova_search_free_gap(node->rb_right, size, prune_size,
+ limit_pfn, start_pfn, align_mask,
+ is_32bit, new_pfn);
if (result)
return result;
@@ -257,8 +269,9 @@ __iova_search_free_gap(struct rb_node *node, unsigned long size,
}
}
- return __iova_search_free_gap(node->rb_left, size, limit_pfn,
- start_pfn, align_mask, is_32bit, new_pfn);
+ return __iova_search_free_gap(node->rb_left, size, prune_size,
+ limit_pfn, start_pfn, align_mask,
+ is_32bit, new_pfn);
}
static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
@@ -268,11 +281,24 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long flags;
unsigned long new_pfn;
unsigned long align_mask = ~0UL;
+ unsigned long strict_size = size;
struct rb_node *gap_node;
bool is_32bit;
- if (size_aligned)
- align_mask <<= fls_long(size - 1);
+ if (size_aligned) {
+ unsigned long align_shift = fls_long(size - 1);
+
+ align_mask <<= align_shift;
+ /*
+ * For an A-aligned allocation of S pages, a gap of size G is
+ * guaranteed to contain a fitting aligned position iff
+ * G >= S + A - 1. Use that as the strict pruning threshold for
+ * a fast O(log n) first pass; if it fails, fall back to a
+ * pruning threshold of just S to catch fortuitously-aligned
+ * borderline gaps.
+ */
+ strict_size = size + (1UL << align_shift) - 1;
+ }
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
is_32bit = limit_pfn <= iovad->dma_32bit_pfn;
@@ -280,8 +306,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
goto iova32_full;
gap_node = __iova_search_free_gap(iovad->rbroot.rb_node, size,
- limit_pfn, iovad->start_pfn,
- align_mask, is_32bit, &new_pfn);
+ strict_size, limit_pfn,
+ iovad->start_pfn, align_mask,
+ is_32bit, &new_pfn);
+ if (!gap_node && strict_size > size) {
+ gap_node = __iova_search_free_gap(iovad->rbroot.rb_node, size,
+ size, limit_pfn,
+ iovad->start_pfn, align_mask,
+ is_32bit, &new_pfn);
+ }
if (!gap_node) {
if (is_32bit)
iovad->max32_alloc_size = size;
--
2.52.0
next prev parent reply other threads:[~2026-05-13 2:03 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 2:00 [PATCH 0/5] iova augmented rbtree O(log n) alloc_iova Rik van Riel
2026-05-13 2:00 ` [PATCH 1/5] iova: switch to augmented rbtree for log(n) allocation Rik van Riel
2026-05-13 2:00 ` [PATCH 2/5] iova: drop dead cached_node / cached32_node infrastructure Rik van Riel
2026-05-13 2:00 ` [PATCH 3/5] iova: limit_pfn-aware augmentation for log(n) 32-bit alloc Rik van Riel
2026-05-13 2:00 ` Rik van Riel [this message]
2026-05-13 2:00 ` [PATCH 5/5] iova: add KUnit test suite Rik van Riel
2026-05-15 22:43 ` Jason Gunthorpe
2026-05-16 3:20 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513020304.1528751-5-riel@surriel.com \
--to=riel@surriel.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kernel-team@meta.com \
--cc=kyle@mcmartin.ca \
--cc=linux-kernel@vger.kernel.org \
--cc=riel@meta.com \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.