From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: robin.murphy@arm.com, joro@8bytes.org, will@kernel.org,
iommu@lists.linux.dev, kyle@mcmartin.ca, kernel-team@meta.com,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [PATCH 4/5] iova: alignment-aware two-phase search for log(n) aligned alloc
Date: Tue, 12 May 2026 22:00:21 -0400 [thread overview]
Message-ID: <20260513020304.1528751-5-riel@surriel.com> (raw)
In-Reply-To: <20260513020304.1528751-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
After the limit_pfn-aware augmentation, __iova_search_free_gap is
O(log n) for both 64-bit and 32-bit alloc paths -- but only when
ignoring alignment. The augmented invariant prunes by gap size; it
doesn't know about the per-call align_mask. A subtree whose
__subtree_max_gap >= size can still contain only borderline-sized gaps
that fail the alignment check, forcing the search to visit every node
in the subtree.
For a request of S pages with alignment A (a power of two), a gap of
size G is GUARANTEED to contain an aligned-fitting position iff
G >= S + A - 1. (Worst case: the highest aligned position
roundDOWN(gap_hi - S + 1, A) loses up to A-1 pfns to rounding.)
Two-phase search:
Phase 1: prune by S + A - 1. Every visited node either prunes its
subtree or descends into one whose subtree_max guarantees a
fitting candidate. Strict O(log n).
Phase 2: only if phase 1 returned NULL. Prune by S, the existing
behaviour. Catches fortuitously-aligned borderline gaps.
O(n) worst case but only triggers when no comfortable gap
exists anywhere in the tree.
For typical IOMMU workloads (alloc_iova_fast rounds size up to a power
of two, so A == S, and most allocations are 1-16 pages) phase 1
succeeds and the search is one O(log n) descent. The borderline case
is the network/storage workload that fragments small power-of-two
allocations and statistically misaligns half (size 2), 75% (size 4),
or more (size 16) of the marginal gaps -- exactly where phase 2 is the
fallback.
Implementation: __iova_search_free_gap takes a separate prune_size
distinct from the size used for fitting and pfn computation, so phase
1 can prune more aggressively while still returning the correct
top-aligned pfn within whichever gap it finds. The non-aligned alloc
path (size_aligned == false, A == 1) collapses to a single phase since
strict_size == size.
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Rik van Riel <riel@surriel.com>
---
drivers/iommu/iova.c | 65 +++++++++++++++++++++++++++++++++-----------
1 file changed, 49 insertions(+), 16 deletions(-)
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 7787da8b2dad..4f3465d8ee16 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -210,18 +210,29 @@ iova_insert_rbtree(struct iova_domain *iovad, struct iova *iova,
}
/*
- * Search the augmented rbtree for the highest-addressed free gap of at least
- * @size pages, with the allocation fitting below @limit_pfn and at or above
- * @start_pfn. When @is_32bit, prune by the 32-bit-clamped subtree max so
- * that 32-bit-restricted allocations on a domain dominated by high-pfn
+ * Search the augmented rbtree for the highest-addressed free gap that fits
+ * @size pages with @align_mask alignment, below @limit_pfn and at or above
+ * @start_pfn.
+ *
+ * @prune_size is the threshold compared against the augmented subtree max:
+ * - If @prune_size == @size, every subtree containing any size-fitting
+ * gap is descended (lax search; can visit O(n) nodes when alignment or
+ * limit_pfn rejects most candidates).
+ * - If @prune_size == @size + alignment - 1, only subtrees containing a
+ * gap big enough to GUARANTEE an aligned fit are descended (strict
+ * search; O(log n) but may miss borderline-sized fortuitously aligned
+ * gaps).
+ *
+ * When @is_32bit, prune by the 32-bit-clamped subtree max so that
+ * 32-bit-restricted allocations on a domain dominated by high-pfn
* allocations stay O(log n) instead of degrading to O(n). Returns the node
* whose gap_to_prev is used, or NULL.
*/
static struct rb_node *
__iova_search_free_gap(struct rb_node *node, unsigned long size,
- unsigned long limit_pfn, unsigned long start_pfn,
- unsigned long align_mask, bool is_32bit,
- unsigned long *new_pfn)
+ unsigned long prune_size, unsigned long limit_pfn,
+ unsigned long start_pfn, unsigned long align_mask,
+ bool is_32bit, unsigned long *new_pfn)
{
struct iova *iova;
struct rb_node *result;
@@ -233,11 +244,12 @@ __iova_search_free_gap(struct rb_node *node, unsigned long size,
iova = to_iova(node);
subtree_max = is_32bit ? iova->__subtree_max_gap32
: iova->__subtree_max_gap;
- if (subtree_max < size)
+ if (subtree_max < prune_size)
return NULL;
- result = __iova_search_free_gap(node->rb_right, size, limit_pfn,
- start_pfn, align_mask, is_32bit, new_pfn);
+ result = __iova_search_free_gap(node->rb_right, size, prune_size,
+ limit_pfn, start_pfn, align_mask,
+ is_32bit, new_pfn);
if (result)
return result;
@@ -257,8 +269,9 @@ __iova_search_free_gap(struct rb_node *node, unsigned long size,
}
}
- return __iova_search_free_gap(node->rb_left, size, limit_pfn,
- start_pfn, align_mask, is_32bit, new_pfn);
+ return __iova_search_free_gap(node->rb_left, size, prune_size,
+ limit_pfn, start_pfn, align_mask,
+ is_32bit, new_pfn);
}
static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
@@ -268,11 +281,24 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long flags;
unsigned long new_pfn;
unsigned long align_mask = ~0UL;
+ unsigned long strict_size = size;
struct rb_node *gap_node;
bool is_32bit;
- if (size_aligned)
- align_mask <<= fls_long(size - 1);
+ if (size_aligned) {
+ unsigned long align_shift = fls_long(size - 1);
+
+ align_mask <<= align_shift;
+ /*
+ * For an A-aligned allocation of S pages, a gap of size G is
+ * guaranteed to contain a fitting aligned position iff
+ * G >= S + A - 1. Use that as the strict pruning threshold for
+ * a fast O(log n) first pass; if it fails, fall back to a
+ * pruning threshold of just S to catch fortuitously-aligned
+ * borderline gaps.
+ */
+ strict_size = size + (1UL << align_shift) - 1;
+ }
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
is_32bit = limit_pfn <= iovad->dma_32bit_pfn;
@@ -280,8 +306,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
goto iova32_full;
gap_node = __iova_search_free_gap(iovad->rbroot.rb_node, size,
- limit_pfn, iovad->start_pfn,
- align_mask, is_32bit, &new_pfn);
+ strict_size, limit_pfn,
+ iovad->start_pfn, align_mask,
+ is_32bit, &new_pfn);
+ if (!gap_node && strict_size > size) {
+ gap_node = __iova_search_free_gap(iovad->rbroot.rb_node, size,
+ size, limit_pfn,
+ iovad->start_pfn, align_mask,
+ is_32bit, &new_pfn);
+ }
if (!gap_node) {
if (is_32bit)
iovad->max32_alloc_size = size;
--
2.52.0
next prev parent reply other threads:[~2026-05-13 2:03 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 2:00 [PATCH 0/5] iova augmented rbtree O(log n) alloc_iova Rik van Riel
2026-05-13 2:00 ` [PATCH 1/5] iova: switch to augmented rbtree for log(n) allocation Rik van Riel
2026-05-13 2:00 ` [PATCH 2/5] iova: drop dead cached_node / cached32_node infrastructure Rik van Riel
2026-05-13 2:00 ` [PATCH 3/5] iova: limit_pfn-aware augmentation for log(n) 32-bit alloc Rik van Riel
2026-05-13 2:00 ` Rik van Riel [this message]
2026-05-13 2:00 ` [PATCH 5/5] iova: add KUnit test suite Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513020304.1528751-5-riel@surriel.com \
--to=riel@surriel.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=kernel-team@meta.com \
--cc=kyle@mcmartin.ca \
--cc=linux-kernel@vger.kernel.org \
--cc=riel@meta.com \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox