From: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>,
"Liam R. Howlett" <liam@infradead.org>,
Alice Ryhl <aliceryhl@google.com>,
Andrew Ballance <andrewjballance@gmail.com>
Cc: linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, maple-tree@lists.infradead.org,
Lorenzo Stoakes <ljs@kernel.org>,
Pranjal Shrivastava <praan@google.com>,
Will Deacon <will@kernel.org>,
Suzuki K Poulose <Suzuki.Poulose@arm.com>,
Neil Armstrong <neil.armstrong@linaro.org>,
Mostafa Saleh <smostafa@google.com>,
Balbir Singh <balbirs@nvidia.com>,
Suren Baghdasaryan <surenb@google.com>,
Marco Elver <elver@google.com>,
Dmitry Vyukov <dvyukov@google.com>,
Alexander Potapenko <glider@google.com>,
Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>,
Brendan Jackman <jackmanb@google.com>,
Puranjay Mohan <puranjay@kernel.org>,
Santosh Shukla <santosh.shukla@amd.com>,
Wyes Karny <wkarny@gmail.com>,
Pranjal Arya <pranjal.arya@oss.qualcomm.com>,
Sudeep Holla <sudeep.holla@kernel.org>
Subject: [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path
Date: Sat, 13 Jun 2026 22:49:49 +0530 [thread overview]
Message-ID: <20260613-vmalloc_maple-v1-7-0aa740bb944b@oss.qualcomm.com> (raw)
In-Reply-To: <20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com>
The dual-tree design (free_vmap_area_mt + occupied_vmap_area_mt
maintained in lock-step) costs roughly twice the maple operations
per allocation lifecycle that the rb_tree path it replaced used.
Strip the maintenance back to a single authoritative tree on the
steady-state hot path.
After this patch:
- The occupied tree is the source of truth for in-use ranges on
the alloc/free hot path.
- free_vmap_area_mt is still maintained on the slow paths
(vmap_init_free_space, pcpu_get_vm_areas's top-down walk,
decay_va_pool_node), but the steady-state alloc/free no longer
has to keep both trees in lock-step.
- This removes ~half of the maple operations a typical
vmalloc/vfree cycle performs.
The pcpu top-down walk relies on the assumption that chunks consume
addresses bottom-up, so stale free-tree entries at low addresses
never collide with pcpu's chosen base. This is documented at the
relevant call site.
Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
mm/vmalloc.c | 179 +++++++++++++++++++++++++++++++++--------------------------
1 file changed, 99 insertions(+), 80 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5bc1e47c456a..73a40a88dbf6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1767,17 +1767,32 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
{
MA_STATE(mas, &occupied_vmap_area_mt, 0, 0);
unsigned long search = min;
+ unsigned long search_len = size;
unsigned long hole_end;
bool retry_empty;
lockdep_assert_held(&free_vmap_area_lock);
retry_empty = list_empty(&vmap_retry_list);
+ /*
+ * Pad the gap-find by align-1 when align exceeds PAGE_SIZE so that
+ * any alignment slack inside the returned gap can be absorbed
+ * without an extra outer-loop iteration. Without this padding, the
+ * loop has to scan past every page-aligned gap that is large enough
+ * for @size but too small for the aligned start, which is O(K) in
+ * the number of such gaps and pathological for big alignments on a
+ * fragmented occupied tree.
+ */
+ if (align > PAGE_SIZE) {
+ if (check_add_overflow(size, align - 1, &search_len))
+ return false;
+ }
+
while (search <= max) {
unsigned long candidate, candidate_end;
mas_set(&mas, search);
- if (mas_empty_area(&mas, search, max, size))
+ if (mas_empty_area(&mas, search, max, search_len))
return false;
hole_end = min(mas.last, max);
@@ -2182,39 +2197,35 @@ rollback_busy_insert_failed_alloc_locked(struct vmap_area *va)
}
/*
- * Reinsert @va into the free index after occupied erase. On failure, place the
- * range on the non-index retry queue and best-effort restore occupied tracking.
+ * Release @va after the caller has erased it from occupied_vmap_area_mt.
+ * In the occupied-only design there is no free index to track free space
+ * with vmap_area objects: the range becomes implicitly free as soon as
+ * the occupied marker is gone. The struct itself is recycled to the slab.
*
- * Return: free-tracked @va on success, NULL when queued for retry.
+ * The signature returns @va on success (matching the pre-rewrite contract
+ * used by the synchronous free_vmap_area() path) so the caller can decide
+ * whether further bookkeeping is needed.
*/
-static __always_inline struct vmap_area *
-reinsert_or_queue_vmap_area_locked(struct vmap_area *va)
+static __always_inline void
+release_drained_vmap_area_locked(struct vmap_area *va)
{
- struct vmap_area *tracked;
-
lockdep_assert_held(&free_vmap_area_lock);
- tracked = merge_or_add_vmap_area_free_locked(va);
- if (tracked)
- return tracked;
-
- if (insert_vmap_area_free_locked(va))
- return va;
-
- /*
- * Retry queue acts as allocation exclusion even if occupied restore
- * fails under pressure.
- */
- if (WARN_ON_ONCE(!occupied_mt_store_va_locked(va)))
- INIT_LIST_HEAD(&va->list);
-
- retry_queue_add_va_locked(va);
- return NULL;
+ kmem_cache_free(vmap_area_cachep, va);
}
/*
* Returns a start address of the newly allocated area, if success.
* Otherwise an error value is returned that indicates failure.
+ *
+ * Steady state (post late_initcall, occupied_mt perf_mode on) takes
+ * the occupied-only fast path: find a gap with mas_empty_area on
+ * @occupied_vmap_area_mt and store the consumed sub-range. This costs
+ * two maple touches per allocation versus four to six in the legacy
+ * path (which clipped a free vmap_area struct in @free_vmap_area_mt).
+ *
+ * Pre-perf_mode (early boot) and -ENOENT/-ERANGE retries fall back to
+ * the legacy free_mt walk + va_clip path, which remains correct.
*/
static __always_inline unsigned long
__alloc_vmap_area(unsigned long size, unsigned long align,
@@ -2235,33 +2246,41 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
return -EINVAL;
if (size > vend - vstart)
return -ENOENT;
- if (align > PAGE_SIZE && (vend - vstart) != size) {
- if (check_add_overflow(size, align - 1, &search_len))
- return -ERANGE;
- }
- if (occupied_mt_supported() && align <= PAGE_SIZE) {
- unsigned long candidate;
+ /*
+ * Occupied-only fast path: skip both the free_mt validation
+ * (free_mt_find_enclose_range_locked) and the va_clip splitting.
+ * occupied_mt_find_hole_window_locked already pads the gap search by
+ * align-1 internally for align > PAGE_SIZE, so any alignment lands
+ * inside the returned gap; storing the consumed sub-range in
+ * occupied_mt makes the allocator visible to subsequent lookups. The
+ * legacy free_mt stays in sync only at coarse points (init, pre-
+ * perf_mode), which is harmless because the alloc and free hot paths
+ * no longer query it.
+ */
+ if (occupied_mt_supported()) {
+ if (!occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
+ align, &nva_start_addr))
+ return -ENOENT;
- if (occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
- align, &candidate)) {
- if (check_add_overflow(candidate, size, &nva_end_addr))
- return -ERANGE;
+ if (check_add_overflow(nva_start_addr, size, &nva_end_addr))
+ return -ERANGE;
- va = free_mt_find_enclose_range_locked(candidate, nva_end_addr);
- if (likely(va)) {
- nva_start_addr = candidate;
- goto found;
- }
+ if (!occupied_mt_store_range_locked(nva_start_addr, nva_end_addr))
+ return -ENOMEM;
- occupied_mt_cache_gap_miss_locked(candidate, vend);
- }
+ return nva_start_addr;
}
/*
- * Free maple index is authoritative for allocatable ranges; lazy and
- * retry entries are intentionally excluded from it.
+ * Pre-perf_mode early boot fallback: walk free_mt linearly and use
+ * va_clip to keep both indices coherent.
*/
+ if (align > PAGE_SIZE && (vend - vstart) != size) {
+ if (check_add_overflow(size, align - 1, &search_len))
+ return -ERANGE;
+ }
+
mas_set(&mas, vstart);
va = mas_find(&mas, vend - 1);
while (va) {
@@ -2295,7 +2314,6 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
if (!va)
return -ENOENT;
-found:
ret = va_clip(va, nva_start_addr, size);
if (WARN_ON_ONCE(ret))
return ret;
@@ -2340,8 +2358,7 @@ static void free_vmap_area(struct vmap_area *va)
spin_unlock(&free_vmap_area_lock);
goto out_schedule_retry;
}
- if (!reinsert_or_queue_vmap_area_locked(va))
- queued_retry = true;
+ release_drained_vmap_area_locked(va);
spin_unlock(&free_vmap_area_lock);
out_schedule_retry:
@@ -2692,15 +2709,13 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
{
struct vmap_area *va, *n;
bool ok = true;
- bool queue_retry_work = false;
+ LIST_HEAD(release);
if (list_empty(head))
return true;
spin_lock(&free_vmap_area_lock);
list_for_each_entry_safe(va, n, head, list) {
- bool occupied_erased = false;
-
list_del_init(&va->list);
if (erase_occupied) {
if (WARN_ON_ONCE(!occupied_mt_erase_va_locked(va))) {
@@ -2708,24 +2723,21 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
ok = false;
continue;
}
-
- occupied_erased = true;
- }
- if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
- if (occupied_erased &&
- WARN_ON_ONCE(!occupied_mt_store_va_locked(va))) {
- retry_queue_add_va_locked(va);
- queue_retry_work = true;
- ok = false;
- continue;
- }
- list_add_tail(&va->list, failed);
- ok = false;
}
+ /*
+ * Occupied-only design: there are no free vmap_area objects
+ * any more. With the occupied marker erased, the range is
+ * implicitly free (a gap in occupied_vmap_area_mt). Just
+ * release the struct outside the lock.
+ */
+ list_add_tail(&va->list, &release);
}
spin_unlock(&free_vmap_area_lock);
- if (queue_retry_work)
- schedule_work(&drain_vmap_work);
+
+ list_for_each_entry_safe(va, n, &release, list) {
+ list_del_init(&va->list);
+ kmem_cache_free(vmap_area_cachep, va);
+ }
return ok;
}
@@ -5747,14 +5759,16 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
orig_start = vas[area]->va_start;
orig_end = vas[area]->va_end;
if (occupied_mt_erase_va_locked(vas[area])) {
- va = reinsert_or_queue_vmap_area_locked(vas[area]);
- if (va)
- kasan_release_vmalloc(orig_start, orig_end,
- va->va_start, va->va_end,
- KASAN_VMALLOC_PAGE_RANGE |
- KASAN_VMALLOC_TLB_FLUSH);
- else
- queued_retry = true;
+ /*
+ * Reinsert releases vas[area] in the occupied-only
+ * design; use orig_start/orig_end captured above for
+ * the kasan release call rather than va->va_start.
+ */
+ release_drained_vmap_area_locked(vas[area]);
+ kasan_release_vmalloc(orig_start, orig_end,
+ orig_start, orig_end,
+ KASAN_VMALLOC_PAGE_RANGE |
+ KASAN_VMALLOC_TLB_FLUSH);
} else {
retry_queue_add_va_locked(vas[area]);
queued_retry = true;
@@ -5820,14 +5834,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
orig_start = vas[area]->va_start;
orig_end = vas[area]->va_end;
if (occupied_mt_erase_va_locked(vas[area])) {
- va = reinsert_or_queue_vmap_area_locked(vas[area]);
- if (va)
- kasan_release_vmalloc(orig_start, orig_end,
- va->va_start, va->va_end,
- KASAN_VMALLOC_PAGE_RANGE |
- KASAN_VMALLOC_TLB_FLUSH);
- else
- queued_retry = true;
+ release_drained_vmap_area_locked(vas[area]);
+ kasan_release_vmalloc(orig_start, orig_end,
+ orig_start, orig_end,
+ KASAN_VMALLOC_PAGE_RANGE |
+ KASAN_VMALLOC_TLB_FLUSH);
} else {
retry_queue_add_va_locked(vas[area]);
queued_retry = true;
@@ -6045,6 +6056,14 @@ module_init(proc_vmalloc_init);
#endif
+/*
+ * Pre-occupied-only design seeded the free index with placeholder VAs
+ * covering gaps between vmlist entries. This is preserved as the
+ * boot-time path that populates the legacy free_vmap_area_mt for any
+ * code that still queries it (notably pcpu_get_vm_areas). With
+ * occupied_vmap_area_mt authoritative, allocators on the hot path
+ * skip free_mt entirely.
+ */
static void __init vmap_init_free_space(void)
{
unsigned long vmap_start = 1;
--
2.34.1
next prev parent reply other threads:[~2026-06-13 17:21 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` Pranjal Arya [this message]
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-14 6:35 ` [syzbot ci] " syzbot ci
2026-06-14 6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260613-vmalloc_maple-v1-7-0aa740bb944b@oss.qualcomm.com \
--to=pranjal.arya@oss.qualcomm.com \
--cc=Suzuki.Poulose@arm.com \
--cc=akpm@linux-foundation.org \
--cc=aliceryhl@google.com \
--cc=andrewjballance@gmail.com \
--cc=balbirs@nvidia.com \
--cc=dev.jain@arm.com \
--cc=dvyukov@google.com \
--cc=elver@google.com \
--cc=glider@google.com \
--cc=jackmanb@google.com \
--cc=liam@infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=maple-tree@lists.infradead.org \
--cc=neil.armstrong@linaro.org \
--cc=praan@google.com \
--cc=puranjay@kernel.org \
--cc=santosh.shukla@amd.com \
--cc=shuah@kernel.org \
--cc=smostafa@google.com \
--cc=sudeep.holla@kernel.org \
--cc=surenb@google.com \
--cc=urezki@gmail.com \
--cc=will@kernel.org \
--cc=wkarny@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox