[PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
@ 2026-06-13 17:19 Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
                   ` (14 more replies)
  0 siblings, 15 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

vmalloc's free/busy/lazy area tracking is one of the last remaining
augmented-rb_tree consumers in the core mm allocators. The rest of
mm/ has been gradually consolidating range-keyed indexing around
maple_tree (notably the per-process VMA tree in mm/mmap.c), and
the underlying reason is a structural mismatch between rb_tree and
range tracking:

- rb_tree is a binary tree with a single entry per node. A lookup
  walks log2(N) nodes via a pointer chase that touches log2(N)
  cache lines, with one comparison per node. Range queries are not
  native to rb_tree; vmalloc has historically maintained them by
  augmenting every node with a subtree_max_size field whose value
  has to be recomputed on every insert, erase, and rotation via a
  custom callback. That callback machinery is vmalloc-specific
  code that exists only to coax range semantics out of a
  binary-keyed tree.
- maple_tree stores up to 16 ranges per node (fanout f=16), so a
  lookup walks ~log16(N) nodes in tightly-packed pivot/slot arrays
  that are far more cache-friendly. Range queries are first-class
  (mas_empty_area, mas_find, mas_erase), with no augmentation
  callback to maintain. RCU traversal and sentinel range storage
  (XA_ZERO_ENTRY) are part of the data structure's contract, not
  bolted on by the consumer.

For the vmalloc allocator's hot paths this means shallower walks
under the same N, fewer cache lines touched per lookup, and no
custom augmented-callback machinery to maintain. Aligning vmalloc
with the same range-tree direction the rest of mm/ has taken
collapses the augmented gap walker to a single mas_empty_area()
descent, retires the augmented rb_node from struct vmap_area
(-24 bytes per object on 64-bit), and exposes the range,
sentinel, and RCU primitives needed for a per-CPU range cache
that the augmented rb_tree could not host cleanly.

This series completes the vmalloc internal migration from rb-tree
indexed tracking to maple-tree indexed tracking for free, busy, and
lazy vmap_area range management.

The series removes vmalloc's internal rb-tree dependencies and moves
address indexing and range lookups/scans to maple-tree-backed paths,
while keeping ordered list neighbour traversal where coalescing
semantics require stable predecessor/successor behaviour.

In addition to the direct rb -> maple migration, the series includes
robustness and scalability refinements in the same code path:

- deferred/lazy maple bring-up to avoid early-boot allocator hazards
- maple-assisted ordered-list insertion for busy/lazy tracking
- mas_preallocate / mas_store_prealloc fast path for common-case
  publish work, with a non-indexed retry queue that absorbs the
  rare publish-under-pressure case without leaking or panicking
- single mas_store(NULL, ...) sub-range trim in va_clip() in place
  of an erase-and-restore pair when narrowing a free-area entry
- single mas_erase() for the busy-tree find-then-unlink pair on the
  free path
- consolidation of in-use ranges as a single authoritative index on
  the steady-state allocation hot path
- list_head representation of the lazy-purge queue, since that queue
  is bulk-drained and has no address-keyed query
- per-CPU bump-allocator overlay layered on top of the migrated
  indices for short-lived, page-aligned, common-case allocations
  (design and chunk-size derivation in the 0010-0012 commit
  messages)
- explicit lock/serialisation behaviour preserved (no lock removed)

Primary advantages
==================

- struct vmap_area shrinks by 24 bytes on 64-bit layouts (72 -> 48),
  removing the embedded augmented rb_node and the subtree_max_size
  field that the rb-tree gap walk depended on. Verified with pahole
  on arm64.
- maple_tree's per-node fanout (multiple pivots/slots per node)
  replaces a binary rb-tree descent for indexed lookups, giving a
  shallower walk for the same allocation count.
- alloc-side gap finding moves from a recursive augmented-rb walk to
  mas_empty_area() over the in-use range index, returning the lowest
  matching gap in a single descent.
- vfree of a chunk-resident allocation through the per-CPU overlay
  resolves addr -> vmap_area in O(1) via the chunk's back-pointer
  array, with a bounded fast-reject for addresses outside any
  reserved chunk; the maple-tree busy lookup remains the fallback.
- correctness behaviour preserved: ordered list neighbour traversal
  for coalescing remains; the locking/serialisation model is
  unchanged; lockdep is silent across the test_vmalloc subtest sweep.
- robustness in bring-up and high churn: deferred/lazy maple
  initialisation avoids early-boot allocator hazards; the retry
  queue keeps publish failures under GFP_NOWAIT pressure correct
  without leaking or panicking.

Real-silicon validation
=======================

The series was tested on Qualcomm Snapdragon X1E80100.
The patched kernel was booted on the device against an RB baseline
image of the same kernel revision, and exercised through:

- GFXBench, run for several hours of sustained graphics workload;
  the patched kernel ran clean throughout, with throughput matching
  the RB baseline within run-to-run noise.
- boot-time module loading via the finit_module / kernel_read_file
  path that exercises the bump-allocator's indexed write loop;
  this path drove the patch 0012 hardening, and the patched kernel
  is UBSAN-clean here.
- repeated insmod / rmmod cycles to soak the chunk install / drain
  paths under live workload.

No kernel WARN, BUG, or UBSAN report was observed across the
multi-hour soak.

Quantified validated performance improvements
=============================================

Numbers below compare the current upstream rb-tree-based vmalloc
allocator (baseline) against the final state after applying all 12
patches in this series (patched). Each row reports the median of N
reps under sequential QEMU execution (PARALLEL=1, taskset-pinned
CPUs, nice -n 5) on a production kernel configuration
(CONFIG_DEBUG_VM=n, CONFIG_KASAN=n, CONFIG_LOCKDEP=n,
CONFIG_UBSAN=n). The baseline label is RB; the patched label is MT.

Aggregate observation:

- x86_64 KVM contention sweep (smp=8, 30 reps): improvement on every
  measured nt x mask configuration, range +5.1 % to +43.6 %.
- x86_64 KVM saturation sweep (smp=16, 15 reps): improvement on every
  measured nt x mask configuration, range +28.2 % to +64.9 %.
- arm64 cortex-a72 TCG contention sweep (smp=8, 15 reps):
  improvement on every measured nt x mask configuration, range
  +19.5 % to +32.6 %.
- functional sweep (1 rep, all 12 test_vmalloc masks): all 11
  passing subtests pass on RB and MT; the 12th (mask=64,
  align_shift_alloc_test) is xfail by design and remains xfail on MT.

x86_64 observed faster cases (sorted by gain):

- full_fit_alloc_test, smp=16 nt=16:                  +64.9 %
- long_busy_list_alloc_test, smp=16 nt=16:            +62.9 %
- full_fit_alloc_test, smp=16 nt=8:                   +51.7 %
- long_busy_list_alloc_test, smp=16 nt=8:             +49.3 %
- long_busy_list_alloc_test, smp=8 nt=8:              +43.6 %
- full_fit_alloc_test, smp=8 nt=4:                    +40.4 %
- full_fit_alloc_test, smp=8 nt=8:                    +40.0 %
- full_fit_alloc_test, smp=16 nt=4:                   +38.2 %
- full_fit_alloc_test, smp=8 nt=2:                    +31.2 %
- fix_align_alloc_test, functional sweep:             +30.2 %
- long_busy_list_alloc_test, smp=8 nt=4:              +29.3 %
- long_busy_list_alloc_test, smp=16 nt=4:             +28.2 %
- full_fit_alloc_test, smp=4 nt=2:                    +26.5 %
- full_fit_alloc_test, smp=8 nt=1:                    +20.4 %
- long_busy_list_alloc_test, smp=8 nt=2:              +17.7 %
- long_busy_list_alloc_test, smp=4 nt=2:              +15.3 %

arm64 observed faster cases (sorted by gain):

- long_busy_list_alloc_test, smp=8 nt=8:              +32.6 %
- long_busy_list_alloc_test, smp=8 nt=4:              +26.7 %
- fix_size_alloc_test, functional sweep:              +26.0 %
- full_fit_alloc_test, smp=8 nt=4:                    +25.5 %
- full_fit_alloc_test, smp=8 nt=2:                    +23.4 %
- full_fit_alloc_test, smp=8 nt=1:                    +23.2 %
- full_fit_alloc_test, smp=8 nt=8:                    +22.3 %
- full_fit_alloc_test, functional sweep:              +21.9 %
- long_busy_list_alloc_test, smp=8 nt=2:              +21.2 %
- random_size_align_alloc_test, functional sweep:     +19.9 %
- long_busy_list_alloc_test, smp=8 nt=1:              +19.5 %
- no_block_alloc_test, functional sweep:              +18.0 %
- full_fit_alloc_test, smp=4 nt=2:                    +17.3 %
- kvfree_rcu_1_arg_vmalloc_test, functional sweep:    +16.9 %
- long_busy_list_alloc_test, smp=4 nt=2:              +16.2 %
- random_size_align_alloc_test, smp=4 nt=2:           +14.3 %
- long_busy_list_alloc_test, functional sweep:        +12.6 %
- fix_align_alloc_test, functional sweep:             +12.4 %
- kvfree_rcu_2_arg_vmalloc_test, functional sweep:    +11.0 %
- pcpu_alloc_test, smp=4 nt=2:                        +6.9 %

Memory impact quantification
============================

struct vmap_area shrinks by 24 bytes per object on 64-bit layouts
(72 -> 48; verified with pahole on the patched build vs the RB
baseline). The shrink comes from removing the embedded augmented
rb_node (24 bytes) and the subtree_max_size union member (8 bytes,
formerly union with vm). list_head (16 bytes) and vm (8 bytes) are
retained.

Aggregate savings scale linearly with the number of concurrent live
vmap_area objects on the system.

mm/vmalloc.o text grows by ~47 KB on x86_64 / ~95 KB on arm64
(absorbing the maple-tree call sites and the per-CPU overlay), but
the linker-page-aligned bzImage / Image size is unchanged within
rounding.

QEMU validation and synthetic benchmarks
========================================

In addition to the real-silicon soak, the series was exercised
under QEMU (KVM on x86_64, TCG on arm64):

11 of 12 test_vmalloc subtests PASS on x86_64 and arm64 at series
HEAD. mask=64 (align_shift_alloc_test) reports xfailed=1 by design;
the same xfail is observed on the unpatched RB baseline.

KASAN, lockdep, CONFIG_DEBUG_VM=y, PROVE_LOCKING and CONFIG_UBSAN
boots are clean on both architectures with
test_vmalloc.run_test_mask=0xFFF (full subtest sweep).

Each commit individually compiles mm/vmalloc.o on x86_64 and arm64.
checkpatch.pl --strict reports no errors across the series.

Methodology limitations and reproducibility caveats:

- arm64 numbers in the validated wins above come from cortex-a72
  TCG emulation, so production arm64 silicon
  would be expected to track somewhere between these TCG numbers
  and the x86_64 KVM numbers.
- Independent replay can change the ranking or sign of small
  sub-millisecond deltas in test workloads whose absolute runtime
  is below ~1 ms at the configured test_loop_count; the headline
  workloads (full_fit_alloc_test, long_busy_list_alloc_test) run
  in the tens-to-hundreds-of-ms range and are not affected.

Reproduce
=========

Apply the series with git am, then build with TEST_VMALLOC=y in a
production-config kernel and boot under QEMU:

  $ git am 00*.patch
  $ make defconfig
  $ scripts/config -e TEST_VMALLOC -d DEBUG_VM -d KASAN
  $ make olddefconfig
  $ make -j$(nproc) bzImage
  $ # boot in QEMU with test_vmalloc.run_test_mask=...

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
Pranjal Arya (12):
      mm/vmalloc: introduce maple_tree-based indexing for vmap_area
      mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree
      mm/vmalloc: convert free, purge, and pcpu paths to maple_tree
      mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area
      mm/vmalloc: tighten failure handling under memory pressure
      mm/vmalloc: tighten alloc/free hot paths
      mm/vmalloc: consolidate occupied tree as authoritative index on hot path
      mm/vmalloc: track lazy-purge queue as a list_head
      mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase
      mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
      mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject
      mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds

 include/linux/vmalloc.h |   16 +-
 lib/maple_tree.c        |    7 +
 mm/vmalloc.c            | 2378 ++++++++++++++++++++++++++++++++++-------------
 3 files changed, 1748 insertions(+), 653 deletions(-)
---
base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
change-id: 20260613-vmalloc_maple-9c96fee8eace

Best regards,
--  
Pranjal Arya <pranjal.arya@oss.qualcomm.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Add the maple_tree data structures, helper API, and runtime readiness
plumbing that this series uses to retire the augmented rb_tree
indexing of vmalloc free and busy ranges.

Two new tree handles are added alongside the existing per-node lazy
index:

  - free_vmap_area_mt    address-keyed gap query for the global
                         free-area allocator
  - vn->busy.mt          per-node address-keyed lookup for find/free

Helpers follow a try_init_*_locked / *_store_*_locked / *_erase_*_locked
naming convention so that the conversion call sites read uniformly.
The try_init_* helpers fold one-shot allocation of the maple-tree
backing state into the first hot-path access; this keeps vmalloc_init()
free of the per-tree GFP_NOWAIT paths and lets the tree machinery
start cold.

No external vmalloc behaviour change in this step. free/busy/lazy
operations still go through the rb_tree and per-node lazy.mt; the new
helpers and globals are wired up by the conversion patches that
follow.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 433 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 402 insertions(+), 31 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1afca3568b9b..67f753d04c96 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -24,6 +24,7 @@
 #include <linux/list.h>
 #include <linux/notifier.h>
 #include <linux/rbtree.h>
+#include <linux/maple_tree.h>
 #include <linux/xarray.h>
 #include <linux/io.h>
 #include <linux/rcupdate.h>
@@ -880,22 +881,22 @@ static bool vmap_initialized __read_mostly;
 static struct kmem_cache *vmap_area_cachep;
 
 /*
- * This linked list is used in pair with free_vmap_area_root.
- * It gives O(1) access to prev/next to perform fast coalescing.
+ * This linked list stores free areas sorted by start address.
+ * It gives O(1) access to neighbors for fast coalescing.
  */
 static LIST_HEAD(free_vmap_area_list);
+/* Next-fit hint to avoid scanning from list head on each allocation. */
+static unsigned long free_vmap_alloc_hint __maybe_unused = 1;
 
 /*
- * This augment red-black tree represents the free vmap space.
- * All vmap_area objects in this tree are sorted by va->va_start
- * address. It is used for allocation and merging when a vmap
- * object is released.
- *
- * Each vmap_area node contains a maximum available free block
- * of its sub-tree, right or left. Therefore it is possible to
- * find a lowest match of free area.
+ * Maple tree shadow of free_vmap_area_list. It is used for
+ * address lookups and first-fit scans.
  */
 static struct rb_root free_vmap_area_root = RB_ROOT;
+static struct maple_tree free_vmap_area_mt __maybe_unused =
+	MTREE_INIT_EXT(free_vmap_area_mt, MT_FLAGS_LOCK_EXTERN, free_vmap_area_lock);
+static bool free_vmap_area_mt_enabled __maybe_unused;
+static bool free_vmap_area_mt_init_tried __maybe_unused;
 
 /*
  * Preload a CPU with one object for "no edge" split case. The
@@ -906,14 +907,17 @@ static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
 
 /*
  * This structure defines a single, solid model where a list and
- * rb-tree are part of one entity protected by the lock. Nodes are
+ * maple tree are part of one entity protected by the lock. Nodes are
  * sorted in ascending order, thus for O(1) access to left/right
  * neighbors a list is used as well as for sequential traversal.
  */
-struct rb_list {
+struct mt_list {
 	struct rb_root root;
+	struct maple_tree mt;
 	struct list_head head;
 	spinlock_t lock;
+	bool mt_enabled;
+	bool mt_init_tried;
 };
 
 /*
@@ -940,8 +944,8 @@ static struct vmap_node {
 	bool skip_populate;
 
 	/* Bookkeeping data of this node. */
-	struct rb_list busy;
-	struct rb_list lazy;
+	struct mt_list busy;
+	struct mt_list lazy;
 
 	/*
 	 * Ready-to-free areas.
@@ -1051,6 +1055,10 @@ va_size(struct vmap_area *va)
 	return (va->va_end - va->va_start);
 }
 
+/*
+ * Transitional rb compatibility retained until all rb-only users are moved.
+ * Follow-up patches in this RFC series remove these helpers.
+ */
 static __always_inline unsigned long
 get_subtree_max_size(struct rb_node *node)
 {
@@ -1070,6 +1078,130 @@ static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work);
 
 static __cacheline_aligned_in_smp atomic_long_t vmap_lazy_nr;
 
+/*
+ * maple nodes are allocated from slab; defer tree population until
+ * slab allocator is up to avoid early-boot failures.
+ */
+static __always_inline bool vmap_mt_runtime_ready(void)
+{
+	return READ_ONCE(vmap_initialized) && slab_is_available();
+}
+
+static __always_inline bool free_mt_supported(void)
+{
+	return free_vmap_area_mt_enabled;
+}
+
+static __always_inline void disable_free_mt_locked(void)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (free_vmap_area_mt_enabled) {
+		__mt_destroy(&free_vmap_area_mt);
+		free_vmap_area_mt_enabled = false;
+	}
+}
+
+static __always_inline void free_mt_store_va_locked(struct vmap_area *va)
+{
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	MA_STATE(mas, &free_vmap_area_mt, va->va_start, va->va_end - 1);
+
+	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		disable_free_mt_locked();
+}
+
+static __always_inline void free_mt_erase_va_locked(struct vmap_area *va)
+{
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	MA_STATE(mas, &free_vmap_area_mt, va->va_start, va->va_end - 1);
+
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		disable_free_mt_locked();
+}
+
+static __always_inline void
+free_mt_update_va_locked(struct vmap_area *va, unsigned long old_start,
+			 unsigned long old_end)
+{
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	MA_STATE(mas_erase, &free_vmap_area_mt, old_start, old_end - 1);
+	MA_STATE(mas_store, &free_vmap_area_mt, va->va_start, va->va_end - 1);
+
+	err = mas_store_gfp(&mas_erase, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err)) {
+		disable_free_mt_locked();
+		return;
+	}
+
+	err = mas_store_gfp(&mas_store, va, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		disable_free_mt_locked();
+}
+
+static void free_mt_rebuild_locked(void)
+{
+	struct vmap_area *va;
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	__mt_destroy(&free_vmap_area_mt);
+	free_vmap_area_mt_enabled = true;
+
+	list_for_each_entry(va, &free_vmap_area_list, list) {
+		MA_STATE(mas, &free_vmap_area_mt, va->va_start, va->va_end - 1);
+
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err)) {
+			disable_free_mt_locked();
+			return;
+		}
+	}
+}
+
+static __always_inline void try_init_free_mt_locked(void)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (free_vmap_area_mt_init_tried)
+		return;
+
+	if (!vmap_mt_runtime_ready())
+		return;
+
+	free_vmap_area_mt_init_tried = true;
+	free_mt_rebuild_locked();
+}
+
+static __always_inline struct vmap_area *
+__find_vmap_area_list(unsigned long addr, struct list_head *head)
+{
+	struct vmap_area *va;
+
+	addr = (unsigned long)kasan_reset_tag((void *)addr);
+
+	list_for_each_entry(va, head, list) {
+		if (addr < va->va_start)
+			break;
+		if (addr < va->va_end)
+			return va;
+	}
+
+	return NULL;
+}
+
 static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root)
 {
 	struct rb_node *n = root->rb_node;
@@ -1092,29 +1224,268 @@ static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *ro
 }
 
 /* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static struct vmap_area *
-__find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
+static __always_inline struct vmap_area *
+__find_vmap_area_exceed_addr_list(unsigned long addr, struct list_head *head)
 {
-	struct vmap_area *va = NULL;
-	struct rb_node *n = root->rb_node;
+	struct vmap_area *va;
 
 	addr = (unsigned long)kasan_reset_tag((void *)addr);
 
-	while (n) {
-		struct vmap_area *tmp;
+	list_for_each_entry(va, head, list) {
+		if (va->va_end > addr)
+			return va;
+	}
 
-		tmp = rb_entry(n, struct vmap_area, rb_node);
-		if (tmp->va_end > addr) {
-			va = tmp;
-			if (tmp->va_start <= addr)
-				break;
+	return NULL;
+}
 
-			n = n->rb_left;
-		} else
-			n = n->rb_right;
+static __always_inline struct list_head *
+find_vmap_area_insert_point_list(struct vmap_area *va, struct list_head *head)
+{
+	struct vmap_area *tmp;
+	struct list_head *next = head;
+
+	list_for_each_entry(tmp, head, list) {
+		if (tmp->va_start > va->va_start) {
+			next = &tmp->list;
+			break;
+		}
 	}
 
-	return va;
+	if (next != head) {
+		tmp = list_entry(next, struct vmap_area, list);
+		if (WARN_ON_ONCE(va->va_end > tmp->va_start))
+			return NULL;
+	}
+
+	if (next->prev != head) {
+		tmp = list_entry(next->prev, struct vmap_area, list);
+		if (WARN_ON_ONCE(va->va_start < tmp->va_end))
+			return NULL;
+	}
+
+	return next;
+}
+
+/*
+ * Use maple-tree neighbour lookup to locate insertion point in O(log n),
+ * while preserving sorted-list neighbour traversal.
+ */
+static __always_inline struct list_head *
+find_vmap_area_insert_point_mt(struct vmap_area *va, struct maple_tree *tree,
+			       struct list_head *head)
+{
+	struct vmap_area *prev, *next;
+	struct list_head *next_link;
+
+	MA_STATE(mas, tree, va->va_start, va->va_start);
+
+	mas_set(&mas, va->va_start);
+	next = mas_find(&mas, ULONG_MAX);
+
+	if (next) {
+		if (WARN_ON_ONCE(next->va_start <= va->va_start))
+			return NULL;
+		if (WARN_ON_ONCE(va->va_end > next->va_start))
+			return NULL;
+		next_link = &next->list;
+	} else {
+		next_link = head;
+	}
+
+	if (next_link->prev != head) {
+		prev = list_entry(next_link->prev, struct vmap_area, list);
+		if (WARN_ON_ONCE(va->va_start < prev->va_end))
+			return NULL;
+	}
+
+	return next_link;
+}
+
+static __always_inline bool
+insert_vmap_area_list_sorted(struct vmap_area *va, struct list_head *head)
+{
+	struct list_head *next;
+
+	next = find_vmap_area_insert_point_list(va, head);
+	if (!next)
+		return false;
+
+	list_add_tail(&va->list, next);
+	return true;
+}
+
+static __always_inline bool
+insert_vmap_area_list_sorted_mt(struct vmap_area *va, struct maple_tree *tree,
+				struct list_head *head)
+{
+	struct list_head *next;
+
+	next = find_vmap_area_insert_point_mt(va, tree, head);
+	if (!next)
+		return false;
+
+	list_add_tail(&va->list, next);
+	return true;
+}
+
+static __always_inline void
+disable_busy_mt_locked(struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->busy.lock);
+
+	if (vn->busy.mt_enabled) {
+		__mt_destroy(&vn->busy.mt);
+		vn->busy.mt_enabled = false;
+	}
+
+	vn->busy.mt_init_tried = true;
+}
+
+static __always_inline void
+disable_lazy_mt_locked(struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->lazy.lock);
+
+	if (vn->lazy.mt_enabled) {
+		__mt_destroy(&vn->lazy.mt);
+		vn->lazy.mt_enabled = false;
+	}
+
+	vn->lazy.mt_init_tried = true;
+}
+
+static void
+busy_mt_rebuild_locked(struct vmap_node *vn)
+{
+	struct vmap_area *va;
+	int err;
+
+	lockdep_assert_held(&vn->busy.lock);
+
+	__mt_destroy(&vn->busy.mt);
+	vn->busy.mt_enabled = true;
+
+	list_for_each_entry(va, &vn->busy.head, list) {
+		MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
+
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err)) {
+			disable_busy_mt_locked(vn);
+			return;
+		}
+	}
+}
+
+static __always_inline void
+try_init_busy_mt_locked(struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->busy.lock);
+
+	if (vn->busy.mt_init_tried)
+		return;
+
+	if (!vmap_mt_runtime_ready())
+		return;
+
+	vn->busy.mt_init_tried = true;
+	busy_mt_rebuild_locked(vn);
+}
+
+static void
+lazy_mt_rebuild_locked(struct vmap_node *vn)
+{
+	struct vmap_area *va;
+	int err;
+
+	lockdep_assert_held(&vn->lazy.lock);
+
+	__mt_destroy(&vn->lazy.mt);
+	vn->lazy.mt_enabled = true;
+
+	list_for_each_entry(va, &vn->lazy.head, list) {
+		MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
+
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err)) {
+			disable_lazy_mt_locked(vn);
+			return;
+		}
+	}
+}
+
+static __always_inline void
+try_init_lazy_mt_locked(struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->lazy.lock);
+
+	if (vn->lazy.mt_init_tried)
+		return;
+
+	if (!vmap_mt_runtime_ready())
+		return;
+
+	vn->lazy.mt_init_tried = true;
+	lazy_mt_rebuild_locked(vn);
+}
+
+static __always_inline struct vmap_area *
+__find_vmap_area_mt(unsigned long addr, struct maple_tree *tree)
+{
+	MA_STATE(mas, tree, addr, addr);
+
+	addr = (unsigned long)kasan_reset_tag((void *)addr);
+	mas_set(&mas, addr);
+
+	return mas_walk(&mas);
+}
+
+static __always_inline struct vmap_area *
+__find_vmap_area_exceed_addr_mt(unsigned long addr, struct maple_tree *tree)
+{
+	MA_STATE(mas, tree, addr, addr);
+
+	addr = (unsigned long)kasan_reset_tag((void *)addr);
+	mas_set(&mas, addr);
+
+	return mas_find(&mas, ULONG_MAX);
+}
+
+static __always_inline struct vmap_area *
+__find_vmap_area_enclose_addr_mt(unsigned long addr, struct maple_tree *tree)
+{
+	MA_STATE(mas, tree, addr, addr);
+
+	addr = (unsigned long)kasan_reset_tag((void *)addr);
+	mas_set(&mas, addr);
+
+	return mas_find_rev(&mas, 0);
+}
+
+static __always_inline struct vmap_area *
+find_vmap_area_busy_locked(unsigned long addr, struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->busy.lock);
+
+	try_init_busy_mt_locked(vn);
+
+	if (likely(vn->busy.mt_enabled))
+		return __find_vmap_area_mt(addr, &vn->busy.mt);
+
+	return __find_vmap_area_list(addr, &vn->busy.head);
+}
+
+static __always_inline struct vmap_area *
+find_vmap_area_exceed_addr_busy_locked(unsigned long addr, struct vmap_node *vn)
+{
+	lockdep_assert_held(&vn->busy.lock);
+
+	try_init_busy_mt_locked(vn);
+
+	if (likely(vn->busy.mt_enabled))
+		return __find_vmap_area_exceed_addr_mt(addr, &vn->busy.mt);
+
+	return __find_vmap_area_exceed_addr_list(addr, &vn->busy.head);
 }
 
 /*
@@ -1135,7 +1506,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 
 	for_each_vmap_node(vn) {
 		spin_lock(&vn->busy.lock);
-		*va = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
+		*va = find_vmap_area_exceed_addr_busy_locked(addr, vn);
 
 		if (*va)
 			if (!va_start_lowest || (*va)->va_start < va_start_lowest)
@@ -1152,7 +1523,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 		vn = addr_to_node(va_start_lowest);
 
 		spin_lock(&vn->busy.lock);
-		*va = __find_vmap_area(va_start_lowest, &vn->busy.root);
+		*va = find_vmap_area_busy_locked(va_start_lowest, vn);
 
 		if (*va)
 			return vn;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Switch __alloc_vmap_area, va_clip, classify_va_fit_type, and the
insertion / merge helpers to drive the global free-area state through
the maple_tree-backed helpers. The augmented rb_tree walk
(find_vmap_lowest_match / find_va_links / augment_tree_propagate_*)
becomes unreachable on the alloc path and is removed.

The alloc path retains a list-based next-fit walk over
free_vmap_area_list driven by free_vmap_alloc_hint, but its
insertion-point lookup, neighbour validation and free-area indexing
run through the maple_tree helpers
(find_vmap_area_insert_point_mt, free_mt_store_va_locked,
free_mt_update_va_locked).

va_clip handles the LE/RE/NE fit types via the new free_mt_*_locked
helpers.

The pcpu and free paths still drive the rb_tree.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 653 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 323 insertions(+), 330 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 67f753d04c96..c5f509f033e6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1535,261 +1535,155 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 	return NULL;
 }
 
-/*
- * This function returns back addresses of parent node
- * and its left or right link for further processing.
- *
- * Otherwise NULL is returned. In that case all further
- * steps regarding inserting of conflicting overlap range
- * have to be declined and actually considered as a bug.
- */
-static __always_inline struct rb_node **
-find_va_links(struct vmap_area *va,
-	struct rb_root *root, struct rb_node *from,
-	struct rb_node **parent)
-{
-	struct vmap_area *tmp_va;
-	struct rb_node **link;
-
-	if (root) {
-		link = &root->rb_node;
-		if (unlikely(!*link)) {
-			*parent = NULL;
-			return link;
-		}
-	} else {
-		link = &from;
-	}
+static __always_inline void
+insert_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
+{
+	int err;
 
-	/*
-	 * Go to the bottom of the tree. When we hit the last point
-	 * we end up with parent rb_node and correct direction, i name
-	 * it link, where the new va->rb_node will be attached to.
-	 */
-	do {
-		tmp_va = rb_entry(*link, struct vmap_area, rb_node);
+	lockdep_assert_held(&vn->busy.lock);
 
-		/*
-		 * During the traversal we also do some sanity check.
-		 * Trigger the BUG() if there are sides(left/right)
-		 * or full overlaps.
-		 */
-		if (va->va_end <= tmp_va->va_start)
-			link = &(*link)->rb_left;
-		else if (va->va_start >= tmp_va->va_end)
-			link = &(*link)->rb_right;
-		else {
-			WARN(1, "vmalloc bug: 0x%lx-0x%lx overlaps with 0x%lx-0x%lx\n",
-				va->va_start, va->va_end, tmp_va->va_start, tmp_va->va_end);
+	try_init_busy_mt_locked(vn);
 
-			return NULL;
-		}
-	} while (*link);
+	if (likely(vn->busy.mt_enabled)) {
+		MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
 
-	*parent = &tmp_va->rb_node;
-	return link;
-}
+		if (!insert_vmap_area_list_sorted_mt(va, &vn->busy.mt,
+						     &vn->busy.head))
+			return;
 
-static __always_inline struct list_head *
-get_va_next_sibling(struct rb_node *parent, struct rb_node **link)
-{
-	struct list_head *list;
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err))
+			disable_busy_mt_locked(vn);
 
-	if (unlikely(!parent))
-		/*
-		 * The red-black tree where we try to find VA neighbors
-		 * before merging or inserting is empty, i.e. it means
-		 * there is no free vmap space. Normally it does not
-		 * happen but we handle this case anyway.
-		 */
-		return NULL;
+		return;
+	}
 
-	list = &rb_entry(parent, struct vmap_area, rb_node)->list;
-	return (&parent->rb_right == link ? list->next : list);
+	if (!insert_vmap_area_list_sorted(va, &vn->busy.head))
+		return;
 }
 
 static __always_inline void
-__link_va(struct vmap_area *va, struct rb_root *root,
-	struct rb_node *parent, struct rb_node **link,
-	struct list_head *head, bool augment)
+unlink_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
-	/*
-	 * VA is still not in the list, but we can
-	 * identify its future previous list_head node.
-	 */
-	if (likely(parent)) {
-		head = &rb_entry(parent, struct vmap_area, rb_node)->list;
-		if (&parent->rb_right != link)
-			head = head->prev;
-	}
+	int err;
 
-	/* Insert to the rb-tree */
-	rb_link_node(&va->rb_node, parent, link);
-	if (augment) {
-		/*
-		 * Some explanation here. Just perform simple insertion
-		 * to the tree. We do not set va->subtree_max_size to
-		 * its current size before calling rb_insert_augmented().
-		 * It is because we populate the tree from the bottom
-		 * to parent levels when the node _is_ in the tree.
-		 *
-		 * Therefore we set subtree_max_size to zero after insertion,
-		 * to let __augment_tree_propagate_from() puts everything to
-		 * the correct order later on.
-		 */
-		rb_insert_augmented(&va->rb_node,
-			root, &free_vmap_area_rb_augment_cb);
-		va->subtree_max_size = 0;
-	} else {
-		rb_insert_color(&va->rb_node, root);
-	}
+	lockdep_assert_held(&vn->busy.lock);
 
-	/* Address-sort this list */
-	list_add(&va->list, head);
-}
+	MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
 
-static __always_inline void
-link_va(struct vmap_area *va, struct rb_root *root,
-	struct rb_node *parent, struct rb_node **link,
-	struct list_head *head)
-{
-	__link_va(va, root, parent, link, head, false);
-}
+	list_del_init(&va->list);
 
-static __always_inline void
-link_va_augment(struct vmap_area *va, struct rb_root *root,
-	struct rb_node *parent, struct rb_node **link,
-	struct list_head *head)
-{
-	__link_va(va, root, parent, link, head, true);
-}
+	try_init_busy_mt_locked(vn);
 
-static __always_inline void
-__unlink_va(struct vmap_area *va, struct rb_root *root, bool augment)
-{
-	if (WARN_ON(RB_EMPTY_NODE(&va->rb_node)))
+	if (unlikely(!vn->busy.mt_enabled))
 		return;
 
-	if (augment)
-		rb_erase_augmented(&va->rb_node,
-			root, &free_vmap_area_rb_augment_cb);
-	else
-		rb_erase(&va->rb_node, root);
-
-	list_del_init(&va->list);
-	RB_CLEAR_NODE(&va->rb_node);
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		disable_busy_mt_locked(vn);
 }
 
 static __always_inline void
-unlink_va(struct vmap_area *va, struct rb_root *root)
+insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
-	__unlink_va(va, root, false);
-}
+	int err;
 
-static __always_inline void
-unlink_va_augment(struct vmap_area *va, struct rb_root *root)
-{
-	__unlink_va(va, root, true);
+	lockdep_assert_held(&vn->lazy.lock);
+
+	try_init_lazy_mt_locked(vn);
+
+	if (likely(vn->lazy.mt_enabled)) {
+		MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
+
+		if (!insert_vmap_area_list_sorted_mt(va, &vn->lazy.mt,
+						     &vn->lazy.head))
+			return;
+
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err))
+			disable_lazy_mt_locked(vn);
+
+		return;
+	}
+
+	if (!insert_vmap_area_list_sorted(va, &vn->lazy.head))
+		return;
 }
 
-#if DEBUG_AUGMENT_PROPAGATE_CHECK
-/*
- * Gets called when remove the node and rotate.
- */
-static __always_inline unsigned long
-compute_subtree_max_size(struct vmap_area *va)
+static __always_inline bool
+lazy_vmap_areas_empty_locked(struct vmap_node *vn)
 {
-	return max3(va_size(va),
-		get_subtree_max_size(va->rb_node.rb_left),
-		get_subtree_max_size(va->rb_node.rb_right));
+	lockdep_assert_held(&vn->lazy.lock);
+
+	try_init_lazy_mt_locked(vn);
+
+	if (likely(vn->lazy.mt_enabled))
+		return mtree_empty(&vn->lazy.mt);
+
+	return list_empty(&vn->lazy.head);
 }
 
-static void
-augment_tree_propagate_check(void)
+static __always_inline void
+move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 {
 	struct vmap_area *va;
-	unsigned long computed_size;
+	int err;
 
-	list_for_each_entry(va, &free_vmap_area_list, list) {
-		computed_size = compute_subtree_max_size(va);
-		if (computed_size != va->subtree_max_size)
-			pr_emerg("tree is corrupted: %lu, %lu\n",
-				va_size(va), va->subtree_max_size);
+	lockdep_assert_held(&vn->lazy.lock);
+
+	try_init_lazy_mt_locked(vn);
+
+	if (likely(vn->lazy.mt_enabled)) {
+		list_for_each_entry(va, &vn->lazy.head, list) {
+			MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
+
+			err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+			if (WARN_ON_ONCE(err)) {
+				disable_lazy_mt_locked(vn);
+				break;
+			}
+		}
+
+		if (vn->lazy.mt_enabled && WARN_ON_ONCE(!mtree_empty(&vn->lazy.mt)))
+			disable_lazy_mt_locked(vn);
 	}
+
+	list_replace_init(&vn->lazy.head, &vn->purge_list);
 }
-#endif
 
-/*
- * This function populates subtree_max_size from bottom to upper
- * levels starting from VA point. The propagation must be done
- * when VA size is modified by changing its va_start/va_end. Or
- * in case of newly inserting of VA to the tree.
- *
- * It means that __augment_tree_propagate_from() must be called:
- * - After VA has been inserted to the tree(free path);
- * - After VA has been shrunk(allocation path);
- * - After VA has been increased(merging path).
- *
- * Please note that, it does not mean that upper parent nodes
- * and their subtree_max_size are recalculated all the time up
- * to the root node.
- *
- *       4--8
- *        /\
- *       /  \
- *      /    \
- *    2--2  8--8
- *
- * For example if we modify the node 4, shrinking it to 2, then
- * no any modification is required. If we shrink the node 2 to 1
- * its subtree_max_size is updated only, and set to 1. If we shrink
- * the node 8 to 6, then its subtree_max_size is set to 6 and parent
- * node becomes 4--6.
- */
-static __always_inline void
-augment_tree_propagate_from(struct vmap_area *va)
+static __always_inline bool
+insert_vmap_area_free_locked(struct vmap_area *va)
 {
-	/*
-	 * Populate the tree from bottom towards the root until
-	 * the calculated maximum available size of checked node
-	 * is equal to its current one.
-	 */
-	free_vmap_area_rb_augment_cb_propagate(&va->rb_node, NULL);
+	lockdep_assert_held(&free_vmap_area_lock);
 
-#if DEBUG_AUGMENT_PROPAGATE_CHECK
-	augment_tree_propagate_check();
-#endif
-}
+	try_init_free_mt_locked();
 
-static void
-insert_vmap_area(struct vmap_area *va,
-	struct rb_root *root, struct list_head *head)
-{
-	struct rb_node **link;
-	struct rb_node *parent;
+	if (likely(free_mt_supported())) {
+		if (!insert_vmap_area_list_sorted_mt(va, &free_vmap_area_mt,
+						     &free_vmap_area_list))
+			return false;
 
-	link = find_va_links(va, root, NULL, &parent);
-	if (link)
-		link_va(va, root, parent, link, head);
+		free_mt_store_va_locked(va);
+	} else {
+		if (!insert_vmap_area_list_sorted(va, &free_vmap_area_list))
+			return false;
+	}
+
+	return true;
 }
 
-static void
-insert_vmap_area_augment(struct vmap_area *va,
-	struct rb_node *from, struct rb_root *root,
-	struct list_head *head)
+static __always_inline void
+unlink_vmap_area_free_locked(struct vmap_area *va)
 {
-	struct rb_node **link;
-	struct rb_node *parent;
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (from)
-		link = find_va_links(va, NULL, from, &parent);
-	else
-		link = find_va_links(va, root, NULL, &parent);
+	if (WARN_ON_ONCE(list_empty(&va->list)))
+		return;
 
-	if (link) {
-		link_va_augment(va, root, parent, link, head);
-		augment_tree_propagate_from(va);
-	}
+	if (likely(free_mt_supported()))
+		free_mt_erase_va_locked(va);
+
+	list_del_init(&va->list);
 }
 
 /*
@@ -1804,29 +1698,20 @@ insert_vmap_area_augment(struct vmap_area *va,
  * ongoing.
  */
 static __always_inline struct vmap_area *
-__merge_or_add_vmap_area(struct vmap_area *va,
-	struct rb_root *root, struct list_head *head, bool augment)
+__merge_or_add_vmap_area(struct vmap_area *va, struct list_head *head, bool update_mt)
 {
 	struct vmap_area *sibling;
 	struct list_head *next;
-	struct rb_node **link;
-	struct rb_node *parent;
+	unsigned long old_start, old_end;
 	bool merged = false;
 
-	/*
-	 * Find a place in the tree where VA potentially will be
-	 * inserted, unless it is merged with its sibling/siblings.
-	 */
-	link = find_va_links(va, root, NULL, &parent);
-	if (!link)
-		return NULL;
+	if (update_mt && free_mt_supported())
+		next = find_vmap_area_insert_point_mt(va, &free_vmap_area_mt, head);
+	else
+		next = find_vmap_area_insert_point_list(va, head);
 
-	/*
-	 * Get next node of VA to check if merging can be done.
-	 */
-	next = get_va_next_sibling(parent, link);
-	if (unlikely(next == NULL))
-		goto insert;
+	if (!next)
+		return NULL;
 
 	/*
 	 * start            end
@@ -1838,7 +1723,11 @@ __merge_or_add_vmap_area(struct vmap_area *va,
 	if (next != head) {
 		sibling = list_entry(next, struct vmap_area, list);
 		if (sibling->va_start == va->va_end) {
+			old_start = sibling->va_start;
+			old_end = sibling->va_end;
 			sibling->va_start = va->va_start;
+			if (update_mt && free_mt_supported())
+				free_mt_update_va_locked(sibling, old_start, old_end);
 
 			/* Free vmap_area object. */
 			kmem_cache_free(vmap_area_cachep, va);
@@ -1862,14 +1751,20 @@ __merge_or_add_vmap_area(struct vmap_area *va,
 			/*
 			 * If both neighbors are coalesced, it is important
 			 * to unlink the "next" node first, followed by merging
-			 * with "previous" one. Otherwise the tree might not be
-			 * fully populated if a sibling's augmented value is
-			 * "normalized" because of rotation operations.
+			 * with "previous" one.
 			 */
-			if (merged)
-				__unlink_va(va, root, augment);
+			if (merged) {
+				if (update_mt)
+					unlink_vmap_area_free_locked(va);
+				else
+					list_del_init(&va->list);
+			}
 
+			old_start = sibling->va_start;
+			old_end = sibling->va_end;
 			sibling->va_end = va->va_end;
+			if (update_mt && free_mt_supported())
+				free_mt_update_va_locked(sibling, old_start, old_end);
 
 			/* Free vmap_area object. */
 			kmem_cache_free(vmap_area_cachep, va);
@@ -1880,31 +1775,97 @@ __merge_or_add_vmap_area(struct vmap_area *va,
 		}
 	}
 
-insert:
-	if (!merged)
-		__link_va(va, root, parent, link, head, augment);
+	if (!merged) {
+		if (update_mt)
+			insert_vmap_area_free_locked(va);
+		else
+			list_add_tail(&va->list, next);
+	}
 
 	return va;
 }
 
 static __always_inline struct vmap_area *
 merge_or_add_vmap_area(struct vmap_area *va,
-	struct rb_root *root, struct list_head *head)
+	struct list_head *head)
 {
-	return __merge_or_add_vmap_area(va, root, head, false);
+	return __merge_or_add_vmap_area(va, head, false);
 }
 
 static __always_inline struct vmap_area *
-merge_or_add_vmap_area_augment(struct vmap_area *va,
-	struct rb_root *root, struct list_head *head)
+merge_or_add_vmap_area_free_locked(struct vmap_area *va)
 {
-	va = __merge_or_add_vmap_area(va, root, head, true);
-	if (va)
-		augment_tree_propagate_from(va);
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	va = __merge_or_add_vmap_area(va, &free_vmap_area_list, true);
+	if (va && va->va_start < free_vmap_alloc_hint)
+		free_vmap_alloc_hint = va->va_start;
 
 	return va;
 }
 
+/*
+ * Transitional wrappers retained until all legacy rb call sites are switched.
+ * Follow-up patches in this series remove these wrappers.
+ */
+static __always_inline void
+insert_vmap_area(struct vmap_area *va, struct rb_root *root,
+		 struct list_head *head)
+{
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
+	if (head == &free_vmap_area_list) {
+		insert_vmap_area_free_locked(va);
+		return;
+	}
+
+	if (head == &vn->lazy.head) {
+		insert_vmap_area_lazy_locked(va, vn);
+		return;
+	}
+
+	insert_vmap_area_busy_locked(va, vn);
+}
+
+static __always_inline void
+insert_vmap_area_augment(struct vmap_area *va, struct rb_node *from,
+			 struct rb_root *root, struct list_head *head)
+{
+	insert_vmap_area(va, root, head);
+}
+
+static __always_inline void unlink_va(struct vmap_area *va, struct rb_root *root)
+{
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
+	if (root == &free_vmap_area_root) {
+		unlink_vmap_area_free_locked(va);
+		return;
+	}
+
+	unlink_vmap_area_busy_locked(va, vn);
+}
+
+static __always_inline void
+unlink_va_augment(struct vmap_area *va, struct rb_root *root)
+{
+	unlink_va(va, root);
+}
+
+static __always_inline void augment_tree_propagate_from(struct vmap_area *va)
+{
+}
+
+static __always_inline struct vmap_area *
+merge_or_add_vmap_area_augment(struct vmap_area *va, struct rb_root *root,
+			       struct list_head *head)
+{
+	if (head == &free_vmap_area_list)
+		return merge_or_add_vmap_area_free_locked(va);
+
+	return merge_or_add_vmap_area(va, head);
+}
+
 static __always_inline bool
 is_within_this_va(struct vmap_area *va, unsigned long size,
 	unsigned long align, unsigned long vstart)
@@ -1924,74 +1885,103 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
 	return (nva_start_addr + size <= va->va_end);
 }
 
-/*
- * Find the first free block(lowest start address) in the tree,
- * that will accomplish the request corresponding to passing
- * parameters. Please note, with an alignment bigger than PAGE_SIZE,
- * a search length is adjusted to account for worst case alignment
- * overhead.
- */
 static __always_inline struct vmap_area *
-find_vmap_lowest_match(struct rb_root *root, unsigned long size,
-	unsigned long align, unsigned long vstart, bool adjust_search_size)
+find_vmap_lowest_match_list(struct list_head *head, unsigned long size,
+			    unsigned long align, unsigned long vstart)
 {
 	struct vmap_area *va;
-	struct rb_node *node;
-	unsigned long length;
 
-	/* Start from the root. */
-	node = root->rb_node;
+	list_for_each_entry(va, head, list) {
+		if (!is_within_this_va(va, size, align, vstart))
+			continue;
 
-	/* Adjust the search size for alignment overhead. */
-	length = adjust_search_size ? size + align - 1 : size;
+		return va;
+	}
 
-	while (node) {
-		va = rb_entry(node, struct vmap_area, rb_node);
+	return NULL;
+}
 
-		if (get_subtree_max_size(node->rb_left) >= length &&
-				vstart < va->va_start) {
-			node = node->rb_left;
-		} else {
-			if (is_within_this_va(va, size, align, vstart))
-				return va;
+static __always_inline unsigned long
+clamp_vmap_alloc_hint(unsigned long hint, unsigned long vstart,
+		      unsigned long vend)
+{
+	if (hint < vstart || hint >= vend)
+		return vstart;
 
-			/*
-			 * Does not make sense to go deeper towards the right
-			 * sub-tree if it does not have a free block that is
-			 * equal or bigger to the requested search length.
-			 */
-			if (get_subtree_max_size(node->rb_right) >= length) {
-				node = node->rb_right;
-				continue;
-			}
+	return hint;
+}
 
-			/*
-			 * OK. We roll back and find the first right sub-tree,
-			 * that will satisfy the search criteria. It can happen
-			 * due to "vstart" restriction or an alignment overhead
-			 * that is bigger then PAGE_SIZE.
-			 */
-			while ((node = rb_parent(node))) {
-				va = rb_entry(node, struct vmap_area, rb_node);
-				if (is_within_this_va(va, size, align, vstart))
+/*
+ * Next-fit scan with wrap-around. Use maple to jump to the first candidate
+ * around the hint in O(log n), then continue on the ordered list for cheap
+ * neighbour traversal and deterministic coalescing behaviour.
+ */
+static __always_inline struct vmap_area *
+find_vmap_match_list_next_fit(struct list_head *head, unsigned long size,
+			      unsigned long align, unsigned long vstart,
+			      unsigned long vend)
+{
+	struct vmap_area *va, *start = NULL;
+	unsigned long hint;
+	bool wrapped;
+
+	hint = clamp_vmap_alloc_hint(free_vmap_alloc_hint, vstart, vend);
+
+	if (hint != vstart) {
+		if (free_mt_supported())
+			start = __find_vmap_area_exceed_addr_mt(hint,
+								&free_vmap_area_mt);
+
+		if (start) {
+			va = start;
+			list_for_each_entry_from(va, head, list) {
+				if (is_within_this_va(va, size, align, hint))
 					return va;
+			}
+		} else {
+			list_for_each_entry(va, head, list) {
+				if (va->va_end <= hint)
+					continue;
 
-				if (get_subtree_max_size(node->rb_right) >= length &&
-						vstart <= va->va_start) {
-					/*
-					 * Shift the vstart forward. Please note, we update it with
-					 * parent's start address adding "1" because we do not want
-					 * to enter same sub-tree after it has already been checked
-					 * and no suitable free block found there.
-					 */
-					vstart = va->va_start + 1;
-					node = node->rb_right;
-					break;
-				}
+				if (is_within_this_va(va, size, align, hint))
+					return va;
 			}
 		}
 	}
 
+	wrapped = (hint != vstart);
+	list_for_each_entry(va, head, list) {
+		if (wrapped) {
+			if (start && va == start)
+				break;
+			if (!start && va->va_start >= hint)
+				break;
+		}
+
+		if (is_within_this_va(va, size, align, vstart))
+			return va;
+	}
+
+	return NULL;
+}
+
+static __always_inline struct vmap_area *
+find_vmap_lowest_match_mt(struct maple_tree *tree, unsigned long size,
+			  unsigned long align, unsigned long vstart)
+{
+	MA_STATE(mas, tree, vstart, vstart);
+	struct vmap_area *va;
+
+	mas_set(&mas, vstart);
+	va = mas_find(&mas, ULONG_MAX);
+
+	while (va) {
+		if (is_within_this_va(va, size, align, vstart))
+			return va;
+
+		va = mas_next(&mas, ULONG_MAX);
+	}
+
 	return NULL;
 }
 
@@ -2015,8 +2005,8 @@ find_vmap_lowest_linear_match(struct list_head *head, unsigned long size,
 }
 
 static void
-find_vmap_lowest_match_check(struct rb_root *root, struct list_head *head,
-			     unsigned long size, unsigned long align)
+find_vmap_lowest_match_check(struct list_head *head, unsigned long size,
+			     unsigned long align)
 {
 	struct vmap_area *va_1, *va_2;
 	unsigned long vstart;
@@ -2025,7 +2015,10 @@ find_vmap_lowest_match_check(struct rb_root *root, struct list_head *head,
 	get_random_bytes(&rnd, sizeof(rnd));
 	vstart = VMALLOC_START + rnd;
 
-	va_1 = find_vmap_lowest_match(root, size, align, vstart, false);
+	if (free_mt_supported())
+		va_1 = find_vmap_lowest_match_mt(&free_vmap_area_mt, size, align, vstart);
+	else
+		va_1 = find_vmap_lowest_linear_match(head, size, align, vstart);
 	va_2 = find_vmap_lowest_linear_match(head, size, align, vstart);
 
 	if (va_1 != va_2)
@@ -2069,11 +2062,12 @@ classify_va_fit_type(struct vmap_area *va,
 }
 
 static __always_inline int
-va_clip(struct rb_root *root, struct list_head *head,
-		struct vmap_area *va, unsigned long nva_start_addr,
-		unsigned long size)
+va_clip(struct vmap_area *va, unsigned long nva_start_addr,
+	unsigned long size)
 {
 	struct vmap_area *lva = NULL;
+	unsigned long old_start = va->va_start;
+	unsigned long old_end = va->va_end;
 	enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
 
 	if (type == FL_FIT_TYPE) {
@@ -2084,7 +2078,7 @@ va_clip(struct rb_root *root, struct list_head *head,
 		 * V      NVA      V
 		 * |---------------|
 		 */
-		unlink_va_augment(va, root);
+		unlink_vmap_area_free_locked(va);
 		kmem_cache_free(vmap_area_cachep, va);
 	} else if (type == LE_FIT_TYPE) {
 		/*
@@ -2159,10 +2153,11 @@ va_clip(struct rb_root *root, struct list_head *head,
 	}
 
 	if (type != FL_FIT_TYPE) {
-		augment_tree_propagate_from(va);
+		if (free_mt_supported())
+			free_mt_update_va_locked(va, old_start, old_end);
 
 		if (lva)	/* type == NE_FIT_TYPE */
-			insert_vmap_area_augment(lva, &va->rb_node, root, head);
+			insert_vmap_area_free_locked(lva);
 	}
 
 	return 0;
@@ -2170,7 +2165,6 @@ va_clip(struct rb_root *root, struct list_head *head,
 
 static unsigned long
 va_alloc(struct vmap_area *va,
-		struct rb_root *root, struct list_head *head,
 		unsigned long size, unsigned long align,
 		unsigned long vstart, unsigned long vend)
 {
@@ -2187,7 +2181,7 @@ va_alloc(struct vmap_area *va,
 		return -ERANGE;
 
 	/* Update the free vmap_area. */
-	ret = va_clip(root, head, va, nva_start_addr, size);
+	ret = va_clip(va, nva_start_addr, size);
 	if (WARN_ON_ONCE(ret))
 		return ret;
 
@@ -2199,35 +2193,37 @@ va_alloc(struct vmap_area *va,
  * Otherwise an error value is returned that indicates failure.
  */
 static __always_inline unsigned long
-__alloc_vmap_area(struct rb_root *root, struct list_head *head,
-	unsigned long size, unsigned long align,
-	unsigned long vstart, unsigned long vend)
+__alloc_vmap_area(unsigned long size, unsigned long align,
+		  unsigned long vstart, unsigned long vend)
 {
-	bool adjust_search_size = true;
 	unsigned long nva_start_addr;
 	struct vmap_area *va;
 
+	lockdep_assert_held(&free_vmap_area_lock);
+
 	/*
-	 * Do not adjust when:
-	 *   a) align <= PAGE_SIZE, because it does not make any sense.
-	 *      All blocks(their start addresses) are at least PAGE_SIZE
-	 *      aligned anyway;
-	 *   b) a short range where a requested size corresponds to exactly
-	 *      specified [vstart:vend] interval and an alignment > PAGE_SIZE.
-	 *      With adjusted search length an allocation would not succeed.
+	 * Next-fit with wrap-around lowers repeated list-head scans in
+	 * high-churn workloads.
 	 */
-	if (align <= PAGE_SIZE || (align > PAGE_SIZE && (vend - vstart) == size))
-		adjust_search_size = false;
+	va = find_vmap_match_list_next_fit(&free_vmap_area_list, size, align,
+					   vstart, vend);
 
-	va = find_vmap_lowest_match(root, size, align, vstart, adjust_search_size);
 	if (unlikely(!va))
 		return -ENOENT;
 
-	nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend);
+	nva_start_addr = va_alloc(va, size, align, vstart, vend);
+	if (!IS_ERR_VALUE(nva_start_addr)) {
+		unsigned long next_hint;
+
+		if (check_add_overflow(nva_start_addr, size, &next_hint))
+			free_vmap_alloc_hint = vstart;
+		else
+			free_vmap_alloc_hint = next_hint;
+	}
 
 #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
 	if (!IS_ERR_VALUE(nva_start_addr))
-		find_vmap_lowest_match_check(root, head, size, align);
+		find_vmap_lowest_match_check(&free_vmap_area_list, size, align);
 #endif
 
 	return nva_start_addr;
@@ -2441,8 +2437,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 retry:
 	if (IS_ERR_VALUE(addr)) {
 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
-		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
-			size, align, vstart, vend);
+		addr = __alloc_vmap_area(size, align, vstart, vend);
 		spin_unlock(&free_vmap_area_lock);
 
 		/*
@@ -2589,7 +2584,6 @@ static void
 decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 {
 	LIST_HEAD(decay_list);
-	struct rb_root decay_root = RB_ROOT;
 	struct vmap_area *va, *nva;
 	unsigned long n_decay, pool_len;
 	int i;
@@ -2618,7 +2612,7 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 				break;
 
 			list_del_init(&va->list);
-			merge_or_add_vmap_area(va, &decay_root, &decay_list);
+			merge_or_add_vmap_area(va, &decay_list);
 		}
 
 		/*
@@ -5456,8 +5450,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 			/* It is a BUG(), but trigger recovery instead. */
 			goto recovery;
 
-		ret = va_clip(&free_vmap_area_root,
-			&free_vmap_area_list, va, start, size);
+		ret = va_clip(va, start, size);
 		if (WARN_ON_ONCE(unlikely(ret)))
 			/* It is a BUG(), but trigger recovery instead. */
 			goto recovery;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths to maple_tree
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Move free_vmap_area_noflush, the lazy-purge processing
(__purge_vmap_area_lazy, purge_vmap_node, decay_va_pool_node,
kasan_release_vmalloc_node), and pcpu_get_vm_areas onto the
maple_tree helpers.

Per-node busy lookup (find_vmap_area, find_unlink_vmap_area,
find_vmap_area_exceed_addr_lock) and the vmap_block free path
likewise shift to vn->busy.mt.

pcpu_get_vm_areas keeps its top-down free-area walk; the new
free_vmap_area_prev() helper returns the previous free-area neighbour
via the address-sorted list, while enclose-address queries
(pvm_find_va_enclose_addr) move onto the maple_tree where supported.

After this patch, the augmented rb_tree is no longer consulted on the
allocation or free path. The address-sorted free_vmap_area_list is
still walked on the alloc path's list-based next-fit scan and on
neighbour traversal in pcpu_get_vm_areas.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 78 +++++++++++++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 40 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c5f509f033e6..f2117eafa9cf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2240,14 +2240,14 @@ static void free_vmap_area(struct vmap_area *va)
 	 * Remove from the busy tree/list.
 	 */
 	spin_lock(&vn->busy.lock);
-	unlink_va(va, &vn->busy.root);
+	unlink_vmap_area_busy_locked(va, vn);
 	spin_unlock(&vn->busy.lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
 	 */
 	spin_lock(&free_vmap_area_lock);
-	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
+	merge_or_add_vmap_area_free_locked(va);
 	spin_unlock(&free_vmap_area_lock);
 }
 
@@ -2431,12 +2431,13 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 		 * Only scan the relevant parts containing pointers to other objects
 		 * to avoid false negatives.
 		 */
-		kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
+		kmemleak_scan_area(&va->vm, SIZE_MAX, gfp_mask);
 	}
 
 retry:
 	if (IS_ERR_VALUE(addr)) {
 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
+		try_init_free_mt_locked();
 		addr = __alloc_vmap_area(size, align, vstart, vend);
 		spin_unlock(&free_vmap_area_lock);
 
@@ -2479,7 +2480,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	vn = addr_to_node(va->va_start);
 
 	spin_lock(&vn->busy.lock);
-	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+	insert_vmap_area_busy_locked(va, vn);
 	spin_unlock(&vn->busy.lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
@@ -2575,8 +2576,7 @@ reclaim_list_global(struct list_head *head)
 
 	spin_lock(&free_vmap_area_lock);
 	list_for_each_entry_safe(va, n, head, list)
-		merge_or_add_vmap_area_augment(va,
-			&free_vmap_area_root, &free_vmap_area_list);
+		merge_or_add_vmap_area_free_locked(va);
 	spin_unlock(&free_vmap_area_lock);
 }
 
@@ -2719,12 +2719,13 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 		vn->skip_populate = full_pool_decay;
 		decay_va_pool_node(vn, full_pool_decay);
 
-		if (RB_EMPTY_ROOT(&vn->lazy.root))
+		spin_lock(&vn->lazy.lock);
+		if (lazy_vmap_areas_empty_locked(vn)) {
+			spin_unlock(&vn->lazy.lock);
 			continue;
+		}
 
-		spin_lock(&vn->lazy.lock);
-		WRITE_ONCE(vn->lazy.root.rb_node, NULL);
-		list_replace_init(&vn->lazy.head, &vn->purge_list);
+		move_lazy_vmap_areas_to_purge_locked(vn);
 		spin_unlock(&vn->lazy.lock);
 
 		start = min(start, list_first_entry(&vn->purge_list,
@@ -2823,7 +2824,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 		id_to_node(vn_id):addr_to_node(va->va_start);
 
 	spin_lock(&vn->lazy.lock);
-	insert_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
+	insert_vmap_area_lazy_locked(va, vn);
 	spin_unlock(&vn->lazy.lock);
 
 	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
@@ -2873,7 +2874,7 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 		vn = &vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
-		va = __find_vmap_area(addr, &vn->busy.root);
+		va = find_vmap_area_busy_locked(addr, vn);
 		spin_unlock(&vn->busy.lock);
 
 		if (va)
@@ -2897,9 +2898,9 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 		vn = &vmap_nodes[i];
 
 		spin_lock(&vn->busy.lock);
-		va = __find_vmap_area(addr, &vn->busy.root);
+		va = find_vmap_area_busy_locked(addr, vn);
 		if (va)
-			unlink_va(va, &vn->busy.root);
+			unlink_vmap_area_busy_locked(va, vn);
 		spin_unlock(&vn->busy.lock);
 
 		if (va)
@@ -2955,8 +2956,8 @@ struct vmap_block_queue {
 
 	/*
 	 * An xarray requires an extra memory dynamically to
-	 * be allocated. If it is an issue, we can use rb-tree
-	 * instead.
+	 * be allocated. If it is an issue, switch to another
+	 * indexing data structure.
 	 */
 	struct xarray vmap_blocks;
 };
@@ -3133,7 +3134,7 @@ static void free_vmap_block(struct vmap_block *vb)
 
 	vn = addr_to_node(vb->va->va_start);
 	spin_lock(&vn->busy.lock);
-	unlink_va(vb->va, &vn->busy.root);
+	unlink_vmap_area_busy_locked(vb->va, vn);
 	spin_unlock(&vn->busy.lock);
 
 	free_vmap_area_noflush(vb->va);
@@ -5238,9 +5239,15 @@ void free_vm_area(struct vm_struct *area)
 EXPORT_SYMBOL_GPL(free_vm_area);
 
 #ifdef CONFIG_SMP
-static struct vmap_area *node_to_va(struct rb_node *n)
+static __always_inline struct vmap_area *
+free_vmap_area_prev(struct vmap_area *va)
 {
-	return rb_entry_safe(n, struct vmap_area, rb_node);
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (va->list.prev == &free_vmap_area_list)
+		return NULL;
+
+	return list_entry(va->list.prev, struct vmap_area, list);
 }
 
 /**
@@ -5255,26 +5262,19 @@ static struct vmap_area *node_to_va(struct rb_node *n)
 static struct vmap_area *
 pvm_find_va_enclose_addr(unsigned long addr)
 {
-	struct vmap_area *va, *tmp;
-	struct rb_node *n;
+	struct vmap_area *va;
 
-	n = free_vmap_area_root.rb_node;
-	va = NULL;
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	while (n) {
-		tmp = rb_entry(n, struct vmap_area, rb_node);
-		if (tmp->va_start <= addr) {
-			va = tmp;
-			if (tmp->va_end >= addr)
-				break;
+	if (free_mt_supported())
+		return __find_vmap_area_enclose_addr_mt(addr, &free_vmap_area_mt);
 
-			n = n->rb_right;
-		} else {
-			n = n->rb_left;
-		}
+	list_for_each_entry_reverse(va, &free_vmap_area_list, list) {
+		if (va->va_start <= addr)
+			return va;
 	}
 
-	return va;
+	return NULL;
 }
 
 /**
@@ -5419,7 +5419,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		 * If this VA does not fit, move base downwards and recheck.
 		 */
 		if (base + start < va->va_start) {
-			va = node_to_va(rb_prev(&va->rb_node));
+			va = free_vmap_area_prev(va);
 			base = pvm_determine_end_from_reverse(&va, align) - end;
 			term_area = area;
 			continue;
@@ -5474,7 +5474,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
 
 		spin_lock(&vn->busy.lock);
-		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
+		insert_vmap_area_busy_locked(vas[area], vn);
 		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
 		spin_unlock(&vn->busy.lock);
@@ -5501,8 +5501,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	while (area--) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root,
-				&free_vmap_area_list);
+		va = merge_or_add_vmap_area_free_locked(vas[area]);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
 				va->va_start, va->va_end,
@@ -5552,8 +5551,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	for (area = 0; area < nr_vms; area++) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		va = merge_or_add_vmap_area_augment(vas[area], &free_vmap_area_root,
-				&free_vmap_area_list);
+		va = merge_or_add_vmap_area_free_locked(vas[area]);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
 				va->va_start, va->va_end,

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (2 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

With every alloc/free/pcpu path now driven by maple_tree, retire the
augmented rb_tree from struct vmap_area and from include/linux/vmalloc.h.
The struct shrinks by 24 bytes on both x86_64 and arm64 (72 -> 48):

  -       struct rb_node rb_node;
  -       struct list_head list;
  -       union {
  -               unsigned long subtree_max_size;
  -               struct vm_struct *vm;
  -       };
  +       struct list_head list;
  +       struct vm_struct *vm;

Also allow maple_tree_init() to be called twice: vmalloc_init() runs
before start_kernel() reaches its own maple_tree_init() callsite,
and the maple_tree machinery needs to be live for the early seeding
done in vmap_init_free_space().  The second call becomes a no-op
once maple_node_cache is set.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 include/linux/vmalloc.h |   16 +-
 lib/maple_tree.c        |    7 +
 mm/vmalloc.c            | 1123 ++++++++++++++++++-----------------------------
 3 files changed, 436 insertions(+), 710 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index d87dc7f77f4e..642bca92b804 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -9,7 +9,6 @@
 #include <linux/list.h>
 #include <linux/llist.h>
 #include <asm/page.h>		/* pgprot_t */
-#include <linux/rbtree.h>
 #include <linux/overflow.h>
 
 #include <asm/vmalloc.h>
@@ -72,19 +71,8 @@ struct vmap_area {
 	unsigned long va_start;
 	unsigned long va_end;
 
-	struct rb_node rb_node;         /* address sorted rbtree */
-	struct list_head list;          /* address sorted list */
-
-	/*
-	 * The following two variables can be packed, because
-	 * a vmap_area object can be either:
-	 *    1) in "free" tree (root is free_vmap_area_root)
-	 *    2) or "busy" tree (root is vmap_area_root)
-	 */
-	union {
-		unsigned long subtree_max_size; /* in "free" tree */
-		struct vm_struct *vm;           /* in "busy" tree */
-	};
+	struct list_head list;          /* auxiliary linkage (pool/purge/lazy) */
+	struct vm_struct *vm;
 	unsigned long flags; /* mark type of vm_map_ram area */
 };
 
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index e52876435b77..f3474a107372 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5634,6 +5634,13 @@ void __init maple_tree_init(void)
 		.sheaf_capacity = 32,
 	};
 
+	/*
+	 * vmalloc_init() may need Maple allocations before start_kernel() reaches
+	 * its own maple_tree_init() callsite. Keep initialization idempotent.
+	 */
+	if (maple_node_cache)
+		return;
+
 	maple_node_cache = kmem_cache_create("maple_node",
 			sizeof(struct maple_node), &args,
 			SLAB_PANIC);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f2117eafa9cf..c908c1a0fcd4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -23,7 +23,6 @@
 #include <linux/kallsyms.h>
 #include <linux/list.h>
 #include <linux/notifier.h>
-#include <linux/rbtree.h>
 #include <linux/maple_tree.h>
 #include <linux/xarray.h>
 #include <linux/io.h>
@@ -36,7 +35,6 @@
 #include <linux/llist.h>
 #include <linux/uio.h>
 #include <linux/bitops.h>
-#include <linux/rbtree_augmented.h>
 #include <linux/overflow.h>
 #include <linux/pgtable.h>
 #include <linux/hugetlb.h>
@@ -881,22 +879,20 @@ static bool vmap_initialized __read_mostly;
 static struct kmem_cache *vmap_area_cachep;
 
 /*
- * This linked list stores free areas sorted by start address.
- * It gives O(1) access to neighbors for fast coalescing.
+ * Maple tree of free ranges.
  */
-static LIST_HEAD(free_vmap_area_list);
-/* Next-fit hint to avoid scanning from list head on each allocation. */
-static unsigned long free_vmap_alloc_hint __maybe_unused = 1;
-
-/*
- * Maple tree shadow of free_vmap_area_list. It is used for
- * address lookups and first-fit scans.
- */
-static struct rb_root free_vmap_area_root = RB_ROOT;
-static struct maple_tree free_vmap_area_mt __maybe_unused =
-	MTREE_INIT_EXT(free_vmap_area_mt, MT_FLAGS_LOCK_EXTERN, free_vmap_area_lock);
-static bool free_vmap_area_mt_enabled __maybe_unused;
-static bool free_vmap_area_mt_init_tried __maybe_unused;
+static struct maple_tree free_vmap_area_mt =
+	MTREE_INIT_EXT(free_vmap_area_mt,
+		       MT_FLAGS_LOCK_EXTERN | MT_FLAGS_ALLOC_RANGE,
+		       free_vmap_area_lock);
+static bool free_vmap_area_mt_enabled;
+static bool free_vmap_area_mt_init_tried;
+static struct maple_tree occupied_vmap_area_mt =
+	MTREE_INIT_EXT(occupied_vmap_area_mt,
+		       MT_FLAGS_LOCK_EXTERN | MT_FLAGS_ALLOC_RANGE,
+		       free_vmap_area_lock);
+static bool occupied_vmap_area_mt_enabled;
+static bool occupied_vmap_area_mt_init_tried;
 
 /*
  * Preload a CPU with one object for "no edge" split case. The
@@ -905,19 +901,11 @@ static bool free_vmap_area_mt_init_tried __maybe_unused;
  */
 static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
 
-/*
- * This structure defines a single, solid model where a list and
- * maple tree are part of one entity protected by the lock. Nodes are
- * sorted in ascending order, thus for O(1) access to left/right
- * neighbors a list is used as well as for sequential traversal.
- */
+/* Per-node ordered range index backed by Maple Tree. */
 struct mt_list {
-	struct rb_root root;
 	struct maple_tree mt;
-	struct list_head head;
 	spinlock_t lock;
 	bool mt_enabled;
-	bool mt_init_tried;
 };
 
 /*
@@ -1055,22 +1043,6 @@ va_size(struct vmap_area *va)
 	return (va->va_end - va->va_start);
 }
 
-/*
- * Transitional rb compatibility retained until all rb-only users are moved.
- * Follow-up patches in this RFC series remove these helpers.
- */
-static __always_inline unsigned long
-get_subtree_max_size(struct rb_node *node)
-{
-	struct vmap_area *va;
-
-	va = rb_entry_safe(node, struct vmap_area, rb_node);
-	return va ? va->subtree_max_size : 0;
-}
-
-RB_DECLARE_CALLBACKS_MAX(static, free_vmap_area_rb_augment_cb,
-	struct vmap_area, rb_node, unsigned long, subtree_max_size, va_size)
-
 static void reclaim_and_purge_vmap_areas(void);
 static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static void drain_vmap_area_work(struct work_struct *work);
@@ -1078,31 +1050,12 @@ static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work);
 
 static __cacheline_aligned_in_smp atomic_long_t vmap_lazy_nr;
 
-/*
- * maple nodes are allocated from slab; defer tree population until
- * slab allocator is up to avoid early-boot failures.
- */
-static __always_inline bool vmap_mt_runtime_ready(void)
-{
-	return READ_ONCE(vmap_initialized) && slab_is_available();
-}
-
 static __always_inline bool free_mt_supported(void)
 {
 	return free_vmap_area_mt_enabled;
 }
 
-static __always_inline void disable_free_mt_locked(void)
-{
-	lockdep_assert_held(&free_vmap_area_lock);
-
-	if (free_vmap_area_mt_enabled) {
-		__mt_destroy(&free_vmap_area_mt);
-		free_vmap_area_mt_enabled = false;
-	}
-}
-
-static __always_inline void free_mt_store_va_locked(struct vmap_area *va)
+static __always_inline bool free_mt_store_va_locked(struct vmap_area *va)
 {
 	int err;
 
@@ -1110,12 +1063,20 @@ static __always_inline void free_mt_store_va_locked(struct vmap_area *va)
 
 	MA_STATE(mas, &free_vmap_area_mt, va->va_start, va->va_end - 1);
 
-	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-	if (WARN_ON_ONCE(err))
-		disable_free_mt_locked();
+	err = mas_preallocate(&mas, va, GFP_NOWAIT | __GFP_NOWARN);
+	if (!err) {
+		mas_store_prealloc(&mas, va);
+		mas_destroy(&mas);
+	} else {
+		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+		if (WARN_ON_ONCE(err))
+			return false;
+	}
+
+	return true;
 }
 
-static __always_inline void free_mt_erase_va_locked(struct vmap_area *va)
+static __always_inline bool free_mt_erase_va_locked(struct vmap_area *va)
 {
 	int err;
 
@@ -1125,10 +1086,12 @@ static __always_inline void free_mt_erase_va_locked(struct vmap_area *va)
 
 	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
 	if (WARN_ON_ONCE(err))
-		disable_free_mt_locked();
+		return false;
+
+	return true;
 }
 
-static __always_inline void
+static __always_inline bool
 free_mt_update_va_locked(struct vmap_area *va, unsigned long old_start,
 			 unsigned long old_end)
 {
@@ -1140,35 +1103,14 @@ free_mt_update_va_locked(struct vmap_area *va, unsigned long old_start,
 	MA_STATE(mas_store, &free_vmap_area_mt, va->va_start, va->va_end - 1);
 
 	err = mas_store_gfp(&mas_erase, NULL, GFP_ATOMIC | __GFP_NOWARN);
-	if (WARN_ON_ONCE(err)) {
-		disable_free_mt_locked();
-		return;
-	}
+	if (WARN_ON_ONCE(err))
+		return false;
 
 	err = mas_store_gfp(&mas_store, va, GFP_ATOMIC | __GFP_NOWARN);
 	if (WARN_ON_ONCE(err))
-		disable_free_mt_locked();
-}
-
-static void free_mt_rebuild_locked(void)
-{
-	struct vmap_area *va;
-	int err;
-
-	lockdep_assert_held(&free_vmap_area_lock);
-
-	__mt_destroy(&free_vmap_area_mt);
-	free_vmap_area_mt_enabled = true;
-
-	list_for_each_entry(va, &free_vmap_area_list, list) {
-		MA_STATE(mas, &free_vmap_area_mt, va->va_start, va->va_end - 1);
+		return false;
 
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err)) {
-			disable_free_mt_locked();
-			return;
-		}
-	}
+	return true;
 }
 
 static __always_inline void try_init_free_mt_locked(void)
@@ -1178,257 +1120,101 @@ static __always_inline void try_init_free_mt_locked(void)
 	if (free_vmap_area_mt_init_tried)
 		return;
 
-	if (!vmap_mt_runtime_ready())
+	if (!slab_is_available())
 		return;
 
 	free_vmap_area_mt_init_tried = true;
-	free_mt_rebuild_locked();
-}
-
-static __always_inline struct vmap_area *
-__find_vmap_area_list(unsigned long addr, struct list_head *head)
-{
-	struct vmap_area *va;
-
-	addr = (unsigned long)kasan_reset_tag((void *)addr);
-
-	list_for_each_entry(va, head, list) {
-		if (addr < va->va_start)
-			break;
-		if (addr < va->va_end)
-			return va;
-	}
-
-	return NULL;
-}
-
-static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root)
-{
-	struct rb_node *n = root->rb_node;
-
-	addr = (unsigned long)kasan_reset_tag((void *)addr);
-
-	while (n) {
-		struct vmap_area *va;
-
-		va = rb_entry(n, struct vmap_area, rb_node);
-		if (addr < va->va_start)
-			n = n->rb_left;
-		else if (addr >= va->va_end)
-			n = n->rb_right;
-		else
-			return va;
-	}
-
-	return NULL;
+	free_vmap_area_mt_enabled = true;
 }
 
-/* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static __always_inline struct vmap_area *
-__find_vmap_area_exceed_addr_list(unsigned long addr, struct list_head *head)
+static __always_inline bool occupied_mt_supported(void)
 {
-	struct vmap_area *va;
-
-	addr = (unsigned long)kasan_reset_tag((void *)addr);
-
-	list_for_each_entry(va, head, list) {
-		if (va->va_end > addr)
-			return va;
-	}
-
-	return NULL;
+	return occupied_vmap_area_mt_enabled;
 }
 
-static __always_inline struct list_head *
-find_vmap_area_insert_point_list(struct vmap_area *va, struct list_head *head)
+static __always_inline void try_init_occupied_mt_locked(void)
 {
-	struct vmap_area *tmp;
-	struct list_head *next = head;
-
-	list_for_each_entry(tmp, head, list) {
-		if (tmp->va_start > va->va_start) {
-			next = &tmp->list;
-			break;
-		}
-	}
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (next != head) {
-		tmp = list_entry(next, struct vmap_area, list);
-		if (WARN_ON_ONCE(va->va_end > tmp->va_start))
-			return NULL;
-	}
+	if (occupied_vmap_area_mt_init_tried)
+		return;
 
-	if (next->prev != head) {
-		tmp = list_entry(next->prev, struct vmap_area, list);
-		if (WARN_ON_ONCE(va->va_start < tmp->va_end))
-			return NULL;
-	}
+	if (!slab_is_available())
+		return;
 
-	return next;
+	occupied_vmap_area_mt_init_tried = true;
+	occupied_vmap_area_mt_enabled = true;
 }
 
-/*
- * Use maple-tree neighbour lookup to locate insertion point in O(log n),
- * while preserving sorted-list neighbour traversal.
- */
-static __always_inline struct list_head *
-find_vmap_area_insert_point_mt(struct vmap_area *va, struct maple_tree *tree,
-			       struct list_head *head)
+static __always_inline bool
+occupied_mt_store_range_locked(unsigned long start, unsigned long end)
 {
-	struct vmap_area *prev, *next;
-	struct list_head *next_link;
+	int err;
 
-	MA_STATE(mas, tree, va->va_start, va->va_start);
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	mas_set(&mas, va->va_start);
-	next = mas_find(&mas, ULONG_MAX);
+	if (WARN_ON_ONCE(!occupied_mt_supported()))
+		return false;
 
-	if (next) {
-		if (WARN_ON_ONCE(next->va_start <= va->va_start))
-			return NULL;
-		if (WARN_ON_ONCE(va->va_end > next->va_start))
-			return NULL;
-		next_link = &next->list;
-	} else {
-		next_link = head;
-	}
+	MA_STATE(mas, &occupied_vmap_area_mt, start, end - 1);
 
-	if (next_link->prev != head) {
-		prev = list_entry(next_link->prev, struct vmap_area, list);
-		if (WARN_ON_ONCE(va->va_start < prev->va_end))
-			return NULL;
+	err = mas_preallocate(&mas, XA_ZERO_ENTRY, GFP_NOWAIT | __GFP_NOWARN);
+	if (!err) {
+		mas_store_prealloc(&mas, XA_ZERO_ENTRY);
+		mas_destroy(&mas);
+		return true;
 	}
 
-	return next_link;
+	err = mas_store_gfp(&mas, XA_ZERO_ENTRY, GFP_ATOMIC | __GFP_NOWARN);
+	return !WARN_ON_ONCE(err);
 }
 
 static __always_inline bool
-insert_vmap_area_list_sorted(struct vmap_area *va, struct list_head *head)
+occupied_mt_erase_range_locked(unsigned long start, unsigned long end)
 {
-	struct list_head *next;
-
-	next = find_vmap_area_insert_point_list(va, head);
-	if (!next)
-		return false;
-
-	list_add_tail(&va->list, next);
-	return true;
-}
+	int err;
 
-static __always_inline bool
-insert_vmap_area_list_sorted_mt(struct vmap_area *va, struct maple_tree *tree,
-				struct list_head *head)
-{
-	struct list_head *next;
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	next = find_vmap_area_insert_point_mt(va, tree, head);
-	if (!next)
+	if (WARN_ON_ONCE(!occupied_mt_supported()))
 		return false;
 
-	list_add_tail(&va->list, next);
-	return true;
-}
+	MA_STATE(mas, &occupied_vmap_area_mt, start, end - 1);
 
-static __always_inline void
-disable_busy_mt_locked(struct vmap_node *vn)
-{
-	lockdep_assert_held(&vn->busy.lock);
-
-	if (vn->busy.mt_enabled) {
-		__mt_destroy(&vn->busy.mt);
-		vn->busy.mt_enabled = false;
-	}
-
-	vn->busy.mt_init_tried = true;
-}
-
-static __always_inline void
-disable_lazy_mt_locked(struct vmap_node *vn)
-{
-	lockdep_assert_held(&vn->lazy.lock);
-
-	if (vn->lazy.mt_enabled) {
-		__mt_destroy(&vn->lazy.mt);
-		vn->lazy.mt_enabled = false;
-	}
-
-	vn->lazy.mt_init_tried = true;
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	return !WARN_ON_ONCE(err);
 }
 
-static void
-busy_mt_rebuild_locked(struct vmap_node *vn)
+static __always_inline bool
+occupied_mt_erase_va_locked(struct vmap_area *va)
 {
-	struct vmap_area *va;
-	int err;
-
-	lockdep_assert_held(&vn->busy.lock);
-
-	__mt_destroy(&vn->busy.mt);
-	vn->busy.mt_enabled = true;
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	list_for_each_entry(va, &vn->busy.head, list) {
-		MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
+	if (!occupied_mt_supported())
+		return true;
 
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err)) {
-			disable_busy_mt_locked(vn);
-			return;
-		}
-	}
+	return occupied_mt_erase_range_locked(va->va_start, va->va_end);
 }
 
 static __always_inline void
 try_init_busy_mt_locked(struct vmap_node *vn)
 {
 	lockdep_assert_held(&vn->busy.lock);
-
-	if (vn->busy.mt_init_tried)
-		return;
-
-	if (!vmap_mt_runtime_ready())
-		return;
-
-	vn->busy.mt_init_tried = true;
-	busy_mt_rebuild_locked(vn);
-}
-
-static void
-lazy_mt_rebuild_locked(struct vmap_node *vn)
-{
-	struct vmap_area *va;
-	int err;
-
-	lockdep_assert_held(&vn->lazy.lock);
-
-	__mt_destroy(&vn->lazy.mt);
-	vn->lazy.mt_enabled = true;
-
-	list_for_each_entry(va, &vn->lazy.head, list) {
-		MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
-
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err)) {
-			disable_lazy_mt_locked(vn);
-			return;
-		}
-	}
+	WARN_ON_ONCE(!vn->busy.mt_enabled);
 }
 
 static __always_inline void
 try_init_lazy_mt_locked(struct vmap_node *vn)
 {
 	lockdep_assert_held(&vn->lazy.lock);
-
-	if (vn->lazy.mt_init_tried)
-		return;
-
-	if (!vmap_mt_runtime_ready())
-		return;
-
-	vn->lazy.mt_init_tried = true;
-	lazy_mt_rebuild_locked(vn);
+	WARN_ON_ONCE(!vn->lazy.mt_enabled);
 }
 
+/*
+ * Busy/lazy lookup paths remain lock-based to preserve pointer lifetime
+ * semantics.
+ */
+
 static __always_inline struct vmap_area *
 __find_vmap_area_mt(unsigned long addr, struct maple_tree *tree)
 {
@@ -1462,6 +1248,24 @@ __find_vmap_area_enclose_addr_mt(unsigned long addr, struct maple_tree *tree)
 	return mas_find_rev(&mas, 0);
 }
 
+static __always_inline bool
+validate_vmap_area_range_insert_mt_locked(struct maple_tree *tree,
+					  unsigned long start,
+					  unsigned long end)
+{
+	struct vmap_area *left, *right;
+
+	left = __find_vmap_area_enclose_addr_mt(start, tree);
+	if (left && WARN_ON_ONCE(left->va_end > start))
+		return false;
+
+	right = __find_vmap_area_exceed_addr_mt(start, tree);
+	if (right && WARN_ON_ONCE(right->va_start < end))
+		return false;
+
+	return true;
+}
+
 static __always_inline struct vmap_area *
 find_vmap_area_busy_locked(unsigned long addr, struct vmap_node *vn)
 {
@@ -1469,10 +1273,10 @@ find_vmap_area_busy_locked(unsigned long addr, struct vmap_node *vn)
 
 	try_init_busy_mt_locked(vn);
 
-	if (likely(vn->busy.mt_enabled))
-		return __find_vmap_area_mt(addr, &vn->busy.mt);
+	if (WARN_ON_ONCE(!vn->busy.mt_enabled))
+		return NULL;
 
-	return __find_vmap_area_list(addr, &vn->busy.head);
+	return __find_vmap_area_mt(addr, &vn->busy.mt);
 }
 
 static __always_inline struct vmap_area *
@@ -1482,10 +1286,10 @@ find_vmap_area_exceed_addr_busy_locked(unsigned long addr, struct vmap_node *vn)
 
 	try_init_busy_mt_locked(vn);
 
-	if (likely(vn->busy.mt_enabled))
-		return __find_vmap_area_exceed_addr_mt(addr, &vn->busy.mt);
+	if (WARN_ON_ONCE(!vn->busy.mt_enabled))
+		return NULL;
 
-	return __find_vmap_area_exceed_addr_list(addr, &vn->busy.head);
+	return __find_vmap_area_exceed_addr_mt(addr, &vn->busy.mt);
 }
 
 /*
@@ -1544,22 +1348,27 @@ insert_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 
 	try_init_busy_mt_locked(vn);
 
-	if (likely(vn->busy.mt_enabled)) {
-		MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
+	if (WARN_ON_ONCE(!vn->busy.mt_enabled))
+		return;
 
-		if (!insert_vmap_area_list_sorted_mt(va, &vn->busy.mt,
-						     &vn->busy.head))
-			return;
+	if (!validate_vmap_area_range_insert_mt_locked(&vn->busy.mt,
+						       va->va_start,
+						       va->va_end))
+		return;
 
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err))
-			disable_busy_mt_locked(vn);
+	INIT_LIST_HEAD(&va->list);
+
+	MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
 
+	err = mas_preallocate(&mas, va, GFP_NOWAIT | __GFP_NOWARN);
+	if (!err) {
+		mas_store_prealloc(&mas, va);
+		mas_destroy(&mas);
 		return;
 	}
 
-	if (!insert_vmap_area_list_sorted(va, &vn->busy.head))
-		return;
+	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+	WARN_ON_ONCE(err);
 }
 
 static __always_inline void
@@ -1569,18 +1378,17 @@ unlink_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 
 	lockdep_assert_held(&vn->busy.lock);
 
-	MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
-
-	list_del_init(&va->list);
-
 	try_init_busy_mt_locked(vn);
 
-	if (unlikely(!vn->busy.mt_enabled))
+	if (WARN_ON_ONCE(!vn->busy.mt_enabled))
 		return;
 
+	MA_STATE(mas, &vn->busy.mt, va->va_start, va->va_end - 1);
+
 	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
-	if (WARN_ON_ONCE(err))
-		disable_busy_mt_locked(vn);
+	WARN_ON_ONCE(err);
+
+	INIT_LIST_HEAD(&va->list);
 }
 
 static __always_inline void
@@ -1591,23 +1399,27 @@ insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 	lockdep_assert_held(&vn->lazy.lock);
 
 	try_init_lazy_mt_locked(vn);
+	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
+		return;
 
-	if (likely(vn->lazy.mt_enabled)) {
-		MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
+	if (!validate_vmap_area_range_insert_mt_locked(&vn->lazy.mt,
+						       va->va_start,
+						       va->va_end))
+		return;
 
-		if (!insert_vmap_area_list_sorted_mt(va, &vn->lazy.mt,
-						     &vn->lazy.head))
-			return;
+	INIT_LIST_HEAD(&va->list);
 
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err))
-			disable_lazy_mt_locked(vn);
+	MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
 
+	err = mas_preallocate(&mas, va, GFP_NOWAIT | __GFP_NOWARN);
+	if (!err) {
+		mas_store_prealloc(&mas, va);
+		mas_destroy(&mas);
 		return;
 	}
 
-	if (!insert_vmap_area_list_sorted(va, &vn->lazy.head))
-		return;
+	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+	WARN_ON_ONCE(err);
 }
 
 static __always_inline bool
@@ -1616,60 +1428,56 @@ lazy_vmap_areas_empty_locked(struct vmap_node *vn)
 	lockdep_assert_held(&vn->lazy.lock);
 
 	try_init_lazy_mt_locked(vn);
+	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
+		return true;
 
-	if (likely(vn->lazy.mt_enabled))
-		return mtree_empty(&vn->lazy.mt);
-
-	return list_empty(&vn->lazy.head);
+	return mtree_empty(&vn->lazy.mt);
 }
 
 static __always_inline void
 move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 {
 	struct vmap_area *va;
-	int err;
 
 	lockdep_assert_held(&vn->lazy.lock);
 
 	try_init_lazy_mt_locked(vn);
+	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
+		return;
 
-	if (likely(vn->lazy.mt_enabled)) {
-		list_for_each_entry(va, &vn->lazy.head, list) {
-			MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
-
-			err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
-			if (WARN_ON_ONCE(err)) {
-				disable_lazy_mt_locked(vn);
-				break;
-			}
-		}
+	MA_STATE(mas, &vn->lazy.mt, 0, 0);
 
-		if (vn->lazy.mt_enabled && WARN_ON_ONCE(!mtree_empty(&vn->lazy.mt)))
-			disable_lazy_mt_locked(vn);
-	}
+	mas_for_each(&mas, va, ULONG_MAX)
+		list_add_tail(&va->list, &vn->purge_list);
 
-	list_replace_init(&vn->lazy.head, &vn->purge_list);
+	__mt_destroy(&vn->lazy.mt);
+	mt_init_flags(&vn->lazy.mt, MT_FLAGS_LOCK_EXTERN);
+	mt_set_external_lock(&vn->lazy.mt, &vn->lazy.lock);
+	vn->lazy.mt_enabled = true;
 }
 
 static __always_inline bool
 insert_vmap_area_free_locked(struct vmap_area *va)
 {
+	struct vmap_area *prev, *next;
+
 	lockdep_assert_held(&free_vmap_area_lock);
 
 	try_init_free_mt_locked();
 
-	if (likely(free_mt_supported())) {
-		if (!insert_vmap_area_list_sorted_mt(va, &free_vmap_area_mt,
-						     &free_vmap_area_list))
-			return false;
+	if (unlikely(!free_mt_supported()))
+		return false;
 
-		free_mt_store_va_locked(va);
-	} else {
-		if (!insert_vmap_area_list_sorted(va, &free_vmap_area_list))
-			return false;
-	}
+	prev = __find_vmap_area_enclose_addr_mt(va->va_start, &free_vmap_area_mt);
+	if (prev && WARN_ON_ONCE(prev->va_end > va->va_start))
+		return false;
 
-	return true;
+	next = __find_vmap_area_exceed_addr_mt(va->va_start, &free_vmap_area_mt);
+	if (next && WARN_ON_ONCE(next->va_start < va->va_end))
+		return false;
+
+	INIT_LIST_HEAD(&va->list);
+	return free_mt_store_va_locked(va);
 }
 
 static __always_inline void
@@ -1677,193 +1485,56 @@ unlink_vmap_area_free_locked(struct vmap_area *va)
 {
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (WARN_ON_ONCE(list_empty(&va->list)))
+	if (unlikely(!free_mt_supported()))
 		return;
 
-	if (likely(free_mt_supported()))
-		free_mt_erase_va_locked(va);
-
-	list_del_init(&va->list);
-}
-
-/*
- * Merge de-allocated chunk of VA memory with previous
- * and next free blocks. If coalesce is not done a new
- * free area is inserted. If VA has been merged, it is
- * freed.
- *
- * Please note, it can return NULL in case of overlap
- * ranges, followed by WARN() report. Despite it is a
- * buggy behaviour, a system can be alive and keep
- * ongoing.
- */
-static __always_inline struct vmap_area *
-__merge_or_add_vmap_area(struct vmap_area *va, struct list_head *head, bool update_mt)
-{
-	struct vmap_area *sibling;
-	struct list_head *next;
-	unsigned long old_start, old_end;
-	bool merged = false;
-
-	if (update_mt && free_mt_supported())
-		next = find_vmap_area_insert_point_mt(va, &free_vmap_area_mt, head);
-	else
-		next = find_vmap_area_insert_point_list(va, head);
-
-	if (!next)
-		return NULL;
-
-	/*
-	 * start            end
-	 * |                |
-	 * |<------VA------>|<-----Next----->|
-	 *                  |                |
-	 *                  start            end
-	 */
-	if (next != head) {
-		sibling = list_entry(next, struct vmap_area, list);
-		if (sibling->va_start == va->va_end) {
-			old_start = sibling->va_start;
-			old_end = sibling->va_end;
-			sibling->va_start = va->va_start;
-			if (update_mt && free_mt_supported())
-				free_mt_update_va_locked(sibling, old_start, old_end);
-
-			/* Free vmap_area object. */
-			kmem_cache_free(vmap_area_cachep, va);
-
-			/* Point to the new merged area. */
-			va = sibling;
-			merged = true;
-		}
-	}
-
-	/*
-	 * start            end
-	 * |                |
-	 * |<-----Prev----->|<------VA------>|
-	 *                  |                |
-	 *                  start            end
-	 */
-	if (next->prev != head) {
-		sibling = list_entry(next->prev, struct vmap_area, list);
-		if (sibling->va_end == va->va_start) {
-			/*
-			 * If both neighbors are coalesced, it is important
-			 * to unlink the "next" node first, followed by merging
-			 * with "previous" one.
-			 */
-			if (merged) {
-				if (update_mt)
-					unlink_vmap_area_free_locked(va);
-				else
-					list_del_init(&va->list);
-			}
-
-			old_start = sibling->va_start;
-			old_end = sibling->va_end;
-			sibling->va_end = va->va_end;
-			if (update_mt && free_mt_supported())
-				free_mt_update_va_locked(sibling, old_start, old_end);
-
-			/* Free vmap_area object. */
-			kmem_cache_free(vmap_area_cachep, va);
-
-			/* Point to the new merged area. */
-			va = sibling;
-			merged = true;
-		}
-	}
-
-	if (!merged) {
-		if (update_mt)
-			insert_vmap_area_free_locked(va);
-		else
-			list_add_tail(&va->list, next);
-	}
+	if (!free_mt_erase_va_locked(va))
+		return;
 
-	return va;
-}
-
-static __always_inline struct vmap_area *
-merge_or_add_vmap_area(struct vmap_area *va,
-	struct list_head *head)
-{
-	return __merge_or_add_vmap_area(va, head, false);
+	INIT_LIST_HEAD(&va->list);
 }
 
 static __always_inline struct vmap_area *
 merge_or_add_vmap_area_free_locked(struct vmap_area *va)
 {
+	struct vmap_area *left, *right;
+	unsigned long new_start, new_end;
+
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	va = __merge_or_add_vmap_area(va, &free_vmap_area_list, true);
-	if (va && va->va_start < free_vmap_alloc_hint)
-		free_vmap_alloc_hint = va->va_start;
+	if (unlikely(!free_mt_supported()))
+		return NULL;
 
-	return va;
-}
+	new_start = va->va_start;
+	new_end = va->va_end;
 
-/*
- * Transitional wrappers retained until all legacy rb call sites are switched.
- * Follow-up patches in this series remove these wrappers.
- */
-static __always_inline void
-insert_vmap_area(struct vmap_area *va, struct rb_root *root,
-		 struct list_head *head)
-{
-	struct vmap_node *vn = addr_to_node(va->va_start);
+	left = __find_vmap_area_enclose_addr_mt(new_start, &free_vmap_area_mt);
+	if (left && WARN_ON_ONCE(left->va_end > new_start))
+		return NULL;
 
-	if (head == &free_vmap_area_list) {
-		insert_vmap_area_free_locked(va);
-		return;
+	if (left && left->va_end == new_start) {
+		new_start = left->va_start;
+		unlink_vmap_area_free_locked(left);
+		kmem_cache_free(vmap_area_cachep, left);
 	}
 
-	if (head == &vn->lazy.head) {
-		insert_vmap_area_lazy_locked(va, vn);
-		return;
-	}
-
-	insert_vmap_area_busy_locked(va, vn);
-}
-
-static __always_inline void
-insert_vmap_area_augment(struct vmap_area *va, struct rb_node *from,
-			 struct rb_root *root, struct list_head *head)
-{
-	insert_vmap_area(va, root, head);
-}
-
-static __always_inline void unlink_va(struct vmap_area *va, struct rb_root *root)
-{
-	struct vmap_node *vn = addr_to_node(va->va_start);
+	right = __find_vmap_area_exceed_addr_mt(new_start, &free_vmap_area_mt);
+	if (right && WARN_ON_ONCE(right->va_start < new_end))
+		return NULL;
 
-	if (root == &free_vmap_area_root) {
-		unlink_vmap_area_free_locked(va);
-		return;
+	if (right && right->va_start == new_end) {
+		new_end = right->va_end;
+		unlink_vmap_area_free_locked(right);
+		kmem_cache_free(vmap_area_cachep, right);
 	}
 
-	unlink_vmap_area_busy_locked(va, vn);
-}
+	va->va_start = new_start;
+	va->va_end = new_end;
 
-static __always_inline void
-unlink_va_augment(struct vmap_area *va, struct rb_root *root)
-{
-	unlink_va(va, root);
-}
-
-static __always_inline void augment_tree_propagate_from(struct vmap_area *va)
-{
-}
-
-static __always_inline struct vmap_area *
-merge_or_add_vmap_area_augment(struct vmap_area *va, struct rb_root *root,
-			       struct list_head *head)
-{
-	if (head == &free_vmap_area_list)
-		return merge_or_add_vmap_area_free_locked(va);
+	if (!insert_vmap_area_free_locked(va))
+		return NULL;
 
-	return merge_or_add_vmap_area(va, head);
+	return va;
 }
 
 static __always_inline bool
@@ -1885,86 +1556,57 @@ is_within_this_va(struct vmap_area *va, unsigned long size,
 	return (nva_start_addr + size <= va->va_end);
 }
 
-static __always_inline struct vmap_area *
-find_vmap_lowest_match_list(struct list_head *head, unsigned long size,
-			    unsigned long align, unsigned long vstart)
+static __always_inline bool
+occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
+				    unsigned long size, unsigned long align,
+				    unsigned long *addr)
 {
-	struct vmap_area *va;
+	MA_STATE(mas, &occupied_vmap_area_mt, 0, 0);
+	unsigned long search = min;
+	unsigned long hole_end;
 
-	list_for_each_entry(va, head, list) {
-		if (!is_within_this_va(va, size, align, vstart))
-			continue;
+	while (search <= max) {
+		unsigned long candidate, candidate_end;
 
-		return va;
-	}
+		mas_set(&mas, search);
+		if (mas_empty_area(&mas, search, max, size))
+			return false;
 
-	return NULL;
-}
+		hole_end = min(mas.last, max);
+		candidate = ALIGN(mas.index, align);
+		if (candidate < mas.index)
+			return false;
 
-static __always_inline unsigned long
-clamp_vmap_alloc_hint(unsigned long hint, unsigned long vstart,
-		      unsigned long vend)
-{
-	if (hint < vstart || hint >= vend)
-		return vstart;
+		if (check_add_overflow(candidate, size - 1, &candidate_end))
+			return false;
 
-	return hint;
-}
+		if (candidate >= search && candidate_end <= hole_end) {
+			*addr = candidate;
+			return true;
+		}
 
-/*
- * Next-fit scan with wrap-around. Use maple to jump to the first candidate
- * around the hint in O(log n), then continue on the ordered list for cheap
- * neighbour traversal and deterministic coalescing behaviour.
- */
-static __always_inline struct vmap_area *
-find_vmap_match_list_next_fit(struct list_head *head, unsigned long size,
-			      unsigned long align, unsigned long vstart,
-			      unsigned long vend)
-{
-	struct vmap_area *va, *start = NULL;
-	unsigned long hint;
-	bool wrapped;
-
-	hint = clamp_vmap_alloc_hint(free_vmap_alloc_hint, vstart, vend);
-
-	if (hint != vstart) {
-		if (free_mt_supported())
-			start = __find_vmap_area_exceed_addr_mt(hint,
-								&free_vmap_area_mt);
-
-		if (start) {
-			va = start;
-			list_for_each_entry_from(va, head, list) {
-				if (is_within_this_va(va, size, align, hint))
-					return va;
-			}
-		} else {
-			list_for_each_entry(va, head, list) {
-				if (va->va_end <= hint)
-					continue;
+		if (hole_end == ULONG_MAX)
+			return false;
 
-				if (is_within_this_va(va, size, align, hint))
-					return va;
-			}
-		}
+		search = hole_end + 1;
 	}
 
-	wrapped = (hint != vstart);
-	list_for_each_entry(va, head, list) {
-		if (wrapped) {
-			if (start && va == start)
-				break;
-			if (!start && va->va_start >= hint)
-				break;
-		}
+	return false;
+}
 
-		if (is_within_this_va(va, size, align, vstart))
-			return va;
-	}
+static __always_inline unsigned long
+occupied_mt_find_hole_lowest_locked(unsigned long size, unsigned long align,
+				    unsigned long vstart, unsigned long vend)
+{
+	unsigned long addr;
 
-	return NULL;
+	if (occupied_mt_find_hole_window_locked(vstart, vend - 1, size, align, &addr))
+		return addr;
+
+	return -ENOENT;
 }
 
+/* Lowest-match scan directly on maple ordered traversal. */
 static __always_inline struct vmap_area *
 find_vmap_lowest_match_mt(struct maple_tree *tree, unsigned long size,
 			  unsigned long align, unsigned long vstart)
@@ -1989,24 +1631,26 @@ find_vmap_lowest_match_mt(struct maple_tree *tree, unsigned long size,
 #include <linux/random.h>
 
 static struct vmap_area *
-find_vmap_lowest_linear_match(struct list_head *head, unsigned long size,
-	unsigned long align, unsigned long vstart)
+find_vmap_lowest_linear_match(struct maple_tree *tree, unsigned long size,
+			      unsigned long align, unsigned long vstart)
 {
+	MA_STATE(mas, tree, vstart, vstart);
 	struct vmap_area *va;
 
-	list_for_each_entry(va, head, list) {
+	mas_set(&mas, vstart);
+	va = mas_find(&mas, ULONG_MAX);
+	while (va) {
 		if (!is_within_this_va(va, size, align, vstart))
-			continue;
-
-		return va;
+			va = mas_next(&mas, ULONG_MAX);
+		else
+			return va;
 	}
 
-	return NULL;
+	return va;
 }
 
 static void
-find_vmap_lowest_match_check(struct list_head *head, unsigned long size,
-			     unsigned long align)
+find_vmap_lowest_match_check(unsigned long size, unsigned long align)
 {
 	struct vmap_area *va_1, *va_2;
 	unsigned long vstart;
@@ -2015,11 +1659,8 @@ find_vmap_lowest_match_check(struct list_head *head, unsigned long size,
 	get_random_bytes(&rnd, sizeof(rnd));
 	vstart = VMALLOC_START + rnd;
 
-	if (free_mt_supported())
-		va_1 = find_vmap_lowest_match_mt(&free_vmap_area_mt, size, align, vstart);
-	else
-		va_1 = find_vmap_lowest_linear_match(head, size, align, vstart);
-	va_2 = find_vmap_lowest_linear_match(head, size, align, vstart);
+	va_1 = find_vmap_lowest_match_mt(&free_vmap_area_mt, size, align, vstart);
+	va_2 = find_vmap_lowest_linear_match(&free_vmap_area_mt, size, align, vstart);
 
 	if (va_1 != va_2)
 		pr_emerg("not lowest: t: 0x%p, l: 0x%p, v: 0x%lx\n",
@@ -2153,39 +1794,38 @@ va_clip(struct vmap_area *va, unsigned long nva_start_addr,
 	}
 
 	if (type != FL_FIT_TYPE) {
-		if (free_mt_supported())
-			free_mt_update_va_locked(va, old_start, old_end);
+		if (free_mt_supported() &&
+		    !free_mt_update_va_locked(va, old_start, old_end))
+			return -ENOMEM;
 
-		if (lva)	/* type == NE_FIT_TYPE */
-			insert_vmap_area_free_locked(lva);
+		if (lva && !insert_vmap_area_free_locked(lva)) {
+			kmem_cache_free(vmap_area_cachep, lva);
+			return -ENOMEM;
+		}
 	}
 
 	return 0;
 }
 
-static unsigned long
-va_alloc(struct vmap_area *va,
-		unsigned long size, unsigned long align,
-		unsigned long vstart, unsigned long vend)
+static __always_inline bool
+restore_allocated_vmap_range_free_locked(unsigned long start, unsigned long end)
 {
-	unsigned long nva_start_addr;
-	int ret;
+	struct vmap_area *va;
 
-	if (va->va_start > vstart)
-		nva_start_addr = ALIGN(va->va_start, align);
-	else
-		nva_start_addr = ALIGN(vstart, align);
+	lockdep_assert_held(&free_vmap_area_lock);
 
-	/* Check the "vend" restriction. */
-	if (nva_start_addr + size > vend)
-		return -ERANGE;
+	va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+	if (!va)
+		return false;
 
-	/* Update the free vmap_area. */
-	ret = va_clip(va, nva_start_addr, size);
-	if (WARN_ON_ONCE(ret))
-		return ret;
+	va->va_start = start;
+	va->va_end = end;
+	if (!insert_vmap_area_free_locked(va)) {
+		kmem_cache_free(vmap_area_cachep, va);
+		return false;
+	}
 
-	return nva_start_addr;
+	return true;
 }
 
 /*
@@ -2196,34 +1836,42 @@ static __always_inline unsigned long
 __alloc_vmap_area(unsigned long size, unsigned long align,
 		  unsigned long vstart, unsigned long vend)
 {
+	int ret;
 	unsigned long nva_start_addr;
+	unsigned long nva_end_addr;
 	struct vmap_area *va;
 
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	/*
-	 * Next-fit with wrap-around lowers repeated list-head scans in
-	 * high-churn workloads.
-	 */
-	va = find_vmap_match_list_next_fit(&free_vmap_area_list, size, align,
-					   vstart, vend);
+	try_init_occupied_mt_locked();
 
-	if (unlikely(!va))
+	if (WARN_ON_ONCE(!occupied_mt_supported()))
 		return -ENOENT;
 
-	nva_start_addr = va_alloc(va, size, align, vstart, vend);
-	if (!IS_ERR_VALUE(nva_start_addr)) {
-		unsigned long next_hint;
+	nva_start_addr = occupied_mt_find_hole_lowest_locked(size, align,
+							     vstart, vend);
+	if (IS_ERR_VALUE(nva_start_addr))
+		return nva_start_addr;
+	nva_end_addr = nva_start_addr + size;
 
-		if (check_add_overflow(nva_start_addr, size, &next_hint))
-			free_vmap_alloc_hint = vstart;
-		else
-			free_vmap_alloc_hint = next_hint;
+	va = __find_vmap_area_mt(nva_start_addr, &free_vmap_area_mt);
+	if (WARN_ON_ONCE(!va))
+		return -ENOENT;
+
+	ret = va_clip(va, nva_start_addr, size);
+	if (WARN_ON_ONCE(ret))
+		return ret;
+
+	if (!occupied_mt_store_range_locked(nva_start_addr, nva_end_addr)) {
+		bool restored;
+
+		restored = restore_allocated_vmap_range_free_locked(nva_start_addr, nva_end_addr);
+		WARN_ON_ONCE(!restored);
+		return -ENOMEM;
 	}
 
 #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
-	if (!IS_ERR_VALUE(nva_start_addr))
-		find_vmap_lowest_match_check(&free_vmap_area_list, size, align);
+	find_vmap_lowest_match_check(size, align);
 #endif
 
 	return nva_start_addr;
@@ -2247,7 +1895,8 @@ static void free_vmap_area(struct vmap_area *va)
 	 * Insert/Merge it back to the free tree/list.
 	 */
 	spin_lock(&free_vmap_area_lock);
-	merge_or_add_vmap_area_free_locked(va);
+	WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
+	WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va));
 	spin_unlock(&free_vmap_area_lock);
 }
 
@@ -2566,24 +2215,36 @@ static DEFINE_MUTEX(vmap_purge_lock);
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
 
-static void
-reclaim_list_global(struct list_head *head)
+static bool
+reclaim_list_global(struct list_head *head, bool erase_occupied,
+		    struct list_head *failed)
 {
 	struct vmap_area *va, *n;
+	bool ok = true;
 
 	if (list_empty(head))
-		return;
+		return true;
 
 	spin_lock(&free_vmap_area_lock);
-	list_for_each_entry_safe(va, n, head, list)
-		merge_or_add_vmap_area_free_locked(va);
+	list_for_each_entry_safe(va, n, head, list) {
+		list_del_init(&va->list);
+		if (erase_occupied)
+			WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
+		if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
+			list_add_tail(&va->list, failed);
+			ok = false;
+		}
+	}
 	spin_unlock(&free_vmap_area_lock);
+
+	return ok;
 }
 
 static void
 decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 {
 	LIST_HEAD(decay_list);
+	LIST_HEAD(decay_failed);
 	struct vmap_area *va, *nva;
 	unsigned long n_decay, pool_len;
 	int i;
@@ -2612,7 +2273,7 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 				break;
 
 			list_del_init(&va->list);
-			merge_or_add_vmap_area(va, &decay_list);
+			list_add_tail(&va->list, &decay_list);
 		}
 
 		/*
@@ -2629,7 +2290,11 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 		}
 	}
 
-	reclaim_list_global(&decay_list);
+	WARN_ON_ONCE(!reclaim_list_global(&decay_list, false, &decay_failed));
+	list_for_each_entry_safe(va, nva, &decay_failed, list) {
+		list_del_init(&va->list);
+		WARN_ON_ONCE(!node_pool_add_va(vn, va));
+	}
 }
 
 #define KASAN_RELEASE_BATCH_SIZE 32
@@ -2664,8 +2329,10 @@ static void purge_vmap_node(struct work_struct *work)
 	struct vmap_node *vn = container_of(work,
 		struct vmap_node, purge_work);
 	unsigned long nr_purged_pages = 0;
+	unsigned long nr_failed_pages = 0;
 	struct vmap_area *va, *n_va;
 	LIST_HEAD(local_list);
+	LIST_HEAD(local_failed);
 
 	if (IS_ENABLED(CONFIG_KASAN_VMALLOC))
 		kasan_release_vmalloc_node(vn);
@@ -2691,7 +2358,23 @@ static void purge_vmap_node(struct work_struct *work)
 
 	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
 
-	reclaim_list_global(&local_list);
+	WARN_ON_ONCE(!reclaim_list_global(&local_list, false, &local_failed));
+	list_for_each_entry_safe(va, n_va, &local_failed, list) {
+		unsigned int vn_id = decode_vn_id(va->flags);
+		struct vmap_node *dst;
+
+		list_del_init(&va->list);
+		dst = is_vn_id_valid(vn_id) ?
+			id_to_node(vn_id) : addr_to_node(va->va_start);
+
+		spin_lock(&dst->lazy.lock);
+		insert_vmap_area_lazy_locked(va, dst);
+		spin_unlock(&dst->lazy.lock);
+		nr_failed_pages += va_size(va) >> PAGE_SHIFT;
+	}
+
+	if (nr_failed_pages)
+		atomic_long_add(nr_failed_pages, &vmap_lazy_nr);
 }
 
 /*
@@ -2823,6 +2506,15 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	vn = is_vn_id_valid(vn_id) ?
 		id_to_node(vn_id):addr_to_node(va->va_start);
 
+	/*
+	 * Drop occupied-range visibility as soon as the area is freed, even
+	 * though coalescing/reinsertion into the free index remains deferred.
+	 */
+	spin_lock(&free_vmap_area_lock);
+	try_init_occupied_mt_locked();
+	WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
+	spin_unlock(&free_vmap_area_lock);
+
 	spin_lock(&vn->lazy.lock);
 	insert_vmap_area_lazy_locked(va, vn);
 	spin_unlock(&vn->lazy.lock);
@@ -5240,14 +4932,11 @@ EXPORT_SYMBOL_GPL(free_vm_area);
 
 #ifdef CONFIG_SMP
 static __always_inline struct vmap_area *
-free_vmap_area_prev(struct vmap_area *va)
+free_vmap_area_prev_by_addr(unsigned long addr)
 {
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (va->list.prev == &free_vmap_area_list)
-		return NULL;
-
-	return list_entry(va->list.prev, struct vmap_area, list);
+	return __find_vmap_area_enclose_addr_mt(addr, &free_vmap_area_mt);
 }
 
 /**
@@ -5262,19 +4951,9 @@ free_vmap_area_prev(struct vmap_area *va)
 static struct vmap_area *
 pvm_find_va_enclose_addr(unsigned long addr)
 {
-	struct vmap_area *va;
-
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (free_mt_supported())
-		return __find_vmap_area_enclose_addr_mt(addr, &free_vmap_area_mt);
-
-	list_for_each_entry_reverse(va, &free_vmap_area_list, list) {
-		if (va->va_start <= addr)
-			return va;
-	}
-
-	return NULL;
+	return __find_vmap_area_enclose_addr_mt(addr, &free_vmap_area_mt);
 }
 
 /**
@@ -5293,13 +4972,19 @@ pvm_determine_end_from_reverse(struct vmap_area **va, unsigned long align)
 	unsigned long vmalloc_end = VMALLOC_END & ~(align - 1);
 	unsigned long addr;
 
+	lockdep_assert_held(&free_vmap_area_lock);
+
 	if (likely(*va)) {
-		list_for_each_entry_from_reverse((*va),
-				&free_vmap_area_list, list) {
+		do {
 			addr = min((*va)->va_end & ~(align - 1), vmalloc_end);
 			if ((*va)->va_start < addr)
 				return addr;
-		}
+
+			if ((*va)->va_start == 0)
+				break;
+
+			*va = free_vmap_area_prev_by_addr((*va)->va_start - 1);
+		} while (*va);
 	}
 
 	return 0;
@@ -5382,6 +5067,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	}
 retry:
 	spin_lock(&free_vmap_area_lock);
+	try_init_free_mt_locked();
+	try_init_occupied_mt_locked();
 
 	/* start scanning - we scan from the top, begin with the last area */
 	area = term_area = last_area;
@@ -5419,7 +5106,10 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		 * If this VA does not fit, move base downwards and recheck.
 		 */
 		if (base + start < va->va_start) {
-			va = free_vmap_area_prev(va);
+			if (va->va_start == 0)
+				va = NULL;
+			else
+				va = free_vmap_area_prev_by_addr(va->va_start - 1);
 			base = pvm_determine_end_from_reverse(&va, align) - end;
 			term_area = area;
 			continue;
@@ -5459,6 +5149,12 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		va = vas[area];
 		va->va_start = start;
 		va->va_end = start + size;
+
+		if (occupied_mt_supported() &&
+		    !occupied_mt_store_range_locked(va->va_start, va->va_end)) {
+			area++;
+			goto recovery;
+		}
 	}
 
 	spin_unlock(&free_vmap_area_lock);
@@ -5501,11 +5197,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	while (area--) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
+		WARN_ON_ONCE(!occupied_mt_erase_va_locked(vas[area]));
 		va = merge_or_add_vmap_area_free_locked(vas[area]);
+		WARN_ON_ONCE(!va);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
-				va->va_start, va->va_end,
-				KASAN_VMALLOC_PAGE_RANGE | KASAN_VMALLOC_TLB_FLUSH);
+					      va->va_start, va->va_end,
+					      KASAN_VMALLOC_PAGE_RANGE |
+					      KASAN_VMALLOC_TLB_FLUSH);
 		vas[area] = NULL;
 	}
 
@@ -5551,7 +5250,9 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	for (area = 0; area < nr_vms; area++) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
+		WARN_ON_ONCE(!occupied_mt_erase_va_locked(vas[area]));
 		va = merge_or_add_vmap_area_free_locked(vas[area]);
+		WARN_ON_ONCE(!va);
 		if (va)
 			kasan_release_vmalloc(orig_start, orig_end,
 				va->va_start, va->va_end,
@@ -5598,7 +5299,7 @@ bool vmalloc_dump_obj(void *object)
 	if (!spin_trylock(&vn->busy.lock))
 		return false;
 
-	va = __find_vmap_area(addr, &vn->busy.root);
+	va = find_vmap_area_busy_locked(addr, vn);
 	if (!va || !va->vm) {
 		spin_unlock(&vn->busy.lock);
 		return false;
@@ -5650,11 +5351,17 @@ static void show_purge_info(struct seq_file *m)
 
 	for_each_vmap_node(vn) {
 		spin_lock(&vn->lazy.lock);
-		list_for_each_entry(va, &vn->lazy.head, list) {
-			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
-				(void *)va->va_start, (void *)va->va_end,
-				va_size(va));
+		try_init_lazy_mt_locked(vn);
+		if (WARN_ON_ONCE(!vn->lazy.mt_enabled)) {
+			spin_unlock(&vn->lazy.lock);
+			continue;
 		}
+		MA_STATE(mas, &vn->lazy.mt, 0, 0);
+
+		mas_for_each(&mas, va, ULONG_MAX)
+			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
+				   (void *)va->va_start, (void *)va->va_end,
+				   va_size(va));
 		spin_unlock(&vn->lazy.lock);
 	}
 }
@@ -5671,12 +5378,19 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
 
 	for_each_vmap_node(vn) {
 		spin_lock(&vn->busy.lock);
-		list_for_each_entry(va, &vn->busy.head, list) {
+		try_init_busy_mt_locked(vn);
+		if (WARN_ON_ONCE(!vn->busy.mt_enabled)) {
+			spin_unlock(&vn->busy.lock);
+			continue;
+		}
+		MA_STATE(mas, &vn->busy.mt, 0, 0);
+
+		mas_for_each(&mas, va, ULONG_MAX) {
 			if (!va->vm) {
 				if (va->flags & VMAP_RAM)
 					seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
-						(void *)va->va_start, (void *)va->va_end,
-						va_size(va));
+						   (void *)va->va_start, (void *)va->va_end,
+						   va_size(va));
 
 				continue;
 			}
@@ -5689,7 +5403,7 @@ static int vmalloc_info_show(struct seq_file *m, void *p)
 			smp_rmb();
 
 			seq_printf(m, "0x%pK-0x%pK %7ld",
-				v->addr, v->addr + v->size, v->size);
+				   v->addr, v->addr + v->size, v->size);
 
 			if (v->caller)
 				seq_printf(m, " %pS", v->caller);
@@ -5754,6 +5468,8 @@ static void __init vmap_init_free_space(void)
 	struct vmap_area *free;
 	struct vm_struct *busy;
 
+	spin_lock(&free_vmap_area_lock);
+
 	/*
 	 *     B     F     B     B     B     F
 	 * -|-----|.....|-----|-----|-----|.....|-
@@ -5761,19 +5477,18 @@ static void __init vmap_init_free_space(void)
 	 *  |<--------------------------------->|
 	 */
 	for (busy = vmlist; busy; busy = busy->next) {
-		if ((unsigned long) busy->addr - vmap_start > 0) {
+		if ((unsigned long)busy->addr - vmap_start > 0) {
 			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
 			if (!WARN_ON_ONCE(!free)) {
 				free->va_start = vmap_start;
-				free->va_end = (unsigned long) busy->addr;
+				free->va_end = (unsigned long)busy->addr;
 
-				insert_vmap_area_augment(free, NULL,
-					&free_vmap_area_root,
-						&free_vmap_area_list);
+				if (WARN_ON_ONCE(!insert_vmap_area_free_locked(free)))
+					kmem_cache_free(vmap_area_cachep, free);
 			}
 		}
 
-		vmap_start = (unsigned long) busy->addr + busy->size;
+		vmap_start = (unsigned long)busy->addr + busy->size;
 	}
 
 	if (vmap_end - vmap_start > 0) {
@@ -5782,11 +5497,12 @@ static void __init vmap_init_free_space(void)
 			free->va_start = vmap_start;
 			free->va_end = vmap_end;
 
-			insert_vmap_area_augment(free, NULL,
-				&free_vmap_area_root,
-					&free_vmap_area_list);
+			if (WARN_ON_ONCE(!insert_vmap_area_free_locked(free)))
+				kmem_cache_free(vmap_area_cachep, free);
 		}
 	}
+
+	spin_unlock(&free_vmap_area_lock);
 }
 
 static void vmap_init_nodes(void)
@@ -5825,13 +5541,15 @@ static void vmap_init_nodes(void)
 #endif
 
 	for_each_vmap_node(vn) {
-		vn->busy.root = RB_ROOT;
-		INIT_LIST_HEAD(&vn->busy.head);
 		spin_lock_init(&vn->busy.lock);
+		mt_init_flags(&vn->busy.mt, MT_FLAGS_LOCK_EXTERN);
+		mt_set_external_lock(&vn->busy.mt, &vn->busy.lock);
+		vn->busy.mt_enabled = true;
 
-		vn->lazy.root = RB_ROOT;
-		INIT_LIST_HEAD(&vn->lazy.head);
 		spin_lock_init(&vn->lazy.lock);
+		mt_init_flags(&vn->lazy.mt, MT_FLAGS_LOCK_EXTERN);
+		mt_set_external_lock(&vn->lazy.mt, &vn->lazy.lock);
+		vn->lazy.mt_enabled = true;
 
 		for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
 			INIT_LIST_HEAD(&vn->pool[i].head);
@@ -5881,6 +5599,11 @@ void __init vmalloc_init(void)
 	 * Create the cache for vmap_area objects.
 	 */
 	vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
+	/*
+	 * vmalloc_init() performs Maple stores/preallocation while importing
+	 * early ranges. Ensure Maple node cache is available at this stage.
+	 */
+	maple_tree_init();
 
 	for_each_possible_cpu(i) {
 		struct vmap_block_queue *vbq;
@@ -5911,7 +5634,15 @@ void __init vmalloc_init(void)
 		va->vm = tmp;
 
 		vn = addr_to_node(va->va_start);
-		insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+		spin_lock(&vn->busy.lock);
+		insert_vmap_area_busy_locked(va, vn);
+		spin_unlock(&vn->busy.lock);
+
+		spin_lock(&free_vmap_area_lock);
+		try_init_occupied_mt_locked();
+		WARN_ON_ONCE(!occupied_mt_store_range_locked(va->va_start,
+							     va->va_end));
+		spin_unlock(&free_vmap_area_lock);
 	}
 
 	/*

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (3 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Tighten failure handling on the two paths that publish into the
maple_tree under a spinlock and have no caller-friendly way to return
-ENOMEM:

  - free_vmap_area_noflush() falls back to vmap_retry_list when
    publish_vmap_area_lazy() can't allocate maple slabs under
    GFP_NOWAIT, and reschedules drain_vmap_work to retry.

  - the alloc path rolls the busy insert back onto the retry queue
    if insert_vmap_area_busy_locked() fails, rather than leaking the
    vmap_area or panicking.

Add vmap_retry_list as a non-indexed retry queue scanned by the
allocator as an exclusion set and drained from the purge worker,
and wire the two publish-failure paths above through it.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 566 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 474 insertions(+), 92 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c908c1a0fcd4..7feb1b182cfa 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -869,6 +869,12 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 static bool vmap_initialized __read_mostly;
+/*
+ * Non-index retry queue for ranges that could not be transitioned to their
+ * target maple index state in constrained paths. This queue is scanned by the
+ * allocator as an exclusion set and drained by purge workers.
+ */
+static LIST_HEAD(vmap_retry_list);
 
 /*
  * This kmem_cache is used for vmap_area objects. Instead of
@@ -1113,6 +1119,47 @@ free_mt_update_va_locked(struct vmap_area *va, unsigned long old_start,
 	return true;
 }
 
+static __always_inline void
+retry_queue_add_va_locked(struct vmap_area *va)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	/*
+	 * Keep a VA on one list at a time. Retry entries are detached from
+	 * all indexed containers before they are queued here.
+	 */
+	if (unlikely(!READ_ONCE(va->list.next) && !READ_ONCE(va->list.prev)))
+		INIT_LIST_HEAD(&va->list);
+	if (WARN_ON_ONCE(!list_empty(&va->list)))
+		return;
+	list_add_tail(&va->list, &vmap_retry_list);
+}
+
+static __always_inline bool
+retry_queue_overlap_locked(unsigned long start, unsigned long end,
+			   unsigned long *blocked_end)
+{
+	struct vmap_area *va;
+	bool overlap = false;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (list_empty(&vmap_retry_list))
+		return false;
+
+	list_for_each_entry(va, &vmap_retry_list, list) {
+		unsigned long va_end = va->va_end - 1;
+
+		if (va->va_start > end || va_end < start)
+			continue;
+
+		overlap = true;
+		*blocked_end = max(*blocked_end, va_end);
+	}
+
+	return overlap;
+}
+
 static __always_inline void try_init_free_mt_locked(void)
 {
 	lockdep_assert_held(&free_vmap_area_lock);
@@ -1169,6 +1216,14 @@ occupied_mt_store_range_locked(unsigned long start, unsigned long end)
 	return !WARN_ON_ONCE(err);
 }
 
+static __always_inline bool
+occupied_mt_store_va_locked(struct vmap_area *va)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	return occupied_mt_store_range_locked(va->va_start, va->va_end);
+}
+
 static __always_inline bool
 occupied_mt_erase_range_locked(unsigned long start, unsigned long end)
 {
@@ -1339,7 +1394,7 @@ find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
 	return NULL;
 }
 
-static __always_inline void
+static __always_inline bool
 insert_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
 	int err;
@@ -1349,12 +1404,12 @@ insert_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 	try_init_busy_mt_locked(vn);
 
 	if (WARN_ON_ONCE(!vn->busy.mt_enabled))
-		return;
+		return false;
 
 	if (!validate_vmap_area_range_insert_mt_locked(&vn->busy.mt,
 						       va->va_start,
 						       va->va_end))
-		return;
+		return false;
 
 	INIT_LIST_HEAD(&va->list);
 
@@ -1364,11 +1419,11 @@ insert_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 	if (!err) {
 		mas_store_prealloc(&mas, va);
 		mas_destroy(&mas);
-		return;
+		return true;
 	}
 
 	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-	WARN_ON_ONCE(err);
+	return !WARN_ON_ONCE(err);
 }
 
 static __always_inline void
@@ -1391,7 +1446,7 @@ unlink_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 	INIT_LIST_HEAD(&va->list);
 }
 
-static __always_inline void
+static __always_inline bool
 insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
 	int err;
@@ -1400,12 +1455,12 @@ insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 
 	try_init_lazy_mt_locked(vn);
 	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
-		return;
+		return false;
 
 	if (!validate_vmap_area_range_insert_mt_locked(&vn->lazy.mt,
 						       va->va_start,
 						       va->va_end))
-		return;
+		return false;
 
 	INIT_LIST_HEAD(&va->list);
 
@@ -1415,11 +1470,72 @@ insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 	if (!err) {
 		mas_store_prealloc(&mas, va);
 		mas_destroy(&mas);
-		return;
+		return true;
 	}
 
 	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-	WARN_ON_ONCE(err);
+	return !WARN_ON_ONCE(err);
+}
+
+static __always_inline bool
+unlink_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
+{
+	int err;
+
+	lockdep_assert_held(&vn->lazy.lock);
+
+	try_init_lazy_mt_locked(vn);
+	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
+		return false;
+
+	MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
+
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		return false;
+
+	INIT_LIST_HEAD(&va->list);
+	return true;
+}
+
+/*
+ * Transition a VA into the lazy index and drop occupied tracking. On occupied
+ * erase failure, attempt to roll back the lazy insertion; if rollback fails we
+ * keep the lazy entry and let purge-side erase_occupied handling repair stale
+ * occupied state.
+ *
+ * Returns true when the VA remains lazy-indexed; false when it should be
+ * retried via non-index queue.
+ */
+static __always_inline bool
+publish_vmap_area_lazy(struct vmap_area *va, struct vmap_node *vn)
+{
+	bool lazy_kept = false;
+
+	spin_lock(&vn->lazy.lock);
+	if (unlikely(!insert_vmap_area_lazy_locked(va, vn))) {
+		spin_unlock(&vn->lazy.lock);
+		return false;
+	}
+
+	/*
+	 * Keep lazy.lock held while dropping occupied tracking so purge-side
+	 * lazy extraction cannot move @va to purge_list during rollback.
+	 */
+	spin_lock(&free_vmap_area_lock);
+	try_init_occupied_mt_locked();
+	if (likely(occupied_mt_erase_va_locked(va))) {
+		spin_unlock(&free_vmap_area_lock);
+		spin_unlock(&vn->lazy.lock);
+		return true;
+	}
+	spin_unlock(&free_vmap_area_lock);
+
+	if (unlikely(!unlink_vmap_area_lazy_locked(va, vn)))
+		lazy_kept = true;
+	spin_unlock(&vn->lazy.lock);
+
+	return lazy_kept;
 }
 
 static __always_inline bool
@@ -1437,7 +1553,9 @@ lazy_vmap_areas_empty_locked(struct vmap_node *vn)
 static __always_inline void
 move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 {
-	struct vmap_area *va;
+	LIST_HEAD(move_list);
+	struct vmap_area *va, *n_va;
+	int err;
 
 	lockdep_assert_held(&vn->lazy.lock);
 
@@ -1448,12 +1566,25 @@ move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 	MA_STATE(mas, &vn->lazy.mt, 0, 0);
 
 	mas_for_each(&mas, va, ULONG_MAX)
-		list_add_tail(&va->list, &vn->purge_list);
+		list_add_tail(&va->list, &move_list);
+
+	/*
+	 * Erase ranges one-by-one and move only successfully erased entries to
+	 * purge_list. This avoids destroy/reinit churn and keeps lazy index
+	 * coherence if an erase operation fails under pressure.
+	 */
+	list_for_each_entry_safe(va, n_va, &move_list, list) {
+		MA_STATE(mas_erase, &vn->lazy.mt, va->va_start, va->va_end - 1);
+
+		err = mas_store_gfp(&mas_erase, NULL, GFP_ATOMIC | __GFP_NOWARN);
+		if (unlikely(err)) {
+			WARN_ON_ONCE(err);
+			list_del_init(&va->list);
+			continue;
+		}
 
-	__mt_destroy(&vn->lazy.mt);
-	mt_init_flags(&vn->lazy.mt, MT_FLAGS_LOCK_EXTERN);
-	mt_set_external_lock(&vn->lazy.mt, &vn->lazy.lock);
-	vn->lazy.mt_enabled = true;
+		list_move_tail(&va->list, &vn->purge_list);
+	}
 }
 
 static __always_inline bool
@@ -1463,11 +1594,6 @@ insert_vmap_area_free_locked(struct vmap_area *va)
 
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	try_init_free_mt_locked();
-
-	if (unlikely(!free_mt_supported()))
-		return false;
-
 	prev = __find_vmap_area_enclose_addr_mt(va->va_start, &free_vmap_area_mt);
 	if (prev && WARN_ON_ONCE(prev->va_end > va->va_start))
 		return false;
@@ -1512,16 +1638,16 @@ merge_or_add_vmap_area_free_locked(struct vmap_area *va)
 	if (left && WARN_ON_ONCE(left->va_end > new_start))
 		return NULL;
 
+	right = __find_vmap_area_exceed_addr_mt(new_start, &free_vmap_area_mt);
+	if (right && WARN_ON_ONCE(right->va_start < new_end))
+		return NULL;
+
 	if (left && left->va_end == new_start) {
 		new_start = left->va_start;
 		unlink_vmap_area_free_locked(left);
 		kmem_cache_free(vmap_area_cachep, left);
 	}
 
-	right = __find_vmap_area_exceed_addr_mt(new_start, &free_vmap_area_mt);
-	if (right && WARN_ON_ONCE(right->va_start < new_end))
-		return NULL;
-
 	if (right && right->va_start == new_end) {
 		new_end = right->va_end;
 		unlink_vmap_area_free_locked(right);
@@ -1580,9 +1706,28 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
 		if (check_add_overflow(candidate, size - 1, &candidate_end))
 			return false;
 
-		if (candidate >= search && candidate_end <= hole_end) {
-			*addr = candidate;
-			return true;
+		while (candidate >= search && candidate_end <= hole_end) {
+			unsigned long blocked_end = 0;
+
+			if (!retry_queue_overlap_locked(candidate, candidate_end,
+							&blocked_end)) {
+				*addr = candidate;
+				return true;
+			}
+
+			if (blocked_end >= hole_end)
+				break;
+
+			blocked_end++;
+			if (!blocked_end)
+				return false;
+
+			candidate = ALIGN(blocked_end, align);
+			if (candidate < blocked_end)
+				return false;
+
+			if (check_add_overflow(candidate, size - 1, &candidate_end))
+				return false;
 		}
 
 		if (hole_end == ULONG_MAX)
@@ -1828,6 +1973,70 @@ restore_allocated_vmap_range_free_locked(unsigned long start, unsigned long end)
 	return true;
 }
 
+/*
+ * Roll back an allocated range when busy insertion fails. Prefer returning
+ * it to the free tree; if that is not possible, keep occupied tracking so
+ * the range stays reserved and allocator state remains coherent.
+ *
+ * Returns true when @va remains referenced by the free tree and must not be
+ * freed by the caller. Returns false when the caller owns @va.
+ */
+static __always_inline bool
+rollback_busy_insert_failed_alloc_locked(struct vmap_area *va)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (!insert_vmap_area_free_locked(va)) {
+		retry_queue_add_va_locked(va);
+		return true;
+	}
+
+	if (occupied_mt_erase_va_locked(va))
+		return true;
+
+	if (free_mt_erase_va_locked(va)) {
+		retry_queue_add_va_locked(va);
+		return true;
+	}
+
+	/*
+	 * Occupied erase failed and we could not remove the temporary free
+	 * insertion. Keep @va alive: both trees still reference this range.
+	 */
+	return true;
+}
+
+/*
+ * Reinsert @va into the free index after occupied erase. On failure, place the
+ * range on the non-index retry queue and best-effort restore occupied tracking.
+ *
+ * Return: free-tracked @va on success, NULL when queued for retry.
+ */
+static __always_inline struct vmap_area *
+reinsert_or_queue_vmap_area_locked(struct vmap_area *va)
+{
+	struct vmap_area *tracked;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	tracked = merge_or_add_vmap_area_free_locked(va);
+	if (tracked)
+		return tracked;
+
+	if (insert_vmap_area_free_locked(va))
+		return va;
+
+	/*
+	 * Retry queue acts as allocation exclusion even if occupied restore
+	 * fails under pressure.
+	 */
+	if (WARN_ON_ONCE(!occupied_mt_store_va_locked(va)))
+		INIT_LIST_HEAD(&va->list);
+
+	retry_queue_add_va_locked(va);
+	return NULL;
+}
+
 /*
  * Returns a start address of the newly allocated area, if success.
  * Otherwise an error value is returned that indicates failure.
@@ -1840,22 +2049,42 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 	unsigned long nva_start_addr;
 	unsigned long nva_end_addr;
 	struct vmap_area *va;
+	MA_STATE(mas, &free_vmap_area_mt, 0, 0);
 
 	lockdep_assert_held(&free_vmap_area_lock);
 
 	try_init_occupied_mt_locked();
 
-	if (WARN_ON_ONCE(!occupied_mt_supported()))
+	if (WARN_ON_ONCE(!size || !align || vstart >= vend))
+		return -EINVAL;
+	if (size > vend - vstart)
 		return -ENOENT;
 
-	nva_start_addr = occupied_mt_find_hole_lowest_locked(size, align,
-							     vstart, vend);
-	if (IS_ERR_VALUE(nva_start_addr))
-		return nva_start_addr;
-	nva_end_addr = nva_start_addr + size;
+	/*
+	 * Free maple index is authoritative for allocatable ranges; lazy and
+	 * retry entries are intentionally excluded from it.
+	 */
+	mas_set(&mas, vstart);
+	va = mas_find(&mas, vend - 1);
+	while (va) {
+		unsigned long search_start = max(va->va_start, vstart);
+		unsigned long candidate_end;
+
+		nva_start_addr = ALIGN(search_start, align);
+		if (nva_start_addr < search_start)
+			return -ERANGE;
 
-	va = __find_vmap_area_mt(nva_start_addr, &free_vmap_area_mt);
-	if (WARN_ON_ONCE(!va))
+		if (check_add_overflow(nva_start_addr, size - 1, &candidate_end))
+			return -ERANGE;
+
+		if (candidate_end < vend && candidate_end < va->va_end) {
+			nva_end_addr = candidate_end + 1;
+			break;
+		}
+
+		va = mas_next(&mas, vend - 1);
+	}
+	if (!va)
 		return -ENOENT;
 
 	ret = va_clip(va, nva_start_addr, size);
@@ -1883,6 +2112,7 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 static void free_vmap_area(struct vmap_area *va)
 {
 	struct vmap_node *vn = addr_to_node(va->va_start);
+	bool queued_retry = false;
 
 	/*
 	 * Remove from the busy tree/list.
@@ -1895,9 +2125,19 @@ static void free_vmap_area(struct vmap_area *va)
 	 * Insert/Merge it back to the free tree/list.
 	 */
 	spin_lock(&free_vmap_area_lock);
-	WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
-	WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va));
+	if (unlikely(!occupied_mt_erase_va_locked(va))) {
+		retry_queue_add_va_locked(va);
+		queued_retry = true;
+		spin_unlock(&free_vmap_area_lock);
+		goto out_schedule_retry;
+	}
+	if (!reinsert_or_queue_vmap_area_locked(va))
+		queued_retry = true;
 	spin_unlock(&free_vmap_area_lock);
+
+out_schedule_retry:
+	if (queued_retry)
+		schedule_work(&drain_vmap_work);
 }
 
 static inline void
@@ -2119,6 +2359,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->va_end = addr + size;
 	va->vm = NULL;
 	va->flags = (va_flags | vn_id);
+	INIT_LIST_HEAD(&va->list);
 
 	if (vm) {
 		vm->addr = (void *)va->va_start;
@@ -2129,8 +2370,29 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	vn = addr_to_node(va->va_start);
 
 	spin_lock(&vn->busy.lock);
-	insert_vmap_area_busy_locked(va, vn);
+	ret = insert_vmap_area_busy_locked(va, vn) ? 0 : -ENOMEM;
 	spin_unlock(&vn->busy.lock);
+	if (ret) {
+		bool keep_va = false;
+
+		va->vm = NULL;
+		spin_lock(&free_vmap_area_lock);
+		keep_va = rollback_busy_insert_failed_alloc_locked(va);
+		spin_unlock(&free_vmap_area_lock);
+
+		if (!keep_va)
+			kmem_cache_free(vmap_area_cachep, va);
+		else
+			schedule_work(&drain_vmap_work);
+
+		if (vm) {
+			vm->addr = NULL;
+			vm->size = 0;
+			vm->requested_size = 0;
+		}
+
+		return ERR_PTR(ret);
+	}
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -2221,21 +2483,40 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 {
 	struct vmap_area *va, *n;
 	bool ok = true;
+	bool queue_retry_work = false;
 
 	if (list_empty(head))
 		return true;
 
 	spin_lock(&free_vmap_area_lock);
 	list_for_each_entry_safe(va, n, head, list) {
+		bool occupied_erased = false;
+
 		list_del_init(&va->list);
-		if (erase_occupied)
-			WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
-		if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
-			list_add_tail(&va->list, failed);
-			ok = false;
+		if (erase_occupied) {
+			if (WARN_ON_ONCE(!occupied_mt_erase_va_locked(va))) {
+				list_add_tail(&va->list, failed);
+				ok = false;
+				continue;
+			}
+
+			occupied_erased = true;
+		}
+			if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
+				if (occupied_erased &&
+				    WARN_ON_ONCE(!occupied_mt_store_va_locked(va))) {
+					retry_queue_add_va_locked(va);
+					queue_retry_work = true;
+					ok = false;
+					continue;
+				}
+				list_add_tail(&va->list, failed);
+				ok = false;
 		}
 	}
 	spin_unlock(&free_vmap_area_lock);
+	if (queue_retry_work)
+		schedule_work(&drain_vmap_work);
 
 	return ok;
 }
@@ -2330,6 +2611,7 @@ static void purge_vmap_node(struct work_struct *work)
 		struct vmap_node, purge_work);
 	unsigned long nr_purged_pages = 0;
 	unsigned long nr_failed_pages = 0;
+	bool queued_retry = false;
 	struct vmap_area *va, *n_va;
 	LIST_HEAD(local_list);
 	LIST_HEAD(local_failed);
@@ -2358,7 +2640,7 @@ static void purge_vmap_node(struct work_struct *work)
 
 	atomic_long_sub(nr_purged_pages, &vmap_lazy_nr);
 
-	WARN_ON_ONCE(!reclaim_list_global(&local_list, false, &local_failed));
+	WARN_ON_ONCE(!reclaim_list_global(&local_list, true, &local_failed));
 	list_for_each_entry_safe(va, n_va, &local_failed, list) {
 		unsigned int vn_id = decode_vn_id(va->flags);
 		struct vmap_node *dst;
@@ -2367,14 +2649,60 @@ static void purge_vmap_node(struct work_struct *work)
 		dst = is_vn_id_valid(vn_id) ?
 			id_to_node(vn_id) : addr_to_node(va->va_start);
 
-		spin_lock(&dst->lazy.lock);
-		insert_vmap_area_lazy_locked(va, dst);
-		spin_unlock(&dst->lazy.lock);
-		nr_failed_pages += va_size(va) >> PAGE_SHIFT;
+		if (publish_vmap_area_lazy(va, dst)) {
+			nr_failed_pages += va_size(va) >> PAGE_SHIFT;
+			continue;
+		}
+
+		spin_lock(&free_vmap_area_lock);
+		retry_queue_add_va_locked(va);
+		spin_unlock(&free_vmap_area_lock);
+		queued_retry = true;
 	}
 
 	if (nr_failed_pages)
 		atomic_long_add(nr_failed_pages, &vmap_lazy_nr);
+
+	if (queued_retry)
+		schedule_work(&drain_vmap_work);
+}
+
+static void drain_vmap_retry_queue(void)
+{
+	struct vmap_area *va, *n_va;
+	bool queued_retry = false;
+	LIST_HEAD(local_retry);
+
+	spin_lock(&free_vmap_area_lock);
+	if (list_empty(&vmap_retry_list)) {
+		spin_unlock(&free_vmap_area_lock);
+		return;
+	}
+
+	list_splice_init(&vmap_retry_list, &local_retry);
+	spin_unlock(&free_vmap_area_lock);
+
+	list_for_each_entry_safe(va, n_va, &local_retry, list) {
+		struct vmap_node *vn = addr_to_node(va->va_start);
+
+		list_del_init(&va->list);
+		if (publish_vmap_area_lazy(va, vn)) {
+			atomic_long_add(va_size(va) >> PAGE_SHIFT, &vmap_lazy_nr);
+			continue;
+		}
+
+		spin_lock(&free_vmap_area_lock);
+		retry_queue_add_va_locked(va);
+		spin_unlock(&free_vmap_area_lock);
+		queued_retry = true;
+	}
+
+	/*
+	 * Ensure retry-only backlog keeps making progress even if no new free
+	 * events arrive to trigger another purge pass.
+	 */
+	if (queued_retry)
+		schedule_work(&drain_vmap_work);
 }
 
 /*
@@ -2392,6 +2720,9 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 
 	lockdep_assert_held(&vmap_purge_lock);
 
+	/* Retry queued transitions first, so they can join this purge cycle. */
+	drain_vmap_retry_queue();
+
 	/*
 	 * Use cpumask to mark which node has to be processed.
 	 */
@@ -2489,6 +2820,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 {
 	unsigned long nr_lazy_max = lazy_max_pages();
 	unsigned long va_start = va->va_start;
+	unsigned long nr_pages = va_size(va) >> PAGE_SHIFT;
 	unsigned int vn_id = decode_vn_id(va->flags);
 	struct vmap_node *vn;
 	unsigned long nr_lazy;
@@ -2496,9 +2828,6 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	if (WARN_ON_ONCE(!list_empty(&va->list)))
 		return;
 
-	nr_lazy = atomic_long_add_return_relaxed(va_size(va) >> PAGE_SHIFT,
-					 &vmap_lazy_nr);
-
 	/*
 	 * If it was request by a certain node we would like to
 	 * return it to that node, i.e. its pool for later reuse.
@@ -2506,18 +2835,20 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	vn = is_vn_id_valid(vn_id) ?
 		id_to_node(vn_id):addr_to_node(va->va_start);
 
-	/*
-	 * Drop occupied-range visibility as soon as the area is freed, even
-	 * though coalescing/reinsertion into the free index remains deferred.
-	 */
-	spin_lock(&free_vmap_area_lock);
-	try_init_occupied_mt_locked();
-	WARN_ON_ONCE(!occupied_mt_erase_va_locked(va));
-	spin_unlock(&free_vmap_area_lock);
+	if (publish_vmap_area_lazy(va, vn)) {
+		nr_lazy = atomic_long_add_return_relaxed(nr_pages, &vmap_lazy_nr);
+	} else {
+		spin_lock(&free_vmap_area_lock);
+		retry_queue_add_va_locked(va);
+		nr_lazy = atomic_long_read(&vmap_lazy_nr);
+		spin_unlock(&free_vmap_area_lock);
 
-	spin_lock(&vn->lazy.lock);
-	insert_vmap_area_lazy_locked(va, vn);
-	spin_unlock(&vn->lazy.lock);
+		/*
+		 * Retry transitions are drained from purge context; poke it
+		 * immediately so transient pressure does not prolong retention.
+		 */
+		schedule_work(&drain_vmap_work);
+	}
 
 	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
 
@@ -5023,6 +5354,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	struct vmap_area **vas, *va;
 	struct vm_struct **vms;
 	int area, area2, last_area, term_area;
+	int inserted_busy = 0;
+	bool queued_retry = false;
 	unsigned long base, start, size, end, last_end, orig_start, orig_end;
 	bool purged = false;
 
@@ -5061,6 +5394,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 	for (area = 0; area < nr_vms; area++) {
 		vas[area] = kmem_cache_zalloc(vmap_area_cachep, GFP_KERNEL);
+		if (vas[area])
+			INIT_LIST_HEAD(&vas[area]->list);
 		vms[area] = kzalloc_obj(struct vm_struct);
 		if (!vas[area] || !vms[area])
 			goto err_free;
@@ -5170,10 +5505,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
 
 		spin_lock(&vn->busy.lock);
-		insert_vmap_area_busy_locked(vas[area], vn);
+		if (unlikely(!insert_vmap_area_busy_locked(vas[area], vn))) {
+			spin_unlock(&vn->busy.lock);
+			goto err_unwind_busy;
+		}
 		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
 		spin_unlock(&vn->busy.lock);
+		inserted_busy++;
 	}
 
 	/*
@@ -5197,33 +5536,43 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	while (area--) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		WARN_ON_ONCE(!occupied_mt_erase_va_locked(vas[area]));
-		va = merge_or_add_vmap_area_free_locked(vas[area]);
-		WARN_ON_ONCE(!va);
-		if (va)
-			kasan_release_vmalloc(orig_start, orig_end,
-					      va->va_start, va->va_end,
-					      KASAN_VMALLOC_PAGE_RANGE |
-					      KASAN_VMALLOC_TLB_FLUSH);
+		if (occupied_mt_erase_va_locked(vas[area])) {
+			va = reinsert_or_queue_vmap_area_locked(vas[area]);
+			if (va)
+				kasan_release_vmalloc(orig_start, orig_end,
+						      va->va_start, va->va_end,
+						      KASAN_VMALLOC_PAGE_RANGE |
+						      KASAN_VMALLOC_TLB_FLUSH);
+			else
+				queued_retry = true;
+		} else {
+			retry_queue_add_va_locked(vas[area]);
+			queued_retry = true;
+		}
 		vas[area] = NULL;
 	}
 
 overflow:
 	spin_unlock(&free_vmap_area_lock);
+	if (queued_retry)
+		schedule_work(&drain_vmap_work);
+
 	if (!purged) {
 		reclaim_and_purge_vmap_areas();
 		purged = true;
 
-		/* Before "retry", check if we recover. */
-		for (area = 0; area < nr_vms; area++) {
-			if (vas[area])
-				continue;
-
-			vas[area] = kmem_cache_zalloc(
-				vmap_area_cachep, GFP_KERNEL);
-			if (!vas[area])
-				goto err_free;
-		}
+			/* Before "retry", check if we recover. */
+			for (area = 0; area < nr_vms; area++) {
+				if (vas[area])
+					continue;
+
+				vas[area] = kmem_cache_zalloc(vmap_area_cachep,
+							      GFP_KERNEL);
+				if (vas[area])
+					INIT_LIST_HEAD(&vas[area]->list);
+				if (!vas[area])
+					goto err_free;
+			}
 
 		goto retry;
 	}
@@ -5240,6 +5589,16 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	kfree(vms);
 	return NULL;
 
+err_unwind_busy:
+	while (inserted_busy--) {
+		struct vmap_node *vn = addr_to_node(vas[inserted_busy]->va_start);
+
+		spin_lock(&vn->busy.lock);
+		unlink_vmap_area_busy_locked(vas[inserted_busy], vn);
+		spin_unlock(&vn->busy.lock);
+	}
+	goto err_free_shadow;
+
 err_free_shadow:
 	spin_lock(&free_vmap_area_lock);
 	/*
@@ -5250,17 +5609,25 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	for (area = 0; area < nr_vms; area++) {
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
-		WARN_ON_ONCE(!occupied_mt_erase_va_locked(vas[area]));
-		va = merge_or_add_vmap_area_free_locked(vas[area]);
-		WARN_ON_ONCE(!va);
-		if (va)
-			kasan_release_vmalloc(orig_start, orig_end,
-				va->va_start, va->va_end,
-				KASAN_VMALLOC_PAGE_RANGE | KASAN_VMALLOC_TLB_FLUSH);
+		if (occupied_mt_erase_va_locked(vas[area])) {
+			va = reinsert_or_queue_vmap_area_locked(vas[area]);
+			if (va)
+				kasan_release_vmalloc(orig_start, orig_end,
+						      va->va_start, va->va_end,
+						      KASAN_VMALLOC_PAGE_RANGE |
+						      KASAN_VMALLOC_TLB_FLUSH);
+			else
+				queued_retry = true;
+		} else {
+			retry_queue_add_va_locked(vas[area]);
+			queued_retry = true;
+		}
 		vas[area] = NULL;
 		kfree(vms[area]);
 	}
 	spin_unlock(&free_vmap_area_lock);
+	if (queued_retry)
+		schedule_work(&drain_vmap_work);
 	kfree(vas);
 	kfree(vms);
 	return NULL;
@@ -5364,6 +5731,13 @@ static void show_purge_info(struct seq_file *m)
 				   va_size(va));
 		spin_unlock(&vn->lazy.lock);
 	}
+
+	spin_lock(&free_vmap_area_lock);
+	list_for_each_entry(va, &vmap_retry_list, list)
+		seq_printf(m, "0x%pK-0x%pK %7ld retry vm_area\n",
+			   (void *)va->va_start, (void *)va->va_end,
+			   va_size(va));
+	spin_unlock(&free_vmap_area_lock);
 }
 
 static int vmalloc_info_show(struct seq_file *m, void *p)
@@ -5635,13 +6009,21 @@ void __init vmalloc_init(void)
 
 		vn = addr_to_node(va->va_start);
 		spin_lock(&vn->busy.lock);
-		insert_vmap_area_busy_locked(va, vn);
+		if (unlikely(!insert_vmap_area_busy_locked(va, vn))) {
+			spin_unlock(&vn->busy.lock);
+			panic("%s: failed to import busy range %#lx-%#lx\n",
+			      __func__, va->va_start, va->va_end);
+		}
 		spin_unlock(&vn->busy.lock);
 
 		spin_lock(&free_vmap_area_lock);
 		try_init_occupied_mt_locked();
-		WARN_ON_ONCE(!occupied_mt_store_range_locked(va->va_start,
-							     va->va_end));
+		if (WARN_ON_ONCE(!occupied_mt_store_range_locked(va->va_start,
+								 va->va_end))) {
+			spin_unlock(&free_vmap_area_lock);
+			panic("%s: failed to import occupied range %#lx-%#lx\n",
+			      __func__, va->va_start, va->va_end);
+		}
 		spin_unlock(&free_vmap_area_lock);
 	}
 

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (4 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Three small refinements that follow from using maple_tree more
idiomatically on the alloc and free walkers:

  - Carry the MA_STATE through the alloc walker instead of dropping
    and re-positioning it between the gap query, the candidate
    classify step, and the post-clip publish.  The walker is already
    pointing at the chosen range; reuse it.

  - When the gap query is satisfied by an entry whose start is not
    yet aligned, ask for a gap of (size + align - 1) so the first
    match is guaranteed alignable.  This matches the effective
    behaviour of the augmented rb_tree's gap traversal.

  - When va_clip narrows an existing free entry, store NULL on just
    the consumed sub-range instead of erasing the whole entry and
    re-storing the surviving prefix/suffix.  mas_store(NULL,
    [start, end]) leaves the un-trimmed sub-range of the original
    entry intact, so the re-store is unnecessary.

  - Walk the address-keyed occupied tree with mas_find on the rare
    decay path so the per-node free-area scan can prune ranges that
    are already aligned out.

No semantic change to the allocator policy, the free-area shape, or
the addresses returned.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 338 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 283 insertions(+), 55 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7feb1b182cfa..5bc1e47c456a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -899,6 +899,7 @@ static struct maple_tree occupied_vmap_area_mt =
 		       free_vmap_area_lock);
 static bool occupied_vmap_area_mt_enabled;
 static bool occupied_vmap_area_mt_init_tried;
+static bool occupied_vmap_area_perf_mode;
 
 /*
  * Preload a CPU with one object for "no edge" split case. The
@@ -1073,12 +1074,13 @@ static __always_inline bool free_mt_store_va_locked(struct vmap_area *va)
 	if (!err) {
 		mas_store_prealloc(&mas, va);
 		mas_destroy(&mas);
-	} else {
-		err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-		if (WARN_ON_ONCE(err))
-			return false;
+		return true;
 	}
 
+	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
+	if (WARN_ON_ONCE(err))
+		return false;
+
 	return true;
 }
 
@@ -1119,6 +1121,31 @@ free_mt_update_va_locked(struct vmap_area *va, unsigned long old_start,
 	return true;
 }
 
+/*
+ * Trim a stored range entry by clearing a sub-range from one end.
+ * Used by LE_FIT and RE_FIT in va_clip(): the original [old_start,
+ * old_end-1]->@va entry survives intact at the un-trimmed sub-range,
+ * so a single mas_store NULL replaces the explicit erase + restore-at-
+ * shrunk-range pair, halving maple-tree work for edge clips. NE_FIT
+ * uses the same primitive after first inserting @lva, which trades 3
+ * stores (erase + store + lva) for 2 (lva + middle trim).
+ */
+static __always_inline bool
+free_mt_trim_range_locked(unsigned long trim_start, unsigned long trim_end)
+{
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (trim_start >= trim_end)
+		return true;
+
+	MA_STATE(mas, &free_vmap_area_mt, trim_start, trim_end - 1);
+
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
+	return !WARN_ON_ONCE(err);
+}
+
 static __always_inline void
 retry_queue_add_va_locked(struct vmap_area *va)
 {
@@ -1175,6 +1202,11 @@ static __always_inline void try_init_free_mt_locked(void)
 }
 
 static __always_inline bool occupied_mt_supported(void)
+{
+	return occupied_vmap_area_perf_mode && occupied_vmap_area_mt_enabled;
+}
+
+static __always_inline bool occupied_mt_enabled(void)
 {
 	return occupied_vmap_area_mt_enabled;
 }
@@ -1194,28 +1226,48 @@ static __always_inline void try_init_occupied_mt_locked(void)
 }
 
 static __always_inline bool
-occupied_mt_store_range_locked(unsigned long start, unsigned long end)
+occupied_mt_store_range_raw_locked(unsigned long start, unsigned long end)
 {
 	int err;
 
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	if (WARN_ON_ONCE(!occupied_mt_supported()))
-		return false;
+	if (!occupied_mt_enabled())
+		return true;
 
 	MA_STATE(mas, &occupied_vmap_area_mt, start, end - 1);
 
-	err = mas_preallocate(&mas, XA_ZERO_ENTRY, GFP_NOWAIT | __GFP_NOWARN);
-	if (!err) {
-		mas_store_prealloc(&mas, XA_ZERO_ENTRY);
-		mas_destroy(&mas);
+	err = mas_store_gfp(&mas, XA_ZERO_ENTRY, GFP_ATOMIC | __GFP_NOWARN);
+	return !WARN_ON_ONCE(err);
+}
+
+static __always_inline bool
+occupied_mt_erase_range_raw_locked(unsigned long start, unsigned long end)
+{
+	int err;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (!occupied_mt_enabled())
 		return true;
-	}
 
-	err = mas_store_gfp(&mas, XA_ZERO_ENTRY, GFP_ATOMIC | __GFP_NOWARN);
+	MA_STATE(mas, &occupied_vmap_area_mt, start, end - 1);
+
+	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
 	return !WARN_ON_ONCE(err);
 }
 
+static __always_inline bool
+occupied_mt_store_range_locked(unsigned long start, unsigned long end)
+{
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (!occupied_mt_supported())
+		return true;
+
+	return occupied_mt_store_range_raw_locked(start, end);
+}
+
 static __always_inline bool
 occupied_mt_store_va_locked(struct vmap_area *va)
 {
@@ -1227,17 +1279,12 @@ occupied_mt_store_va_locked(struct vmap_area *va)
 static __always_inline bool
 occupied_mt_erase_range_locked(unsigned long start, unsigned long end)
 {
-	int err;
-
 	lockdep_assert_held(&free_vmap_area_lock);
 
 	if (WARN_ON_ONCE(!occupied_mt_supported()))
 		return false;
 
-	MA_STATE(mas, &occupied_vmap_area_mt, start, end - 1);
-
-	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
-	return !WARN_ON_ONCE(err);
+	return occupied_mt_erase_range_raw_locked(start, end);
 }
 
 static __always_inline bool
@@ -1303,6 +1350,24 @@ __find_vmap_area_enclose_addr_mt(unsigned long addr, struct maple_tree *tree)
 	return mas_find_rev(&mas, 0);
 }
 
+static __always_inline bool
+find_vmap_area_insert_neighbors_mt_locked(struct maple_tree *tree,
+					  unsigned long start,
+					  unsigned long end,
+					  struct vmap_area **left,
+					  struct vmap_area **right)
+{
+	*left = __find_vmap_area_enclose_addr_mt(start, tree);
+	if (*left && WARN_ON_ONCE((*left)->va_end > start))
+		return false;
+
+	*right = __find_vmap_area_exceed_addr_mt(start, tree);
+	if (*right && WARN_ON_ONCE((*right)->va_start < end))
+		return false;
+
+	return true;
+}
+
 static __always_inline bool
 validate_vmap_area_range_insert_mt_locked(struct maple_tree *tree,
 					  unsigned long start,
@@ -1310,12 +1375,8 @@ validate_vmap_area_range_insert_mt_locked(struct maple_tree *tree,
 {
 	struct vmap_area *left, *right;
 
-	left = __find_vmap_area_enclose_addr_mt(start, tree);
-	if (left && WARN_ON_ONCE(left->va_end > start))
-		return false;
-
-	right = __find_vmap_area_exceed_addr_mt(start, tree);
-	if (right && WARN_ON_ONCE(right->va_start < end))
+	if (!find_vmap_area_insert_neighbors_mt_locked(tree, start, end,
+						       &left, &right))
 		return false;
 
 	return true;
@@ -1499,10 +1560,11 @@ unlink_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 }
 
 /*
- * Transition a VA into the lazy index and drop occupied tracking. On occupied
- * erase failure, attempt to roll back the lazy insertion; if rollback fails we
- * keep the lazy entry and let purge-side erase_occupied handling repair stale
- * occupied state.
+ * Transition a VA into the lazy index.
+ *
+ * In the default mode, occupied tracking is dropped while the VA is lazy.
+ * In occupied perf mode, lazy ranges stay occupied-indexed so hole search can
+ * avoid repeatedly probing unavailable gaps.
  *
  * Returns true when the VA remains lazy-indexed; false when it should be
  * retried via non-index queue.
@@ -1518,6 +1580,11 @@ publish_vmap_area_lazy(struct vmap_area *va, struct vmap_node *vn)
 		return false;
 	}
 
+	if (occupied_mt_supported()) {
+		spin_unlock(&vn->lazy.lock);
+		return true;
+	}
+
 	/*
 	 * Keep lazy.lock held while dropping occupied tracking so purge-side
 	 * lazy extraction cannot move @va to purge_list during rollback.
@@ -1588,24 +1655,34 @@ move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 }
 
 static __always_inline bool
-insert_vmap_area_free_locked(struct vmap_area *va)
+insert_vmap_area_free_nocheck_locked(struct vmap_area *va)
 {
-	struct vmap_area *prev, *next;
-
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	prev = __find_vmap_area_enclose_addr_mt(va->va_start, &free_vmap_area_mt);
-	if (prev && WARN_ON_ONCE(prev->va_end > va->va_start))
-		return false;
+	try_init_free_mt_locked();
 
-	next = __find_vmap_area_exceed_addr_mt(va->va_start, &free_vmap_area_mt);
-	if (next && WARN_ON_ONCE(next->va_start < va->va_end))
+	if (unlikely(!free_mt_supported()))
 		return false;
 
 	INIT_LIST_HEAD(&va->list);
 	return free_mt_store_va_locked(va);
 }
 
+static __always_inline bool
+insert_vmap_area_free_locked(struct vmap_area *va)
+{
+	struct vmap_area *prev, *next;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (!find_vmap_area_insert_neighbors_mt_locked(&free_vmap_area_mt,
+						       va->va_start, va->va_end,
+						       &prev, &next))
+		return false;
+
+	return insert_vmap_area_free_nocheck_locked(va);
+}
+
 static __always_inline void
 unlink_vmap_area_free_locked(struct vmap_area *va)
 {
@@ -1634,8 +1711,9 @@ merge_or_add_vmap_area_free_locked(struct vmap_area *va)
 	new_start = va->va_start;
 	new_end = va->va_end;
 
-	left = __find_vmap_area_enclose_addr_mt(new_start, &free_vmap_area_mt);
-	if (left && WARN_ON_ONCE(left->va_end > new_start))
+	if (!find_vmap_area_insert_neighbors_mt_locked(&free_vmap_area_mt,
+						       new_start, new_end,
+						       &left, &right))
 		return NULL;
 
 	right = __find_vmap_area_exceed_addr_mt(new_start, &free_vmap_area_mt);
@@ -1657,7 +1735,7 @@ merge_or_add_vmap_area_free_locked(struct vmap_area *va)
 	va->va_start = new_start;
 	va->va_end = new_end;
 
-	if (!insert_vmap_area_free_locked(va))
+	if (!insert_vmap_area_free_nocheck_locked(va))
 		return NULL;
 
 	return va;
@@ -1690,6 +1768,10 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
 	MA_STATE(mas, &occupied_vmap_area_mt, 0, 0);
 	unsigned long search = min;
 	unsigned long hole_end;
+	bool retry_empty;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+	retry_empty = list_empty(&vmap_retry_list);
 
 	while (search <= max) {
 		unsigned long candidate, candidate_end;
@@ -1709,7 +1791,8 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
 		while (candidate >= search && candidate_end <= hole_end) {
 			unsigned long blocked_end = 0;
 
-			if (!retry_queue_overlap_locked(candidate, candidate_end,
+			if (retry_empty ||
+			    !retry_queue_overlap_locked(candidate, candidate_end,
 							&blocked_end)) {
 				*addr = candidate;
 				return true;
@@ -1751,6 +1834,70 @@ occupied_mt_find_hole_lowest_locked(unsigned long size, unsigned long align,
 	return -ENOENT;
 }
 
+static __always_inline struct vmap_area *
+free_mt_find_enclose_range_locked(unsigned long start, unsigned long end)
+{
+	struct vmap_area *va;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	va = __find_vmap_area_mt(start, &free_vmap_area_mt);
+	if (!va)
+		return NULL;
+
+	if (va->va_start > start || va->va_end < end)
+		return NULL;
+
+	return va;
+}
+
+static __always_inline void
+occupied_mt_cache_gap_miss_locked(unsigned long candidate, unsigned long vend)
+{
+	struct vmap_area *prev, *next;
+	unsigned long blocked_end;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	if (!occupied_mt_supported())
+		return;
+
+	prev = __find_vmap_area_enclose_addr_mt(candidate, &free_vmap_area_mt);
+	if (prev && prev->va_start <= candidate && candidate < prev->va_end)
+		return;
+
+	next = __find_vmap_area_exceed_addr_mt(candidate, &free_vmap_area_mt);
+	blocked_end = next ? next->va_start : vend;
+	if (blocked_end <= candidate)
+		return;
+
+	WARN_ON_ONCE(!occupied_mt_store_range_raw_locked(candidate, blocked_end));
+}
+
+static __always_inline bool occupied_mt_seed_from_free_locked(void)
+{
+	MA_STATE(mas, &free_vmap_area_mt, 0, 0);
+	struct vmap_area *va;
+	unsigned long search = VMALLOC_START;
+
+	lockdep_assert_held(&free_vmap_area_lock);
+
+	mas_for_each(&mas, va, VMALLOC_END - 1) {
+		if (search < va->va_start) {
+			if (!occupied_mt_store_range_raw_locked(search, va->va_start))
+				return false;
+		}
+
+		if (va->va_end > search)
+			search = va->va_end;
+	}
+
+	if (search < VMALLOC_END)
+		return occupied_mt_store_range_raw_locked(search, VMALLOC_END);
+
+	return true;
+}
+
 /* Lowest-match scan directly on maple ordered traversal. */
 static __always_inline struct vmap_area *
 find_vmap_lowest_match_mt(struct maple_tree *tree, unsigned long size,
@@ -1939,11 +2086,39 @@ va_clip(struct vmap_area *va, unsigned long nva_start_addr,
 	}
 
 	if (type != FL_FIT_TYPE) {
-		if (free_mt_supported() &&
-		    !free_mt_update_va_locked(va, old_start, old_end))
-			return -ENOMEM;
-
-		if (lva && !insert_vmap_area_free_locked(lva)) {
+		if (free_mt_supported()) {
+			/*
+			 * Drop only the consumed sub-range from the original
+			 * free entry instead of erase-then-store. The maple
+			 * tree leaves @va at the surviving sub-range intact,
+			 * so a single mas_store per clip side suffices.
+			 *
+			 * For NE_FIT, insert @lva at the original entry's
+			 * left portion first: mas_store overwrites the old
+			 * [old_start, old_end-1]->va entry only across
+			 * [old_start, lva->va_end-1], leaving the right side
+			 * still pointing to @va. The subsequent middle trim
+			 * carves out the consumed gap. Trades 3 stores
+			 * (erase + restore + lva) for 2.
+			 */
+			if (type == LE_FIT_TYPE) {
+				if (!free_mt_trim_range_locked(old_start,
+							       va->va_start))
+					return -ENOMEM;
+			} else if (type == RE_FIT_TYPE) {
+				if (!free_mt_trim_range_locked(va->va_end,
+							       old_end))
+					return -ENOMEM;
+			} else { /* NE_FIT_TYPE */
+				if (!insert_vmap_area_free_nocheck_locked(lva)) {
+					kmem_cache_free(vmap_area_cachep, lva);
+					return -ENOMEM;
+				}
+				if (!free_mt_trim_range_locked(nva_start_addr,
+							       nva_start_addr + size))
+					return -ENOMEM;
+			}
+		} else if (lva && !insert_vmap_area_free_nocheck_locked(lva)) {
 			kmem_cache_free(vmap_area_cachep, lva);
 			return -ENOMEM;
 		}
@@ -1965,7 +2140,7 @@ restore_allocated_vmap_range_free_locked(unsigned long start, unsigned long end)
 
 	va->va_start = start;
 	va->va_end = end;
-	if (!insert_vmap_area_free_locked(va)) {
+	if (!insert_vmap_area_free_nocheck_locked(va)) {
 		kmem_cache_free(vmap_area_cachep, va);
 		return false;
 	}
@@ -2048,6 +2223,7 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 	int ret;
 	unsigned long nva_start_addr;
 	unsigned long nva_end_addr;
+	unsigned long search_len = size;
 	struct vmap_area *va;
 	MA_STATE(mas, &free_vmap_area_mt, 0, 0);
 
@@ -2059,6 +2235,28 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 		return -EINVAL;
 	if (size > vend - vstart)
 		return -ENOENT;
+	if (align > PAGE_SIZE && (vend - vstart) != size) {
+		if (check_add_overflow(size, align - 1, &search_len))
+			return -ERANGE;
+	}
+
+	if (occupied_mt_supported() && align <= PAGE_SIZE) {
+		unsigned long candidate;
+
+		if (occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
+							align, &candidate)) {
+			if (check_add_overflow(candidate, size, &nva_end_addr))
+				return -ERANGE;
+
+			va = free_mt_find_enclose_range_locked(candidate, nva_end_addr);
+			if (likely(va)) {
+				nva_start_addr = candidate;
+				goto found;
+			}
+
+			occupied_mt_cache_gap_miss_locked(candidate, vend);
+		}
+	}
 
 	/*
 	 * Free maple index is authoritative for allocatable ranges; lazy and
@@ -2067,26 +2265,37 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 	mas_set(&mas, vstart);
 	va = mas_find(&mas, vend - 1);
 	while (va) {
-		unsigned long search_start = max(va->va_start, vstart);
-		unsigned long candidate_end;
+		unsigned long search_start, limit_end;
+
+		search_start = va->va_start;
+		if (search_start < vstart)
+			search_start = vstart;
+
+		limit_end = va->va_end;
+		if (limit_end > vend)
+			limit_end = vend;
+
+		if (unlikely(limit_end <= search_start))
+			goto next;
+		if (unlikely(limit_end - search_start < search_len))
+			goto next;
 
 		nva_start_addr = ALIGN(search_start, align);
 		if (nva_start_addr < search_start)
 			return -ERANGE;
 
-		if (check_add_overflow(nva_start_addr, size - 1, &candidate_end))
+		if (check_add_overflow(nva_start_addr, size, &nva_end_addr))
 			return -ERANGE;
-
-		if (candidate_end < vend && candidate_end < va->va_end) {
-			nva_end_addr = candidate_end + 1;
+		if (nva_end_addr <= limit_end)
 			break;
-		}
 
+next:
 		va = mas_next(&mas, vend - 1);
 	}
 	if (!va)
 		return -ENOENT;
 
+found:
 	ret = va_clip(va, nva_start_addr, size);
 	if (WARN_ON_ONCE(ret))
 		return ret;
@@ -2571,7 +2780,8 @@ decay_va_pool_node(struct vmap_node *vn, bool full_decay)
 		}
 	}
 
-	WARN_ON_ONCE(!reclaim_list_global(&decay_list, false, &decay_failed));
+	WARN_ON_ONCE(!reclaim_list_global(&decay_list, occupied_mt_supported(),
+					  &decay_failed));
 	list_for_each_entry_safe(va, nva, &decay_failed, list) {
 		list_del_init(&va->list);
 		WARN_ON_ONCE(!node_pool_add_va(vn, va));
@@ -6043,3 +6253,21 @@ void __init vmalloc_init(void)
 	vmap_node_shrinker->scan_objects = vmap_node_shrink_scan;
 	shrinker_register(vmap_node_shrinker);
 }
+
+static int __init vmap_enable_occupied_perf_mode(void)
+{
+	bool seeded = false;
+
+	spin_lock(&free_vmap_area_lock);
+	try_init_occupied_mt_locked();
+	if (occupied_mt_enabled())
+		seeded = occupied_mt_seed_from_free_locked();
+	occupied_vmap_area_perf_mode = seeded;
+	spin_unlock(&free_vmap_area_lock);
+
+	if (!seeded)
+		pr_warn("vmalloc: occupied perf mode disabled (seed failure)\n");
+
+	return 0;
+}
+late_initcall(vmap_enable_occupied_perf_mode);

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (5 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

The dual-tree design (free_vmap_area_mt + occupied_vmap_area_mt
maintained in lock-step) costs roughly twice the maple operations
per allocation lifecycle that the rb_tree path it replaced used.
Strip the maintenance back to a single authoritative tree on the
steady-state hot path.

After this patch:

  - The occupied tree is the source of truth for in-use ranges on
    the alloc/free hot path.
  - free_vmap_area_mt is still maintained on the slow paths
    (vmap_init_free_space, pcpu_get_vm_areas's top-down walk,
    decay_va_pool_node), but the steady-state alloc/free no longer
    has to keep both trees in lock-step.
  - This removes ~half of the maple operations a typical
    vmalloc/vfree cycle performs.

The pcpu top-down walk relies on the assumption that chunks consume
addresses bottom-up, so stale free-tree entries at low addresses
never collide with pcpu's chosen base.  This is documented at the
relevant call site.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 179 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 99 insertions(+), 80 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5bc1e47c456a..73a40a88dbf6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1767,17 +1767,32 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
 {
 	MA_STATE(mas, &occupied_vmap_area_mt, 0, 0);
 	unsigned long search = min;
+	unsigned long search_len = size;
 	unsigned long hole_end;
 	bool retry_empty;
 
 	lockdep_assert_held(&free_vmap_area_lock);
 	retry_empty = list_empty(&vmap_retry_list);
 
+	/*
+	 * Pad the gap-find by align-1 when align exceeds PAGE_SIZE so that
+	 * any alignment slack inside the returned gap can be absorbed
+	 * without an extra outer-loop iteration. Without this padding, the
+	 * loop has to scan past every page-aligned gap that is large enough
+	 * for @size but too small for the aligned start, which is O(K) in
+	 * the number of such gaps and pathological for big alignments on a
+	 * fragmented occupied tree.
+	 */
+	if (align > PAGE_SIZE) {
+		if (check_add_overflow(size, align - 1, &search_len))
+			return false;
+	}
+
 	while (search <= max) {
 		unsigned long candidate, candidate_end;
 
 		mas_set(&mas, search);
-		if (mas_empty_area(&mas, search, max, size))
+		if (mas_empty_area(&mas, search, max, search_len))
 			return false;
 
 		hole_end = min(mas.last, max);
@@ -2182,39 +2197,35 @@ rollback_busy_insert_failed_alloc_locked(struct vmap_area *va)
 }
 
 /*
- * Reinsert @va into the free index after occupied erase. On failure, place the
- * range on the non-index retry queue and best-effort restore occupied tracking.
+ * Release @va after the caller has erased it from occupied_vmap_area_mt.
+ * In the occupied-only design there is no free index to track free space
+ * with vmap_area objects: the range becomes implicitly free as soon as
+ * the occupied marker is gone. The struct itself is recycled to the slab.
  *
- * Return: free-tracked @va on success, NULL when queued for retry.
+ * The signature returns @va on success (matching the pre-rewrite contract
+ * used by the synchronous free_vmap_area() path) so the caller can decide
+ * whether further bookkeeping is needed.
  */
-static __always_inline struct vmap_area *
-reinsert_or_queue_vmap_area_locked(struct vmap_area *va)
+static __always_inline void
+release_drained_vmap_area_locked(struct vmap_area *va)
 {
-	struct vmap_area *tracked;
-
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	tracked = merge_or_add_vmap_area_free_locked(va);
-	if (tracked)
-		return tracked;
-
-	if (insert_vmap_area_free_locked(va))
-		return va;
-
-	/*
-	 * Retry queue acts as allocation exclusion even if occupied restore
-	 * fails under pressure.
-	 */
-	if (WARN_ON_ONCE(!occupied_mt_store_va_locked(va)))
-		INIT_LIST_HEAD(&va->list);
-
-	retry_queue_add_va_locked(va);
-	return NULL;
+	kmem_cache_free(vmap_area_cachep, va);
 }
 
 /*
  * Returns a start address of the newly allocated area, if success.
  * Otherwise an error value is returned that indicates failure.
+ *
+ * Steady state (post late_initcall, occupied_mt perf_mode on) takes
+ * the occupied-only fast path: find a gap with mas_empty_area on
+ * @occupied_vmap_area_mt and store the consumed sub-range. This costs
+ * two maple touches per allocation versus four to six in the legacy
+ * path (which clipped a free vmap_area struct in @free_vmap_area_mt).
+ *
+ * Pre-perf_mode (early boot) and -ENOENT/-ERANGE retries fall back to
+ * the legacy free_mt walk + va_clip path, which remains correct.
  */
 static __always_inline unsigned long
 __alloc_vmap_area(unsigned long size, unsigned long align,
@@ -2235,33 +2246,41 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 		return -EINVAL;
 	if (size > vend - vstart)
 		return -ENOENT;
-	if (align > PAGE_SIZE && (vend - vstart) != size) {
-		if (check_add_overflow(size, align - 1, &search_len))
-			return -ERANGE;
-	}
 
-	if (occupied_mt_supported() && align <= PAGE_SIZE) {
-		unsigned long candidate;
+	/*
+	 * Occupied-only fast path: skip both the free_mt validation
+	 * (free_mt_find_enclose_range_locked) and the va_clip splitting.
+	 * occupied_mt_find_hole_window_locked already pads the gap search by
+	 * align-1 internally for align > PAGE_SIZE, so any alignment lands
+	 * inside the returned gap; storing the consumed sub-range in
+	 * occupied_mt makes the allocator visible to subsequent lookups. The
+	 * legacy free_mt stays in sync only at coarse points (init, pre-
+	 * perf_mode), which is harmless because the alloc and free hot paths
+	 * no longer query it.
+	 */
+	if (occupied_mt_supported()) {
+		if (!occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
+							 align, &nva_start_addr))
+			return -ENOENT;
 
-		if (occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
-							align, &candidate)) {
-			if (check_add_overflow(candidate, size, &nva_end_addr))
-				return -ERANGE;
+		if (check_add_overflow(nva_start_addr, size, &nva_end_addr))
+			return -ERANGE;
 
-			va = free_mt_find_enclose_range_locked(candidate, nva_end_addr);
-			if (likely(va)) {
-				nva_start_addr = candidate;
-				goto found;
-			}
+		if (!occupied_mt_store_range_locked(nva_start_addr, nva_end_addr))
+			return -ENOMEM;
 
-			occupied_mt_cache_gap_miss_locked(candidate, vend);
-		}
+		return nva_start_addr;
 	}
 
 	/*
-	 * Free maple index is authoritative for allocatable ranges; lazy and
-	 * retry entries are intentionally excluded from it.
+	 * Pre-perf_mode early boot fallback: walk free_mt linearly and use
+	 * va_clip to keep both indices coherent.
 	 */
+	if (align > PAGE_SIZE && (vend - vstart) != size) {
+		if (check_add_overflow(size, align - 1, &search_len))
+			return -ERANGE;
+	}
+
 	mas_set(&mas, vstart);
 	va = mas_find(&mas, vend - 1);
 	while (va) {
@@ -2295,7 +2314,6 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 	if (!va)
 		return -ENOENT;
 
-found:
 	ret = va_clip(va, nva_start_addr, size);
 	if (WARN_ON_ONCE(ret))
 		return ret;
@@ -2340,8 +2358,7 @@ static void free_vmap_area(struct vmap_area *va)
 		spin_unlock(&free_vmap_area_lock);
 		goto out_schedule_retry;
 	}
-	if (!reinsert_or_queue_vmap_area_locked(va))
-		queued_retry = true;
+	release_drained_vmap_area_locked(va);
 	spin_unlock(&free_vmap_area_lock);
 
 out_schedule_retry:
@@ -2692,15 +2709,13 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 {
 	struct vmap_area *va, *n;
 	bool ok = true;
-	bool queue_retry_work = false;
+	LIST_HEAD(release);
 
 	if (list_empty(head))
 		return true;
 
 	spin_lock(&free_vmap_area_lock);
 	list_for_each_entry_safe(va, n, head, list) {
-		bool occupied_erased = false;
-
 		list_del_init(&va->list);
 		if (erase_occupied) {
 			if (WARN_ON_ONCE(!occupied_mt_erase_va_locked(va))) {
@@ -2708,24 +2723,21 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 				ok = false;
 				continue;
 			}
-
-			occupied_erased = true;
-		}
-			if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
-				if (occupied_erased &&
-				    WARN_ON_ONCE(!occupied_mt_store_va_locked(va))) {
-					retry_queue_add_va_locked(va);
-					queue_retry_work = true;
-					ok = false;
-					continue;
-				}
-				list_add_tail(&va->list, failed);
-				ok = false;
 		}
+		/*
+		 * Occupied-only design: there are no free vmap_area objects
+		 * any more. With the occupied marker erased, the range is
+		 * implicitly free (a gap in occupied_vmap_area_mt). Just
+		 * release the struct outside the lock.
+		 */
+		list_add_tail(&va->list, &release);
 	}
 	spin_unlock(&free_vmap_area_lock);
-	if (queue_retry_work)
-		schedule_work(&drain_vmap_work);
+
+	list_for_each_entry_safe(va, n, &release, list) {
+		list_del_init(&va->list);
+		kmem_cache_free(vmap_area_cachep, va);
+	}
 
 	return ok;
 }
@@ -5747,14 +5759,16 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
 		if (occupied_mt_erase_va_locked(vas[area])) {
-			va = reinsert_or_queue_vmap_area_locked(vas[area]);
-			if (va)
-				kasan_release_vmalloc(orig_start, orig_end,
-						      va->va_start, va->va_end,
-						      KASAN_VMALLOC_PAGE_RANGE |
-						      KASAN_VMALLOC_TLB_FLUSH);
-			else
-				queued_retry = true;
+			/*
+			 * Reinsert releases vas[area] in the occupied-only
+			 * design; use orig_start/orig_end captured above for
+			 * the kasan release call rather than va->va_start.
+			 */
+			release_drained_vmap_area_locked(vas[area]);
+			kasan_release_vmalloc(orig_start, orig_end,
+					      orig_start, orig_end,
+					      KASAN_VMALLOC_PAGE_RANGE |
+					      KASAN_VMALLOC_TLB_FLUSH);
 		} else {
 			retry_queue_add_va_locked(vas[area]);
 			queued_retry = true;
@@ -5820,14 +5834,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
 		if (occupied_mt_erase_va_locked(vas[area])) {
-			va = reinsert_or_queue_vmap_area_locked(vas[area]);
-			if (va)
-				kasan_release_vmalloc(orig_start, orig_end,
-						      va->va_start, va->va_end,
-						      KASAN_VMALLOC_PAGE_RANGE |
-						      KASAN_VMALLOC_TLB_FLUSH);
-			else
-				queued_retry = true;
+			release_drained_vmap_area_locked(vas[area]);
+			kasan_release_vmalloc(orig_start, orig_end,
+					      orig_start, orig_end,
+					      KASAN_VMALLOC_PAGE_RANGE |
+					      KASAN_VMALLOC_TLB_FLUSH);
 		} else {
 			retry_queue_add_va_locked(vas[area]);
 			queued_retry = true;
@@ -6045,6 +6056,14 @@ module_init(proc_vmalloc_init);
 
 #endif
 
+/*
+ * Pre-occupied-only design seeded the free index with placeholder VAs
+ * covering gaps between vmlist entries. This is preserved as the
+ * boot-time path that populates the legacy free_vmap_area_mt for any
+ * code that still queries it (notably pcpu_get_vm_areas). With
+ * occupied_vmap_area_mt authoritative, allocators on the hot path
+ * skip free_mt entirely.
+ */
 static void __init vmap_init_free_space(void)
 {
 	unsigned long vmap_start = 1;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (6 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

The lazy queue is bulk-drained from the purge worker; nothing
queries it by address.  publish_vmap_area_lazy() inserts into the
queue and purge_vmap_areas_lazy() walks it linearly.

A list_head expresses the actual usage and saves the per-publish
maple insert.  Per-node vn->lazy.mt becomes vn->lazy_list.  The
locking discipline (vn->lazy.lock still serialises inserts) is
unchanged.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 133 +++++++++++++++++++++++++----------------------------------
 1 file changed, 57 insertions(+), 76 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 73a40a88dbf6..1b73001e197e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -942,6 +942,16 @@ static struct vmap_node {
 	struct mt_list busy;
 	struct mt_list lazy;
 
+	/*
+	 * Lazy list. The lazy index is no longer queried by address on the
+	 * hot path: free_vmap_area_noflush() pushes the VA via list_add and
+	 * purge drains it via list_splice. Keeping a list head sidesteps a
+	 * mas_store on every vfree and a mas_for_each + per-entry
+	 * mas_store(NULL) during purge. lazy.mt is retained for the rare
+	 * non-perf_mode rollback path inside publish_vmap_area_lazy().
+	 */
+	struct list_head lazy_list;
+
 	/*
 	 * Ready-to-free areas.
 	 */
@@ -1510,52 +1520,37 @@ unlink_vmap_area_busy_locked(struct vmap_area *va, struct vmap_node *vn)
 static __always_inline bool
 insert_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
-	int err;
-
 	lockdep_assert_held(&vn->lazy.lock);
 
-	try_init_lazy_mt_locked(vn);
-	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
-		return false;
-
-	if (!validate_vmap_area_range_insert_mt_locked(&vn->lazy.mt,
-						       va->va_start,
-						       va->va_end))
+	/*
+	 * The maple-tree lazy index is bypassed in the hot path: a simple
+	 * list saves one mas_store per vfree and one mas_for_each + N
+	 * mas_store(NULL) during purge. lazy.mt is left untouched here so
+	 * the non-perf_mode publish_vmap_area_lazy() rollback can still
+	 * unlink the VA via unlink_vmap_area_lazy_locked() if it inserted
+	 * one — that path is unreachable in steady state with perf_mode on.
+	 */
+	if (WARN_ON_ONCE(!list_empty(&va->list)))
 		return false;
 
-	INIT_LIST_HEAD(&va->list);
-
-	MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
-
-	err = mas_preallocate(&mas, va, GFP_NOWAIT | __GFP_NOWARN);
-	if (!err) {
-		mas_store_prealloc(&mas, va);
-		mas_destroy(&mas);
-		return true;
-	}
-
-	err = mas_store_gfp(&mas, va, GFP_ATOMIC | __GFP_NOWARN);
-	return !WARN_ON_ONCE(err);
+	list_add_tail(&va->list, &vn->lazy_list);
+	return true;
 }
 
 static __always_inline bool
 unlink_vmap_area_lazy_locked(struct vmap_area *va, struct vmap_node *vn)
 {
-	int err;
-
 	lockdep_assert_held(&vn->lazy.lock);
 
-	try_init_lazy_mt_locked(vn);
-	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
-		return false;
-
-	MA_STATE(mas, &vn->lazy.mt, va->va_start, va->va_end - 1);
-
-	err = mas_store_gfp(&mas, NULL, GFP_ATOMIC | __GFP_NOWARN);
-	if (WARN_ON_ONCE(err))
+	/*
+	 * Match insert_vmap_area_lazy_locked()'s list-based fast path. Used
+	 * only by publish_vmap_area_lazy() rollback, which is unreachable in
+	 * steady state but kept for the non-perf_mode early-boot window.
+	 */
+	if (list_empty(&va->list))
 		return false;
 
-	INIT_LIST_HEAD(&va->list);
+	list_del_init(&va->list);
 	return true;
 }
 
@@ -1610,48 +1605,22 @@ lazy_vmap_areas_empty_locked(struct vmap_node *vn)
 {
 	lockdep_assert_held(&vn->lazy.lock);
 
-	try_init_lazy_mt_locked(vn);
-	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
-		return true;
-
-	return mtree_empty(&vn->lazy.mt);
+	return list_empty(&vn->lazy_list);
 }
 
 static __always_inline void
 move_lazy_vmap_areas_to_purge_locked(struct vmap_node *vn)
 {
-	LIST_HEAD(move_list);
-	struct vmap_area *va, *n_va;
-	int err;
-
 	lockdep_assert_held(&vn->lazy.lock);
 
-	try_init_lazy_mt_locked(vn);
-	if (WARN_ON_ONCE(!vn->lazy.mt_enabled))
-		return;
-
-	MA_STATE(mas, &vn->lazy.mt, 0, 0);
-
-	mas_for_each(&mas, va, ULONG_MAX)
-		list_add_tail(&va->list, &move_list);
-
 	/*
-	 * Erase ranges one-by-one and move only successfully erased entries to
-	 * purge_list. This avoids destroy/reinit churn and keeps lazy index
-	 * coherence if an erase operation fails under pressure.
+	 * Move every queued VA to purge_list with a single splice. The
+	 * sort-by-address property that the maple-tree lazy index used to
+	 * provide is no longer used by purge_vmap_node(); kasan_release
+	 * computes its own min/max over the resulting purge_list when
+	 * needed.
 	 */
-	list_for_each_entry_safe(va, n_va, &move_list, list) {
-		MA_STATE(mas_erase, &vn->lazy.mt, va->va_start, va->va_end - 1);
-
-		err = mas_store_gfp(&mas_erase, NULL, GFP_ATOMIC | __GFP_NOWARN);
-		if (unlikely(err)) {
-			WARN_ON_ONCE(err);
-			list_del_init(&va->list);
-			continue;
-		}
-
-		list_move_tail(&va->list, &vn->purge_list);
-	}
+	list_splice_tail_init(&vn->lazy_list, &vn->purge_list);
 }
 
 static __always_inline bool
@@ -2806,13 +2775,18 @@ static void
 kasan_release_vmalloc_node(struct vmap_node *vn)
 {
 	struct vmap_area *va;
-	unsigned long start, end;
+	unsigned long start = ULONG_MAX, end = 0;
 	unsigned int batch_count = 0;
 
-	start = list_first_entry(&vn->purge_list, struct vmap_area, list)->va_start;
-	end = list_last_entry(&vn->purge_list, struct vmap_area, list)->va_end;
-
+	/*
+	 * purge_list is no longer sorted by address (lazy_list is built in
+	 * insertion order via list_add_tail). Compute the bounding range
+	 * inline with the per-VA shadow-release loop to avoid a second walk.
+	 */
 	list_for_each_entry(va, &vn->purge_list, list) {
+		start = min(start, va->va_start);
+		end = max(end, va->va_end);
+
 		if (is_vmalloc_or_module_addr((void *) va->va_start))
 			kasan_release_vmalloc(va->va_start, va->va_end,
 				va->va_start, va->va_end,
@@ -2824,7 +2798,9 @@ kasan_release_vmalloc_node(struct vmap_node *vn)
 		}
 	}
 
-	kasan_release_vmalloc(start, end, start, end, KASAN_VMALLOC_TLB_FLUSH);
+	if (start < end)
+		kasan_release_vmalloc(start, end, start, end,
+				      KASAN_VMALLOC_TLB_FLUSH);
 }
 
 static void purge_vmap_node(struct work_struct *work)
@@ -2938,6 +2914,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 	static cpumask_t purge_nodes;
 	unsigned int nr_purge_nodes;
 	struct vmap_node *vn;
+	struct vmap_area *va;
 	int i;
 
 	lockdep_assert_held(&vmap_purge_lock);
@@ -2964,11 +2941,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 		move_lazy_vmap_areas_to_purge_locked(vn);
 		spin_unlock(&vn->lazy.lock);
 
-		start = min(start, list_first_entry(&vn->purge_list,
-			struct vmap_area, list)->va_start);
-
-		end = max(end, list_last_entry(&vn->purge_list,
-			struct vmap_area, list)->va_end);
+		/*
+		 * lazy_list (and therefore purge_list) is no longer sorted by
+		 * address. Compute the bounding range by walking purge_list.
+		 */
+		list_for_each_entry(va, &vn->purge_list, list) {
+			start = min(start, va->va_start);
+			end = max(end, va->va_end);
+		}
 
 		cpumask_set_cpu(node_to_id(vn), &purge_nodes);
 	}
@@ -6153,6 +6133,7 @@ static void vmap_init_nodes(void)
 		mt_init_flags(&vn->lazy.mt, MT_FLAGS_LOCK_EXTERN);
 		mt_set_external_lock(&vn->lazy.mt, &vn->lazy.lock);
 		vn->lazy.mt_enabled = true;
+		INIT_LIST_HEAD(&vn->lazy_list);
 
 		for (i = 0; i < MAX_VA_SIZE_PAGES; i++) {
 			INIT_LIST_HEAD(&vn->pool[i].head);

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (7 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator Pranjal Arya
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

find_unlink_vmap_area() previously walked the busy tree once (via
find_vmap_area_busy_locked() / __find_vmap_area_mt) and then erased
via a second walker (unlink_vmap_area_busy_locked).  Two independent
maple-tree descents, one per call.

maple_tree exposes mas_erase() which combines lookup and erase in
one descent.  Replace the find+unlink pair with a single mas_erase()
walker.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1b73001e197e..463127d5ce58 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3122,10 +3122,24 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 	do {
 		vn = &vmap_nodes[i];
 
+		/*
+		 * Combine the lookup and removal into a single maple-tree
+		 * descent. mas_erase() positions the state at @addr and clears
+		 * the slot in one pass, returning the previously stored VA.
+		 * This saves the second mas_store(NULL) the original
+		 * find_vmap_area_busy_locked + unlink_vmap_area_busy_locked
+		 * pair issued, halving the busy-tree maple work per vfree.
+		 */
 		spin_lock(&vn->busy.lock);
-		va = find_vmap_area_busy_locked(addr, vn);
+		if (likely(vn->busy.mt_enabled)) {
+			MA_STATE(mas, &vn->busy.mt, addr, addr);
+
+			va = mas_erase(&mas);
+		} else {
+			va = NULL;
+		}
 		if (va)
-			unlink_vmap_area_busy_locked(va, vn);
+			INIT_LIST_HEAD(&va->list);
 		spin_unlock(&vn->busy.lock);
 
 		if (va)

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (8 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Now that the alloc path goes through the maple_tree-based gap finder
(mas_empty_area), amortise the cost of visiting it for the most common
shape of vmalloc call: short-lived, page-aligned, PAGE_SIZE-multiple
allocations.

Each CPU reserves a 64 MB chunk via __alloc_vmap_area -- the same
maple-backed allocator the global path uses -- and dispenses page-
aligned allocations from a bump pointer inside that chunk.  Chunk
reservation and drain are the only operations that touch the global
allocator; per-allocation work stays entirely per-CPU.

When a chunk's allocation count returns to zero and it is no longer
the per-CPU current chunk, vmap_bump_unlink() releases the chunk's
range back to the global allocator via occupied_mt_erase_range_locked
-- the same maple primitive the consolidate-occupied-tree patch made
authoritative.  The chunk install path uses
occupied_mt_store_range_locked symmetrically, so cache lifecycle is
expressed entirely through the maple-tree's range primitives.

Per-CPU access uses preempt_disable() rather than a spinlock; the
chunk pointer is per-CPU and only mutated by its owner.  The chunks
list (vmap_bump_chunks) is gated by a single global spinlock that is
taken only on chunk install/release, not on the fast path.

Why this overlay sits on the maple_tree migration
=================================================

The overlay relies on three primitives that maple_tree provides
natively and that the augmented rb_tree allocator does not expose
in a clean form:

  - Bare [base, limit) range reservation. The augmented rb_node
    carries a vmap_area-shaped subtree_max_size consulted by
    find_vmap_lowest_match.  A chunk reservation has no associated
    vmap_area object, so it cannot be stored in the augmented tree
    without either synthesising a fake vmap_area per chunk or
    introducing a parallel range tracker with its own augmentation
    discipline.  maple_tree stores [base, limit) ranges natively
    and the gap walker (mas_empty_area) returns the lowest free
    region in a single descent, sharing one primitive with the
    regular allocation path.
  - Sentinel range storage.  occupied_vmap_area_mt records a
    reserved chunk as XA_ZERO_ENTRY over [base, limit), sharing
    one index with ordinary in-use vmap_area ranges.  The
    augmented rb_tree has no equivalent of XA_ZERO_ENTRY: a
    chunk would have to live in a dedicated structure, doubling
    the alloc-side state surface.
  - RCU range traversal.  vmap_chunk_lookup() must run lock-free
    so that cross-chunk vfree() does not take a global spinlock
    per free of a chunk-resident allocation.  maple_tree supports
    RCU traversal as a property of the data structure;
    rb_tree-side equivalents (lib/rbtree_latch, hand-rolled
    grace-period accounting on top of rb_tree) impose write-side
    cost and would have to be added to vmalloc as new
    infrastructure.

After the migration these three primitives are part of the
allocator API; the overlay reuses mas_empty_area() for chunk
refill, occupied_mt_store_range_locked() and
occupied_mt_erase_range_locked() for chunk lifecycle, and
maple-tree-friendly RCU for the chunk-list lookup.  No parallel
data structures are introduced.

VMAP_BUMP_CHUNK_SIZE = 64 MB derivation
=======================================

The chunk size is the smallest power-of-two value that satisfies
three independent constraints:

  1. Eligibility coverage.  vmap_bump_eligible() requires
     size <= VMAP_BUMP_CHUNK_SIZE / 2 so that any single eligible
     allocation fits with room for alignment slack.  The largest
     standard-range vmalloc() callers in tree are the module loader
     (modules can carry up to ~32 MB of text + RO data + RW data on
     architectures with full kernel module support) and BPF JIT
     buffers (capped near 4 MB).  Setting CHUNK_SIZE = 64 MB keeps
     all of these on the bump fast path; halving the chunk to 32 MB
     would push module loads to the slow path.

  2. Refill amortisation.  The global vmalloc lock is taken once per
     chunk refill, paying for ~CHUNK_SIZE / avg_alloc_size bump
     allocations between lock acquisitions.  At avg = 4 KB (a
     plausible lower bound for typical kernel vmalloc traffic),
     64 MB amortises to ~16,000 fast-path allocations per global
     lock acquisition; at avg = 1 MB, ~64 per lock.  Doubling the
     chunk size beyond 64 MB barely improves this ratio.

  3. Address-space cost.  Each CPU pins a chunk-sized reservation
     within the vmalloc range.  On a 32-CPU server with the standard
     128 GB x86_64 vmalloc range, 64 MB chunks reserve
     32 * 64 MB = 2 GB = 1.6 % of the range.  On arm64 with
     CONFIG_ARM64_VA_BITS=52 (256 PB vmalloc), the cost is
     negligible.  Doubling to 128 MB pushes the x86_64 reservation
     to 3.2 %, which is still acceptable but starts to matter for
     workloads with high CPU counts.

Per-chunk metadata associated with each chunk is sized as
sizeof(struct vmap_area *) * (CHUNK_SIZE / PAGE_SIZE), which scales
linearly with chunk size and stays at a constant 0.2 % overhead
regardless of the chosen value.  At 64 MB this is 128 KB per chunk.

64 MB is therefore the *minimum* chunk size that meets constraint (1)
and (2) simultaneously; constraint (3) sets the upper bound and
allows growing the chunk if module sizes grow in the future.  The
constant is exposed at the top of the bump-allocator code block so
distributors can tune it for unusual configurations.

Allocations that don't match the predicate (non-page-aligned, larger
than half a chunk, fixed-VA, or with NUMA constraints) fall through
to the existing __alloc_vmap_area path unchanged.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 463127d5ce58..65ee80eaf4bf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2467,6 +2467,98 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
 	va->vm = vm;
 }
 
+/*
+ * Per-CPU bump-allocator overlay.
+ *
+ * Each CPU reserves a contiguous chunk of vmalloc address space and
+ * dispenses page-aligned allocations via a bump pointer. The chunk's
+ * range is reserved through the global allocator once; individual
+ * allocations within the chunk avoid the global maple-tree work
+ * entirely. Each allocation still gets its own vmap_area struct and
+ * is inserted into the per-node busy.mt, so find_vmap_area() and
+ * vfree() continue to work unchanged.
+ *
+ * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
+ * 128 GB vmalloc range, the address space supports thousands of chunks
+ * before exhaustion. A future iteration can add chunk recycling via a
+ * va->bump_chunk back-pointer + refcount; deferred to keep this hot
+ * path's struct vmap_area footprint at 48 B.
+ *
+ * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
+ * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
+ * else falls through to the existing __alloc_vmap_area path.
+ */
+#define VMAP_BUMP_CHUNK_SIZE	(64UL * 1024 * 1024)
+
+struct vmap_bump_chunk {
+	unsigned long	base;
+	unsigned long	limit;
+	unsigned long	bump;
+};
+
+static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
+static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+
+/* Try the per-CPU bump-allocator. Returns the chosen address or
+ * a negative IS_ERR_VALUE on miss; callers fall through to the
+ * regular path on miss.
+ */
+static unsigned long
+vmap_bump_alloc(unsigned long size, unsigned long align,
+		unsigned long vstart, unsigned long vend)
+{
+	struct vmap_bump_chunk *chunk;
+	spinlock_t *lock;
+	unsigned long aligned, addr = -ENOENT;
+
+	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+	    size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
+	    align > VMAP_BUMP_CHUNK_SIZE / 2)
+		return -EINVAL;
+
+	lock = this_cpu_ptr(&vmap_bump_lock);
+	spin_lock(lock);
+	chunk = this_cpu_ptr(&vmap_bump);
+	if (chunk->base) {
+		aligned = ALIGN(chunk->bump, align);
+		if (aligned + size <= chunk->limit) {
+			chunk->bump = aligned + size;
+			addr = aligned;
+		}
+	}
+	spin_unlock(lock);
+	return addr;
+}
+
+/* Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. Old chunk's remaining space is leaked (the
+ * already-allocated VAs in it stay live; the unused tail is wasted).
+ */
+static int
+vmap_bump_refill(gfp_t gfp_mask)
+{
+	struct vmap_bump_chunk *chunk;
+	spinlock_t *lock;
+	unsigned long base;
+
+	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
+	base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
+				 VMALLOC_START, VMALLOC_END);
+	spin_unlock(&free_vmap_area_lock);
+
+	if (IS_ERR_VALUE(base))
+		return -ENOMEM;
+
+	lock = this_cpu_ptr(&vmap_bump_lock);
+	spin_lock(lock);
+	chunk = this_cpu_ptr(&vmap_bump);
+	chunk->base = base;
+	chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+	chunk->bump = base;
+	spin_unlock(lock);
+	return 0;
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend. If vm is passed in, the two will also be bound.
@@ -2519,6 +2611,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	}
 
 retry:
+	if (IS_ERR_VALUE(addr)) {
+		/*
+		 * Per-CPU bump-allocator fast path. On hit, no global
+		 * tree work runs at all. On miss, refill the chunk and
+		 * try again before falling back to the regular path.
+		 */
+		addr = vmap_bump_alloc(size, align, vstart, vend);
+		if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
+			if (vmap_bump_refill(gfp_mask) == 0)
+				addr = vmap_bump_alloc(size, align,
+						       vstart, vend);
+		}
+	}
 	if (IS_ERR_VALUE(addr)) {
 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
 		try_init_free_mt_locked();
@@ -6214,6 +6319,8 @@ void __init vmalloc_init(void)
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
 		xa_init(&vbq->vmap_blocks);
+
+		spin_lock_init(&per_cpu(vmap_bump_lock, i));
 	}
 
 	/*

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (9 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

For an address that lives in a per-CPU chunk reserved from the
maple_tree allocator, walking the busy maple_tree to recover the
struct vmap_area is wasted work — the cache already knows which
vmap_area covers each page in its chunks.

Expose that knowledge directly.  Each chunk gains a back-pointer
array indexed by chunk-relative page offset:

    page_va[(addr - chunk->base) >> PAGE_SHIFT] -> struct vmap_area *

vmap_chunk_lookup() probes the chunk list with a single hash-like
lookup and returns the resident vmap_area in O(1); only chunk-misses
fall through to the existing busy-tree walk.

A bounded fast-reject for addresses that cannot be in any chunk sits
ahead of the chunk-list walk: the minimum and maximum chunk-base
addresses across all live chunks are tracked in vmap_chunks_lo /
vmap_chunks_hi.  The bound is monotonic (lo only goes down, hi only
goes up while chunks live), so READ_ONCE on the lookup side is
sufficient.  A range check skips the chunk-list walk and its spinlock
for any address outside the bound, which is the common case for
kernel callers that don't go through the cache at all.

This is invisible to any caller; only the resolution path is faster.
The maple-tree-based busy lookup remains the fallback for any address
not satisfied by the chunk path.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 372 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 306 insertions(+), 66 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 65ee80eaf4bf..6991054e1cba 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2468,97 +2468,280 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
 }
 
 /*
- * Per-CPU bump-allocator overlay.
+ * Per-CPU bump-allocator overlay (Option B + Option G).
  *
  * Each CPU reserves a contiguous chunk of vmalloc address space and
  * dispenses page-aligned allocations via a bump pointer. The chunk's
- * range is reserved through the global allocator once; individual
- * allocations within the chunk avoid the global maple-tree work
- * entirely. Each allocation still gets its own vmap_area struct and
- * is inserted into the per-node busy.mt, so find_vmap_area() and
- * vfree() continue to work unchanged.
+ * range is reserved through the global allocator once; per-allocation
+ * the bump path skips global maple-tree work entirely AND skips the
+ * per-node busy.mt insert: each chunk carries a page_va[] array that
+ * maps page-offsets within the chunk to the owning vmap_area struct,
+ * so find_vmap_area(addr) for a chunk-resident addr is one chunk
+ * lookup + array index — no maple_tree descent at all.
  *
- * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
- * 128 GB vmalloc range, the address space supports thousands of chunks
- * before exhaustion. A future iteration can add chunk recycling via a
- * va->bump_chunk back-pointer + refcount; deferred to keep this hot
- * path's struct vmap_area footprint at 48 B.
+ * Constraints: only the standard vmalloc range (VMALLOC_START..
+ * VMALLOC_END) with align and size both <= VMAP_BUMP_CHUNK_SIZE/2
+ * take the bump path. Anything else falls through to the existing
+ * __alloc_vmap_area path which keeps the busy.mt insert.
  *
- * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
- * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
- * else falls through to the existing __alloc_vmap_area path.
+ * Chunks recycle on bump exhaustion: the active chunk is retired
+ * to a global list when it can no longer fit the request; freed VAs
+ * release their page_va entries; when a chunk's alloc count drops to
+ * zero it is returned to the global allocator and freed.
  */
 #define VMAP_BUMP_CHUNK_SIZE	(64UL * 1024 * 1024)
+#define VMAP_BUMP_CHUNK_PAGES	(VMAP_BUMP_CHUNK_SIZE >> PAGE_SHIFT)
+
+/*
+ * VA flag bit 0x4 marks vmap_areas allocated by the bump allocator. These
+ * VAs are never inserted into occupied_vmap_area_mt — the chunk's whole
+ * range was inserted at refill time. reclaim_list_global() consults this
+ * bit to skip occupied_mt_erase_va_locked() on the vfree path, which would
+ * otherwise WARN every time a bump-allocated VA is reclaimed. Bit 0x4 sits
+ * outside VMAP_FLAGS_MASK (0x3 = VMAP_RAM | VMAP_BLOCK) and below the
+ * encode_vn_id() shift (BITS_PER_BYTE), so it does not alias either field.
+ */
+#define VA_FROM_BUMP_CHUNK	0x4
 
 struct vmap_bump_chunk {
-	unsigned long	base;
-	unsigned long	limit;
-	unsigned long	bump;
+	unsigned long		base;
+	unsigned long		limit;
+	unsigned long		bump;
+	atomic_t		alloced;	/* # outstanding pages */
+	struct list_head	link;		/* on vmap_bump_chunks */
+	struct rcu_head		rcu;		/* deferred free */
+	struct vmap_area	*page_va[VMAP_BUMP_CHUNK_PAGES];
 };
 
-static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
-static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+static DEFINE_PER_CPU(struct vmap_bump_chunk *, vmap_bump_cur);
+static LIST_HEAD(vmap_bump_chunks);
+static DEFINE_SPINLOCK(vmap_bump_chunks_lock);
 
-/* Try the per-CPU bump-allocator. Returns the chosen address or
- * a negative IS_ERR_VALUE on miss; callers fall through to the
- * regular path on miss.
+/*
+ * Coarse [lo, hi) bounds covering every active vmap_bump_chunk's
+ * range. vmap_chunk_lookup() rejects out-of-range addresses (e.g.
+ * pcpu allocations sitting in the upper half of the vmalloc range)
+ * without taking vmap_bump_chunks_lock. Updated whenever a chunk is
+ * installed or released.
  */
-static unsigned long
+static unsigned long vmap_chunks_lo = ULONG_MAX;
+static unsigned long vmap_chunks_hi;
+
+static __always_inline unsigned long
+vmap_chunk_page_idx(struct vmap_bump_chunk *chunk, unsigned long addr)
+{
+	return (addr - chunk->base) >> PAGE_SHIFT;
+}
+
+/*
+ * Find the chunk containing @addr. Returns NULL if @addr was not
+ * allocated from any chunk. The walk is O(num_chunks); for our
+ * benchmark workloads num_chunks is bounded in the tens, so this is
+ * still under one cache-line of comparisons in practice.
+ */
+static struct vmap_bump_chunk *
+vmap_chunk_lookup(unsigned long addr)
+{
+	struct vmap_bump_chunk *chunk, *cur;
+
+	/*
+	 * Fast reject: addr lies entirely outside any chunk's [base, limit).
+	 * READ_ONCE pairs with the WRITE_ONCE updates in vmap_bump_refill /
+	 * vmap_bump_unlink. The bound is monotonic (lo only goes down, hi
+	 * only goes up while chunks live), so a stale read can only force
+	 * us into the slow path — never miss a real hit.
+	 */
+	if (addr < READ_ONCE(vmap_chunks_lo) ||
+	    addr >= READ_ONCE(vmap_chunks_hi))
+		return NULL;
+
+	cur = this_cpu_read(vmap_bump_cur);
+	if (cur && addr >= cur->base && addr < cur->limit)
+		return cur;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(chunk, &vmap_bump_chunks, link) {
+		if (addr >= chunk->base && addr < chunk->limit) {
+			rcu_read_unlock();
+			return chunk;
+		}
+	}
+	rcu_read_unlock();
+	return NULL;
+}
+
+/*
+ * Reserve and bump-allocate via the per-CPU chunk. Returns the
+ * vmap_area pre-populated (va_start, va_end, page_va[] linkage),
+ * or NULL on miss/refill-needed.
+ */
+static struct vmap_area *
 vmap_bump_alloc(unsigned long size, unsigned long align,
-		unsigned long vstart, unsigned long vend)
+		unsigned long vstart, unsigned long vend, gfp_t gfp_mask,
+		int node, unsigned long va_flags)
 {
 	struct vmap_bump_chunk *chunk;
-	spinlock_t *lock;
-	unsigned long aligned, addr = -ENOENT;
+	struct vmap_area *va;
+	unsigned long aligned, idx, n_pages, i;
 
 	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
 	    size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
 	    align > VMAP_BUMP_CHUNK_SIZE / 2)
-		return -EINVAL;
+		return NULL;
 
-	lock = this_cpu_ptr(&vmap_bump_lock);
-	spin_lock(lock);
-	chunk = this_cpu_ptr(&vmap_bump);
-	if (chunk->base) {
-		aligned = ALIGN(chunk->bump, align);
-		if (aligned + size <= chunk->limit) {
-			chunk->bump = aligned + size;
-			addr = aligned;
-		}
+	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
+	if (unlikely(!va))
+		return NULL;
+
+	/*
+	 * preempt_disable() is sufficient for the per-CPU chunk hot path:
+	 * the chunk pointer is per-CPU and only mutated by the CPU that
+	 * owns it (in vmap_bump_refill). preempt-disable pins us to the
+	 * current CPU and serializes against an in-flight refill on the
+	 * same CPU.
+	 */
+	preempt_disable();
+	chunk = this_cpu_read(vmap_bump_cur);
+	if (!chunk) {
+		preempt_enable();
+		kmem_cache_free(vmap_area_cachep, va);
+		return NULL;
 	}
-	spin_unlock(lock);
-	return addr;
+	aligned = ALIGN(chunk->bump, align);
+	if (aligned + size > chunk->limit) {
+		preempt_enable();
+		kmem_cache_free(vmap_area_cachep, va);
+		return NULL;
+	}
+	chunk->bump = aligned + size;
+	idx = vmap_chunk_page_idx(chunk, aligned);
+	n_pages = size >> PAGE_SHIFT;
+	for (i = 0; i < n_pages; i++)
+		chunk->page_va[idx + i] = va;
+	atomic_add(n_pages, &chunk->alloced);
+	preempt_enable();
+
+	va->va_start = aligned;
+	va->va_end = aligned + size;
+	va->vm = NULL;
+	/*
+	 * Encode the destination vmap_node so the existing per-node pool
+	 * machinery and decode_vn_id() in free_vmap_area_noflush() see a
+	 * valid id. VA_FROM_BUMP_CHUNK marks this VA so reclaim_list_global
+	 * skips occupied_mt_erase_va_locked() — bump VAs were never tracked
+	 * in occupied_vmap_area_mt (the whole chunk range was). The bit
+	 * sits below BITS_PER_BYTE so it does not alias decode_vn_id()'s
+	 * shift, and outside VMAP_FLAGS_MASK so it does not alias VMAP_RAM
+	 * / VMAP_BLOCK.
+	 */
+	va->flags = va_flags | encode_vn_id(addr_to_node_id(aligned)) |
+		    VA_FROM_BUMP_CHUNK;
+	INIT_LIST_HEAD(&va->list);
+	return va;
 }
 
-/* Refill this CPU's bump chunk. Reserves a fresh range from the
- * global allocator. Old chunk's remaining space is leaked (the
- * already-allocated VAs in it stay live; the unused tail is wasted).
+/*
+ * Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. The old chunk (if any) is moved to the global
+ * vmap_bump_chunks list; it stays alive until its outstanding
+ * allocations drain.
  */
 static int
 vmap_bump_refill(gfp_t gfp_mask)
 {
-	struct vmap_bump_chunk *chunk;
-	spinlock_t *lock;
+	struct vmap_bump_chunk *new_chunk;
 	unsigned long base;
 
+	new_chunk = kvzalloc(sizeof(*new_chunk), gfp_mask);
+	if (!new_chunk)
+		return -ENOMEM;
+
 	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
 	base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
 				 VMALLOC_START, VMALLOC_END);
 	spin_unlock(&free_vmap_area_lock);
 
-	if (IS_ERR_VALUE(base))
+	if (IS_ERR_VALUE(base)) {
+		kvfree(new_chunk);
 		return -ENOMEM;
+	}
+
+	new_chunk->base = base;
+	new_chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+	new_chunk->bump = base;
+	atomic_set(&new_chunk->alloced, 0);
+	INIT_LIST_HEAD(&new_chunk->link);
+
+	spin_lock(&vmap_bump_chunks_lock);
+	list_add_rcu(&new_chunk->link, &vmap_bump_chunks);
+	if (new_chunk->base < vmap_chunks_lo)
+		WRITE_ONCE(vmap_chunks_lo, new_chunk->base);
+	if (new_chunk->limit > vmap_chunks_hi)
+		WRITE_ONCE(vmap_chunks_hi, new_chunk->limit);
+	spin_unlock(&vmap_bump_chunks_lock);
+
+	preempt_disable();
+	this_cpu_write(vmap_bump_cur, new_chunk);
+	preempt_enable();
 
-	lock = this_cpu_ptr(&vmap_bump_lock);
-	spin_lock(lock);
-	chunk = this_cpu_ptr(&vmap_bump);
-	chunk->base = base;
-	chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
-	chunk->bump = base;
-	spin_unlock(lock);
 	return 0;
 }
 
+/*
+ * Drop a chunk-allocated VA. Called from the vfree path when the va
+ * has VA_FROM_BUMP_CHUNK set. Clears the page_va[] linkage and
+ * releases the va struct. If the chunk's outstanding count hits zero
+ * AND the chunk is no longer the per-CPU current chunk, the chunk's
+ * range is returned to the global allocator and the chunk descriptor
+ * is freed.
+ */
+static struct vmap_area *
+vmap_bump_unlink(unsigned long addr)
+{
+	struct vmap_bump_chunk *chunk;
+	struct vmap_area *va;
+	unsigned long idx, n_pages;
+
+	chunk = vmap_chunk_lookup(addr);
+	if (!chunk)
+		return NULL;
+
+	idx = vmap_chunk_page_idx(chunk, addr);
+	if (idx >= VMAP_BUMP_CHUNK_PAGES)
+		return NULL;
+
+	va = chunk->page_va[idx];
+	if (!va || va->va_start != addr)
+		return NULL;
+
+	n_pages = (va->va_end - va->va_start) >> PAGE_SHIFT;
+	memset(&chunk->page_va[idx], 0, n_pages * sizeof(va));
+
+	/*
+	 * If this chunk fully drained AND it is no longer the per-CPU
+	 * current chunk, return its range to the global allocator and
+	 * free the descriptor. We do NOT reset the bump pointer for the
+	 * current chunk: addresses inside the chunk may still have stale
+	 * TLB entries until the next lazy-purge flush, so reusing them
+	 * before the flush is unsafe. Forward-only bump avoids that.
+	 */
+	if (atomic_sub_return(n_pages, &chunk->alloced) == 0 &&
+	    chunk != this_cpu_read(vmap_bump_cur)) {
+		spin_lock(&vmap_bump_chunks_lock);
+		list_del_rcu(&chunk->link);
+		spin_unlock(&vmap_bump_chunks_lock);
+
+		spin_lock(&free_vmap_area_lock);
+		if (occupied_mt_supported())
+			WARN_ON_ONCE(!occupied_mt_erase_range_locked(chunk->base,
+								     chunk->limit));
+		spin_unlock(&free_vmap_area_lock);
+		kvfree_rcu(chunk, rcu);
+	}
+
+	return va;
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend. If vm is passed in, the two will also be bound.
@@ -2589,6 +2772,44 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	allow_block = gfpflags_allow_blocking(gfp_mask);
 	might_sleep_if(allow_block);
 
+	/*
+	 * Per-CPU bump-chunk fast path (Option B + Option G).
+	 *
+	 * Returns a fully-populated va_start/va_end vmap_area struct; the
+	 * chunk's page_va[] array carries the addr->va linkage, so no
+	 * per-node busy.mt insert is needed. find_vmap_area() and
+	 * find_unlink_vmap_area() consult vmap_chunk_lookup() before
+	 * falling back to busy.mt.
+	 */
+	va = vmap_bump_alloc(size, align, vstart, vend, gfp_mask, node,
+			     va_flags);
+	if (!va && vmap_bump_refill(gfp_mask) == 0)
+		va = vmap_bump_alloc(size, align, vstart, vend, gfp_mask, node,
+				     va_flags);
+	if (va) {
+		if (vm) {
+			vm->addr = (void *)va->va_start;
+			vm->size = va_size(va);
+			va->vm = vm;
+		}
+		BUG_ON(!IS_ALIGNED(va->va_start, align));
+		BUG_ON(va->va_start < vstart);
+		BUG_ON(va->va_end > vend);
+
+		ret = kasan_populate_vmalloc(va->va_start, size, gfp_mask);
+		if (ret) {
+			vmap_bump_unlink(va->va_start);
+			kmem_cache_free(vmap_area_cachep, va);
+			if (vm) {
+				vm->addr = NULL;
+				vm->size = 0;
+				vm->requested_size = 0;
+			}
+			return ERR_PTR(ret);
+		}
+		return va;
+	}
+
 	/*
 	 * If a VA is obtained from a global heap(if it fails here)
 	 * it is anyway marked with this "vn_id" so it is returned
@@ -2611,19 +2832,6 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	}
 
 retry:
-	if (IS_ERR_VALUE(addr)) {
-		/*
-		 * Per-CPU bump-allocator fast path. On hit, no global
-		 * tree work runs at all. On miss, refill the chunk and
-		 * try again before falling back to the regular path.
-		 */
-		addr = vmap_bump_alloc(size, align, vstart, vend);
-		if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
-			if (vmap_bump_refill(gfp_mask) == 0)
-				addr = vmap_bump_alloc(size, align,
-						       vstart, vend);
-		}
-	}
 	if (IS_ERR_VALUE(addr)) {
 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
 		try_init_free_mt_locked();
@@ -2792,12 +3000,20 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 	list_for_each_entry_safe(va, n, head, list) {
 		list_del_init(&va->list);
 		if (erase_occupied) {
+			/*
+			 * Bump-allocated VAs were never inserted into
+			 * occupied_vmap_area_mt — the chunk's whole range was.
+			 * Skip the per-VA erase to avoid a spurious WARN.
+			 */
+			if (va->flags & VA_FROM_BUMP_CHUNK)
+				goto queue_release;
 			if (WARN_ON_ONCE(!occupied_mt_erase_va_locked(va))) {
 				list_add_tail(&va->list, failed);
 				ok = false;
 				continue;
 			}
 		}
+queue_release:
 		/*
 		 * Occupied-only design: there are no free vmap_area objects
 		 * any more. With the occupied marker erased, the range is
@@ -3179,6 +3395,7 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 
 struct vmap_area *find_vmap_area(unsigned long addr)
 {
+	struct vmap_bump_chunk *chunk;
 	struct vmap_node *vn;
 	struct vmap_area *va;
 	int i, j;
@@ -3186,6 +3403,22 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 	if (unlikely(!vmap_initialized))
 		return NULL;
 
+	/*
+	 * Bump-chunk fast path: if @addr lives in a per-CPU bump chunk,
+	 * the va is at chunk->page_va[(addr - chunk->base) / PAGE_SIZE].
+	 * No maple-tree descent.
+	 */
+	chunk = vmap_chunk_lookup(addr);
+	if (chunk) {
+		unsigned long idx = vmap_chunk_page_idx(chunk, addr);
+
+		if (idx < VMAP_BUMP_CHUNK_PAGES) {
+			va = chunk->page_va[idx];
+			if (va)
+				return va;
+		}
+	}
+
 	/*
 	 * An addr_to_node_id(addr) converts an address to a node index
 	 * where a VA is located. If VA spans several zones and passed
@@ -3220,6 +3453,15 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 	struct vmap_area *va;
 	int i, j;
 
+	/*
+	 * Bump-chunk fast path: if @addr was allocated from a per-CPU
+	 * chunk, the page_va[] linkage is the only place it lives. No
+	 * busy.mt walk needed.
+	 */
+	va = vmap_bump_unlink(addr);
+	if (va)
+		return va;
+
 	/*
 	 * Check the comment in the find_vmap_area() about the loop.
 	 */
@@ -6319,8 +6561,6 @@ void __init vmalloc_init(void)
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
 		xa_init(&vbq->vmap_blocks);
-
-		spin_lock_init(&per_cpu(vmap_bump_lock, i));
 	}
 
 	/*

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (10 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
@ 2026-06-13 17:19 ` Pranjal Arya
  2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-13 17:19 UTC (permalink / raw)
  To: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance
  Cc: linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Pranjal Arya,
	Sudeep Holla

Real-hardware testing on a Snapdragon X1E80100 exposed a panic during
boot-time module loading via finit_module -> kernel_read_file ->
vmalloc:

    Internal error: UBSAN: array index out of bounds
    Call trace:
      vmap_bump_alloc
      alloc_vmap_area
      __vmalloc_node_range_noprof
      vmalloc_noprof
      kernel_read_file
      __arm64_sys_finit_module

UBSAN's array-bounds sanitiser triggers on the indexed write loop:

    for (i = 0; i < n_pages; i++)
        chunk->page_va[idx + i] = va;

Harden the bump path:

  - Centralise the eligibility predicate in vmap_bump_eligible() and
    add it to alloc_vmap_area() so vmap_bump_refill() is only called
    for requests the bump path can actually serve. Add PAGE_ALIGNED(size)
    and align > 0 to the predicate (defensive; alloc_vmap_area's
    callers always satisfy these but the explicit check is cheap and
    prevents the trap path from being entered with bad inputs).

  - In vmap_bump_alloc(), use check_add_overflow() for the new bump
    pointer, validate aligned >= chunk->base (defensive against
    metadata corruption), and bounds-check idx and idx + n_pages
    against VMAP_BUMP_CHUNK_PAGES before touching page_va[]. Replace
    the indexed page_va[] store loop with a pointer walk:

        slot = &chunk->page_va[idx];
        for (i = n_pages; i > 0; i--)
                *slot++ = va;

    The pointer-increment form is not subject to the array-bounds
    sanitiser instrumentation that fires on chunk->page_va[idx + i].

  - In vmap_bump_unlink(), validate n_pages > 0 and n_pages <=
    VMAP_BUMP_CHUNK_PAGES - idx before the memset, so a corrupted
    va->va_end cannot drive a write past the end of page_va[].

  - Track the chunk's owner CPU at refill time and compare against
    per_cpu(vmap_bump_cur, owner_cpu) on unlink. The previous
    this_cpu_read(vmap_bump_cur) compared the chunk against the
    *current* CPU's chunk, which is wrong when free runs on a CPU
    other than the chunk owner: it could either retire a chunk that
    is still the owner's current, or skip retirement on a chunk that
    has already been replaced.

No semantic change to the bump-path policy or to the addresses
returned. Builds clean on x86_64 and arm64 (full bzImage / Image).

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 49 insertions(+), 13 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6991054e1cba..03f10b6b815c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2508,6 +2508,7 @@ struct vmap_bump_chunk {
 	unsigned long		limit;
 	unsigned long		bump;
 	atomic_t		alloced;	/* # outstanding pages */
+	int			owner_cpu;
 	struct list_head	link;		/* on vmap_bump_chunks */
 	struct rcu_head		rcu;		/* deferred free */
 	struct vmap_area	*page_va[VMAP_BUMP_CHUNK_PAGES];
@@ -2517,6 +2518,16 @@ static DEFINE_PER_CPU(struct vmap_bump_chunk *, vmap_bump_cur);
 static LIST_HEAD(vmap_bump_chunks);
 static DEFINE_SPINLOCK(vmap_bump_chunks_lock);
 
+static __always_inline bool
+vmap_bump_eligible(unsigned long size, unsigned long align,
+		   unsigned long vstart, unsigned long vend)
+{
+	return vstart == VMALLOC_START && vend == VMALLOC_END &&
+		size > 0 && PAGE_ALIGNED(size) &&
+		size <= VMAP_BUMP_CHUNK_SIZE / 2 &&
+		align > 0 && align <= VMAP_BUMP_CHUNK_SIZE / 2;
+}
+
 /*
  * Coarse [lo, hi) bounds covering every active vmap_bump_chunk's
  * range. vmap_chunk_lookup() rejects out-of-range addresses (e.g.
@@ -2582,11 +2593,10 @@ vmap_bump_alloc(unsigned long size, unsigned long align,
 {
 	struct vmap_bump_chunk *chunk;
 	struct vmap_area *va;
-	unsigned long aligned, idx, n_pages, i;
+	struct vmap_area **slot;
+	unsigned long aligned, new_bump, idx, n_pages, i;
 
-	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
-	    size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
-	    align > VMAP_BUMP_CHUNK_SIZE / 2)
+	if (!vmap_bump_eligible(size, align, vstart, vend))
 		return NULL;
 
 	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
@@ -2607,22 +2617,34 @@ vmap_bump_alloc(unsigned long size, unsigned long align,
 		kmem_cache_free(vmap_area_cachep, va);
 		return NULL;
 	}
+
 	aligned = ALIGN(chunk->bump, align);
-	if (aligned + size > chunk->limit) {
+	if (aligned < chunk->base ||
+	    check_add_overflow(aligned, size, &new_bump) ||
+	    new_bump > chunk->limit) {
 		preempt_enable();
 		kmem_cache_free(vmap_area_cachep, va);
 		return NULL;
 	}
-	chunk->bump = aligned + size;
+
 	idx = vmap_chunk_page_idx(chunk, aligned);
 	n_pages = size >> PAGE_SHIFT;
-	for (i = 0; i < n_pages; i++)
-		chunk->page_va[idx + i] = va;
+	if (unlikely(idx >= VMAP_BUMP_CHUNK_PAGES ||
+		     n_pages > VMAP_BUMP_CHUNK_PAGES - idx)) {
+		preempt_enable();
+		kmem_cache_free(vmap_area_cachep, va);
+		return NULL;
+	}
+
+	chunk->bump = new_bump;
+	slot = &chunk->page_va[idx];
+	for (i = n_pages; i > 0; i--)
+		*slot++ = va;
 	atomic_add(n_pages, &chunk->alloced);
 	preempt_enable();
 
 	va->va_start = aligned;
-	va->va_end = aligned + size;
+	va->va_end = new_bump;
 	va->vm = NULL;
 	/*
 	 * Encode the destination vmap_node so the existing per-node pool
@@ -2651,6 +2673,7 @@ vmap_bump_refill(gfp_t gfp_mask)
 {
 	struct vmap_bump_chunk *new_chunk;
 	unsigned long base;
+	int cpu;
 
 	new_chunk = kvzalloc(sizeof(*new_chunk), gfp_mask);
 	if (!new_chunk)
@@ -2670,6 +2693,7 @@ vmap_bump_refill(gfp_t gfp_mask)
 	new_chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
 	new_chunk->bump = base;
 	atomic_set(&new_chunk->alloced, 0);
+	new_chunk->owner_cpu = -1;
 	INIT_LIST_HEAD(&new_chunk->link);
 
 	spin_lock(&vmap_bump_chunks_lock);
@@ -2681,6 +2705,8 @@ vmap_bump_refill(gfp_t gfp_mask)
 	spin_unlock(&vmap_bump_chunks_lock);
 
 	preempt_disable();
+	cpu = smp_processor_id();
+	new_chunk->owner_cpu = cpu;
 	this_cpu_write(vmap_bump_cur, new_chunk);
 	preempt_enable();
 
@@ -2699,6 +2725,7 @@ static struct vmap_area *
 vmap_bump_unlink(unsigned long addr)
 {
 	struct vmap_bump_chunk *chunk;
+	struct vmap_bump_chunk *owner_cur;
 	struct vmap_area *va;
 	unsigned long idx, n_pages;
 
@@ -2715,6 +2742,8 @@ vmap_bump_unlink(unsigned long addr)
 		return NULL;
 
 	n_pages = (va->va_end - va->va_start) >> PAGE_SHIFT;
+	if (unlikely(!n_pages || n_pages > VMAP_BUMP_CHUNK_PAGES - idx))
+		return NULL;
 	memset(&chunk->page_va[idx], 0, n_pages * sizeof(va));
 
 	/*
@@ -2725,8 +2754,12 @@ vmap_bump_unlink(unsigned long addr)
 	 * TLB entries until the next lazy-purge flush, so reusing them
 	 * before the flush is unsafe. Forward-only bump avoids that.
 	 */
+	if (unlikely(chunk->owner_cpu < 0 || chunk->owner_cpu >= nr_cpu_ids))
+		return va;
+
+	owner_cur = READ_ONCE(per_cpu(vmap_bump_cur, chunk->owner_cpu));
 	if (atomic_sub_return(n_pages, &chunk->alloced) == 0 &&
-	    chunk != this_cpu_read(vmap_bump_cur)) {
+	    chunk != owner_cur) {
 		spin_lock(&vmap_bump_chunks_lock);
 		list_del_rcu(&chunk->link);
 		spin_unlock(&vmap_bump_chunks_lock);
@@ -2781,11 +2814,14 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	 * find_unlink_vmap_area() consult vmap_chunk_lookup() before
 	 * falling back to busy.mt.
 	 */
-	va = vmap_bump_alloc(size, align, vstart, vend, gfp_mask, node,
-			     va_flags);
-	if (!va && vmap_bump_refill(gfp_mask) == 0)
+	va = NULL;
+	if (vmap_bump_eligible(size, align, vstart, vend)) {
 		va = vmap_bump_alloc(size, align, vstart, vend, gfp_mask, node,
 				     va_flags);
+		if (!va && vmap_bump_refill(gfp_mask) == 0)
+			va = vmap_bump_alloc(size, align, vstart, vend, gfp_mask,
+					     node, va_flags);
+	}
 	if (va) {
 		if (vm) {
 			vm->addr = (void *)va->va_start;

-- 
2.34.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (11 preceding siblings ...)
  2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
@ 2026-06-13 23:15 ` Matthew Wilcox
  2026-06-15  9:52   ` Uladzislau Rezki
                     ` (2 more replies)
  2026-06-14  6:35 ` [syzbot ci] " syzbot ci
  2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
  14 siblings, 3 replies; 26+ messages in thread
From: Matthew Wilcox @ 2026-06-13 23:15 UTC (permalink / raw)
  To: Pranjal Arya
  Cc: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
> vmalloc's free/busy/lazy area tracking is one of the last remaining
> augmented-rb_tree consumers in the core mm allocators. The rest of
> mm/ has been gradually consolidating range-keyed indexing around
> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
> the underlying reason is a structural mismatch between rb_tree and
> range tracking:

First, and most importantly, I love this.  The maple tree is undoubtedly
the right data structure to use for this purpose.

What I don't understand is why you maintain a separate "free" tree.
It should not be necessary any more, but maybe you tried removing it
already and found a performance problem?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
@ 2026-06-15  9:52   ` Uladzislau Rezki
  2026-06-16 18:07     ` Matthew Wilcox
  2026-06-26 10:18     ` Pranjal Arya
  2026-06-25 21:28   ` Pranjal Arya
  2026-06-26  9:43   ` Pranjal Arya
  2 siblings, 2 replies; 26+ messages in thread
From: Uladzislau Rezki @ 2026-06-15  9:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pranjal Arya, Andrew Morton, Uladzislau Rezki, Liam R. Howlett,
	Alice Ryhl, Andrew Ballance, linux-arm-msm, linux-mm,
	linux-kernel, maple-tree, Lorenzo Stoakes, Pranjal Shrivastava,
	Will Deacon, Suzuki K Poulose, Neil Armstrong, Mostafa Saleh,
	Balbir Singh, Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
> On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
> > vmalloc's free/busy/lazy area tracking is one of the last remaining
> > augmented-rb_tree consumers in the core mm allocators. The rest of
> > mm/ has been gradually consolidating range-keyed indexing around
> > maple_tree (notably the per-process VMA tree in mm/mmap.c), and
> > the underlying reason is a structural mismatch between rb_tree and
> > range tracking:
> 
> First, and most importantly, I love this.  The maple tree is undoubtedly
> the right data structure to use for this purpose.
> 
> What I don't understand is why you maintain a separate "free" tree.
> It should not be necessary any more, but maybe you tried removing it
> already and found a performance problem?
>
We maintain it in order to split several entities. That prevents
interfering between allocated data and vmap-free-space manager.
So in that case one context can easily access allocated data, for
example vread iterator, etc., whereas another can do an allocation.

So by splitting parts i minimize lock-contention.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-15  9:52   ` Uladzislau Rezki
@ 2026-06-16 18:07     ` Matthew Wilcox
  2026-06-18 10:06       ` Uladzislau Rezki
  2026-06-26 10:34       ` Pranjal Arya
  2026-06-26 10:18     ` Pranjal Arya
  1 sibling, 2 replies; 26+ messages in thread
From: Matthew Wilcox @ 2026-06-16 18:07 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Pranjal Arya, Andrew Morton, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Mon, Jun 15, 2026 at 11:52:22AM +0200, Uladzislau Rezki wrote:
> On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
> > What I don't understand is why you maintain a separate "free" tree.
> > It should not be necessary any more, but maybe you tried removing it
> > already and found a performance problem?
>
> We maintain it in order to split several entities. That prevents
> interfering between allocated data and vmap-free-space manager.
> So in that case one context can easily access allocated data, for
> example vread iterator, etc., whereas another can do an allocation.
> 
> So by splitting parts i minimize lock-contention.

Sure, but there are many ways to reduce lock contention.  One is to not
take locks at all; the maple tree is RCU-safe, so you can read the tree
holding only the RCU read lock, as long as you obey the RCU rules.

Specifically:
 - Write side has to RCU-free the objects that are stored in the tree
 - Read side has to trylock the objects it finds (and retry the walk
   if the trylock fails)
 - Read side can see a mixture of objects if the tree is changed while
   it is reading, but for any given index in the tree it is guaranteed
   to see one of the objects which has been referred to by that index.
   That is, if the write side overwrites an index that referred to
   object A with object B, the reader will see either object A or B.
   It will not see NULL and it will not see any other object.
 - If the write side stores both object C and object D in the tree,
   the read side may see neither, both, only C or only D.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-16 18:07     ` Matthew Wilcox
@ 2026-06-18 10:06       ` Uladzislau Rezki
  2026-06-26 10:47         ` Pranjal Arya
  2026-06-26 10:34       ` Pranjal Arya
  1 sibling, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2026-06-18 10:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Uladzislau Rezki, Pranjal Arya, Andrew Morton, Liam R. Howlett,
	Alice Ryhl, Andrew Ballance, linux-arm-msm, linux-mm,
	linux-kernel, maple-tree, Lorenzo Stoakes, Pranjal Shrivastava,
	Will Deacon, Suzuki K Poulose, Neil Armstrong, Mostafa Saleh,
	Balbir Singh, Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Tue, Jun 16, 2026 at 07:07:57PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 15, 2026 at 11:52:22AM +0200, Uladzislau Rezki wrote:
> > On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
> > > What I don't understand is why you maintain a separate "free" tree.
> > > It should not be necessary any more, but maybe you tried removing it
> > > already and found a performance problem?
> >
> > We maintain it in order to split several entities. That prevents
> > interfering between allocated data and vmap-free-space manager.
> > So in that case one context can easily access allocated data, for
> > example vread iterator, etc., whereas another can do an allocation.
> > 
> > So by splitting parts i minimize lock-contention.
> 
> Sure, but there are many ways to reduce lock contention.  One is to not
> take locks at all; the maple tree is RCU-safe, so you can read the tree
> holding only the RCU read lock, as long as you obey the RCU rules.
> 
> Specifically:
>  - Write side has to RCU-free the objects that are stored in the tree
>  - Read side has to trylock the objects it finds (and retry the walk
>    if the trylock fails)
>  - Read side can see a mixture of objects if the tree is changed while
>    it is reading, but for any given index in the tree it is guaranteed
>    to see one of the objects which has been referred to by that index.
>    That is, if the write side overwrites an index that referred to
>    object A with object B, the reader will see either object A or B.
>    It will not see NULL and it will not see any other object.
>  - If the write side stores both object C and object D in the tree,
>    the read side may see neither, both, only C or only D.
>
Some thoughts about it.

Having the tree which is RCU safe is good for sure. We can benefit from
at least in the: vmallocinfo scanning/dumping, possibly in the vread_iter()
when access to /proc/kcore and other places(which i need to check carefully).
But this is for read-only traversal.

Switching to gap-based approach requires quite a bit of refactoring and it
should be a full switch without any hybrid schemes or mixes. I expect that
we remove more code then adding because of some parts will become hidden
like lookups/reserving range/erase, etc which is good.

- replacing free_vmap_area to maple-tree gap based approach;
- rewriting pcpu-allocator which lives in the end of vmalloc space;
- refactoring per-cpu allocator which is also part of vmalloc space;
- vread iterator;
- vmalloc dump path;
- vmap_node logic(use gap-reserve to minimize contention);
- and more...

To me such rewrite makes sense if we end up in something structural not
just because maple tree exists. The criteria i would go with are: at least
same performance level, remove more then add, the design stays at least in
same good shape.

There are some drawback i am thinking of. One of them is maple insert path,
mas_store_gfp()? First we need to find an empty area, then set-range and do
mas_store_gfp() that uses gfp flag for its internal allocation. If we are
under spin-lock sleeping is not possible, using NOWAIT or ATOMIC is not a
case thus we should somehow pre-allocate outside the lock and store the range
without any allocation.

The allocator operation:
 - finds an empty range;
 - publishes VA that blocks that range.

those two have to be serialized among other writes. Otherwise two CPUs can use
same empty range and both try to reserve them. If preallocate outside the lock,
the "alloc" side has to validate that a selected range is still empty and only
then store VA to block the range.

I think it is worth to prototype something to see how it would go. I may be
missing something for sure.

Thank you for your input!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-18 10:06       ` Uladzislau Rezki
@ 2026-06-26 10:47         ` Pranjal Arya
  2026-06-26 14:15           ` Matthew Wilcox
  0 siblings, 1 reply; 26+ messages in thread
From: Pranjal Arya @ 2026-06-26 10:47 UTC (permalink / raw)
  To: Uladzislau Rezki, Matthew Wilcox
  Cc: Andrew Morton, Liam R. Howlett, Alice Ryhl, Andrew Ballance,
	linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla



On 6/18/2026 3:36 PM, Uladzislau Rezki wrote:
> On Tue, Jun 16, 2026 at 07:07:57PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 15, 2026 at 11:52:22AM +0200, Uladzislau Rezki wrote:
>>> On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
>>>> What I don't understand is why you maintain a separate "free" tree.
>>>> It should not be necessary any more, but maybe you tried removing it
>>>> already and found a performance problem?
>>>
>>> We maintain it in order to split several entities. That prevents
>>> interfering between allocated data and vmap-free-space manager.
>>> So in that case one context can easily access allocated data, for
>>> example vread iterator, etc., whereas another can do an allocation.
>>>
>>> So by splitting parts i minimize lock-contention.
>>
>> Sure, but there are many ways to reduce lock contention.  One is to not
>> take locks at all; the maple tree is RCU-safe, so you can read the tree
>> holding only the RCU read lock, as long as you obey the RCU rules.
>>
>> Specifically:
>>  - Write side has to RCU-free the objects that are stored in the tree
>>  - Read side has to trylock the objects it finds (and retry the walk
>>    if the trylock fails)
>>  - Read side can see a mixture of objects if the tree is changed while
>>    it is reading, but for any given index in the tree it is guaranteed
>>    to see one of the objects which has been referred to by that index.
>>    That is, if the write side overwrites an index that referred to
>>    object A with object B, the reader will see either object A or B.
>>    It will not see NULL and it will not see any other object.
>>  - If the write side stores both object C and object D in the tree,
>>    the read side may see neither, both, only C or only D.
>>
> Some thoughts about it.
> 
> Having the tree which is RCU safe is good for sure. We can benefit from
> at least in the: vmallocinfo scanning/dumping, possibly in the vread_iter()
> when access to /proc/kcore and other places(which i need to check carefully).
> But this is for read-only traversal.
> 

I agree. The RCU safe busy tree is a foundation that enables lockless
read only traversal. Will implement it in next patchset.

> Switching to gap-based approach requires quite a bit of refactoring and it
> should be a full switch without any hybrid schemes or mixes. I expect that
> we remove more code then adding because of some parts will become hidden
> like lookups/reserving range/erase, etc which is good.
> 
> - replacing free_vmap_area to maple-tree gap based approach;
> - rewriting pcpu-allocator which lives in the end of vmalloc space;
> - refactoring per-cpu allocator which is also part of vmalloc space;
> - vread iterator;
> - vmalloc dump path;
> - vmap_node logic(use gap-reserve to minimize contention);
> - and more...
> 
> To me such rewrite makes sense if we end up in something structural not
> just because maple tree exists. The criteria i would go with are: at least
> same performance level, remove more then add, the design stays at least in
> same good shape.
> 
> There are some drawback i am thinking of. One of them is maple insert path,
> mas_store_gfp()? First we need to find an empty area, then set-range and do
> mas_store_gfp() that uses gfp flag for its internal allocation. If we are
> under spin-lock sleeping is not possible, using NOWAIT or ATOMIC is not a
> case thus we should somehow pre-allocate outside the lock and store the range
> without any allocation.
> 

I am planning to have following approach on this:

  1. mas_preallocate(GFP_NOWAIT | __GFP_NOWARN) + mas_store_prealloc
     will be fast path. The preallocate attempt will be non sleeping and,
     if it succeeds, the subsequent store won't require allocation.

  2. mas_store_gfp(GFP_ATOMIC | __GFP_NOWARN) fallback: if
     preallocate fails (rare but possible under memory pressure),
     GFP_ATOMIC will make a non sleeping allocation attempt inline.

  3. vmap_retry_list recovery queue: if both above fail, the VA will be
     added to a non indexed retry list.  The allocator will scan the list
     on subsequent calls, and purge workers will drain it. This will avoid
     any leak or panic under sustained slab pressure.

Neither GFP_NOWAIT nor GFP_ATOMIC can sleep under a spinlock. The retry
will queue provide the correctness backstop to avoid GFP_KERNEL
blocking inside the lock.

> The allocator operation:
>  - finds an empty range;
>  - publishes VA that blocks that range.
> 
> those two have to be serialized among other writes. Otherwise two CPUs can use
> same empty range and both try to reserve them. If preallocate outside the lock,
> the "alloc" side has to validate that a selected range is still empty and only
> then store VA to block the range.
> 

Sure. Both will be called under a lock. Holding both under the same lock
is the better serialization approach. No two concurrent allocators will
observe the same gap and both succeed. In the next patchset, I'll
explicitly add a comment on the lock declaration describing this approach
in commit message.

> I think it is worth to prototype something to see how it would go. I may be
> missing something for sure.
> 
> Thank you for your input!
> 
> --
> Uladzislau Rezki

Thank you for the detailed design questions. This will make upcoming
patchset substantially cleaner than the RFC.

BR,
Pranjal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-26 10:47         ` Pranjal Arya
@ 2026-06-26 14:15           ` Matthew Wilcox
  0 siblings, 0 replies; 26+ messages in thread
From: Matthew Wilcox @ 2026-06-26 14:15 UTC (permalink / raw)
  To: Pranjal Arya
  Cc: Uladzislau Rezki, Andrew Morton, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Fri, Jun 26, 2026 at 04:17:08PM +0530, Pranjal Arya wrote:
> I am planning to have following approach on this:
> 
>   1. mas_preallocate(GFP_NOWAIT | __GFP_NOWARN) + mas_store_prealloc
>      will be fast path. The preallocate attempt will be non sleeping and,
>      if it succeeds, the subsequent store won't require allocation.
> 
>   2. mas_store_gfp(GFP_ATOMIC | __GFP_NOWARN) fallback: if
>      preallocate fails (rare but possible under memory pressure),
>      GFP_ATOMIC will make a non sleeping allocation attempt inline.

This isn't how you're supposed to use the maple tree.  The intent is
that it's protected by its internal spinlock, and so it can drop the
lock to allocate memory if it needs to in order to balance the tree.

If the user needs to also use that spinlock to protect other things,
that's allowed.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-16 18:07     ` Matthew Wilcox
  2026-06-18 10:06       ` Uladzislau Rezki
@ 2026-06-26 10:34       ` Pranjal Arya
  1 sibling, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-26 10:34 UTC (permalink / raw)
  To: Matthew Wilcox, Uladzislau Rezki
  Cc: Andrew Morton, Liam R. Howlett, Alice Ryhl, Andrew Ballance,
	linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla



On 6/16/2026 11:37 PM, Matthew Wilcox wrote:
> On Mon, Jun 15, 2026 at 11:52:22AM +0200, Uladzislau Rezki wrote:
>> On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
>>> What I don't understand is why you maintain a separate "free" tree.
>>> It should not be necessary any more, but maybe you tried removing it
>>> already and found a performance problem?
>>
>> We maintain it in order to split several entities. That prevents
>> interfering between allocated data and vmap-free-space manager.
>> So in that case one context can easily access allocated data, for
>> example vread iterator, etc., whereas another can do an allocation.
>>
>> So by splitting parts i minimize lock-contention.
> 
> Sure, but there are many ways to reduce lock contention.  One is to not
> take locks at all; the maple tree is RCU-safe, so you can read the tree
> holding only the RCU read lock, as long as you obey the RCU rules.
> 
> Specifically:
>  - Write side has to RCU-free the objects that are stored in the tree
>  - Read side has to trylock the objects it finds (and retry the walk
>    if the trylock fails)
>  - Read side can see a mixture of objects if the tree is changed while
>    it is reading, but for any given index in the tree it is guaranteed
>    to see one of the objects which has been referred to by that index.
>    That is, if the write side overwrites an index that referred to
>    object A with object B, the reader will see either object A or B.
>    It will not see NULL and it will not see any other object.
>  - If the write side stores both object C and object D in the tree,
>    the read side may see neither, both, only C or only D.

Acknowledged & I'll implement this approach for find_vmap_area():

Write side:
  Every vmap_area that was previously freed with kmem_cache_free() will
  now be freed via call_rcu(&va->rcu, vmap_area_free_rcu).

Read side:
  find_vmap_area() will walk vn->busy.mt under rcu_read_lock()
  using mas_walk() (RCU safe tree traversal) & will then validate
  the found object using READ_ONCE before returning it.

The maple tree guarantees that for any single index the reader will see
either the old or the new object, not NULL. To ensure this, I'll set
MT_FLAGS_USE_RCU on vn->busy.mt in the mt_init_flags() call.

With these changes, find_vmap_area() which is called on every vfree(),
vmalloc_to_page(), vmalloc_to_pfn() will no longer acquire any
spinlock on the lookup path.

Does it sound like a good plan ?

BR,
Pranjal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-15  9:52   ` Uladzislau Rezki
  2026-06-16 18:07     ` Matthew Wilcox
@ 2026-06-26 10:18     ` Pranjal Arya
  1 sibling, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-26 10:18 UTC (permalink / raw)
  To: Uladzislau Rezki, Matthew Wilcox
  Cc: Andrew Morton, Liam R. Howlett, Alice Ryhl, Andrew Ballance,
	linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On 6/15/2026 3:22 PM, Uladzislau Rezki wrote:
> On Sun, Jun 14, 2026 at 12:15:28AM +0100, Matthew Wilcox wrote:
>> On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
>>> vmalloc's free/busy/lazy area tracking is one of the last remaining
>>> augmented-rb_tree consumers in the core mm allocators. The rest of
>>> mm/ has been gradually consolidating range-keyed indexing around
>>> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
>>> the underlying reason is a structural mismatch between rb_tree and
>>> range tracking:
>>
>> First, and most importantly, I love this.  The maple tree is undoubtedly
>> the right data structure to use for this purpose.
>>
>> What I don't understand is why you maintain a separate "free" tree.
>> It should not be necessary any more, but maybe you tried removing it
>> already and found a performance problem?
>>
> We maintain it in order to split several entities. That prevents
> interfering between allocated data and vmap-free-space manager.
> So in that case one context can easily access allocated data, for
> example vread iterator, etc., whereas another can do an allocation.
> 
> So by splitting parts i minimize lock-contention.
> 
> --
> Uladzislau Rezki

The lock contention concern makes sense, and in the next patch I will
incorporate the suggested changes.

1. find_vmap_area() will be converted to a fully lockless RCU walk.
   With MT_FLAGS_USE_RCU set on vn->busy.mt and call_rcu-deferred
   vmap_area free, callers such as vread_iter and vmalloc_info_show will
   traverse the occupied tree without taking vn->busy.lock.  This will
   eliminate the contention.

2. The global allocator lock will serialise only the gap find & range publish
   pair on the occupied tree. Per node busy lock will be a separate, finer grained
   lock will only need to be held for the busy tree insert after an address
   gets selected.

3. As Matthew noted in his follow up, the maple tree's own RCU safe node
   retirement will need readers holding only rcu_read_lock() giving a consistent
   view of the tree at every index. So the contention reduction goal will be
   met without a separate free index.

BR,
Pranjal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
  2026-06-15  9:52   ` Uladzislau Rezki
@ 2026-06-25 21:28   ` Pranjal Arya
  2026-06-26  9:43   ` Pranjal Arya
  2 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-25 21:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On 6/14/2026 4:45 AM, Matthew Wilcox wrote:
> On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
>> vmalloc's free/busy/lazy area tracking is one of the last remaining
>> augmented-rb_tree consumers in the core mm allocators. The rest of
>> mm/ has been gradually consolidating range-keyed indexing around
>> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
>> the underlying reason is a structural mismatch between rb_tree and
>> range tracking:
> 
> First, and most importantly, I love this.  The maple tree is undoubtedly
> the right data structure to use for this purpose.
> 
> What I don't understand is why you maintain a separate "free" tree.
> It should not be necessary any more, but maybe you tried removing it
> already and found a performance problem?

Thank you Matthew, that is very encouraging to hear :).  The maple 
tree's native range primitives and built in RCU support do make it a 
much cleaner fit for vmalloc than the augmented rb tree was.

You are right, and I'm going to remove free tree next patch. Allocation 
will use occupied maple tree which will walk the gap space of the
occupied index directly.

The concern that motivated keeping the separate index in the RFC was
lock contention separation. Uladzislau had originally structured things 
so that the alloc path could touch the free tree while unrelated readers 
could access the occupied tree without blocking each other. Your follow 
up explanation of how the maple tree's RCU contract makes a separate 
free index unnecessary is the key insight with MT_FLAGS_USE_RCU set on 
the occupied tree and call rcu deferred vmap_area free, the same 
contention reduction is achievable without maintaining a second index.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
  2026-06-15  9:52   ` Uladzislau Rezki
  2026-06-25 21:28   ` Pranjal Arya
@ 2026-06-26  9:43   ` Pranjal Arya
  2 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-26  9:43 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On 6/14/2026 4:45 AM, Matthew Wilcox wrote:
> On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
>> vmalloc's free/busy/lazy area tracking is one of the last remaining
>> augmented-rb_tree consumers in the core mm allocators. The rest of
>> mm/ has been gradually consolidating range-keyed indexing around
>> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
>> the underlying reason is a structural mismatch between rb_tree and
>> range tracking:
> 
> First, and most importantly, I love this.  The maple tree is undoubtedly
> the right data structure to use for this purpose.
> 
> What I don't understand is why you maintain a separate "free" tree.
> It should not be necessary any more, but maybe you tried removing it
> already and found a performance problem?

Thank you Matthew, that is very encouraging to hear :).  The maple tree's
native range primitives and built in RCU support do make it a much
cleaner fit for vmalloc than the augmented rb tree was.

You are right, and I'm going to remove free tree next patch. Allocation will
use occupied maple tree which will walk the gap space of the
occupied index directly.

The concern that motivated keeping the separate index in the RFC was
lock contention separation. Uladzislau had originally structured things so
that the alloc path could touch the free tree while unrelated readers could
access the occupied tree without blocking each other. Your follow up explanation
of how the maple tree's RCU contract makes a separate free index unnecessary
is the key insight with MT_FLAGS_USE_RCU set on the occupied tree and
call rcu deferred vmap_area free, the same contention reduction is achievable
without maintaining a second index.

BR,
Pranjal

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [syzbot ci] Re: mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (12 preceding siblings ...)
  2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
@ 2026-06-14  6:35 ` syzbot ci
  2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
  14 siblings, 0 replies; 26+ messages in thread
From: syzbot ci @ 2026-06-14  6:35 UTC (permalink / raw)
  To: akpm, aliceryhl, andrewjballance, balbirs, dev.jain, dvyukov,
	elver, glider, jackmanb, liam, linux-arm-msm, linux-kernel,
	linux-mm, ljs, maple-tree, neil.armstrong, praan, pranjal.arya,
	puranjay, santosh.shukla, shuah, smostafa, sudeep.holla, surenb,
	suzuki.poulose, urezki, will, wkarny
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
https://lore.kernel.org/all/20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com
* [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area
* [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree
* [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths to maple_tree
* [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area
* [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure
* [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths
* [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path
* [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head
* [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase
* [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
* [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject
* [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds

and found the following issue:
WARNING in alloc_vmap_area

Full report is available here:
https://ci.syzbot.org/series/45e14120-bbb8-4a0c-b5c6-0aaf3b421101

***

WARNING in alloc_vmap_area

tree:      linux-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next
base:      c425609d6ac4012c8bbf01ec2e10e801b1923a7b
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/481f4f4b-8d55-4e3c-b96c-1de7b763686d/config
syz repro: https://ci.syzbot.org/findings/ae02a227-b4c3-476a-826b-3bbf18bdbfe4/syz_repro

------------[ cut here ]------------
(*left)->va_end > start
WARNING: mm/vmalloc.c:1371 at find_vmap_area_insert_neighbors_mt_locked mm/vmalloc.c:1371 [inline], CPU#0: syz.0.17/5834
WARNING: mm/vmalloc.c:1371 at validate_vmap_area_range_insert_mt_locked mm/vmalloc.c:1388 [inline], CPU#0: syz.0.17/5834
WARNING: mm/vmalloc.c:1371 at insert_vmap_area_busy_locked mm/vmalloc.c:1480 [inline], CPU#0: syz.0.17/5834
WARNING: mm/vmalloc.c:1371 at alloc_vmap_area+0x5458/0x87f0 mm/vmalloc.c:2917, CPU#0: syz.0.17/5834
Modules linked in:
CPU: 0 UID: 0 PID: 5834 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:find_vmap_area_insert_neighbors_mt_locked mm/vmalloc.c:1371 [inline]
RIP: 0010:validate_vmap_area_range_insert_mt_locked mm/vmalloc.c:1388 [inline]
RIP: 0010:insert_vmap_area_busy_locked mm/vmalloc.c:1480 [inline]
RIP: 0010:alloc_vmap_area+0x5458/0x87f0 mm/vmalloc.c:2917
Code: a5 ff e9 fc 0f 00 00 e8 e6 80 a5 ff 4c 8b 74 24 18 e9 ed 17 00 00 e8 d7 80 a5 ff 90 0f 0b 90 e9 34 e5 ff ff e8 c9 80 a5 ff 90 <0f> 0b 90 4c 8b 74 24 18 eb 09 e8 b9 80 a5 ff 90 0f 0b 90 49 bf 00
RSP: 0018:ffffc90031ade360 EFLAGS: 00010293
RAX: ffffffff82200aa7 RBX: ffffffffa0000000 RCX: ffff888109e2bb80
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90031adf4f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000100 R11: 0000000000000003 R12: ffff88810007d888
R13: ffffffffa0002000 R14: ffffffffa0401000 R15: ffffc90031adeb80
FS:  00007f97268b66c0(0000) GS:ffff88818e8af000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9727132780 CR3: 000000010cd9e000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __get_vm_area_node+0x1f8/0x300 mm/vmalloc.c:4214
 __vmalloc_node_range_noprof+0x36a/0x1750 mm/vmalloc.c:5030
 execmem_vmalloc+0xa7/0x1d0 mm/execmem.c:41
 bpf_dispatcher_change_prog+0x25d/0xd70 kernel/bpf/dispatcher.c:151
 bpf_prog_test_run_xdp+0x794/0x1160 net/bpf/test_run.c:1464
 bpf_prog_test_run+0x2cd/0x340 kernel/bpf/syscall.c:4859
 __sys_bpf+0xa20/0xd90 kernel/bpf/syscall.c:6436
 __do_sys_bpf kernel/bpf/syscall.c:6537 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:6534 [inline]
 __x64_sys_bpf+0xba/0xd0 kernel/bpf/syscall.c:6534
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f972725ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f97268b6028 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007f97274d5fa0 RCX: 00007f972725ce59
RDX: 0000000000000050 RSI: 0000200000000b80 RDI: 000000000000000a
RBP: 00007f97272f2d6f R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f97274d6038 R14: 00007f97274d5fa0 R15: 00007ffd7848c2a8
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
                   ` (13 preceding siblings ...)
  2026-06-14  6:35 ` [syzbot ci] " syzbot ci
@ 2026-06-14  6:58 ` Uladzislau Rezki
  2026-06-26 11:00   ` Pranjal Arya
  14 siblings, 1 reply; 26+ messages in thread
From: Uladzislau Rezki @ 2026-06-14  6:58 UTC (permalink / raw)
  To: Pranjal Arya
  Cc: Andrew Morton, Uladzislau Rezki, Liam R. Howlett, Alice Ryhl,
	Andrew Ballance, linux-arm-msm, linux-mm, linux-kernel,
	maple-tree, Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla

On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
> vmalloc's free/busy/lazy area tracking is one of the last remaining
> augmented-rb_tree consumers in the core mm allocators. The rest of
> mm/ has been gradually consolidating range-keyed indexing around
> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
> the underlying reason is a structural mismatch between rb_tree and
> range tracking:
> 
> - rb_tree is a binary tree with a single entry per node. A lookup
>   walks log2(N) nodes via a pointer chase that touches log2(N)
>   cache lines, with one comparison per node. Range queries are not
>   native to rb_tree; vmalloc has historically maintained them by
>   augmenting every node with a subtree_max_size field whose value
>   has to be recomputed on every insert, erase, and rotation via a
>   custom callback. That callback machinery is vmalloc-specific
>   code that exists only to coax range semantics out of a
>   binary-keyed tree.
> - maple_tree stores up to 16 ranges per node (fanout f=16), so a
>   lookup walks ~log16(N) nodes in tightly-packed pivot/slot arrays
>   that are far more cache-friendly. Range queries are first-class
>   (mas_empty_area, mas_find, mas_erase), with no augmentation
>   callback to maintain. RCU traversal and sentinel range storage
>   (XA_ZERO_ENTRY) are part of the data structure's contract, not
>   bolted on by the consumer.
> 
> For the vmalloc allocator's hot paths this means shallower walks
> under the same N, fewer cache lines touched per lookup, and no
> custom augmented-callback machinery to maintain. Aligning vmalloc
> with the same range-tree direction the rest of mm/ has taken
> collapses the augmented gap walker to a single mas_empty_area()
> descent, retires the augmented rb_node from struct vmap_area
> (-24 bytes per object on 64-bit), and exposes the range,
> sentinel, and RCU primitives needed for a per-CPU range cache
> that the augmented rb_tree could not host cleanly.
> 
> This series completes the vmalloc internal migration from rb-tree
> indexed tracking to maple-tree indexed tracking for free, busy, and
> lazy vmap_area range management.
> 
> The series removes vmalloc's internal rb-tree dependencies and moves
> address indexing and range lookups/scans to maple-tree-backed paths,
> while keeping ordered list neighbour traversal where coalescing
> semantics require stable predecessor/successor behaviour.
> 
> In addition to the direct rb -> maple migration, the series includes
> robustness and scalability refinements in the same code path:
> 
> - deferred/lazy maple bring-up to avoid early-boot allocator hazards
> - maple-assisted ordered-list insertion for busy/lazy tracking
> - mas_preallocate / mas_store_prealloc fast path for common-case
>   publish work, with a non-indexed retry queue that absorbs the
>   rare publish-under-pressure case without leaking or panicking
> - single mas_store(NULL, ...) sub-range trim in va_clip() in place
>   of an erase-and-restore pair when narrowing a free-area entry
> - single mas_erase() for the busy-tree find-then-unlink pair on the
>   free path
> - consolidation of in-use ranges as a single authoritative index on
>   the steady-state allocation hot path
> - list_head representation of the lazy-purge queue, since that queue
>   is bulk-drained and has no address-keyed query
> - per-CPU bump-allocator overlay layered on top of the migrated
>   indices for short-lived, page-aligned, common-case allocations
>   (design and chunk-size derivation in the 0010-0012 commit
>   messages)
> - explicit lock/serialisation behaviour preserved (no lock removed)
> 
> Primary advantages
> ==================
> 
> - struct vmap_area shrinks by 24 bytes on 64-bit layouts (72 -> 48),
>   removing the embedded augmented rb_node and the subtree_max_size
>   field that the rb-tree gap walk depended on. Verified with pahole
>   on arm64.
> - maple_tree's per-node fanout (multiple pivots/slots per node)
>   replaces a binary rb-tree descent for indexed lookups, giving a
>   shallower walk for the same allocation count.
> - alloc-side gap finding moves from a recursive augmented-rb walk to
>   mas_empty_area() over the in-use range index, returning the lowest
>   matching gap in a single descent.
> - vfree of a chunk-resident allocation through the per-CPU overlay
>   resolves addr -> vmap_area in O(1) via the chunk's back-pointer
>   array, with a bounded fast-reject for addresses outside any
>   reserved chunk; the maple-tree busy lookup remains the fallback.
> - correctness behaviour preserved: ordered list neighbour traversal
>   for coalescing remains; the locking/serialisation model is
>   unchanged; lockdep is silent across the test_vmalloc subtest sweep.
> - robustness in bring-up and high churn: deferred/lazy maple
>   initialisation avoids early-boot allocator hazards; the retry
>   queue keeps publish failures under GFP_NOWAIT pressure correct
>   without leaking or panicking.
> 
> Real-silicon validation
> =======================
> 
> The series was tested on Qualcomm Snapdragon X1E80100.
> The patched kernel was booted on the device against an RB baseline
> image of the same kernel revision, and exercised through:
> 
> - GFXBench, run for several hours of sustained graphics workload;
>   the patched kernel ran clean throughout, with throughput matching
>   the RB baseline within run-to-run noise.
> - boot-time module loading via the finit_module / kernel_read_file
>   path that exercises the bump-allocator's indexed write loop;
>   this path drove the patch 0012 hardening, and the patched kernel
>   is UBSAN-clean here.
> - repeated insmod / rmmod cycles to soak the chunk install / drain
>   paths under live workload.
> 
> No kernel WARN, BUG, or UBSAN report was observed across the
> multi-hour soak.
> 
I tried to do some testing using this series. See below the kernel
splat. I use test_vmalloc.sh to test/stress it.

<snip>
[   50.661082] ------------[ cut here ]------------
[   50.662455] WARNING: mm/vmalloc.c:545 at vmap_small_pages_range_noflush+0x569/0x6c0, CPU#12: vmalloc_test/41/646
[   50.664685] Modules linked in:
[   50.665454] CPU: 12 UID: 0 PID: 646 Comm: vmalloc_test/41 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
[   50.667765] Tainted: [W]=WARN
[   50.668492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   50.670639] RIP: 0010:vmap_small_pages_range_noflush+0x569/0x6c0
[   50.671974] Code: e9 2d 01 00 00 b8 01 00 00 00 48 85 c2 0f 84 2d 01 00 00 4c 89 e6 4c 89 f7 e8 d3 80 c3 ff e9 1d 01 00 00 89 c3 e9 dc fe ff ff <0f> 0b 65 66 f7 05 fc 3e 14 02 ff ff 75 b2 65 48 8b 15 d9 3e 14 02
[   50.676003] RSP: 0018:ffffcd36cc22bbe0 EFLAGS: 00010286
[   50.677198] RAX: 8000000938940163 RBX: 0000000000000000 RCX: ffff8a7e9dd36000
[   50.678752] RDX: fffffcb13b3a8000 RSI: ffff8a6fc919e800 RDI: 0000000000000000
[   50.680313] RBP: ffffcd36cc22bc90 R08: ffffce6240000000 R09: ffffce622c064000
[   50.681895] R10: 0000000000000024 R11: 000000000000000a R12: ffff8a7e9dd36000
[   50.683459] R13: ffffce622c000000 R14: ffffce622c000000 R15: ffff8a776700db00
[   50.685059] FS:  0000000000000000(0000) GS:ffff8a7f1a012000(0000) knlGS:0000000000000000
[   50.686881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.688165] CR2: 0000000000000000 CR3: 0000000839c3c000 CR4: 00000000000006f0
[   50.689773] Call Trace:
[   50.690419]  <TASK>
[   50.690994]  __vmap_pages_range_noflush+0xda/0x100
[   50.692100]  __vmalloc_node_range_noprof+0x3e2/0x980
[   50.693269]  ? long_busy_list_alloc_test+0x68/0xf0
[   50.694375]  __vmalloc_node_noprof+0x52/0x70
[   50.695366]  ? long_busy_list_alloc_test+0x68/0xf0
[   50.696470]  vmalloc_noprof+0x25/0x30
[   50.697389]  long_busy_list_alloc_test+0x68/0xf0
[   50.698436]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
[   50.699626]  test_func+0x112/0x230
[   50.700442]  ? __pfx_test_func+0x10/0x10
[   50.701373]  kthread+0x10d/0x140
[   50.702153]  ? __pfx_kthread+0x10/0x10
[   50.703039]  ret_from_fork+0x3a1/0x430
[   50.703925]  ? __pfx_kthread+0x10/0x10
[   50.704821]  ret_from_fork_asm+0x1a/0x30
[   50.705736]  </TASK>
[   50.706322] irq event stamp: 1124795
[   50.707157] hardirqs last  enabled at (1124805): [<ffffffffa1fe09b8>] __up_console_sem+0x68/0x80
[   50.709082] hardirqs last disabled at (1124816): [<ffffffffa1fe099d>] __up_console_sem+0x4d/0x80
[   50.711031] softirqs last  enabled at (0): [<ffffffffa1f1ba57>] copy_process+0xd67/0x2570
[   50.712874] softirqs last disabled at (0): [<0000000000000000>] 0x0
[   50.714247] ---[ end trace 0000000000000000 ]---
[   50.715336] warn_alloc: 15 callbacks suppressed
[   50.715339] vmalloc_test/41: vmalloc error: size 409600, failed to map pages, mode:0xcc2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null),cpuset=/,mems_allowed=0
[   50.719421] CPU: 12 UID: 0 PID: 646 Comm: vmalloc_test/41 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
[   50.719424] Tainted: [W]=WARN
[   50.719425] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   50.719426] Call Trace:
[   50.719427]  <TASK>
[   50.719428]  dump_stack_lvl+0x72/0xa0
[   50.719433]  dump_stack+0x14/0x1a
[   50.719435]  warn_alloc+0x137/0x160
[   50.719445]  __vmalloc_node_range_noprof+0x8cd/0x980
[   50.719450]  ? long_busy_list_alloc_test+0x68/0xf0
[   50.719457]  __vmalloc_node_noprof+0x52/0x70
[   50.719460]  ? long_busy_list_alloc_test+0x68/0xf0
[   50.719461]  vmalloc_noprof+0x25/0x30
[   50.719463]  long_busy_list_alloc_test+0x68/0xf0
[   50.719465]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
[   50.719467]  test_func+0x112/0x230
[   50.719472]  ? __pfx_test_func+0x10/0x10
[   50.719474]  kthread+0x10d/0x140
[   50.719476]  ? __pfx_kthread+0x10/0x10
[   50.719479]  ret_from_fork+0x3a1/0x430
[   50.719482]  ? __pfx_kthread+0x10/0x10
[   50.719484]  ret_from_fork_asm+0x1a/0x30
[   50.719493]  </TASK>
[   50.719494] Mem-Info:
[   50.743128] active_anon:0 inactive_anon:0 isolated_anon:0
[   50.743128]  active_file:0 inactive_file:0 isolated_file:0
[   50.743128]  unevictable:0 dirty:0 writeback:0
[   50.743128]  slab_reclaimable:928 slab_unreclaimable:1820126
[   50.743128]  mapped:0 shmem:0 pagetables:610948
[   50.743128]  sec_pagetables:0 bounce:0
[   50.743128]  kernel_misc_reclaimable:0
[   50.743128]  free:12765547 free_pcp:26676 free_cma:0
[   50.751177] Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB kernel_stack:15136kB pagetables:2444312kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB gpu_active:0kB gpu_reclaim:0kB
[   50.757522] Node 0 DMA free:11264kB boost:0kB min:12kB low:24kB high:36kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   50.763396] lowmem_reserve[]: 0 2991 64259 64259 64259
[   50.764526] Node 0 DMA32 free:346136kB boost:0kB min:3144kB low:6204kB high:9264kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:3129216kB managed:3063680kB mlocked:0kB bounce:0kB free_pcp:54788kB local_pcp:760kB free_cma:0kB
[   50.770714] lowmem_reserve[]: 0 0 61267 61267 61267
[   50.771803] Node 0 Normal free:50727200kB boost:2048kB min:66468kB low:129204kB high:191940kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:63963136kB managed:62737744kB mlocked:0kB bounce:0kB free_pcp:51984kB local_pcp:520kB free_cma:0kB
[   50.778200] lowmem_reserve[]: 0 0 0 0 0
[   50.779086] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
[   50.781534] Node 0 DMA32: 31*4kB (U) 28*8kB (U) 34*16kB (U) 38*32kB (U) 41*64kB (U) 48*128kB (UM) 366*256kB (UM) 15*512kB (UM) 1*1024kB (M) 3*2048kB (UM) 55*4096kB (M) = 344700kB
[   50.788250] Node 0 Normal: 1282565*4kB (UM) 889838*8kB (UM) 149191*16kB (UME) 301234*32kB (UME) 240946*64kB (UME) 84458*128kB (UME) 775*256kB (UE) 32*512kB (U) 1*1024kB (U) 1*2048kB (U) 0*4096kB = 50724532kB
[   50.792662] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   50.794471] 0 total pagecache pages
[   50.795273] 0 pages in swap cache
[   50.796063] Free swap  = 0kB
[   50.796759] Total swap = 0kB
[   50.797457] 16777086 pages RAM
[   50.798184] 0 pages HighMem/MovableOnly
[   50.799061] 322890 pages reserved
[   50.799851] 0 pages hwpoisoned
[   50.800578] Memory cgroup min protection 0kB -- low protection 0kB
[   51.818069] BUG: TASK stack guard page was hit at ffffcd36cc25fff8 (stack is ffffcd36cc260000..ffffcd36cc264000)
[   51.818078] Oops: stack guard page: 0000 [#1] SMP NOPTI
[   51.818084] CPU: 31 UID: 0 PID: 653 Comm: vmalloc_test/48 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
[   51.818088] Tainted: [W]=WARN
[   51.818090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[   51.818091] RIP: 0010:__lock_acquire+0x8/0x21d0
[   51.818098] Code: c7 c6 78 9f 40 a3 67 48 0f b9 3a e9 4e ff ff ff 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 <41> 55 41 54 53 48 83 ec 48 48 8b 45 10 8b 1d b9 ac f6 01 48 89 45
[   51.818100] RSP: 0018:ffffcd36cc260000 EFLAGS: 00010002
[   51.818103] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000002
[   51.818104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa3e065c8
[   51.818105] RBP: ffffcd36cc260010 R08: 0000000000000001 R09: 0000000000000001
[   51.818106] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002
[   51.818107] R13: 0000000000000001 R14: ffffffffa3e065c8 R15: 0000000000000000
[   51.818110] FS:  0000000000000000(0000) GS:ffff8a7f1a992000(0000) knlGS:0000000000000000
[   51.818111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.818112] CR2: ffffcd36cc25fff8 CR3: 0000000839c3c000 CR4: 00000000000006f0
[   51.818116] Call Trace:
[   51.818117]  <TASK>
[   51.818119]  lock_acquire+0xc9/0x2e0
[   51.818122]  ? __alloc_pages_slowpath.constprop.0+0xca/0x14e0
[   51.818129]  seqcount_lockdep_reader_access+0x7a/0xa0
[   51.818131]  ? __alloc_pages_slowpath.constprop.0+0xca/0x14e0
[   51.818133]  __alloc_pages_slowpath.constprop.0+0xca/0x14e0
[   51.818139]  __alloc_frozen_pages_noprof+0x361/0x3b0
[   51.818144]  alloc_pages_mpol+0xa8/0x170
[   51.818147]  alloc_frozen_pages_noprof+0x54/0x90
[   51.818148]  ___kmalloc_large_node+0xc2/0xe0
[   51.818151]  __kmalloc_large_node_noprof+0x2e/0x130
[   51.818154]  __kvmalloc_node_noprof+0x5c0/0x9a0
[   51.818157]  ? vmap_bump_refill+0x44/0x1100
[   51.818162]  vmap_bump_refill+0x44/0x1100
[   51.818164]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818166]  ? lock_release+0xd1/0x2b0
[   51.818169]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818171]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818173]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818176]  alloc_vmap_area+0x1780/0x20b0
[   51.818178]  ? __lock_acquire+0x46d/0x21d0
[   51.818181]  ? lock_acquire+0xc9/0x2e0
[   51.818182]  ? find_held_lock+0x31/0x90
[   51.818185]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818189]  ? lock_release+0xd1/0x2b0
[   51.818191]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818205]  __get_vm_area_node+0xd1/0x130
[   51.818209]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818210]  ? vmap_bump_refill+0x44/0x1100
[   51.818214]  ? vmap_bump_refill+0x44/0x1100
[   51.818217]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818220]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818222]  ? vmap_bump_refill+0x44/0x1100
[   51.818225]  ? vmap_bump_refill+0x44/0x1100
[   51.818228]  vmap_bump_refill+0x44/0x1100
[   51.818230]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818232]  ? lock_release+0xd1/0x2b0
[   51.818234]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818236]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818238]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818242]  alloc_vmap_area+0x1780/0x20b0
[   51.818243]  ? __lock_acquire+0x46d/0x21d0
[   51.818246]  ? lock_acquire+0xc9/0x2e0
[   51.818247]  ? find_held_lock+0x31/0x90
[   51.818249]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818252]  ? lock_release+0xd1/0x2b0
[   51.818254]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818258]  __get_vm_area_node+0xd1/0x130
[   51.818260]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818262]  ? vmap_bump_refill+0x44/0x1100
[   51.818265]  ? vmap_bump_refill+0x44/0x1100
[   51.818268]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818271]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818273]  ? vmap_bump_refill+0x44/0x1100
[   51.818276]  ? vmap_bump_refill+0x44/0x1100
[   51.818279]  vmap_bump_refill+0x44/0x1100
[   51.818281]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818283]  ? lock_release+0xd1/0x2b0
[   51.818285]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818287]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818289]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818293]  alloc_vmap_area+0x1780/0x20b0
[   51.818294]  ? __lock_acquire+0x46d/0x21d0
[   51.818297]  ? lock_acquire+0xc9/0x2e0
[   51.818298]  ? find_held_lock+0x31/0x90
[   51.818300]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818303]  ? lock_release+0xd1/0x2b0
[   51.818305]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818308]  __get_vm_area_node+0xd1/0x130
[   51.818311]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818312]  ? vmap_bump_refill+0x44/0x1100
[   51.818315]  ? vmap_bump_refill+0x44/0x1100
[   51.818318]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818321]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818323]  ? vmap_bump_refill+0x44/0x1100
[   51.818325]  ? vmap_bump_refill+0x44/0x1100
[   51.818329]  vmap_bump_refill+0x44/0x1100
[   51.818331]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818333]  ? lock_release+0xd1/0x2b0
[   51.818335]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818338]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818340]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818344]  alloc_vmap_area+0x1780/0x20b0
[   51.818345]  ? __lock_acquire+0x46d/0x21d0
[   51.818348]  ? lock_acquire+0xc9/0x2e0
[   51.818350]  ? find_held_lock+0x31/0x90
[   51.818352]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818354]  ? lock_release+0xd1/0x2b0
[   51.818357]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818360]  __get_vm_area_node+0xd1/0x130
[   51.818362]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818364]  ? vmap_bump_refill+0x44/0x1100
[   51.818367]  ? vmap_bump_refill+0x44/0x1100
[   51.818370]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818373]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818375]  ? vmap_bump_refill+0x44/0x1100
[   51.818377]  ? vmap_bump_refill+0x44/0x1100
[   51.818380]  vmap_bump_refill+0x44/0x1100
[   51.818382]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818385]  ? lock_release+0xd1/0x2b0
[   51.818386]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818389]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818390]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818394]  alloc_vmap_area+0x1780/0x20b0
[   51.818395]  ? __lock_acquire+0x46d/0x21d0
[   51.818398]  ? lock_acquire+0xc9/0x2e0
[   51.818400]  ? find_held_lock+0x31/0x90
[   51.818402]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818404]  ? lock_release+0xd1/0x2b0
[   51.818407]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818410]  __get_vm_area_node+0xd1/0x130
[   51.818412]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818414]  ? vmap_bump_refill+0x44/0x1100
[   51.818417]  ? vmap_bump_refill+0x44/0x1100
[   51.818420]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818423]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818425]  ? vmap_bump_refill+0x44/0x1100
[   51.818427]  ? vmap_bump_refill+0x44/0x1100
[   51.818430]  vmap_bump_refill+0x44/0x1100
[   51.818433]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818435]  ? lock_release+0xd1/0x2b0
[   51.818437]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818439]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818441]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818444]  alloc_vmap_area+0x1780/0x20b0
[   51.818445]  ? __lock_acquire+0x46d/0x21d0
[   51.818448]  ? lock_acquire+0xc9/0x2e0
[   51.818450]  ? find_held_lock+0x31/0x90
[   51.818452]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818454]  ? lock_release+0xd1/0x2b0
[   51.818456]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818459]  __get_vm_area_node+0xd1/0x130
[   51.818462]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818464]  ? vmap_bump_refill+0x44/0x1100
[   51.818467]  ? vmap_bump_refill+0x44/0x1100
[   51.818469]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818472]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818474]  ? vmap_bump_refill+0x44/0x1100
[   51.818477]  ? vmap_bump_refill+0x44/0x1100
[   51.818480]  vmap_bump_refill+0x44/0x1100
[   51.818482]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818484]  ? lock_release+0xd1/0x2b0
[   51.818486]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818488]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818490]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818493]  alloc_vmap_area+0x1780/0x20b0
[   51.818495]  ? __lock_acquire+0x46d/0x21d0
[   51.818497]  ? lock_acquire+0xc9/0x2e0
[   51.818499]  ? find_held_lock+0x31/0x90
[   51.818501]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818503]  ? lock_release+0xd1/0x2b0
[   51.818506]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818509]  __get_vm_area_node+0xd1/0x130
[   51.818511]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818513]  ? vmap_bump_refill+0x44/0x1100
[   51.818516]  ? vmap_bump_refill+0x44/0x1100
[   51.818519]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818521]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818524]  ? vmap_bump_refill+0x44/0x1100
[   51.818526]  ? vmap_bump_refill+0x44/0x1100
[   51.818529]  vmap_bump_refill+0x44/0x1100
[   51.818531]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818533]  ? lock_release+0xd1/0x2b0
[   51.818535]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818537]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818539]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818543]  alloc_vmap_area+0x1780/0x20b0
[   51.818544]  ? __lock_acquire+0x46d/0x21d0
[   51.818547]  ? lock_acquire+0xc9/0x2e0
[   51.818549]  ? find_held_lock+0x31/0x90
[   51.818550]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818553]  ? lock_release+0xd1/0x2b0
[   51.818555]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818558]  __get_vm_area_node+0xd1/0x130
[   51.818561]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818562]  ? vmap_bump_refill+0x44/0x1100
[   51.818565]  ? vmap_bump_refill+0x44/0x1100
[   51.818568]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818571]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818573]  ? vmap_bump_refill+0x44/0x1100
[   51.818575]  ? vmap_bump_refill+0x44/0x1100
[   51.818578]  vmap_bump_refill+0x44/0x1100
[   51.818581]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818583]  ? lock_release+0xd1/0x2b0
[   51.818585]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818587]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818589]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818592]  alloc_vmap_area+0x1780/0x20b0
[   51.818594]  ? __lock_acquire+0x46d/0x21d0
[   51.818596]  ? lock_acquire+0xc9/0x2e0
[   51.818598]  ? find_held_lock+0x31/0x90
[   51.818600]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818602]  ? lock_release+0xd1/0x2b0
[   51.818605]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818608]  __get_vm_area_node+0xd1/0x130
[   51.818610]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818612]  ? vmap_bump_refill+0x44/0x1100
[   51.818615]  ? vmap_bump_refill+0x44/0x1100
[   51.818618]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818620]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818623]  ? vmap_bump_refill+0x44/0x1100
[   51.818625]  ? vmap_bump_refill+0x44/0x1100
[   51.818628]  vmap_bump_refill+0x44/0x1100
[   51.818630]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818633]  ? lock_release+0xd1/0x2b0
[   51.818634]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818637]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818638]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818642]  alloc_vmap_area+0x1780/0x20b0
[   51.818643]  ? __lock_acquire+0x46d/0x21d0
[   51.818646]  ? lock_acquire+0xc9/0x2e0
[   51.818647]  ? find_held_lock+0x31/0x90
[   51.818649]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818652]  ? lock_release+0xd1/0x2b0
[   51.818654]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818657]  __get_vm_area_node+0xd1/0x130
[   51.818660]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818661]  ? vmap_bump_refill+0x44/0x1100
[   51.818664]  ? vmap_bump_refill+0x44/0x1100
[   51.818667]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818670]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818672]  ? vmap_bump_refill+0x44/0x1100
[   51.818674]  ? vmap_bump_refill+0x44/0x1100
[   51.818677]  vmap_bump_refill+0x44/0x1100
[   51.818679]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818682]  ? lock_release+0xd1/0x2b0
[   51.818684]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818686]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818688]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818691]  alloc_vmap_area+0x1780/0x20b0
[   51.818692]  ? __lock_acquire+0x46d/0x21d0
[   51.818695]  ? lock_acquire+0xc9/0x2e0
[   51.818697]  ? find_held_lock+0x31/0x90
[   51.818698]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818701]  ? lock_release+0xd1/0x2b0
[   51.818703]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818706]  __get_vm_area_node+0xd1/0x130
[   51.818709]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818710]  ? vmap_bump_refill+0x44/0x1100
[   51.818714]  ? vmap_bump_refill+0x44/0x1100
[   51.818717]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818720]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818722]  ? vmap_bump_refill+0x44/0x1100
[   51.818725]  ? vmap_bump_refill+0x44/0x1100
[   51.818728]  vmap_bump_refill+0x44/0x1100
[   51.818730]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818732]  ? lock_release+0xd1/0x2b0
[   51.818734]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818737]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818738]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818742]  alloc_vmap_area+0x1780/0x20b0
[   51.818743]  ? __lock_acquire+0x46d/0x21d0
[   51.818746]  ? lock_acquire+0xc9/0x2e0
[   51.818748]  ? find_held_lock+0x31/0x90
[   51.818750]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818753]  ? lock_release+0xd1/0x2b0
[   51.818755]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818758]  __get_vm_area_node+0xd1/0x130
[   51.818761]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818763]  ? vmap_bump_refill+0x44/0x1100
[   51.818766]  ? vmap_bump_refill+0x44/0x1100
[   51.818768]  ? alloc_frozen_pages_noprof+0x54/0x90
[   51.818771]  __kvmalloc_node_noprof+0x2fd/0x9a0
[   51.818774]  ? vmap_bump_refill+0x44/0x1100
[   51.818776]  ? vmap_bump_refill+0x44/0x1100
[   51.818779]  vmap_bump_refill+0x44/0x1100
[   51.818781]  ? kmem_cache_free+0x2d3/0x4e0
[   51.818784]  ? lock_release+0xd1/0x2b0
[   51.818786]  ? vmap_bump_alloc+0x1ae/0x1e0
[   51.818788]  ? kmem_cache_free+0x2d8/0x4e0
[   51.818790]  ? vmap_bump_alloc+0x81/0x1e0
[   51.818793]  alloc_vmap_area+0x1780/0x20b0
[   51.818795]  ? __lock_acquire+0x46d/0x21d0
[   51.818798]  ? lock_acquire+0xc9/0x2e0
[   51.818800]  ? find_held_lock+0x31/0x90
[   51.818802]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
[   51.818805]  ? lock_release+0xd1/0x2b0
[   51.818807]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
[   51.818810]  __get_vm_area_node+0xd1/0x130
[   51.818813]  __vmalloc_node_range_noprof+0x145/0x980
[   51.818814]  ? long_busy_list_alloc_test+0x68/0xf0
[   51.818818]  ? kfree+0x32f/0x4a0
[   51.818820]  ? long_busy_list_alloc_test+0x68/0xf0
[   51.818822]  ? vfree+0xf6/0x2b0
[   51.818825]  __vmalloc_node_noprof+0x52/0x70
[   51.818827]  ? long_busy_list_alloc_test+0x68/0xf0
[   51.818828]  vmalloc_noprof+0x25/0x30
[   51.818829]  long_busy_list_alloc_test+0x68/0xf0
[   51.818831]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
[   51.818832]  test_func+0x112/0x230
[   51.818835]  ? __pfx_test_func+0x10/0x10
[   51.818836]  kthread+0x10d/0x140
[   51.818839]  ? __pfx_kthread+0x10/0x10
[   51.818842]  ret_from_fork+0x3a1/0x430
[   51.818845]  ? __pfx_kthread+0x10/0x10
[   51.818847]  ret_from_fork_asm+0x1a/0x30
[   51.818853]  </TASK>
[   51.818853] Modules linked in:
[   51.818857] ---[ end trace 0000000000000000 ]---
[   51.818859] RIP: 0010:__lock_acquire+0x8/0x21d0
[   51.818860] Code: c7 c6 78 9f 40 a3 67 48 0f b9 3a e9 4e ff ff ff 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 <41> 55 41 54 53 48 83 ec 48 48 8b 45 10 8b 1d b9 ac f6 01 48 89 45
[   51.818861] RSP: 0018:ffffcd36cc260000 EFLAGS: 00010002
[   51.818862] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000002
[   51.818863] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa3e065c8
[   51.818864] RBP: ffffcd36cc260010 R08: 0000000000000001 R09: 0000000000000001
[   51.818865] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002
[   51.818865] R13: 0000000000000001 R14: ffffffffa3e065c8 R15: 0000000000000000
[   51.818868] FS:  0000000000000000(0000) GS:ffff8a7f1a992000(0000) knlGS:0000000000000000
[   51.818869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.818869] CR2: ffffcd36cc25fff8 CR3: 0000000839c3c000 CR4: 00000000000006f0
[   51.818873] Kernel panic - not syncing: Fatal exception in interrupt
[   51.819725] Kernel Offset: 0x20c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<snip>

I have not looked through the patches yet. See below some questions.

Do you solve any real issue by this series? Can we split this series
to several sections? I mean swap busy-tree then lazy-tree and latest
free-tree. This is for better code review reason and performance
evaluation one.

I am not sure we need to switch free-tree(it is not fragmented data structure)
whereas lazy and busy sounds good to me as those contain more objects. 

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree
  2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
@ 2026-06-26 11:00   ` Pranjal Arya
  0 siblings, 0 replies; 26+ messages in thread
From: Pranjal Arya @ 2026-06-26 11:00 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Andrew Morton, Liam R. Howlett, Alice Ryhl, Andrew Ballance,
	linux-arm-msm, linux-mm, linux-kernel, maple-tree,
	Lorenzo Stoakes, Pranjal Shrivastava, Will Deacon,
	Suzuki K Poulose, Neil Armstrong, Mostafa Saleh, Balbir Singh,
	Suren Baghdasaryan, Marco Elver, Dmitry Vyukov,
	Alexander Potapenko, Shuah Khan, Dev Jain, Brendan Jackman,
	Puranjay Mohan, Santosh Shukla, Wyes Karny, Sudeep Holla



On 6/14/2026 12:28 PM, Uladzislau Rezki wrote:
> On Sat, Jun 13, 2026 at 10:49:42PM +0530, Pranjal Arya wrote:
>> vmalloc's free/busy/lazy area tracking is one of the last remaining
>> augmented-rb_tree consumers in the core mm allocators. The rest of
>> mm/ has been gradually consolidating range-keyed indexing around
>> maple_tree (notably the per-process VMA tree in mm/mmap.c), and
>> the underlying reason is a structural mismatch between rb_tree and
>> range tracking:
>>
>> - rb_tree is a binary tree with a single entry per node. A lookup
>>   walks log2(N) nodes via a pointer chase that touches log2(N)
>>   cache lines, with one comparison per node. Range queries are not
>>   native to rb_tree; vmalloc has historically maintained them by
>>   augmenting every node with a subtree_max_size field whose value
>>   has to be recomputed on every insert, erase, and rotation via a
>>   custom callback. That callback machinery is vmalloc-specific
>>   code that exists only to coax range semantics out of a
>>   binary-keyed tree.
>> - maple_tree stores up to 16 ranges per node (fanout f=16), so a
>>   lookup walks ~log16(N) nodes in tightly-packed pivot/slot arrays
>>   that are far more cache-friendly. Range queries are first-class
>>   (mas_empty_area, mas_find, mas_erase), with no augmentation
>>   callback to maintain. RCU traversal and sentinel range storage
>>   (XA_ZERO_ENTRY) are part of the data structure's contract, not
>>   bolted on by the consumer.
>>
>> For the vmalloc allocator's hot paths this means shallower walks
>> under the same N, fewer cache lines touched per lookup, and no
>> custom augmented-callback machinery to maintain. Aligning vmalloc
>> with the same range-tree direction the rest of mm/ has taken
>> collapses the augmented gap walker to a single mas_empty_area()
>> descent, retires the augmented rb_node from struct vmap_area
>> (-24 bytes per object on 64-bit), and exposes the range,
>> sentinel, and RCU primitives needed for a per-CPU range cache
>> that the augmented rb_tree could not host cleanly.
>>
>> This series completes the vmalloc internal migration from rb-tree
>> indexed tracking to maple-tree indexed tracking for free, busy, and
>> lazy vmap_area range management.
>>
>> The series removes vmalloc's internal rb-tree dependencies and moves
>> address indexing and range lookups/scans to maple-tree-backed paths,
>> while keeping ordered list neighbour traversal where coalescing
>> semantics require stable predecessor/successor behaviour.
>>
>> In addition to the direct rb -> maple migration, the series includes
>> robustness and scalability refinements in the same code path:
>>
>> - deferred/lazy maple bring-up to avoid early-boot allocator hazards
>> - maple-assisted ordered-list insertion for busy/lazy tracking
>> - mas_preallocate / mas_store_prealloc fast path for common-case
>>   publish work, with a non-indexed retry queue that absorbs the
>>   rare publish-under-pressure case without leaking or panicking
>> - single mas_store(NULL, ...) sub-range trim in va_clip() in place
>>   of an erase-and-restore pair when narrowing a free-area entry
>> - single mas_erase() for the busy-tree find-then-unlink pair on the
>>   free path
>> - consolidation of in-use ranges as a single authoritative index on
>>   the steady-state allocation hot path
>> - list_head representation of the lazy-purge queue, since that queue
>>   is bulk-drained and has no address-keyed query
>> - per-CPU bump-allocator overlay layered on top of the migrated
>>   indices for short-lived, page-aligned, common-case allocations
>>   (design and chunk-size derivation in the 0010-0012 commit
>>   messages)
>> - explicit lock/serialisation behaviour preserved (no lock removed)
>>
>> Primary advantages
>> ==================
>>
>> - struct vmap_area shrinks by 24 bytes on 64-bit layouts (72 -> 48),
>>   removing the embedded augmented rb_node and the subtree_max_size
>>   field that the rb-tree gap walk depended on. Verified with pahole
>>   on arm64.
>> - maple_tree's per-node fanout (multiple pivots/slots per node)
>>   replaces a binary rb-tree descent for indexed lookups, giving a
>>   shallower walk for the same allocation count.
>> - alloc-side gap finding moves from a recursive augmented-rb walk to
>>   mas_empty_area() over the in-use range index, returning the lowest
>>   matching gap in a single descent.
>> - vfree of a chunk-resident allocation through the per-CPU overlay
>>   resolves addr -> vmap_area in O(1) via the chunk's back-pointer
>>   array, with a bounded fast-reject for addresses outside any
>>   reserved chunk; the maple-tree busy lookup remains the fallback.
>> - correctness behaviour preserved: ordered list neighbour traversal
>>   for coalescing remains; the locking/serialisation model is
>>   unchanged; lockdep is silent across the test_vmalloc subtest sweep.
>> - robustness in bring-up and high churn: deferred/lazy maple
>>   initialisation avoids early-boot allocator hazards; the retry
>>   queue keeps publish failures under GFP_NOWAIT pressure correct
>>   without leaking or panicking.
>>
>> Real-silicon validation
>> =======================
>>
>> The series was tested on Qualcomm Snapdragon X1E80100.
>> The patched kernel was booted on the device against an RB baseline
>> image of the same kernel revision, and exercised through:
>>
>> - GFXBench, run for several hours of sustained graphics workload;
>>   the patched kernel ran clean throughout, with throughput matching
>>   the RB baseline within run-to-run noise.
>> - boot-time module loading via the finit_module / kernel_read_file
>>   path that exercises the bump-allocator's indexed write loop;
>>   this path drove the patch 0012 hardening, and the patched kernel
>>   is UBSAN-clean here.
>> - repeated insmod / rmmod cycles to soak the chunk install / drain
>>   paths under live workload.
>>
>> No kernel WARN, BUG, or UBSAN report was observed across the
>> multi-hour soak.
>>
> I tried to do some testing using this series. See below the kernel
> splat. I use test_vmalloc.sh to test/stress it.
> 
> <snip>
> [   50.661082] ------------[ cut here ]------------
> [   50.662455] WARNING: mm/vmalloc.c:545 at vmap_small_pages_range_noflush+0x569/0x6c0, CPU#12: vmalloc_test/41/646
> [   50.664685] Modules linked in:
> [   50.665454] CPU: 12 UID: 0 PID: 646 Comm: vmalloc_test/41 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
> [   50.667765] Tainted: [W]=WARN
> [   50.668492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   50.670639] RIP: 0010:vmap_small_pages_range_noflush+0x569/0x6c0
> [   50.671974] Code: e9 2d 01 00 00 b8 01 00 00 00 48 85 c2 0f 84 2d 01 00 00 4c 89 e6 4c 89 f7 e8 d3 80 c3 ff e9 1d 01 00 00 89 c3 e9 dc fe ff ff <0f> 0b 65 66 f7 05 fc 3e 14 02 ff ff 75 b2 65 48 8b 15 d9 3e 14 02
> [   50.676003] RSP: 0018:ffffcd36cc22bbe0 EFLAGS: 00010286
> [   50.677198] RAX: 8000000938940163 RBX: 0000000000000000 RCX: ffff8a7e9dd36000
> [   50.678752] RDX: fffffcb13b3a8000 RSI: ffff8a6fc919e800 RDI: 0000000000000000
> [   50.680313] RBP: ffffcd36cc22bc90 R08: ffffce6240000000 R09: ffffce622c064000
> [   50.681895] R10: 0000000000000024 R11: 000000000000000a R12: ffff8a7e9dd36000
> [   50.683459] R13: ffffce622c000000 R14: ffffce622c000000 R15: ffff8a776700db00
> [   50.685059] FS:  0000000000000000(0000) GS:ffff8a7f1a012000(0000) knlGS:0000000000000000
> [   50.686881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   50.688165] CR2: 0000000000000000 CR3: 0000000839c3c000 CR4: 00000000000006f0
> [   50.689773] Call Trace:
> [   50.690419]  <TASK>
> [   50.690994]  __vmap_pages_range_noflush+0xda/0x100
> [   50.692100]  __vmalloc_node_range_noprof+0x3e2/0x980
> [   50.693269]  ? long_busy_list_alloc_test+0x68/0xf0
> [   50.694375]  __vmalloc_node_noprof+0x52/0x70
> [   50.695366]  ? long_busy_list_alloc_test+0x68/0xf0
> [   50.696470]  vmalloc_noprof+0x25/0x30
> [   50.697389]  long_busy_list_alloc_test+0x68/0xf0
> [   50.698436]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
> [   50.699626]  test_func+0x112/0x230
> [   50.700442]  ? __pfx_test_func+0x10/0x10
> [   50.701373]  kthread+0x10d/0x140
> [   50.702153]  ? __pfx_kthread+0x10/0x10
> [   50.703039]  ret_from_fork+0x3a1/0x430
> [   50.703925]  ? __pfx_kthread+0x10/0x10
> [   50.704821]  ret_from_fork_asm+0x1a/0x30
> [   50.705736]  </TASK>
> [   50.706322] irq event stamp: 1124795
> [   50.707157] hardirqs last  enabled at (1124805): [<ffffffffa1fe09b8>] __up_console_sem+0x68/0x80
> [   50.709082] hardirqs last disabled at (1124816): [<ffffffffa1fe099d>] __up_console_sem+0x4d/0x80
> [   50.711031] softirqs last  enabled at (0): [<ffffffffa1f1ba57>] copy_process+0xd67/0x2570
> [   50.712874] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [   50.714247] ---[ end trace 0000000000000000 ]---
> [   50.715336] warn_alloc: 15 callbacks suppressed
> [   50.715339] vmalloc_test/41: vmalloc error: size 409600, failed to map pages, mode:0xcc2(GFP_KERNEL|__GFP_HIGHMEM), nodemask=(null),cpuset=/,mems_allowed=0
> [   50.719421] CPU: 12 UID: 0 PID: 646 Comm: vmalloc_test/41 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
> [   50.719424] Tainted: [W]=WARN
> [   50.719425] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   50.719426] Call Trace:
> [   50.719427]  <TASK>
> [   50.719428]  dump_stack_lvl+0x72/0xa0
> [   50.719433]  dump_stack+0x14/0x1a
> [   50.719435]  warn_alloc+0x137/0x160
> [   50.719445]  __vmalloc_node_range_noprof+0x8cd/0x980
> [   50.719450]  ? long_busy_list_alloc_test+0x68/0xf0
> [   50.719457]  __vmalloc_node_noprof+0x52/0x70
> [   50.719460]  ? long_busy_list_alloc_test+0x68/0xf0
> [   50.719461]  vmalloc_noprof+0x25/0x30
> [   50.719463]  long_busy_list_alloc_test+0x68/0xf0
> [   50.719465]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
> [   50.719467]  test_func+0x112/0x230
> [   50.719472]  ? __pfx_test_func+0x10/0x10
> [   50.719474]  kthread+0x10d/0x140
> [   50.719476]  ? __pfx_kthread+0x10/0x10
> [   50.719479]  ret_from_fork+0x3a1/0x430
> [   50.719482]  ? __pfx_kthread+0x10/0x10
> [   50.719484]  ret_from_fork_asm+0x1a/0x30
> [   50.719493]  </TASK>
> [   50.719494] Mem-Info:
> [   50.743128] active_anon:0 inactive_anon:0 isolated_anon:0
> [   50.743128]  active_file:0 inactive_file:0 isolated_file:0
> [   50.743128]  unevictable:0 dirty:0 writeback:0
> [   50.743128]  slab_reclaimable:928 slab_unreclaimable:1820126
> [   50.743128]  mapped:0 shmem:0 pagetables:610948
> [   50.743128]  sec_pagetables:0 bounce:0
> [   50.743128]  kernel_misc_reclaimable:0
> [   50.743128]  free:12765547 free_pcp:26676 free_cma:0
> [   50.751177] Node 0 active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB kernel_stack:15136kB pagetables:2444312kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB gpu_active:0kB gpu_reclaim:0kB
> [   50.757522] Node 0 DMA free:11264kB boost:0kB min:12kB low:24kB high:36kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [   50.763396] lowmem_reserve[]: 0 2991 64259 64259 64259
> [   50.764526] Node 0 DMA32 free:346136kB boost:0kB min:3144kB low:6204kB high:9264kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:3129216kB managed:3063680kB mlocked:0kB bounce:0kB free_pcp:54788kB local_pcp:760kB free_cma:0kB
> [   50.770714] lowmem_reserve[]: 0 0 61267 61267 61267
> [   50.771803] Node 0 Normal free:50727200kB boost:2048kB min:66468kB low:129204kB high:191940kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:0kB present:63963136kB managed:62737744kB mlocked:0kB bounce:0kB free_pcp:51984kB local_pcp:520kB free_cma:0kB
> [   50.778200] lowmem_reserve[]: 0 0 0 0 0
> [   50.779086] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
> [   50.781534] Node 0 DMA32: 31*4kB (U) 28*8kB (U) 34*16kB (U) 38*32kB (U) 41*64kB (U) 48*128kB (UM) 366*256kB (UM) 15*512kB (UM) 1*1024kB (M) 3*2048kB (UM) 55*4096kB (M) = 344700kB
> [   50.788250] Node 0 Normal: 1282565*4kB (UM) 889838*8kB (UM) 149191*16kB (UME) 301234*32kB (UME) 240946*64kB (UME) 84458*128kB (UME) 775*256kB (UE) 32*512kB (U) 1*1024kB (U) 1*2048kB (U) 0*4096kB = 50724532kB
> [   50.792662] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [   50.794471] 0 total pagecache pages
> [   50.795273] 0 pages in swap cache
> [   50.796063] Free swap  = 0kB
> [   50.796759] Total swap = 0kB
> [   50.797457] 16777086 pages RAM
> [   50.798184] 0 pages HighMem/MovableOnly
> [   50.799061] 322890 pages reserved
> [   50.799851] 0 pages hwpoisoned
> [   50.800578] Memory cgroup min protection 0kB -- low protection 0kB
> [   51.818069] BUG: TASK stack guard page was hit at ffffcd36cc25fff8 (stack is ffffcd36cc260000..ffffcd36cc264000)
> [   51.818078] Oops: stack guard page: 0000 [#1] SMP NOPTI
> [   51.818084] CPU: 31 UID: 0 PID: 653 Comm: vmalloc_test/48 Tainted: G        W           7.1.0-rc7+ #347 PREEMPT(full)
> [   51.818088] Tainted: [W]=WARN
> [   51.818090] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [   51.818091] RIP: 0010:__lock_acquire+0x8/0x21d0
> [   51.818098] Code: c7 c6 78 9f 40 a3 67 48 0f b9 3a e9 4e ff ff ff 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 <41> 55 41 54 53 48 83 ec 48 48 8b 45 10 8b 1d b9 ac f6 01 48 89 45
> [   51.818100] RSP: 0018:ffffcd36cc260000 EFLAGS: 00010002
> [   51.818103] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000002
> [   51.818104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa3e065c8
> [   51.818105] RBP: ffffcd36cc260010 R08: 0000000000000001 R09: 0000000000000001
> [   51.818106] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002
> [   51.818107] R13: 0000000000000001 R14: ffffffffa3e065c8 R15: 0000000000000000
> [   51.818110] FS:  0000000000000000(0000) GS:ffff8a7f1a992000(0000) knlGS:0000000000000000
> [   51.818111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.818112] CR2: ffffcd36cc25fff8 CR3: 0000000839c3c000 CR4: 00000000000006f0
> [   51.818116] Call Trace:
> [   51.818117]  <TASK>
> [   51.818119]  lock_acquire+0xc9/0x2e0
> [   51.818122]  ? __alloc_pages_slowpath.constprop.0+0xca/0x14e0
> [   51.818129]  seqcount_lockdep_reader_access+0x7a/0xa0
> [   51.818131]  ? __alloc_pages_slowpath.constprop.0+0xca/0x14e0
> [   51.818133]  __alloc_pages_slowpath.constprop.0+0xca/0x14e0
> [   51.818139]  __alloc_frozen_pages_noprof+0x361/0x3b0
> [   51.818144]  alloc_pages_mpol+0xa8/0x170
> [   51.818147]  alloc_frozen_pages_noprof+0x54/0x90
> [   51.818148]  ___kmalloc_large_node+0xc2/0xe0
> [   51.818151]  __kmalloc_large_node_noprof+0x2e/0x130
> [   51.818154]  __kvmalloc_node_noprof+0x5c0/0x9a0
> [   51.818157]  ? vmap_bump_refill+0x44/0x1100
> [   51.818162]  vmap_bump_refill+0x44/0x1100
> [   51.818164]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818166]  ? lock_release+0xd1/0x2b0
> [   51.818169]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818171]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818173]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818176]  alloc_vmap_area+0x1780/0x20b0
> [   51.818178]  ? __lock_acquire+0x46d/0x21d0
> [   51.818181]  ? lock_acquire+0xc9/0x2e0
> [   51.818182]  ? find_held_lock+0x31/0x90
> [   51.818185]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818189]  ? lock_release+0xd1/0x2b0
> [   51.818191]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818205]  __get_vm_area_node+0xd1/0x130
> [   51.818209]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818210]  ? vmap_bump_refill+0x44/0x1100
> [   51.818214]  ? vmap_bump_refill+0x44/0x1100
> [   51.818217]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818220]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818222]  ? vmap_bump_refill+0x44/0x1100
> [   51.818225]  ? vmap_bump_refill+0x44/0x1100
> [   51.818228]  vmap_bump_refill+0x44/0x1100
> [   51.818230]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818232]  ? lock_release+0xd1/0x2b0
> [   51.818234]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818236]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818238]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818242]  alloc_vmap_area+0x1780/0x20b0
> [   51.818243]  ? __lock_acquire+0x46d/0x21d0
> [   51.818246]  ? lock_acquire+0xc9/0x2e0
> [   51.818247]  ? find_held_lock+0x31/0x90
> [   51.818249]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818252]  ? lock_release+0xd1/0x2b0
> [   51.818254]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818258]  __get_vm_area_node+0xd1/0x130
> [   51.818260]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818262]  ? vmap_bump_refill+0x44/0x1100
> [   51.818265]  ? vmap_bump_refill+0x44/0x1100
> [   51.818268]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818271]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818273]  ? vmap_bump_refill+0x44/0x1100
> [   51.818276]  ? vmap_bump_refill+0x44/0x1100
> [   51.818279]  vmap_bump_refill+0x44/0x1100
> [   51.818281]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818283]  ? lock_release+0xd1/0x2b0
> [   51.818285]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818287]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818289]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818293]  alloc_vmap_area+0x1780/0x20b0
> [   51.818294]  ? __lock_acquire+0x46d/0x21d0
> [   51.818297]  ? lock_acquire+0xc9/0x2e0
> [   51.818298]  ? find_held_lock+0x31/0x90
> [   51.818300]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818303]  ? lock_release+0xd1/0x2b0
> [   51.818305]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818308]  __get_vm_area_node+0xd1/0x130
> [   51.818311]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818312]  ? vmap_bump_refill+0x44/0x1100
> [   51.818315]  ? vmap_bump_refill+0x44/0x1100
> [   51.818318]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818321]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818323]  ? vmap_bump_refill+0x44/0x1100
> [   51.818325]  ? vmap_bump_refill+0x44/0x1100
> [   51.818329]  vmap_bump_refill+0x44/0x1100
> [   51.818331]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818333]  ? lock_release+0xd1/0x2b0
> [   51.818335]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818338]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818340]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818344]  alloc_vmap_area+0x1780/0x20b0
> [   51.818345]  ? __lock_acquire+0x46d/0x21d0
> [   51.818348]  ? lock_acquire+0xc9/0x2e0
> [   51.818350]  ? find_held_lock+0x31/0x90
> [   51.818352]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818354]  ? lock_release+0xd1/0x2b0
> [   51.818357]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818360]  __get_vm_area_node+0xd1/0x130
> [   51.818362]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818364]  ? vmap_bump_refill+0x44/0x1100
> [   51.818367]  ? vmap_bump_refill+0x44/0x1100
> [   51.818370]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818373]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818375]  ? vmap_bump_refill+0x44/0x1100
> [   51.818377]  ? vmap_bump_refill+0x44/0x1100
> [   51.818380]  vmap_bump_refill+0x44/0x1100
> [   51.818382]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818385]  ? lock_release+0xd1/0x2b0
> [   51.818386]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818389]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818390]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818394]  alloc_vmap_area+0x1780/0x20b0
> [   51.818395]  ? __lock_acquire+0x46d/0x21d0
> [   51.818398]  ? lock_acquire+0xc9/0x2e0
> [   51.818400]  ? find_held_lock+0x31/0x90
> [   51.818402]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818404]  ? lock_release+0xd1/0x2b0
> [   51.818407]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818410]  __get_vm_area_node+0xd1/0x130
> [   51.818412]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818414]  ? vmap_bump_refill+0x44/0x1100
> [   51.818417]  ? vmap_bump_refill+0x44/0x1100
> [   51.818420]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818423]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818425]  ? vmap_bump_refill+0x44/0x1100
> [   51.818427]  ? vmap_bump_refill+0x44/0x1100
> [   51.818430]  vmap_bump_refill+0x44/0x1100
> [   51.818433]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818435]  ? lock_release+0xd1/0x2b0
> [   51.818437]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818439]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818441]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818444]  alloc_vmap_area+0x1780/0x20b0
> [   51.818445]  ? __lock_acquire+0x46d/0x21d0
> [   51.818448]  ? lock_acquire+0xc9/0x2e0
> [   51.818450]  ? find_held_lock+0x31/0x90
> [   51.818452]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818454]  ? lock_release+0xd1/0x2b0
> [   51.818456]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818459]  __get_vm_area_node+0xd1/0x130
> [   51.818462]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818464]  ? vmap_bump_refill+0x44/0x1100
> [   51.818467]  ? vmap_bump_refill+0x44/0x1100
> [   51.818469]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818472]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818474]  ? vmap_bump_refill+0x44/0x1100
> [   51.818477]  ? vmap_bump_refill+0x44/0x1100
> [   51.818480]  vmap_bump_refill+0x44/0x1100
> [   51.818482]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818484]  ? lock_release+0xd1/0x2b0
> [   51.818486]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818488]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818490]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818493]  alloc_vmap_area+0x1780/0x20b0
> [   51.818495]  ? __lock_acquire+0x46d/0x21d0
> [   51.818497]  ? lock_acquire+0xc9/0x2e0
> [   51.818499]  ? find_held_lock+0x31/0x90
> [   51.818501]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818503]  ? lock_release+0xd1/0x2b0
> [   51.818506]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818509]  __get_vm_area_node+0xd1/0x130
> [   51.818511]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818513]  ? vmap_bump_refill+0x44/0x1100
> [   51.818516]  ? vmap_bump_refill+0x44/0x1100
> [   51.818519]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818521]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818524]  ? vmap_bump_refill+0x44/0x1100
> [   51.818526]  ? vmap_bump_refill+0x44/0x1100
> [   51.818529]  vmap_bump_refill+0x44/0x1100
> [   51.818531]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818533]  ? lock_release+0xd1/0x2b0
> [   51.818535]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818537]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818539]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818543]  alloc_vmap_area+0x1780/0x20b0
> [   51.818544]  ? __lock_acquire+0x46d/0x21d0
> [   51.818547]  ? lock_acquire+0xc9/0x2e0
> [   51.818549]  ? find_held_lock+0x31/0x90
> [   51.818550]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818553]  ? lock_release+0xd1/0x2b0
> [   51.818555]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818558]  __get_vm_area_node+0xd1/0x130
> [   51.818561]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818562]  ? vmap_bump_refill+0x44/0x1100
> [   51.818565]  ? vmap_bump_refill+0x44/0x1100
> [   51.818568]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818571]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818573]  ? vmap_bump_refill+0x44/0x1100
> [   51.818575]  ? vmap_bump_refill+0x44/0x1100
> [   51.818578]  vmap_bump_refill+0x44/0x1100
> [   51.818581]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818583]  ? lock_release+0xd1/0x2b0
> [   51.818585]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818587]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818589]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818592]  alloc_vmap_area+0x1780/0x20b0
> [   51.818594]  ? __lock_acquire+0x46d/0x21d0
> [   51.818596]  ? lock_acquire+0xc9/0x2e0
> [   51.818598]  ? find_held_lock+0x31/0x90
> [   51.818600]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818602]  ? lock_release+0xd1/0x2b0
> [   51.818605]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818608]  __get_vm_area_node+0xd1/0x130
> [   51.818610]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818612]  ? vmap_bump_refill+0x44/0x1100
> [   51.818615]  ? vmap_bump_refill+0x44/0x1100
> [   51.818618]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818620]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818623]  ? vmap_bump_refill+0x44/0x1100
> [   51.818625]  ? vmap_bump_refill+0x44/0x1100
> [   51.818628]  vmap_bump_refill+0x44/0x1100
> [   51.818630]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818633]  ? lock_release+0xd1/0x2b0
> [   51.818634]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818637]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818638]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818642]  alloc_vmap_area+0x1780/0x20b0
> [   51.818643]  ? __lock_acquire+0x46d/0x21d0
> [   51.818646]  ? lock_acquire+0xc9/0x2e0
> [   51.818647]  ? find_held_lock+0x31/0x90
> [   51.818649]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818652]  ? lock_release+0xd1/0x2b0
> [   51.818654]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818657]  __get_vm_area_node+0xd1/0x130
> [   51.818660]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818661]  ? vmap_bump_refill+0x44/0x1100
> [   51.818664]  ? vmap_bump_refill+0x44/0x1100
> [   51.818667]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818670]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818672]  ? vmap_bump_refill+0x44/0x1100
> [   51.818674]  ? vmap_bump_refill+0x44/0x1100
> [   51.818677]  vmap_bump_refill+0x44/0x1100
> [   51.818679]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818682]  ? lock_release+0xd1/0x2b0
> [   51.818684]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818686]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818688]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818691]  alloc_vmap_area+0x1780/0x20b0
> [   51.818692]  ? __lock_acquire+0x46d/0x21d0
> [   51.818695]  ? lock_acquire+0xc9/0x2e0
> [   51.818697]  ? find_held_lock+0x31/0x90
> [   51.818698]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818701]  ? lock_release+0xd1/0x2b0
> [   51.818703]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818706]  __get_vm_area_node+0xd1/0x130
> [   51.818709]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818710]  ? vmap_bump_refill+0x44/0x1100
> [   51.818714]  ? vmap_bump_refill+0x44/0x1100
> [   51.818717]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818720]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818722]  ? vmap_bump_refill+0x44/0x1100
> [   51.818725]  ? vmap_bump_refill+0x44/0x1100
> [   51.818728]  vmap_bump_refill+0x44/0x1100
> [   51.818730]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818732]  ? lock_release+0xd1/0x2b0
> [   51.818734]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818737]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818738]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818742]  alloc_vmap_area+0x1780/0x20b0
> [   51.818743]  ? __lock_acquire+0x46d/0x21d0
> [   51.818746]  ? lock_acquire+0xc9/0x2e0
> [   51.818748]  ? find_held_lock+0x31/0x90
> [   51.818750]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818753]  ? lock_release+0xd1/0x2b0
> [   51.818755]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818758]  __get_vm_area_node+0xd1/0x130
> [   51.818761]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818763]  ? vmap_bump_refill+0x44/0x1100
> [   51.818766]  ? vmap_bump_refill+0x44/0x1100
> [   51.818768]  ? alloc_frozen_pages_noprof+0x54/0x90
> [   51.818771]  __kvmalloc_node_noprof+0x2fd/0x9a0
> [   51.818774]  ? vmap_bump_refill+0x44/0x1100
> [   51.818776]  ? vmap_bump_refill+0x44/0x1100
> [   51.818779]  vmap_bump_refill+0x44/0x1100
> [   51.818781]  ? kmem_cache_free+0x2d3/0x4e0
> [   51.818784]  ? lock_release+0xd1/0x2b0
> [   51.818786]  ? vmap_bump_alloc+0x1ae/0x1e0
> [   51.818788]  ? kmem_cache_free+0x2d8/0x4e0
> [   51.818790]  ? vmap_bump_alloc+0x81/0x1e0
> [   51.818793]  alloc_vmap_area+0x1780/0x20b0
> [   51.818795]  ? __lock_acquire+0x46d/0x21d0
> [   51.818798]  ? lock_acquire+0xc9/0x2e0
> [   51.818800]  ? find_held_lock+0x31/0x90
> [   51.818802]  ? __kmalloc_cache_node_noprof+0x18d/0x6e0
> [   51.818805]  ? lock_release+0xd1/0x2b0
> [   51.818807]  ? __kmalloc_cache_node_noprof+0x41a/0x6e0
> [   51.818810]  __get_vm_area_node+0xd1/0x130
> [   51.818813]  __vmalloc_node_range_noprof+0x145/0x980
> [   51.818814]  ? long_busy_list_alloc_test+0x68/0xf0
> [   51.818818]  ? kfree+0x32f/0x4a0
> [   51.818820]  ? long_busy_list_alloc_test+0x68/0xf0
> [   51.818822]  ? vfree+0xf6/0x2b0
> [   51.818825]  __vmalloc_node_noprof+0x52/0x70
> [   51.818827]  ? long_busy_list_alloc_test+0x68/0xf0
> [   51.818828]  vmalloc_noprof+0x25/0x30
> [   51.818829]  long_busy_list_alloc_test+0x68/0xf0
> [   51.818831]  ? __pfx_long_busy_list_alloc_test+0x10/0x10
> [   51.818832]  test_func+0x112/0x230
> [   51.818835]  ? __pfx_test_func+0x10/0x10
> [   51.818836]  kthread+0x10d/0x140
> [   51.818839]  ? __pfx_kthread+0x10/0x10
> [   51.818842]  ret_from_fork+0x3a1/0x430
> [   51.818845]  ? __pfx_kthread+0x10/0x10
> [   51.818847]  ret_from_fork_asm+0x1a/0x30
> [   51.818853]  </TASK>
> [   51.818853] Modules linked in:
> [   51.818857] ---[ end trace 0000000000000000 ]---
> [   51.818859] RIP: 0010:__lock_acquire+0x8/0x21d0
> [   51.818860] Code: c7 c6 78 9f 40 a3 67 48 0f b9 3a e9 4e ff ff ff 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 <41> 55 41 54 53 48 83 ec 48 48 8b 45 10 8b 1d b9 ac f6 01 48 89 45
> [   51.818861] RSP: 0018:ffffcd36cc260000 EFLAGS: 00010002
> [   51.818862] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000002
> [   51.818863] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa3e065c8
> [   51.818864] RBP: ffffcd36cc260010 R08: 0000000000000001 R09: 0000000000000001
> [   51.818865] R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000002
> [   51.818865] R13: 0000000000000001 R14: ffffffffa3e065c8 R15: 0000000000000000
> [   51.818868] FS:  0000000000000000(0000) GS:ffff8a7f1a992000(0000) knlGS:0000000000000000
> [   51.818869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.818869] CR2: ffffcd36cc25fff8 CR3: 0000000839c3c000 CR4: 00000000000006f0
> [   51.818873] Kernel panic - not syncing: Fatal exception in interrupt
> [   51.819725] Kernel Offset: 0x20c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> <snip>
> 

Thanks for testing in extreme conditions. I'll fix these in the next patchset.

> I have not looked through the patches yet. See below some questions.
> 
> Do you solve any real issue by this series? Can we split this series
> to several sections? I mean swap busy-tree then lazy-tree and latest
> free-tree. This is for better code review reason and performance
> evaluation one.
> 

Yes, on three axes:

Memory: struct vmap_area shrinks by 24 bytes per object (72 → 48 bytes,
verified with pahole on arm64).  The saving scales with the number of
concurrent live vmap_areas on a server with ~16K live VAs that is
~384 KB saved.  The 24 byte saving comes from retiring the embedded
struct rb_node (the augmented node carrying subtree_max_size), the
removal of which is enabled precisely by switching from the rb_tree
whose gap-walking algorithm required it.

Performance: Substantial improvements are observed across different
scenarios on both x86 and ARM platforms.

Code quality: The augmented rb_tree callbacks (subtree_max_size propagation,
find_vmap_lowest_match, the augmented gap walker) are deleted.  The maple
tree provides native gap finding (mas_empty_area), erase, and range storage as
features with no consumer side augmentation callbacks. This is a net simplification
in the allocator side making design cleaner and scalable.

> I am not sure we need to switch free-tree(it is not fragmented data structure)
> whereas lazy and busy sounds good to me as those contain more objects. 

Could you describe which part of the current structure
is most difficult to review?  That would help target any restructuring
effort most effectively.

In upcoming patchset, the occupied tree's gap space will be a complete
and accurate representation of all free address ranges. A separate free
range index will redundant by the occupied tree.

The concern about fragmentation is addressed by mas_empty_area(). The
maple tree's gap walk is O(log16 N) and touches tight node structures,
so performance on the alloc path is getting improved rather than being
regressed by using the occupied tree directly. The benchmarks runs also
above confirm this.

> 
> --
> Uladzislau Rezki

BR,
Pranjal

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2026-06-26 14:15 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-15  9:52   ` Uladzislau Rezki
2026-06-16 18:07     ` Matthew Wilcox
2026-06-18 10:06       ` Uladzislau Rezki
2026-06-26 10:47         ` Pranjal Arya
2026-06-26 14:15           ` Matthew Wilcox
2026-06-26 10:34       ` Pranjal Arya
2026-06-26 10:18     ` Pranjal Arya
2026-06-25 21:28   ` Pranjal Arya
2026-06-26  9:43   ` Pranjal Arya
2026-06-14  6:35 ` [syzbot ci] " syzbot ci
2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
2026-06-26 11:00   ` Pranjal Arya

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox