From: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>,
"Liam R. Howlett" <liam@infradead.org>,
Alice Ryhl <aliceryhl@google.com>,
Andrew Ballance <andrewjballance@gmail.com>
Cc: linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, maple-tree@lists.infradead.org,
Lorenzo Stoakes <ljs@kernel.org>,
Pranjal Shrivastava <praan@google.com>,
Will Deacon <will@kernel.org>,
Suzuki K Poulose <Suzuki.Poulose@arm.com>,
Neil Armstrong <neil.armstrong@linaro.org>,
Mostafa Saleh <smostafa@google.com>,
Balbir Singh <balbirs@nvidia.com>,
Suren Baghdasaryan <surenb@google.com>,
Marco Elver <elver@google.com>,
Dmitry Vyukov <dvyukov@google.com>,
Alexander Potapenko <glider@google.com>,
Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>,
Brendan Jackman <jackmanb@google.com>,
Puranjay Mohan <puranjay@kernel.org>,
Santosh Shukla <santosh.shukla@amd.com>,
Wyes Karny <wkarny@gmail.com>,
Pranjal Arya <pranjal.arya@oss.qualcomm.com>,
Sudeep Holla <sudeep.holla@kernel.org>
Subject: [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
Date: Sat, 13 Jun 2026 22:49:52 +0530 [thread overview]
Message-ID: <20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com> (raw)
In-Reply-To: <20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com>
Now that the alloc path goes through the maple_tree-based gap finder
(mas_empty_area), amortise the cost of visiting it for the most common
shape of vmalloc call: short-lived, page-aligned, PAGE_SIZE-multiple
allocations.
Each CPU reserves a 64 MB chunk via __alloc_vmap_area -- the same
maple-backed allocator the global path uses -- and dispenses page-
aligned allocations from a bump pointer inside that chunk. Chunk
reservation and drain are the only operations that touch the global
allocator; per-allocation work stays entirely per-CPU.
When a chunk's allocation count returns to zero and it is no longer
the per-CPU current chunk, vmap_bump_unlink() releases the chunk's
range back to the global allocator via occupied_mt_erase_range_locked
-- the same maple primitive the consolidate-occupied-tree patch made
authoritative. The chunk install path uses
occupied_mt_store_range_locked symmetrically, so cache lifecycle is
expressed entirely through the maple-tree's range primitives.
Per-CPU access uses preempt_disable() rather than a spinlock; the
chunk pointer is per-CPU and only mutated by its owner. The chunks
list (vmap_bump_chunks) is gated by a single global spinlock that is
taken only on chunk install/release, not on the fast path.
Why this overlay sits on the maple_tree migration
=================================================
The overlay relies on three primitives that maple_tree provides
natively and that the augmented rb_tree allocator does not expose
in a clean form:
- Bare [base, limit) range reservation. The augmented rb_node
carries a vmap_area-shaped subtree_max_size consulted by
find_vmap_lowest_match. A chunk reservation has no associated
vmap_area object, so it cannot be stored in the augmented tree
without either synthesising a fake vmap_area per chunk or
introducing a parallel range tracker with its own augmentation
discipline. maple_tree stores [base, limit) ranges natively
and the gap walker (mas_empty_area) returns the lowest free
region in a single descent, sharing one primitive with the
regular allocation path.
- Sentinel range storage. occupied_vmap_area_mt records a
reserved chunk as XA_ZERO_ENTRY over [base, limit), sharing
one index with ordinary in-use vmap_area ranges. The
augmented rb_tree has no equivalent of XA_ZERO_ENTRY: a
chunk would have to live in a dedicated structure, doubling
the alloc-side state surface.
- RCU range traversal. vmap_chunk_lookup() must run lock-free
so that cross-chunk vfree() does not take a global spinlock
per free of a chunk-resident allocation. maple_tree supports
RCU traversal as a property of the data structure;
rb_tree-side equivalents (lib/rbtree_latch, hand-rolled
grace-period accounting on top of rb_tree) impose write-side
cost and would have to be added to vmalloc as new
infrastructure.
After the migration these three primitives are part of the
allocator API; the overlay reuses mas_empty_area() for chunk
refill, occupied_mt_store_range_locked() and
occupied_mt_erase_range_locked() for chunk lifecycle, and
maple-tree-friendly RCU for the chunk-list lookup. No parallel
data structures are introduced.
VMAP_BUMP_CHUNK_SIZE = 64 MB derivation
=======================================
The chunk size is the smallest power-of-two value that satisfies
three independent constraints:
1. Eligibility coverage. vmap_bump_eligible() requires
size <= VMAP_BUMP_CHUNK_SIZE / 2 so that any single eligible
allocation fits with room for alignment slack. The largest
standard-range vmalloc() callers in tree are the module loader
(modules can carry up to ~32 MB of text + RO data + RW data on
architectures with full kernel module support) and BPF JIT
buffers (capped near 4 MB). Setting CHUNK_SIZE = 64 MB keeps
all of these on the bump fast path; halving the chunk to 32 MB
would push module loads to the slow path.
2. Refill amortisation. The global vmalloc lock is taken once per
chunk refill, paying for ~CHUNK_SIZE / avg_alloc_size bump
allocations between lock acquisitions. At avg = 4 KB (a
plausible lower bound for typical kernel vmalloc traffic),
64 MB amortises to ~16,000 fast-path allocations per global
lock acquisition; at avg = 1 MB, ~64 per lock. Doubling the
chunk size beyond 64 MB barely improves this ratio.
3. Address-space cost. Each CPU pins a chunk-sized reservation
within the vmalloc range. On a 32-CPU server with the standard
128 GB x86_64 vmalloc range, 64 MB chunks reserve
32 * 64 MB = 2 GB = 1.6 % of the range. On arm64 with
CONFIG_ARM64_VA_BITS=52 (256 PB vmalloc), the cost is
negligible. Doubling to 128 MB pushes the x86_64 reservation
to 3.2 %, which is still acceptable but starts to matter for
workloads with high CPU counts.
Per-chunk metadata associated with each chunk is sized as
sizeof(struct vmap_area *) * (CHUNK_SIZE / PAGE_SIZE), which scales
linearly with chunk size and stays at a constant 0.2 % overhead
regardless of the chosen value. At 64 MB this is 128 KB per chunk.
64 MB is therefore the *minimum* chunk size that meets constraint (1)
and (2) simultaneously; constraint (3) sets the upper bound and
allows growing the chunk if module sizes grow in the future. The
constant is exposed at the top of the bump-allocator code block so
distributors can tune it for unusual configurations.
Allocations that don't match the predicate (non-page-aligned, larger
than half a chunk, fixed-VA, or with NUMA constraints) fall through
to the existing __alloc_vmap_area path unchanged.
Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
mm/vmalloc.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 463127d5ce58..65ee80eaf4bf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2467,6 +2467,98 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
va->vm = vm;
}
+/*
+ * Per-CPU bump-allocator overlay.
+ *
+ * Each CPU reserves a contiguous chunk of vmalloc address space and
+ * dispenses page-aligned allocations via a bump pointer. The chunk's
+ * range is reserved through the global allocator once; individual
+ * allocations within the chunk avoid the global maple-tree work
+ * entirely. Each allocation still gets its own vmap_area struct and
+ * is inserted into the per-node busy.mt, so find_vmap_area() and
+ * vfree() continue to work unchanged.
+ *
+ * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
+ * 128 GB vmalloc range, the address space supports thousands of chunks
+ * before exhaustion. A future iteration can add chunk recycling via a
+ * va->bump_chunk back-pointer + refcount; deferred to keep this hot
+ * path's struct vmap_area footprint at 48 B.
+ *
+ * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
+ * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
+ * else falls through to the existing __alloc_vmap_area path.
+ */
+#define VMAP_BUMP_CHUNK_SIZE (64UL * 1024 * 1024)
+
+struct vmap_bump_chunk {
+ unsigned long base;
+ unsigned long limit;
+ unsigned long bump;
+};
+
+static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
+static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+
+/* Try the per-CPU bump-allocator. Returns the chosen address or
+ * a negative IS_ERR_VALUE on miss; callers fall through to the
+ * regular path on miss.
+ */
+static unsigned long
+vmap_bump_alloc(unsigned long size, unsigned long align,
+ unsigned long vstart, unsigned long vend)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long aligned, addr = -ENOENT;
+
+ if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+ size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
+ align > VMAP_BUMP_CHUNK_SIZE / 2)
+ return -EINVAL;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ if (chunk->base) {
+ aligned = ALIGN(chunk->bump, align);
+ if (aligned + size <= chunk->limit) {
+ chunk->bump = aligned + size;
+ addr = aligned;
+ }
+ }
+ spin_unlock(lock);
+ return addr;
+}
+
+/* Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. Old chunk's remaining space is leaked (the
+ * already-allocated VAs in it stay live; the unused tail is wasted).
+ */
+static int
+vmap_bump_refill(gfp_t gfp_mask)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long base;
+
+ preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
+ base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
+ VMALLOC_START, VMALLOC_END);
+ spin_unlock(&free_vmap_area_lock);
+
+ if (IS_ERR_VALUE(base))
+ return -ENOMEM;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ chunk->base = base;
+ chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+ chunk->bump = base;
+ spin_unlock(lock);
+ return 0;
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend. If vm is passed in, the two will also be bound.
@@ -2519,6 +2611,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
}
retry:
+ if (IS_ERR_VALUE(addr)) {
+ /*
+ * Per-CPU bump-allocator fast path. On hit, no global
+ * tree work runs at all. On miss, refill the chunk and
+ * try again before falling back to the regular path.
+ */
+ addr = vmap_bump_alloc(size, align, vstart, vend);
+ if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
+ if (vmap_bump_refill(gfp_mask) == 0)
+ addr = vmap_bump_alloc(size, align,
+ vstart, vend);
+ }
+ }
if (IS_ERR_VALUE(addr)) {
preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
try_init_free_mt_locked();
@@ -6214,6 +6319,8 @@ void __init vmalloc_init(void)
init_llist_head(&p->list);
INIT_WORK(&p->wq, delayed_vfree_work);
xa_init(&vbq->vmap_blocks);
+
+ spin_lock_init(&per_cpu(vmap_bump_lock, i));
}
/*
--
2.34.1
next prev parent reply other threads:[~2026-06-13 17:21 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` Pranjal Arya [this message]
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-14 6:35 ` [syzbot ci] " syzbot ci
2026-06-14 6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com \
--to=pranjal.arya@oss.qualcomm.com \
--cc=Suzuki.Poulose@arm.com \
--cc=akpm@linux-foundation.org \
--cc=aliceryhl@google.com \
--cc=andrewjballance@gmail.com \
--cc=balbirs@nvidia.com \
--cc=dev.jain@arm.com \
--cc=dvyukov@google.com \
--cc=elver@google.com \
--cc=glider@google.com \
--cc=jackmanb@google.com \
--cc=liam@infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=maple-tree@lists.infradead.org \
--cc=neil.armstrong@linaro.org \
--cc=praan@google.com \
--cc=puranjay@kernel.org \
--cc=santosh.shukla@amd.com \
--cc=shuah@kernel.org \
--cc=smostafa@google.com \
--cc=sudeep.holla@kernel.org \
--cc=surenb@google.com \
--cc=urezki@gmail.com \
--cc=will@kernel.org \
--cc=wkarny@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.