From: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Uladzislau Rezki <urezki@gmail.com>,
"Liam R. Howlett" <liam@infradead.org>,
Alice Ryhl <aliceryhl@google.com>,
Andrew Ballance <andrewjballance@gmail.com>
Cc: linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, maple-tree@lists.infradead.org,
Lorenzo Stoakes <ljs@kernel.org>,
Pranjal Shrivastava <praan@google.com>,
Will Deacon <will@kernel.org>,
Suzuki K Poulose <Suzuki.Poulose@arm.com>,
Neil Armstrong <neil.armstrong@linaro.org>,
Mostafa Saleh <smostafa@google.com>,
Balbir Singh <balbirs@nvidia.com>,
Suren Baghdasaryan <surenb@google.com>,
Marco Elver <elver@google.com>,
Dmitry Vyukov <dvyukov@google.com>,
Alexander Potapenko <glider@google.com>,
Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>,
Brendan Jackman <jackmanb@google.com>,
Puranjay Mohan <puranjay@kernel.org>,
Santosh Shukla <santosh.shukla@amd.com>,
Wyes Karny <wkarny@gmail.com>,
Pranjal Arya <pranjal.arya@oss.qualcomm.com>,
Sudeep Holla <sudeep.holla@kernel.org>
Subject: [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
Date: Sat, 13 Jun 2026 22:49:52 +0530 [thread overview]
Message-ID: <20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com> (raw)
In-Reply-To: <20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com>
Now that the alloc path goes through the maple_tree-based gap finder
(mas_empty_area), amortise the cost of visiting it for the most common
shape of vmalloc call: short-lived, page-aligned, PAGE_SIZE-multiple
allocations.
Each CPU reserves a 64 MB chunk via __alloc_vmap_area -- the same
maple-backed allocator the global path uses -- and dispenses page-
aligned allocations from a bump pointer inside that chunk. Chunk
reservation and drain are the only operations that touch the global
allocator; per-allocation work stays entirely per-CPU.
When a chunk's allocation count returns to zero and it is no longer
the per-CPU current chunk, vmap_bump_unlink() releases the chunk's
range back to the global allocator via occupied_mt_erase_range_locked
-- the same maple primitive the consolidate-occupied-tree patch made
authoritative. The chunk install path uses
occupied_mt_store_range_locked symmetrically, so cache lifecycle is
expressed entirely through the maple-tree's range primitives.
Per-CPU access uses preempt_disable() rather than a spinlock; the
chunk pointer is per-CPU and only mutated by its owner. The chunks
list (vmap_bump_chunks) is gated by a single global spinlock that is
taken only on chunk install/release, not on the fast path.
Why this overlay sits on the maple_tree migration
=================================================
The overlay relies on three primitives that maple_tree provides
natively and that the augmented rb_tree allocator does not expose
in a clean form:
- Bare [base, limit) range reservation. The augmented rb_node
carries a vmap_area-shaped subtree_max_size consulted by
find_vmap_lowest_match. A chunk reservation has no associated
vmap_area object, so it cannot be stored in the augmented tree
without either synthesising a fake vmap_area per chunk or
introducing a parallel range tracker with its own augmentation
discipline. maple_tree stores [base, limit) ranges natively
and the gap walker (mas_empty_area) returns the lowest free
region in a single descent, sharing one primitive with the
regular allocation path.
- Sentinel range storage. occupied_vmap_area_mt records a
reserved chunk as XA_ZERO_ENTRY over [base, limit), sharing
one index with ordinary in-use vmap_area ranges. The
augmented rb_tree has no equivalent of XA_ZERO_ENTRY: a
chunk would have to live in a dedicated structure, doubling
the alloc-side state surface.
- RCU range traversal. vmap_chunk_lookup() must run lock-free
so that cross-chunk vfree() does not take a global spinlock
per free of a chunk-resident allocation. maple_tree supports
RCU traversal as a property of the data structure;
rb_tree-side equivalents (lib/rbtree_latch, hand-rolled
grace-period accounting on top of rb_tree) impose write-side
cost and would have to be added to vmalloc as new
infrastructure.
After the migration these three primitives are part of the
allocator API; the overlay reuses mas_empty_area() for chunk
refill, occupied_mt_store_range_locked() and
occupied_mt_erase_range_locked() for chunk lifecycle, and
maple-tree-friendly RCU for the chunk-list lookup. No parallel
data structures are introduced.
VMAP_BUMP_CHUNK_SIZE = 64 MB derivation
=======================================
The chunk size is the smallest power-of-two value that satisfies
three independent constraints:
1. Eligibility coverage. vmap_bump_eligible() requires
size <= VMAP_BUMP_CHUNK_SIZE / 2 so that any single eligible
allocation fits with room for alignment slack. The largest
standard-range vmalloc() callers in tree are the module loader
(modules can carry up to ~32 MB of text + RO data + RW data on
architectures with full kernel module support) and BPF JIT
buffers (capped near 4 MB). Setting CHUNK_SIZE = 64 MB keeps
all of these on the bump fast path; halving the chunk to 32 MB
would push module loads to the slow path.
2. Refill amortisation. The global vmalloc lock is taken once per
chunk refill, paying for ~CHUNK_SIZE / avg_alloc_size bump
allocations between lock acquisitions. At avg = 4 KB (a
plausible lower bound for typical kernel vmalloc traffic),
64 MB amortises to ~16,000 fast-path allocations per global
lock acquisition; at avg = 1 MB, ~64 per lock. Doubling the
chunk size beyond 64 MB barely improves this ratio.
3. Address-space cost. Each CPU pins a chunk-sized reservation
within the vmalloc range. On a 32-CPU server with the standard
128 GB x86_64 vmalloc range, 64 MB chunks reserve
32 * 64 MB = 2 GB = 1.6 % of the range. On arm64 with
CONFIG_ARM64_VA_BITS=52 (256 PB vmalloc), the cost is
negligible. Doubling to 128 MB pushes the x86_64 reservation
to 3.2 %, which is still acceptable but starts to matter for
workloads with high CPU counts.
Per-chunk metadata associated with each chunk is sized as
sizeof(struct vmap_area *) * (CHUNK_SIZE / PAGE_SIZE), which scales
linearly with chunk size and stays at a constant 0.2 % overhead
regardless of the chosen value. At 64 MB this is 128 KB per chunk.
64 MB is therefore the *minimum* chunk size that meets constraint (1)
and (2) simultaneously; constraint (3) sets the upper bound and
allows growing the chunk if module sizes grow in the future. The
constant is exposed at the top of the bump-allocator code block so
distributors can tune it for unusual configurations.
Allocations that don't match the predicate (non-page-aligned, larger
than half a chunk, fixed-VA, or with NUMA constraints) fall through
to the existing __alloc_vmap_area path unchanged.
Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
mm/vmalloc.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 463127d5ce58..65ee80eaf4bf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2467,6 +2467,98 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
va->vm = vm;
}
+/*
+ * Per-CPU bump-allocator overlay.
+ *
+ * Each CPU reserves a contiguous chunk of vmalloc address space and
+ * dispenses page-aligned allocations via a bump pointer. The chunk's
+ * range is reserved through the global allocator once; individual
+ * allocations within the chunk avoid the global maple-tree work
+ * entirely. Each allocation still gets its own vmap_area struct and
+ * is inserted into the per-node busy.mt, so find_vmap_area() and
+ * vfree() continue to work unchanged.
+ *
+ * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
+ * 128 GB vmalloc range, the address space supports thousands of chunks
+ * before exhaustion. A future iteration can add chunk recycling via a
+ * va->bump_chunk back-pointer + refcount; deferred to keep this hot
+ * path's struct vmap_area footprint at 48 B.
+ *
+ * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
+ * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
+ * else falls through to the existing __alloc_vmap_area path.
+ */
+#define VMAP_BUMP_CHUNK_SIZE (64UL * 1024 * 1024)
+
+struct vmap_bump_chunk {
+ unsigned long base;
+ unsigned long limit;
+ unsigned long bump;
+};
+
+static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
+static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+
+/* Try the per-CPU bump-allocator. Returns the chosen address or
+ * a negative IS_ERR_VALUE on miss; callers fall through to the
+ * regular path on miss.
+ */
+static unsigned long
+vmap_bump_alloc(unsigned long size, unsigned long align,
+ unsigned long vstart, unsigned long vend)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long aligned, addr = -ENOENT;
+
+ if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+ size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
+ align > VMAP_BUMP_CHUNK_SIZE / 2)
+ return -EINVAL;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ if (chunk->base) {
+ aligned = ALIGN(chunk->bump, align);
+ if (aligned + size <= chunk->limit) {
+ chunk->bump = aligned + size;
+ addr = aligned;
+ }
+ }
+ spin_unlock(lock);
+ return addr;
+}
+
+/* Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. Old chunk's remaining space is leaked (the
+ * already-allocated VAs in it stay live; the unused tail is wasted).
+ */
+static int
+vmap_bump_refill(gfp_t gfp_mask)
+{
+ struct vmap_bump_chunk *chunk;
+ spinlock_t *lock;
+ unsigned long base;
+
+ preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
+ base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
+ VMALLOC_START, VMALLOC_END);
+ spin_unlock(&free_vmap_area_lock);
+
+ if (IS_ERR_VALUE(base))
+ return -ENOMEM;
+
+ lock = this_cpu_ptr(&vmap_bump_lock);
+ spin_lock(lock);
+ chunk = this_cpu_ptr(&vmap_bump);
+ chunk->base = base;
+ chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+ chunk->bump = base;
+ spin_unlock(lock);
+ return 0;
+}
+
/*
* Allocate a region of KVA of the specified size and alignment, within the
* vstart and vend. If vm is passed in, the two will also be bound.
@@ -2519,6 +2611,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
}
retry:
+ if (IS_ERR_VALUE(addr)) {
+ /*
+ * Per-CPU bump-allocator fast path. On hit, no global
+ * tree work runs at all. On miss, refill the chunk and
+ * try again before falling back to the regular path.
+ */
+ addr = vmap_bump_alloc(size, align, vstart, vend);
+ if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
+ if (vmap_bump_refill(gfp_mask) == 0)
+ addr = vmap_bump_alloc(size, align,
+ vstart, vend);
+ }
+ }
if (IS_ERR_VALUE(addr)) {
preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
try_init_free_mt_locked();
@@ -6214,6 +6319,8 @@ void __init vmalloc_init(void)
init_llist_head(&p->list);
INIT_WORK(&p->wq, delayed_vfree_work);
xa_init(&vbq->vmap_blocks);
+
+ spin_lock_init(&per_cpu(vmap_bump_lock, i));
}
/*
--
2.34.1
next prev parent reply other threads:[~2026-06-13 17:21 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` Pranjal Arya [this message]
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-14 6:35 ` [syzbot ci] " syzbot ci
2026-06-14 6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com \
--to=pranjal.arya@oss.qualcomm.com \
--cc=Suzuki.Poulose@arm.com \
--cc=akpm@linux-foundation.org \
--cc=aliceryhl@google.com \
--cc=andrewjballance@gmail.com \
--cc=balbirs@nvidia.com \
--cc=dev.jain@arm.com \
--cc=dvyukov@google.com \
--cc=elver@google.com \
--cc=glider@google.com \
--cc=jackmanb@google.com \
--cc=liam@infradead.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=maple-tree@lists.infradead.org \
--cc=neil.armstrong@linaro.org \
--cc=praan@google.com \
--cc=puranjay@kernel.org \
--cc=santosh.shukla@amd.com \
--cc=shuah@kernel.org \
--cc=smostafa@google.com \
--cc=sudeep.holla@kernel.org \
--cc=surenb@google.com \
--cc=urezki@gmail.com \
--cc=will@kernel.org \
--cc=wkarny@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox