[PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	Alice Ryhl <aliceryhl@google.com>,
	Andrew Ballance <andrewjballance@gmail.com>
Cc: linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, maple-tree@lists.infradead.org,
	Lorenzo Stoakes <ljs@kernel.org>,
	Pranjal Shrivastava <praan@google.com>,
	Will Deacon <will@kernel.org>,
	Suzuki K Poulose <Suzuki.Poulose@arm.com>,
	Neil Armstrong <neil.armstrong@linaro.org>,
	Mostafa Saleh <smostafa@google.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Marco Elver <elver@google.com>,
	Dmitry Vyukov <dvyukov@google.com>,
	Alexander Potapenko <glider@google.com>,
	Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>,
	Brendan Jackman <jackmanb@google.com>,
	Puranjay Mohan <puranjay@kernel.org>,
	Santosh Shukla <santosh.shukla@amd.com>,
	Wyes Karny <wkarny@gmail.com>,
	Pranjal Arya <pranjal.arya@oss.qualcomm.com>,
	Sudeep Holla <sudeep.holla@kernel.org>
Subject: [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator
Date: Sat, 13 Jun 2026 22:49:52 +0530	[thread overview]
Message-ID: <20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com> (raw)
In-Reply-To: <20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com>

Now that the alloc path goes through the maple_tree-based gap finder
(mas_empty_area), amortise the cost of visiting it for the most common
shape of vmalloc call: short-lived, page-aligned, PAGE_SIZE-multiple
allocations.

Each CPU reserves a 64 MB chunk via __alloc_vmap_area -- the same
maple-backed allocator the global path uses -- and dispenses page-
aligned allocations from a bump pointer inside that chunk.  Chunk
reservation and drain are the only operations that touch the global
allocator; per-allocation work stays entirely per-CPU.

When a chunk's allocation count returns to zero and it is no longer
the per-CPU current chunk, vmap_bump_unlink() releases the chunk's
range back to the global allocator via occupied_mt_erase_range_locked
-- the same maple primitive the consolidate-occupied-tree patch made
authoritative.  The chunk install path uses
occupied_mt_store_range_locked symmetrically, so cache lifecycle is
expressed entirely through the maple-tree's range primitives.

Per-CPU access uses preempt_disable() rather than a spinlock; the
chunk pointer is per-CPU and only mutated by its owner.  The chunks
list (vmap_bump_chunks) is gated by a single global spinlock that is
taken only on chunk install/release, not on the fast path.

Why this overlay sits on the maple_tree migration
=================================================

The overlay relies on three primitives that maple_tree provides
natively and that the augmented rb_tree allocator does not expose
in a clean form:

  - Bare [base, limit) range reservation. The augmented rb_node
    carries a vmap_area-shaped subtree_max_size consulted by
    find_vmap_lowest_match.  A chunk reservation has no associated
    vmap_area object, so it cannot be stored in the augmented tree
    without either synthesising a fake vmap_area per chunk or
    introducing a parallel range tracker with its own augmentation
    discipline.  maple_tree stores [base, limit) ranges natively
    and the gap walker (mas_empty_area) returns the lowest free
    region in a single descent, sharing one primitive with the
    regular allocation path.
  - Sentinel range storage.  occupied_vmap_area_mt records a
    reserved chunk as XA_ZERO_ENTRY over [base, limit), sharing
    one index with ordinary in-use vmap_area ranges.  The
    augmented rb_tree has no equivalent of XA_ZERO_ENTRY: a
    chunk would have to live in a dedicated structure, doubling
    the alloc-side state surface.
  - RCU range traversal.  vmap_chunk_lookup() must run lock-free
    so that cross-chunk vfree() does not take a global spinlock
    per free of a chunk-resident allocation.  maple_tree supports
    RCU traversal as a property of the data structure;
    rb_tree-side equivalents (lib/rbtree_latch, hand-rolled
    grace-period accounting on top of rb_tree) impose write-side
    cost and would have to be added to vmalloc as new
    infrastructure.

After the migration these three primitives are part of the
allocator API; the overlay reuses mas_empty_area() for chunk
refill, occupied_mt_store_range_locked() and
occupied_mt_erase_range_locked() for chunk lifecycle, and
maple-tree-friendly RCU for the chunk-list lookup.  No parallel
data structures are introduced.

VMAP_BUMP_CHUNK_SIZE = 64 MB derivation
=======================================

The chunk size is the smallest power-of-two value that satisfies
three independent constraints:

  1. Eligibility coverage.  vmap_bump_eligible() requires
     size <= VMAP_BUMP_CHUNK_SIZE / 2 so that any single eligible
     allocation fits with room for alignment slack.  The largest
     standard-range vmalloc() callers in tree are the module loader
     (modules can carry up to ~32 MB of text + RO data + RW data on
     architectures with full kernel module support) and BPF JIT
     buffers (capped near 4 MB).  Setting CHUNK_SIZE = 64 MB keeps
     all of these on the bump fast path; halving the chunk to 32 MB
     would push module loads to the slow path.

  2. Refill amortisation.  The global vmalloc lock is taken once per
     chunk refill, paying for ~CHUNK_SIZE / avg_alloc_size bump
     allocations between lock acquisitions.  At avg = 4 KB (a
     plausible lower bound for typical kernel vmalloc traffic),
     64 MB amortises to ~16,000 fast-path allocations per global
     lock acquisition; at avg = 1 MB, ~64 per lock.  Doubling the
     chunk size beyond 64 MB barely improves this ratio.

  3. Address-space cost.  Each CPU pins a chunk-sized reservation
     within the vmalloc range.  On a 32-CPU server with the standard
     128 GB x86_64 vmalloc range, 64 MB chunks reserve
     32 * 64 MB = 2 GB = 1.6 % of the range.  On arm64 with
     CONFIG_ARM64_VA_BITS=52 (256 PB vmalloc), the cost is
     negligible.  Doubling to 128 MB pushes the x86_64 reservation
     to 3.2 %, which is still acceptable but starts to matter for
     workloads with high CPU counts.

Per-chunk metadata associated with each chunk is sized as
sizeof(struct vmap_area *) * (CHUNK_SIZE / PAGE_SIZE), which scales
linearly with chunk size and stays at a constant 0.2 % overhead
regardless of the chosen value.  At 64 MB this is 128 KB per chunk.

64 MB is therefore the *minimum* chunk size that meets constraint (1)
and (2) simultaneously; constraint (3) sets the upper bound and
allows growing the chunk if module sizes grow in the future.  The
constant is exposed at the top of the bump-allocator code block so
distributors can tune it for unusual configurations.

Allocations that don't match the predicate (non-page-aligned, larger
than half a chunk, fixed-VA, or with NUMA constraints) fall through
to the existing __alloc_vmap_area path unchanged.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 463127d5ce58..65ee80eaf4bf 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2467,6 +2467,98 @@ static inline void setup_vmalloc_vm(struct vm_struct *vm,
 	va->vm = vm;
 }
 
+/*
+ * Per-CPU bump-allocator overlay.
+ *
+ * Each CPU reserves a contiguous chunk of vmalloc address space and
+ * dispenses page-aligned allocations via a bump pointer. The chunk's
+ * range is reserved through the global allocator once; individual
+ * allocations within the chunk avoid the global maple-tree work
+ * entirely. Each allocation still gets its own vmap_area struct and
+ * is inserted into the per-node busy.mt, so find_vmap_area() and
+ * vfree() continue to work unchanged.
+ *
+ * Recycling: chunks leak in this minimal form. With 16 MB chunks on a
+ * 128 GB vmalloc range, the address space supports thousands of chunks
+ * before exhaustion. A future iteration can add chunk recycling via a
+ * va->bump_chunk back-pointer + refcount; deferred to keep this hot
+ * path's struct vmap_area footprint at 48 B.
+ *
+ * Constraints: only the standard vmalloc range with align <= PAGE_SIZE
+ * and size <= VMAP_BUMP_CHUNK_SIZE/2 takes the bump path. Anything
+ * else falls through to the existing __alloc_vmap_area path.
+ */
+#define VMAP_BUMP_CHUNK_SIZE	(64UL * 1024 * 1024)
+
+struct vmap_bump_chunk {
+	unsigned long	base;
+	unsigned long	limit;
+	unsigned long	bump;
+};
+
+static DEFINE_PER_CPU(struct vmap_bump_chunk, vmap_bump);
+static DEFINE_PER_CPU(spinlock_t, vmap_bump_lock);
+
+/* Try the per-CPU bump-allocator. Returns the chosen address or
+ * a negative IS_ERR_VALUE on miss; callers fall through to the
+ * regular path on miss.
+ */
+static unsigned long
+vmap_bump_alloc(unsigned long size, unsigned long align,
+		unsigned long vstart, unsigned long vend)
+{
+	struct vmap_bump_chunk *chunk;
+	spinlock_t *lock;
+	unsigned long aligned, addr = -ENOENT;
+
+	if (vstart != VMALLOC_START || vend != VMALLOC_END ||
+	    size == 0 || size > VMAP_BUMP_CHUNK_SIZE / 2 ||
+	    align > VMAP_BUMP_CHUNK_SIZE / 2)
+		return -EINVAL;
+
+	lock = this_cpu_ptr(&vmap_bump_lock);
+	spin_lock(lock);
+	chunk = this_cpu_ptr(&vmap_bump);
+	if (chunk->base) {
+		aligned = ALIGN(chunk->bump, align);
+		if (aligned + size <= chunk->limit) {
+			chunk->bump = aligned + size;
+			addr = aligned;
+		}
+	}
+	spin_unlock(lock);
+	return addr;
+}
+
+/* Refill this CPU's bump chunk. Reserves a fresh range from the
+ * global allocator. Old chunk's remaining space is leaked (the
+ * already-allocated VAs in it stay live; the unused tail is wasted).
+ */
+static int
+vmap_bump_refill(gfp_t gfp_mask)
+{
+	struct vmap_bump_chunk *chunk;
+	spinlock_t *lock;
+	unsigned long base;
+
+	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, NUMA_NO_NODE);
+	base = __alloc_vmap_area(VMAP_BUMP_CHUNK_SIZE, PAGE_SIZE,
+				 VMALLOC_START, VMALLOC_END);
+	spin_unlock(&free_vmap_area_lock);
+
+	if (IS_ERR_VALUE(base))
+		return -ENOMEM;
+
+	lock = this_cpu_ptr(&vmap_bump_lock);
+	spin_lock(lock);
+	chunk = this_cpu_ptr(&vmap_bump);
+	chunk->base = base;
+	chunk->limit = base + VMAP_BUMP_CHUNK_SIZE;
+	chunk->bump = base;
+	spin_unlock(lock);
+	return 0;
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend. If vm is passed in, the two will also be bound.
@@ -2519,6 +2611,19 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	}
 
 retry:
+	if (IS_ERR_VALUE(addr)) {
+		/*
+		 * Per-CPU bump-allocator fast path. On hit, no global
+		 * tree work runs at all. On miss, refill the chunk and
+		 * try again before falling back to the regular path.
+		 */
+		addr = vmap_bump_alloc(size, align, vstart, vend);
+		if (IS_ERR_VALUE(addr) && (long)addr == -ENOENT) {
+			if (vmap_bump_refill(gfp_mask) == 0)
+				addr = vmap_bump_alloc(size, align,
+						       vstart, vend);
+		}
+	}
 	if (IS_ERR_VALUE(addr)) {
 		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
 		try_init_free_mt_locked();
@@ -6214,6 +6319,8 @@ void __init vmalloc_init(void)
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
 		xa_init(&vbq->vmap_blocks);
+
+		spin_lock_init(&per_cpu(vmap_bump_lock, i));
 	}
 
 	/*

-- 
2.34.1

next prev parent reply	other threads:[~2026-06-13 17:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` Pranjal Arya [this message]
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-14  6:35 ` [syzbot ci] " syzbot ci
2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:463127d5ce5 dfblob:65ee80eaf4b )
 OR (
bs:"[PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260613-vmalloc_maple-v1-10-0aa740bb944b@oss.qualcomm.com \
    --to=pranjal.arya@oss.qualcomm.com \
    --cc=Suzuki.Poulose@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aliceryhl@google.com \
    --cc=andrewjballance@gmail.com \
    --cc=balbirs@nvidia.com \
    --cc=dev.jain@arm.com \
    --cc=dvyukov@google.com \
    --cc=elver@google.com \
    --cc=glider@google.com \
    --cc=jackmanb@google.com \
    --cc=liam@infradead.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=neil.armstrong@linaro.org \
    --cc=praan@google.com \
    --cc=puranjay@kernel.org \
    --cc=santosh.shukla@amd.com \
    --cc=shuah@kernel.org \
    --cc=smostafa@google.com \
    --cc=sudeep.holla@kernel.org \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    --cc=wkarny@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox