[PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	"Liam R. Howlett" <liam@infradead.org>,
	Alice Ryhl <aliceryhl@google.com>,
	Andrew Ballance <andrewjballance@gmail.com>
Cc: linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, maple-tree@lists.infradead.org,
	Lorenzo Stoakes <ljs@kernel.org>,
	Pranjal Shrivastava <praan@google.com>,
	Will Deacon <will@kernel.org>,
	Suzuki K Poulose <Suzuki.Poulose@arm.com>,
	Neil Armstrong <neil.armstrong@linaro.org>,
	Mostafa Saleh <smostafa@google.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Marco Elver <elver@google.com>,
	Dmitry Vyukov <dvyukov@google.com>,
	Alexander Potapenko <glider@google.com>,
	Shuah Khan <shuah@kernel.org>, Dev Jain <dev.jain@arm.com>,
	Brendan Jackman <jackmanb@google.com>,
	Puranjay Mohan <puranjay@kernel.org>,
	Santosh Shukla <santosh.shukla@amd.com>,
	Wyes Karny <wkarny@gmail.com>,
	Pranjal Arya <pranjal.arya@oss.qualcomm.com>,
	Sudeep Holla <sudeep.holla@kernel.org>
Subject: [PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path
Date: Sat, 13 Jun 2026 22:49:49 +0530	[thread overview]
Message-ID: <20260613-vmalloc_maple-v1-7-0aa740bb944b@oss.qualcomm.com> (raw)
In-Reply-To: <20260613-vmalloc_maple-v1-0-0aa740bb944b@oss.qualcomm.com>

The dual-tree design (free_vmap_area_mt + occupied_vmap_area_mt
maintained in lock-step) costs roughly twice the maple operations
per allocation lifecycle that the rb_tree path it replaced used.
Strip the maintenance back to a single authoritative tree on the
steady-state hot path.

After this patch:

  - The occupied tree is the source of truth for in-use ranges on
    the alloc/free hot path.
  - free_vmap_area_mt is still maintained on the slow paths
    (vmap_init_free_space, pcpu_get_vm_areas's top-down walk,
    decay_va_pool_node), but the steady-state alloc/free no longer
    has to keep both trees in lock-step.
  - This removes ~half of the maple operations a typical
    vmalloc/vfree cycle performs.

The pcpu top-down walk relies on the assumption that chunks consume
addresses bottom-up, so stale free-tree entries at low addresses
never collide with pcpu's chosen base.  This is documented at the
relevant call site.

Signed-off-by: Pranjal Arya <pranjal.arya@oss.qualcomm.com>
---
 mm/vmalloc.c | 179 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 99 insertions(+), 80 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5bc1e47c456a..73a40a88dbf6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1767,17 +1767,32 @@ occupied_mt_find_hole_window_locked(unsigned long min, unsigned long max,
 {
 	MA_STATE(mas, &occupied_vmap_area_mt, 0, 0);
 	unsigned long search = min;
+	unsigned long search_len = size;
 	unsigned long hole_end;
 	bool retry_empty;
 
 	lockdep_assert_held(&free_vmap_area_lock);
 	retry_empty = list_empty(&vmap_retry_list);
 
+	/*
+	 * Pad the gap-find by align-1 when align exceeds PAGE_SIZE so that
+	 * any alignment slack inside the returned gap can be absorbed
+	 * without an extra outer-loop iteration. Without this padding, the
+	 * loop has to scan past every page-aligned gap that is large enough
+	 * for @size but too small for the aligned start, which is O(K) in
+	 * the number of such gaps and pathological for big alignments on a
+	 * fragmented occupied tree.
+	 */
+	if (align > PAGE_SIZE) {
+		if (check_add_overflow(size, align - 1, &search_len))
+			return false;
+	}
+
 	while (search <= max) {
 		unsigned long candidate, candidate_end;
 
 		mas_set(&mas, search);
-		if (mas_empty_area(&mas, search, max, size))
+		if (mas_empty_area(&mas, search, max, search_len))
 			return false;
 
 		hole_end = min(mas.last, max);
@@ -2182,39 +2197,35 @@ rollback_busy_insert_failed_alloc_locked(struct vmap_area *va)
 }
 
 /*
- * Reinsert @va into the free index after occupied erase. On failure, place the
- * range on the non-index retry queue and best-effort restore occupied tracking.
+ * Release @va after the caller has erased it from occupied_vmap_area_mt.
+ * In the occupied-only design there is no free index to track free space
+ * with vmap_area objects: the range becomes implicitly free as soon as
+ * the occupied marker is gone. The struct itself is recycled to the slab.
  *
- * Return: free-tracked @va on success, NULL when queued for retry.
+ * The signature returns @va on success (matching the pre-rewrite contract
+ * used by the synchronous free_vmap_area() path) so the caller can decide
+ * whether further bookkeeping is needed.
  */
-static __always_inline struct vmap_area *
-reinsert_or_queue_vmap_area_locked(struct vmap_area *va)
+static __always_inline void
+release_drained_vmap_area_locked(struct vmap_area *va)
 {
-	struct vmap_area *tracked;
-
 	lockdep_assert_held(&free_vmap_area_lock);
 
-	tracked = merge_or_add_vmap_area_free_locked(va);
-	if (tracked)
-		return tracked;
-
-	if (insert_vmap_area_free_locked(va))
-		return va;
-
-	/*
-	 * Retry queue acts as allocation exclusion even if occupied restore
-	 * fails under pressure.
-	 */
-	if (WARN_ON_ONCE(!occupied_mt_store_va_locked(va)))
-		INIT_LIST_HEAD(&va->list);
-
-	retry_queue_add_va_locked(va);
-	return NULL;
+	kmem_cache_free(vmap_area_cachep, va);
 }
 
 /*
  * Returns a start address of the newly allocated area, if success.
  * Otherwise an error value is returned that indicates failure.
+ *
+ * Steady state (post late_initcall, occupied_mt perf_mode on) takes
+ * the occupied-only fast path: find a gap with mas_empty_area on
+ * @occupied_vmap_area_mt and store the consumed sub-range. This costs
+ * two maple touches per allocation versus four to six in the legacy
+ * path (which clipped a free vmap_area struct in @free_vmap_area_mt).
+ *
+ * Pre-perf_mode (early boot) and -ENOENT/-ERANGE retries fall back to
+ * the legacy free_mt walk + va_clip path, which remains correct.
  */
 static __always_inline unsigned long
 __alloc_vmap_area(unsigned long size, unsigned long align,
@@ -2235,33 +2246,41 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 		return -EINVAL;
 	if (size > vend - vstart)
 		return -ENOENT;
-	if (align > PAGE_SIZE && (vend - vstart) != size) {
-		if (check_add_overflow(size, align - 1, &search_len))
-			return -ERANGE;
-	}
 
-	if (occupied_mt_supported() && align <= PAGE_SIZE) {
-		unsigned long candidate;
+	/*
+	 * Occupied-only fast path: skip both the free_mt validation
+	 * (free_mt_find_enclose_range_locked) and the va_clip splitting.
+	 * occupied_mt_find_hole_window_locked already pads the gap search by
+	 * align-1 internally for align > PAGE_SIZE, so any alignment lands
+	 * inside the returned gap; storing the consumed sub-range in
+	 * occupied_mt makes the allocator visible to subsequent lookups. The
+	 * legacy free_mt stays in sync only at coarse points (init, pre-
+	 * perf_mode), which is harmless because the alloc and free hot paths
+	 * no longer query it.
+	 */
+	if (occupied_mt_supported()) {
+		if (!occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
+							 align, &nva_start_addr))
+			return -ENOENT;
 
-		if (occupied_mt_find_hole_window_locked(vstart, vend - 1, size,
-							align, &candidate)) {
-			if (check_add_overflow(candidate, size, &nva_end_addr))
-				return -ERANGE;
+		if (check_add_overflow(nva_start_addr, size, &nva_end_addr))
+			return -ERANGE;
 
-			va = free_mt_find_enclose_range_locked(candidate, nva_end_addr);
-			if (likely(va)) {
-				nva_start_addr = candidate;
-				goto found;
-			}
+		if (!occupied_mt_store_range_locked(nva_start_addr, nva_end_addr))
+			return -ENOMEM;
 
-			occupied_mt_cache_gap_miss_locked(candidate, vend);
-		}
+		return nva_start_addr;
 	}
 
 	/*
-	 * Free maple index is authoritative for allocatable ranges; lazy and
-	 * retry entries are intentionally excluded from it.
+	 * Pre-perf_mode early boot fallback: walk free_mt linearly and use
+	 * va_clip to keep both indices coherent.
 	 */
+	if (align > PAGE_SIZE && (vend - vstart) != size) {
+		if (check_add_overflow(size, align - 1, &search_len))
+			return -ERANGE;
+	}
+
 	mas_set(&mas, vstart);
 	va = mas_find(&mas, vend - 1);
 	while (va) {
@@ -2295,7 +2314,6 @@ __alloc_vmap_area(unsigned long size, unsigned long align,
 	if (!va)
 		return -ENOENT;
 
-found:
 	ret = va_clip(va, nva_start_addr, size);
 	if (WARN_ON_ONCE(ret))
 		return ret;
@@ -2340,8 +2358,7 @@ static void free_vmap_area(struct vmap_area *va)
 		spin_unlock(&free_vmap_area_lock);
 		goto out_schedule_retry;
 	}
-	if (!reinsert_or_queue_vmap_area_locked(va))
-		queued_retry = true;
+	release_drained_vmap_area_locked(va);
 	spin_unlock(&free_vmap_area_lock);
 
 out_schedule_retry:
@@ -2692,15 +2709,13 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 {
 	struct vmap_area *va, *n;
 	bool ok = true;
-	bool queue_retry_work = false;
+	LIST_HEAD(release);
 
 	if (list_empty(head))
 		return true;
 
 	spin_lock(&free_vmap_area_lock);
 	list_for_each_entry_safe(va, n, head, list) {
-		bool occupied_erased = false;
-
 		list_del_init(&va->list);
 		if (erase_occupied) {
 			if (WARN_ON_ONCE(!occupied_mt_erase_va_locked(va))) {
@@ -2708,24 +2723,21 @@ reclaim_list_global(struct list_head *head, bool erase_occupied,
 				ok = false;
 				continue;
 			}
-
-			occupied_erased = true;
-		}
-			if (WARN_ON_ONCE(!merge_or_add_vmap_area_free_locked(va))) {
-				if (occupied_erased &&
-				    WARN_ON_ONCE(!occupied_mt_store_va_locked(va))) {
-					retry_queue_add_va_locked(va);
-					queue_retry_work = true;
-					ok = false;
-					continue;
-				}
-				list_add_tail(&va->list, failed);
-				ok = false;
 		}
+		/*
+		 * Occupied-only design: there are no free vmap_area objects
+		 * any more. With the occupied marker erased, the range is
+		 * implicitly free (a gap in occupied_vmap_area_mt). Just
+		 * release the struct outside the lock.
+		 */
+		list_add_tail(&va->list, &release);
 	}
 	spin_unlock(&free_vmap_area_lock);
-	if (queue_retry_work)
-		schedule_work(&drain_vmap_work);
+
+	list_for_each_entry_safe(va, n, &release, list) {
+		list_del_init(&va->list);
+		kmem_cache_free(vmap_area_cachep, va);
+	}
 
 	return ok;
 }
@@ -5747,14 +5759,16 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
 		if (occupied_mt_erase_va_locked(vas[area])) {
-			va = reinsert_or_queue_vmap_area_locked(vas[area]);
-			if (va)
-				kasan_release_vmalloc(orig_start, orig_end,
-						      va->va_start, va->va_end,
-						      KASAN_VMALLOC_PAGE_RANGE |
-						      KASAN_VMALLOC_TLB_FLUSH);
-			else
-				queued_retry = true;
+			/*
+			 * Reinsert releases vas[area] in the occupied-only
+			 * design; use orig_start/orig_end captured above for
+			 * the kasan release call rather than va->va_start.
+			 */
+			release_drained_vmap_area_locked(vas[area]);
+			kasan_release_vmalloc(orig_start, orig_end,
+					      orig_start, orig_end,
+					      KASAN_VMALLOC_PAGE_RANGE |
+					      KASAN_VMALLOC_TLB_FLUSH);
 		} else {
 			retry_queue_add_va_locked(vas[area]);
 			queued_retry = true;
@@ -5820,14 +5834,11 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 		orig_start = vas[area]->va_start;
 		orig_end = vas[area]->va_end;
 		if (occupied_mt_erase_va_locked(vas[area])) {
-			va = reinsert_or_queue_vmap_area_locked(vas[area]);
-			if (va)
-				kasan_release_vmalloc(orig_start, orig_end,
-						      va->va_start, va->va_end,
-						      KASAN_VMALLOC_PAGE_RANGE |
-						      KASAN_VMALLOC_TLB_FLUSH);
-			else
-				queued_retry = true;
+			release_drained_vmap_area_locked(vas[area]);
+			kasan_release_vmalloc(orig_start, orig_end,
+					      orig_start, orig_end,
+					      KASAN_VMALLOC_PAGE_RANGE |
+					      KASAN_VMALLOC_TLB_FLUSH);
 		} else {
 			retry_queue_add_va_locked(vas[area]);
 			queued_retry = true;
@@ -6045,6 +6056,14 @@ module_init(proc_vmalloc_init);
 
 #endif
 
+/*
+ * Pre-occupied-only design seeded the free index with placeholder VAs
+ * covering gaps between vmlist entries. This is preserved as the
+ * boot-time path that populates the legacy free_vmap_area_mt for any
+ * code that still queries it (notably pcpu_get_vm_areas). With
+ * occupied_vmap_area_mt authoritative, allocators on the hot path
+ * skip free_mt entirely.
+ */
 static void __init vmap_init_free_space(void)
 {
 	unsigned long vmap_start = 1;

-- 
2.34.1

next prev parent reply	other threads:[~2026-06-13 17:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13 17:19 [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 01/12] mm/vmalloc: introduce maple_tree-based indexing for vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 02/12] mm/vmalloc: convert allocation-side gap finding and insertion to maple_tree Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 03/12] mm/vmalloc: convert free, purge, and pcpu paths " Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 04/12] mm/vmalloc: finalize maple-only indexing and shrink struct vmap_area Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 05/12] mm/vmalloc: tighten failure handling under memory pressure Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 06/12] mm/vmalloc: tighten alloc/free hot paths Pranjal Arya
2026-06-13 17:19 ` Pranjal Arya [this message]
2026-06-13 17:19 ` [PATCH RFC 08/12] mm/vmalloc: track lazy-purge queue as a list_head Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 09/12] mm/vmalloc: collapse busy-tree find-then-unlink into a single mas_erase Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 10/12] mm/vmalloc: per-CPU caching of free ranges from the maple_tree allocator Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 11/12] mm/vmalloc: O(1) lookup of cached vmap_areas with bounded fast-reject Pranjal Arya
2026-06-13 17:19 ` [PATCH RFC 12/12] mm/vmalloc: harden bump-allocator alloc/free against UBSAN array bounds Pranjal Arya
2026-06-13 23:15 ` [PATCH RFC 00/12] mm/vmalloc: migrate vmap_area indexing from rb-tree to maple-tree Matthew Wilcox
2026-06-14  6:35 ` [syzbot ci] " syzbot ci
2026-06-14  6:58 ` [PATCH RFC 00/12] " Uladzislau Rezki

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:5bc1e47c456 dfblob:73a40a88dbf )
 OR (
bs:"[PATCH RFC 07/12] mm/vmalloc: consolidate occupied tree as authoritative index on hot path" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260613-vmalloc_maple-v1-7-0aa740bb944b@oss.qualcomm.com \
    --to=pranjal.arya@oss.qualcomm.com \
    --cc=Suzuki.Poulose@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aliceryhl@google.com \
    --cc=andrewjballance@gmail.com \
    --cc=balbirs@nvidia.com \
    --cc=dev.jain@arm.com \
    --cc=dvyukov@google.com \
    --cc=elver@google.com \
    --cc=glider@google.com \
    --cc=jackmanb@google.com \
    --cc=liam@infradead.org \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=neil.armstrong@linaro.org \
    --cc=praan@google.com \
    --cc=puranjay@kernel.org \
    --cc=santosh.shukla@amd.com \
    --cc=shuah@kernel.org \
    --cc=smostafa@google.com \
    --cc=sudeep.holla@kernel.org \
    --cc=surenb@google.com \
    --cc=urezki@gmail.com \
    --cc=will@kernel.org \
    --cc=wkarny@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.