From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache
Date: Wed, 20 May 2026 10:59:37 -0400 [thread overview]
Message-ID: <20260520150018.2491267-32-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
PASS_1 of __rmqueue_smallest walks &zone->spb_lists[cat][full]
linearly. Under steady workload on a 250 GB test system, the median
walk depth was ~50 SPBs and 20-57% of allocations visited 100+ SPBs.
Cache the SPB that last satisfied a PASS_1 alloc for each
(zone, order, migratetype) tuple, in two layers:
- per-zone hint (zone->sb_hint[order][mt]) -- visible to all CPUs,
serialized by zone->lock.
- per-CPU hint indexed by zone_idx -- cache-hot, contention-free.
Each slot stores (zone *, sb *) because zone_idx is per-pgdat
(not globally unique on NUMA); the zone-pointer check on read
prevents a cross-node SPB from being handed back to the wrong
zone's accounting.
Stale hints are harmless: try_alloc_from_sb_pass1() returns NULL and
the standard list walk runs as before. On PASS_1 success both hints
are refreshed. spb_invalidate_warm_hints() clears both arrays from
resize_zone_superpageblocks() under zone->lock to prevent UAF across
memory hotplug-add.
Hint hits show up in tracepoint:kmem:spb_alloc_walk as the [0, 5)
bucket because n_spbs_visited stays 0; no new tracepoint needed.
Skipped for migratetype >= MIGRATE_PCPTYPES (HIGHATOMIC/CMA/ISOLATE
are already cheap or rare).
Measurement on the same test system with this commit applied:
median walk depth: ~50 SPBs -> ~5
tail (>=100 SPB visits): 20-57% -> 0.4%
hint hit rate (n=0): -> 99%
Memory cost: ~320 B per zone + ~2.6 KB per CPU
(MAX_NR_ZONES * NR_PAGE_ORDERS * MIGRATE_PCPTYPES * sizeof(slot)).
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
include/linux/mmzone.h | 11 +++
mm/internal.h | 2 +
mm/mm_init.c | 8 ++
mm/page_alloc.c | 173 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 194 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46eb5012d18b..c9c248d5b14e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1111,6 +1111,17 @@ struct zone {
struct list_head spb_isolated; /* fully isolated (1GB contig alloc) */
struct list_head spb_lists[__NR_SB_CATEGORIES][__NR_SB_FULLNESS];
+ /*
+ * PASS_1 fast-path hint: most-recent SPB that satisfied a
+ * (order, mt) PASS_1 allocation. Stale hints are harmless -- the hint
+ * try-alloc just falls through to the standard list walk on miss.
+ * Sized for [0..NR_PAGE_ORDERS) x PCPTYPES; HIGHATOMIC/CMA/ISOLATE
+ * skip the hint (already cheap or rare). Invalidated by
+ * spb_invalidate_warm_hints() when the SPB array is resized
+ * (memory hotplug add).
+ */
+ struct superpageblock *sb_hint[NR_PAGE_ORDERS][MIGRATE_PCPTYPES];
+
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
diff --git a/mm/internal.h b/mm/internal.h
index 9854d76ebf36..3a847dcfb03f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1119,6 +1119,8 @@ static inline void superpageblock_set_has_movable(struct zone *zone,
void resize_zone_superpageblocks(struct zone *zone);
#endif
+void spb_invalidate_warm_hints(struct zone *zone);
+
struct cma;
#ifdef CONFIG_CMA
diff --git a/mm/mm_init.c b/mm/mm_init.c
index af71ef8393c6..19a338ed1bdf 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1837,6 +1837,14 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
zone->superpageblock_base_pfn = new_sb_base;
zone->spb_kvmalloced = true;
+ /*
+ * Invalidate PASS_1 hints under zone->lock so that no
+ * concurrent allocator (also entering __rmqueue_smallest under
+ * zone->lock) can dereference an old SPB pointer that is about
+ * to be freed below.
+ */
+ spb_invalidate_warm_hints(zone);
+
spin_unlock_irqrestore(&zone->lock, flags);
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6dadfe9d59d9..116d9cc0a493 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2854,6 +2854,109 @@ struct spb_tainted_walk {
bool saw_below_reserve; /* tainted SPB has nr_free <= spb_tainted_reserve */
};
+/*
+ * PASS_1 fast-path hint: most-recent SPB this CPU successfully
+ * allocated from for a given (zone, order, migratetype). Combined with
+ * the per-zone zone->sb_hint[][], this lets PASS_1 skip the linear walk
+ * of spb_lists[cat][full] in the common case. Stale hints are
+ * harmless -- the try-alloc just falls through to the standard list walk
+ * on miss.
+ *
+ * The slot stores both the zone pointer and the SPB pointer because
+ * zone_idx(zone) is per-pgdat (not globally unique on NUMA), so two
+ * nodes' ZONE_NORMAL share the same array index. The zone-pointer check
+ * on read prevents a cross-node SPB from being handed back to the wrong
+ * zone (which would corrupt per-zone NR_FREE_PAGES accounting).
+ */
+struct spb_warm_hint_slot {
+ struct zone *zone;
+ struct superpageblock *sb;
+};
+struct spb_warm_hints {
+ struct spb_warm_hint_slot slot[MAX_NR_ZONES][NR_PAGE_ORDERS][MIGRATE_PCPTYPES];
+};
+static DEFINE_PER_CPU(struct spb_warm_hints, spb_warm_hints);
+
+/**
+ * spb_invalidate_warm_hints - drop all cached hints into @zone
+ * @zone: zone whose SPB array is about to change
+ *
+ * Called from memory hotplug paths that resize zone->superpageblocks
+ * (and therefore invalidate every SPB pointer for @zone). Must be
+ * called with zone->lock held; the lock serializes against any CPU
+ * doing a hint read inside __rmqueue_smallest (also under zone->lock),
+ * so callers see either pre-invalidation state (old SPB pointers,
+ * still-valid old array) or post-invalidation state (NULL slots) --
+ * never a half-state with stale pointers into a freed array.
+ */
+void spb_invalidate_warm_hints(struct zone *zone)
+{
+ enum zone_type zidx = zone_idx(zone);
+ int cpu, order, mt;
+
+ lockdep_assert_held(&zone->lock);
+
+ memset(zone->sb_hint, 0, sizeof(zone->sb_hint));
+
+ for_each_possible_cpu(cpu) {
+ struct spb_warm_hints *h = per_cpu_ptr(&spb_warm_hints, cpu);
+
+ for (order = 0; order < NR_PAGE_ORDERS; order++) {
+ for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+ if (h->slot[zidx][order][mt].zone != zone)
+ continue;
+ h->slot[zidx][order][mt].zone = NULL;
+ h->slot[zidx][order][mt].sb = NULL;
+ }
+ }
+ }
+}
+
+/*
+ * Try to allocate from a single SPB using PASS_1 semantics:
+ * whole pageblock first (PCP-buddy friendly), then sub-pageblock.
+ * Returns the page on success, NULL on miss. Caller is responsible
+ * for hint updates and shrinker queueing.
+ */
+static struct page *try_alloc_from_sb_pass1(struct zone *zone,
+ struct superpageblock *sb,
+ unsigned int order,
+ int migratetype)
+{
+ unsigned int current_order;
+ struct free_area *area;
+ struct page *page;
+
+ if (!sb->nr_free_pages)
+ return NULL;
+
+ for (current_order = max(order, pageblock_order);
+ current_order < NR_PAGE_ORDERS;
+ ++current_order) {
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(area, migratetype);
+ if (!page)
+ continue;
+ page_del_and_expand(zone, page, order,
+ current_order, migratetype);
+ return page;
+ }
+ if (order < pageblock_order) {
+ for (current_order = order;
+ current_order < pageblock_order;
+ ++current_order) {
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(area, migratetype);
+ if (!page)
+ continue;
+ page_del_and_expand(zone, page, order,
+ current_order, migratetype);
+ return page;
+ }
+ }
+ return NULL;
+}
+
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype, struct spb_tainted_walk *walk)
@@ -2875,6 +2978,58 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
};
int movable = (migratetype == MIGRATE_MOVABLE) ? 1 : 0;
+ /*
+ * PASS_1 fast-path: try per-CPU then per-zone hint SPB before the
+ * linear list walk. The hint stores the SPB that last satisfied a
+ * PASS_1 alloc for this (zone, order, migratetype). On hit, we
+ * skip the entire spb_lists walk. Skip for HIGHATOMIC/CMA/ISOLATE
+ * -- those paths are already cheap (atomic-NORETRY skip) or rare.
+ */
+ if (migratetype < MIGRATE_PCPTYPES) {
+ enum zone_type zidx = zone_idx(zone);
+ struct superpageblock *cpu_hint = NULL, *zone_hint;
+ struct spb_warm_hint_slot *slot;
+
+ slot = this_cpu_ptr(
+ &spb_warm_hints.slot[zidx][order][migratetype]);
+ /*
+ * Validate slot->zone == zone: zone_idx is per-pgdat, so
+ * on NUMA the same slot index is shared by every node's
+ * zone of this type. Without this check, a hint written
+ * from one node would be returned to allocations on
+ * another node and corrupt the wrong zone's accounting.
+ */
+ if (slot->zone == zone)
+ cpu_hint = slot->sb;
+ if (cpu_hint) {
+ page = try_alloc_from_sb_pass1(zone, cpu_hint,
+ order, migratetype);
+ if (page) {
+ spb_react_to_tainted_alloc(cpu_hint, zone);
+ trace_mm_page_alloc_zone_locked(page, order,
+ migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ zone_hint = zone->sb_hint[order][migratetype];
+ if (zone_hint && zone_hint != cpu_hint) {
+ page = try_alloc_from_sb_pass1(zone, zone_hint,
+ order, migratetype);
+ if (page) {
+ spb_react_to_tainted_alloc(zone_hint, zone);
+ slot->zone = zone;
+ slot->sb = zone_hint;
+ trace_mm_page_alloc_zone_locked(page, order,
+ migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ }
+
/*
* Search per-superpageblock free lists for pages of the requested
* migratetype, walking superpageblocks from fullest to emptiest
@@ -2940,6 +3095,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page, order, migratetype,
pcp_allowed_order(order) &&
migratetype < MIGRATE_PCPTYPES);
+ if (migratetype < MIGRATE_PCPTYPES) {
+ struct spb_warm_hint_slot *slot;
+
+ zone->sb_hint[order][migratetype] = sb;
+ slot = this_cpu_ptr(&spb_warm_hints.slot
+ [zone_idx(zone)][order][migratetype]);
+ slot->zone = zone;
+ slot->sb = sb;
+ }
return page;
}
/* Then try sub-pageblock (no PCP buddy) */
@@ -2961,6 +3125,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page, order, migratetype,
pcp_allowed_order(order) &&
migratetype < MIGRATE_PCPTYPES);
+ if (migratetype < MIGRATE_PCPTYPES) {
+ struct spb_warm_hint_slot *slot;
+
+ zone->sb_hint[order][migratetype] = sb;
+ slot = this_cpu_ptr(&spb_warm_hints.slot
+ [zone_idx(zone)][order][migratetype]);
+ slot->zone = zone;
+ slot->sb = sb;
+ }
return page;
}
}
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:00 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-27 15:41 ` Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 5:09 ` kernel test robot
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-32-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.