From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback
Date: Wed, 20 May 2026 10:59:32 -0400 [thread overview]
Message-ID: <20260520150018.2491267-27-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
A coarse bail-out gate in get_page_from_freelist's slowpath retry,
returning NULL to keep atomic-shape allocations from tainting clean
SPBs, would break early-boot in QEMU: cred_init's slab cache create
reaches the slowpath with gfp = __GFP_COMP (gfp_allowed_mask =
GFP_BOOT_MASK strips __GFP_RECLAIM from GFP_KERNEL during boot), has
no fallback path, and panics when a coarse gate refuses the
allocation.
Add a finer-grained refusal anchored in __rmqueue, where the SPB-aware
free-list walk already runs:
- Add ALLOC_HIGHORDER_OPTIONAL, set in gfp_to_alloc_flags() for two
shapes:
1. Explicit fallback declaration: __GFP_NORETRY without
__GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
skb_page_frag_refill on full sockets, etc.
2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no
__GFP_NOMEMALLOC, no __GFP_NOFAIL. Catches GFP_ATOMIC,
GFP_NOWAIT, including ALLOC_HIGHATOMIC consumers (which still
get a second crack at the dedicated MIGRATE_HIGHATOMIC reserve
in rmqueue_buddy after __rmqueue returns NULL).
__GFP_MEMALLOC and __GFP_NOFAIL never get the flag -- they must
succeed even at the cost of fresh-SPB taint.
- Add struct spb_tainted_walk to record what __rmqueue_smallest's
Pass 1 saw on the SB_TAINTED list (any free pages, any free PB,
below-reserve pageblock count). Thread it through the function's
new fourth argument; non-walking call sites pass NULL.
- In __rmqueue, allocate the walk on the stack for callers with
ALLOC_HIGHORDER_OPTIONAL set on a non-movable, non-CMA migratetype.
Force *mode back to RMQUEUE_NORMAL on every call so rmqueue_bulk
Phase 3 can't reuse a memoised RMQUEUE_CLAIM/STEAL state to skip
the gate across iterations.
- After __rmqueue_smallest returns NULL, check the walk: if a tainted
SPB has free pages or a free pageblock that could absorb this
allocation after evacuation, return NULL and bump
SPB_HIGHORDER_REFUSED. Skip RMQUEUE_CLAIM and RMQUEUE_STEAL
entirely (both can taint clean SPBs). The slowpath will eventually
drop NOFRAGMENT and let the allocation proceed only for the
callers that lack ALLOC_HIGHORDER_OPTIONAL -- i.e. the truly
must-not-fail consumers.
- Before falling through to Pass 3 (empty SPBs) inside
__rmqueue_smallest, kick queue_spb_evacuate() when the walk saw a
tainted SPB below its reserve threshold, so future allocations
have a movable-evicted home in an already-tainted SPB.
- Add SPB_HIGHORDER_REFUSED vm event counter (events, not refused
allocations: a single high-level alloc that retries can be counted
multiple times across per-zone attempts).
The early-boot SB_TAINTED list is empty, so the walk records nothing,
the refusal does not engage, and __rmqueue falls through to
RMQUEUE_CLAIM which taints the first SPB normally (the first taint is
unavoidable). cred_init's slab create succeeds, boot succeeds.
Tested in a 16 GB QEMU VM under combined sb-stress + UDP-loopback +
fork/mmap storms (~480s); 2 tainted Normal SPBs out of 13 (boot
baseline 1, +1 during stress); 11 clean SPBs distributed movable load;
no kernel BUG, oops, hang, or panic.
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
include/linux/vm_event_item.h | 5 ++
mm/internal.h | 1 +
mm/page_alloc.c | 115 ++++++++++++++++++++++++++++++++--
mm/vmstat.c | 1 +
4 files changed, 116 insertions(+), 6 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..4a8513d5fc3e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -76,6 +76,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
CMA_ALLOC_SUCCESS,
CMA_ALLOC_FAIL,
#endif
+ SPB_HIGHORDER_REFUSED, /*
+ * refused fragmenting fallback to keep
+ * a clean SPB clean when a tainted SPB
+ * still has free pageblocks
+ */
UNEVICTABLE_PGCULLED, /* culled to noreclaim list */
UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */
UNEVICTABLE_PGRESCUED, /* rescued from noreclaim list */
diff --git a/mm/internal.h b/mm/internal.h
index e6d61dbc18d9..f52575202a96 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1512,6 +1512,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_TRYLOCK 0x400 /* Only use spin_trylock in allocation path */
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
#define ALLOC_NOFRAG_TAINTED_OK 0x1000 /* NOFRAGMENT, but allow steal from tainted SPBs */
+#define ALLOC_HIGHORDER_OPTIONAL 0x2000 /* caller can fall back to a lower order */
/* Flags that allow allocations below the min watermark. */
#define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfbfed056bbb..e4ecddb428c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2799,9 +2799,21 @@ static struct page *try_to_claim_block(struct zone *zone, struct page *page,
int block_type, unsigned int alloc_flags,
bool from_tainted_spb);
+/*
+ * Snapshot of tainted-SPB state observed while __rmqueue_smallest walks the
+ * free lists. Lets the caller (currently __rmqueue) decide whether to refuse
+ * a fragmenting fallback when an existing tainted SPB could absorb the demand
+ * once it is evacuated.
+ */
+struct spb_tainted_walk {
+ bool saw_free_pages; /* tainted SPB has any free pages, any order */
+ bool saw_free_pb; /* tainted SPB has at least one free pageblock */
+ bool saw_below_reserve; /* tainted SPB has nr_free <= spb_tainted_reserve */
+};
+
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
- int migratetype)
+ int migratetype, struct spb_tainted_walk *walk)
{
unsigned int current_order;
struct free_area *area;
@@ -2850,6 +2862,20 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
list_for_each_entry(sb,
&zone->spb_lists[cat][full], list) {
+ /*
+ * Snapshot tainted-SPB capacity before the
+ * nr_free_pages skip: an SPB with a free pageblock
+ * but nothing on the requested-MT freelist still
+ * counts as "could absorb this allocation after evac".
+ */
+ if (walk && cat == SB_TAINTED) {
+ if (sb->nr_free_pages)
+ walk->saw_free_pages = true;
+ if (sb->nr_free)
+ walk->saw_free_pb = true;
+ if (sb->nr_free <= spb_tainted_reserve(sb))
+ walk->saw_below_reserve = true;
+ }
if (!sb->nr_free_pages)
continue;
/* Try whole pageblock (or larger) first for PCP buddy */
@@ -2975,6 +3001,16 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
}
}
+ /*
+ * About to fall through to Pass 3 (empty SPBs) or Pass 4 fallback,
+ * which risks tainting a clean SPB. If the tainted-SPB walk above
+ * showed that some tainted SPB is below its reserve threshold of
+ * free pageblocks, kick deferred evacuation so future allocations
+ * have a movable-evicted home in an already-tainted SPB.
+ */
+ if (walk && walk->saw_below_reserve)
+ queue_spb_evacuate(zone, order, migratetype);
+
/* Pass 3: whole pageblock from empty superpageblocks */
list_for_each_entry(sb, &zone->spb_empty, list) {
if (!sb->nr_free_pages)
@@ -3098,7 +3134,7 @@ static inline bool noncompatible_cross_type(int start_type, int fallback_type)
static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
unsigned int order)
{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
+ return __rmqueue_smallest(zone, order, MIGRATE_CMA, NULL);
}
#else
static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
@@ -3573,7 +3609,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
if (sb)
spb_update_list(sb);
#endif
- return __rmqueue_smallest(zone, order, start_type);
+ return __rmqueue_smallest(zone, order, start_type, NULL);
}
/*
@@ -3920,8 +3956,29 @@ static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
unsigned int alloc_flags, enum rmqueue_mode *mode)
{
+ struct spb_tainted_walk walk = { };
+ struct spb_tainted_walk *walkp = NULL;
struct page *page;
+ /*
+ * Track tainted-SPB state for non-movable, non-CMA callers that
+ * signaled they have a cheap fallback (atomic shape or explicit
+ * NORETRY). We use that to refuse a fragmenting CLAIM/STEAL when a
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Force *mode back to RMQUEUE_NORMAL so the walk + refusal check
+ * runs on every call. rmqueue_bulk Phase 3 chains many __rmqueue
+ * calls reusing *mode; without this reset, a single successful
+ * RMQUEUE_CLAIM/STEAL on the first iteration would let every
+ * subsequent iteration skip the case RMQUEUE_NORMAL block and taint
+ * additional clean SPBs unchecked.
+ */
+ if (migratetype != MIGRATE_MOVABLE && !is_migrate_cma(migratetype) &&
+ (alloc_flags & ALLOC_HIGHORDER_OPTIONAL)) {
+ walkp = &walk;
+ *mode = RMQUEUE_NORMAL;
+ }
+
if (IS_ENABLED(CONFIG_CMA)) {
/*
* Balance movable allocations between regular and CMA areas by
@@ -3948,9 +4005,22 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
*/
switch (*mode) {
case RMQUEUE_NORMAL:
- page = __rmqueue_smallest(zone, order, migratetype);
+ page = __rmqueue_smallest(zone, order, migratetype, walkp);
if (page)
return page;
+ /*
+ * Refuse to fragment a clean SPB when a tainted SPB already
+ * holds free pages or a free pageblock that could absorb
+ * this allocation after evacuation. The caller has a cheap
+ * fallback (lower-order retry, vmalloc, single-page fragment,
+ * drop the packet, etc.) -- better that than tainting fresh
+ * capacity. Pre-Pass-3 evac trigger in __rmqueue_smallest
+ * already kicked deferred eviction.
+ */
+ if (walkp && (walk.saw_free_pages || walk.saw_free_pb)) {
+ count_vm_event(SPB_HIGHORDER_REFUSED);
+ return NULL;
+ }
fallthrough;
case RMQUEUE_CMA:
if (alloc_flags & ALLOC_CMA) {
@@ -5073,7 +5143,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
spin_lock_irqsave(&zone->lock, flags);
}
if (alloc_flags & ALLOC_HIGHATOMIC)
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ page = __rmqueue_smallest(zone, order,
+ MIGRATE_HIGHATOMIC, NULL);
if (!page) {
enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
@@ -5086,7 +5157,9 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
* high-order atomic allocation in the future.
*/
if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ page = __rmqueue_smallest(zone, order,
+ MIGRATE_HIGHATOMIC,
+ NULL);
if (!page) {
spin_unlock_irqrestore(&zone->lock, flags);
@@ -6435,6 +6508,36 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
if (defrag_mode)
alloc_flags |= ALLOC_NOFRAGMENT;
+ /*
+ * Mark callers that have a cheap fallback if the page allocator returns
+ * NULL, so __rmqueue can refuse to taint a clean SPB when an existing
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Two shapes qualify:
+ *
+ * 1. Explicit fallback declaration: __GFP_NORETRY without
+ * __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
+ * skb_page_frag_refill on full sockets, etc.
+ *
+ * 2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no __GFP_NOMEMALLOC,
+ * no __GFP_NOFAIL. These callers (GFP_ATOMIC, GFP_NOWAIT, including
+ * ALLOC_HIGHATOMIC consumers) have implicit fallbacks: drop the
+ * packet, demote the slab order, return ENOMEM up the slowpath,
+ * retry from process context with GFP_KERNEL, etc. ALLOC_HIGHATOMIC
+ * callers also get a second crack at the dedicated MIGRATE_HIGHATOMIC
+ * reserve in rmqueue_buddy after __rmqueue returns NULL.
+ * Tainting a 1 GiB SPB to satisfy any of them is a long-lived
+ * fragmentation event for short-lived data.
+ *
+ * __GFP_MEMALLOC (reclaim recursion) and __GFP_NOFAIL (declared cannot
+ * fail) are excluded -- they must succeed even at the cost of taint.
+ */
+ if ((gfp_mask & __GFP_NORETRY) && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+ alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+ else if (!(gfp_mask & (__GFP_DIRECT_RECLAIM | __GFP_NOMEMALLOC |
+ __GFP_NOFAIL)))
+ alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+
return alloc_flags;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9133254b6b87..0be1b969f493 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1388,6 +1388,7 @@ const char * const vmstat_text[] = {
[I(CMA_ALLOC_SUCCESS)] = "cma_alloc_success",
[I(CMA_ALLOC_FAIL)] = "cma_alloc_fail",
#endif
+ [I(SPB_HIGHORDER_REFUSED)] = "spb_highorder_refused",
[I(UNEVICTABLE_PGCULLED)] = "unevictable_pgs_culled",
[I(UNEVICTABLE_PGSCANNED)] = "unevictable_pgs_scanned",
[I(UNEVICTABLE_PGRESCUED)] = "unevictable_pgs_rescued",
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:02 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-27-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox