From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback
Date: Wed, 20 May 2026 10:59:32 -0400 [thread overview]
Message-ID: <20260520150018.2491267-27-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
A coarse bail-out gate in get_page_from_freelist's slowpath retry,
returning NULL to keep atomic-shape allocations from tainting clean
SPBs, would break early-boot in QEMU: cred_init's slab cache create
reaches the slowpath with gfp = __GFP_COMP (gfp_allowed_mask =
GFP_BOOT_MASK strips __GFP_RECLAIM from GFP_KERNEL during boot), has
no fallback path, and panics when a coarse gate refuses the
allocation.
Add a finer-grained refusal anchored in __rmqueue, where the SPB-aware
free-list walk already runs:
- Add ALLOC_HIGHORDER_OPTIONAL, set in gfp_to_alloc_flags() for two
shapes:
1. Explicit fallback declaration: __GFP_NORETRY without
__GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
skb_page_frag_refill on full sockets, etc.
2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no
__GFP_NOMEMALLOC, no __GFP_NOFAIL. Catches GFP_ATOMIC,
GFP_NOWAIT, including ALLOC_HIGHATOMIC consumers (which still
get a second crack at the dedicated MIGRATE_HIGHATOMIC reserve
in rmqueue_buddy after __rmqueue returns NULL).
__GFP_MEMALLOC and __GFP_NOFAIL never get the flag -- they must
succeed even at the cost of fresh-SPB taint.
- Add struct spb_tainted_walk to record what __rmqueue_smallest's
Pass 1 saw on the SB_TAINTED list (any free pages, any free PB,
below-reserve pageblock count). Thread it through the function's
new fourth argument; non-walking call sites pass NULL.
- In __rmqueue, allocate the walk on the stack for callers with
ALLOC_HIGHORDER_OPTIONAL set on a non-movable, non-CMA migratetype.
Force *mode back to RMQUEUE_NORMAL on every call so rmqueue_bulk
Phase 3 can't reuse a memoised RMQUEUE_CLAIM/STEAL state to skip
the gate across iterations.
- After __rmqueue_smallest returns NULL, check the walk: if a tainted
SPB has free pages or a free pageblock that could absorb this
allocation after evacuation, return NULL and bump
SPB_HIGHORDER_REFUSED. Skip RMQUEUE_CLAIM and RMQUEUE_STEAL
entirely (both can taint clean SPBs). The slowpath will eventually
drop NOFRAGMENT and let the allocation proceed only for the
callers that lack ALLOC_HIGHORDER_OPTIONAL -- i.e. the truly
must-not-fail consumers.
- Before falling through to Pass 3 (empty SPBs) inside
__rmqueue_smallest, kick queue_spb_evacuate() when the walk saw a
tainted SPB below its reserve threshold, so future allocations
have a movable-evicted home in an already-tainted SPB.
- Add SPB_HIGHORDER_REFUSED vm event counter (events, not refused
allocations: a single high-level alloc that retries can be counted
multiple times across per-zone attempts).
The early-boot SB_TAINTED list is empty, so the walk records nothing,
the refusal does not engage, and __rmqueue falls through to
RMQUEUE_CLAIM which taints the first SPB normally (the first taint is
unavoidable). cred_init's slab create succeeds, boot succeeds.
Tested in a 16 GB QEMU VM under combined sb-stress + UDP-loopback +
fork/mmap storms (~480s); 2 tainted Normal SPBs out of 13 (boot
baseline 1, +1 during stress); 11 clean SPBs distributed movable load;
no kernel BUG, oops, hang, or panic.
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
include/linux/vm_event_item.h | 5 ++
mm/internal.h | 1 +
mm/page_alloc.c | 115 ++++++++++++++++++++++++++++++++--
mm/vmstat.c | 1 +
4 files changed, 116 insertions(+), 6 deletions(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 03fe95f5a020..4a8513d5fc3e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -76,6 +76,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
CMA_ALLOC_SUCCESS,
CMA_ALLOC_FAIL,
#endif
+ SPB_HIGHORDER_REFUSED, /*
+ * refused fragmenting fallback to keep
+ * a clean SPB clean when a tainted SPB
+ * still has free pageblocks
+ */
UNEVICTABLE_PGCULLED, /* culled to noreclaim list */
UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */
UNEVICTABLE_PGRESCUED, /* rescued from noreclaim list */
diff --git a/mm/internal.h b/mm/internal.h
index e6d61dbc18d9..f52575202a96 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1512,6 +1512,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
#define ALLOC_TRYLOCK 0x400 /* Only use spin_trylock in allocation path */
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
#define ALLOC_NOFRAG_TAINTED_OK 0x1000 /* NOFRAGMENT, but allow steal from tainted SPBs */
+#define ALLOC_HIGHORDER_OPTIONAL 0x2000 /* caller can fall back to a lower order */
/* Flags that allow allocations below the min watermark. */
#define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfbfed056bbb..e4ecddb428c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2799,9 +2799,21 @@ static struct page *try_to_claim_block(struct zone *zone, struct page *page,
int block_type, unsigned int alloc_flags,
bool from_tainted_spb);
+/*
+ * Snapshot of tainted-SPB state observed while __rmqueue_smallest walks the
+ * free lists. Lets the caller (currently __rmqueue) decide whether to refuse
+ * a fragmenting fallback when an existing tainted SPB could absorb the demand
+ * once it is evacuated.
+ */
+struct spb_tainted_walk {
+ bool saw_free_pages; /* tainted SPB has any free pages, any order */
+ bool saw_free_pb; /* tainted SPB has at least one free pageblock */
+ bool saw_below_reserve; /* tainted SPB has nr_free <= spb_tainted_reserve */
+};
+
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
- int migratetype)
+ int migratetype, struct spb_tainted_walk *walk)
{
unsigned int current_order;
struct free_area *area;
@@ -2850,6 +2862,20 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
list_for_each_entry(sb,
&zone->spb_lists[cat][full], list) {
+ /*
+ * Snapshot tainted-SPB capacity before the
+ * nr_free_pages skip: an SPB with a free pageblock
+ * but nothing on the requested-MT freelist still
+ * counts as "could absorb this allocation after evac".
+ */
+ if (walk && cat == SB_TAINTED) {
+ if (sb->nr_free_pages)
+ walk->saw_free_pages = true;
+ if (sb->nr_free)
+ walk->saw_free_pb = true;
+ if (sb->nr_free <= spb_tainted_reserve(sb))
+ walk->saw_below_reserve = true;
+ }
if (!sb->nr_free_pages)
continue;
/* Try whole pageblock (or larger) first for PCP buddy */
@@ -2975,6 +3001,16 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
}
}
+ /*
+ * About to fall through to Pass 3 (empty SPBs) or Pass 4 fallback,
+ * which risks tainting a clean SPB. If the tainted-SPB walk above
+ * showed that some tainted SPB is below its reserve threshold of
+ * free pageblocks, kick deferred evacuation so future allocations
+ * have a movable-evicted home in an already-tainted SPB.
+ */
+ if (walk && walk->saw_below_reserve)
+ queue_spb_evacuate(zone, order, migratetype);
+
/* Pass 3: whole pageblock from empty superpageblocks */
list_for_each_entry(sb, &zone->spb_empty, list) {
if (!sb->nr_free_pages)
@@ -3098,7 +3134,7 @@ static inline bool noncompatible_cross_type(int start_type, int fallback_type)
static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
unsigned int order)
{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
+ return __rmqueue_smallest(zone, order, MIGRATE_CMA, NULL);
}
#else
static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
@@ -3573,7 +3609,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
if (sb)
spb_update_list(sb);
#endif
- return __rmqueue_smallest(zone, order, start_type);
+ return __rmqueue_smallest(zone, order, start_type, NULL);
}
/*
@@ -3920,8 +3956,29 @@ static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
unsigned int alloc_flags, enum rmqueue_mode *mode)
{
+ struct spb_tainted_walk walk = { };
+ struct spb_tainted_walk *walkp = NULL;
struct page *page;
+ /*
+ * Track tainted-SPB state for non-movable, non-CMA callers that
+ * signaled they have a cheap fallback (atomic shape or explicit
+ * NORETRY). We use that to refuse a fragmenting CLAIM/STEAL when a
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Force *mode back to RMQUEUE_NORMAL so the walk + refusal check
+ * runs on every call. rmqueue_bulk Phase 3 chains many __rmqueue
+ * calls reusing *mode; without this reset, a single successful
+ * RMQUEUE_CLAIM/STEAL on the first iteration would let every
+ * subsequent iteration skip the case RMQUEUE_NORMAL block and taint
+ * additional clean SPBs unchecked.
+ */
+ if (migratetype != MIGRATE_MOVABLE && !is_migrate_cma(migratetype) &&
+ (alloc_flags & ALLOC_HIGHORDER_OPTIONAL)) {
+ walkp = &walk;
+ *mode = RMQUEUE_NORMAL;
+ }
+
if (IS_ENABLED(CONFIG_CMA)) {
/*
* Balance movable allocations between regular and CMA areas by
@@ -3948,9 +4005,22 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
*/
switch (*mode) {
case RMQUEUE_NORMAL:
- page = __rmqueue_smallest(zone, order, migratetype);
+ page = __rmqueue_smallest(zone, order, migratetype, walkp);
if (page)
return page;
+ /*
+ * Refuse to fragment a clean SPB when a tainted SPB already
+ * holds free pages or a free pageblock that could absorb
+ * this allocation after evacuation. The caller has a cheap
+ * fallback (lower-order retry, vmalloc, single-page fragment,
+ * drop the packet, etc.) -- better that than tainting fresh
+ * capacity. Pre-Pass-3 evac trigger in __rmqueue_smallest
+ * already kicked deferred eviction.
+ */
+ if (walkp && (walk.saw_free_pages || walk.saw_free_pb)) {
+ count_vm_event(SPB_HIGHORDER_REFUSED);
+ return NULL;
+ }
fallthrough;
case RMQUEUE_CMA:
if (alloc_flags & ALLOC_CMA) {
@@ -5073,7 +5143,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
spin_lock_irqsave(&zone->lock, flags);
}
if (alloc_flags & ALLOC_HIGHATOMIC)
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ page = __rmqueue_smallest(zone, order,
+ MIGRATE_HIGHATOMIC, NULL);
if (!page) {
enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
@@ -5086,7 +5157,9 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
* high-order atomic allocation in the future.
*/
if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ page = __rmqueue_smallest(zone, order,
+ MIGRATE_HIGHATOMIC,
+ NULL);
if (!page) {
spin_unlock_irqrestore(&zone->lock, flags);
@@ -6435,6 +6508,36 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
if (defrag_mode)
alloc_flags |= ALLOC_NOFRAGMENT;
+ /*
+ * Mark callers that have a cheap fallback if the page allocator returns
+ * NULL, so __rmqueue can refuse to taint a clean SPB when an existing
+ * tainted SPB still has free pageblocks waiting to be evacuated.
+ *
+ * Two shapes qualify:
+ *
+ * 1. Explicit fallback declaration: __GFP_NORETRY without
+ * __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
+ * skb_page_frag_refill on full sockets, etc.
+ *
+ * 2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no __GFP_NOMEMALLOC,
+ * no __GFP_NOFAIL. These callers (GFP_ATOMIC, GFP_NOWAIT, including
+ * ALLOC_HIGHATOMIC consumers) have implicit fallbacks: drop the
+ * packet, demote the slab order, return ENOMEM up the slowpath,
+ * retry from process context with GFP_KERNEL, etc. ALLOC_HIGHATOMIC
+ * callers also get a second crack at the dedicated MIGRATE_HIGHATOMIC
+ * reserve in rmqueue_buddy after __rmqueue returns NULL.
+ * Tainting a 1 GiB SPB to satisfy any of them is a long-lived
+ * fragmentation event for short-lived data.
+ *
+ * __GFP_MEMALLOC (reclaim recursion) and __GFP_NOFAIL (declared cannot
+ * fail) are excluded -- they must succeed even at the cost of taint.
+ */
+ if ((gfp_mask & __GFP_NORETRY) && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+ alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+ else if (!(gfp_mask & (__GFP_DIRECT_RECLAIM | __GFP_NOMEMALLOC |
+ __GFP_NOFAIL)))
+ alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+
return alloc_flags;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9133254b6b87..0be1b969f493 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1388,6 +1388,7 @@ const char * const vmstat_text[] = {
[I(CMA_ALLOC_SUCCESS)] = "cma_alloc_success",
[I(CMA_ALLOC_FAIL)] = "cma_alloc_fail",
#endif
+ [I(SPB_HIGHORDER_REFUSED)] = "spb_highorder_refused",
[I(UNEVICTABLE_PGCULLED)] = "unevictable_pgs_culled",
[I(UNEVICTABLE_PGSCANNED)] = "unevictable_pgs_scanned",
[I(UNEVICTABLE_PGRESCUED)] = "unevictable_pgs_rescued",
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:00 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-27 15:41 ` Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 5:09 ` kernel test robot
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-27-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.