[00/45 RFC PATCH] 1GB superpageblock memory allocation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [00/45 RFC PATCH] 1GB superpageblock memory allocation
@ 2026-04-30 20:20 Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
                   ` (45 more replies)
  0 siblings, 46 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif

Neither of those are great solutions, given that modern servers
tend to be large, often run multiple workloads simultaneously,
and each workload wants something else.

To address that issue, this patch series divides memory not just
into 2MB page blocks, but into PUD sized superpageblocks, and
aggressively tries to steer unmovable, reclaimable, and highatomic
allocations into those superpageblocks that have already been
"tainted" by such allocations.

The goal is to leave as many 1GB superpageblocks as possible
used by only movable allocations, so they can be easily
defragmented for either regular PMD sized huge pages, or
for PUD sized huge pages.

This series is still very much a work in progress, with lots of
work left to do, but I am posting it now (ahead of LSF/MM) in
the hopes of getting some feedback on whether this looks like
the right direction to go in.

This code has been largely written by AI, then nitpicked over
by me (with some early feedback from Johannes and Usama),
and gone through the cycle of nitpicks several times. I am
sure there are places left where the code could, and should
be better.

However, it does seem to work. On my 256GB system, I can
run syzkaller with AI automatically analyzing crashes, examining
git history for potential causes and fixes, etc, using up all
of memory. Out of the 238 superpageblocks in the normal zone,
normally less than 20 get used for unmovable and reclaimable
allocations.

I can allocate 50 1GB huge pages without driving the workload
out of memory. Presumably I could allocate a lot more if I
shut down the workload.

A number of the patches, especially later in the series,
are fixes that should be folded into earlier patches.
I hope to do that soon-ish.

In the mean time, these are probably the patches to focus
on when reviewing the ideas behind this series:
- mm: page_alloc: track actual page contents in pageblock flags
- mm: page_alloc: add per-superpageblock free lists
- mm: page_alloc: add background superpageblock defragmentation worker
- mm: page_alloc: add within-superpageblock compaction for clean superpageblocks
- mm: page_alloc: superpageblock-aware contiguous and higher order allocation

Based on 1c9982b49613

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
                   ` (44 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Johannes Weiner, Rik van Riel

From: Johannes Weiner <jweiner@meta.com>

Replace the packed pageblock_flags bitmap with a per-pageblock struct
containing its own flags word. This changes the storage from
NR_PAGEBLOCK_BITS bits per pageblock packed into shared unsigned longs,
to a dedicated unsigned long per pageblock.

The free path looks up migratetype (from pageblock flags) immediately
followed by looking up pageblock ownership. Colocating them in a struct
means this hot path touches one cache line instead of two.

The per-pageblock struct also eliminates all the bit-packing indexing
(pfn_to_bitidx, word selection, intra-word shifts), simplifying the
accessor code.

Memory overhead: 8 bytes per pageblock (one unsigned long). With 2MB
pageblocks on x86_64, that's 4KB per GB -- up from ~0.5-1 bytes per
pageblock with the packed bitmap, but still negligible in absolute terms.

No functional change.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h | 15 ++++----
 mm/internal.h          | 17 +++++++++
 mm/mm_init.c           | 25 +++++--------
 mm/page_alloc.c        | 84 +++++++-----------------------------------
 mm/sparse.c            |  3 +-
 5 files changed, 50 insertions(+), 94 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3e51190a55e4..2f202bda5ec6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -916,7 +916,7 @@ struct zone {
 	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
 	 * In SPARSEMEM, this map is stored in struct mem_section
 	 */
-	unsigned long		*pageblock_flags;
+	struct pageblock_data	*pageblock_data;
 #endif /* CONFIG_SPARSEMEM */
 
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
@@ -1866,9 +1866,6 @@ static inline bool movable_only_nodes(nodemask_t *nodes)
 #define PAGES_PER_SECTION       (1UL << PFN_SECTION_SHIFT)
 #define PAGE_SECTION_MASK	(~(PAGES_PER_SECTION-1))
 
-#define SECTION_BLOCKFLAGS_BITS \
-	((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS)
-
 #if (MAX_PAGE_ORDER + PAGE_SHIFT) > SECTION_SIZE_BITS
 #error Allocator MAX_PAGE_ORDER exceeds SECTION_SIZE
 #endif
@@ -1901,13 +1898,17 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
 #define SUBSECTION_ALIGN_UP(pfn) ALIGN((pfn), PAGES_PER_SUBSECTION)
 #define SUBSECTION_ALIGN_DOWN(pfn) ((pfn) & PAGE_SUBSECTION_MASK)
 
+struct pageblock_data {
+	unsigned long flags;
+};
+
 struct mem_section_usage {
 	struct rcu_head rcu;
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 	DECLARE_BITMAP(subsection_map, SUBSECTIONS_PER_SECTION);
 #endif
 	/* See declaration of similar field in struct zone */
-	unsigned long pageblock_flags[0];
+	struct pageblock_data pageblock_data[];
 };
 
 void subsection_map_init(unsigned long pfn, unsigned long nr_pages);
@@ -1960,9 +1961,9 @@ extern struct mem_section **mem_section;
 extern struct mem_section mem_section[NR_SECTION_ROOTS][SECTIONS_PER_ROOT];
 #endif
 
-static inline unsigned long *section_to_usemap(struct mem_section *ms)
+static inline struct pageblock_data *section_to_usemap(struct mem_section *ms)
 {
-	return ms->usage->pageblock_flags;
+	return ms->usage->pageblock_data;
 }
 
 static inline struct mem_section *__nr_to_section(unsigned long nr)
diff --git a/mm/internal.h b/mm/internal.h
index cb0af847d7d9..bb0e0b8a4495 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -787,6 +787,23 @@ static inline struct page *find_buddy_page_pfn(struct page *page,
 	return NULL;
 }
 
+static inline struct pageblock_data *pfn_to_pageblock(const struct page *page,
+						      unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+	struct mem_section *ms = __pfn_to_section(pfn);
+	unsigned long idx = (pfn & (PAGES_PER_SECTION - 1)) >> pageblock_order;
+
+	return &section_to_usemap(ms)[idx];
+#else
+	struct zone *zone = page_zone(page);
+	unsigned long idx;
+
+	idx = (pfn - pageblock_start_pfn(zone->zone_start_pfn)) >> pageblock_order;
+	return &zone->pageblock_data[idx];
+#endif
+}
+
 extern struct page *__pageblock_pfn_to_page(unsigned long start_pfn,
 				unsigned long end_pfn, struct zone *zone);
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index df34797691bd..f3751fe6e5c3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1467,36 +1467,31 @@ void __meminit init_currently_empty_zone(struct zone *zone,
 
 #ifndef CONFIG_SPARSEMEM
 /*
- * Calculate the size of the zone->pageblock_flags rounded to an unsigned long
- * Start by making sure zonesize is a multiple of pageblock_order by rounding
- * up. Then use 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally
- * round what is now in bits to nearest long in bits, then return it in
- * bytes.
+ * Calculate the size of the zone->pageblock_data array.
+ * Round up the zone size to a pageblock boundary to get the
+ * number of pageblocks, then multiply by the struct size.
  */
 static unsigned long __init usemap_size(unsigned long zone_start_pfn, unsigned long zonesize)
 {
-	unsigned long usemapsize;
+	unsigned long nr_pageblocks;
 
 	zonesize += zone_start_pfn & (pageblock_nr_pages-1);
-	usemapsize = round_up(zonesize, pageblock_nr_pages);
-	usemapsize = usemapsize >> pageblock_order;
-	usemapsize *= NR_PAGEBLOCK_BITS;
-	usemapsize = round_up(usemapsize, BITS_PER_LONG);
+	nr_pageblocks = round_up(zonesize, pageblock_nr_pages) >> pageblock_order;
 
-	return usemapsize / BITS_PER_BYTE;
+	return nr_pageblocks * sizeof(struct pageblock_data);
 }
 
 static void __ref setup_usemap(struct zone *zone)
 {
 	unsigned long usemapsize = usemap_size(zone->zone_start_pfn,
 					       zone->spanned_pages);
-	zone->pageblock_flags = NULL;
+	zone->pageblock_data = NULL;
 	if (usemapsize) {
-		zone->pageblock_flags =
+		zone->pageblock_data =
 			memblock_alloc_node(usemapsize, SMP_CACHE_BYTES,
 					    zone_to_nid(zone));
-		if (!zone->pageblock_flags)
-			panic("Failed to allocate %ld bytes for zone %s pageblock flags on node %d\n",
+		if (!zone->pageblock_data)
+			panic("Failed to allocate %ld bytes for zone %s pageblock data on node %d\n",
 			      usemapsize, zone->name, zone_to_nid(zone));
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d4b6f1a554e..45519be08c9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -359,52 +359,18 @@ static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 }
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
-/* Return a pointer to the bitmap storing bits affecting a block of pages */
-static inline unsigned long *get_pageblock_bitmap(const struct page *page,
-							unsigned long pfn)
-{
-#ifdef CONFIG_SPARSEMEM
-	return section_to_usemap(__pfn_to_section(pfn));
-#else
-	return page_zone(page)->pageblock_flags;
-#endif /* CONFIG_SPARSEMEM */
-}
-
-static inline int pfn_to_bitidx(const struct page *page, unsigned long pfn)
-{
-#ifdef CONFIG_SPARSEMEM
-	pfn &= (PAGES_PER_SECTION-1);
-#else
-	pfn = pfn - pageblock_start_pfn(page_zone(page)->zone_start_pfn);
-#endif /* CONFIG_SPARSEMEM */
-	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
-}
-
 static __always_inline bool is_standalone_pb_bit(enum pageblock_bits pb_bit)
 {
 	return pb_bit >= PB_compact_skip && pb_bit < __NR_PAGEBLOCK_BITS;
 }
 
-static __always_inline void
-get_pfnblock_bitmap_bitidx(const struct page *page, unsigned long pfn,
-			   unsigned long **bitmap_word, unsigned long *bitidx)
+static __always_inline unsigned long *
+get_pfnblock_flags_word(const struct page *page, unsigned long pfn)
 {
-	unsigned long *bitmap;
-	unsigned long word_bitidx;
-
-#ifdef CONFIG_MEMORY_ISOLATION
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 8);
-#else
-	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
-#endif
 	BUILD_BUG_ON(__MIGRATE_TYPE_END > MIGRATETYPE_MASK);
 	VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
 
-	bitmap = get_pageblock_bitmap(page, pfn);
-	*bitidx = pfn_to_bitidx(page, pfn);
-	word_bitidx = *bitidx / BITS_PER_LONG;
-	*bitidx &= (BITS_PER_LONG - 1);
-	*bitmap_word = &bitmap[word_bitidx];
+	return &pfn_to_pageblock(page, pfn)->flags;
 }
 
 
@@ -421,18 +387,14 @@ static unsigned long __get_pfnblock_flags_mask(const struct page *page,
 					       unsigned long pfn,
 					       unsigned long mask)
 {
-	unsigned long *bitmap_word;
-	unsigned long bitidx;
-	unsigned long word;
+	unsigned long *flags_word = get_pfnblock_flags_word(page, pfn);
 
-	get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx);
 	/*
 	 * This races, without locks, with set_pfnblock_migratetype(). Ensure
 	 * a consistent read of the memory array, so that results, even though
 	 * racy, are not corrupted.
 	 */
-	word = READ_ONCE(*bitmap_word);
-	return (word >> bitidx) & mask;
+	return READ_ONCE(*flags_word) & mask;
 }
 
 /**
@@ -446,15 +408,10 @@ static unsigned long __get_pfnblock_flags_mask(const struct page *page,
 bool get_pfnblock_bit(const struct page *page, unsigned long pfn,
 		      enum pageblock_bits pb_bit)
 {
-	unsigned long *bitmap_word;
-	unsigned long bitidx;
-
 	if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit)))
 		return false;
 
-	get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx);
-
-	return test_bit(bitidx + pb_bit, bitmap_word);
+	return test_bit(pb_bit, get_pfnblock_flags_word(page, pfn));
 }
 
 /**
@@ -493,18 +450,13 @@ get_pfnblock_migratetype(const struct page *page, unsigned long pfn)
 static void __set_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 				      unsigned long flags, unsigned long mask)
 {
-	unsigned long *bitmap_word;
-	unsigned long bitidx;
-	unsigned long word;
-
-	get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx);
+	unsigned long *flags_word = get_pfnblock_flags_word(page, pfn);
+	unsigned long word, new_word;
 
-	mask <<= bitidx;
-	flags <<= bitidx;
-
-	word = READ_ONCE(*bitmap_word);
+	word = READ_ONCE(*flags_word);
 	do {
-	} while (!try_cmpxchg(bitmap_word, &word, (word & ~mask) | flags));
+		new_word = (word & ~mask) | flags;
+	} while (!try_cmpxchg(flags_word, &word, new_word));
 }
 
 /**
@@ -516,15 +468,10 @@ static void __set_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 void set_pfnblock_bit(const struct page *page, unsigned long pfn,
 		      enum pageblock_bits pb_bit)
 {
-	unsigned long *bitmap_word;
-	unsigned long bitidx;
-
 	if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit)))
 		return;
 
-	get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx);
-
-	set_bit(bitidx + pb_bit, bitmap_word);
+	set_bit(pb_bit, get_pfnblock_flags_word(page, pfn));
 }
 
 /**
@@ -536,15 +483,10 @@ void set_pfnblock_bit(const struct page *page, unsigned long pfn,
 void clear_pfnblock_bit(const struct page *page, unsigned long pfn,
 			enum pageblock_bits pb_bit)
 {
-	unsigned long *bitmap_word;
-	unsigned long bitidx;
-
 	if (WARN_ON_ONCE(!is_standalone_pb_bit(pb_bit)))
 		return;
 
-	get_pfnblock_bitmap_bitidx(page, pfn, &bitmap_word, &bitidx);
-
-	clear_bit(bitidx + pb_bit, bitmap_word);
+	clear_bit(pb_bit, get_pfnblock_flags_word(page, pfn));
 }
 
 /**
diff --git a/mm/sparse.c b/mm/sparse.c
index b5b2b6f7041b..c9473b9a5c24 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -298,7 +298,8 @@ static void __meminit sparse_init_one_section(struct mem_section *ms,
 
 static unsigned long usemap_size(void)
 {
-	return BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS) * sizeof(unsigned long);
+	return (1UL << (PFN_SECTION_SHIFT - pageblock_order)) *
+		sizeof(struct pageblock_data);
 }
 
 size_t mem_section_usage_size(void)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel

From: Johannes Weiner <hannes@cmpxchg.org>

On large machines, zone->lock is a scaling bottleneck for page
allocation. Two common patterns drive contention:

1. Affinity violations: pages are allocated on one CPU but freed on
another (jemalloc, exit, reclaim). The freeing CPU's PCP drains to
zone buddy, and the allocating CPU refills from zone buddy -- both
under zone->lock, defeating PCP batching entirely.

2. Concurrent exits: processes tearing down large address spaces
simultaneously overwhelm per-CPU PCP capacity, serializing on
zone->lock for overflow.

Solution

Extend the PCP to operate on whole pageblocks with ownership tracking.

Each CPU claims pageblocks from the zone buddy and splits them
locally. Pages are tagged with their owning CPU, so frees route back
to the owner's PCP regardless of which CPU frees. This eliminates
affinity violations: the owner CPU's PCP absorbs both allocations and
frees for its blocks without touching zone->lock.

It also shortens zone->lock hold time during drain and refill
cycles. Whole blocks are acquired under zone->lock and then split
outside of it. Affinity routing to the owning PCP on free enables
buddy merging outside the zone->lock as well; a bottom-up merge pass
runs under pcp->lock on drain, freeing larger chunks under zone->lock.

PCP refill uses a four-phase approach:

Phase 0: recover owned fragments previously drained to zone buddy.
Phase 1: claim whole pageblocks from zone buddy.
Phase 2: grab sub-pageblock chunks without migratetype stealing.
Phase 3: traditional __rmqueue() with migratetype fallback.

Phase 0/1 pages are owned and marked PagePCPBuddy, making them
eligible for PCP-level merging. Phase 2/3 pages are cached on PCP for
batching only -- no ownership, no merging. However, Phase 2 still
benefits from chunky zone transactions: it pulls higher-order entries
from zone free lists under zone->lock and splits them on the PCP
outside of it, rather than acquiring zone->lock per page.

When PCP batch sizes are small (small machines with few CPUs) or the
zone is fragmented and no whole pageblocks are available, refill falls
through to Phase 2/3 naturally. The allocator degrades gracefully to
the original page-at-a-time behavior.

When owned blocks accumulate long-lived allocations (e.g. a mix of
anonymous and file cache pages), partial block drains send the free
fragments to zone buddy and remember the block, so Phase 0 can recover
them on the next refill. This allows the allocator to pack new
allocations next to existing ones in already-committed blocks rather
than consuming fresh pageblocks, keeping fragmentation contained.

Data structures:

- per_cpu_pages: +owned_blocks list head, +PCPF_CPU_DEAD flag to gate
  enqueuing on offline CPUs.
- pageblock_data: +cpu (owner), +block_pfn, +cpu_node (recovery list
  linkage). 32 bytes per pageblock, ~16KB per GB with 2MB pageblocks.
- PagePCPBuddy page type marks pages eligible for PCP-level merging.

[riel@surriel.com: fix ownership clearing on direct block frees]

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h     |  23 +-
 include/linux/page-flags.h |   9 +
 mm/debug.c                 |   1 +
 mm/page_alloc.c            | 705 +++++++++++++++++++++++++++++--------
 4 files changed, 575 insertions(+), 163 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2f202bda5ec6..a59260487ab4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -714,17 +714,10 @@ enum zone_watermarks {
 };
 
 /*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. Two additional lists
- * are added for THP. One PCP list is used by GPF_MOVABLE, and the other PCP list
- * is used by GFP_UNMOVABLE and GFP_RECLAIMABLE.
+ * One per migratetype for page orders up to and including PAGE_BLOCK_MAX_ORDER.
  */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 2
-#else
-#define NR_PCP_THP 0
-#endif
-#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
-#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)
+#define NR_PCP_ORDERS (PAGE_BLOCK_MAX_ORDER + 1)
+#define NR_PCP_LISTS (MIGRATE_PCPTYPES * NR_PCP_ORDERS)
 
 /*
  * Flags used in pcp->flags field.
@@ -737,9 +730,13 @@ enum zone_watermarks {
  * draining PCP for consecutive high-order pages freeing without
  * allocation if data cache slice of CPU is large enough.  To reduce
  * zone lock contention and keep cache-hot pages reusing.
+ *
+ * PCPF_CPU_DEAD: CPU is offline.  Don't enqueue freed pages; fall
+ * back to zone buddy instead.
  */
 #define	PCPF_PREV_FREE_HIGH_ORDER	BIT(0)
 #define	PCPF_FREE_HIGH_BATCH		BIT(1)
+#define	PCPF_CPU_DEAD			BIT(2)
 
 struct per_cpu_pages {
 	spinlock_t lock;	/* Protects lists field */
@@ -755,6 +752,9 @@ struct per_cpu_pages {
 #endif
 	short free_count;	/* consecutive free count */
 
+	/* Pageblocks owned by this CPU, for fragment recovery */
+	struct list_head owned_blocks;
+
 	/* Lists of pages, one per migrate type stored on the pcp-lists */
 	struct list_head lists[NR_PCP_LISTS];
 } ____cacheline_aligned_in_smp;
@@ -1900,6 +1900,9 @@ static inline unsigned long section_nr_to_pfn(unsigned long sec)
 
 struct pageblock_data {
 	unsigned long flags;
+	int cpu;			/* PCP ownership: owning cpu + 1, or 0 for zone-owned */
+	unsigned long block_pfn;	/* first PFN of pageblock */
+	struct list_head cpu_node;	/* per-CPU owned-blocks list */
 };
 
 struct mem_section_usage {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c73..6798f78ef677 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -934,6 +934,7 @@ enum pagetype {
 	PGTY_zsmalloc		= 0xf6,
 	PGTY_unaccepted		= 0xf7,
 	PGTY_large_kmalloc	= 0xf8,
+	PGTY_pcp_buddy		= 0xf9,
 
 	PGTY_mapcount_underflow = 0xff
 };
@@ -1002,6 +1003,14 @@ static __always_inline void __ClearPage##uname(struct page *page)	\
  */
 PAGE_TYPE_OPS(Buddy, buddy, buddy)
 
+/*
+ * PagePCPBuddy() indicates that the page is free and in a per-cpu
+ * buddy allocator (see mm/page_alloc.c). Unlike PageBuddy() pages,
+ * these are not on zone free lists and must not be isolated by
+ * compaction or other zone-level code.
+ */
+PAGE_TYPE_OPS(PCPBuddy, pcp_buddy, pcp_buddy)
+
 /*
  * PageOffline() indicates that the page is logically offline although the
  * containing section is online. (e.g. inflated in a balloon driver or
diff --git a/mm/debug.c b/mm/debug.c
index 77fa8fe1d641..d4542d5d202b 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -56,6 +56,7 @@ static const char *page_type_names[] = {
 	DEF_PAGETYPE_NAME(table),
 	DEF_PAGETYPE_NAME(buddy),
 	DEF_PAGETYPE_NAME(unaccepted),
+	DEF_PAGETYPE_NAME(pcp_buddy),
 };
 
 static const char *page_type_name(unsigned int page_type)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 45519be08c9b..c0aa39fa2f61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -414,6 +414,22 @@ bool get_pfnblock_bit(const struct page *page, unsigned long pfn,
 	return test_bit(pb_bit, get_pfnblock_flags_word(page, pfn));
 }
 
+/*
+ * Extract migratetype from a pageblock_data pointer. Callers that
+ * already have the pbd can avoid a redundant pfn_to_pageblock().
+ */
+static __always_inline enum migratetype
+pbd_migratetype(const struct pageblock_data *pbd)
+{
+	unsigned long flags = READ_ONCE(pbd->flags) & MIGRATETYPE_AND_ISO_MASK;
+
+#ifdef CONFIG_MEMORY_ISOLATION
+	if (flags & BIT(PB_migrate_isolate))
+		return MIGRATE_ISOLATE;
+#endif
+	return flags & MIGRATETYPE_MASK;
+}
+
 /**
  * get_pfnblock_migratetype - Return the migratetype of a pageblock
  * @page: The page within the block of interest
@@ -427,16 +443,7 @@ bool get_pfnblock_bit(const struct page *page, unsigned long pfn,
 __always_inline enum migratetype
 get_pfnblock_migratetype(const struct page *page, unsigned long pfn)
 {
-	unsigned long mask = MIGRATETYPE_AND_ISO_MASK;
-	unsigned long flags;
-
-	flags = __get_pfnblock_flags_mask(page, pfn, mask);
-
-#ifdef CONFIG_MEMORY_ISOLATION
-	if (flags & BIT(PB_migrate_isolate))
-		return MIGRATE_ISOLATE;
-#endif
-	return flags & MIGRATETYPE_MASK;
+	return pbd_migratetype(pfn_to_pageblock(page, pfn));
 }
 
 /**
@@ -520,6 +527,8 @@ void __meminit init_pageblock_migratetype(struct page *page,
 					  enum migratetype migratetype,
 					  bool isolate)
 {
+	unsigned long pfn = page_to_pfn(page);
+	struct pageblock_data *pbd;
 	unsigned long flags;
 
 	if (unlikely(page_group_by_mobility_disabled &&
@@ -538,8 +547,11 @@ void __meminit init_pageblock_migratetype(struct page *page,
 	if (isolate)
 		flags |= BIT(PB_migrate_isolate);
 #endif
-	__set_pfnblock_flags_mask(page, page_to_pfn(page), flags,
-				  MIGRATETYPE_AND_ISO_MASK);
+	__set_pfnblock_flags_mask(page, pfn, flags, MIGRATETYPE_AND_ISO_MASK);
+
+	pbd = pfn_to_pageblock(page, pfn);
+	pbd->block_pfn = pfn;
+	INIT_LIST_HEAD(&pbd->cpu_node);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -625,19 +637,7 @@ static void bad_page(struct page *page, const char *reason)
 
 static inline unsigned int order_to_pindex(int migratetype, int order)
 {
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	bool movable;
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		VM_BUG_ON(order != HPAGE_PMD_ORDER);
-
-		movable = migratetype == MIGRATE_MOVABLE;
-
-		return NR_LOWORDER_PCP_LISTS + movable;
-	}
-#else
-	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-#endif
+	VM_BUG_ON(order > PAGE_BLOCK_MAX_ORDER);
 
 	return (MIGRATE_PCPTYPES * order) + migratetype;
 }
@@ -646,25 +646,14 @@ static inline int pindex_to_order(unsigned int pindex)
 {
 	int order = pindex / MIGRATE_PCPTYPES;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pindex >= NR_LOWORDER_PCP_LISTS)
-		order = HPAGE_PMD_ORDER;
-#else
-	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
-#endif
+	VM_BUG_ON(order > PAGE_BLOCK_MAX_ORDER);
 
 	return order;
 }
 
 static inline bool pcp_allowed_order(unsigned int order)
 {
-	if (order <= PAGE_ALLOC_COSTLY_ORDER)
-		return true;
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order == HPAGE_PMD_ORDER)
-		return true;
-#endif
-	return false;
+	return order <= pageblock_order;
 }
 
 /*
@@ -697,6 +686,91 @@ static inline void set_buddy_order(struct page *page, unsigned int order)
 	__SetPageBuddy(page);
 }
 
+/*
+ * PCP pageblock ownership tracking.
+ *
+ * Ownership rules:
+ * - Whole pageblocks acquired by rmqueue_bulk() Phase 1 are owned, meaning
+ *   all frees will be routed to that PCP.
+ * - Draining a whole pageblock back to the zone clears PCP ownership.
+ * - Draining a partial block (due to PCP thresholds or memory pressure) puts
+ *   the block on the pcp->owned_blocks list. A later refill will attempt to
+ *   recover it in Phase 0.
+ * - Whole pageblocks can assemble on the zone buddy due to PCP bypasses,
+ *   e.g. during lock contention. __free_one_page() clears stale ownership.
+ * - Phases 2/3 refill with fragments for pure caching - if there are not
+ *   enough blocks or pcp->high restrictions. They do not participate
+ *   in ownership, affinity enforcement, or on-PCP merging.
+ *
+ * PagePCPBuddy means "mergeable buddy on home PCP":
+ * - Set when Phase 0/1 restore or acquire whole pageblocks.
+ * - Propagated to split remainders in pcp_rmqueue_smallest().
+ * - Set on freed pages from owned blocks routed to the owner PCP.
+ * - NOT set for Phase 2/3 fragments or zone-owned frees.
+ * - The merge pass in free_pcppages_bulk() only processes
+ *   PagePCPBuddy pages, ensuring it never touches pages on
+ *   another CPU's PCP list.
+ *
+ * We store the owning CPU + 1, so the default value of 0 in those
+ * arrays means no owner / zone owner (and not CPU 0).
+ */
+
+static inline void clear_pcpblock_owner(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct pageblock_data *pbd = pfn_to_pageblock(page, pfn);
+
+	pbd->cpu = 0;
+	list_del_init(&pbd->cpu_node);
+}
+
+static inline void set_pcpblock_owner(struct page *page, int cpu)
+{
+	pfn_to_pageblock(page, page_to_pfn(page))->cpu = cpu + 1;
+}
+
+static inline int get_pcpblock_owner(struct page *page)
+{
+	return pfn_to_pageblock(page, page_to_pfn(page))->cpu - 1;
+}
+
+static inline void set_pcp_order(struct page *page, unsigned int order)
+{
+	set_page_private(page, order);
+}
+
+static inline unsigned int pcp_buddy_order(struct page *page)
+{
+	return page_private(page);
+}
+
+static void pcp_enqueue(struct per_cpu_pages *pcp, struct page *page,
+			int migratetype, unsigned int order)
+{
+	set_pcp_order(page, order);
+	list_add(&page->pcp_list,
+		 &pcp->lists[order_to_pindex(migratetype, order)]);
+	pcp->count += 1 << order;
+}
+
+static void pcp_enqueue_tail(struct per_cpu_pages *pcp, struct page *page,
+			     int migratetype, unsigned int order)
+{
+	set_pcp_order(page, order);
+	list_add_tail(&page->pcp_list,
+		      &pcp->lists[order_to_pindex(migratetype, order)]);
+	pcp->count += 1 << order;
+}
+
+static void pcp_dequeue(struct per_cpu_pages *pcp, struct page *page,
+			unsigned int order)
+{
+	list_del(&page->pcp_list);
+	__ClearPagePCPBuddy(page);
+	set_page_private(page, 0);
+	pcp->count -= 1 << order;
+}
+
 #ifdef CONFIG_COMPACTION
 static inline struct capture_control *task_capc(struct zone *zone)
 {
@@ -937,6 +1011,21 @@ static inline void __free_one_page(struct page *page,
 
 	account_freepages(zone, 1 << order, migratetype);
 
+	/*
+	 * For whole blocks, ownership returns to the zone. There are
+	 * no more outstanding frees to route through that CPU's PCP,
+	 * and we don't want to confuse any future users of the pages
+	 * in this block. E.g. rmqueue_buddy().
+	 *
+	 * Check here if a whole block came in directly: pre-merged in
+	 * the PCP, or PCP contended and bypassed.
+	 *
+	 * There is another check in the loop below if a block merges
+	 * up with pages already on the zone buddy.
+	 */
+	if (order == pageblock_order)
+		clear_pcpblock_owner(page);
+
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
 
@@ -986,6 +1075,10 @@ static inline void __free_one_page(struct page *page,
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
 		order++;
+
+		/* Clear owner also when we merge up. See above */
+		if (order == pageblock_order)
+			clear_pcpblock_owner(page);
 	}
 
 done_merging:
@@ -1421,17 +1514,24 @@ bool free_pages_prepare(struct page *page, unsigned int order)
 }
 
 /*
- * Frees a number of pages from the PCP lists
- * Assumes all pages on list are in same zone.
- * count is the number of pages to free.
+ * Free PCP pages to zone buddy. First does a bottom-up merge pass
+ * over PagePCPBuddy entries under pcp->lock only (already held by
+ * caller). Only pages marked PagePCPBuddy (owned-block pages on
+ * their home PCP) participate in merging; non-owned pages (Phase
+ * 2/3 fragments) are skipped and drain individually.
+ *
+ * Then drains pages to zone under zone->lock, starting with
+ * fully-merged pageblocks via round-robin. When those are exhausted,
+ * falls through to smaller orders. Draining a pageblock-order page
+ * disowns the block.
  */
 static void free_pcppages_bulk(struct zone *zone, int count,
-					struct per_cpu_pages *pcp,
-					int pindex)
+				struct per_cpu_pages *pcp)
 {
 	unsigned long flags;
 	unsigned int order;
 	struct page *page;
+	int mt, pindex;
 
 	/*
 	 * Ensure proper count is passed which otherwise would stuck in the
@@ -1439,8 +1539,45 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	 */
 	count = min(pcp->count, count);
 
-	/* Ensure requested pindex is drained first. */
-	pindex = pindex - 1;
+	/* PCP merge pass */
+	for (order = 0; order < pageblock_order; order++) {
+		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+			struct list_head *list;
+			struct page *page, *tmp;
+
+			list = &pcp->lists[order_to_pindex(mt, order)];
+			list_for_each_entry_safe(page, tmp, list, pcp_list) {
+				unsigned long pfn = page_to_pfn(page);
+				unsigned long buddy_pfn = __find_buddy_pfn(pfn, order);
+				struct page *buddy = page + (buddy_pfn - pfn);
+				unsigned long combined_pfn;
+				struct page *combined;
+
+				if (!PagePCPBuddy(page))
+					continue;
+				if (!PagePCPBuddy(buddy))
+					continue;
+				if (pcp_buddy_order(buddy) != order)
+					continue;
+
+				/* Don't corrupt the safe iterator! */
+				if (buddy == tmp)
+					tmp = list_next_entry(tmp, pcp_list);
+
+				pcp_dequeue(pcp, page, order);
+				pcp_dequeue(pcp, buddy, order);
+
+				combined_pfn = buddy_pfn & pfn;
+				combined = page + (combined_pfn - pfn);
+
+				__SetPagePCPBuddy(combined);
+				pcp_enqueue_tail(pcp, combined, mt, order + 1);
+			}
+		}
+	}
+
+	/* Ensure pageblock orders are drained first. */
+	pindex = order_to_pindex(0, pageblock_order) - 1;
 
 	spin_lock_irqsave(&zone->lock, flags);
 
@@ -1458,19 +1595,31 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		order = pindex_to_order(pindex);
 		nr_pages = 1 << order;
 		do {
+			fpi_t fpi = FPI_NONE;
 			unsigned long pfn;
-			int mt;
 
 			page = list_last_entry(list, struct page, pcp_list);
 			pfn = page_to_pfn(page);
 			mt = get_pfnblock_migratetype(page, pfn);
 
-			/* must delete to avoid corrupting pcp list */
-			list_del(&page->pcp_list);
+			/*
+			 * Owned fragment going to zone buddy: queue
+			 * block for recovery during the next refill,
+			 * and keep it away from other CPUs (tail).
+			 */
+			if (PagePCPBuddy(page) && order < pageblock_order) {
+				struct pageblock_data *pbd;
+
+				pbd = pfn_to_pageblock(page, pfn);
+				if (list_empty(&pbd->cpu_node))
+					list_add(&pbd->cpu_node, &pcp->owned_blocks);
+				fpi = FPI_TO_TAIL;
+			}
+
+			pcp_dequeue(pcp, page, order);
 			count -= nr_pages;
-			pcp->count -= nr_pages;
 
-			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
+			__free_one_page(page, pfn, zone, order, mt, fpi);
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
 	}
@@ -1478,6 +1627,45 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
+/*
+ * Search PCP free lists for a page of at least the requested order.
+ * If found at a higher order, split and place remainders on PCP lists.
+ * Returns NULL if nothing available on the PCP.
+ */
+static struct page *pcp_rmqueue_smallest(struct per_cpu_pages *pcp,
+					 int migratetype, unsigned int order)
+{
+	unsigned int high;
+
+	for (high = order; high <= pageblock_order; high++) {
+		struct list_head *list;
+		unsigned long size;
+		struct page *page;
+		bool owned;
+
+		list = &pcp->lists[order_to_pindex(migratetype, high)];
+		if (list_empty(list))
+			continue;
+
+		page = list_first_entry(list, struct page, pcp_list);
+		/* Save before pcp_dequeue() clears it */
+		owned = PagePCPBuddy(page);
+		pcp_dequeue(pcp, page, high);
+
+		size = 1 << high;
+		while (high > order) {
+			high--;
+			size >>= 1;
+			if (owned)
+				__SetPagePCPBuddy(&page[size]);
+			pcp_enqueue(pcp, &page[size], migratetype, high);
+		}
+
+		return page;
+	}
+	return NULL;
+}
+
 /* Split a multi-block free page into its individual pageblocks. */
 static void split_large_buddy(struct zone *zone, struct page *page,
 			      unsigned long pfn, int order, fpi_t fpi)
@@ -1487,6 +1675,7 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
 	/* Caller removed page from freelist, buddy info cleared! */
 	VM_WARN_ON_ONCE(PageBuddy(page));
+	VM_WARN_ON_ONCE(PagePCPBuddy(page));
 
 	if (order > pageblock_order)
 		order = pageblock_order;
@@ -2482,28 +2671,162 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 }
 
 /*
- * Obtain a specified number of elements from the buddy allocator, all under
- * a single hold of the lock, for efficiency.  Add them to the supplied list.
- * Returns the number of new pages which were placed at *list.
+ * Obtain a specified number of elements from the buddy allocator, all
+ * under a single hold of the lock, for efficiency.  Add them to the
+ * freelist of @pcp.
+ *
+ * When @pcp is non-NULL and @count > 1 (normal pageset), uses a four-phase
+ * approach:
+ *   Phase 0: Recover previously owned, partially drained blocks.
+ *   Phase 1: Acquire whole pageblocks, claim ownership, set PagePCPBuddy.
+ *            These pages are eligible for PCP-level buddy merging.
+ *   Phase 2: Grab sub-pageblock fragments of the same migratetype.
+ *   Phase 3: Fall back to __rmqueue() with migratetype fallback.
+ *   Phase 2/3 pages are cached for batching only -- no ownership claim,
+ *   no PagePCPBuddy, no PCP-level merging.
+ *
+ * When @pcp is NULL or @count <= 1 (boot pageset), acquires individual
+ * pages of the requested order directly.
+ *
+ * Returns %true if at least some pages were acquired.
  */
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
-			unsigned long count, struct list_head *list,
-			int migratetype, unsigned int alloc_flags)
+static bool rmqueue_bulk(struct zone *zone, unsigned int order,
+			 unsigned long count,
+			 int migratetype, unsigned int alloc_flags,
+			 struct per_cpu_pages *pcp)
 {
+	unsigned long pages_needed = count << order;
 	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+	struct pageblock_data *pbd, *tmp;
+	int cpu = smp_processor_id();
+	unsigned long refilled = 0;
 	unsigned long flags;
-	int i;
+	int o;
 
 	if (unlikely(alloc_flags & ALLOC_TRYLOCK)) {
 		if (!spin_trylock_irqsave(&zone->lock, flags))
-			return 0;
+			return false;
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 	}
-	for (i = 0; i < count; ++i) {
+
+	if (!pcp || count <= 1)
+		goto phase3;
+
+	/*
+	 * Phase 0: Recover fragments from owned blocks.
+	 *
+	 * The owned_blocks list tracks blocks that have fragments
+	 * sitting in zone buddy (put there by drains). Pull matching
+	 * fragments back to PCP with PagePCPBuddy so they participate
+	 * in merging, instead of claiming fresh blocks and spreading
+	 * fragmentation further.
+	 *
+	 * Only recover blocks matching the requested migratetype.
+	 * After recovery, remove the block from the list -- the drain
+	 * path re-adds it if new fragments arrive.
+	 */
+	list_for_each_entry_safe(pbd, tmp, &pcp->owned_blocks, cpu_node) {
+		unsigned long base_pfn, pfn;
+		int block_mt;
+
+		base_pfn = pbd->block_pfn;
+		block_mt = pbd_migratetype(pbd);
+		if (block_mt != migratetype)
+			continue;
+
+		for (pfn = base_pfn; pfn < base_pfn + pageblock_nr_pages;) {
+			struct page *page = pfn_to_page(pfn);
+
+			if (!PageBuddy(page)) {
+				pfn++;
+				continue;
+			}
+
+			o = buddy_order(page);
+			del_page_from_free_list(page, zone, o, block_mt);
+			__SetPagePCPBuddy(page);
+			pcp_enqueue_tail(pcp, page, block_mt, o);
+			refilled += 1 << o;
+			pfn += 1 << o;
+		}
+
+		list_del_init(&pbd->cpu_node);
+
+		if (refilled >= pages_needed)
+			goto out;
+	}
+
+	/*
+	 * Phase 1: Try whole pageblocks. Fast path for unfragmented
+	 * zones. Claim ownership and set PagePCPBuddy so these pages
+	 * are eligible for PCP-level merging.
+	 *
+	 * Only grab blocks that fit within the refill budget. On
+	 * small zones, pages_needed can be less than a whole
+	 * pageblock; skip to smaller blocks or individual pages to
+	 * avoid overshooting the PCP high watermark.
+	 */
+	while (refilled + pageblock_nr_pages <= pages_needed) {
+		struct page *page;
+
+		page = __rmqueue(zone, pageblock_order,
+				 migratetype, alloc_flags, &rmqm);
+		if (!page)
+			break;
+
+		set_pcpblock_owner(page, cpu);
+		__SetPagePCPBuddy(page);
+		pcp_enqueue_tail(pcp, page, migratetype, pageblock_order);
+		refilled += 1 << pageblock_order;
+	}
+	if (refilled >= pages_needed)
+		goto out;
+
+	/*
+	 * Phase 2: Zone too fragmented for whole pageblocks.
+	 * Sweep zone free lists top-down for same-migratetype
+	 * chunks. Avoids cross-type stealing and keeps PCP
+	 * functional under fragmentation.
+	 *
+	 * No ownership claim or PagePCPBuddy - these are
+	 * sub-pageblock fragments cached for batching only.
+	 *
+	 * Stop above the requested order -- at that point,
+	 * phase 3's __rmqueue() does the same lookup but with
+	 * migratetype fallback.
+	 */
+	for (o = pageblock_order - 1;
+	     o > (int)order && refilled < pages_needed; o--) {
+		struct free_area *area = &zone->free_area[o];
+		struct page *page;
+
+		while (refilled + (1 << o) <= pages_needed) {
+			page = get_page_from_free_area(area, migratetype);
+			if (!page)
+				break;
+
+			del_page_from_free_list(page, zone, o, migratetype);
+			pcp_enqueue_tail(pcp, page, migratetype, o);
+			refilled += 1 << o;
+		}
+	}
+
+	/*
+	 * Phase 3: Last resort. Use __rmqueue() which does
+	 * migratetype fallback. Cache the pages on PCP to still
+	 * amortize future zone lock acquisitions.
+	 *
+	 * No ownership claim or PagePCPBuddy - these fragments
+	 * drain individually to zone buddy.
+	 *
+	 * Boot pagesets (count <= 1) jump here directly.
+	 */
+phase3:
+	while (refilled < pages_needed) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 					      alloc_flags, &rmqm);
-		if (unlikely(page == NULL))
+		if (!page)
 			break;
 
 		/*
@@ -2516,11 +2839,13 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * for IO devices that can merge IO requests if the physical
 		 * pages are ordered properly.
 		 */
-		list_add_tail(&page->pcp_list, list);
+		pcp_enqueue_tail(pcp, page, migratetype, order);
+		refilled += 1 << order;
 	}
-	spin_unlock_irqrestore(&zone->lock, flags);
 
-	return i;
+out:
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return refilled;
 }
 
 /*
@@ -2551,7 +2876,7 @@ bool decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 	while (to_drain > 0) {
 		to_drain_batched = min(to_drain, batch);
 		pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
-		free_pcppages_bulk(zone, to_drain_batched, pcp, 0);
+		free_pcppages_bulk(zone, to_drain_batched, pcp);
 		pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
 		todo = true;
 
@@ -2576,7 +2901,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	to_drain = min(pcp->count, batch);
 	if (to_drain > 0) {
 		pcp_spin_lock_maybe_irqsave(pcp, UP_flags);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		free_pcppages_bulk(zone, to_drain, pcp);
 		pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
 	}
 }
@@ -2598,7 +2923,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 			int to_drain = min(count,
 				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
 
-			free_pcppages_bulk(zone, to_drain, pcp, 0);
+			free_pcppages_bulk(zone, to_drain, pcp);
 			count -= to_drain;
 		}
 		pcp_spin_unlock_maybe_irqrestore(pcp, UP_flags);
@@ -2792,21 +3117,15 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 }
 
 /*
- * Tune pcp alloc factor and adjust count & free_count. Free pages to bring the
- * pcp's watermarks below high.
- *
- * May return a freed pcp, if during page freeing the pcp spinlock cannot be
- * reacquired. Return true if pcp is locked, false otherwise.
+ * Free a page to the PCP and flush excess pages if necessary.
+ * Works for both local and remote PCP - caller handles locking.
+ * @owned: page is from a PCP-owned block (eligible for merging).
  */
-static bool free_frozen_page_commit(struct zone *zone,
+static void free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order, fpi_t fpi_flags, unsigned long *UP_flags)
+		unsigned int order, fpi_t fpi_flags, bool owned)
 {
-	int high, batch;
-	int to_free, to_free_batched;
-	int pindex;
-	int cpu = smp_processor_id();
-	int ret = true;
+	int high, batch, to_free;
 	bool free_high = false;
 
 	/*
@@ -2816,9 +3135,15 @@ static bool free_frozen_page_commit(struct zone *zone,
 	 */
 	pcp->alloc_factor >>= 1;
 	__count_vm_events(PGFREE, 1 << order);
-	pindex = order_to_pindex(migratetype, order);
-	list_add(&page->pcp_list, &pcp->lists[pindex]);
-	pcp->count += 1 << order;
+	/*
+	 * Only set PagePCPBuddy for pages from owned blocks -- those
+	 * are on their home PCP and eligible for buddy merging.
+	 * Zone-owned pages are cached on the local PCP for batching
+	 * only; the merge pass skips them harmlessly.
+	 */
+	if (owned)
+		__SetPagePCPBuddy(page);
+	pcp_enqueue(pcp, page, migratetype, order);
 
 	batch = READ_ONCE(pcp->batch);
 	/*
@@ -2844,41 +3169,15 @@ static bool free_frozen_page_commit(struct zone *zone,
 		 * Do not attempt to take a zone lock. Let pcp->count get
 		 * over high mark temporarily.
 		 */
-		return true;
+		return;
 	}
 
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count < high)
-		return true;
+		return;
 
 	to_free = nr_pcp_free(pcp, batch, high, free_high);
-	while (to_free > 0 && pcp->count > 0) {
-		to_free_batched = min(to_free, batch);
-		free_pcppages_bulk(zone, to_free_batched, pcp, pindex);
-		to_free -= to_free_batched;
-
-		if (to_free == 0 || pcp->count == 0)
-			break;
-
-		pcp_spin_unlock(pcp, *UP_flags);
-
-		pcp = pcp_spin_trylock(zone->per_cpu_pageset, *UP_flags);
-		if (!pcp) {
-			ret = false;
-			break;
-		}
-
-		/*
-		 * Check if this thread has been migrated to a different CPU.
-		 * If that is the case, give up and indicate that the pcp is
-		 * returned in an unlocked state.
-		 */
-		if (smp_processor_id() != cpu) {
-			pcp_spin_unlock(pcp, *UP_flags);
-			ret = false;
-			break;
-		}
-	}
+	free_pcppages_bulk(zone, to_free, pcp);
 
 	if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
 	    zone_watermark_ok(zone, 0, high_wmark_pages(zone),
@@ -2897,7 +3196,6 @@ static bool free_frozen_page_commit(struct zone *zone,
 		    next_memory_node(pgdat->node_id) < MAX_NUMNODES)
 			kswapd_clear_hopeless(pgdat, KSWAPD_CLEAR_HOPELESS_PCP);
 	}
-	return ret;
 }
 
 /*
@@ -2908,9 +3206,11 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 {
 	unsigned long UP_flags;
 	struct per_cpu_pages *pcp;
+	struct pageblock_data *pbd;
 	struct zone *zone;
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
+	int owner_cpu, cache_cpu;
 
 	if (!pcp_allowed_order(order)) {
 		__free_pages_ok(page, order, fpi_flags);
@@ -2928,7 +3228,8 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	 * excessively into the page allocator
 	 */
 	zone = page_zone(page);
-	migratetype = get_pfnblock_migratetype(page, pfn);
+	pbd = pfn_to_pageblock(page, pfn);
+	migratetype = pbd_migratetype(pbd);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			free_one_page(zone, page, pfn, order, fpi_flags);
@@ -2942,15 +3243,45 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 		add_page_to_zone_llist(zone, page, order);
 		return;
 	}
-	pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags);
-	if (pcp) {
-		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
-						order, fpi_flags, &UP_flags))
+
+	/*
+	 * Route page to the owning CPU's PCP for merging, or to
+	 * the local PCP for batching (zone-owned pages). Zone-owned
+	 * pages are cached without PagePCPBuddy -- the merge pass
+	 * skips them, so they're inert on any PCP list and drain
+	 * individually to zone buddy.
+	 *
+	 * Ownership is stable here: it can only change when the
+	 * pageblock is complete -- either fully free in zone buddy
+	 * (Phase 1 claims) or fully merged on PCP (drain disowns).
+	 * Since we hold this page, neither can happen.
+	 */
+	owner_cpu = pbd->cpu - 1;
+	cache_cpu = owner_cpu;
+	if (cache_cpu < 0)
+		cache_cpu = raw_smp_processor_id();
+
+	pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu);
+	if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) {
+		if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
+			free_one_page(zone, page, pfn, order, fpi_flags);
 			return;
-		pcp_spin_unlock(pcp, UP_flags);
+		}
 	} else {
+		spin_lock_irqsave(&pcp->lock, UP_flags);
+	}
+
+	if (unlikely(pcp->flags & PCPF_CPU_DEAD)) {
+		spin_unlock_irqrestore(&pcp->lock, UP_flags);
 		free_one_page(zone, page, pfn, order, fpi_flags);
+		return;
 	}
+
+	free_frozen_page_commit(zone, pcp, page,
+				migratetype, order, fpi_flags,
+				cache_cpu == owner_cpu);
+
+	spin_unlock_irqrestore(&pcp->lock, UP_flags);
 }
 
 void free_frozen_pages(struct page *page, unsigned int order)
@@ -2971,6 +3302,7 @@ void free_unref_folios(struct folio_batch *folios)
 	unsigned long UP_flags;
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
+	int locked_cpu = -1;
 	int i, j;
 
 	/* Prepare folios for freeing */
@@ -3002,17 +3334,29 @@ void free_unref_folios(struct folio_batch *folios)
 		struct zone *zone = folio_zone(folio);
 		unsigned long pfn = folio_pfn(folio);
 		unsigned int order = (unsigned long)folio->private;
+		struct pageblock_data *pbd;
 		int migratetype;
+		int owner_cpu, cache_cpu;
 
 		folio->private = NULL;
-		migratetype = get_pfnblock_migratetype(&folio->page, pfn);
+		pbd = pfn_to_pageblock(&folio->page, pfn);
+		migratetype = pbd_migratetype(pbd);
+		owner_cpu = pbd->cpu - 1;
+		cache_cpu = owner_cpu;
+		if (cache_cpu < 0)
+			cache_cpu = raw_smp_processor_id();
 
-		/* Different zone requires a different pcp lock */
+		/*
+		 * Re-lock needed if zone changed, page is isolate,
+		 * or target CPU changed.
+		 */
 		if (zone != locked_zone ||
-		    is_migrate_isolate(migratetype)) {
+		    is_migrate_isolate(migratetype) ||
+		    cache_cpu != locked_cpu) {
 			if (pcp) {
-				pcp_spin_unlock(pcp, UP_flags);
+				spin_unlock_irqrestore(&pcp->lock, UP_flags);
 				locked_zone = NULL;
+				locked_cpu = -1;
 				pcp = NULL;
 			}
 
@@ -3026,17 +3370,35 @@ void free_unref_folios(struct folio_batch *folios)
 				continue;
 			}
 
+			pcp = per_cpu_ptr(zone->per_cpu_pageset,
+					  cache_cpu);
 			/*
-			 * trylock is necessary as folios may be getting freed
-			 * from IRQ or SoftIRQ context after an IO completion.
+			 * Use trylock when not in task context (IRQ,
+			 * softirq) to avoid spinning with IRQs
+			 * disabled. In task context, spin -- brief
+			 * contention on a per-CPU lock beats the
+			 * unbatched zone->lock fallback.
 			 */
-			pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags);
-			if (unlikely(!pcp)) {
+			if (!in_task()) {
+				if (unlikely(!spin_trylock_irqsave(
+						&pcp->lock, UP_flags))) {
+					pcp = NULL;
+					free_one_page(zone, &folio->page, pfn,
+						      order, FPI_NONE);
+					continue;
+				}
+			} else {
+				spin_lock_irqsave(&pcp->lock, UP_flags);
+			}
+			if (unlikely(pcp->flags & PCPF_CPU_DEAD)) {
+				spin_unlock_irqrestore(&pcp->lock, UP_flags);
+				pcp = NULL;
 				free_one_page(zone, &folio->page, pfn,
 					      order, FPI_NONE);
 				continue;
 			}
 			locked_zone = zone;
+			locked_cpu = cache_cpu;
 		}
 
 		/*
@@ -3047,15 +3409,13 @@ void free_unref_folios(struct folio_batch *folios)
 			migratetype = MIGRATE_MOVABLE;
 
 		trace_mm_page_free_batched(&folio->page);
-		if (!free_frozen_page_commit(zone, pcp, &folio->page,
-				migratetype, order, FPI_NONE, &UP_flags)) {
-			pcp = NULL;
-			locked_zone = NULL;
-		}
+		free_frozen_page_commit(zone, pcp, &folio->page,
+				migratetype, order, FPI_NONE,
+				cache_cpu == owner_cpu);
 	}
 
 	if (pcp)
-		pcp_spin_unlock(pcp, UP_flags);
+		spin_unlock_irqrestore(&pcp->lock, UP_flags);
 	folio_batch_reinit(folios);
 }
 
@@ -3278,28 +3638,24 @@ static inline
 struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
 			int migratetype,
 			unsigned int alloc_flags,
-			struct per_cpu_pages *pcp,
-			struct list_head *list)
+			struct per_cpu_pages *pcp)
 {
 	struct page *page;
 
 	do {
-		if (list_empty(list)) {
+		/* Try to find/split from existing PCP stock */
+		page = pcp_rmqueue_smallest(pcp, migratetype, order);
+		if (!page) {
 			int batch = nr_pcp_alloc(pcp, zone, order);
-			int alloced;
 
-			alloced = rmqueue_bulk(zone, order,
-					batch, list,
-					migratetype, alloc_flags);
+			if (!rmqueue_bulk(zone, order, batch, migratetype,
+					  alloc_flags, pcp))
+				return NULL;
 
-			pcp->count += alloced << order;
-			if (unlikely(list_empty(list)))
+			page = pcp_rmqueue_smallest(pcp, migratetype, order);
+			if (unlikely(!page))
 				return NULL;
 		}
-
-		page = list_first_entry(list, struct page, pcp_list);
-		list_del(&page->pcp_list);
-		pcp->count -= 1 << order;
 	} while (check_new_pages(page, order));
 
 	return page;
@@ -3311,7 +3667,6 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 			int migratetype, unsigned int alloc_flags)
 {
 	struct per_cpu_pages *pcp;
-	struct list_head *list;
 	struct page *page;
 	unsigned long UP_flags;
 
@@ -3326,8 +3681,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	 * frees.
 	 */
 	pcp->free_count >>= 1;
-	list = &pcp->lists[order_to_pindex(migratetype, order)];
-	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp);
 	pcp_spin_unlock(pcp, UP_flags);
 	if (page) {
 		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
@@ -5013,7 +5367,6 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	struct zone *zone;
 	struct zoneref *z;
 	struct per_cpu_pages *pcp;
-	struct list_head *pcp_list;
 	struct alloc_context ac;
 	gfp_t alloc_gfp;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -5108,7 +5461,6 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		goto failed;
 
 	/* Attempt the batch allocation */
-	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
 	while (nr_populated < nr_pages) {
 
 		/* Skip existing pages */
@@ -5117,8 +5469,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 			continue;
 		}
 
-		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
-								pcp, pcp_list);
+		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags, pcp);
 		if (unlikely(!page)) {
 			/* Try and allocate at least one page */
 			if (!nr_account) {
@@ -5993,6 +6344,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
 	spin_lock_init(&pcp->lock);
 	for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
 		INIT_LIST_HEAD(&pcp->lists[pindex]);
+	INIT_LIST_HEAD(&pcp->owned_blocks);
 
 	/*
 	 * Set batch and high values safe for a boot pageset. A true percpu
@@ -6228,7 +6580,45 @@ static int page_alloc_cpu_dead(unsigned int cpu)
 
 	lru_add_drain_cpu(cpu);
 	mlock_drain_remote(cpu);
-	drain_pages(cpu);
+
+	/*
+	 * Mark the dead CPU's PCPs so concurrent frees don't
+	 * enqueue pages on them after the drain. Set the flag
+	 * under pcp->lock to serialize with trylock in the free
+	 * path. Stale ownership entries in pageblock_data are
+	 * harmless: frees check PCPF_CPU_DEAD and fall back to zone,
+	 * and rmqueue_bulk will reclaim the blocks for live CPUs.
+	 */
+	for_each_populated_zone(zone) {
+		unsigned long flags, zflags;
+		struct per_cpu_pages *pcp;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+
+		pcp_spin_lock_maybe_irqsave(pcp, flags);
+		pcp->flags |= PCPF_CPU_DEAD;
+		pcp_spin_unlock_maybe_irqrestore(pcp, flags);
+
+		drain_pages_zone(cpu, zone);
+
+		/*
+		 * Drain released all pages. Reinitialize the
+		 * owned-blocks list -- any remaining entries are
+		 * stale (fragments that merged in zone buddy and
+		 * cleared ownership, but weren't removed from
+		 * the list because __free_one_page doesn't hold
+		 * pcp->lock).
+		 *
+		 * Hold zone lock to prevent racing with other
+		 * CPUs doing list_del_init on stale entries
+		 * from this list during their Phase 1.
+		 */
+		pcp_spin_lock_maybe_irqsave(pcp, flags);
+		spin_lock_irqsave(&zone->lock, zflags);
+		INIT_LIST_HEAD(&pcp->owned_blocks);
+		spin_unlock_irqrestore(&zone->lock, zflags);
+		pcp_spin_unlock_maybe_irqrestore(pcp, flags);
+	}
 
 	/*
 	 * Spill the event counters of the dead processor
@@ -6257,8 +6647,17 @@ static int page_alloc_cpu_online(unsigned int cpu)
 {
 	struct zone *zone;
 
-	for_each_populated_zone(zone)
+	for_each_populated_zone(zone) {
+		struct per_cpu_pages *pcp;
+		unsigned long flags;
+
+		pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+		pcp_spin_lock_maybe_irqsave(pcp, flags);
+		pcp->flags &= ~PCPF_CPU_DEAD;
+		pcp_spin_unlock_maybe_irqrestore(pcp, flags);
+
 		zone_pcp_update(zone, 1);
+	}
 	return 0;
 }
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The per-cpu pageblock buddy allocator changed __free_frozen_pages() and
free_unref_folios() to use a blocking spin_lock_irqsave() for the PCP
lock when in_task(), rather than mainline's unconditional trylock via
pcp_spin_trylock().

This breaks a mainline invariant: the allocation path in rmqueue_pcplist()
acquires pcp->lock via pcp_spin_trylock(), which on SMP does
preempt_disable() + spin_trylock() without disabling IRQs.  This means
the alloc path holds pcp->lock with interrupts enabled.

The resulting ABBA deadlock scenario:

  CPU0 (alloc path):  pcp_spin_trylock() acquires pcp->lock (IRQs ON)
                      -> hardirq fires while lock is held
                      -> IRQ handler takes xa_lock
                         (e.g. __folio_end_writeback -> xa_lock)

  CPU1 (free path):   xa_lock held (e.g. slab -> stack_depot_free)
                      -> __free_frozen_pages()
                      -> spin_lock_irqsave(&pcp->lock) BLOCKS
                      -> waits for CPU0

  CPU0 cannot release pcp->lock because it is stuck in hardirq
  waiting for xa_lock held by CPU1.  Deadlock.

The key insight is that pcp_trylock_prepare() is a no-op on SMP, so
pcp_spin_trylock() does not save/restore IRQs.  Any lock taken in
hardirq context that is also held across __free_frozen_pages() creates
this ABBA potential.

Fix by always using spin_trylock_irqsave() for the PCP lock, falling
back to free_one_page() (zone buddy) when the trylock fails.  This
restores the mainline invariant of never blocking on PCP lock acquisition
in the free path.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0aa39fa2f61..d98eab3e288e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3262,13 +3262,15 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 		cache_cpu = raw_smp_processor_id();
 
 	pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu);
-	if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) {
-		if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
-			free_one_page(zone, page, pfn, order, fpi_flags);
-			return;
-		}
-	} else {
-		spin_lock_irqsave(&pcp->lock, UP_flags);
+	/*
+	 * Always use trylock: callers may hold locks (e.g. xa_lock via
+	 * slab/stack_depot) that are also taken in hardirq context, and
+	 * pcp->lock is acquired with IRQs enabled on the allocation side.
+	 * A blocking lock here would create an ABBA deadlock potential.
+	 */
+	if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
+		free_one_page(zone, page, pfn, order, fpi_flags);
+		return;
 	}
 
 	if (unlikely(pcp->flags & PCPF_CPU_DEAD)) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (2 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
                   ` (41 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

init_unavailable_range() initializes struct pages for memory holes between
memblock regions.  It receives a zone ID from its caller, but that zone ID
is simply the last zone processed in memmap_init()'s iteration — it does
not necessarily match the zone that actually spans each PFN in the hole.

When an unavailable range straddles a zone boundary (e.g. a hole between
DMA32 and Normal), all pages in the hole get tagged with the wrong zone in
page->flags.  Any later page_zone() call on such a page returns the wrong
zone, which can cause accounting confusion or crashes when code assumes the
returned zone is valid for that page.

Fix by looking up the correct zone for each PFN in the hole.  This is init-
only code running once at boot, so the per-page zone lookup has no
performance impact.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/mm_init.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index f3751fe6e5c3..b3f83452de72 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -848,9 +848,27 @@ static void __init init_unavailable_range(unsigned long spfn,
 {
 	unsigned long pfn;
 	u64 pgcnt = 0;
+	pg_data_t *pgdat = NODE_DATA(node);
+	int zid = zone;
 
 	for_each_valid_pfn(pfn, spfn, epfn) {
-		__init_single_page(pfn_to_page(pfn), pfn, zone, node);
+		/*
+		 * The caller's zone may not match the PFN when unavailable
+		 * ranges straddle zone boundaries.  Look up the correct zone
+		 * so page->flags encodes the right zone for page_zone().
+		 */
+		if (!zone_spans_pfn(&pgdat->node_zones[zid], pfn)) {
+			int z;
+
+			for (z = 0; z < MAX_NR_ZONES; z++) {
+				if (zone_spans_pfn(&pgdat->node_zones[z], pfn)) {
+					zid = z;
+					break;
+				}
+			}
+		}
+
+		__init_single_page(pfn_to_page(pfn), pfn, zid, node);
 		__SetPageReserved(pfn_to_page(pfn));
 		pgcnt++;
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (3 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
                   ` (40 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The SPB rework moved free pages off zone->free_area[] and onto the
per-superpageblock free lists at zone->superpageblocks[i].free_area[].
pagetypeinfo_showfree_print() was still walking the now-empty zone-level
free lists, so /proc/pagetypeinfo's "Free pages count per migrate type
at order" table read as all zeros.

Walk every SPB in the zone, accumulating counts per (migratetype, order)
into stack-allocated 2-D arrays, then emit one line per migratetype.
zone->lock is dropped between SPBs (matching the original printer's
unlock/cond_resched/lock pattern) to bound time under the lock. The
100000-per-cell cap is retained -- it is now cumulative across all SPBs
in the zone, which is the same effective semantic as before since the
old free_area was already per-zone.

Concurrent memory hotplug can swap zone->superpageblocks under us during
a lock drop; the counts may then be inconsistent, but no UAF is possible
since sb is re-derefed each iteration. Acceptable for a debug-only
interface.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/vmstat.c | 66 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 28 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 86b14b0f77b5..7de08ab61b9d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1572,41 +1572,51 @@ static int frag_show(struct seq_file *m, void *arg)
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
+	unsigned long counts[MIGRATE_TYPES][NR_PAGE_ORDERS] = { };
+	bool overflow[MIGRATE_TYPES][NR_PAGE_ORDERS] = { };
+	unsigned long sb_idx, nr_sbs = zone->nr_superpageblocks;
 	int order, mtype;
 
+	/*
+	 * Free pages live on per-superpageblock free lists. Walk the SPBs,
+	 * accumulating per (migratetype, order) counts. The 100000 cap per
+	 * cell limits time under zone->lock; this is a debugging interface,
+	 * knowing there is "a lot" of one size is sufficient. zone->lock is
+	 * dropped between SPBs, so concurrent memory hotplug may produce
+	 * inconsistent counts -- acceptable for a debug-only interface.
+	 */
+	for (sb_idx = 0; sb_idx < nr_sbs; sb_idx++) {
+		struct superpageblock *sb = &zone->superpageblocks[sb_idx];
+
+		for (order = 0; order < NR_PAGE_ORDERS; order++) {
+			struct free_area *area = &sb->free_area[order];
+			struct list_head *curr;
+
+			for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+				if (overflow[mtype][order])
+					continue;
+				list_for_each(curr, &area->free_list[mtype]) {
+					if (++counts[mtype][order] >= 100000) {
+						overflow[mtype][order] = true;
+						break;
+					}
+				}
+			}
+		}
+		spin_unlock_irq(&zone->lock);
+		cond_resched();
+		spin_lock_irq(&zone->lock);
+	}
+
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 		seq_printf(m, "Node %4d, zone %8s, type %12s ",
 					pgdat->node_id,
 					zone->name,
 					migratetype_names[mtype]);
-		for (order = 0; order < NR_PAGE_ORDERS; ++order) {
-			unsigned long freecount = 0;
-			struct free_area *area;
-			struct list_head *curr;
-			bool overflow = false;
-
-			area = &(zone->free_area[order]);
-
-			list_for_each(curr, &area->free_list[mtype]) {
-				/*
-				 * Cap the free_list iteration because it might
-				 * be really large and we are under a spinlock
-				 * so a long time spent here could trigger a
-				 * hard lockup detector. Anyway this is a
-				 * debugging tool so knowing there is a handful
-				 * of pages of this order should be more than
-				 * sufficient.
-				 */
-				if (++freecount >= 100000) {
-					overflow = true;
-					break;
-				}
-			}
-			seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount);
-			spin_unlock_irq(&zone->lock);
-			cond_resched();
-			spin_lock_irq(&zone->lock);
-		}
+		for (order = 0; order < NR_PAGE_ORDERS; order++)
+			seq_printf(m, "%s%6lu ",
+				   overflow[mtype][order] ? ">" : "",
+				   counts[mtype][order]);
 		seq_putc(m, '\n');
 	}
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (4 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
                   ` (39 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

watermark_boost was introduced to react to fragmentation events at the
pageblock granularity: a sub-pageblock cross-type fallback would raise
the zone watermark and wake kswapd, on the theory that reclaiming some
order-0 pages would reduce future fallbacks.

With superpageblocks, anti-fragmentation is enforced at 1 GiB SPB
granularity, and the meaningful signals (CLEAN->TAINT events, empty SPB
count) live there.  Sub-pageblock fallbacks inside an already-tainted
SPB do not change the fragmentation picture, and order-0 reclaim does
not unmix a pageblock or surface a fresh clean SPB.

Worse, the boost is applied in try_to_claim_block() before the success
path is decided.  When option 1 (no UNMOVABLE/RECLAIMABLE pageblock
mixing) rejects a cross-type relabel, the boost has already been
applied and the next rmqueue() will wake kswapd to drain memory back
to high+boost - even when free pages are tens of times the high
watermark.  Real workloads showed bursts of >150 wakeup_kswapd/min,
all order-0, with stack traces consistently arriving from rmqueue()
through the boost-cleanup path.  Free memory at the time was 38x the
high watermark.

Drop the mechanism entirely:

  - boost_watermark() and its callsite in try_to_claim_block()
  - the ZONE_BOOSTED_WATERMARK flag and its set/clear in rmqueue()
  - zone->watermark_boost and the boost addend in wmark_pages()
  - the __GFP_HIGH boost-bypass path in zone_watermark_fast()
  - the watermark_boost_factor sysctl
  - boost-aware logic in balance_pgdat() (nr_boost_reclaim,
    zone_boosts[], pgdat_watermark_boosted, the boost-restart goto,
    no-writeback for boost reclaim, the boost-only kcompactd wakeup)

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 Documentation/admin-guide/sysctl/vm.rst |  21 -----
 Documentation/mm/physical_memory.rst    |  13 +--
 include/linux/mmzone.h                  |   6 +-
 mm/page_alloc.c                         |  82 +----------------
 mm/show_mem.c                           |   2 -
 mm/vmscan.c                             | 115 ++----------------------
 mm/vmstat.c                             |   2 -
 7 files changed, 14 insertions(+), 227 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c..3ddc6115c89a 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -76,7 +76,6 @@ files can be found in mm/swap.c.
 - user_reserve_kbytes
 - vfs_cache_pressure
 - vfs_cache_pressure_denom
-- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -1073,26 +1072,6 @@ vfs_cache_pressure_denom
 Defaults to 100 (minimum allowed value). Requires corresponding
 vfs_cache_pressure setting to take effect.
 
-watermark_boost_factor
-======================
-
-This factor controls the level of reclaim when memory is being fragmented.
-It defines the percentage of the high watermark of a zone that will be
-reclaimed if pages of different mobility are being mixed within pageblocks.
-The intent is that compaction has less work to do in the future and to
-increase the success rate of future high-order allocations such as SLUB
-allocations, THP and hugetlbfs pages.
-
-To make it sensible with respect to the watermark_scale_factor
-parameter, the unit is in fractions of 10,000. The default value of
-15,000 means that up to 150% of the high watermark will be reclaimed in the
-event of a pageblock being mixed due to fragmentation. The level of reclaim
-is determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
-
-
 watermark_scale_factor
 ======================
 
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index b76183545e5b..c4968db6e77c 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -394,11 +394,6 @@ General
   to the distance between two watermarks. The distance itself is calculated
   taking ``vm.watermark_scale_factor`` sysctl into account.
 
-``watermark_boost``
-  The number of pages which are used to boost watermarks to increase reclaim
-  pressure to reduce the likelihood of future fallbacks and wake kswapd now
-  as the node may be balanced overall and kswapd will not wake naturally.
-
 ``nr_reserved_highatomic``
   The number of pages which are reserved for high-order atomic allocations.
 
@@ -527,11 +522,9 @@ General
   Defined only when ``CONFIG_UNACCEPTED_MEMORY`` is enabled.
 
 ``flags``
-  The zone flags. The least three bits are used and defined by
-  ``enum zone_flags``. ``ZONE_BOOSTED_WATERMARK`` (bit 0): zone recently boosted
-  watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE`` (bit 1):
-  kswapd may be scanning the zone. ``ZONE_BELOW_HIGH`` (bit 2): zone is below
-  high watermark.
+  The zone flags. The bits are defined by ``enum zone_flags``.
+  ``ZONE_RECLAIM_ACTIVE`` (bit 0): kswapd may be scanning the zone.
+  ``ZONE_BELOW_HIGH`` (bit 1): zone is below high watermark.
 
 ``lock``
   The main lock that protects the internal data structures of the page allocator
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a59260487ab4..5d1869fd2708 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -881,7 +881,6 @@ struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
-	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
 	unsigned long nr_free_highatomic;
@@ -1067,9 +1066,6 @@ enum pgdat_flags {
 };
 
 enum zone_flags {
-	ZONE_BOOSTED_WATERMARK,		/* zone recently boosted watermarks.
-					 * Cleared when kswapd is woken.
-					 */
 	ZONE_RECLAIM_ACTIVE,		/* kswapd may be scanning the zone. */
 	ZONE_BELOW_HIGH,		/* zone is below high watermark. */
 };
@@ -1077,7 +1073,7 @@ enum zone_flags {
 static inline unsigned long wmark_pages(const struct zone *z,
 					enum zone_watermarks w)
 {
-	return z->_watermark[w] + z->watermark_boost;
+	return z->_watermark[w];
 }
 
 static inline unsigned long min_wmark_pages(const struct zone *z)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d98eab3e288e..5cc5edaf8111 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -301,7 +301,6 @@ const char * const migratetype_names[MIGRATE_TYPES] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
-static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
 int defrag_mode;
 
@@ -2321,43 +2320,6 @@ bool pageblock_unisolate_and_move_free_pages(struct zone *zone, struct page *pag
 
 #endif /* CONFIG_MEMORY_ISOLATION */
 
-static inline bool boost_watermark(struct zone *zone)
-{
-	unsigned long max_boost;
-
-	if (!watermark_boost_factor)
-		return false;
-	/*
-	 * Don't bother in zones that are unlikely to produce results.
-	 * On small machines, including kdump capture kernels running
-	 * in a small area, boosting the watermark can cause an out of
-	 * memory situation immediately.
-	 */
-	if ((pageblock_nr_pages * 4) > zone_managed_pages(zone))
-		return false;
-
-	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
-			watermark_boost_factor, 10000);
-
-	/*
-	 * high watermark may be uninitialised if fragmentation occurs
-	 * very early in boot so do not boost. We do not fall
-	 * through and boost by pageblock_nr_pages as failing
-	 * allocations that early means that reclaim is not going
-	 * to help and it may even be impossible to reclaim the
-	 * boosted watermark resulting in a hang.
-	 */
-	if (!max_boost)
-		return false;
-
-	max_boost = max(pageblock_nr_pages, max_boost);
-
-	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
-		max_boost);
-
-	return true;
-}
-
 /*
  * When we are falling back to another migratetype during allocation, should we
  * try to claim an entire block to satisfy further allocations, instead of
@@ -2458,14 +2420,6 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		return page;
 	}
 
-	/*
-	 * Boost watermarks to increase reclaim pressure to reduce the
-	 * likelihood of future fallbacks. Wake kswapd now as the node
-	 * may be balanced overall and kswapd will not wake naturally.
-	 */
-	if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD))
-		set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
-
 	/* moving whole block can fail due to zone boundary conditions */
 	if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
 				       &movable_pages))
@@ -3723,13 +3677,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 							migratetype);
 
 out:
-	/* Separate test+clear to avoid unnecessary atomics */
-	if ((alloc_flags & ALLOC_KSWAPD) &&
-	    unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
-		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
-		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
-	}
-
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
 	return page;
 }
@@ -4007,24 +3954,8 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 			return true;
 	}
 
-	if (__zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
-					free_pages))
-		return true;
-
-	/*
-	 * Ignore watermark boosting for __GFP_HIGH order-0 allocations
-	 * when checking the min watermark. The min watermark is the
-	 * point where boosting is ignored so that kswapd is woken up
-	 * when below the low watermark.
-	 */
-	if (unlikely(!order && (alloc_flags & ALLOC_MIN_RESERVE) && z->watermark_boost
-		&& ((alloc_flags & ALLOC_WMARK_MASK) == WMARK_MIN))) {
-		mark = z->_watermark[WMARK_MIN];
-		return __zone_watermark_ok(z, order, mark, highest_zoneidx,
-					alloc_flags, free_pages);
-	}
-
-	return false;
+	return __zone_watermark_ok(z, order, mark, highest_zoneidx, alloc_flags,
+					free_pages);
 }
 
 #ifdef CONFIG_NUMA
@@ -6824,7 +6755,6 @@ static void __setup_per_zone_wmarks(void)
 			    mult_frac(zone_managed_pages(zone),
 				      watermark_scale_factor, 10000));
 
-		zone->watermark_boost = 0;
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = low_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp;
@@ -7092,14 +7022,6 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
-	{
-		.procname	= "watermark_boost_factor",
-		.data		= &watermark_boost_factor,
-		.maxlen		= sizeof(watermark_boost_factor),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 24078ac3e6bc..bbbbef5baed7 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -298,7 +298,6 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 		printk(KERN_CONT
 			"%s"
 			" free:%lukB"
-			" boost:%lukB"
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
@@ -321,7 +320,6 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
-			K(zone->watermark_boost),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..879cea20dd57 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6725,30 +6725,6 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	} while (memcg);
 }
 
-static bool pgdat_watermark_boosted(pg_data_t *pgdat, int highest_zoneidx)
-{
-	int i;
-	struct zone *zone;
-
-	/*
-	 * Check for watermark boosts top-down as the higher zones
-	 * are more likely to be boosted. Both watermarks and boosts
-	 * should not be checked at the same time as reclaim would
-	 * start prematurely when there is no boosting and a lower
-	 * zone is balanced.
-	 */
-	for (i = highest_zoneidx; i >= 0; i--) {
-		zone = pgdat->node_zones + i;
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone->watermark_boost)
-			return true;
-	}
-
-	return false;
-}
-
 /*
  * Returns true if there is an eligible zone balanced for the request order
  * and highest_zoneidx
@@ -6953,14 +6929,13 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	unsigned long pflags;
-	unsigned long nr_boost_reclaim;
-	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
-	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
+		.may_writepage = 1,
+		.may_swap = 1,
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
@@ -6969,18 +6944,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 
 	count_vm_event(PAGEOUTRUN);
 
-	/*
-	 * Account for the reclaim boost. Note that the zone boost is left in
-	 * place so that parallel allocations that are near the watermark will
-	 * stall or direct reclaim until kswapd is finished.
-	 */
-	nr_boost_reclaim = 0;
-	for_each_managed_zone_pgdat(zone, pgdat, i, highest_zoneidx) {
-		nr_boost_reclaim += zone->watermark_boost;
-		zone_boosts[i] = zone->watermark_boost;
-	}
-	boosted = nr_boost_reclaim;
-
 restart:
 	set_reclaim_active(pgdat, highest_zoneidx);
 	sc.priority = DEF_PRIORITY;
@@ -7015,39 +6978,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		}
 
 		/*
-		 * If the pgdat is imbalanced then ignore boosting and preserve
-		 * the watermarks for a later time and restart. Note that the
-		 * zone watermarks will be still reset at the end of balancing
-		 * on the grounds that the normal reclaim should be enough to
-		 * re-evaluate if boosting is required when kswapd next wakes.
+		 * If there are no eligible zones, no work to do. Note that
+		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
+		 * have adjusted it.
 		 */
 		balanced = pgdat_balanced(pgdat, sc.order, highest_zoneidx);
-		if (!balanced && nr_boost_reclaim) {
-			nr_boost_reclaim = 0;
-			goto restart;
-		}
-
-		/*
-		 * If boosting is not active then only reclaim if there are no
-		 * eligible zones. Note that sc.reclaim_idx is not used as
-		 * buffer_heads_over_limit may have adjusted it.
-		 */
-		if (!nr_boost_reclaim && balanced)
+		if (balanced)
 			goto out;
 
-		/* Limit the priority of boosting to avoid reclaim writeback */
-		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
-			raise_priority = false;
-
-		/*
-		 * Do not writeback or swap pages for boosted reclaim. The
-		 * intent is to relieve pressure not issue sub-optimal IO
-		 * from reclaim context. If no pages are reclaimed, the
-		 * reclaim will be aborted.
-		 */
-		sc.may_writepage = !nr_boost_reclaim;
-		sc.may_swap = !nr_boost_reclaim;
-
 		/*
 		 * Do some background aging, to give pages a chance to be
 		 * referenced before reclaiming. All pages are rotated
@@ -7091,15 +7029,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		 * progress in reclaiming pages
 		 */
 		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
-		nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
-
-		/*
-		 * If reclaim made no progress for a boost, stop reclaim as
-		 * IO cannot be queued and it could be an infinite loop in
-		 * extreme circumstances.
-		 */
-		if (nr_boost_reclaim && !nr_reclaimed)
-			break;
 
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
@@ -7115,12 +7044,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		goto restart;
 	}
 
-	/*
-	 * If the reclaim was boosted, we might still be far from the
-	 * watermark_high at this point. We need to avoid increasing the
-	 * failure count to prevent the kswapd thread from stopping.
-	 */
-	if (!sc.nr_reclaimed && !boosted) {
+	if (!sc.nr_reclaimed) {
 		int fail_cnt = atomic_inc_return(&pgdat->kswapd_failures);
 		/* kswapd context, low overhead to trace every failure */
 		trace_mm_vmscan_kswapd_reclaim_fail(pgdat->node_id, fail_cnt);
@@ -7129,28 +7053,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 out:
 	clear_reclaim_active(pgdat, highest_zoneidx);
 
-	/* If reclaim was boosted, account for the reclaim done in this pass */
-	if (boosted) {
-		unsigned long flags;
-
-		for (i = 0; i <= highest_zoneidx; i++) {
-			if (!zone_boosts[i])
-				continue;
-
-			/* Increments are under the zone lock */
-			zone = pgdat->node_zones + i;
-			spin_lock_irqsave(&zone->lock, flags);
-			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
-			spin_unlock_irqrestore(&zone->lock, flags);
-		}
-
-		/*
-		 * As there is now likely space, wakeup kcompact to defragment
-		 * pageblocks.
-		 */
-		wakeup_kcompactd(pgdat, pageblock_order, highest_zoneidx);
-	}
-
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release(_THIS_IP_);
 	psi_memstall_leave(&pflags);
@@ -7384,8 +7286,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (kswapd_test_hopeless(pgdat) ||
-	    (pgdat_balanced(pgdat, order, highest_zoneidx) &&
-	     !pgdat_watermark_boosted(pgdat, highest_zoneidx))) {
+	    pgdat_balanced(pgdat, order, highest_zoneidx)) {
 		/*
 		 * There may be plenty of free memory available, but it's too
 		 * fragmented for high-order allocations.  Wake up kcompactd
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7de08ab61b9d..32027b8c0526 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1776,7 +1776,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  pages free     %lu"
-		   "\n        boost    %lu"
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
@@ -1786,7 +1785,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        managed  %lu"
 		   "\n        cma      %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
-		   zone->watermark_boost,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (5 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

When the page allocator steals a movable pageblock for unmovable or
reclaimable allocations (via try_to_claim_block), the remaining movable
pages in that block can prevent future unmovable/reclaimable allocations
from being concentrated in fewer pageblocks, leading to long-term memory
fragmentation.

Add a lightweight asynchronous evacuation mechanism: when a movable
pageblock is claimed for unmovable/reclaimable use, queue a work item to
migrate the remaining movable pages out. This allows future
unmovable/reclaimable allocations to be satisfied from the now-evacuated
block, keeping those allocation types concentrated and reducing
fragmentation.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |   4 +
 mm/page_alloc.c        | 223 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 227 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5d1869fd2708..2ab45d1133d9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -22,6 +22,7 @@
 #include <linux/mm_types.h>
 #include <linux/page-flags.h>
 #include <linux/local_lock.h>
+#include <linux/irq_work_types.h>
 #include <linux/zswap.h>
 #include <asm/page.h>
 
@@ -1440,6 +1441,9 @@ typedef struct pglist_data {
 	wait_queue_head_t kcompactd_wait;
 	struct task_struct *kcompactd;
 	bool proactive_compact_trigger;
+	struct workqueue_struct *evacuate_wq;
+	struct llist_head evacuate_pending;
+	struct irq_work evacuate_irq_work;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5cc5edaf8111..45c25c4fc7c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -18,6 +18,7 @@
 #include <linux/mm.h>
 #include <linux/highmem.h>
 #include <linux/interrupt.h>
+#include <linux/irq_work.h>
 #include <linux/jiffies.h>
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -51,6 +52,7 @@
 #include <linux/lockdep.h>
 #include <linux/psi.h>
 #include <linux/khugepaged.h>
+#include <linux/workqueue.h>
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
@@ -59,6 +61,10 @@
 #include "shuffle.h"
 #include "page_reporting.h"
 
+#ifdef CONFIG_COMPACTION
+static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn);
+#endif
+
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
 
@@ -2409,6 +2415,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	int free_pages, movable_pages, alike_pages;
 	unsigned long start_pfn;
 
+	/*
+	 * Don't steal from pageblocks that are isolated for
+	 * evacuation — that would undo the work in progress.
+	 */
+	if (get_pageblock_isolate(page))
+		return NULL;
+
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		unsigned int nr_added;
@@ -2454,6 +2467,18 @@ try_to_claim_block(struct zone *zone, struct page *page,
 			page_group_by_mobility_disabled) {
 		__move_freepages_block(zone, start_pfn, block_type, start_type);
 		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
+#ifdef CONFIG_COMPACTION
+		/*
+		 * A movable pageblock was just claimed for unmovable or
+		 * reclaimable use. Queue async evacuation of the remaining
+		 * movable pages so future unmovable/reclaimable allocations
+		 * can stay concentrated in fewer pageblocks.
+		 */
+		if (block_type == MIGRATE_MOVABLE &&
+		    (start_type == MIGRATE_UNMOVABLE ||
+		     start_type == MIGRATE_RECLAIMABLE))
+			queue_pageblock_evacuate(zone, start_pfn);
+#endif
 		return __rmqueue_smallest(zone, order, start_type);
 	}
 
@@ -7089,6 +7114,204 @@ void __init page_alloc_sysctl_init(void)
 	register_sysctl_init("vm", page_alloc_sysctl_table);
 }
 
+#ifdef CONFIG_COMPACTION
+/*
+ * Pageblock evacuation: asynchronously migrate movable pages out of
+ * pageblocks that were stolen for unmovable/reclaimable allocations.
+ * This keeps unmovable/reclaimable allocations concentrated in fewer
+ * pageblocks, reducing long-term fragmentation.
+ *
+ * Uses a global pool of 64 pre-allocated work items (~3.5KB total)
+ * and a per-pgdat workqueue to keep migration node-local.
+ */
+
+struct evacuate_item {
+	struct work_struct	work;
+	struct zone		*zone;
+	unsigned long		start_pfn;
+	struct llist_node	free_node;
+};
+
+#define NR_EVACUATE_ITEMS	64
+static struct evacuate_item evacuate_pool[NR_EVACUATE_ITEMS];
+static struct llist_head evacuate_freelist;
+
+static struct evacuate_item *evacuate_item_alloc(void)
+{
+	struct llist_node *node;
+
+	node = llist_del_first(&evacuate_freelist);
+	if (!node)
+		return NULL;
+	return container_of(node, struct evacuate_item, free_node);
+}
+
+static void evacuate_item_free(struct evacuate_item *item)
+{
+	llist_add(&item->free_node, &evacuate_freelist);
+}
+
+static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
+{
+	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
+	unsigned long pfn = start_pfn;
+	int nr_reclaimed;
+	int ret = 0;
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = zone,
+		.mode = MIGRATE_ASYNC,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+	};
+	struct migration_target_control mtc = {
+		.nid = zone_to_nid(zone),
+		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+	};
+
+	/* Verify this pageblock is still worth evacuating */
+	if (get_pageblock_migratetype(pfn_to_page(start_pfn)) == MIGRATE_MOVABLE)
+		return;
+
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	/*
+	 * Loop through the entire pageblock, isolating and migrating
+	 * in batches. isolate_migratepages_range stops at
+	 * COMPACT_CLUSTER_MAX, so we must loop to cover the full block.
+	 */
+	while (pfn < end_pfn || !list_empty(&cc.migratepages)) {
+		if (list_empty(&cc.migratepages)) {
+			cc.nr_migratepages = 0;
+			cc.migrate_pfn = pfn;
+			ret = isolate_migratepages_range(&cc, pfn, end_pfn);
+			if (ret && ret != -EAGAIN)
+				break;
+			pfn = cc.migrate_pfn;
+			if (list_empty(&cc.migratepages))
+				break;
+		}
+
+		nr_reclaimed = reclaim_clean_pages_from_list(zone,
+							&cc.migratepages);
+		cc.nr_migratepages -= nr_reclaimed;
+
+		if (!list_empty(&cc.migratepages)) {
+			ret = migrate_pages(&cc.migratepages,
+					    alloc_migration_target, NULL,
+					    (unsigned long)&mtc, cc.mode,
+					    MR_COMPACTION, NULL);
+			if (ret) {
+				putback_movable_pages(&cc.migratepages);
+				break;
+			}
+		}
+
+		cond_resched();
+	}
+
+	if (!list_empty(&cc.migratepages))
+		putback_movable_pages(&cc.migratepages);
+}
+
+static void evacuate_work_fn(struct work_struct *work)
+{
+	struct evacuate_item *item = container_of(work, struct evacuate_item,
+						  work);
+	evacuate_pageblock(item->zone, item->start_pfn);
+	evacuate_item_free(item);
+}
+
+/**
+ * evacuate_irq_work_fn - IRQ work callback to drain pending evacuations
+ * @work: the irq_work embedded in pg_data_t
+ *
+ * queue_work() can deadlock when called from inside the page allocator
+ * because it may try to allocate memory with locks already held.
+ * Use irq_work to defer the queue_work() calls to a safe context.
+ */
+static void evacuate_irq_work_fn(struct irq_work *work)
+{
+	pg_data_t *pgdat = container_of(work, pg_data_t,
+					evacuate_irq_work);
+	struct llist_node *pending;
+	struct evacuate_item *item, *next;
+
+	if (!pgdat->evacuate_wq)
+		return;
+
+	/*
+	 * Collect all pending items first, then queue them.  Use _safe
+	 * because evacuate_work_fn() may run immediately on another
+	 * CPU and free the item before we follow the next pointer.
+	 */
+	pending = llist_del_all(&pgdat->evacuate_pending);
+	llist_for_each_entry_safe(item, next, pending, free_node) {
+		INIT_WORK(&item->work, evacuate_work_fn);
+		queue_work(pgdat->evacuate_wq, &item->work);
+	}
+}
+
+/**
+ * queue_pageblock_evacuate - schedule async evacuation of movable pages
+ * @zone: the zone containing the pageblock
+ * @pfn: start PFN of the pageblock (must be pageblock-aligned)
+ *
+ * Called from the page allocator when a movable pageblock is claimed
+ * for unmovable or reclaimable allocations. Queues the pageblock for
+ * background migration of its remaining movable pages. Uses irq_work
+ * to defer the actual queue_work() call outside the allocator's lock
+ * context.
+ */
+static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn)
+{
+	struct evacuate_item *item;
+	pg_data_t *pgdat = zone->zone_pgdat;
+
+	if (!pgdat->evacuate_irq_work.func)
+		return;
+
+	item = evacuate_item_alloc();
+	if (!item)
+		return;
+
+	item->zone = zone;
+	item->start_pfn = pfn;
+	llist_add(&item->free_node, &pgdat->evacuate_pending);
+	irq_work_queue(&pgdat->evacuate_irq_work);
+}
+
+static int __init pageblock_evacuate_init(void)
+{
+	int nid, i;
+
+	/* Initialize the global freelist of work items */
+	init_llist_head(&evacuate_freelist);
+	for (i = 0; i < NR_EVACUATE_ITEMS; i++)
+		llist_add(&evacuate_pool[i].free_node, &evacuate_freelist);
+
+	/* Create a per-pgdat workqueue */
+	for_each_online_node(nid) {
+		pg_data_t *pgdat = NODE_DATA(nid);
+		char name[32];
+
+		snprintf(name, sizeof(name), "kevacuate/%d", nid);
+		pgdat->evacuate_wq = alloc_workqueue(name, WQ_MEM_RECLAIM, 1);
+		if (!pgdat->evacuate_wq) {
+			pr_warn("Failed to create evacuate workqueue for node %d\n", nid);
+			continue;
+		}
+
+		init_llist_head(&pgdat->evacuate_pending);
+		init_irq_work(&pgdat->evacuate_irq_work,
+			      evacuate_irq_work_fn);
+	}
+
+	return 0;
+}
+late_initcall(pageblock_evacuate_init);
+#endif /* CONFIG_COMPACTION */
+
 #ifdef CONFIG_CONTIG_ALLOC
 /* Usage: See admin-guide/dynamic-debug-howto.rst */
 static void alloc_contig_dump_pages(struct list_head *page_list)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (6 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
                   ` (37 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Extend pageblock_data flags with PB_has_unmovable, PB_has_reclaimable, and
PB_has_movable bits to track the actual types of pages allocated within a
pageblock, independent of its intended migratetype.

The flags are set at steal time in try_to_claim_block(), avoiding overhead
on every allocation in __rmqueue_smallest():

1. Allocation / steal time: when try_to_claim_block() claims a pageblock,
set the PB_has_* flag corresponding to the allocation's migratetype. If
unmovable or reclaimable pages are being placed into a pageblock that
already has PB_has_movable set, queue async evacuation of the remaining
movable pages.

2. Full pageblock free: when buddy merging reconstructs a complete
pageblock in __free_one_page(), clear all PB_has_* flags since the block is
now empty.

3. Migration scan: when isolate_migratepages_block() completes a full
pageblock scan and finds no movable pages to isolate, clear PB_has_movable.
This consolidates the clearing for all callers: evacuate_pageblock(),
compaction, and alloc_contig_range().

This provides the foundation for superpageblock-level steering decisions:
knowing which pageblocks actually contain unmovable/reclaimable pages
allows directing future allocations to already-tainted regions, keeping
clean regions available for large contiguous allocations.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/pageblock-flags.h |  9 ++++
 mm/compaction.c                 | 17 ++++++
 mm/page_alloc.c                 | 93 +++++++++++++++++++++++++--------
 3 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index e046278a01fa..21bfcdf80b2e 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -20,6 +20,15 @@ enum pageblock_bits {
 	PB_migrate_2,
 	PB_compact_skip,/* If set the block is skipped by compaction */
 
+	/*
+	 * Track actual page contents independent of the intended migratetype.
+	 * Set at allocation time; cleared on full pageblock free or when
+	 * migration confirms no pages of that type remain.
+	 */
+	PB_has_unmovable,
+	PB_has_reclaimable,
+	PB_has_movable,
+
 #ifdef CONFIG_MEMORY_ISOLATION
 	/*
 	 * Pageblock isolation is represented with a separate bit, so that
diff --git a/mm/compaction.c b/mm/compaction.c
index 1e8f8eca318c..cf2a5074c473 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -849,6 +849,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	bool skip_on_failure = false;
 	unsigned long next_skip_pfn = 0;
 	bool skip_updated = false;
+	bool movable_skipped = false;
 	int ret = 0;
 
 	cc->migrate_pfn = low_pfn;
@@ -1061,6 +1062,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					folio = page_folio(page);
 					goto isolate_success;
 				}
+				movable_skipped = true;
 			}
 
 			goto isolate_fail;
@@ -1229,6 +1231,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			unlock_page_lruvec_irqrestore(locked, flags);
 			locked = NULL;
 		}
+		movable_skipped = true;
 		folio_put(folio);
 
 isolate_fail:
@@ -1292,6 +1295,20 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!cc->no_set_skip_hint && valid_page && !skip_updated)
 			set_pageblock_skip(valid_page);
 		update_cached_migrate(cc, low_pfn);
+
+		/*
+		 * Full pageblock scanned with no movable pages isolated.
+		 * Only clear PB_has_movable if no movable pages were
+		 * seen at all. If movable pages exist but could not be
+		 * isolated (pinned, writeback, dirty, etc.), leave the
+		 * flag set so a future migration attempt can try again.
+		 */
+		if (!nr_isolated && !movable_skipped && valid_page &&
+		    get_pfnblock_bit(valid_page, pageblock_start_pfn(start_pfn),
+				     PB_has_movable))
+			clear_pfnblock_bit(valid_page,
+					   pageblock_start_pfn(start_pfn),
+					   PB_has_movable);
 	}
 
 	trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 45c25c4fc7c0..d0a4de435842 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -972,6 +972,30 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
+/*
+ * mark_pageblock_free - handle a pageblock becoming fully free
+ * @page: page at the start of the pageblock
+ * @pfn: page frame number
+ *
+ * Clear stale PCP ownership and actual-contents tracking flags when
+ * buddy merging reconstructs a full pageblock or a whole pageblock is
+ * freed directly. No PCP can still hold pages from this block (otherwise
+ * the buddy merge couldn't have completed), so the ownership entry would
+ * just cause misrouted frees.
+ */
+static void mark_pageblock_free(struct page *page, unsigned long pfn)
+{
+	clear_pcpblock_owner(page);
+
+	/*
+	 * The entire block is now free — clear actual-contents tracking
+	 * flags since no allocated pages remain.
+	 */
+	clear_pfnblock_bit(page, pfn, PB_has_unmovable);
+	clear_pfnblock_bit(page, pfn, PB_has_reclaimable);
+	clear_pfnblock_bit(page, pfn, PB_has_movable);
+}
+
 /*
  * Freeing function for a buddy system allocator.
  *
@@ -1017,19 +1041,14 @@ static inline void __free_one_page(struct page *page,
 	account_freepages(zone, 1 << order, migratetype);
 
 	/*
-	 * For whole blocks, ownership returns to the zone. There are
-	 * no more outstanding frees to route through that CPU's PCP,
-	 * and we don't want to confuse any future users of the pages
-	 * in this block. E.g. rmqueue_buddy().
-	 *
-	 * Check here if a whole block came in directly: pre-merged in
-	 * the PCP, or PCP contended and bypassed.
-	 *
-	 * There is another check in the loop below if a block merges
-	 * up with pages already on the zone buddy.
+	 * When freeing a whole pageblock, clear stale PCP ownership
+	 * and actual-contents tracking flags up front.  The in-loop
+	 * check only fires when sub-pageblock pages merge *up to*
+	 * pageblock_order, not when entering at pageblock_order
+	 * directly.
 	 */
 	if (order == pageblock_order)
-		clear_pcpblock_owner(page);
+		mark_pageblock_free(page, pfn);
 
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
@@ -1081,9 +1100,13 @@ static inline void __free_one_page(struct page *page,
 		pfn = combined_pfn;
 		order++;
 
-		/* Clear owner also when we merge up. See above */
+		/*
+		 * If merging has reconstructed a full pageblock,
+		 * clear any stale PCP ownership and actual-contents
+		 * tracking flags.
+		 */
 		if (order == pageblock_order)
-			clear_pcpblock_owner(page);
+			mark_pageblock_free(page, pfn);
 	}
 
 done_merging:
@@ -2469,15 +2492,32 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
 #ifdef CONFIG_COMPACTION
 		/*
-		 * A movable pageblock was just claimed for unmovable or
-		 * reclaimable use. Queue async evacuation of the remaining
-		 * movable pages so future unmovable/reclaimable allocations
-		 * can stay concentrated in fewer pageblocks.
+		 * Track actual page contents in pageblock flags.
+		 * Mark the pageblock with the type being allocated, and
+		 * if unmovable/reclaimable pages are being placed into a
+		 * pageblock that already has movable pages, queue async
+		 * evacuation of the movable pages.
 		 */
-		if (block_type == MIGRATE_MOVABLE &&
-		    (start_type == MIGRATE_UNMOVABLE ||
-		     start_type == MIGRATE_RECLAIMABLE))
-			queue_pageblock_evacuate(zone, start_pfn);
+		{
+			struct page *start_page = pfn_to_page(start_pfn);
+
+			if (start_type == MIGRATE_UNMOVABLE) {
+				set_pfnblock_bit(start_page, start_pfn,
+						 PB_has_unmovable);
+				if (get_pfnblock_bit(start_page, start_pfn,
+						     PB_has_movable))
+					queue_pageblock_evacuate(zone, start_pfn);
+			} else if (start_type == MIGRATE_RECLAIMABLE) {
+				set_pfnblock_bit(start_page, start_pfn,
+						 PB_has_reclaimable);
+				if (get_pfnblock_bit(start_page, start_pfn,
+						     PB_has_movable))
+					queue_pageblock_evacuate(zone, start_pfn);
+			} else if (start_type == MIGRATE_MOVABLE) {
+				set_pfnblock_bit(start_page, start_pfn,
+						 PB_has_movable);
+			}
+		}
 #endif
 		return __rmqueue_smallest(zone, order, start_type);
 	}
@@ -7212,6 +7252,17 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
 
 	if (!list_empty(&cc.migratepages))
 		putback_movable_pages(&cc.migratepages);
+
+	/*
+	 * Re-scan to let isolate_migratepages_block clear PB_has_movable
+	 * if no movable pages remain after evacuation.
+	 */
+	cc.migrate_pfn = start_pfn;
+	cc.nr_migratepages = 0;
+	INIT_LIST_HEAD(&cc.migratepages);
+	isolate_migratepages_range(&cc, start_pfn, end_pfn);
+	if (!list_empty(&cc.migratepages))
+		putback_movable_pages(&cc.migratepages);
 }
 
 static void evacuate_work_fn(struct work_struct *work)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (7 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel

Introduce a 1GB (PUD-sized) "superpageblock" data structure to track
pageblock composition at a coarser granularity, enabling future steering of
unmovable/reclaimable allocations into already-tainted superpageblocks and
preserving clean superpageblocks for 1GB hugepage allocation.

Each superpageblock groups SUPERBLOCK_NR_PAGEBLOCKS pageblocks (512 on
  x86_64 with 2MB pageblocks) and maintains:
- Counts of pageblocks by migratetype (nr_free, nr_unmovable,
  nr_reclaimable, nr_movable, nr_reserved)
- A list_head for future organization by fullness category
- Identity (start_pfn, zone pointer)

Superblock counters are maintained by hooking into
init_pageblock_migratetype(). Memory holes and firmware-reserved regions
are tracked as reserved pageblocks by initializing all slots as reserved
during setup and decrementing as init_pageblock_migratetype() claims them.

The superpageblock array is allocated per-zone during boot via memblock. At
~48 bytes per superpageblock (~12KB for a 256GB system), the overhead is
negligible.

This is pure bookkeeping with no allocation behavior change.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h | 57 ++++++++++++++++++++++++++
 mm/mm_init.c           | 90 ++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c        | 65 ++++++++++++++++++++++++++++++
 3 files changed, 212 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2ab45d1133d9..a0e8ce4b7b79 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -877,6 +877,43 @@ enum zone_type {
 
 #define ASYNC_AND_SYNC 2
 
+/*
+ * Superpageblock: 1GB (PUD-sized) region for anti-fragmentation tracking.
+ *
+ * Groups pageblocks to steer unmovable/reclaimable allocations into
+ * already-tainted superpageblocks, preserving clean superpageblocks for 1GB
+ * hugepage allocation.
+ *
+ * SUPERPAGEBLOCK_ORDER derived from PUD geometry:
+ *   x86_64: PUD_SHIFT=30, PAGE_SHIFT=12 → order 18 → 1GB
+ *   Each superpageblock contains SUPERPAGEBLOCK_NR_PAGEBLOCKS pageblocks
+ *   (512 on x86_64 with 2MB pageblocks).
+ */
+#define SUPERPAGEBLOCK_ORDER	(PUD_SHIFT - PAGE_SHIFT)
+#define SUPERPAGEBLOCK_NR_PAGES	(1UL << SUPERPAGEBLOCK_ORDER)
+
+/*
+ * SUPERPAGEBLOCK_NR_PAGEBLOCKS depends on pageblock_order which may be
+ * variable (CONFIG_HUGETLB_PAGE_SIZE_VARIABLE).
+ */
+#define SUPERPAGEBLOCK_NR_PAGEBLOCKS (1UL << (SUPERPAGEBLOCK_ORDER - pageblock_order))
+
+struct superpageblock {
+	/* Pageblock counts by current migratetype */
+	u16			nr_free;
+	u16			nr_unmovable;
+	u16			nr_reclaimable;
+	u16			nr_movable;
+	u16			nr_reserved;	/* holes, firmware, etc. */
+
+	/* For organizing superpageblocks by fullness category */
+	struct list_head	list;
+
+	/* Identity */
+	unsigned long		start_pfn;
+	struct zone		*zone;
+};
+
 struct zone {
 	/* Read-mostly fields */
 
@@ -919,6 +956,11 @@ struct zone {
 	struct pageblock_data	*pageblock_data;
 #endif /* CONFIG_SPARSEMEM */
 
+	/* Superpageblock array for 1GB anti-fragmentation tracking */
+	struct superpageblock	*superpageblocks;
+	unsigned long		nr_superpageblocks;
+	unsigned long		superpageblock_base_pfn; /* 1GB-aligned base */
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -1059,6 +1101,21 @@ struct zone {
 	atomic_long_t		vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];
 } ____cacheline_internodealigned_in_smp;
 
+static inline struct superpageblock *pfn_to_superpageblock(struct zone *zone,
+						   unsigned long pfn)
+{
+	unsigned long idx;
+
+	if (!zone->superpageblocks)
+		return NULL;
+
+	idx = (pfn - zone->superpageblock_base_pfn) >> SUPERPAGEBLOCK_ORDER;
+	if (idx >= zone->nr_superpageblocks)
+		return NULL;
+
+	return &zone->superpageblocks[idx];
+}
+
 enum pgdat_flags {
 	PGDAT_WRITEBACK,		/* reclaim scanning has recently found
 					 * many pages under writeback
diff --git a/mm/mm_init.c b/mm/mm_init.c
index b3f83452de72..1fb62342d1c6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1517,6 +1517,95 @@ static void __ref setup_usemap(struct zone *zone)
 static inline void setup_usemap(struct zone *zone) {}
 #endif /* CONFIG_SPARSEMEM */
 
+/**
+ * init_one_superpageblock - initialize a single superpageblock
+ * @sb: superpageblock to initialize
+ * @zone: owning zone
+ * @start_pfn: start PFN for this superpageblock
+ * @zone_start: zone start PFN (for clipping)
+ * @zone_end: zone end PFN (for clipping)
+ *
+ * Zero counters, compute the zone-clipped pageblock count.
+ * Used by both boot-time setup and memory hotplug resize.
+ */
+static void __meminit init_one_superpageblock(struct superpageblock *sb,
+					      struct zone *zone,
+					      unsigned long start_pfn,
+					      unsigned long zone_start,
+					      unsigned long zone_end)
+{
+	unsigned long sb_end = start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+	unsigned long pb_start = max(start_pfn, zone_start);
+	unsigned long pb_end = min(sb_end, zone_end);
+	u16 actual_pbs;
+
+	sb->nr_unmovable = 0;
+	sb->nr_reclaimable = 0;
+	sb->nr_movable = 0;
+	sb->nr_free = 0;
+	INIT_LIST_HEAD(&sb->list);
+	sb->start_pfn = start_pfn;
+	sb->zone = zone;
+
+	/*
+	 * Start with all pageblock slots as reserved.
+	 * init_pageblock_migratetype() will decrement nr_reserved and
+	 * increment the appropriate counter for each real pageblock.
+	 * Holes and firmware-reserved regions stay counted as reserved.
+	 *
+	 * Only count pageblocks that fall within the zone's span.
+	 * The first and last superpageblocks may extend beyond the
+	 * zone boundaries.  Use round-up division because a partial
+	 * pageblock at the zone boundary still gets initialized by
+	 * init_pageblock_migratetype().
+	 */
+	actual_pbs = (pb_end > pb_start) ?
+		     ((pb_end - pb_start + pageblock_nr_pages - 1) >>
+		      pageblock_order) : 0;
+	sb->nr_reserved = actual_pbs;
+}
+
+static void __init setup_superpageblocks(struct zone *zone)
+{
+	unsigned long zone_start = zone->zone_start_pfn;
+	unsigned long zone_end = zone_start + zone->spanned_pages;
+	unsigned long sb_base, nr_superpageblocks;
+	size_t alloc_size;
+	unsigned long i;
+
+	zone->superpageblocks = NULL;
+	zone->nr_superpageblocks = 0;
+	zone->superpageblock_base_pfn = 0;
+
+	if (!zone->spanned_pages)
+		return;
+
+	/*
+	 * Superpageblocks must be 1GB (PUD) aligned. Align the base down
+	 * and the end up to cover all 1GB regions the zone spans.
+	 */
+	sb_base = ALIGN_DOWN(zone_start, SUPERPAGEBLOCK_NR_PAGES);
+	nr_superpageblocks = (ALIGN(zone_end, SUPERPAGEBLOCK_NR_PAGES) - sb_base) >>
+			 SUPERPAGEBLOCK_ORDER;
+
+	alloc_size = nr_superpageblocks * sizeof(struct superpageblock);
+	zone->superpageblocks = memblock_alloc_node(alloc_size, SMP_CACHE_BYTES,
+						zone_to_nid(zone));
+	if (!zone->superpageblocks) {
+		pr_warn("Failed to allocate %zu bytes for zone %s superpageblocks\n",
+			alloc_size, zone->name);
+		return;
+	}
+
+	zone->nr_superpageblocks = nr_superpageblocks;
+	zone->superpageblock_base_pfn = sb_base;
+
+	for (i = 0; i < nr_superpageblocks; i++)
+		init_one_superpageblock(&zone->superpageblocks[i], zone,
+					sb_base + (i << SUPERPAGEBLOCK_ORDER),
+					zone_start, zone_end);
+}
+
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
@@ -1625,6 +1714,7 @@ static void __init free_area_init_core(struct pglist_data *pgdat)
 			continue;
 
 		setup_usemap(zone);
+		setup_superpageblocks(zone);
 		init_currently_empty_zone(zone, zone->zone_start_pfn, size);
 	}
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0a4de435842..a3837a30a7eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -501,6 +501,62 @@ void clear_pfnblock_bit(const struct page *page, unsigned long pfn,
 	clear_bit(pb_bit, get_pfnblock_flags_word(page, pfn));
 }
 
+/*
+ * Map migratetype to PB_has_* bit index. Returns -1 for types that
+ * don't have a tracking bit (e.g. MIGRATE_ISOLATE).
+ */
+static inline int migratetype_to_has_bit(int migratetype)
+{
+	switch (migratetype) {
+	case MIGRATE_UNMOVABLE:
+	case MIGRATE_HIGHATOMIC:
+		return PB_has_unmovable;
+	case MIGRATE_RECLAIMABLE:
+		return PB_has_reclaimable;
+	case MIGRATE_MOVABLE:
+#ifdef CONFIG_CMA
+	case MIGRATE_CMA:
+#endif
+		return PB_has_movable;
+	default:
+		return -1;
+	}
+}
+
+/*
+ * __spb_set_has_type - set PB_has_* and increment type counter
+ *
+ * Idempotent: only increments the counter on the 0→1 bit transition.
+ */
+static void __spb_set_has_type(struct page *page, int migratetype)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb = pfn_to_superpageblock(page_zone(page), pfn);
+	int bit;
+
+	if (!sb)
+		return;
+
+	bit = migratetype_to_has_bit(migratetype);
+	if (bit < 0)
+		return;
+
+	if (!get_pfnblock_bit(page, pfn, bit)) {
+		set_pfnblock_bit(page, pfn, bit);
+		switch (bit) {
+		case PB_has_unmovable:
+			sb->nr_unmovable++;
+			break;
+		case PB_has_reclaimable:
+			sb->nr_reclaimable++;
+			break;
+		case PB_has_movable:
+			sb->nr_movable++;
+			break;
+		}
+	}
+}
+
 /**
  * set_pageblock_migratetype - Set the migratetype of a pageblock
  * @page: The page within the block of interest
@@ -534,6 +590,7 @@ void __meminit init_pageblock_migratetype(struct page *page,
 {
 	unsigned long pfn = page_to_pfn(page);
 	struct pageblock_data *pbd;
+	struct superpageblock *sb;
 	unsigned long flags;
 
 	if (unlikely(page_group_by_mobility_disabled &&
@@ -557,6 +614,14 @@ void __meminit init_pageblock_migratetype(struct page *page,
 	pbd = pfn_to_pageblock(page, pfn);
 	pbd->block_pfn = pfn;
 	INIT_LIST_HEAD(&pbd->cpu_node);
+
+	/* Transition from reserved (boot default) to initial migratetype */
+	sb = pfn_to_superpageblock(page_zone(page), pfn);
+	if (sb) {
+		if (sb->nr_reserved)
+			sb->nr_reserved--;
+		__spb_set_has_type(page, migratetype);
+	}
 }
 
 #ifdef CONFIG_DEBUG_VM
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (8 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

setup_superpageblocks() is __init-only and uses memblock_alloc_node(), so
hotplugged memory that extends a zone's span has no superpageblock
coverage.  Pages in those regions would bypass superpageblock steering
entirely.

Add resize_zone_superpageblocks() which is called from
move_pfn_range_to_zone() after the zone span has been updated. It allocates
a new superpageblock array with kvmalloc_node() covering the full zone
span, copies existing superpageblocks (fixing up list head pointers), and
initializes new superpageblocks for the added range.

Use round-up division for partial pageblock counting to match
init_one_superpageblock().

ZONE_DEVICE is excluded since device pages should not participate in anti-
fragmentation steering.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |   1 +
 mm/internal.h          |   4 ++
 mm/memory_hotplug.c    |   4 ++
 mm/mm_init.c           | 138 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 147 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a0e8ce4b7b79..c17ea237fe13 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -960,6 +960,7 @@ struct zone {
 	struct superpageblock	*superpageblocks;
 	unsigned long		nr_superpageblocks;
 	unsigned long		superpageblock_base_pfn; /* 1GB-aligned base */
+	bool			spb_kvmalloced; /* true if from kvmalloc (hotplug) */
 
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
diff --git a/mm/internal.h b/mm/internal.h
index bb0e0b8a4495..163ef96fa777 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1025,6 +1025,10 @@ void init_cma_reserved_pageblock(struct page *page);
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+void resize_zone_superpageblocks(struct zone *zone);
+#endif
+
 struct cma;
 
 #ifdef CONFIG_CMA
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bc805029da51..e21fdb4f27db 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -760,6 +760,10 @@ void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	resize_zone_range(zone, start_pfn, nr_pages);
 	resize_pgdat_range(pgdat, start_pfn, nr_pages);
 
+	/* Grow superpageblock array to cover the new zone span */
+	if (!zone_is_zone_device(zone))
+		resize_zone_superpageblocks(zone);
+
 	/*
 	 * Subsection population requires care in pfn_to_online_page().
 	 * Set the taint to enable the slow path detection of
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1fb62342d1c6..c5cf90de4d62 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1606,6 +1606,144 @@ static void __init setup_superpageblocks(struct zone *zone)
 					zone_start, zone_end);
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+/**
+ * resize_zone_superpageblocks - grow superpageblock array for memory hotplug
+ * @zone: zone whose span has been extended by hotplug
+ *
+ * Called from move_pfn_range_to_zone() after resize_zone_range() has
+ * updated the zone's span.  Allocates a new superpageblock array covering
+ * the full zone span, copies existing superpageblocks (fixing up list heads),
+ * and initializes new superpageblocks for the added range.
+ *
+ * Must be called under mem_hotplug_lock (write).  No concurrent
+ * allocations can occur since the hotplugged pages are not yet online.
+ */
+void __meminit resize_zone_superpageblocks(struct zone *zone)
+{
+	unsigned long zone_start = zone->zone_start_pfn;
+	unsigned long zone_end = zone_start + zone->spanned_pages;
+	unsigned long new_sb_base, new_nr_sbs;
+	unsigned long old_offset;
+	struct superpageblock *old_sbs;
+	struct superpageblock *new_sbs;
+	bool old_kvmalloced;
+	size_t alloc_size;
+	unsigned long i;
+	int nid = zone_to_nid(zone);
+
+	if (!zone->spanned_pages)
+		return;
+
+	new_sb_base = ALIGN_DOWN(zone_start, SUPERPAGEBLOCK_NR_PAGES);
+	new_nr_sbs = (ALIGN(zone_end, SUPERPAGEBLOCK_NR_PAGES) - new_sb_base) >>
+		     SUPERPAGEBLOCK_ORDER;
+
+	/* Already covered? */
+	if (zone->superpageblocks &&
+	    new_sb_base == zone->superpageblock_base_pfn &&
+	    new_nr_sbs == zone->nr_superpageblocks)
+		return;
+
+	alloc_size = new_nr_sbs * sizeof(struct superpageblock);
+	new_sbs = kvmalloc_node(alloc_size, GFP_KERNEL | __GFP_ZERO, nid);
+	if (!new_sbs) {
+		pr_warn("Failed to allocate %zu bytes for zone %s superpageblocks\n",
+			alloc_size, zone->name);
+		return;
+	}
+
+	/*
+	 * Copy existing superpageblocks to their new position.
+	 * The old array covers [old_base, old_base + old_nr * SB_SIZE).
+	 * The new array covers [new_base, new_base + new_nr * SB_SIZE).
+	 * old_base >= new_base always (zone can only grow).
+	 */
+	if (zone->superpageblocks) {
+		old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
+			     SUPERPAGEBLOCK_ORDER;
+		memcpy(&new_sbs[old_offset], zone->superpageblocks,
+		       zone->nr_superpageblocks * sizeof(struct superpageblock));
+
+		/*
+		 * Fix up list_head pointers that were self-referencing
+		 * (empty lists) or pointing into the old array.
+		 */
+		for (i = old_offset; i < old_offset + zone->nr_superpageblocks; i++) {
+			struct superpageblock *sb = &new_sbs[i];
+
+			if (list_empty(&sb->list))
+				INIT_LIST_HEAD(&sb->list);
+			else
+				list_replace(&zone->superpageblocks[i - old_offset].list,
+					     &sb->list);
+		}
+	}
+
+	/* Initialize new superpageblocks (slots not covered by old array) */
+	for (i = 0; i < new_nr_sbs; i++) {
+		struct superpageblock *sb = &new_sbs[i];
+		bool is_old = false;
+
+		if (zone->superpageblocks) {
+			old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
+				     SUPERPAGEBLOCK_ORDER;
+			if (i >= old_offset &&
+			    i < old_offset + zone->nr_superpageblocks)
+				is_old = true;
+		}
+
+		if (is_old)
+			continue;
+
+		init_one_superpageblock(sb, zone,
+					new_sb_base + (i << SUPERPAGEBLOCK_ORDER),
+					zone_start, zone_end);
+	}
+
+	/*
+	 * Update existing superpageblocks whose nr_reserved may have
+	 * increased due to the zone span growing into them.
+	 */
+	if (zone->superpageblocks) {
+		old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
+			     SUPERPAGEBLOCK_ORDER;
+		for (i = old_offset; i < old_offset + zone->nr_superpageblocks; i++) {
+			struct superpageblock *sb = &new_sbs[i];
+			unsigned long sb_start = sb->start_pfn;
+			unsigned long sb_end = sb_start + SUPERPAGEBLOCK_NR_PAGES;
+			unsigned long pb_start = max(sb_start, zone_start);
+			unsigned long pb_end = min(sb_end, zone_end);
+			u16 new_pbs = (pb_end > pb_start) ?
+				((pb_end - pb_start + pageblock_nr_pages - 1) >>
+				 pageblock_order) : 0;
+			u16 old_pbs = sb->nr_free + sb->nr_unmovable +
+				sb->nr_reclaimable + sb->nr_movable +
+				sb->nr_reserved;
+
+			if (new_pbs > old_pbs)
+				sb->nr_reserved += new_pbs - old_pbs;
+		}
+	}
+
+	/* Swap in the new array */
+	old_sbs = zone->superpageblocks;
+	old_kvmalloced = zone->spb_kvmalloced;
+	zone->superpageblocks = new_sbs;
+	zone->nr_superpageblocks = new_nr_sbs;
+	zone->superpageblock_base_pfn = new_sb_base;
+	zone->spb_kvmalloced = true;
+
+	/*
+	 * The boot-time array was allocated with memblock_alloc, which
+	 * is not individually freeable after boot.  Only kvfree arrays
+	 * from previous hotplug resizes.
+	 */
+	if (old_sbs && old_kvmalloced)
+		kvfree(old_sbs);
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
+
 #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
 
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (9 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Organize superpageblocks into bucketed lists by fullness level and taint
status to enable efficient allocation steering without sorting.

Five fullness buckets (FULL, 75%, 50%, 25%, ALMOST_EMPTY) track what
fraction of a superpageblock's pageblocks are in use. Two categories (CLEAN
vs TAINTED) distinguish superpageblocks that contain only free and movable
pageblocks from those contaminated with unmovable, reclaimable, or reserved
pageblocks. A separate sb_empty list tracks completely free
superpageblocks.

Track fully-free pageblocks with a PB_all_free pageblock flag. When buddy
coalescing reconstructs a full pageblock, increment nr_free. Type counters
are driven by PB_has_* bit transitions, not by migratetype label changes.

For tainted superpageblocks, fullness is based on unmovable + reclaimable
pageblock counts rather than total usage, correctly reflecting how full
they are with the content types we're trying to concentrate.

Add a debugfs interface at /sys/kernel/debug/superpageblocks.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h          |  22 +++
 include/linux/pageblock-flags.h |   1 +
 mm/mm_init.c                    |  26 ++-
 mm/page_alloc.c                 | 295 +++++++++++++++++++++++++++++++-
 4 files changed, 339 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c17ea237fe13..f03800f5028b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -898,6 +898,23 @@ enum zone_type {
  */
 #define SUPERPAGEBLOCK_NR_PAGEBLOCKS (1UL << (SUPERPAGEBLOCK_ORDER - pageblock_order))
 
+/* Superpageblock fullness buckets (by % of pageblocks in use) */
+enum sb_fullness {
+	SB_FULL,		/* 100% full, 0 free pageblocks */
+	SB_FULL_75,		/* 75-99% full */
+	SB_FULL_50,		/* 50-74% full */
+	SB_FULL_25,		/* 25-49% full */
+	SB_ALMOST_EMPTY,	/* 1-24% full */
+	__NR_SB_FULLNESS,
+};
+
+/* Superpageblock taint categories */
+enum sb_category {
+	SB_CLEAN,		/* only free + movable pageblocks */
+	SB_TAINTED,		/* has unmovable/reclaimable/reserved */
+	__NR_SB_CATEGORIES,
+};
+
 struct superpageblock {
 	/* Pageblock counts by current migratetype */
 	u16			nr_free;
@@ -905,6 +922,7 @@ struct superpageblock {
 	u16			nr_reclaimable;
 	u16			nr_movable;
 	u16			nr_reserved;	/* holes, firmware, etc. */
+	u16			total_pageblocks; /* zone-clipped total */
 
 	/* For organizing superpageblocks by fullness category */
 	struct list_head	list;
@@ -962,6 +980,10 @@ struct zone {
 	unsigned long		superpageblock_base_pfn; /* 1GB-aligned base */
 	bool			spb_kvmalloced; /* true if from kvmalloc (hotplug) */
 
+	/* Superpageblock fullness lists for allocation steering */
+	struct list_head	spb_empty;	/* completely free superpageblocks */
+	struct list_head	spb_lists[__NR_SB_CATEGORIES][__NR_SB_FULLNESS];
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 21bfcdf80b2e..4dce39d054a9 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -28,6 +28,7 @@ enum pageblock_bits {
 	PB_has_unmovable,
 	PB_has_reclaimable,
 	PB_has_movable,
+	PB_all_free,	/* All pages in pageblock are free in buddy */
 
 #ifdef CONFIG_MEMORY_ISOLATION
 	/*
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c5cf90de4d62..6af34c1a8cc4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1562,7 +1562,17 @@ static void __meminit init_one_superpageblock(struct superpageblock *sb,
 	actual_pbs = (pb_end > pb_start) ?
 		     ((pb_end - pb_start + pageblock_nr_pages - 1) >>
 		      pageblock_order) : 0;
+	sb->total_pageblocks = actual_pbs;
 	sb->nr_reserved = actual_pbs;
+	if (actual_pbs) {
+		/*
+		 * All superpageblocks start as reserved (tainted+full).
+		 * They move to the correct category when the pages
+		 * inside are freed during boot.
+		 */
+		list_add_tail(&sb->list,
+			      &zone->spb_lists[SB_TAINTED][SB_FULL]);
+	}
 }
 
 static void __init setup_superpageblocks(struct zone *zone)
@@ -1572,11 +1582,18 @@ static void __init setup_superpageblocks(struct zone *zone)
 	unsigned long sb_base, nr_superpageblocks;
 	size_t alloc_size;
 	unsigned long i;
+	int cat, full;
 
 	zone->superpageblocks = NULL;
 	zone->nr_superpageblocks = 0;
 	zone->superpageblock_base_pfn = 0;
 
+	/* Fullness lists steer allocations to preferred superpageblocks */
+	INIT_LIST_HEAD(&zone->spb_empty);
+	for (cat = 0; cat < __NR_SB_CATEGORIES; cat++)
+		for (full = 0; full < __NR_SB_FULLNESS; full++)
+			INIT_LIST_HEAD(&zone->spb_lists[cat][full]);
+
 	if (!zone->spanned_pages)
 		return;
 
@@ -1702,8 +1719,9 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	}
 
 	/*
-	 * Update existing superpageblocks whose nr_reserved may have
-	 * increased due to the zone span growing into them.
+	 * Update existing superpageblocks whose nr_reserved and
+	 * total_pageblocks may have increased due to the zone
+	 * span growing into them.
 	 */
 	if (zone->superpageblocks) {
 		old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
@@ -1721,8 +1739,10 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 				sb->nr_reclaimable + sb->nr_movable +
 				sb->nr_reserved;
 
-			if (new_pbs > old_pbs)
+			if (new_pbs > old_pbs) {
 				sb->nr_reserved += new_pbs - old_pbs;
+				sb->total_pageblocks = new_pbs;
+			}
 		}
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3837a30a7eb..ed0919280dd6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -56,6 +56,8 @@
 #include <linux/delayacct.h>
 #include <linux/cacheinfo.h>
 #include <linux/pgalloc_tag.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 #include <asm/div64.h>
 #include "internal.h"
 #include "shuffle.h"
@@ -558,7 +560,157 @@ static void __spb_set_has_type(struct page *page, int migratetype)
 }
 
 /**
- * set_pageblock_migratetype - Set the migratetype of a pageblock
+ * spb_get_category - Determine if a superpageblock is clean or tainted
+ * @sb: superpageblock to classify
+ *
+ * A superpageblock is clean if it contains only free and movable pageblocks.
+ * Any unmovable, reclaimable, or reserved pageblocks make it tainted.
+ * Reserved pageblocks (memory holes) taint the superpageblock because it
+ * can never be used for 1GB hugepages, making it a better home for
+ * unmovable/reclaimable allocations.
+ */
+static inline enum sb_category spb_get_category(struct superpageblock *sb)
+{
+	if (sb->nr_unmovable || sb->nr_reclaimable || sb->nr_reserved)
+		return SB_TAINTED;
+	return SB_CLEAN;
+}
+
+/**
+ * sb_get_fullness - Determine the fullness bucket for a superpageblock
+ * @sb: superpageblock to classify
+ * @cat: the category (CLEAN or TAINTED) of this superpageblock
+ *
+ * For clean SPBs, fullness is based on total usage (total - nr_free).
+ * For tainted SPBs, fullness is based only on unmovable + reclaimable
+ * pageblocks, since those are what we're trying to concentrate.
+ * Uses SUPERPAGEBLOCK_NR_PAGEBLOCKS as divisor so that partial
+ * superpageblocks at zone boundaries are preferred over whole ones.
+ */
+static inline enum sb_fullness sb_get_fullness(struct superpageblock *sb,
+					       enum sb_category cat)
+{
+	unsigned int used, total = sb->total_pageblocks;
+	unsigned int quarter = SUPERPAGEBLOCK_NR_PAGEBLOCKS / 4;
+
+	if (!total)
+		return SB_FULL;
+
+	if (cat == SB_TAINTED)
+		used = sb->nr_unmovable + sb->nr_reclaimable;
+	else
+		used = total - sb->nr_free;
+
+	if (used >= total)
+		return SB_FULL;
+
+	if (used >= 3 * quarter)
+		return SB_FULL_75;
+	if (used >= 2 * quarter)
+		return SB_FULL_50;
+	if (used >= quarter)
+		return SB_FULL_25;
+	return SB_ALMOST_EMPTY;
+}
+
+/**
+ * spb_update_list - Move a superpageblock to the correct fullness list
+ * @sb: superpageblock to reclassify
+ *
+ * Called after counters change. Removes from current list (if any)
+ * and adds to the appropriate list based on current fullness and
+ * taint status.
+ */
+static void spb_update_list(struct superpageblock *sb)
+{
+	struct zone *zone = sb->zone;
+	enum sb_category cat;
+	enum sb_fullness full;
+
+	list_del_init(&sb->list);
+
+	if (sb->nr_free == SUPERPAGEBLOCK_NR_PAGEBLOCKS) {
+		list_add_tail(&sb->list, &zone->spb_empty);
+		return;
+	}
+
+	cat = spb_get_category(sb);
+	full = sb_get_fullness(sb, cat);
+	list_add_tail(&sb->list, &zone->spb_lists[cat][full]);
+}
+
+/**
+ * superpageblock_pb_now_free - A pageblock just became fully free in buddy
+ * @page: page in the pageblock
+ *
+ * When buddy coalescing reconstructs a complete pageblock-order page,
+ * increment nr_free. Type counters are handled separately by
+ * __spb_clear_has_type() in mark_pageblock_free().
+ */
+static void superpageblock_pb_now_free(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb = pfn_to_superpageblock(page_zone(page), pfn);
+
+	if (!sb)
+		return;
+
+	sb->nr_free++;
+
+	spb_update_list(sb);
+}
+
+/**
+ * superpageblock_pb_now_used - A fully-free pageblock just got its first allocation
+ * @page: page in the pageblock
+ *
+ * When allocating from an order >= pageblock_order free page, decrement
+ * nr_free. Type counters are handled separately by __spb_set_has_type()
+ * at allocation time.
+ */
+static void superpageblock_pb_now_used(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb = pfn_to_superpageblock(page_zone(page), pfn);
+
+	if (!sb)
+		return;
+
+	if (sb->nr_free)
+		sb->nr_free--;
+
+	spb_update_list(sb);
+}
+
+/**
+ * superpageblock_range_now_used - Mark a multi-pageblock free range as no longer free
+ * @page: first page of the range (must be pageblock-aligned)
+ * @order: order of the range (must be >= pageblock_order)
+ *
+ * When a free page of order >= pageblock_order is removed from buddy outside
+ * the normal allocation path (e.g. __isolate_free_page, memory hotplug,
+ * HW poison takeoff), every constituent pageblock leaves its PB_all_free
+ * state. Walk the range, clear PB_all_free, and decrement nr_free for each
+ * affected pageblock. PB_has_* bits are not touched: the pages are not being
+ * allocated to a specific migratetype here. They will be re-established by
+ * mark_pageblock_free() if the pages later return to buddy and coalesce.
+ */
+static void superpageblock_range_now_used(struct page *page, unsigned int order)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long end_pfn = pfn + (1UL << order);
+
+	for (; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *pb_page = pfn_to_page(pfn);
+
+		if (get_pfnblock_bit(pb_page, pfn, PB_all_free)) {
+			clear_pfnblock_bit(pb_page, pfn, PB_all_free);
+			superpageblock_pb_now_used(pb_page);
+		}
+	}
+}
+
+/** set_pageblock_migratetype - Set the migratetype of a pageblock
  * @page: The page within the block of interest
  * @migratetype: migratetype to set
  */
@@ -621,6 +773,7 @@ void __meminit init_pageblock_migratetype(struct page *page,
 		if (sb->nr_reserved)
 			sb->nr_reserved--;
 		__spb_set_has_type(page, migratetype);
+		spb_update_list(sb);
 	}
 }
 
@@ -1059,6 +1212,11 @@ static void mark_pageblock_free(struct page *page, unsigned long pfn)
 	clear_pfnblock_bit(page, pfn, PB_has_unmovable);
 	clear_pfnblock_bit(page, pfn, PB_has_reclaimable);
 	clear_pfnblock_bit(page, pfn, PB_has_movable);
+
+	if (!get_pfnblock_bit(page, pfn, PB_all_free)) {
+		set_pfnblock_bit(page, pfn, PB_all_free);
+		superpageblock_pb_now_free(page);
+	}
 }
 
 /*
@@ -1107,7 +1265,8 @@ static inline void __free_one_page(struct page *page,
 
 	/*
 	 * When freeing a whole pageblock, clear stale PCP ownership
-	 * and actual-contents tracking flags up front.  The in-loop
+	 * and actual-contents tracking flags up front, and mark it
+	 * as fully free for superpageblock accounting.  The in-loop
 	 * check only fires when sub-pageblock pages merge *up to*
 	 * pageblock_order, not when entering at pageblock_order
 	 * directly.
@@ -1987,6 +2146,20 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 {
 	int nr_pages = 1 << high;
 
+	/*
+	 * If we're splitting a page that spans at least a full pageblock,
+	 * the allocated pageblock transitions from fully-free to in-use.
+	 * Clear PB_all_free and update superpageblock accounting.
+	 */
+	if (high >= pageblock_order) {
+		unsigned long pfn = page_to_pfn(page);
+
+		if (get_pfnblock_bit(page, pfn, PB_all_free)) {
+			clear_pfnblock_bit(page, pfn, PB_all_free);
+			superpageblock_pb_now_used(page);
+		}
+	}
+
 	__del_page_from_free_list(page, zone, high, migratetype);
 	nr_pages -= expand(zone, page, low, high, migratetype);
 	account_freepages(zone, -nr_pages, migratetype);
@@ -2513,6 +2686,25 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		unsigned int nr_added;
+		unsigned long pb_pfn;
+
+		/*
+		 * Clear PB_all_free for pageblocks being claimed.
+		 * This path bypasses page_del_and_expand(), so we
+		 * must handle the free→used transition here.
+		 * Use block_type (the original migratetype) because
+		 * that's what was decremented when PB_all_free was set.
+		 */
+		for (pb_pfn = page_to_pfn(page);
+		     pb_pfn < page_to_pfn(page) + (1 << current_order);
+		     pb_pfn += pageblock_nr_pages) {
+			struct page *pb_page = pfn_to_page(pb_pfn);
+
+			if (get_pfnblock_bit(pb_page, pb_pfn, PB_all_free)) {
+				clear_pfnblock_bit(pb_page, pb_pfn, PB_all_free);
+				superpageblock_pb_now_used(pb_page);
+			}
+		}
 
 		del_page_from_free_list(page, zone, current_order, block_type);
 		change_pageblock_range(page, current_order, start_type);
@@ -3555,6 +3747,14 @@ int __isolate_free_page(struct page *page, unsigned int order)
 
 	del_page_from_free_list(page, zone, order, mt);
 
+	/*
+	 * The free page is leaving buddy. For order >= pageblock_order, every
+	 * constituent pageblock had PB_all_free set; clear those bits and
+	 * decrement nr_free so the SPB pageblock-level counter stays in sync.
+	 */
+	if (order >= pageblock_order)
+		superpageblock_range_now_used(page, order);
+
 	/*
 	 * Set the pageblock if the isolated page is at least half of a
 	 * pageblock
@@ -8068,6 +8268,8 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 		BUG_ON(!PageBuddy(page));
 		VM_WARN_ON(get_pageblock_migratetype(page) != MIGRATE_ISOLATE);
 		order = buddy_order(page);
+		if (order >= pageblock_order)
+			superpageblock_range_now_used(page, order);
 		del_page_from_free_list(page, zone, order, MIGRATE_ISOLATE);
 		pfn += (1 << order);
 	}
@@ -8159,6 +8361,25 @@ bool take_page_off_buddy(struct page *page)
 
 			del_page_from_free_list(page_head, zone, page_order,
 						migratetype);
+			/*
+			 * break_down_buddy_pages() re-adds every non-target
+			 * pageblock to buddy at order >= pageblock_order, so
+			 * those keep their PB_all_free state. Only the target's
+			 * pageblock loses its fully-free status — clear that
+			 * one bit and decrement the SPB nr_free counter.
+			 */
+			if (page_order >= pageblock_order) {
+				unsigned long pfn_pb = ALIGN_DOWN(pfn,
+							pageblock_nr_pages);
+				struct page *pb_page = pfn_to_page(pfn_pb);
+
+				if (get_pfnblock_bit(pb_page, pfn_pb,
+						     PB_all_free)) {
+					clear_pfnblock_bit(pb_page, pfn_pb,
+							   PB_all_free);
+					superpageblock_pb_now_used(pb_page);
+				}
+			}
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
 			SetPageHWPoisonTakenOff(page);
@@ -8458,3 +8679,73 @@ struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int or
 	return page;
 }
 EXPORT_SYMBOL_GPL(alloc_pages_nolock_noprof);
+
+#ifdef CONFIG_DEBUG_FS
+static const char * const sb_fullness_names[] = {
+	"full", "75pct", "50pct", "25pct", "almost_empty"
+};
+
+static const char * const sb_category_names[] = {
+	"clean", "tainted"
+};
+
+static int superpageblock_debugfs_show(struct seq_file *m, void *v)
+{
+	struct zone *zone;
+	int cat, full;
+
+	for_each_populated_zone(zone) {
+		unsigned long i;
+		int empty_count = 0;
+		struct superpageblock *sb;
+
+		if (!zone->superpageblocks)
+			continue;
+
+		seq_printf(m, "Node %d, zone %8s: %lu superpageblocks, base_pfn=0x%lx\n",
+			   zone->zone_pgdat->node_id, zone->name,
+			   zone->nr_superpageblocks, zone->superpageblock_base_pfn);
+
+		list_for_each_entry(sb, &zone->spb_empty, list)
+			empty_count++;
+		if (empty_count)
+			seq_printf(m, "  empty: %d\n", empty_count);
+
+		for (cat = 0; cat < __NR_SB_CATEGORIES; cat++) {
+			for (full = 0; full < __NR_SB_FULLNESS; full++) {
+				int count = 0;
+
+				list_for_each_entry(sb,
+					&zone->spb_lists[cat][full], list)
+					count++;
+				if (count)
+					seq_printf(m, "  %s/%s: %d\n",
+						   sb_category_names[cat],
+						   sb_fullness_names[full],
+						   count);
+			}
+		}
+
+		/* Per-superpageblock detail */
+		for (i = 0; i < zone->nr_superpageblocks; i++) {
+			sb = &zone->superpageblocks[i];
+			seq_printf(m, "  sb[%lu] pfn=0x%lx: unmov=%u recl=%u mov=%u rsv=%u free=%u total=%u\n",
+				   i, sb->start_pfn,
+				   sb->nr_unmovable, sb->nr_reclaimable,
+				   sb->nr_movable, sb->nr_reserved,
+				   sb->nr_free, sb->total_pageblocks);
+		}
+	}
+	return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(superpageblock_debugfs);
+
+static int __init superpageblock_debugfs_init(void)
+{
+	debugfs_create_file("superpageblocks", 0444, NULL, NULL,
+			    &superpageblock_debugfs_fops);
+	return 0;
+}
+late_initcall(superpageblock_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (10 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

When the allocator needs to steal a movable pageblock for unmovable or
reclaimable allocations, prefer pages from already-tainted superpageblocks.
This concentrates contamination in superpageblocks that are already impure,
preserving clean superpageblocks for future 1GB hugepage allocations.

In __rmqueue_claim, after finding a candidate page on the free list, check
if it belongs to a clean superpageblock. If so, do a bounded scan
(SB_SCAN_LIMIT=8) of the same free list looking for a page from a
tainted superpageblock instead. This is a best-effort optimization:
if no tainted alternative is found, the original page is used.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 103 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 82 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed0919280dd6..d795f41975c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2308,6 +2308,9 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 		clear_page_pfmemalloc(page);
 }
 
+/* Bounded scan limit when searching free lists for tainted superpageblock pages */
+#define SPB_SCAN_LIMIT 8
+
 /*
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
@@ -2704,6 +2707,14 @@ try_to_claim_block(struct zone *zone, struct page *page,
 				clear_pfnblock_bit(pb_page, pb_pfn, PB_all_free);
 				superpageblock_pb_now_used(pb_page);
 			}
+			__spb_set_has_type(pb_page, start_type);
+		}
+		/* Single list update after all pageblocks processed */
+		{
+			struct superpageblock *sb =
+				pfn_to_superpageblock(zone, page_to_pfn(page));
+			if (sb)
+				spb_update_list(sb);
 		}
 
 		del_page_from_free_list(page, zone, current_order, block_type);
@@ -2749,31 +2760,27 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
 #ifdef CONFIG_COMPACTION
 		/*
-		 * Track actual page contents in pageblock flags.
-		 * Mark the pageblock with the type being allocated, and
-		 * if unmovable/reclaimable pages are being placed into a
-		 * pageblock that already has movable pages, queue async
-		 * evacuation of the movable pages.
+		 * Track actual page contents in pageblock flags and
+		 * update superpageblock counters so the SPB moves to
+		 * the correct fullness list for steering.
 		 */
 		{
 			struct page *start_page = pfn_to_page(start_pfn);
+			struct superpageblock *sb;
 
-			if (start_type == MIGRATE_UNMOVABLE) {
-				set_pfnblock_bit(start_page, start_pfn,
-						 PB_has_unmovable);
-				if (get_pfnblock_bit(start_page, start_pfn,
-						     PB_has_movable))
-					queue_pageblock_evacuate(zone, start_pfn);
-			} else if (start_type == MIGRATE_RECLAIMABLE) {
-				set_pfnblock_bit(start_page, start_pfn,
-						 PB_has_reclaimable);
-				if (get_pfnblock_bit(start_page, start_pfn,
-						     PB_has_movable))
-					queue_pageblock_evacuate(zone, start_pfn);
-			} else if (start_type == MIGRATE_MOVABLE) {
-				set_pfnblock_bit(start_page, start_pfn,
-						 PB_has_movable);
-			}
+			__spb_set_has_type(start_page, start_type);
+			if (block_type != start_type)
+				__spb_set_has_type(start_page, block_type);
+
+			sb = pfn_to_superpageblock(zone, start_pfn);
+			if (sb)
+				spb_update_list(sb);
+
+			if ((start_type == MIGRATE_UNMOVABLE ||
+			     start_type == MIGRATE_RECLAIMABLE) &&
+			    get_pfnblock_bit(start_page, start_pfn,
+					     PB_has_movable))
+				queue_pageblock_evacuate(zone, start_pfn);
 		}
 #endif
 		return __rmqueue_smallest(zone, order, start_type);
@@ -2828,6 +2835,38 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 			break;
 
 		page = get_page_from_free_area(area, fallback_mt);
+
+		/*
+		 * For unmovable/reclaimable stealing, prefer pages from
+		 * tainted superpageblocks (already contaminated) to keep clean
+		 * superpageblocks clean for future 1GB allocations.
+		 */
+		if (start_migratetype != MIGRATE_MOVABLE &&
+		    zone->superpageblocks && page) {
+			struct superpageblock *sb;
+			struct page *alt;
+			int scanned = 0;
+
+			sb = pfn_to_superpageblock(zone, page_to_pfn(page));
+			if (sb && spb_get_category(sb) == SB_CLEAN) {
+				list_for_each_entry(alt,
+						    &area->free_list[fallback_mt],
+						    buddy_list) {
+					struct superpageblock *asb;
+
+					if (++scanned > SPB_SCAN_LIMIT)
+						break;
+					asb = pfn_to_superpageblock(zone,
+							page_to_pfn(alt));
+					if (asb && spb_get_category(asb) ==
+					    SB_TAINTED) {
+						page = alt;
+						break;
+					}
+				}
+			}
+		}
+
 		page = try_to_claim_block(zone, page, current_order, order,
 					  start_migratetype, fallback_mt,
 					  alloc_flags);
@@ -2848,6 +2887,7 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 static __always_inline struct page *
 __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 {
+	struct superpageblock *sb;
 	struct free_area *area;
 	int current_order;
 	struct page *page;
@@ -2862,6 +2902,27 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 
 		page = get_page_from_free_area(area, fallback_mt);
 		page_del_and_expand(zone, page, order, current_order, fallback_mt);
+
+		/*
+		 * page_del_and_expand recorded PB_has_<fallback_mt> for the
+		 * source free list type. Also record the actual allocation
+		 * type so evacuation and defrag can find these pages.
+		 *
+		 * For example, a MOVABLE allocation stealing from an
+		 * UNMOVABLE free list must set PB_has_movable so the
+		 * pageblock is visible to evacuate_pageblock() and
+		 * spb_defrag_tainted(). __spb_set_has_type is idempotent:
+		 * it only increments the SPB counter on the 0->1 bit
+		 * transition.
+		 */
+		if (fallback_mt != start_migratetype) {
+			__spb_set_has_type(page, start_migratetype);
+			sb = pfn_to_superpageblock(zone,
+						   page_to_pfn(page));
+			if (sb)
+				spb_update_list(sb);
+		}
+
 		trace_mm_page_alloc_extfrag(page, order, current_order,
 					    start_migratetype, fallback_mt);
 		return page;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (11 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

When refilling PCP with whole pageblocks for movable allocations, prefer
pageblocks from the fullest clean (only free + movable) superpageblock.
This packs movable allocations into already-partial superpageblocks,
preserving empty superpageblocks for potential 1GB hugepage allocation.

Add sb_preferred_for_movable() which walks the clean superpageblock lists
from SB_FULL toward SB_ALMOST_EMPTY to find the fullest clean
superpageblock with available free pageblocks. Add __rmqueue_from_sb()
which scans the buddy free list for a page within a specific
superpageblock's PFN range, with a bounded scan limit (8 entries) to avoid
excessive latency.

Hook into rmqueue_bulk() phase 1 (whole pageblock grab for PCP refill) to
try the preferred superpageblock before falling back to the normal
__rmqueue() path. This is the primary steering point for movable
allocations without per-superpageblock free lists.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d795f41975c1..8b10322d5221 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2311,6 +2311,73 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 /* Bounded scan limit when searching free lists for tainted superpageblock pages */
 #define SPB_SCAN_LIMIT 8
 
+/**
+ * sb_preferred_for_movable - Find the fullest clean superpageblock for movable
+ * @zone: zone to search
+ *
+ * Walk spb_lists[CLEAN] from nearly full toward emptiest — pack movable
+ * allocations into already-partial superpageblocks before starting new ones.
+ * Skip SB_FULL since those have no free pageblocks.
+ * Returns NULL if no suitable superpageblock found.
+ */
+static struct superpageblock *sb_preferred_for_movable(struct zone *zone)
+{
+	int full;
+	struct superpageblock *sb;
+
+	for (full = SB_FULL_75; full < __NR_SB_FULLNESS; full++) {
+		list_for_each_entry(sb, &zone->spb_lists[SB_CLEAN][full], list) {
+			if (sb->nr_free)
+				return sb;
+		}
+	}
+	/* Fall back to empty superpageblocks — no clean partials available */
+	return NULL;
+}
+
+/**
+ * __rmqueue_from_sb - Try to allocate a page from a specific superpageblock
+ * @zone: zone to allocate from
+ * @order: allocation order
+ * @migratetype: type to allocate
+ * @sb: preferred superpageblock
+ *
+ * Scan the free list at the given order for a page within the superpageblock's
+ * PFN range. Bounded scan to avoid excessive latency. Returns NULL if
+ * no suitable page found.
+ */
+static struct page *__rmqueue_from_sb(struct zone *zone, unsigned int order,
+				      int migratetype, struct superpageblock *sb)
+{
+	unsigned int current_order;
+	unsigned long sb_start = sb->start_pfn;
+	unsigned long sb_end = sb_start + (1UL << SUPERPAGEBLOCK_ORDER);
+	struct free_area *area;
+	struct page *page;
+	int scanned;
+
+	for (current_order = order; current_order < NR_PAGE_ORDERS;
+	     ++current_order) {
+		area = &zone->free_area[current_order];
+		scanned = 0;
+
+		list_for_each_entry(page, &area->free_list[migratetype],
+				    buddy_list) {
+			unsigned long pfn = page_to_pfn(page);
+
+			if (pfn >= sb_start && pfn < sb_end) {
+				page_del_and_expand(zone, page, order,
+						    current_order,
+						    migratetype);
+				return page;
+			}
+			if (++scanned >= SPB_SCAN_LIMIT)
+				break;
+		}
+	}
+	return NULL;
+}
+
 /*
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
@@ -3103,12 +3170,26 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * small zones, pages_needed can be less than a whole
 	 * pageblock; skip to smaller blocks or individual pages to
 	 * avoid overshooting the PCP high watermark.
+	 *
+	 * For movable allocations, prefer pageblocks from the
+	 * fullest clean superpageblock to pack allocations and
+	 * preserve empty superpageblocks for 1GB hugepages.
 	 */
 	while (refilled + pageblock_nr_pages <= pages_needed) {
-		struct page *page;
+		struct page *page = NULL;
 
-		page = __rmqueue(zone, pageblock_order,
-				 migratetype, alloc_flags, &rmqm);
+		if (migratetype == MIGRATE_MOVABLE) {
+			struct superpageblock *sb;
+
+			sb = sb_preferred_for_movable(zone);
+			if (sb)
+				page = __rmqueue_from_sb(zone, pageblock_order,
+							 migratetype, sb);
+		}
+		if (!page)
+			page = __rmqueue(zone, pageblock_order,
+					 migratetype,
+					 alloc_flags, &rmqm);
 		if (!page)
 			break;
 
@@ -5738,6 +5819,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		goto out;
 	gfp = alloc_gfp;
 
+	alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
+
 	/* Find an allowed local zone that meets the low watermark. */
 	z = ac.preferred_zoneref;
 	for_next_zone_zonelist_nodemask(zone, z, ac.highest_zoneidx, ac.nodemask) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (12 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Extract the whole-pageblock claiming logic from try_to_claim_block()
into a standalone claim_whole_block() function. This handles the
PB_all_free → used transition, pageblock migratetype change, and
block splitting for orders >= pageblock_order.

Pure refactoring, no functional change. Prepares for reuse of this
logic in the per-superpageblock free lists patch.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 90 +++++++++++++++++++++++++++++--------------------
 1 file changed, 54 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8b10322d5221..907ce46c060f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2731,6 +2731,57 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	return -1;
 }
 
+/*
+ * claim_whole_block - claim a free block (>= pageblock_order) for a new type
+ * @zone: zone containing the page
+ * @page: free page to claim
+ * @current_order: order of the free page
+ * @order: requested allocation order
+ * @new_type: migratetype to assign
+ * @old_type: current migratetype of the block (for free list removal)
+ *
+ * Handle the PB_all_free → used transition, change the pageblock
+ * migratetype, split the block down to @order, and return the page.
+ */
+static struct page *
+claim_whole_block(struct zone *zone, struct page *page,
+		  int current_order, int order, int new_type, int old_type)
+{
+	struct superpageblock *sb;
+	unsigned int nr_added;
+	unsigned long pb_pfn;
+
+	VM_WARN_ON_ONCE(current_order < order);
+
+	/*
+	 * Clear PB_all_free for pageblocks being claimed.
+	 * This path bypasses page_del_and_expand(), so we
+	 * must handle the free→used transition here.
+	 */
+	for (pb_pfn = page_to_pfn(page);
+	     pb_pfn < page_to_pfn(page) + (1 << current_order);
+	     pb_pfn += pageblock_nr_pages) {
+		struct page *pb_page = pfn_to_page(pb_pfn);
+
+		if (get_pfnblock_bit(pb_page, pb_pfn, PB_all_free)) {
+			clear_pfnblock_bit(pb_page, pb_pfn, PB_all_free);
+			superpageblock_pb_now_used(pb_page);
+		}
+		__spb_set_has_type(pb_page, new_type);
+	}
+
+	del_page_from_free_list(page, zone, current_order, old_type);
+	change_pageblock_range(page, current_order, new_type);
+	nr_added = expand(zone, page, order, current_order, new_type);
+	account_freepages(zone, nr_added, new_type);
+
+	/* Single list update after all pageblocks processed */
+	sb = pfn_to_superpageblock(zone, page_to_pfn(page));
+	if (sb)
+		spb_update_list(sb);
+	return page;
+}
+
 /*
  * This function implements actual block claiming behaviour. If order is large
  * enough, we can claim the whole pageblock for the requested migratetype. If
@@ -2754,42 +2805,9 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		return NULL;
 
 	/* Take ownership for orders >= pageblock_order */
-	if (current_order >= pageblock_order) {
-		unsigned int nr_added;
-		unsigned long pb_pfn;
-
-		/*
-		 * Clear PB_all_free for pageblocks being claimed.
-		 * This path bypasses page_del_and_expand(), so we
-		 * must handle the free→used transition here.
-		 * Use block_type (the original migratetype) because
-		 * that's what was decremented when PB_all_free was set.
-		 */
-		for (pb_pfn = page_to_pfn(page);
-		     pb_pfn < page_to_pfn(page) + (1 << current_order);
-		     pb_pfn += pageblock_nr_pages) {
-			struct page *pb_page = pfn_to_page(pb_pfn);
-
-			if (get_pfnblock_bit(pb_page, pb_pfn, PB_all_free)) {
-				clear_pfnblock_bit(pb_page, pb_pfn, PB_all_free);
-				superpageblock_pb_now_used(pb_page);
-			}
-			__spb_set_has_type(pb_page, start_type);
-		}
-		/* Single list update after all pageblocks processed */
-		{
-			struct superpageblock *sb =
-				pfn_to_superpageblock(zone, page_to_pfn(page));
-			if (sb)
-				spb_update_list(sb);
-		}
-
-		del_page_from_free_list(page, zone, current_order, block_type);
-		change_pageblock_range(page, current_order, start_type);
-		nr_added = expand(zone, page, order, current_order, start_type);
-		account_freepages(zone, nr_added, start_type);
-		return page;
-	}
+	if (current_order >= pageblock_order)
+		return claim_whole_block(zone, page, current_order, order,
+					start_type, block_type);
 
 	/* moving whole block can fail due to zone boundary conditions */
 	if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (13 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Add per-superpageblock free lists for all buddy orders, enabling precise
allocation steering — pick the right superpageblock first, then allocate
from its local free list in O(1).

Each superpageblock contains free_area[NR_PAGE_ORDERS] with per-migratetype
free lists. Pages belonging to a superpageblock are placed on the owning
superpageblock's free list at every order. Pages not belonging to any
superpageblock remain on zone free lists.

The core free list operations (__add_to_free_list,
__del_page_from_free_list, move_to_free_list) use pfn_sb_free_area() to
route to the correct free_area based on the page's PFN. The expand() path
inherits this automatically. Zone-level nr_free counters are shadowed for
watermark checks.

__rmqueue_smallest() searches per-superpageblock free lists when
superpageblocks are enabled, walking superpageblocks from fullest to
emptiest to concentrate allocations. For empty superpageblocks, it searches
from the highest order down to find the largest available chunk. Zone free
lists serve as a fallback for pages not in any superpageblock.

The fallback allocation paths (__rmqueue_claim and __rmqueue_steal) are
made superpageblock-aware via __rmqueue_sb_find_fallback(), which searches
per-superpageblock free lists for fallback-type pages. The category search
order is migratetype-aware: movable allocations prefer clean
superpageblocks to keep movable pages consolidated, while
unmovable/reclaimable prefer tainted superpageblocks to avoid contaminating
clean ones.

Buddy coalescing in __free_one_page() works correctly because
__del_page_from_free_list() uses pfn_sb_free_area() to find the buddy's
free_area, and both the freed page and its buddy are always in the same
superpageblock for orders below pageblock_order.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  10 +
 mm/compaction.c        |  36 +-
 mm/internal.h          |  10 +
 mm/mm_init.c           |  20 +
 mm/page_alloc.c        | 855 +++++++++++++++++++++++++++++++++--------
 5 files changed, 756 insertions(+), 175 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f03800f5028b..f226dfdd1e99 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -924,9 +924,19 @@ struct superpageblock {
 	u16			nr_reserved;	/* holes, firmware, etc. */
 	u16			total_pageblocks; /* zone-clipped total */
 
+	/* Total free pages across all per-superpageblock free lists */
+	unsigned long		nr_free_pages;
+
 	/* For organizing superpageblocks by fullness category */
 	struct list_head	list;
 
+	/*
+	 * Per-superpageblock free lists for all buddy orders.
+	 * All pages belonging to this superpageblock are tracked here,
+	 * keeping allocation steering effective at every order.
+	 */
+	struct free_area	free_area[NR_PAGE_ORDERS];
+
 	/* Identity */
 	unsigned long		start_pfn;
 	struct zone		*zone;
diff --git a/mm/compaction.c b/mm/compaction.c
index cf2a5074c473..88ba88340f3b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -961,6 +961,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					low_pfn += (1UL << order) - 1;
 					nr_scanned += (1UL << order) - 1;
 				}
+				/*
+				 * Skipped a movable page; clearing
+				 * PB_has_movable here would orphan SPB type
+				 * counters (debugfs invariant 1).
+				 */
+				movable_skipped = true;
 				goto isolate_fail;
 			}
 			/* for alloc_contig case */
@@ -1040,6 +1046,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					low_pfn += (1UL << order) - 1;
 					nr_scanned += (1UL << order) - 1;
 				}
+				/*
+				 * Skipped a movable compound page; clearing
+				 * PB_has_movable here would orphan SPB type
+				 * counters (debugfs invariant 1).
+				 */
+				movable_skipped = true;
 				goto isolate_fail;
 			}
 		}
@@ -1065,6 +1077,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				movable_skipped = true;
 			}
 
+			/*
+			 * Non-LRU non-movable_ops page: still occupies the
+			 * pageblock, so clearing PB_has_movable here would
+			 * orphan SPB type counters (debugfs invariant 1).
+			 */
+			movable_skipped = true;
 			goto isolate_fail;
 		}
 
@@ -1303,12 +1321,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * isolated (pinned, writeback, dirty, etc.), leave the
 		 * flag set so a future migration attempt can try again.
 		 */
-		if (!nr_isolated && !movable_skipped && valid_page &&
-		    get_pfnblock_bit(valid_page, pageblock_start_pfn(start_pfn),
-				     PB_has_movable))
-			clear_pfnblock_bit(valid_page,
-					   pageblock_start_pfn(start_pfn),
-					   PB_has_movable);
+		if (!nr_isolated && !movable_skipped && valid_page)
+			superpageblock_clear_has_movable(cc->zone,
+							valid_page);
 	}
 
 	trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
@@ -1856,6 +1871,15 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
 		prep_compound_page(&dst->page, order);
 	cc->nr_freepages -= 1 << order;
 	cc->nr_migratepages -= 1 << order;
+
+	/*
+	 * Compaction isolates free pages via __isolate_free_page, which
+	 * bypasses page_del_and_expand and its PB_has_* tracking.  The
+	 * destination will hold movable pages after migration, so mark
+	 * PB_has_movable on the destination pageblock now.
+	 */
+	superpageblock_set_has_movable(cc->zone, &dst->page);
+
 	return page_rmappable_folio(&dst->page);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 163ef96fa777..7ee73f9bb76c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1025,6 +1025,16 @@ void init_cma_reserved_pageblock(struct page *page);
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
+#ifdef CONFIG_COMPACTION
+void superpageblock_clear_has_movable(struct zone *zone, struct page *page);
+void superpageblock_set_has_movable(struct zone *zone, struct page *page);
+#else
+static inline void superpageblock_clear_has_movable(struct zone *zone,
+						    struct page *page) {}
+static inline void superpageblock_set_has_movable(struct zone *zone,
+						  struct page *page) {}
+#endif
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 void resize_zone_superpageblocks(struct zone *zone);
 #endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6af34c1a8cc4..80cfc7c4de98 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1537,16 +1537,27 @@ static void __meminit init_one_superpageblock(struct superpageblock *sb,
 	unsigned long sb_end = start_pfn + SUPERPAGEBLOCK_NR_PAGES;
 	unsigned long pb_start = max(start_pfn, zone_start);
 	unsigned long pb_end = min(sb_end, zone_end);
+	int order, t;
 	u16 actual_pbs;
 
 	sb->nr_unmovable = 0;
 	sb->nr_reclaimable = 0;
 	sb->nr_movable = 0;
 	sb->nr_free = 0;
+	sb->nr_free_pages = 0;
 	INIT_LIST_HEAD(&sb->list);
 	sb->start_pfn = start_pfn;
 	sb->zone = zone;
 
+	/* Initialize per-superpageblock free areas */
+	for (order = 0; order < NR_PAGE_ORDERS; order++) {
+		struct free_area *area = &sb->free_area[order];
+
+		for (t = 0; t < MIGRATE_TYPES; t++)
+			INIT_LIST_HEAD(&area->free_list[t]);
+		area->nr_free = 0;
+	}
+
 	/*
 	 * Start with all pageblock slots as reserved.
 	 * init_pageblock_migratetype() will decrement nr_reserved and
@@ -1594,6 +1605,15 @@ static void __init setup_superpageblocks(struct zone *zone)
 		for (full = 0; full < __NR_SB_FULLNESS; full++)
 			INIT_LIST_HEAD(&zone->spb_lists[cat][full]);
 
+	/*
+	 * Warn if pages have already been freed into this zone's
+	 * free_area before superpageblocks are set up — those pages
+	 * would become stranded because __rmqueue_smallest only
+	 * searches per-superpageblock free lists.
+	 */
+	for (i = 0; i < NR_PAGE_ORDERS; i++)
+		WARN_ON_ONCE(zone->free_area[i].nr_free);
+
 	if (!zone->spanned_pages)
 		return;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 907ce46c060f..cbf5f48d377e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,140 @@ static void __spb_set_has_type(struct page *page, int migratetype)
 	}
 }
 
+/*
+ * __spb_clear_has_type - clear PB_has_* and decrement type counter
+ *
+ * Idempotent: only decrements the counter on the 1→0 bit transition.
+ */
+static void __spb_clear_has_type(struct page *page, int migratetype)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb = pfn_to_superpageblock(page_zone(page), pfn);
+	int bit;
+
+	if (!sb)
+		return;
+
+	bit = migratetype_to_has_bit(migratetype);
+	if (bit < 0)
+		return;
+
+	if (get_pfnblock_bit(page, pfn, bit)) {
+		clear_pfnblock_bit(page, pfn, bit);
+		switch (bit) {
+		case PB_has_unmovable:
+			if (sb->nr_unmovable)
+				sb->nr_unmovable--;
+			break;
+		case PB_has_reclaimable:
+			if (sb->nr_reclaimable)
+				sb->nr_reclaimable--;
+			break;
+		case PB_has_movable:
+			if (sb->nr_movable)
+				sb->nr_movable--;
+			break;
+		}
+	}
+}
+
+#ifdef CONFIG_COMPACTION
+/*
+ * spb_pageblock_has_free_movable_fragments - probe SPB free lists for movable
+ * @zone: zone containing @page
+ * @page: any page within the target pageblock
+ *
+ * Returns true if the SPB containing @page has any free MOVABLE pages on its
+ * per-order free lists at orders below pageblock_order whose PFN falls within
+ * the target pageblock. The compaction migrate scanner only sees in-use pages,
+ * so a pageblock can look "empty of movable" to the scanner while the SPB
+ * still owns small-order MOVABLE fragments inside it. Clearing PB_has_movable
+ * in that case would orphan those fragments from the SPB type accounting and
+ * trigger debugfs invariant 1 (sum_types undercount).
+ *
+ * Returns false (no fragments found) when the SPB lookup fails, which
+ * preserves the legacy clear-on-empty behavior for edge cases.
+ *
+ * Caller must hold zone->lock.
+ */
+static bool spb_pageblock_has_free_movable_fragments(struct zone *zone,
+						     struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long pb_start = pageblock_start_pfn(pfn);
+	unsigned long pb_end = pb_start + pageblock_nr_pages;
+	unsigned long frag_pfn;
+	struct superpageblock *sb;
+	struct list_head *list;
+	struct page *frag;
+	unsigned int order;
+
+	sb = pfn_to_superpageblock(zone, pfn);
+	if (!sb)
+		return false;
+
+	for (order = 0; order < pageblock_order; order++) {
+		list = &sb->free_area[order].free_list[MIGRATE_MOVABLE];
+		list_for_each_entry(frag, list, buddy_list) {
+			frag_pfn = page_to_pfn(frag);
+			if (frag_pfn >= pb_start && frag_pfn < pb_end)
+				return true;
+		}
+	}
+
+	return false;
+}
+
+/**
+ * superpageblock_clear_has_movable - clear PB_has_movable with SPB counter update
+ * @page: page within the pageblock
+ *
+ * Called from compaction when a full pageblock scan determines no movable
+ * pages remain. Clears PB_has_movable and decrements the superpageblock's
+ * nr_movable counter atomically (under zone->lock).
+ *
+ * Without this, clearing PB_has_movable directly via clear_pfnblock_bit()
+ * would leave the SPB counter stale, causing nr_movable to grow unbounded
+ * as subsequent movable allocations re-set the bit and re-increment.
+ *
+ * The migrate scanner only inspects in-use pages, so it is blind to MOVABLE
+ * fragments below pageblock_order sitting on the SPB free lists. Probe those
+ * lists first; if any fragment of @page's pageblock is still tracked by the
+ * SPB, leave PB_has_movable set so the SPB type accounting stays consistent
+ * (debugfs invariant 1: unmov + recl + mov + free >= total - rsv).
+ */
+void superpageblock_clear_has_movable(struct zone *zone, struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (!spb_pageblock_has_free_movable_fragments(zone, page))
+		__spb_clear_has_type(page, MIGRATE_MOVABLE);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/**
+ * superpageblock_set_has_movable - set PB_has_movable with SPB counter update
+ * @zone: zone containing the page
+ * @page: page within the pageblock
+ *
+ * Called from compaction when a movable page is migrated into a pageblock.
+ * Compaction bypasses page_del_and_expand (which normally sets PB_has_*)
+ * by using __isolate_free_page + direct migration, so PB_has_movable must
+ * be set explicitly for the destination pageblock.
+ *
+ * Idempotent: only increments the counter on the 0→1 bit transition.
+ */
+void superpageblock_set_has_movable(struct zone *zone, struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	__spb_set_has_type(page, MIGRATE_MOVABLE);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+#endif /* CONFIG_COMPACTION */
+
 /**
  * spb_get_category - Determine if a superpageblock is clean or tainted
  * @sb: superpageblock to classify
@@ -629,7 +763,7 @@ static void spb_update_list(struct superpageblock *sb)
 
 	list_del_init(&sb->list);
 
-	if (sb->nr_free == SUPERPAGEBLOCK_NR_PAGEBLOCKS) {
+	if (sb->nr_free == sb->total_pageblocks) {
 		list_add_tail(&sb->list, &zone->spb_empty);
 		return;
 	}
@@ -1067,12 +1201,41 @@ static inline void account_freepages(struct zone *zone, int nr_pages,
 			   zone->nr_free_highatomic + nr_pages);
 }
 
+/**
+ * pfn_sb_free_area - Get the correct free_area for a page at given order
+ * @zone: the zone
+ * @pfn: page frame number
+ * @order: buddy order
+ *
+ * Returns the per-superpageblock free_area if the page belongs to a valid
+ * superpageblock. Otherwise returns the zone free_area (for zones where the
+ * superpageblock setup failed).
+ */
+static inline struct free_area *pfn_sb_free_area(struct zone *zone,
+						 unsigned long pfn,
+						 unsigned int order,
+						 struct superpageblock **sbp)
+{
+	struct superpageblock *sb = pfn_to_superpageblock(zone, pfn);
+
+	if (sb) {
+		if (sbp)
+			*sbp = sb;
+		return &sb->free_area[order];
+	}
+	if (sbp)
+		*sbp = NULL;
+	return &zone->free_area[order];
+}
+
 /* Used for pages not on another list */
 static inline void __add_to_free_list(struct page *page, struct zone *zone,
 				      unsigned int order, int migratetype,
 				      bool tail)
 {
-	struct free_area *area = &zone->free_area[order];
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb;
+	struct free_area *area = pfn_sb_free_area(zone, pfn, order, &sb);
 	int nr_pages = 1 << order;
 
 	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
@@ -1085,6 +1248,13 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 		list_add(&page->buddy_list, &area->free_list[migratetype]);
 	area->nr_free++;
 
+	if (sb) {
+		/* Keep zone-level nr_free accurate for watermark checks */
+		zone->free_area[order].nr_free++;
+		/* Track total free pages per superpageblock */
+		sb->nr_free_pages += nr_pages;
+	}
+
 	if (order >= pageblock_order && !is_migrate_isolate(migratetype))
 		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages);
 }
@@ -1097,7 +1267,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 static inline void move_to_free_list(struct page *page, struct zone *zone,
 				     unsigned int order, int old_mt, int new_mt)
 {
-	struct free_area *area = &zone->free_area[order];
+	unsigned long pfn = page_to_pfn(page);
+	struct free_area *area = pfn_sb_free_area(zone, pfn, order, NULL);
 	int nr_pages = 1 << order;
 
 	/* Free page moving can fail, so it happens before the type update */
@@ -1121,6 +1292,9 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 static inline void __del_page_from_free_list(struct page *page, struct zone *zone,
 					     unsigned int order, int migratetype)
 {
+	unsigned long pfn = page_to_pfn(page);
+	struct superpageblock *sb;
+	struct free_area *area = pfn_sb_free_area(zone, pfn, order, &sb);
 	int nr_pages = 1 << order;
 
         VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
@@ -1134,7 +1308,14 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	list_del(&page->buddy_list);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
+
+	if (sb) {
+		/* Keep zone-level nr_free accurate for watermark checks */
+		zone->free_area[order].nr_free--;
+		/* Track total free pages per superpageblock */
+		sb->nr_free_pages -= nr_pages;
+	}
 
 	if (order >= pageblock_order && !is_migrate_isolate(migratetype))
 		__mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages);
@@ -1190,33 +1371,44 @@ static void change_pageblock_range(struct page *pageblock_page,
 	}
 }
 
-/*
+/**
  * mark_pageblock_free - handle a pageblock becoming fully free
  * @page: page at the start of the pageblock
  * @pfn: page frame number
+ * @migratetype: pointer to the caller's migratetype variable (may be updated)
  *
- * Clear stale PCP ownership and actual-contents tracking flags when
- * buddy merging reconstructs a full pageblock or a whole pageblock is
- * freed directly. No PCP can still hold pages from this block (otherwise
- * the buddy merge couldn't have completed), so the ownership entry would
- * just cause misrouted frees.
+ * Clear stale PCP ownership and actual-contents tracking flags, mark the
+ * pageblock as fully free for superpageblock accounting, and reset the
+ * migratetype to MOVABLE so the page lands on free_list[MIGRATE_MOVABLE].
+ * Non-movable allocations must go through RMQUEUE_CLAIM to reuse it,
+ * which properly handles PB_all_free and superpageblock accounting.
  */
-static void mark_pageblock_free(struct page *page, unsigned long pfn)
+static void mark_pageblock_free(struct page *page, unsigned long pfn,
+				int *migratetype)
 {
 	clear_pcpblock_owner(page);
 
 	/*
-	 * The entire block is now free — clear actual-contents tracking
-	 * flags since no allocated pages remain.
+	 * Clear PB_has_* bits and decrement corresponding SPB type
+	 * counters. Use __spb_clear_has_type (no list update) to avoid
+	 * bouncing the SPB between lists; pb_now_free's spb_update_list
+	 * handles the final reclassification.
 	 */
-	clear_pfnblock_bit(page, pfn, PB_has_unmovable);
-	clear_pfnblock_bit(page, pfn, PB_has_reclaimable);
-	clear_pfnblock_bit(page, pfn, PB_has_movable);
+	__spb_clear_has_type(page, MIGRATE_UNMOVABLE);
+	__spb_clear_has_type(page, MIGRATE_RECLAIMABLE);
+	__spb_clear_has_type(page, MIGRATE_MOVABLE);
 
 	if (!get_pfnblock_bit(page, pfn, PB_all_free)) {
 		set_pfnblock_bit(page, pfn, PB_all_free);
 		superpageblock_pb_now_free(page);
 	}
+
+	if (*migratetype == MIGRATE_UNMOVABLE ||
+	    *migratetype == MIGRATE_RECLAIMABLE ||
+	    *migratetype == MIGRATE_HIGHATOMIC) {
+		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		*migratetype = MIGRATE_MOVABLE;
+	}
 }
 
 /*
@@ -1249,6 +1441,7 @@ static inline void __free_one_page(struct page *page,
 		int migratetype, fpi_t fpi_flags)
 {
 	struct capture_control *capc = task_capc(zone);
+	unsigned int orig_order = order;
 	unsigned long buddy_pfn = 0;
 	unsigned long combined_pfn;
 	struct page *buddy;
@@ -1261,18 +1454,31 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
-	account_freepages(zone, 1 << order, migratetype);
+	if (order >= pageblock_order) {
+		int i, nr_pbs = 1 << (order - pageblock_order);
 
-	/*
-	 * When freeing a whole pageblock, clear stale PCP ownership
-	 * and actual-contents tracking flags up front, and mark it
-	 * as fully free for superpageblock accounting.  The in-loop
-	 * check only fires when sub-pageblock pages merge *up to*
-	 * pageblock_order, not when entering at pageblock_order
-	 * directly.
-	 */
-	if (order == pageblock_order)
-		mark_pageblock_free(page, pfn);
+		for (i = 0; i < nr_pbs; i++) {
+			int pb_mt = get_pfnblock_migratetype(
+					page + i * pageblock_nr_pages,
+					pfn + i * pageblock_nr_pages);
+			mark_pageblock_free(page + i * pageblock_nr_pages,
+					    pfn + i * pageblock_nr_pages,
+					    &pb_mt);
+		}
+		/*
+		 * After mark_pageblock_free, non-CMA sub-pageblocks are
+		 * MOVABLE. CMA pageblocks retain their CMA type so pages
+		 * land on the correct free list for CMA allocations.
+		 * ISOLATE pageblocks must stay ISOLATE so that
+		 * account_freepages() correctly skips them — otherwise
+		 * NR_FREE_PAGES gets incremented for isolated pages.
+		 */
+		if (!is_migrate_cma(migratetype) &&
+		    !is_migrate_isolate(migratetype))
+			migratetype = MIGRATE_MOVABLE;
+	}
+
+	account_freepages(zone, 1 << order, migratetype);
 
 	while (order < MAX_PAGE_ORDER) {
 		int buddy_mt = migratetype;
@@ -1329,8 +1535,29 @@ static inline void __free_one_page(struct page *page,
 		 * clear any stale PCP ownership and actual-contents
 		 * tracking flags.
 		 */
-		if (order == pageblock_order)
-			mark_pageblock_free(page, pfn);
+		if (order == pageblock_order) {
+			int old_mt = migratetype;
+
+			mark_pageblock_free(page, pfn, &migratetype);
+			/*
+			 * mark_pageblock_free may convert migratetype to
+			 * MOVABLE. Transfer the accounting done earlier so
+			 * nr_free_highatomic doesn't leak.
+			 *
+			 * We transfer 1 << orig_order pages — the amount
+			 * credited by this __free_one_page call. Buddies
+			 * consumed during merging may also have HIGHATOMIC
+			 * credits from their own frees; those are not tracked
+			 * here. In practice HIGHATOMIC reserves are small and
+			 * short-lived, so any residual drift is minor.
+			 */
+			if (old_mt != migratetype) {
+				account_freepages(zone, -(1 << orig_order),
+						  old_mt);
+				account_freepages(zone, 1 << orig_order,
+						  migratetype);
+			}
+		}
 	}
 
 done_merging:
@@ -2148,15 +2375,44 @@ static __always_inline void page_del_and_expand(struct zone *zone,
 
 	/*
 	 * If we're splitting a page that spans at least a full pageblock,
-	 * the allocated pageblock transitions from fully-free to in-use.
-	 * Clear PB_all_free and update superpageblock accounting.
+	 * each constituent pageblock transitions from fully-free to in-use.
+	 * Clear PB_all_free and update superpageblock accounting for ALL
+	 * pageblocks in the range, not just the first one.
 	 */
 	if (high >= pageblock_order) {
 		unsigned long pfn = page_to_pfn(page);
+		unsigned long end_pfn = pfn + (1 << high);
+
+		for (; pfn < end_pfn; pfn += pageblock_nr_pages) {
+			struct page *pb_page = pfn_to_page(pfn);
 
-		if (get_pfnblock_bit(page, pfn, PB_all_free)) {
-			clear_pfnblock_bit(page, pfn, PB_all_free);
-			superpageblock_pb_now_used(page);
+			if (get_pfnblock_bit(pb_page, pfn, PB_all_free)) {
+				clear_pfnblock_bit(pb_page, pfn, PB_all_free);
+				superpageblock_pb_now_used(pb_page);
+			}
+			__spb_set_has_type(pb_page, migratetype);
+		}
+		/* Single list update after all pageblocks processed */
+		{
+			struct superpageblock *sb =
+				pfn_to_superpageblock(zone,
+						      page_to_pfn(page));
+			if (sb)
+				spb_update_list(sb);
+		}
+	} else {
+		/*
+		 * Sub-pageblock allocation: set PB_has_<migratetype> for
+		 * the containing pageblock. Idempotent — only increments
+		 * the counter on the first allocation of this type.
+		 */
+		__spb_set_has_type(page, migratetype);
+		{
+			struct superpageblock *sb =
+				pfn_to_superpageblock(zone,
+						      page_to_pfn(page));
+			if (sb)
+				spb_update_list(sb);
 		}
 	}
 
@@ -2311,6 +2567,15 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
 /* Bounded scan limit when searching free lists for tainted superpageblock pages */
 #define SPB_SCAN_LIMIT 8
 
+/*
+ * Reserve free pageblocks in tainted superpageblocks for unmovable/reclaimable
+ * allocations.  Movable allocations skip tainted superpageblocks that have
+ * fewer than this many free pageblocks, ensuring that unmovable claims
+ * always find room in existing tainted superpageblocks instead of spilling
+ * into clean ones.
+ */
+#define SPB_TAINTED_RESERVE	4
+
 /**
  * sb_preferred_for_movable - Find the fullest clean superpageblock for movable
  * @zone: zone to search
@@ -2350,38 +2615,38 @@ static struct page *__rmqueue_from_sb(struct zone *zone, unsigned int order,
 				      int migratetype, struct superpageblock *sb)
 {
 	unsigned int current_order;
-	unsigned long sb_start = sb->start_pfn;
-	unsigned long sb_end = sb_start + (1UL << SUPERPAGEBLOCK_ORDER);
 	struct free_area *area;
 	struct page *page;
-	int scanned;
 
-	for (current_order = order; current_order < NR_PAGE_ORDERS;
+	/*
+	 * Search the superpageblock's own free lists for all orders.
+	 */
+	for (current_order = order;
+	     current_order < NR_PAGE_ORDERS;
 	     ++current_order) {
-		area = &zone->free_area[current_order];
-		scanned = 0;
-
-		list_for_each_entry(page, &area->free_list[migratetype],
-				    buddy_list) {
-			unsigned long pfn = page_to_pfn(page);
+		area = &sb->free_area[current_order];
+		page = get_page_from_free_area(area, migratetype);
+		if (!page)
+			continue;
 
-			if (pfn >= sb_start && pfn < sb_end) {
-				page_del_and_expand(zone, page, order,
-						    current_order,
-						    migratetype);
-				return page;
-			}
-			if (++scanned >= SPB_SCAN_LIMIT)
-				break;
-		}
+		page_del_and_expand(zone, page, order, current_order,
+				    migratetype);
+		return page;
 	}
+
 	return NULL;
 }
 
 /*
  * Go through the free lists for the given migratetype and remove
- * the smallest available page from the freelists
+ * the smallest available page from the freelists.
+ *
+ * When superpageblocks are enabled, search per-superpageblock free lists first,
+ * falling back to zone free lists for pages not in any superpageblock.
  */
+static struct page *claim_whole_block(struct zone *zone, struct page *page,
+		  int current_order, int order, int new_type, int old_type);
+
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						int migratetype)
@@ -2389,14 +2654,179 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	unsigned int current_order;
 	struct free_area *area;
 	struct page *page;
+	int full;
+	struct superpageblock *sb;
+	/*
+	 * Category search order: 2 passes.
+	 * Movable: clean first, then tainted (pack into clean SBs).
+	 * Others: tainted first, then clean (concentrate in tainted SBs).
+	 */
+	static const enum sb_category cat_order[2][2] = {
+		[0] = { SB_TAINTED, SB_CLEAN },  /* unmovable/reclaimable */
+		[1] = { SB_CLEAN, SB_TAINTED },  /* movable */
+	};
+	int movable = (migratetype == MIGRATE_MOVABLE) ? 1 : 0;
+
+	/*
+	 * Search per-superpageblock free lists for pages of the requested
+	 * migratetype, walking superpageblocks from fullest to emptiest
+	 * to pack allocations.
+	 *
+	 * For unmovable/reclaimable, prefer tainted superpageblocks to
+	 * concentrate non-movable allocations into fewer superpageblocks.
+	 * For movable, prefer clean superpageblocks to keep them homogeneous.
+	 *
+	 * Search empty superpageblocks between the preferred and fallback
+	 * category passes to avoid movable allocations consuming free
+	 * pageblocks in tainted superpageblocks (which unmovable needs for
+	 * future CLAIMs), and vice versa.
+	 */
+	for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+		enum sb_category cat = cat_order[movable][0];
+
+		list_for_each_entry(sb,
+			&zone->spb_lists[cat][full], list) {
+			if (!sb->nr_free_pages)
+				continue;
+			for (current_order = order;
+			     current_order < NR_PAGE_ORDERS;
+			     ++current_order) {
+				area = &sb->free_area[current_order];
+				page = get_page_from_free_area(
+					area, migratetype);
+				if (!page)
+					continue;
+				page_del_and_expand(zone, page,
+					order, current_order,
+					migratetype);
+				trace_mm_page_alloc_zone_locked(
+					page, order, migratetype,
+					pcp_allowed_order(order) &&
+					migratetype < MIGRATE_PCPTYPES);
+				return page;
+			}
+		}
+	}
+
+	/*
+	 * For non-movable allocations, try to reclaim free pageblocks
+	 * from tainted superpageblocks before looking at empty or clean
+	 * ones. Free pageblocks in tainted SBs have pages on the MOVABLE
+	 * free list (reset by mark_pageblock_free), so the search above
+	 * misses them. Claim them inline to keep non-movable allocations
+	 * concentrated in already-tainted superpageblocks.
+	 */
+	if (!movable && !is_migrate_cma(migratetype)) {
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+				&zone->spb_lists[SB_TAINTED][full], list) {
+				if (!sb->nr_free)
+					continue;
+				for (current_order = max_t(unsigned int,
+						order, pageblock_order);
+				     current_order < NR_PAGE_ORDERS;
+				     ++current_order) {
+					area = &sb->free_area[current_order];
+					page = get_page_from_free_area(
+						area, MIGRATE_MOVABLE);
+					if (!page)
+						continue;
+					if (get_pageblock_isolate(page))
+						continue;
+					if (is_migrate_cma(
+					    get_pageblock_migratetype(page)))
+						continue;
+					page = claim_whole_block(zone, page,
+						current_order, order,
+						migratetype, MIGRATE_MOVABLE);
+					trace_mm_page_alloc_zone_locked(
+						page, order, migratetype,
+						pcp_allowed_order(order) &&
+						migratetype < MIGRATE_PCPTYPES);
+					return page;
+				}
+			}
+		}
+	}
+
+	/* Empty superpageblocks: try before falling back to non-preferred category */
+	list_for_each_entry(sb, &zone->spb_empty, list) {
+		if (!sb->nr_free_pages)
+			continue;
+		for (current_order = max(order, pageblock_order);
+		     current_order < NR_PAGE_ORDERS;
+		     ++current_order) {
+			area = &sb->free_area[current_order];
+			page = get_page_from_free_area(area, migratetype);
+			if (!page)
+				continue;
+			page_del_and_expand(zone, page, order,
+				current_order, migratetype);
+			trace_mm_page_alloc_zone_locked(page, order,
+				migratetype,
+				pcp_allowed_order(order) &&
+				migratetype < MIGRATE_PCPTYPES);
+			return page;
+		}
+	}
+
+	/*
+	 * Pass 4: movable allocations fall back to tainted SPBs.
+	 * Non-movable allocations must NOT search clean SPBs here;
+	 * stale migratetype labels create phantom non-movable free
+	 * pages in clean SPBs that would cause unnecessary tainting.
+	 * Let __rmqueue_claim and __rmqueue_steal handle non-movable
+	 * fallback with proper ALLOC_NOFRAGMENT protection.
+	 */
+	if (movable) {
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			enum sb_category cat = cat_order[movable][1];
+
+			list_for_each_entry(sb,
+				&zone->spb_lists[cat][full], list) {
+				if (!sb->nr_free_pages)
+					continue;
+				/*
+				 * Movable falling back to tainted: skip SBs
+				 * with few free pageblocks to reserve space
+				 * for future unmovable/reclaimable claims.
+				 */
+				if (sb->nr_free <= SPB_TAINTED_RESERVE)
+					continue;
+				for (current_order = order;
+				     current_order < NR_PAGE_ORDERS;
+				     ++current_order) {
+					area = &sb->free_area[current_order];
+					page = get_page_from_free_area(
+						area, migratetype);
+					if (!page)
+						continue;
+					page_del_and_expand(zone, page,
+						order, current_order,
+						migratetype);
+					trace_mm_page_alloc_zone_locked(
+						page, order, migratetype,
+						pcp_allowed_order(order) &&
+						migratetype < MIGRATE_PCPTYPES);
+					return page;
+				}
+			}
+		}
+	}
 
-	/* Find a page of the appropriate size in the preferred list */
-	for (current_order = order; current_order < NR_PAGE_ORDERS; ++current_order) {
+	/*
+	 * Zone free lists: all pages should be on superpageblock lists.
+	 * Finding a page here means zone hotplug added memory without
+	 * setting up superpageblocks for the new range.
+	 */
+	for (current_order = order;
+	     current_order < NR_PAGE_ORDERS; ++current_order) {
 		area = &(zone->free_area[current_order]);
 		page = get_page_from_free_area(area, migratetype);
 		if (!page)
 			continue;
 
+		WARN_ON_ONCE(zone->superpageblocks);
 		page_del_and_expand(zone, page, order, current_order,
 				    migratetype);
 		trace_mm_page_alloc_zone_locked(page, order, migratetype,
@@ -2742,6 +3172,8 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
  *
  * Handle the PB_all_free → used transition, change the pageblock
  * migratetype, split the block down to @order, and return the page.
+ * Used by both the claim fallback path and __rmqueue_smallest when
+ * reclaiming free pageblocks from tainted superpageblocks.
  */
 static struct page *
 claim_whole_block(struct zone *zone, struct page *page,
@@ -2753,11 +3185,6 @@ claim_whole_block(struct zone *zone, struct page *page,
 
 	VM_WARN_ON_ONCE(current_order < order);
 
-	/*
-	 * Clear PB_all_free for pageblocks being claimed.
-	 * This path bypasses page_del_and_expand(), so we
-	 * must handle the free→used transition here.
-	 */
 	for (pb_pfn = page_to_pfn(page);
 	     pb_pfn < page_to_pfn(page) + (1 << current_order);
 	     pb_pfn += pageblock_nr_pages) {
@@ -2804,6 +3231,16 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	if (get_pageblock_isolate(page))
 		return NULL;
 
+	/*
+	 * Never steal from CMA pageblocks.  CMA pages freed through
+	 * PCP may land on the MOVABLE free list (PCP caches the
+	 * allocation-time migratetype), making them visible to the
+	 * fallback search.  Stealing would corrupt CMA by changing
+	 * the pageblock type away from MIGRATE_CMA.
+	 */
+	if (is_migrate_cma(get_pageblock_migratetype(page)))
+		return NULL;
+
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order)
 		return claim_whole_block(zone, page, current_order, order,
@@ -2874,10 +3311,134 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	return NULL;
 }
 
+/*
+ * Search per-superpageblock free lists for a page of a fallback migratetype.
+ * Sub-pageblock-order free pages live on superpageblock free lists, not zone
+ * free lists, so __rmqueue_claim and __rmqueue_steal need this helper to
+ * find fallback pages at those orders.
+ *
+ * For unmovable/reclaimable allocations, prefer tainted superpageblocks to
+ * keep clean ones clean for future large contiguous allocations.
+ * For movable allocations, prefer clean superpageblocks to keep movable
+ * pages consolidated and superpageblocks homogeneous.
+ *
+ * @search_cats: bitmask controlling which categories to search.
+ *   bit 0: search the preferred category (tainted for unmov, clean for mov)
+ *   bit 1: search empty superpageblocks
+ *   bit 2: search the fallback category (clean for unmov, tainted for mov)
+ * All bits set (0x7) gives the original behavior.
+ */
+#define SB_SEARCH_PREFERRED	(1 << 0)
+#define SB_SEARCH_EMPTY		(1 << 1)
+#define SB_SEARCH_FALLBACK	(1 << 2)
+#define SB_SEARCH_ALL		(SB_SEARCH_PREFERRED | SB_SEARCH_EMPTY | SB_SEARCH_FALLBACK)
+
+static struct page *
+__rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
+			   int start_migratetype, int *fallback_mt,
+			   unsigned int search_cats)
+{
+	int full, i;
+	struct superpageblock *sb;
+	/*
+	 * Category search order: 2 passes.
+	 * Movable: clean, tainted.  Others: tainted, clean.
+	 */
+	static const enum sb_category cat_order[2][2] = {
+		[0] = { SB_TAINTED, SB_CLEAN },  /* unmovable/reclaimable */
+		[1] = { SB_CLEAN, SB_TAINTED },   /* movable */
+	};
+	int movable = (start_migratetype == MIGRATE_MOVABLE) ? 1 : 0;
+
+	/* Pass 0: preferred category */
+	if (search_cats & SB_SEARCH_PREFERRED) {
+		enum sb_category cat = cat_order[movable][0];
+
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+					    &zone->spb_lists[cat][full], list) {
+				struct free_area *area =
+					&sb->free_area[order];
+
+				if (movable && cat == SB_TAINTED &&
+				    sb->nr_free <= SPB_TAINTED_RESERVE)
+					continue;
+
+				for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
+					int fmt = fallbacks[start_migratetype][i];
+					struct page *page;
+
+					page = get_page_from_free_area(area,
+								       fmt);
+					if (page) {
+						*fallback_mt = fmt;
+						return page;
+					}
+				}
+			}
+		}
+	}
+
+	/* Empty superpageblocks: between preferred and fallback */
+	if (search_cats & SB_SEARCH_EMPTY) {
+		list_for_each_entry(sb, &zone->spb_empty, list) {
+			struct free_area *area =
+				&sb->free_area[order];
+
+			for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
+				int fmt = fallbacks[start_migratetype][i];
+				struct page *page;
+
+				page = get_page_from_free_area(area,
+							       fmt);
+				if (page) {
+					*fallback_mt = fmt;
+					return page;
+				}
+			}
+		}
+	}
+
+	/* Pass 1: fallback category */
+	if (search_cats & SB_SEARCH_FALLBACK) {
+		enum sb_category cat = cat_order[movable][1];
+
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+					    &zone->spb_lists[cat][full], list) {
+				struct free_area *area =
+					&sb->free_area[order];
+
+				if (movable && cat == SB_TAINTED &&
+				    sb->nr_free <= SPB_TAINTED_RESERVE)
+					continue;
+
+				for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
+					int fmt = fallbacks[start_migratetype][i];
+					struct page *page;
+
+					page = get_page_from_free_area(area,
+								       fmt);
+					if (page) {
+						*fallback_mt = fmt;
+						return page;
+					}
+				}
+			}
+		}
+	}
+
+	return NULL;
+}
+
 /*
  * Try to allocate from some fallback migratetype by claiming the entire block,
  * i.e. converting it to the allocation's start migratetype.
  *
+ * Search by category first, then by order within each category, to avoid
+ * claiming clean/empty superpageblocks when tainted ones still have space
+ * at smaller orders.
+ *
  * The use of signed ints for order and current_order is a deliberate
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
@@ -2886,11 +3447,16 @@ static __always_inline struct page *
 __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 						unsigned int alloc_flags)
 {
-	struct free_area *area;
 	int current_order;
 	int min_order = order;
 	struct page *page;
 	int fallback_mt;
+	static const unsigned int cat_search[] = {
+		SB_SEARCH_PREFERRED,
+		SB_SEARCH_EMPTY,
+		SB_SEARCH_FALLBACK,
+	};
+	int c;
 
 	/*
 	 * Do not steal pages from freelists belonging to other pageblocks
@@ -2901,65 +3467,34 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 		min_order = pageblock_order;
 
 	/*
-	 * Find the largest available free page in the other list. This roughly
-	 * approximates finding the pageblock with the most free pages, which
-	 * would be too costly to do exactly.
+	 * Find the largest available free page in a fallback migratetype.
+	 * Search each superpageblock category across all orders before
+	 * moving to the next category, so that smaller blocks in tainted
+	 * superpageblocks are preferred over larger blocks in empty/clean
+	 * ones.
 	 */
-	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
-				--current_order) {
-		area = &(zone->free_area[current_order]);
-		fallback_mt = find_suitable_fallback(area, current_order,
-						     start_migratetype, true);
-
-		/* No block in that order */
-		if (fallback_mt == -1)
-			continue;
-
-		/* Advanced into orders too low to claim, abort */
-		if (fallback_mt == -2)
-			break;
-
-		page = get_page_from_free_area(area, fallback_mt);
+	for (c = 0; c < ARRAY_SIZE(cat_search); c++) {
+		for (current_order = MAX_PAGE_ORDER;
+		     current_order >= min_order; --current_order) {
+			if (!should_try_claim_block(current_order,
+						    start_migratetype))
+				break;
+			page = __rmqueue_sb_find_fallback(zone, current_order,
+						start_migratetype,
+						&fallback_mt, cat_search[c]);
+			if (!page)
+				continue;
 
-		/*
-		 * For unmovable/reclaimable stealing, prefer pages from
-		 * tainted superpageblocks (already contaminated) to keep clean
-		 * superpageblocks clean for future 1GB allocations.
-		 */
-		if (start_migratetype != MIGRATE_MOVABLE &&
-		    zone->superpageblocks && page) {
-			struct superpageblock *sb;
-			struct page *alt;
-			int scanned = 0;
-
-			sb = pfn_to_superpageblock(zone, page_to_pfn(page));
-			if (sb && spb_get_category(sb) == SB_CLEAN) {
-				list_for_each_entry(alt,
-						    &area->free_list[fallback_mt],
-						    buddy_list) {
-					struct superpageblock *asb;
-
-					if (++scanned > SPB_SCAN_LIMIT)
-						break;
-					asb = pfn_to_superpageblock(zone,
-							page_to_pfn(alt));
-					if (asb && spb_get_category(asb) ==
-					    SB_TAINTED) {
-						page = alt;
-						break;
-					}
-				}
+			page = try_to_claim_block(zone, page, current_order,
+						  order, start_migratetype,
+						  fallback_mt, alloc_flags);
+			if (page) {
+				trace_mm_page_alloc_extfrag(page, order,
+					current_order, start_migratetype,
+					fallback_mt);
+				return page;
 			}
 		}
-
-		page = try_to_claim_block(zone, page, current_order, order,
-					  start_migratetype, fallback_mt,
-					  alloc_flags);
-		if (page) {
-			trace_mm_page_alloc_extfrag(page, order, current_order,
-						    start_migratetype, fallback_mt);
-			return page;
-		}
 	}
 
 	return NULL;
@@ -2973,19 +3508,23 @@ static __always_inline struct page *
 __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 {
 	struct superpageblock *sb;
-	struct free_area *area;
 	int current_order;
 	struct page *page;
 	int fallback_mt;
 
+	/*
+	 * Search per-superpageblock free lists for fallback migratetypes.
+	 * Superpageblocks are always enabled for populated zones.
+	 */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
-		area = &(zone->free_area[current_order]);
-		fallback_mt = find_suitable_fallback(area, current_order,
-						     start_migratetype, false);
-		if (fallback_mt == -1)
+		page = __rmqueue_sb_find_fallback(zone, current_order,
+					start_migratetype,
+					&fallback_mt,
+					SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK);
+
+		if (!page)
 			continue;
 
-		page = get_page_from_free_area(area, fallback_mt);
 		page_del_and_expand(zone, page, order, current_order, fallback_mt);
 
 		/*
@@ -3220,33 +3759,11 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 		goto out;
 
 	/*
-	 * Phase 2: Zone too fragmented for whole pageblocks.
-	 * Sweep zone free lists top-down for same-migratetype
-	 * chunks. Avoids cross-type stealing and keeps PCP
-	 * functional under fragmentation.
-	 *
-	 * No ownership claim or PagePCPBuddy - these are
-	 * sub-pageblock fragments cached for batching only.
-	 *
-	 * Stop above the requested order -- at that point,
-	 * phase 3's __rmqueue() does the same lookup but with
-	 * migratetype fallback.
+	 * Phase 2 was removed: it swept zone free lists for sub-pageblock
+	 * fragments, which are always empty when superpageblocks are enabled.
+	 * Phase 3's __rmqueue() -> __rmqueue_smallest() properly searches
+	 * per-superpageblock free lists at all orders.
 	 */
-	for (o = pageblock_order - 1;
-	     o > (int)order && refilled < pages_needed; o--) {
-		struct free_area *area = &zone->free_area[o];
-		struct page *page;
-
-		while (refilled + (1 << o) <= pages_needed) {
-			page = get_page_from_free_area(area, migratetype);
-			if (!page)
-				break;
-
-			del_page_from_free_list(page, zone, o, migratetype);
-			pcp_enqueue_tail(pcp, page, migratetype, o);
-			refilled += 1 << o;
-		}
-	}
 
 	/*
 	 * Phase 3: Last resort. Use __rmqueue() which does
@@ -4251,10 +4768,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
-			struct free_area *area = &(zone->free_area[order]);
+			struct free_area *area;
+			struct superpageblock *sb;
 			unsigned long size;
-
-			page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC);
+			unsigned long i;
+
+			page = NULL;
+			/* Search per-superpageblock free lists */
+			for (i = 0; i < zone->nr_superpageblocks && !page; i++) {
+				sb = &zone->superpageblocks[i];
+				area = &sb->free_area[order];
+				page = get_page_from_free_area(area,
+							       MIGRATE_HIGHATOMIC);
+			}
 			if (!page)
 				continue;
 
@@ -4385,29 +4911,20 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	if (!order)
 		return true;
 
-	/* For a high-order request, check at least one suitable page is free */
+	/*
+	 * For a high-order request, check at least one suitable page is free.
+	 * Zone free_area nr_free is shadowed — it includes pages on
+	 * per-superpageblock free lists. A non-zero nr_free means the allocator
+	 * will find pages on superpageblock lists even if zone list heads are
+	 * empty.
+	 */
 	for (o = order; o < NR_PAGE_ORDERS; o++) {
 		struct free_area *area = &z->free_area[o];
-		int mt;
 
 		if (!area->nr_free)
 			continue;
 
-		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
-			if (!free_area_empty(area, mt))
-				return true;
-		}
-
-#ifdef CONFIG_CMA
-		if ((alloc_flags & ALLOC_CMA) &&
-		    !free_area_empty(area, MIGRATE_CMA)) {
-			return true;
-		}
-#endif
-		if ((alloc_flags & (ALLOC_HIGHATOMIC|ALLOC_OOM)) &&
-		    !free_area_empty(area, MIGRATE_HIGHATOMIC)) {
-			return true;
-		}
+		return true;
 	}
 	return false;
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (14 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Add an event-driven background worker that evacuates movable pages from
tainted superpageblocks when free space runs low. Each superpageblock has
its own work_struct, so defrag targets the specific superpageblock that
needs it rather than scanning the entire system.

Defrag is triggered from sb_update_list() when a tainted superpageblock
drops below threshold: 1 or fewer free pageblocks, or less than 2
pageblocks worth of free pages. The worker evacuates movable pageblocks
until free space recovers: at least 2 free pageblocks or 3 pageblocks worth
of free pages, or no movable pages remain.

Clean superpageblocks (only free + movable) are never defragged since they
don't need it. Superblocks with no movable pages are skipped since there is
nothing to evacuate.

[v19 fold] Drop the now-dead per-pageblock evacuate plumbing
(queue_pageblock_evacuate, evacuate_item, evacuate_pool, evacuate_freelist,
evacuate_item_alloc/free, evacuate_work_fn, evacuate_irq_work_fn, plus
pgdat->evacuate_pending and pgdat->evacuate_irq_work). The new
background superpageblock defragmentation worker introduced here calls
evacuate_pageblock() directly from within its own work_struct, so the
async per-pageblock work-item pool, the irq_work indirection, and their
per-pgdat init in pageblock_evacuate_init() are no longer used.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  19 ++-
 mm/internal.h          |   2 +
 mm/mm_init.c           |  87 +++++++----
 mm/page_alloc.c        | 317 ++++++++++++++++++++++++++++-------------
 4 files changed, 301 insertions(+), 124 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f226dfdd1e99..61fe939e7c0f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -937,6 +937,23 @@ struct superpageblock {
 	 */
 	struct free_area	free_area[NR_PAGE_ORDERS];
 
+#ifdef CONFIG_COMPACTION
+	/* Background defragmentation work for this superpageblock */
+	struct work_struct	defrag_work;
+	struct irq_work		defrag_irq_work;
+	bool			defrag_active;
+	/*
+	 * Back-off state after a no-op defrag pass: defer the next attempt
+	 * until either nr_free_pages has grown by at least pageblock_nr_pages
+	 * or a cooldown elapses, so allocator hot paths cannot re-arm
+	 * defrag_work many times per second on an SB that cannot make progress.
+	 * defrag_last_no_progress_jiffies == 0 means the previous pass made
+	 * progress (or no pass has run yet).
+	 */
+	unsigned long		defrag_last_no_progress_jiffies;
+	unsigned long		defrag_last_no_progress_pages;
+#endif
+
 	/* Identity */
 	unsigned long		start_pfn;
 	struct zone		*zone;
@@ -1532,8 +1549,6 @@ typedef struct pglist_data {
 	struct task_struct *kcompactd;
 	bool proactive_compact_trigger;
 	struct workqueue_struct *evacuate_wq;
-	struct llist_head evacuate_pending;
-	struct irq_work evacuate_irq_work;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/mm/internal.h b/mm/internal.h
index 7ee73f9bb76c..02f1c7d36b85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1026,9 +1026,11 @@ void init_cma_reserved_pageblock(struct page *page);
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
 #ifdef CONFIG_COMPACTION
+void init_superpageblock_defrag(struct superpageblock *sb);
 void superpageblock_clear_has_movable(struct zone *zone, struct page *page);
 void superpageblock_set_has_movable(struct zone *zone, struct page *page);
 #else
+static inline void init_superpageblock_defrag(struct superpageblock *sb) {}
 static inline void superpageblock_clear_has_movable(struct zone *zone,
 						    struct page *page) {}
 static inline void superpageblock_set_has_movable(struct zone *zone,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 80cfc7c4de98..1f55ff3126a2 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1668,6 +1668,7 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	size_t alloc_size;
 	unsigned long i;
 	int nid = zone_to_nid(zone);
+	unsigned long flags;
 
 	if (!zone->spanned_pages)
 		return;
@@ -1690,6 +1691,37 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 		return;
 	}
 
+	/* Initialize new superpageblocks (not from old array) first, outside lock */
+	if (zone->superpageblocks) {
+		old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
+			     SUPERPAGEBLOCK_ORDER;
+	} else {
+		old_offset = 0;
+	}
+
+	for (i = 0; i < new_nr_sbs; i++) {
+		struct superpageblock *sb = &new_sbs[i];
+		bool is_old = false;
+
+		if (zone->superpageblocks &&
+		    i >= old_offset &&
+		    i < old_offset + zone->nr_superpageblocks)
+			is_old = true;
+
+		if (is_old)
+			continue;
+
+		init_one_superpageblock(sb, zone,
+					new_sb_base + (i << SUPERPAGEBLOCK_ORDER),
+					zone_start, zone_end);
+	}
+
+	/*
+	 * Take zone->lock for the copy+fixup+swap to prevent concurrent
+	 * allocations from traversing free lists while we relocate them.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+
 	/*
 	 * Copy existing superpageblocks to their new position.
 	 * The old array covers [old_base, old_base + old_nr * SB_SIZE).
@@ -1703,39 +1735,42 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 		       zone->nr_superpageblocks * sizeof(struct superpageblock));
 
 		/*
-		 * Fix up list_head pointers that were self-referencing
-		 * (empty lists) or pointing into the old array.
+		 * Fix up all list_head pointers: both the SPB category list
+		 * and every free_area[order].free_list[migratetype]. Pages on
+		 * buddy free lists have buddy_list.prev/next pointing at the
+		 * old array's list heads — those must be updated to point at
+		 * the new array.
 		 */
 		for (i = old_offset; i < old_offset + zone->nr_superpageblocks; i++) {
 			struct superpageblock *sb = &new_sbs[i];
+			struct superpageblock *old_sb =
+				&zone->superpageblocks[i - old_offset];
+			int order, mt;
 
-			if (list_empty(&sb->list))
+			/* Fix up sb->list (zone category/fullness list) */
+			if (list_empty(&old_sb->list))
 				INIT_LIST_HEAD(&sb->list);
 			else
-				list_replace(&zone->superpageblocks[i - old_offset].list,
-					     &sb->list);
-		}
-	}
-
-	/* Initialize new superpageblocks (slots not covered by old array) */
-	for (i = 0; i < new_nr_sbs; i++) {
-		struct superpageblock *sb = &new_sbs[i];
-		bool is_old = false;
+				list_replace(&old_sb->list, &sb->list);
+
+			/* Fix up all free_area list heads */
+			for (order = 0; order < NR_PAGE_ORDERS; order++) {
+				for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+					struct list_head *old_list =
+						&old_sb->free_area[order].free_list[mt];
+					struct list_head *new_list =
+						&sb->free_area[order].free_list[mt];
+
+					if (list_empty(old_list))
+						INIT_LIST_HEAD(new_list);
+					else
+						list_replace(old_list, new_list);
+				}
+			}
 
-		if (zone->superpageblocks) {
-			old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
-				     SUPERPAGEBLOCK_ORDER;
-			if (i >= old_offset &&
-			    i < old_offset + zone->nr_superpageblocks)
-				is_old = true;
+			/* Reinitialize defrag work structs (contain stale pointers) */
+			init_superpageblock_defrag(sb);
 		}
-
-		if (is_old)
-			continue;
-
-		init_one_superpageblock(sb, zone,
-					new_sb_base + (i << SUPERPAGEBLOCK_ORDER),
-					zone_start, zone_end);
 	}
 
 	/*
@@ -1774,6 +1809,8 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	zone->superpageblock_base_pfn = new_sb_base;
 	zone->spb_kvmalloced = true;
 
+	spin_unlock_irqrestore(&zone->lock, flags);
+
 	/*
 	 * The boot-time array was allocated with memblock_alloc, which
 	 * is not individually freeable after boot.  Only kvfree arrays
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cbf5f48d377e..07d2926ffb3d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -63,10 +63,6 @@
 #include "shuffle.h"
 #include "page_reporting.h"
 
-#ifdef CONFIG_COMPACTION
-static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn);
-#endif
-
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
 
@@ -753,8 +749,15 @@ static inline enum sb_fullness sb_get_fullness(struct superpageblock *sb,
  *
  * Called after counters change. Removes from current list (if any)
  * and adds to the appropriate list based on current fullness and
- * taint status.
+ * taint status. Also triggers background defragmentation if the
+ * superpageblock is tainted and running low on free space.
  */
+#ifdef CONFIG_COMPACTION
+static void spb_maybe_start_defrag(struct superpageblock *sb);
+#else
+static inline void spb_maybe_start_defrag(struct superpageblock *sb) {}
+#endif
+
 static void spb_update_list(struct superpageblock *sb)
 {
 	struct zone *zone = sb->zone;
@@ -771,6 +774,8 @@ static void spb_update_list(struct superpageblock *sb)
 	cat = spb_get_category(sb);
 	full = sb_get_fullness(sb, cat);
 	list_add_tail(&sb->list, &zone->spb_lists[cat][full]);
+
+	spb_maybe_start_defrag(sb);
 }
 
 /**
@@ -3297,12 +3302,6 @@ try_to_claim_block(struct zone *zone, struct page *page,
 			sb = pfn_to_superpageblock(zone, start_pfn);
 			if (sb)
 				spb_update_list(sb);
-
-			if ((start_type == MIGRATE_UNMOVABLE ||
-			     start_type == MIGRATE_RECLAIMABLE) &&
-			    get_pfnblock_bit(start_page, start_pfn,
-					     PB_has_movable))
-				queue_pageblock_evacuate(zone, start_pfn);
 		}
 #endif
 		return __rmqueue_smallest(zone, order, start_type);
@@ -8100,42 +8099,14 @@ void __init page_alloc_sysctl_init(void)
 
 #ifdef CONFIG_COMPACTION
 /*
- * Pageblock evacuation: asynchronously migrate movable pages out of
- * pageblocks that were stolen for unmovable/reclaimable allocations.
- * This keeps unmovable/reclaimable allocations concentrated in fewer
- * pageblocks, reducing long-term fragmentation.
- *
- * Uses a global pool of 64 pre-allocated work items (~3.5KB total)
- * and a per-pgdat workqueue to keep migration node-local.
+ * Pageblock evacuation: synchronously migrate movable pages out of a
+ * pageblock to consolidate fragmentation. Driven by the background
+ * superpageblock defragmentation worker (see below); has no per-pageblock
+ * scheduling infrastructure of its own.
  */
 
-struct evacuate_item {
-	struct work_struct	work;
-	struct zone		*zone;
-	unsigned long		start_pfn;
-	struct llist_node	free_node;
-};
-
-#define NR_EVACUATE_ITEMS	64
-static struct evacuate_item evacuate_pool[NR_EVACUATE_ITEMS];
-static struct llist_head evacuate_freelist;
-
-static struct evacuate_item *evacuate_item_alloc(void)
-{
-	struct llist_node *node;
-
-	node = llist_del_first(&evacuate_freelist);
-	if (!node)
-		return NULL;
-	return container_of(node, struct evacuate_item, free_node);
-}
-
-static void evacuate_item_free(struct evacuate_item *item)
-{
-	llist_add(&item->free_node, &evacuate_freelist);
-}
-
-static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
+static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn,
+			       bool force)
 {
 	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
 	unsigned long pfn = start_pfn;
@@ -8153,8 +8124,14 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
 		.gfp_mask = GFP_HIGHUSER_MOVABLE,
 	};
 
-	/* Verify this pageblock is still worth evacuating */
-	if (get_pageblock_migratetype(pfn_to_page(start_pfn)) == MIGRATE_MOVABLE)
+	/*
+	 * Verify this pageblock is still worth evacuating.
+	 * Skip if it reverted to MOVABLE (steal was undone) — unless
+	 * force is set (background defrag wants to clear movable pages
+	 * out of tainted superpageblocks regardless of pageblock type).
+	 */
+	if (!force &&
+	    get_pageblock_migratetype(pfn_to_page(start_pfn)) == MIGRATE_MOVABLE)
 		return;
 
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -8209,86 +8186,206 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
 		putback_movable_pages(&cc.migratepages);
 }
 
-static void evacuate_work_fn(struct work_struct *work)
+/*
+ * Background superpageblock defragmentation.
+ *
+ * Evacuate movable pageblocks from tainted superpageblocks to consolidate
+ * contamination. Triggered on-demand when a tainted superpageblock runs
+ * low on free space, rather than running on a fixed timer.
+ *
+ * Goals for tainted superpageblocks:
+ * - At least 2 free pageblocks if movable pageblocks still exist
+ * - Or 3 pageblocks worth of free pages while movable pages remain
+ * - Skip superpageblocks with no movable pages (nothing to evacuate)
+ */
+
+/* Target free space: 3 pageblocks worth of free pages */
+#define SPB_DEFRAG_FREE_PAGES_TARGET	(3UL * pageblock_nr_pages)
+
+/**
+ * spb_needs_defrag - Check if a superpageblock needs defragmentation
+ * @sb: superpageblock to check (may be NULL)
+ *
+ * Returns false for NULL, non-tainted, or clean superpageblocks.
+ * A tainted superpageblock needs defrag if it has movable pages that can
+ * be evacuated AND free space is running low (1 or fewer free
+ * pageblocks, or less than 2 pageblocks worth of free pages).
+ */
+/*
+ * Cooldown between defrag attempts that made no progress, in seconds.
+ * Long enough to keep the allocator hot path quiet on saturated SBs;
+ * short enough that a freshly-freed pageblock isn't ignored for long.
+ */
+#define SPB_DEFRAG_NOOP_COOLDOWN_SECS	5
+
+static bool spb_needs_defrag(struct superpageblock *sb)
 {
-	struct evacuate_item *item = container_of(work, struct evacuate_item,
-						  work);
-	evacuate_pageblock(item->zone, item->start_pfn);
-	evacuate_item_free(item);
+	if (!sb)
+		return false;
+
+	if (spb_get_category(sb) != SB_TAINTED)
+		return false;
+
+	/*
+	 * Back off if the previous pass made no progress: do not retry until
+	 * either the cooldown elapses or free pages have grown by at least a
+	 * pageblock's worth (a hint that there might be new material to
+	 * consolidate or evacuate).
+	 */
+	if (sb->defrag_last_no_progress_jiffies &&
+	    time_before(jiffies, sb->defrag_last_no_progress_jiffies +
+				 SPB_DEFRAG_NOOP_COOLDOWN_SECS * HZ) &&
+	    sb->nr_free_pages < sb->defrag_last_no_progress_pages +
+				pageblock_nr_pages)
+		return false;
+
+	/*
+	 * Tainted superpageblocks: evacuate movable pages to concentrate
+	 * unmovable/reclaimable allocations.  Migration targets are
+	 * allocated system-wide, so no internal free space is needed.
+	 * Maintain the tainted reserve so unmovable claims always
+	 * find room in existing tainted superpageblocks.
+	 */
+	return sb->nr_movable > 0 &&
+	       sb->nr_free < SPB_TAINTED_RESERVE;
 }
 
 /**
- * evacuate_irq_work_fn - IRQ work callback to drain pending evacuations
- * @work: the irq_work embedded in pg_data_t
+ * spb_defrag_done - Check if defrag target has been reached
+ * @sb: superpageblock being defragmented
  *
- * queue_work() can deadlock when called from inside the page allocator
- * because it may try to allocate memory with locks already held.
- * Use irq_work to defer the queue_work() calls to a safe context.
+ * Stop defragmenting when the superpageblock has enough free space
+ * or there are no more movable pages to evacuate.
  */
-static void evacuate_irq_work_fn(struct irq_work *work)
+static bool spb_defrag_done(struct superpageblock *sb)
 {
-	pg_data_t *pgdat = container_of(work, pg_data_t,
-					evacuate_irq_work);
-	struct llist_node *pending;
-	struct evacuate_item *item, *next;
+	/*
+	 * Tainted superpageblocks: keep evacuating movable pages until
+	 * the reserve of free pageblocks is restored, or until there
+	 * are no more movable pages to evacuate.
+	 */
+	return !sb->nr_movable ||
+	       sb->nr_free >= SPB_TAINTED_RESERVE;
+}
 
-	if (!pgdat->evacuate_wq)
+/**
+ * spb_defrag_superpageblock - evacuate movable pages from a tainted superpageblock
+ * @sb: the tainted superpageblock to defragment
+ *
+ * Find any pageblock with movable pages (PB_has_movable) and evacuate
+ * them, leaving only unmovable, reclaimable, and free pages behind.
+ * Stop when the free space target is reached.
+ */
+static void spb_defrag_superpageblock(struct superpageblock *sb)
+{
+	unsigned long pfn, end_pfn;
+	struct zone *zone = sb->zone;
+
+	if (!sb->nr_movable)
 		return;
 
+	end_pfn = sb->start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+
+	for (pfn = sb->start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		if (spb_defrag_done(sb))
+			return;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+
+		/* Skip pageblocks without movable pages */
+		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+			continue;
+
+		/* Skip if fully free — nothing to evacuate */
+		if (get_pfnblock_bit(page, pfn, PB_all_free))
+			continue;
+
+		evacuate_pageblock(zone, pfn, true);
+	}
+}
+
+static void spb_defrag_work_fn(struct work_struct *work)
+{
+	struct superpageblock *sb = container_of(work, struct superpageblock,
+					     defrag_work);
+	u16 nr_free_before = sb->nr_free;
+
+	spb_defrag_superpageblock(sb);
+
 	/*
-	 * Collect all pending items first, then queue them.  Use _safe
-	 * because evacuate_work_fn() may run immediately on another
-	 * CPU and free the item before we follow the next pointer.
+	 * If this pass produced no new free pageblocks, arm the no-progress
+	 * cooldown so spb_needs_defrag() rejects re-arms until either time
+	 * passes or nr_free_pages grows enough to suggest new material to
+	 * work on.  Use jiffies | 1 so the field is never accidentally zero.
 	 */
-	pending = llist_del_all(&pgdat->evacuate_pending);
-	llist_for_each_entry_safe(item, next, pending, free_node) {
-		INIT_WORK(&item->work, evacuate_work_fn);
-		queue_work(pgdat->evacuate_wq, &item->work);
+	if (sb->nr_free == nr_free_before) {
+		sb->defrag_last_no_progress_jiffies = jiffies | 1;
+		sb->defrag_last_no_progress_pages = sb->nr_free_pages;
+	} else {
+		sb->defrag_last_no_progress_jiffies = 0;
 	}
+
+	/* Allow new defrag requests for this superpageblock */
+	sb->defrag_active = false;
 }
 
 /**
- * queue_pageblock_evacuate - schedule async evacuation of movable pages
- * @zone: the zone containing the pageblock
- * @pfn: start PFN of the pageblock (must be pageblock-aligned)
+ * spb_defrag_irq_work_fn - IRQ work callback to safely queue defrag work
+ * @work: the irq_work embedded in struct superpageblock
  *
- * Called from the page allocator when a movable pageblock is claimed
- * for unmovable or reclaimable allocations. Queues the pageblock for
- * background migration of its remaining movable pages. Uses irq_work
- * to defer the actual queue_work() call outside the allocator's lock
- * context.
+ * queue_work() can deadlock when called from inside the page allocator
+ * because it may try to allocate memory with locks already held.
+ * Use irq_work to defer the queue_work() call to a safe context.
  */
-static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn)
+static void spb_defrag_irq_work_fn(struct irq_work *work)
 {
-	struct evacuate_item *item;
-	pg_data_t *pgdat = zone->zone_pgdat;
+	struct superpageblock *sb = container_of(work, struct superpageblock,
+					     defrag_irq_work);
+	pg_data_t *pgdat = sb->zone->zone_pgdat;
+
+	if (pgdat->evacuate_wq)
+		queue_work(pgdat->evacuate_wq, &sb->defrag_work);
+}
 
-	if (!pgdat->evacuate_irq_work.func)
+/**
+ * spb_maybe_start_defrag - Trigger defrag if a superpageblock needs it
+ * @sb: superpageblock whose counters just changed
+ *
+ * Called from counter update paths (under zone->lock). If the
+ * superpageblock is tainted and running low on free space, schedule
+ * irq_work to queue defrag work outside the allocator's lock context.
+ * The irq_work handler is set up by pageblock_evacuate_init();
+ * before that runs, defrag_irq_work.func is NULL and we skip.
+ */
+static void spb_maybe_start_defrag(struct superpageblock *sb)
+{
+	if (!spb_needs_defrag(sb))
 		return;
 
-	item = evacuate_item_alloc();
-	if (!item)
+	/* Don't pile up work items; one defrag pass per superpageblock at a time */
+	if (sb->defrag_active)
 		return;
 
-	item->zone = zone;
-	item->start_pfn = pfn;
-	llist_add(&item->free_node, &pgdat->evacuate_pending);
-	irq_work_queue(&pgdat->evacuate_irq_work);
+	if (sb->defrag_irq_work.func) {
+		sb->defrag_active = true;
+		irq_work_queue(&sb->defrag_irq_work);
+	}
 }
 
 static int __init pageblock_evacuate_init(void)
 {
-	int nid, i;
-
-	/* Initialize the global freelist of work items */
-	init_llist_head(&evacuate_freelist);
-	for (i = 0; i < NR_EVACUATE_ITEMS; i++)
-		llist_add(&evacuate_pool[i].free_node, &evacuate_freelist);
+	int nid;
 
 	/* Create a per-pgdat workqueue */
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		char name[32];
+		int z;
 
 		snprintf(name, sizeof(name), "kevacuate/%d", nid);
 		pgdat->evacuate_wq = alloc_workqueue(name, WQ_MEM_RECLAIM, 1);
@@ -8297,14 +8394,40 @@ static int __init pageblock_evacuate_init(void)
 			continue;
 		}
 
-		init_llist_head(&pgdat->evacuate_pending);
-		init_irq_work(&pgdat->evacuate_irq_work,
-			      evacuate_irq_work_fn);
+		/* Initialize per-superpageblock defrag work structs */
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			unsigned long j;
+
+			if (!zone->superpageblocks)
+				continue;
+
+			for (j = 0; j < zone->nr_superpageblocks; j++) {
+				INIT_WORK(&zone->superpageblocks[j].defrag_work,
+					  spb_defrag_work_fn);
+				init_irq_work(&zone->superpageblocks[j].defrag_irq_work,
+					      spb_defrag_irq_work_fn);
+			}
+		}
 	}
 
 	return 0;
 }
 late_initcall(pageblock_evacuate_init);
+
+/**
+ * init_superpageblock_defrag - initialize defrag work structs for a superpageblock
+ * @sb: superpageblock to initialize
+ *
+ * Called during boot from pageblock_evacuate_init() and during memory
+ * hotplug from resize_zone_superpageblocks().  Safe to call multiple times
+ * on the same superpageblock (reinitializes work structs).
+ */
+void init_superpageblock_defrag(struct superpageblock *sb)
+{
+	INIT_WORK(&sb->defrag_work, spb_defrag_work_fn);
+	init_irq_work(&sb->defrag_irq_work, spb_defrag_irq_work_fn);
+}
 #endif /* CONFIG_COMPACTION */
 
 #ifdef CONFIG_CONTIG_ALLOC
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (15 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Extend the superpageblock defragmentation framework to handle clean
superpageblocks in addition to tainted ones. While tainted superpageblock
defrag evacuates movable pages out to free up pageblocks, clean
superpageblock compaction migrates pages *within* the same superpageblock
to consolidate scattered free pages into whole free pageblocks.

The key components:

- spb_needs_defrag() and spb_defrag_done() now handle both categories: both
  use the same nr_free < 2 and nr_free_pages thresholds, with tainted SBs
  additionally checking nr_movable.

- spb_defrag_superpageblock() becomes a dispatcher that calls either
  spb_defrag_tainted() (existing evacuation logic) or
  spb_defrag_clean() (new compaction logic).

- spb_defrag_clean() scans pageblocks in the superpageblock,
  skipping fully-free (PB_all_free) and PCP-owned pageblocks, and calls
  compact_pageblock_in_spb() on candidates.

- compact_pageblock_in_spb() uses the same isolate/migrate loop pattern as
  evacuate_pageblock(), but with a custom migration target allocator
  (alloc_spb_compaction_target) that allocates pages exclusively from the
  superpageblock's own free lists.

Also make the compaction code superpageblock-aware:

- Search per-superpageblock free lists instead of zone free lists for
  migration targets, since with SPBs enabled all pages live on per-
  superpageblock free lists.

- Fix PB_has_movable check for zones with non-aligned start PFNs by using
  zone_start_pfn for pageblock boundary checks.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |   1 +
 mm/compaction.c        | 272 ++++++++++++++++++++++----------
 mm/page_alloc.c        | 343 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 501 insertions(+), 115 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 61fe939e7c0f..ba6f08295ff9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -942,6 +942,7 @@ struct superpageblock {
 	struct work_struct	defrag_work;
 	struct irq_work		defrag_irq_work;
 	bool			defrag_active;
+	unsigned long		defrag_cursor;
 	/*
 	 * Back-off state after a no-op defrag pass: defer the next attempt
 	 * until either nr_free_pages has grown by at least pageblock_nr_pages
diff --git a/mm/compaction.c b/mm/compaction.c
index 88ba88340f3b..0e9b4b3ca59b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1321,9 +1321,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * isolated (pinned, writeback, dirty, etc.), leave the
 		 * flag set so a future migration attempt can try again.
 		 */
-		if (!nr_isolated && !movable_skipped && valid_page)
-			superpageblock_clear_has_movable(cc->zone,
-							valid_page);
+		if (!nr_isolated && !movable_skipped && valid_page) {
+			unsigned long pb_pfn = pageblock_start_pfn(start_pfn);
+
+			/*
+			 * start_pfn may not be pageblock-aligned when the
+			 * zone start is not aligned (e.g. DMA zone at PFN 1).
+			 * Skip the PB_has_movable update if the pageblock
+			 * start falls below the zone.
+			 */
+			if (pb_pfn >= cc->zone->zone_start_pfn)
+				superpageblock_clear_has_movable(cc->zone,
+								valid_page);
+		}
 	}
 
 	trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
@@ -1577,45 +1587,70 @@ static void fast_isolate_freepages(struct compact_control *cc)
 	for (order = cc->search_order;
 	     !page && order >= 0;
 	     order = next_search_order(cc, order)) {
-		struct free_area *area = &cc->zone->free_area[order];
-		struct list_head *freelist;
-		struct page *freepage;
+		struct list_head *freelist = NULL;
+		struct page *freepage = NULL;
 		unsigned long flags;
 		unsigned int order_scanned = 0;
 		unsigned long high_pfn = 0;
 
-		if (!area->nr_free)
+		if (!cc->zone->free_area[order].nr_free)
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
-		freelist = &area->free_list[MIGRATE_MOVABLE];
-		list_for_each_entry_reverse(freepage, freelist, buddy_list) {
-			unsigned long pfn;
-
-			order_scanned++;
-			nr_scanned++;
-			pfn = page_to_pfn(freepage);
-
-			if (pfn >= highest)
-				highest = max(pageblock_start_pfn(pfn),
-					      cc->zone->zone_start_pfn);
-
-			if (pfn >= low_pfn) {
-				cc->fast_search_fail = 0;
-				cc->search_order = order;
-				page = freepage;
-				break;
-			}
 
-			if (pfn >= min_pfn && pfn > high_pfn) {
-				high_pfn = pfn;
+		/*
+		 * With superpageblocks, free pages live on per-SPB free
+		 * lists rather than zone-level free lists.  Iterate all
+		 * SPBs to find candidate pages.
+		 */
+		{
+			struct zone *zone = cc->zone;
+			unsigned long si, nr_spb = zone->nr_superpageblocks;
+
+			for (si = 0; !page && order_scanned < limit; si++) {
+				struct free_area *area;
+
+				if (nr_spb) {
+					if (si >= nr_spb)
+						break;
+					area = &zone->superpageblocks[si].free_area[order];
+				} else {
+					if (si > 0)
+						break;
+					area = &zone->free_area[order];
+				}
 
-				/* Shorten the scan if a candidate is found */
-				limit >>= 1;
+				freelist = &area->free_list[MIGRATE_MOVABLE];
+				list_for_each_entry_reverse(freepage,
+							    freelist,
+							    buddy_list) {
+					unsigned long pfn;
+
+					order_scanned++;
+					nr_scanned++;
+					pfn = page_to_pfn(freepage);
+
+					if (pfn >= highest)
+						highest = max(
+						    pageblock_start_pfn(pfn),
+						    zone->zone_start_pfn);
+
+					if (pfn >= low_pfn) {
+						cc->fast_search_fail = 0;
+						cc->search_order = order;
+						page = freepage;
+						break;
+					}
+
+					if (pfn >= min_pfn && pfn > high_pfn) {
+						high_pfn = pfn;
+						limit >>= 1;
+					}
+
+					if (order_scanned >= limit)
+						break;
+				}
 			}
-
-			if (order_scanned >= limit)
-				break;
 		}
 
 		/* Use a maximum candidate pfn if a preferred one was not found */
@@ -1624,10 +1659,24 @@ static void fast_isolate_freepages(struct compact_control *cc)
 
 			/* Update freepage for the list reorder below */
 			freepage = page;
+
+			/*
+			 * high_pfn page may be on a different SPB's list
+			 * than the last one scanned; fix up freelist.
+			 */
+			if (cc->zone->nr_superpageblocks) {
+				struct superpageblock *sb;
+
+				sb = pfn_to_superpageblock(cc->zone,
+							   high_pfn);
+				if (sb)
+					freelist = &sb->free_area[order].free_list[MIGRATE_MOVABLE];
+			}
 		}
 
 		/* Reorder to so a future search skips recent pages */
-		move_freelist_head(freelist, freepage);
+		if (freelist && freepage)
+			move_freelist_head(freelist, freepage);
 
 		/* Isolate the page if available */
 		if (page) {
@@ -2021,47 +2070,77 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
 	for (order = cc->order - 1;
 	     order >= PAGE_ALLOC_COSTLY_ORDER && !found_block && nr_scanned < limit;
 	     order--) {
-		struct free_area *area = &cc->zone->free_area[order];
-		struct list_head *freelist;
 		unsigned long flags;
 		struct page *freepage;
 
-		if (!area->nr_free)
+		if (!cc->zone->free_area[order].nr_free)
 			continue;
 
 		spin_lock_irqsave(&cc->zone->lock, flags);
-		freelist = &area->free_list[MIGRATE_MOVABLE];
-		list_for_each_entry(freepage, freelist, buddy_list) {
-			unsigned long free_pfn;
 
-			if (nr_scanned++ >= limit) {
-				move_freelist_tail(freelist, freepage);
-				break;
-			}
+		/*
+		 * With superpageblocks, free pages live on per-SPB free
+		 * lists.  Iterate all SPBs to find candidates.
+		 */
+		{
+			struct zone *zone = cc->zone;
+			unsigned long si, nr_spb = zone->nr_superpageblocks;
+
+			for (si = 0; !found_block && nr_scanned < limit; si++) {
+				struct free_area *area;
+				struct list_head *freelist;
+
+				if (nr_spb) {
+					if (si >= nr_spb)
+						break;
+					area = &zone->superpageblocks[si].free_area[order];
+				} else {
+					if (si > 0)
+						break;
+					area = &zone->free_area[order];
+				}
 
-			free_pfn = page_to_pfn(freepage);
-			if (free_pfn < high_pfn) {
-				/*
-				 * Avoid if skipped recently. Ideally it would
-				 * move to the tail but even safe iteration of
-				 * the list assumes an entry is deleted, not
-				 * reordered.
-				 */
-				if (get_pageblock_skip(freepage))
-					continue;
-
-				/* Reorder to so a future search skips recent pages */
-				move_freelist_tail(freelist, freepage);
-
-				update_fast_start_pfn(cc, free_pfn);
-				pfn = pageblock_start_pfn(free_pfn);
-				if (pfn < cc->zone->zone_start_pfn)
-					pfn = cc->zone->zone_start_pfn;
-				cc->fast_search_fail = 0;
-				found_block = true;
-				break;
+				freelist = &area->free_list[MIGRATE_MOVABLE];
+				list_for_each_entry(freepage, freelist,
+						    buddy_list) {
+					unsigned long free_pfn;
+
+					if (nr_scanned++ >= limit) {
+						move_freelist_tail(freelist,
+								   freepage);
+						break;
+					}
+
+					free_pfn = page_to_pfn(freepage);
+					if (free_pfn < high_pfn) {
+						/*
+						 * Avoid if skipped recently.
+						 * Ideally it would move to
+						 * the tail but even safe
+						 * iteration of the list
+						 * assumes an entry is deleted,
+						 * not reordered.
+						 */
+						if (get_pageblock_skip(freepage))
+							continue;
+
+						move_freelist_tail(freelist,
+								   freepage);
+
+						update_fast_start_pfn(cc,
+								      free_pfn);
+						pfn = pageblock_start_pfn(
+								free_pfn);
+						if (pfn < zone->zone_start_pfn)
+							pfn = zone->zone_start_pfn;
+						cc->fast_search_fail = 0;
+						found_block = true;
+						break;
+					}
+				}
 			}
 		}
+
 		spin_unlock_irqrestore(&cc->zone->lock, flags);
 	}
 
@@ -2348,32 +2427,57 @@ static enum compact_result __compact_finished(struct compact_control *cc)
 	/* Direct compactor: Is a suitable page free? */
 	ret = COMPACT_NO_SUITABLE_PAGE;
 	for (order = cc->order; order < NR_PAGE_ORDERS; order++) {
-		struct free_area *area = &cc->zone->free_area[order];
+		struct zone *zone = cc->zone;
+		unsigned long si, nr_spb = zone->nr_superpageblocks;
 
-		/* Job done if page is free of the right migratetype */
-		if (!free_area_empty(area, migratetype))
-			return COMPACT_SUCCESS;
+		/* Zone-level nr_free is maintained even with SPBs */
+		if (!zone->free_area[order].nr_free)
+			continue;
 
-#ifdef CONFIG_CMA
-		/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
-		if (migratetype == MIGRATE_MOVABLE &&
-			!free_area_empty(area, MIGRATE_CMA))
-			return COMPACT_SUCCESS;
-#endif
 		/*
-		 * Job done if allocation would steal freepages from
-		 * other migratetype buddy lists.
+		 * With superpageblocks, free pages live on per-SPB free
+		 * lists.  Check all SPBs for a suitable page.
 		 */
-		if (find_suitable_fallback(area, order, migratetype, true) >= 0)
+		for (si = 0; ; si++) {
+			struct free_area *area;
+
+			if (nr_spb) {
+				if (si >= nr_spb)
+					break;
+				area = &zone->superpageblocks[si].free_area[order];
+			} else {
+				if (si > 0)
+					break;
+				area = &zone->free_area[order];
+			}
+
+			/* Job done if page is free of the right migratetype */
+			if (!free_area_empty(area, migratetype))
+				return COMPACT_SUCCESS;
+
+#ifdef CONFIG_CMA
+			/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
+			if (migratetype == MIGRATE_MOVABLE &&
+				!free_area_empty(area, MIGRATE_CMA))
+				return COMPACT_SUCCESS;
+#endif
 			/*
-			 * Movable pages are OK in any pageblock. If we are
-			 * stealing for a non-movable allocation, make sure
-			 * we finish compacting the current pageblock first
-			 * (which is assured by the above migrate_pfn align
-			 * check) so it is as free as possible and we won't
-			 * have to steal another one soon.
+			 * Job done if allocation would steal freepages from
+			 * other migratetype buddy lists.
 			 */
-			return COMPACT_SUCCESS;
+			if (find_suitable_fallback(area, order, migratetype,
+						   true) >= 0)
+				/*
+				 * Movable pages are OK in any pageblock. If we
+				 * are stealing for a non-movable allocation,
+				 * make sure we finish compacting the current
+				 * pageblock first (which is assured by the
+				 * above migrate_pfn align check) so it is as
+				 * free as possible and we won't have to steal
+				 * another one soon.
+				 */
+				return COMPACT_SUCCESS;
+		}
 	}
 
 out:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07d2926ffb3d..54b9a69bda10 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8199,17 +8199,23 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn,
  * - Skip superpageblocks with no movable pages (nothing to evacuate)
  */
 
-/* Target free space: 3 pageblocks worth of free pages */
-#define SPB_DEFRAG_FREE_PAGES_TARGET	(3UL * pageblock_nr_pages)
+/*
+ * Target free space for clean SPB internal compaction: at least a quarter
+ * of the superpageblock must be free before we attempt to consolidate
+ * scattered free pages into whole free pageblocks. Below this threshold
+ * the work-to-payoff ratio is poor — we walk the whole SPB and migrate
+ * a handful of pages without producing a usable free pageblock.
+ */
+#define SPB_DEFRAG_FREE_PAGES_TARGET	(SUPERPAGEBLOCK_NR_PAGES / 4)
 
 /**
  * spb_needs_defrag - Check if a superpageblock needs defragmentation
  * @sb: superpageblock to check (may be NULL)
  *
- * Returns false for NULL, non-tainted, or clean superpageblocks.
- * A tainted superpageblock needs defrag if it has movable pages that can
- * be evacuated AND free space is running low (1 or fewer free
- * pageblocks, or less than 2 pageblocks worth of free pages).
+ * For tainted superpageblocks: defrag is needed when there are movable
+ * pageblocks that can be evacuated AND free space is running low.
+ * For clean superpageblocks: compaction is needed when free pages are
+ * scattered (plenty of free pages but few whole free pageblocks).
  */
 /*
  * Cooldown between defrag attempts that made no progress, in seconds.
@@ -8223,9 +8229,6 @@ static bool spb_needs_defrag(struct superpageblock *sb)
 	if (!sb)
 		return false;
 
-	if (spb_get_category(sb) != SB_TAINTED)
-		return false;
-
 	/*
 	 * Back off if the previous pass made no progress: do not retry until
 	 * either the cooldown elapses or free pages have grown by at least a
@@ -8246,16 +8249,30 @@ static bool spb_needs_defrag(struct superpageblock *sb)
 	 * Maintain the tainted reserve so unmovable claims always
 	 * find room in existing tainted superpageblocks.
 	 */
-	return sb->nr_movable > 0 &&
-	       sb->nr_free < SPB_TAINTED_RESERVE;
+	if (spb_get_category(sb) == SB_TAINTED)
+		return sb->nr_movable > 0 &&
+		       sb->nr_free < SPB_TAINTED_RESERVE;
+
+	/*
+	 * Clean superpageblocks: compact scattered free pages into whole
+	 * free pageblocks.  Needs internal free space as destination.
+	 */
+	if (sb->nr_free >= 2)
+		return false;
+
+	if (sb->nr_free_pages < SPB_DEFRAG_FREE_PAGES_TARGET)
+		return false;
+
+	return true;
 }
 
 /**
- * spb_defrag_done - Check if defrag target has been reached
+ * spb_defrag_done - Check if defrag/compaction should stop
  * @sb: superpageblock being defragmented
  *
- * Stop defragmenting when the superpageblock has enough free space
- * or there are no more movable pages to evacuate.
+ * Stop when the superpageblock has enough free pageblocks, when free
+ * pages drop too low to be worth continuing, or (for tainted
+ * superpageblocks) when there are no more movable pages to evacuate.
  */
 static bool spb_defrag_done(struct superpageblock *sb)
 {
@@ -8264,49 +8281,311 @@ static bool spb_defrag_done(struct superpageblock *sb)
 	 * the reserve of free pageblocks is restored, or until there
 	 * are no more movable pages to evacuate.
 	 */
-	return !sb->nr_movable ||
-	       sb->nr_free >= SPB_TAINTED_RESERVE;
+	if (spb_get_category(sb) == SB_TAINTED)
+		return !sb->nr_movable ||
+		       sb->nr_free >= SPB_TAINTED_RESERVE;
+
+	/* Clean superpageblocks: stop when enough free pageblocks exist */
+	if (sb->nr_free >= 2)
+		return true;
+
+	if (sb->nr_free_pages < SPB_DEFRAG_FREE_PAGES_TARGET)
+		return true;
+
+	return false;
+}
+
+static void spb_clear_skip_bits(struct superpageblock *sb)
+{
+	unsigned long pfn, end_pfn;
+	struct zone *zone = sb->zone;
+
+	end_pfn = sb->start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+
+	for (pfn = sb->start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		if (!pfn_valid(pfn))
+			continue;
+		if (!zone_spans_pfn(zone, pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+		clear_pageblock_skip(page);
+	}
 }
 
 /**
- * spb_defrag_superpageblock - evacuate movable pages from a tainted superpageblock
+ * spb_defrag_tainted - evacuate movable pages from a tainted superpageblock
  * @sb: the tainted superpageblock to defragment
  *
  * Find any pageblock with movable pages (PB_has_movable) and evacuate
  * them, leaving only unmovable, reclaimable, and free pages behind.
  * Stop when the free space target is reached.
  */
-static void spb_defrag_superpageblock(struct superpageblock *sb)
+static void spb_defrag_tainted(struct superpageblock *sb)
 {
-	unsigned long pfn, end_pfn;
+	unsigned long pfn, end_pfn, start_pfn, cursor;
 	struct zone *zone = sb->zone;
+	bool wrapped = false;
 
 	if (!sb->nr_movable)
 		return;
 
-	end_pfn = sb->start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+	start_pfn = sb->start_pfn;
+	end_pfn = start_pfn + SUPERPAGEBLOCK_NR_PAGES;
 
-	for (pfn = sb->start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+	cursor = sb->defrag_cursor;
+	if (cursor < start_pfn || cursor >= end_pfn) {
+		cursor = start_pfn;
+		spb_clear_skip_bits(sb);
+	}
+
+	pfn = cursor;
+
+	while (pfn < end_pfn) {
 		struct page *page;
 
 		if (spb_defrag_done(sb))
-			return;
+			goto out;
 
 		if (!pfn_valid(pfn))
-			continue;
+			goto next;
+
+		if (!zone_spans_pfn(zone, pfn))
+			goto next;
 
 		page = pfn_to_page(pfn);
 
-		/* Skip pageblocks without movable pages */
 		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
-			continue;
+			goto next;
 
-		/* Skip if fully free — nothing to evacuate */
 		if (get_pfnblock_bit(page, pfn, PB_all_free))
-			continue;
+			goto next;
+
+		if (get_pageblock_skip(page))
+			goto next;
 
 		evacuate_pageblock(zone, pfn, true);
+next:
+		pfn += pageblock_nr_pages;
+		if (pfn >= end_pfn && !wrapped) {
+			spb_clear_skip_bits(sb);
+			pfn = start_pfn;
+			wrapped = true;
+		}
+		if (wrapped && pfn > cursor)
+			break;
+	}
+out:
+	sb->defrag_cursor = pfn;
+}
+
+/*
+ * Within-superpageblock compaction: migrate pages from partially-used
+ * pageblocks into free space within the same superpageblock, consolidating
+ * scattered free pages into whole free pageblocks.
+ */
+
+struct spb_compaction_control {
+	struct superpageblock	*sb;
+	struct zone		*zone;
+};
+
+/*
+ * alloc_spb_compaction_target - allocate a migration target page from
+ * within the same superpageblock's free lists.
+ *
+ * This is a custom migration target allocator that restricts allocations
+ * to the superpageblock being compacted, ensuring pages stay within the SB.
+ */
+static struct folio *alloc_spb_compaction_target(struct folio *src,
+		unsigned long private)
+{
+	struct spb_compaction_control *scc =
+		(struct spb_compaction_control *)private;
+	struct superpageblock *sb = scc->sb;
+	struct zone *zone = scc->zone;
+	int src_order = folio_order(src);
+	int order = src_order;
+	int migratetype = MIGRATE_MOVABLE;
+	struct free_area *area;
+	struct page *target;
+
+	spin_lock_irq(&zone->lock);
+
+	area = &sb->free_area[order];
+	target = get_page_from_free_area(area, migratetype);
+	if (!target) {
+		/* Try to split a higher-order block within this SB */
+		for (order = src_order + 1; order < NR_PAGE_ORDERS; order++) {
+			area = &sb->free_area[order];
+			target = get_page_from_free_area(area, migratetype);
+			if (target)
+				break;
+		}
+	}
+
+	if (target)
+		page_del_and_expand(zone, target, src_order, order, migratetype);
+
+	spin_unlock_irq(&zone->lock);
+
+	if (!target)
+		return NULL;
+
+	prep_new_page(target, src_order, __GFP_MOVABLE | __GFP_COMP, 0);
+	set_page_refcounted(target);
+	return page_rmappable_folio(target);
+}
+
+static void free_spb_compaction_target(struct folio *folio,
+		unsigned long private)
+{
+	folio_put(folio);
+}
+
+/*
+ * compact_pageblock_in_spb - migrate pages from a partially-used pageblock
+ * into free space within the same superpageblock.
+ *
+ * Similar to evacuate_pageblock() but uses the within-SB allocator
+ * so pages stay inside the superpageblock being compacted.
+ */
+static void compact_pageblock_in_spb(struct superpageblock *sb,
+				    struct zone *zone,
+				    unsigned long start_pfn)
+{
+	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
+	unsigned long pfn = start_pfn;
+	int nr_reclaimed;
+	int ret = 0;
+	struct compact_control cc = {
+		.nr_migratepages = 0,
+		.order = -1,
+		.zone = zone,
+		.mode = MIGRATE_SYNC_LIGHT,
+		.gfp_mask = GFP_HIGHUSER_MOVABLE,
+	};
+	struct spb_compaction_control scc = {
+		.sb = sb,
+		.zone = zone,
+	};
+
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	while (pfn < end_pfn || !list_empty(&cc.migratepages)) {
+		if (list_empty(&cc.migratepages)) {
+			cc.nr_migratepages = 0;
+			cc.migrate_pfn = pfn;
+			ret = isolate_migratepages_range(&cc, pfn, end_pfn);
+			if (ret && ret != -EAGAIN)
+				break;
+			pfn = cc.migrate_pfn;
+			if (list_empty(&cc.migratepages))
+				break;
+		}
+
+		nr_reclaimed = reclaim_clean_pages_from_list(zone,
+							&cc.migratepages);
+		cc.nr_migratepages -= nr_reclaimed;
+
+		if (!list_empty(&cc.migratepages)) {
+			ret = migrate_pages(&cc.migratepages,
+					    alloc_spb_compaction_target,
+					    free_spb_compaction_target,
+					    (unsigned long)&scc, cc.mode,
+					    MR_COMPACTION, NULL);
+			if (ret) {
+				putback_movable_pages(&cc.migratepages);
+				break;
+			}
+		}
+
+		cond_resched();
+	}
+
+	if (!list_empty(&cc.migratepages))
+		putback_movable_pages(&cc.migratepages);
+}
+
+/**
+ * spb_defrag_clean - compact a clean superpageblock internally
+ * @sb: the clean superpageblock to compact
+ *
+ * Scan pageblocks in the superpageblock looking for partially-used ones.
+ * Skip fully free pageblocks and pageblocks recently marked unsuitable
+ * by the pageblock_skip bit; PCPBuddy-cached pages within an otherwise
+ * compactable pageblock are skipped per-page by isolate_migratepages_block().
+ * Migrate pages from the best candidate into free space within the same
+ * superpageblock.
+ */
+static void spb_defrag_clean(struct superpageblock *sb)
+{
+	unsigned long pfn, end_pfn, start_pfn, cursor;
+	struct zone *zone = sb->zone;
+	bool wrapped = false;
+
+	start_pfn = sb->start_pfn;
+	end_pfn = start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+
+	cursor = sb->defrag_cursor;
+	if (cursor < start_pfn || cursor >= end_pfn) {
+		cursor = start_pfn;
+		spb_clear_skip_bits(sb);
+	}
+
+	pfn = cursor;
+
+	while (pfn < end_pfn) {
+		struct page *page;
+
+		if (spb_defrag_done(sb))
+			goto out;
+
+		if (!pfn_valid(pfn))
+			goto next;
+
+		if (!zone_spans_pfn(zone, pfn))
+			goto next;
+
+		page = pfn_to_page(pfn);
+
+		if (get_pfnblock_bit(page, pfn, PB_all_free))
+			goto next;
+
+		if (get_pageblock_skip(page))
+			goto next;
+
+		compact_pageblock_in_spb(sb, zone, pfn);
+next:
+		pfn += pageblock_nr_pages;
+		if (pfn >= end_pfn && !wrapped) {
+			spb_clear_skip_bits(sb);
+			pfn = start_pfn;
+			wrapped = true;
+		}
+		if (wrapped && pfn > cursor)
+			break;
 	}
+out:
+	sb->defrag_cursor = pfn;
+}
+
+/**
+ * spb_defrag_superpageblock - defragment a superpageblock
+ * @sb: the superpageblock to defragment
+ *
+ * Dispatch to the appropriate defrag strategy based on superpageblock
+ * category: evacuate movable pages from tainted superpageblocks, or
+ * compact scattered free pages within clean superpageblocks.
+ */
+static void spb_defrag_superpageblock(struct superpageblock *sb)
+{
+	if (spb_get_category(sb) == SB_TAINTED)
+		spb_defrag_tainted(sb);
+	else
+		spb_defrag_clean(sb);
 }
 
 static void spb_defrag_work_fn(struct work_struct *work)
@@ -8357,10 +8636,12 @@ static void spb_defrag_irq_work_fn(struct irq_work *work)
  * @sb: superpageblock whose counters just changed
  *
  * Called from counter update paths (under zone->lock). If the
- * superpageblock is tainted and running low on free space, schedule
- * irq_work to queue defrag work outside the allocator's lock context.
- * The irq_work handler is set up by pageblock_evacuate_init();
- * before that runs, defrag_irq_work.func is NULL and we skip.
+ * superpageblock needs defragmentation — either evacuation of movable
+ * pages from a tainted superpageblock, or internal compaction of a
+ * clean superpageblock — schedule irq_work to queue defrag work outside
+ * the allocator's lock context. The irq_work handler is set up by
+ * pageblock_evacuate_init(); before that runs, defrag_irq_work.func
+ * is NULL and we skip.
  */
 static void spb_maybe_start_defrag(struct superpageblock *sb)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (16 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Add superpageblock-aware contiguous page allocation that leverages SPB
metadata to find ranges of clean (all-free) superpageblocks, instead of
scanning all memory with alloc_contig_range(). The SPB metadata identifies
exactly which 1GB regions have only free pages, making CMA and large
contiguous allocations more targeted.

Track contiguous allocations in superpageblock metadata by marking fully-
covered SPBs with contig_allocated, moving them to the spb_isolated list so
they don't participate in allocation steering. Fix the iteration to use
ALIGN(start, spb_pages) to correctly handle non-aligned allocation
boundaries.

Hook superpageblock-aware allocation into __alloc_pages_direct_compact()
for THP/mTHP and high-order unmovable/reclaimable allocations. For movable
allocations at pageblock_order or above, try sb_try_alloc_contig() first.
For unmovable/reclaimable, evacuate movable pages from tainted
superpageblocks to create buddy coalescing opportunities. Both paths fall
through to traditional compaction if the SPB approach fails.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |   2 +
 mm/mm_init.c           |   1 +
 mm/page_alloc.c        | 452 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 450 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ba6f08295ff9..765e1c5dc365 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -923,6 +923,7 @@ struct superpageblock {
 	u16			nr_movable;
 	u16			nr_reserved;	/* holes, firmware, etc. */
 	u16			total_pageblocks; /* zone-clipped total */
+	bool			contig_allocated; /* all pages held by contig alloc */
 
 	/* Total free pages across all per-superpageblock free lists */
 	unsigned long		nr_free_pages;
@@ -1010,6 +1011,7 @@ struct zone {
 
 	/* Superpageblock fullness lists for allocation steering */
 	struct list_head	spb_empty;	/* completely free superpageblocks */
+	struct list_head	spb_isolated;	/* fully isolated (1GB contig alloc) */
 	struct list_head	spb_lists[__NR_SB_CATEGORIES][__NR_SB_FULLNESS];
 
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1f55ff3126a2..8e3c64d37254 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1601,6 +1601,7 @@ static void __init setup_superpageblocks(struct zone *zone)
 
 	/* Fullness lists steer allocations to preferred superpageblocks */
 	INIT_LIST_HEAD(&zone->spb_empty);
+	INIT_LIST_HEAD(&zone->spb_isolated);
 	for (cat = 0; cat < __NR_SB_CATEGORIES; cat++)
 		for (full = 0; full < __NR_SB_FULLNESS; full++)
 			INIT_LIST_HEAD(&zone->spb_lists[cat][full]);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 54b9a69bda10..8ce96db50c2f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -754,8 +754,26 @@ static inline enum sb_fullness sb_get_fullness(struct superpageblock *sb,
  */
 #ifdef CONFIG_COMPACTION
 static void spb_maybe_start_defrag(struct superpageblock *sb);
+static bool spb_needs_defrag(struct superpageblock *sb);
+static struct page *spb_try_alloc_contig(struct zone *zone,
+					unsigned long nr_pages,
+					gfp_t gfp_mask);
+static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
+				  int migratetype);
 #else
 static inline void spb_maybe_start_defrag(struct superpageblock *sb) {}
+static inline bool spb_needs_defrag(struct superpageblock *sb) { return false; }
+static inline struct page *spb_try_alloc_contig(struct zone *zone,
+					       unsigned long nr_pages,
+					       gfp_t gfp_mask)
+{
+	return NULL;
+}
+static inline bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
+					 int migratetype)
+{
+	return false;
+}
 #endif
 
 static void spb_update_list(struct superpageblock *sb)
@@ -766,6 +784,11 @@ static void spb_update_list(struct superpageblock *sb)
 
 	list_del_init(&sb->list);
 
+	if (sb->contig_allocated) {
+		list_add_tail(&sb->list, &zone->spb_isolated);
+		return;
+	}
+
 	if (sb->nr_free == sb->total_pageblocks) {
 		list_add_tail(&sb->list, &zone->spb_empty);
 		return;
@@ -916,6 +939,45 @@ void __meminit init_pageblock_migratetype(struct page *page,
 	}
 }
 
+#ifdef CONFIG_CONTIG_ALLOC
+/**
+ * superpageblock_contig_mark - Mark/unmark SPBs for contiguous allocation
+ * @start: start PFN of the contiguous range
+ * @end: end PFN (exclusive) of the contiguous range
+ * @allocated: true when allocated, false when freed
+ *
+ * Called after a successful contiguous allocation (or before freeing) to
+ * mark fully-covered superpageblocks as contig_allocated. This moves them
+ * to the spb_isolated list so they don't participate in allocation steering,
+ * and makes them visible in debugfs.
+ */
+static void superpageblock_contig_mark(unsigned long start, unsigned long end,
+				       bool allocated)
+{
+	struct zone *zone = page_zone(pfn_to_page(start));
+	unsigned long spb_pages = SUPERPAGEBLOCK_NR_PAGES;
+	unsigned long pfn;
+	unsigned long flags;
+
+	/* Only track full-SPB contiguous allocations */
+	if (end - start < spb_pages)
+		return;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	for (pfn = ALIGN(start, spb_pages); pfn + spb_pages <= end;
+	     pfn += spb_pages) {
+		struct superpageblock *sb = pfn_to_superpageblock(zone, pfn);
+
+		if (!sb)
+			continue;
+
+		sb->contig_allocated = allocated;
+		spb_update_list(sb);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+#endif /* CONFIG_CONTIG_ALLOC */
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -4240,6 +4302,17 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 
 void free_frozen_pages(struct page *page, unsigned int order)
 {
+#ifdef CONFIG_CONTIG_ALLOC
+	/*
+	 * If freeing a superpageblock-sized (or larger) range, clear the
+	 * contig_allocated flag so the SPB returns to normal allocation.
+	 */
+	if (order >= SUPERPAGEBLOCK_ORDER) {
+		unsigned long pfn = page_to_pfn(page);
+
+		superpageblock_contig_mark(pfn, pfn + (1UL << order), false);
+	}
+#endif
 	__free_frozen_pages(page, order, FPI_NONE);
 }
 
@@ -5408,6 +5481,60 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	if (!order)
 		return NULL;
 
+	/*
+	 * Superpageblock-aware contiguous allocation for movable high-order
+	 * allocations. Use superpageblock metadata to find clean ranges and
+	 * evacuate them via alloc_contig_frozen_range, bypassing the
+	 * blind compaction scanner entirely.
+	 */
+	if (order >= pageblock_order &&
+	    ac->migratetype == MIGRATE_MOVABLE) {
+		struct zoneref *z;
+		struct zone *zone;
+
+		for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
+					       ac->highest_zoneidx,
+					       ac->nodemask) {
+			page = spb_try_alloc_contig(zone, 1UL << order,
+						   gfp_mask);
+			if (page) {
+				prep_new_page(page, order, gfp_mask,
+					      alloc_flags);
+				*compact_result = COMPACT_SUCCESS;
+				count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+		}
+	}
+
+	/*
+	 * Superpageblock-aware targeted evacuation for unmovable/reclaimable
+	 * high-order allocations. Instead of blind compaction, find
+	 * pageblocks of the right migratetype in tainted superpageblocks
+	 * and evacuate their movable pages to create buddy coalescing
+	 * opportunities.
+	 */
+	if (ac->migratetype == MIGRATE_UNMOVABLE ||
+	    ac->migratetype == MIGRATE_RECLAIMABLE) {
+		struct zoneref *z;
+		struct zone *zone;
+
+		for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
+					       ac->highest_zoneidx,
+					       ac->nodemask) {
+			if (spb_evacuate_for_order(zone, order,
+						  ac->migratetype)) {
+				page = get_page_from_freelist(gfp_mask, order,
+							     alloc_flags, ac);
+				if (page) {
+					*compact_result = COMPACT_SUCCESS;
+					count_vm_event(COMPACTSUCCESS);
+					return page;
+				}
+			}
+		}
+	}
+
 	psi_memstall_enter(&pflags);
 	delayacct_compact_start();
 	noreclaim_flag = memalloc_noreclaim_save();
@@ -9011,6 +9138,8 @@ int alloc_contig_frozen_range_noprof(unsigned long start, unsigned long end,
 	}
 done:
 	undo_isolate_page_range(start, end);
+	if (!ret)
+		superpageblock_contig_mark(start, end, true);
 	return ret;
 }
 EXPORT_SYMBOL(alloc_contig_frozen_range_noprof);
@@ -9105,6 +9234,279 @@ static bool zone_spans_last_pfn(const struct zone *zone,
 	return zone_spans_pfn(zone, last_pfn);
 }
 
+/*
+ * Maximum superpageblock candidates to collect for contiguous allocation.
+ * Collected under zone->lock, then tried without it.
+ */
+#define SPB_CONTIG_MAX_CANDIDATES 4
+
+#ifdef CONFIG_COMPACTION
+/**
+ * sb_collect_contig_candidates - Find superpageblock ranges for contiguous alloc
+ * @zone: zone to search (must hold zone->lock)
+ * @nr_pages: number of contiguous pages needed
+ * @pfns: output array of candidate start PFNs
+ * @max: maximum candidates to collect
+ *
+ * For superpageblock-sized (1GB) allocations:
+ *   1. Empty superpageblocks first — no evacuation needed
+ *   2. Clean superpageblocks from almost-empty to full — less evacuation work
+ *
+ * For pageblock-sized (2MB+) sub-superpageblock allocations:
+ *   1. Clean superpageblocks from fullest to almost-empty — pack allocations
+ *      to preserve empty superpageblocks for 1GB
+ *   2. Empty superpageblocks as last resort
+ *
+ * Returns number of candidates found.
+ */
+static int sb_collect_contig_candidates(struct zone *zone,
+					unsigned long nr_pages,
+					unsigned long *pfns, int max)
+{
+	struct superpageblock *sb;
+	int full, n = 0;
+
+	lockdep_assert_held(&zone->lock);
+
+	if (nr_pages >= SUPERPAGEBLOCK_NR_PAGES) {
+		/* 1GB+: empty superpageblocks first (no evacuation needed) */
+		list_for_each_entry(sb, &zone->spb_empty, list) {
+			if (sb->total_pageblocks < SUPERPAGEBLOCK_NR_PAGEBLOCKS)
+				continue;
+			pfns[n++] = sb->start_pfn;
+			if (n >= max)
+				return n;
+		}
+		/* Then clean superpageblocks, almost-empty first (less work) */
+		for (full = __NR_SB_FULLNESS - 1; full >= 0; full--) {
+			list_for_each_entry(sb,
+					    &zone->spb_lists[SB_CLEAN][full],
+					    list) {
+				if (sb->total_pageblocks <
+				    SUPERPAGEBLOCK_NR_PAGEBLOCKS)
+					continue;
+				pfns[n++] = sb->start_pfn;
+				if (n >= max)
+					return n;
+			}
+		}
+		return n;
+	}
+
+	/*
+	 * 2MB+ sub-superpageblock allocations.
+	 * Walk clean superpageblocks fullest-first — pack allocations into
+	 * partial superpageblocks to preserve empty ones for 1GB use.
+	 * Pick one candidate per superpageblock for diversity.
+	 */
+	for (full = SB_FULL_75; full < __NR_SB_FULLNESS; full++) {
+		list_for_each_entry(sb, &zone->spb_lists[SB_CLEAN][full], list) {
+			unsigned long pfn, sb_end;
+
+			sb_end = sb->start_pfn +
+				(unsigned long)sb->total_pageblocks *
+				pageblock_nr_pages;
+			pfn = ALIGN(sb->start_pfn, nr_pages);
+
+			if (pfn + nr_pages <= sb_end) {
+				pfns[n++] = pfn;
+				if (n >= max)
+					return n;
+			}
+		}
+	}
+	/* Empty superpageblocks as last resort for 2MB */
+	list_for_each_entry(sb, &zone->spb_empty, list) {
+		unsigned long pfn = ALIGN(sb->start_pfn, nr_pages);
+		unsigned long sb_end = sb->start_pfn +
+			(unsigned long)sb->total_pageblocks *
+			pageblock_nr_pages;
+
+		if (pfn + nr_pages <= sb_end) {
+			pfns[n++] = pfn;
+			if (n >= max)
+				return n;
+		}
+	}
+	return n;
+}
+
+/**
+ * spb_try_alloc_contig - Superpageblock-aware contiguous page allocation
+ * @zone: zone to allocate from
+ * @nr_pages: number of contiguous pages needed (>= pageblock_nr_pages)
+ * @gfp_mask: GFP mask for allocation
+ *
+ * Use superpageblock metadata to quickly find suitable ranges for contiguous
+ * allocation, avoiding the brute-force PFN scan. Each candidate is tried
+ * twice to handle transient failures (e.g., temporary page pins, racing
+ * allocations), then falls through to the next candidate.
+ *
+ * Returns: page pointer on success, NULL on failure.
+ */
+static struct page *spb_try_alloc_contig(struct zone *zone,
+					unsigned long nr_pages,
+					gfp_t gfp_mask)
+{
+	unsigned long pfns[SPB_CONTIG_MAX_CANDIDATES];
+	unsigned long flags;
+	int nr_candidates, i;
+
+	if (nr_pages < pageblock_nr_pages)
+		return NULL;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	nr_candidates = sb_collect_contig_candidates(zone, nr_pages,
+						     pfns,
+						     SPB_CONTIG_MAX_CANDIDATES);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	for (i = 0; i < nr_candidates; i++) {
+		int attempts;
+
+		for (attempts = 0; attempts < 2; attempts++) {
+			int ret;
+
+			ret = alloc_contig_frozen_range_noprof(pfns[i],
+					pfns[i] + nr_pages,
+					ACR_FLAGS_NONE, gfp_mask);
+			if (!ret)
+				return pfn_to_page(pfns[i]);
+		}
+
+		/*
+		 * Failed on this candidate — rotate its superpageblock to the
+		 * tail of its list so the next call tries fresh candidates.
+		 */
+		spin_lock_irqsave(&zone->lock, flags);
+		{
+			struct superpageblock *sb =
+				pfn_to_superpageblock(zone, pfns[i]);
+			if (sb) {
+				struct list_head *head;
+
+				if (sb->nr_free == sb->total_pageblocks)
+					head = &zone->spb_empty;
+				else
+					head = &zone->spb_lists
+						[spb_get_category(sb)]
+						[sb_get_fullness(sb, spb_get_category(sb))];
+				list_move_tail(&sb->list, head);
+			}
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+	return NULL;
+}
+
+/**
+ * sb_collect_evacuate_candidates - Find pageblocks for targeted evacuation
+ * @zone: zone to search (must hold zone->lock)
+ * @migratetype: desired migratetype (MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE)
+ * @sb_pfns: output array of tainted superpageblock start PFNs
+ * @max: maximum candidates to collect
+ *
+ * Find tainted superpageblocks containing pageblocks of the desired migratetype
+ * that also have movable pages to evacuate. Evacuating movable pages from
+ * these pageblocks creates buddy coalescing opportunities for high-order
+ * allocations of the desired migratetype.
+ *
+ * Returns number of candidate superpageblock PFNs found.
+ */
+static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
+					  unsigned long *sb_pfns, int max)
+{
+	struct superpageblock *sb;
+	int full, n = 0;
+
+	lockdep_assert_held(&zone->lock);
+
+	for (full = 0; full < __NR_SB_FULLNESS; full++) {
+		list_for_each_entry(sb, &zone->spb_lists[SB_TAINTED][full],
+				    list) {
+			bool has_matching;
+
+			if (!sb->nr_movable)
+				continue;
+
+			if (migratetype == MIGRATE_UNMOVABLE)
+				has_matching = sb->nr_unmovable > 0;
+			else if (migratetype == MIGRATE_RECLAIMABLE)
+				has_matching = sb->nr_reclaimable > 0;
+			else
+				continue;
+
+			if (!has_matching)
+				continue;
+
+			sb_pfns[n++] = sb->start_pfn;
+			if (n >= max)
+				return n;
+		}
+	}
+	return n;
+}
+
+/**
+ * spb_evacuate_for_order - Targeted evacuation of movable pages from
+ *                         unmovable/reclaimable pageblocks
+ * @zone: zone to work on
+ * @order: allocation order that failed
+ * @migratetype: desired migratetype (MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE)
+ *
+ * Instead of blind compaction, use superpageblock metadata to find pageblocks
+ * of the right migratetype in tainted superpageblocks and evacuate their
+ * movable pages. This creates buddy coalescing opportunities within
+ * the pageblock, enabling higher-order allocations.
+ *
+ * Returns true if evacuation was performed (caller should retry allocation).
+ */
+static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
+				  int migratetype)
+{
+	unsigned long sb_pfns[SPB_CONTIG_MAX_CANDIDATES];
+	unsigned long flags;
+	int nr_sbs, i;
+	bool did_evacuate = false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	nr_sbs = sb_collect_evacuate_candidates(zone, migratetype,
+						sb_pfns,
+						SPB_CONTIG_MAX_CANDIDATES);
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	for (i = 0; i < nr_sbs && !did_evacuate; i++) {
+		unsigned long pfn, end_pfn;
+
+		end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
+		for (pfn = sb_pfns[i]; pfn < end_pfn;
+		     pfn += pageblock_nr_pages) {
+			struct page *page;
+
+			if (!pfn_valid(pfn))
+				continue;
+
+			/* Superpageblocks can straddle zone boundaries. */
+			if (!zone_spans_pfn(zone, pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+
+			if (get_pfnblock_migratetype(page, pfn) != migratetype)
+				continue;
+
+			if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+				continue;
+
+			evacuate_pageblock(zone, pfn, true);
+			did_evacuate = true;
+			break;
+		}
+	}
+	return did_evacuate;
+}
+#endif /* CONFIG_COMPACTION */
+
 /**
  * alloc_contig_frozen_pages() -- tries to find and allocate contiguous range of frozen pages
  * @nr_pages:	Number of contiguous pages to allocate
@@ -9138,9 +9540,29 @@ struct page *alloc_contig_frozen_pages_noprof(unsigned long nr_pages,
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
+	struct page *page;
 	bool skip_hugetlb = true;
 	bool skipped_hugetlb = false;
 
+	/*
+	 * First pass: superpageblock-aware search. Use superpageblock metadata
+	 * to quickly find suitable ranges, avoiding the brute-force PFN
+	 * scan. For 1GB allocations this walks spb_empty then
+	 * spb_lists[SB_CLEAN]; for 2MB+ it finds evacuatable pageblocks
+	 * in clean superpageblocks.
+	 */
+	if (nr_pages >= pageblock_nr_pages) {
+		zonelist = node_zonelist(nid, gfp_mask);
+		for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					       gfp_zone(gfp_mask), nodemask) {
+			page = spb_try_alloc_contig(zone, nr_pages, gfp_mask);
+			if (page)
+				return page;
+		}
+	}
+
+	/* Second pass: brute-force PFN scan (existing fallback) */
+
 retry:
 	zonelist = node_zonelist(nid, gfp_mask);
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
@@ -9235,6 +9657,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages)
 	if (WARN_ON_ONCE(first_page != compound_head(first_page)))
 		return;
 
+	superpageblock_contig_mark(pfn, pfn + nr_pages, false);
+
 	if (PageHead(first_page)) {
 		WARN_ON_ONCE(order != compound_order(first_page));
 		free_frozen_pages(first_page, order);
@@ -9254,9 +9678,13 @@ EXPORT_SYMBOL(free_contig_frozen_range);
  */
 void free_contig_range(unsigned long pfn, unsigned long nr_pages)
 {
+	unsigned long end = pfn + nr_pages;
+
 	if (WARN_ON_ONCE(PageHead(pfn_to_page(pfn))))
 		return;
 
+	superpageblock_contig_mark(pfn, end, false);
+
 	for (; nr_pages--; pfn++)
 		__free_page(pfn_to_page(pfn));
 }
@@ -9794,6 +10222,15 @@ static int superpageblock_debugfs_show(struct seq_file *m, void *v)
 		if (empty_count)
 			seq_printf(m, "  empty: %d\n", empty_count);
 
+		{
+			int isolated_count = 0;
+
+			list_for_each_entry(sb, &zone->spb_isolated, list)
+				isolated_count++;
+			if (isolated_count)
+				seq_printf(m, "  contig_alloc: %d\n", isolated_count);
+		}
+
 		for (cat = 0; cat < __NR_SB_CATEGORIES; cat++) {
 			for (full = 0; full < __NR_SB_FULLNESS; full++) {
 				int count = 0;
@@ -9812,11 +10249,16 @@ static int superpageblock_debugfs_show(struct seq_file *m, void *v)
 		/* Per-superpageblock detail */
 		for (i = 0; i < zone->nr_superpageblocks; i++) {
 			sb = &zone->superpageblocks[i];
-			seq_printf(m, "  sb[%lu] pfn=0x%lx: unmov=%u recl=%u mov=%u rsv=%u free=%u total=%u\n",
-				   i, sb->start_pfn,
-				   sb->nr_unmovable, sb->nr_reclaimable,
-				   sb->nr_movable, sb->nr_reserved,
-				   sb->nr_free, sb->total_pageblocks);
+			if (sb->contig_allocated)
+				seq_printf(m, "  sb[%lu] pfn=0x%lx: contig_allocated total=%u\n",
+					   i, sb->start_pfn,
+					   sb->total_pageblocks);
+			else
+				seq_printf(m, "  sb[%lu] pfn=0x%lx: unmov=%u recl=%u mov=%u rsv=%u free=%u total=%u\n",
+					   i, sb->start_pfn,
+					   sb->nr_unmovable, sb->nr_reclaimable,
+					   sb->nr_movable, sb->nr_reserved,
+					   sb->nr_free, sb->total_pageblocks);
 		}
 	}
 	return 0;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (17 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Non-DIRECT_RECLAIM (atomic) allocations that fail with ALLOC_NOFRAGMENT
previously dropped the flag entirely and retried, allowing them to taint
clean superpageblocks. This was the primary source of taint spreading
observed on production systems.

Two changes to keep atomic allocations within tainted SPBs:

1. Extend Pass 2 in __rmqueue_smallest with a sub-pageblock phase (Pass
   2b). The original Pass 2 only finds whole free pageblocks (>= pageblock
   order) in tainted SPBs. Pass 2b searches for sub-pageblock-order free
   blocks and uses try_to_claim_block to claim the pageblock if it has
   enough compatible pages. This finds pages in tainted SPBs that have
   fragmented free space but no whole free pageblocks.

2. Add ALLOC_NOFRAG_TAINTED_OK intermediate flag. Instead of going
   directly from ALLOC_NOFRAGMENT to no protection, atomic allocations
   first try with ALLOC_NOFRAG_TAINTED_OK which allows __rmqueue_steal
   to search tainted SPBs only. Clean/empty SPBs remain protected. Only
   if steal from tainted SPBs also fails is ALLOC_NOFRAGMENT fully
   dropped as a last resort.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 87 +++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 02f1c7d36b85..f641795688af 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1413,6 +1413,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_NOFRAG_TAINTED_OK	0x1000 /* NOFRAGMENT, but allow steal from tainted SPBs */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ce96db50c2f..13bc57592cd5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2713,6 +2713,9 @@ static struct page *__rmqueue_from_sb(struct zone *zone, unsigned int order,
  */
 static struct page *claim_whole_block(struct zone *zone, struct page *page,
 		  int current_order, int order, int new_type, int old_type);
+static struct page *try_to_claim_block(struct zone *zone, struct page *page,
+		  int current_order, int order, int start_type,
+		  int block_type, unsigned int alloc_flags);
 
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
@@ -2782,6 +2785,11 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	 * free list (reset by mark_pageblock_free), so the search above
 	 * misses them. Claim them inline to keep non-movable allocations
 	 * concentrated in already-tainted superpageblocks.
+	 *
+	 * Try whole pageblock orders first (preferred for PCP buddy optimization),
+	 * then fall back to sub-pageblock orders. Sub-pageblock claiming uses
+	 * try_to_claim_block which checks whether the pageblock has enough
+	 * compatible pages to justify claiming it.
 	 */
 	if (!movable && !is_migrate_cma(migratetype)) {
 		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
@@ -2814,6 +2822,43 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				}
 			}
 		}
+		/* Pass 2b: sub-pageblock orders in tainted SPBs */
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+				&zone->spb_lists[SB_TAINTED][full], list) {
+				int co;
+
+				if (!sb->nr_free_pages)
+					continue;
+				for (co = min_t(int, pageblock_order - 1,
+						NR_PAGE_ORDERS - 1);
+				     co >= (int)order;
+				     --co) {
+					current_order = co;
+					area = &sb->free_area[current_order];
+					page = get_page_from_free_area(
+						area, MIGRATE_MOVABLE);
+					if (!page)
+						continue;
+					if (get_pageblock_isolate(page))
+						continue;
+					if (is_migrate_cma(
+					    get_pageblock_migratetype(page)))
+						continue;
+					page = try_to_claim_block(zone, page,
+						current_order, order,
+						migratetype, MIGRATE_MOVABLE,
+						0);
+					if (!page)
+						continue;
+					trace_mm_page_alloc_zone_locked(
+						page, order, migratetype,
+						pcp_allowed_order(order) &&
+						migratetype < MIGRATE_PCPTYPES);
+					return page;
+				}
+			}
+		}
 	}
 
 	/* Empty superpageblocks: try before falling back to non-preferred category */
@@ -3566,12 +3611,23 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
  * the block as its current migratetype, potentially causing fragmentation.
  */
 static __always_inline struct page *
-__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype,
+		unsigned int alloc_flags)
 {
 	struct superpageblock *sb;
 	int current_order;
 	struct page *page;
 	int fallback_mt;
+	unsigned int search_cats;
+
+	/*
+	 * When ALLOC_NOFRAG_TAINTED_OK is set, only steal from tainted
+	 * SPBs to avoid tainting clean ones. Otherwise search all categories.
+	 */
+	if (alloc_flags & ALLOC_NOFRAG_TAINTED_OK)
+		search_cats = SB_SEARCH_PREFERRED;
+	else
+		search_cats = SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK;
 
 	/*
 	 * Search per-superpageblock free lists for fallback migratetypes.
@@ -3581,7 +3637,7 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 		page = __rmqueue_sb_find_fallback(zone, current_order,
 					start_migratetype,
 					&fallback_mt,
-					SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK);
+					search_cats);
 
 		if (!page)
 			continue;
@@ -3681,8 +3737,10 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		}
 		fallthrough;
 	case RMQUEUE_STEAL:
-		if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
-			page = __rmqueue_steal(zone, order, migratetype);
+		if (!(alloc_flags & ALLOC_NOFRAGMENT) ||
+		    (alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
+			page = __rmqueue_steal(zone, order, migratetype,
+					       alloc_flags);
 			if (page) {
 				*mode = RMQUEUE_STEAL;
 				return page;
@@ -5301,9 +5359,24 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	/*
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again.
-	 */
-	if (no_fallback && !defrag_mode) {
-		alloc_flags &= ~ALLOC_NOFRAGMENT;
+	 *
+	 * For allocations that can do direct reclaim, keep NOFRAGMENT set
+	 * and let the slowpath try reclaim and compaction to free pages in
+	 * already-tainted superpageblocks before allowing clean SPBs to be
+	 * tainted.
+	 *
+	 * Atomic allocations cannot reclaim, but try an intermediate step
+	 * first: allow steal/claim from tainted SPBs only. This avoids
+	 * tainting clean SPBs while still finding pages in tainted ones.
+	 * Only drop NOFRAGMENT entirely if that also fails.
+	 */
+	if (no_fallback && !defrag_mode &&
+	    !(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+		if (!(alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
+			alloc_flags |= ALLOC_NOFRAG_TAINTED_OK;
+			goto retry;
+		}
+		alloc_flags &= ~(ALLOC_NOFRAGMENT | ALLOC_NOFRAG_TAINTED_OK);
 		goto retry;
 	}
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (18 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

On systems with many superpageblocks, sub-pageblock MOVABLE fragments
within already-tainted SPBs were being skipped by __rmqueue_claim()
due to the ALLOC_NOFRAGMENT pageblock_order floor. This caused the
allocator to fall through to clean SPBs, tainting them unnecessarily.

Introduce SPB_AGGRESSIVE_THRESHOLD: on systems with more than 8
superpageblocks, relax the min_order floor for the preferred category
(tainted SPBs) so non-movable allocations consume free space there at
any granularity. On small systems, preserve the pageblock_order floor
to protect MOVABLE capacity within tainted SPBs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13bc57592cd5..215b7d6b95d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2643,6 +2643,24 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
  */
 #define SPB_TAINTED_RESERVE	4
 
+/*
+ * On systems with many superpageblocks, we can afford to "write off"
+ * tainted superpageblocks by aggressively packing unmovable/reclaimable
+ * allocations into them — even sub-pageblock fragments — to keep clean
+ * superpageblocks clean for future 1GB hugepage and contiguous allocations.
+ *
+ * On small systems (few superpageblocks), each SPB represents a large
+ * fraction of total memory. Aggressively claiming sub-pageblock movable
+ * fragments from tainted SPBs would destroy MOVABLE capacity that the
+ * system can't afford to lose, with little benefit since there are too
+ * few SPBs to meaningfully separate movable from unmovable anyway.
+ *
+ * This threshold controls the crossover: above it, prefer concentrating
+ * non-movable allocations in tainted SPBs at any granularity; below it,
+ * only claim whole free pageblocks from tainted SPBs.
+ */
+#define SPB_AGGRESSIVE_THRESHOLD	8
+
 /**
  * sb_preferred_for_movable - Find the fullest clean superpageblock for movable
  * @zone: zone to search
@@ -3555,6 +3573,7 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 {
 	int current_order;
 	int min_order = order;
+	int nofrag_min_order = order;
 	struct page *page;
 	int fallback_mt;
 	static const unsigned int cat_search[] = {
@@ -3568,9 +3587,18 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 	 * Do not steal pages from freelists belonging to other pageblocks
 	 * i.e. orders < pageblock_order. If there are no local zones free,
 	 * the zonelists will be reiterated without ALLOC_NOFRAGMENT.
+	 *
+	 * Only apply this restriction to empty and clean superpageblocks.
+	 * Claiming within already-tainted superpageblocks does not cause
+	 * new fragmentation, and skipping them wastes free space that
+	 * could prevent tainting clean superpageblocks.
+	 *
+	 * When ALLOC_NOFRAGMENT is set, skip empty and clean superpageblocks
+	 * entirely to avoid tainting them. The slowpath will try reclaim and
+	 * compaction first, and only drop ALLOC_NOFRAGMENT as a last resort.
 	 */
 	if (order < pageblock_order && alloc_flags & ALLOC_NOFRAGMENT)
-		min_order = pageblock_order;
+		nofrag_min_order = pageblock_order;
 
 	/*
 	 * Find the largest available free page in a fallback migratetype.
@@ -3580,6 +3608,31 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 	 * ones.
 	 */
 	for (c = 0; c < ARRAY_SIZE(cat_search); c++) {
+		/*
+		 * When avoiding fragmentation, do not search clean/empty
+		 * superpageblocks for fallback pages. Tainting a clean SPB
+		 * is the worst outcome — better to fail and let the slowpath
+		 * try reclaim and compaction in already-tainted SPBs first.
+		 */
+		if ((alloc_flags & ALLOC_NOFRAGMENT) &&
+		    cat_search[c] != SB_SEARCH_PREFERRED)
+			continue;
+
+		/*
+		 * For the preferred category (tainted SPBs for non-movable),
+		 * search all orders down to the allocation order on systems
+		 * with enough superpageblocks that we can afford to write off
+		 * tainted ones. These SPBs are already tainted, so sub-pageblock
+		 * stealing doesn't cause additional fragmentation.
+		 *
+		 * On small systems, keep the pageblock_order floor to preserve
+		 * MOVABLE capacity within tainted SPBs — see comment at
+		 * SPB_AGGRESSIVE_THRESHOLD.
+		 */
+		min_order = (cat_search[c] == SB_SEARCH_PREFERRED &&
+			     zone->nr_superpageblocks > SPB_AGGRESSIVE_THRESHOLD) ?
+			    order : nofrag_min_order;
+
 		for (current_order = MAX_PAGE_ORDER;
 		     current_order >= min_order; --current_order) {
 			if (!should_try_claim_block(current_order,
@@ -3850,8 +3903,18 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 	 * For movable allocations, prefer pageblocks from the
 	 * fullest clean superpageblock to pack allocations and
 	 * preserve empty superpageblocks for 1GB hugepages.
+	 *
+	 * For non-movable allocations, force ALLOC_NOFRAGMENT so
+	 * __rmqueue cannot steal a whole pageblock out of a clean
+	 * SPB. Stealing is the worst possible outcome for a bulk
+	 * refill: a single network or slab burst can taint dozens
+	 * of clean pageblocks. Phase 2 will adopt sub-pageblock
+	 * fragments from tainted SPBs before Phase 3 falls back to
+	 * the original alloc_flags (which may eventually steal at
+	 * the requested order, a much smaller fragmentation event).
 	 */
 	while (refilled + pageblock_nr_pages <= pages_needed) {
+		unsigned int p1_alloc_flags = alloc_flags;
 		struct page *page = NULL;
 
 		if (migratetype == MIGRATE_MOVABLE) {
@@ -3861,11 +3924,14 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 			if (sb)
 				page = __rmqueue_from_sb(zone, pageblock_order,
 							 migratetype, sb);
+		} else if (!is_migrate_cma(migratetype)) {
+			p1_alloc_flags = (p1_alloc_flags | ALLOC_NOFRAGMENT) &
+					 ~ALLOC_NOFRAG_TAINTED_OK;
 		}
 		if (!page)
 			page = __rmqueue(zone, pageblock_order,
 					 migratetype,
-					 alloc_flags, &rmqm);
+					 p1_alloc_flags, &rmqm);
 		if (!page)
 			break;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (19 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
                   ` (24 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

When the allocator needs pages for unmovable or reclaimable allocations
and tainted superpageblocks are exhausted, it currently falls through to
clean superpageblocks immediately, permanently tainting them. This
defeats the purpose of superpageblock anti-fragmentation.

Restructure the allocation fallback cascade to try reclaim and compaction
before tainting clean superpageblocks:

1. Reorder __rmqueue_smallest to search each preferred SPB completely
   before moving to the next source. Within each preferred SPB, try
   whole-pageblock allocations first (for PCP buddy optimization),
   then fall back to sub-pageblock allocations. This ensures that
   sub-pageblock free pages in existing tainted SPBs are used before
   tainting empty or clean SPBs. The pass order is:
   - Preferred SPBs: whole pageblock first, then sub-pageblock
   - Whole pageblock inline claim from tainted SPBs (non-movable only)
   - Whole pageblock from empty SPBs
   - Fallback to non-preferred SPBs

2. In get_page_from_freelist(), only drop ALLOC_NOFRAGMENT immediately
   for allocations that cannot do direct reclaim (atomic). Allocations
   that can reclaim keep ALLOC_NOFRAGMENT set and enter the slowpath,
   where reclaim and compaction can free pages in already-tainted SPBs.

3. Preserve ALLOC_NOFRAGMENT through the slowpath by calling
   alloc_flags_nofragment() after gfp_to_alloc_flags(). Previously
   the slowpath only set NOFRAGMENT for defrag_mode, losing the SPB
   protection that the fastpath established.

4. After reclaim and compaction have both been tried and failed, drop
   ALLOC_NOFRAGMENT unconditionally as a last resort before OOM.
   Previously this was gated on defrag_mode.

Testing shows that with this change, clean superpageblocks maintain
unmov=0 throughout a heavy mixed workload (swap pressure, filesystem
metadata, anonymous memory cycling, compaction, hugepage allocation),
where previously 2-3 additional SPBs would become tainted with 7-8
unmovable pageblocks each.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 74 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 61 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 215b7d6b95d2..8f925b5a2e5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2764,11 +2764,23 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	 * concentrate non-movable allocations into fewer superpageblocks.
 	 * For movable, prefer clean superpageblocks to keep them homogeneous.
 	 *
-	 * Search empty superpageblocks between the preferred and fallback
-	 * category passes to avoid movable allocations consuming free
-	 * pageblocks in tainted superpageblocks (which unmovable needs for
-	 * future CLAIMs), and vice versa.
+	 * Prefer whole pageblock allocations (>= pageblock_order) over
+	 * sub-pageblock allocations because whole pageblocks enable the
+	 * PCP buddy optimization for fast subsequent allocations.
+	 *
+	 * Search order:
+	 * 1. Preferred SPBs: whole pageblock first, then sub-pageblock
+	 * 2. Whole pageblock inline claim from tainted SPBs (non-movable only)
+	 * 3. Whole pageblock from empty SPBs
+	 * 4. Fallback to non-preferred SPBs
+	 *
+	 * Pass 1 tries whole pageblock first for PCP buddy optimization,
+	 * then falls back to sub-pageblock within the same preferred SPBs.
+	 * This ensures we never taint empty/clean SPBs while preferred
+	 * SPBs still have free pages at any order.
 	 */
+
+	/* Pass 1: preferred SPBs — whole pageblock first, then sub-pageblock */
 	for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
 		enum sb_category cat = cat_order[movable][0];
 
@@ -2776,7 +2788,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 			&zone->spb_lists[cat][full], list) {
 			if (!sb->nr_free_pages)
 				continue;
-			for (current_order = order;
+			/* Try whole pageblock (or larger) first for PCP buddy */
+			for (current_order = max(order, pageblock_order);
 			     current_order < NR_PAGE_ORDERS;
 			     ++current_order) {
 				area = &sb->free_area[current_order];
@@ -2793,15 +2806,34 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					migratetype < MIGRATE_PCPTYPES);
 				return page;
 			}
+			/* Then try sub-pageblock (no PCP buddy) */
+			if (order < pageblock_order) {
+				for (current_order = order;
+				     current_order < pageblock_order;
+				     ++current_order) {
+					area = &sb->free_area[current_order];
+					page = get_page_from_free_area(
+						area, migratetype);
+					if (!page)
+						continue;
+					page_del_and_expand(zone, page,
+						order, current_order,
+						migratetype);
+					trace_mm_page_alloc_zone_locked(
+						page, order, migratetype,
+						pcp_allowed_order(order) &&
+						migratetype < MIGRATE_PCPTYPES);
+					return page;
+				}
+			}
 		}
 	}
 
 	/*
-	 * For non-movable allocations, try to reclaim free pageblocks
-	 * from tainted superpageblocks before looking at empty or clean
-	 * ones. Free pageblocks in tainted SBs have pages on the MOVABLE
-	 * free list (reset by mark_pageblock_free), so the search above
-	 * misses them. Claim them inline to keep non-movable allocations
+	 * Pass 2: for non-movable allocations, try to claim free pageblocks
+	 * from tainted superpageblocks. Free pageblocks in tainted SBs have
+	 * pages on the MOVABLE free list (reset by mark_pageblock_free), so
+	 * pass 1 misses them. Claim them inline to keep non-movable allocations
 	 * concentrated in already-tainted superpageblocks.
 	 *
 	 * Try whole pageblock orders first (preferred for PCP buddy optimization),
@@ -2879,7 +2911,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		}
 	}
 
-	/* Empty superpageblocks: try before falling back to non-preferred category */
+	/* Pass 3: whole pageblock from empty superpageblocks */
 	list_for_each_entry(sb, &zone->spb_empty, list) {
 		if (!sb->nr_free_pages)
 			continue;
@@ -6281,6 +6313,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!zonelist_zone(ac->preferred_zoneref))
 		goto nopage;
 
+	/*
+	 * Preserve ALLOC_NOFRAGMENT through the slowpath so that reclaim
+	 * and compaction are tried before allowing clean superpageblocks
+	 * to be tainted. The fast path sets this via alloc_flags_nofragment()
+	 * but gfp_to_alloc_flags() only sets it for defrag_mode. Re-add it
+	 * here so the slowpath retries with NOFRAGMENT still protecting
+	 * clean SPBs until the last-resort drop below.
+	 */
+	alloc_flags |= alloc_flags_nofragment(
+				zonelist_zone(ac->preferred_zoneref), gfp_mask);
+
 	/*
 	 * Check for insane configurations where the cpuset doesn't contain
 	 * any suitable zone to satisfy the request - e.g. non-movable
@@ -6420,8 +6463,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
-	/* Reclaim/compaction failed to prevent the fallback */
-	if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) {
+	/*
+	 * Reclaim and compaction have been tried but could not free enough
+	 * pages in already-tainted superpageblocks. Drop NOFRAGMENT as a
+	 * last resort to allow claiming from clean/empty SPBs and stealing
+	 * across migratetype boundaries. This is better than OOM-killing.
+	 */
+	if (alloc_flags & ALLOC_NOFRAGMENT) {
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
 		goto retry;
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (20 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Add Phase 2 to rmqueue_bulk: when refilling PCP for unmovable or
reclaimable allocations, search tainted superpageblocks for partially-free
pageblocks with sub-pageblock buddy entries of the requested migratetype.

Claim ownership of the pageblock and move the found entry to PCP with
PCPBuddy marking.  Pass 0 (the existing owned-block recovery phase)
picks up remaining buddy entries on subsequent refills, so there is no
need to sweep the entire pageblock eagerly.

This concentrates non-movable allocations into already-tainted
superpageblocks, reducing fragmentation spread to clean superpageblocks.

Before claiming ownership, verify the pageblock is not already owned by
another CPU (pbd->cpu == 0).  Without this check, two CPUs could have
PCPBuddy pages from the same pageblock on separate PCP lists protected
by different locks, and the PCP merge pass could corrupt the other
CPU's list.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 114 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 101 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8f925b5a2e5f..4f8105b89e47 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1130,7 +1130,7 @@ static inline void set_buddy_order(struct page *page, unsigned int order)
  * - Set when Phase 0/1 restore or acquire whole pageblocks.
  * - Propagated to split remainders in pcp_rmqueue_smallest().
  * - Set on freed pages from owned blocks routed to the owner PCP.
- * - NOT set for Phase 2/3 fragments or zone-owned frees.
+ * - NOT set for Phase 3 fragments or zone-owned frees.
  * - The merge pass in free_pcppages_bulk() only processes
  *   PagePCPBuddy pages, ensuring it never touches pages on
  *   another CPU's PCP list.
@@ -3840,15 +3840,15 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
  * under a single hold of the lock, for efficiency.  Add them to the
  * freelist of @pcp.
  *
- * When @pcp is non-NULL and @count > 1 (normal pageset), uses a four-phase
+ * When @pcp is non-NULL and @count > 1 (normal pageset), uses a multi-phase
  * approach:
- *   Phase 0: Recover previously owned, partially drained blocks.
- *   Phase 1: Acquire whole pageblocks, claim ownership, set PagePCPBuddy.
- *            These pages are eligible for PCP-level buddy merging.
- *   Phase 2: Grab sub-pageblock fragments of the same migratetype.
- *   Phase 3: Fall back to __rmqueue() with migratetype fallback.
- *   Phase 2/3 pages are cached for batching only -- no ownership claim,
- *   no PagePCPBuddy, no PCP-level merging.
+ *   Phase 0:   Recover previously owned, partially drained blocks.
+ *   Phase 1:   Acquire whole pageblocks, claim ownership, set PagePCPBuddy.
+ *              These pages are eligible for PCP-level buddy merging.
+ *   Phase 2:   Adopt partial pageblocks from tainted SPBs (non-movable only).
+ *              Claims ownership so Pass 0 can recover buddy entries later.
+ *   Phase 3:   Fall back to __rmqueue() with migratetype fallback.
+ *              No ownership claim, no PagePCPBuddy, no PCP-level merging.
  *
  * When @pcp is NULL or @count <= 1 (boot pageset), acquires individual
  * pages of the requested order directly.
@@ -3976,11 +3976,99 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 		goto out;
 
 	/*
-	 * Phase 2 was removed: it swept zone free lists for sub-pageblock
-	 * fragments, which are always empty when superpageblocks are enabled.
-	 * Phase 3's __rmqueue() -> __rmqueue_smallest() properly searches
-	 * per-superpageblock free lists at all orders.
+	 * Phase 2: Adopt partial pageblocks from tainted SPBs.
+	 *
+	 * Phase 1 only grabs whole free pageblocks. When a tainted SPB
+	 * has partially-used pageblocks with free sub-pageblock buddy
+	 * entries, Phase 1 can't use them. Phase 3 can find them via
+	 * __rmqueue_smallest, but without ownership or PCPBuddy marking,
+	 * so they fragment further on drain.
+	 *
+	 * This phase bridges the gap: find a sub-pageblock free entry
+	 * in a tainted SPB and claim ownership of its pageblock. Pass 0
+	 * will pick up remaining buddy entries on subsequent refills.
+	 *
+	 * Only for unmovable/reclaimable — movable should use clean SPBs.
 	 */
+	if (migratetype != MIGRATE_MOVABLE &&
+	    !is_migrate_cma(migratetype)) {
+		enum sb_fullness full;
+
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			struct superpageblock *sb;
+
+			list_for_each_entry(sb,
+				&zone->spb_lists[SB_TAINTED][full], list) {
+				struct page *page;
+				int found_order = -1;
+
+				if (sb->nr_free_pages < pageblock_nr_pages / 4)
+					continue;
+
+				/*
+				 * Find a sub-pageblock free entry for our
+				 * migratetype, starting from the largest order.
+				 */
+				for (o = pageblock_order - 1; o >= order; o--) {
+					struct free_area *area;
+
+					area = &sb->free_area[o];
+					page = get_page_from_free_area(
+						area, migratetype);
+					if (page) {
+						found_order = o;
+						break;
+					}
+				}
+				if (found_order < 0)
+					continue;
+
+				/*
+				 * Check that this pageblock isn't already
+				 * owned by another CPU. If it is, two CPUs
+				 * would have PCPBuddy pages from the same
+				 * pageblock, and the PCP merge pass could
+				 * corrupt the other CPU's PCP list.
+				 */
+				pbd = pfn_to_pageblock(page,
+						       page_to_pfn(page));
+				if (pbd->cpu != 0)
+					continue;
+
+				/*
+				 * Found a free chunk in an unowned pageblock.
+				 * Take it from buddy, claim ownership, and
+				 * set PCPBuddy. Pass 0 will grab remaining
+				 * buddy entries on future refills.
+				 *
+				 * Set PB_has_<migratetype> since we bypass
+				 * page_del_and_expand (which normally does
+				 * PB_has tracking).
+				 */
+				del_page_from_free_list(page, zone,
+							found_order,
+							migratetype);
+				__spb_set_has_type(page, migratetype);
+				set_pcpblock_owner(page, cpu);
+				__SetPagePCPBuddy(page);
+				pcp_enqueue_tail(pcp, page, migratetype,
+						 found_order);
+				refilled += 1 << found_order;
+
+				/*
+				 * Register for Phase 0 recovery so future
+				 * drains from this pageblock can be swept
+				 * back efficiently.
+				 */
+				if (list_empty(&pbd->cpu_node))
+					list_add(&pbd->cpu_node,
+						 &pcp->owned_blocks);
+
+				if (refilled >= pages_needed)
+					goto out;
+			}
+		}
+	}
 
 	/*
 	 * Phase 3: Last resort. Use __rmqueue() which does
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (21 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Add spb_debug_check() and call it after every site that mutates the
per-superpageblock type counters (nr_free / nr_unmovable / nr_reclaimable
/ nr_movable). Each counter must be <= total_pageblocks; a violation
indicates that a PB_has_<mt> bit transition was missed by one of the
allocation, free, claim, or evacuation paths and the counter has drifted
out of sync with the bits.

VM_WARN_ONCE keeps the production cost zero (CONFIG_DEBUG_VM only) while
giving us a single place to catch counter drift early during stress
testing instead of debugging it from a much later misaccounting symptom.

Several real bugs in the SPB stack were caught by this check during
development; keeping it permanently makes future churn safer.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f8105b89e47..9f4298fc2727 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -521,6 +521,32 @@ static inline int migratetype_to_has_bit(int migratetype)
 	}
 }
 
+#ifdef CONFIG_DEBUG_VM
+static void spb_debug_check(struct superpageblock *sb, const char *caller)
+{
+	u16 total = sb->total_pageblocks;
+
+	VM_WARN_ONCE(sb->nr_free > total,
+		     "%s: nr_free %u > total %u (zone=%s sb=%lu)\n",
+		     caller, sb->nr_free, total, sb->zone->name,
+		     (unsigned long)(sb - sb->zone->superpageblocks));
+	VM_WARN_ONCE(sb->nr_unmovable > total,
+		     "%s: nr_unmovable %u > total %u (zone=%s sb=%lu)\n",
+		     caller, sb->nr_unmovable, total, sb->zone->name,
+		     (unsigned long)(sb - sb->zone->superpageblocks));
+	VM_WARN_ONCE(sb->nr_reclaimable > total,
+		     "%s: nr_reclaimable %u > total %u (zone=%s sb=%lu)\n",
+		     caller, sb->nr_reclaimable, total, sb->zone->name,
+		     (unsigned long)(sb - sb->zone->superpageblocks));
+	VM_WARN_ONCE(sb->nr_movable > total,
+		     "%s: nr_movable %u > total %u (zone=%s sb=%lu)\n",
+		     caller, sb->nr_movable, total, sb->zone->name,
+		     (unsigned long)(sb - sb->zone->superpageblocks));
+}
+#else
+static inline void spb_debug_check(struct superpageblock *sb, const char *caller) {}
+#endif
+
 /*
  * __spb_set_has_type - set PB_has_* and increment type counter
  *
@@ -552,6 +578,7 @@ static void __spb_set_has_type(struct page *page, int migratetype)
 			sb->nr_movable++;
 			break;
 		}
+		spb_debug_check(sb, "__spb_set_has_type");
 	}
 }
 
@@ -589,6 +616,7 @@ static void __spb_clear_has_type(struct page *page, int migratetype)
 				sb->nr_movable--;
 			break;
 		}
+		spb_debug_check(sb, "__spb_clear_has_type");
 	}
 }
 
@@ -818,6 +846,7 @@ static void superpageblock_pb_now_free(struct page *page)
 		return;
 
 	sb->nr_free++;
+	spb_debug_check(sb, "pb_now_free");
 
 	spb_update_list(sb);
 }
@@ -840,6 +869,7 @@ static void superpageblock_pb_now_used(struct page *page)
 
 	if (sb->nr_free)
 		sb->nr_free--;
+	spb_debug_check(sb, "pb_now_used");
 
 	spb_update_list(sb);
 }
@@ -1305,7 +1335,9 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone,
 	struct free_area *area = pfn_sb_free_area(zone, pfn, order, &sb);
 	int nr_pages = 1 << order;
 
-	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype &&
+		     !is_migrate_isolate(get_pageblock_migratetype(page)) &&
+		     !is_migrate_cma(get_pageblock_migratetype(page)),
 		     "page type is %d, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, nr_pages);
 
@@ -1339,7 +1371,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone,
 	int nr_pages = 1 << order;
 
 	/* Free page moving can fail, so it happens before the type update */
-	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt,
+	VM_WARN_ONCE(get_pageblock_migratetype(page) != old_mt &&
+		     !is_migrate_cma(get_pageblock_migratetype(page)),
 		     "page type is %d, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), old_mt, nr_pages);
 
@@ -1364,7 +1397,9 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon
 	struct free_area *area = pfn_sb_free_area(zone, pfn, order, &sb);
 	int nr_pages = 1 << order;
 
-        VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
+	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype &&
+		     !is_migrate_isolate(get_pageblock_migratetype(page)) &&
+		     !is_migrate_cma(get_pageblock_migratetype(page)),
 		     "page type is %d, passed migratetype is %d (nr=%d)\n",
 		     get_pageblock_migratetype(page), migratetype, nr_pages);
 
@@ -10529,11 +10564,12 @@ static int superpageblock_debugfs_show(struct seq_file *m, void *v)
 					   i, sb->start_pfn,
 					   sb->total_pageblocks);
 			else
-				seq_printf(m, "  sb[%lu] pfn=0x%lx: unmov=%u recl=%u mov=%u rsv=%u free=%u total=%u\n",
+				seq_printf(m, "  sb[%lu] pfn=0x%lx: unmov=%u recl=%u mov=%u rsv=%u free=%u total=%u free_pages=%lu\n",
 					   i, sb->start_pfn,
 					   sb->nr_unmovable, sb->nr_reclaimable,
 					   sb->nr_movable, sb->nr_reserved,
-					   sb->nr_free, sb->total_pageblocks);
+					   sb->nr_free, sb->total_pageblocks,
+					   sb->nr_free_pages);
 		}
 	}
 	return 0;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (22 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
                   ` (21 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Reduce tainted superpageblock proliferation with two changes:

1. Dynamic SPB_TAINTED_RESERVE: scale the movable steering reserve with
   SPB size (~3% of pageblocks, minimum 4). For a 512-pageblock SPB this
   gives 16 reserved pageblocks instead of the previous flat 4, triggering
   async defrag 4x earlier and keeping more headroom for unmovable claims.

2. Two-phase targeted evacuation before NOFRAGMENT drop: when the slowpath
   is about to drop ALLOC_NOFRAGMENT for unmovable/reclaimable allocations,
   first try evacuating movable pages from tainted SPBs to create free
   pageblocks. Phase 1 evacuates movable pages from pageblocks already
   labeled as the desired migratetype (buddy coalescing). Phase 2 evacuates
   entire MOVABLE pageblocks to create free whole pageblocks that Pass 2
   can claim for the desired migratetype. This avoids tainting clean SPBs
   in many cases where existing tainted SPBs have reclaimable capacity.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 176 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 125 insertions(+), 51 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f4298fc2727..493db531b869 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2675,8 +2675,16 @@ static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags
  * fewer than this many free pageblocks, ensuring that unmovable claims
  * always find room in existing tainted superpageblocks instead of spilling
  * into clean ones.
+ *
+ * Scale with SPB size: reserve ~3% of pageblocks (minimum 4).
+ * For a 512-pageblock SPB this gives 16 reserved pageblocks.
  */
-#define SPB_TAINTED_RESERVE	4
+#define SPB_TAINTED_RESERVE_MIN	4
+
+static inline u16 spb_tainted_reserve(const struct superpageblock *sb)
+{
+	return max_t(u16, SPB_TAINTED_RESERVE_MIN, sb->total_pageblocks / 32);
+}
 
 /*
  * On systems with many superpageblocks, we can afford to "write off"
@@ -2988,7 +2996,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				 * with few free pageblocks to reserve space
 				 * for future unmovable/reclaimable claims.
 				 */
-				if (sb->nr_free <= SPB_TAINTED_RESERVE)
+				if (sb->nr_free <= spb_tainted_reserve(sb))
 					continue;
 				for (current_order = order;
 				     current_order < NR_PAGE_ORDERS;
@@ -3552,7 +3560,7 @@ __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
 					&sb->free_area[order];
 
 				if (movable && cat == SB_TAINTED &&
-				    sb->nr_free <= SPB_TAINTED_RESERVE)
+				    sb->nr_free <= spb_tainted_reserve(sb))
 					continue;
 
 				for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
@@ -3601,7 +3609,7 @@ __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
 					&sb->free_area[order];
 
 				if (movable && cat == SB_TAINTED &&
-				    sb->nr_free <= SPB_TAINTED_RESERVE)
+				    sb->nr_free <= spb_tainted_reserve(sb))
 					continue;
 
 				for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
@@ -6588,9 +6596,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * Reclaim and compaction have been tried but could not free enough
-	 * pages in already-tainted superpageblocks. Drop NOFRAGMENT as a
-	 * last resort to allow claiming from clean/empty SPBs and stealing
-	 * across migratetype boundaries. This is better than OOM-killing.
+	 * pages in already-tainted superpageblocks. Before dropping
+	 * NOFRAGMENT, try targeted evacuation of movable pages from
+	 * tainted SPBs to create free pageblocks for unmovable claims.
+	 */
+	if ((alloc_flags & ALLOC_NOFRAGMENT) &&
+	    (ac->migratetype == MIGRATE_UNMOVABLE ||
+	     ac->migratetype == MIGRATE_RECLAIMABLE)) {
+		struct zoneref *z;
+		struct zone *zone;
+
+		for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
+					       ac->highest_zoneidx,
+					       ac->nodemask) {
+			if (spb_evacuate_for_order(zone, order,
+						  ac->migratetype)) {
+				page = get_page_from_freelist(gfp_mask, order,
+							     alloc_flags, ac);
+				if (page)
+					goto got_pg;
+			}
+		}
+	}
+
+	/*
+	 * Targeted evacuation could not free enough either. Drop
+	 * NOFRAGMENT as a last resort to allow claiming from clean/empty
+	 * SPBs. This is better than OOM-killing.
 	 */
 	if (alloc_flags & ALLOC_NOFRAGMENT) {
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
@@ -8688,7 +8720,7 @@ static bool spb_needs_defrag(struct superpageblock *sb)
 	 */
 	if (spb_get_category(sb) == SB_TAINTED)
 		return sb->nr_movable > 0 &&
-		       sb->nr_free < SPB_TAINTED_RESERVE;
+		       sb->nr_free < spb_tainted_reserve(sb);
 
 	/*
 	 * Clean superpageblocks: compact scattered free pages into whole
@@ -8720,7 +8752,7 @@ static bool spb_defrag_done(struct superpageblock *sb)
 	 */
 	if (spb_get_category(sb) == SB_TAINTED)
 		return !sb->nr_movable ||
-		       sb->nr_free >= SPB_TAINTED_RESERVE;
+		       sb->nr_free >= spb_tainted_reserve(sb);
 
 	/* Clean superpageblocks: stop when enough free pageblocks exist */
 	if (sb->nr_free >= 2)
@@ -9710,16 +9742,18 @@ static struct page *spb_try_alloc_contig(struct zone *zone,
 }
 
 /**
- * sb_collect_evacuate_candidates - Find pageblocks for targeted evacuation
+ * sb_collect_evacuate_candidates - Find tainted SPBs for targeted evacuation
  * @zone: zone to search (must hold zone->lock)
- * @migratetype: desired migratetype (MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE)
+ * @migratetype: desired migratetype (MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE),
+ *               or -1 to find any tainted SPB with movable pages
  * @sb_pfns: output array of tainted superpageblock start PFNs
  * @max: maximum candidates to collect
  *
- * Find tainted superpageblocks containing pageblocks of the desired migratetype
- * that also have movable pages to evacuate. Evacuating movable pages from
- * these pageblocks creates buddy coalescing opportunities for high-order
- * allocations of the desired migratetype.
+ * Find tainted superpageblocks with movable pages to evacuate.  When
+ * @migratetype is specified, only return SPBs that also contain pageblocks
+ * of that type (for coalescing within existing non-movable pageblocks).
+ * When @migratetype is -1, return any tainted SPB with movable pages
+ * (for freeing whole pageblocks via movable evacuation).
  *
  * Returns number of candidate superpageblock PFNs found.
  */
@@ -9734,20 +9768,22 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
 	for (full = 0; full < __NR_SB_FULLNESS; full++) {
 		list_for_each_entry(sb, &zone->spb_lists[SB_TAINTED][full],
 				    list) {
-			bool has_matching;
-
 			if (!sb->nr_movable)
 				continue;
 
-			if (migratetype == MIGRATE_UNMOVABLE)
-				has_matching = sb->nr_unmovable > 0;
-			else if (migratetype == MIGRATE_RECLAIMABLE)
-				has_matching = sb->nr_reclaimable > 0;
-			else
-				continue;
+			if (migratetype >= 0) {
+				bool has_matching;
 
-			if (!has_matching)
-				continue;
+				if (migratetype == MIGRATE_UNMOVABLE)
+					has_matching = sb->nr_unmovable > 0;
+				else if (migratetype == MIGRATE_RECLAIMABLE)
+					has_matching = sb->nr_reclaimable > 0;
+				else
+					continue;
+
+				if (!has_matching)
+					continue;
+			}
 
 			sb_pfns[n++] = sb->start_pfn;
 			if (n >= max)
@@ -9757,17 +9793,56 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
 	return n;
 }
 
+/*
+ * Evacuate pageblocks of the given migratetype within a range.
+ * Returns number of pageblocks evacuated.
+ */
+static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
+			     unsigned long end_pfn, int migratetype, int max)
+{
+	unsigned long pfn;
+	int nr_evacuated = 0;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		if (!zone_spans_pfn(zone, pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+
+		if (get_pfnblock_migratetype(page, pfn) != migratetype)
+			continue;
+
+		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+			continue;
+
+		evacuate_pageblock(zone, pfn, true);
+		if (++nr_evacuated >= max)
+			break;
+	}
+	return nr_evacuated;
+}
+
 /**
  * spb_evacuate_for_order - Targeted evacuation of movable pages from
- *                         unmovable/reclaimable pageblocks
+ *                         tainted superpageblocks
  * @zone: zone to work on
  * @order: allocation order that failed
  * @migratetype: desired migratetype (MIGRATE_UNMOVABLE or MIGRATE_RECLAIMABLE)
  *
- * Instead of blind compaction, use superpageblock metadata to find pageblocks
- * of the right migratetype in tainted superpageblocks and evacuate their
- * movable pages. This creates buddy coalescing opportunities within
- * the pageblock, enabling higher-order allocations.
+ * Two-phase evacuation to create free space in tainted superpageblocks:
+ *
+ * Phase 1: Evacuate movable pages from pageblocks already labeled as
+ * @migratetype. This creates buddy coalescing opportunities within
+ * existing non-movable pageblocks.
+ *
+ * Phase 2: Evacuate entire MOVABLE pageblocks from tainted SPBs.
+ * When fully evacuated, these become free whole pageblocks that
+ * __rmqueue_smallest Pass 2 can claim for the desired migratetype.
  *
  * Returns true if evacuation was performed (caller should retry allocation).
  */
@@ -9779,40 +9854,39 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 	int nr_sbs, i;
 	bool did_evacuate = false;
 
+	/* Phase 1: coalesce within existing non-movable pageblocks */
 	spin_lock_irqsave(&zone->lock, flags);
 	nr_sbs = sb_collect_evacuate_candidates(zone, migratetype,
 						sb_pfns,
 						SPB_CONTIG_MAX_CANDIDATES);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-	for (i = 0; i < nr_sbs && !did_evacuate; i++) {
-		unsigned long pfn, end_pfn;
-
-		end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
-		for (pfn = sb_pfns[i]; pfn < end_pfn;
-		     pfn += pageblock_nr_pages) {
-			struct page *page;
+	for (i = 0; i < nr_sbs; i++) {
+		unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
 
-			if (!pfn_valid(pfn))
-				continue;
-
-			/* Superpageblocks can straddle zone boundaries. */
-			if (!zone_spans_pfn(zone, pfn))
-				continue;
+		if (evacuate_pb_range(zone, sb_pfns[i], end_pfn,
+				      migratetype, 3))
+			did_evacuate = true;
+	}
 
-			page = pfn_to_page(pfn);
+	if (did_evacuate)
+		return true;
 
-			if (get_pfnblock_migratetype(page, pfn) != migratetype)
-				continue;
+	/* Phase 2: evacuate MOVABLE pageblocks to create free whole pageblocks */
+	spin_lock_irqsave(&zone->lock, flags);
+	nr_sbs = sb_collect_evacuate_candidates(zone, -1,
+						sb_pfns,
+						SPB_CONTIG_MAX_CANDIDATES);
+	spin_unlock_irqrestore(&zone->lock, flags);
 
-			if (!get_pfnblock_bit(page, pfn, PB_has_movable))
-				continue;
+	for (i = 0; i < nr_sbs; i++) {
+		unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
 
-			evacuate_pageblock(zone, pfn, true);
+		if (evacuate_pb_range(zone, sb_pfns[i], end_pfn,
+				      MIGRATE_MOVABLE, 3))
 			did_evacuate = true;
-			break;
-		}
 	}
+
 	return did_evacuate;
 }
 #endif /* CONFIG_COMPACTION */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in tainted SPBs
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (23 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

Summary:
__rmqueue_smallest Pass 2b is the last resort before tainting a fresh
clean superpageblock: it walks MOVABLE sub-pageblock free chunks inside
already-tainted SPBs, calling try_to_claim_block() to relabel a movable
pageblock as the requested non-movable type. If Pass 2b fails, the
allocator falls through to Pass 3 and taints a clean SPB.

try_to_claim_block() guards the relabel with a 50% compatibility check:
free_pages + alike_pages must be at least pageblock_nr_pages/2. The
guard exists to protect a generic clean MOVABLE pageblock from being
relabeled when most of its pages are still in-use movable allocations.

Inside a tainted SPB the guard is harmful, not protective. The SPB has
already accepted fragmentation, and stranding a few in-use movable
pages inside a relabeled pageblock is dramatically cheaper than
tainting an entire clean SPB. bpftrace on a devvm under realistic load
caught the pathology directly: at the moment a clean SPB was tainted,
all 8 existing tainted SPBs had nr_free=0 (no whole free pageblocks),
collectively held ~21k movable free pages distributed across MOVABLE
pageblocks, and try_to_claim_block() had failed 29182 of 29228 calls
(99.84%) over the prior few minutes. Pass 2b was effectively unable
to absorb non-movable demand into the tainted pool.

Add a from_tainted_spb parameter to try_to_claim_block() and skip the
50% threshold when set. Pass 2b passes true (it walks SB_TAINTED lists
exclusively); __rmqueue_claim() passes false to preserve its existing
fragmentation-protection semantics.

Test Plan:
Devvm bpftrace setup at ~/spb-monitors/spb-taint-walk.bt watches
clean->tainted transitions in zone Normal and tracks
try_to_claim_block call/ok/fail counters. Before the change the fail
rate was 99.84% with periodic clean SPB taints under load. After the
change, expect the fail rate to drop sharply and the count of tainted
SPBs to plateau at the boot-recruited set.

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 493db531b869..67cc8165ab1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2776,7 +2776,8 @@ static struct page *claim_whole_block(struct zone *zone, struct page *page,
 		  int current_order, int order, int new_type, int old_type);
 static struct page *try_to_claim_block(struct zone *zone, struct page *page,
 		  int current_order, int order, int start_type,
-		  int block_type, unsigned int alloc_flags);
+		  int block_type, unsigned int alloc_flags,
+		  bool from_tainted_spb);

 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
@@ -2941,7 +2942,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page = try_to_claim_block(zone, page,
 						current_order, order,
 						migratetype, MIGRATE_MOVABLE,
-						0);
+						0, true);
 					if (!page)
 						continue;
 					trace_mm_page_alloc_zone_locked(
@@ -3420,11 +3421,17 @@ claim_whole_block(struct zone *zone, struct page *page,
  * not, we check the pageblock for constituent pages; if at least half of the
  * pages are free or compatible, we can still claim the whole block, so pages
  * freed in the future will be put on the correct free list.
+ *
+ * @from_tainted_spb: caller has already verified the block lives in a tainted
+ * superpageblock, where SPB-level fragmentation has already been accepted.
+ * Skip the per-pageblock compatibility threshold so we can absorb non-movable
+ * demand into the existing tainted SPB instead of tainting a fresh clean one.
  */
 static struct page *
 try_to_claim_block(struct zone *zone, struct page *page,
 		   int current_order, int order, int start_type,
-		   int block_type, unsigned int alloc_flags)
+		   int block_type, unsigned int alloc_flags,
+		   bool from_tainted_spb)
 {
 	int free_pages, movable_pages, alike_pages;
 	unsigned long start_pfn;
@@ -3480,8 +3487,14 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	/*
 	 * If a sufficient number of pages in the block are either free or of
 	 * compatible migratability as our allocation, claim the whole block.
-	 */
-	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
+	 * The compatibility threshold protects clean MOVABLE pageblocks from
+	 * being relabeled when most of their pages are still in-use movable
+	 * allocations. Inside a tainted SPB the protection is unnecessary:
+	 * fragmentation has already been accepted at the SPB level, and
+	 * relabeling is much cheaper than tainting a fresh clean SPB.
+	 */
+	if (from_tainted_spb ||
+	    free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		__move_freepages_block(zone, start_pfn, block_type, start_type);
 		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
@@ -3721,7 +3734,8 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,

 			page = try_to_claim_block(zone, page, current_order,
 						  order, start_migratetype,
-						  fallback_mt, alloc_flags);
+						  fallback_mt, alloc_flags,
+						  false);
 			if (page) {
 				trace_mm_page_alloc_extfrag(page, order,
 					current_order, start_migratetype,
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (24 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Summary:
Inside a tainted SPB, free pages of UNMOVABLE and RECLAIMABLE
allocations cannot be told apart by the buddy allocator's
compatibility heuristic (alike_pages == 0 between the two non-movable
types in try_to_claim_block). Once a pageblock holds in-use pages of
both, any sticky UNMOVABLE pinhole prevents the RECLAIMABLE pages
from coalescing into useful higher-order chunks when they drain back
to the buddy. The PB's free capacity is permanently capped at
order-1 dust regardless of how much of it actually returns. Sticky
recl pages (active dentries, locked btrfs eb folios, NOFS slab) are
unavoidable; the cost is paid in internal fragmentation.

Two paths in the page allocator create UNMOVABLE<->RECLAIMABLE
mixing today:

  1. try_to_claim_block() relabels a partial PB whenever the 50%
     threshold "free_pages + alike_pages >= pageblock_nr_pages/2"
     passes. For UNMOV<->RECL, alike_pages == 0, so the rule
     degenerates to free_pages >= 256. A PB with 256 in-use UNMOV
     pages plus 256 free pages passes and is relabeled RECL. Both
     PB_has_unmovable and PB_has_reclaimable are then set.

  2. __rmqueue_steal() takes a single foreign-type page out of a
     PB without relabeling the PB. A UNMOVABLE allocation stealing
     from a RECLAIMABLE-labeled PB sets PB_has_unmovable on top of
     the existing PB_has_reclaimable.

Tighten both paths:

  - Add noncompatible_cross_type() helper that detects the
    UNMOV<->RECL pair (MOVABLE may still mix with either since
    movable pages can be migrated out).

  - In try_to_claim_block(), require a fully-free PB
    (free_pages == pageblock_nr_pages) for any cross-type relabel,
    regardless of from_tainted_spb. The other-type bit inherited
    from the prior label is stale on a fully-free PB (no in-use
    pages of either type) so clear it during the relabel rather
    than leaving the PB visibly mixed in PB_has_* state.

  - In __rmqueue_steal(), pass a new SB_SKIP_CROSS_TYPE flag to
    __rmqueue_sb_find_fallback() so the cross-type fallback entry
    in fallbacks[] is skipped. Steal then falls through to the
    MIGRATE_MOVABLE second fallback instead of single-page-stealing
    into a foreign non-movable PB.

The from_tainted_spb=true caller of try_to_claim_block() is
unaffected because it hardcodes block_type=MIGRATE_MOVABLE. The
claim_whole_block() branch (current_order >= pageblock_order) is
also unaffected: it requires PB_all_free, so the PB is fully free
of any prior type.

Test Plan:
Bare-metal devvm with the existing 4 stuck tainted SPBs (sb[2,15,
36,51] in Normal). Build and reboot. Compare per-order free
distribution in newly tainted SPBs against pre-patch baseline:
today o0/o1 dominate, target meaningful (>10%) free at order >= 3
in pure-RECL SPBs created post-patch. Watch for tainted SPB count
growth past ~12 (3x current baseline) — the fully-free constraint
on cross-type claims will taint fresh SPBs more often, and a
runaway count means the cost was misjudged. Watch dmesg for
allocation failures and kswapd CPU stays under 2 cores. Existing
mixed SPBs from before this change won't unmix; the win is for
SPBs created after.

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 111 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 85 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 67cc8165ab1f..ceb1284a63ed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3057,6 +3057,23 @@ static int fallbacks[MIGRATE_PCPTYPES][MIGRATE_PCPTYPES - 1] = {
 	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE   },
 };
 
+/*
+ * UNMOVABLE and RECLAIMABLE allocations should not share the same
+ * pageblock. Their free pages are interchangeable on the buddy free
+ * lists (alike_pages == 0 between them), so once a PB holds both
+ * types the buddy can no longer tell them apart and any sticky
+ * UNMOVABLE pinhole prevents the RECLAIMABLE pages from coalescing
+ * into useful higher-order chunks when they drain back. MOVABLE may
+ * mix with either, since MOVABLE pages can be migrated out.
+ */
+static inline bool noncompatible_cross_type(int start_type, int fallback_type)
+{
+	return (start_type == MIGRATE_UNMOVABLE &&
+		fallback_type == MIGRATE_RECLAIMABLE) ||
+	       (start_type == MIGRATE_RECLAIMABLE &&
+		fallback_type == MIGRATE_UNMOVABLE);
+}
+
 #ifdef CONFIG_CMA
 static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order)
@@ -3434,6 +3451,9 @@ try_to_claim_block(struct zone *zone, struct page *page,
 		   bool from_tainted_spb)
 {
 	int free_pages, movable_pages, alike_pages;
+#ifdef CONFIG_COMPACTION
+	struct superpageblock *sb;
+#endif
 	unsigned long start_pfn;
 
 	/*
@@ -3492,35 +3512,48 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	 * allocations. Inside a tainted SPB the protection is unnecessary:
 	 * fragmentation has already been accepted at the SPB level, and
 	 * relabeling is much cheaper than tainting a fresh clean SPB.
-	 */
-	if (from_tainted_spb ||
-	    free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
-			page_group_by_mobility_disabled) {
-		__move_freepages_block(zone, start_pfn, block_type, start_type);
-		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
-#ifdef CONFIG_COMPACTION
-		/*
-		 * Track actual page contents in pageblock flags and
-		 * update superpageblock counters so the SPB moves to
-		 * the correct fullness list for steering.
-		 */
-		{
-			struct page *start_page = pfn_to_page(start_pfn);
-			struct superpageblock *sb;
-
-			__spb_set_has_type(start_page, start_type);
-			if (block_type != start_type)
-				__spb_set_has_type(start_page, block_type);
+	 *
+	 * UNMOVABLE<->RECLAIMABLE cross-type claims override these rules:
+	 * once mixed, sticky pinholes of one type prevent the other from
+	 * coalescing into useful higher-order free chunks even after drain.
+	 * Only relabel a fully-free PB in that case, regardless of whether
+	 * the SPB is tainted.
+	 */
+	if (noncompatible_cross_type(start_type, block_type)) {
+		if (free_pages != pageblock_nr_pages)
+			return NULL;
+	} else if (!from_tainted_spb &&
+		   free_pages + alike_pages < (1 << (pageblock_order-1)) &&
+		   !page_group_by_mobility_disabled) {
+		return NULL;
+	}
 
-			sb = pfn_to_superpageblock(zone, start_pfn);
-			if (sb)
-				spb_update_list(sb);
-		}
-#endif
-		return __rmqueue_smallest(zone, order, start_type);
+	__move_freepages_block(zone, start_pfn, block_type, start_type);
+	set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
+#ifdef CONFIG_COMPACTION
+	/*
+	 * Track actual page contents in pageblock flags and update
+	 * superpageblock counters so the SPB moves to the correct
+	 * fullness list for steering.
+	 *
+	 * For cross-type UNMOVABLE<->RECLAIMABLE relabel (which by the
+	 * predicate above only fires on a fully-free PB), the inherited
+	 * PB_has_<block_type> bit is stale — there are no in-use pages
+	 * of that type. Clear it so the resulting PB is unmixed.
+	 */
+	__spb_set_has_type(pfn_to_page(start_pfn), start_type);
+	if (block_type != start_type) {
+		if (noncompatible_cross_type(start_type, block_type))
+			__spb_clear_has_type(pfn_to_page(start_pfn), block_type);
+		else
+			__spb_set_has_type(pfn_to_page(start_pfn), block_type);
 	}
 
-	return NULL;
+	sb = pfn_to_superpageblock(zone, start_pfn);
+	if (sb)
+		spb_update_list(sb);
+#endif
+	return __rmqueue_smallest(zone, order, start_type);
 }
 
 /*
@@ -3544,6 +3577,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
 #define SB_SEARCH_EMPTY		(1 << 1)
 #define SB_SEARCH_FALLBACK	(1 << 2)
 #define SB_SEARCH_ALL		(SB_SEARCH_PREFERRED | SB_SEARCH_EMPTY | SB_SEARCH_FALLBACK)
+/*
+ * Skip UNMOVABLE<->RECLAIMABLE cross-type fallback. Used by the steal
+ * path to prevent landing single foreign-type pages into a PB labeled
+ * with the other non-movable type — a steal does not relabel the PB
+ * so cross-type stealing creates permanent mixing.
+ */
+#define SB_SKIP_CROSS_TYPE	(1 << 3)
 
 static struct page *
 __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
@@ -3580,6 +3620,10 @@ __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
 					int fmt = fallbacks[start_migratetype][i];
 					struct page *page;
 
+					if ((search_cats & SB_SKIP_CROSS_TYPE) &&
+					    noncompatible_cross_type(start_migratetype, fmt))
+						continue;
+
 					page = get_page_from_free_area(area,
 								       fmt);
 					if (page) {
@@ -3601,6 +3645,10 @@ __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
 				int fmt = fallbacks[start_migratetype][i];
 				struct page *page;
 
+				if ((search_cats & SB_SKIP_CROSS_TYPE) &&
+				    noncompatible_cross_type(start_migratetype, fmt))
+					continue;
+
 				page = get_page_from_free_area(area,
 							       fmt);
 				if (page) {
@@ -3629,6 +3677,10 @@ __rmqueue_sb_find_fallback(struct zone *zone, unsigned int order,
 					int fmt = fallbacks[start_migratetype][i];
 					struct page *page;
 
+					if ((search_cats & SB_SKIP_CROSS_TYPE) &&
+					    noncompatible_cross_type(start_migratetype, fmt))
+						continue;
+
 					page = get_page_from_free_area(area,
 								       fmt);
 					if (page) {
@@ -3765,11 +3817,18 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype,
 	/*
 	 * When ALLOC_NOFRAG_TAINTED_OK is set, only steal from tainted
 	 * SPBs to avoid tainting clean ones. Otherwise search all categories.
+	 *
+	 * Always skip UNMOVABLE<->RECLAIMABLE cross-type fallback. The steal
+	 * path takes a single page without relabeling its PB, so a cross-type
+	 * steal would land an UNMOVABLE page in a RECLAIMABLE-labeled PB
+	 * (or vice versa) and create permanent mixing. Falling through to
+	 * MIGRATE_MOVABLE (the second fallback) is preferable.
 	 */
 	if (alloc_flags & ALLOC_NOFRAG_TAINTED_OK)
 		search_cats = SB_SEARCH_PREFERRED;
 	else
 		search_cats = SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK;
+	search_cats |= SB_SKIP_CROSS_TYPE;
 
 	/*
 	 * Search per-superpageblock free lists for fallback migratetypes.
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (25 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Hook queue_spb_evacuate() into __rmqueue_claim() so that whenever a
non-movable allocation is about to claim a pageblock from an empty or
clean superpageblock as a fallback (i.e. cat_search[c] is not
SB_SEARCH_PREFERRED), a deferred spb_evacuate_for_order() is scheduled
on the zone's pgdat workqueue.

The current allocation still proceeds and taints the clean SPB this
time, but the deferred evacuation creates free pageblocks inside
existing tainted SPBs so the next caller hitting the same trigger can
claim from the tainted pool instead of tainting another clean SPB.

Movable allocations are excluded because their preferred category is
SB_CLEAN; falling back from clean to tainted does not taint anything
new and so does not need the hint.

The trigger is gated by single-flight, throttle, and tainted-pool
precheck inside queue_spb_evacuate(), so it is safe to fire from this
hot path without storming the workqueue.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  18 ++++
 mm/page_alloc.c        | 189 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 765e1c5dc365..195a80e2f0ee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1139,6 +1139,22 @@ struct zone {
 	unsigned int		compact_considered;
 	unsigned int		compact_defer_shift;
 	int			compact_order_failed;
+
+	/*
+	 * Atomic-context SPB evacuation deferral state.
+	 *
+	 * spb_evac_in_flight: bitmap indexed by
+	 *   migratetype * NR_PAGE_ORDERS + order, set on enqueue and
+	 *   cleared by the worker after spb_evacuate_for_order returns.
+	 *   Provides single-flight gating per (migratetype, order).
+	 *
+	 * spb_evac_last: jiffies of the last enqueue per migratetype,
+	 *   used as a 10ms throttle to prevent wakeup storms from
+	 *   concurrent atomic allocations.
+	 */
+	DECLARE_BITMAP(spb_evac_in_flight,
+		       MIGRATE_PCPTYPES * NR_PAGE_ORDERS);
+	unsigned long		spb_evac_last[MIGRATE_PCPTYPES];
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
@@ -1552,6 +1568,8 @@ typedef struct pglist_data {
 	struct task_struct *kcompactd;
 	bool proactive_compact_trigger;
 	struct workqueue_struct *evacuate_wq;
+	struct llist_head spb_evac_pending;
+	struct irq_work spb_evac_irq_work;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ceb1284a63ed..f0fdfe8c9a45 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -788,6 +788,8 @@ static struct page *spb_try_alloc_contig(struct zone *zone,
 					gfp_t gfp_mask);
 static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 				  int migratetype);
+static void queue_spb_evacuate(struct zone *zone, unsigned int order,
+			       int migratetype);
 #else
 static inline void spb_maybe_start_defrag(struct superpageblock *sb) {}
 static inline bool spb_needs_defrag(struct superpageblock *sb) { return false; }
@@ -802,6 +804,8 @@ static inline bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 {
 	return false;
 }
+static inline void queue_spb_evacuate(struct zone *zone, unsigned int order,
+				      int migratetype) {}
 #endif
 
 static void spb_update_list(struct superpageblock *sb)
@@ -3784,6 +3788,18 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 			if (!page)
 				continue;
 
+			/*
+			 * About to claim from an empty or clean superpageblock
+			 * for a non-movable allocation -- this taints a fresh
+			 * SPB.  Defer an evacuation pass over the tainted pool
+			 * so subsequent allocations can reclaim freed
+			 * pageblocks instead of repeating this fallback.
+			 */
+			if (cat_search[c] != SB_SEARCH_PREFERRED &&
+			    start_migratetype != MIGRATE_MOVABLE)
+				queue_spb_evacuate(zone, order,
+						   start_migratetype);
+
 			page = try_to_claim_block(zone, page, current_order,
 						  order, start_migratetype,
 						  fallback_mt, alloc_flags,
@@ -8728,6 +8744,168 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn,
 		putback_movable_pages(&cc.migratepages);
 }
 
+/*
+ * Atomic-context SPB evacuation deferral.
+ *
+ * When an atomic allocation in __rmqueue_claim is about to taint a
+ * clean superpageblock because the tainted pool has no free page at
+ * the requested (order, migratetype), schedule a deferred call to
+ * spb_evacuate_for_order. That frees pageblocks inside tainted SPBs so
+ * subsequent allocations can claim them instead of tainting more clean
+ * SPBs.
+ *
+ * Two-step deferral mirrors the pageblock-evacuate path: irq_work to
+ * leave allocator lock context, then queue_work to reach process
+ * context where spb_evacuate_for_order can sleep in migrate_pages.
+ */
+
+struct spb_evac_request {
+	struct work_struct	work;
+	struct zone		*zone;
+	unsigned int		order;
+	int			migratetype;
+	struct llist_node	free_node;
+};
+
+#define NR_SPB_EVAC_REQUESTS	64
+static struct spb_evac_request spb_evac_pool[NR_SPB_EVAC_REQUESTS];
+static struct llist_head spb_evac_freelist;
+
+static struct spb_evac_request *spb_evac_request_alloc(void)
+{
+	struct llist_node *node;
+
+	node = llist_del_first(&spb_evac_freelist);
+	if (!node)
+		return NULL;
+	return container_of(node, struct spb_evac_request, free_node);
+}
+
+static void spb_evac_request_free(struct spb_evac_request *req)
+{
+	llist_add(&req->free_node, &spb_evac_freelist);
+}
+
+static void spb_evac_work_fn(struct work_struct *work)
+{
+	struct spb_evac_request *req = container_of(work,
+						    struct spb_evac_request,
+						    work);
+	struct zone *zone = req->zone;
+	unsigned int order = req->order;
+	int mt = req->migratetype;
+
+	spb_evacuate_for_order(zone, order, mt);
+
+	/*
+	 * Clearing the in-flight bit lets a future caller hitting the
+	 * same (mt, order) re-enqueue evacuation.  Ordering between this
+	 * worker's SPB state changes and the future caller's
+	 * tainted_pool_has_free walk is provided by zone->lock taken
+	 * inside spb_evacuate_for_order and by the future caller.
+	 */
+	clear_bit(mt * NR_PAGE_ORDERS + order, zone->spb_evac_in_flight);
+	spb_evac_request_free(req);
+}
+
+static void spb_evac_irq_work_fn(struct irq_work *work)
+{
+	pg_data_t *pgdat = container_of(work, pg_data_t,
+					spb_evac_irq_work);
+	struct llist_node *pending;
+	struct spb_evac_request *req, *next;
+
+	if (!pgdat->evacuate_wq)
+		return;
+
+	pending = llist_del_all(&pgdat->spb_evac_pending);
+	llist_for_each_entry_safe(req, next, pending, free_node) {
+		INIT_WORK(&req->work, spb_evac_work_fn);
+		queue_work(pgdat->evacuate_wq, &req->work);
+	}
+}
+
+/*
+ * Walk tainted SPBs to check whether any has a free page at the given
+ * order and migratetype.  When this returns true, a clean-SPB claim is
+ * not pool depletion but a try_to_claim_block over-rejection: skip the
+ * deferred evacuation since it cannot help.
+ */
+static bool tainted_pool_has_free(struct zone *zone, unsigned int order,
+				  int migratetype)
+{
+	struct superpageblock *sb;
+	int full;
+
+	lockdep_assert_held(&zone->lock);
+
+	for (full = 0; full < __NR_SB_FULLNESS; full++) {
+		list_for_each_entry(sb, &zone->spb_lists[SB_TAINTED][full],
+				    list) {
+			struct free_area *fa = &sb->free_area[order];
+
+			if (fa->nr_free &&
+			    !list_empty(&fa->free_list[migratetype]))
+				return true;
+		}
+	}
+	return false;
+}
+
+/**
+ * queue_spb_evacuate - schedule deferred SPB evacuation from atomic context
+ * @zone: zone that just failed to find a free page in the tainted pool
+ * @order: requested allocation order
+ * @migratetype: requested migratetype (UNMOVABLE or RECLAIMABLE only)
+ *
+ * Caller must hold zone->lock; the tainted-pool walk asserts it.
+ *
+ * Single-flight gated per (zone, migratetype, order) and throttled to
+ * one enqueue per 10ms per (zone, migratetype).  Pool exhaustion
+ * silently drops the request; the next caller hitting the same trigger
+ * will retry.
+ */
+static void queue_spb_evacuate(struct zone *zone, unsigned int order,
+			       int migratetype)
+{
+	pg_data_t *pgdat = zone->zone_pgdat;
+	struct spb_evac_request *req;
+	unsigned int bit;
+
+	lockdep_assert_held(&zone->lock);
+
+	if (!pgdat->spb_evac_irq_work.func)
+		return;
+	if (order >= NR_PAGE_ORDERS || migratetype >= MIGRATE_PCPTYPES)
+		return;
+
+	if (time_before(jiffies,
+			zone->spb_evac_last[migratetype] + HZ / 100))
+		return;
+
+	bit = migratetype * NR_PAGE_ORDERS + order;
+	if (test_and_set_bit(bit, zone->spb_evac_in_flight))
+		return;
+
+	if (tainted_pool_has_free(zone, order, migratetype)) {
+		clear_bit(bit, zone->spb_evac_in_flight);
+		return;
+	}
+
+	req = spb_evac_request_alloc();
+	if (!req) {
+		clear_bit(bit, zone->spb_evac_in_flight);
+		return;
+	}
+
+	zone->spb_evac_last[migratetype] = jiffies;
+	req->zone = zone;
+	req->order = order;
+	req->migratetype = migratetype;
+	llist_add(&req->free_node, &pgdat->spb_evac_pending);
+	irq_work_queue(&pgdat->spb_evac_irq_work);
+}
+
 /*
  * Background superpageblock defragmentation.
  *
@@ -9202,7 +9380,12 @@ static void spb_maybe_start_defrag(struct superpageblock *sb)
 
 static int __init pageblock_evacuate_init(void)
 {
-	int nid;
+	int nid, i;
+
+	/* Initialize the global freelist of SPB evacuate requests */
+	init_llist_head(&spb_evac_freelist);
+	for (i = 0; i < NR_SPB_EVAC_REQUESTS; i++)
+		llist_add(&spb_evac_pool[i].free_node, &spb_evac_freelist);
 
 	/* Create a per-pgdat workqueue */
 	for_each_online_node(nid) {
@@ -9217,6 +9400,10 @@ static int __init pageblock_evacuate_init(void)
 			continue;
 		}
 
+		init_llist_head(&pgdat->spb_evac_pending);
+		init_irq_work(&pgdat->spb_evac_irq_work,
+			      spb_evac_irq_work_fn);
+
 		/* Initialize per-superpageblock defrag work structs */
 		for (z = 0; z < MAX_NR_ZONES; z++) {
 			struct zone *zone = &pgdat->node_zones[z];
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (26 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

rmqueue_bulk Phase 2 walks SB_TAINTED superpageblocks looking for
sub-pageblock free fragments, so PCP refill can be satisfied without
tainting a clean SPB. The original Phase 2 abandons a candidate
pageblock entirely if pbd->cpu != 0 (already owned by some CPU), to
avoid two CPUs holding PCPBuddy pages from the same pageblock — which
would let the PCP merge pass corrupt the other CPU's PCP list.

On systems with many CPUs (88+) and many tainted SPBs (~50% on a 16
GiB devvm under stress), nearly every free fragment in a tainted SPB
lives in a pageblock already PCPBuddy-owned by some CPU. Phase 2 skips
through the entire SPB without finding anything usable, the atomic
alloc falls through to the slowpath, and clean SPBs get tainted.

Take the page anyway when the source pageblock is owned, but skip the
ownership claim and PCPBuddy marking. Phase 3 / __rmqueue_smallest
already pull plain non-PCPBuddy pages from owned pageblocks the same
way; the hazard is specifically about two CPUs holding PCPBuddy pages
from the same pageblock, not about a plain non-PCPBuddy page coexisting
with another CPU's PCPBuddy entries. Pass 0 (owned-block recovery) is
only meaningful when we actually claimed ownership, so register on
owned_blocks only when !pb_owned.

Fixes: 266461cd5442 ("mm: page_alloc: adopt partial pageblocks from tainted superpageblocks")

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 50 ++++++++++++++++++++++++++++---------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f0fdfe8c9a45..a09660a06ed3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4133,6 +4133,7 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 				&zone->spb_lists[SB_TAINTED][full], list) {
 				struct page *page;
 				int found_order = -1;
+				bool claim_pb;
 
 				if (sb->nr_free_pages < pageblock_nr_pages / 4)
 					continue;
@@ -4156,33 +4157,39 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 					continue;
 
 				/*
-				 * Check that this pageblock isn't already
-				 * owned by another CPU. If it is, two CPUs
-				 * would have PCPBuddy pages from the same
-				 * pageblock, and the PCP merge pass could
-				 * corrupt the other CPU's PCP list.
+				 * Found a free fragment in a tainted SPB. Take
+				 * it from the buddy.
+				 *
+				 * If the source pageblock is unowned, claim it:
+				 * mark our pages PagePCPBuddy and register the
+				 * block on owned_blocks so Pass 0 can recover
+				 * remaining fragments on future refills.
+				 *
+				 * If the source pageblock is already owned by
+				 * some CPU (us or another), take the page as a
+				 * plain non-PCPBuddy fragment — the same way
+				 * Phase 3 / __rmqueue_smallest would. Setting
+				 * PagePCPBuddy here would let two CPUs hold
+				 * PCPBuddy pages from the same pageblock, and
+				 * the PCP merge pass could then corrupt the
+				 * other CPU's PCP list.
+				 *
+				 * Set PB_has_<migratetype> either way (bypasses
+				 * page_del_and_expand which normally does the
+				 * PB_has tracking); idempotent if already set.
 				 */
 				pbd = pfn_to_pageblock(page,
 						       page_to_pfn(page));
-				if (pbd->cpu != 0)
-					continue;
+				claim_pb = (pbd->cpu == 0);
 
-				/*
-				 * Found a free chunk in an unowned pageblock.
-				 * Take it from buddy, claim ownership, and
-				 * set PCPBuddy. Pass 0 will grab remaining
-				 * buddy entries on future refills.
-				 *
-				 * Set PB_has_<migratetype> since we bypass
-				 * page_del_and_expand (which normally does
-				 * PB_has tracking).
-				 */
 				del_page_from_free_list(page, zone,
 							found_order,
 							migratetype);
 				__spb_set_has_type(page, migratetype);
-				set_pcpblock_owner(page, cpu);
-				__SetPagePCPBuddy(page);
+				if (claim_pb) {
+					set_pcpblock_owner(page, cpu);
+					__SetPagePCPBuddy(page);
+				}
 				pcp_enqueue_tail(pcp, page, migratetype,
 						 found_order);
 				refilled += 1 << found_order;
@@ -4190,9 +4197,10 @@ static bool rmqueue_bulk(struct zone *zone, unsigned int order,
 				/*
 				 * Register for Phase 0 recovery so future
 				 * drains from this pageblock can be swept
-				 * back efficiently.
+				 * back efficiently. Only meaningful when we
+				 * actually claimed ownership above.
 				 */
-				if (list_empty(&pbd->cpu_node))
+				if (claim_pb && list_empty(&pbd->cpu_node))
 					list_add(&pbd->cpu_node,
 						 &pcp->owned_blocks);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (27 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The two reverted bail-out gates from commit 2550578e3408 ("mm:
page_alloc: refuse to taint clean SPBs for best-effort allocations") and
commit 3d5f94a1bbe2 ("mm: page_alloc: gate clean-SPB taint bail-out on
ALLOC_HIGHATOMIC")
returned NULL from get_page_from_freelist's slowpath retry to keep
atomic-shape allocations from tainting clean SPBs. That gate broke
early-boot in QEMU: cred_init's slab cache create reaches the slowpath
with gfp = __GFP_COMP (gfp_allowed_mask = GFP_BOOT_MASK strips
__GFP_RECLAIM from GFP_KERNEL during boot), no fallback path, and
panicked when the gate refused the allocation.

Replace the gate with a finer-grained refusal anchored in __rmqueue,
where the SPB-aware free-list walk already runs:

  - Add ALLOC_HIGHORDER_OPTIONAL, set in gfp_to_alloc_flags() for two
    shapes:

      1. Explicit fallback declaration: __GFP_NORETRY without
         __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
         skb_page_frag_refill on full sockets, etc.

      2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no
         __GFP_NOMEMALLOC, no __GFP_NOFAIL. Catches GFP_ATOMIC,
         GFP_NOWAIT, including ALLOC_HIGHATOMIC consumers (which still
         get a second crack at the dedicated MIGRATE_HIGHATOMIC reserve
         in rmqueue_buddy after __rmqueue returns NULL).

    __GFP_MEMALLOC and __GFP_NOFAIL never get the flag — they must
    succeed even at the cost of fresh-SPB taint.

  - Add struct spb_tainted_walk to record what __rmqueue_smallest's
    Pass 1 saw on the SB_TAINTED list (any free pages, any free PB,
    below-reserve pageblock count). Thread it through the function's
    new fourth argument; non-walking call sites pass NULL.

  - In __rmqueue, allocate the walk on the stack for callers with
    ALLOC_HIGHORDER_OPTIONAL set on a non-movable, non-CMA migratetype.
    Force *mode back to RMQUEUE_NORMAL on every call so rmqueue_bulk
    Phase 3 can't reuse a memoised RMQUEUE_CLAIM/STEAL state to skip
    the gate across iterations.

  - After __rmqueue_smallest returns NULL, check the walk: if a tainted
    SPB has free pages or a free pageblock that could absorb this
    allocation after evacuation, return NULL and bump
    SPB_HIGHORDER_REFUSED. Skip RMQUEUE_CLAIM and RMQUEUE_STEAL
    entirely (both can taint clean SPBs). The slowpath will eventually
    drop NOFRAGMENT and let the allocation proceed only for the
    callers that lack ALLOC_HIGHORDER_OPTIONAL — i.e. the truly
    must-not-fail consumers.

  - Before falling through to Pass 3 (empty SPBs) inside
    __rmqueue_smallest, kick queue_spb_evacuate() when the walk saw a
    tainted SPB below its reserve threshold, so future allocations
    have a movable-evicted home in an already-tainted SPB.

  - Add SPB_HIGHORDER_REFUSED vm event counter (events, not refused
    allocations: a single high-level alloc that retries can be counted
    multiple times across per-zone attempts).

The early-boot SB_TAINTED list is empty, so the walk records nothing,
the refusal does not engage, and __rmqueue falls through to
RMQUEUE_CLAIM which taints the first SPB normally (the first taint is
unavoidable). cred_init's slab create succeeds, boot succeeds.

Tested in a 16 GB QEMU VM under combined sb-stress + UDP-loopback +
fork/mmap storms (~480s); 2 tainted Normal SPBs out of 13 (boot
baseline 1, +1 during stress); 11 clean SPBs distributed movable load;
no kernel BUG, oops, hang, or panic.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/vm_event_item.h |   5 ++
 mm/internal.h                 |   1 +
 mm/page_alloc.c               | 115 ++++++++++++++++++++++++++++++++--
 mm/vmstat.c                   |   1 +
 4 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 22a139f82d75..3de6ca1e9c56 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -89,6 +89,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		CMA_ALLOC_SUCCESS,
 		CMA_ALLOC_FAIL,
 #endif
+		SPB_HIGHORDER_REFUSED,	/*
+					 * refused fragmenting fallback to keep
+					 * a clean SPB clean when a tainted SPB
+					 * still has free pageblocks
+					 */
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
 		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
diff --git a/mm/internal.h b/mm/internal.h
index f641795688af..71e39414645f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1414,6 +1414,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 #define ALLOC_NOFRAG_TAINTED_OK	0x1000 /* NOFRAGMENT, but allow steal from tainted SPBs */
+#define ALLOC_HIGHORDER_OPTIONAL 0x2000 /* caller can fall back to a lower order */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a09660a06ed3..9305b36f52a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2783,9 +2783,21 @@ static struct page *try_to_claim_block(struct zone *zone, struct page *page,
 		  int block_type, unsigned int alloc_flags,
 		  bool from_tainted_spb);
 
+/*
+ * Snapshot of tainted-SPB state observed while __rmqueue_smallest walks the
+ * free lists. Lets the caller (currently __rmqueue) decide whether to refuse
+ * a fragmenting fallback when an existing tainted SPB could absorb the demand
+ * once it is evacuated.
+ */
+struct spb_tainted_walk {
+	bool saw_free_pages;	/* tainted SPB has any free pages, any order */
+	bool saw_free_pb;	/* tainted SPB has at least one free pageblock */
+	bool saw_below_reserve;	/* tainted SPB has nr_free <= spb_tainted_reserve */
+};
+
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
-						int migratetype)
+				int migratetype, struct spb_tainted_walk *walk)
 {
 	unsigned int current_order;
 	struct free_area *area;
@@ -2834,6 +2846,20 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		list_for_each_entry(sb,
 			&zone->spb_lists[cat][full], list) {
+			/*
+			 * Snapshot tainted-SPB capacity before the
+			 * nr_free_pages skip: an SPB with a free pageblock
+			 * but nothing on the requested-MT freelist still
+			 * counts as "could absorb this allocation after evac".
+			 */
+			if (walk && cat == SB_TAINTED) {
+				if (sb->nr_free_pages)
+					walk->saw_free_pages = true;
+				if (sb->nr_free)
+					walk->saw_free_pb = true;
+				if (sb->nr_free <= spb_tainted_reserve(sb))
+					walk->saw_below_reserve = true;
+			}
 			if (!sb->nr_free_pages)
 				continue;
 			/* Try whole pageblock (or larger) first for PCP buddy */
@@ -2959,6 +2985,16 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		}
 	}
 
+	/*
+	 * About to fall through to Pass 3 (empty SPBs) or Pass 4 fallback,
+	 * which risks tainting a clean SPB. If the tainted-SPB walk above
+	 * showed that some tainted SPB is below its reserve threshold of
+	 * free pageblocks, kick deferred evacuation so future allocations
+	 * have a movable-evicted home in an already-tainted SPB.
+	 */
+	if (walk && walk->saw_below_reserve)
+		queue_spb_evacuate(zone, order, migratetype);
+
 	/* Pass 3: whole pageblock from empty superpageblocks */
 	list_for_each_entry(sb, &zone->spb_empty, list) {
 		if (!sb->nr_free_pages)
@@ -3082,7 +3118,7 @@ static inline bool noncompatible_cross_type(int start_type, int fallback_type)
 static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order)
 {
-	return __rmqueue_smallest(zone, order, MIGRATE_CMA);
+	return __rmqueue_smallest(zone, order, MIGRATE_CMA, NULL);
 }
 #else
 static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
@@ -3557,7 +3593,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	if (sb)
 		spb_update_list(sb);
 #endif
-	return __rmqueue_smallest(zone, order, start_type);
+	return __rmqueue_smallest(zone, order, start_type, NULL);
 }
 
 /*
@@ -3904,8 +3940,29 @@ static __always_inline struct page *
 __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 	  unsigned int alloc_flags, enum rmqueue_mode *mode)
 {
+	struct spb_tainted_walk walk = { };
+	struct spb_tainted_walk *walkp = NULL;
 	struct page *page;
 
+	/*
+	 * Track tainted-SPB state for non-movable, non-CMA callers that
+	 * signaled they have a cheap fallback (atomic shape or explicit
+	 * NORETRY). We use that to refuse a fragmenting CLAIM/STEAL when a
+	 * tainted SPB still has free pageblocks waiting to be evacuated.
+	 *
+	 * Force *mode back to RMQUEUE_NORMAL so the walk + refusal check
+	 * runs on every call. rmqueue_bulk Phase 3 chains many __rmqueue
+	 * calls reusing *mode; without this reset, a single successful
+	 * RMQUEUE_CLAIM/STEAL on the first iteration would let every
+	 * subsequent iteration skip the case RMQUEUE_NORMAL block and taint
+	 * additional clean SPBs unchecked.
+	 */
+	if (migratetype != MIGRATE_MOVABLE && !is_migrate_cma(migratetype) &&
+	    (alloc_flags & ALLOC_HIGHORDER_OPTIONAL)) {
+		walkp = &walk;
+		*mode = RMQUEUE_NORMAL;
+	}
+
 	if (IS_ENABLED(CONFIG_CMA)) {
 		/*
 		 * Balance movable allocations between regular and CMA areas by
@@ -3932,9 +3989,22 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 	 */
 	switch (*mode) {
 	case RMQUEUE_NORMAL:
-		page = __rmqueue_smallest(zone, order, migratetype);
+		page = __rmqueue_smallest(zone, order, migratetype, walkp);
 		if (page)
 			return page;
+		/*
+		 * Refuse to fragment a clean SPB when a tainted SPB already
+		 * holds free pages or a free pageblock that could absorb
+		 * this allocation after evacuation. The caller has a cheap
+		 * fallback (lower-order retry, vmalloc, single-page fragment,
+		 * drop the packet, etc.) — better that than tainting fresh
+		 * capacity. Pre-Pass-3 evac trigger in __rmqueue_smallest
+		 * already kicked deferred eviction.
+		 */
+		if (walkp && (walk.saw_free_pages || walk.saw_free_pb)) {
+			count_vm_event(SPB_HIGHORDER_REFUSED);
+			return NULL;
+		}
 		fallthrough;
 	case RMQUEUE_CMA:
 		if (alloc_flags & ALLOC_CMA) {
@@ -4973,7 +5043,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			spin_lock_irqsave(&zone->lock, flags);
 		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
-			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			page = __rmqueue_smallest(zone, order,
+						  MIGRATE_HIGHATOMIC, NULL);
 		if (!page) {
 			enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 
@@ -4986,7 +5057,9 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			 * high-order atomic allocation in the future.
 			 */
 			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
-				page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+				page = __rmqueue_smallest(zone, order,
+							  MIGRATE_HIGHATOMIC,
+							  NULL);
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
@@ -6302,6 +6375,36 @@ gfp_to_alloc_flags(gfp_t gfp_mask, unsigned int order)
 	if (defrag_mode)
 		alloc_flags |= ALLOC_NOFRAGMENT;
 
+	/*
+	 * Mark callers that have a cheap fallback if the page allocator returns
+	 * NULL, so __rmqueue can refuse to taint a clean SPB when an existing
+	 * tainted SPB still has free pageblocks waiting to be evacuated.
+	 *
+	 * Two shapes qualify:
+	 *
+	 *  1. Explicit fallback declaration: __GFP_NORETRY without
+	 *     __GFP_RETRY_MAYFAIL. Used by THP, slab high-order refill,
+	 *     skb_page_frag_refill on full sockets, etc.
+	 *
+	 *  2. Atomic-context shape: no __GFP_DIRECT_RECLAIM, no __GFP_NOMEMALLOC,
+	 *     no __GFP_NOFAIL. These callers (GFP_ATOMIC, GFP_NOWAIT, including
+	 *     ALLOC_HIGHATOMIC consumers) have implicit fallbacks: drop the
+	 *     packet, demote the slab order, return ENOMEM up the slowpath,
+	 *     retry from process context with GFP_KERNEL, etc. ALLOC_HIGHATOMIC
+	 *     callers also get a second crack at the dedicated MIGRATE_HIGHATOMIC
+	 *     reserve in rmqueue_buddy after __rmqueue returns NULL.
+	 *     Tainting a 1 GiB SPB to satisfy any of them is a long-lived
+	 *     fragmentation event for short-lived data.
+	 *
+	 * __GFP_MEMALLOC (reclaim recursion) and __GFP_NOFAIL (declared cannot
+	 * fail) are excluded — they must succeed even at the cost of taint.
+	 */
+	if ((gfp_mask & __GFP_NORETRY) && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+		alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+	else if (!(gfp_mask & (__GFP_DIRECT_RECLAIM | __GFP_NOMEMALLOC |
+			       __GFP_NOFAIL)))
+		alloc_flags |= ALLOC_HIGHORDER_OPTIONAL;
+
 	return alloc_flags;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 32027b8c0526..8a6c9120d325 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1385,6 +1385,7 @@ const char * const vmstat_text[] = {
 	[I(CMA_ALLOC_SUCCESS)]			= "cma_alloc_success",
 	[I(CMA_ALLOC_FAIL)]			= "cma_alloc_fail",
 #endif
+	[I(SPB_HIGHORDER_REFUSED)]		= "spb_highorder_refused",
 	[I(UNEVICTABLE_PGCULLED)]		= "unevictable_pgs_culled",
 	[I(UNEVICTABLE_PGSCANNED)]		= "unevictable_pgs_scanned",
 	[I(UNEVICTABLE_PGRESCUED)]		= "unevictable_pgs_rescued",
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (28 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
@ 2026-04-30 20:20 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The ALLOC_HIGHORDER_OPTIONAL refusal gate from
commit 96f17c6b8398 ("mm: page_alloc: refuse fragmenting fallback for
callers with cheap fallback") prevents
fragmenting fallbacks for atomic-shape callers, but it can only refuse
allocations that have a cheap fallback. GFP_KERNEL slab callers
(dentry/inode/page-table caches) have no such fallback and reach
__rmqueue_claim/_steal whenever the tainted-SPB pool runs out of
headroom. Without an external pressure release valve, sustained slab
growth eventually drains the tainted pool, every clean SPB starts
absorbing one taint, and fragmentation grows until equilibrium at a
much higher tainted-SPB count than the workload memory-footprint
warrants.

Live experiment on a 247 GB devvm under the syz-VM + edenfs workload
showed the failure mode clearly: tainted Normal SPBs climbed from the
boot baseline of 8 to 85 during an 8-minute burst as 18 syzkaller VMs
spun up and btrfs_inode/dentry caches grew past the existing tainted
pool capacity. Once at 85 (with about 25 GB of cached slab) the system
plateaued: existing tainted SPBs had absorbed enough demand that no
more taints occurred — but the equilibrium was over 2x what packing
35 GB of slab into 1 GB tainted SPBs ought to need.

The pageblock-evacuation worker
(spb_evacuate_for_order/queue_spb_evacuate) already runs from these
pressure points, but it can only consolidate movable pages out of
tainted SPBs. Slab content stranded in tainted SPBs blocks free
pageblocks from re-coalescing and forces new taints when movable
supply runs out.

Add a parallel slab-shrink mechanism that mirrors the evacuation
infrastructure exactly: a per-pgdat irq_work that bridges from
allocator-lock context out to a workqueue, a pool of request
descriptors, and a queue function with single-flight + 100ms throttle.
The worker calls shrink_slab() with the zone's nid, walking
node-local shrinkers from DEF_PRIORITY toward 0 until either no
shrinker reports progress or a pageblock-sized batch of objects has
been freed.

Wire three trigger sites:

  1. __rmqueue_smallest pre-Pass-3 — alongside the existing
     queue_spb_evacuate trigger when the spb_tainted_walk reports
     saw_below_reserve. Demand-side signal: an allocation just couldn't
     find space in tainted, and tainted is below its reserve.

  2. __rmqueue_claim — alongside the existing queue_spb_evacuate when
     a non-movable claim is about to taint a clean SPB. Same demand
     signal as (1) but caught one layer down.

  3. End of spb_evacuate_for_order — fired unconditionally, even when
     the movable evacuation pass succeeded. Supply-side trigger: keeps
     headroom available for the next burst, when the movable supply
     may have run out and movable evac alone would have nothing to do.

shrink_slab is location-agnostic — it doesn't know about SPBs — but
since most slab pages live in already-tainted SPBs (that is where they
were allocated), the freed pages naturally land back in the tainted
pool, restoring headroom without spreading the taint to clean SPBs.

Speed control is implicit: trigger frequency tracks evacuation
frequency, so reclaim rate matches allocation rate. Per-invocation
aggressiveness ramps via decreasing priority. No new sysctls or
watermarks are introduced; the 100ms throttle is the only tunable.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h        |   9 +++
 include/linux/vm_event_item.h |   5 ++
 mm/page_alloc.c               | 138 +++++++++++++++++++++++++++++++++-
 mm/vmstat.c                   |   2 +
 4 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 195a80e2f0ee..acaff292140f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1570,6 +1570,15 @@ typedef struct pglist_data {
 	struct workqueue_struct *evacuate_wq;
 	struct llist_head spb_evac_pending;
 	struct irq_work spb_evac_irq_work;
+
+	/*
+	 * SPB-driven slab reclaim: single work item per pgdat (shrink_slab
+	 * is node-scoped, so one work in-flight per node is the max), with
+	 * a 100ms throttle. queue_work() gives us single-flight semantics
+	 * for free.
+	 */
+	struct work_struct spb_slab_shrink_work;
+	unsigned long spb_slab_shrink_last;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3de6ca1e9c56..5a560014ab49 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -94,6 +94,11 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 					 * a clean SPB clean when a tainted SPB
 					 * still has free pageblocks
 					 */
+		SPB_SLAB_SHRINK_QUEUED,	/*
+					 * queued a deferred slab shrink to
+					 * reclaim space inside tainted SPBs
+					 */
+		SPB_SLAB_SHRINK_RAN,	/* slab shrink worker ran a pass */
 		UNEVICTABLE_PGCULLED,	/* culled to noreclaim list */
 		UNEVICTABLE_PGSCANNED,	/* scanned for reclaimability */
 		UNEVICTABLE_PGRESCUED,	/* rescued from noreclaim list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9305b36f52a6..a72cb2da606d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -790,6 +790,7 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 				  int migratetype);
 static void queue_spb_evacuate(struct zone *zone, unsigned int order,
 			       int migratetype);
+static void queue_spb_slab_shrink(struct zone *zone);
 #else
 static inline void spb_maybe_start_defrag(struct superpageblock *sb) {}
 static inline bool spb_needs_defrag(struct superpageblock *sb) { return false; }
@@ -806,6 +807,7 @@ static inline bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 }
 static inline void queue_spb_evacuate(struct zone *zone, unsigned int order,
 				      int migratetype) {}
+static inline void queue_spb_slab_shrink(struct zone *zone) {}
 #endif
 
 static void spb_update_list(struct superpageblock *sb)
@@ -2991,9 +2993,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	 * showed that some tainted SPB is below its reserve threshold of
 	 * free pageblocks, kick deferred evacuation so future allocations
 	 * have a movable-evicted home in an already-tainted SPB.
+	 *
+	 * Queue slab shrink alongside evacuation: even when movable evac
+	 * succeeds, shrinking slab in parallel keeps headroom available
+	 * for the next burst, when the movable supply may have run out.
 	 */
-	if (walk && walk->saw_below_reserve)
+	if (walk && walk->saw_below_reserve) {
 		queue_spb_evacuate(zone, order, migratetype);
+		queue_spb_slab_shrink(zone);
+	}
 
 	/* Pass 3: whole pageblock from empty superpageblocks */
 	list_for_each_entry(sb, &zone->spb_empty, list) {
@@ -3829,12 +3837,17 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 			 * for a non-movable allocation -- this taints a fresh
 			 * SPB.  Defer an evacuation pass over the tainted pool
 			 * so subsequent allocations can reclaim freed
-			 * pageblocks instead of repeating this fallback.
+			 * pageblocks instead of repeating this fallback. Also
+			 * kick a slab shrink so the tainted pool gets fresh
+			 * headroom (movable evac alone can't free pages held
+			 * by slab).
 			 */
 			if (cat_search[c] != SB_SEARCH_PREFERRED &&
-			    start_migratetype != MIGRATE_MOVABLE)
+			    start_migratetype != MIGRATE_MOVABLE) {
 				queue_spb_evacuate(zone, order,
 						   start_migratetype);
+				queue_spb_slab_shrink(zone);
+			}
 
 			page = try_to_claim_block(zone, page, current_order,
 						  order, start_migratetype,
@@ -9017,6 +9030,111 @@ static void queue_spb_evacuate(struct zone *zone, unsigned int order,
 	irq_work_queue(&pgdat->spb_evac_irq_work);
 }
 
+/*
+ * SPB-driven slab reclaim.
+ *
+ * When tainted SPBs run low on free pageblocks under sustained
+ * non-movable pressure (slab inode/dentry/page-table caches), the
+ * pageblock-evacuation worker can only consolidate *movable* pages out
+ * of tainted SPBs. Non-movable slab content stays put, so once the
+ * movable supply is drained the only way to recover headroom in a
+ * tainted SPB is to shrink the slab caches whose pages live there.
+ *
+ * shrink_slab() is node-scoped, so one work item per pgdat is enough:
+ * a single embedded work_struct, gated by a 100ms throttle.
+ * queue_work() returns false if the work is already queued/running, so
+ * we get single-flight for free.
+ *
+ * shrink_slab() itself is location-agnostic — it walks all registered
+ * shrinkers and frees objects whose backing pages may live in any
+ * zone or SPB. That is fine here because any slab page reclaimed
+ * frees space the next allocation can reuse without tainting a fresh
+ * SPB. We pass the pgdat's nid so node-aware shrinkers prefer caches
+ * local to the pressured node.
+ */
+
+/*
+ * Per-invocation budget: walk shrinkers from DEF_PRIORITY (scan 1/4096
+ * of each cache) down toward 0 (full scan), stopping when shrinkers
+ * report no more progress or we have freed a pageblock-sized chunk.
+ * The trigger frequency is what controls overall reclaim rate; this
+ * loop just bounds latency per worker run.
+ */
+#define SPB_SLAB_SHRINK_TARGET_OBJS	(pageblock_nr_pages * 4UL)
+
+static void spb_slab_shrink_work_fn(struct work_struct *work)
+{
+	pg_data_t *pgdat = container_of(work, pg_data_t,
+					spb_slab_shrink_work);
+	int nid = pgdat->node_id;
+	unsigned long freed = 0;
+	int prio = DEF_PRIORITY;
+
+	count_vm_event(SPB_SLAB_SHRINK_RAN);
+
+	while (freed < SPB_SLAB_SHRINK_TARGET_OBJS && prio >= 0) {
+		unsigned long delta = 0;
+		struct mem_cgroup *memcg;
+
+		/*
+		 * Walk the memcg hierarchy starting at the root, the same
+		 * pattern shrink_one_node uses for global slab reclaim.
+		 * Some cgroups may not be present on the node that is
+		 * being shrunk, but many allocators will use any memory.
+		 */
+		memcg = mem_cgroup_iter(NULL, NULL, NULL);
+		do {
+			delta += shrink_slab(GFP_KERNEL, nid, memcg, prio);
+		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
+
+		if (!delta)
+			break;
+		freed += delta;
+		/*
+		 * Increase aggressiveness each round; DEF_PRIORITY scans
+		 * a small slice of each cache, prio 0 scans the whole
+		 * thing. Most workloads find enough at one or two
+		 * iterations below DEF_PRIORITY.
+		 */
+		prio--;
+	}
+}
+
+/**
+ * queue_spb_slab_shrink - schedule deferred slab shrink for SPB pressure
+ * @zone: zone whose tainted-SPB pool is running low
+ *
+ * Throttled to one enqueue per 100ms per pgdat. queue_work() handles
+ * single-flight: if the work is already queued or running, it returns
+ * false and the throttle stamp still gets bumped (next call will be
+ * no-op until the throttle elapses).
+ *
+ * Callable from any context: page allocator paths hold zone->lock,
+ * the SPB evacuate worker does not. queue_work() takes only the
+ * workqueue's pool lock — no zone->lock dependency.
+ *
+ * Pairs with queue_spb_evacuate: evacuation moves movable pages out
+ * of tainted SPBs to free up whole pageblocks; this shrinks slab to
+ * free up the remaining (non-movable) pages. We queue both because
+ * even when movable evacuation succeeds, shrinking slab in parallel
+ * keeps headroom available for the next burst, when movable supply
+ * may have run out.
+ */
+static void queue_spb_slab_shrink(struct zone *zone)
+{
+	pg_data_t *pgdat = zone->zone_pgdat;
+
+	if (!pgdat->evacuate_wq)
+		return;
+
+	if (time_before(jiffies, pgdat->spb_slab_shrink_last + HZ / 10))
+		return;
+
+	pgdat->spb_slab_shrink_last = jiffies;
+	if (queue_work(pgdat->evacuate_wq, &pgdat->spb_slab_shrink_work))
+		count_vm_event(SPB_SLAB_SHRINK_QUEUED);
+}
+
 /*
  * Background superpageblock defragmentation.
  *
@@ -9498,6 +9616,7 @@ static int __init pageblock_evacuate_init(void)
 	for (i = 0; i < NR_SPB_EVAC_REQUESTS; i++)
 		llist_add(&spb_evac_pool[i].free_node, &spb_evac_freelist);
 
+
 	/* Create a per-pgdat workqueue */
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
@@ -9515,6 +9634,9 @@ static int __init pageblock_evacuate_init(void)
 		init_irq_work(&pgdat->spb_evac_irq_work,
 			      spb_evac_irq_work_fn);
 
+		INIT_WORK(&pgdat->spb_slab_shrink_work,
+			  spb_slab_shrink_work_fn);
+
 		/* Initialize per-superpageblock defrag work structs */
 		for (z = 0; z < MAX_NR_ZONES; z++) {
 			struct zone *zone = &pgdat->node_zones[z];
@@ -10258,6 +10380,16 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 			did_evacuate = true;
 	}
 
+	/*
+	 * Always kick a slab shrink after an evacuation pass — even when
+	 * movable evacuation succeeded. Slab content stranded inside
+	 * tainted SPBs can only be freed by shrinking the cache; doing
+	 * it now keeps headroom available for the next burst, when the
+	 * movable supply may have run out and movable evac alone would
+	 * have nothing to do.
+	 */
+	queue_spb_slab_shrink(zone);
+
 	return did_evacuate;
 }
 #endif /* CONFIG_COMPACTION */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8a6c9120d325..8ffad06a39ae 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,6 +1386,8 @@ const char * const vmstat_text[] = {
 	[I(CMA_ALLOC_FAIL)]			= "cma_alloc_fail",
 #endif
 	[I(SPB_HIGHORDER_REFUSED)]		= "spb_highorder_refused",
+	[I(SPB_SLAB_SHRINK_QUEUED)]		= "spb_slab_shrink_queued",
+	[I(SPB_SLAB_SHRINK_RAN)]		= "spb_slab_shrink_ran",
 	[I(UNEVICTABLE_PGCULLED)]		= "unevictable_pgs_culled",
 	[I(UNEVICTABLE_PGSCANNED)]		= "unevictable_pgs_scanned",
 	[I(UNEVICTABLE_PGRESCUED)]		= "unevictable_pgs_rescued",
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (29 preceding siblings ...)
  2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

When pages get freed via __free_one_page, they're placed on the per-SPB
free_list determined by their pageblock's migratetype, not the original
allocation's migratetype. Slab-heavy workloads expose a structural
mismatch:

  - RECLAIMABLE pageblocks fill up densely with live slab objects (e.g.
    btrfs_inode caches), leaving very few sub-pageblock free fragments
    on the RECL free list.

  - UNMOVABLE pageblocks accumulate sparse free space from vmalloc and
    raw-alloc churn — tens of thousands of free pages, all on the UNMOV
    free list.

Net effect: a tainted SPB can show 87,000+ free pages in metadata while
having ZERO free buddies on the RECL list. A new RECL allocation walking
__rmqueue_smallest's preferred-SB Pass 1 finds nothing, falls through
Pass 2 (claim_whole_block on MOVABLE — but mov=0 in tainted SBs),
Pass 2b (sub-PB MOVABLE — same), and reaches Pass 3, which taints a
fresh clean SPB. Repeat per RECL burst.

Add a Pass 2c between 2b and 3: for non-movable allocations that
couldn't find their own migratetype, try borrowing a sub-pageblock buddy
from the *opposite* non-movable migratetype's free list within tainted
SPBs. UNMOV alloc → check RECL free list; RECL alloc → check UNMOV
free list. The pageblock tag is NOT changed — page_del_and_expand uses
the source migratetype for both delete and re-list, so the splits stay
on the source list, and when our borrowed page is later freed
__free_one_page returns it to the source list (based on pageblock tag).
The "borrow" is purely transient: physical page goes to a foreign-type
caller, returns to its native list on free.

PB_has_<requested_type> is set via __spb_set_has_type so spb_defrag
accounting reflects that the pageblock now hosts our type's content.
PB_has_<source_type> stays set since other buddies of that type remain.

Restricted to UNMOV ↔ RECL within SB_TAINTED — movable allocations have
their own Pass 4 fallback, and clean SPBs must not be polluted with
cross-type mixing (that's what the existing migratetype-isolation
machinery exists to prevent).

Live measurement on a 247 GB devvm with btrfs root, kernel 397 (Stage 1
+ simplified Stage 2a) at boot+7min: 12 tainted Normal-zone SPBs grew
from 4 baseline despite the existing 11 having between 825 and 87,062
free pages each, ALL on the UNMOV list while the workload kept
allocating RECL btrfs_inode slab pages. Pass 2c lets those allocs
absorb into the existing UNMOV-listed free pool rather than creating
fresh tainted SPBs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a72cb2da606d..f2db3dd86a84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2806,6 +2806,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	struct page *page;
 	int full;
 	struct superpageblock *sb;
+	int opposite_mt;
 	/*
 	 * Category search order: 2 passes.
 	 * Movable: clean first, then tainted (pack into clean SBs).
@@ -2985,6 +2986,90 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				}
 			}
 		}
+
+		/*
+		 * Pass 2c: cross-non-movable borrow within tainted SPBs.
+		 *
+		 * If we're a non-movable alloc and Pass 1/2/2b couldn't find a
+		 * buddy on our migratetype's free list anywhere, but tainted
+		 * SPBs have free buddies on the *opposite* non-movable type's
+		 * free list, take one of those.
+		 *
+		 * Why this happens: when pages are freed, __free_one_page puts
+		 * them on the free_list determined by their pageblock's tag,
+		 * not the original allocation's migratetype. Slab caches tend
+		 * to be dense (RECL pageblocks fill up; few sub-PB fragments),
+		 * while UNMOV pageblocks accumulate sparse free space from
+		 * vmalloc/raw alloc churn. Net effect: tainted SPBs frequently
+		 * have tens of thousands of free pages all on the UNMOV list,
+		 * invisible to RECL allocs (or vice versa). Without this pass,
+		 * the alloc falls through to Pass 3 and taints a fresh clean
+		 * SPB even though the existing tainted ones have plenty of
+		 * unused space.
+		 *
+		 * We do NOT relabel the source pageblock. The buddy is taken
+		 * from @opposite_mt's free list and the splits go back on
+		 * @opposite_mt's list (page_del_and_expand uses the same mt
+		 * for delete and expand). The pageblock tag is unchanged, so
+		 * the page returns to @opposite_mt's list when freed via
+		 * __free_one_page. Effectively a borrow: the alloc takes a
+		 * physical page from a UNMOV-tagged pageblock for a RECL
+		 * use, and the page cycles back to UNMOV's list on free.
+		 *
+		 * We do set PB_has_<migratetype> via __spb_set_has_type so
+		 * spb_defrag accounting reflects that this pageblock now hosts
+		 * our migratetype's content too. PB_has_<opposite_mt> stays
+		 * set since other buddies of that type remain.
+		 *
+		 * Restricted to UNMOV ↔ RECL. Movable allocations don't
+		 * participate (they have their own Pass 4 fallback path).
+		 *
+		 * Restricted to SB_TAINTED to avoid spreading mixing into
+		 * clean SPBs.
+		 */
+		opposite_mt = -1;
+		if (migratetype == MIGRATE_UNMOVABLE)
+			opposite_mt = MIGRATE_RECLAIMABLE;
+		else if (migratetype == MIGRATE_RECLAIMABLE)
+			opposite_mt = MIGRATE_UNMOVABLE;
+
+		if (opposite_mt >= 0) {
+			for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+				list_for_each_entry(sb,
+					&zone->spb_lists[SB_TAINTED][full], list) {
+					int co;
+
+					if (!sb->nr_free_pages)
+						continue;
+					for (co = min_t(int, pageblock_order - 1,
+							NR_PAGE_ORDERS - 1);
+					     co >= (int)order;
+					     --co) {
+						current_order = co;
+						area = &sb->free_area[current_order];
+						page = get_page_from_free_area(
+							area, opposite_mt);
+						if (!page)
+							continue;
+						if (get_pageblock_isolate(page))
+							continue;
+						if (is_migrate_cma(
+						    get_pageblock_migratetype(page)))
+							continue;
+						page_del_and_expand(zone, page,
+							order, current_order,
+							opposite_mt);
+						__spb_set_has_type(page,
+							migratetype);
+						trace_mm_page_alloc_zone_locked(
+							page, order, migratetype,
+							pcp_allowed_order(order) &&
+							migratetype < MIGRATE_PCPTYPES);
+						return page;
+					}
+				}
+			}
+		}
 	}
 
 	/*
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (30 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The SPB slab shrinker introduced earlier in the series only fires when
__rmqueue_smallest falls all the way through to Pass 3 (about to taint
a clean SPB) or when __rmqueue_claim is about to taint one. Bare-metal
testing on a 247 GB devvm with btrfs root (rev 398, with Pass 2c) shows
this is too late: at boot+16min only 15 shrinks had fired in 6 minutes
while slab grew from 1.7 GB to 11.7 GB and tainted Normal-zone SPBs
climbed from 4 baseline to 16. The 100ms throttle (max 10 shrinks/sec
per pgdat) further capped the response rate, and the trigger placement
meant slab pressure could keep absorbing into already-tainted SPBs
without ever firing the shrinker until those SPBs were exhausted — at
which point the only remaining option is to taint a fresh clean SPB.

Two changes:

1. Add a proactive high-water trigger on the success paths of
   __rmqueue_smallest's tainted-SPB passes (Pass 1 SB_TAINTED, Pass 2,
   Pass 2b, Pass 2c). When a non-movable allocation consumes from a
   tainted SPB whose nr_free_pages has fallen below spb_tainted_reserve
   worth of pages (reserve_pageblocks * pageblock_nr_pages), queue a
   slab shrink. The predicate compares total free pages rather than
   whole free pageblocks (nr_free): sub-pageblock allocations and
   fragmented free space don't move the pageblock count but do consume
   the SPB's freeable capacity, and we can't assume slab reclaim will
   produce whole pageblocks either. This makes the trigger frequency
   proportional to the rate of non-movable consumption from contended
   tainted SPBs, instead of firing only at the cliff edge.

2. Remove the 100ms time-based throttle from queue_spb_slab_shrink.
   The throttle was redundant with queue_work()'s built-in single-flight
   semantics (returns false if the work is already queued/running) and
   was actively harmful: with the new high-water trigger firing per
   allocation, the natural rate-limiter is the worker's runtime. The
   previously-allocated spb_slab_shrink_last field is removed from
   pglist_data.

queue_work() absorbs the resulting per-alloc burst at near-zero cost
(test-and-set on WORK_STRUCT_PENDING_BIT) when a pass is already in
flight, so unconditional firing on every qualifying allocation is
cheap.

Pass 4 (movable falling back to tainted) does not get the trigger:
movable consumption does not contribute to the slab pressure that taints
fresh SPBs, and Pass 4 already filters out SBs at or below reserve.
Clean-SPB success paths in Pass 1 are also untouched (clean SPBs are
not the source of the pressure).

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  7 +++---
 mm/page_alloc.c        | 48 ++++++++++++++++++++++++++++++++----------
 2 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acaff292140f..68892e40cd4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1573,12 +1573,11 @@ typedef struct pglist_data {
 
 	/*
 	 * SPB-driven slab reclaim: single work item per pgdat (shrink_slab
-	 * is node-scoped, so one work in-flight per node is the max), with
-	 * a 100ms throttle. queue_work() gives us single-flight semantics
-	 * for free.
+	 * is node-scoped, so one work in-flight per node is the max).
+	 * queue_work() gives us single-flight semantics for free — fresh
+	 * triggers no-op while a pass is in progress.
 	 */
 	struct work_struct spb_slab_shrink_work;
-	unsigned long spb_slab_shrink_last;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2db3dd86a84..ff7755ef2b79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2692,6 +2692,23 @@ static inline u16 spb_tainted_reserve(const struct superpageblock *sb)
 	return max_t(u16, SPB_TAINTED_RESERVE_MIN, sb->total_pageblocks / 32);
 }
 
+/*
+ * High-water threshold for proactively kicking the slab shrinker. When a
+ * non-movable allocation consumes from a tainted SPB whose total free
+ * pages have fallen below spb_tainted_reserve worth of pages, queue a
+ * shrink so we start freeing slab memory before the SPB is exhausted.
+ *
+ * Compared against nr_free_pages rather than nr_free (whole pageblocks):
+ * sub-pageblock allocations and fragmented free space don't move the
+ * pageblock count, but they do consume the SPB's freeable capacity, and
+ * we can't assume slab reclaim will produce whole pageblocks either.
+ */
+static inline bool spb_below_shrink_high_water(const struct superpageblock *sb)
+{
+	return sb->nr_free_pages <
+		(unsigned long)spb_tainted_reserve(sb) * pageblock_nr_pages;
+}
+
 /*
  * On systems with many superpageblocks, we can afford to "write off"
  * tainted superpageblocks by aggressively packing unmovable/reclaimable
@@ -2877,6 +2894,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				page_del_and_expand(zone, page,
 					order, current_order,
 					migratetype);
+				if (cat == SB_TAINTED &&
+				    spb_below_shrink_high_water(sb))
+					queue_spb_slab_shrink(zone);
 				trace_mm_page_alloc_zone_locked(
 					page, order, migratetype,
 					pcp_allowed_order(order) &&
@@ -2896,6 +2916,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page_del_and_expand(zone, page,
 						order, current_order,
 						migratetype);
+					if (cat == SB_TAINTED &&
+					    spb_below_shrink_high_water(sb))
+						queue_spb_slab_shrink(zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -2941,6 +2964,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page = claim_whole_block(zone, page,
 						current_order, order,
 						migratetype, MIGRATE_MOVABLE);
+					if (spb_below_shrink_high_water(sb))
+						queue_spb_slab_shrink(zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -2978,6 +3003,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						0, true);
 					if (!page)
 						continue;
+					if (spb_below_shrink_high_water(sb))
+						queue_spb_slab_shrink(zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -3061,6 +3088,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							opposite_mt);
 						__spb_set_has_type(page,
 							migratetype);
+						if (spb_below_shrink_high_water(sb))
+							queue_spb_slab_shrink(zone);
 						trace_mm_page_alloc_zone_locked(
 							page, order, migratetype,
 							pcp_allowed_order(order) &&
@@ -9126,9 +9155,9 @@ static void queue_spb_evacuate(struct zone *zone, unsigned int order,
  * tainted SPB is to shrink the slab caches whose pages live there.
  *
  * shrink_slab() is node-scoped, so one work item per pgdat is enough:
- * a single embedded work_struct, gated by a 100ms throttle.
- * queue_work() returns false if the work is already queued/running, so
- * we get single-flight for free.
+ * a single embedded work_struct. queue_work() returns false if the work
+ * is already queued/running, so we get single-flight for free — fresh
+ * triggers no-op until the in-flight pass completes.
  *
  * shrink_slab() itself is location-agnostic — it walks all registered
  * shrinkers and frees objects whose backing pages may live in any
@@ -9189,10 +9218,11 @@ static void spb_slab_shrink_work_fn(struct work_struct *work)
  * queue_spb_slab_shrink - schedule deferred slab shrink for SPB pressure
  * @zone: zone whose tainted-SPB pool is running low
  *
- * Throttled to one enqueue per 100ms per pgdat. queue_work() handles
- * single-flight: if the work is already queued or running, it returns
- * false and the throttle stamp still gets bumped (next call will be
- * no-op until the throttle elapses).
+ * Single-flight via queue_work(): if the work is already queued or
+ * running, it returns false and we no-op. There is no time-based
+ * throttle — the rate at which fresh shrink runs can fire is bounded
+ * by how fast the worker completes (one full pass freeing up to
+ * SPB_SLAB_SHRINK_TARGET_OBJS objects).
  *
  * Callable from any context: page allocator paths hold zone->lock,
  * the SPB evacuate worker does not. queue_work() takes only the
@@ -9212,10 +9242,6 @@ static void queue_spb_slab_shrink(struct zone *zone)
 	if (!pgdat->evacuate_wq)
 		return;
 
-	if (time_before(jiffies, pgdat->spb_slab_shrink_last + HZ / 10))
-		return;
-
-	pgdat->spb_slab_shrink_last = jiffies;
 	if (queue_work(pgdat->evacuate_wq, &pgdat->spb_slab_shrink_work))
 		count_vm_event(SPB_SLAB_SHRINK_QUEUED);
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (31 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

get_page_from_freelist's atomic-allocation retry logic progressively
relaxes ALLOC_NOFRAGMENT to give atomic allocs every chance: first add
ALLOC_NOFRAG_TAINTED_OK (allow steal from tainted), then drop
ALLOC_NOFRAGMENT entirely (allow tainting clean SPBs). The intent is
that atomic allocations have no slowpath escape and need extra room to
succeed.

For callers that pass __GFP_NORETRY, this tradeoff is wrong. The
NORETRY contract is "I have a fallback; don't go to extreme lengths."
Network skb_page_frag_refill, slab high-order allocations, and similar
hot-path callers all use NORETRY exactly so the allocator can return
NULL and let the caller's own fallback (smaller frag, lower-order
slab, etc.) take over. Tainting a clean superpageblock to satisfy
such a request is a lasting cost — the SPB stays tainted for the
remainder of the workload's lifetime, blocking 1 GiB hugepage
allocation from that region — that outlives the single allocation
that triggered it.

Skip the relaxation steps for NORETRY callers and return NULL
immediately. Their fallback path absorbs the failure cleanly.

Observed on a 247 GB devvm running the page-superblock
v18 series: an atomic order-3 alloc from
swapper context (PCP refill, gfp=0x152820 = __GFP_HIGH |
__GFP_KSWAPD_RECLAIM | __GFP_NOWARN | __GFP_NORETRY | __GFP_COMP |
__GFP_HARDWALL) tainted a fresh clean SPB at boot+~90 min despite
ALLOC_NOFRAGMENT being set, because the atomic-retry path stripped
the flag. The caller had a NORETRY-fallback ready; the taint was
gratuitous.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff7755ef2b79..e8d6d5b47f63 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5895,9 +5895,20 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * first: allow steal/claim from tainted SPBs only. This avoids
 	 * tainting clean SPBs while still finding pages in tainted ones.
 	 * Only drop NOFRAGMENT entirely if that also fails.
+	 *
+	 * Exception: callers that explicitly opted into failure with
+	 * __GFP_NORETRY have a fallback path of their own (a smaller
+	 * order, a different cache, returning NULL from a best-effort
+	 * cache refill, etc.). Tainting a clean superpageblock is a
+	 * lasting cost that outlives this allocation; it is not justified
+	 * to absorb it just to satisfy a caller that already has a
+	 * cheaper escape hatch. Return NULL and let the caller's fallback
+	 * run instead.
 	 */
 	if (no_fallback && !defrag_mode &&
 	    !(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+		if (gfp_mask & __GFP_NORETRY)
+			return NULL;
 		if (!(alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
 			alloc_flags |= ALLOC_NOFRAG_TAINTED_OK;
 			goto retry;
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (32 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

After the SPB rework, free pages live on per-superpageblock free lists
(zone->superpageblocks[i].free_area[order].free_list[mt]) rather than
on a single zone-level list. page_reporting_cycle() was still walking
the now-empty zone-level list, so virtio-balloon free page reporting
silently became a no-op on systems with superpageblocks: no pages were
ever isolated, no MADV_DONTNEED hints reached the host, and any guest
memory backing balloon-eligible pages stayed resident on the host.

Refactor the per-list walk into page_reporting_cycle_list() taking an
explicit list_head and a pointer to the shared budget, then have
page_reporting_cycle() iterate every SPB in the zone for the requested
(order, mt). The budget is shared across the whole walk so a fragmented
zone does not multiply the rate-limit. The zone-level shadow nr_free
(maintained by __add_to_free_list / __del_page_from_free_list) is used
both for the early-out and for the budget total; that shadow already
sums all SPBs.

Hold the memory hotplug read lock around the SPB walk.
resize_zone_superpageblocks() swaps zone->superpageblocks under
zone->lock and immediately kvfree()s the old array with no RCU grace
period. The helper drops zone->lock during prdev->report() (which can
sleep) and resumes operating on a list_head pointer that lives inside
an SPB; without get_online_mems(), that pointer can become a dangling
reference if hotplug runs in the unlock window.

The zone-level fallback path is retained for zones whose SPB array has
not yet been allocated (e.g. unpopulated hotplug zones).

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_reporting.c | 149 ++++++++++++++++++++++++++------------------
 1 file changed, 90 insertions(+), 59 deletions(-)

diff --git a/mm/page_reporting.c b/mm/page_reporting.c
index f0042d5743af..81f903caec22 100644
--- a/mm/page_reporting.c
+++ b/mm/page_reporting.c
@@ -6,6 +6,7 @@
 #include <linux/export.h>
 #include <linux/module.h>
 #include <linux/delay.h>
+#include <linux/memory_hotplug.h>
 #include <linux/scatterlist.h>
 
 #include "page_reporting.h"
@@ -138,116 +139,68 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
 }
 
 /*
- * The page reporting cycle consists of 4 stages, fill, report, drain, and
- * idle. We will cycle through the first 3 stages until we cannot obtain a
- * full scatterlist of pages, in that case we will switch to idle.
+ * Walk a single free_list (zone-level or per-superpageblock), pulling
+ * unreported pages into the scatterlist and calling prdev->report() each
+ * time the scatterlist fills. Updates *budget and *offset across calls so
+ * the caller can spread one budget across multiple lists (e.g. one per SPB).
  */
 static int
-page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
-		     unsigned int order, unsigned int mt,
-		     struct scatterlist *sgl, unsigned int *offset)
+page_reporting_cycle_list(struct page_reporting_dev_info *prdev,
+			  struct zone *zone, struct list_head *list,
+			  unsigned int order, struct scatterlist *sgl,
+			  unsigned int *offset, long *budget)
 {
-	struct free_area *area = &zone->free_area[order];
-	struct list_head *list = &area->free_list[mt];
 	unsigned int page_len = PAGE_SIZE << order;
 	struct page *page, *next;
-	long budget;
 	int err = 0;
 
-	/*
-	 * Perform early check, if free area is empty there is
-	 * nothing to process so we can skip this free_list.
-	 */
 	if (list_empty(list))
-		return err;
+		return 0;
 
 	spin_lock_irq(&zone->lock);
 
-	/*
-	 * Limit how many calls we will be making to the page reporting
-	 * device for this list. By doing this we avoid processing any
-	 * given list for too long.
-	 *
-	 * The current value used allows us enough calls to process over a
-	 * sixteenth of the current list plus one additional call to handle
-	 * any pages that may have already been present from the previous
-	 * list processed. This should result in us reporting all pages on
-	 * an idle system in about 30 seconds.
-	 *
-	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
-	 * should always be a power of 2.
-	 */
-	budget = DIV_ROUND_UP(area->nr_free, PAGE_REPORTING_CAPACITY * 16);
-
-	/* loop through free list adding unreported pages to sg list */
 	list_for_each_entry_safe(page, next, list, lru) {
-		/* We are going to skip over the reported pages. */
 		if (PageReported(page))
 			continue;
 
-		/*
-		 * If we fully consumed our budget then update our
-		 * state to indicate that we are requesting additional
-		 * processing and exit this list.
-		 */
-		if (budget < 0) {
+		if (*budget < 0) {
 			atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED);
 			next = page;
 			break;
 		}
 
-		/* Attempt to pull page from list and place in scatterlist */
 		if (*offset) {
 			if (!__isolate_free_page(page, order)) {
 				next = page;
 				break;
 			}
 
-			/* Add page to scatter list */
 			--(*offset);
 			sg_set_page(&sgl[*offset], page, page_len, 0);
 
 			continue;
 		}
 
-		/*
-		 * Make the first non-reported page in the free list
-		 * the new head of the free list before we release the
-		 * zone lock.
-		 */
 		if (!list_is_first(&page->lru, list))
 			list_rotate_to_front(&page->lru, list);
 
-		/* release lock before waiting on report processing */
 		spin_unlock_irq(&zone->lock);
 
-		/* begin processing pages in local list */
 		err = prdev->report(prdev, sgl, PAGE_REPORTING_CAPACITY);
 
-		/* reset offset since the full list was reported */
 		*offset = PAGE_REPORTING_CAPACITY;
+		(*budget)--;
 
-		/* update budget to reflect call to report function */
-		budget--;
-
-		/* reacquire zone lock and resume processing */
 		spin_lock_irq(&zone->lock);
 
-		/* flush reported pages from the sg list */
 		page_reporting_drain(prdev, sgl, PAGE_REPORTING_CAPACITY, !err);
 
-		/*
-		 * Reset next to first entry, the old next isn't valid
-		 * since we dropped the lock to report the pages
-		 */
 		next = list_first_entry(list, struct page, lru);
 
-		/* exit on error */
 		if (err)
 			break;
 	}
 
-	/* Rotate any leftover pages to the head of the freelist */
 	if (!list_entry_is_head(next, list, lru) && !list_is_first(&next->lru, list))
 		list_rotate_to_front(&next->lru, list);
 
@@ -256,6 +209,84 @@ page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
 	return err;
 }
 
+/*
+ * The page reporting cycle consists of 4 stages, fill, report, drain, and
+ * idle. We will cycle through the first 3 stages until we cannot obtain a
+ * full scatterlist of pages, in that case we will switch to idle.
+ *
+ * With superpageblocks, free pages live on per-SPB free_lists rather than a
+ * single zone-level list, so the cycle iterates every SPB for the requested
+ * (order, mt). The budget is shared across the entire walk so that
+ * fragmented zones do not produce a budget multiplier.
+ */
+static int
+page_reporting_cycle(struct page_reporting_dev_info *prdev, struct zone *zone,
+		     unsigned int order, unsigned int mt,
+		     struct scatterlist *sgl, unsigned int *offset)
+{
+	long budget;
+	int err = 0;
+
+	/*
+	 * Early exit if the per-zone shadow says there is nothing free at
+	 * this order in any SPB. Avoids touching every SPB's list head.
+	 */
+	if (!data_race(zone->free_area[order].nr_free))
+		return 0;
+
+	/*
+	 * Limit how many calls we will be making to the page reporting
+	 * device. By doing this we avoid processing any given (order, mt)
+	 * for too long.
+	 *
+	 * The current value used allows us enough calls to process over a
+	 * sixteenth of the current free pool plus one additional call to
+	 * handle any pages that may have already been present from the
+	 * previous list processed. This should result in us reporting all
+	 * pages on an idle system in about 30 seconds.
+	 *
+	 * The division here should be cheap since PAGE_REPORTING_CAPACITY
+	 * should always be a power of 2.
+	 */
+	budget = DIV_ROUND_UP(data_race(zone->free_area[order].nr_free),
+			      PAGE_REPORTING_CAPACITY * 16);
+
+	/*
+	 * Block memory hotplug for the SPB walk. resize_zone_superpageblocks()
+	 * swaps zone->superpageblocks under zone->lock and immediately
+	 * kvfree()s the old array, with no RCU grace period. The helper drops
+	 * zone->lock during prdev->report() and resumes using a list_head
+	 * pointer into an SPB; without holding mem_hotplug_lock for read,
+	 * that pointer can become a dangling reference into freed memory.
+	 */
+	get_online_mems();
+
+	if (zone->nr_superpageblocks) {
+		unsigned long sb_idx, nr_sbs = zone->nr_superpageblocks;
+
+		for (sb_idx = 0; sb_idx < nr_sbs; sb_idx++) {
+			struct list_head *list =
+				&zone->superpageblocks[sb_idx].free_area[order].free_list[mt];
+
+			err = page_reporting_cycle_list(prdev, zone, list,
+							order, sgl, offset,
+							&budget);
+			if (err || budget < 0)
+				break;
+		}
+	} else {
+		/* No SPBs (e.g. unpopulated zone); fall back to zone-level list. */
+		struct list_head *list = &zone->free_area[order].free_list[mt];
+
+		err = page_reporting_cycle_list(prdev, zone, list, order,
+						sgl, offset, &budget);
+	}
+
+	put_online_mems();
+
+	return err;
+}
+
 static int
 page_reporting_process_zone(struct page_reporting_dev_info *prdev,
 			    struct scatterlist *sgl, struct zone *zone)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (33 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

show_mem()'s per-order line includes a parenthesized set of letters
(UME, etc.) indicating which migratetypes have free pages at that
order. This was computed by checking free_area_empty() on
zone->free_area[order].free_list[type]. After the SPB rework, those
zone-level list heads are always empty -- free pages live on per-
superpageblock lists -- so the migratetype letters never appeared.

Iterate every SPB in the zone for each order, OR'ing in any non-empty
migratetype lists, with an early exit once all migratetypes have been
seen. The shadow nr_free count remains correct (zone->free_area[].
nr_free is updated by __add_to_free_list / __del_page_from_free_list
to sum across all SPBs).

Falls back to the zone-level free_area for zones whose SPB array has
not yet been allocated.

The whole loop runs under spin_lock_irqsave(&zone->lock) without
drops, so no hotplug race. Worst case work is bounded
(NR_PAGE_ORDERS * MIGRATE_TYPES * nr_superpageblocks list_empty
pointer compares per zone) and acceptable for a diagnostic path.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/show_mem.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/show_mem.c b/mm/show_mem.c
index bbbbef5baed7..a80076fe6165 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -363,16 +363,31 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z
 
 		spin_lock_irqsave(&zone->lock, flags);
 		for (order = 0; order < NR_PAGE_ORDERS; order++) {
-			struct free_area *area = &zone->free_area[order];
+			unsigned long sb_idx;
+			unsigned long nr_lists = zone->nr_superpageblocks ? : 1;
 			int type;
 
-			nr[order] = area->nr_free;
+			nr[order] = zone->free_area[order].nr_free;
 			total += nr[order] << order;
 
+			/*
+			 * Collect the migratetypes present at this order. After
+			 * the SPB rework, free pages live on per-superpageblock
+			 * free lists, so check each SPB. Stop early once all
+			 * migratetypes have been observed.
+			 */
 			types[order] = 0;
-			for (type = 0; type < MIGRATE_TYPES; type++) {
-				if (!free_area_empty(area, type))
-					types[order] |= 1 << type;
+			for (sb_idx = 0; sb_idx < nr_lists; sb_idx++) {
+				struct free_area *area = zone->nr_superpageblocks ?
+					&zone->superpageblocks[sb_idx].free_area[order] :
+					&zone->free_area[order];
+
+				for (type = 0; type < MIGRATE_TYPES; type++) {
+					if (!free_area_empty(area, type))
+						types[order] |= 1 << type;
+				}
+				if (types[order] == (1 << MIGRATE_TYPES) - 1)
+					break;
 			}
 		}
 		spin_unlock_irqrestore(&zone->lock, flags);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (34 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Plumb alloc_flags through __rmqueue_smallest so that subsequent
diagnostic tracepoints (and any future logic that needs to react to
allocation flags) can observe and use it. No behavioural change: the
parameter is added to the signature and threaded through every caller,
but nothing inside __rmqueue_smallest acts on it yet.

Callers passing 0 for alloc_flags are paths that synthesise an
allocation outside of the normal alloc_flags-tracking flow
(__rmqueue_cma_fallback, the fallback in try_to_claim_block).

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8d6d5b47f63..d621e84bf664 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2816,7 +2816,8 @@ struct spb_tainted_walk {
 
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
-				int migratetype, struct spb_tainted_walk *walk)
+				int migratetype, unsigned int alloc_flags,
+				struct spb_tainted_walk *walk)
 {
 	unsigned int current_order;
 	struct free_area *area;
@@ -3240,7 +3241,7 @@ static inline bool noncompatible_cross_type(int start_type, int fallback_type)
 static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order)
 {
-	return __rmqueue_smallest(zone, order, MIGRATE_CMA, NULL);
+	return __rmqueue_smallest(zone, order, MIGRATE_CMA, 0, NULL);
 }
 #else
 static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
@@ -3715,7 +3716,7 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	if (sb)
 		spb_update_list(sb);
 #endif
-	return __rmqueue_smallest(zone, order, start_type, NULL);
+	return __rmqueue_smallest(zone, order, start_type, 0, NULL);
 }
 
 /*
@@ -4116,7 +4117,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 	 */
 	switch (*mode) {
 	case RMQUEUE_NORMAL:
-		page = __rmqueue_smallest(zone, order, migratetype, walkp);
+		page = __rmqueue_smallest(zone, order, migratetype,
+					  alloc_flags, walkp);
 		if (page)
 			return page;
 		/*
@@ -5171,7 +5173,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order,
-						  MIGRATE_HIGHATOMIC, NULL);
+						  MIGRATE_HIGHATOMIC,
+						  alloc_flags, NULL);
 		if (!page) {
 			enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 
@@ -5186,7 +5189,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 			if (!page && (alloc_flags & (ALLOC_OOM|ALLOC_NON_BLOCK)))
 				page = __rmqueue_smallest(zone, order,
 							  MIGRATE_HIGHATOMIC,
-							  NULL);
+							  alloc_flags, NULL);
 
 			if (!page) {
 				spin_unlock_irqrestore(&zone->lock, flags);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (35 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

kvmalloc's contract is "try contiguous physical memory first; fall
back to vmalloc on failure." For size > PAGE_SIZE, kmalloc_gfp_adjust
already strips __GFP_DIRECT_RECLAIM and adds __GFP_NOWARN to make
the kmalloc attempt non-disruptive. But the page allocator's atomic-
allocation retry chain in get_page_from_freelist (no __GFP_DIRECT_RECLAIM
path) progressively relaxes ALLOC_NOFRAGMENT — first adding
ALLOC_NOFRAG_TAINTED_OK, then dropping ALLOC_NOFRAGMENT entirely —
because atomic allocations have no slowpath escape and need every
chance to succeed.

For kvmalloc-large, this is wrong: there IS a slowpath escape (the
vmalloc fallback). Tainting a previously-clean superpageblock to
satisfy the kmalloc attempt costs more than letting it fail and
calling vmalloc — the SPB stays tainted for the rest of the workload's
lifetime, blocking 1 GiB hugepage allocation from that region.

Add __GFP_NORETRY in the same conditional that strips __GFP_DIRECT_RECLAIM.
The page allocator's NORETRY-skip exit (mm/page_alloc.c) treats this
as the documented "caller has a fallback" signal and returns NULL
immediately instead of relaxing ALLOC_NOFRAGMENT. kvmalloc then runs
its existing vmalloc fallback as designed.

kvmalloc's documented contract already disallows callers passing
__GFP_NORETRY directly (see the comment block above
__kvmalloc_node_noprof), so adding it internally cannot surprise
any existing caller.

Observed on a 247 GB devvm running the page-superblock v18
series: a `below` process reading a /proc/sys file via
kvmalloc(buf, GFP_USER) tainted a fresh clean SPB at
boot+~47 min via __kmalloc_large_node → alloc_pages_mpol. A
tls-cert-validator did the same a minute later. Both were "best
effort" allocations with vmalloc as their existing fallback — they
should not have been tainting clean SPBs.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/slub.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2b2d33cc735c..fa422d245a53 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6703,13 +6703,24 @@ static gfp_t kmalloc_gfp_adjust(gfp_t flags, size_t size)
 	 * However make sure that larger requests are not too disruptive - i.e.
 	 * do not direct reclaim unless physically continuous memory is preferred
 	 * (__GFP_RETRY_MAYFAIL mode). We still kick in kswapd/kcompactd to
-	 * start working in the background
+	 * start working in the background.
+	 *
+	 * Also signal __GFP_NORETRY: the vmalloc fallback IS our retry path,
+	 * so the page allocator should not go to extreme lengths (e.g.
+	 * tainting a previously-clean superpageblock from the page-superblock
+	 * series) just to satisfy the kmalloc attempt. The atomic-allocation
+	 * relaxation logic in get_page_from_freelist treats __GFP_NORETRY as
+	 * "caller has a fallback" and returns NULL early instead of dropping
+	 * ALLOC_NOFRAGMENT. kvmalloc's documented contract already disallows
+	 * callers passing __GFP_NORETRY directly, so adding it here is safe.
 	 */
 	if (size > PAGE_SIZE) {
 		flags |= __GFP_NOWARN;

-		if (!(flags & __GFP_RETRY_MAYFAIL))
+		if (!(flags & __GFP_RETRY_MAYFAIL)) {
 			flags &= ~__GFP_DIRECT_RECLAIM;
+			flags |= __GFP_NORETRY;
+		}

 		/* nofail semantic is implemented by the vmalloc fallback */
 		flags &= ~__GFP_NOFAIL;
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (36 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

PASS_1 of __rmqueue_smallest walks &zone->spb_lists[cat][full]
linearly. Under steady workload on a 247 GB devvm, the median walk
depth was ~50 SPBs and 20-57% of allocations visited 100+ SPBs.

Cache the SPB that last satisfied a PASS_1 alloc for each
(zone, order, migratetype) tuple, in two layers:

  - per-zone hint (zone->sb_hint[order][mt]) — visible to all CPUs,
    serialized by zone->lock.
  - per-CPU hint indexed by zone_idx — cache-hot, contention-free.
    Each slot stores (zone *, sb *) because zone_idx is per-pgdat
    (not globally unique on NUMA); the zone-pointer check on read
    prevents a cross-node SPB from being handed back to the wrong
    zone's accounting.

Stale hints are harmless: try_alloc_from_sb_pass1() returns NULL and
the standard list walk runs as before. On PASS_1 success both hints
are refreshed. spb_invalidate_warm_hints() clears both arrays from
resize_zone_superpageblocks() under zone->lock to prevent UAF across
memory hotplug-add.

Hint hits show up in tracepoint:kmem:spb_alloc_walk as the [0, 5)
bucket because n_spbs_visited stays 0; no new tracepoint needed.
Skipped for migratetype >= MIGRATE_PCPTYPES (HIGHATOMIC/CMA/ISOLATE
are already cheap or rare).

Measurement on the same devvm with this commit applied:

  median walk depth:        ~50 SPBs   ->   ~5
  tail (>=100 SPB visits):  20-57%     ->   0.4%
  hint hit rate (n=0):                 ->   99%

Memory cost: ~320 B per zone + ~2.6 KB per CPU
(MAX_NR_ZONES * NR_PAGE_ORDERS * MIGRATE_PCPTYPES * sizeof(slot)).

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  11 +++
 mm/internal.h          |   2 +
 mm/mm_init.c           |   8 ++
 mm/page_alloc.c        | 180 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 201 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68892e40cd4e..298cff01160c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1014,6 +1014,17 @@ struct zone {
 	struct list_head	spb_isolated;	/* fully isolated (1GB contig alloc) */
 	struct list_head	spb_lists[__NR_SB_CATEGORIES][__NR_SB_FULLNESS];
 
+	/*
+	 * Stage 5 PASS_1 fast-path hint: most-recent SPB that satisfied a
+	 * (order, mt) PASS_1 allocation. Stale hints are harmless — the hint
+	 * try-alloc just falls through to the standard list walk on miss.
+	 * Sized for [0..NR_PAGE_ORDERS) x PCPTYPES; HIGHATOMIC/CMA/ISOLATE
+	 * skip the hint (already cheap or rare). Invalidated by
+	 * spb_invalidate_warm_hints() when the SPB array is resized
+	 * (memory hotplug add).
+	 */
+	struct superpageblock	*sb_hint[NR_PAGE_ORDERS][MIGRATE_PCPTYPES];
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
diff --git a/mm/internal.h b/mm/internal.h
index 71e39414645f..c84d7acb9342 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1041,6 +1041,8 @@ static inline void superpageblock_set_has_movable(struct zone *zone,
 void resize_zone_superpageblocks(struct zone *zone);
 #endif
 
+void spb_invalidate_warm_hints(struct zone *zone);
+
 struct cma;
 
 #ifdef CONFIG_CMA
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 8e3c64d37254..3a57cc4f3b48 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1810,6 +1810,14 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	zone->superpageblock_base_pfn = new_sb_base;
 	zone->spb_kvmalloced = true;
 
+	/*
+	 * Invalidate Stage 5 PASS_1 hints under zone->lock so that no
+	 * concurrent allocator (also entering __rmqueue_smallest under
+	 * zone->lock) can dereference an old SPB pointer that is about
+	 * to be freed below.
+	 */
+	spb_invalidate_warm_hints(zone);
+
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d621e84bf664..2f5d3ba1c0ef 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2814,6 +2814,110 @@ struct spb_tainted_walk {
 	bool saw_below_reserve;	/* tainted SPB has nr_free <= spb_tainted_reserve */
 };
 
+/*
+ * Stage 5 PASS_1 fast-path hint: most-recent SPB this CPU successfully
+ * allocated from for a given (zone, order, migratetype). Combined with
+ * the per-zone zone->sb_hint[][], this lets PASS_1 skip the linear walk
+ * of spb_lists[cat][full] in the common case (~78 SPBs visited per
+ * order-0 MOVABLE alloc on the build-403 baseline). Stale hints are
+ * harmless — the try-alloc just falls through to the standard list walk
+ * on miss.
+ *
+ * The slot stores both the zone pointer and the SPB pointer because
+ * zone_idx(zone) is per-pgdat (not globally unique on NUMA), so two
+ * nodes' ZONE_NORMAL share the same array index. The zone-pointer check
+ * on read prevents a cross-node SPB from being handed back to the wrong
+ * zone (which would corrupt per-zone NR_FREE_PAGES accounting).
+ */
+struct spb_warm_hint_slot {
+	struct zone		*zone;
+	struct superpageblock	*sb;
+};
+struct spb_warm_hints {
+	struct spb_warm_hint_slot slot[MAX_NR_ZONES][NR_PAGE_ORDERS][MIGRATE_PCPTYPES];
+};
+static DEFINE_PER_CPU(struct spb_warm_hints, spb_warm_hints);
+
+/**
+ * spb_invalidate_warm_hints - drop all cached hints into @zone
+ * @zone: zone whose SPB array is about to change
+ *
+ * Called from memory hotplug paths that resize zone->superpageblocks
+ * (and therefore invalidate every SPB pointer for @zone). Must be
+ * called with zone->lock held; the lock serializes against any CPU
+ * doing a hint read inside __rmqueue_smallest (also under zone->lock),
+ * so callers see either pre-invalidation state (old SPB pointers,
+ * still-valid old array) or post-invalidation state (NULL slots) —
+ * never a half-state with stale pointers into a freed array.
+ */
+void spb_invalidate_warm_hints(struct zone *zone)
+{
+	enum zone_type zidx = zone_idx(zone);
+	int cpu, order, mt;
+
+	lockdep_assert_held(&zone->lock);
+
+	memset(zone->sb_hint, 0, sizeof(zone->sb_hint));
+
+	for_each_possible_cpu(cpu) {
+		struct spb_warm_hints *h = per_cpu_ptr(&spb_warm_hints, cpu);
+
+		for (order = 0; order < NR_PAGE_ORDERS; order++) {
+			for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+				if (h->slot[zidx][order][mt].zone != zone)
+					continue;
+				h->slot[zidx][order][mt].zone = NULL;
+				h->slot[zidx][order][mt].sb = NULL;
+			}
+		}
+	}
+}
+
+/*
+ * Try to allocate from a single SPB using PASS_1 semantics:
+ * whole pageblock first (PCP-buddy friendly), then sub-pageblock.
+ * Returns the page on success, NULL on miss. Caller is responsible
+ * for tracepoints, hint updates, and shrinker queueing.
+ */
+static struct page *try_alloc_from_sb_pass1(struct zone *zone,
+					    struct superpageblock *sb,
+					    unsigned int order,
+					    int migratetype)
+{
+	unsigned int current_order;
+	struct free_area *area;
+	struct page *page;
+
+	if (!sb->nr_free_pages)
+		return NULL;
+
+	for (current_order = max(order, pageblock_order);
+	     current_order < NR_PAGE_ORDERS;
+	     ++current_order) {
+		area = &sb->free_area[current_order];
+		page = get_page_from_free_area(area, migratetype);
+		if (!page)
+			continue;
+		page_del_and_expand(zone, page, order,
+				    current_order, migratetype);
+		return page;
+	}
+	if (order < pageblock_order) {
+		for (current_order = order;
+		     current_order < pageblock_order;
+		     ++current_order) {
+			area = &sb->free_area[current_order];
+			page = get_page_from_free_area(area, migratetype);
+			if (!page)
+				continue;
+			page_del_and_expand(zone, page, order,
+					    current_order, migratetype);
+			return page;
+		}
+	}
+	return NULL;
+}
+
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				int migratetype, unsigned int alloc_flags,
@@ -2836,6 +2940,64 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	};
 	int movable = (migratetype == MIGRATE_MOVABLE) ? 1 : 0;
 
+	/*
+	 * Stage 5 PASS_1 fast-path: try per-CPU then per-zone hint SPB
+	 * before the linear list walk. The hint stores the SPB that last
+	 * satisfied a PASS_1 alloc for this (zone, order, migratetype).
+	 * On hit, we skip the entire spb_lists walk (n_spbs_visited stays
+	 * 0, which shows up as the [0,5) bucket in the spb_alloc_walk
+	 * tracepoint histogram). Skip for HIGHATOMIC/CMA/ISOLATE — those
+	 * paths are already cheap (atomic-NORETRY skip) or rare.
+	 */
+	if (migratetype < MIGRATE_PCPTYPES) {
+		enum zone_type zidx = zone_idx(zone);
+		struct superpageblock *cpu_hint = NULL, *zone_hint;
+		struct spb_warm_hint_slot *slot;
+
+		slot = this_cpu_ptr(
+			&spb_warm_hints.slot[zidx][order][migratetype]);
+		/*
+		 * Validate slot->zone == zone: zone_idx is per-pgdat, so
+		 * on NUMA the same slot index is shared by every node's
+		 * zone of this type. Without this check, a hint written
+		 * from one node would be returned to allocations on
+		 * another node and corrupt the wrong zone's accounting.
+		 */
+		if (slot->zone == zone)
+			cpu_hint = slot->sb;
+		if (cpu_hint) {
+			page = try_alloc_from_sb_pass1(zone, cpu_hint,
+						       order, migratetype);
+			if (page) {
+				if (spb_get_category(cpu_hint) == SB_TAINTED &&
+				    spb_below_shrink_high_water(cpu_hint))
+					queue_spb_slab_shrink(zone);
+				trace_mm_page_alloc_zone_locked(page, order,
+				    migratetype,
+				    pcp_allowed_order(order) &&
+				    migratetype < MIGRATE_PCPTYPES);
+				return page;
+			}
+		}
+		zone_hint = zone->sb_hint[order][migratetype];
+		if (zone_hint && zone_hint != cpu_hint) {
+			page = try_alloc_from_sb_pass1(zone, zone_hint,
+						       order, migratetype);
+			if (page) {
+				if (spb_get_category(zone_hint) == SB_TAINTED &&
+				    spb_below_shrink_high_water(zone_hint))
+					queue_spb_slab_shrink(zone);
+				slot->zone = zone;
+				slot->sb = zone_hint;
+				trace_mm_page_alloc_zone_locked(page, order,
+				    migratetype,
+				    pcp_allowed_order(order) &&
+				    migratetype < MIGRATE_PCPTYPES);
+				return page;
+			}
+		}
+	}
+
 	/*
 	 * Search per-superpageblock free lists for pages of the requested
 	 * migratetype, walking superpageblocks from fullest to emptiest
@@ -2902,6 +3064,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page, order, migratetype,
 					pcp_allowed_order(order) &&
 					migratetype < MIGRATE_PCPTYPES);
+				if (migratetype < MIGRATE_PCPTYPES) {
+					struct spb_warm_hint_slot *slot;
+
+					zone->sb_hint[order][migratetype] = sb;
+					slot = this_cpu_ptr(&spb_warm_hints.slot
+					    [zone_idx(zone)][order][migratetype]);
+					slot->zone = zone;
+					slot->sb = sb;
+				}
 				return page;
 			}
 			/* Then try sub-pageblock (no PCP buddy) */
@@ -2924,6 +3095,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
 						migratetype < MIGRATE_PCPTYPES);
+					if (migratetype < MIGRATE_PCPTYPES) {
+						struct spb_warm_hint_slot *slot;
+
+						zone->sb_hint[order][migratetype] = sb;
+						slot = this_cpu_ptr(&spb_warm_hints.slot
+						    [zone_idx(zone)][order][migratetype]);
+						slot->zone = zone;
+						slot->sb = sb;
+					}
 					return page;
 				}
 			}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (37 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

dump_page() calls is_migrate_cma_folio() which calls
get_pfnblock_migratetype() which calls get_pfnblock_flags_word()
which has a VM_BUG_ON_PAGE for !zone_spans_pfn(zone, pfn).

If that VM_BUG_ON_PAGE fires (e.g. dumping a page in an unavailable
range, or a page that hasn't yet been initialized), the BUG handler
itself calls dump_page() — which calls is_migrate_cma_folio() —
which fires the same VM_BUG_ON_PAGE. Infinite recursion until the
kernel runs out of stack.

Guard the CMA check with pfn_valid() and zone_spans_pfn() so
dump_page() can safely report on pages that don't have a meaningful
zone. The "CMA" suffix is only printed if the page is genuinely in a
CMA pageblock.

Found by: dump_page() called from a VM_BUG_ON_PAGE in early boot
hitting a page in an unavailable range, recursing until stack
exhaustion.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/debug.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/debug.c b/mm/debug.c
index d4542d5d202b..e233520b009c 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -73,6 +73,7 @@ static void __dump_folio(const struct folio *folio, const struct page *page,
 {
 	struct address_space *mapping = folio_mapping(folio);
 	int mapcount = atomic_read(&page->_mapcount) + 1;
+	bool cma = false;
 	char *type = "";
 
 	if (page_mapcount_is_type(mapcount))
@@ -112,9 +113,24 @@ static void __dump_folio(const struct folio *folio, const struct page *page,
 	 * "isolate" again in the meantime, but since we are just dumping the
 	 * state for debugging, it should be fine to accept a bit of
 	 * inaccuracy here due to racing.
+	 *
+	 * Guard the is_migrate_cma_folio() call with pfn_valid() and
+	 * zone_spans_pfn(). The macro calls get_pfnblock_migratetype()
+	 * which calls get_pfnblock_flags_word() which has a VM_BUG_ON_PAGE
+	 * for !zone_spans_pfn(). If that fires, dump_page() recurses
+	 * infinitely. Call page_zone() only after pfn_valid() to avoid
+	 * dereferencing uninitialized zone data during early boot.
 	 */
+#ifdef CONFIG_CMA
+	if (pfn_valid(pfn)) {
+		struct zone *zone = page_zone(page);
+
+		if (zone_spans_pfn(zone, pfn))
+			cma = is_migrate_cma_folio(folio, pfn);
+	}
+#endif
 	pr_warn("%sflags: %pGp%s\n", type, &folio->flags,
-		is_migrate_cma_folio(folio, pfn) ? " CMA" : "");
+		cma ? " CMA" : "");
 	if (page_has_type(&folio->page))
 		pr_warn("page_type: %x(%s)\n", folio->page.page_type >> 24,
 				page_type_name(folio->page.page_type));
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (38 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rafael J. Wysocki, Len Brown,
	Pavel Machek, linux-pm, Rik van Riel

From: Rik van Riel <riel@meta.com>

mark_free_pages() walks the buddy allocator's free lists and calls
swsusp_set_page_free() on each free page so it is omitted from the
hibernation image. After the SPB rework, free pages live on per-
superpageblock free lists rather than the zone-level free list, so
the existing list_for_each_entry() walk over zone->free_area[order]
.free_list[t] found nothing. The hibernation snapshot then treated
every free page as needing to be saved, wasting image space and risking
OOM during the snapshot.

Wrap the existing per-page walk in an SPB iteration loop. When the
zone has no SPBs (e.g. an unpopulated hotplug zone), fall back to the
zone-level free list. The whole function still runs under
spin_lock_irqsave(&zone->lock) without drops, so there are no lock-
order or hotplug concerns.

Note: kernel/power/snapshot.o currently fails to build with
-Werror=return-type due to an unrelated pre-existing warning in
enough_free_mem(). This patch was reviewed but not build-tested in
isolation; the change itself is mechanical.

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Machek <pavel@kernel.org>
Cc: linux-pm@vger.kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 kernel/power/snapshot.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 6e1321837c66..682c3bf2ba8b 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1270,17 +1270,32 @@ static void mark_free_pages(struct zone *zone)
 	}
 
 	for_each_migratetype_order(order, t) {
-		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], buddy_list) {
-			unsigned long i;
-
-			pfn = page_to_pfn(page);
-			for (i = 0; i < (1UL << order); i++) {
-				if (!--page_count) {
-					touch_nmi_watchdog();
-					page_count = WD_PAGE_COUNT;
+		unsigned long sb_idx;
+		unsigned long nr_lists = zone->nr_superpageblocks ? : 1;
+
+		/*
+		 * After the SPB rework, free pages live on per-superpageblock
+		 * free lists. Walk every SPB's list for this (order, mt) cell.
+		 * If the zone has no SPBs (unpopulated zone), fall back to the
+		 * zone-level list head so that any pre-SPB pages are still
+		 * marked.
+		 */
+		for (sb_idx = 0; sb_idx < nr_lists; sb_idx++) {
+			struct list_head *list = zone->nr_superpageblocks ?
+				&zone->superpageblocks[sb_idx].free_area[order].free_list[t] :
+				&zone->free_area[order].free_list[t];
+
+			list_for_each_entry(page, list, buddy_list) {
+				unsigned long i;
+
+				pfn = page_to_pfn(page);
+				for (i = 0; i < (1UL << order); i++) {
+					if (!--page_count) {
+						touch_nmi_watchdog();
+						page_count = WD_PAGE_COUNT;
+					}
+					swsusp_set_page_free(pfn_to_page(pfn + i));
 				}
-				swsusp_set_page_free(pfn_to_page(pfn + i));
 			}
 		}
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (39 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Chris Mason, David Sterba, linux-btrfs,
	Rik van Riel

From: Rik van Riel <riel@meta.com>

Extent buffer pages allocated by alloc_extent_buffer() are attached to
btree_inode->i_mapping (the buffer_tree path), reach the LRU, and are
served by the btree_migrate_folio aops in fs/btrfs/disk-io.c. They are
migratable in practice once their owning extent buffer hits refs == 1,
which happens naturally as tree roots rotate. The buddy allocator
classifies them by GFP, however, and bare GFP_NOFS lands them in
MIGRATE_UNMOVABLE pageblocks. The result: every btree_inode page we
read in pins an unmovable pageblock from the page-superblock allocator's
perspective, even though the page itself can be moved.

Add __GFP_MOVABLE to that one allocation site (alloc_extent_buffer's
call to alloc_eb_folio_array). Plumb the flag through
alloc_eb_folio_array → btrfs_alloc_page_array as a `gfp_t extra_gfp`
parameter. All other call sites pass 0.

Three categories of caller stay on bare GFP_NOFS, deliberately:

  - alloc_dummy_extent_buffer / btrfs_clone_extent_buffer: the
    resulting eb is EXTENT_BUFFER_UNMAPPED, folio->mapping stays NULL,
    the folios never enter LRU, never get migrate_folio aops. Tagging
    them __GFP_MOVABLE would violate the page allocator's migrability
    contract and they would defeat compaction in MOVABLE pageblocks
    where isolate_migratepages_block skips non-LRU non-movable_ops
    pages outright.

  - btrfs_alloc_page_array callers in fs/btrfs/raid56.c (stripe
    pages), fs/btrfs/inode.c (encoded reads), fs/btrfs/ioctl.c (uring
    encoded reads), fs/btrfs/relocation.c (relocation buffers): same
    contract violation. raid56 stripe_pages additionally persist in
    the stripe cache (RBIO_CACHE_SIZE=1024) well beyond a single I/O,
    so they are not transient enough to hand-wave the contract.

  - btrfs_alloc_folio_array caller in fs/btrfs/scrub.c (stripe
    folios): same — stripe->folios[] are private buffers freed via
    folio_put in release_scrub_stripe.

This change targets the dominant fragmentation source observed on the
page-superblock v18 series: ~28 GB of btree_inode pages parked across
many tainted superpageblocks on a 247 GB devvm with btrfs root,
preventing 1 GiB hugepage allocation from those regions. With the
movable hint, those pages now land in MOVABLE pageblocks where the
existing background defragger drains them through the standard
PB_has_movable gate, no LRU-sample fallback needed.

Suggested-by: Rik van Riel <riel@meta.com>

Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 fs/btrfs/extent_io.c  | 69 ++++++++++++++++++++++++++++++-------------
 fs/btrfs/extent_io.h  |  4 +--
 fs/btrfs/inode.c      |  2 +-
 fs/btrfs/ioctl.c      |  2 +-
 fs/btrfs/raid56.c     |  6 ++--
 fs/btrfs/relocation.c |  2 +-
 fs/btrfs/scrub.c      |  3 +-
 7 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5f97a3d2a8d7..7e28e4a876a0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -626,24 +626,33 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
 }
 
 /*
- * Populate every free slot in a provided array with folios using GFP_NOFS.
+ * Populate every free slot in a provided array with folios using
+ * GFP_NOFS plus optional caller-supplied flags.
  *
- * @nr_folios:   number of folios to allocate
- * @order:	 the order of the folios to be allocated
- * @folio_array: the array to fill with folios; any existing non-NULL entries in
- *		 the array will be skipped
+ * @nr_folios:    number of folios to allocate
+ * @order:	  folio order
+ * @folio_array:  array to fill with folios; non-NULL entries are skipped
+ * @extra_gfp:    extra GFP flags OR'd into GFP_NOFS. The only value used
+ *                today is __GFP_MOVABLE, which the extent-buffer real-mapping
+ *                path (alloc_extent_buffer) passes when the resulting folios
+ *                will be attached to btree_inode->i_mapping (added to LRU,
+ *                served by the btree_migrate_folio aops). Pass 0 for
+ *                everything else; folios allocated by other callers stay in
+ *                driver-owned arrays, never reach LRU and never register
+ *                movable_ops, so they cannot satisfy the __GFP_MOVABLE
+ *                migrability contract.
  *
  * Return: 0        if all folios were able to be allocated;
  *         -ENOMEM  otherwise, the partially allocated folios would be freed and
  *                  the array slots zeroed
  */
 int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
-			    struct folio **folio_array)
+			    struct folio **folio_array, gfp_t extra_gfp)
 {
 	for (int i = 0; i < nr_folios; i++) {
 		if (folio_array[i])
 			continue;
-		folio_array[i] = folio_alloc(GFP_NOFS, order);
+		folio_array[i] = folio_alloc(GFP_NOFS | extra_gfp, order);
 		if (!folio_array[i])
 			goto error;
 	}
@@ -658,21 +667,27 @@ int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
 }
 
 /*
- * Populate every free slot in a provided array with pages, using GFP_NOFS.
+ * Populate every free slot in a provided array with pages, using GFP_NOFS
+ * plus optional caller-supplied flags.
  *
- * @nr_pages:   number of pages to allocate
- * @page_array: the array to fill with pages; any existing non-null entries in
- *		the array will be skipped
- * @nofail:	whether using __GFP_NOFAIL flag
+ * @nr_pages:    number of pages to allocate
+ * @page_array:  array to fill; non-NULL entries are skipped
+ * @nofail:      whether to use __GFP_NOFAIL
+ * @extra_gfp:   extra GFP flags OR'd into the base mask. The only value used
+ *               today is __GFP_MOVABLE, which the extent-buffer real-mapping
+ *               path passes when the resulting pages will be attached to
+ *               btree_inode->i_mapping. See btrfs_alloc_folio_array() for
+ *               the full migrability rationale.
  *
  * Return: 0        if all pages were able to be allocated;
  *         -ENOMEM  otherwise, the partially allocated pages would be freed and
  *                  the array slots zeroed
  */
 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
-			   bool nofail)
+			   bool nofail, gfp_t extra_gfp)
 {
-	const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
+	const gfp_t gfp = (nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS) |
+			  extra_gfp;
 	unsigned int allocated;
 
 	for (allocated = 0; allocated < nr_pages;) {
@@ -695,14 +710,23 @@ int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
  * Populate needed folios for the extent buffer.
  *
  * For now, the folios populated are always in order 0 (aka, single page).
+ *
+ * @movable: pass true only when the resulting pages will be attached to
+ *           btree_inode->i_mapping (the alloc_extent_buffer real path).
+ *           Cloned/dummy extent buffers (EXTENT_BUFFER_UNMAPPED) leave
+ *           folio->mapping NULL, never enter the LRU, and never get the
+ *           btree_migrate_folio aops, so __GFP_MOVABLE would violate the
+ *           page-allocator's migrability contract for them.
  */
-static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
+static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail,
+				bool movable)
 {
 	struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
 	int num_pages = num_extent_pages(eb);
 	int ret;
 
-	ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
+	ret = btrfs_alloc_page_array(num_pages, page_array, nofail,
+				     movable ? __GFP_MOVABLE : 0);
 	if (ret < 0)
 		return ret;
 
@@ -3067,7 +3091,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 	 */
 	set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
 
-	ret = alloc_eb_folio_array(new, false);
+	ret = alloc_eb_folio_array(new, false, false);
 	if (ret)
 		goto release_eb;
 
@@ -3108,7 +3132,7 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	if (!eb)
 		return NULL;
 
-	ret = alloc_eb_folio_array(eb, false);
+	ret = alloc_eb_folio_array(eb, false, false);
 	if (ret)
 		goto release_eb;
 
@@ -3461,8 +3485,13 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	}
 
 reallocate:
-	/* Allocate all pages first. */
-	ret = alloc_eb_folio_array(eb, true);
+	/*
+	 * Allocate all pages first. These will be attached to
+	 * btree_inode->i_mapping below (added to LRU, served by
+	 * btree_migrate_folio), so request __GFP_MOVABLE so the
+	 * page allocator places them in MOVABLE pageblocks.
+	 */
+	ret = alloc_eb_folio_array(eb, true, true);
 	if (ret < 0) {
 		btrfs_free_folio_state(prealloc);
 		goto out;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 8d05f1a58b7c..1a3631abe989 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -363,9 +363,9 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
 			      struct extent_buffer *buf);
 
 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
-			   bool nofail);
+			   bool nofail, gfp_t extra_gfp);
 int btrfs_alloc_folio_array(unsigned int nr_folios, unsigned int order,
-			    struct folio **folio_array);
+			    struct folio **folio_array, gfp_t extra_gfp);
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 bool find_lock_delalloc_range(struct inode *inode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a6da98435ef7..fcd662203608 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9645,7 +9645,7 @@ ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, struct iov_iter *iter,
 	pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
 	if (!pages)
 		return -ENOMEM;
-	ret = btrfs_alloc_page_array(nr_pages, pages, false);
+	ret = btrfs_alloc_page_array(nr_pages, pages, false, 0);
 	if (ret) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b805dd9227ef..eaf9508fcf1f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4614,7 +4614,7 @@ static int btrfs_uring_read_extent(struct kiocb *iocb, struct iov_iter *iter,
 	pages = kzalloc_objs(struct page *, nr_pages, GFP_NOFS);
 	if (!pages)
 		return -ENOMEM;
-	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+	ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
 	if (ret) {
 		ret = -ENOMEM;
 		goto out_fail;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b4511f560e92..d8531a901535 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1143,7 +1143,7 @@ static int alloc_rbio_pages(struct btrfs_raid_bio *rbio)
 {
 	int ret;
 
-	ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false);
+	ret = btrfs_alloc_page_array(rbio->nr_pages, rbio->stripe_pages, false, 0);
 	if (ret < 0)
 		return ret;
 	/* Mapping all sectors */
@@ -1158,7 +1158,7 @@ static int alloc_rbio_parity_pages(struct btrfs_raid_bio *rbio)
 	int ret;
 
 	ret = btrfs_alloc_page_array(rbio->nr_pages - data_pages,
-				     rbio->stripe_pages + data_pages, false);
+				     rbio->stripe_pages + data_pages, false, 0);
 	if (ret < 0)
 		return ret;
 
@@ -1756,7 +1756,7 @@ static int alloc_rbio_data_pages(struct btrfs_raid_bio *rbio)
 	const int data_pages = rbio->nr_data * rbio->stripe_npages;
 	int ret;
 
-	ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false);
+	ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages, false, 0);
 	if (ret < 0)
 		return ret;
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b2343aed7a5d..814e48003015 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4046,7 +4046,7 @@ static int copy_remapped_data(struct btrfs_fs_info *fs_info, u64 old_addr,
 	if (!pages)
 		return -ENOMEM;
 
-	ret = btrfs_alloc_page_array(nr_pages, pages, 0);
+	ret = btrfs_alloc_page_array(nr_pages, pages, 0, 0);
 	if (ret) {
 		ret = -ENOMEM;
 		goto end;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index bc94bbc00772..23f6f780eab6 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -369,7 +369,8 @@ static int init_scrub_stripe(struct btrfs_fs_info *fs_info,
 
 	ASSERT(BTRFS_STRIPE_LEN >> min_folio_shift <= SCRUB_STRIPE_MAX_FOLIOS);
 	ret = btrfs_alloc_folio_array(BTRFS_STRIPE_LEN >> min_folio_shift,
-				      fs_info->block_min_order, stripe->folios);
+				      fs_info->block_min_order, stripe->folios,
+				      0);
 	if (ret < 0)
 		goto error;
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (40 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Pass 2c (cross-non-movable borrow) is restricted to UNMOV<->RECL: it
borrows individual buddies from the opposite non-movable migratetype's
free list within a tainted SPB without relabeling the source pageblock.
Movable free pages within tainted SPBs are deliberately excluded
because long-lived non-movable content in a MOV-tagged pageblock
blocks compaction of that pageblock.

Under workloads that mostly free MOVABLE-tagged content into tainted
SPBs (page-cache reclaim, anon LRU shrink), the result is a tainted
SPB with tens to hundreds of thousands of free pages all on the MOV
free list — invisible to non-movable demand. Pass 1 doesn't see them
(they're not on the requesting mt's list), Pass 2/2b can't claim a
whole pageblock when sb->nr_free == 0 (no contiguous free PB to
relabel), and Pass 2c skips MOV. The non-movable alloc falls through
to Pass 3 and taints a fresh clean SPB even though the existing
tainted ones have plenty of unused space.

Add Pass 2d, mirroring Pass 2c semantics but borrowing from the
MOVABLE free list within already-tainted SPBs. The borrowed page is
used for the requesting non-movable mt for the lifetime of the
allocation, then on free returns to the MOVABLE list (no pageblock
relabel; same "borrow" mechanism as 2c).

Tradeoff: the borrowed UNMOV/RECL content blocks compaction of its
source pageblock until the alloc is freed. Restricted to SB_TAINTED
so contamination is bounded to one pageblock inside an already-
tainted SPB. The alternative — Pass 3 tainting a fresh clean SPB —
removes a 1 GiB region from the clean pool, which is strictly worse
for the anti-fragmentation invariant the series is built around.

Skipped for movable allocs (they use Pass 4) and CMA allocs.

Observable as the new SPB_ALLOC_OUTCOME_PASS_2D outcome on the
spb_alloc_walk tracepoint. Expected effect on the live workload:
tainted SPB count growth slows substantially; allocations that were
previously taking the PASS_3 escape now succeed in PASS_2D.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2f5d3ba1c0ef..af499f0a1a48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3280,6 +3280,79 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				}
 			}
 		}
+
+		/*
+		 * Pass 2d: cross-MOV borrow within tainted SPBs.
+		 *
+		 * If Pass 1/2/2b/2c all failed, the next step is Pass 3
+		 * which would taint a fresh clean SPB. Before that, try
+		 * to borrow an individual buddy from a tainted SPB's
+		 * MIGRATE_MOVABLE free list.
+		 *
+		 * Tainted SPBs accumulate large amounts of free space on
+		 * the MOV free list (e.g. reclaimed page-cache pages
+		 * whose pageblock tag is MOVABLE). Pass 1 cannot see
+		 * those for non-movable allocs, Pass 2/2b cannot claim a
+		 * whole pageblock when sb->nr_free == 0, and Pass 2c is
+		 * restricted to UNMOV<->RECL. The result is a tainted
+		 * SPB with tens to hundreds of thousands of free pages
+		 * all unreachable from non-movable demand.
+		 *
+		 * Borrow semantics mirror Pass 2c: take a buddy from the
+		 * MOVABLE free list without relabeling the source
+		 * pageblock. The page is used for the requesting non-
+		 * movable mt for the lifetime of the allocation, then on
+		 * free returns to the MOVABLE list.
+		 *
+		 * Cost: the borrowed UNMOV/RECL content blocks
+		 * compaction of its source pageblock until freed.
+		 * Restricted to SB_TAINTED so the contamination is
+		 * bounded to an already-tainted SPB; the alternative
+		 * (Pass 3) taints a fresh clean SPB and removes a 1 GiB
+		 * region from the clean pool, which is strictly worse.
+		 *
+		 * Skipped for movable allocs (they have Pass 4) and for
+		 * CMA allocs.
+		 */
+		if (!movable && !is_migrate_cma(migratetype)) {
+			for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+				list_for_each_entry(sb,
+					&zone->spb_lists[SB_TAINTED][full], list) {
+					int co;
+
+					if (!sb->nr_free_pages)
+						continue;
+					for (co = min_t(int, pageblock_order - 1,
+							NR_PAGE_ORDERS - 1);
+					     co >= (int)order;
+					     --co) {
+						current_order = co;
+						area = &sb->free_area[current_order];
+						page = get_page_from_free_area(
+							area, MIGRATE_MOVABLE);
+						if (!page)
+							continue;
+						if (get_pageblock_isolate(page))
+							continue;
+						if (is_migrate_cma(
+						    get_pageblock_migratetype(page)))
+							continue;
+						page_del_and_expand(zone, page,
+							order, current_order,
+							MIGRATE_MOVABLE);
+						__spb_set_has_type(page,
+							migratetype);
+						if (spb_below_shrink_high_water(sb))
+							queue_spb_slab_shrink(zone);
+						trace_mm_page_alloc_zone_locked(
+							page, order, migratetype,
+							pcp_allowed_order(order) &&
+							migratetype < MIGRATE_PCPTYPES);
+						return page;
+					}
+				}
+			}
+		}
 	}

 	/*
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (41 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

The per-SPB background defrag worker is currently triggered only from
spb_update_list(), which itself only fires when the SPB's category or
fullness bucket changes. Sub-bucket allocations (decrementing free
counters within the same bucket) do not re-evaluate.

drgn dump on a saturated devvm showed several tainted SPBs with
defrag_last_no_progress_jiffies set hundreds-to-thousands of seconds
ago — long after their 5-second SPB_DEFRAG_NOOP_COOLDOWN expired —
yet defrag had never been re-triggered on them. The shape of the
failure: a tainted SPB hits free=0, the worker tried once and made no
progress (movable pages mostly in mixed pageblocks, evacuating them
left the source PB still occupied by unmov/recl content), no-progress
cooldown stamped, no later allocator event crossed a fullness bucket
on that SPB so spb_update_list never re-fired the trigger. The SPB
sat stuck while subsequent non-movable allocs ended up tainting fresh
clean SPBs via PASS_3.

Add two complementary triggers in __rmqueue_smallest:

(1) On every PASS_1/2/2B/2C/2D success that already evaluates
    spb_below_shrink_high_water(sb) (i.e. the same threshold at
    which queue_spb_slab_shrink is fired), additionally call
    spb_maybe_start_defrag(sb). Catches actively-pressured tainted
    SPBs immediately, no extra hot-path predicate evaluation.

(2) Just before the PASS_3 fall-through that risks tainting a fresh
    clean SPB, walk the tainted-SPB list and call
    spb_maybe_start_defrag() on each. Catches SPBs that are stuck
    with no allocator activity to drive (1). Bounded by
    nr_tainted_spbs and only runs on the slow path that is about to
    fragment the clean pool — appropriate to spend a list walk
    here. The cooldown gate inside spb_needs_defrag() no-ops cheaply
    for SPBs not yet eligible.

The cooldown still gates spb_needs_defrag() so neither trigger
storms the worker.

The existing spb_maybe_start_defrag() call inside spb_update_list()
is retained: it remains the trigger for the clean-SPB
within-superpageblock compaction path (spb_defrag_clean), which the
new alloc-path triggers do not cover (they only fire on
SB_TAINTED). Replacing the spb_update_list call entirely would
require a separate clean-SPB-specific trigger in the allocator and
is left for a follow-up.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller

Also factor out the now-repeated tainted-alloc reaction into a helper
spb_react_to_tainted_alloc(sb, zone) and call it from all 8
PASS_1/2/2B/2C/2D success sites in __rmqueue_smallest. Centralizes the
gate (cat == SB_TAINTED && spb_below_shrink_high_water(sb)) and the
shrink+defrag kick in one place, removing duplication and reducing
the per-success-site noise.
---
 mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af499f0a1a48..e15e71d5ac99 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2709,6 +2709,30 @@ static inline bool spb_below_shrink_high_water(const struct superpageblock *sb)
 		(unsigned long)spb_tainted_reserve(sb) * pageblock_nr_pages;
 }
 
+/*
+ * spb_react_to_tainted_alloc - kick reclaim machinery on a tainted-SPB alloc.
+ *
+ * Called from each PASS_1/2/2B/2C/2D success path after a successful
+ * allocation against a tainted SPB. If the SPB is below its shrink
+ * high-water mark, queue the SPB-driven slab shrink and try to start
+ * the per-SPB defrag worker. Both have their own cooldown gates inside,
+ * so this is cheap to call on every such allocation.
+ *
+ * Skips quickly when the SPB is not tainted (e.g. movable allocation
+ * landing on a clean SPB) or when the high-water mark hasn't been
+ * crossed.
+ */
+static inline void spb_react_to_tainted_alloc(struct superpageblock *sb,
+					      struct zone *zone)
+{
+	if (spb_get_category(sb) != SB_TAINTED)
+		return;
+	if (!spb_below_shrink_high_water(sb))
+		return;
+	queue_spb_slab_shrink(zone);
+	spb_maybe_start_defrag(sb);
+}
+
 /*
  * On systems with many superpageblocks, we can afford to "write off"
  * tainted superpageblocks by aggressively packing unmovable/reclaimable
@@ -2969,9 +2993,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 			page = try_alloc_from_sb_pass1(zone, cpu_hint,
 						       order, migratetype);
 			if (page) {
-				if (spb_get_category(cpu_hint) == SB_TAINTED &&
-				    spb_below_shrink_high_water(cpu_hint))
-					queue_spb_slab_shrink(zone);
+				spb_react_to_tainted_alloc(cpu_hint, zone);
 				trace_mm_page_alloc_zone_locked(page, order,
 				    migratetype,
 				    pcp_allowed_order(order) &&
@@ -2984,9 +3006,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 			page = try_alloc_from_sb_pass1(zone, zone_hint,
 						       order, migratetype);
 			if (page) {
-				if (spb_get_category(zone_hint) == SB_TAINTED &&
-				    spb_below_shrink_high_water(zone_hint))
-					queue_spb_slab_shrink(zone);
+				spb_react_to_tainted_alloc(zone_hint, zone);
 				slot->zone = zone;
 				slot->sb = zone_hint;
 				trace_mm_page_alloc_zone_locked(page, order,
@@ -3057,9 +3077,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				page_del_and_expand(zone, page,
 					order, current_order,
 					migratetype);
-				if (cat == SB_TAINTED &&
-				    spb_below_shrink_high_water(sb))
-					queue_spb_slab_shrink(zone);
+				if (cat == SB_TAINTED)
+					spb_react_to_tainted_alloc(sb, zone);
 				trace_mm_page_alloc_zone_locked(
 					page, order, migratetype,
 					pcp_allowed_order(order) &&
@@ -3088,9 +3107,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page_del_and_expand(zone, page,
 						order, current_order,
 						migratetype);
-					if (cat == SB_TAINTED &&
-					    spb_below_shrink_high_water(sb))
-						queue_spb_slab_shrink(zone);
+					if (cat == SB_TAINTED)
+						spb_react_to_tainted_alloc(sb, zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -3145,8 +3163,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page = claim_whole_block(zone, page,
 						current_order, order,
 						migratetype, MIGRATE_MOVABLE);
-					if (spb_below_shrink_high_water(sb))
-						queue_spb_slab_shrink(zone);
+					spb_react_to_tainted_alloc(sb, zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -3184,8 +3201,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						0, true);
 					if (!page)
 						continue;
-					if (spb_below_shrink_high_water(sb))
-						queue_spb_slab_shrink(zone);
+					spb_react_to_tainted_alloc(sb, zone);
 					trace_mm_page_alloc_zone_locked(
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
@@ -3269,8 +3285,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							opposite_mt);
 						__spb_set_has_type(page,
 							migratetype);
-						if (spb_below_shrink_high_water(sb))
-							queue_spb_slab_shrink(zone);
+						spb_react_to_tainted_alloc(sb, zone);
 						trace_mm_page_alloc_zone_locked(
 							page, order, migratetype,
 							pcp_allowed_order(order) &&
@@ -3342,8 +3357,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							MIGRATE_MOVABLE);
 						__spb_set_has_type(page,
 							migratetype);
-						if (spb_below_shrink_high_water(sb))
-							queue_spb_slab_shrink(zone);
+						spb_react_to_tainted_alloc(sb, zone);
 						trace_mm_page_alloc_zone_locked(
 							page, order, migratetype,
 							pcp_allowed_order(order) &&
@@ -3371,6 +3385,25 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		queue_spb_slab_shrink(zone);
 	}
 
+	/*
+	 * Last-chance defrag trigger before tainting a fresh clean SPB.
+	 * Walk the tainted-SPB list and try to wake the per-SPB defrag
+	 * worker on each. Catches SPBs that are stuck in expired-cooldown
+	 * state because no allocator activity has touched them recently
+	 * (the routine event-driven trigger from spb_update_list only
+	 * fires on bucket transitions, not on every alloc). Once the
+	 * cooldown has expired, spb_maybe_start_defrag() will requeue
+	 * work; otherwise the gate inside spb_needs_defrag() no-ops
+	 * cheaply. Bounded by nr_tainted_spbs and only runs when we are
+	 * already on the slow path of fragmenting the clean pool.
+	 */
+	for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+		list_for_each_entry(sb,
+			&zone->spb_lists[SB_TAINTED][full], list) {
+			spb_maybe_start_defrag(sb);
+		}
+	}
+
 	/* Pass 3: whole pageblock from empty superpageblocks */
 	list_for_each_entry(sb, &zone->spb_empty, list) {
 		if (!sb->nr_free_pages)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM]
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (42 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
  2026-05-01  7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@meta.com>

Bundle all SPB anti-fragmentation diagnostic tracepoints into a single
top-of-stack commit so the entire instrumentation can be dropped before
upstream submission.

Tracepoint definitions (include/trace/events/kmem.h):
  - spb_alloc_walk            — exit point of every __rmqueue_smallest
                                 call with outcome and SPB visit count
  - spb_alloc_fall_through    — fires when PASS 1/2/2b/2c all failed
                                 and the allocator is about to taint
                                 a fresh clean SPB (PASS 3 / steal)
  - spb_pb_taint              — every PB_has_<mt> bit transition
  - spb_claim_block_refused   — try_to_claim_block exits with reason
  - spb_evacuate_for_order_done — evac phase completion summary
  - spb_alloc_atomic_relax    — atomic NORETRY relaxation events

Plus the SPB_ALLOC_OUTCOME_PASS_2D = 8 enum value (extending the
spb_alloc_walk outcome set for the cross-MOV borrow path).

Tracepoint emission scaffolding and call sites (mm/page_alloc.c):
  - n_spbs_visited counter + SPB_WALK_DONE macro in __rmqueue_smallest
  - bool first/last in __spb_set_has_type / __spb_clear_has_type
  - if-stmt brace + trace_spb_claim_block_refused in try_to_claim_block
    early-return paths (isolate, CMA, zone-boundary, noncompat-cross)
  - struct zone *pref + trace_spb_alloc_atomic_relax in slowpath
    NORETRY/NOFRAG-tainted relaxation
  - phase1_attempts/phase2_attempts counters + trace_spb_evacuate_for_order_done
  - trace_printk("SB first unmovable/reclaimable") on first-of-type
    transitions per SPB

Designed for diagnostics only — for upstream submission, hide this
commit. The behavioral commits below provide the SPB anti-fragmentation
machinery; this commit is purely instrumentation.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/trace/events/kmem.h | 371 ++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c             | 149 ++++++++++++++-
 2 files changed, 510 insertions(+), 10 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index cd7920c81f85..67fda214edc9 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -266,6 +266,377 @@ TRACE_EVENT(mm_page_pcpu_drain,
 		__entry->order, __entry->migratetype)
 );
 
+/*
+ * spb_pb_taint action encoding.
+ */
+#define SPB_PB_TAINT_ACTION_SET		0   /* set PB_has_<mt> */
+#define SPB_PB_TAINT_ACTION_CLEAR	1   /* clear PB_has_<mt> */
+
+#define show_spb_pb_taint_action(a)				\
+	__print_symbolic(a,					\
+		{ SPB_PB_TAINT_ACTION_SET,	"SET"   },	\
+		{ SPB_PB_TAINT_ACTION_CLEAR,	"CLEAR" })
+
+/*
+ * Per-call tracepoint at every PB_has_<migratetype> bit transition.
+ * Distinct from the existing trace_printk lines (which only fire on
+ * the FIRST 0->1 transition per (SPB, migratetype)) — this fires on
+ * EVERY successful set/clear, and includes a flag for whether this
+ * call also caused a 0<->1 transition at the SPB-level counter
+ * (i.e., is_first_or_last for this (SPB, mt) combination).
+ *
+ * Use to answer "who is painting/clearing PB_has bits and at what
+ * rate?" — most useful when investigating runaway tainting or when
+ * Stage 1 / sync evac should be clearing bits but isn't.
+ *
+ * High volume: bounded by the rate of PB_has_* bit changes, which
+ * is typically per-allocation. Static-key gated to zero overhead
+ * when detached.
+ */
+TRACE_EVENT(spb_pb_taint,
+
+	TP_PROTO(struct page *page, int migratetype, int action,
+		 bool is_first_or_last),
+
+	TP_ARGS(page, migratetype, action, is_first_or_last),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	pfn			)
+		__field(	int,		migratetype		)
+		__field(	int,		action			)
+		__field(	bool,		is_first_or_last	)
+	),
+
+	TP_fast_assign(
+		__entry->pfn			= page_to_pfn(page);
+		__entry->migratetype		= migratetype;
+		__entry->action			= action;
+		__entry->is_first_or_last	= is_first_or_last;
+	),
+
+	TP_printk("pfn=0x%lx mt=%d action=%s first_or_last=%d",
+		__entry->pfn,
+		__entry->migratetype,
+		show_spb_pb_taint_action(__entry->action),
+		__entry->is_first_or_last)
+);
+
+/*
+ * spb_claim_block_refused reason encoding.
+ */
+#define SPB_CLAIM_REFUSED_ISOLATE		0
+#define SPB_CLAIM_REFUSED_CMA			1
+#define SPB_CLAIM_REFUSED_ZONE_BOUNDARY		2
+#define SPB_CLAIM_REFUSED_CROSS_TYPE_NOT_FREE	3
+#define SPB_CLAIM_REFUSED_INSUFFICIENT_COMPAT	4
+
+#define show_spb_claim_refused_reason(r)				\
+	__print_symbolic(r,						\
+		{ SPB_CLAIM_REFUSED_ISOLATE,         "ISOLATE"        },	\
+		{ SPB_CLAIM_REFUSED_CMA,             "CMA"            },	\
+		{ SPB_CLAIM_REFUSED_ZONE_BOUNDARY,   "ZONE_BOUNDARY"  },	\
+		{ SPB_CLAIM_REFUSED_CROSS_TYPE_NOT_FREE, "CROSS_TYPE_NOT_FREE" }, \
+		{ SPB_CLAIM_REFUSED_INSUFFICIENT_COMPAT, "INSUFFICIENT_COMPAT" })
+
+/*
+ * Per-refusal tracepoint inside try_to_claim_block. The function can
+ * fail for several reasons: pageblock isolated for evacuation, CMA
+ * pageblock, zone boundary straddle, cross-type relabel that requires
+ * a fully-free PB, or the heuristic threshold that says too few pages
+ * in the block are compatible. Visibility into WHICH reason fires how
+ * often informs Stage 4 design (e.g., is the heuristic gate the
+ * dominant cause of allocations spilling to clean SPBs?).
+ *
+ * Volume: bounded by the rate of fallback attempts, which is rare
+ * compared to total allocations.
+ */
+TRACE_EVENT(spb_claim_block_refused,
+
+	TP_PROTO(struct page *page, int start_type, int block_type,
+		 int reason),
+
+	TP_ARGS(page, start_type, block_type, reason),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	pfn		)
+		__field(	int,		start_type	)
+		__field(	int,		block_type	)
+		__field(	int,		reason		)
+	),
+
+	TP_fast_assign(
+		__entry->pfn		= page_to_pfn(page);
+		__entry->start_type	= start_type;
+		__entry->block_type	= block_type;
+		__entry->reason		= reason;
+	),
+
+	TP_printk("pfn=0x%lx start_mt=%d block_mt=%d reason=%s",
+		__entry->pfn,
+		__entry->start_type,
+		__entry->block_type,
+		show_spb_claim_refused_reason(__entry->reason))
+);
+
+/*
+ * Per-call tracepoint at the exit of spb_evacuate_for_order, the
+ * synchronous slowpath evacuator called from
+ * __alloc_pages_direct_compact. Captures how many evacuate_pageblock
+ * calls were attempted in each phase:
+ *   - Phase 1: coalesce within existing same-mt pageblocks
+ *   - Phase 2: evacuate whole movable pageblocks to create free PBs
+ *
+ * Together with pgmigrate_success/pgmigrate_fail counter deltas, this
+ * lets us answer "is slowpath sync evacuation actually creating
+ * useful free pageblocks, or are the migrations EAGAINing on busy
+ * ebs?" — directly informs whether the per-call budget caps need
+ * tuning.
+ *
+ * Low volume: ~one event per direct-compact slowpath visit.
+ */
+TRACE_EVENT(spb_evacuate_for_order_done,
+
+	TP_PROTO(struct zone *zone, unsigned int order, int migratetype,
+		 unsigned int phase1_attempts, unsigned int phase2_attempts,
+		 bool did_evacuate),
+
+	TP_ARGS(zone, order, migratetype, phase1_attempts,
+		phase2_attempts, did_evacuate),
+
+	TP_STRUCT__entry(
+		__string(	name,			zone->name	)
+		__field(	unsigned int,		order		)
+		__field(	int,			migratetype	)
+		__field(	unsigned int,		phase1_attempts	)
+		__field(	unsigned int,		phase2_attempts	)
+		__field(	bool,			did_evacuate	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->order			= order;
+		__entry->migratetype		= migratetype;
+		__entry->phase1_attempts	= phase1_attempts;
+		__entry->phase2_attempts	= phase2_attempts;
+		__entry->did_evacuate		= did_evacuate;
+	),
+
+	TP_printk("zone=%s order=%u mt=%d p1=%u p2=%u did_evac=%d",
+		__get_str(name),
+		__entry->order,
+		__entry->migratetype,
+		__entry->phase1_attempts,
+		__entry->phase2_attempts,
+		__entry->did_evacuate)
+);
+
+/*
+ * spb_alloc_atomic_relax step encoding.
+ */
+#define SPB_ATOMIC_RELAX_NORETRY_SKIP	0   /* NORETRY caller — return NULL */
+#define SPB_ATOMIC_RELAX_ADD_TAINTED_OK	1   /* add ALLOC_NOFRAG_TAINTED_OK retry */
+#define SPB_ATOMIC_RELAX_DROP_NOFRAGMENT 2  /* drop ALLOC_NOFRAGMENT retry */
+
+#define show_spb_atomic_relax_step(s)					\
+	__print_symbolic(s,						\
+		{ SPB_ATOMIC_RELAX_NORETRY_SKIP,    "NORETRY_SKIP"    }, \
+		{ SPB_ATOMIC_RELAX_ADD_TAINTED_OK,  "ADD_TAINTED_OK"  }, \
+		{ SPB_ATOMIC_RELAX_DROP_NOFRAGMENT, "DROP_NOFRAGMENT" })
+
+/*
+ * Per-event tracepoint at each atomic-allocation NOFRAGMENT-relaxation
+ * step in get_page_from_freelist. Captures NORETRY-skip exits (caller
+ * had a fallback so we returned NULL), and the two relaxation retries
+ * (add NOFRAG_TAINTED_OK; drop NOFRAGMENT entirely).
+ *
+ * Use to quantify how often each step fires under the workload.
+ * Validates the NORETRY-skip change is paying off.
+ *
+ * Volume: only on atomic allocs that exhaust the tainted pool —
+ * typically rare on a healthy system.
+ */
+TRACE_EVENT(spb_alloc_atomic_relax,
+
+	TP_PROTO(struct zone *zone, unsigned int order, int migratetype,
+		 gfp_t gfp_mask, int step),
+
+	TP_ARGS(zone, order, migratetype, gfp_mask, step),
+
+	TP_STRUCT__entry(
+		__string(	name,			zone->name	)
+		__field(	unsigned int,		order		)
+		__field(	int,			migratetype	)
+		__field(	unsigned long,		gfp_mask	)
+		__field(	int,			step		)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->order		= order;
+		__entry->migratetype	= migratetype;
+		__entry->gfp_mask	= (__force unsigned long)gfp_mask;
+		__entry->step		= step;
+	),
+
+	TP_printk("zone=%s order=%u mt=%d gfp=%s step=%s",
+		__get_str(name),
+		__entry->order,
+		__entry->migratetype,
+		show_gfp_flags(__entry->gfp_mask),
+		show_spb_atomic_relax_step(__entry->step))
+);
+
+/*
+ * spb_alloc_walk outcome encoding. SUCCESS_* values name which Pass
+ * inside __rmqueue_smallest produced the page. NO_PAGE means the
+ * function returned NULL (all passes failed).
+ */
+#define SPB_ALLOC_OUTCOME_NO_PAGE	0
+#define SPB_ALLOC_OUTCOME_PASS_1	1   /* preferred SPBs */
+#define SPB_ALLOC_OUTCOME_PASS_2	2   /* claim_whole_block from tainted */
+#define SPB_ALLOC_OUTCOME_PASS_2B	3   /* sub-PB claim from tainted */
+#define SPB_ALLOC_OUTCOME_PASS_2C	4   /* cross-non-movable borrow */
+#define SPB_ALLOC_OUTCOME_PASS_3	5   /* empty SPB (taints fresh SPB) */
+#define SPB_ALLOC_OUTCOME_PASS_4	6   /* movable falls back to tainted */
+#define SPB_ALLOC_OUTCOME_ZONE_FALLBACK	7  /* zone-level free_area (hotplug edge) */
+#define SPB_ALLOC_OUTCOME_PASS_2D	8   /* cross-MOV borrow within tainted */
+
+#define show_spb_alloc_outcome(o)				\
+	__print_symbolic(o,					\
+		{ SPB_ALLOC_OUTCOME_NO_PAGE,	"NO_PAGE"  },	\
+		{ SPB_ALLOC_OUTCOME_PASS_1,	"PASS_1"   },	\
+		{ SPB_ALLOC_OUTCOME_PASS_2,	"PASS_2"   },	\
+		{ SPB_ALLOC_OUTCOME_PASS_2B,	"PASS_2B"  },	\
+		{ SPB_ALLOC_OUTCOME_PASS_2C,	"PASS_2C"  },	\
+		{ SPB_ALLOC_OUTCOME_PASS_2D,	"PASS_2D"  },	\
+		{ SPB_ALLOC_OUTCOME_PASS_3,	"PASS_3"   },	\
+		{ SPB_ALLOC_OUTCOME_PASS_4,	"PASS_4"   },	\
+		{ SPB_ALLOC_OUTCOME_ZONE_FALLBACK, "ZONE_FB" })
+
+/*
+ * Per-allocation tracepoint at every exit of __rmqueue_smallest.
+ * Captures how many SPBs were walked before the allocation was
+ * satisfied (or determined unsatisfiable).
+ *
+ * Use this to characterize the cost of the linear spb_lists walk:
+ *   - typical walk depth per allocation
+ *   - per-(order, migratetype) walk-depth distribution
+ *   - whether some workloads see pathologically long walks
+ *
+ * High-volume tracepoint (~1 emission per allocation, ~hundreds of
+ * thousands per second on busy systems). The static-key gating in
+ * the caller keeps cost at ~1 ns when the tracepoint is detached.
+ * When attached, expect ~100 ns/event (~10% CPU on a saturated
+ * allocator). Filter by outcome to reduce volume:
+ *   tracepoint:kmem:spb_alloc_walk /args->n_spbs_visited > 5/ { ... }
+ */
+TRACE_EVENT(spb_alloc_walk,
+
+	TP_PROTO(struct zone *zone, unsigned int order, int migratetype,
+		 unsigned int alloc_flags, int outcome,
+		 unsigned int n_spbs_visited),
+
+	TP_ARGS(zone, order, migratetype, alloc_flags, outcome,
+		n_spbs_visited),
+
+	TP_STRUCT__entry(
+		__string(	name,			zone->name	)
+		__field(	unsigned int,		order		)
+		__field(	int,			migratetype	)
+		__field(	unsigned int,		alloc_flags	)
+		__field(	int,			outcome		)
+		__field(	unsigned int,		n_spbs_visited	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->order			= order;
+		__entry->migratetype		= migratetype;
+		__entry->alloc_flags		= alloc_flags;
+		__entry->outcome		= outcome;
+		__entry->n_spbs_visited		= n_spbs_visited;
+	),
+
+	TP_printk("zone=%s order=%u mt=%d alloc_flags=0x%x outcome=%s n_spbs_visited=%u",
+		__get_str(name),
+		__entry->order,
+		__entry->migratetype,
+		__entry->alloc_flags,
+		show_spb_alloc_outcome(__entry->outcome),
+		__entry->n_spbs_visited)
+);
+
+/*
+ * Diagnostic tracepoint fired when __rmqueue_smallest's tainted-SPB
+ * passes (Pass 1/2/2b/2c) all failed and the allocator is about to
+ * fall through to Pass 3 (which may taint a clean SPB) or to the
+ * fallback paths in __rmqueue_claim/__rmqueue_steal.
+ *
+ * Captures enough state to answer "why didn't an existing tainted SPB
+ * absorb this allocation?":
+ *   - n_tainted_with_buddy: count of tainted SPBs whose free_area at
+ *     the requested order has a non-empty free_list of the requested
+ *     migratetype. >0 means buddies WERE available — Pass 1 missed
+ *     them somehow. 0 means the tainted pool genuinely had nothing at
+ *     the right (order, mt).
+ *   - walk flags: snapshot of struct spb_tainted_walk gathered during
+ *     Pass 1's walk. saw_free_pages = any tainted SPB had any free
+ *     pages anywhere; saw_free_pb = any tainted SPB had a wholly-free
+ *     pageblock; saw_below_reserve = any tainted SPB was at or below
+ *     its reserve threshold.
+ *
+ * Fires once per fall-through event, so volume scales with the rate
+ * at which clean-SPB tainting becomes a possibility — typically rare
+ * once the workload reaches steady state.
+ */
+TRACE_EVENT(spb_alloc_fall_through,
+
+	TP_PROTO(struct zone *zone, unsigned int order, int migratetype,
+		 unsigned int alloc_flags,
+		 unsigned int n_tainted, unsigned int n_tainted_with_buddy,
+		 bool saw_free_pages, bool saw_free_pb,
+		 bool saw_below_reserve),
+
+	TP_ARGS(zone, order, migratetype, alloc_flags,
+		n_tainted, n_tainted_with_buddy,
+		saw_free_pages, saw_free_pb, saw_below_reserve),
+
+	TP_STRUCT__entry(
+		__string(	name,			zone->name		)
+		__field(	unsigned int,		order			)
+		__field(	int,			migratetype		)
+		__field(	unsigned int,		alloc_flags		)
+		__field(	unsigned int,		n_tainted		)
+		__field(	unsigned int,		n_tainted_with_buddy	)
+		__field(	bool,			saw_free_pages		)
+		__field(	bool,			saw_free_pb		)
+		__field(	bool,			saw_below_reserve	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->order			= order;
+		__entry->migratetype		= migratetype;
+		__entry->alloc_flags		= alloc_flags;
+		__entry->n_tainted		= n_tainted;
+		__entry->n_tainted_with_buddy	= n_tainted_with_buddy;
+		__entry->saw_free_pages		= saw_free_pages;
+		__entry->saw_free_pb		= saw_free_pb;
+		__entry->saw_below_reserve	= saw_below_reserve;
+	),
+
+	TP_printk("zone=%s order=%u mt=%d alloc_flags=0x%x n_tainted=%u n_tainted_with_buddy=%u walk=[fp=%d fpb=%d below=%d]",
+		__get_str(name),
+		__entry->order,
+		__entry->migratetype,
+		__entry->alloc_flags,
+		__entry->n_tainted,
+		__entry->n_tainted_with_buddy,
+		__entry->saw_free_pages,
+		__entry->saw_free_pb,
+		__entry->saw_below_reserve)
+);
+
 TRACE_EVENT(mm_page_alloc_extfrag,
 
 	TP_PROTO(struct page *page,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e15e71d5ac99..815cee325ec0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -566,18 +566,39 @@ static void __spb_set_has_type(struct page *page, int migratetype)
 		return;
 
 	if (!get_pfnblock_bit(page, pfn, bit)) {
+		bool first = false;
+
 		set_pfnblock_bit(page, pfn, bit);
 		switch (bit) {
 		case PB_has_unmovable:
 			sb->nr_unmovable++;
+			first = (sb->nr_unmovable == 1);
+			if (first)
+				trace_printk("SB first unmovable: zone=%s sb=%lu pfn=%lu mt=%d rsv=%u mov=%u recl=%u free=%u\n",
+					     sb->zone->name,
+					     (unsigned long)(sb - sb->zone->superpageblocks),
+					     pfn, migratetype,
+					     sb->nr_reserved, sb->nr_movable,
+					     sb->nr_reclaimable, sb->nr_free);
 			break;
 		case PB_has_reclaimable:
 			sb->nr_reclaimable++;
+			first = (sb->nr_reclaimable == 1);
+			if (first)
+				trace_printk("SB first reclaimable: zone=%s sb=%lu pfn=%lu mt=%d rsv=%u mov=%u unmov=%u free=%u\n",
+					     sb->zone->name,
+					     (unsigned long)(sb - sb->zone->superpageblocks),
+					     pfn, migratetype,
+					     sb->nr_reserved, sb->nr_movable,
+					     sb->nr_unmovable, sb->nr_free);
 			break;
 		case PB_has_movable:
 			sb->nr_movable++;
+			first = (sb->nr_movable == 1);
 			break;
 		}
+		trace_spb_pb_taint(page, migratetype,
+				   SPB_PB_TAINT_ACTION_SET, first);
 		spb_debug_check(sb, "__spb_set_has_type");
 	}
 }
@@ -601,21 +622,28 @@ static void __spb_clear_has_type(struct page *page, int migratetype)
 		return;
 
 	if (get_pfnblock_bit(page, pfn, bit)) {
+		bool last = false;
+
 		clear_pfnblock_bit(page, pfn, bit);
 		switch (bit) {
 		case PB_has_unmovable:
 			if (sb->nr_unmovable)
 				sb->nr_unmovable--;
+			last = (sb->nr_unmovable == 0);
 			break;
 		case PB_has_reclaimable:
 			if (sb->nr_reclaimable)
 				sb->nr_reclaimable--;
+			last = (sb->nr_reclaimable == 0);
 			break;
 		case PB_has_movable:
 			if (sb->nr_movable)
 				sb->nr_movable--;
+			last = (sb->nr_movable == 0);
 			break;
 		}
+		trace_spb_pb_taint(page, migratetype,
+				   SPB_PB_TAINT_ACTION_CLEAR, last);
 		spb_debug_check(sb, "__spb_clear_has_type");
 	}
 }
@@ -2953,6 +2981,17 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	int full;
 	struct superpageblock *sb;
 	int opposite_mt;
+	/*
+	 * Diagnostic counter for the spb_alloc_walk tracepoint. Counts how
+	 * many SPBs were visited (across all Passes) before this allocation
+	 * succeeded or fell through. Used to characterize the cost of the
+	 * linear spb_lists walk and identify pathological cases.
+	 */
+	unsigned int n_spbs_visited = 0;
+
+#define SPB_WALK_DONE(_outcome) \
+	trace_spb_alloc_walk(zone, order, migratetype, alloc_flags, \
+			     (_outcome), n_spbs_visited)
 	/*
 	 * Category search order: 2 passes.
 	 * Movable: clean first, then tainted (pack into clean SBs).
@@ -2998,6 +3037,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				    migratetype,
 				    pcp_allowed_order(order) &&
 				    migratetype < MIGRATE_PCPTYPES);
+				SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_1);
 				return page;
 			}
 		}
@@ -3013,6 +3053,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				    migratetype,
 				    pcp_allowed_order(order) &&
 				    migratetype < MIGRATE_PCPTYPES);
+				SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_1);
 				return page;
 			}
 		}
@@ -3049,6 +3090,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 		list_for_each_entry(sb,
 			&zone->spb_lists[cat][full], list) {
+			n_spbs_visited++;
 			/*
 			 * Snapshot tainted-SPB capacity before the
 			 * nr_free_pages skip: an SPB with a free pageblock
@@ -3083,6 +3125,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					page, order, migratetype,
 					pcp_allowed_order(order) &&
 					migratetype < MIGRATE_PCPTYPES);
+				SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_1);
 				if (migratetype < MIGRATE_PCPTYPES) {
 					struct spb_warm_hint_slot *slot;
 
@@ -3113,6 +3156,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
 						migratetype < MIGRATE_PCPTYPES);
+					SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_1);
 					if (migratetype < MIGRATE_PCPTYPES) {
 						struct spb_warm_hint_slot *slot;
 
@@ -3144,6 +3188,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
 			list_for_each_entry(sb,
 				&zone->spb_lists[SB_TAINTED][full], list) {
+				n_spbs_visited++;
 				if (!sb->nr_free)
 					continue;
 				for (current_order = max_t(unsigned int,
@@ -3168,6 +3213,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
 						migratetype < MIGRATE_PCPTYPES);
+					SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_2);
 					return page;
 				}
 			}
@@ -3178,6 +3224,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				&zone->spb_lists[SB_TAINTED][full], list) {
 				int co;
 
+				n_spbs_visited++;
 				if (!sb->nr_free_pages)
 					continue;
 				for (co = min_t(int, pageblock_order - 1,
@@ -3206,6 +3253,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
 						migratetype < MIGRATE_PCPTYPES);
+					SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_2B);
 					return page;
 				}
 			}
@@ -3263,6 +3311,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					&zone->spb_lists[SB_TAINTED][full], list) {
 					int co;
 
+					n_spbs_visited++;
 					if (!sb->nr_free_pages)
 						continue;
 					for (co = min_t(int, pageblock_order - 1,
@@ -3290,6 +3339,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							page, order, migratetype,
 							pcp_allowed_order(order) &&
 							migratetype < MIGRATE_PCPTYPES);
+						SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_2C);
 						return page;
 					}
 				}
@@ -3335,6 +3385,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 					&zone->spb_lists[SB_TAINTED][full], list) {
 					int co;
 
+					n_spbs_visited++;
 					if (!sb->nr_free_pages)
 						continue;
 					for (co = min_t(int, pageblock_order - 1,
@@ -3362,6 +3413,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 							page, order, migratetype,
 							pcp_allowed_order(order) &&
 							migratetype < MIGRATE_PCPTYPES);
+						SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_2D);
 						return page;
 					}
 				}
@@ -3404,8 +3456,40 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		}
 	}
 
+	/*
+	 * Diagnostic: capture per-fall-through state so we can answer
+	 * "why didn't an existing tainted SPB absorb this allocation?".
+	 * The count loop walks the tainted-SPB lists looking for any SPB
+	 * with a free buddy at the requested (order, migratetype). >0
+	 * means buddies were available — Pass 1 missed them. 0 means
+	 * the tainted pool genuinely had nothing usable. Loop is bounded
+	 * by the number of tainted SPBs and runs only on the slow path
+	 * (this is the fall-through to Pass 3/Pass 4). Skipped if the
+	 * tracepoint is not active so there is zero cost in production.
+	 */
+	if (walk && trace_spb_alloc_fall_through_enabled()) {
+		unsigned int n_tainted = 0, n_with_buddy = 0;
+
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+				&zone->spb_lists[SB_TAINTED][full], list) {
+				n_tainted++;
+				if (!list_empty(
+				    &sb->free_area[order].free_list[migratetype]))
+					n_with_buddy++;
+			}
+		}
+		trace_spb_alloc_fall_through(zone, order, migratetype,
+					     alloc_flags,
+					     n_tainted, n_with_buddy,
+					     walk->saw_free_pages,
+					     walk->saw_free_pb,
+					     walk->saw_below_reserve);
+	}
+
 	/* Pass 3: whole pageblock from empty superpageblocks */
 	list_for_each_entry(sb, &zone->spb_empty, list) {
+		n_spbs_visited++;
 		if (!sb->nr_free_pages)
 			continue;
 		for (current_order = max(order, pageblock_order);
@@ -3421,6 +3505,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				migratetype,
 				pcp_allowed_order(order) &&
 				migratetype < MIGRATE_PCPTYPES);
+			SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_3);
 			return page;
 		}
 	}
@@ -3439,6 +3524,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 
 			list_for_each_entry(sb,
 				&zone->spb_lists[cat][full], list) {
+				n_spbs_visited++;
 				if (!sb->nr_free_pages)
 					continue;
 				/*
@@ -3463,6 +3549,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 						page, order, migratetype,
 						pcp_allowed_order(order) &&
 						migratetype < MIGRATE_PCPTYPES);
+					SPB_WALK_DONE(SPB_ALLOC_OUTCOME_PASS_4);
 					return page;
 				}
 			}
@@ -3487,10 +3574,13 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 		trace_mm_page_alloc_zone_locked(page, order, migratetype,
 				pcp_allowed_order(order) &&
 				migratetype < MIGRATE_PCPTYPES);
+		SPB_WALK_DONE(SPB_ALLOC_OUTCOME_ZONE_FALLBACK);
 		return page;
 	}
 
+	SPB_WALK_DONE(SPB_ALLOC_OUTCOME_NO_PAGE);
 	return NULL;
+#undef SPB_WALK_DONE
 }
 
 
@@ -3909,8 +3999,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	 * Don't steal from pageblocks that are isolated for
 	 * evacuation — that would undo the work in progress.
 	 */
-	if (get_pageblock_isolate(page))
+	if (get_pageblock_isolate(page)) {
+		trace_spb_claim_block_refused(page, start_type, block_type,
+					      SPB_CLAIM_REFUSED_ISOLATE);
 		return NULL;
+	}
 
 	/*
 	 * Never steal from CMA pageblocks.  CMA pages freed through
@@ -3919,8 +4012,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	 * fallback search.  Stealing would corrupt CMA by changing
 	 * the pageblock type away from MIGRATE_CMA.
 	 */
-	if (is_migrate_cma(get_pageblock_migratetype(page)))
+	if (is_migrate_cma(get_pageblock_migratetype(page))) {
+		trace_spb_claim_block_refused(page, start_type, block_type,
+					      SPB_CLAIM_REFUSED_CMA);
 		return NULL;
+	}
 
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order)
@@ -3929,8 +4025,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
 
 	/* moving whole block can fail due to zone boundary conditions */
 	if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
-				       &movable_pages))
+				       &movable_pages)) {
+		trace_spb_claim_block_refused(page, start_type, block_type,
+					      SPB_CLAIM_REFUSED_ZONE_BOUNDARY);
 		return NULL;
+	}
 
 	/*
 	 * Determine how many pages are compatible with our allocation.
@@ -3969,11 +4068,17 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	 * the SPB is tainted.
 	 */
 	if (noncompatible_cross_type(start_type, block_type)) {
-		if (free_pages != pageblock_nr_pages)
+		if (free_pages != pageblock_nr_pages) {
+			trace_spb_claim_block_refused(page, start_type,
+				block_type,
+				SPB_CLAIM_REFUSED_CROSS_TYPE_NOT_FREE);
 			return NULL;
+		}
 	} else if (!from_tainted_spb &&
 		   free_pages + alike_pages < (1 << (pageblock_order-1)) &&
 		   !page_group_by_mobility_disabled) {
+		trace_spb_claim_block_refused(page, start_type, block_type,
+			SPB_CLAIM_REFUSED_INSUFFICIENT_COMPAT);
 		return NULL;
 	}
 
@@ -6196,12 +6301,24 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 */
 	if (no_fallback && !defrag_mode &&
 	    !(gfp_mask & __GFP_DIRECT_RECLAIM)) {
-		if (gfp_mask & __GFP_NORETRY)
+		struct zone *pref = zonelist_zone(ac->preferred_zoneref);
+
+		if (gfp_mask & __GFP_NORETRY) {
+			trace_spb_alloc_atomic_relax(pref, order,
+				ac->migratetype, gfp_mask,
+				SPB_ATOMIC_RELAX_NORETRY_SKIP);
 			return NULL;
+		}
 		if (!(alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
+			trace_spb_alloc_atomic_relax(pref, order,
+				ac->migratetype, gfp_mask,
+				SPB_ATOMIC_RELAX_ADD_TAINTED_OK);
 			alloc_flags |= ALLOC_NOFRAG_TAINTED_OK;
 			goto retry;
 		}
+		trace_spb_alloc_atomic_relax(pref, order,
+			ac->migratetype, gfp_mask,
+			SPB_ATOMIC_RELAX_DROP_NOFRAGMENT);
 		alloc_flags &= ~(ALLOC_NOFRAGMENT | ALLOC_NOFRAG_TAINTED_OK);
 		goto retry;
 	}
@@ -10756,6 +10873,7 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 	unsigned long sb_pfns[SPB_CONTIG_MAX_CANDIDATES];
 	unsigned long flags;
 	int nr_sbs, i;
+	unsigned int phase1_attempts = 0, phase2_attempts = 0;
 	bool did_evacuate = false;
 
 	/* Phase 1: coalesce within existing non-movable pageblocks */
@@ -10767,14 +10885,20 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 
 	for (i = 0; i < nr_sbs; i++) {
 		unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
+		int n;
 
-		if (evacuate_pb_range(zone, sb_pfns[i], end_pfn,
-				      migratetype, 3))
+		n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
+				      migratetype, 3);
+		phase1_attempts += n;
+		if (n)
 			did_evacuate = true;
 	}
 
-	if (did_evacuate)
+	if (did_evacuate) {
+		trace_spb_evacuate_for_order_done(zone, order, migratetype,
+				phase1_attempts, phase2_attempts, true);
 		return true;
+	}
 
 	/* Phase 2: evacuate MOVABLE pageblocks to create free whole pageblocks */
 	spin_lock_irqsave(&zone->lock, flags);
@@ -10785,9 +10909,12 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 
 	for (i = 0; i < nr_sbs; i++) {
 		unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
+		int n;
 
-		if (evacuate_pb_range(zone, sb_pfns[i], end_pfn,
-				      MIGRATE_MOVABLE, 3))
+		n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
+				      MIGRATE_MOVABLE, 3);
+		phase2_attempts += n;
+		if (n)
 			did_evacuate = true;
 	}
 
@@ -10801,6 +10928,8 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 	 */
 	queue_spb_slab_shrink(zone);
 
+	trace_spb_evacuate_for_order_done(zone, order, migratetype,
+			phase1_attempts, phase2_attempts, did_evacuate);
 	return did_evacuate;
 }
 #endif /* CONFIG_COMPACTION */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (43 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
@ 2026-04-30 20:21 ` Rik van Riel
  2026-05-01  7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
  45 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-04-30 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, linux-mm, david, willy, surenb, hannes, ljs, ziy,
	usama.arif, Rik van Riel, Rik van Riel

From: Rik van Riel <riel@fb.com>

The slowpath in __alloc_pages_slowpath calls spb_evacuate_for_order
just before dropping ALLOC_NOFRAGMENT. Each successful evacuation
frees a MOV pageblock inside a tainted SPB so the retry can satisfy
a non-movable allocation via Pass 2 (claim_whole_block) without
having to drop NOFRAGMENT and let __rmqueue_claim taint a clean SPB.

Two problems with the existing implementation:

1. Per-call budget too small.

   SPB_CONTIG_MAX_CANDIDATES = 4   (also used by contig allocation)
   per-SPB pageblocks         = 3  (hard-coded literal in evac calls)

   → up to 12 pageblocks scanned/migrated per call.

   Production traces show 13 of 21 tainted SPBs typically have MOV
   content (~2500 PB-units = ~5 GiB MOV-having pageblocks across the
   tainted pool). The 4-candidate cap leaves the bulk of evacuatable
   capacity untouched per call, so the slowpath frequently gives up
   and drops NOFRAGMENT even though plenty of MOV content was there
   to free.

2. Source-pageblock migratetype filter creates blind spots.

   evacuate_pb_range(..., migratetype, ...) skipped any pageblock
   whose underlying tag did not match the @migratetype argument.
   Phase 1 used the requesting type; Phase 2 used MIGRATE_MOVABLE.

   But MOV content can live in any tag of pageblock:
     - PASS_2C / PASS_2D borrows set PB_has_<requesting_mt> on a
       MOV-tagged PB without changing the tag, then borrowed pages
       return to the MOV free list when freed.
     - __spb_set_has_type adds a non-MOV bit on a PB without
       re-evaluating the PB tag. PBs accumulate has-bits over time.

   Result: MOV stragglers in PBs whose tag does not match either
   phase's filter are permanently invisible to evacuation.

Fix both:

  * Introduce dedicated SPB_EVACUATE_MAX_CANDIDATES = 16 and
    SPB_EVACUATE_MAX_PB_PER_SB = 8 so the evacuation budget can be
    sized independently of the contig-allocation candidate cap.
    Combined cap: 128 pageblocks (256 MiB) per call.

  * Drop the migratetype tag filter from evacuate_pb_range. Accept
    any pageblock with PB_has_movable set, skipping only the cases
    whose semantics forbid touching them here (ISOLATE, CMA,
    HIGHATOMIC).

  * Collapse the two-phase structure in spb_evacuate_for_order into
    a single pass. The two phases were doing the same evacuation
    action with different filters; once the filter is relaxed the
    distinction collapses naturally:
      - PBs that are pure MOV become empty -> free MOV pageblock,
        claimable by Pass 2 / claim_whole_block on the retry.
      - PBs that are mixed lose their MOV stragglers, so future
        allocations of the dominant type can use the PB without
        competing with MOV residue.

  * sb_collect_evacuate_candidates loses its migratetype parameter:
    after the unified pass, the only candidate filter is nr_movable
    > 0.

The contig-allocation path (spb_try_alloc_contig) is unchanged;
SPB_CONTIG_MAX_CANDIDATES remains at 4. 1 GiB allocations have a
different latency profile and broad evac-style scanning is not
appropriate there.

The trace_spb_evacuate_for_order_done signature is preserved for ABI
continuity with existing observers; the merged attempt count is
reported in phase1_attempts and phase2_attempts is reported as 0.

Stack impact: sb_pfns[] grows from 32 bytes to 128 bytes — trivial
for an 8K/16K kernel stack.

Per Rik's stated priority hierarchy:
  P1 protect clean SPBs from being tainted     (highest)
  P2 protect MOV pageblocks inside tainted SPBs
  P3 allocation latency                        (lowest)
trading a few hundred ms of evacuation latency to keep clean SPBs
clean is the desired direction.

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 mm/page_alloc.c | 141 ++++++++++++++++++++++++++++--------------------
 1 file changed, 84 insertions(+), 57 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 815cee325ec0..9cc8b9bbc1fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -10603,6 +10603,27 @@ static bool zone_spans_last_pfn(const struct zone *zone,
  */
 #define SPB_CONTIG_MAX_CANDIDATES 4
 
+/*
+ * Maximum tainted superpageblock candidates per spb_evacuate_for_order call.
+ * Collected under zone->lock, then evacuated without it. Larger than the
+ * contig-allocation candidate cap because evacuation runs from the slowpath
+ * after reclaim/compaction failed: we need a meaningful chance of freeing a
+ * non-MOV-claimable pageblock before the slowpath escalates to dropping
+ * ALLOC_NOFRAGMENT (which lets __rmqueue_claim taint clean SPBs). Sized to
+ * scan a meaningful fraction of a typical tainted-pool population.
+ */
+#define SPB_EVACUATE_MAX_CANDIDATES 16
+
+/*
+ * Maximum pageblocks to evacuate per candidate SPB inside
+ * spb_evacuate_for_order. Each evacuation triggers page migration which is
+ * O(pages_per_pageblock) wall-clock cost, so this caps per-call latency.
+ * Bumped from 3 to 8 to free more capacity per slowpath escalation pass.
+ * Combined cap: SPB_EVACUATE_MAX_CANDIDATES * SPB_EVACUATE_MAX_PB_PER_SB
+ * pageblocks per call (16 * 8 = 128 = 256 MiB on x86 max migration budget).
+ */
+#define SPB_EVACUATE_MAX_PB_PER_SB 8
+
 #ifdef CONFIG_COMPACTION
 /**
  * sb_collect_contig_candidates - Find superpageblock ranges for contiguous alloc
@@ -10778,7 +10799,7 @@ static struct page *spb_try_alloc_contig(struct zone *zone,
  *
  * Returns number of candidate superpageblock PFNs found.
  */
-static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
+static int sb_collect_evacuate_candidates(struct zone *zone,
 					  unsigned long *sb_pfns, int max)
 {
 	struct superpageblock *sb;
@@ -10792,20 +10813,6 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
 			if (!sb->nr_movable)
 				continue;
 
-			if (migratetype >= 0) {
-				bool has_matching;
-
-				if (migratetype == MIGRATE_UNMOVABLE)
-					has_matching = sb->nr_unmovable > 0;
-				else if (migratetype == MIGRATE_RECLAIMABLE)
-					has_matching = sb->nr_reclaimable > 0;
-				else
-					continue;
-
-				if (!has_matching)
-					continue;
-			}
-
 			sb_pfns[n++] = sb->start_pfn;
 			if (n >= max)
 				return n;
@@ -10815,17 +10822,38 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype,
 }
 
 /*
- * Evacuate pageblocks of the given migratetype within a range.
+ * Evacuate MOV content out of any pageblock in the given range that has it.
+ *
+ * The previous version filtered on the source pageblock's migratetype tag,
+ * which made evacuation blind to MOV stragglers living in PBs whose tag did
+ * not match the current allocation's requesting type:
+ *
+ *   - PASS_2C / PASS_2D borrows set PB_has_<requesting_mt> on a MOV-tagged
+ *     PB without changing the tag. The borrowed pages return to the MOV
+ *     free list when freed, so a MOV-tagged PB can host non-MOV PB_has bits
+ *     and MOV content simultaneously.
+ *
+ *   - When __spb_set_has_type adds a non-MOV bit on a PB, the PB tag is not
+ *     re-evaluated. PBs accumulate has-bits over time without their tag
+ *     necessarily reflecting current content.
+ *
+ * Drop the migratetype tag filter and accept any PB with PB_has_movable set.
+ * Skip only the cases whose semantics forbid touching them here:
+ *   - MIGRATE_ISOLATE     under quarantine
+ *   - CMA                 own allocator
+ *   - MIGRATE_HIGHATOMIC  reserve, evac would race the reservation logic
+ *
  * Returns number of pageblocks evacuated.
  */
 static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
-			     unsigned long end_pfn, int migratetype, int max)
+			     unsigned long end_pfn, int max)
 {
 	unsigned long pfn;
 	int nr_evacuated = 0;
 
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		struct page *page;
+		int pb_mt;
 
 		if (!pfn_valid(pfn))
 			continue;
@@ -10835,10 +10863,13 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
 
 		page = pfn_to_page(pfn);
 
-		if (get_pfnblock_migratetype(page, pfn) != migratetype)
+		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
 			continue;
 
-		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+		pb_mt = get_pfnblock_migratetype(page, pfn);
+		if (pb_mt == MIGRATE_ISOLATE ||
+		    is_migrate_cma(pb_mt) ||
+		    pb_mt == MIGRATE_HIGHATOMIC)
 			continue;
 
 		evacuate_pageblock(zone, pfn, true);
@@ -10870,41 +10901,33 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn,
 static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 				  int migratetype)
 {
-	unsigned long sb_pfns[SPB_CONTIG_MAX_CANDIDATES];
+	unsigned long sb_pfns[SPB_EVACUATE_MAX_CANDIDATES];
 	unsigned long flags;
 	int nr_sbs, i;
-	unsigned int phase1_attempts = 0, phase2_attempts = 0;
+	unsigned int attempts = 0;
 	bool did_evacuate = false;
 
-	/* Phase 1: coalesce within existing non-movable pageblocks */
-	spin_lock_irqsave(&zone->lock, flags);
-	nr_sbs = sb_collect_evacuate_candidates(zone, migratetype,
-						sb_pfns,
-						SPB_CONTIG_MAX_CANDIDATES);
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	for (i = 0; i < nr_sbs; i++) {
-		unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES;
-		int n;
-
-		n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
-				      migratetype, 3);
-		phase1_attempts += n;
-		if (n)
-			did_evacuate = true;
-	}
-
-	if (did_evacuate) {
-		trace_spb_evacuate_for_order_done(zone, order, migratetype,
-				phase1_attempts, phase2_attempts, true);
-		return true;
-	}
-
-	/* Phase 2: evacuate MOVABLE pageblocks to create free whole pageblocks */
+	/*
+	 * Single-pass evacuation: collect candidate tainted SPBs (anything
+	 * with MOV content), then walk each one's pageblocks evacuating MOV
+	 * content from any non-special PB. evacuate_pb_range filters by
+	 * PB_has_movable, so this is a no-op on PBs that have no MOV content.
+	 *
+	 * Two effects accumulate:
+	 *   - PBs that are pure MOV become empty -> free MOV pageblock,
+	 *     claimable by Pass 2 / claim_whole_block on the retry.
+	 *   - PBs that are mixed (e.g., UNMOV + MOV stragglers) lose the MOV
+	 *     stragglers, so future allocations of the dominant type can use
+	 *     the PB without competing with the MOV residue.
+	 *
+	 * The previous two-phase design tried to do these separately and
+	 * filtered evacuation by source PB tag. That left MOV content
+	 * stranded in PBs whose tag did not match either phase, and gave up
+	 * after one phase even though the other phase could have helped.
+	 */
 	spin_lock_irqsave(&zone->lock, flags);
-	nr_sbs = sb_collect_evacuate_candidates(zone, -1,
-						sb_pfns,
-						SPB_CONTIG_MAX_CANDIDATES);
+	nr_sbs = sb_collect_evacuate_candidates(zone, sb_pfns,
+						SPB_EVACUATE_MAX_CANDIDATES);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	for (i = 0; i < nr_sbs; i++) {
@@ -10912,24 +10935,28 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order,
 		int n;
 
 		n = evacuate_pb_range(zone, sb_pfns[i], end_pfn,
-				      MIGRATE_MOVABLE, 3);
-		phase2_attempts += n;
+				      SPB_EVACUATE_MAX_PB_PER_SB);
+		attempts += n;
 		if (n)
 			did_evacuate = true;
 	}
 
 	/*
 	 * Always kick a slab shrink after an evacuation pass — even when
-	 * movable evacuation succeeded. Slab content stranded inside
-	 * tainted SPBs can only be freed by shrinking the cache; doing
-	 * it now keeps headroom available for the next burst, when the
-	 * movable supply may have run out and movable evac alone would
-	 * have nothing to do.
+	 * MOV evacuation succeeded. Slab content stranded inside tainted
+	 * SPBs can only be freed by shrinking the cache; doing it now keeps
+	 * headroom available for the next burst, when the MOV supply may
+	 * have run out and evac alone would have nothing to do.
 	 */
 	queue_spb_slab_shrink(zone);
 
+	/*
+	 * The tracepoint signature retains phase1_attempts / phase2_attempts
+	 * for ABI continuity with existing observers; report the merged total
+	 * in phase1_attempts and 0 in phase2_attempts.
+	 */
 	trace_spb_evacuate_for_order_done(zone, order, migratetype,
-			phase1_attempts, phase2_attempts, did_evacuate);
+			attempts, 0, did_evacuate);
 	return did_evacuate;
 }
 #endif /* CONFIG_COMPACTION */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [00/45 RFC PATCH] 1GB superpageblock memory allocation
  2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
                   ` (44 preceding siblings ...)
  2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
@ 2026-05-01  7:14 ` David Hildenbrand (Arm)
  2026-05-01 11:58   ` Rik van Riel
  45 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-01  7:14 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, linux-mm, willy, surenb, hannes, ljs, ziy,
	usama.arif

On 4/30/26 22:20, Rik van Riel wrote:

Is there some text missing here, given that the first paragraph starts with
"Neither of those" and I am not sure which solutions you have in mind?

> Neither of those are great solutions, given that modern servers
> tend to be large, often run multiple workloads simultaneously,
> and each workload wants something else.
> 
> To address that issue, this patch series divides memory not just
> into 2MB page blocks, but into PUD sized superpageblocks, and
> aggressively tries to steer unmovable, reclaimable, and highatomic
> allocations into those superpageblocks that have already been
> "tainted" by such allocations.



-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [00/45 RFC PATCH] 1GB superpageblock memory allocation
  2026-05-01  7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
@ 2026-05-01 11:58   ` Rik van Riel
  0 siblings, 0 replies; 48+ messages in thread
From: Rik van Riel @ 2026-05-01 11:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm), linux-kernel
  Cc: kernel-team, linux-mm, willy, surenb, hannes, ljs, ziy,
	usama.arif

On Fri, 2026-05-01 at 09:14 +0200, David Hildenbrand (Arm) wrote:
> On 4/30/26 22:20, Rik van Riel wrote:
> 
> Is there some text missing here, given that the first paragraph
> starts with
> "Neither of those" and I am not sure which solutions you have in
> mind?

Yeah, sorry. I suspect what may have happened is I
failed to leave a blank line between the email
headers and the first paragraph when I inserted the
file into the send-email compose.

Here's the first paragraph:

Some workloads see real performance benefits from using 1GB pages,
but allocating 1GB pages has often been limited to hugetlb pages
that were set aside at boot time, or using CMA to keep a fixed
amount of system memory off limits to the kernel.

> 
> > Neither of those are great solutions, given that modern servers
> > tend to be large, often run multiple workloads simultaneously,
> > and each workload wants something else.
> > 
> > To address that issue, this patch series divides memory not just
> > into 2MB page blocks, but into PUD sized superpageblocks, and
> > aggressively tries to steer unmovable, reclaimable, and highatomic
> > allocations into those superpageblocks that have already been
> > "tainted" by such allocations.
> 
> 
> Also, I know this series still needs a lot of
work. I will try to get many of those things out
of the way over the next few weeks.

Right now this mostly serves as an illustration
that reliable 1GB page allocation can be done.

Hopefully the next version will be more worthy 
of reviewer time.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2026-05-01 11:58 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
2026-05-01  7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox