[RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
	willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
	ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
	fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs
Date: Wed, 20 May 2026 10:59:24 -0400	[thread overview]
Message-ID: <20260520150018.2491267-19-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>

Non-DIRECT_RECLAIM (atomic) allocations that fail with
ALLOC_NOFRAGMENT previously dropped the flag entirely and retried,
allowing them to taint clean superpageblocks.  This was the primary
source of taint spreading observed on production systems.

Stage the relaxation in three steps that keep atomic allocations
inside tainted SPBs as long as possible:

1. Extend Pass 2 in __rmqueue_smallest with a sub-pageblock phase
   (Pass 2b).  Pass 2 only finds whole free pageblocks (>= pageblock
   order) in tainted SPBs.  Pass 2b searches for sub-pageblock-order
   free blocks and uses try_to_claim_block() to claim a pageblock
   that has enough compatible pages.  This finds pages in tainted
   SPBs that have fragmented free space but no whole free pageblocks.

2. Add an ALLOC_NOFRAG_TAINTED_OK intermediate flag.  Instead of
   going directly from ALLOC_NOFRAGMENT to no protection, atomic
   allocations first retry with ALLOC_NOFRAG_TAINTED_OK, which
   allows __rmqueue_steal to search tainted SPBs only.  Clean and
   empty SPBs remain protected.  Only if steal from tainted SPBs
   also fails is ALLOC_NOFRAGMENT fully dropped as a last resort.

3. Bypass the pageblock compatibility threshold inside
   try_to_claim_block() when the call originates from the
   tainted-SPB walk in Pass 2b.  The
   free_pages + alike_pages >= 1 << (pageblock_order - 1) gate was
   designed to prevent the cross-fragment-fallback path from
   spreading mixing into clean SPBs; inside an already-tainted SPB
   the fragmentation has already been accepted, and the threshold
   rejects the typical fragmented-MOVABLE-pageblock case Pass 2b is
   meant to reclaim.  Without the bypass Pass 2b would be largely a
   no-op.

For callers that pass __GFP_NORETRY, the relaxation sequence is
wrong in principle.  The NORETRY contract is "I have a fallback;
don't go to extreme lengths."  Network skb_page_frag_refill, slab
high-order allocations, and similar hot-path callers use NORETRY
exactly so the allocator can return NULL and let their own fallback
(smaller frag, lower-order slab, etc.) take over.  Tainting a clean
superpageblock to satisfy such a request is a lasting cost -- the
SPB stays tainted for the remainder of the workload's lifetime,
blocking 1 GiB hugepage allocation from that region -- that
outlives the single allocation that triggered it.  Skip the
relaxation steps for NORETRY callers and return NULL immediately;
their fallback path absorbs the failure cleanly.

Observed on a 250 GB system running the page-superblock series:
an atomic order-3 alloc from swapper context (PCP refill,
gfp=0x152820 = __GFP_HIGH | __GFP_KSWAPD_RECLAIM | __GFP_NOWARN |
__GFP_NORETRY | __GFP_COMP | __GFP_HARDWALL) tainted a fresh clean
SPB at boot+~90 min despite ALLOC_NOFRAGMENT being set, because
the atomic-retry path stripped the flag.  The caller had a NORETRY
fallback ready; the taint was gratuitous.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/internal.h   |   1 +
 mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 110 insertions(+), 11 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index c0dbc2e4b7f0..e6d61dbc18d9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1511,6 +1511,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
 #define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
+#define ALLOC_NOFRAG_TAINTED_OK	0x1000 /* NOFRAGMENT, but allow steal from tainted SPBs */
 
 /* Flags that allow allocations below the min watermark. */
 #define ALLOC_RESERVES (ALLOC_NON_BLOCK|ALLOC_MIN_RESERVE|ALLOC_HIGHATOMIC|ALLOC_OOM)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6a07bd72c0b..6884f638a97c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2729,6 +2729,10 @@ static struct page *__rmqueue_from_sb(struct zone *zone, unsigned int order,
  */
 static struct page *claim_whole_block(struct zone *zone, struct page *page,
 		  int current_order, int order, int new_type, int old_type);
+static struct page *try_to_claim_block(struct zone *zone, struct page *page,
+		  int current_order, int order, int start_type,
+		  int block_type, unsigned int alloc_flags,
+		  bool from_tainted_spb);
 
 static __always_inline
 struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
@@ -2798,6 +2802,11 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	 * free list (reset by mark_pageblock_free), so the search above
 	 * misses them. Claim them inline to keep non-movable allocations
 	 * concentrated in already-tainted superpageblocks.
+	 *
+	 * Try whole pageblock orders first (preferred for PCP buddy optimization),
+	 * then fall back to sub-pageblock orders. Sub-pageblock claiming uses
+	 * try_to_claim_block which checks whether the pageblock has enough
+	 * compatible pages to justify claiming it.
 	 */
 	if (!movable && !is_migrate_cma(migratetype)) {
 		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
@@ -2830,6 +2839,43 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 				}
 			}
 		}
+		/* Pass 2b: sub-pageblock orders in tainted SPBs */
+		for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+			list_for_each_entry(sb,
+				&zone->spb_lists[SB_TAINTED][full], list) {
+				int co;
+
+				if (!sb->nr_free_pages)
+					continue;
+				for (co = min_t(int, pageblock_order - 1,
+						NR_PAGE_ORDERS - 1);
+				     co >= (int)order;
+				     --co) {
+					current_order = co;
+					area = &sb->free_area[current_order];
+					page = get_page_from_free_area(
+						area, MIGRATE_MOVABLE);
+					if (!page)
+						continue;
+					if (get_pageblock_isolate(page))
+						continue;
+					if (is_migrate_cma(
+					    get_pageblock_migratetype(page)))
+						continue;
+					page = try_to_claim_block(zone, page,
+						current_order, order,
+						migratetype, MIGRATE_MOVABLE,
+						0, true);
+					if (!page)
+						continue;
+					trace_mm_page_alloc_zone_locked(
+						page, order, migratetype,
+						pcp_allowed_order(order) &&
+						migratetype < MIGRATE_PCPTYPES);
+					return page;
+				}
+			}
+		}
 	}
 
 	/* Empty superpageblocks: try before falling back to non-preferred category */
@@ -3298,11 +3344,17 @@ claim_whole_block(struct zone *zone, struct page *page,
  * not, we check the pageblock for constituent pages; if at least half of the
  * pages are free or compatible, we can still claim the whole block, so pages
  * freed in the future will be put on the correct free list.
+ *
+ * @from_tainted_spb: caller has already verified the block lives in a tainted
+ * superpageblock, where SPB-level fragmentation has already been accepted.
+ * Skip the per-pageblock compatibility threshold so we can absorb non-movable
+ * demand into the existing tainted SPB instead of tainting a fresh clean one.
  */
 static struct page *
 try_to_claim_block(struct zone *zone, struct page *page,
 		   int current_order, int order, int start_type,
-		   int block_type, unsigned int alloc_flags)
+		   int block_type, unsigned int alloc_flags,
+		   bool from_tainted_spb)
 {
 	int free_pages, movable_pages, alike_pages;
 	unsigned long start_pfn;
@@ -3362,8 +3414,14 @@ try_to_claim_block(struct zone *zone, struct page *page,
 	/*
 	 * If a sufficient number of pages in the block are either free or of
 	 * compatible migratability as our allocation, claim the whole block.
-	 */
-	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
+	 * The compatibility threshold protects clean MOVABLE pageblocks from
+	 * being relabeled when most of their pages are still in-use movable
+	 * allocations. Inside a tainted SPB the protection is unnecessary:
+	 * fragmentation has already been accepted at the SPB level, and
+	 * relabeling is much cheaper than tainting a fresh clean SPB.
+	 */
+	if (from_tainted_spb ||
+	    free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		__move_freepages_block(zone, start_pfn, block_type, start_type);
 		set_pageblock_migratetype(pfn_to_page(start_pfn), start_type);
@@ -3565,7 +3623,8 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 
 			page = try_to_claim_block(zone, page, current_order,
 						  order, start_migratetype,
-						  fallback_mt, alloc_flags);
+						  fallback_mt, alloc_flags,
+						  false);
 			if (page) {
 				trace_mm_page_alloc_extfrag(page, order,
 					current_order, start_migratetype,
@@ -3583,12 +3642,23 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
  * the block as its current migratetype, potentially causing fragmentation.
  */
 static __always_inline struct page *
-__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype,
+		unsigned int alloc_flags)
 {
 	struct superpageblock *sb;
 	int current_order;
 	struct page *page;
 	int fallback_mt;
+	unsigned int search_cats;
+
+	/*
+	 * When ALLOC_NOFRAG_TAINTED_OK is set, only steal from tainted
+	 * SPBs to avoid tainting clean ones. Otherwise search all categories.
+	 */
+	if (alloc_flags & ALLOC_NOFRAG_TAINTED_OK)
+		search_cats = SB_SEARCH_PREFERRED;
+	else
+		search_cats = SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK;
 
 	/*
 	 * Search per-superpageblock free lists for fallback migratetypes.
@@ -3598,7 +3668,7 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 		page = __rmqueue_sb_find_fallback(zone, current_order,
 					start_migratetype,
 					&fallback_mt,
-					SB_SEARCH_PREFERRED | SB_SEARCH_FALLBACK);
+					search_cats);
 
 		if (!page)
 			continue;
@@ -3698,8 +3768,10 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		}
 		fallthrough;
 	case RMQUEUE_STEAL:
-		if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
-			page = __rmqueue_steal(zone, order, migratetype);
+		if (!(alloc_flags & ALLOC_NOFRAGMENT) ||
+		    (alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
+			page = __rmqueue_steal(zone, order, migratetype,
+					       alloc_flags);
 			if (page) {
 				*mode = RMQUEUE_STEAL;
 				return page;
@@ -5408,9 +5480,35 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	/*
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again.
-	 */
-	if (no_fallback && !defrag_mode) {
-		alloc_flags &= ~ALLOC_NOFRAGMENT;
+	 *
+	 * For allocations that can do direct reclaim, keep NOFRAGMENT set
+	 * and let the slowpath try reclaim and compaction to free pages in
+	 * already-tainted superpageblocks before allowing clean SPBs to be
+	 * tainted.
+	 *
+	 * Atomic allocations cannot reclaim, but try an intermediate step
+	 * first: allow steal/claim from tainted SPBs only. This avoids
+	 * tainting clean SPBs while still finding pages in tainted ones.
+	 * Only drop NOFRAGMENT entirely if that also fails.
+	 *
+	 * Exception: callers that explicitly opted into failure with
+	 * __GFP_NORETRY have a fallback path of their own (a smaller
+	 * order, a different cache, returning NULL from a best-effort
+	 * cache refill, etc.). Tainting a clean superpageblock is a
+	 * lasting cost that outlives this allocation; it is not justified
+	 * to absorb it just to satisfy a caller that already has a
+	 * cheaper escape hatch. Return NULL and let the caller's fallback
+	 * run instead.
+	 */
+	if (no_fallback && !defrag_mode &&
+	    !(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+		if (gfp_mask & __GFP_NORETRY)
+			return NULL;
+		if (!(alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
+			alloc_flags |= ALLOC_NOFRAG_TAINTED_OK;
+			goto retry;
+		}
+		alloc_flags &= ~(ALLOC_NOFRAGMENT | ALLOC_NOFRAG_TAINTED_OK);
 		goto retry;
 	}
 
-- 
2.54.0

next prev parent reply	other threads:[~2026-05-20 15:11 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02   ` Usama Arif
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19   ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47   ` Boris Burkov
2026-05-23 15:58     ` David Sterba
2026-05-24  1:43       ` Rik van Riel
2026-05-24 19:59         ` Matthew Wilcox
2026-05-25  6:57           ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21  7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55   ` Rik van Riel

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:c0dbc2e4b7f dfblob:e6d61dbc18d dfblob:b6a07bd72c0
dfblob:6884f638a97 )
 OR (
bs:"[RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260520150018.2491267-19-riel@surriel.com \
    --to=riel@surriel.com \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox