All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
	willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
	ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
	fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders
Date: Wed, 20 May 2026 10:59:41 -0400	[thread overview]
Message-ID: <20260520150018.2491267-36-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>

Several common high-order allocation patterns are best-effort: the caller
prefers a single large page for performance but has an order-decrement
fallback (or equivalent retry path) and is happy to accept failure of the
high-order attempt. Examples:

  kvmalloc()              kmalloc attempt has a vmalloc fallback
  vmalloc()               vm_area_alloc_pages decrements order on NULL
  alloc_skb_with_frags()  decrements order on NULL per fragment

The convention these callers share is to strip __GFP_DIRECT_RECLAIM and
set __GFP_NOWARN on the high-order attempt, signaling 'I don't want this
to block on direct reclaim and I'm fine with failure being silent'.

Without further hints, get_page_from_freelist's relax sequence treats
these as atomic allocs that must succeed and escalates: it adds
ALLOC_NOFRAG_TAINTED_OK (allowing PASS_2/2B claim_whole_block to relabel
a MOV pageblock inside a tainted SPB) and then drops ALLOC_NOFRAGMENT
entirely (allowing __rmqueue_claim/_steal to taint a clean SPB). The
caller's order-decrement fallback never runs because the high-order
attempt 'succeeds' by tainting.

The fix at the call sites is to add __GFP_NORETRY (kmalloc_gfp_adjust
already does this for kvmalloc). Generalize: in the relax sequence,
before dropping NOFRAGMENT, detect the 'best-effort high-order with
fallback' pattern by:

   order > 0
   __GFP_NOWARN set
   __GFP_NOFAIL not set
   __GFP_DIRECT_RECLAIM already cleared (the relax-sequence gate above)

If the tainted pool can plausibly serve a smaller (or same) order alloc
on the caller's retry, refuse the current attempt instead of escalating.
'Plausibly serve' means any tainted SPB has either:

  - nr_movable > 0  (MOV content exists; reclaim/migration can free
                     pageblocks at the order the caller's retry needs,
                     including orders >= the requested order -- e.g.
                     four THPs in the SPB can yield an order-7 buddy
                     for an order-7 unmovable alloc once the THPs are
                     migrated), OR
  - a free buddy on the requesting migratetype's own list at an order
    < requested (a smaller PASS_1 retry would succeed directly), OR
  - a free buddy on the opposite non-MOV list at an order < requested
    (PASS_2C borrow at the smaller order would succeed) -- only relevant
    for UNMOV/RECL allocs.

The MOV-content check alone covers the common case cheaply (one counter
read per tainted SPB) and works even when the movable memory exists at
orders larger than the alloc -- which is exactly when the per-order
free_list walk would miss it.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 mm/page_alloc.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 122 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 116d9cc0a493..2791a52b61da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2708,6 +2708,101 @@ static inline u16 spb_tainted_reserve(const struct superpageblock *sb)
 	return max_t(u16, SPB_TAINTED_RESERVE_MIN, sb->total_pageblocks / 32);
 }
 
+/*
+ * spb_tainted_can_serve_smaller - could a smaller-order @migratetype alloc
+ * be satisfied from any tainted SPB of @zone (now or after evac/reclaim)?
+ *
+ * "Yes" if either:
+ *   - some tainted SPB has nr_movable > 0 (MOV content exists; reclaim or
+ *     compaction/evac can free pageblocks at any order the caller's
+ *     order-decrement fallback might want, including orders >= the original
+ *     requested order -- e.g. four THPs in the SPB can yield an order-7
+ *     buddy for an order-7 unmovable alloc once the THPs are migrated), OR
+ *   - some tainted SPB has a free buddy on the requesting migratetype's
+ *     own list at an order < @order (a smaller PASS_1 retry would
+ *     succeed directly), OR
+ *   - some tainted SPB has a free buddy on the opposite non-MOV list at
+ *     an order < @order (PASS_2C borrow at the smaller order would
+ *     succeed) -- only meaningful for UNMOV/RECL allocs.
+ *
+ * Used by the get_page_from_freelist relax sequence to discriminate
+ * "the caller has an order-decrement fallback that the tainted pool can
+ * eventually serve" from "the alloc must escalate to dropping
+ * ALLOC_NOFRAGMENT and tainting a clean SPB".
+ *
+ * Walks zone->spb_lists[SB_TAINTED][*] under zone->lock: spb_update_list()
+ * mutates these same lists under zone->lock, so a lockless walk would race
+ * with list-cursor reassignment (list_move from a concurrent allocator
+ * caller could splice the cursor onto a different list and turn the walk
+ * into an infinite loop or crash on a corrupted list_head). Sister function
+ * tainted_pool_has_free() takes zone->lock for the same reason; match its
+ * lock discipline. Bounded by the tainted SPB count plus a constant amount
+ * of work per SPB.
+ */
+static bool spb_tainted_can_serve_smaller(struct zone *zone,
+					  unsigned int order,
+					  int migratetype)
+{
+	struct superpageblock *sb;
+	unsigned long flags;
+	bool found = false;
+	int full;
+	unsigned int o;
+	int opposite_mt = -1;
+
+	if (order == 0)
+		return false;
+
+	if (migratetype == MIGRATE_UNMOVABLE)
+		opposite_mt = MIGRATE_RECLAIMABLE;
+	else if (migratetype == MIGRATE_RECLAIMABLE)
+		opposite_mt = MIGRATE_UNMOVABLE;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	for (full = 0; full < __NR_SB_FULLNESS && !found; full++) {
+		list_for_each_entry(sb, &zone->spb_lists[SB_TAINTED][full],
+				    list) {
+			/*
+			 * MOV content can be reclaimed (LRU folios) or
+			 * migrated (compaction / spb_evacuate_for_order),
+			 * making the SPB able to host a smaller (or even
+			 * same-order) non-MOV alloc on the retry. Cheap
+			 * counter check, covers most real cases.
+			 */
+			if (sb->nr_movable > 0) {
+				found = true;
+				break;
+			}
+
+			if (!sb->nr_free_pages)
+				continue;
+
+			/*
+			 * No MOV content but there might be a same-mt or
+			 * opposite-non-MOV buddy at a smaller order that a
+			 * PASS_1 retry / PASS_2C borrow could serve.
+			 */
+			for (o = 0; o < order; o++) {
+				struct free_area *area = &sb->free_area[o];
+
+				if (!list_empty(&area->free_list[migratetype])) {
+					found = true;
+					break;
+				}
+				if (opposite_mt >= 0 &&
+				    !list_empty(&area->free_list[opposite_mt])) {
+					found = true;
+					break;
+				}
+			}
+			if (found)
+				break;
+		}
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return found;
+}
+
 /*
  * High-water threshold for proactively kicking the slab shrinker. When a
  * non-movable allocation consumes from a tainted SPB whose total free
@@ -6303,8 +6398,35 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 */
 	if (no_fallback && !defrag_mode &&
 	    !(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+		struct zone *pref = zonelist_zone(ac->preferred_zoneref);
+
 		if (gfp_mask & __GFP_NORETRY)
 			return NULL;
+
+		/*
+		 * Best-effort high-order callers convention: stripping
+		 * __GFP_DIRECT_RECLAIM, setting __GFP_NOWARN, omitting
+		 * __GFP_NOFAIL, and asking for a high order indicates the
+		 * caller has an order-decrement fallback (kvmalloc's
+		 * vmalloc fallback, vmalloc's order-decrement loop,
+		 * alloc_skb_with_frags's order-decrement loop, ...).
+		 *
+		 * If the tainted-SPB pool already has a free buddy at any
+		 * lower order on a free list a smaller retry could use,
+		 * refuse this attempt so the caller's order-decrement
+		 * uses that sub-pageblock space instead of forcing us to
+		 * drop ALLOC_NOFRAGMENT and taint a clean SPB.
+		 *
+		 * Same intent as adding __GFP_NORETRY at every such
+		 * caller, but applied centrally so we cover both existing
+		 * and future callers without per-call-site fixes.
+		 */
+		if (order > 0 && (gfp_mask & __GFP_NOWARN) &&
+		    !(gfp_mask & __GFP_NOFAIL) &&
+		    spb_tainted_can_serve_smaller(pref, order,
+						  ac->migratetype))
+			return NULL;
+
 		if (!(alloc_flags & ALLOC_NOFRAG_TAINTED_OK)) {
 			alloc_flags |= ALLOC_NOFRAG_TAINTED_OK;
 			goto retry;
-- 
2.54.0



  parent reply	other threads:[~2026-05-20 15:01 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02   ` Usama Arif
2026-05-27 15:41     ` Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19   ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47   ` Boris Burkov
2026-05-23 15:58     ` David Sterba
2026-05-24  1:43       ` Rik van Riel
2026-05-24 19:59         ` Matthew Wilcox
2026-05-25  6:57           ` Christoph Hellwig
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21  5:09   ` kernel test robot
2026-05-21  7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55   ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260520150018.2491267-36-riel@surriel.com \
    --to=riel@surriel.com \
    --cc=david@kernel.org \
    --cc=fvdl@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.