From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks
Date: Thu, 30 Apr 2026 16:20:50 -0400 [thread overview]
Message-ID: <20260430202233.111010-22-riel@surriel.com> (raw)
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
When the allocator needs pages for unmovable or reclaimable allocations
and tainted superpageblocks are exhausted, it currently falls through to
clean superpageblocks immediately, permanently tainting them. This
defeats the purpose of superpageblock anti-fragmentation.
Restructure the allocation fallback cascade to try reclaim and compaction
before tainting clean superpageblocks:
1. Reorder __rmqueue_smallest to search each preferred SPB completely
before moving to the next source. Within each preferred SPB, try
whole-pageblock allocations first (for PCP buddy optimization),
then fall back to sub-pageblock allocations. This ensures that
sub-pageblock free pages in existing tainted SPBs are used before
tainting empty or clean SPBs. The pass order is:
- Preferred SPBs: whole pageblock first, then sub-pageblock
- Whole pageblock inline claim from tainted SPBs (non-movable only)
- Whole pageblock from empty SPBs
- Fallback to non-preferred SPBs
2. In get_page_from_freelist(), only drop ALLOC_NOFRAGMENT immediately
for allocations that cannot do direct reclaim (atomic). Allocations
that can reclaim keep ALLOC_NOFRAGMENT set and enter the slowpath,
where reclaim and compaction can free pages in already-tainted SPBs.
3. Preserve ALLOC_NOFRAGMENT through the slowpath by calling
alloc_flags_nofragment() after gfp_to_alloc_flags(). Previously
the slowpath only set NOFRAGMENT for defrag_mode, losing the SPB
protection that the fastpath established.
4. After reclaim and compaction have both been tried and failed, drop
ALLOC_NOFRAGMENT unconditionally as a last resort before OOM.
Previously this was gated on defrag_mode.
Testing shows that with this change, clean superpageblocks maintain
unmov=0 throughout a heavy mixed workload (swap pressure, filesystem
metadata, anonymous memory cycling, compaction, hugepage allocation),
where previously 2-3 additional SPBs would become tainted with 7-8
unmovable pageblocks each.
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_alloc.c | 74 ++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 61 insertions(+), 13 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 215b7d6b95d2..8f925b5a2e5f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2764,11 +2764,23 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
* concentrate non-movable allocations into fewer superpageblocks.
* For movable, prefer clean superpageblocks to keep them homogeneous.
*
- * Search empty superpageblocks between the preferred and fallback
- * category passes to avoid movable allocations consuming free
- * pageblocks in tainted superpageblocks (which unmovable needs for
- * future CLAIMs), and vice versa.
+ * Prefer whole pageblock allocations (>= pageblock_order) over
+ * sub-pageblock allocations because whole pageblocks enable the
+ * PCP buddy optimization for fast subsequent allocations.
+ *
+ * Search order:
+ * 1. Preferred SPBs: whole pageblock first, then sub-pageblock
+ * 2. Whole pageblock inline claim from tainted SPBs (non-movable only)
+ * 3. Whole pageblock from empty SPBs
+ * 4. Fallback to non-preferred SPBs
+ *
+ * Pass 1 tries whole pageblock first for PCP buddy optimization,
+ * then falls back to sub-pageblock within the same preferred SPBs.
+ * This ensures we never taint empty/clean SPBs while preferred
+ * SPBs still have free pages at any order.
*/
+
+ /* Pass 1: preferred SPBs — whole pageblock first, then sub-pageblock */
for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
enum sb_category cat = cat_order[movable][0];
@@ -2776,7 +2788,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
&zone->spb_lists[cat][full], list) {
if (!sb->nr_free_pages)
continue;
- for (current_order = order;
+ /* Try whole pageblock (or larger) first for PCP buddy */
+ for (current_order = max(order, pageblock_order);
current_order < NR_PAGE_ORDERS;
++current_order) {
area = &sb->free_area[current_order];
@@ -2793,15 +2806,34 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
migratetype < MIGRATE_PCPTYPES);
return page;
}
+ /* Then try sub-pageblock (no PCP buddy) */
+ if (order < pageblock_order) {
+ for (current_order = order;
+ current_order < pageblock_order;
+ ++current_order) {
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(
+ area, migratetype);
+ if (!page)
+ continue;
+ page_del_and_expand(zone, page,
+ order, current_order,
+ migratetype);
+ trace_mm_page_alloc_zone_locked(
+ page, order, migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
}
}
/*
- * For non-movable allocations, try to reclaim free pageblocks
- * from tainted superpageblocks before looking at empty or clean
- * ones. Free pageblocks in tainted SBs have pages on the MOVABLE
- * free list (reset by mark_pageblock_free), so the search above
- * misses them. Claim them inline to keep non-movable allocations
+ * Pass 2: for non-movable allocations, try to claim free pageblocks
+ * from tainted superpageblocks. Free pageblocks in tainted SBs have
+ * pages on the MOVABLE free list (reset by mark_pageblock_free), so
+ * pass 1 misses them. Claim them inline to keep non-movable allocations
* concentrated in already-tainted superpageblocks.
*
* Try whole pageblock orders first (preferred for PCP buddy optimization),
@@ -2879,7 +2911,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
}
}
- /* Empty superpageblocks: try before falling back to non-preferred category */
+ /* Pass 3: whole pageblock from empty superpageblocks */
list_for_each_entry(sb, &zone->spb_empty, list) {
if (!sb->nr_free_pages)
continue;
@@ -6281,6 +6313,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (!zonelist_zone(ac->preferred_zoneref))
goto nopage;
+ /*
+ * Preserve ALLOC_NOFRAGMENT through the slowpath so that reclaim
+ * and compaction are tried before allowing clean superpageblocks
+ * to be tainted. The fast path sets this via alloc_flags_nofragment()
+ * but gfp_to_alloc_flags() only sets it for defrag_mode. Re-add it
+ * here so the slowpath retries with NOFRAGMENT still protecting
+ * clean SPBs until the last-resort drop below.
+ */
+ alloc_flags |= alloc_flags_nofragment(
+ zonelist_zone(ac->preferred_zoneref), gfp_mask);
+
/*
* Check for insane configurations where the cpuset doesn't contain
* any suitable zone to satisfy the request - e.g. non-movable
@@ -6420,8 +6463,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
&compaction_retries))
goto retry;
- /* Reclaim/compaction failed to prevent the fallback */
- if (defrag_mode && (alloc_flags & ALLOC_NOFRAGMENT)) {
+ /*
+ * Reclaim and compaction have been tried but could not free enough
+ * pages in already-tainted superpageblocks. Drop NOFRAGMENT as a
+ * last resort to allow claiming from clean/empty SPBs and stealing
+ * across migratetype boundaries. This is better than OOM-killing.
+ */
+ if (alloc_flags & ALLOC_NOFRAGMENT) {
alloc_flags &= ~ALLOC_NOFRAGMENT;
goto retry;
}
--
2.52.0
next prev parent reply other threads:[~2026-04-30 20:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` Rik van Riel [this message]
2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
2026-05-01 7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260430202233.111010-22-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@meta.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox