From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
fvdl@google.com, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 27/40] mm: page_alloc: cross-migratetype buddy borrow within tainted SPBs
Date: Wed, 20 May 2026 10:59:33 -0400 [thread overview]
Message-ID: <20260520150018.2491267-28-riel@surriel.com> (raw)
In-Reply-To: <20260520150018.2491267-1-riel@surriel.com>
When pages are freed via __free_one_page they're placed on the
per-SPB free_list determined by their pageblock's migratetype, not
the original allocation's migratetype. Slab-heavy and cache-heavy
workloads both expose structural mismatches that leave non-movable
allocations stranded:
- RECLAIMABLE pageblocks fill up densely with live slab objects
(e.g. btrfs_inode caches), leaving very few sub-pageblock free
fragments on the RECL free list.
- UNMOVABLE pageblocks accumulate sparse free space from vmalloc
and raw-alloc churn -- tens of thousands of free pages, all
on the UNMOV free list.
- MOVABLE-tagged pageblocks in tainted SPBs absorb freed
page-cache and anon-LRU pages, accumulating large pools all on
the MOVABLE free list -- invisible to non-movable demand even
though the tainted SPB has plenty of unused space.
Add two new passes between Pass 2b and Pass 3 of __rmqueue_smallest,
both restricted to SB_TAINTED (clean SPBs must not be polluted with
cross-type mixing) and both purely transient borrows (no pageblock
relabel; the borrowed page returns to its source list when freed):
Pass 2c -- cross-non-movable borrow. UNMOV alloc tries the
RECL free list; RECL alloc tries the UNMOV free list. Restricted
to UNMOV <-> RECL.
Pass 2d -- cross-MOV borrow. Non-movable alloc tries the
MOVABLE free list of a tainted SPB. Tradeoff: the borrowed
UNMOV/RECL content blocks compaction of its source pageblock
until freed; restricted to SB_TAINTED so contamination is bounded
to one pageblock inside an already-tainted SPB. The alternative
-- Pass 3 tainting a fresh clean SPB -- removes a 1 GiB region
from the clean pool, which is strictly worse for the anti-
fragmentation invariant the series is built around.
PB_has_<requested_type> is set via __spb_set_has_type so spb_defrag
accounting reflects that the pageblock now hosts our type's
content. PB_has_<source_type> stays set since other buddies of
that type remain.
Movable allocations don't participate (they have Pass 4) and CMA
is skipped. Observable as SPB_ALLOC_OUTCOME_PASS_2C and
SPB_ALLOC_OUTCOME_PASS_2D on the spb_alloc_walk tracepoint.
Live measurement on a 250 GB system with btrfs root
(Stage 1 + simplified Stage 2a) at boot+7min: 12 tainted Normal-
zone SPBs grew from 4 baseline despite the existing 11 having
between 825 and 87,062 free pages each, ALL on the UNMOV list
while the workload kept allocating RECL btrfs_inode slab pages.
Pass 2c lets those allocs absorb into the existing UNMOV-listed
free pool rather than creating fresh tainted SPBs; Pass 2d
extends the same idea to the MOV-listed free pool that page-
cache reclaim leaves behind.
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
mm/page_alloc.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4ecddb428c3..ce8cd99dd283 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2820,6 +2820,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
struct page *page;
int full;
struct superpageblock *sb;
+ int opposite_mt;
/*
* Category search order: 2 passes.
* Movable: clean first, then tainted (pack into clean SBs).
@@ -2999,6 +3000,161 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
}
}
}
+
+ /*
+ * Pass 2c: cross-non-movable borrow within tainted SPBs.
+ *
+ * If we're a non-movable alloc and Pass 1/2/2b couldn't find a
+ * buddy on our migratetype's free list anywhere, but tainted
+ * SPBs have free buddies on the *opposite* non-movable type's
+ * free list, take one of those.
+ *
+ * Why this happens: when pages are freed, __free_one_page puts
+ * them on the free_list determined by their pageblock's tag,
+ * not the original allocation's migratetype. Slab caches tend
+ * to be dense (RECL pageblocks fill up; few sub-PB fragments),
+ * while UNMOV pageblocks accumulate sparse free space from
+ * vmalloc/raw alloc churn. Net effect: tainted SPBs frequently
+ * have tens of thousands of free pages all on the UNMOV list,
+ * invisible to RECL allocs (or vice versa). Without this pass,
+ * the alloc falls through to Pass 3 and taints a fresh clean
+ * SPB even though the existing tainted ones have plenty of
+ * unused space.
+ *
+ * We do NOT relabel the source pageblock. The buddy is taken
+ * from @opposite_mt's free list and the splits go back on
+ * @opposite_mt's list (page_del_and_expand uses the same mt
+ * for delete and expand). The pageblock tag is unchanged, so
+ * the page returns to @opposite_mt's list when freed via
+ * __free_one_page. Effectively a borrow: the alloc takes a
+ * physical page from a UNMOV-tagged pageblock for a RECL
+ * use, and the page cycles back to UNMOV's list on free.
+ *
+ * We do set PB_has_<migratetype> via __spb_set_has_type so
+ * spb_defrag accounting reflects that this pageblock now hosts
+ * our migratetype's content too. PB_has_<opposite_mt> stays
+ * set since other buddies of that type remain.
+ *
+ * Restricted to UNMOV ↔ RECL. Movable allocations don't
+ * participate (they have their own Pass 4 fallback path).
+ *
+ * Restricted to SB_TAINTED to avoid spreading mixing into
+ * clean SPBs.
+ */
+ opposite_mt = -1;
+ if (migratetype == MIGRATE_UNMOVABLE)
+ opposite_mt = MIGRATE_RECLAIMABLE;
+ else if (migratetype == MIGRATE_RECLAIMABLE)
+ opposite_mt = MIGRATE_UNMOVABLE;
+
+ if (opposite_mt >= 0) {
+ for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+ list_for_each_entry(sb,
+ &zone->spb_lists[SB_TAINTED][full], list) {
+ int co;
+
+ if (!sb->nr_free_pages)
+ continue;
+ for (co = min_t(int, pageblock_order - 1,
+ NR_PAGE_ORDERS - 1);
+ co >= (int)order;
+ --co) {
+ current_order = co;
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(
+ area, opposite_mt);
+ if (!page)
+ continue;
+ if (get_pageblock_isolate(page))
+ continue;
+ if (is_migrate_cma(
+ get_pageblock_migratetype(page)))
+ continue;
+ page_del_and_expand(zone, page,
+ order, current_order,
+ opposite_mt);
+ __spb_set_has_type(page,
+ migratetype);
+ trace_mm_page_alloc_zone_locked(
+ page, order, migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ }
+ }
+
+ /*
+ * Pass 2d: cross-MOV borrow within tainted SPBs.
+ *
+ * If Pass 1/2/2b/2c all failed, the next step is Pass 3
+ * which would taint a fresh clean SPB. Before that, try
+ * to borrow an individual buddy from a tainted SPB's
+ * MIGRATE_MOVABLE free list.
+ *
+ * Tainted SPBs accumulate large amounts of free space on
+ * the MOV free list (e.g. reclaimed page-cache pages
+ * whose pageblock tag is MOVABLE). Pass 1 cannot see
+ * those for non-movable allocs, Pass 2/2b cannot claim a
+ * whole pageblock when sb->nr_free == 0, and Pass 2c is
+ * restricted to UNMOV<->RECL. The result is a tainted
+ * SPB with tens to hundreds of thousands of free pages
+ * all unreachable from non-movable demand.
+ *
+ * Borrow semantics mirror Pass 2c: take a buddy from the
+ * MOVABLE free list without relabeling the source
+ * pageblock. The page is used for the requesting non-
+ * movable mt for the lifetime of the allocation, then on
+ * free returns to the MOVABLE list.
+ *
+ * Cost: the borrowed UNMOV/RECL content blocks
+ * compaction of its source pageblock until freed.
+ * Restricted to SB_TAINTED so the contamination is
+ * bounded to an already-tainted SPB; the alternative
+ * (Pass 3) taints a fresh clean SPB and removes a 1 GiB
+ * region from the clean pool, which is strictly worse.
+ *
+ * Skipped for movable allocs (they have Pass 4) and for
+ * CMA allocs.
+ */
+ if (!movable && !is_migrate_cma(migratetype)) {
+ for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) {
+ list_for_each_entry(sb,
+ &zone->spb_lists[SB_TAINTED][full], list) {
+ int co;
+
+ if (!sb->nr_free_pages)
+ continue;
+ for (co = min_t(int, pageblock_order - 1,
+ NR_PAGE_ORDERS - 1);
+ co >= (int)order;
+ --co) {
+ current_order = co;
+ area = &sb->free_area[current_order];
+ page = get_page_from_free_area(
+ area, MIGRATE_MOVABLE);
+ if (!page)
+ continue;
+ if (get_pageblock_isolate(page))
+ continue;
+ if (is_migrate_cma(
+ get_pageblock_migratetype(page)))
+ continue;
+ page_del_and_expand(zone, page,
+ order, current_order,
+ MIGRATE_MOVABLE);
+ __spb_set_has_type(page,
+ migratetype);
+ trace_mm_page_alloc_zone_locked(
+ page, order, migratetype,
+ pcp_allowed_order(order) &&
+ migratetype < MIGRATE_PCPTYPES);
+ return page;
+ }
+ }
+ }
+ }
}
/*
--
2.54.0
next prev parent reply other threads:[~2026-05-20 15:00 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 14:59 [RFC PATCH 00/40] mm: reliable 1GB page allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 01/40] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 02/40] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 03/40] mm: page_alloc: split-path PCP free with local-trylock + remote-llist Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 04/40] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 05/40] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-05-26 14:02 ` Usama Arif
2026-05-27 15:41 ` Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 06/40] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 07/40] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 08/40] mm: page_alloc: superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 09/40] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 10/40] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 11/40] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 12/40] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 13/40] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 14/40] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 15/40] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 16/40] mm: compaction: walk per-superpageblock free lists for migration targets Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 17/40] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 18/40] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 19/40] mm: page_alloc: aggressively pack non-movable allocs in tainted SPBs on large systems Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 20/40] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 21/40] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 22/40] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 23/40] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 24/40] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 25/40] mm: trigger deferred SPB evac when atomic allocs would taint a clean SPB Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 26/40] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-05-20 14:59 ` Rik van Riel [this message]
2026-05-20 14:59 ` [RFC PATCH 28/40] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 29/40] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 30/40] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 31/40] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 32/40] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 33/40] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-05-20 18:19 ` Rafael J. Wysocki
2026-05-20 14:59 ` [RFC PATCH 34/40] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-05-20 17:47 ` Boris Burkov
2026-05-23 15:58 ` David Sterba
2026-05-24 1:43 ` Rik van Riel
2026-05-24 19:59 ` Matthew Wilcox
2026-05-25 6:57 ` Christoph Hellwig
2026-05-20 14:59 ` [RFC PATCH 35/40] mm: page_alloc: refuse best-effort high-order allocs servable at lower orders Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 36/40] mm: page_alloc: set ALLOC_NOFRAGMENT on alloc_frozen_pages_nolock_noprof Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 37/40] mm: page_alloc: move spb_get_category and spb_tainted_reserve to mmzone.h Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 38/40] mm: compaction: skip empty tainted superpageblocks as migration source Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 39/40] mm: compaction: respect tainted SPB reserve in destination selection Rik van Riel
2026-05-20 14:59 ` [RFC PATCH 40/40] mm: page_alloc: SPB tracepoint instrumentation [DO-NOT-MERGE] Rik van Riel
2026-05-21 5:09 ` kernel test robot
2026-05-21 7:39 ` [syzbot ci] Re: mm: reliable 1GB page allocation syzbot ci
2026-05-22 11:02 ` [RFC PATCH 00/40] " Usama Arif
2026-05-22 13:55 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520150018.2491267-28-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.