From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org,
willy@infradead.org, surenb@google.com, hannes@cmpxchg.org,
ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev,
Rik van Riel <riel@meta.com>, Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 32/45] mm: page_alloc: proactive high-water trigger for SPB slab shrink
Date: Thu, 30 Apr 2026 16:21:01 -0400 [thread overview]
Message-ID: <20260430202233.111010-33-riel@surriel.com> (raw)
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
From: Rik van Riel <riel@meta.com>
The SPB slab shrinker introduced earlier in the series only fires when
__rmqueue_smallest falls all the way through to Pass 3 (about to taint
a clean SPB) or when __rmqueue_claim is about to taint one. Bare-metal
testing on a 247 GB devvm with btrfs root (rev 398, with Pass 2c) shows
this is too late: at boot+16min only 15 shrinks had fired in 6 minutes
while slab grew from 1.7 GB to 11.7 GB and tainted Normal-zone SPBs
climbed from 4 baseline to 16. The 100ms throttle (max 10 shrinks/sec
per pgdat) further capped the response rate, and the trigger placement
meant slab pressure could keep absorbing into already-tainted SPBs
without ever firing the shrinker until those SPBs were exhausted — at
which point the only remaining option is to taint a fresh clean SPB.
Two changes:
1. Add a proactive high-water trigger on the success paths of
__rmqueue_smallest's tainted-SPB passes (Pass 1 SB_TAINTED, Pass 2,
Pass 2b, Pass 2c). When a non-movable allocation consumes from a
tainted SPB whose nr_free_pages has fallen below spb_tainted_reserve
worth of pages (reserve_pageblocks * pageblock_nr_pages), queue a
slab shrink. The predicate compares total free pages rather than
whole free pageblocks (nr_free): sub-pageblock allocations and
fragmented free space don't move the pageblock count but do consume
the SPB's freeable capacity, and we can't assume slab reclaim will
produce whole pageblocks either. This makes the trigger frequency
proportional to the rate of non-movable consumption from contended
tainted SPBs, instead of firing only at the cliff edge.
2. Remove the 100ms time-based throttle from queue_spb_slab_shrink.
The throttle was redundant with queue_work()'s built-in single-flight
semantics (returns false if the work is already queued/running) and
was actively harmful: with the new high-water trigger firing per
allocation, the natural rate-limiter is the worker's runtime. The
previously-allocated spb_slab_shrink_last field is removed from
pglist_data.
queue_work() absorbs the resulting per-alloc burst at near-zero cost
(test-and-set on WORK_STRUCT_PENDING_BIT) when a pass is already in
flight, so unconditional firing on every qualifying allocation is
cheap.
Pass 4 (movable falling back to tainted) does not get the trigger:
movable consumption does not contribute to the slab pressure that taints
fresh SPBs, and Pass 4 already filters out SBs at or below reserve.
Clean-SPB success paths in Pass 1 are also untouched (clean SPBs are
not the source of the pressure).
Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
include/linux/mmzone.h | 7 +++---
mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++----------
2 files changed, 40 insertions(+), 15 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acaff292140f..68892e40cd4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1573,12 +1573,11 @@ typedef struct pglist_data {
/*
* SPB-driven slab reclaim: single work item per pgdat (shrink_slab
- * is node-scoped, so one work in-flight per node is the max), with
- * a 100ms throttle. queue_work() gives us single-flight semantics
- * for free.
+ * is node-scoped, so one work in-flight per node is the max).
+ * queue_work() gives us single-flight semantics for free — fresh
+ * triggers no-op while a pass is in progress.
*/
struct work_struct spb_slab_shrink_work;
- unsigned long spb_slab_shrink_last;
#endif
/*
* This is a per-node reserve of pages that are not available
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f2db3dd86a84..ff7755ef2b79 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2692,6 +2692,23 @@ static inline u16 spb_tainted_reserve(const struct superpageblock *sb)
return max_t(u16, SPB_TAINTED_RESERVE_MIN, sb->total_pageblocks / 32);
}
+/*
+ * High-water threshold for proactively kicking the slab shrinker. When a
+ * non-movable allocation consumes from a tainted SPB whose total free
+ * pages have fallen below spb_tainted_reserve worth of pages, queue a
+ * shrink so we start freeing slab memory before the SPB is exhausted.
+ *
+ * Compared against nr_free_pages rather than nr_free (whole pageblocks):
+ * sub-pageblock allocations and fragmented free space don't move the
+ * pageblock count, but they do consume the SPB's freeable capacity, and
+ * we can't assume slab reclaim will produce whole pageblocks either.
+ */
+static inline bool spb_below_shrink_high_water(const struct superpageblock *sb)
+{
+ return sb->nr_free_pages <
+ (unsigned long)spb_tainted_reserve(sb) * pageblock_nr_pages;
+}
+
/*
* On systems with many superpageblocks, we can afford to "write off"
* tainted superpageblocks by aggressively packing unmovable/reclaimable
@@ -2877,6 +2894,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page_del_and_expand(zone, page,
order, current_order,
migratetype);
+ if (cat == SB_TAINTED &&
+ spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2896,6 +2916,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page_del_and_expand(zone, page,
order, current_order,
migratetype);
+ if (cat == SB_TAINTED &&
+ spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2941,6 +2964,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page = claim_whole_block(zone, page,
current_order, order,
migratetype, MIGRATE_MOVABLE);
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -2978,6 +3003,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
0, true);
if (!page)
continue;
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -3061,6 +3088,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
opposite_mt);
__spb_set_has_type(page,
migratetype);
+ if (spb_below_shrink_high_water(sb))
+ queue_spb_slab_shrink(zone);
trace_mm_page_alloc_zone_locked(
page, order, migratetype,
pcp_allowed_order(order) &&
@@ -9126,9 +9155,9 @@ static void queue_spb_evacuate(struct zone *zone, unsigned int order,
* tainted SPB is to shrink the slab caches whose pages live there.
*
* shrink_slab() is node-scoped, so one work item per pgdat is enough:
- * a single embedded work_struct, gated by a 100ms throttle.
- * queue_work() returns false if the work is already queued/running, so
- * we get single-flight for free.
+ * a single embedded work_struct. queue_work() returns false if the work
+ * is already queued/running, so we get single-flight for free — fresh
+ * triggers no-op until the in-flight pass completes.
*
* shrink_slab() itself is location-agnostic — it walks all registered
* shrinkers and frees objects whose backing pages may live in any
@@ -9189,10 +9218,11 @@ static void spb_slab_shrink_work_fn(struct work_struct *work)
* queue_spb_slab_shrink - schedule deferred slab shrink for SPB pressure
* @zone: zone whose tainted-SPB pool is running low
*
- * Throttled to one enqueue per 100ms per pgdat. queue_work() handles
- * single-flight: if the work is already queued or running, it returns
- * false and the throttle stamp still gets bumped (next call will be
- * no-op until the throttle elapses).
+ * Single-flight via queue_work(): if the work is already queued or
+ * running, it returns false and we no-op. There is no time-based
+ * throttle — the rate at which fresh shrink runs can fire is bounded
+ * by how fast the worker completes (one full pass freeing up to
+ * SPB_SLAB_SHRINK_TARGET_OBJS objects).
*
* Callable from any context: page allocator paths hold zone->lock,
* the SPB evacuate worker does not. queue_work() takes only the
@@ -9212,10 +9242,6 @@ static void queue_spb_slab_shrink(struct zone *zone)
if (!pgdat->evacuate_wq)
return;
- if (time_before(jiffies, pgdat->spb_slab_shrink_last + HZ / 10))
- return;
-
- pgdat->spb_slab_shrink_last = jiffies;
if (queue_work(pgdat->evacuate_wq, &pgdat->spb_slab_shrink_work))
count_vm_event(SPB_SLAB_SHRINK_QUEUED);
}
--
2.52.0
next prev parent reply other threads:[~2026-04-30 20:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-30 20:20 [00/45 RFC PATCH] 1GB superpageblock memory allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 01/45] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 02/45] mm: page_alloc: per-cpu pageblock buddy allocator Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 03/45] mm: page_alloc: use trylock for PCP lock in free path to avoid lock inversion Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 04/45] mm: mm_init: fix zone assignment for pages in unavailable ranges Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 05/45] mm: vmstat: restore per-migratetype free counts in /proc/pagetypeinfo Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 06/45] mm: page_alloc: remove watermark boost mechanism Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 07/45] mm: page_alloc: async evacuation of stolen movable pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 08/45] mm: page_alloc: track actual page contents in pageblock flags Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 09/45] mm: page_alloc: introduce superpageblock metadata for 1GB anti-fragmentation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 10/45] mm: page_alloc: support superpageblock resize for memory hotplug Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 11/45] mm: page_alloc: add superpageblock fullness lists for allocation steering Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 12/45] mm: page_alloc: steer pageblock stealing to tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 13/45] mm: page_alloc: steer movable allocations to fullest clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 14/45] mm: page_alloc: extract claim_whole_block from try_to_claim_block Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 15/45] mm: page_alloc: add per-superpageblock free lists Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 17/45] mm: page_alloc: add within-superpageblock compaction for clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 18/45] mm: page_alloc: superpageblock-aware contiguous and higher order allocation Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 19/45] mm: page_alloc: prevent atomic allocations from tainting clean SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 20/45] mm: page_alloc: aggressively pack non-movable allocations in tainted SPBs on large systems Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 21/45] mm: page_alloc: prefer reclaim over tainting clean superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 22/45] mm: page_alloc: adopt partial pageblocks from tainted superpageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 23/45] mm: page_alloc: add CONFIG_DEBUG_VM sanity checks for SPB counters Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 24/45] mm: page_alloc: targeted evacuation and dynamic reserves for tainted SPBs Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 25/45] mm: page_alloc: skip pageblock compatibility threshold in " Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 26/45] mm: page_alloc: prevent UNMOVABLE/RECLAIMABLE mixing in pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 27/45] mm: trigger deferred SPB evacuation when atomic allocs would taint a clean SPB Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 28/45] mm: page_alloc: keep PCP refill in tainted SPBs across owned pageblocks Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 29/45] mm: page_alloc: refuse fragmenting fallback for callers with cheap fallback Rik van Riel
2026-04-30 20:20 ` [RFC PATCH 30/45] mm: page_alloc: drive slab shrink from SPB anti-fragmentation pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` Rik van Riel [this message]
2026-04-30 20:21 ` [RFC PATCH 33/45] mm: page_alloc: refuse to taint clean SPBs for atomic NORETRY callers Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 34/45] mm: page_reporting: walk per-superpageblock free lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 35/45] mm: show_mem: collect migratetype letters from per-superpageblock lists Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 36/45] mm: page_alloc: add alloc_flags parameter to __rmqueue_smallest Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 37/45] mm/slub: kvmalloc — add __GFP_NORETRY to large-kmalloc attempt Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 39/45] mm: debug: prevent infinite recursion in dump_page() with CMA Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 40/45] PM: hibernate: walk per-superpageblock free lists in mark_free_pages Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 41/45] btrfs: allocate eb-attached btree pages as movable Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 42/45] mm: page_alloc: cross-MOV borrow within tainted SPBs Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 44/45] mm: page_alloc: SPB tracepoint instrumentation [DROP-FOR-UPSTREAM] Rik van Riel
2026-04-30 20:21 ` [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Rik van Riel
2026-05-01 7:14 ` [00/45 RFC PATCH] 1GB superpageblock memory allocation David Hildenbrand (Arm)
2026-05-01 11:58 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260430202233.111010-33-riel@surriel.com \
--to=riel@surriel.com \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@meta.com \
--cc=surenb@google.com \
--cc=usama.arif@linux.dev \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox