From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A7D63A7835 for ; Thu, 30 Apr 2026 20:22:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580580; cv=none; b=R+dmixw/75PviWwm050PvepCHvqtCm1A/lLP+IyySK95p4ohkmHBFnMPVcRrJwAgyNQmuEH81+5A794aQXeWXRhzVgsfB/QKcla2dPJFHldqvyFlLjjAcSUr9QySzXhHCsaJBDLvJuYL00sLKadkkDlLcaJ23hxeOlqcHUiL3FA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580580; c=relaxed/simple; bh=hJY+g4lWobVwmPxxrLbiZqretTavopkeDx0w2gFjdaA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=hgHgORzEqr5VGIGAxCOswj7g079/F0i5pJYzUf9IGyDAdNKZzVYktgb92Zjvyiq+28Bh7raCf2Hg1sWxqLc+5Fiy5aLkHIe46sYhIJTd3gF1xTWdlHC1pBrtTlAvHTNJTYxX52uk7QE7FmhBr3njGsr2ZZ/M3BcX2P0N8b4GUxI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=EkYJ6vhQ; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="EkYJ6vhQ" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=TTtVQekqmYIRquveFhZQyO2hAOIv4Nsoy3Q0tNMXmWI=; b=EkYJ6vhQ1ikJWGTx30iLuB15Lu SZiPnH8n/JHTZihRB3OYatOQgxvD/uVjM6brM1J1zJVofWJmGfB+QMUV8Sfg84c8HleZ0QgzpIjua zkNqMj6YcNwTJBVFF/+u90FYkuNTP9w0JeYiLH22Y8rjEtyvMBWKlBKwhk7XkqgKNGLu7zHAc4Tw6 I9XBIuOwVD3uTUzJzmkLJwG+o6LzRy6Hz+L/3VUIiuNQ4mVwKREWxmaZQsrDyjPn+YcDspQTd51I9 y5okmJbxkIjM0+TdI/NO83uV+F5y8J4zg3D0QNEbGsxbjnF8R5LN3x2rbY+F3VsGsMjP5ifMLjGoT tfOgTKhA==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuD-000000001R0-2Cnf; Thu, 30 Apr 2026 16:22:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 43/45] mm: page_alloc: trigger defrag from allocator hot path on tainted-SPB pressure Date: Thu, 30 Apr 2026 16:21:12 -0400 Message-ID: <20260430202233.111010-44-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rik van Riel The per-SPB background defrag worker is currently triggered only from spb_update_list(), which itself only fires when the SPB's category or fullness bucket changes. Sub-bucket allocations (decrementing free counters within the same bucket) do not re-evaluate. drgn dump on a saturated devvm showed several tainted SPBs with defrag_last_no_progress_jiffies set hundreds-to-thousands of seconds ago — long after their 5-second SPB_DEFRAG_NOOP_COOLDOWN expired — yet defrag had never been re-triggered on them. The shape of the failure: a tainted SPB hits free=0, the worker tried once and made no progress (movable pages mostly in mixed pageblocks, evacuating them left the source PB still occupied by unmov/recl content), no-progress cooldown stamped, no later allocator event crossed a fullness bucket on that SPB so spb_update_list never re-fired the trigger. The SPB sat stuck while subsequent non-movable allocs ended up tainting fresh clean SPBs via PASS_3. Add two complementary triggers in __rmqueue_smallest: (1) On every PASS_1/2/2B/2C/2D success that already evaluates spb_below_shrink_high_water(sb) (i.e. the same threshold at which queue_spb_slab_shrink is fired), additionally call spb_maybe_start_defrag(sb). Catches actively-pressured tainted SPBs immediately, no extra hot-path predicate evaluation. (2) Just before the PASS_3 fall-through that risks tainting a fresh clean SPB, walk the tainted-SPB list and call spb_maybe_start_defrag() on each. Catches SPBs that are stuck with no allocator activity to drive (1). Bounded by nr_tainted_spbs and only runs on the slow path that is about to fragment the clean pool — appropriate to spend a list walk here. The cooldown gate inside spb_needs_defrag() no-ops cheaply for SPBs not yet eligible. The cooldown still gates spb_needs_defrag() so neither trigger storms the worker. The existing spb_maybe_start_defrag() call inside spb_update_list() is retained: it remains the trigger for the clean-SPB within-superpageblock compaction path (spb_defrag_clean), which the new alloc-path triggers do not cover (they only fire on SB_TAINTED). Replacing the spb_update_list call entirely would require a separate clean-SPB-specific trigger in the allocator and is left for a follow-up. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller Also factor out the now-repeated tainted-alloc reaction into a helper spb_react_to_tainted_alloc(sb, zone) and call it from all 8 PASS_1/2/2B/2C/2D success sites in __rmqueue_smallest. Centralizes the gate (cat == SB_TAINTED && spb_below_shrink_high_water(sb)) and the shrink+defrag kick in one place, removing duplication and reducing the per-success-site noise. --- mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 53 insertions(+), 20 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index af499f0a1a48..e15e71d5ac99 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2709,6 +2709,30 @@ static inline bool spb_below_shrink_high_water(const struct superpageblock *sb) (unsigned long)spb_tainted_reserve(sb) * pageblock_nr_pages; } +/* + * spb_react_to_tainted_alloc - kick reclaim machinery on a tainted-SPB alloc. + * + * Called from each PASS_1/2/2B/2C/2D success path after a successful + * allocation against a tainted SPB. If the SPB is below its shrink + * high-water mark, queue the SPB-driven slab shrink and try to start + * the per-SPB defrag worker. Both have their own cooldown gates inside, + * so this is cheap to call on every such allocation. + * + * Skips quickly when the SPB is not tainted (e.g. movable allocation + * landing on a clean SPB) or when the high-water mark hasn't been + * crossed. + */ +static inline void spb_react_to_tainted_alloc(struct superpageblock *sb, + struct zone *zone) +{ + if (spb_get_category(sb) != SB_TAINTED) + return; + if (!spb_below_shrink_high_water(sb)) + return; + queue_spb_slab_shrink(zone); + spb_maybe_start_defrag(sb); +} + /* * On systems with many superpageblocks, we can afford to "write off" * tainted superpageblocks by aggressively packing unmovable/reclaimable @@ -2969,9 +2993,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = try_alloc_from_sb_pass1(zone, cpu_hint, order, migratetype); if (page) { - if (spb_get_category(cpu_hint) == SB_TAINTED && - spb_below_shrink_high_water(cpu_hint)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(cpu_hint, zone); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && @@ -2984,9 +3006,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = try_alloc_from_sb_pass1(zone, zone_hint, order, migratetype); if (page) { - if (spb_get_category(zone_hint) == SB_TAINTED && - spb_below_shrink_high_water(zone_hint)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(zone_hint, zone); slot->zone = zone; slot->sb = zone_hint; trace_mm_page_alloc_zone_locked(page, order, @@ -3057,9 +3077,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page_del_and_expand(zone, page, order, current_order, migratetype); - if (cat == SB_TAINTED && - spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + if (cat == SB_TAINTED) + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3088,9 +3107,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page_del_and_expand(zone, page, order, current_order, migratetype); - if (cat == SB_TAINTED && - spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + if (cat == SB_TAINTED) + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3145,8 +3163,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = claim_whole_block(zone, page, current_order, order, migratetype, MIGRATE_MOVABLE); - if (spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3184,8 +3201,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, 0, true); if (!page) continue; - if (spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3269,8 +3285,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, opposite_mt); __spb_set_has_type(page, migratetype); - if (spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3342,8 +3357,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, MIGRATE_MOVABLE); __spb_set_has_type(page, migratetype); - if (spb_below_shrink_high_water(sb)) - queue_spb_slab_shrink(zone); + spb_react_to_tainted_alloc(sb, zone); trace_mm_page_alloc_zone_locked( page, order, migratetype, pcp_allowed_order(order) && @@ -3371,6 +3385,25 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, queue_spb_slab_shrink(zone); } + /* + * Last-chance defrag trigger before tainting a fresh clean SPB. + * Walk the tainted-SPB list and try to wake the per-SPB defrag + * worker on each. Catches SPBs that are stuck in expired-cooldown + * state because no allocator activity has touched them recently + * (the routine event-driven trigger from spb_update_list only + * fires on bucket transitions, not on every alloc). Once the + * cooldown has expired, spb_maybe_start_defrag() will requeue + * work; otherwise the gate inside spb_needs_defrag() no-ops + * cheaply. Bounded by nr_tainted_spbs and only runs when we are + * already on the slow path of fragmenting the clean pool. + */ + for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) { + list_for_each_entry(sb, + &zone->spb_lists[SB_TAINTED][full], list) { + spb_maybe_start_defrag(sb); + } + } + /* Pass 3: whole pageblock from empty superpageblocks */ list_for_each_entry(sb, &zone->spb_empty, list) { if (!sb->nr_free_pages) -- 2.52.0