From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A3123A5459 for ; Thu, 30 Apr 2026 20:22:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580583; cv=none; b=AWLiKi8oqjUhRjx6fGquM9JLZjn4Ipeo0f1BrTUwup2n4R4C71RZXJ0QdUpcMpaSniV4k1fDKaJIuaoBJoialufNemMCcj2vW79ts5VhXPf67HmsJmtHGTM6aptvNTKZfxiIAF2UEo4pI9YPWHVtMbG6BT8GbXhMbDTKqT+w5zw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580583; c=relaxed/simple; bh=3sa3qrJkBGAMWfkbCkSD0RI1hTm57+70MYe6GNob038=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=h1ZXssLkarC+LU48mcgsFHH6Gsn27RoNviLRrWJKH1/BemSJUA4B4DXZm3NUJDlvE5TS6Vjci+ZqxKQbgSzH3fUfeLBiDygH6t9RJHVmegFS4Cc5scsYAxRi7SPqpbsRHTcmQdhUct0Z0jvSFoW6b8ZlZSuVHnKlxkojlJcp4uI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=giQtU7hl; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="giQtU7hl" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=s3PRSg0Ba4XvYhY1BnO4oXxGKXF8eZ/cIx3+tOOwEHU=; b=giQtU7hlg6UFDZNwxEC7uf5qR5 47A9YcA4MxErkcSLPnpiyJ2sBAQeYKo/Dtl2A9sjg0L6VUOF2Np9U3j0+m11HFGYVmu5tBZHLEFUF pD+a1wUCVMZBJY5Qbi19FNiI8rs+/zh55Wfc1DvB5zGMImTJkE8VDLVixsDZUrSYXlvQkh+JY7KgQ FWTZNFR1e8wUWMIruQto4Sl47i8uFl+Nw9RBPXL5OIsDBi1zhwhdjWnGvc58Hu430ZuG7KvJOqF00 Du9tPFphTlIhnOkARY8NxFlhPoBQmXcj77i0vOVSwdrqzk+IhCC2jpgTnB3phOgWetciVaFq9Vf3q mrIkbCaQ==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuD-000000001R0-2QEu; Thu, 30 Apr 2026 16:22:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 45/45] mm: page_alloc: enlarge and unify spb_evacuate_for_order Date: Thu, 30 Apr 2026 16:21:14 -0400 Message-ID: <20260430202233.111010-46-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rik van Riel The slowpath in __alloc_pages_slowpath calls spb_evacuate_for_order just before dropping ALLOC_NOFRAGMENT. Each successful evacuation frees a MOV pageblock inside a tainted SPB so the retry can satisfy a non-movable allocation via Pass 2 (claim_whole_block) without having to drop NOFRAGMENT and let __rmqueue_claim taint a clean SPB. Two problems with the existing implementation: 1. Per-call budget too small. SPB_CONTIG_MAX_CANDIDATES = 4 (also used by contig allocation) per-SPB pageblocks = 3 (hard-coded literal in evac calls) → up to 12 pageblocks scanned/migrated per call. Production traces show 13 of 21 tainted SPBs typically have MOV content (~2500 PB-units = ~5 GiB MOV-having pageblocks across the tainted pool). The 4-candidate cap leaves the bulk of evacuatable capacity untouched per call, so the slowpath frequently gives up and drops NOFRAGMENT even though plenty of MOV content was there to free. 2. Source-pageblock migratetype filter creates blind spots. evacuate_pb_range(..., migratetype, ...) skipped any pageblock whose underlying tag did not match the @migratetype argument. Phase 1 used the requesting type; Phase 2 used MIGRATE_MOVABLE. But MOV content can live in any tag of pageblock: - PASS_2C / PASS_2D borrows set PB_has_ on a MOV-tagged PB without changing the tag, then borrowed pages return to the MOV free list when freed. - __spb_set_has_type adds a non-MOV bit on a PB without re-evaluating the PB tag. PBs accumulate has-bits over time. Result: MOV stragglers in PBs whose tag does not match either phase's filter are permanently invisible to evacuation. Fix both: * Introduce dedicated SPB_EVACUATE_MAX_CANDIDATES = 16 and SPB_EVACUATE_MAX_PB_PER_SB = 8 so the evacuation budget can be sized independently of the contig-allocation candidate cap. Combined cap: 128 pageblocks (256 MiB) per call. * Drop the migratetype tag filter from evacuate_pb_range. Accept any pageblock with PB_has_movable set, skipping only the cases whose semantics forbid touching them here (ISOLATE, CMA, HIGHATOMIC). * Collapse the two-phase structure in spb_evacuate_for_order into a single pass. The two phases were doing the same evacuation action with different filters; once the filter is relaxed the distinction collapses naturally: - PBs that are pure MOV become empty -> free MOV pageblock, claimable by Pass 2 / claim_whole_block on the retry. - PBs that are mixed lose their MOV stragglers, so future allocations of the dominant type can use the PB without competing with MOV residue. * sb_collect_evacuate_candidates loses its migratetype parameter: after the unified pass, the only candidate filter is nr_movable > 0. The contig-allocation path (spb_try_alloc_contig) is unchanged; SPB_CONTIG_MAX_CANDIDATES remains at 4. 1 GiB allocations have a different latency profile and broad evac-style scanning is not appropriate there. The trace_spb_evacuate_for_order_done signature is preserved for ABI continuity with existing observers; the merged attempt count is reported in phase1_attempts and phase2_attempts is reported as 0. Stack impact: sb_pfns[] grows from 32 bytes to 128 bytes — trivial for an 8K/16K kernel stack. Per Rik's stated priority hierarchy: P1 protect clean SPBs from being tainted (highest) P2 protect MOV pageblocks inside tainted SPBs P3 allocation latency (lowest) trading a few hundred ms of evacuation latency to keep clean SPBs clean is the desired direction. Signed-off-by: Rik van Riel --- mm/page_alloc.c | 141 ++++++++++++++++++++++++++++-------------------- 1 file changed, 84 insertions(+), 57 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 815cee325ec0..9cc8b9bbc1fd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -10603,6 +10603,27 @@ static bool zone_spans_last_pfn(const struct zone *zone, */ #define SPB_CONTIG_MAX_CANDIDATES 4 +/* + * Maximum tainted superpageblock candidates per spb_evacuate_for_order call. + * Collected under zone->lock, then evacuated without it. Larger than the + * contig-allocation candidate cap because evacuation runs from the slowpath + * after reclaim/compaction failed: we need a meaningful chance of freeing a + * non-MOV-claimable pageblock before the slowpath escalates to dropping + * ALLOC_NOFRAGMENT (which lets __rmqueue_claim taint clean SPBs). Sized to + * scan a meaningful fraction of a typical tainted-pool population. + */ +#define SPB_EVACUATE_MAX_CANDIDATES 16 + +/* + * Maximum pageblocks to evacuate per candidate SPB inside + * spb_evacuate_for_order. Each evacuation triggers page migration which is + * O(pages_per_pageblock) wall-clock cost, so this caps per-call latency. + * Bumped from 3 to 8 to free more capacity per slowpath escalation pass. + * Combined cap: SPB_EVACUATE_MAX_CANDIDATES * SPB_EVACUATE_MAX_PB_PER_SB + * pageblocks per call (16 * 8 = 128 = 256 MiB on x86 max migration budget). + */ +#define SPB_EVACUATE_MAX_PB_PER_SB 8 + #ifdef CONFIG_COMPACTION /** * sb_collect_contig_candidates - Find superpageblock ranges for contiguous alloc @@ -10778,7 +10799,7 @@ static struct page *spb_try_alloc_contig(struct zone *zone, * * Returns number of candidate superpageblock PFNs found. */ -static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype, +static int sb_collect_evacuate_candidates(struct zone *zone, unsigned long *sb_pfns, int max) { struct superpageblock *sb; @@ -10792,20 +10813,6 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype, if (!sb->nr_movable) continue; - if (migratetype >= 0) { - bool has_matching; - - if (migratetype == MIGRATE_UNMOVABLE) - has_matching = sb->nr_unmovable > 0; - else if (migratetype == MIGRATE_RECLAIMABLE) - has_matching = sb->nr_reclaimable > 0; - else - continue; - - if (!has_matching) - continue; - } - sb_pfns[n++] = sb->start_pfn; if (n >= max) return n; @@ -10815,17 +10822,38 @@ static int sb_collect_evacuate_candidates(struct zone *zone, int migratetype, } /* - * Evacuate pageblocks of the given migratetype within a range. + * Evacuate MOV content out of any pageblock in the given range that has it. + * + * The previous version filtered on the source pageblock's migratetype tag, + * which made evacuation blind to MOV stragglers living in PBs whose tag did + * not match the current allocation's requesting type: + * + * - PASS_2C / PASS_2D borrows set PB_has_ on a MOV-tagged + * PB without changing the tag. The borrowed pages return to the MOV + * free list when freed, so a MOV-tagged PB can host non-MOV PB_has bits + * and MOV content simultaneously. + * + * - When __spb_set_has_type adds a non-MOV bit on a PB, the PB tag is not + * re-evaluated. PBs accumulate has-bits over time without their tag + * necessarily reflecting current content. + * + * Drop the migratetype tag filter and accept any PB with PB_has_movable set. + * Skip only the cases whose semantics forbid touching them here: + * - MIGRATE_ISOLATE under quarantine + * - CMA own allocator + * - MIGRATE_HIGHATOMIC reserve, evac would race the reservation logic + * * Returns number of pageblocks evacuated. */ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int migratetype, int max) + unsigned long end_pfn, int max) { unsigned long pfn; int nr_evacuated = 0; for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { struct page *page; + int pb_mt; if (!pfn_valid(pfn)) continue; @@ -10835,10 +10863,13 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn, page = pfn_to_page(pfn); - if (get_pfnblock_migratetype(page, pfn) != migratetype) + if (!get_pfnblock_bit(page, pfn, PB_has_movable)) continue; - if (!get_pfnblock_bit(page, pfn, PB_has_movable)) + pb_mt = get_pfnblock_migratetype(page, pfn); + if (pb_mt == MIGRATE_ISOLATE || + is_migrate_cma(pb_mt) || + pb_mt == MIGRATE_HIGHATOMIC) continue; evacuate_pageblock(zone, pfn, true); @@ -10870,41 +10901,33 @@ static int evacuate_pb_range(struct zone *zone, unsigned long start_pfn, static bool spb_evacuate_for_order(struct zone *zone, unsigned int order, int migratetype) { - unsigned long sb_pfns[SPB_CONTIG_MAX_CANDIDATES]; + unsigned long sb_pfns[SPB_EVACUATE_MAX_CANDIDATES]; unsigned long flags; int nr_sbs, i; - unsigned int phase1_attempts = 0, phase2_attempts = 0; + unsigned int attempts = 0; bool did_evacuate = false; - /* Phase 1: coalesce within existing non-movable pageblocks */ - spin_lock_irqsave(&zone->lock, flags); - nr_sbs = sb_collect_evacuate_candidates(zone, migratetype, - sb_pfns, - SPB_CONTIG_MAX_CANDIDATES); - spin_unlock_irqrestore(&zone->lock, flags); - - for (i = 0; i < nr_sbs; i++) { - unsigned long end_pfn = sb_pfns[i] + SUPERPAGEBLOCK_NR_PAGES; - int n; - - n = evacuate_pb_range(zone, sb_pfns[i], end_pfn, - migratetype, 3); - phase1_attempts += n; - if (n) - did_evacuate = true; - } - - if (did_evacuate) { - trace_spb_evacuate_for_order_done(zone, order, migratetype, - phase1_attempts, phase2_attempts, true); - return true; - } - - /* Phase 2: evacuate MOVABLE pageblocks to create free whole pageblocks */ + /* + * Single-pass evacuation: collect candidate tainted SPBs (anything + * with MOV content), then walk each one's pageblocks evacuating MOV + * content from any non-special PB. evacuate_pb_range filters by + * PB_has_movable, so this is a no-op on PBs that have no MOV content. + * + * Two effects accumulate: + * - PBs that are pure MOV become empty -> free MOV pageblock, + * claimable by Pass 2 / claim_whole_block on the retry. + * - PBs that are mixed (e.g., UNMOV + MOV stragglers) lose the MOV + * stragglers, so future allocations of the dominant type can use + * the PB without competing with the MOV residue. + * + * The previous two-phase design tried to do these separately and + * filtered evacuation by source PB tag. That left MOV content + * stranded in PBs whose tag did not match either phase, and gave up + * after one phase even though the other phase could have helped. + */ spin_lock_irqsave(&zone->lock, flags); - nr_sbs = sb_collect_evacuate_candidates(zone, -1, - sb_pfns, - SPB_CONTIG_MAX_CANDIDATES); + nr_sbs = sb_collect_evacuate_candidates(zone, sb_pfns, + SPB_EVACUATE_MAX_CANDIDATES); spin_unlock_irqrestore(&zone->lock, flags); for (i = 0; i < nr_sbs; i++) { @@ -10912,24 +10935,28 @@ static bool spb_evacuate_for_order(struct zone *zone, unsigned int order, int n; n = evacuate_pb_range(zone, sb_pfns[i], end_pfn, - MIGRATE_MOVABLE, 3); - phase2_attempts += n; + SPB_EVACUATE_MAX_PB_PER_SB); + attempts += n; if (n) did_evacuate = true; } /* * Always kick a slab shrink after an evacuation pass — even when - * movable evacuation succeeded. Slab content stranded inside - * tainted SPBs can only be freed by shrinking the cache; doing - * it now keeps headroom available for the next burst, when the - * movable supply may have run out and movable evac alone would - * have nothing to do. + * MOV evacuation succeeded. Slab content stranded inside tainted + * SPBs can only be freed by shrinking the cache; doing it now keeps + * headroom available for the next burst, when the MOV supply may + * have run out and evac alone would have nothing to do. */ queue_spb_slab_shrink(zone); + /* + * The tracepoint signature retains phase1_attempts / phase2_attempts + * for ABI continuity with existing observers; report the merged total + * in phase1_attempts and 0 in phase2_attempts. + */ trace_spb_evacuate_for_order_done(zone, order, migratetype, - phase1_attempts, phase2_attempts, did_evacuate); + attempts, 0, did_evacuate); return did_evacuate; } #endif /* CONFIG_COMPACTION */ -- 2.52.0