From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FF473A5428 for ; Thu, 30 Apr 2026 20:22:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580582; cv=none; b=bk9OwkJ/2yunGpZg4pHAgQi09BWnTy+QqJng7QH7brNfrrxbAAossgTqBISZecjf+TKz8zLrfy4py6pcF+2wZ2n5i7YK1HJd4NQH8jwu+tId9jRxzJUv5+4Z+pr/j9njYhggXv0IpAaU4CYOPfVmpfTbX1zvxnpZvruEafY1ZD8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580582; c=relaxed/simple; bh=Ipm2nZKp4bF3mI6+rz4JtkSRtOmZE2IicEod/8Abib4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HJ5aOjFnjqa4mgPovZODERui9zO8fj3ECXb//elsMFJL3AUgBEl8I/Nh7cBRBqkTtShOUcVVGqGHRpntBxtR80t8OytldNGhPHGxYrxhkKN5Ex+zEQjGYPjLO3CusIkZcvI66BpPazlQrJ9FC+7SwdzWeZZ+AW59ql8Qqgobaws= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=jDGv2Jkl; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="jDGv2Jkl" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=1+wya874kPYOQCuKyoHaIAZiAttEr8zwW2uypklBYts=; b=jDGv2JklOdsjUJtX9FqfdNw40x EdFO6ZT0PbY9GrQncG0/Ofe8jZljL8TkI2kXK07EG8Vy5mU6yTcdzYqWN88eAvVQ1RraaPvVk4Jkq v6JnkKlXahuGwF4Nj+mtgyjn0O6Skq2qpd6rxrHklKmF4yx+En9jGg6Te/eifNkxgq/G3QW4Yzbcq JdKdUan04cX/CItXoaw5oVj7e6ETWlnbAP77/NRMUlFg8IwTa8a3mx/00Zip+/KBbUiv/y/HoPcM2 O9hBCeYwwdys9a4NEWrZ0M6uaKsg+3zsCXhJVY0yK/5+hzfFMNjl6/Ix3QavdjoOhGzmjvzHILyZR i+3hKBSw==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuD-000000001R0-1h4c; Thu, 30 Apr 2026 16:22:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 38/45] mm: page_alloc: per-(zone, order, mt) PASS_1 hint cache Date: Thu, 30 Apr 2026 16:21:07 -0400 Message-ID: <20260430202233.111010-39-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rik van Riel PASS_1 of __rmqueue_smallest walks &zone->spb_lists[cat][full] linearly. Under steady workload on a 247 GB devvm, the median walk depth was ~50 SPBs and 20-57% of allocations visited 100+ SPBs. Cache the SPB that last satisfied a PASS_1 alloc for each (zone, order, migratetype) tuple, in two layers: - per-zone hint (zone->sb_hint[order][mt]) — visible to all CPUs, serialized by zone->lock. - per-CPU hint indexed by zone_idx — cache-hot, contention-free. Each slot stores (zone *, sb *) because zone_idx is per-pgdat (not globally unique on NUMA); the zone-pointer check on read prevents a cross-node SPB from being handed back to the wrong zone's accounting. Stale hints are harmless: try_alloc_from_sb_pass1() returns NULL and the standard list walk runs as before. On PASS_1 success both hints are refreshed. spb_invalidate_warm_hints() clears both arrays from resize_zone_superpageblocks() under zone->lock to prevent UAF across memory hotplug-add. Hint hits show up in tracepoint:kmem:spb_alloc_walk as the [0, 5) bucket because n_spbs_visited stays 0; no new tracepoint needed. Skipped for migratetype >= MIGRATE_PCPTYPES (HIGHATOMIC/CMA/ISOLATE are already cheap or rare). Measurement on the same devvm with this commit applied: median walk depth: ~50 SPBs -> ~5 tail (>=100 SPB visits): 20-57% -> 0.4% hint hit rate (n=0): -> 99% Memory cost: ~320 B per zone + ~2.6 KB per CPU (MAX_NR_ZONES * NR_PAGE_ORDERS * MIGRATE_PCPTYPES * sizeof(slot)). Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- include/linux/mmzone.h | 11 +++ mm/internal.h | 2 + mm/mm_init.c | 8 ++ mm/page_alloc.c | 180 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 201 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 68892e40cd4e..298cff01160c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1014,6 +1014,17 @@ struct zone { struct list_head spb_isolated; /* fully isolated (1GB contig alloc) */ struct list_head spb_lists[__NR_SB_CATEGORIES][__NR_SB_FULLNESS]; + /* + * Stage 5 PASS_1 fast-path hint: most-recent SPB that satisfied a + * (order, mt) PASS_1 allocation. Stale hints are harmless — the hint + * try-alloc just falls through to the standard list walk on miss. + * Sized for [0..NR_PAGE_ORDERS) x PCPTYPES; HIGHATOMIC/CMA/ISOLATE + * skip the hint (already cheap or rare). Invalidated by + * spb_invalidate_warm_hints() when the SPB array is resized + * (memory hotplug add). + */ + struct superpageblock *sb_hint[NR_PAGE_ORDERS][MIGRATE_PCPTYPES]; + /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; diff --git a/mm/internal.h b/mm/internal.h index 71e39414645f..c84d7acb9342 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1041,6 +1041,8 @@ static inline void superpageblock_set_has_movable(struct zone *zone, void resize_zone_superpageblocks(struct zone *zone); #endif +void spb_invalidate_warm_hints(struct zone *zone); + struct cma; #ifdef CONFIG_CMA diff --git a/mm/mm_init.c b/mm/mm_init.c index 8e3c64d37254..3a57cc4f3b48 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1810,6 +1810,14 @@ void __meminit resize_zone_superpageblocks(struct zone *zone) zone->superpageblock_base_pfn = new_sb_base; zone->spb_kvmalloced = true; + /* + * Invalidate Stage 5 PASS_1 hints under zone->lock so that no + * concurrent allocator (also entering __rmqueue_smallest under + * zone->lock) can dereference an old SPB pointer that is about + * to be freed below. + */ + spb_invalidate_warm_hints(zone); + spin_unlock_irqrestore(&zone->lock, flags); /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d621e84bf664..2f5d3ba1c0ef 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2814,6 +2814,110 @@ struct spb_tainted_walk { bool saw_below_reserve; /* tainted SPB has nr_free <= spb_tainted_reserve */ }; +/* + * Stage 5 PASS_1 fast-path hint: most-recent SPB this CPU successfully + * allocated from for a given (zone, order, migratetype). Combined with + * the per-zone zone->sb_hint[][], this lets PASS_1 skip the linear walk + * of spb_lists[cat][full] in the common case (~78 SPBs visited per + * order-0 MOVABLE alloc on the build-403 baseline). Stale hints are + * harmless — the try-alloc just falls through to the standard list walk + * on miss. + * + * The slot stores both the zone pointer and the SPB pointer because + * zone_idx(zone) is per-pgdat (not globally unique on NUMA), so two + * nodes' ZONE_NORMAL share the same array index. The zone-pointer check + * on read prevents a cross-node SPB from being handed back to the wrong + * zone (which would corrupt per-zone NR_FREE_PAGES accounting). + */ +struct spb_warm_hint_slot { + struct zone *zone; + struct superpageblock *sb; +}; +struct spb_warm_hints { + struct spb_warm_hint_slot slot[MAX_NR_ZONES][NR_PAGE_ORDERS][MIGRATE_PCPTYPES]; +}; +static DEFINE_PER_CPU(struct spb_warm_hints, spb_warm_hints); + +/** + * spb_invalidate_warm_hints - drop all cached hints into @zone + * @zone: zone whose SPB array is about to change + * + * Called from memory hotplug paths that resize zone->superpageblocks + * (and therefore invalidate every SPB pointer for @zone). Must be + * called with zone->lock held; the lock serializes against any CPU + * doing a hint read inside __rmqueue_smallest (also under zone->lock), + * so callers see either pre-invalidation state (old SPB pointers, + * still-valid old array) or post-invalidation state (NULL slots) — + * never a half-state with stale pointers into a freed array. + */ +void spb_invalidate_warm_hints(struct zone *zone) +{ + enum zone_type zidx = zone_idx(zone); + int cpu, order, mt; + + lockdep_assert_held(&zone->lock); + + memset(zone->sb_hint, 0, sizeof(zone->sb_hint)); + + for_each_possible_cpu(cpu) { + struct spb_warm_hints *h = per_cpu_ptr(&spb_warm_hints, cpu); + + for (order = 0; order < NR_PAGE_ORDERS; order++) { + for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) { + if (h->slot[zidx][order][mt].zone != zone) + continue; + h->slot[zidx][order][mt].zone = NULL; + h->slot[zidx][order][mt].sb = NULL; + } + } + } +} + +/* + * Try to allocate from a single SPB using PASS_1 semantics: + * whole pageblock first (PCP-buddy friendly), then sub-pageblock. + * Returns the page on success, NULL on miss. Caller is responsible + * for tracepoints, hint updates, and shrinker queueing. + */ +static struct page *try_alloc_from_sb_pass1(struct zone *zone, + struct superpageblock *sb, + unsigned int order, + int migratetype) +{ + unsigned int current_order; + struct free_area *area; + struct page *page; + + if (!sb->nr_free_pages) + return NULL; + + for (current_order = max(order, pageblock_order); + current_order < NR_PAGE_ORDERS; + ++current_order) { + area = &sb->free_area[current_order]; + page = get_page_from_free_area(area, migratetype); + if (!page) + continue; + page_del_and_expand(zone, page, order, + current_order, migratetype); + return page; + } + if (order < pageblock_order) { + for (current_order = order; + current_order < pageblock_order; + ++current_order) { + area = &sb->free_area[current_order]; + page = get_page_from_free_area(area, migratetype); + if (!page) + continue; + page_del_and_expand(zone, page, order, + current_order, migratetype); + return page; + } + } + return NULL; +} + static __always_inline struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, int migratetype, unsigned int alloc_flags, @@ -2836,6 +2940,64 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, }; int movable = (migratetype == MIGRATE_MOVABLE) ? 1 : 0; + /* + * Stage 5 PASS_1 fast-path: try per-CPU then per-zone hint SPB + * before the linear list walk. The hint stores the SPB that last + * satisfied a PASS_1 alloc for this (zone, order, migratetype). + * On hit, we skip the entire spb_lists walk (n_spbs_visited stays + * 0, which shows up as the [0,5) bucket in the spb_alloc_walk + * tracepoint histogram). Skip for HIGHATOMIC/CMA/ISOLATE — those + * paths are already cheap (atomic-NORETRY skip) or rare. + */ + if (migratetype < MIGRATE_PCPTYPES) { + enum zone_type zidx = zone_idx(zone); + struct superpageblock *cpu_hint = NULL, *zone_hint; + struct spb_warm_hint_slot *slot; + + slot = this_cpu_ptr( + &spb_warm_hints.slot[zidx][order][migratetype]); + /* + * Validate slot->zone == zone: zone_idx is per-pgdat, so + * on NUMA the same slot index is shared by every node's + * zone of this type. Without this check, a hint written + * from one node would be returned to allocations on + * another node and corrupt the wrong zone's accounting. + */ + if (slot->zone == zone) + cpu_hint = slot->sb; + if (cpu_hint) { + page = try_alloc_from_sb_pass1(zone, cpu_hint, + order, migratetype); + if (page) { + if (spb_get_category(cpu_hint) == SB_TAINTED && + spb_below_shrink_high_water(cpu_hint)) + queue_spb_slab_shrink(zone); + trace_mm_page_alloc_zone_locked(page, order, + migratetype, + pcp_allowed_order(order) && + migratetype < MIGRATE_PCPTYPES); + return page; + } + } + zone_hint = zone->sb_hint[order][migratetype]; + if (zone_hint && zone_hint != cpu_hint) { + page = try_alloc_from_sb_pass1(zone, zone_hint, + order, migratetype); + if (page) { + if (spb_get_category(zone_hint) == SB_TAINTED && + spb_below_shrink_high_water(zone_hint)) + queue_spb_slab_shrink(zone); + slot->zone = zone; + slot->sb = zone_hint; + trace_mm_page_alloc_zone_locked(page, order, + migratetype, + pcp_allowed_order(order) && + migratetype < MIGRATE_PCPTYPES); + return page; + } + } + } + /* * Search per-superpageblock free lists for pages of the requested * migratetype, walking superpageblocks from fullest to emptiest @@ -2902,6 +3064,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); + if (migratetype < MIGRATE_PCPTYPES) { + struct spb_warm_hint_slot *slot; + + zone->sb_hint[order][migratetype] = sb; + slot = this_cpu_ptr(&spb_warm_hints.slot + [zone_idx(zone)][order][migratetype]); + slot->zone = zone; + slot->sb = sb; + } return page; } /* Then try sub-pageblock (no PCP buddy) */ @@ -2924,6 +3095,15 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); + if (migratetype < MIGRATE_PCPTYPES) { + struct spb_warm_hint_slot *slot; + + zone->sb_hint[order][migratetype] = sb; + slot = this_cpu_ptr(&spb_warm_hints.slot + [zone_idx(zone)][order][migratetype]); + slot->zone = zone; + slot->sb = sb; + } return page; } } -- 2.52.0