From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3B9D83A7F6E for ; Thu, 30 Apr 2026 20:22:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580581; cv=none; b=OrMUWUjIRKxY63zFFKbClJEZ/g8AOcdhNzgj12z/cPv5RZvQE1fFJAv9WNfOUFKqM0F6ST/Mz4nkVI8UE8mwKm+3v6kSiVIhsmWgXYUhoW17OQhtmR4G1KZjRoOv2+3vi7ptnXFW4VZGdPKQHmnkX91JKFC0Y1+1r/+/a5Ll2iE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777580581; c=relaxed/simple; bh=cUQuJviPhiAr5qnd9uKlZzpKm2M8kThE5IiowYUzBxA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UhLTJVtiCWaqO2DdFWMwpaF3JHQ3DJURidh0zYPM7WeI7ui02N3KMZDPK5GpoSMxkeooNfP8f9CoEtIGyF8xGzoO98oU2eb21qdLu8NoPtA4HgK/kGgHUp15sw45sH72SRcM42Ko/aFLc/lQ0OFI/OrNUvJzU81KzXZFAuDEIvM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=OEA4Tg8P; arc=none smtp.client-ip=96.67.55.147 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="OEA4Tg8P" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com ; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References: In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=H0FRn2xh+tixze1ELPb+tnTxd8VoDsShkAhtmqJ4LXM=; b=OEA4Tg8P7rvdTlLOnC7LFpoy/6 lsgPBlZCJ5hj0HO4wYvVFicRiv91aPUdPGGjgsb86mZ1U4ZECh+O00hSAYD0DYvK+Pyo8JReKwIix WTNL/qkYRZZggAyMZOg57XxL/3p2ggFlF27KyAgKyHZorpJ45ZIpTObQuJGIb/YGU1Jd5z8And8vF bM619+b/1pqt9ZS+KOMLyX6qqDJHHEB5Yf95rsNUCk/T66t7GO+SJTVwgglOJUcgPieA4EMMqXH6D eI3s+bwNylnj85mvolE+Qe57EzkxeGLLamiqAzGkmX3wCXhsf/8uv4Ijkd85pstUjjcfQKaAoVNhp pJPlTt7w==; Received: from fangorn.home.surriel.com ([10.0.13.7]) by shelob.surriel.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.97.1) (envelope-from ) id 1wIXuD-000000001R0-14Wj; Thu, 30 Apr 2026 16:22:41 -0400 From: Rik van Riel To: linux-kernel@vger.kernel.org Cc: kernel-team@meta.com, linux-mm@kvack.org, david@kernel.org, willy@infradead.org, surenb@google.com, hannes@cmpxchg.org, ljs@kernel.org, ziy@nvidia.com, usama.arif@linux.dev, Rik van Riel , Rik van Riel Subject: [RFC PATCH 31/45] mm: page_alloc: cross-non-movable buddy borrow within tainted SPBs Date: Thu, 30 Apr 2026 16:21:00 -0400 Message-ID: <20260430202233.111010-32-riel@surriel.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260430202233.111010-1-riel@surriel.com> References: <20260430202233.111010-1-riel@surriel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Rik van Riel When pages get freed via __free_one_page, they're placed on the per-SPB free_list determined by their pageblock's migratetype, not the original allocation's migratetype. Slab-heavy workloads expose a structural mismatch: - RECLAIMABLE pageblocks fill up densely with live slab objects (e.g. btrfs_inode caches), leaving very few sub-pageblock free fragments on the RECL free list. - UNMOVABLE pageblocks accumulate sparse free space from vmalloc and raw-alloc churn — tens of thousands of free pages, all on the UNMOV free list. Net effect: a tainted SPB can show 87,000+ free pages in metadata while having ZERO free buddies on the RECL list. A new RECL allocation walking __rmqueue_smallest's preferred-SB Pass 1 finds nothing, falls through Pass 2 (claim_whole_block on MOVABLE — but mov=0 in tainted SBs), Pass 2b (sub-PB MOVABLE — same), and reaches Pass 3, which taints a fresh clean SPB. Repeat per RECL burst. Add a Pass 2c between 2b and 3: for non-movable allocations that couldn't find their own migratetype, try borrowing a sub-pageblock buddy from the *opposite* non-movable migratetype's free list within tainted SPBs. UNMOV alloc → check RECL free list; RECL alloc → check UNMOV free list. The pageblock tag is NOT changed — page_del_and_expand uses the source migratetype for both delete and re-list, so the splits stay on the source list, and when our borrowed page is later freed __free_one_page returns it to the source list (based on pageblock tag). The "borrow" is purely transient: physical page goes to a foreign-type caller, returns to its native list on free. PB_has_ is set via __spb_set_has_type so spb_defrag accounting reflects that the pageblock now hosts our type's content. PB_has_ stays set since other buddies of that type remain. Restricted to UNMOV ↔ RECL within SB_TAINTED — movable allocations have their own Pass 4 fallback, and clean SPBs must not be polluted with cross-type mixing (that's what the existing migratetype-isolation machinery exists to prevent). Live measurement on a 247 GB devvm with btrfs root, kernel 397 (Stage 1 + simplified Stage 2a) at boot+7min: 12 tainted Normal-zone SPBs grew from 4 baseline despite the existing 11 having between 825 and 87,062 free pages each, ALL on the UNMOV list while the workload kept allocating RECL btrfs_inode slab pages. Pass 2c lets those allocs absorb into the existing UNMOV-listed free pool rather than creating fresh tainted SPBs. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- mm/page_alloc.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a72cb2da606d..f2db3dd86a84 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2806,6 +2806,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page *page; int full; struct superpageblock *sb; + int opposite_mt; /* * Category search order: 2 passes. * Movable: clean first, then tainted (pack into clean SBs). @@ -2985,6 +2986,90 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, } } } + + /* + * Pass 2c: cross-non-movable borrow within tainted SPBs. + * + * If we're a non-movable alloc and Pass 1/2/2b couldn't find a + * buddy on our migratetype's free list anywhere, but tainted + * SPBs have free buddies on the *opposite* non-movable type's + * free list, take one of those. + * + * Why this happens: when pages are freed, __free_one_page puts + * them on the free_list determined by their pageblock's tag, + * not the original allocation's migratetype. Slab caches tend + * to be dense (RECL pageblocks fill up; few sub-PB fragments), + * while UNMOV pageblocks accumulate sparse free space from + * vmalloc/raw alloc churn. Net effect: tainted SPBs frequently + * have tens of thousands of free pages all on the UNMOV list, + * invisible to RECL allocs (or vice versa). Without this pass, + * the alloc falls through to Pass 3 and taints a fresh clean + * SPB even though the existing tainted ones have plenty of + * unused space. + * + * We do NOT relabel the source pageblock. The buddy is taken + * from @opposite_mt's free list and the splits go back on + * @opposite_mt's list (page_del_and_expand uses the same mt + * for delete and expand). The pageblock tag is unchanged, so + * the page returns to @opposite_mt's list when freed via + * __free_one_page. Effectively a borrow: the alloc takes a + * physical page from a UNMOV-tagged pageblock for a RECL + * use, and the page cycles back to UNMOV's list on free. + * + * We do set PB_has_ via __spb_set_has_type so + * spb_defrag accounting reflects that this pageblock now hosts + * our migratetype's content too. PB_has_ stays + * set since other buddies of that type remain. + * + * Restricted to UNMOV ↔ RECL. Movable allocations don't + * participate (they have their own Pass 4 fallback path). + * + * Restricted to SB_TAINTED to avoid spreading mixing into + * clean SPBs. + */ + opposite_mt = -1; + if (migratetype == MIGRATE_UNMOVABLE) + opposite_mt = MIGRATE_RECLAIMABLE; + else if (migratetype == MIGRATE_RECLAIMABLE) + opposite_mt = MIGRATE_UNMOVABLE; + + if (opposite_mt >= 0) { + for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) { + list_for_each_entry(sb, + &zone->spb_lists[SB_TAINTED][full], list) { + int co; + + if (!sb->nr_free_pages) + continue; + for (co = min_t(int, pageblock_order - 1, + NR_PAGE_ORDERS - 1); + co >= (int)order; + --co) { + current_order = co; + area = &sb->free_area[current_order]; + page = get_page_from_free_area( + area, opposite_mt); + if (!page) + continue; + if (get_pageblock_isolate(page)) + continue; + if (is_migrate_cma( + get_pageblock_migratetype(page))) + continue; + page_del_and_expand(zone, page, + order, current_order, + opposite_mt); + __spb_set_has_type(page, + migratetype); + trace_mm_page_alloc_zone_locked( + page, order, migratetype, + pcp_allowed_order(order) && + migratetype < MIGRATE_PCPTYPES); + return page; + } + } + } + } } /* -- 2.52.0