From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5056F26A1C4
	for <linux-kernel@vger.kernel.org>; Fri, 10 Apr 2026 19:12:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.182
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1775848350; cv=none; b=IVoKUP1XLRLiMM8WU/Pj4aXY560HAikEDU8T1Su/hg6vkXHSEI/gSZbB6aC3ilJ7fE4YDmC+/+f2+u3OuPWvG7bzzET7GnyYttM+gpxucEiEOvDVDj2sYLIYkwYf8ea89iuwuBrNb5v5UTr33xuiCDy6LA2aMzQA70peeutSBTs=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1775848350; c=relaxed/simple;
	bh=lFkdyPixQikXW5/emnEMxCNYxRZ09siSIcStmGFi6G0=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=QEijzsRQUMIFLNLidWzM4N1cM9HxnkfFcRbevUk8RbuAoFAIvFehZryYAV5+BbQMQNzBO6z2Iu3RDptHt0g4ifI62Ry06vCYk39NXLIC76R5Zt68JYnAYp17bY/mt6zdh5KC36/zNzULrPk4VfXkR2xkOtkPcZAl7H/Ko4nO1hk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org; spf=pass smtp.mailfrom=cmpxchg.org; dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b=snIuS9Th; arc=none smtp.client-ip=209.85.222.182
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=cmpxchg.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg.org header.i=@cmpxchg.org header.b="snIuS9Th"
Received: by mail-qk1-f182.google.com with SMTP id af79cd13be357-8d424af6282so264212485a.0
        for <linux-kernel@vger.kernel.org>; Fri, 10 Apr 2026 12:12:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg.org; s=google; t=1775848346; x=1776453146; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=KQjd2tzgI6oSJuNqIonNuY40hOoOZO4B1i4e3mkLOJw=;
        b=snIuS9ThQ6M5VMXAFYvBxZW8rZsAiWqGRIpjGviGpXt1nh+NSz/SZlnOARfL0c7tgd
         GEgwyIu4QSDTqX+9sR27/b+euyEDFxj5266z8U2XQ3tUiEir89vzy6jNKI9JFpmCAQCk
         vuAEswsZiKxJEn7tioK3/uN6Df7c/airxAfTL3S1YiQQtsdpQViPhQEyps3M1L2OoZTQ
         KEH3y/uxLXe2zCSFGf83OLhJeyOi6hf06E7TfcQVbzcp5srdiiRsYTUUywh7IlTu4wnp
         mvMFNckQHLuT/aCvN8odXuoE3dPLRLDWdKvSXcROyouol0ShVjfWG+/e2ZdamuQS2je/
         HfDQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1775848346; x=1776453146;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=KQjd2tzgI6oSJuNqIonNuY40hOoOZO4B1i4e3mkLOJw=;
        b=XK8w5RrZ7TnTSX00i8OsfYOGZo4Zmcgx2tPi/3jz4SudhNt2ESv3SV2GqYBRlSqpks
         OatAxJMSnDuMFTQIBagbTNOh4hXRuZynh13j713xVeR7Xr7z2jm/+iHHp8VTgIW4vXfx
         HRuLTA0K52c41y6D6Z+bjZrH+o3MZuuLtwB7wrZ4lJy0BLUIl5DhPMjHGzp/UFq+YAYz
         lDM0vDACnEQafPZKUp2Ue2zC/1rElGJTQJyLXzEFD9l16kMrQCPqbsPN51bnGIjbr4IJ
         JENx+v+ayMNbSc1OBNnbV/rvq3wZag81Bbv9hreeZwXRv1/TFQ+eBh9n6+vQenQF4LTG
         cdlw==
X-Forwarded-Encrypted: i=1; AJvYcCX34rcez+ELSEhogP4xZmwtX8VwCO9zvRvi0KdQtj787TwvUuFYsy3k2H8RlS1SjAWi0sZqC2twlEW/0j4=@vger.kernel.org
X-Gm-Message-State: AOJu0YwLFepH3S6FXBEHqpkhtoayd18tM8A4SmeKGFYpaaAHjoMGY6DJ
	io/frwttrNTWnQfnKOY6Z6yLN6Ia+tG+zrhcm8X0YXtII5kcaSlec2KdQIby3oYTXkU=
X-Gm-Gg: AeBDieu7lnGdXBOIhZspiwQVWOIdpMAxYN3gMhgUx78nCXNsNTjuI0HL1Ab+liSejxC
	Qya5wvjowG2j1P9CrJGcsznZEJo9XF0RKIWMNpnFVirDtQFZiW5gVYd4GD/AkQ3JHz3rizA565g
	+VyFO78oNK4ybMyfd91UO5uKRBtM49+KhhDGGxOS/U4wkURvgstF66t+4YfdK2RnOdgO9S7gUIY
	mwDP5dSt1L74r+PLLqyY2GgHE++KGgiVmIBtza+A+jkL5rPWukBfpGAbHiiTNU7e968NwMgShSR
	whfcNIOXkTMkxIfKI1nqqKwW4yWVNqeTJ1WJoQEBEVLKokWgkVHFEws2Sb6MjfnpCLjWze8gRLG
	0WfD8TBMj8m7tVZQl1ojnHVq52AL5vhMEYVz8B978TwNfhB/20im7BoeOe+6EqCUgsebOeBl9BK
	zn3cZicpx7/8xVpDvkcR3s/w==
X-Received: by 2002:a05:620a:19a7:b0:8cd:942e:82d7 with SMTP id af79cd13be357-8ddcd8f6b1fmr591533185a.24.1775848345948;
        Fri, 10 Apr 2026 12:12:25 -0700 (PDT)
Received: from localhost ([2603:7000:c00:3a00:365a:60ff:fe62:ff29])
        by smtp.gmail.com with ESMTPSA id af79cd13be357-8dead5143b3sm101934585a.3.2026.04.10.12.12.24
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 10 Apr 2026 12:12:24 -0700 (PDT)
Date: Fri, 10 Apr 2026 15:12:20 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Cc: linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>,
	Zi Yan <ziy@nvidia.com>, David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Rik van Riel <riel@surriel.com>, linux-kernel@vger.kernel.org
Subject: Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
Message-ID: <adlLlARgYNL_T5hq@cmpxchg.org>
References: <20260403194526.477775-1-hannes@cmpxchg.org>
 <20260403194526.477775-3-hannes@cmpxchg.org>
 <45f3a5ba-9f61-4ee7-bc9a-af50057c0865@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <45f3a5ba-9f61-4ee7-bc9a-af50057c0865@kernel.org>

Hey Vlastimil,

On Fri, Apr 10, 2026 at 11:48:21AM +0200, Vlastimil Babka (SUSE) wrote:
> On 4/3/26 21:40, Johannes Weiner wrote:
> > On large machines, zone->lock is a scaling bottleneck for page
> > allocation. Two common patterns drive contention:
> > 
> > 1. Affinity violations: pages are allocated on one CPU but freed on
> > another (jemalloc, exit, reclaim). The freeing CPU's PCP drains to
> > zone buddy, and the allocating CPU refills from zone buddy -- both
> > under zone->lock, defeating PCP batching entirely.
> > 
> > 2. Concurrent exits: processes tearing down large address spaces
> > simultaneously overwhelm per-CPU PCP capacity, serializing on
> > zone->lock for overflow.
> > 
> > Solution
> > 
> > Extend the PCP to operate on whole pageblocks with ownership tracking.
> 
> Hi Johannes,
> 
> interesting ideas, as usual from you :) I'll try to point out some things
> that immediately came to mind, although it's not a thorough review.

Thanks for taking a look!

> > Each CPU claims pageblocks from the zone buddy and splits them
> > locally. Pages are tagged with their owning CPU, so frees route back
> > to the owner's PCP regardless of which CPU frees. This eliminates
> > affinity violations: the owner CPU's PCP absorbs both allocations and
> > frees for its blocks without touching zone->lock.
> 
> Details differ a lot of course (i.e. slab has no buddy merging) but I can
> see some parallel with SLUB's cpu slabs and these "cpu owned pageblocks".
> However SLUB moved into the direction of today's pcplists with replacing
> that with sheaves, and this is moving in the opposite direction :)

Let me think about this more. I've not attempted a deeper comparison
with slub, but rather followed the data on painpoints in the current
buddy / pcp allocator dynamics.

> > It also shortens zone->lock hold time during drain and refill
> > cycles. Whole blocks are acquired under zone->lock and then split
> > outside of it. Affinity routing to the owning PCP on free enables
> > buddy merging outside the zone->lock as well; a bottom-up merge pass
> > runs under pcp->lock on drain, freeing larger chunks under zone->lock.
> > 
> > PCP refill uses a four-phase approach:
> > 
> > Phase 0: recover owned fragments previously drained to zone buddy.
> 
> Note this is done using pfn scanning under zone lock. Is there a risk of
> defeating the short lock hold time goal?

This is part of a larger question, let me take a stab below.

> > Phase 1: claim whole pageblocks from zone buddy.
> > Phase 2: grab sub-pageblock chunks without migratetype stealing.
> > Phase 3: traditional __rmqueue() with migratetype fallback.
> > 
> > Phase 0/1 pages are owned and marked PagePCPBuddy, making them
> > eligible for PCP-level merging. Phase 2/3 pages are cached on PCP for
> > batching only -- no ownership, no merging.
> 
> > However, Phase 2 still
> > benefits from chunky zone transactions: it pulls higher-order entries
> > from zone free lists under zone->lock and splits them on the PCP
> > outside of it, rather than acquiring zone->lock per page.
> 
> I think this particular benefit could be possible to do even today without
> the other changes. Should we try it first?

I'll experiment with that in isolation. I would expect it to help in
the allocation paths.

That said, the worst congestion we've seen were all triggered by the
freeing paths. Faults and allocations tend to be more spread out over
time. It's the frees that happen in CPU-bound avalanches of order-0.

> > When PCP batch sizes are small (small machines with few CPUs) or the
> > zone is fragmented and no whole pageblocks are available, refill falls
> > through to Phase 2/3 naturally. The allocator degrades gracefully to
> > the original page-at-a-time behavior.
> > 
> > When owned blocks accumulate long-lived allocations (e.g. a mix of
> > anonymous and file cache pages), partial block drains send the free
> > fragments to zone buddy and remember the block, so Phase 0 can recover
> > them on the next refill. This allows the allocator to pack new
> > allocations next to existing ones in already-committed blocks rather
> > than consuming fresh pageblocks, keeping fragmentation contained.
> 
> So this reads like there could be multiple owned blocks (is there any
> limit?) with only a bunch of free pages each, increasing my concern about
> pfn scanning under zone lock.

Yes, it's a concern right now, and needs more work.

The list is built from PCP fragments on drain, and fully consumed on
refill (migratetype mismatch aside, need to fix). That caps the list
at pcp->high_max - pcp->batch blocks in the worst case (when there is
only one free page in each block).

Those last free pages can get stolen by another CPU before the next
refill, resulting in a PFN walk worst-case of (pcp->high_max -
pcp->batch) * pageblock_nr_pages.

Some ideas I need to try out:

- First I think we can bound the zone->lock hold period easily by
  cycling the lock after each block.

- We could maintain a free counter in pageblock_data to terminate the
  scan early if no buddies remain.

- We might be able to hard limit the PFN scans to 2x or 4x
  pages_needed, then fall through to grabbing individual pages. We
  shouldn't get new blocks while there are unrecovered partial blocks,
  or it would defeat the point of recovery (runaway consumption of new
  blocks pinned by a couple of long-lived allocations). The dynamics
  this would add are a bit harder to reason about.

> > @@ -2941,15 +3242,45 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >  		add_page_to_zone_llist(zone, page, order);
> >  		return;
> >  	}
> > -	pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags);
> > -	if (pcp) {
> > -		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> > -						order, fpi_flags, &UP_flags))
> > +
> > +	/*
> > +	 * Route page to the owning CPU's PCP for merging, or to
> > +	 * the local PCP for batching (zone-owned pages). Zone-owned
> > +	 * pages are cached without PagePCPBuddy -- the merge pass
> > +	 * skips them, so they're inert on any PCP list and drain
> > +	 * individually to zone buddy.
> > +	 *
> > +	 * Ownership is stable here: it can only change when the
> > +	 * pageblock is complete -- either fully free in zone buddy
> > +	 * (Phase 1 claims) or fully merged on PCP (drain disowns).
> > +	 * Since we hold this page, neither can happen.
> > +	 */
> > +	owner_cpu = pbd->cpu - 1;
> > +	cache_cpu = owner_cpu;
> > +	if (cache_cpu < 0)
> > +		cache_cpu = raw_smp_processor_id();
> > +
> > +	pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu);
> > +	if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) {
> > +		if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
> > +			free_one_page(zone, page, pfn, order, fpi_flags);
> >  			return;
> > -		pcp_spin_unlock(pcp, UP_flags);
> > +		}
> >  	} else {
> > +		spin_lock_irqsave(&pcp->lock, UP_flags);
> 
> Hm was it necessary to replace the pcp trylock scheme with
> spin_lock_irqsave() here?

It's beneficial.

Before, this would only be contended with preemption; the trylock was
needed to avoid a deadlock.

After, we can have contention when freeing to a remote PCP that's
busy. But giving the page back to that owning PCP is still the best
destination for it. A trylock with a zone buddy fallback means a
zone->lock cycle now AND for a single page AND increases the odds of a
zone->locked refill later (since that page leaves the PCP economy).

What *might* work is an llist on that PCP, leaving it to the next
successful PCP holder to drain. But that means more work under the
pcp->lock on the allocation and drain side, so it could be a wash.