Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: "Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Cc: linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>,
	Zi Yan <ziy@nvidia.com>, David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Rik van Riel <riel@surriel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
Date: Fri, 10 Apr 2026 15:12:20 -0400	[thread overview]
Message-ID: <adlLlARgYNL_T5hq@cmpxchg.org> (raw)
In-Reply-To: <45f3a5ba-9f61-4ee7-bc9a-af50057c0865@kernel.org>

Hey Vlastimil,

On Fri, Apr 10, 2026 at 11:48:21AM +0200, Vlastimil Babka (SUSE) wrote:
> On 4/3/26 21:40, Johannes Weiner wrote:
> > On large machines, zone->lock is a scaling bottleneck for page
> > allocation. Two common patterns drive contention:
> > 
> > 1. Affinity violations: pages are allocated on one CPU but freed on
> > another (jemalloc, exit, reclaim). The freeing CPU's PCP drains to
> > zone buddy, and the allocating CPU refills from zone buddy -- both
> > under zone->lock, defeating PCP batching entirely.
> > 
> > 2. Concurrent exits: processes tearing down large address spaces
> > simultaneously overwhelm per-CPU PCP capacity, serializing on
> > zone->lock for overflow.
> > 
> > Solution
> > 
> > Extend the PCP to operate on whole pageblocks with ownership tracking.
> 
> Hi Johannes,
> 
> interesting ideas, as usual from you :) I'll try to point out some things
> that immediately came to mind, although it's not a thorough review.

Thanks for taking a look!

> > Each CPU claims pageblocks from the zone buddy and splits them
> > locally. Pages are tagged with their owning CPU, so frees route back
> > to the owner's PCP regardless of which CPU frees. This eliminates
> > affinity violations: the owner CPU's PCP absorbs both allocations and
> > frees for its blocks without touching zone->lock.
> 
> Details differ a lot of course (i.e. slab has no buddy merging) but I can
> see some parallel with SLUB's cpu slabs and these "cpu owned pageblocks".
> However SLUB moved into the direction of today's pcplists with replacing
> that with sheaves, and this is moving in the opposite direction :)

Let me think about this more. I've not attempted a deeper comparison
with slub, but rather followed the data on painpoints in the current
buddy / pcp allocator dynamics.

> > It also shortens zone->lock hold time during drain and refill
> > cycles. Whole blocks are acquired under zone->lock and then split
> > outside of it. Affinity routing to the owning PCP on free enables
> > buddy merging outside the zone->lock as well; a bottom-up merge pass
> > runs under pcp->lock on drain, freeing larger chunks under zone->lock.
> > 
> > PCP refill uses a four-phase approach:
> > 
> > Phase 0: recover owned fragments previously drained to zone buddy.
> 
> Note this is done using pfn scanning under zone lock. Is there a risk of
> defeating the short lock hold time goal?

This is part of a larger question, let me take a stab below.

> > Phase 1: claim whole pageblocks from zone buddy.
> > Phase 2: grab sub-pageblock chunks without migratetype stealing.
> > Phase 3: traditional __rmqueue() with migratetype fallback.
> > 
> > Phase 0/1 pages are owned and marked PagePCPBuddy, making them
> > eligible for PCP-level merging. Phase 2/3 pages are cached on PCP for
> > batching only -- no ownership, no merging.
> 
> > However, Phase 2 still
> > benefits from chunky zone transactions: it pulls higher-order entries
> > from zone free lists under zone->lock and splits them on the PCP
> > outside of it, rather than acquiring zone->lock per page.
> 
> I think this particular benefit could be possible to do even today without
> the other changes. Should we try it first?

I'll experiment with that in isolation. I would expect it to help in
the allocation paths.

That said, the worst congestion we've seen were all triggered by the
freeing paths. Faults and allocations tend to be more spread out over
time. It's the frees that happen in CPU-bound avalanches of order-0.

> > When PCP batch sizes are small (small machines with few CPUs) or the
> > zone is fragmented and no whole pageblocks are available, refill falls
> > through to Phase 2/3 naturally. The allocator degrades gracefully to
> > the original page-at-a-time behavior.
> > 
> > When owned blocks accumulate long-lived allocations (e.g. a mix of
> > anonymous and file cache pages), partial block drains send the free
> > fragments to zone buddy and remember the block, so Phase 0 can recover
> > them on the next refill. This allows the allocator to pack new
> > allocations next to existing ones in already-committed blocks rather
> > than consuming fresh pageblocks, keeping fragmentation contained.
> 
> So this reads like there could be multiple owned blocks (is there any
> limit?) with only a bunch of free pages each, increasing my concern about
> pfn scanning under zone lock.

Yes, it's a concern right now, and needs more work.

The list is built from PCP fragments on drain, and fully consumed on
refill (migratetype mismatch aside, need to fix). That caps the list
at pcp->high_max - pcp->batch blocks in the worst case (when there is
only one free page in each block).

Those last free pages can get stolen by another CPU before the next
refill, resulting in a PFN walk worst-case of (pcp->high_max -
pcp->batch) * pageblock_nr_pages.

Some ideas I need to try out:

- First I think we can bound the zone->lock hold period easily by
  cycling the lock after each block.

- We could maintain a free counter in pageblock_data to terminate the
  scan early if no buddies remain.

- We might be able to hard limit the PFN scans to 2x or 4x
  pages_needed, then fall through to grabbing individual pages. We
  shouldn't get new blocks while there are unrecovered partial blocks,
  or it would defeat the point of recovery (runaway consumption of new
  blocks pinned by a couple of long-lived allocations). The dynamics
  this would add are a bit harder to reason about.

> > @@ -2941,15 +3242,45 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >  		add_page_to_zone_llist(zone, page, order);
> >  		return;
> >  	}
> > -	pcp = pcp_spin_trylock(zone->per_cpu_pageset, UP_flags);
> > -	if (pcp) {
> > -		if (!free_frozen_page_commit(zone, pcp, page, migratetype,
> > -						order, fpi_flags, &UP_flags))
> > +
> > +	/*
> > +	 * Route page to the owning CPU's PCP for merging, or to
> > +	 * the local PCP for batching (zone-owned pages). Zone-owned
> > +	 * pages are cached without PagePCPBuddy -- the merge pass
> > +	 * skips them, so they're inert on any PCP list and drain
> > +	 * individually to zone buddy.
> > +	 *
> > +	 * Ownership is stable here: it can only change when the
> > +	 * pageblock is complete -- either fully free in zone buddy
> > +	 * (Phase 1 claims) or fully merged on PCP (drain disowns).
> > +	 * Since we hold this page, neither can happen.
> > +	 */
> > +	owner_cpu = pbd->cpu - 1;
> > +	cache_cpu = owner_cpu;
> > +	if (cache_cpu < 0)
> > +		cache_cpu = raw_smp_processor_id();
> > +
> > +	pcp = per_cpu_ptr(zone->per_cpu_pageset, cache_cpu);
> > +	if (unlikely(fpi_flags & FPI_TRYLOCK) || !in_task()) {
> > +		if (!spin_trylock_irqsave(&pcp->lock, UP_flags)) {
> > +			free_one_page(zone, page, pfn, order, fpi_flags);
> >  			return;
> > -		pcp_spin_unlock(pcp, UP_flags);
> > +		}
> >  	} else {
> > +		spin_lock_irqsave(&pcp->lock, UP_flags);
> 
> Hm was it necessary to replace the pcp trylock scheme with
> spin_lock_irqsave() here?

It's beneficial.

Before, this would only be contended with preemption; the trylock was
needed to avoid a deadlock.

After, we can have contention when freeing to a remote PCP that's
busy. But giving the page back to that owning PCP is still the best
destination for it. A trylock with a zone buddy fallback means a
zone->lock cycle now AND for a single page AND increases the odds of a
zone->locked refill later (since that page leaves the PCP economy).

What *might* work is an llist on that PCP, leaving it to the next
successful PCP holder to drain. But that means more work under the
pcp->lock on the allocation and drain side, so it could be a wash.

next prev parent reply	other threads:[~2026-04-10 19:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-03 19:40 [RFC 0/2] mm: page_alloc: pcp buddy allocator Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04  1:43   ` Rik van Riel
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04  1:42   ` Rik van Riel
2026-04-06 16:12     ` Johannes Weiner
2026-04-06 17:31   ` Frank van der Linden
2026-04-06 21:58     ` Johannes Weiner
2026-04-10  9:48   ` Vlastimil Babka (SUSE)
2026-04-10 19:12     ` Johannes Weiner [this message]
2026-04-04  2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24   ` Johannes Weiner
2026-04-07  2:42     ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=adlLlARgYNL_T5hq@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=david@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=riel@surriel.com \
    --cc=vbabka@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox