From: Johannes Weiner <hannes@cmpxchg.org>
To: Zi Yan <ziy@nvidia.com>
Cc: linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Rik van Riel <riel@surriel.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 0/2] mm: page_alloc: pcp buddy allocator
Date: Mon, 6 Apr 2026 11:24:21 -0400 [thread overview]
Message-ID: <adPQJfmbpYr3-uzX@cmpxchg.org> (raw)
In-Reply-To: <1C961B84-522F-43AB-ADCB-014B3A4ACD21@nvidia.com>
On Fri, Apr 03, 2026 at 10:27:36PM -0400, Zi Yan wrote:
> On 3 Apr 2026, at 15:40, Johannes Weiner wrote:
> > this is an RFC for making the page allocator scale better with higher
> > thread counts and larger memory quantities.
> >
> > In Meta production, we're seeing increasing zone->lock contention that
> > was traced back to a few different paths. A prominent one is the
> > userspace allocator, jemalloc. Allocations happen from page faults on
> > all CPUs running the workload. Frees are cached for reuse, but the
> > caches are periodically purged back to the kernel from a handful of
> > purger threads. This breaks affinity between allocations and frees:
> > Both sides use their own PCPs - one side depletes them, the other one
> > overfills them. Both sides routinely hit the zone->locked slowpath.
> >
> > My understanding is that tcmalloc has a similar architecture.
> >
> > Another contributor to contention is process exits, where large
> > numbers of pages are freed at once. The current PCP can only reduce
> > lock time when pages are reused. Reuse is unlikely because it's an
> > avalanche of free pages on a CPU busy walking page tables. Every time
> > the PCP overflows, the drain acquires the zone->lock and frees pages
> > one by one, trying to merge buddies together.
>
> IIUC, zone->lock held time is mostly spent on free page merging.
> Have you tried to let PCP do the free page merging before holding
> zone->lock and returning free pages to buddy? That is a much smaller
> change than what you proposed. This method might not work if
> physically contiguous free pages are allocated by separate CPUs,
> so that PCP merging cannot be done. But this might be rare?
On my 32G system, pcp->high_min for zone Normal is 988. That's one
block and a half. The rmqueue_smallest policy means the next CPU will
prefer the remainder of that partial block. So if there is
concurrency, every other block is shared. Not exactly uncommon. The
effect lessens the larger the machine is, of course.
But let's assume it's not an issue. How do you know you can safely
merge with a buddy pfn? You need to establish that it's on that same
PCP's list. Short of *scanning* the list, it seems something like
PagePCPBuddy() and page->pcp_cpu is inevitably needed. But of course a
per-page cpu field is tough to come by.
So the block ownership is more natural, and then you might as well use
that for affinity routing to increase the odds of merges.
IOW, I'm having a hard time seeing what could be taken away and still
have it work.
> > The idea proposed here is this: instead of single pages, make the PCP
> > grab entire pageblocks, split them outside the zone->lock. That CPU
> > then takes ownership of the block, and all frees route back to that
> > PCP instead of the freeing CPU's local one.
>
> This is basically distributed buddy allocators, right? Instead of
> relying on a single zone->lock, PCP locks are used. The worst case
> it can face is that physically contiguous free pages are allocated
> across all CPUs, so that all CPUs are competing a single PCP lock.
The worst case is one CPU allocating for everybody else in the system,
so that all freers route to that PCP.
I've played with microbenchmarks to provoke this, but it looks mostly
neutral over baseline, at least at the scale of this machine.
In this scenario, baseline will have the affinity mismatch problem:
the allocating CPU routinely hits zone->lock to refill, and the
freeing CPUs routinely hit zone->lock to drain and merge.
In the new scheme, they would hit the pcp->lock instead of the
zone->lock. So not necessarily an improvement in lock breaking. BUT
because freers refill the allocator's cache, merging is deferred;
that's a net reduction of work performed under the contended lock.
> It seems that you have not hit this. So I wonder if what I proposed
> above might work as a simpler approach. Let me know if I miss anything.
>
> I wonder how this distributed buddy allocators would work if anyone
> wants to allocate >pageblock free pages, like alloc_contig_range().
> Multiple PCP locks need to be taken one by one. Maybe it is better
> than taking and dropping zone->lock repeatedly. Have you benchmarked
> alloc_contig_range(), like hugetlb allocation?
I didn't change that aspect.
The PCPs are still the same size, and PCP pages are still skipped by
the isolation code.
IOW it's not a purely distributed buddy allocator. It's still just a
per-cpu cache of limited size. The only thing I'm doing is provide a
mechanism for splitting and pre-merging at the cache level, and
setting up affinity/routing rules to increase the chances of
success. But the impact on alloc_contig should be the same.
> > This has several benefits:
> >
> > 1. It's right away coarser/fewer allocations transactions under the
> > zone->lock.
> >
> > 1a. Even if no full free blocks are available (memory pressure or
> > small zone), with splitting available at the PCP level means the
> > PCP can still grab chunks larger than the requested order from the
> > zone->lock freelists, and dole them out on its own time.
> >
> > 2. The pages free back to where the allocations happen, increasing the
> > odds of reuse and reducing the chances of zone->lock slowpaths.
> >
> > 3. The page buddies come back into one place, allowing upfront merging
> > under the local pcp->lock. This makes coarser/fewer freeing
> > transactions under the zone->lock.
>
> I wonder if we could go more radical by moving buddy allocator out of
> zone->lock completely to PCP lock. If one PCP runs out of free pages,
> it can steal another PCP's whole pageblock. I probably should do some
> literature investigation on this. Some research must have been done
> on this.
This is an interesting idea. Make the zone buddy a pure block economy
and remove all buddy code from it. Slowpath allocs and frees would
always be in whole blocks.
You'd have to come up with a natural stealing order. If one CPU needs
something it doesn't have, which CPUs, and which order, do you look at
for stealing.
I think you'd still have to route back frees to the nominal owner of
the block, or stealing could scatter pages all over the place and we'd
never be able to merge them back up.
I think you'd also need to pull accounting (NR_FREE_PAGES) to the
per-cpu level, and inform compaction/isolation to deal with these
pages, since the majority default is now distributed.
But the scenario where one CPU needs what another one has is an
interesting one. I didn't invent anything new for this for now, but
rather rely on how we have been handling this through the zone
freelists. But I do think it's a little silly: right now, if a CPU
needs something another CPU might have, we ask EVERY CPU in the system
to drain their cache into the shared pool - simultaneously - running
the full buddy merge algorithm on everything that comes in. The CPU
grabs a small handful of these pages, most likely having to split
again. All other CPUs are now cache cold on the next request.
next prev parent reply other threads:[~2026-04-06 15:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:40 [RFC 0/2] mm: page_alloc: pcp buddy allocator Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04 1:43 ` Rik van Riel
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04 1:42 ` Rik van Riel
2026-04-06 16:12 ` Johannes Weiner
2026-04-06 17:31 ` Frank van der Linden
2026-04-06 21:58 ` Johannes Weiner
2026-04-04 2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24 ` Johannes Weiner [this message]
2026-04-07 2:42 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adPQJfmbpYr3-uzX@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=Liam.Howlett@oracle.com \
--cc=david@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox