From: Johannes Weiner <hannes@cmpxchg.org>
To: Frank van der Linden <fvdl@google.com>
Cc: linux-mm@kvack.org, Vlastimil Babka <vbabka@suse.cz>,
Zi Yan <ziy@nvidia.com>, David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Rik van Riel <riel@surriel.com>,
linux-kernel@vger.kernel.org
Subject: Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
Date: Mon, 6 Apr 2026 17:58:51 -0400 [thread overview]
Message-ID: <adQsm0NX_46ai6tk@cmpxchg.org> (raw)
In-Reply-To: <CAPTztWYXT0jHKfMqmUJR7Cu1vU8YPXLkkVY2dPpiEtRQEvdY5A@mail.gmail.com>
On Mon, Apr 06, 2026 at 10:31:02AM -0700, Frank van der Linden wrote:
> On Fri, Apr 3, 2026 at 12:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On large machines, zone->lock is a scaling bottleneck for page
> > allocation. Two common patterns drive contention:
> >
> > 1. Affinity violations: pages are allocated on one CPU but freed on
> > another (jemalloc, exit, reclaim). The freeing CPU's PCP drains to
> > zone buddy, and the allocating CPU refills from zone buddy -- both
> > under zone->lock, defeating PCP batching entirely.
> >
> > 2. Concurrent exits: processes tearing down large address spaces
> > simultaneously overwhelm per-CPU PCP capacity, serializing on
> > zone->lock for overflow.
> >
> > Solution
> >
> > Extend the PCP to operate on whole pageblocks with ownership tracking.
> >
> > Each CPU claims pageblocks from the zone buddy and splits them
> > locally. Pages are tagged with their owning CPU, so frees route back
> > to the owner's PCP regardless of which CPU frees. This eliminates
> > affinity violations: the owner CPU's PCP absorbs both allocations and
> > frees for its blocks without touching zone->lock.
> >
> > It also shortens zone->lock hold time during drain and refill
> > cycles. Whole blocks are acquired under zone->lock and then split
> > outside of it. Affinity routing to the owning PCP on free enables
> > buddy merging outside the zone->lock as well; a bottom-up merge pass
> > runs under pcp->lock on drain, freeing larger chunks under zone->lock.
> >
> > PCP refill uses a four-phase approach:
> >
> > Phase 0: recover owned fragments previously drained to zone buddy.
> > Phase 1: claim whole pageblocks from zone buddy.
> > Phase 2: grab sub-pageblock chunks without migratetype stealing.
> > Phase 3: traditional __rmqueue() with migratetype fallback.
> >
>
> Since the migrate type passed to rmqueue_bulk, where these changes
> are, is the PCP migratetype, this will prefer MIGRATE_MOVABLE more
> than before in the presence of MIGRATE_CMA pageblocks, right?
>
> Currently, the CMA fallback is done when > 50% of free zone memory is
> MIGRATE_CMA. For a PCP list, this isn't strictly true of course, since
> grabbing a page of the PCP list doesn't do this check, and MIGRATE_CMA
> doesn't have its own PCP list. But since rmqueue_bulk does do it, I'm
> guessing the fallback still mostly adheres to that 50%.
>
> With this change to rmqueue_bulk, it feels like it would prefer
> MIGRATE_MOVABLE more, since that is the mt passed to it (never
> MIGRATE_CMA), and the fallback is only done if the final phase is
> needed.
>
> Have you tested this with a zone that has a large amount of CMA in it
> and checked the percentages?
Good catch. Yes, I think there are problems here wrt CMA:
Phase 0 does not recover CMA blocks when movable is requested. That
looks buggy. It should restore both block types.
Phase 1 grabbing whole new blocks actually does use __rmqueue(), so it
gets the CMA fallback.
Phase 2 scans freelists based on requested type. This looks buggy as
well. It should use the logic from the to of __rmqueue() to decide
whether to grab CMA chunks instead.
Phase 3 is the regular __rmqueue() path again, which honors it.
It doesn't look hard to fix, but I'll be sure to test that.
next prev parent reply other threads:[~2026-04-06 21:59 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-03 19:40 [RFC 0/2] mm: page_alloc: pcp buddy allocator Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04 1:43 ` Rik van Riel
2026-04-20 1:40 ` Zi Yan
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04 1:42 ` Rik van Riel
2026-04-06 16:12 ` Johannes Weiner
2026-04-06 17:31 ` Frank van der Linden
2026-04-06 21:58 ` Johannes Weiner [this message]
2026-04-08 6:30 ` kernel test robot
2026-04-10 9:48 ` Vlastimil Babka (SUSE)
2026-04-10 19:12 ` Johannes Weiner
2026-04-04 2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24 ` Johannes Weiner
2026-04-07 2:42 ` Zi Yan
-- strict thread matches above, loose matches on Subject: below --
2026-04-08 2:22 [RFC 2/2] mm: page_alloc: per-cpu pageblock " kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adQsm0NX_46ai6tk@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=Liam.Howlett@oracle.com \
--cc=david@kernel.org \
--cc=fvdl@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=riel@surriel.com \
--cc=vbabka@suse.cz \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.