All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] mm: page_alloc: pcp buddy allocator
@ 2026-04-03 19:40 Johannes Weiner
  2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Johannes Weiner @ 2026-04-03 19:40 UTC (permalink / raw)
  To: linux-mm
  Cc: Vlastimil Babka, Zi Yan, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Rik van Riel, linux-kernel

Hi,

this is an RFC for making the page allocator scale better with higher
thread counts and larger memory quantities.

In Meta production, we're seeing increasing zone->lock contention that
was traced back to a few different paths. A prominent one is the
userspace allocator, jemalloc. Allocations happen from page faults on
all CPUs running the workload. Frees are cached for reuse, but the
caches are periodically purged back to the kernel from a handful of
purger threads. This breaks affinity between allocations and frees:
Both sides use their own PCPs - one side depletes them, the other one
overfills them. Both sides routinely hit the zone->locked slowpath.

My understanding is that tcmalloc has a similar architecture.

Another contributor to contention is process exits, where large
numbers of pages are freed at once. The current PCP can only reduce
lock time when pages are reused. Reuse is unlikely because it's an
avalanche of free pages on a CPU busy walking page tables. Every time
the PCP overflows, the drain acquires the zone->lock and frees pages
one by one, trying to merge buddies together.

The idea proposed here is this: instead of single pages, make the PCP
grab entire pageblocks, split them outside the zone->lock. That CPU
then takes ownership of the block, and all frees route back to that
PCP instead of the freeing CPU's local one.

This has several benefits:

1. It's right away coarser/fewer allocations transactions under the
   zone->lock.

1a. Even if no full free blocks are available (memory pressure or
    small zone), with splitting available at the PCP level means the
    PCP can still grab chunks larger than the requested order from the
    zone->lock freelists, and dole them out on its own time.

2. The pages free back to where the allocations happen, increasing the
   odds of reuse and reducing the chances of zone->lock slowpaths.

3. The page buddies come back into one place, allowing upfront merging
   under the local pcp->lock. This makes coarser/fewer freeing
   transactions under the zone->lock.

The big concern is fragmentation. Movable allocations tend to be a mix
of short-lived anon and long-lived file cache pages. By the time the
PCP needs to drain due to thresholds or pressure, the blocks might not
be fully re-assembled yet. To prevent gobbling up and fragmenting ever
more blocks, partial blocks are remembered on drain and their pages
queued last on the zone freelist. When a PCP refills, it first tries
to recover any such fragment blocks.

On small or pressured machines, the PCP degrades to its previous
behavior. If a whole block doesn't fit the pcp->high limit, or a whole
block isn't available, the refill grabs smaller chunks that aren't
marked for ownership. The free side will use the local PCP as before.

I still need to run broader benchmarks, but I've been consistently
seeing a 3-4% reduction in %sys time for simple kernel builds on my
32-way, 32G RAM test machine.

A synthetic test on the same machine that allocates on many CPUs and
frees on just a few sees a consistent 1% increase in throughput.

I would expect those numbers to increase with higher concurrency and
larger memory volumes, but verifying that is TBD.

Sending an RFC to get an early gauge on direction.

Based on 0257f64bdac7fdca30fa3cae0df8b9ecbec7733a.

 include/linux/mmzone.h     |  38 ++-
 include/linux/page-flags.h |   9 +
 mm/debug.c                 |   1 +
 mm/internal.h              |  17 +
 mm/mm_init.c               |  25 +-
 mm/page_alloc.c            | 784 +++++++++++++++++++++++++++++++------------
 mm/sparse.c                |   3 +-
 7 files changed, 622 insertions(+), 255 deletions(-)



^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
@ 2026-04-08  2:22 kernel test robot
  0 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2026-04-08  2:22 UTC (permalink / raw)
  To: oe-kbuild; +Cc: lkp

:::::: 
:::::: Manual check reason: "low confidence bisect report"
:::::: 

BCC: lkp@intel.com
CC: oe-kbuild-all@lists.linux.dev
In-Reply-To: <20260403194526.477775-3-hannes@cmpxchg.org>
References: <20260403194526.477775-3-hannes@cmpxchg.org>
TO: Johannes Weiner <hannes@cmpxchg.org>

Hi Johannes,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on rppt-memblock/for-next]
[also build test ERROR on rppt-memblock/fixes linus/master v7.0-rc7 next-20260407]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Johannes-Weiner/mm-page_alloc-replace-pageblock_flags-bitmap-with-struct-pageblock_data/20260407-193348
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git for-next
patch link:    https://lore.kernel.org/r/20260403194526.477775-3-hannes%40cmpxchg.org
patch subject: [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator
:::::: branch date: 15 hours ago
:::::: commit date: 15 hours ago
config: i386-allnoconfig-bpf (https://download.01.org/0day-ci/archive/20260408/202604080453.g3eQBKxN-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260408/202604080453.g3eQBKxN-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/r/202604080453.g3eQBKxN-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from kernel/fork.c:118:
   kernel/../mm/internal.h: In function 'pfn_to_pageblock':
>> kernel/../mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^
--
   In file included from mm/filemap.c:54:
   mm/internal.h: In function 'pfn_to_pageblock':
>> mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^
--
   In file included from mm/mm_init.c:35:
   mm/internal.h: In function 'pfn_to_pageblock':
>> mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^
   mm/mm_init.c: In function 'usemap_size':
>> mm/mm_init.c:1456:39: error: invalid application of 'sizeof' to incomplete type 'struct pageblock_data'
    1456 |         return nr_pageblocks * sizeof(struct pageblock_data);
         |                                       ^~~~~~
>> mm/mm_init.c:1457:1: warning: control reaches end of non-void function [-Wreturn-type]
    1457 | }
         | ^
--
   In file included from mm/page_alloc.c:58:
   mm/internal.h: In function 'pfn_to_pageblock':
>> mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^
   mm/page_alloc.c: In function 'get_pfnblock_flags_word':
>> mm/page_alloc.c:363:44: error: invalid use of undefined type 'struct pageblock_data'
     363 |         return &pfn_to_pageblock(page, pfn)->flags;
         |                                            ^~
>> mm/page_alloc.c:364:1: warning: control reaches end of non-void function [-Wreturn-type]
     364 | }
         | ^
--
   In file included from fs/exec.c:82:
   fs/../mm/internal.h: In function 'pfn_to_pageblock':
>> fs/../mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^
--
   In file included from lib/vsprintf.c:51:
   lib/../mm/internal.h: In function 'pfn_to_pageblock':
>> lib/../mm/internal.h:803:37: error: invalid use of undefined type 'struct pageblock_data'
     803 |         return &zone->pageblock_data[idx];
         |                                     ^


vim +803 kernel/../mm/internal.h

8170ac4700d26f Zi Yan          2022-04-28  789  
d88d3563065850 Johannes Weiner 2026-04-03  790  static inline struct pageblock_data *pfn_to_pageblock(const struct page *page,
d88d3563065850 Johannes Weiner 2026-04-03  791  						      unsigned long pfn)
d88d3563065850 Johannes Weiner 2026-04-03  792  {
d88d3563065850 Johannes Weiner 2026-04-03  793  #ifdef CONFIG_SPARSEMEM
d88d3563065850 Johannes Weiner 2026-04-03  794  	struct mem_section *ms = __pfn_to_section(pfn);
d88d3563065850 Johannes Weiner 2026-04-03  795  	unsigned long idx = (pfn & (PAGES_PER_SECTION - 1)) >> pageblock_order;
d88d3563065850 Johannes Weiner 2026-04-03  796  
d88d3563065850 Johannes Weiner 2026-04-03  797  	return &section_to_usemap(ms)[idx];
d88d3563065850 Johannes Weiner 2026-04-03  798  #else
d88d3563065850 Johannes Weiner 2026-04-03  799  	struct zone *zone = page_zone(page);
d88d3563065850 Johannes Weiner 2026-04-03  800  	unsigned long idx;
d88d3563065850 Johannes Weiner 2026-04-03  801  
d88d3563065850 Johannes Weiner 2026-04-03  802  	idx = (pfn - pageblock_start_pfn(zone->zone_start_pfn)) >> pageblock_order;
d88d3563065850 Johannes Weiner 2026-04-03 @803  	return &zone->pageblock_data[idx];
d88d3563065850 Johannes Weiner 2026-04-03  804  #endif
d88d3563065850 Johannes Weiner 2026-04-03  805  }
d88d3563065850 Johannes Weiner 2026-04-03  806  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-20  1:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-03 19:40 [RFC 0/2] mm: page_alloc: pcp buddy allocator Johannes Weiner
2026-04-03 19:40 ` [RFC 1/2] mm: page_alloc: replace pageblock_flags bitmap with struct pageblock_data Johannes Weiner
2026-04-04  1:43   ` Rik van Riel
2026-04-20  1:40   ` Zi Yan
2026-04-03 19:40 ` [RFC 2/2] mm: page_alloc: per-cpu pageblock buddy allocator Johannes Weiner
2026-04-04  1:42   ` Rik van Riel
2026-04-06 16:12     ` Johannes Weiner
2026-04-06 17:31   ` Frank van der Linden
2026-04-06 21:58     ` Johannes Weiner
2026-04-08  6:30   ` kernel test robot
2026-04-10  9:48   ` Vlastimil Babka (SUSE)
2026-04-10 19:12     ` Johannes Weiner
2026-04-04  2:27 ` [RFC 0/2] mm: page_alloc: pcp " Zi Yan
2026-04-06 15:24   ` Johannes Weiner
2026-04-07  2:42     ` Zi Yan
  -- strict thread matches above, loose matches on Subject: below --
2026-04-08  2:22 [RFC 2/2] mm: page_alloc: per-cpu pageblock " kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.