Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting
@ 2026-06-26 10:20 Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

This is version 2 of per-memcg-per-node kmem accounting.

As asked by Joshua, I ran some microbenchmarks to check the impact of
this fine grain accounting.

TL;DR: There is a substantial impact (up to +337% on small percpu allocations)
on a benchmark that loops over small percpu allocations. On the other hand,
on a userspace program that creates a bpf percpu map, this cost is not visible.

I followed Joshua's advice and now this version batches the memcg accounting:
it improves the performance +337% vs +417% (v1) on 176 cores single node
machine and +153% vs 206% (v1) on 80 cores 2 nodes machine.

We can see that the overhead of this version scales linearly with the number of
cpus (the number of nodes being small). This overhead comes mainly from
vmalloc_to_page() so I have another variant (b) that decreases the impact even
more (+131% vs +337% on 176 cores and +86% vs +153% on 80 cores) but I'm not
sure the added complexity is needed so I did not send this version, let me know
what you think.

Performance
===========

All benchmarks run in a memcg with __GFP_ACCOUNT.

1) BPF percpu map create/destroy, full series vs baseline kernel (two
   boots, 176-CPU AMD EPYC, 1 NUMA node): the per-node accounting is lost
   in the BPF syscall overhead, the delta is within noise (us/op):

     size (B):    64     256    1024   4096   8192
     delta:     -5.5%  -5.1%  -1.8%  -5.1%  -4.1%

2) In-kernel microbench that isolates the accounting cost: a tight
   __alloc_percpu_gfp()/free_percpu() loop, __GFP_ACCOUNT on vs off on the
   same boot (ACCT COST = on - off). The dominant cost on a many-CPU box
   is discovering each backing page's real node (vmalloc_to_page() per
   possible CPU). ACCT COST by value size:

   176-CPU EPYC, 1 node
     size (B):              64       256      1024     4096     8192
      baseline (upstream)  +5.3%    +5.4%    +0.1%    -1.8%    -0.5%
      v1 credit (per-page) +417.3%  +182.5%  +68.5%   +21.4%   +16.1%
   a) per-node accounting  +337.8%  +141.8%  +36.1%   +11.9%   +6.8%
   b) per-page nid cache   +131.3%  +53.7%   +10.5%   +0.9%    +2.0%
   c) single-node fast     +12.6%   +12.1%   +3.5%    +6.6%    +0.7%

   80-CPU Xeon Gold 6138, 2 nodes (fast path inactive)
     size (B):              64       256      1024     4096     8192
      baseline (upstream)  +1.2%    -3.8%    +12.4%   +1.2%    +0.5%   (noise)
      v1 credit (per-page) +206.1%  +134.0%  +44.5%   +11.6%   +11.5%
   a) per-node accounting  +153.2%  +64.7%   +19.4%   +4.2%    +5.9%
   b) per-page nid cache   +86.5%   +45.5%   +14.7%   +1.8%    +1.6%

   (a) this patchset without fast path for single node
   (b) is an alternative version, not in this series, that caches each backing
       page's node in the chunk so the walk is paid once per page instead of
       once per allocation
   (c) this patchset with fast path for single node

Changes in v2
=============

- objcg lifetime: Shakeel's patch 1 now guarantees the lifetime of every
  per-node objcg
- dropped patch 5 and 6 since Shakeel's patch 2 replaces them
- fixed the number of precharged pages (the v1 formula under-precharged)
- per-node batching (Joshua's suggestion): accumulate the per-node bytes
  first, then issue one account_kmem()/uncharge() per touched node =>
  O(nodes) memcg ops instead of O(num_possible_cpus)
- single-node fast path: skip the per-cpu node walk on single node machines
- obj_exts metadata is now accounted per-node (walk its vmalloc pages)
  rather than charged whole to one memcg (Shakeel's main v1 objection).
- renamed obj_cgroup_get_nid() -> obj_cgroup_nid() (returns a borrowed RCU
  pointer, no ref taken).
- zswap: fixed the missing locking around the per-node objcg lookup (now
  done under RCU + obj_cgroup_tryget()).

This series pursues the work initiated by Joshua [1]. We need kernel
memory to be accounted on a per-node basis in order to be able to know
the memcg <-> physical memory association.

This series takes advantage of the recently introduced per-node
obj_cgroup and makes those obj_cgroup tied to their NUMA node.

The bulk of the series is percpu per-node accounting: percpu
"precharges" the memcg before we know the actual location of the pages
it uses, so charging and accounting had to be split. All other kmem
users (slab, __memcg_kmem_charge_page) are now handled directly by
Shakeel's per-node obj_cgroup infrastructure this series sits on, so
only percpu and zswap need explicit per-node work here (zswap support
is limited because Joshua is working on it in parallel [3]).

Thanks Joshua and Shakeel for the early feedback!

[1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/
[3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

Functional Testing
==================

- Tested with a percpu kmem self-test in an 8-node VM (2 nodes with CPUs,
  6 memory-only). For each allocation it checks that every node is charged
  and later uncharged the same number of bytes -- including a CPU-less node
  that ends up holding the obj_exts metadata -- and that nothing is left
  charged after teardown. All checks pass. (The self-test module is not
  part of this series.)

Alexandre Ghiti (7):
  mm: percpu: fix obj_exts metadata charge size
  mm: percpu: Split memcg charging and kmem accounting
  mm: memcontrol: track MEMCG_KMEM per NUMA node
  mm: percpu: per-node kmem accounting
  mm: percpu: per-node kmem accounting for obj_exts metadata
  mm: percpu: skip the per-cpu node walk on single-node systems
  mm: zswap: per-node kmem accounting for zswap/zsmalloc

Shakeel Butt (2):
  memcg: convert task->objcg to a per-node objcgs array
  memcg: charge kmem pages and slab objects against per-node objcg

 include/linux/memcontrol.h |  23 ++-
 include/linux/mmzone.h     |   1 +
 include/linux/sched.h      |   7 +-
 include/linux/zsmalloc.h   |   2 +
 mm/memcontrol.c            | 286 ++++++++++++++++++++++++++-----------
 mm/percpu-internal.h       |   2 +-
 mm/percpu.c                | 108 +++++++++++++-
 mm/vmstat.c                |   1 +
 mm/zsmalloc.c              |  11 ++
 mm/zswap.c                 |  19 ++-
 10 files changed, 361 insertions(+), 99 deletions(-)

-- 
2.54.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-26 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 3/9] mm: percpu: fix obj_exts metadata charge size Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 4/9] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 5/9] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 6/9] mm: percpu: per-node kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 7/9] mm: percpu: per-node kmem accounting for obj_exts metadata Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 8/9] mm: percpu: skip the per-cpu node walk on single-node systems Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
2026-06-26 14:32   ` Usama Arif
2026-06-26 18:36     ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox