[RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting
@ 2026-06-26 10:20 Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

This is version 2 of per-memcg-per-node kmem accounting.

As asked by Joshua, I ran some microbenchmarks to check the impact of
this fine grain accounting.

TL;DR: There is a substantial impact (up to +337% on small percpu allocations)
on a benchmark that loops over small percpu allocations. On the other hand,
on a userspace program that creates a bpf percpu map, this cost is not visible.

I followed Joshua's advice and now this version batches the memcg accounting:
it improves the performance +337% vs +417% (v1) on 176 cores single node
machine and +153% vs 206% (v1) on 80 cores 2 nodes machine.

We can see that the overhead of this version scales linearly with the number of
cpus (the number of nodes being small). This overhead comes mainly from
vmalloc_to_page() so I have another variant (b) that decreases the impact even
more (+131% vs +337% on 176 cores and +86% vs +153% on 80 cores) but I'm not
sure the added complexity is needed so I did not send this version, let me know
what you think.

Performance
===========

All benchmarks run in a memcg with __GFP_ACCOUNT.

1) BPF percpu map create/destroy, full series vs baseline kernel (two
   boots, 176-CPU AMD EPYC, 1 NUMA node): the per-node accounting is lost
   in the BPF syscall overhead, the delta is within noise (us/op):

     size (B):    64     256    1024   4096   8192
     delta:     -5.5%  -5.1%  -1.8%  -5.1%  -4.1%

2) In-kernel microbench that isolates the accounting cost: a tight
   __alloc_percpu_gfp()/free_percpu() loop, __GFP_ACCOUNT on vs off on the
   same boot (ACCT COST = on - off). The dominant cost on a many-CPU box
   is discovering each backing page's real node (vmalloc_to_page() per
   possible CPU). ACCT COST by value size:

   176-CPU EPYC, 1 node
     size (B):              64       256      1024     4096     8192
      baseline (upstream)  +5.3%    +5.4%    +0.1%    -1.8%    -0.5%
      v1 credit (per-page) +417.3%  +182.5%  +68.5%   +21.4%   +16.1%
   a) per-node accounting  +337.8%  +141.8%  +36.1%   +11.9%   +6.8%
   b) per-page nid cache   +131.3%  +53.7%   +10.5%   +0.9%    +2.0%
   c) single-node fast     +12.6%   +12.1%   +3.5%    +6.6%    +0.7%

   80-CPU Xeon Gold 6138, 2 nodes (fast path inactive)
     size (B):              64       256      1024     4096     8192
      baseline (upstream)  +1.2%    -3.8%    +12.4%   +1.2%    +0.5%   (noise)
      v1 credit (per-page) +206.1%  +134.0%  +44.5%   +11.6%   +11.5%
   a) per-node accounting  +153.2%  +64.7%   +19.4%   +4.2%    +5.9%
   b) per-page nid cache   +86.5%   +45.5%   +14.7%   +1.8%    +1.6%

   (a) this patchset without fast path for single node
   (b) is an alternative version, not in this series, that caches each backing
       page's node in the chunk so the walk is paid once per page instead of
       once per allocation
   (c) this patchset with fast path for single node

Changes in v2
=============

- objcg lifetime: Shakeel's patch 1 now guarantees the lifetime of every
  per-node objcg
- dropped patch 5 and 6 since Shakeel's patch 2 replaces them
- fixed the number of precharged pages (the v1 formula under-precharged)
- per-node batching (Joshua's suggestion): accumulate the per-node bytes
  first, then issue one account_kmem()/uncharge() per touched node =>
  O(nodes) memcg ops instead of O(num_possible_cpus)
- single-node fast path: skip the per-cpu node walk on single node machines
- obj_exts metadata is now accounted per-node (walk its vmalloc pages)
  rather than charged whole to one memcg (Shakeel's main v1 objection).
- renamed obj_cgroup_get_nid() -> obj_cgroup_nid() (returns a borrowed RCU
  pointer, no ref taken).
- zswap: fixed the missing locking around the per-node objcg lookup (now
  done under RCU + obj_cgroup_tryget()).

This series pursues the work initiated by Joshua [1]. We need kernel
memory to be accounted on a per-node basis in order to be able to know
the memcg <-> physical memory association.

This series takes advantage of the recently introduced per-node
obj_cgroup and makes those obj_cgroup tied to their NUMA node.

The bulk of the series is percpu per-node accounting: percpu
"precharges" the memcg before we know the actual location of the pages
it uses, so charging and accounting had to be split. All other kmem
users (slab, __memcg_kmem_charge_page) are now handled directly by
Shakeel's per-node obj_cgroup infrastructure this series sits on, so
only percpu and zswap need explicit per-node work here (zswap support
is limited because Joshua is working on it in parallel [3]).

Thanks Joshua and Shakeel for the early feedback!

[1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/
[3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

Functional Testing
==================

- Tested with a percpu kmem self-test in an 8-node VM (2 nodes with CPUs,
  6 memory-only). For each allocation it checks that every node is charged
  and later uncharged the same number of bytes -- including a CPU-less node
  that ends up holding the obj_exts metadata -- and that nothing is left
  charged after teardown. All checks pass. (The self-test module is not
  part of this series.)

Alexandre Ghiti (7):
  mm: percpu: fix obj_exts metadata charge size
  mm: percpu: Split memcg charging and kmem accounting
  mm: memcontrol: track MEMCG_KMEM per NUMA node
  mm: percpu: per-node kmem accounting
  mm: percpu: per-node kmem accounting for obj_exts metadata
  mm: percpu: skip the per-cpu node walk on single-node systems
  mm: zswap: per-node kmem accounting for zswap/zsmalloc

Shakeel Butt (2):
  memcg: convert task->objcg to a per-node objcgs array
  memcg: charge kmem pages and slab objects against per-node objcg

 include/linux/memcontrol.h |  23 ++-
 include/linux/mmzone.h     |   1 +
 include/linux/sched.h      |   7 +-
 include/linux/zsmalloc.h   |   2 +
 mm/memcontrol.c            | 286 ++++++++++++++++++++++++++-----------
 mm/percpu-internal.h       |   2 +-
 mm/percpu.c                | 108 +++++++++++++-
 mm/vmstat.c                |   1 +
 mm/zsmalloc.c              |  11 ++
 mm/zswap.c                 |  19 ++-
 10 files changed, 361 insertions(+), 99 deletions(-)

-- 
2.54.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg Alexandre Ghiti
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

From: Shakeel Butt <shakeel.butt@linux.dev>

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but task_struct still cached only one objcg. On every cross-node
allocation current_obj_cgroup() returned an objcg whose nid did not
match the current CPU's node, so the stock's per-node vmstat batching
and (separately) the per-node accounting hierarchy were defeated for
multi-node workloads.

Replace task->objcg with task->objcgs: a tagged pointer to an
nr_node_ids-sized array of per-node obj_cgroup pointers. Bit 0 keeps
its meaning as CURRENT_OBJCG_UPDATE_FLAG so mem_cgroup_kmem_attach()
can still atomically mark the cache stale from another task's context
with a single set_bit().

current_obj_cgroup() now indexes the array by numa_node_id() and falls
back to root_mem_cgroup on a NULL array (kthread or fork-time alloc
failure) or NULL entry (transient drain window).

current_objcg_update() refreshes every entry under one rcu_read_lock,
xchg'ing fresh per-node objcgs in and dropping the stale references.
The outer cmpxchg loop on the tagged array pointer preserves the
existing race-with-kmem_attach semantics: if the update bit is re-set
mid-refresh, the whole refresh is retried.

The array is eagerly allocated in mem_cgroup_fork() for non-kthread
tasks. This keeps current_objcg_update() off the allocation path, which
matters because it runs from kmem allocation contexts that may be
atomic. Kthreads and tasks whose fork-time kcalloc() fails simply leave
task->objcgs as NULL and route kmem allocations to root_mem_cgroup, as
before. The array is freed in mem_cgroup_exit() after dropping the
per-node references.

__get_obj_cgroup_from_memcg() takes nid as an explicit parameter so it
can be reused for both folio charging (numa_node_id()) and the per-node
refresh loop.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/sched.h |   7 +-
 mm/memcontrol.c       | 148 +++++++++++++++++++++++++-----------------
 2 files changed, 95 insertions(+), 60 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee06cba5c6f5..d7ea9fe38d01 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1538,8 +1538,11 @@ struct task_struct {
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
 
-	/* Cache for current->cgroups->memcg->nodeinfo[nid]->objcg lookups: */
-	struct obj_cgroup		*objcg;
+	/*
+	 * Per-node cache for current->cgroups->memcg->nodeinfo[nid]->objcg
+	 * lookups. Tagged pointer: bit 0 = CURRENT_OBJCG_UPDATE_FLAG.
+	 */
+	struct obj_cgroup		**objcgs;
 #endif
 
 #ifdef CONFIG_BLK_CGROUP
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..ee47427de9e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2907,10 +2907,9 @@ struct mem_cgroup *mem_cgroup_from_virt(void *p)
 	return folio_memcg_check(virt_to_folio(p));
 }
 
-static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg,
+						      int nid)
 {
-	int nid = numa_node_id();
-
 	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		struct obj_cgroup *objcg = rcu_dereference(memcg->nodeinfo[nid]->objcg);
 
@@ -2926,67 +2925,73 @@ static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *me
 	struct obj_cgroup *objcg;
 
 	rcu_read_lock();
-	objcg = __get_obj_cgroup_from_memcg(memcg);
+	objcg = __get_obj_cgroup_from_memcg(memcg, numa_node_id());
 	rcu_read_unlock();
 
 	return objcg;
 }
 
-static struct obj_cgroup *current_objcg_update(void)
+static struct obj_cgroup **current_objcg_update(void)
 {
 	struct mem_cgroup *memcg;
-	struct obj_cgroup *old, *objcg = NULL;
+	struct obj_cgroup **objcgs;
+	unsigned long old_tagged;
+	int nid;
 
 	do {
-		/* Atomically drop the update bit. */
-		old = xchg(&current->objcg, NULL);
-		if (old) {
-			old = (struct obj_cgroup *)
-				((unsigned long)old & ~CURRENT_OBJCG_UPDATE_FLAG);
-			obj_cgroup_put(old);
-
-			old = NULL;
-		}
-
-		/* If new objcg is NULL, no reason for the second atomic update. */
-		if (!current->mm || (current->flags & PF_KTHREAD))
-			return NULL;
+		old_tagged = (unsigned long)READ_ONCE(current->objcgs);
+		objcgs = (struct obj_cgroup **)
+			(old_tagged & ~CURRENT_OBJCG_UPDATE_FLAG);
 
 		/*
-		 * Release the objcg pointer from the previous iteration,
-		 * if try_cmpxcg() below fails.
+		 * If there is no per-node cache (kthread or fork-time
+		 * allocation failure), there is nothing to refresh. The
+		 * cmpxchg below still clears the update bit so we do not
+		 * keep re-entering this slow path.
 		 */
-		if (unlikely(objcg)) {
-			obj_cgroup_put(objcg);
-			objcg = NULL;
+		if (objcgs) {
+			if (!current->mm || (current->flags & PF_KTHREAD)) {
+				/*
+				 * The task lost its mm: drop the cached
+				 * per-node references; future allocations will
+				 * fall back to root_mem_cgroup.
+				 */
+				for_each_node(nid)
+					obj_cgroup_put(xchg(&objcgs[nid], NULL));
+			} else {
+				/*
+				 * Re-read the memcg under rcu since the task
+				 * may have been asynchronously moved and the
+				 * previous memcg can be offlined.
+				 */
+				rcu_read_lock();
+				memcg = mem_cgroup_from_task(current);
+				for_each_node(nid) {
+					struct obj_cgroup *fresh, *stale;
+
+					fresh = __get_obj_cgroup_from_memcg(memcg, nid);
+					stale = xchg(&objcgs[nid], fresh);
+					obj_cgroup_put(stale);
+				}
+				rcu_read_unlock();
+			}
 		}
 
 		/*
-		 * Obtain the new objcg pointer. The current task can be
-		 * asynchronously moved to another memcg and the previous
-		 * memcg can be offlined. So let's get the memcg pointer
-		 * and try get a reference to objcg under a rcu read lock.
-		 */
-
-		rcu_read_lock();
-		memcg = mem_cgroup_from_task(current);
-		objcg = __get_obj_cgroup_from_memcg(memcg);
-		rcu_read_unlock();
-
-		/*
-		 * Try set up a new objcg pointer atomically. If it
-		 * fails, it means the update flag was set concurrently, so
-		 * the whole procedure should be repeated.
+		 * Publish the cleared-flag pointer. If kmem_attach raced and
+		 * re-set the update bit, retry the whole refresh.
 		 */
-	} while (!try_cmpxchg(&current->objcg, &old, objcg));
+	} while (!try_cmpxchg((unsigned long *)&current->objcgs,
+			      &old_tagged, (unsigned long)objcgs));
 
-	return objcg;
+	return objcgs;
 }
 
 __always_inline struct obj_cgroup *current_obj_cgroup(void)
 {
 	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
+	struct obj_cgroup **objcgs;
 	int nid = numa_node_id();
 
 	if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi())
@@ -2997,14 +3002,16 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 		if (unlikely(memcg))
 			goto from_memcg;
 
-		objcg = READ_ONCE(current->objcg);
-		if (unlikely((unsigned long)objcg & CURRENT_OBJCG_UPDATE_FLAG))
-			objcg = current_objcg_update();
+		objcgs = READ_ONCE(current->objcgs);
+		if (unlikely((unsigned long)objcgs & CURRENT_OBJCG_UPDATE_FLAG))
+			objcgs = current_objcg_update();
 		/*
-		 * Objcg reference is kept by the task, so it's safe
-		 * to use the objcg by the current task.
+		 * Per-node objcg references are kept by the task, so it's
+		 * safe to use them by the current task.
 		 */
-		return objcg ? : rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1);
+		if (objcgs && (objcg = objcgs[nid]))
+			return objcg;
+		return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1);
 	}
 
 	memcg = this_cpu_read(int_active_memcg);
@@ -4544,22 +4551,47 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 
 static void mem_cgroup_fork(struct task_struct *task)
 {
+	struct obj_cgroup **objcgs;
+
 	/*
-	 * Set the update flag to cause task->objcg to be initialized lazily
-	 * on the first allocation. It can be done without any synchronization
-	 * because it's always performed on the current task, so does
-	 * current_objcg_update().
+	 * Kthreads do not need a per-node cache; their kmem allocations fall
+	 * back to root_mem_cgroup via current_obj_cgroup().
 	 */
-	task->objcg = (struct obj_cgroup *)CURRENT_OBJCG_UPDATE_FLAG;
+	if (task->flags & PF_KTHREAD) {
+		task->objcgs = NULL;
+		return;
+	}
+
+	/*
+	 * Eagerly allocate the per-node cache so that current_objcg_update()
+	 * never has to allocate from potentially-atomic kmem allocation
+	 * paths. On allocation failure this task will use root_mem_cgroup
+	 * for kmem accounting.
+	 *
+	 * Tag with the update flag so the first kmem allocation populates
+	 * the entries via current_objcg_update().
+	 */
+	objcgs = kcalloc(nr_node_ids, sizeof(*objcgs), GFP_KERNEL);
+	if (objcgs)
+		task->objcgs = (struct obj_cgroup **)
+			((unsigned long)objcgs | CURRENT_OBJCG_UPDATE_FLAG);
+	else
+		task->objcgs = NULL;
 }
 
 static void mem_cgroup_exit(struct task_struct *task)
 {
-	struct obj_cgroup *objcg = task->objcg;
+	struct obj_cgroup **objcgs;
+	int nid;
 
-	objcg = (struct obj_cgroup *)
-		((unsigned long)objcg & ~CURRENT_OBJCG_UPDATE_FLAG);
-	obj_cgroup_put(objcg);
+	objcgs = (struct obj_cgroup **)
+		((unsigned long)task->objcgs & ~CURRENT_OBJCG_UPDATE_FLAG);
+
+	if (objcgs) {
+		for_each_node(nid)
+			obj_cgroup_put(objcgs[nid]);
+		kfree(objcgs);
+	}
 
 	/*
 	 * Some kernel allocations can happen after this point,
@@ -4567,7 +4599,7 @@ static void mem_cgroup_exit(struct task_struct *task)
 	 * because it's always performed on the current task, so does
 	 * current_objcg_update().
 	 */
-	task->objcg = NULL;
+	task->objcgs = NULL;
 }
 
 #ifdef CONFIG_LRU_GEN
@@ -4599,7 +4631,7 @@ static void mem_cgroup_kmem_attach(struct cgroup_taskset *tset)
 
 	cgroup_taskset_for_each(task, css, tset) {
 		/* atomically set the update bit */
-		set_bit(CURRENT_OBJCG_UPDATE_BIT, (unsigned long *)&task->objcg);
+		set_bit(CURRENT_OBJCG_UPDATE_BIT, (unsigned long *)&task->objcgs);
 	}
 }
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 3/9] mm: percpu: fix obj_exts metadata charge size Alexandre Ghiti
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

From: Shakeel Butt <shakeel.butt@linux.dev>

After 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") current_obj_cgroup() returns the per-node objcg of the
task's memcg for numa_node_id(). Two callers in the kmem accounting
path always know the actual target node — the page being charged in
__memcg_kmem_charge_page() and each slab being charged in
__memcg_slab_post_alloc_hook() — but were using current_obj_cgroup()
and so charged against an objcg whose nid did not match the
allocation's physical node. The per-objcg vmstat batching (keyed by
objcg->nid) and per-node charge attribution were both routed to the
wrong sibling of the same memcg whenever the allocating CPU's node
differed from the allocation's node.

Factor the per-node objcg lookup into __current_obj_cgroup(int nid)
and keep current_obj_cgroup() as a one-line wrapper that passes
numa_node_id(), preserving all other callers. Use the new helper in:

  - __memcg_kmem_charge_page(): pass page_to_nid(page).
  - __memcg_slab_post_alloc_hook(): re-fetch inside the loop using
    slab_nid(slab) so each slab in a bulk allocation is charged
    against its own node's objcg. The early per-task root/NULL check
    above the loop remains (all per-node objcgs of a memcg share the
    same root-ness, so it is still a valid fast path); the in-loop
    check guards the transient drain window where one node's entry
    may be NULL.

Update the stale slab_pgdat(slab) reference in the TODO comment to
slab_nid(slab); slab_pgdat is no longer relevant after the
obj_stock_pcp cached_pgdat removal.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/memcontrol.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee47427de9e2..3bcc20e72914 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2987,12 +2987,11 @@ static struct obj_cgroup **current_objcg_update(void)
 	return objcgs;
 }
 
-__always_inline struct obj_cgroup *current_obj_cgroup(void)
+__always_inline static struct obj_cgroup *__current_obj_cgroup(int nid)
 {
 	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
 	struct obj_cgroup **objcgs;
-	int nid = numa_node_id();
 
 	if (IS_ENABLED(CONFIG_MEMCG_NMI_UNSAFE) && in_nmi())
 		return NULL;
@@ -3036,6 +3035,11 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 	return rcu_dereference_check(root_mem_cgroup->nodeinfo[nid]->objcg, 1);
 }
 
+__always_inline struct obj_cgroup *current_obj_cgroup(void)
+{
+	return __current_obj_cgroup(numa_node_id());
+}
+
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	struct obj_cgroup *objcg;
@@ -3143,7 +3147,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	struct obj_cgroup *objcg;
 	int ret = 0;
 
-	objcg = current_obj_cgroup();
+	objcg = __current_obj_cgroup(page_to_nid(page));
 	if (objcg && !obj_cgroup_is_root(objcg)) {
 		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
 		if (!ret) {
@@ -3536,7 +3540,9 @@ static inline size_t obj_full_size(struct kmem_cache *s)
 {
 	/*
 	 * For each accounted object there is an extra space which is used
-	 * to store obj_cgroup membership. Charge it too.
+	 * to store obj_cgroup membership. Charge it too. In addition, we
+	 * allocate obj_exts array on the same node as slab_nid(), so per-node
+	 * kmem accounting is fine.
 	 */
 	return s->size + sizeof(struct obj_cgroup *);
 }
@@ -3594,6 +3600,16 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 			continue;
 		}
 
+		/*
+		 * Charge against the per-node objcg matching the slab's node
+		 * so the stock's per-objcg vmstat batch (keyed by objcg->nid)
+		 * aligns with the physical slab. May transiently fall back to
+		 * root if the per-node entry is being drained.
+		 */
+		objcg = __current_obj_cgroup(slab_nid(slab));
+		if (!objcg || obj_cgroup_is_root(objcg))
+			continue;
+
 		/*
 		 * if we fail and size is 1, memcg_alloc_abort_single() will
 		 * just free the object, which is ok as we have not assigned
@@ -3602,7 +3618,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		 * for larger sizes, kmem_cache_free_bulk() will uncharge
 		 * any objects that were already charged and obj_ext assigned
 		 *
-		 * TODO: we could batch this until slab_pgdat(slab) changes
+		 * TODO: we could batch this until slab_nid(slab) changes
 		 * between iterations, with a more complicated undo
 		 */
 		stock = trylock_stock();
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/9] mm: percpu: fix obj_exts metadata charge size
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 4/9] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

pcpu_obj_full_size() uses sizeof(struct obj_cgroup *) to charge the size
of the percpu obj_exts metadata. But obj_exts is actually a vector of
struct pcpuobj_ext, whose size, when CONFIG_MEM_ALLOC_PROFILING is
enabled, is 16B and not 8B like sizeof(struct obj_cgroup *) currently
returns.

Fix that by using sizeof(struct pcpuobj_ext) instead.

Fixes: 8f30d2660a38 ("mm: percpu: introduce pcpuobj_ext")
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu-internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 8cbe039bf847..d1c0a508710e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -156,7 +156,7 @@ static inline size_t pcpu_obj_full_size(size_t size)
 
 #ifdef CONFIG_MEMCG
 	if (!mem_cgroup_kmem_disabled())
-		extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
+		extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct pcpuobj_ext);
 #endif
 
 	return size * num_possible_cpus() + extra_size;
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/9] mm: percpu: Split memcg charging and kmem accounting
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 3/9] mm: percpu: fix obj_exts metadata charge size Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 5/9] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

This is preparatory patch for upcoming per-memcg-per-node kmem
accounting.

Percpu allocations charge memory before knowing which NUMA nodes the
pages will land on. So we need to decouple the memcg charging from the
kmem accounting:

1. In the pre-alloc hook, obj_cgroup_precharge() reserves pages for
   memcg limit enforcement without updating kmem stats.
2. In the post-alloc hook, obj_cgroup_account_kmem() accounts kmem
   and places the sub-page remainder into the obj stock after the
   allocation succeeds.

Because of that decoupling, we must not rely on the stock in the
precharge function and always charge the necessary pages that will
be accounted after the allocations happened. That means we may
temporarily overcharge the memcg but the obj_stock draining will get
things back to normal.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/memcontrol.h |  4 +++
 mm/memcontrol.c            | 50 ++++++++++++++++++++++++++++++++++++++
 mm/percpu.c                | 15 +++++++++---
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1f46a0016fc..8f419ee54510 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1704,6 +1704,10 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+int obj_cgroup_precharge(struct obj_cgroup *objcg, gfp_t gfp,
+			 unsigned int nr_pages);
+void obj_cgroup_unprecharge(struct obj_cgroup *objcg, unsigned int nr_pages);
+void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages);
 
 extern struct static_key_false memcg_bpf_enabled_key;
 static inline bool memcg_bpf_enabled(void)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3bcc20e72914..480fba12a217 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3536,6 +3536,56 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size, true);
 }
 
+/*
+ * obj_cgroup_account_kmem - account KMEM for nr_pages
+ *
+ * Called after obj_cgroup_precharge() when the allocation succeeds.
+ * Accounts KMEM for nr_pages on the objcg's node.
+ */
+void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	account_kmem_nmi_safe(memcg, nr_pages);
+	memcg1_account_kmem(memcg, nr_pages);
+	rcu_read_unlock();
+}
+
+/*
+ * obj_cgroup_precharge - reserve pages without KMEM accounting
+ *
+ * Reserves page counter credits for limit enforcement. Does not update
+ * KMEM stats or the per-CPU obj stock, because precharge decouples
+ * the page counter charge from KMEM accounting (which happens later
+ * per-node via obj_cgroup_account_kmem).
+ *
+ * On failure, use obj_cgroup_unprecharge() to release the reservation.
+ */
+int obj_cgroup_precharge(struct obj_cgroup *objcg, gfp_t gfp,
+			 unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = try_charge_memcg(memcg, gfp, nr_pages);
+	css_put(&memcg->css);
+
+	return ret;
+}
+
+void obj_cgroup_unprecharge(struct obj_cgroup *objcg, unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	if (!mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+	css_put(&memcg->css);
+}
+
 static inline size_t obj_full_size(struct kmem_cache *s)
 {
 	/*
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..01c87e39d366 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1625,7 +1625,8 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
-	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
+	if (obj_cgroup_precharge(objcg, gfp,
+				 PAGE_ALIGN(pcpu_obj_full_size(size)) >> PAGE_SHIFT))
 		return false;
 
 	*objcgp = objcg;
@@ -1640,15 +1641,23 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		return;
 
 	if (likely(chunk && chunk->obj_exts)) {
+		size_t total = pcpu_obj_full_size(size);
+		size_t remainder = PAGE_ALIGN(total) - total;
+
 		obj_cgroup_get(objcg);
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				pcpu_obj_full_size(size));
+				total);
 		rcu_read_unlock();
+
+		obj_cgroup_account_kmem(objcg, PAGE_ALIGN(total) >> PAGE_SHIFT);
+		if (remainder)
+			obj_cgroup_uncharge(objcg, remainder);
 	} else {
-		obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
+		obj_cgroup_unprecharge(objcg,
+				       PAGE_ALIGN(pcpu_obj_full_size(size)) >> PAGE_SHIFT);
 	}
 }
 
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/9] mm: memcontrol: track MEMCG_KMEM per NUMA node
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (3 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 4/9] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 6/9] mm: percpu: per-node kmem accounting Alexandre Ghiti
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

This patch gets rid of MEMCG_KMEM and wires all the "generic" functions
by introducing per-node obj_cgroup objects.

Note that it does not convert the kmem users to proper per-memcg-per-node
accounting now, this is done in upcoming patches.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/memcontrol.h | 19 +++++++----
 include/linux/mmzone.h     |  1 +
 mm/memcontrol.c            | 64 ++++++++++++++++++++++++--------------
 mm/vmstat.c                |  1 +
 4 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8f419ee54510..a60fa197a973 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,7 +36,6 @@ enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
-	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
@@ -127,9 +126,10 @@ struct mem_cgroup_per_node {
 	struct list_head objcg_list;
 
 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-	/* slab stats for nmi context */
+	/* slab and kmem stats for nmi context */
 	atomic_t		slab_reclaimable;
 	atomic_t		slab_unreclaimable;
+	atomic_t		kmem;
 #endif
 };
 
@@ -191,6 +191,7 @@ struct obj_cgroup {
 		struct rcu_head rcu;
 	};
 	bool is_root;
+	int nid;
 };
 
 /*
@@ -255,10 +256,6 @@ struct mem_cgroup {
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
 	atomic_long_t		memory_events_local[MEMCG_NR_MEMORY_EVENTS];
 
-#ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-	/* MEMCG_KMEM for nmi context */
-	atomic_t		kmem_stat;
-#endif
 	/*
 	 * Hint of reclaim pressure for socket memroy management. Note
 	 * that this indicator should NOT be used in legacy cgroup mode
@@ -773,6 +770,16 @@ static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 		percpu_ref_put(&objcg->refcnt);
 }
 
+static inline struct obj_cgroup *obj_cgroup_nid(struct obj_cgroup *objcg,
+						int nid)
+{
+	struct mem_cgroup *memcg = obj_cgroup_memcg(objcg);
+
+	/* Borrowed RCU lookup: takes no reference, caller must hold RCU. */
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return rcu_dereference(memcg->nodeinfo[nid]->objcg);
+}
+
 static inline bool mem_cgroup_tryget(struct mem_cgroup *memcg)
 {
 	return !memcg || css_tryget(&memcg->css);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca2712187147..753fdf9dc80f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -327,6 +327,7 @@ enum node_stat_item {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_KMEM,
 	NR_BALLOON_PAGES,
 	NR_KERNEL_FILE_PAGES,
 	NR_GPU_ACTIVE,	/* Pages assigned to GPU objects */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 480fba12a217..c6a0d8463400 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,7 @@ bool mem_cgroup_kmem_disabled(void)
 }
 
 static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
+static void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -170,9 +171,11 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 
 	if (nr_pages) {
 		struct mem_cgroup *memcg;
+		struct lruvec *lruvec;
 
 		memcg = get_mem_cgroup_from_objcg(objcg);
-		mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+		lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(objcg->nid));
+		mod_lruvec_state(lruvec, NR_KMEM, -nr_pages);
 		memcg1_account_kmem(memcg, -nr_pages);
 		if (!mem_cgroup_is_root(memcg))
 			memcg_uncharge(memcg, nr_pages);
@@ -423,13 +426,13 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_KMEM,
 };
 
 static const unsigned int memcg_stat_items[] = {
 	MEMCG_SWAP,
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
-	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
@@ -1546,7 +1549,7 @@ struct memory_stat {
 static const struct memory_stat memory_stats[] = {
 	{ "anon",			NR_ANON_MAPPED			},
 	{ "file",			NR_FILE_PAGES			},
-	{ "kernel",			MEMCG_KMEM			},
+	{ "kernel",			NR_KMEM				},
 	{ "kernel_stack",		NR_KERNEL_STACK_KB		},
 	{ "pagetables",			NR_PAGETABLE			},
 	{ "sec_pagetables",		NR_SECONDARY_PAGETABLE		},
@@ -3052,20 +3055,26 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 }
 
 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
+static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int nid, int val)
 {
 	if (likely(!in_nmi())) {
-		mod_memcg_state(memcg, MEMCG_KMEM, val);
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+		mod_lruvec_state(lruvec, NR_KMEM, val);
 	} else {
+		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
+
 		/* preemption is disabled in_nmi(). */
 		__css_rstat_updated(&memcg->css, smp_processor_id());
-		atomic_add(val, &memcg->kmem_stat);
+		atomic_add(val, &pn->kmem);
 	}
 }
 #else
-static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
+static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int nid, int val)
 {
-	mod_memcg_state(memcg, MEMCG_KMEM, val);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+	mod_lruvec_state(lruvec, NR_KMEM, val);
 }
 #endif
 
@@ -3081,7 +3090,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	memcg = get_mem_cgroup_from_objcg(objcg);
 
-	account_kmem_nmi_safe(memcg, -nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, -nr_pages);
 	memcg1_account_kmem(memcg, -nr_pages);
 	if (!mem_cgroup_is_root(memcg))
 		refill_stock(memcg, nr_pages);
@@ -3109,7 +3118,7 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
 	if (ret)
 		goto out;
 
-	account_kmem_nmi_safe(memcg, nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, nr_pages);
 	memcg1_account_kmem(memcg, nr_pages);
 out:
 	css_put(&memcg->css);
@@ -3337,10 +3346,11 @@ static void drain_obj_stock_slot(struct obj_stock_pcp *stock, int i)
 
 		if (nr_pages) {
 			struct mem_cgroup *memcg;
+			struct lruvec *lruvec;
 
 			memcg = get_mem_cgroup_from_objcg(old);
-
-			mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+			lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(old->nid));
+			mod_lruvec_state(lruvec, NR_KMEM, -nr_pages);
 			memcg1_account_kmem(memcg, -nr_pages);
 			if (!mem_cgroup_is_root(memcg))
 				memcg_uncharge(memcg, nr_pages);
@@ -3349,7 +3359,7 @@ static void drain_obj_stock_slot(struct obj_stock_pcp *stock, int i)
 		}
 
 		/*
-		 * The leftover is flushed to the centralized per-memcg value.
+		 * The leftover is flushed to the per-node per-memcg value.
 		 * On the next attempt to refill obj stock it will be moved
 		 * to a per-cpu stock (probably, on an other CPU), see
 		 * refill_obj_stock().
@@ -3548,7 +3558,7 @@ void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages)
 
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
-	account_kmem_nmi_safe(memcg, nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, nr_pages);
 	memcg1_account_kmem(memcg, nr_pages);
 	rcu_read_unlock();
 }
@@ -4302,6 +4312,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		if (unlikely(mem_cgroup_is_root(memcg)))
 			objcg->is_root = true;
 
+		objcg->nid = nid;
 		objcg->memcg = memcg;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->objcg, objcg);
 		obj_cgroup_get(objcg);
@@ -4505,15 +4516,6 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
 {
 	int nid;
 
-	if (atomic_read(&memcg->kmem_stat)) {
-		int kmem = atomic_xchg(&memcg->kmem_stat, 0);
-		int index = memcg_stats_index(MEMCG_KMEM);
-
-		memcg->vmstats->state[index] += kmem;
-		if (parent)
-			parent->vmstats->state_pending[index] += kmem;
-	}
-
 	for_each_node_state(nid, N_MEMORY) {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
 		struct lruvec_stats *lstats = pn->lruvec_stats;
@@ -4544,6 +4546,18 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
 			if (parent)
 				parent->vmstats->state_pending[index] += slab;
 		}
+		if (atomic_read(&pn->kmem)) {
+			int kmem = atomic_xchg(&pn->kmem, 0);
+			int index = memcg_stats_index(NR_KMEM);
+
+			mod_node_page_state(NODE_DATA(nid), NR_KMEM, kmem);
+			lstats->state[index] += kmem;
+			memcg->vmstats->state[index] += kmem;
+			if (plstats)
+				plstats->state_pending[index] += kmem;
+			if (parent)
+				parent->vmstats->state_pending[index] += kmem;
+		}
 	}
 }
 #else
@@ -5332,7 +5346,9 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	if (ug->nr_memory) {
 		memcg_uncharge(memcg, ug->nr_memory);
 		if (ug->nr_kmem) {
-			mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+			struct lruvec *lruvec =
+				mem_cgroup_lruvec(memcg, NODE_DATA(ug->objcg->nid));
+			mod_lruvec_state(lruvec, NR_KMEM, -ug->nr_kmem);
 			memcg1_account_kmem(memcg, -ug->nr_kmem);
 		}
 		memcg1_oom_recover(memcg);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..d55437d1852e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1293,6 +1293,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	[I(NR_HUGETLB)]				= "nr_hugetlb",
 #endif
+	[I(NR_KMEM)]				= "nr_kmem",
 	[I(NR_BALLOON_PAGES)]			= "nr_balloon_pages",
 	[I(NR_KERNEL_FILE_PAGES)]		= "nr_kernel_file_pages",
 	[I(NR_GPU_ACTIVE)]			= "nr_gpu_active",
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/9] mm: percpu: per-node kmem accounting
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (4 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 5/9] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 7/9] mm: percpu: per-node kmem accounting for obj_exts metadata Alexandre Ghiti
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

Now that the memcg charging is decoupled from the kmem accounting, the
post-alloc hook knows the actual node each backing page landed on and can
account it per node. The backing pages of a percpu allocation are only
best-effort placed on cpu_to_node() by pcpu_alloc_pages(), so some may
have fallen back to other nodes; the hook therefore reads each page's real
node via page_to_nid() and accumulates the bytes per node.

The accounting cannot go through the obj_stock: a concurrent stock drain
could take the pages the post-alloc hook relies on, and the hook cannot
afford a failing re-charge. Instead, accumulate the per-node bytes first,
then for each touched node issue a single obj_cgroup_account_kmem() of
ceil(bytes_on_node / PAGE_SIZE) pages and hand the sub-page remainder back
to the stock. The free hook mirrors this with one uncharge of the exact
bytes per node. Batching per node (instead of per page) keeps the memcg
work proportional to the number of nodes rather than num_possible_cpus().

We have to precharge enough pages to account for the worst case scenario
where the allocation is spread on all nodes. Since in
pcpu_memcg_post_alloc_hook(), we charge PAGE_ALIGN(size_on_node_X), that
means we round up by strictly less than one page for each node. So for N
nodes, we waste strictly less than N pages: so we have to precharge at
least PAGE_ALIGN(total size) + num_possible_nodes().

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu.c | 88 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 01c87e39d366..e9d2d3716b99 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1613,6 +1613,22 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 }
 
 #ifdef CONFIG_MEMCG
+static unsigned int pcpu_memcg_nr_precharge_pages(size_t size)
+{
+	size_t total = pcpu_obj_full_size(size);
+	unsigned int ceil = PAGE_ALIGN(total) >> PAGE_SHIFT;
+
+	/*
+	 * pcpu_memcg_post_alloc_hook() charges ceil(bytes_on_node / PAGE_SIZE)
+	 * pages per node. Summed over the K <= num_possible_nodes() nodes the
+	 * allocation touches that is at most ceil + (K - 1): each node rounds
+	 * its share up by strictly less than a page. Precharge
+	 * ceil + num_possible_nodes(), which covers that worst case with a
+	 * page of headroom, so the per-node credit never runs short.
+	 */
+	return ceil + num_possible_nodes();
+}
+
 static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 				      struct obj_cgroup **objcgp)
 {
@@ -1625,14 +1641,37 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
-	if (obj_cgroup_precharge(objcg, gfp,
-				 PAGE_ALIGN(pcpu_obj_full_size(size)) >> PAGE_SHIFT))
+	if (obj_cgroup_precharge(objcg, gfp, pcpu_memcg_nr_precharge_pages(size)))
 		return false;
 
 	*objcgp = objcg;
 	return true;
 }
 
+/*
+ * Accumulate the per-cpu payload bytes of this allocation onto the node that
+ * actually backs each page. pcpu_alloc_pages() only places a CPU's backing
+ * page on cpu_to_node() as a best effort, so the page may have fallen back to
+ * another node; use the page's real node. node_bytes[nid] accumulates the
+ * bytes seen on each node, to be charged in one batch per node by the caller.
+ */
+static void pcpu_memcg_accumulate_pages(struct pcpu_chunk *chunk, int off,
+				   size_t size, unsigned int *node_bytes)
+{
+	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	unsigned int cpu, i;
+
+	for_each_possible_cpu(cpu) {
+		for (i = 0; i < nr_pages; i++) {
+			void *addr = (void *)pcpu_chunk_addr(chunk, cpu, PFN_DOWN(off) + i);
+			size_t page_sz = i < nr_pages - 1 ?
+				PAGE_SIZE : size - (nr_pages - 1) * PAGE_SIZE;
+
+			node_bytes[page_to_nid(pcpu_addr_to_page(addr))] += page_sz;
+		}
+	}
+}
+
 static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 				       struct pcpu_chunk *chunk, int off,
 				       size_t size)
@@ -1641,29 +1680,47 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		return;
 
 	if (likely(chunk && chunk->obj_exts)) {
-		size_t total = pcpu_obj_full_size(size);
-		size_t remainder = PAGE_ALIGN(total) - total;
+		unsigned int precharge_pages = pcpu_memcg_nr_precharge_pages(size);
+		unsigned int node_bytes[MAX_NUMNODES] = { 0 };
+		unsigned int pages_used = 0;
+		int nid;
 
 		obj_cgroup_get(objcg);
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
+		pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
+
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				total);
-		rcu_read_unlock();
+				pcpu_obj_full_size(size));
+
+		for_each_online_node(nid) {
+			unsigned int pages;
 
-		obj_cgroup_account_kmem(objcg, PAGE_ALIGN(total) >> PAGE_SHIFT);
-		if (remainder)
-			obj_cgroup_uncharge(objcg, remainder);
+			if (!node_bytes[nid])
+				continue;
+			pages = DIV_ROUND_UP(node_bytes[nid], PAGE_SIZE);
+			obj_cgroup_account_kmem(obj_cgroup_nid(objcg, nid), pages);
+			pages_used += pages;
+			if (pages * PAGE_SIZE > node_bytes[nid])
+				obj_cgroup_uncharge(obj_cgroup_nid(objcg, nid),
+						    pages * PAGE_SIZE - node_bytes[nid]);
+		}
+
+		/* Return the precharged pages we did not use. */
+		if (pages_used < precharge_pages)
+			obj_cgroup_unprecharge(objcg, precharge_pages - pages_used);
+		rcu_read_unlock();
 	} else {
-		obj_cgroup_unprecharge(objcg,
-				       PAGE_ALIGN(pcpu_obj_full_size(size)) >> PAGE_SHIFT);
+		obj_cgroup_unprecharge(objcg, pcpu_memcg_nr_precharge_pages(size));
 	}
 }
 
 static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 {
+	unsigned int node_bytes[MAX_NUMNODES] = { 0 };
 	struct obj_cgroup *objcg;
+	int nid;
 
 	if (unlikely(!chunk->obj_exts))
 		return;
@@ -1673,11 +1730,18 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 		return;
 	chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
 
-	obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
+	pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
 
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
 			-pcpu_obj_full_size(size));
+
+	/* Uncharge each node the exact bytes it was charged at alloc. */
+	for_each_online_node(nid) {
+		if (node_bytes[nid])
+			obj_cgroup_uncharge(obj_cgroup_nid(objcg, nid),
+					    node_bytes[nid]);
+	}
 	rcu_read_unlock();
 
 	obj_cgroup_put(objcg);
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 7/9] mm: percpu: per-node kmem accounting for obj_exts metadata
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (5 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 6/9] mm: percpu: per-node kmem accounting Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 8/9] mm: percpu: skip the per-cpu node walk on single-node systems Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

Account the percpu obj_exts metadata to the correct NUMA node. The
obj_exts array is vmalloc'd and its pages may reside on different nodes,
so walk the vmalloc pages via vmalloc_to_page() + page_to_nid() and add
each page's slice to the same per-node byte accumulation as the per-cpu
payload. The post-alloc and free hooks then charge / uncharge the
combined accumulation per node, so the metadata rides the same batched
account_kmem() and the same precharged page pool as the payload.

Folding the metadata into the shared node_bytes[] accumulation means each
node is rounded up to whole pages only once, across payload and metadata
combined.

Note that there is no need to bump the number of pages to precharge here
since we already use pcpu_obj_full_size() and the same reasoning as the
previous commit applies here: we waste strictly less than N pages.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index e9d2d3716b99..9224344d4b8e 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1672,6 +1672,23 @@ static void pcpu_memcg_accumulate_pages(struct pcpu_chunk *chunk, int off,
 	}
 }
 
+static void pcpu_memcg_accumulate_obj_exts(struct pcpu_chunk *chunk, int off,
+				      size_t size, unsigned int *node_bytes)
+{
+	size_t ext_bytes = size / PCPU_MIN_ALLOC_SIZE * sizeof(struct pcpuobj_ext);
+	unsigned long ext_start = (unsigned long)&chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT];
+	unsigned long ext_end = ext_start + ext_bytes;
+	unsigned long addr;
+
+	for (addr = ext_start; addr < ext_end; addr = ALIGN(addr + 1, PAGE_SIZE)) {
+		struct page *page = vmalloc_to_page((void *)addr);
+		size_t page_sz = min_t(size_t, ext_end - addr,
+				       PAGE_SIZE - offset_in_page(addr));
+
+		node_bytes[page_to_nid(page)] += page_sz;
+	}
+}
+
 static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 				       struct pcpu_chunk *chunk, int off,
 				       size_t size)
@@ -1689,6 +1706,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
 		pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
+		pcpu_memcg_accumulate_obj_exts(chunk, off, size, node_bytes);
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1731,6 +1749,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 	chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
 
 	pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
+	pcpu_memcg_accumulate_obj_exts(chunk, off, size, node_bytes);
 
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 8/9] mm: percpu: skip the per-cpu node walk on single-node systems
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (6 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 7/9] mm: percpu: per-node kmem accounting for obj_exts metadata Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 10:20 ` [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
  8 siblings, 0 replies; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

pcpu_memcg_{post_alloc,free}_hook() determine each backing page's NUMA
node by walking the chunk with vmalloc_to_page() once per possible CPU
(plus the obj_exts vmalloc pages). On a single-node system that walk is
pure overhead: with only one online node, page_to_nid() can only return
that node, so the whole allocation footprint necessarily lives there.

Add a fast path in pcpu_memcg_accumulate(): when nr_online_nodes == 1,
attribute pcpu_obj_full_size() (payload + obj_exts metadata) to
first_online_node and return, skipping the O(num_possible_cpus)
vmalloc_to_page() walk entirely. The result is identical to walking the
pages, since every page is on that node.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 9224344d4b8e..9a735d01b23a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1689,6 +1689,18 @@ static void pcpu_memcg_accumulate_obj_exts(struct pcpu_chunk *chunk, int off,
 	}
 }
 
+static void pcpu_memcg_accumulate(struct pcpu_chunk *chunk, int off, size_t size,
+			     unsigned int *node_bytes)
+{
+	if (nr_online_nodes == 1) {
+		node_bytes[first_online_node] = pcpu_obj_full_size(size);
+		return;
+	}
+
+	pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
+	pcpu_memcg_accumulate_obj_exts(chunk, off, size, node_bytes);
+}
+
 static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 				       struct pcpu_chunk *chunk, int off,
 				       size_t size)
@@ -1705,8 +1717,7 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		obj_cgroup_get(objcg);
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
-		pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
-		pcpu_memcg_accumulate_obj_exts(chunk, off, size, node_bytes);
+		pcpu_memcg_accumulate(chunk, off, size, node_bytes);
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
@@ -1748,8 +1759,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 		return;
 	chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
 
-	pcpu_memcg_accumulate_pages(chunk, off, size, node_bytes);
-	pcpu_memcg_accumulate_obj_exts(chunk, off, size, node_bytes);
+	pcpu_memcg_accumulate(chunk, off, size, node_bytes);
 
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc
  2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (7 preceding siblings ...)
  2026-06-26 10:20 ` [PATCH v2 8/9] mm: percpu: skip the per-cpu node walk on single-node systems Alexandre Ghiti
@ 2026-06-26 10:20 ` Alexandre Ghiti
  2026-06-26 14:32   ` Usama Arif
  8 siblings, 1 reply; 12+ messages in thread
From: Alexandre Ghiti @ 2026-06-26 10:20 UTC (permalink / raw)
  To: alexandre, Andrew Morton
  Cc: Axel Rasmussen, Barry Song, Ben Segall, cgroups, Chengming Zhou,
	Christoph Lameter, David Hildenbrand, Dennis Zhou,
	Dietmar Eggemann, Ingo Molnar, Johannes Weiner, Juri Lelli,
	Kairui Song, Kent Overstreet, K Prateek Nayak, Liam R. Howlett,
	linux-kernel, linux-mm, Lorenzo Stoakes, Mel Gorman, Michal Hocko,
	Mike Rapoport, Minchan Kim, Muchun Song, Nhat Pham,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie, Alexandre Ghiti

Update zswap and zsmalloc to use per-node obj_cgroup for kmem
accounting, attributing compressed page charges to the correct
NUMA node.

But actually, this is incomplete because it does not correctly account
for entries that straddle pages, those pages being possibly on 2 different
nodes.

This will be correctly handled by Joshua in a different series [1].

Link: https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ [1]
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/zsmalloc.h |  2 ++
 mm/zsmalloc.c            | 11 +++++++++++
 mm/zswap.c               | 19 ++++++++++++++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 478410c880b1..30427f3fe232 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -50,6 +50,8 @@ void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
 		  void *handle_mem, size_t mem_len);
 
+int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle);
+
 extern const struct movable_operations zsmalloc_mops;
 
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 83f5820c45f9..17f7403ebe77 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1380,6 +1380,17 @@ static void obj_free(int class_size, unsigned long obj)
 	mod_zspage_inuse(zspage, -1);
 }
 
+int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle)
+{
+	unsigned long obj;
+	struct zpdesc *zpdesc;
+
+	obj = handle_to_obj(handle);
+	obj_to_zpdesc(obj, &zpdesc);
+	return page_to_nid(zpdesc_page(zpdesc));
+}
+EXPORT_SYMBOL(zs_handle_to_nid);
+
 void zs_free(struct zs_pool *pool, unsigned long handle)
 {
 	struct zspage *zspage;
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..466c6a3f4ef3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1438,7 +1438,24 @@ static bool zswap_store_page(struct page *page,
 	 */
 	zswap_pool_get(pool);
 	if (objcg) {
-		obj_cgroup_get(objcg);
+		struct obj_cgroup *nid_objcg;
+		int nid = zs_handle_to_nid(pool->zs_pool, entry->handle);
+
+		/*
+		 * obj_cgroup_nid() returns a borrowed RCU pointer (no
+		 * reference), so the returned per-node objcg may be freed
+		 * (kfree_rcu) before we use it. Pin it with a tryget inside a
+		 * single rcu section; if it is already dying, fall back to the
+		 * folio objcg (held by the caller) so the charge still lands on
+		 * the right memcg, just without per-node attribution.
+		 */
+		rcu_read_lock();
+		nid_objcg = obj_cgroup_nid(objcg, nid);
+		if (nid_objcg && obj_cgroup_tryget(nid_objcg))
+			objcg = nid_objcg;
+		else
+			obj_cgroup_get(objcg);
+		rcu_read_unlock();
 		obj_cgroup_charge_zswap(objcg, entry->length);
 	}
 	atomic_long_inc(&zswap_stored_pages);
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc
  2026-06-26 10:20 ` [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
@ 2026-06-26 14:32   ` Usama Arif
  2026-06-26 18:36     ` Nhat Pham
  0 siblings, 1 reply; 12+ messages in thread
From: Usama Arif @ 2026-06-26 14:32 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Usama Arif, alexandre, Andrew Morton, Barry Song, Ben Segall,
	cgroups, Chengming Zhou, Christoph Lameter, David Hildenbrand,
	Dennis Zhou, Dietmar Eggemann, Ingo Molnar, Johannes Weiner,
	Juri Lelli, Kairui Song, Kent Overstreet, K Prateek Nayak,
	Liam R. Howlett, linux-kernel, linux-mm, Lorenzo Stoakes,
	Mel Gorman, Michal Hocko, Mike Rapoport, Minchan Kim, Muchun Song,
	Nhat Pham, Peter Zijlstra, Qi Zheng, Roman Gushchin,
	Sergey Senozhatsky, Shakeel Butt, Steven Rostedt,
	Suren Baghdasaryan, Tejun Heo, Valentin Schneider,
	Vincent Guittot, Vlastimil Babka, Wei Xu, Yosry Ahmed,
	Yuanchu Xie

On Fri, 26 Jun 2026 12:20:58 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:

> Update zswap and zsmalloc to use per-node obj_cgroup for kmem
> accounting, attributing compressed page charges to the correct
> NUMA node.
> 
> But actually, this is incomplete because it does not correctly account
> for entries that straddle pages, those pages being possibly on 2 different
> nodes.
> 
> This will be correctly handled by Joshua in a different series [1].
> 
> Link: https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ [1]
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  include/linux/zsmalloc.h |  2 ++
>  mm/zsmalloc.c            | 11 +++++++++++
>  mm/zswap.c               | 19 ++++++++++++++++++-
>  3 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 478410c880b1..30427f3fe232 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -50,6 +50,8 @@ void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
>  void zs_obj_write(struct zs_pool *pool, unsigned long handle,
>  		  void *handle_mem, size_t mem_len);
>  
> +int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle);
> +
>  extern const struct movable_operations zsmalloc_mops;
>  
>  #endif
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 83f5820c45f9..17f7403ebe77 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -1380,6 +1380,17 @@ static void obj_free(int class_size, unsigned long obj)
>  	mod_zspage_inuse(zspage, -1);
>  }
>  
> +int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle)
> +{
> +	unsigned long obj;
> +	struct zpdesc *zpdesc;
> +
> +	obj = handle_to_obj(handle);
> +	obj_to_zpdesc(obj, &zpdesc);
> +	return page_to_nid(zpdesc_page(zpdesc));
> +}
> +EXPORT_SYMBOL(zs_handle_to_nid);

Does this need the same locking as the other handle-to-zspage paths?
zs_free() takes pool->lock before handle_to_obj() because zspage migration can
update or move the object behind the handle. This helper does the same decode
without the lock, so zswap's uncharge path can race migration and charge or
uncharge the wrong node, or observe transient zspage state.


> +
>  void zs_free(struct zs_pool *pool, unsigned long handle)
>  {
>  	struct zspage *zspage;
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 761cd699e0a3..466c6a3f4ef3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1438,7 +1438,24 @@ static bool zswap_store_page(struct page *page,
>  	 */
>  	zswap_pool_get(pool);
>  	if (objcg) {
> -		obj_cgroup_get(objcg);
> +		struct obj_cgroup *nid_objcg;
> +		int nid = zs_handle_to_nid(pool->zs_pool, entry->handle);
> +
> +		/*
> +		 * obj_cgroup_nid() returns a borrowed RCU pointer (no
> +		 * reference), so the returned per-node objcg may be freed
> +		 * (kfree_rcu) before we use it. Pin it with a tryget inside a
> +		 * single rcu section; if it is already dying, fall back to the
> +		 * folio objcg (held by the caller) so the charge still lands on
> +		 * the right memcg, just without per-node attribution.
> +		 */
> +		rcu_read_lock();
> +		nid_objcg = obj_cgroup_nid(objcg, nid);
> +		if (nid_objcg && obj_cgroup_tryget(nid_objcg))
> +			objcg = nid_objcg;
> +		else
> +			obj_cgroup_get(objcg);
> +		rcu_read_unlock();
>  		obj_cgroup_charge_zswap(objcg, entry->length);
>  	}
>  	atomic_long_inc(&zswap_stored_pages);
> -- 
> 2.54.0
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc
  2026-06-26 14:32   ` Usama Arif
@ 2026-06-26 18:36     ` Nhat Pham
  0 siblings, 0 replies; 12+ messages in thread
From: Nhat Pham @ 2026-06-26 18:36 UTC (permalink / raw)
  To: Usama Arif
  Cc: Alexandre Ghiti, alexandre, Andrew Morton, Barry Song, Ben Segall,
	cgroups, Chengming Zhou, Christoph Lameter, David Hildenbrand,
	Dennis Zhou, Dietmar Eggemann, Ingo Molnar, Johannes Weiner,
	Juri Lelli, Kairui Song, Kent Overstreet, K Prateek Nayak,
	Liam R. Howlett, linux-kernel, linux-mm, Lorenzo Stoakes,
	Mel Gorman, Michal Hocko, Mike Rapoport, Minchan Kim, Muchun Song,
	Peter Zijlstra, Qi Zheng, Roman Gushchin, Sergey Senozhatsky,
	Shakeel Butt, Steven Rostedt, Suren Baghdasaryan, Tejun Heo,
	Valentin Schneider, Vincent Guittot, Vlastimil Babka, Wei Xu,
	Yosry Ahmed, Yuanchu Xie

On Fri, Jun 26, 2026 at 7:32 AM Usama Arif <usama.arif@linux.dev> wrote:
>
> On Fri, 26 Jun 2026 12:20:58 +0200 Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> > Update zswap and zsmalloc to use per-node obj_cgroup for kmem
> > accounting, attributing compressed page charges to the correct
> > NUMA node.
> >
> > But actually, this is incomplete because it does not correctly account
> > for entries that straddle pages, those pages being possibly on 2 different
> > nodes.
> >
> > This will be correctly handled by Joshua in a different series [1].
> >
> > Link: https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ [1]
> > Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> > ---
> >  include/linux/zsmalloc.h |  2 ++
> >  mm/zsmalloc.c            | 11 +++++++++++
> >  mm/zswap.c               | 19 ++++++++++++++++++-
> >  3 files changed, 31 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> > index 478410c880b1..30427f3fe232 100644
> > --- a/include/linux/zsmalloc.h
> > +++ b/include/linux/zsmalloc.h
> > @@ -50,6 +50,8 @@ void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
> >  void zs_obj_write(struct zs_pool *pool, unsigned long handle,
> >                 void *handle_mem, size_t mem_len);
> >
> > +int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle);
> > +
> >  extern const struct movable_operations zsmalloc_mops;
> >
> >  #endif
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 83f5820c45f9..17f7403ebe77 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -1380,6 +1380,17 @@ static void obj_free(int class_size, unsigned long obj)
> >       mod_zspage_inuse(zspage, -1);
> >  }
> >
> > +int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle)
> > +{
> > +     unsigned long obj;
> > +     struct zpdesc *zpdesc;
> > +
> > +     obj = handle_to_obj(handle);
> > +     obj_to_zpdesc(obj, &zpdesc);
> > +     return page_to_nid(zpdesc_page(zpdesc));
> > +}
> > +EXPORT_SYMBOL(zs_handle_to_nid);
>
> Does this need the same locking as the other handle-to-zspage paths?
> zs_free() takes pool->lock before handle_to_obj() because zspage migration can
> update or move the object behind the handle. This helper does the same decode
> without the lock, so zswap's uncharge path can race migration and charge or
> uncharge the wrong node, or observe transient zspage state.
>

Can we just charge it to the page's node for now? Once Joshua's patch
series is in, we can correctly charge the node owning the data :)

FWIW, this is how these zswap entries are organized in the LRU too -
following to the OG page's node.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-06-26 18:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-26 10:20 [RESEND PATCH v2 0/9] per-memcg-per-node kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 1/9] memcg: convert task->objcg to a per-node objcgs array Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 2/9] memcg: charge kmem pages and slab objects against per-node objcg Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 3/9] mm: percpu: fix obj_exts metadata charge size Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 4/9] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 5/9] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 6/9] mm: percpu: per-node kmem accounting Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 7/9] mm: percpu: per-node kmem accounting for obj_exts metadata Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 8/9] mm: percpu: skip the per-cpu node walk on single-node systems Alexandre Ghiti
2026-06-26 10:20 ` [PATCH v2 9/9] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
2026-06-26 14:32   ` Usama Arif
2026-06-26 18:36     ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox