[PATCH 0/8] per-memcg-per-node kmem accounting

Linux cgroups development
 help / color / mirror / Atom feed

* [PATCH 0/8] per-memcg-per-node kmem accounting
@ 2026-05-11 20:20 Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats Alexandre Ghiti
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

This series pursues the work initiated by Joshua [1]. We need kernel  
memory to be accounted on a per-node basis in order to be able to  
know the memcg and physical memory association.  
  
This series takes advantage of the recent introduction of per-node  
obj_cgroup [2] and makes those obj_cgroup tied to their numa node.  
  
The bulk of the series is percpu per-node accounting: percpu  
"precharges" the memcg before we know the actual location of the pages  
it uses, so charging and accounting had to be split. All other kmem 
users (slab, zswap, __memcg_kmem_charge_page) are straightforward 
conversions (zswap support is limited in this series because Joshua 
is working on it in parallel [3]). 
 
Thanks Joshua for your early feedbacks! 
  
[1] https://lore.kernel.org/linux-mm/20260404033844.1892595-1-joshua.hahnjy@gmail.com/  
[2] https://lore.kernel.org/linux-mm/56c04b1c5d54f75ccdc12896df6c1ca35403ecc3.1772711148.git.zhengqi.arch@bytedance.com/  
[3] https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/

Alexandre Ghiti (8):
  mm: memcontrol: propagate NMI slab stats to memcg vmstats
  mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
  mm: percpu: Split memcg charging and kmem accounting
  mm: memcontrol: track MEMCG_KMEM per NUMA node
  mm: memcontrol: per-node kmem accounting for page charges
  mm: slab: per-node kmem accounting for slab
  mm: percpu: per-node kmem accounting using local credit
  mm: zswap: per-node kmem accounting for zswap/zsmalloc

 include/linux/memcontrol.h |  27 +++++--
 include/linux/mmzone.h     |   1 +
 include/linux/zsmalloc.h   |   2 +
 mm/memcontrol.c            | 150 ++++++++++++++++++++++++++++---------
 mm/percpu-internal.h       |  16 +---
 mm/percpu.c                |  90 ++++++++++++++++++++--
 mm/vmstat.c                |   1 +
 mm/zsmalloc.c              |  11 +++
 mm/zswap.c                 |   9 ++-
 9 files changed, 242 insertions(+), 65 deletions(-)

-- 
2.54.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 22:49   ` Shakeel Butt
  2026-05-11 20:20 ` [PATCH 2/8] mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT Alexandre Ghiti
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

flush_nmi_stats() drains per-node NMI slab atomics into the per-node
lruvec_stats, but does not propagate them to the memcg-level vmstats.

This is inconsistent with account_slab_nmi_safe() which updates both,
so fix this by propagating the NMI slab stats to the memcg-level vmstats.

Fixes: 940b01fc8dc1 ("memcg: nmi safe memcg stats for specific archs")
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/memcontrol.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3d98ab41f1f..d81a76654b2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4341,16 +4341,22 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
 			int index = memcg_stats_index(NR_SLAB_RECLAIMABLE_B);
 
 			lstats->state[index] += slab;
+			memcg->vmstats->state[index] += slab;
 			if (plstats)
 				plstats->state_pending[index] += slab;
+			if (parent)
+				parent->vmstats->state_pending[index] += slab;
 		}
 		if (atomic_read(&pn->slab_unreclaimable)) {
 			int slab = atomic_xchg(&pn->slab_unreclaimable, 0);
 			int index = memcg_stats_index(NR_SLAB_UNRECLAIMABLE_B);
 
 			lstats->state[index] += slab;
+			memcg->vmstats->state[index] += slab;
 			if (plstats)
 				plstats->state_pending[index] += slab;
+			if (parent)
+				parent->vmstats->state_pending[index] += slab;
 		}
 	}
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats
  2026-05-11 20:20 ` [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats Alexandre Ghiti
@ 2026-05-11 22:49   ` Shakeel Butt
  0 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2026-05-11 22:49 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups

On Mon, May 11, 2026 at 10:20:36PM +0200, Alexandre Ghiti wrote:
> flush_nmi_stats() drains per-node NMI slab atomics into the per-node
> lruvec_stats, but does not propagate them to the memcg-level vmstats.
> 
> This is inconsistent with account_slab_nmi_safe() which updates both,

I think the above sentence needs clarification. Something like "For non nmi
case, account_slab_nmi_safe() calls mod_memcg_lruvec_state() which updates both
per-node lruvec_stats and memcg-level vmstats, so flush_nmi_stats() needs to
flush to per-node lruvec_stats as well as memcg-level vmstats but at the moment
the memcg-level vmstats flushing is missing. Fix that".

> so fix this by propagating the NMI slab stats to the memcg-level vmstats.
> 
> Fixes: 940b01fc8dc1 ("memcg: nmi safe memcg stats for specific archs")
> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
> ---
>  mm/memcontrol.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c3d98ab41f1f..d81a76654b2c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4341,16 +4341,22 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
>  			int index = memcg_stats_index(NR_SLAB_RECLAIMABLE_B);
>  
>  			lstats->state[index] += slab;
> +			memcg->vmstats->state[index] += slab;
>  			if (plstats)
>  				plstats->state_pending[index] += slab;
> +			if (parent)
> +				parent->vmstats->state_pending[index] += slab;

Nit: please keep all three code lines additions together.

>  		}
>  		if (atomic_read(&pn->slab_unreclaimable)) {
>  			int slab = atomic_xchg(&pn->slab_unreclaimable, 0);
>  			int index = memcg_stats_index(NR_SLAB_UNRECLAIMABLE_B);
>  
>  			lstats->state[index] += slab;
> +			memcg->vmstats->state[index] += slab;
>  			if (plstats)
>  				plstats->state_pending[index] += slab;
> +			if (parent)
> +				parent->vmstats->state_pending[index] += slab;

Same here.

>  		}
>  	}
>  }

With the commit message fixed and nits addressed, you can add:

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

and thanks for catching this issue.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/8] mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 3/8] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

This is a preparatory patch for upcoming per-memcg-per-node kmem
accounting.

pcpu allocations are always fully charged at once using
pcpu_obj_full_size(), which returns the size of the pcpu "metadata" +
pcpu "payload". But metadata and payload may not be allocated on the
same numa node, so charge the metadata independently from the payload.

Do this by explicitly passing __GFP_ACCOUNT to the obj_exts allocation
and remove its accounting in pcpu_memcg_pre_alloc_hook().

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu-internal.h | 16 +++-------------
 mm/percpu.c          | 15 ++++++++-------
 2 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 4b3d6ec43703..f01db026d213 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -144,22 +144,12 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk)
 }
 
 /**
- * pcpu_obj_full_size - helper to calculate size of each accounted object
+ * pcpu_obj_total_size - helper to calculate size of each accounted object
  * @size: size of area to allocate in bytes
- *
- * For each accounted object there is an extra space which is used to store
- * obj_cgroup membership if kmemcg is not disabled. Charge it too.
  */
-static inline size_t pcpu_obj_full_size(size_t size)
+static inline size_t pcpu_obj_total_size(size_t size)
 {
-	size_t extra_size = 0;
-
-#ifdef CONFIG_MEMCG
-	if (!mem_cgroup_kmem_disabled())
-		extra_size += size / PCPU_MIN_ALLOC_SIZE * sizeof(struct obj_cgroup *);
-#endif
-
-	return size * num_possible_cpus() + extra_size;
+	return size * num_possible_cpus();
 }
 
 #ifdef CONFIG_PERCPU_STATS
diff --git a/mm/percpu.c b/mm/percpu.c
index b0676b8054ed..13de6e099d96 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1460,7 +1460,8 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
 	if (need_pcpuobj_ext()) {
 		chunk->obj_exts =
 			pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) *
-					sizeof(struct pcpuobj_ext), gfp);
+					sizeof(struct pcpuobj_ext),
+					gfp | __GFP_ACCOUNT);
 		if (!chunk->obj_exts)
 			goto objcg_fail;
 	}
@@ -1625,7 +1626,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
-	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
+	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_total_size(size)))
 		return false;
 
 	*objcgp = objcg;
@@ -1645,10 +1646,10 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				pcpu_obj_full_size(size));
+				pcpu_obj_total_size(size));
 		rcu_read_unlock();
 	} else {
-		obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
+		obj_cgroup_uncharge(objcg, pcpu_obj_total_size(size));
 	}
 }
 
@@ -1664,11 +1665,11 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 		return;
 	chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
 
-	obj_cgroup_uncharge(objcg, pcpu_obj_full_size(size));
+	obj_cgroup_uncharge(objcg, pcpu_obj_total_size(size));
 
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-			-pcpu_obj_full_size(size));
+			-pcpu_obj_total_size(size));
 	rcu_read_unlock();
 
 	obj_cgroup_put(objcg);
@@ -1897,7 +1898,7 @@ void __percpu *pcpu_alloc_noprof(size_t size, size_t align, bool reserved,
 
 	trace_percpu_alloc_percpu(_RET_IP_, reserved, is_atomic, size, align,
 				  chunk->base_addr, off, ptr,
-				  pcpu_obj_full_size(size), gfp);
+				  pcpu_obj_total_size(size), gfp);
 
 	pcpu_memcg_post_alloc_hook(objcg, chunk, off, size);
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] mm: percpu: Split memcg charging and kmem accounting
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 2/8] mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 4/8] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

This is preparatory patch for upcoming per-memcg-per-node kmem
accounting.

Percpu allocations charge memory before knowing which NUMA nodes the
pages will land on. So we need to decouple the memcg charging from the
kmem accounting:

1. In the pre-alloc hook, obj_cgroup_precharge() reserves pages for
   memcg limit enforcement without updating kmem stats.
2. In the post-alloc hook, obj_cgroup_account_kmem() accounts kmem
   and places the sub-page remainder into the obj stock after the
   allocation succeeds.

Because of that decoupling, we must not rely on the stock in the
precharge function and always charge the necessary pages that will
be accounted after the allocations happened. That means we may
temporarily overcharge the memcg but the obj_stock draining will get
things back to normal.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/memcontrol.h |  4 +++
 mm/memcontrol.c            | 50 ++++++++++++++++++++++++++++++++++++++
 mm/percpu.c                | 15 +++++++++---
 3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..568ab08f42af 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1707,6 +1707,10 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 
 int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size);
 void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size);
+int obj_cgroup_precharge(struct obj_cgroup *objcg, gfp_t gfp,
+			 unsigned int nr_pages);
+void obj_cgroup_unprecharge(struct obj_cgroup *objcg, unsigned int nr_pages);
+void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages);
 
 extern struct static_key_false memcg_bpf_enabled_key;
 static inline bool memcg_bpf_enabled(void)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d81a76654b2c..aaaa6a8b9f15 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3405,6 +3405,56 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
 	refill_obj_stock(objcg, size, true);
 }
 
+/*
+ * obj_cgroup_account_kmem - account KMEM for nr_pages
+ *
+ * Called after obj_cgroup_precharge() when the allocation succeeds.
+ * Accounts KMEM for nr_pages on the objcg's node.
+ */
+void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	account_kmem_nmi_safe(memcg, nr_pages);
+	memcg1_account_kmem(memcg, nr_pages);
+	rcu_read_unlock();
+}
+
+/*
+ * obj_cgroup_precharge - reserve pages without KMEM accounting
+ *
+ * Reserves page counter credits for limit enforcement. Does not update
+ * KMEM stats or the per-CPU obj stock, because precharge decouples
+ * the page counter charge from KMEM accounting (which happens later
+ * per-node via obj_cgroup_account_kmem).
+ *
+ * On failure, use obj_cgroup_unprecharge() to release the reservation.
+ */
+int obj_cgroup_precharge(struct obj_cgroup *objcg, gfp_t gfp,
+			 unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = try_charge_memcg(memcg, gfp, nr_pages);
+	css_put(&memcg->css);
+
+	return ret;
+}
+
+void obj_cgroup_unprecharge(struct obj_cgroup *objcg, unsigned int nr_pages)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	if (!mem_cgroup_is_root(memcg))
+		refill_stock(memcg, nr_pages);
+	css_put(&memcg->css);
+}
+
 static inline size_t obj_full_size(struct kmem_cache *s)
 {
 	/*
diff --git a/mm/percpu.c b/mm/percpu.c
index 13de6e099d96..7c67dc2e4878 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1626,7 +1626,8 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
-	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_total_size(size)))
+	if (obj_cgroup_precharge(objcg, gfp,
+				 PAGE_ALIGN(pcpu_obj_total_size(size)) >> PAGE_SHIFT))
 		return false;
 
 	*objcgp = objcg;
@@ -1641,15 +1642,23 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		return;
 
 	if (likely(chunk && chunk->obj_exts)) {
+		size_t total = pcpu_obj_total_size(size);
+		size_t remainder = PAGE_ALIGN(total) - total;
+
 		obj_cgroup_get(objcg);
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				pcpu_obj_total_size(size));
+				total);
 		rcu_read_unlock();
+
+		obj_cgroup_account_kmem(objcg, PAGE_ALIGN(total) >> PAGE_SHIFT);
+		if (remainder)
+			obj_cgroup_uncharge(objcg, remainder);
 	} else {
-		obj_cgroup_uncharge(objcg, pcpu_obj_total_size(size));
+		obj_cgroup_unprecharge(objcg,
+				       PAGE_ALIGN(pcpu_obj_total_size(size)) >> PAGE_SHIFT);
 	}
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/8] mm: memcontrol: track MEMCG_KMEM per NUMA node
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (2 preceding siblings ...)
  2026-05-11 20:20 ` [PATCH 3/8] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 5/8] mm: memcontrol: per-node kmem accounting for page charges Alexandre Ghiti
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

This patch gets rid of MEMCG_KMEM and wires all the "generic" functions
by introducing per-node obj_cgroup objects.

Note that it does not convert the kmem users to proper per-memcg-per-node
accounting now, this is done in upcoming patches.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/memcontrol.h | 23 ++++++++++----
 include/linux/mmzone.h     |  1 +
 mm/memcontrol.c            | 64 ++++++++++++++++++++++++--------------
 mm/vmstat.c                |  1 +
 4 files changed, 59 insertions(+), 30 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 568ab08f42af..17cf823160e4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,7 +35,6 @@ enum memcg_stat_item {
 	MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
-	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
@@ -126,9 +125,10 @@ struct mem_cgroup_per_node {
 	struct list_head objcg_list;
 
 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-	/* slab stats for nmi context */
+	/* slab and kmem stats for nmi context */
 	atomic_t		slab_reclaimable;
 	atomic_t		slab_unreclaimable;
+	atomic_t		kmem;
 #endif
 };
 
@@ -190,6 +190,7 @@ struct obj_cgroup {
 		struct rcu_head rcu;
 	};
 	bool is_root;
+	int nid;
 };
 
 /*
@@ -254,10 +255,6 @@ struct mem_cgroup {
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
 	atomic_long_t		memory_events_local[MEMCG_NR_MEMORY_EVENTS];
 
-#ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-	/* MEMCG_KMEM for nmi context */
-	atomic_t		kmem_stat;
-#endif
 	/*
 	 * Hint of reclaim pressure for socket memroy management. Note
 	 * that this indicator should NOT be used in legacy cgroup mode
@@ -776,6 +773,20 @@ static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 		percpu_ref_put(&objcg->refcnt);
 }
 
+static inline struct obj_cgroup *obj_cgroup_get_nid(struct obj_cgroup *objcg,
+						    int nid)
+{
+	struct obj_cgroup *nid_objcg;
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	nid_objcg = rcu_dereference(memcg->nodeinfo[nid]->objcg);
+	rcu_read_unlock();
+
+	return nid_objcg;
+}
+
 static inline bool mem_cgroup_tryget(struct mem_cgroup *memcg)
 {
 	return !memcg || css_tryget(&memcg->css);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da5..97eb168fd7f3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -326,6 +326,7 @@ enum node_stat_item {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_KMEM,
 	NR_BALLOON_PAGES,
 	NR_KERNEL_FILE_PAGES,
 	NR_GPU_ACTIVE,	/* Pages assigned to GPU objects */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aaaa6a8b9f15..979a847e542a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,7 @@ bool mem_cgroup_kmem_disabled(void)
 }
 
 static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages);
+static void mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val);
 
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
@@ -170,9 +171,11 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 
 	if (nr_pages) {
 		struct mem_cgroup *memcg;
+		struct lruvec *lruvec;
 
 		memcg = get_mem_cgroup_from_objcg(objcg);
-		mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+		lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(objcg->nid));
+		mod_lruvec_state(lruvec, NR_KMEM, -nr_pages);
 		memcg1_account_kmem(memcg, -nr_pages);
 		if (!mem_cgroup_is_root(memcg))
 			memcg_uncharge(memcg, nr_pages);
@@ -423,13 +426,13 @@ static const unsigned int memcg_node_stat_items[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
+	NR_KMEM,
 };
 
 static const unsigned int memcg_stat_items[] = {
 	MEMCG_SWAP,
 	MEMCG_SOCK,
 	MEMCG_PERCPU_B,
-	MEMCG_KMEM,
 	MEMCG_ZSWAP_B,
 	MEMCG_ZSWAPPED,
 	MEMCG_ZSWAP_INCOMP,
@@ -1537,7 +1540,7 @@ struct memory_stat {
 static const struct memory_stat memory_stats[] = {
 	{ "anon",			NR_ANON_MAPPED			},
 	{ "file",			NR_FILE_PAGES			},
-	{ "kernel",			MEMCG_KMEM			},
+	{ "kernel",			NR_KMEM				},
 	{ "kernel_stack",		NR_KERNEL_STACK_KB		},
 	{ "pagetables",			NR_PAGETABLE			},
 	{ "sec_pagetables",		NR_SECONDARY_PAGETABLE		},
@@ -3004,20 +3007,26 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 }
 
 #ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
-static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
+static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int nid, int val)
 {
 	if (likely(!in_nmi())) {
-		mod_memcg_state(memcg, MEMCG_KMEM, val);
+		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+		mod_lruvec_state(lruvec, NR_KMEM, val);
 	} else {
+		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
+
 		/* preemption is disabled in_nmi(). */
 		css_rstat_updated(&memcg->css, smp_processor_id());
-		atomic_add(val, &memcg->kmem_stat);
+		atomic_add(val, &pn->kmem);
 	}
 }
 #else
-static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int val)
+static inline void account_kmem_nmi_safe(struct mem_cgroup *memcg, int nid, int val)
 {
-	mod_memcg_state(memcg, MEMCG_KMEM, val);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+
+	mod_lruvec_state(lruvec, NR_KMEM, val);
 }
 #endif
 
@@ -3033,7 +3042,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
 
 	memcg = get_mem_cgroup_from_objcg(objcg);
 
-	account_kmem_nmi_safe(memcg, -nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, -nr_pages);
 	memcg1_account_kmem(memcg, -nr_pages);
 	if (!mem_cgroup_is_root(memcg))
 		refill_stock(memcg, nr_pages);
@@ -3061,7 +3070,7 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
 	if (ret)
 		goto out;
 
-	account_kmem_nmi_safe(memcg, nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, nr_pages);
 	memcg1_account_kmem(memcg, nr_pages);
 out:
 	css_put(&memcg->css);
@@ -3238,10 +3247,11 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 
 		if (nr_pages) {
 			struct mem_cgroup *memcg;
+			struct lruvec *lruvec;
 
 			memcg = get_mem_cgroup_from_objcg(old);
-
-			mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+			lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(old->nid));
+			mod_lruvec_state(lruvec, NR_KMEM, -nr_pages);
 			memcg1_account_kmem(memcg, -nr_pages);
 			if (!mem_cgroup_is_root(memcg))
 				memcg_uncharge(memcg, nr_pages);
@@ -3250,7 +3260,7 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 		}
 
 		/*
-		 * The leftover is flushed to the centralized per-memcg value.
+		 * The leftover is flushed to the per-node per-memcg value.
 		 * On the next attempt to refill obj stock it will be moved
 		 * to a per-cpu stock (probably, on an other CPU), see
 		 * refill_obj_stock().
@@ -3417,7 +3427,7 @@ void obj_cgroup_account_kmem(struct obj_cgroup *objcg, unsigned int nr_pages)
 
 	rcu_read_lock();
 	memcg = obj_cgroup_memcg(objcg);
-	account_kmem_nmi_safe(memcg, nr_pages);
+	account_kmem_nmi_safe(memcg, objcg->nid, nr_pages);
 	memcg1_account_kmem(memcg, nr_pages);
 	rcu_read_unlock();
 }
@@ -4165,6 +4175,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		if (unlikely(mem_cgroup_is_root(memcg)))
 			objcg->is_root = true;
 
+		objcg->nid = nid;
 		objcg->memcg = memcg;
 		rcu_assign_pointer(memcg->nodeinfo[nid]->objcg, objcg);
 		obj_cgroup_get(objcg);
@@ -4369,15 +4380,6 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
 {
 	int nid;
 
-	if (atomic_read(&memcg->kmem_stat)) {
-		int kmem = atomic_xchg(&memcg->kmem_stat, 0);
-		int index = memcg_stats_index(MEMCG_KMEM);
-
-		memcg->vmstats->state[index] += kmem;
-		if (parent)
-			parent->vmstats->state_pending[index] += kmem;
-	}
-
 	for_each_node_state(nid, N_MEMORY) {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
 		struct lruvec_stats *lstats = pn->lruvec_stats;
@@ -4408,6 +4410,18 @@ static void flush_nmi_stats(struct mem_cgroup *memcg, struct mem_cgroup *parent,
 			if (parent)
 				parent->vmstats->state_pending[index] += slab;
 		}
+		if (atomic_read(&pn->kmem)) {
+			int kmem = atomic_xchg(&pn->kmem, 0);
+			int index = memcg_stats_index(NR_KMEM);
+
+			mod_node_page_state(NODE_DATA(nid), NR_KMEM, kmem);
+			lstats->state[index] += kmem;
+			memcg->vmstats->state[index] += kmem;
+			if (plstats)
+				plstats->state_pending[index] += kmem;
+			if (parent)
+				parent->vmstats->state_pending[index] += kmem;
+		}
 	}
 }
 #else
@@ -5173,7 +5187,9 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	if (ug->nr_memory) {
 		memcg_uncharge(memcg, ug->nr_memory);
 		if (ug->nr_kmem) {
-			mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+			struct lruvec *lruvec =
+				mem_cgroup_lruvec(memcg, NODE_DATA(ug->objcg->nid));
+			mod_lruvec_state(lruvec, NR_KMEM, -ug->nr_kmem);
 			memcg1_account_kmem(memcg, -ug->nr_kmem);
 		}
 		memcg1_oom_recover(memcg);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..d55437d1852e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1293,6 +1293,7 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_HUGETLB_PAGE
 	[I(NR_HUGETLB)]				= "nr_hugetlb",
 #endif
+	[I(NR_KMEM)]				= "nr_kmem",
 	[I(NR_BALLOON_PAGES)]			= "nr_balloon_pages",
 	[I(NR_KERNEL_FILE_PAGES)]		= "nr_kernel_file_pages",
 	[I(NR_GPU_ACTIVE)]			= "nr_gpu_active",
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/8] mm: memcontrol: per-node kmem accounting for page charges
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (3 preceding siblings ...)
  2026-05-11 20:20 ` [PATCH 4/8] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 6/8] mm: slab: per-node kmem accounting for slab Alexandre Ghiti
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

Update __memcg_kmem_charge_page() to use per-node obj_cgroup for
correct NUMA attribution of NR_KMEM.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/memcontrol.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 979a847e542a..66d2beb1c974 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3105,17 +3105,19 @@ static void page_set_objcg(struct page *page, const struct obj_cgroup *objcg)
  */
 int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 {
-	struct obj_cgroup *objcg;
+	struct obj_cgroup *objcg, *nid_objcg;
+	int nid = page_to_nid(page);
 	int ret = 0;
 
 	objcg = current_obj_cgroup();
 	if (objcg && !obj_cgroup_is_root(objcg)) {
-		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
-		if (!ret) {
-			obj_cgroup_get(objcg);
-			page_set_objcg(page, objcg);
-			return 0;
-		}
+		nid_objcg = obj_cgroup_get_nid(objcg, nid);
+		ret = obj_cgroup_charge_pages(nid_objcg, gfp, 1 << order);
+		if (ret)
+			return ret;
+		obj_cgroup_get(nid_objcg);
+		page_set_objcg(page, nid_objcg);
+		return 0;
 	}
 	return ret;
 }
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/8] mm: slab: per-node kmem accounting for slab
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (4 preceding siblings ...)
  2026-05-11 20:20 ` [PATCH 5/8] mm: memcontrol: per-node kmem accounting for page charges Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 7/8] mm: percpu: per-node kmem accounting using local credit Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 8/8] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

Update slab hook to use per-node obj_cgroup for correct NUMA
attribution of NR_KMEM.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/memcontrol.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 66d2beb1c974..cefb335c990e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3521,14 +3521,18 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		unsigned long obj_exts;
 		struct slabobj_ext *obj_ext;
 		struct obj_stock_pcp *stock;
+		struct obj_cgroup *nid_objcg;
+		int nid;
 
 		slab = virt_to_slab(p[i]);
+		nid = slab_pgdat(slab)->node_id;
 
 		if (!slab_obj_exts(slab) &&
 		    alloc_slab_obj_exts(slab, s, flags, false)) {
 			continue;
 		}
 
+		nid_objcg = obj_cgroup_get_nid(objcg, nid);
 		/*
 		 * if we fail and size is 1, memcg_alloc_abort_single() will
 		 * just free the object, which is ok as we have not assigned
@@ -3541,17 +3545,17 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		 * between iterations, with a more complicated undo
 		 */
 		stock = trylock_stock();
-		if (!stock || !__consume_obj_stock(objcg, stock, obj_size)) {
+		if (!stock || !__consume_obj_stock(nid_objcg, stock, obj_size)) {
 			size_t remainder;
 
 			unlock_stock(stock);
-			if (__obj_cgroup_charge(objcg, flags, obj_size, &remainder))
+			if (__obj_cgroup_charge(nid_objcg, flags, obj_size, &remainder))
 				return false;
 			stock = trylock_stock();
 			if (remainder)
-				__refill_obj_stock(objcg, stock, remainder, false);
+				__refill_obj_stock(nid_objcg, stock, remainder, false);
 		}
-		__account_obj_stock(objcg, stock, obj_size,
+		__account_obj_stock(nid_objcg, stock, obj_size,
 				    slab_pgdat(slab), cache_vmstat_idx(s));
 		unlock_stock(stock);
 
@@ -3559,8 +3563,8 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		get_slab_obj_exts(obj_exts);
 		off = obj_to_index(s, slab, p[i]);
 		obj_ext = slab_obj_ext(slab, obj_exts, off);
-		obj_cgroup_get(objcg);
-		obj_ext->objcg = objcg;
+		obj_cgroup_get(nid_objcg);
+		obj_ext->objcg = nid_objcg;
 		put_slab_obj_exts(obj_exts);
 	}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/8] mm: percpu: per-node kmem accounting using local credit
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (5 preceding siblings ...)
  2026-05-11 20:20 ` [PATCH 6/8] mm: slab: per-node kmem accounting for slab Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  2026-05-11 20:20 ` [PATCH 8/8] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

Now that the memcg charging is decoupled from the kmem accounting, we
can't use obj_stock to handle the percpu accounting because our
precharged pages may get drained. That's a problem because we suppose
we have enough charged pages in pcpu_memcg_post_alloc_hook() and we cannot
charge more pages here because it may fail and would defeat the purpose of
the precharge.

So instead of using obj_stock, use a local per-node credit that fills
the same purpose whose surplus eventually gets refilled into the
stock.

Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 mm/percpu.c | 88 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 76 insertions(+), 12 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 7c67dc2e4878..64b327fe3c26 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1614,6 +1614,16 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr)
 }
 
 #ifdef CONFIG_MEMCG
+static unsigned int pcpu_memcg_nr_precharge_pages(size_t size)
+{
+	size_t total = pcpu_obj_total_size(size);
+
+	if (total < PAGE_SIZE)
+		return num_possible_nodes();
+
+	return PAGE_ALIGN(total) >> PAGE_SHIFT;
+}
+
 static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 				      struct obj_cgroup **objcgp)
 {
@@ -1626,8 +1636,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 	if (!objcg || obj_cgroup_is_root(objcg))
 		return true;
 
-	if (obj_cgroup_precharge(objcg, gfp,
-				 PAGE_ALIGN(pcpu_obj_total_size(size)) >> PAGE_SHIFT))
+	if (obj_cgroup_precharge(objcg, gfp, pcpu_memcg_nr_precharge_pages(size)))
 		return false;
 
 	*objcgp = objcg;
@@ -1642,29 +1651,68 @@ static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg,
 		return;
 
 	if (likely(chunk && chunk->obj_exts)) {
-		size_t total = pcpu_obj_total_size(size);
-		size_t remainder = PAGE_ALIGN(total) - total;
+		unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+		unsigned int precharge_pages = pcpu_memcg_nr_precharge_pages(size);
+		unsigned int pages_used = 0;
+		unsigned int node_credit[MAX_NUMNODES] = { 0 };
+		unsigned int cpu;
+		int nid;
 
 		obj_cgroup_get(objcg);
 		chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = objcg;
 
 		rcu_read_lock();
 		mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
-				total);
+				pcpu_obj_total_size(size));
 		rcu_read_unlock();
 
-		obj_cgroup_account_kmem(objcg, PAGE_ALIGN(total) >> PAGE_SHIFT);
-		if (remainder)
-			obj_cgroup_uncharge(objcg, remainder);
+		for_each_possible_cpu(cpu) {
+			unsigned int i;
+
+			for (i = 0; i < nr_pages; i++) {
+				void *addr = (void *)pcpu_chunk_addr(chunk, cpu,
+								     PFN_DOWN(off) + i);
+				size_t page_sz = i < nr_pages - 1 ?
+					PAGE_SIZE : size - (nr_pages - 1) * PAGE_SIZE;
+
+				nid = page_to_nid(pcpu_addr_to_page(addr));
+
+				if (node_credit[nid] < page_sz) {
+					struct obj_cgroup *nid_objcg;
+
+					nid_objcg = obj_cgroup_get_nid(objcg, nid);
+					obj_cgroup_account_kmem(nid_objcg, 1);
+					node_credit[nid] += PAGE_SIZE;
+					pages_used++;
+				}
+
+				node_credit[nid] -= page_sz;
+			}
+		}
+
+		/* Return unused precharged pages */
+		if (pages_used < precharge_pages)
+			obj_cgroup_unprecharge(objcg, precharge_pages - pages_used);
+
+		/* Put leftover per-node credit into stock */
+		for_each_online_node(nid) {
+			if (node_credit[nid] > 0) {
+				struct obj_cgroup *nid_objcg;
+
+				nid_objcg = obj_cgroup_get_nid(objcg, nid);
+				obj_cgroup_uncharge(nid_objcg, node_credit[nid]);
+			}
+		}
 	} else {
-		obj_cgroup_unprecharge(objcg,
-				       PAGE_ALIGN(pcpu_obj_total_size(size)) >> PAGE_SHIFT);
+		obj_cgroup_unprecharge(objcg, pcpu_memcg_nr_precharge_pages(size));
 	}
 }
 
 static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 {
+	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	struct obj_cgroup *objcg;
+	unsigned int cpu;
 
 	if (unlikely(!chunk->obj_exts))
 		return;
@@ -1674,13 +1722,29 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 		return;
 	chunk->obj_exts[off >> PCPU_MIN_ALLOC_SHIFT].cgroup = NULL;
 
-	obj_cgroup_uncharge(objcg, pcpu_obj_total_size(size));
-
 	rcu_read_lock();
 	mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B,
 			-pcpu_obj_total_size(size));
 	rcu_read_unlock();
 
+	for_each_possible_cpu(cpu) {
+		unsigned int i;
+
+		for (i = 0; i < nr_pages; i++) {
+			void *addr = (void *)pcpu_chunk_addr(chunk, cpu,
+							     PFN_DOWN(off) + i);
+			struct obj_cgroup *nid_objcg;
+			int nid;
+			size_t unc;
+
+			nid = page_to_nid(pcpu_addr_to_page(addr));
+			nid_objcg = obj_cgroup_get_nid(objcg, nid);
+			unc = i < nr_pages - 1 ?
+				PAGE_SIZE : size - (nr_pages - 1) * PAGE_SIZE;
+			obj_cgroup_uncharge(nid_objcg, unc);
+		}
+	}
+
 	obj_cgroup_put(objcg);
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/8] mm: zswap: per-node kmem accounting for zswap/zsmalloc
  2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
                   ` (6 preceding siblings ...)
  2026-05-11 20:20 ` [PATCH 7/8] mm: percpu: per-node kmem accounting using local credit Alexandre Ghiti
@ 2026-05-11 20:20 ` Alexandre Ghiti
  7 siblings, 0 replies; 10+ messages in thread
From: Alexandre Ghiti @ 2026-05-11 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Dennis Zhou, Tejun Heo, Christoph Lameter,
	Vlastimil Babka, Yosry Ahmed, Nhat Pham, Sergey Senozhatsky,
	Chengming Zhou, Suren Baghdasaryan, Qi Zheng, David Hildenbrand,
	Lorenzo Stoakes, Minchan Kim, Mike Rapoport, Axel Rasmussen,
	Barry Song, Kairui Song, Wei Xu, Yuanchu Xie, Liam R . Howlett,
	Joshua Hahn, linux-mm, linux-kernel, cgroups, Alexandre Ghiti

Update zswap and zsmalloc to use per-node obj_cgroup for kmem
accounting, attributing compressed page charges to the correct
NUMA node.

But actually, this is incomplete because it does not correctly account
for entries that straddle pages, those pages being possibly on 2 different
nodes.

This will be correctly handled by Joshua in a different series [1].

Link: https://lore.kernel.org/linux-mm/20260311195153.4013476-1-joshua.hahnjy@gmail.com/ [1]
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
---
 include/linux/zsmalloc.h |  2 ++
 mm/zsmalloc.c            | 11 +++++++++++
 mm/zswap.c               |  9 +++++++--
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 478410c880b1..30427f3fe232 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -50,6 +50,8 @@ void zs_obj_read_sg_end(struct zs_pool *pool, unsigned long handle);
 void zs_obj_write(struct zs_pool *pool, unsigned long handle,
 		  void *handle_mem, size_t mem_len);
 
+int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle);
+
 extern const struct movable_operations zsmalloc_mops;
 
 #endif
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 63128ddb7959..7b45c67e51f1 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1381,6 +1381,17 @@ static void obj_free(int class_size, unsigned long obj)
 	mod_zspage_inuse(zspage, -1);
 }
 
+int zs_handle_to_nid(struct zs_pool *pool, unsigned long handle)
+{
+	unsigned long obj;
+	struct zpdesc *zpdesc;
+
+	obj = handle_to_obj(handle);
+	obj_to_zpdesc(obj, &zpdesc);
+	return page_to_nid(zpdesc_page(zpdesc));
+}
+EXPORT_SYMBOL(zs_handle_to_nid);
+
 void zs_free(struct zs_pool *pool, unsigned long handle)
 {
 	struct zspage *zspage;
diff --git a/mm/zswap.c b/mm/zswap.c
index 4b5149173b0e..e3a9a294f14b 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1443,8 +1443,13 @@ static bool zswap_store_page(struct page *page,
 	 */
 	zswap_pool_get(pool);
 	if (objcg) {
-		obj_cgroup_get(objcg);
-		obj_cgroup_charge_zswap(objcg, entry->length);
+		struct obj_cgroup *nid_objcg;
+		int nid = zs_handle_to_nid(pool->zs_pool, entry->handle);
+
+		nid_objcg = obj_cgroup_get_nid(objcg, nid);
+		obj_cgroup_charge_zswap(nid_objcg, entry->length);
+		obj_cgroup_get(nid_objcg);
+		objcg = nid_objcg;
 	}
 	atomic_long_inc(&zswap_stored_pages);
 	if (entry->length == PAGE_SIZE)
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-11 22:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 20:20 [PATCH 0/8] per-memcg-per-node kmem accounting Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 1/8] mm: memcontrol: propagate NMI slab stats to memcg vmstats Alexandre Ghiti
2026-05-11 22:49   ` Shakeel Butt
2026-05-11 20:20 ` [PATCH 2/8] mm: percpu: charge obj_exts allocation with __GFP_ACCOUNT Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 3/8] mm: percpu: Split memcg charging and kmem accounting Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 4/8] mm: memcontrol: track MEMCG_KMEM per NUMA node Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 5/8] mm: memcontrol: per-node kmem accounting for page charges Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 6/8] mm: slab: per-node kmem accounting for slab Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 7/8] mm: percpu: per-node kmem accounting using local credit Alexandre Ghiti
2026-05-11 20:20 ` [PATCH 8/8] mm: zswap: per-node kmem accounting for zswap/zsmalloc Alexandre Ghiti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox