[PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs
@ 2026-05-20  5:29 Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node so that reparenting LRU folios can take per-node lru locks. As a
side effect, the per-CPU obj_stock_pcp -- which caches a single
cached_objcg pointer -- thrashes on workloads where threads of the
same memcg run on different NUMA nodes. The kernel test robot reported
a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern.

Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg
pointer") landed as a temporary fix by treating sibling per-node
objcgs as equivalent for the cache lookup, intended to be reverted
once per-node kmem accounting is introduced. This series takes a more
general approach: cache multiple objcgs per CPU using the multi-slot
pattern memcg_stock_pcp already uses, so the per-node objcg variants
of one memcg can all coexist in the stock without ever forcing a
drain. The temporary fix can then be reverted.

To avoid increasing the per-CPU cache footprint, the first three
patches shrink the existing single-slot obj_stock_pcp fields.
The final patch converts cached_objcg and nr_bytes into
NR_OBJ_STOCK=5 slot arrays and reorders the struct so the entire
consume/refill/account hot path fits within a single 64-byte cache
line on non-debug 64-bit builds (verified with pahole).

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Tested-by: kernel test robot <oliver.sang@intel.com>

Shakeel Butt (4):
  memcg: store node_id instead of pglist_data pointer
  memcg: uint16_t for nr_bytes in obj_stock_pcp
  memcg: int16_t for cached slab stats
  memcg: multi objcg charge support

 mm/memcontrol.c | 214 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 157 insertions(+), 57 deletions(-)

--
2.53.0-Meta

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs
@ 2026-05-20  5:31 Shakeel Butt
  2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node so that reparenting LRU folios can take per-node lru locks. As a
side effect, the per-CPU obj_stock_pcp -- which caches a single
cached_objcg pointer -- thrashes on workloads where threads of the
same memcg run on different NUMA nodes. The kernel test robot reported
a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern.

Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg
pointer") landed as a temporary fix by treating sibling per-node
objcgs as equivalent for the cache lookup, intended to be reverted
once per-node kmem accounting is introduced. This series takes a more
general approach: cache multiple objcgs per CPU using the multi-slot
pattern memcg_stock_pcp already uses, so the per-node objcg variants
of one memcg can all coexist in the stock without ever forcing a
drain. The temporary fix can then be reverted.

To avoid increasing the per-CPU cache footprint, the first three
patches shrink the existing single-slot obj_stock_pcp fields.
The final patch converts cached_objcg and nr_bytes into
NR_OBJ_STOCK=5 slot arrays and reorders the struct so the entire
consume/refill/account hot path fits within a single 64-byte cache
line on non-debug 64-bit builds (verified with pahole).

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Tested-by: kernel test robot <oliver.sang@intel.com>

Shakeel Butt (4):
  memcg: store node_id instead of pglist_data pointer
  memcg: uint16_t for nr_bytes in obj_stock_pcp
  memcg: int16_t for cached slab stats
  memcg: multi objcg charge support

 mm/memcontrol.c | 214 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 157 insertions(+), 57 deletions(-)

--
2.53.0-Meta

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] memcg: store node_id instead of pglist_data pointer
  2026-05-20  5:31 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt
@ 2026-05-20  5:31 ` Shakeel Butt
  2026-05-20  6:01   ` Harry Yoo
  2026-05-20  6:13   ` Muchun Song
  2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

The struct obj_stock_pcp stores a pointer to pglist_data for the slab
stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The
pointer is not strictly required: NODE_DATA() can recover it from the
node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE
as the "no stats cached" sentinel.

At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is
plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
---
 mm/memcontrol.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8caeb7ccaa3..d7c162946719 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2021,7 +2021,7 @@ struct obj_stock_pcp {
 	local_trylock_t lock;
 	unsigned int nr_bytes;
 	struct obj_cgroup *cached_objcg;
-	struct pglist_data *cached_pgdat;
+	int16_t node_id;
 	int nr_slab_reclaimable_b;
 	int nr_slab_unreclaimable_b;
 
@@ -2031,6 +2031,7 @@ struct obj_stock_pcp {
 
 static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = {
 	.lock = INIT_LOCAL_TRYLOCK(lock),
+	.node_id = NUMA_NO_NODE,
 };
 
 static DEFINE_MUTEX(percpu_charge_mutex);
@@ -3159,6 +3160,13 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 {
 	int *bytes;
 
+	/*
+	 * Though at the moment MAX_NUMNODES <= 1024 in all archs but let's make
+	 * sure it does not exceed S16_MAX otherwise we need to fix node_id type
+	 * in struct obj_stock_pcp.
+	 */
+	BUILD_BUG_ON(MAX_NUMNODES >= S16_MAX);
+
 	if (!stock || READ_ONCE(stock->cached_objcg) != objcg)
 		goto direct;
 
@@ -3166,9 +3174,11 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 	 * Save vmstat data in stock and skip vmstat array update unless
 	 * accumulating over a page of vmstat data or when pgdat changes.
 	 */
-	if (stock->cached_pgdat != pgdat) {
+	if (stock->node_id == NUMA_NO_NODE) {
+		stock->node_id = pgdat->node_id;
+	} else if (stock->node_id != pgdat->node_id) {
 		/* Flush the existing cached vmstat data */
-		struct pglist_data *oldpg = stock->cached_pgdat;
+		struct pglist_data *oldpg = NODE_DATA(stock->node_id);
 
 		if (stock->nr_slab_reclaimable_b) {
 			mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B,
@@ -3180,7 +3190,7 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 					  stock->nr_slab_unreclaimable_b);
 			stock->nr_slab_unreclaimable_b = 0;
 		}
-		stock->cached_pgdat = pgdat;
+		stock->node_id = pgdat->node_id;
 	}
 
 	bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? &stock->nr_slab_reclaimable_b
@@ -3276,19 +3286,21 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 	 * Flush the vmstat data in current stock
 	 */
 	if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) {
+		struct pglist_data *oldpg = NODE_DATA(stock->node_id);
+
 		if (stock->nr_slab_reclaimable_b) {
-			mod_objcg_mlstate(old, stock->cached_pgdat,
+			mod_objcg_mlstate(old, oldpg,
 					  NR_SLAB_RECLAIMABLE_B,
 					  stock->nr_slab_reclaimable_b);
 			stock->nr_slab_reclaimable_b = 0;
 		}
 		if (stock->nr_slab_unreclaimable_b) {
-			mod_objcg_mlstate(old, stock->cached_pgdat,
+			mod_objcg_mlstate(old, oldpg,
 					  NR_SLAB_UNRECLAIMABLE_B,
 					  stock->nr_slab_unreclaimable_b);
 			stock->nr_slab_unreclaimable_b = 0;
 		}
-		stock->cached_pgdat = NULL;
+		stock->node_id = NUMA_NO_NODE;
 	}
 
 	WRITE_ONCE(stock->cached_objcg, NULL);
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20  5:31 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt
  2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
@ 2026-05-20  5:31 ` Shakeel Butt
  2026-05-20  6:41   ` Harry Yoo
                     ` (2 more replies)
  2026-05-20  5:31 ` [PATCH 3/4] memcg: int16_t for cached slab stats Shakeel Butt
  2026-05-20  5:31 ` [PATCH 4/4] memcg: multi objcg charge support Shakeel Butt
  3 siblings, 3 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
shrink the per-CPU cache.

The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
_256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
== U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
U16_MAX. The accumulator also needs to stay within uint16_t between
page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
U16_MAX + 1.

Accumulate the new total in an 'unsigned int' local, then:

  1. Flush whenever the accumulator would hit U16_MAX. Together with
     the existing allow_uncharge flush at PAGE_SIZE, this keeps the
     uint16_t safe on PAGE_SIZE <= 64KiB.

  2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
     powerpc 44x), push any sub-page remainder above U16_MAX into
     objcg->nr_charged_bytes via atomic_add before storing back, so
     the store cannot silently truncate. The PAGE_SHIFT > 16 guard
     folds the branch out at compile time on smaller page sizes.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
---
 mm/memcontrol.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7c162946719..b3d63d9f267c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2019,7 +2019,7 @@ static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
 
 struct obj_stock_pcp {
 	local_trylock_t lock;
-	unsigned int nr_bytes;
+	uint16_t nr_bytes;
 	struct obj_cgroup *cached_objcg;
 	int16_t node_id;
 	int nr_slab_reclaimable_b;
@@ -3331,6 +3331,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 			       bool allow_uncharge)
 {
 	unsigned int nr_pages = 0;
+	unsigned int stock_nr_bytes;
 
 	if (!stock) {
 		nr_pages = nr_bytes >> PAGE_SHIFT;
@@ -3339,21 +3340,41 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 		goto out;
 	}
 
+	stock_nr_bytes = stock->nr_bytes;
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
-		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
+		stock_nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
 		WRITE_ONCE(stock->cached_objcg, objcg);
 
 		allow_uncharge = true;	/* Allow uncharge when objcg changes */
 	}
-	stock->nr_bytes += nr_bytes;
+	stock_nr_bytes += nr_bytes;
+
+	/* Since stock->nr_bytes is uint16_t, don't refill >= U16_MAX */
+	if ((allow_uncharge && (stock_nr_bytes > PAGE_SIZE)) ||
+	    stock_nr_bytes >= U16_MAX) {
+		nr_pages = stock_nr_bytes >> PAGE_SHIFT;
+		stock_nr_bytes &= (PAGE_SIZE - 1);
+
+		/*
+		 * On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on
+		 * hexagon and powerpc 44x), the sub-page remainder can
+		 * still exceed U16_MAX. Push the excess back to
+		 * objcg->nr_charged_bytes so the store into uint16_t
+		 * cannot silently truncate; folded out at compile time
+		 * on smaller page sizes.
+		 */
+		if (PAGE_SHIFT > 16 && stock_nr_bytes > U16_MAX) {
+			unsigned int kept = stock_nr_bytes & U16_MAX;
 
-	if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) {
-		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
-		stock->nr_bytes &= (PAGE_SIZE - 1);
+			atomic_add(stock_nr_bytes - kept,
+				   &objcg->nr_charged_bytes);
+			stock_nr_bytes = kept;
+		}
 	}
+	stock->nr_bytes = stock_nr_bytes;
 
 out:
 	if (nr_pages)
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] memcg: int16_t for cached slab stats
  2026-05-20  5:31 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt
  2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
  2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
@ 2026-05-20  5:31 ` Shakeel Butt
  2026-05-20  7:25   ` Harry Yoo
  2026-05-20  5:31 ` [PATCH 4/4] memcg: multi objcg charge support Shakeel Butt
  3 siblings, 1 reply; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Currently struct obj_stock_pcp stores cached slab stats in 'int' which
is 4 bytes per counter on 64-bit machines. Switch them to int16_t to
shrink the cached metadata.

The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at
PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t. On 64KiB
pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a
sufficiently long run of accumulations would overflow the cache. Add
an explicit S16_MAX guard before each add: when the next add would
push abs(*bytes) past S16_MAX, fold the cached value into @nr and
flush directly via mod_objcg_mlstate() before the accumulation.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
---
 mm/memcontrol.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b3d63d9f267c..1ed27fd06850 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2022,8 +2022,8 @@ struct obj_stock_pcp {
 	uint16_t nr_bytes;
 	struct obj_cgroup *cached_objcg;
 	int16_t node_id;
-	int nr_slab_reclaimable_b;
-	int nr_slab_unreclaimable_b;
+	int16_t nr_slab_reclaimable_b;
+	int16_t nr_slab_unreclaimable_b;
 
 	struct work_struct work;
 	unsigned long flags;
@@ -3158,7 +3158,7 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock, int nr,
 				struct pglist_data *pgdat, enum node_stat_item idx)
 {
-	int *bytes;
+	int16_t *bytes;
 
 	/*
 	 * Though at the moment MAX_NUMNODES <= 1024 in all archs but let's make
@@ -3195,6 +3195,16 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 
 	bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? &stock->nr_slab_reclaimable_b
 					       : &stock->nr_slab_unreclaimable_b;
+	/*
+	 * To avoid overflow or underflow, flush directly if accumulating @nr
+	 * would push the cached value past S16_MAX.
+	 */
+	if (abs(nr + *bytes) >= S16_MAX) {
+		nr += *bytes;
+		*bytes = 0;
+		goto direct;
+	}
+
 	/*
 	 * Even for large object >= PAGE_SIZE, the vmstat data will still be
 	 * cached locally at least once before pushing it out.
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] memcg: multi objcg charge support
  2026-05-20  5:31 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt
                   ` (2 preceding siblings ...)
  2026-05-20  5:31 ` [PATCH 3/4] memcg: int16_t for cached slab stats Shakeel Butt
@ 2026-05-20  5:31 ` Shakeel Butt
  2026-05-20  9:35   ` Harry Yoo
  3 siblings, 1 reply; 19+ messages in thread
From: Shakeel Butt @ 2026-05-20  5:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node so that reparenting LRU folios can take per-node lru locks. As a
side effect, the per-CPU obj_stock_pcp -- which caches exactly one
cached_objcg -- thrashes on workloads where threads of the same memcg
run on different NUMA nodes. The kernel test robot reported a 67.7%
regression on stress-ng.switch.ops_per_sec from this pattern.

Mirror the multi-slot pattern already used by memcg_stock_pcp: turn
nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all
slots on consume/refill/account, prefer empty slots when inserting,
and evict a random slot only when full. With multiple slots a CPU can
hold the per-node objcg variants of one memcg plus a few siblings
without ever forcing a drain.

A single int8_t index records which slot the cached slab stats belong
to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK
= 5 the layout (verified with pahole) is:

  offset 0  : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B
  offset 8  : nr_bytes[5]                                     = 10B
  offset 18 : padding                                         = 6B
  offset 24 : cached[5]                                       = 40B
  offset 64 : (line 2) work_struct + flags (cold)

so consume_obj_stock, refill_obj_stock and the slab account path each
touch exactly one 64-byte cache line on non-debug 64-bit builds.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Tested-by: kernel test robot <oliver.sang@intel.com>
---
 mm/memcontrol.c | 185 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 133 insertions(+), 52 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ed27fd06850..52104cbb8e7c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -150,14 +150,14 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	 * However, it can be PAGE_SIZE or (x * PAGE_SIZE).
 	 *
 	 * The following sequence can lead to it:
-	 * 1) CPU0: objcg == stock->cached_objcg
+	 * 1) CPU0: objcg cached in one of stock->cached[i]
 	 * 2) CPU1: we do a small allocation (e.g. 92 bytes),
 	 *          PAGE_SIZE bytes are charged
 	 * 3) CPU1: a process from another memcg is allocating something,
 	 *          the stock if flushed,
 	 *          objcg->nr_charged_bytes = PAGE_SIZE - 92
 	 * 5) CPU0: we do release this object,
-	 *          92 bytes are added to stock->nr_bytes
+	 *          92 bytes are added to stock->nr_bytes[i]
 	 * 6) CPU0: stock is flushed,
 	 *          92 bytes are added to objcg->nr_charged_bytes
 	 *
@@ -2017,13 +2017,25 @@ static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
 	.lock = INIT_LOCAL_TRYLOCK(lock),
 };
 
+/*
+ * NR_OBJ_STOCK is sized so the entire hot path of obj_stock_pcp
+ * (lock, accounting metadata, nr_bytes[] and cached[]) fits within a
+ * single 64-byte cache line on non-debug 64-bit builds. With 5 slots:
+ *   lock(1) + index(1) + node_id(2) + slab stats(4) + nr_bytes(10)
+ *   + pad(6) + cached(40) == 64 bytes.
+ * A CPU can thus consume/refill/account against five different objcgs
+ * (typically per-node variants of the same memcg) while incurring at
+ * most one cache miss on the stock.
+ */
+#define NR_OBJ_STOCK 5
 struct obj_stock_pcp {
 	local_trylock_t lock;
-	uint16_t nr_bytes;
-	struct obj_cgroup *cached_objcg;
+	int8_t index;
 	int16_t node_id;
 	int16_t nr_slab_reclaimable_b;
 	int16_t nr_slab_unreclaimable_b;
+	uint16_t nr_bytes[NR_OBJ_STOCK];
+	struct obj_cgroup *cached[NR_OBJ_STOCK];
 
 	struct work_struct work;
 	unsigned long flags;
@@ -2031,11 +2043,13 @@ struct obj_stock_pcp {
 
 static DEFINE_PER_CPU_ALIGNED(struct obj_stock_pcp, obj_stock) = {
 	.lock = INIT_LOCAL_TRYLOCK(lock),
+	.index = -1,
 	.node_id = NUMA_NO_NODE,
 };
 
 static DEFINE_MUTEX(percpu_charge_mutex);
 
+static void drain_obj_stock_slot(struct obj_stock_pcp *stock, int i);
 static void drain_obj_stock(struct obj_stock_pcp *stock);
 static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg);
@@ -3153,12 +3167,13 @@ static void unlock_stock(struct obj_stock_pcp *stock)
 		local_unlock(&obj_stock.lock);
 }
 
-/* Call after __refill_obj_stock() to ensure stock->cached_objg == objcg */
+/* Call after __refill_obj_stock() so a slot for objcg exists in the stock */
 static void __account_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock, int nr,
 				struct pglist_data *pgdat, enum node_stat_item idx)
 {
 	int16_t *bytes;
+	int i;
 
 	/*
 	 * Though at the moment MAX_NUMNODES <= 1024 in all archs but let's make
@@ -3167,29 +3182,39 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
 	 */
 	BUILD_BUG_ON(MAX_NUMNODES >= S16_MAX);
 
-	if (!stock || READ_ONCE(stock->cached_objcg) != objcg)
+	if (!stock)
+		goto direct;
+
+	for (i = 0; i < NR_OBJ_STOCK; ++i) {
+		if (READ_ONCE(stock->cached[i]) == objcg)
+			break;
+	}
+	if (i == NR_OBJ_STOCK)
 		goto direct;
 
 	/*
 	 * Save vmstat data in stock and skip vmstat array update unless
-	 * accumulating over a page of vmstat data or when pgdat changes.
+	 * accumulating over a page of vmstat data or when the objcg slot or
+	 * pgdat the stats belong to changes.
 	 */
-	if (stock->node_id == NUMA_NO_NODE) {
+	if (stock->index < 0) {
+		stock->index = i;
 		stock->node_id = pgdat->node_id;
-	} else if (stock->node_id != pgdat->node_id) {
-		/* Flush the existing cached vmstat data */
+	} else if (stock->index != i || stock->node_id != pgdat->node_id) {
+		struct obj_cgroup *old = READ_ONCE(stock->cached[stock->index]);
 		struct pglist_data *oldpg = NODE_DATA(stock->node_id);
 
 		if (stock->nr_slab_reclaimable_b) {
-			mod_objcg_mlstate(objcg, oldpg, NR_SLAB_RECLAIMABLE_B,
+			mod_objcg_mlstate(old, oldpg, NR_SLAB_RECLAIMABLE_B,
 					  stock->nr_slab_reclaimable_b);
 			stock->nr_slab_reclaimable_b = 0;
 		}
 		if (stock->nr_slab_unreclaimable_b) {
-			mod_objcg_mlstate(objcg, oldpg, NR_SLAB_UNRECLAIMABLE_B,
+			mod_objcg_mlstate(old, oldpg, NR_SLAB_UNRECLAIMABLE_B,
 					  stock->nr_slab_unreclaimable_b);
 			stock->nr_slab_unreclaimable_b = 0;
 		}
+		stock->index = i;
 		stock->node_id = pgdat->node_id;
 	}
 
@@ -3230,10 +3255,16 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock,
 				unsigned int nr_bytes)
 {
-	if (objcg == READ_ONCE(stock->cached_objcg) &&
-	    stock->nr_bytes >= nr_bytes) {
-		stock->nr_bytes -= nr_bytes;
-		return true;
+	int i;
+
+	for (i = 0; i < NR_OBJ_STOCK; ++i) {
+		if (READ_ONCE(stock->cached[i]) != objcg)
+			continue;
+		if (stock->nr_bytes[i] >= nr_bytes) {
+			stock->nr_bytes[i] -= nr_bytes;
+			return true;
+		}
+		return false;
 	}
 
 	return false;
@@ -3254,16 +3285,42 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	return ret;
 }
 
-static void drain_obj_stock(struct obj_stock_pcp *stock)
+/* Flush the cached slab stats (if any) back to their owning objcg/pgdat. */
+static void drain_obj_stock_stats(struct obj_stock_pcp *stock)
 {
-	struct obj_cgroup *old = READ_ONCE(stock->cached_objcg);
+	struct obj_cgroup *old;
+	struct pglist_data *oldpg;
+
+	if (stock->index < 0)
+		return;
+
+	old = READ_ONCE(stock->cached[stock->index]);
+	oldpg = NODE_DATA(stock->node_id);
+
+	if (stock->nr_slab_reclaimable_b) {
+		mod_objcg_mlstate(old, oldpg, NR_SLAB_RECLAIMABLE_B,
+				  stock->nr_slab_reclaimable_b);
+		stock->nr_slab_reclaimable_b = 0;
+	}
+	if (stock->nr_slab_unreclaimable_b) {
+		mod_objcg_mlstate(old, oldpg, NR_SLAB_UNRECLAIMABLE_B,
+				  stock->nr_slab_unreclaimable_b);
+		stock->nr_slab_unreclaimable_b = 0;
+	}
+	stock->index = -1;
+	stock->node_id = NUMA_NO_NODE;
+}
+
+static void drain_obj_stock_slot(struct obj_stock_pcp *stock, int i)
+{
+	struct obj_cgroup *old = READ_ONCE(stock->cached[i]);
 
 	if (!old)
 		return;
 
-	if (stock->nr_bytes) {
-		unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
-		unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
+	if (stock->nr_bytes[i]) {
+		unsigned int nr_pages = stock->nr_bytes[i] >> PAGE_SHIFT;
+		unsigned int nr_bytes = stock->nr_bytes[i] & (PAGE_SIZE - 1);
 
 		if (nr_pages) {
 			struct mem_cgroup *memcg;
@@ -3289,46 +3346,43 @@ static void drain_obj_stock(struct obj_stock_pcp *stock)
 		 * so it might be changed in the future.
 		 */
 		atomic_add(nr_bytes, &old->nr_charged_bytes);
-		stock->nr_bytes = 0;
+		stock->nr_bytes[i] = 0;
 	}
 
-	/*
-	 * Flush the vmstat data in current stock
-	 */
-	if (stock->nr_slab_reclaimable_b || stock->nr_slab_unreclaimable_b) {
-		struct pglist_data *oldpg = NODE_DATA(stock->node_id);
-
-		if (stock->nr_slab_reclaimable_b) {
-			mod_objcg_mlstate(old, oldpg,
-					  NR_SLAB_RECLAIMABLE_B,
-					  stock->nr_slab_reclaimable_b);
-			stock->nr_slab_reclaimable_b = 0;
-		}
-		if (stock->nr_slab_unreclaimable_b) {
-			mod_objcg_mlstate(old, oldpg,
-					  NR_SLAB_UNRECLAIMABLE_B,
-					  stock->nr_slab_unreclaimable_b);
-			stock->nr_slab_unreclaimable_b = 0;
-		}
-		stock->node_id = NUMA_NO_NODE;
-	}
+	/* Flush vmstat data when its owning slot is being drained. */
+	if (stock->index == i)
+		drain_obj_stock_stats(stock);
 
-	WRITE_ONCE(stock->cached_objcg, NULL);
+	WRITE_ONCE(stock->cached[i], NULL);
 	obj_cgroup_put(old);
 }
 
+static void drain_obj_stock(struct obj_stock_pcp *stock)
+{
+	int i;
+
+	for (i = 0; i < NR_OBJ_STOCK; ++i)
+		drain_obj_stock_slot(stock, i);
+}
+
 static bool obj_stock_flush_required(struct obj_stock_pcp *stock,
 				     struct mem_cgroup *root_memcg)
 {
-	struct obj_cgroup *objcg = READ_ONCE(stock->cached_objcg);
+	struct obj_cgroup *objcg;
 	struct mem_cgroup *memcg;
 	bool flush = false;
+	int i;
 
 	rcu_read_lock();
-	if (objcg) {
+	for (i = 0; i < NR_OBJ_STOCK; ++i) {
+		objcg = READ_ONCE(stock->cached[i]);
+		if (!objcg)
+			continue;
 		memcg = obj_cgroup_memcg(objcg);
-		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))
+		if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) {
 			flush = true;
+			break;
+		}
 	}
 	rcu_read_unlock();
 
@@ -3342,6 +3396,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 {
 	unsigned int nr_pages = 0;
 	unsigned int stock_nr_bytes;
+	int i, slot = -1, empty_slot = -1;
 
 	if (!stock) {
 		nr_pages = nr_bytes >> PAGE_SHIFT;
@@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 		goto out;
 	}
 
-	stock_nr_bytes = stock->nr_bytes;
-	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
-		drain_obj_stock(stock);
+	for (i = 0; i < NR_OBJ_STOCK; ++i) {
+		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
+
+		if (!cached) {
+			if (empty_slot == -1)
+				empty_slot = i;
+			continue;
+		}
+		if (cached == objcg) {
+			slot = i;
+			break;
+		}
+	}
+
+	if (slot == -1) {
+		slot = empty_slot;
+		if (slot == -1) {
+			slot = get_random_u32_below(NR_OBJ_STOCK);
+			drain_obj_stock_slot(stock, slot);
+		}
 		obj_cgroup_get(objcg);
+		/*
+		 * Keep the xchg result in the unsigned int local; storing
+		 * it directly into stock->nr_bytes[slot] (uint16_t) would
+		 * silently truncate values >= U16_MAX and bypass the flush
+		 * guard below, leaking page-counter charges.
+		 */
 		stock_nr_bytes = atomic_read(&objcg->nr_charged_bytes)
 				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
-		WRITE_ONCE(stock->cached_objcg, objcg);
+		WRITE_ONCE(stock->cached[slot], objcg);
 
 		allow_uncharge = true;	/* Allow uncharge when objcg changes */
+	} else {
+		stock_nr_bytes = stock->nr_bytes[slot];
 	}
+
 	stock_nr_bytes += nr_bytes;
 
-	/* Since stock->nr_bytes is uint16_t, don't refill >= U16_MAX */
+	/* nr_bytes[] is uint16_t; flush if we would refill >= U16_MAX. */
 	if ((allow_uncharge && (stock_nr_bytes > PAGE_SIZE)) ||
 	    stock_nr_bytes >= U16_MAX) {
 		nr_pages = stock_nr_bytes >> PAGE_SHIFT;
@@ -3384,7 +3465,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 			stock_nr_bytes = kept;
 		}
 	}
-	stock->nr_bytes = stock_nr_bytes;
+	stock->nr_bytes[slot] = stock_nr_bytes;
 
 out:
 	if (nr_pages)
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] memcg: store node_id instead of pglist_data pointer
  2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
@ 2026-05-20  6:01   ` Harry Yoo
  2026-05-20  6:13   ` Muchun Song
  1 sibling, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2026-05-20  6:01 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/20/26 2:31 PM, Shakeel Butt wrote:
> The struct obj_stock_pcp stores a pointer to pglist_data for the slab
> stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The
> pointer is not strictly required: NODE_DATA() can recover it from the
> node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE
> as the "no stats cached" sentinel.
> 
> At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is
> plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] memcg: store node_id instead of pglist_data pointer
  2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
  2026-05-20  6:01   ` Harry Yoo
@ 2026-05-20  6:13   ` Muchun Song
  1 sibling, 0 replies; 19+ messages in thread
From: Muchun Song @ 2026-05-20  6:13 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot



> On May 20, 2026, at 13:31, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> The struct obj_stock_pcp stores a pointer to pglist_data for the slab
> stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The
> pointer is not strictly required: NODE_DATA() can recover it from the
> node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE
> as the "no stats cached" sentinel.
> 
> At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is
> plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>

Acked-by: Muchun Song <muchun.song@linux.dev>

Thanks.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
@ 2026-05-20  6:41   ` Harry Yoo
  2026-05-20  7:01   ` Harry Yoo
  2026-05-20 13:20   ` David Laight
  2 siblings, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2026-05-20  6:41 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/20/26 2:31 PM, Shakeel Butt wrote:
> Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
> which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
> shrink the per-CPU cache.
> 
> The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
> _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
> PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
> remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
> == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
> U16_MAX. The accumulator also needs to stay within uint16_t between
> page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
> U16_MAX + 1.
> 
> Accumulate the new total in an 'unsigned int' local, then:
> 
>    1. Flush whenever the accumulator would hit U16_MAX. Together with
>       the existing allow_uncharge flush at PAGE_SIZE, this keeps the
>       uint16_t safe on PAGE_SIZE <= 64KiB.
> 
>    2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
>       powerpc 44x), push any sub-page remainder above U16_MAX into
>       objcg->nr_charged_bytes via atomic_add before storing back, so
>       the store cannot silently truncate. The PAGE_SHIFT > 16 guard
>       folds the branch out at compile time on smaller page sizes.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
  2026-05-20  6:41   ` Harry Yoo
@ 2026-05-20  7:01   ` Harry Yoo
  2026-05-21  1:01     ` Shakeel Butt
  2026-05-20 13:20   ` David Laight
  2 siblings, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2026-05-20  7:01 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/20/26 2:31 PM, Shakeel Butt wrote:
> Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
> which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
> shrink the per-CPU cache.
> 
> The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
> _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
> PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
> remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
> == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
> U16_MAX. The accumulator also needs to stay within uint16_t between
> page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
> U16_MAX + 1.
> 
> Accumulate the new total in an 'unsigned int' local, then:
> 
>    1. Flush whenever the accumulator would hit U16_MAX. Together with
>       the existing allow_uncharge flush at PAGE_SIZE, this keeps the
>       uint16_t safe on PAGE_SIZE <= 64KiB.
> 
>    2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
>       powerpc 44x), push any sub-page remainder above U16_MAX into
>       objcg->nr_charged_bytes via atomic_add before storing back, so
>       the store cannot silently truncate. The PAGE_SHIFT > 16 guard
>       folds the branch out at compile time on smaller page sizes.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---
>   mm/memcontrol.c | 33 +++++++++++++++++++++++++++------
>   1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7c162946719..b3d63d9f267c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3339,21 +3340,41 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
>   		goto out;
>   	}
>   
> +	stock_nr_bytes = stock->nr_bytes;
>   	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
>   		drain_obj_stock(stock);
>   		obj_cgroup_get(objcg);
> -		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> +		stock_nr_bytes = atomic_read(&objcg->nr_charged_bytes)
>   				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
>   		WRITE_ONCE(stock->cached_objcg, objcg);
>   
>   		allow_uncharge = true;	/* Allow uncharge when objcg changes */
>   	}
> -	stock->nr_bytes += nr_bytes;
> +	stock_nr_bytes += nr_bytes;
> +
> +	/* Since stock->nr_bytes is uint16_t, don't refill >= U16_MAX */
> +	if ((allow_uncharge && (stock_nr_bytes > PAGE_SIZE)) ||
> +	    stock_nr_bytes >= U16_MAX) {

nit: This should be > U16_MAX?

> +		nr_pages = stock_nr_bytes >> PAGE_SHIFT;
> +		stock_nr_bytes &= (PAGE_SIZE - 1);
> +
> +		/*
> +		 * On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on
> +		 * hexagon and powerpc 44x), the sub-page remainder can
> +		 * still exceed U16_MAX. Push the excess back to
> +		 * objcg->nr_charged_bytes so the store into uint16_t
> +		 * cannot silently truncate; folded out at compile time
> +		 * on smaller page sizes.
> +		 */
> +		if (PAGE_SHIFT > 16 && stock_nr_bytes > U16_MAX) {
> +			unsigned int kept = stock_nr_bytes & U16_MAX;
>   
> -	if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) {
> -		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> -		stock->nr_bytes &= (PAGE_SIZE - 1);
> +			atomic_add(stock_nr_bytes - kept,
> +				   &objcg->nr_charged_bytes);
> +			stock_nr_bytes = kept;
> +		}
>   	}
> +	stock->nr_bytes = stock_nr_bytes;
>   
>   out:
>   	if (nr_pages)

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] memcg: int16_t for cached slab stats
  2026-05-20  5:31 ` [PATCH 3/4] memcg: int16_t for cached slab stats Shakeel Butt
@ 2026-05-20  7:25   ` Harry Yoo
  0 siblings, 0 replies; 19+ messages in thread
From: Harry Yoo @ 2026-05-20  7:25 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/20/26 2:31 PM, Shakeel Butt wrote:
> Currently struct obj_stock_pcp stores cached slab stats in 'int' which
> is 4 bytes per counter on 64-bit machines. Switch them to int16_t to
> shrink the cached metadata.
> 
> The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at
> PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t. On 64KiB
> pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a
> sufficiently long run of accumulations would overflow the cache. Add
> an explicit S16_MAX guard before each add: when the next add would
> push abs(*bytes) past S16_MAX, fold the cached value into @nr and
> flush directly via mod_objcg_mlstate() before the accumulation.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---

So arches with 64KiB sizes won't benefit from

"Even for large object >= PAGE_SIZE, the vmstat data will still be 
cached locally at least once before pushing it out" case.

But its benefit is questionable (to me), might be ok.

>   mm/memcontrol.c | 16 +++++++++++++---
>   1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b3d63d9f267c..1ed27fd06850 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3158,7 +3158,7 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
>   				struct obj_stock_pcp *stock, int nr,
>   				struct pglist_data *pgdat, enum node_stat_item idx)
>   {
> -	int *bytes;
> +	int16_t *bytes;
>   
>   	/*
>   	 * Though at the moment MAX_NUMNODES <= 1024 in all archs but let's make
> @@ -3195,6 +3195,16 @@ static void __account_obj_stock(struct obj_cgroup *objcg,
>   
>   	bytes = (idx == NR_SLAB_RECLAIMABLE_B) ? &stock->nr_slab_reclaimable_b
>   					       : &stock->nr_slab_unreclaimable_b;
> +	/*
> +	 * To avoid overflow or underflow, flush directly if accumulating @nr
> +	 * would push the cached value past S16_MAX.
> +	 */
> +	if (abs(nr + *bytes) >= S16_MAX) {
nit: should be > S16_MAX?

> +		nr += *bytes;
> +		*bytes = 0;
> +		goto direct;
> +	}
> +
>   	/*
>   	 * Even for large object >= PAGE_SIZE, the vmstat data will still be
>   	 * cached locally at least once before pushing it out.

FWIW:
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] memcg: multi objcg charge support
  2026-05-20  5:31 ` [PATCH 4/4] memcg: multi objcg charge support Shakeel Butt
@ 2026-05-20  9:35   ` Harry Yoo
  2026-05-21  1:05     ` Shakeel Butt
  0 siblings, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2026-05-20  9:35 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/20/26 2:31 PM, Shakeel Butt wrote:
> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> per-node type") split a memcg's single obj_cgroup into one per NUMA
> node so that reparenting LRU folios can take per-node lru locks. As a
> side effect, the per-CPU obj_stock_pcp -- which caches exactly one
> cached_objcg -- thrashes on workloads where threads of the same memcg
> run on different NUMA nodes. The kernel test robot reported a 67.7%
> regression on stress-ng.switch.ops_per_sec from this pattern.
> 
> Mirror the multi-slot pattern already used by memcg_stock_pcp: turn
> nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all
> slots on consume/refill/account, prefer empty slots when inserting,
> and evict a random slot only when full. With multiple slots a CPU can
> hold the per-node objcg variants of one memcg plus a few siblings
> without ever forcing a drain.
> 
> A single int8_t index records which slot the cached slab stats belong
> to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK
> = 5 the layout (verified with pahole) is:
> 
>    offset 0  : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B
>    offset 8  : nr_bytes[5]                                     = 10B
>    offset 18 : padding                                         = 6B
>    offset 24 : cached[5]                                       = 40B
>    offset 64 : (line 2) work_struct + flags (cold)
> 
> so consume_obj_stock, refill_obj_stock and the slab account path each
> touch exactly one 64-byte cache line on non-debug 64-bit builds.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---
> @@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
>   		goto out;
>   	}
>   
> -	stock_nr_bytes = stock->nr_bytes;
> -	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> -		drain_obj_stock(stock);
> +	for (i = 0; i < NR_OBJ_STOCK; ++i) {
> +		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
> +
> +		if (!cached) {
> +			if (empty_slot == -1)
> +				empty_slot = i;
> +			continue;
> +		}
> +		if (cached == objcg) {
> +			slot = i;
> +			break;
> +		}
> +	}
> +
> +	if (slot == -1) {
> +		slot = empty_slot;
> +		if (slot == -1) {
> +			slot = get_random_u32_below(NR_OBJ_STOCK);

It would break kmalloc_nolock() because _get_random_bytes() uses a 
spinlock. perhaps prandom_u32_state() should be sufficient in this case.

Is there a reason why it uses random eviction, unlike multi-memcg percpu 
charge cache?

Otherwise LGTM!

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
  2026-05-20  6:41   ` Harry Yoo
  2026-05-20  7:01   ` Harry Yoo
@ 2026-05-20 13:20   ` David Laight
  2026-05-21  1:03     ` Shakeel Butt
  2 siblings, 1 reply; 19+ messages in thread
From: David Laight @ 2026-05-20 13:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Tue, 19 May 2026 22:31:20 -0700
Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
> which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
> shrink the per-CPU cache.
> 
> The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
> _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
> PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
> remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
> == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
> U16_MAX. The accumulator also needs to stay within uint16_t between
> page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
> U16_MAX + 1.
> 
> Accumulate the new total in an 'unsigned int' local, then:
> 
>   1. Flush whenever the accumulator would hit U16_MAX. Together with
>      the existing allow_uncharge flush at PAGE_SIZE, this keeps the
>      uint16_t safe on PAGE_SIZE <= 64KiB.
> 
>   2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
>      powerpc 44x), push any sub-page remainder above U16_MAX into
>      objcg->nr_charged_bytes via atomic_add before storing back, so
>      the store cannot silently truncate. The PAGE_SHIFT > 16 guard
>      folds the branch out at compile time on smaller page sizes.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> ---
>  mm/memcontrol.c | 33 +++++++++++++++++++++++++++------
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d7c162946719..b3d63d9f267c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2019,7 +2019,7 @@ static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
>  
>  struct obj_stock_pcp {
>  	local_trylock_t lock;
> -	unsigned int nr_bytes;
> +	uint16_t nr_bytes;
>  	struct obj_cgroup *cached_objcg;
>  	int16_t node_id;

You might want to move it to this hole.
The size of 'lock' depends on kernel build options.

-- David

>  	int nr_slab_reclaimable_b;
> @@ -3331,6 +3331,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
>  			       bool allow_uncharge)
>  {
>  	unsigned int nr_pages = 0;
> +	unsigned int stock_nr_bytes;
>  
>  	if (!stock) {
>  		nr_pages = nr_bytes >> PAGE_SHIFT;
> @@ -3339,21 +3340,41 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
>  		goto out;
>  	}
>  
> +	stock_nr_bytes = stock->nr_bytes;
>  	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
>  		drain_obj_stock(stock);
>  		obj_cgroup_get(objcg);
> -		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> +		stock_nr_bytes = atomic_read(&objcg->nr_charged_bytes)
>  				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
>  		WRITE_ONCE(stock->cached_objcg, objcg);
>  
>  		allow_uncharge = true;	/* Allow uncharge when objcg changes */
>  	}
> -	stock->nr_bytes += nr_bytes;
> +	stock_nr_bytes += nr_bytes;
> +
> +	/* Since stock->nr_bytes is uint16_t, don't refill >= U16_MAX */
> +	if ((allow_uncharge && (stock_nr_bytes > PAGE_SIZE)) ||
> +	    stock_nr_bytes >= U16_MAX) {
> +		nr_pages = stock_nr_bytes >> PAGE_SHIFT;
> +		stock_nr_bytes &= (PAGE_SIZE - 1);
> +
> +		/*
> +		 * On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on
> +		 * hexagon and powerpc 44x), the sub-page remainder can
> +		 * still exceed U16_MAX. Push the excess back to
> +		 * objcg->nr_charged_bytes so the store into uint16_t
> +		 * cannot silently truncate; folded out at compile time
> +		 * on smaller page sizes.
> +		 */
> +		if (PAGE_SHIFT > 16 && stock_nr_bytes > U16_MAX) {
> +			unsigned int kept = stock_nr_bytes & U16_MAX;
>  
> -	if (allow_uncharge && (stock->nr_bytes > PAGE_SIZE)) {
> -		nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> -		stock->nr_bytes &= (PAGE_SIZE - 1);
> +			atomic_add(stock_nr_bytes - kept,
> +				   &objcg->nr_charged_bytes);
> +			stock_nr_bytes = kept;
> +		}
>  	}
> +	stock->nr_bytes = stock_nr_bytes;
>  
>  out:
>  	if (nr_pages)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20  7:01   ` Harry Yoo
@ 2026-05-21  1:01     ` Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-21  1:01 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Wed, May 20, 2026 at 04:01:45PM +0900, Harry Yoo wrote:
> 
> 
> On 5/20/26 2:31 PM, Shakeel Butt wrote:
> > Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
> > which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
> > shrink the per-CPU cache.
> > 
> > The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
> > _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
> > PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
> > remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
> > == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
> > U16_MAX. The accumulator also needs to stay within uint16_t between
> > page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
> > U16_MAX + 1.
> > 
> > Accumulate the new total in an 'unsigned int' local, then:
> > 
> >    1. Flush whenever the accumulator would hit U16_MAX. Together with
> >       the existing allow_uncharge flush at PAGE_SIZE, this keeps the
> >       uint16_t safe on PAGE_SIZE <= 64KiB.
> > 
> >    2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
> >       powerpc 44x), push any sub-page remainder above U16_MAX into
> >       objcg->nr_charged_bytes via atomic_add before storing back, so
> >       the store cannot silently truncate. The PAGE_SHIFT > 16 guard
> >       folds the branch out at compile time on smaller page sizes.
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Tested-by: kernel test robot <oliver.sang@intel.com>
> > ---
> >   mm/memcontrol.c | 33 +++++++++++++++++++++++++++------
> >   1 file changed, 27 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d7c162946719..b3d63d9f267c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3339,21 +3340,41 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> >   		goto out;
> >   	}
> > +	stock_nr_bytes = stock->nr_bytes;
> >   	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> >   		drain_obj_stock(stock);
> >   		obj_cgroup_get(objcg);
> > -		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> > +		stock_nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> >   				? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0;
> >   		WRITE_ONCE(stock->cached_objcg, objcg);
> >   		allow_uncharge = true;	/* Allow uncharge when objcg changes */
> >   	}
> > -	stock->nr_bytes += nr_bytes;
> > +	stock_nr_bytes += nr_bytes;
> > +
> > +	/* Since stock->nr_bytes is uint16_t, don't refill >= U16_MAX */
> > +	if ((allow_uncharge && (stock_nr_bytes > PAGE_SIZE)) ||
> > +	    stock_nr_bytes >= U16_MAX) {
> 
> nit: This should be > U16_MAX?

Ack.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp
  2026-05-20 13:20   ` David Laight
@ 2026-05-21  1:03     ` Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-21  1:03 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn, Harry Yoo,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Wed, May 20, 2026 at 02:20:23PM +0100, David Laight wrote:
> On Tue, 19 May 2026 22:31:20 -0700
> Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int'
> > which is 4 bytes on 64-bit machines. Switch the field to uint16_t to
> > shrink the per-CPU cache.
> > 
> > The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and
> > _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the
> > PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page
> > remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1
> > == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds
> > U16_MAX. The accumulator also needs to stay within uint16_t between
> > page-aligned flushes on 64KiB pages where PAGE_SIZE itself is
> > U16_MAX + 1.
> > 
> > Accumulate the new total in an 'unsigned int' local, then:
> > 
> >   1. Flush whenever the accumulator would hit U16_MAX. Together with
> >      the existing allow_uncharge flush at PAGE_SIZE, this keeps the
> >      uint16_t safe on PAGE_SIZE <= 64KiB.
> > 
> >   2. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and
> >      powerpc 44x), push any sub-page remainder above U16_MAX into
> >      objcg->nr_charged_bytes via atomic_add before storing back, so
> >      the store cannot silently truncate. The PAGE_SHIFT > 16 guard
> >      folds the branch out at compile time on smaller page sizes.
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Tested-by: kernel test robot <oliver.sang@intel.com>
> > ---
> >  mm/memcontrol.c | 33 +++++++++++++++++++++++++++------
> >  1 file changed, 27 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d7c162946719..b3d63d9f267c 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2019,7 +2019,7 @@ static DEFINE_PER_CPU_ALIGNED(struct memcg_stock_pcp, memcg_stock) = {
> >  
> >  struct obj_stock_pcp {
> >  	local_trylock_t lock;
> > -	unsigned int nr_bytes;
> > +	uint16_t nr_bytes;
> >  	struct obj_cgroup *cached_objcg;
> >  	int16_t node_id;
> 
> You might want to move it to this hole.
> The size of 'lock' depends on kernel build options.

Thanks. In the final patch, I am rearranging the fields for better packing.
Please take a look at 4th patch and see if it still need fixing.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] memcg: multi objcg charge support
  2026-05-20  9:35   ` Harry Yoo
@ 2026-05-21  1:05     ` Shakeel Butt
  2026-05-21  1:43       ` Harry Yoo
  2026-05-21  3:22       ` Joshua Hahn
  0 siblings, 2 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-21  1:05 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Wed, May 20, 2026 at 06:35:30PM +0900, Harry Yoo wrote:
> 
> 
> On 5/20/26 2:31 PM, Shakeel Butt wrote:
> > Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> > per-node type") split a memcg's single obj_cgroup into one per NUMA
> > node so that reparenting LRU folios can take per-node lru locks. As a
> > side effect, the per-CPU obj_stock_pcp -- which caches exactly one
> > cached_objcg -- thrashes on workloads where threads of the same memcg
> > run on different NUMA nodes. The kernel test robot reported a 67.7%
> > regression on stress-ng.switch.ops_per_sec from this pattern.
> > 
> > Mirror the multi-slot pattern already used by memcg_stock_pcp: turn
> > nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all
> > slots on consume/refill/account, prefer empty slots when inserting,
> > and evict a random slot only when full. With multiple slots a CPU can
> > hold the per-node objcg variants of one memcg plus a few siblings
> > without ever forcing a drain.
> > 
> > A single int8_t index records which slot the cached slab stats belong
> > to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK
> > = 5 the layout (verified with pahole) is:
> > 
> >    offset 0  : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B
> >    offset 8  : nr_bytes[5]                                     = 10B
> >    offset 18 : padding                                         = 6B
> >    offset 24 : cached[5]                                       = 40B
> >    offset 64 : (line 2) work_struct + flags (cold)
> > 
> > so consume_obj_stock, refill_obj_stock and the slab account path each
> > touch exactly one 64-byte cache line on non-debug 64-bit builds.
> > 
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> > Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Tested-by: kernel test robot <oliver.sang@intel.com>
> > ---
> > @@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> >   		goto out;
> >   	}
> > -	stock_nr_bytes = stock->nr_bytes;
> > -	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > -		drain_obj_stock(stock);
> > +	for (i = 0; i < NR_OBJ_STOCK; ++i) {
> > +		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
> > +
> > +		if (!cached) {
> > +			if (empty_slot == -1)
> > +				empty_slot = i;
> > +			continue;
> > +		}
> > +		if (cached == objcg) {
> > +			slot = i;
> > +			break;
> > +		}
> > +	}
> > +
> > +	if (slot == -1) {
> > +		slot = empty_slot;
> > +		if (slot == -1) {
> > +			slot = get_random_u32_below(NR_OBJ_STOCK);
> 
> It would break kmalloc_nolock() because _get_random_bytes() uses a spinlock.
> perhaps prandom_u32_state() should be sufficient in this case.
> 
> Is there a reason why it uses random eviction, unlike multi-memcg percpu
> charge cache?

Oh I didn't know and actually we are already using get_random_u32_below() in
refill_stock(). So, it need fixing as well. That would be a separate patch.

I will explore prandom_u32_state().


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] memcg: multi objcg charge support
  2026-05-21  1:05     ` Shakeel Butt
@ 2026-05-21  1:43       ` Harry Yoo
  2026-05-21 20:19         ` Shakeel Butt
  2026-05-21  3:22       ` Joshua Hahn
  1 sibling, 1 reply; 19+ messages in thread
From: Harry Yoo @ 2026-05-21  1:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot



On 5/21/26 10:05 AM, Shakeel Butt wrote:
> On Wed, May 20, 2026 at 06:35:30PM +0900, Harry Yoo wrote:
>>> @@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
>>>    		goto out;
>>>    	}
>>> -	stock_nr_bytes = stock->nr_bytes;
>>> -	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
>>> -		drain_obj_stock(stock);
>>> +	for (i = 0; i < NR_OBJ_STOCK; ++i) {
>>> +		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
>>> +
>>> +		if (!cached) {
>>> +			if (empty_slot == -1)
>>> +				empty_slot = i;
>>> +			continue;
>>> +		}
>>> +		if (cached == objcg) {
>>> +			slot = i;
>>> +			break;
>>> +		}
>>> +	}
>>> +
>>> +	if (slot == -1) {
>>> +		slot = empty_slot;
>>> +		if (slot == -1) {
>>> +			slot = get_random_u32_below(NR_OBJ_STOCK);
>>
>> It would break kmalloc_nolock() because _get_random_bytes() uses a spinlock.
>> perhaps prandom_u32_state() should be sufficient in this case.

s/spinlock/local_lock/

>> Is there a reason why it uses random eviction, unlike multi-memcg percpu
>> charge cache?
> 
> Oh I didn't know and actually we are already using get_random_u32_below() in
> refill_stock(). So, it need fixing as well. That would be a separate patch.

Ouch, I see.
> I will explore prandom_u32_state().

Thanks!

FYI, SLUB had a similar issue that was recently fixed:
commit a1e244a9f1778 ("mm/slab: use prandom if !allow_spin").

It uses prandom if spinning is not allowed when shuffling slab freelist.

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] memcg: multi objcg charge support
  2026-05-21  1:05     ` Shakeel Butt
  2026-05-21  1:43       ` Harry Yoo
@ 2026-05-21  3:22       ` Joshua Hahn
  1 sibling, 0 replies; 19+ messages in thread
From: Joshua Hahn @ 2026-05-21  3:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Harry Yoo, Andrew Morton, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Qi Zheng, Alexandre Ghiti,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Wed, 20 May 2026 18:05:23 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> On Wed, May 20, 2026 at 06:35:30PM +0900, Harry Yoo wrote:
> > 
> > 
> > On 5/20/26 2:31 PM, Shakeel Butt wrote:
> > > Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> > > per-node type") split a memcg's single obj_cgroup into one per NUMA
> > > node so that reparenting LRU folios can take per-node lru locks. As a
> > > side effect, the per-CPU obj_stock_pcp -- which caches exactly one
> > > cached_objcg -- thrashes on workloads where threads of the same memcg
> > > run on different NUMA nodes. The kernel test robot reported a 67.7%
> > > regression on stress-ng.switch.ops_per_sec from this pattern.
> > > 
> > > Mirror the multi-slot pattern already used by memcg_stock_pcp: turn
> > > nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all
> > > slots on consume/refill/account, prefer empty slots when inserting,
> > > and evict a random slot only when full. With multiple slots a CPU can
> > > hold the per-node objcg variants of one memcg plus a few siblings
> > > without ever forcing a drain.
> > > 
> > > A single int8_t index records which slot the cached slab stats belong
> > > to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK
> > > = 5 the layout (verified with pahole) is:
> > > 
> > >    offset 0  : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B
> > >    offset 8  : nr_bytes[5]                                     = 10B
> > >    offset 18 : padding                                         = 6B
> > >    offset 24 : cached[5]                                       = 40B
> > >    offset 64 : (line 2) work_struct + flags (cold)
> > > 
> > > so consume_obj_stock, refill_obj_stock and the slab account path each
> > > touch exactly one 64-byte cache line on non-debug 64-bit builds.
> > > 
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> > > Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > Tested-by: kernel test robot <oliver.sang@intel.com>
> > > ---
> > > @@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> > >   		goto out;
> > >   	}
> > > -	stock_nr_bytes = stock->nr_bytes;
> > > -	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > > -		drain_obj_stock(stock);
> > > +	for (i = 0; i < NR_OBJ_STOCK; ++i) {
> > > +		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
> > > +
> > > +		if (!cached) {
> > > +			if (empty_slot == -1)
> > > +				empty_slot = i;
> > > +			continue;
> > > +		}
> > > +		if (cached == objcg) {
> > > +			slot = i;
> > > +			break;
> > > +		}
> > > +	}
> > > +
> > > +	if (slot == -1) {
> > > +		slot = empty_slot;
> > > +		if (slot == -1) {
> > > +			slot = get_random_u32_below(NR_OBJ_STOCK);
> > 
> > It would break kmalloc_nolock() because _get_random_bytes() uses a spinlock.
> > perhaps prandom_u32_state() should be sufficient in this case.
> 
> > Is there a reason why it uses random eviction, unlike multi-memcg percpu
> > charge cache?
> 
> Oh I didn't know and actually we are already using get_random_u32_below() in
> refill_stock(). So, it need fixing as well. That would be a separate patch.

Hello Harry and Shakeel!

I just wanted to note that my work to push the memcg stock to page_counter [1]
actually gets rid of the random eviction and also the get_random_u32_below.
So the problem will no longer exist, at least for memcg/page_counter stock.

I was also thinking whether there was any way we can abstract or push down the
objcg stock to a different per_X slot (with finer granularity than per-node)
so we don't do random evictions, but it doesn't seem like there's a natural
point to do this from a quick glance.

The fix for get_random_32_below to use prandom_u32_state seems quite simple
though, so if either of you would like to send that in as a small hotfix
it shouldn't be a big deal to rebase on top of that instead too.

(Sorry for joining the conversation late too. I'll take a closer look at this
entire series tomorrow as well).

Thank you both. Have a great day!
Joshua

[1] https://lore.kernel.org/all/20260410210742.550489-1-joshua.hahnjy@gmail.com/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] memcg: multi objcg charge support
  2026-05-21  1:43       ` Harry Yoo
@ 2026-05-21 20:19         ` Shakeel Butt
  0 siblings, 0 replies; 19+ messages in thread
From: Shakeel Butt @ 2026-05-21 20:19 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Thu, May 21, 2026 at 10:43:11AM +0900, Harry Yoo wrote:
> 
> 
> On 5/21/26 10:05 AM, Shakeel Butt wrote:
> > On Wed, May 20, 2026 at 06:35:30PM +0900, Harry Yoo wrote:
> > > > @@ -3350,19 +3405,45 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> > > >    		goto out;
> > > >    	}
> > > > -	stock_nr_bytes = stock->nr_bytes;
> > > > -	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > > > -		drain_obj_stock(stock);
> > > > +	for (i = 0; i < NR_OBJ_STOCK; ++i) {
> > > > +		struct obj_cgroup *cached = READ_ONCE(stock->cached[i]);
> > > > +
> > > > +		if (!cached) {
> > > > +			if (empty_slot == -1)
> > > > +				empty_slot = i;
> > > > +			continue;
> > > > +		}
> > > > +		if (cached == objcg) {
> > > > +			slot = i;
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	if (slot == -1) {
> > > > +		slot = empty_slot;
> > > > +		if (slot == -1) {
> > > > +			slot = get_random_u32_below(NR_OBJ_STOCK);
> > > 
> > > It would break kmalloc_nolock() because _get_random_bytes() uses a spinlock.
> > > perhaps prandom_u32_state() should be sufficient in this case.
> 
> s/spinlock/local_lock/

I do see spinlock in crng_make_state() for some code paths.

> 
> > > Is there a reason why it uses random eviction, unlike multi-memcg percpu
> > > charge cache?
> > 
> > Oh I didn't know and actually we are already using get_random_u32_below() in
> > refill_stock(). So, it need fixing as well. That would be a separate patch.
> 
> Ouch, I see.
> > I will explore prandom_u32_state().
> 
> Thanks!
> 
> FYI, SLUB had a similar issue that was recently fixed:
> commit a1e244a9f1778 ("mm/slab: use prandom if !allow_spin").
> 
> It uses prandom if spinning is not allowed when shuffling slab freelist.

The drain does not really need a random number. Fixing an index like 0 makes it
much simpler but it might expose some corner cases. Round robin might be enough
for this though. I will see what would be the easiest way forward.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-05-21 20:20 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20  5:31 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt
2026-05-20  5:31 ` [PATCH 1/4] memcg: store node_id instead of pglist_data pointer Shakeel Butt
2026-05-20  6:01   ` Harry Yoo
2026-05-20  6:13   ` Muchun Song
2026-05-20  5:31 ` [PATCH 2/4] memcg: uint16_t for nr_bytes in obj_stock_pcp Shakeel Butt
2026-05-20  6:41   ` Harry Yoo
2026-05-20  7:01   ` Harry Yoo
2026-05-21  1:01     ` Shakeel Butt
2026-05-20 13:20   ` David Laight
2026-05-21  1:03     ` Shakeel Butt
2026-05-20  5:31 ` [PATCH 3/4] memcg: int16_t for cached slab stats Shakeel Butt
2026-05-20  7:25   ` Harry Yoo
2026-05-20  5:31 ` [PATCH 4/4] memcg: multi objcg charge support Shakeel Butt
2026-05-20  9:35   ` Harry Yoo
2026-05-21  1:05     ` Shakeel Butt
2026-05-21  1:43       ` Harry Yoo
2026-05-21 20:19         ` Shakeel Butt
2026-05-21  3:22       ` Joshua Hahn
  -- strict thread matches above, loose matches on Subject: below --
2026-05-20  5:29 [PATCH 0/4] memcg: shrink obj_stock_pcp and cache multiple objcgs Shakeel Butt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.