[PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
@ 2026-05-18 22:28 Shakeel Butt
  2026-05-18 23:41 ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2026-05-18 22:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but the per-CPU obj_stock_pcp still keys cached_objcg by
pointer. Cross-NUMA workloads now see a drain on every refill and a
miss on every consume that targets a sibling per-node objcg of the
same memcg, producing the 67.7% stress-ng switch-mq regression
reported by LKP.

stock->nr_bytes are fungible across per-node objcgs of one memcg.
Treat the cache as keyed by memcg in __consume_obj_stock() and
__refill_obj_stock() so siblings share the reserve. Compare via
READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
the rcu_read_lock contract on obj_cgroup_memcg() does not apply.

Sharing the reserve without re-caching means bytes funded by one
per-node objcg's slow path can be consumed/freed under a different
sibling, leaving sub-page residue on whichever sibling was cached at
drain time. The pre-existing obj_cgroup_release() path would WARN and
silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
bytes per memcg lifecycle from the page_counter. Forward the residue
into a per-node objcg of the same (post-reparent) memcg at release time
instead, so it can be reconciled later via a refill atomic_xchg or
another release; the chain terminates at root_mem_cgroup, whose
page_counter has no enforced limit.

Please note that this is temporary fix and will be reverted when
per-node kmem accounting is introduced.

Update the stale invariant comment on __account_obj_stock().

Qi Zheng built a specialized reproducer [1] for the corner case and
confirmed the fix.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Debugged-by: Qi Zheng <qi.zheng@linux.dev>
---

Changes since v2:
https://lore.kernel.org/20260517194308.952655-1-shakeel.butt@linux.dev/
- Instead of handling sub-page charged residue at refill time, let's handle it
  at the obj_cgroup_release time.

Changes since v1:
https://lore.kernel.org/20260515171953.2224503-1-shakeel.butt@linux.dev/
- Fix the rcu warning (Sashiko).
- Fix the page counter possible underflow warning (Sashiko).

 mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 51 insertions(+), 18 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d978e18b9b2d..a547ec7c42d1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -142,14 +142,24 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
 	unsigned int nr_bytes;
 	unsigned int nr_pages;
+	unsigned int sub_bytes;
 	unsigned long flags;
 
 	/*
-	 * At this point all allocated objects are freed, and
-	 * objcg->nr_charged_bytes can't have an arbitrary byte value.
-	 * However, it can be PAGE_SIZE or (x * PAGE_SIZE).
+	 * At this point all allocated objects are freed, but
+	 * objcg->nr_charged_bytes can still hold either
+	 *   - (x * PAGE_SIZE)  if a small-alloc/drain race left whole pages
+	 *     stranded (see the historical sequence below), or
+	 *   - any sub-page residue, now that the stock is keyed by memcg and
+	 *     sibling per-node objcgs share its reserve: bytes consumed by
+	 *     one sibling can spill into another sibling's nr_charged_bytes
+	 *     when the stock is drained.
 	 *
-	 * The following sequence can lead to it:
+	 * Uncharge the page-aligned portion from this objcg's (post-reparent)
+	 * memcg, and forward any sub-page residue into a per-node objcg of
+	 * the same memcg so it can be reconciled later instead of being lost.
+	 *
+	 * Historical race producing the (x * PAGE_SIZE) case:
 	 * 1) CPU0: objcg == stock->cached_objcg
 	 * 2) CPU1: we do a small allocation (e.g. 92 bytes),
 	 *          PAGE_SIZE bytes are charged
@@ -160,23 +170,33 @@ static void obj_cgroup_release(struct percpu_ref *ref)
 	 *          92 bytes are added to stock->nr_bytes
 	 * 6) CPU0: stock is flushed,
 	 *          92 bytes are added to objcg->nr_charged_bytes
-	 *
-	 * In the result, nr_charged_bytes == PAGE_SIZE.
-	 * This page will be uncharged in obj_cgroup_release().
 	 */
 	nr_bytes = atomic_read(&objcg->nr_charged_bytes);
-	WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
 	nr_pages = nr_bytes >> PAGE_SHIFT;
+	sub_bytes = nr_bytes & (PAGE_SIZE - 1);
 
-	if (nr_pages) {
+	if (nr_pages || sub_bytes) {
 		struct mem_cgroup *memcg;
 
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
-		memcg1_account_kmem(memcg, -nr_pages);
-		if (!mem_cgroup_is_root(memcg))
-			memcg_uncharge(memcg, nr_pages);
-		mem_cgroup_put(memcg);
+		rcu_read_lock();
+		memcg = obj_cgroup_memcg(objcg);
+
+		if (nr_pages) {
+			mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
+			memcg1_account_kmem(memcg, -nr_pages);
+			if (!mem_cgroup_is_root(memcg))
+				memcg_uncharge(memcg, nr_pages);
+		}
+
+		if (sub_bytes && !mem_cgroup_is_root(memcg)) {
+			struct obj_cgroup *fwd;
+
+			fwd = rcu_dereference(
+				memcg->nodeinfo[numa_node_id()]->objcg);
+			if (fwd)
+				atomic_add(sub_bytes, &fwd->nr_charged_bytes);
+		}
+		rcu_read_unlock();
 	}
 
 	spin_lock_irqsave(&objcg_lock, flags);
@@ -3152,7 +3172,12 @@ static void unlock_stock(struct obj_stock_pcp *stock)
 		local_unlock(&obj_stock.lock);
 }
 
-/* Call after __refill_obj_stock() to ensure stock->cached_objg == objcg */
+/*
+ * Call after __consume_obj_stock() / __refill_obj_stock(). The stock may be
+ * cached for a sibling per-node objcg of the same memcg; in that case the
+ * vmstat batching slot does not match objcg and we fallthrough to the
+ * direct path.
+ */
 static void __account_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock, int nr,
 				struct pglist_data *pgdat, enum node_stat_item idx)
@@ -3210,7 +3235,11 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock,
 				unsigned int nr_bytes)
 {
-	if (objcg == READ_ONCE(stock->cached_objcg) &&
+	struct obj_cgroup *cached = READ_ONCE(stock->cached_objcg);
+
+	/* Sibling per-node objcgs share the reserve. */
+	if ((cached == objcg ||
+	     (cached && READ_ONCE(cached->memcg) == READ_ONCE(objcg->memcg))) &&
 	    stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		return true;
@@ -3318,6 +3347,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 			       unsigned int nr_bytes,
 			       bool allow_uncharge)
 {
+	struct obj_cgroup *cached;
 	unsigned int nr_pages = 0;
 
 	if (!stock) {
@@ -3327,7 +3357,10 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 		goto out;
 	}
 
-	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
+	cached = READ_ONCE(stock->cached_objcg);
+	/* Direct READ_ONCE due to just pointer comparison. */
+	if (cached != objcg &&
+	    (!cached || READ_ONCE(cached->memcg) != READ_ONCE(objcg->memcg))) {
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-18 22:28 [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
@ 2026-05-18 23:41 ` Shakeel Butt
  2026-05-19  3:35   ` Qi Zheng
  2026-05-19  6:46   ` Harry Yoo
  0 siblings, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2026-05-18 23:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot

On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> per-node type") split a memcg's single obj_cgroup into one per NUMA
> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
> pointer. Cross-NUMA workloads now see a drain on every refill and a
> miss on every consume that targets a sibling per-node objcg of the
> same memcg, producing the 67.7% stress-ng switch-mq regression
> reported by LKP.
> 
> stock->nr_bytes are fungible across per-node objcgs of one memcg.
> Treat the cache as keyed by memcg in __consume_obj_stock() and
> __refill_obj_stock() so siblings share the reserve. Compare via
> READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
> the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
> 
> Sharing the reserve without re-caching means bytes funded by one
> per-node objcg's slow path can be consumed/freed under a different
> sibling, leaving sub-page residue on whichever sibling was cached at
> drain time. The pre-existing obj_cgroup_release() path would WARN and
> silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
> bytes per memcg lifecycle from the page_counter. Forward the residue
> into a per-node objcg of the same (post-reparent) memcg at release time
> instead, so it can be reconciled later via a refill atomic_xchg or
> another release; the chain terminates at root_mem_cgroup, whose
> page_counter has no enforced limit.
> 
> Please note that this is temporary fix and will be reverted when
> per-node kmem accounting is introduced.
> 
> Update the stale invariant comment on __account_obj_stock().
> 
> Qi Zheng built a specialized reproducer [1] for the corner case and
> confirmed the fix.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Debugged-by: Qi Zheng <qi.zheng@linux.dev>

Sashiko's review below and my response.


> 
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s)
> to consider:
> - [High] Race condition during memcg offlining causes a permanent page counter
> leak when forwarding `sub_bytes`.
> - [High] Unbounded accumulation of `sub_bytes` in a non-allocating parent
> memcg causes a permanent page counter leak and integer overflow.
> --
> 
> commit e90ccbfce3bb9538bf1cdf185a1c10a7269e0939
> Author: Shakeel Butt <shakeel.butt@linux.dev>
> 
> memcg: cache obj_stock by memcg, not by objcg pointer
> 
> This commit changes obj_stock_pcp caching from pointer-based to memcg-based
> so sibling per-node objcgs share the reserve. It also forwards sub-page
> residue into a per-node objcg of the same memcg during release so it can be
> reconciled later instead of being lost.
> 
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/
> [1]
> 
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >  index d978e18b9b2d0..a547ec7c42d16 100644
> >  --- a/mm/memcontrol.c
> >  +++ b/mm/memcontrol.c
> > 
> [ ... ]
> 
> > 
> > @@ -160,23 +170,33 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> >  */
> >  nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> >  - WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> >  nr_pages = nr_bytes >> PAGE_SHIFT;
> >  + sub_bytes = nr_bytes & (PAGE_SIZE - 1);
> >  
> >  - if (nr_pages) {
> >  + if (nr_pages || sub_bytes) {
> >  struct mem_cgroup *memcg;
> >  
> >  - memcg = get_mem_cgroup_from_objcg(objcg);
> >  - mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
> >  - memcg1_account_kmem(memcg, -nr_pages);
> >  - if (!mem_cgroup_is_root(memcg))
> >  - memcg_uncharge(memcg, nr_pages);
> >  - mem_cgroup_put(memcg);
> >  + rcu_read_lock();
> >  + memcg = obj_cgroup_memcg(objcg);
> >  +
> >  + if (nr_pages) {
> >  + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
> >  + memcg1_account_kmem(memcg, -nr_pages);
> >  + if (!mem_cgroup_is_root(memcg))
> >  + memcg_uncharge(memcg, nr_pages);
> >  + }
> >  +
> >  + if (sub_bytes && !mem_cgroup_is_root(memcg)) {
> >  + struct obj_cgroup *fwd;
> >  +
> >  + fwd = rcu_dereference(
> >  + memcg->nodeinfo[numa_node_id()]->objcg);
> >  + if (fwd)
> > 
> Could there be a race condition during memcg offlining here that leads to a
> page counter leak?
> 
> If this executes concurrently with the offlining of the memcg,
> memcg_reparent_objcgs() might have already replaced the nodeinfo objcg
> pointer with NULL. In this race window, fwd evaluates to NULL, and the
> sub_bytes would be dropped.
> 
> Since they are dropped without being uncharged or forwarded, it seems the
> page counter would permanently leak these charges.

This is not much a concern as it is bounded (i.e. nr_nodes * PAGE_SIZE - 1) and
previously for the warning case, we were already dropping the sub-page charges.

> 
> > 
> > + atomic_add(sub_bytes, &fwd->nr_charged_bytes);
> > 
> Does this unbounded addition create a risk of integer overflow and page
> counter leaks if the parent memcg never allocates?
> 
> If the parent memcg is used strictly as a hierarchical container (like
> system.slice) and performs no kernel allocations, its objcg is never cached
> in any CPU's obj_stock_pcp.
> 
> Because __refill_obj_stock() relies on the cache to periodically drain
> nr_charged_bytes, it would never be invoked for the parent. Could child
> residues continuously accumulate in the parent's nr_charged_bytes without
> ever being uncharged, eventually overflowing the 32-bit atomic_t?
> 

This can be a concern for a long running cgroup. However fixing this would add
complexity not worth it. This is a temporary fix and will be reverted in newer
kernels.

> > 
> > + }
> >  + rcu_read_unlock();
> >  }
> > 
> -- 
> Sashiko AI review ·
> https://sashiko.dev/#/patchset/20260518222827.110696-1-shakeel.butt@linux.dev?part=1
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-18 23:41 ` Shakeel Butt
@ 2026-05-19  3:35   ` Qi Zheng
  2026-05-19  6:46   ` Harry Yoo
  1 sibling, 0 replies; 10+ messages in thread
From: Qi Zheng @ 2026-05-19  3:35 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Alexandre Ghiti, Joshua Hahn, Meta kernel team, linux-mm, cgroups,
	linux-kernel, kernel test robot



On 5/19/26 7:41 AM, Shakeel Butt wrote:
> On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
>> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
>> per-node type") split a memcg's single obj_cgroup into one per NUMA
>> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
>> pointer. Cross-NUMA workloads now see a drain on every refill and a
>> miss on every consume that targets a sibling per-node objcg of the
>> same memcg, producing the 67.7% stress-ng switch-mq regression
>> reported by LKP.
>>
>> stock->nr_bytes are fungible across per-node objcgs of one memcg.
>> Treat the cache as keyed by memcg in __consume_obj_stock() and
>> __refill_obj_stock() so siblings share the reserve. Compare via
>> READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
>> the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
>>
>> Sharing the reserve without re-caching means bytes funded by one
>> per-node objcg's slow path can be consumed/freed under a different
>> sibling, leaving sub-page residue on whichever sibling was cached at
>> drain time. The pre-existing obj_cgroup_release() path would WARN and
>> silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
>> bytes per memcg lifecycle from the page_counter. Forward the residue
>> into a per-node objcg of the same (post-reparent) memcg at release time
>> instead, so it can be reconciled later via a refill atomic_xchg or
>> another release; the chain terminates at root_mem_cgroup, whose
>> page_counter has no enforced limit.
>>
>> Please note that this is temporary fix and will be reverted when
>> per-node kmem accounting is introduced.
>>
>> Update the stale invariant comment on __account_obj_stock().
>>
>> Qi Zheng built a specialized reproducer [1] for the corner case and
>> confirmed the fix.
>>
>> Reported-by: kernel test robot <oliver.sang@intel.com>
>> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
>> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
>> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
>> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>> Debugged-by: Qi Zheng <qi.zheng@linux.dev>
> 
> Sashiko's review below and my response.
> 
> 
>>
>> Thank you for your contribution! Sashiko AI review found 2 potential issue(s)
>> to consider:
>> - [High] Race condition during memcg offlining causes a permanent page counter
>> leak when forwarding `sub_bytes`.
>> - [High] Unbounded accumulation of `sub_bytes` in a non-allocating parent
>> memcg causes a permanent page counter leak and integer overflow.
>> --
>>
>> commit e90ccbfce3bb9538bf1cdf185a1c10a7269e0939
>> Author: Shakeel Butt <shakeel.butt@linux.dev>
>>
>> memcg: cache obj_stock by memcg, not by objcg pointer
>>
>> This commit changes obj_stock_pcp caching from pointer-based to memcg-based
>> so sibling per-node objcgs share the reserve. It also forwards sub-page
>> residue into a per-node objcg of the same memcg during release so it can be
>> reconciled later instead of being lost.
>>
>> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/
>> [1]
>>
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>   index d978e18b9b2d0..a547ec7c42d16 100644
>>>   --- a/mm/memcontrol.c
>>>   +++ b/mm/memcontrol.c
>>>
>> [ ... ]
>>
>>>
>>> @@ -160,23 +170,33 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>>>   */
>>>   nr_bytes = atomic_read(&objcg->nr_charged_bytes);
>>>   - WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
>>>   nr_pages = nr_bytes >> PAGE_SHIFT;
>>>   + sub_bytes = nr_bytes & (PAGE_SIZE - 1);
>>>   
>>>   - if (nr_pages) {
>>>   + if (nr_pages || sub_bytes) {
>>>   struct mem_cgroup *memcg;
>>>   
>>>   - memcg = get_mem_cgroup_from_objcg(objcg);
>>>   - mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
>>>   - memcg1_account_kmem(memcg, -nr_pages);
>>>   - if (!mem_cgroup_is_root(memcg))
>>>   - memcg_uncharge(memcg, nr_pages);
>>>   - mem_cgroup_put(memcg);
>>>   + rcu_read_lock();
>>>   + memcg = obj_cgroup_memcg(objcg);
>>>   +
>>>   + if (nr_pages) {
>>>   + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
>>>   + memcg1_account_kmem(memcg, -nr_pages);
>>>   + if (!mem_cgroup_is_root(memcg))
>>>   + memcg_uncharge(memcg, nr_pages);
>>>   + }
>>>   +
>>>   + if (sub_bytes && !mem_cgroup_is_root(memcg)) {
>>>   + struct obj_cgroup *fwd;
>>>   +
>>>   + fwd = rcu_dereference(
>>>   + memcg->nodeinfo[numa_node_id()]->objcg);
>>>   + if (fwd)
>>>
>> Could there be a race condition during memcg offlining here that leads to a
>> page counter leak?
>>
>> If this executes concurrently with the offlining of the memcg,
>> memcg_reparent_objcgs() might have already replaced the nodeinfo objcg
>> pointer with NULL. In this race window, fwd evaluates to NULL, and the
>> sub_bytes would be dropped.
>>
>> Since they are dropped without being uncharged or forwarded, it seems the
>> page counter would permanently leak these charges.
> 
> This is not much a concern as it is bounded (i.e. nr_nodes * PAGE_SIZE - 1) and
> previously for the warning case, we were already dropping the sub-page charges.
> 
>>
>>>
>>> + atomic_add(sub_bytes, &fwd->nr_charged_bytes);
>>>
>> Does this unbounded addition create a risk of integer overflow and page
>> counter leaks if the parent memcg never allocates?
>>
>> If the parent memcg is used strictly as a hierarchical container (like
>> system.slice) and performs no kernel allocations, its objcg is never cached
>> in any CPU's obj_stock_pcp.
>>
>> Because __refill_obj_stock() relies on the cache to periodically drain
>> nr_charged_bytes, it would never be invoked for the parent. Could child
>> residues continuously accumulate in the parent's nr_charged_bytes without
>> ever being uncharged, eventually overflowing the 32-bit atomic_t?
>>
> 
> This can be a concern for a long running cgroup. However fixing this would add
> complexity not worth it. This is a temporary fix and will be reverted in newer
> kernels.

I think this is fine as a temporary fix:

Acked-by: Qi Zheng <qi.zheng@linux.dev>

Thanks!

> 
>>>
>>> + }
>>>   + rcu_read_unlock();
>>>   }
>>>
>> -- 
>> Sashiko AI review ·
>> https://sashiko.dev/#/patchset/20260518222827.110696-1-shakeel.butt@linux.dev?part=1
>>



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-18 23:41 ` Shakeel Butt
  2026-05-19  3:35   ` Qi Zheng
@ 2026-05-19  6:46   ` Harry Yoo
  2026-05-19 14:02     ` Shakeel Butt
  1 sibling, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2026-05-19  6:46 UTC (permalink / raw)
  To: Shakeel Butt, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Alexandre Ghiti, Joshua Hahn, Meta kernel team,
	linux-mm, cgroups, linux-kernel, kernel test robot



On 5/19/26 8:41 AM, Shakeel Butt wrote:
> On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
>> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
>> per-node type") split a memcg's single obj_cgroup into one per NUMA
>> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
>> pointer. Cross-NUMA workloads now see a drain on every refill and a
>> miss on every consume that targets a sibling per-node objcg of the
>> same memcg, producing the 67.7% stress-ng switch-mq regression
>> reported by LKP.
>>
>> stock->nr_bytes are fungible across per-node objcgs of one memcg.
>> Treat the cache as keyed by memcg in __consume_obj_stock() and
>> __refill_obj_stock() so siblings share the reserve. Compare via
>> READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
>> the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
>>
>> Sharing the reserve without re-caching means bytes funded by one
>> per-node objcg's slow path can be consumed/freed under a different
>> sibling, leaving sub-page residue on whichever sibling was cached at
>> drain time. The pre-existing obj_cgroup_release() path would WARN and
>> silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
>> bytes per memcg lifecycle from the page_counter. Forward the residue
>> into a per-node objcg of the same (post-reparent) memcg at release time
>> instead, so it can be reconciled later via a refill atomic_xchg or
>> another release; the chain terminates at root_mem_cgroup, whose
>> page_counter has no enforced limit.
>>
>> Please note that this is temporary fix and will be reverted when
>> per-node kmem accounting is introduced.

... because once per-node kmem accounting is introduced,
"stock->nr_bytes are fungible across per-node objcgs of one memcg"
no longer holds?

And the follow-up plain is to revert this and address it with a 
multi-objcg percpu stock [1], similar to a multi-memcg percpu charge 
cache we have now, right? (regardless of per-node kmem accounting's 
progress)

If this temporary fix imposes other potential correctness issues, would 
it make sense to land [1] in mainline before the next LTS release and 
skip this temporary fix?

[1] https://lore.kernel.org/oe-lkp/agtPMpQK2jXdQAY4@linux.dev

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19  6:46   ` Harry Yoo
@ 2026-05-19 14:02     ` Shakeel Butt
  2026-05-19 15:00       ` Harry Yoo
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2026-05-19 14:02 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Tue, May 19, 2026 at 03:46:51PM +0900, Harry Yoo wrote:
> 
> 
> On 5/19/26 8:41 AM, Shakeel Butt wrote:
> > On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
> > > Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> > > per-node type") split a memcg's single obj_cgroup into one per NUMA
> > > node, but the per-CPU obj_stock_pcp still keys cached_objcg by
> > > pointer. Cross-NUMA workloads now see a drain on every refill and a
> > > miss on every consume that targets a sibling per-node objcg of the
> > > same memcg, producing the 67.7% stress-ng switch-mq regression
> > > reported by LKP.
> > > 
> > > stock->nr_bytes are fungible across per-node objcgs of one memcg.
> > > Treat the cache as keyed by memcg in __consume_obj_stock() and
> > > __refill_obj_stock() so siblings share the reserve. Compare via
> > > READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
> > > the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
> > > 
> > > Sharing the reserve without re-caching means bytes funded by one
> > > per-node objcg's slow path can be consumed/freed under a different
> > > sibling, leaving sub-page residue on whichever sibling was cached at
> > > drain time. The pre-existing obj_cgroup_release() path would WARN and
> > > silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
> > > bytes per memcg lifecycle from the page_counter. Forward the residue
> > > into a per-node objcg of the same (post-reparent) memcg at release time
> > > instead, so it can be reconciled later via a refill atomic_xchg or
> > > another release; the chain terminates at root_mem_cgroup, whose
> > > page_counter has no enforced limit.
> > > 
> > > Please note that this is temporary fix and will be reverted when
> > > per-node kmem accounting is introduced.
> 
> ... because once per-node kmem accounting is introduced,
> "stock->nr_bytes are fungible across per-node objcgs of one memcg"
> no longer holds?

Yes

> 
> And the follow-up plain is to revert this and address it with a multi-objcg
> percpu stock [1], similar to a multi-memcg percpu charge cache we have now,
> right? (regardless of per-node kmem accounting's progress)
> 

Yes

> If this temporary fix imposes other potential correctness issues, would it
> make sense to land [1] in mainline before the next LTS release and skip this
> temporary fix?
> 
> [1] https://lore.kernel.org/oe-lkp/agtPMpQK2jXdQAY4@linux.dev
> 

The full clean solution might take one more cycle and I think we can not just
ignore 67% regression on 7.1.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19 14:02     ` Shakeel Butt
@ 2026-05-19 15:00       ` Harry Yoo
  2026-05-19 20:11         ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Harry Yoo @ 2026-05-19 15:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot



On 5/19/26 11:02 PM, Shakeel Butt wrote:
> On Tue, May 19, 2026 at 03:46:51PM +0900, Harry Yoo wrote:
>>
>>
>> On 5/19/26 8:41 AM, Shakeel Butt wrote:
>>> On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
>>>> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
>>>> per-node type") split a memcg's single obj_cgroup into one per NUMA
>>>> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
>>>> pointer. Cross-NUMA workloads now see a drain on every refill and a
>>>> miss on every consume that targets a sibling per-node objcg of the
>>>> same memcg, producing the 67.7% stress-ng switch-mq regression
>>>> reported by LKP.
>>>>
>>>> stock->nr_bytes are fungible across per-node objcgs of one memcg.
>>>> Treat the cache as keyed by memcg in __consume_obj_stock() and
>>>> __refill_obj_stock() so siblings share the reserve. Compare via
>>>> READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
>>>> the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
>>>>
>>>> Sharing the reserve without re-caching means bytes funded by one
>>>> per-node objcg's slow path can be consumed/freed under a different
>>>> sibling, leaving sub-page residue on whichever sibling was cached at
>>>> drain time. The pre-existing obj_cgroup_release() path would WARN and
>>>> silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
>>>> bytes per memcg lifecycle from the page_counter. Forward the residue
>>>> into a per-node objcg of the same (post-reparent) memcg at release time
>>>> instead, so it can be reconciled later via a refill atomic_xchg or
>>>> another release; the chain terminates at root_mem_cgroup, whose
>>>> page_counter has no enforced limit.
>>>>
>>>> Please note that this is temporary fix and will be reverted when
>>>> per-node kmem accounting is introduced.
>>
>> ... because once per-node kmem accounting is introduced,
>> "stock->nr_bytes are fungible across per-node objcgs of one memcg"
>> no longer holds?
> 
> Yes
> 
>>
>> And the follow-up plain is to revert this and address it with a multi-objcg
>> percpu stock [1], similar to a multi-memcg percpu charge cache we have now,
>> right? (regardless of per-node kmem accounting's progress)
>>
> 
> Yes

Thanks for confirming!

>> If this temporary fix imposes other potential correctness issues, would it
>> make sense to land [1] in mainline before the next LTS release and skip this
>> temporary fix?
>>
>> [1] https://lore.kernel.org/oe-lkp/agtPMpQK2jXdQAY4@linux.dev
>>
> 
> The full clean solution might take one more cycle and I think we can not just
> ignore 67% regression on 7.1.

That is valid point, unfortunately.

One more thing I have to ask... for v7.1, wouldn't it be a safer option 
to revert the per-node object change and re-introduce it once we have a 
cleaner solution?

This change was introduced in v5, but the implementation before v4 had 
been exposed in -next for a while, and I think we don't have enough 
justification to keep the per-node objcgs change, at least for v7.1, 
given that we have an unexpected last-minute regression and
correctness concerns (albeit slight).

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19 15:00       ` Harry Yoo
@ 2026-05-19 20:11         ` Shakeel Butt
  2026-05-19 20:49           ` Andrew Morton
  2026-05-19 23:39           ` Harry Yoo
  0 siblings, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2026-05-19 20:11 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Wed, May 20, 2026 at 12:00:16AM +0900, Harry Yoo wrote:
> 
> 
[...]
> > 
> > The full clean solution might take one more cycle and I think we can not just
> > ignore 67% regression on 7.1.
> 
> That is valid point, unfortunately.
> 
> One more thing I have to ask... for v7.1, wouldn't it be a safer option to
> revert the per-node object change and re-introduce it once we have a cleaner
> solution?

The issue with that revert is that we reintroduce all node lru locking in the
objcg reparenting path.

> 
> This change was introduced in v5, but the implementation before v4 had been
> exposed in -next for a while, and I think we don't have enough justification
> to keep the per-node objcgs change, at least for v7.1, given that we have an
> unexpected last-minute regression and
> correctness concerns (albeit slight).

I am waiting for Oliver to test the multi-objcg patch I sent. If that also
resolves the regression then we have one more option i.e. backport that to 7.1
to fix the regression.

So to summarize, for future kernels we will be having multi-objcg in some form.
For 7.1, we have to decide between (1) do nothing (2) this patch or (3) backport
the multi-objcg path to 7.1.

Andrew, please don't send this patch to Linus until we decide on the option.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19 20:11         ` Shakeel Butt
@ 2026-05-19 20:49           ` Andrew Morton
  2026-05-22 16:16             ` Shakeel Butt
  2026-05-19 23:39           ` Harry Yoo
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-05-19 20:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Harry Yoo, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Tue, 19 May 2026 13:11:13 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Andrew, please don't send this patch to Linus until we decide on the option.

No probs, I added a note-to-self.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19 20:49           ` Andrew Morton
@ 2026-05-22 16:16             ` Shakeel Butt
  0 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2026-05-22 16:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Harry Yoo, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Tue, May 19, 2026 at 01:49:27PM -0700, Andrew Morton wrote:
> On Tue, 19 May 2026 13:11:13 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Andrew, please don't send this patch to Linus until we decide on the option.
> 
> No probs, I added a note-to-self.

Here is my suggestion on how to proceed.

1. Let's drop this patch (memcg: cache obj_stock by memcg, not by objcg pointer)
   from mm-tree.

2. Let's not add anything to 7.1.

3. 7.2+ will have the multi-objcg series [1] and the patches will have fixes
   tag. If someone uses non-LTS 7.1, they can easily find those patches and can
   backport them.

Please let me know if there are any concerns.

[1] https://lore.kernel.org/20260522011908.1669332-1-shakeel.butt@linux.dev/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-19 20:11         ` Shakeel Butt
  2026-05-19 20:49           ` Andrew Morton
@ 2026-05-19 23:39           ` Harry Yoo
  1 sibling, 0 replies; 10+ messages in thread
From: Harry Yoo @ 2026-05-19 23:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Qi Zheng, Alexandre Ghiti, Joshua Hahn,
	Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot



On 5/20/26 5:11 AM, Shakeel Butt wrote:
> On Wed, May 20, 2026 at 12:00:16AM +0900, Harry Yoo wrote:
>>
>>
> [...]
>>>
>>> The full clean solution might take one more cycle and I think we can not just
>>> ignore 67% regression on 7.1.
>>
>> That is valid point, unfortunately.
>>
>> One more thing I have to ask... for v7.1, wouldn't it be a safer option to
>> revert the per-node object change and re-introduce it once we have a cleaner
>> solution?
> 
> The issue with that revert is that we reintroduce all node lru locking in the
> objcg reparenting path.

I'm not sure the problems with all-node locking are serious enough to 
rule it out as an option for 7.1.

It is not ideal, but given that the critical section for reparenting is 
independent of folio count, would this actually be a significant problem 
in practice? (even large servers rarely go beyond 8 NUMA nodes...)

>> This change was introduced in v5, but the implementation before v4 had been
>> exposed in -next for a while, and I think we don't have enough justification
>> to keep the per-node objcgs change, at least for v7.1, given that we have an
>> unexpected last-minute regression and
>> correctness concerns (albeit slight).
> 
> I am waiting for Oliver to test the multi-objcg patch I sent. If that also
> resolves the regression then we have one more option i.e. backport that to 7.1
> to fix the regression.

Yeah, If per-node objcg is required in future kernels anyway, this 
option would be more ideal (if available).

> So to summarize, for future kernels we will be having multi-objcg in some form.
> For 7.1, we have to decide between (1) do nothing (2) this patch or (3) backport
> the multi-objcg path to 7.1.

Ack.

Thanks!

-- 
Cheers,
Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-22 16:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 22:28 [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
2026-05-18 23:41 ` Shakeel Butt
2026-05-19  3:35   ` Qi Zheng
2026-05-19  6:46   ` Harry Yoo
2026-05-19 14:02     ` Shakeel Butt
2026-05-19 15:00       ` Harry Yoo
2026-05-19 20:11         ` Shakeel Butt
2026-05-19 20:49           ` Andrew Morton
2026-05-22 16:16             ` Shakeel Butt
2026-05-19 23:39           ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox