Linux cgroups development
 help / color / mirror / Atom feed
* [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer
@ 2026-05-15 17:19 Shakeel Butt
  2026-05-15 18:42 ` Shakeel Butt
  2026-05-16  6:58 ` [syzbot ci] " syzbot ci
  0 siblings, 2 replies; 3+ messages in thread
From: Shakeel Butt @ 2026-05-15 17:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but the per-CPU obj_stock_pcp still keys cached_objcg by
pointer. Cross-NUMA workloads now see a drain on every refill and a
miss on every consume that targets a sibling per-node objcg of the
same memcg, producing the 67.7% stress-ng switch-mq regression
reported by LKP.

stock->nr_bytes are fungible across per-node objcgs of one memcg:
drain_obj_stock() and obj_cgroup_uncharge_pages() both account via
obj_cgroup_memcg(). Treat the cache as keyed by memcg in both
__consume_obj_stock() and __refill_obj_stock() so siblings share the
reserve -- eliminating the drain on free and keeping the alloc fast
path in consume.

Though kernel test robot reported the regression but it was not easy to
reproduce locally. Qi implemented [1] a specialized reproducer to show
the corner case which cause the regression and then Qi tested the patch
and reported that the corner case is eliminated after the patch.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Debugged-by: Qi Zheng <qi.zheng@linux.dev>
Tested-by: Qi Zheng <qi.zheng@linux.dev>
---
 mm/memcontrol.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d978e18b9b2d..66448f428531 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3210,7 +3210,11 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
 				struct obj_stock_pcp *stock,
 				unsigned int nr_bytes)
 {
-	if (objcg == READ_ONCE(stock->cached_objcg) &&
+	struct obj_cgroup *cached = READ_ONCE(stock->cached_objcg);
+
+	/* Cache is keyed by memcg; sibling per-node objcgs share the reserve. */
+	if ((cached == objcg ||
+	     (cached && obj_cgroup_memcg(cached) == obj_cgroup_memcg(objcg))) &&
 	    stock->nr_bytes >= nr_bytes) {
 		stock->nr_bytes -= nr_bytes;
 		return true;
@@ -3318,6 +3322,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 			       unsigned int nr_bytes,
 			       bool allow_uncharge)
 {
+	struct obj_cgroup *cached;
 	unsigned int nr_pages = 0;
 
 	if (!stock) {
@@ -3327,7 +3332,10 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
 		goto out;
 	}
 
-	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
+	cached = READ_ONCE(stock->cached_objcg);
+	/* Same memcg: bytes are fungible, no drain needed. */
+	if (cached != objcg &&
+	    (!cached || obj_cgroup_memcg(cached) != obj_cgroup_memcg(objcg))) {
 		drain_obj_stock(stock);
 		obj_cgroup_get(objcg);
 		stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-15 17:19 [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
@ 2026-05-15 18:42 ` Shakeel Butt
  2026-05-16  6:58 ` [syzbot ci] " syzbot ci
  1 sibling, 0 replies; 3+ messages in thread
From: Shakeel Butt @ 2026-05-15 18:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Qi Zheng, Meta kernel team, linux-mm, cgroups, linux-kernel,
	kernel test robot

On Fri, May 15, 2026 at 10:19:53AM -0700, Shakeel Butt wrote:
> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> per-node type") split a memcg's single obj_cgroup into one per NUMA
> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
> pointer. Cross-NUMA workloads now see a drain on every refill and a
> miss on every consume that targets a sibling per-node objcg of the
> same memcg, producing the 67.7% stress-ng switch-mq regression
> reported by LKP.
> 
> stock->nr_bytes are fungible across per-node objcgs of one memcg:
> drain_obj_stock() and obj_cgroup_uncharge_pages() both account via
> obj_cgroup_memcg(). Treat the cache as keyed by memcg in both
> __consume_obj_stock() and __refill_obj_stock() so siblings share the
> reserve -- eliminating the drain on free and keeping the alloc fast
> path in consume.
> 
> Though kernel test robot reported the regression but it was not easy to
> reproduce locally. Qi implemented [1] a specialized reproducer to show
> the corner case which cause the regression and then Qi tested the patch
> and reported that the corner case is eliminated after the patch.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Debugged-by: Qi Zheng <qi.zheng@linux.dev>
> Tested-by: Qi Zheng <qi.zheng@linux.dev>

Pasting the sashiko review here as it seems interesting and good:

May 15, 2026 at 10:52 AM, sashiko-bot@kernel.org mailto:sashiko-bot@kernel.org  wrote:


> 
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Fungible sharing of stock reserve bytes between per-node objcgs breaks the obj_cgroup_release invariant, causing permanent kmem charge leaks and WARN_ON_ONCE.
> - [High] Missing update to `__account_obj_stock()` bypasses vmstat batching for sibling objcgs, causing severe lock and atomic contention.
> - [High] Calling `obj_cgroup_memcg()` without required RCU or cgroup locks triggers lockdep splats on fast paths.
> --
> 
> commit e77fe5b6af93b69e06c1ccd13f83e26b59234483
> Author: Shakeel Butt <shakeel.butt@linux.dev>
> 
> memcg: cache obj_stock by memcg, not by objcg pointer
> 
> This commit modifies the per-CPU object stock caching to share the reserve
> among per-node sibling obj_cgroups of the same memcg, aiming to reduce
> drains and refills on cross-NUMA workloads.
> 
> Link: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/
> 
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >  index d978e18b9b2d0..66448f4285316 100644
> >  --- a/mm/memcontrol.c
> >  +++ b/mm/memcontrol.c
> >  @@ -3210,7 +3210,11 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
> >  struct obj_stock_pcp *stock,
> >  unsigned int nr_bytes)
> >  {
> >  - if (objcg == READ_ONCE(stock->cached_objcg) &&
> >  + struct obj_cgroup *cached = READ_ONCE(stock->cached_objcg);
> >  +
> >  + /* Cache is keyed by memcg; sibling per-node objcgs share the reserve. */
> >  + if ((cached == objcg ||
> >  + (cached && obj_cgroup_memcg(cached) == obj_cgroup_memcg(objcg))) &&
> > 
> Does calling obj_cgroup_memcg() here trigger lockdep warnings on slab fast
> paths?
> 
> Looking at obj_cgroup_memcg(), it expects rcu_read_lock_held() or
> lockdep_is_held(&cgroup_mutex) to be true.
> 
> Since this function is called from slab fast paths where only the local
> CPU stock lock is held, the RCU read lock may not be held.
> 
> Could this check directly compare READ_ONCE(cached->memcg) and
> READ_ONCE(objcg->memcg) to avoid the lockdep assertion?

Yes, this makes sense as we are not really dereferencing the memcg here.

> 
> > 
> > stock->nr_bytes >= nr_bytes) {
> >  stock->nr_bytes -= nr_bytes;
> >  return true;
> > 
> Will sharing the reserve between per-node sibling objcgs without updating
> stock->cached_objcg break the page multiple invariant in
> obj_cgroup_release()?
> 
> If an allocation for objcg_B consumes bytes originally funded by objcg_A,
> and the stock is later drained, those borrowed bytes are flushed into
> objcg_A->nr_charged_bytes.
> 
> When obj_cgroup_release() is invoked, nr_charged_bytes will not be an
> exact multiple of PAGE_SIZE. Will this trigger
> WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)) and truncate the remainder,
> permanently leaking the page charge from the memcg?

This is actually a very good point and need more thought.

> 
> [ ... ]
> 
> > 
> > @@ -3318,6 +3322,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> >  unsigned int nr_bytes,
> >  bool allow_uncharge)
> >  {
> > 
> [ ... ]
> 
> > 
> > - if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> >  + cached = READ_ONCE(stock->cached_objcg);
> >  + /* Same memcg: bytes are fungible, no drain needed. */
> >  + if (cached != objcg &&
> >  + (!cached || obj_cgroup_memcg(cached) != obj_cgroup_memcg(objcg))) {
> >  drain_obj_stock(stock);
> > 
> Since the stock->cached_objcg pointer is not updated when a sibling is
> already cached, does this bypass the vmstat batching in
> __account_obj_stock()?
> 
> If __account_obj_stock() still requires strict pointer equality
> (READ_ONCE(stock->cached_objcg) == objcg), alternating cross-node
> allocations will continually fail that match.
> 
> Will this force a fallback to direct mod_objcg_mlstate() updates on every
> object allocation and free, causing severe cacheline contention?

This is not that worrying. Yes, the stats update will take slower path. This can
be fixed when we add multiple objcg per-cpu stock.

> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260515171953.2224503-1-shakeel.butt@linux.dev?part=1
>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [syzbot ci] Re: memcg: cache obj_stock by memcg, not by objcg pointer
  2026-05-15 17:19 [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
  2026-05-15 18:42 ` Shakeel Butt
@ 2026-05-16  6:58 ` syzbot ci
  1 sibling, 0 replies; 3+ messages in thread
From: syzbot ci @ 2026-05-16  6:58 UTC (permalink / raw)
  To: akpm, cgroups, hannes, kernel-team, linux-kernel, linux-mm,
	mhocko, muchun.song, oliver.sang, qi.zheng, roman.gushchin,
	shakeel.butt
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] memcg: cache obj_stock by memcg, not by objcg pointer
https://lore.kernel.org/all/20260515171953.2224503-1-shakeel.butt@linux.dev
* [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer

and found the following issue:
WARNING in __refill_obj_stock

Full report is available here:
https://ci.syzbot.org/series/8efc6e46-4b2e-43ab-90a0-62552bdc14a6

***

WARNING in __refill_obj_stock

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      0cec77cfd5314c0b3b03530abe1a4b32e991f639
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/fda0a69d-af56-4b9f-8b30-b63ea4756923/config

------------[ cut here ]------------
debug_locks && !(rcu_read_lock_held() || lock_is_held(&(&cgroup_mutex)->dep_map))
WARNING: ./include/linux/memcontrol.h:380 at __refill_obj_stock+0x4fd/0x610, CPU#0: syz.1.48/5712
Modules linked in:

CPU: 0 UID: 0 PID: 5712 Comm: syz.1.48 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__refill_obj_stock+0x4fd/0x610
Code: 89 e7 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d e9 d8 ba 00 00 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 90 <0f> 0b 90 e9 a8 fb ff ff 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c 60
RSP: 0018:ffffc9000483f4b0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88810beb0f80 RCX: 0000000080000001
RDX: 0000000000000110 RSI: ffffffff8e21e93e RDI: ffffffff8c28b8e0
RBP: 0000000000000001 R08: ffffffff8239a83c R09: ffff88812103c600
R10: dffffc0000000000 R11: ffffed102d89bae9 R12: 1ffff110242078c8
R13: ffff88810beb0d80 R14: dffffc0000000000 R15: ffff88812103c600
FS:  0000000000000000(0000) GS:ffff88818dc89000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f80ad948060 CR3: 000000000e74a000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __memcg_slab_free_hook+0x2ed/0x4b0
 kmem_cache_free+0x381/0x650
 __put_anon_vma+0x12b/0x2d0
 unlink_anon_vmas+0x58b/0x730
 free_pgtables+0x802/0xb40
 exit_mmap+0x490/0x9e0
 __mmput+0x118/0x430
 exit_mm+0x18e/0x250
 do_exit+0x6a2/0x22c0
 do_group_exit+0x21b/0x2d0
 get_signal+0x1284/0x1330
 arch_do_signal_or_restart+0xbc/0x840
 exit_to_user_mode_loop+0x8c/0x4d0
 do_syscall_64+0x33e/0xf80
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f6eea99ce59
Code: Unable to access opcode bytes at 0x7f6eea99ce2f.
RSP: 002b:00007f6eeb8240e8 EFLAGS: 00000246
 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00007f6eeac15fa8 RCX: 00007f6eea99ce59
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00007f6eeac15fac
RBP: 00007f6eeac15fa0 R08: 3fffffffffffffff R09: 0000000000000000
R10: 0000200001000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f6eeac16038 R14: 00007fffdc60cea0 R15: 00007fffdc60cf88
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).

The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-16  6:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 17:19 [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
2026-05-15 18:42 ` Shakeel Butt
2026-05-16  6:58 ` [syzbot ci] " syzbot ci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox