* [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer
@ 2026-05-15 17:19 Shakeel Butt
2026-05-15 18:42 ` Shakeel Butt
0 siblings, 1 reply; 2+ messages in thread
From: Shakeel Butt @ 2026-05-15 17:19 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Qi Zheng, Meta kernel team, linux-mm, cgroups, linux-kernel,
kernel test robot
Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
per-node type") split a memcg's single obj_cgroup into one per NUMA
node, but the per-CPU obj_stock_pcp still keys cached_objcg by
pointer. Cross-NUMA workloads now see a drain on every refill and a
miss on every consume that targets a sibling per-node objcg of the
same memcg, producing the 67.7% stress-ng switch-mq regression
reported by LKP.
stock->nr_bytes are fungible across per-node objcgs of one memcg:
drain_obj_stock() and obj_cgroup_uncharge_pages() both account via
obj_cgroup_memcg(). Treat the cache as keyed by memcg in both
__consume_obj_stock() and __refill_obj_stock() so siblings share the
reserve -- eliminating the drain on free and keeping the alloc fast
path in consume.
Though kernel test robot reported the regression but it was not easy to
reproduce locally. Qi implemented [1] a specialized reproducer to show
the corner case which cause the regression and then Qi tested the patch
and reported that the corner case is eliminated after the patch.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Debugged-by: Qi Zheng <qi.zheng@linux.dev>
Tested-by: Qi Zheng <qi.zheng@linux.dev>
---
mm/memcontrol.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d978e18b9b2d..66448f428531 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3210,7 +3210,11 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
struct obj_stock_pcp *stock,
unsigned int nr_bytes)
{
- if (objcg == READ_ONCE(stock->cached_objcg) &&
+ struct obj_cgroup *cached = READ_ONCE(stock->cached_objcg);
+
+ /* Cache is keyed by memcg; sibling per-node objcgs share the reserve. */
+ if ((cached == objcg ||
+ (cached && obj_cgroup_memcg(cached) == obj_cgroup_memcg(objcg))) &&
stock->nr_bytes >= nr_bytes) {
stock->nr_bytes -= nr_bytes;
return true;
@@ -3318,6 +3322,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
unsigned int nr_bytes,
bool allow_uncharge)
{
+ struct obj_cgroup *cached;
unsigned int nr_pages = 0;
if (!stock) {
@@ -3327,7 +3332,10 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
goto out;
}
- if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
+ cached = READ_ONCE(stock->cached_objcg);
+ /* Same memcg: bytes are fungible, no drain needed. */
+ if (cached != objcg &&
+ (!cached || obj_cgroup_memcg(cached) != obj_cgroup_memcg(objcg))) {
drain_obj_stock(stock);
obj_cgroup_get(objcg);
stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer
2026-05-15 17:19 [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
@ 2026-05-15 18:42 ` Shakeel Butt
0 siblings, 0 replies; 2+ messages in thread
From: Shakeel Butt @ 2026-05-15 18:42 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Qi Zheng, Meta kernel team, linux-mm, cgroups, linux-kernel,
kernel test robot
On Fri, May 15, 2026 at 10:19:53AM -0700, Shakeel Butt wrote:
> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
> per-node type") split a memcg's single obj_cgroup into one per NUMA
> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
> pointer. Cross-NUMA workloads now see a drain on every refill and a
> miss on every consume that targets a sibling per-node objcg of the
> same memcg, producing the 67.7% stress-ng switch-mq regression
> reported by LKP.
>
> stock->nr_bytes are fungible across per-node objcgs of one memcg:
> drain_obj_stock() and obj_cgroup_uncharge_pages() both account via
> obj_cgroup_memcg(). Treat the cache as keyed by memcg in both
> __consume_obj_stock() and __refill_obj_stock() so siblings share the
> reserve -- eliminating the drain on free and keeping the alloc fast
> path in consume.
>
> Though kernel test robot reported the regression but it was not easy to
> reproduce locally. Qi implemented [1] a specialized reproducer to show
> the corner case which cause the regression and then Qi tested the patch
> and reported that the corner case is eliminated after the patch.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Debugged-by: Qi Zheng <qi.zheng@linux.dev>
> Tested-by: Qi Zheng <qi.zheng@linux.dev>
Pasting the sashiko review here as it seems interesting and good:
May 15, 2026 at 10:52 AM, sashiko-bot@kernel.org mailto:sashiko-bot@kernel.org wrote:
>
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Fungible sharing of stock reserve bytes between per-node objcgs breaks the obj_cgroup_release invariant, causing permanent kmem charge leaks and WARN_ON_ONCE.
> - [High] Missing update to `__account_obj_stock()` bypasses vmstat batching for sibling objcgs, causing severe lock and atomic contention.
> - [High] Calling `obj_cgroup_memcg()` without required RCU or cgroup locks triggers lockdep splats on fast paths.
> --
>
> commit e77fe5b6af93b69e06c1ccd13f83e26b59234483
> Author: Shakeel Butt <shakeel.butt@linux.dev>
>
> memcg: cache obj_stock by memcg, not by objcg pointer
>
> This commit modifies the per-CPU object stock caching to share the reserve
> among per-node sibling obj_cgroups of the same memcg, aiming to reduce
> drains and refills on cross-NUMA workloads.
>
> Link: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/
>
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d978e18b9b2d0..66448f4285316 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3210,7 +3210,11 @@ static bool __consume_obj_stock(struct obj_cgroup *objcg,
> > struct obj_stock_pcp *stock,
> > unsigned int nr_bytes)
> > {
> > - if (objcg == READ_ONCE(stock->cached_objcg) &&
> > + struct obj_cgroup *cached = READ_ONCE(stock->cached_objcg);
> > +
> > + /* Cache is keyed by memcg; sibling per-node objcgs share the reserve. */
> > + if ((cached == objcg ||
> > + (cached && obj_cgroup_memcg(cached) == obj_cgroup_memcg(objcg))) &&
> >
> Does calling obj_cgroup_memcg() here trigger lockdep warnings on slab fast
> paths?
>
> Looking at obj_cgroup_memcg(), it expects rcu_read_lock_held() or
> lockdep_is_held(&cgroup_mutex) to be true.
>
> Since this function is called from slab fast paths where only the local
> CPU stock lock is held, the RCU read lock may not be held.
>
> Could this check directly compare READ_ONCE(cached->memcg) and
> READ_ONCE(objcg->memcg) to avoid the lockdep assertion?
Yes, this makes sense as we are not really dereferencing the memcg here.
>
> >
> > stock->nr_bytes >= nr_bytes) {
> > stock->nr_bytes -= nr_bytes;
> > return true;
> >
> Will sharing the reserve between per-node sibling objcgs without updating
> stock->cached_objcg break the page multiple invariant in
> obj_cgroup_release()?
>
> If an allocation for objcg_B consumes bytes originally funded by objcg_A,
> and the stock is later drained, those borrowed bytes are flushed into
> objcg_A->nr_charged_bytes.
>
> When obj_cgroup_release() is invoked, nr_charged_bytes will not be an
> exact multiple of PAGE_SIZE. Will this trigger
> WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)) and truncate the remainder,
> permanently leaking the page charge from the memcg?
This is actually a very good point and need more thought.
>
> [ ... ]
>
> >
> > @@ -3318,6 +3322,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> > unsigned int nr_bytes,
> > bool allow_uncharge)
> > {
> >
> [ ... ]
>
> >
> > - if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > + cached = READ_ONCE(stock->cached_objcg);
> > + /* Same memcg: bytes are fungible, no drain needed. */
> > + if (cached != objcg &&
> > + (!cached || obj_cgroup_memcg(cached) != obj_cgroup_memcg(objcg))) {
> > drain_obj_stock(stock);
> >
> Since the stock->cached_objcg pointer is not updated when a sibling is
> already cached, does this bypass the vmstat batching in
> __account_obj_stock()?
>
> If __account_obj_stock() still requires strict pointer equality
> (READ_ONCE(stock->cached_objcg) == objcg), alternating cross-node
> allocations will continually fail that match.
>
> Will this force a fallback to direct mod_objcg_mlstate() updates on every
> object allocation and free, causing severe cacheline contention?
This is not that worrying. Yes, the stats update will take slower path. This can
be fixed when we add multiple objcg per-cpu stock.
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260515171953.2224503-1-shakeel.butt@linux.dev?part=1
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-05-15 18:42 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15 17:19 [PATCH] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
2026-05-15 18:42 ` Shakeel Butt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox