Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Alexandre Ghiti <alex@ghiti.fr>,
	Joshua Hahn <joshua.hahnjy@gmail.com>,
	Meta kernel team <kernel-team@meta.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel test robot <oliver.sang@intel.com>
Subject: Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer
Date: Tue, 19 May 2026 11:35:42 +0800	[thread overview]
Message-ID: <06b25f78-8e2d-4075-9b80-133bbd691c71@linux.dev> (raw)
In-Reply-To: <aguiSnY3ie1y4nEl@linux.dev>



On 5/19/26 7:41 AM, Shakeel Butt wrote:
> On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote:
>> Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg
>> per-node type") split a memcg's single obj_cgroup into one per NUMA
>> node, but the per-CPU obj_stock_pcp still keys cached_objcg by
>> pointer. Cross-NUMA workloads now see a drain on every refill and a
>> miss on every consume that targets a sibling per-node objcg of the
>> same memcg, producing the 67.7% stress-ng switch-mq regression
>> reported by LKP.
>>
>> stock->nr_bytes are fungible across per-node objcgs of one memcg.
>> Treat the cache as keyed by memcg in __consume_obj_stock() and
>> __refill_obj_stock() so siblings share the reserve. Compare via
>> READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so
>> the rcu_read_lock contract on obj_cgroup_memcg() does not apply.
>>
>> Sharing the reserve without re-caching means bytes funded by one
>> per-node objcg's slow path can be consumed/freed under a different
>> sibling, leaving sub-page residue on whichever sibling was cached at
>> drain time. The pre-existing obj_cgroup_release() path would WARN and
>> silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1)
>> bytes per memcg lifecycle from the page_counter. Forward the residue
>> into a per-node objcg of the same (post-reparent) memcg at release time
>> instead, so it can be reconciled later via a refill atomic_xchg or
>> another release; the chain terminates at root_mem_cgroup, whose
>> page_counter has no enforced limit.
>>
>> Please note that this is temporary fix and will be reverted when
>> per-node kmem accounting is introduced.
>>
>> Update the stale invariant comment on __account_obj_stock().
>>
>> Qi Zheng built a specialized reproducer [1] for the corner case and
>> confirmed the fix.
>>
>> Reported-by: kernel test robot <oliver.sang@intel.com>
>> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com
>> Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
>> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1]
>> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
>> Debugged-by: Qi Zheng <qi.zheng@linux.dev>
> 
> Sashiko's review below and my response.
> 
> 
>>
>> Thank you for your contribution! Sashiko AI review found 2 potential issue(s)
>> to consider:
>> - [High] Race condition during memcg offlining causes a permanent page counter
>> leak when forwarding `sub_bytes`.
>> - [High] Unbounded accumulation of `sub_bytes` in a non-allocating parent
>> memcg causes a permanent page counter leak and integer overflow.
>> --
>>
>> commit e90ccbfce3bb9538bf1cdf185a1c10a7269e0939
>> Author: Shakeel Butt <shakeel.butt@linux.dev>
>>
>> memcg: cache obj_stock by memcg, not by objcg pointer
>>
>> This commit changes obj_stock_pcp caching from pointer-based to memcg-based
>> so sibling per-node objcgs share the reserve. It also forwards sub-page
>> residue into a per-node objcg of the same memcg during release so it can be
>> reconciled later instead of being lost.
>>
>> Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/
>> [1]
>>
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>   index d978e18b9b2d0..a547ec7c42d16 100644
>>>   --- a/mm/memcontrol.c
>>>   +++ b/mm/memcontrol.c
>>>
>> [ ... ]
>>
>>>
>>> @@ -160,23 +170,33 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>>>   */
>>>   nr_bytes = atomic_read(&objcg->nr_charged_bytes);
>>>   - WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
>>>   nr_pages = nr_bytes >> PAGE_SHIFT;
>>>   + sub_bytes = nr_bytes & (PAGE_SIZE - 1);
>>>   
>>>   - if (nr_pages) {
>>>   + if (nr_pages || sub_bytes) {
>>>   struct mem_cgroup *memcg;
>>>   
>>>   - memcg = get_mem_cgroup_from_objcg(objcg);
>>>   - mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
>>>   - memcg1_account_kmem(memcg, -nr_pages);
>>>   - if (!mem_cgroup_is_root(memcg))
>>>   - memcg_uncharge(memcg, nr_pages);
>>>   - mem_cgroup_put(memcg);
>>>   + rcu_read_lock();
>>>   + memcg = obj_cgroup_memcg(objcg);
>>>   +
>>>   + if (nr_pages) {
>>>   + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages);
>>>   + memcg1_account_kmem(memcg, -nr_pages);
>>>   + if (!mem_cgroup_is_root(memcg))
>>>   + memcg_uncharge(memcg, nr_pages);
>>>   + }
>>>   +
>>>   + if (sub_bytes && !mem_cgroup_is_root(memcg)) {
>>>   + struct obj_cgroup *fwd;
>>>   +
>>>   + fwd = rcu_dereference(
>>>   + memcg->nodeinfo[numa_node_id()]->objcg);
>>>   + if (fwd)
>>>
>> Could there be a race condition during memcg offlining here that leads to a
>> page counter leak?
>>
>> If this executes concurrently with the offlining of the memcg,
>> memcg_reparent_objcgs() might have already replaced the nodeinfo objcg
>> pointer with NULL. In this race window, fwd evaluates to NULL, and the
>> sub_bytes would be dropped.
>>
>> Since they are dropped without being uncharged or forwarded, it seems the
>> page counter would permanently leak these charges.
> 
> This is not much a concern as it is bounded (i.e. nr_nodes * PAGE_SIZE - 1) and
> previously for the warning case, we were already dropping the sub-page charges.
> 
>>
>>>
>>> + atomic_add(sub_bytes, &fwd->nr_charged_bytes);
>>>
>> Does this unbounded addition create a risk of integer overflow and page
>> counter leaks if the parent memcg never allocates?
>>
>> If the parent memcg is used strictly as a hierarchical container (like
>> system.slice) and performs no kernel allocations, its objcg is never cached
>> in any CPU's obj_stock_pcp.
>>
>> Because __refill_obj_stock() relies on the cache to periodically drain
>> nr_charged_bytes, it would never be invoked for the parent. Could child
>> residues continuously accumulate in the parent's nr_charged_bytes without
>> ever being uncharged, eventually overflowing the 32-bit atomic_t?
>>
> 
> This can be a concern for a long running cgroup. However fixing this would add
> complexity not worth it. This is a temporary fix and will be reverted in newer
> kernels.

I think this is fine as a temporary fix:

Acked-by: Qi Zheng <qi.zheng@linux.dev>

Thanks!

> 
>>>
>>> + }
>>>   + rcu_read_unlock();
>>>   }
>>>
>> -- 
>> Sashiko AI review ·
>> https://sashiko.dev/#/patchset/20260518222827.110696-1-shakeel.butt@linux.dev?part=1
>>

next prev parent reply	other threads:[~2026-05-19  3:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18 22:28 [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer Shakeel Butt
2026-05-18 23:41 ` Shakeel Butt
2026-05-19  3:35   ` Qi Zheng [this message]
2026-05-19  6:46   ` Harry Yoo
2026-05-19 14:02     ` Shakeel Butt
2026-05-19 15:00       ` Harry Yoo
2026-05-19 20:11         ` Shakeel Butt
2026-05-19 20:49           ` Andrew Morton
2026-05-22 16:16             ` Shakeel Butt
2026-05-26  4:09               ` Harry Yoo
2026-05-19 23:39           ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=06b25f78-8e2d-4075-9b80-133bbd691c71@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=alex@ghiti.fr \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.