From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5075F338925 for ; Mon, 18 May 2026 23:42:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779147733; cv=none; b=B126gHtSnR/xUehFGDaMPaej+NvolkBqGuiXGCoajpi0ZMJMJGHacO3jHI0IuZKyCcBDF2HQUZH3hj1Frz8/Uys2Y4/+njSNUBfZYGFBnrj7e4JzoUSw4wVz6gh+LQKozer/T6YBYFyeeiK/OkWbh4kYy56bAzpoXLeX2I+U4/4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779147733; c=relaxed/simple; bh=DKAhZBOJgMVQ0Xkz3R7SYHeIFYTpInBm9CFf+t08UhI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=QBDio/3CsyTAF4erwC1nKIGoLJijN0sgmkeOQh8ih08zB6NSrldnkdChpnaIfqvdL3YNi8DQtl5dMA1dBqqkFTy4f49dUY3TviUk9bfnsbDGbN6xTD1v1Lp/PZNeM8wo+wcZ2X2sTEdqW0sh8BMWiq91KndMOz6u0G+iKTVH/bw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=puQ78vfx; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="puQ78vfx" Date: Mon, 18 May 2026 16:41:58 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1779147728; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+QCtoP4NRudV33pIm9zRDQzZGMwbm5GmlU6HPmH5vp0=; b=puQ78vfxlavN/DowBTm+ommhrlzC3UKWVldJ5ENuibRKIrQ1q1IbR/KdkiqcpmkggUEn+V o7AIm6kvbDP7Zt3TGA18ExumKlQkRoRfOQJxJzWMAs+igS0iUlEIsSrEAMezb+rg/puubX p7/Q39B6ZaDa1AOoF8sYlB180hq/vl0= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Qi Zheng , Alexandre Ghiti , Joshua Hahn , Meta kernel team , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel test robot Subject: Re: [PATCH v3] memcg: cache obj_stock by memcg, not by objcg pointer Message-ID: References: <20260518222827.110696-1-shakeel.butt@linux.dev> Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20260518222827.110696-1-shakeel.butt@linux.dev> X-Migadu-Flow: FLOW_OUT On Mon, May 18, 2026 at 03:28:27PM -0700, Shakeel Butt wrote: > Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg > per-node type") split a memcg's single obj_cgroup into one per NUMA > node, but the per-CPU obj_stock_pcp still keys cached_objcg by > pointer. Cross-NUMA workloads now see a drain on every refill and a > miss on every consume that targets a sibling per-node objcg of the > same memcg, producing the 67.7% stress-ng switch-mq regression > reported by LKP. > > stock->nr_bytes are fungible across per-node objcgs of one memcg. > Treat the cache as keyed by memcg in __consume_obj_stock() and > __refill_obj_stock() so siblings share the reserve. Compare via > READ_ONCE(objcg->memcg) directly: pointer-compare only, no deref, so > the rcu_read_lock contract on obj_cgroup_memcg() does not apply. > > Sharing the reserve without re-caching means bytes funded by one > per-node objcg's slow path can be consumed/freed under a different > sibling, leaving sub-page residue on whichever sibling was cached at > drain time. The pre-existing obj_cgroup_release() path would WARN and > silently drop that residue, leaking up to nr_node_ids * (PAGE_SIZE - 1) > bytes per memcg lifecycle from the page_counter. Forward the residue > into a per-node objcg of the same (post-reparent) memcg at release time > instead, so it can be reconciled later via a refill atomic_xchg or > another release; the chain terminates at root_mem_cgroup, whose > page_counter has no enforced limit. > > Please note that this is temporary fix and will be reverted when > per-node kmem accounting is introduced. > > Update the stale invariant comment on __account_obj_stock(). > > Qi Zheng built a specialized reproducer [1] for the corner case and > confirmed the fix. > > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com > Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") > Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ [1] > Signed-off-by: Shakeel Butt > Debugged-by: Qi Zheng Sashiko's review below and my response. > > Thank you for your contribution! Sashiko AI review found 2 potential issue(s) > to consider: > - [High] Race condition during memcg offlining causes a permanent page counter > leak when forwarding `sub_bytes`. > - [High] Unbounded accumulation of `sub_bytes` in a non-allocating parent > memcg causes a permanent page counter leak and integer overflow. > -- > > commit e90ccbfce3bb9538bf1cdf185a1c10a7269e0939 > Author: Shakeel Butt > > memcg: cache obj_stock by memcg, not by objcg pointer > > This commit changes obj_stock_pcp caching from pointer-based to memcg-based > so sibling per-node objcgs share the reserve. It also forwards sub-page > residue into a per-node objcg of the same memcg during release so it can be > reconciled later instead of being lost. > > Link: https://lore.kernel.org/19693be6-7132-446e-b3fc-b7e9f56e5949@linux.dev/ > [1] > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d978e18b9b2d0..a547ec7c42d16 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > > [ ... ] > > > > > @@ -160,23 +170,33 @@ static void obj_cgroup_release(struct percpu_ref *ref) > > */ > > nr_bytes = atomic_read(&objcg->nr_charged_bytes); > > - WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > nr_pages = nr_bytes >> PAGE_SHIFT; > > + sub_bytes = nr_bytes & (PAGE_SIZE - 1); > > > > - if (nr_pages) { > > + if (nr_pages || sub_bytes) { > > struct mem_cgroup *memcg; > > > > - memcg = get_mem_cgroup_from_objcg(objcg); > > - mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); > > - memcg1_account_kmem(memcg, -nr_pages); > > - if (!mem_cgroup_is_root(memcg)) > > - memcg_uncharge(memcg, nr_pages); > > - mem_cgroup_put(memcg); > > + rcu_read_lock(); > > + memcg = obj_cgroup_memcg(objcg); > > + > > + if (nr_pages) { > > + mod_memcg_state(memcg, MEMCG_KMEM, -nr_pages); > > + memcg1_account_kmem(memcg, -nr_pages); > > + if (!mem_cgroup_is_root(memcg)) > > + memcg_uncharge(memcg, nr_pages); > > + } > > + > > + if (sub_bytes && !mem_cgroup_is_root(memcg)) { > > + struct obj_cgroup *fwd; > > + > > + fwd = rcu_dereference( > > + memcg->nodeinfo[numa_node_id()]->objcg); > > + if (fwd) > > > Could there be a race condition during memcg offlining here that leads to a > page counter leak? > > If this executes concurrently with the offlining of the memcg, > memcg_reparent_objcgs() might have already replaced the nodeinfo objcg > pointer with NULL. In this race window, fwd evaluates to NULL, and the > sub_bytes would be dropped. > > Since they are dropped without being uncharged or forwarded, it seems the > page counter would permanently leak these charges. This is not much a concern as it is bounded (i.e. nr_nodes * PAGE_SIZE - 1) and previously for the warning case, we were already dropping the sub-page charges. > > > > > + atomic_add(sub_bytes, &fwd->nr_charged_bytes); > > > Does this unbounded addition create a risk of integer overflow and page > counter leaks if the parent memcg never allocates? > > If the parent memcg is used strictly as a hierarchical container (like > system.slice) and performs no kernel allocations, its objcg is never cached > in any CPU's obj_stock_pcp. > > Because __refill_obj_stock() relies on the cache to periodically drain > nr_charged_bytes, it would never be invoked for the parent. Could child > residues continuously accumulate in the parent's nr_charged_bytes without > ever being uncharged, eventually overflowing the 32-bit atomic_t? > This can be a concern for a long running cgroup. However fixing this would add complexity not worth it. This is a temporary fix and will be reverted in newer kernels. > > > > + } > > + rcu_read_unlock(); > > } > > > -- > Sashiko AI review ยท > https://sashiko.dev/#/patchset/20260518222827.110696-1-shakeel.butt@linux.dev?part=1 >