Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Oliver Sang <oliver.sang@intel.com>
Cc: "Qi Zheng" <qi.zheng@linux.dev>,
	oe-lkp@lists.linux.dev, lkp@intel.com,
	linux-kernel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"David Carlier" <devnexen@gmail.com>,
	"Allen Pais" <apais@linux.microsoft.com>,
	"Axel Rasmussen" <axelrasmussen@google.com>,
	"Baoquan He" <bhe@redhat.com>,
	"Chengming Zhou" <chengming.zhou@linux.dev>,
	"Chen Ridong" <chenridong@huawei.com>,
	"David Hildenbrand" <david@kernel.org>,
	"Hamza Mahfooz" <hamzamahfooz@linux.microsoft.com>,
	"Harry Yoo" <harry.yoo@oracle.com>,
	"Hugh Dickins" <hughd@google.com>,
	"Imran Khan" <imran.f.khan@oracle.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Kamalesh Babulal" <kamalesh.babulal@oracle.com>,
	"Lance Yang" <lance.yang@linux.dev>,
	"Liam Howlett" <Liam.Howlett@oracle.com>,
	"Lorenzo Stoakes" <ljs@kernel.org>,
	"Michal Hocko" <mhocko@suse.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Mike Rapoport" <rppt@kernel.org>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Muchun Song" <songmuchun@bytedance.com>,
	"Nhat Pham" <nphamcs@gmail.com>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"Usama Arif" <usamaarif642@gmail.com>,
	"Vlastimil Babka" <vbabka@kernel.org>,
	"Wei Xu" <weixugc@google.com>, "Yosry Ahmed" <yosry@kernel.org>,
	"Yuanchu Xie" <yuanchu@google.com>, "Zi Yan" <ziy@nvidia.com>,
	"Usama Arif" <usama.arif@linux.dev>,
	cgroups@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [linus:master] [mm] 01b9da291c: stress-ng.switch.ops_per_sec 67.7% regression
Date: Sun, 17 May 2026 12:38:48 -0700	[thread overview]
Message-ID: <agoYp1zW9afZ6uQz@linux.dev> (raw)
In-Reply-To: <agm61hMv08XnV8sI@xsang-OptiPlex-9020>

On Sun, May 17, 2026 at 08:55:50PM +0800, Oliver Sang wrote:
> hi, Shakeel, hi, Qi,
> 
> On Fri, May 15, 2026 at 10:09:06AM -0700, Shakeel Butt wrote:
> > On Fri, May 15, 2026 at 03:37:22PM +0800, Qi Zheng wrote:
> > > Hi Shakeel,
> > > 
> > > On 5/14/26 9:40 PM, Shakeel Butt wrote:
> > > > May 14, 2026 at 12:46 AM, "Qi Zheng" <qi.zheng@linux.dev mailto:qi.zheng@linux.dev?to=%22Qi%20Zheng%22%20%3Cqi.zheng%40linux.dev%3E > wrote:
> > > > 
> > > > 
> > > > > 
> > > > > On 5/13/26 10:27 PM, Shakeel Butt wrote:
> > > > > 
> > > > > > 
> > > > > > On Wed, May 13, 2026 at 06:49:45AM -0700, Shakeel Butt wrote:
> > > > > > 
> > > > > > > 
> > > > > > > On Wed, May 13, 2026 at 10:10:34AM +0800, Qi Zheng wrote:
> > > > > > > 
> > > > > >   On 5/13/26 12:03 AM, Shakeel Butt wrote:
> > > > > >   On Tue, May 12, 2026 at 08:56:52PM +0800, kernel test robot wrote:
> > > > > > 
> > > > > >   Hello,
> > > > > > 
> > > > > >   kernel test robot noticed a 67.7% regression of stress-ng.switch.ops_per_sec on:
> > > > > > 
> > > > > >   commit: 01b9da291c4969354807b52956f4aae1f41b4924 ("mm: memcontrol: convert objcg to be per-memcg per-node type")
> > > > > >   https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > > 
> > > > > >   This is most probably due to shuffling of struct mem_cgroup and struct
> > > > > >   mem_cgroup_per_node members.
> > > > > > 
> > > > > >   Another possibility is that after objcg was split into per-node, the
> > > > > >   slab accounting fast path is still designed assuming only one current
> > > > > >   objcg per CPU:
> > > > > > 
> > > > > >   struct obj_stock_pcp {
> > > > > >   struct obj_cgroup *cached_objcg;
> > > > > >   };
> > > > > > 
> > > > > >   So it's may cause the following thrashing:
> > > > > > 
> > > > > >   CPU stock cached = memcg/node0 objcg
> > > > > >   free object tagged = memcg/node1 objcg
> > > > > >   => __refill_obj_stock --> objcg mismatch
> > > > > >   => drain_obj_stock()
> > > > > >   => cache switches to node1 objcg
> > > > > > 
> > > > > >   next local allocation tagged = node0 objcg
> > > > > >   => mismatch again
> > > > > >   => drain_obj_stock()
> > > > > > 
> > > > > > > 
> > > > > > > Actually I think this is the issue, we have ping pong threads running on
> > > > > > >   different nodes where though theu are in same cgroup but their current->obcg is
> > > > > > >   for local node and thus this ping pong is thrashing the per-cpu objcg stock.
> > > > > > > 
> > > > > > >   The easier fix would be to compare objcg->memcg instead of just objcg during
> > > > > > >   draining and caching. In addition we can add support for multiple objcg per-cpu
> > > > > > >   stock caching.
> > > > > > > 
> > > > > >   Something like the following:
> > > > > >   From d756abe831a905d6fe32bad9a984fc619dafb7e0 Mon Sep 17 00:00:00 2001
> > > > > >   From: Shakeel Butt <shakeel.butt@linux.dev>
> > > > > >   Date: Wed, 13 May 2026 07:24:55 -0700
> > > > > >   Subject: [PATCH] mm/memcontrol: skip obj_stock drain when refilled objcg
> > > > > >   shares memcg
> > > > > >   Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > > > >   ---
> > > > > >   mm/memcontrol.c | 14 +++++++++++++-
> > > > > >   1 file changed, 13 insertions(+), 1 deletion(-)
> > > > > >   diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > >   index d978e18b9b2d..01ed7a8e18ac 100644
> > > > > >   --- a/mm/memcontrol.c
> > > > > >   +++ b/mm/memcontrol.c
> > > > > >   @@ -3318,6 +3318,7 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> > > > > >   unsigned int nr_bytes,
> > > > > >   bool allow_uncharge)
> > > > > >   {
> > > > > >   + struct obj_cgroup *cached;
> > > > > >   unsigned int nr_pages = 0;
> > > > > >   > if (!stock) {
> > > > > >   @@ -3327,7 +3328,18 @@ static void __refill_obj_stock(struct obj_cgroup *objcg,
> > > > > >   goto out;
> > > > > >   }
> > > > > >   > - if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > > > > >   + cached = READ_ONCE(stock->cached_objcg);
> > > > > >   + if (cached != objcg &&
> > > > > >   + (!cached || obj_cgroup_memcg(cached) != obj_cgroup_memcg(objcg))) {
> > > > > >   drain_obj_stock(stock);
> > > > > >   obj_cgroup_get(objcg);
> > > > > >   stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes)
> > > > > > 
> > > > > This change looks like it should be able to fix the ping-pong issue, but
> > > > > I stiil haven't reproduced the performance regression locally. I'll
> > > > > continue testing it.
> > > > 
> > > > Same here, couldn't reproduce locally. It seems like we had to craft a scenario
> > > > where the pair pingpong threads get their current->objcg from different nodes.
> > > > I will try that.
> > > 
> > > I still haven't been able to reproduce the LKP results locally, but I
> > > used an AI bot to generate a pingpong test case (pasted at the end) and
> > > automatically ran the test on a physical machine. The results are as
> > > follows:
> > > 
> > >   parent: 8285917d6f
> > >   bad:    01b9da291c
> > >   fix:    01b9da291c + stock patch
> > > 
> > >   | kernel | mq_ops/sec mean | vs parent | drain_obj_stock / round |
> > >   |--------|-----------------|-----------|-------------------------|
> > >   | parent |     9.743M      |  baseline |          ~0             |
> > >   | bad    |     7.821M      |  -19.73%  |          ~11.16M        |
> > >   | fix    |     9.274M      |  -4.81%   |          ~0             |
> > > 
> > > Probing the drain_obj_stock() calls confirms that the fix restores the
> > > frequency to the parent's baseline.
> > > 
> > > And it seems that besides __refill_obj_stock(), we should also modify
> > > __consume_obj_stock()?
> > > 
> > 
> > Thanks a lot Qi. I will send the formal patch and will add your Debugged-by if
> > you don't mind.
> > 
> 
> Tested-by: kernel test robot <oliver.sang@intel.com>
> 
> we tested above patch, and it recovers the regression:
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/method/nr_threads/rootfs/tbox_group/test/testcase/testtime:
>   gcc-14/performance/x86_64-rhel-9.4/mq/100%/debian-13-x86_64-20250902.cgz/lkp-spr-r02/switch/stress-ng/60s
> 
> commit:
>   8285917d6f ("mm: memcontrol: prepare for reparenting non-hierarchical stats")
>   01b9da291c ("mm: memcontrol: convert objcg to be per-memcg per-node type")
>   682fd4e9ff  <--- above patch from Shakeel
> 
> 8285917d6f383aef 01b9da291c4969354807b52956f 682fd4e9ffd4009805f81dd25ed
> ---------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \
>       5849          +210.2%      18145 ±  3%      +1.5%       5935        stress-ng.switch.nanosecs_per_context_switch_mq_method
>  2.296e+09           -67.7%  7.408e+08 ±  3%      -1.4%  2.263e+09        stress-ng.switch.ops
>   38288993           -67.7%   12355813 ±  3%      -1.4%   37739220        stress-ng.switch.ops_per_sec
> 
> 
> full compasison is as below [3]
> 
> but there are two notes. 
> 
> #1 is that we noticed there is a fomal patch later from Shakeel in [1] which has
> more changes. not sure if this test is enough? do you want us to test [1]
> further?

Thanks Oliver, I will send a v2 soon, please test v2.

> 
> #2: when we test above patch, we found the server easy to crash while running
> tests. we try to run up to 20 times, only 2 of them run successfully (above
> 37739220 is just the average data from these 2 runs, since the data is stable,
> we think maybe it's ok to report to you with this data).
> we also noticed for [1] there is a [syzbot ci] report in [2]. since we don't
> have serial output for our test server in this report which is for performance
> tests, we cannot say if other 18 runs failed due to similar reason. just FYI.
> 

The syzbot report is simply a rcu warning which will be fixed in v2. Do you
have more details on the crash you are seeing? Is it page counter underflow
warning?

Thanks again for the help.


  reply	other threads:[~2026-05-17 19:39 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-12 12:56 [linus:master] [mm] 01b9da291c: stress-ng.switch.ops_per_sec 67.7% regression kernel test robot
2026-05-12 16:03 ` Shakeel Butt
2026-05-13  2:10   ` Qi Zheng
2026-05-13 13:49     ` Shakeel Butt
2026-05-13 14:27       ` Shakeel Butt
2026-05-14  7:46         ` Qi Zheng
2026-05-14 13:40           ` Shakeel Butt
2026-05-15  7:37             ` Qi Zheng
2026-05-15 17:09               ` Shakeel Butt
2026-05-17 12:55                 ` Oliver Sang
2026-05-17 19:38                   ` Shakeel Butt [this message]
     [not found]                     ` <agtATZG9mIlYzMUl@linux.dev>
     [not found]                       ` <agtPMpQK2jXdQAY4@linux.dev>
2026-05-19  5:04                         ` Oliver Sang
2026-05-19 14:22                           ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agoYp1zW9afZ6uQz@linux.dev \
    --to=shakeel.butt@linux.dev \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apais@linux.microsoft.com \
    --cc=axelrasmussen@google.com \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chenridong@huawei.com \
    --cc=david@kernel.org \
    --cc=devnexen@gmail.com \
    --cc=hamzamahfooz@linux.microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=imran.f.khan@oracle.com \
    --cc=kamalesh.babulal@oracle.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=lkp@intel.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=songmuchun@bytedance.com \
    --cc=surenb@google.com \
    --cc=usama.arif@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=yosry@kernel.org \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox