From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-183.mta0.migadu.com (out-183.mta0.migadu.com [91.218.175.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7AE503CFF69 for ; Tue, 10 Mar 2026 17:02:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.183 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773162151; cv=none; b=iJWpdfCvUvfer9MfV87Of4adHTuCxDGw/wvOsRH8TPzh8R8+OhxJPPp7Uywvip6P5VeDjl/zYaidMjFHrqZhSvm3j0yTd5XI+ZVaascnnBCU4nukDFhHRLyQdAVf3NrUKo8uWGDcIEMJkYNjPbRYfzMjnJOyctla9X2MVwnxkZY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773162151; c=relaxed/simple; bh=UXdqamY6AIu9HPttzy2Zdui/eIc+rJ7wz+7z84q9oug=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=HhW+eCwLJMGr5eWvZxwQgitDs+qPC/3fGaxgDKyVq+c7WPKm4r6UBu92WQMmGqCD7GjyTJl68IjtYQf593/y7VTxXHHnL5wPsB4X9Mwn9Y+w1TThPLFKNv2icHTrzWFSsK/oxuvTkaEt1zdwNDENRvDleZMxpc+Y66m4tQnRRCI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=E5bgoUQu; arc=none smtp.client-ip=91.218.175.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="E5bgoUQu" Message-ID: <35272f7a-4c1e-48f8-8e99-82bf3baffab3@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773162135; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0LKshKUK3KitIMjNACXgO7TTkRiyZPtXq9EyeTBs6so=; b=E5bgoUQuw0+On5E2PJ89wAn5VJKu0l4SF5t/wicRslH5R9CgJYBndevL6URwlolo5gSWRA Rc/KugVqwjM2d90Y46q784Bkmf8xfQ3zQpYtXtd6nTtBR9SZPM+vh4LPJc3yEA8/OJ97D/ 64B0uIZzJgoLagIttfjTn+QWj07uYtA= Date: Tue, 10 Mar 2026 10:01:53 -0700 Precedence: bulk X-Mailing-List: virtualization@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH v2] mm/mempolicy: track page allocations per mempolicy To: Shakeel Butt Cc: linux-mm@kvack.org, akpm@linux-foundation.org, mhocko@suse.com, vbabka@suse.cz, apopple@nvidia.com, axelrasmussen@google.com, byungchul@sk.com, cgroups@vger.kernel.org, david@kernel.org, eperezma@redhat.com, gourry@gourry.net, jasowang@redhat.com, hannes@cmpxchg.org, joshua.hahnjy@gmail.com, Liam.Howlett@oracle.com, linux-kernel@vger.kernel.org, lorenzo.stoakes@oracle.com, matthew.brost@intel.com, mst@redhat.com, rppt@kernel.org, muchun.song@linux.dev, zhengqi.arch@bytedance.com, rakie.kim@sk.com, roman.gushchin@linux.dev, surenb@google.com, virtualization@lists.linux.dev, weixugc@google.com, xuanzhuo@linux.alibaba.com, ying.huang@linux.alibaba.com, yuanchu@google.com, ziy@nvidia.com, kernel-team@meta.com References: <20260307045520.247998-1-jp.kobryn@linux.dev> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "JP Kobryn (Meta)" In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 3/10/26 7:53 AM, Shakeel Butt wrote: > On Mon, Mar 09, 2026 at 09:17:43PM -0700, JP Kobryn (Meta) wrote: >> On 3/9/26 4:43 PM, Shakeel Butt wrote: >>> On Fri, Mar 06, 2026 at 08:55:20PM -0800, JP Kobryn (Meta) wrote: > [...] >>> >>> This seems like monotonic increasing metrics and I think you don't care about >>> their absolute value but rather rate of change. Any reason this can not be >>> achieved through tracepoints and BPF combination? >> >> We have the per-node reclaim stats (pg{steal,scan,refill}) in >> nodeN/vmstat and memory.numa_stat now. The new stats in this patch would >> be collected from the same source. They were meant to be used together, >> so it seemed like a reasonable location. I think the advantage over >> tracepoints is we get the observability on from the start and it would >> be simple to extend existing programs that already read stats from the >> cgroup dir files. > > Convenience is not really justifying the cost of adding 18 counters, > particularly in memcg. We can argue about adding just in system level metrics > but not for memcg. > > counter_cost = nr_cpus * nr_nodes * nr_memcg * 16 (struct lruvec_stats_percpu) > > On a typical prod machine, we can see 1000s of memcg, 100s of cpus and couple of > numa nodes. So, a single counter's cost can range from 200KiB to MiBs. This does > not seem like a cost we should force everyone to pay. > > If you really want these per-memcg and assuming these metrics are updated in > non-performance critical path, we can try to decouple these and other reclaim > related stats from rstat infra. That would at least reduce nr_cpus factor in the > above equation to 1. Though we will need to actually evaluate the performance > for the change before committing to it. I could trade off the per-cgroup granularity and change these stats to become global per-node stats.