From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7CC033E5591
	for <virtualization@lists.linux.dev>; Wed, 11 Mar 2026 17:32:11 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.180
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1773250335; cv=none; b=POIXVNTaAs52R8XiDoan7/YQRA5j4oRLWpc3n+0Mw9dJWywKKbi5MFxNLNVU7BTyt7U2lI3ilqJXWxQLZCaVpFQwmBV6Et+3yzaA/pgAHKvTzGiEtxJGqjWZ16y6nAkoARlB69MuJK+k27Lu4D9OmPVxnMRkDeDUwTkR2GrkdgQ=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1773250335; c=relaxed/simple;
	bh=eMXALHP7X2YfG/Fchz9hrRxbckSdspy+ZVInY/xaFuA=;
	h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:
	 In-Reply-To:Content-Type; b=c5ZU6Zvzk6RLcurgyICdADi5K2u7AtbpiUi6pEUsgrU6Ze2HHS7dDvIlx2twheZDRdPQz2wI14qJQUK0BPgIfQmgG2MZweSpylXTuT91uxcerPQM8QqV6UQsBYBd1OGE1szgVztTrPLZexJZrdaDnF55fiZXd0grPmy/tl+BaMk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=usFHXj3w; arc=none smtp.client-ip=91.218.175.180
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="usFHXj3w"
Message-ID: <a26df89c-5504-445d-a639-ffd2d12efaf5@linux.dev>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1773250318;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=vyM17D46yinvuYWaZAzccGihacHtVbtTSHoTyuFgaKc=;
	b=usFHXj3w2AqXsJ1LOeGv31v5R1ETaH6/W/uYTGsa7F+xPjIgKyARdhZOrkYilhGHFs7aLY
	FM5REEnhV2QC9XKTI0PNHZnhd3aPQ9YOxkViH3/x/z/lZQarLQkOIxJBcumL6qqwh4i9bR
	tOTWYERaHxlalspfyH7cH1wJ2iLp7iY=
Date: Wed, 11 Mar 2026 10:31:48 -0700
Precedence: bulk
X-Mailing-List: virtualization@lists.linux.dev
List-Id: <virtualization.lists.linux.dev>
List-Subscribe: <mailto:virtualization+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:virtualization+unsubscribe@lists.linux.dev>
MIME-Version: 1.0
Subject: Re: [PATCH v2] mm/mempolicy: track page allocations per mempolicy
To: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org, mhocko@suse.com,
 vbabka@suse.cz, apopple@nvidia.com, axelrasmussen@google.com,
 byungchul@sk.com, cgroups@vger.kernel.org, david@kernel.org,
 eperezma@redhat.com, gourry@gourry.net, jasowang@redhat.com,
 hannes@cmpxchg.org, joshua.hahnjy@gmail.com, Liam.Howlett@oracle.com,
 linux-kernel@vger.kernel.org, lorenzo.stoakes@oracle.com,
 matthew.brost@intel.com, mst@redhat.com, rppt@kernel.org,
 muchun.song@linux.dev, zhengqi.arch@bytedance.com, rakie.kim@sk.com,
 roman.gushchin@linux.dev, shakeel.butt@linux.dev, surenb@google.com,
 virtualization@lists.linux.dev, weixugc@google.com,
 xuanzhuo@linux.alibaba.com, yuanchu@google.com, ziy@nvidia.com,
 kernel-team@meta.com
References: <20260307045520.247998-1-jp.kobryn@linux.dev>
 <87seabu8np.fsf@DESKTOP-5N7EMDA>
 <977dc43d-622c-411d-99a6-4204fa26c21e@linux.dev>
 <87cy1boyzd.fsf@DESKTOP-5N7EMDA>
Content-Language: en-US
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: "JP Kobryn (Meta)" <jp.kobryn@linux.dev>
In-Reply-To: <87cy1boyzd.fsf@DESKTOP-5N7EMDA>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Migadu-Flow: FLOW_OUT

On 3/10/26 7:56 PM, Huang, Ying wrote:
> "JP Kobryn (Meta)" <jp.kobryn@linux.dev> writes:
> 
>> On 3/7/26 4:27 AM, Huang, Ying wrote:
>>> "JP Kobryn (Meta)" <jp.kobryn@linux.dev> writes:
>>>
>>>> When investigating pressure on a NUMA node, there is no straightforward way
>>>> to determine which policies are driving allocations to it.
>>>>
>>>> Add per-policy page allocation counters as new node stat items. These
>>>> counters track allocations to nodes and also whether the allocations were
>>>> intentional or fallbacks.
>>>>
>>>> The new stats follow the existing numa hit/miss/foreign style and have the
>>>> following meanings:
>>>>
>>>>     hit
>>>>       - for BIND and PREFERRED_MANY, allocation succeeded on node in nodemask
>>>>       - for other policies, allocation succeeded on intended node
>>>>       - counted on the node of the allocation
>>>>     miss
>>>>       - allocation intended for other node, but happened on this one
>>>>       - counted on other node
>>>>     foreign
>>>>       - allocation intended on this node, but happened on other node
>>>>       - counted on this node
>>>>
>>>> Counters are exposed per-memcg, per-node in memory.numa_stat and globally
>>>> in /proc/vmstat.
>>> IMHO, it may be better to describe your workflow as an example to
>>> use
>>> the newly added statistics.  That can describe why we need them.  For
>>> example, what you have described in
>>> https://lore.kernel.org/linux-mm/9ae80317-f005-474c-9da1-95462138f3c6@gmail.com/
>>>
>>>> 1) Pressure/OOMs reported while system-wide memory is free.
>>>> 2) Check per-node pgscan/pgsteal stats (provided by patch 2) to narrow
>>>> down node(s) under pressure. They become available in
>>>> /sys/devices/system/node/nodeN/vmstat.
>>>> 3) Check per-policy allocation counters (this patch) on that node to
>>>> find what policy was driving it. Same readout at nodeN/vmstat.
>>>> 4) Now use /proc/*/numa_maps to identify tasks using the policy.
>>>
>>
>> Good call. I'll add a workflow adapted for the current approach in
>> the next revision. I included it in another response in this thread, but
>> I'll repeat here because it will make it easier to answer your question
>> below.
>>
>> 1) Pressure/OOMs reported while system-wide memory is free.
>> 2) Check /proc/zoneinfo or per-node stats in .../nodeN/vmstat to narrow
>>     down node(s) under pressure.
>> 3) Check per-policy hit/miss/foreign counters (added by this patch) on
>>     node(s) to see what policy is driving allocations there (intentional
>>     vs fallback).
>> 4) Use /proc/*/numa_maps to identify tasks using the policy.
>>
>>> One question.  If we have to search /proc/*/numa_maps, why can't we
>>> find all necessary information via /proc/*/numa_maps?  For example,
>>> which VMA uses the most pages on the node?  Which policy is used in the
>>> VMA? ...
>>>
>>
>> There's a gap in the flow of information if we go straight from a node
>> in question to numa_maps. Without step 3 above, we can't distinguish
>> whether pages landed there intentionally, as a fallback, or were
>> migrated sometime after the allocation. These new counters track the
>> results of allocations at the time they happen, preserving that
>> information regardless of what may happen later on.
> 
> Sorry for late reply.
> 
> IMHO, step 3) doesn't add much to the flow.  It only counts allocation,
> not migration, freeing, etc.

This logic would undermine other existing stats.

> I'm afraid that it may be misleading.  For
> example, if a lot of pages have been allocated with a mempolicy, then
> these pages are freed.  /proc/*/numa_maps are more useful stats for the
> goal.

numa_maps only show live snapshots with no attribution. Even if we
tracked them over time, there's no way to determine if the allocations
exist as a result of a policy decision.

> To get all necessary information, I think that more thorough
> tracing is necessary.

Tracking other sources of pages on a node (migration, etc) is
beyond the goal of this patch.