public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations
@ 2026-04-27 15:15 JP Kobryn (Meta)
  2026-04-27 21:11 ` Andrew Morton
  0 siblings, 1 reply; 2+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-27 15:15 UTC (permalink / raw)
  To: linux-mm, akpm, vbabka, mhocko, ying.huang, hannes, shakeel.butt,
	gourry
  Cc: kasong, qi.zheng, baohua, axelrasmussen, yuanchu, weixugc, david,
	ljs, liam, rppt, surenb, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, apopple, linux-kernel, kernel-team

When investigating pressure on a NUMA node, there is no straightforward way
to determine which user-defined policies are driving allocations to it.

Add NUMA mempolicy allocation counters as new node stat items. These
counters track allocations to nodes and also whether the allocations were
intentional or fallbacks.

The new stats follow the existing numa hit/miss/foreign style and have the
following meanings:

  hit
    - for nodemask-based policies, allocation succeeded within nodemask
    - for other policies, allocation succeeded on intended node
    - counted on actual_nid
  miss
    - allocation landed on actual_nid instead of intended_nid
    - counted on actual_nid
  foreign
    - allocation intended for intended_nid, but landed on actual_nid
    - counted on intended_nid

The existing numa_* counters cannot be adjusted to fill this role because
they are incremented in zone_statistics(), which also covers non-policy
allocations such as alloc_pages_node(). The mempolicy context is not
applicable at that level since in-kernel callers may make their own node
decisions independent of any task policy.

Allocations using the system default policies (default_policy and
preferred_node_policy[]) are excluded since they do not reflect a user
designation and are already accounted for in the existing numa_* stats.
This covers both task-level defaults and ensures VMA-level policies set via
mbind() are correctly counted. The logic for checking against the default
policies was already present in mpol_to_str() so it was factored out.

Counters are exposed per-node in nodeN/vmstat and globally in /proc/vmstat.
They provide the information needed in step 3 of the investigation workflow
below:

1) Pressure/OOMs reported while system-wide memory is free.
2) Check /proc/zoneinfo or per-node stats in .../nodeN/vmstat to narrow
   down node(s) under pressure.
3) Check numa_mpol_{hit,miss,foreign} counters (added by this patch) on
   node(s) to see what policy is driving allocations there (and whether
   they are intentional vs fallback).
     - If active: a user-defined policy is driving allocations to the
       node. Proceed to step 4.
     - If inactive: pressure is from allocations without a user-defined
       policy. Stop and investigate task placement or node capacity
       instead.
4) Use /proc/*/numa_maps to identify tasks using the policy.

Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
The main feedback addressed up to this point has been the semantics behind the
stats. The stats are only counted for user-specified policies.

v4:
  - Factor out default policy check from mpol_to_str()
  - Check against default policies instead of using current->mempolicy
  - Change wording for hit/miss/foreign meaning to use {actual,intended}_node

v3: https://lore.kernel.org/linux-mm/20260317050657.47494-1-jp.kobryn@linux.dev/
  - Moved stats off of memcg
  - Switched from per-policy to aggregated counters (18 -> 3)
  - Filter allocations with no user-specified policy

v2: https://lore.kernel.org/linux-mm/20260307045520.247998-1-jp.kobryn@linux.dev/
  - Replaced single per-policy total counter (PGALLOC_MPOL_*) with
    hit/miss/foreign triplet per policy
  - Changed from global node stats to per-memcg per-node tracking

v1: https://lore.kernel.org/linux-mm/20260212045109.255391-2-inwardvessel@gmail.com/

 include/linux/mmzone.h |  5 ++++
 mm/mempolicy.c         | 65 +++++++++++++++++++++++++++++++++++++-----
 mm/vmstat.c            |  5 ++++
 3 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da59..602060b4da4b3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -323,6 +323,11 @@ enum node_stat_item {
 	PGSCAN_ANON,
 	PGSCAN_FILE,
 	PGREFILL,
+#ifdef CONFIG_NUMA
+	NUMA_MPOL_HIT,
+	NUMA_MPOL_MISS,
+	NUMA_MPOL_FOREIGN,
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	NR_HUGETLB,
 #endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59f..28e82753317b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2432,6 +2432,54 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
 	return page;
 }
 
+static bool is_default_policy(struct mempolicy *pol)
+{
+	return pol == &default_policy ||
+	       (pol >= &preferred_node_policy[0] &&
+		pol <= &preferred_node_policy[ARRAY_SIZE(preferred_node_policy) - 1]);
+}
+
+/*
+ * Count a user-defined mempolicy allocation. Stats are tracked per-node.
+ * The following numa_mpol_{hit/miss/foreign} pattern is used:
+ *
+ *   hit
+ *     - for nodemask-based policies, allocation succeeded within nodemask
+ *     - for other policies, allocation succeeded on intended node
+ *     - counted on actual_nid
+ *   miss
+ *     - allocation landed on actual_nid instead of intended_nid
+ *     - counted on actual_nid
+ *   foreign
+ *     - allocation intended for intended_nid, but landed on actual_nid
+ *     - counted on intended_nid
+ */
+static void mpol_count_numa_alloc(struct mempolicy *pol, int intended_nid,
+				  struct page *page, unsigned int order)
+{
+	int actual_nid = page_to_nid(page);
+	long nr_pages = 1L << order;
+	bool is_hit;
+
+	if (is_default_policy(pol))
+		return;
+
+	if (pol->mode == MPOL_BIND || pol->mode == MPOL_PREFERRED_MANY)
+		is_hit = node_isset(actual_nid, pol->nodes);
+	else
+		is_hit = (actual_nid == intended_nid);
+
+	if (is_hit) {
+		mod_node_page_state(NODE_DATA(actual_nid), NUMA_MPOL_HIT, nr_pages);
+	} else {
+		/* account for miss on the fallback node */
+		mod_node_page_state(NODE_DATA(actual_nid), NUMA_MPOL_MISS, nr_pages);
+
+		/* account for foreign on the intended node */
+		mod_node_page_state(NODE_DATA(intended_nid), NUMA_MPOL_FOREIGN, nr_pages);
+	}
+}
+
 /**
  * alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
  * @gfp: GFP flags.
@@ -2450,8 +2498,10 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 
 	nodemask = policy_nodemask(gfp, pol, ilx, &nid);
 
-	if (pol->mode == MPOL_PREFERRED_MANY)
-		return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
+		goto out;
+	}
 
 	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 	    /* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2477,7 +2527,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 				gfp | __GFP_THISNODE | __GFP_NORETRY, order,
 				nid, NULL);
 			if (page || !(gfp & __GFP_DIRECT_RECLAIM))
-				return page;
+				goto out;
 			/*
 			 * If hugepage allocations are configured to always
 			 * synchronous compact or the vma has been madvised
@@ -2500,6 +2550,10 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		}
 	}
 
+out:
+	if (page)
+		mpol_count_numa_alloc(pol, nid, page, order);
+
 	return page;
 }
 
@@ -3559,10 +3613,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	unsigned short mode = MPOL_DEFAULT;
 	unsigned short flags = 0;
 
-	if (pol &&
-	    pol != &default_policy &&
-	    !(pol >= &preferred_node_policy[0] &&
-	      pol <= &preferred_node_policy[ARRAY_SIZE(preferred_node_policy) - 1])) {
+	if (pol && !is_default_policy(pol)) {
 		mode = pol->mode;
 		flags = pol->flags;
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 761718eea2827..a5e7a0cb4678d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1290,6 +1290,11 @@ const char * const vmstat_text[] = {
 	[I(PGSCAN_ANON)]			= "pgscan_anon",
 	[I(PGSCAN_FILE)]			= "pgscan_file",
 	[I(PGREFILL)]				= "pgrefill",
+#ifdef CONFIG_NUMA
+	[I(NUMA_MPOL_HIT)]			= "numa_mpol_hit",
+	[I(NUMA_MPOL_MISS)]			= "numa_mpol_miss",
+	[I(NUMA_MPOL_FOREIGN)]			= "numa_mpol_foreign",
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	[I(NR_HUGETLB)]				= "nr_hugetlb",
 #endif
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations
  2026-04-27 15:15 [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations JP Kobryn (Meta)
@ 2026-04-27 21:11 ` Andrew Morton
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2026-04-27 21:11 UTC (permalink / raw)
  To: JP Kobryn (Meta)
  Cc: linux-mm, vbabka, mhocko, ying.huang, hannes, shakeel.butt,
	gourry, kasong, qi.zheng, baohua, axelrasmussen, yuanchu, weixugc,
	david, ljs, liam, rppt, surenb, ziy, matthew.brost, joshua.hahnjy,
	rakie.kim, byungchul, apopple, linux-kernel, kernel-team

On Mon, 27 Apr 2026 08:15:20 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:

> When investigating pressure on a NUMA node, there is no straightforward way
> to determine which user-defined policies are driving allocations to it.
> 
> Add NUMA mempolicy allocation counters as new node stat items. These
> counters track allocations to nodes and also whether the allocations were
> intentional or fallbacks.

AI review:
	https://sashiko.dev/#/patchset/20260427151520.137341-1-jp.kobryn@linux.dev


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-04-27 21:11 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 15:15 [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations JP Kobryn (Meta)
2026-04-27 21:11 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox