* [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations
@ 2026-04-27 15:15 JP Kobryn (Meta)
2026-04-27 21:11 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: JP Kobryn (Meta) @ 2026-04-27 15:15 UTC (permalink / raw)
To: linux-mm, akpm, vbabka, mhocko, ying.huang, hannes, shakeel.butt,
gourry
Cc: kasong, qi.zheng, baohua, axelrasmussen, yuanchu, weixugc, david,
ljs, liam, rppt, surenb, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, apopple, linux-kernel, kernel-team
When investigating pressure on a NUMA node, there is no straightforward way
to determine which user-defined policies are driving allocations to it.
Add NUMA mempolicy allocation counters as new node stat items. These
counters track allocations to nodes and also whether the allocations were
intentional or fallbacks.
The new stats follow the existing numa hit/miss/foreign style and have the
following meanings:
hit
- for nodemask-based policies, allocation succeeded within nodemask
- for other policies, allocation succeeded on intended node
- counted on actual_nid
miss
- allocation landed on actual_nid instead of intended_nid
- counted on actual_nid
foreign
- allocation intended for intended_nid, but landed on actual_nid
- counted on intended_nid
The existing numa_* counters cannot be adjusted to fill this role because
they are incremented in zone_statistics(), which also covers non-policy
allocations such as alloc_pages_node(). The mempolicy context is not
applicable at that level since in-kernel callers may make their own node
decisions independent of any task policy.
Allocations using the system default policies (default_policy and
preferred_node_policy[]) are excluded since they do not reflect a user
designation and are already accounted for in the existing numa_* stats.
This covers both task-level defaults and ensures VMA-level policies set via
mbind() are correctly counted. The logic for checking against the default
policies was already present in mpol_to_str() so it was factored out.
Counters are exposed per-node in nodeN/vmstat and globally in /proc/vmstat.
They provide the information needed in step 3 of the investigation workflow
below:
1) Pressure/OOMs reported while system-wide memory is free.
2) Check /proc/zoneinfo or per-node stats in .../nodeN/vmstat to narrow
down node(s) under pressure.
3) Check numa_mpol_{hit,miss,foreign} counters (added by this patch) on
node(s) to see what policy is driving allocations there (and whether
they are intentional vs fallback).
- If active: a user-defined policy is driving allocations to the
node. Proceed to step 4.
- If inactive: pressure is from allocations without a user-defined
policy. Stop and investigate task placement or node capacity
instead.
4) Use /proc/*/numa_maps to identify tasks using the policy.
Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev>
---
The main feedback addressed up to this point has been the semantics behind the
stats. The stats are only counted for user-specified policies.
v4:
- Factor out default policy check from mpol_to_str()
- Check against default policies instead of using current->mempolicy
- Change wording for hit/miss/foreign meaning to use {actual,intended}_node
v3: https://lore.kernel.org/linux-mm/20260317050657.47494-1-jp.kobryn@linux.dev/
- Moved stats off of memcg
- Switched from per-policy to aggregated counters (18 -> 3)
- Filter allocations with no user-specified policy
v2: https://lore.kernel.org/linux-mm/20260307045520.247998-1-jp.kobryn@linux.dev/
- Replaced single per-policy total counter (PGALLOC_MPOL_*) with
hit/miss/foreign triplet per policy
- Changed from global node stats to per-memcg per-node tracking
v1: https://lore.kernel.org/linux-mm/20260212045109.255391-2-inwardvessel@gmail.com/
include/linux/mmzone.h | 5 ++++
mm/mempolicy.c | 65 +++++++++++++++++++++++++++++++++++++-----
mm/vmstat.c | 5 ++++
3 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9adb2ad21da59..602060b4da4b3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -323,6 +323,11 @@ enum node_stat_item {
PGSCAN_ANON,
PGSCAN_FILE,
PGREFILL,
+#ifdef CONFIG_NUMA
+ NUMA_MPOL_HIT,
+ NUMA_MPOL_MISS,
+ NUMA_MPOL_FOREIGN,
+#endif
#ifdef CONFIG_HUGETLB_PAGE
NR_HUGETLB,
#endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59f..28e82753317b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2432,6 +2432,54 @@ static struct page *alloc_pages_preferred_many(gfp_t gfp, unsigned int order,
return page;
}
+static bool is_default_policy(struct mempolicy *pol)
+{
+ return pol == &default_policy ||
+ (pol >= &preferred_node_policy[0] &&
+ pol <= &preferred_node_policy[ARRAY_SIZE(preferred_node_policy) - 1]);
+}
+
+/*
+ * Count a user-defined mempolicy allocation. Stats are tracked per-node.
+ * The following numa_mpol_{hit/miss/foreign} pattern is used:
+ *
+ * hit
+ * - for nodemask-based policies, allocation succeeded within nodemask
+ * - for other policies, allocation succeeded on intended node
+ * - counted on actual_nid
+ * miss
+ * - allocation landed on actual_nid instead of intended_nid
+ * - counted on actual_nid
+ * foreign
+ * - allocation intended for intended_nid, but landed on actual_nid
+ * - counted on intended_nid
+ */
+static void mpol_count_numa_alloc(struct mempolicy *pol, int intended_nid,
+ struct page *page, unsigned int order)
+{
+ int actual_nid = page_to_nid(page);
+ long nr_pages = 1L << order;
+ bool is_hit;
+
+ if (is_default_policy(pol))
+ return;
+
+ if (pol->mode == MPOL_BIND || pol->mode == MPOL_PREFERRED_MANY)
+ is_hit = node_isset(actual_nid, pol->nodes);
+ else
+ is_hit = (actual_nid == intended_nid);
+
+ if (is_hit) {
+ mod_node_page_state(NODE_DATA(actual_nid), NUMA_MPOL_HIT, nr_pages);
+ } else {
+ /* account for miss on the fallback node */
+ mod_node_page_state(NODE_DATA(actual_nid), NUMA_MPOL_MISS, nr_pages);
+
+ /* account for foreign on the intended node */
+ mod_node_page_state(NODE_DATA(intended_nid), NUMA_MPOL_FOREIGN, nr_pages);
+ }
+}
+
/**
* alloc_pages_mpol - Allocate pages according to NUMA mempolicy.
* @gfp: GFP flags.
@@ -2450,8 +2498,10 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
nodemask = policy_nodemask(gfp, pol, ilx, &nid);
- if (pol->mode == MPOL_PREFERRED_MANY)
- return alloc_pages_preferred_many(gfp, order, nid, nodemask);
+ if (pol->mode == MPOL_PREFERRED_MANY) {
+ page = alloc_pages_preferred_many(gfp, order, nid, nodemask);
+ goto out;
+ }
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
/* filter "hugepage" allocation, unless from alloc_pages() */
@@ -2477,7 +2527,7 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
gfp | __GFP_THISNODE | __GFP_NORETRY, order,
nid, NULL);
if (page || !(gfp & __GFP_DIRECT_RECLAIM))
- return page;
+ goto out;
/*
* If hugepage allocations are configured to always
* synchronous compact or the vma has been madvised
@@ -2500,6 +2550,10 @@ static struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
}
}
+out:
+ if (page)
+ mpol_count_numa_alloc(pol, nid, page, order);
+
return page;
}
@@ -3559,10 +3613,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
unsigned short mode = MPOL_DEFAULT;
unsigned short flags = 0;
- if (pol &&
- pol != &default_policy &&
- !(pol >= &preferred_node_policy[0] &&
- pol <= &preferred_node_policy[ARRAY_SIZE(preferred_node_policy) - 1])) {
+ if (pol && !is_default_policy(pol)) {
mode = pol->mode;
flags = pol->flags;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 761718eea2827..a5e7a0cb4678d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1290,6 +1290,11 @@ const char * const vmstat_text[] = {
[I(PGSCAN_ANON)] = "pgscan_anon",
[I(PGSCAN_FILE)] = "pgscan_file",
[I(PGREFILL)] = "pgrefill",
+#ifdef CONFIG_NUMA
+ [I(NUMA_MPOL_HIT)] = "numa_mpol_hit",
+ [I(NUMA_MPOL_MISS)] = "numa_mpol_miss",
+ [I(NUMA_MPOL_FOREIGN)] = "numa_mpol_foreign",
+#endif
#ifdef CONFIG_HUGETLB_PAGE
[I(NR_HUGETLB)] = "nr_hugetlb",
#endif
--
2.52.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations
2026-04-27 15:15 [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations JP Kobryn (Meta)
@ 2026-04-27 21:11 ` Andrew Morton
2026-05-05 6:22 ` JP Kobryn (Meta)
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2026-04-27 21:11 UTC (permalink / raw)
To: JP Kobryn (Meta)
Cc: linux-mm, vbabka, mhocko, ying.huang, hannes, shakeel.butt,
gourry, kasong, qi.zheng, baohua, axelrasmussen, yuanchu, weixugc,
david, ljs, liam, rppt, surenb, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, apopple, linux-kernel, kernel-team
On Mon, 27 Apr 2026 08:15:20 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:
> When investigating pressure on a NUMA node, there is no straightforward way
> to determine which user-defined policies are driving allocations to it.
>
> Add NUMA mempolicy allocation counters as new node stat items. These
> counters track allocations to nodes and also whether the allocations were
> intentional or fallbacks.
AI review:
https://sashiko.dev/#/patchset/20260427151520.137341-1-jp.kobryn@linux.dev
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations
2026-04-27 21:11 ` Andrew Morton
@ 2026-05-05 6:22 ` JP Kobryn (Meta)
0 siblings, 0 replies; 3+ messages in thread
From: JP Kobryn (Meta) @ 2026-05-05 6:22 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, vbabka, mhocko, ying.huang, hannes, shakeel.butt,
gourry, kasong, qi.zheng, baohua, axelrasmussen, yuanchu, weixugc,
david, ljs, liam, rppt, surenb, ziy, matthew.brost, joshua.hahnjy,
rakie.kim, byungchul, apopple, linux-kernel, kernel-team
On 4/27/26 2:11 PM, Andrew Morton wrote:
> On Mon, 27 Apr 2026 08:15:20 -0700 "JP Kobryn (Meta)" <jp.kobryn@linux.dev> wrote:
>
>> When investigating pressure on a NUMA node, there is no straightforward way
>> to determine which user-defined policies are driving allocations to it.
>>
>> Add NUMA mempolicy allocation counters as new node stat items. These
>> counters track allocations to nodes and also whether the allocations were
>> intentional or fallbacks.
>
> AI review:
> https://sashiko.dev/#/patchset/20260427151520.137341-1-jp.kobryn@linux.dev
This was helpful. I quoted the review points and answered them inline.
: For MPOL_PREFERRED_MANY and MPOL_BIND policies, policy_nodemask() does
: not modify the nid parameter unless home_node is set, so intended_nid
: defaults to the caller's local node.
Yes, this patch is not intended to change that behavior.
: If an allocation falls back outside the preferred nodemask, will the
: FOREIGN stat incorrectly penalize the local node, which was never the
: intended target?
The allocation can land in the mask and count as a hit but if outside
the mask, the intended node is incremented with foreign. Assuming home
node is not set, the local node dictates the search path for the
fallback path. So in that regard foreign can apply. The alternative
would be to increment foreign for all nodes in the mask after a miss but
that would imbalance miss/foreign and skew the data. Foreign may not
make sense for mask-based policies.
: Furthermore, if the fallback allocation lands on the local node, will
: it simultaneously count as both a 'miss' and a 'foreign' on the exact
: same node?
Yes, more support that foreign does not map well to a mask-based policy.
Having data on just hit and miss would be sufficient for the
investigative purpose of this patch.
: mod_node_page_state() unconditionally executes local_irq_save() and
: local_irq_restore().
: Since mpol_count_numa_alloc() is invoked on every
: successful page allocation governed by a user-defined mempolicy, does
: this introduce severe IRQ-disabling overhead into the highly optimized
: page allocation fast path?
: Established NUMA counters (like NUMA_HIT) avoid this lock contention
: by using lockless per-cpu operations via __count_numa_event() and
: raw_cpu_add().
For reasons explained more below, I plan on changing from counters to
tracepoints.
: The patch tracks user-defined mempolicy allocations by instrumenting
: alloc_pages_mpol().
: Bulk memory allocations under a mempolicy are routed through
: alloc_pages_bulk_mempolicy_noprof(), which dispatches to specialized
: bulk allocators and bypasses alloc_pages_mpol() entirely. Will this
: lead to silent undercounting of mempolicy allocations for workloads
: utilizing bulk allocation?
: Similarly, Hugetlbfs allocations resolve their mempolicies
: independently via huge_node() and allocate pages through
: alloc_buddy_hugetlb_folio_with_mpol(), which directly invokes the
: buddy allocator. Will Hugetlbfs allocations also be completely
: excluded from the new NUMA_MPOL_* counters?
It seems the existing NUMA_INTERLEAVE_HIT misses this as well (only
counted in alloc_pages_mpol()). But closing this gap with the new stats
looks like it will become messy since every individual allocation of the
bulk request would have to be accounted for. I think using tracepoints
would be cleaner for not only solving this bulk issue, but the other
concerns as well.
I know some other reviewers favored tracepoints over adding new stats
altogether. Originally I saw this as a convenience trade off because
of the instrumentation needed from a userspace consumer. But given the
challenges with the foreign mapping, irq concern, and bulk counting
complexity, I'll go this direction in v5 and hopefully get more
consensus.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-05 6:23 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 15:15 [PATCH v4] mm/mempolicy: track user-defined mempolicy allocations JP Kobryn (Meta)
2026-04-27 21:11 ` Andrew Morton
2026-05-05 6:22 ` JP Kobryn (Meta)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox