[PATCH] mm: mempolicy: fix automatic numa balancing for shmem

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
@ 2026-06-29 16:33 Johannes Weiner
  2026-06-29 17:59 ` Gregory Price
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Johannes Weiner @ 2026-06-29 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	linux-mm, linux-kernel, Neha Gholkar

Neha reports that mapped shmem aren't considered for NUMA balancing,
noting convergence problems and bandwidth bottlenecking for cachelib
based workloads on tiered memory systems.

Looking at the code and going through the git history, this doesn't
actually seem intentional:

Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
motivation was a real usecase: Oracle was pinning shared segments with
mbind(MPOL_BIND) so trapping faults was both expensive and pointless.

The handling of NULL from vm_ops->get_policy, however, treated "user
explicitly opted out" the same as "user never specified anything." For
VMAs whose shared policy is absent - the common case for shmem - the
scan was disabled too.

This issue is old. It probably hurts less in conventional NUMA. But it's
very noticable on tiered systems, where entire tmpfs workingsets can get
stuck on lower-bandwidth memory.

Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
thereby handle the fallback to task policy (-> preferred_node_policy()
has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
already handles it this way, the scan-eligibility check was the outlier.

This preserves Mel's intended fix: don't scan stuff the user explicitly
pinned. But allow default policy vmas to participate in balancing.

Reported-by: Neha Gholkar <nehagholkar@gmail.com>
Tested-by: Neha Gholkar <nehagholkar@gmail.com>
Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/mempolicy.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..bba65898aee1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 bool vma_policy_mof(struct vm_area_struct *vma)
 {
 	struct mempolicy *pol;
+	pgoff_t ilx;
+	bool mof;

-	if (vma->vm_ops && vma->vm_ops->get_policy) {
-		bool ret = false;
-		pgoff_t ilx;		/* ignored here */
-
-		pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
-		if (pol && (pol->flags & MPOL_F_MOF))
-			ret = true;
-		mpol_cond_put(pol);
-
-		return ret;
-	}
-
-	pol = vma->vm_policy;
+	pol = __get_vma_policy(vma, vma->vm_start, &ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-
-	return pol->flags & MPOL_F_MOF;
+	mof = pol->flags & MPOL_F_MOF;
+	mpol_cond_put(pol);
+	return mof;
 }

 bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
@ 2026-06-29 17:59 ` Gregory Price
  2026-06-29 18:22   ` Johannes Weiner
  2026-06-30 11:20   ` Huang, Ying
  2026-06-29 18:33 ` David Hildenbrand (Arm)
  2026-06-30 23:40 ` Balbir Singh
  2 siblings, 2 replies; 13+ messages in thread
From: Gregory Price @ 2026-06-29 17:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
> 
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
> 
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> 
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
> 
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.
> 

Eugh.

Demotions don't care about mempolicy, so opting shmem out of NUMA
balancing and mbind'ing on a tiered system is just full sadness.

This is all just more evidence that demotion needs to be completely
redone, it's creating a mess of undefined behavior for memory placement.

> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
> 
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
> 
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 17:59 ` Gregory Price
@ 2026-06-29 18:22   ` Johannes Weiner
  2026-06-30 11:20   ` Huang, Ying
  1 sibling, 0 replies; 13+ messages in thread
From: Johannes Weiner @ 2026-06-29 18:22 UTC (permalink / raw)
  To: Gregory Price
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Mon, Jun 29, 2026 at 01:59:41PM -0400, Gregory Price wrote:
> On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> > Neha reports that mapped shmem aren't considered for NUMA balancing,
> > noting convergence problems and bandwidth bottlenecking for cachelib
> > based workloads on tiered memory systems.
> > 
> > Looking at the code and going through the git history, this doesn't
> > actually seem intentional:
> > 
> > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> > motivation was a real usecase: Oracle was pinning shared segments with
> > mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> > 
> > The handling of NULL from vm_ops->get_policy, however, treated "user
> > explicitly opted out" the same as "user never specified anything." For
> > VMAs whose shared policy is absent - the common case for shmem - the
> > scan was disabled too.
> > 
> > This issue is old. It probably hurts less in conventional NUMA. But it's
> > very noticable on tiered systems, where entire tmpfs workingsets can get
> > stuck on lower-bandwidth memory.
> > 
> 
> Eugh.
> 
> Demotions don't care about mempolicy, so opting shmem out of NUMA
> balancing and mbind'ing on a tiered system is just full sadness.

Right, mbinding in tiered mode is a whole other ball of wax. I'm just
trying to make the default case work ;-)

> This is all just more evidence that demotion needs to be completely
> redone, it's creating a mess of undefined behavior for memory placement.

No argument from me.

> > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> > thereby handle the fallback to task policy (-> preferred_node_policy()
> > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> > already handles it this way, the scan-eligibility check was the outlier.
> > 
> > This preserves Mel's intended fix: don't scan stuff the user explicitly
> > pinned. But allow default policy vmas to participate in balancing.
> > 
> > Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> > Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Gregory Price <gourry@gourry.net>

Thanks! Sorry for making you feel bad.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 17:59 ` Gregory Price
  2026-06-29 18:22   ` Johannes Weiner
@ 2026-06-30 11:20   ` Huang, Ying
  2026-06-30 15:29     ` Gregory Price
  1 sibling, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2026-06-30 11:20 UTC (permalink / raw)
  To: Gregory Price
  Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

Gregory Price <gourry@gourry.net> writes:

[snip]

> Demotions don't care about mempolicy, so opting shmem out of NUMA
> balancing and mbind'ing on a tiered system is just full sadness.
>
> This is all just more evidence that demotion needs to be completely
> redone, it's creating a mess of undefined behavior for memory placement.

It's hard to respect mempolicy during demotion in the current
implementation.  Do you have any ideas on how to improve this?

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-30 11:20   ` Huang, Ying
@ 2026-06-30 15:29     ` Gregory Price
  2026-07-01 11:03       ` Huang, Ying
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2026-06-30 15:29 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> [snip]
> 
> > Demotions don't care about mempolicy, so opting shmem out of NUMA
> > balancing and mbind'ing on a tiered system is just full sadness.
> >
> > This is all just more evidence that demotion needs to be completely
> > redone, it's creating a mess of undefined behavior for memory placement.
> 
> It's hard to respect mempolicy during demotion in the current
> implementation.  Do you have any ideas on how to improve this?
> 

I think it's feasible we could respect per-vma mempolicies, but not
per-task.  That would at least make this particular interaction less
painful and mbind() would do what you'd expect.  It is a bit racy,
but with MPOL_MF_MOVE_ALL the user can get what they actually want.

I think task-wide mempolicy is problematic and generally a bad idea
on tiered systems, maybe it's ok if we simply document task policies
are not respected on tiered systems?

~Gregory

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-30 15:29     ` Gregory Price
@ 2026-07-01 11:03       ` Huang, Ying
  2026-07-01 15:33         ` Gregory Price
  0 siblings, 1 reply; 13+ messages in thread
From: Huang, Ying @ 2026-07-01 11:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

Gregory Price <gourry@gourry.net> writes:

> On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> [snip]
>> 
>> > Demotions don't care about mempolicy, so opting shmem out of NUMA
>> > balancing and mbind'ing on a tiered system is just full sadness.
>> >
>> > This is all just more evidence that demotion needs to be completely
>> > redone, it's creating a mess of undefined behavior for memory placement.
>> 
>> It's hard to respect mempolicy during demotion in the current
>> implementation.  Do you have any ideas on how to improve this?
>> 
>
> I think it's feasible we could respect per-vma mempolicies, but not
> per-task.  That would at least make this particular interaction less
> painful and mbind() would do what you'd expect.  It is a bit racy,
> but with MPOL_MF_MOVE_ALL the user can get what they actually want.

Yes.  Per-vma mempolicy support is possible.

> I think task-wide mempolicy is problematic and generally a bad idea
> on tiered systems, maybe it's ok if we simply document task policies
> are not respected on tiered systems?

Anyway, it's convenient to use numactl to manage mempolicy.

Is it possible to enable NUMA_BALANCING_MEMORY_TIERING for non-default
VMAs?  If we don't enable NUMA_BALANCING_NORMAL, the overhead should be
OK because the page table entries are changed to PROTN_ONE only for
pages on the slow tier.

Additionally, we may need to consider cpusets.

---
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-07-01 11:03       ` Huang, Ying
@ 2026-07-01 15:33         ` Gregory Price
  2026-07-01 15:49           ` Johannes Weiner
  0 siblings, 1 reply; 13+ messages in thread
From: Gregory Price @ 2026-07-01 15:33 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Wed, Jul 01, 2026 at 07:03:32PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@gourry.net> writes:
> >> 
> >> [snip]
> >> 
> >> > Demotions don't care about mempolicy, so opting shmem out of NUMA
> >> > balancing and mbind'ing on a tiered system is just full sadness.
> >> >
> >> > This is all just more evidence that demotion needs to be completely
> >> > redone, it's creating a mess of undefined behavior for memory placement.
> >> 
> >> It's hard to respect mempolicy during demotion in the current
> >> implementation.  Do you have any ideas on how to improve this?
> >> 
> >
> > I think it's feasible we could respect per-vma mempolicies, but not
> > per-task.  That would at least make this particular interaction less
> > painful and mbind() would do what you'd expect.  It is a bit racy,
> > but with MPOL_MF_MOVE_ALL the user can get what they actually want.
> 
> Yes.  Per-vma mempolicy support is possible.
> 
> > I think task-wide mempolicy is problematic and generally a bad idea
> > on tiered systems, maybe it's ok if we simply document task policies
> > are not respected on tiered systems?
> 
> Anyway, it's convenient to use numactl to manage mempolicy.
>

It can be, but there's also many footguns with task-wide policy.

Something i found while seeing if i could make ZONE_NORMAL nodes more
reliably hotpluggable:

diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index dd2717ff94bf..9ceeb56574ef 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -682,7 +682,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	 * we won't be able to do that under the lock.
 	 */
 	if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
-		page = alloc_pages(gfp_nested_mask(alloc_flags),
+		/*
+		 * The stack depot pool is a global, never-freed allocation.
+		 * Use alloc_pages_node() on the CPU-local node instead of
+		 * alloc_pages() so the pool does not inherit a transient task's
+		 * NUMA mempolicy (e.g. MPOL_BIND to a CPU-less/bound node), which
+		 * would strand this long-lived page on that node forever.
+		 */
+		page = alloc_pages_node(numa_node_id(),
+				   gfp_nested_mask(alloc_flags),
 				   DEPOT_POOL_ORDER);
 		if (page)
 			prealloc = page_address(page);

This is a global, permanently allocated, resource that inherits a task
mempolicy's placement because that task *happened* to be the first one
to touch it.

There are many alloc_pages() calls (155 instances kernel-wide) that
inherit a task mempolicy when that's probably not what we want.

alloc_pages() is called in: net/, lib/, kexec_core/, drivers/, arch/

you can imagine a task setting `set_mempolicy(INTERLEAVE, ALL)` and the
result is a bunch of random driver memory gets spread all over the place
along with the task's heap.  Is that really what the caller wanted, or
did they just want userland data spread about?

But at this point it's a 20 year old interface, not much we can do about
it without making *someone* sad :[

I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland
memory only" - and then slowly trying to migrate numactl to make this
the default.

> Is it possible to enable NUMA_BALANCING_MEMORY_TIERING for non-default
> VMAs?  If we don't enable NUMA_BALANCING_NORMAL, the overhead should be
> OK because the page table entries are changed to PROTN_ONE only for
> pages on the slow tier.
> 

hmmm, will have to give this some thought.

> Additionally, we may need to consider cpusets.
> 

Direct reclaim considers cpusets for the reclaiming task (added
recently), kswapd sits in its own cgroup.

Cross-cpuset checks - i'm not sure how tractable that is.  We ignore it
for now, recognizing that if something is cross-cpuset it's some
definition of shared/global object (e.g. pagecache mappings).

~Gregory

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-07-01 15:33         ` Gregory Price
@ 2026-07-01 15:49           ` Johannes Weiner
  2026-07-01 16:22             ` Gregory Price
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2026-07-01 15:49 UTC (permalink / raw)
  To: Gregory Price
  Cc: Huang, Ying, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Wed, Jul 01, 2026 at 11:33:38AM -0400, Gregory Price wrote:
> Something i found while seeing if i could make ZONE_NORMAL nodes more
> reliably hotpluggable:
> 
> diff --git a/lib/stackdepot.c b/lib/stackdepot.c
> index dd2717ff94bf..9ceeb56574ef 100644
> --- a/lib/stackdepot.c
> +++ b/lib/stackdepot.c
> @@ -682,7 +682,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
>  	 * we won't be able to do that under the lock.
>  	 */
>  	if (unlikely(can_alloc && !READ_ONCE(new_pool))) {
> -		page = alloc_pages(gfp_nested_mask(alloc_flags),
> +		/*
> +		 * The stack depot pool is a global, never-freed allocation.
> +		 * Use alloc_pages_node() on the CPU-local node instead of
> +		 * alloc_pages() so the pool does not inherit a transient task's
> +		 * NUMA mempolicy (e.g. MPOL_BIND to a CPU-less/bound node), which
> +		 * would strand this long-lived page on that node forever.
> +		 */
> +		page = alloc_pages_node(numa_node_id(),
> +				   gfp_nested_mask(alloc_flags),
>  				   DEPOT_POOL_ORDER);
>  		if (page)
>  			prealloc = page_address(page);
> 
> This is a global, permanently allocated, resource that inherits a task
> mempolicy's placement because that task *happened* to be the first one
> to touch it.
> 
> There are many alloc_pages() calls (155 instances kernel-wide) that
> inherit a task mempolicy when that's probably not what we want.
> 
> alloc_pages() is called in: net/, lib/, kexec_core/, drivers/, arch/
> 
> you can imagine a task setting `set_mempolicy(INTERLEAVE, ALL)` and the
> result is a bunch of random driver memory gets spread all over the place
> along with the task's heap.  Is that really what the caller wanted, or
> did they just want userland data spread about?
> 
> But at this point it's a 20 year old interface, not much we can do about
> it without making *someone* sad :[
> 
> I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland
> memory only" - and then slowly trying to migrate numactl to make this
> the default.

Hm. Kernel allocations that are totally incidental like the stackdepot
example above should not follow task policy. But there are kernel
allocations (kernel stack, inodes, pipes) that are directly allocated
on behalf of a process, and so probably SHOULD follow task policy.

That's an annotation problem that I think we have solved already,
because cgroups need the same distinction for what allocations to
charge to the current task's cgroup context.

We could rename __GFP_ACCOUNT / SLAB_ACCOUNT to __GFP_TASKPOLICY /
SLAB_TASKPOLICY or something, and have mempolicy follow it too.

There is still the whole "changing 20 year old behavior" aspect, but I
think the polarity works in our favor: big important allocations have
already been following the policy correctly. The behavior changes
primarily for smaller, random allocations.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-07-01 15:49           ` Johannes Weiner
@ 2026-07-01 16:22             ` Gregory Price
  0 siblings, 0 replies; 13+ messages in thread
From: Gregory Price @ 2026-07-01 16:22 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Huang, Ying, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Wed, Jul 01, 2026 at 11:49:16AM -0400, Johannes Weiner wrote:
> On Wed, Jul 01, 2026 at 11:33:38AM -0400, Gregory Price wrote:
> > 
> > I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland
> > memory only" - and then slowly trying to migrate numactl to make this
> > the default.
> 
> Hm. Kernel allocations that are totally incidental like the stackdepot
> example above should not follow task policy. But there are kernel
> allocations (kernel stack, inodes, pipes) that are directly allocated
> on behalf of a process, and so probably SHOULD follow task policy.
> 
> That's an annotation problem that I think we have solved already,
> because cgroups need the same distinction for what allocations to
> charge to the current task's cgroup context.
> 
> We could rename __GFP_ACCOUNT / SLAB_ACCOUNT to __GFP_TASKPOLICY /
> SLAB_TASKPOLICY or something, and have mempolicy follow it too.
> 
> There is still the whole "changing 20 year old behavior" aspect, but I
> think the polarity works in our favor: big important allocations have
> already been following the policy correctly. The behavior changes
> primarily for smaller, random allocations.

This seems like it would be a pretty trivial change.

something like:

static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
				   pgoff_t ilx, int *nid)
{
	nodemask_t *nodemask = NULL;
	
	if (!(gfp & __GFP_TASKPOLICY))
		/* don't do the task policy, numa_node_id()? */
... snip ...
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
  2026-06-29 17:59 ` Gregory Price
@ 2026-06-29 18:33 ` David Hildenbrand (Arm)
  2026-06-29 18:47   ` Johannes Weiner
  2026-06-30 23:40 ` Balbir Singh
  2 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 18:33 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, linux-mm,
	linux-kernel, Neha Gholkar

On 6/29/26 18:33, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
> 
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
> 
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> 
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
> 
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.

Sounds bad enough to warrant CC: stable?

> 
> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
> 
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
> 
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")



> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/mempolicy.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..bba65898aee1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  bool vma_policy_mof(struct vm_area_struct *vma)
>  {
>  	struct mempolicy *pol;
> +	pgoff_t ilx;
> +	bool mof;
>  
> -	if (vma->vm_ops && vma->vm_ops->get_policy) {
> -		bool ret = false;
> -		pgoff_t ilx;		/* ignored here */
> -
> -		pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> -		if (pol && (pol->flags & MPOL_F_MOF))
> -			ret = true;
> -		mpol_cond_put(pol);
> -
> -		return ret;
> -	}

Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy
was not available), which is what __get_vma_policy() does at well.

But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy().


Makes sense to me although this is a source of confusion for me.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 18:33 ` David Hildenbrand (Arm)
@ 2026-06-29 18:47   ` Johannes Weiner
  2026-06-30 11:26     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 13+ messages in thread
From: Johannes Weiner @ 2026-06-29 18:47 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	linux-mm, linux-kernel, Neha Gholkar

On Mon, Jun 29, 2026 at 08:33:32PM +0200, David Hildenbrand (Arm) wrote:
> On 6/29/26 18:33, Johannes Weiner wrote:
> > Neha reports that mapped shmem aren't considered for NUMA balancing,
> > noting convergence problems and bandwidth bottlenecking for cachelib
> > based workloads on tiered memory systems.
> > 
> > Looking at the code and going through the git history, this doesn't
> > actually seem intentional:
> > 
> > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> > motivation was a real usecase: Oracle was pinning shared segments with
> > mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> > 
> > The handling of NULL from vm_ops->get_policy, however, treated "user
> > explicitly opted out" the same as "user never specified anything." For
> > VMAs whose shared policy is absent - the common case for shmem - the
> > scan was disabled too.
> > 
> > This issue is old. It probably hurts less in conventional NUMA. But it's
> > very noticable on tiered systems, where entire tmpfs workingsets can get
> > stuck on lower-bandwidth memory.
> 
> Sounds bad enough to warrant CC: stable?

No objection from me. I was hesitant because it's old, and while these
are real workloads that see it they are hardware/kernel validation
runs. OTOH it's a straight-forward bug and should backport easily.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/mempolicy.c | 21 ++++++---------------
> >  1 file changed, 6 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 36699fabd3c2..bba65898aee1 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> >  bool vma_policy_mof(struct vm_area_struct *vma)
> >  {
> >  	struct mempolicy *pol;
> > +	pgoff_t ilx;
> > +	bool mof;
> >  
> > -	if (vma->vm_ops && vma->vm_ops->get_policy) {
> > -		bool ret = false;
> > -		pgoff_t ilx;		/* ignored here */
> > -
> > -		pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> > -		if (pol && (pol->flags & MPOL_F_MOF))
> > -			ret = true;
> > -		mpol_cond_put(pol);
> > -
> > -		return ret;
> > -	}
> 
> Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy
> was not available), which is what __get_vma_policy() does at well.
> 
> But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy().

Yep.

> Makes sense to me although this is a source of confusion for me.

How so? Is there anything I can improve in the changelog?

> Acked-by: David Hildenbrand (Arm) <david@kernel.org>

Thanks David!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 18:47   ` Johannes Weiner
@ 2026-06-30 11:26     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30 11:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	linux-mm, linux-kernel, Neha Gholkar


> 
>> Makes sense to me although this is a source of confusion for me.
> 
> How so? Is there anything I can improve in the changelog?


Oh, it was just a comment in general around NUMA policies :)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
  2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
  2026-06-29 17:59 ` Gregory Price
  2026-06-29 18:33 ` David Hildenbrand (Arm)
@ 2026-06-30 23:40 ` Balbir Singh
  2 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2026-06-30 23:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Neha Gholkar

On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
> 
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
> 
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> 
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
> 
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.
>
> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
> 
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
> 
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/mempolicy.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..bba65898aee1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  bool vma_policy_mof(struct vm_area_struct *vma)
>  {
>  	struct mempolicy *pol;
> +	pgoff_t ilx;
> +	bool mof;
>  
> -	if (vma->vm_ops && vma->vm_ops->get_policy) {
> -		bool ret = false;
> -		pgoff_t ilx;		/* ignored here */
> -
> -		pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> -		if (pol && (pol->flags & MPOL_F_MOF))
> -			ret = true;
> -		mpol_cond_put(pol);
> -
> -		return ret;
> -	}
> -
> -	pol = vma->vm_policy;
> +	pol = __get_vma_policy(vma, vma->vm_start, &ilx);
>  	if (!pol)
>  		pol = get_task_policy(current);
> -
> -	return pol->flags & MPOL_F_MOF;
> +	mof = pol->flags & MPOL_F_MOF;
> +	mpol_cond_put(pol);
> +	return mof;
>  }
>  
>  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> -- 
> 

The change to use the fallback seems reasonable

Acked-by: Balbir Singh <balbirs@nvidia.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-07-01 16:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
2026-06-29 17:59 ` Gregory Price
2026-06-29 18:22   ` Johannes Weiner
2026-06-30 11:20   ` Huang, Ying
2026-06-30 15:29     ` Gregory Price
2026-07-01 11:03       ` Huang, Ying
2026-07-01 15:33         ` Gregory Price
2026-07-01 15:49           ` Johannes Weiner
2026-07-01 16:22             ` Gregory Price
2026-06-29 18:33 ` David Hildenbrand (Arm)
2026-06-29 18:47   ` Johannes Weiner
2026-06-30 11:26     ` David Hildenbrand (Arm)
2026-06-30 23:40 ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox