* [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
@ 2026-06-29 16:33 Johannes Weiner
2026-06-29 17:59 ` Gregory Price
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Johannes Weiner @ 2026-06-29 16:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
linux-mm, linux-kernel, Neha Gholkar
Neha reports that mapped shmem aren't considered for NUMA balancing,
noting convergence problems and bandwidth bottlenecking for cachelib
based workloads on tiered memory systems.
Looking at the code and going through the git history, this doesn't
actually seem intentional:
Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
motivation was a real usecase: Oracle was pinning shared segments with
mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
The handling of NULL from vm_ops->get_policy, however, treated "user
explicitly opted out" the same as "user never specified anything." For
VMAs whose shared policy is absent - the common case for shmem - the
scan was disabled too.
This issue is old. It probably hurts less in conventional NUMA. But it's
very noticable on tiered systems, where entire tmpfs workingsets can get
stuck on lower-bandwidth memory.
Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
thereby handle the fallback to task policy (-> preferred_node_policy()
has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
already handles it this way, the scan-eligibility check was the outlier.
This preserves Mel's intended fix: don't scan stuff the user explicitly
pinned. But allow default policy vmas to participate in balancing.
Reported-by: Neha Gholkar <nehagholkar@gmail.com>
Tested-by: Neha Gholkar <nehagholkar@gmail.com>
Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/mempolicy.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..bba65898aee1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
bool vma_policy_mof(struct vm_area_struct *vma)
{
struct mempolicy *pol;
+ pgoff_t ilx;
+ bool mof;
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- bool ret = false;
- pgoff_t ilx; /* ignored here */
-
- pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
- if (pol && (pol->flags & MPOL_F_MOF))
- ret = true;
- mpol_cond_put(pol);
-
- return ret;
- }
-
- pol = vma->vm_policy;
+ pol = __get_vma_policy(vma, vma->vm_start, &ilx);
if (!pol)
pol = get_task_policy(current);
-
- return pol->flags & MPOL_F_MOF;
+ mof = pol->flags & MPOL_F_MOF;
+ mpol_cond_put(pol);
+ return mof;
}
bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner @ 2026-06-29 17:59 ` Gregory Price 2026-06-29 18:22 ` Johannes Weiner 2026-06-30 11:20 ` Huang, Ying 2026-06-29 18:33 ` David Hildenbrand (Arm) 2026-06-30 23:40 ` Balbir Singh 2 siblings, 2 replies; 13+ messages in thread From: Gregory Price @ 2026-06-29 17:59 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote: > Neha reports that mapped shmem aren't considered for NUMA balancing, > noting convergence problems and bandwidth bottlenecking for cachelib > based workloads on tiered memory systems. > > Looking at the code and going through the git history, this doesn't > actually seem intentional: > > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The > motivation was a real usecase: Oracle was pinning shared segments with > mbind(MPOL_BIND) so trapping faults was both expensive and pointless. > > The handling of NULL from vm_ops->get_policy, however, treated "user > explicitly opted out" the same as "user never specified anything." For > VMAs whose shared policy is absent - the common case for shmem - the > scan was disabled too. > > This issue is old. It probably hurts less in conventional NUMA. But it's > very noticable on tiered systems, where entire tmpfs workingsets can get > stuck on lower-bandwidth memory. > Eugh. Demotions don't care about mempolicy, so opting shmem out of NUMA balancing and mbind'ing on a tiered system is just full sadness. This is all just more evidence that demotion needs to be completely redone, it's creating a mess of undefined behavior for memory placement. > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and > thereby handle the fallback to task policy (-> preferred_node_policy() > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy > already handles it this way, the scan-eligibility check was the outlier. > > This preserves Mel's intended fix: don't scan stuff the user explicitly > pinned. But allow default policy vmas to participate in balancing. > > Reported-by: Neha Gholkar <nehagholkar@gmail.com> > Tested-by: Neha Gholkar <nehagholkar@gmail.com> > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Gregory Price <gourry@gourry.net> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 17:59 ` Gregory Price @ 2026-06-29 18:22 ` Johannes Weiner 2026-06-30 11:20 ` Huang, Ying 1 sibling, 0 replies; 13+ messages in thread From: Johannes Weiner @ 2026-06-29 18:22 UTC (permalink / raw) To: Gregory Price Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Mon, Jun 29, 2026 at 01:59:41PM -0400, Gregory Price wrote: > On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote: > > Neha reports that mapped shmem aren't considered for NUMA balancing, > > noting convergence problems and bandwidth bottlenecking for cachelib > > based workloads on tiered memory systems. > > > > Looking at the code and going through the git history, this doesn't > > actually seem intentional: > > > > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault > > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose > > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The > > motivation was a real usecase: Oracle was pinning shared segments with > > mbind(MPOL_BIND) so trapping faults was both expensive and pointless. > > > > The handling of NULL from vm_ops->get_policy, however, treated "user > > explicitly opted out" the same as "user never specified anything." For > > VMAs whose shared policy is absent - the common case for shmem - the > > scan was disabled too. > > > > This issue is old. It probably hurts less in conventional NUMA. But it's > > very noticable on tiered systems, where entire tmpfs workingsets can get > > stuck on lower-bandwidth memory. > > > > Eugh. > > Demotions don't care about mempolicy, so opting shmem out of NUMA > balancing and mbind'ing on a tiered system is just full sadness. Right, mbinding in tiered mode is a whole other ball of wax. I'm just trying to make the default case work ;-) > This is all just more evidence that demotion needs to be completely > redone, it's creating a mess of undefined behavior for memory placement. No argument from me. > > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and > > thereby handle the fallback to task policy (-> preferred_node_policy() > > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy > > already handles it this way, the scan-eligibility check was the outlier. > > > > This preserves Mel's intended fix: don't scan stuff the user explicitly > > pinned. But allow default policy vmas to participate in balancing. > > > > Reported-by: Neha Gholkar <nehagholkar@gmail.com> > > Tested-by: Neha Gholkar <nehagholkar@gmail.com> > > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs") > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > Reviewed-by: Gregory Price <gourry@gourry.net> Thanks! Sorry for making you feel bad. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 17:59 ` Gregory Price 2026-06-29 18:22 ` Johannes Weiner @ 2026-06-30 11:20 ` Huang, Ying 2026-06-30 15:29 ` Gregory Price 1 sibling, 1 reply; 13+ messages in thread From: Huang, Ying @ 2026-06-30 11:20 UTC (permalink / raw) To: Gregory Price Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar Gregory Price <gourry@gourry.net> writes: [snip] > Demotions don't care about mempolicy, so opting shmem out of NUMA > balancing and mbind'ing on a tiered system is just full sadness. > > This is all just more evidence that demotion needs to be completely > redone, it's creating a mess of undefined behavior for memory placement. It's hard to respect mempolicy during demotion in the current implementation. Do you have any ideas on how to improve this? --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-30 11:20 ` Huang, Ying @ 2026-06-30 15:29 ` Gregory Price 2026-07-01 11:03 ` Huang, Ying 0 siblings, 1 reply; 13+ messages in thread From: Gregory Price @ 2026-06-30 15:29 UTC (permalink / raw) To: Huang, Ying Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > [snip] > > > Demotions don't care about mempolicy, so opting shmem out of NUMA > > balancing and mbind'ing on a tiered system is just full sadness. > > > > This is all just more evidence that demotion needs to be completely > > redone, it's creating a mess of undefined behavior for memory placement. > > It's hard to respect mempolicy during demotion in the current > implementation. Do you have any ideas on how to improve this? > I think it's feasible we could respect per-vma mempolicies, but not per-task. That would at least make this particular interaction less painful and mbind() would do what you'd expect. It is a bit racy, but with MPOL_MF_MOVE_ALL the user can get what they actually want. I think task-wide mempolicy is problematic and generally a bad idea on tiered systems, maybe it's ok if we simply document task policies are not respected on tiered systems? ~Gregory ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-30 15:29 ` Gregory Price @ 2026-07-01 11:03 ` Huang, Ying 2026-07-01 15:33 ` Gregory Price 0 siblings, 1 reply; 13+ messages in thread From: Huang, Ying @ 2026-07-01 11:03 UTC (permalink / raw) To: Gregory Price Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar Gregory Price <gourry@gourry.net> writes: > On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote: >> Gregory Price <gourry@gourry.net> writes: >> >> [snip] >> >> > Demotions don't care about mempolicy, so opting shmem out of NUMA >> > balancing and mbind'ing on a tiered system is just full sadness. >> > >> > This is all just more evidence that demotion needs to be completely >> > redone, it's creating a mess of undefined behavior for memory placement. >> >> It's hard to respect mempolicy during demotion in the current >> implementation. Do you have any ideas on how to improve this? >> > > I think it's feasible we could respect per-vma mempolicies, but not > per-task. That would at least make this particular interaction less > painful and mbind() would do what you'd expect. It is a bit racy, > but with MPOL_MF_MOVE_ALL the user can get what they actually want. Yes. Per-vma mempolicy support is possible. > I think task-wide mempolicy is problematic and generally a bad idea > on tiered systems, maybe it's ok if we simply document task policies > are not respected on tiered systems? Anyway, it's convenient to use numactl to manage mempolicy. Is it possible to enable NUMA_BALANCING_MEMORY_TIERING for non-default VMAs? If we don't enable NUMA_BALANCING_NORMAL, the overhead should be OK because the page table entries are changed to PROTN_ONE only for pages on the slow tier. Additionally, we may need to consider cpusets. --- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-07-01 11:03 ` Huang, Ying @ 2026-07-01 15:33 ` Gregory Price 2026-07-01 15:49 ` Johannes Weiner 0 siblings, 1 reply; 13+ messages in thread From: Gregory Price @ 2026-07-01 15:33 UTC (permalink / raw) To: Huang, Ying Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Wed, Jul 01, 2026 at 07:03:32PM +0800, Huang, Ying wrote: > Gregory Price <gourry@gourry.net> writes: > > > On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote: > >> Gregory Price <gourry@gourry.net> writes: > >> > >> [snip] > >> > >> > Demotions don't care about mempolicy, so opting shmem out of NUMA > >> > balancing and mbind'ing on a tiered system is just full sadness. > >> > > >> > This is all just more evidence that demotion needs to be completely > >> > redone, it's creating a mess of undefined behavior for memory placement. > >> > >> It's hard to respect mempolicy during demotion in the current > >> implementation. Do you have any ideas on how to improve this? > >> > > > > I think it's feasible we could respect per-vma mempolicies, but not > > per-task. That would at least make this particular interaction less > > painful and mbind() would do what you'd expect. It is a bit racy, > > but with MPOL_MF_MOVE_ALL the user can get what they actually want. > > Yes. Per-vma mempolicy support is possible. > > > I think task-wide mempolicy is problematic and generally a bad idea > > on tiered systems, maybe it's ok if we simply document task policies > > are not respected on tiered systems? > > Anyway, it's convenient to use numactl to manage mempolicy. > It can be, but there's also many footguns with task-wide policy. Something i found while seeing if i could make ZONE_NORMAL nodes more reliably hotpluggable: diff --git a/lib/stackdepot.c b/lib/stackdepot.c index dd2717ff94bf..9ceeb56574ef 100644 --- a/lib/stackdepot.c +++ b/lib/stackdepot.c @@ -682,7 +682,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries, * we won't be able to do that under the lock. */ if (unlikely(can_alloc && !READ_ONCE(new_pool))) { - page = alloc_pages(gfp_nested_mask(alloc_flags), + /* + * The stack depot pool is a global, never-freed allocation. + * Use alloc_pages_node() on the CPU-local node instead of + * alloc_pages() so the pool does not inherit a transient task's + * NUMA mempolicy (e.g. MPOL_BIND to a CPU-less/bound node), which + * would strand this long-lived page on that node forever. + */ + page = alloc_pages_node(numa_node_id(), + gfp_nested_mask(alloc_flags), DEPOT_POOL_ORDER); if (page) prealloc = page_address(page); This is a global, permanently allocated, resource that inherits a task mempolicy's placement because that task *happened* to be the first one to touch it. There are many alloc_pages() calls (155 instances kernel-wide) that inherit a task mempolicy when that's probably not what we want. alloc_pages() is called in: net/, lib/, kexec_core/, drivers/, arch/ you can imagine a task setting `set_mempolicy(INTERLEAVE, ALL)` and the result is a bunch of random driver memory gets spread all over the place along with the task's heap. Is that really what the caller wanted, or did they just want userland data spread about? But at this point it's a 20 year old interface, not much we can do about it without making *someone* sad :[ I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland memory only" - and then slowly trying to migrate numactl to make this the default. > Is it possible to enable NUMA_BALANCING_MEMORY_TIERING for non-default > VMAs? If we don't enable NUMA_BALANCING_NORMAL, the overhead should be > OK because the page table entries are changed to PROTN_ONE only for > pages on the slow tier. > hmmm, will have to give this some thought. > Additionally, we may need to consider cpusets. > Direct reclaim considers cpusets for the reclaiming task (added recently), kswapd sits in its own cgroup. Cross-cpuset checks - i'm not sure how tractable that is. We ignore it for now, recognizing that if something is cross-cpuset it's some definition of shared/global object (e.g. pagecache mappings). ~Gregory ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-07-01 15:33 ` Gregory Price @ 2026-07-01 15:49 ` Johannes Weiner 2026-07-01 16:22 ` Gregory Price 0 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2026-07-01 15:49 UTC (permalink / raw) To: Gregory Price Cc: Huang, Ying, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Wed, Jul 01, 2026 at 11:33:38AM -0400, Gregory Price wrote: > Something i found while seeing if i could make ZONE_NORMAL nodes more > reliably hotpluggable: > > diff --git a/lib/stackdepot.c b/lib/stackdepot.c > index dd2717ff94bf..9ceeb56574ef 100644 > --- a/lib/stackdepot.c > +++ b/lib/stackdepot.c > @@ -682,7 +682,15 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries, > * we won't be able to do that under the lock. > */ > if (unlikely(can_alloc && !READ_ONCE(new_pool))) { > - page = alloc_pages(gfp_nested_mask(alloc_flags), > + /* > + * The stack depot pool is a global, never-freed allocation. > + * Use alloc_pages_node() on the CPU-local node instead of > + * alloc_pages() so the pool does not inherit a transient task's > + * NUMA mempolicy (e.g. MPOL_BIND to a CPU-less/bound node), which > + * would strand this long-lived page on that node forever. > + */ > + page = alloc_pages_node(numa_node_id(), > + gfp_nested_mask(alloc_flags), > DEPOT_POOL_ORDER); > if (page) > prealloc = page_address(page); > > This is a global, permanently allocated, resource that inherits a task > mempolicy's placement because that task *happened* to be the first one > to touch it. > > There are many alloc_pages() calls (155 instances kernel-wide) that > inherit a task mempolicy when that's probably not what we want. > > alloc_pages() is called in: net/, lib/, kexec_core/, drivers/, arch/ > > you can imagine a task setting `set_mempolicy(INTERLEAVE, ALL)` and the > result is a bunch of random driver memory gets spread all over the place > along with the task's heap. Is that really what the caller wanted, or > did they just want userland data spread about? > > But at this point it's a 20 year old interface, not much we can do about > it without making *someone* sad :[ > > I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland > memory only" - and then slowly trying to migrate numactl to make this > the default. Hm. Kernel allocations that are totally incidental like the stackdepot example above should not follow task policy. But there are kernel allocations (kernel stack, inodes, pipes) that are directly allocated on behalf of a process, and so probably SHOULD follow task policy. That's an annotation problem that I think we have solved already, because cgroups need the same distinction for what allocations to charge to the current task's cgroup context. We could rename __GFP_ACCOUNT / SLAB_ACCOUNT to __GFP_TASKPOLICY / SLAB_TASKPOLICY or something, and have mempolicy follow it too. There is still the whole "changing 20 year old behavior" aspect, but I think the polarity works in our favor: big important allocations have already been following the policy correctly. The behavior changes primarily for smaller, random allocations. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-07-01 15:49 ` Johannes Weiner @ 2026-07-01 16:22 ` Gregory Price 0 siblings, 0 replies; 13+ messages in thread From: Gregory Price @ 2026-07-01 16:22 UTC (permalink / raw) To: Johannes Weiner Cc: Huang, Ying, Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Wed, Jul 01, 2026 at 11:49:16AM -0400, Johannes Weiner wrote: > On Wed, Jul 01, 2026 at 11:33:38AM -0400, Gregory Price wrote: > > > > I considered proposing MPOL_F_MOVABLE_ONLY to mean (roughly) "userland > > memory only" - and then slowly trying to migrate numactl to make this > > the default. > > Hm. Kernel allocations that are totally incidental like the stackdepot > example above should not follow task policy. But there are kernel > allocations (kernel stack, inodes, pipes) that are directly allocated > on behalf of a process, and so probably SHOULD follow task policy. > > That's an annotation problem that I think we have solved already, > because cgroups need the same distinction for what allocations to > charge to the current task's cgroup context. > > We could rename __GFP_ACCOUNT / SLAB_ACCOUNT to __GFP_TASKPOLICY / > SLAB_TASKPOLICY or something, and have mempolicy follow it too. > > There is still the whole "changing 20 year old behavior" aspect, but I > think the polarity works in our favor: big important allocations have > already been following the policy correctly. The behavior changes > primarily for smaller, random allocations. This seems like it would be a pretty trivial change. something like: static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, pgoff_t ilx, int *nid) { nodemask_t *nodemask = NULL; if (!(gfp & __GFP_TASKPOLICY)) /* don't do the task policy, numa_node_id()? */ ... snip ... } ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner 2026-06-29 17:59 ` Gregory Price @ 2026-06-29 18:33 ` David Hildenbrand (Arm) 2026-06-29 18:47 ` Johannes Weiner 2026-06-30 23:40 ` Balbir Singh 2 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-29 18:33 UTC (permalink / raw) To: Johannes Weiner, Andrew Morton Cc: Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On 6/29/26 18:33, Johannes Weiner wrote: > Neha reports that mapped shmem aren't considered for NUMA balancing, > noting convergence problems and bandwidth bottlenecking for cachelib > based workloads on tiered memory systems. > > Looking at the code and going through the git history, this doesn't > actually seem intentional: > > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The > motivation was a real usecase: Oracle was pinning shared segments with > mbind(MPOL_BIND) so trapping faults was both expensive and pointless. > > The handling of NULL from vm_ops->get_policy, however, treated "user > explicitly opted out" the same as "user never specified anything." For > VMAs whose shared policy is absent - the common case for shmem - the > scan was disabled too. > > This issue is old. It probably hurts less in conventional NUMA. But it's > very noticable on tiered systems, where entire tmpfs workingsets can get > stuck on lower-bandwidth memory. Sounds bad enough to warrant CC: stable? > > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and > thereby handle the fallback to task policy (-> preferred_node_policy() > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy > already handles it this way, the scan-eligibility check was the outlier. > > This preserves Mel's intended fix: don't scan stuff the user explicitly > pinned. But allow default policy vmas to participate in balancing. > > Reported-by: Neha Gholkar <nehagholkar@gmail.com> > Tested-by: Neha Gholkar <nehagholkar@gmail.com> > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/mempolicy.c | 21 ++++++--------------- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 36699fabd3c2..bba65898aee1 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > bool vma_policy_mof(struct vm_area_struct *vma) > { > struct mempolicy *pol; > + pgoff_t ilx; > + bool mof; > > - if (vma->vm_ops && vma->vm_ops->get_policy) { > - bool ret = false; > - pgoff_t ilx; /* ignored here */ > - > - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx); > - if (pol && (pol->flags & MPOL_F_MOF)) > - ret = true; > - mpol_cond_put(pol); > - > - return ret; > - } Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy was not available), which is what __get_vma_policy() does at well. But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy(). Makes sense to me although this is a source of confusion for me. Acked-by: David Hildenbrand (Arm) <david@kernel.org> -- Cheers, David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 18:33 ` David Hildenbrand (Arm) @ 2026-06-29 18:47 ` Johannes Weiner 2026-06-30 11:26 ` David Hildenbrand (Arm) 0 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2026-06-29 18:47 UTC (permalink / raw) To: David Hildenbrand (Arm) Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Mon, Jun 29, 2026 at 08:33:32PM +0200, David Hildenbrand (Arm) wrote: > On 6/29/26 18:33, Johannes Weiner wrote: > > Neha reports that mapped shmem aren't considered for NUMA balancing, > > noting convergence problems and bandwidth bottlenecking for cachelib > > based workloads on tiered memory systems. > > > > Looking at the code and going through the git history, this doesn't > > actually seem intentional: > > > > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault > > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose > > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The > > motivation was a real usecase: Oracle was pinning shared segments with > > mbind(MPOL_BIND) so trapping faults was both expensive and pointless. > > > > The handling of NULL from vm_ops->get_policy, however, treated "user > > explicitly opted out" the same as "user never specified anything." For > > VMAs whose shared policy is absent - the common case for shmem - the > > scan was disabled too. > > > > This issue is old. It probably hurts less in conventional NUMA. But it's > > very noticable on tiered systems, where entire tmpfs workingsets can get > > stuck on lower-bandwidth memory. > > Sounds bad enough to warrant CC: stable? No objection from me. I was hesitant because it's old, and while these are real workloads that see it they are hardware/kernel validation runs. OTOH it's a straight-forward bug and should backport easily. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > > --- > > mm/mempolicy.c | 21 ++++++--------------- > > 1 file changed, 6 insertions(+), 15 deletions(-) > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 36699fabd3c2..bba65898aee1 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > bool vma_policy_mof(struct vm_area_struct *vma) > > { > > struct mempolicy *pol; > > + pgoff_t ilx; > > + bool mof; > > > > - if (vma->vm_ops && vma->vm_ops->get_policy) { > > - bool ret = false; > > - pgoff_t ilx; /* ignored here */ > > - > > - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx); > > - if (pol && (pol->flags & MPOL_F_MOF)) > > - ret = true; > > - mpol_cond_put(pol); > > - > > - return ret; > > - } > > Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy > was not available), which is what __get_vma_policy() does at well. > > But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy(). Yep. > Makes sense to me although this is a source of confusion for me. How so? Is there anything I can improve in the changelog? > Acked-by: David Hildenbrand (Arm) <david@kernel.org> Thanks David! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 18:47 ` Johannes Weiner @ 2026-06-30 11:26 ` David Hildenbrand (Arm) 0 siblings, 0 replies; 13+ messages in thread From: David Hildenbrand (Arm) @ 2026-06-30 11:26 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar > >> Makes sense to me although this is a source of confusion for me. > > How so? Is there anything I can improve in the changelog? Oh, it was just a comment in general around NUMA policies :) -- Cheers, David ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem 2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner 2026-06-29 17:59 ` Gregory Price 2026-06-29 18:33 ` David Hildenbrand (Arm) @ 2026-06-30 23:40 ` Balbir Singh 2 siblings, 0 replies; 13+ messages in thread From: Balbir Singh @ 2026-06-30 23:40 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang, Alistair Popple, linux-mm, linux-kernel, Neha Gholkar On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote: > Neha reports that mapped shmem aren't considered for NUMA balancing, > noting convergence problems and bandwidth bottlenecking for cachelib > based workloads on tiered memory systems. > > Looking at the code and going through the git history, this doesn't > actually seem intentional: > > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The > motivation was a real usecase: Oracle was pinning shared segments with > mbind(MPOL_BIND) so trapping faults was both expensive and pointless. > > The handling of NULL from vm_ops->get_policy, however, treated "user > explicitly opted out" the same as "user never specified anything." For > VMAs whose shared policy is absent - the common case for shmem - the > scan was disabled too. > > This issue is old. It probably hurts less in conventional NUMA. But it's > very noticable on tiered systems, where entire tmpfs workingsets can get > stuck on lower-bandwidth memory. > > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and > thereby handle the fallback to task policy (-> preferred_node_policy() > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy > already handles it this way, the scan-eligibility check was the outlier. > > This preserves Mel's intended fix: don't scan stuff the user explicitly > pinned. But allow default policy vmas to participate in balancing. > > Reported-by: Neha Gholkar <nehagholkar@gmail.com> > Tested-by: Neha Gholkar <nehagholkar@gmail.com> > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs") > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/mempolicy.c | 21 ++++++--------------- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 36699fabd3c2..bba65898aee1 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > bool vma_policy_mof(struct vm_area_struct *vma) > { > struct mempolicy *pol; > + pgoff_t ilx; > + bool mof; > > - if (vma->vm_ops && vma->vm_ops->get_policy) { > - bool ret = false; > - pgoff_t ilx; /* ignored here */ > - > - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx); > - if (pol && (pol->flags & MPOL_F_MOF)) > - ret = true; > - mpol_cond_put(pol); > - > - return ret; > - } > - > - pol = vma->vm_policy; > + pol = __get_vma_policy(vma, vma->vm_start, &ilx); > if (!pol) > pol = get_task_policy(current); > - > - return pol->flags & MPOL_F_MOF; > + mof = pol->flags & MPOL_F_MOF; > + mpol_cond_put(pol); > + return mof; > } > > bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > -- > The change to use the fallback seems reasonable Acked-by: Balbir Singh <balbirs@nvidia.com> ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-07-01 16:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner 2026-06-29 17:59 ` Gregory Price 2026-06-29 18:22 ` Johannes Weiner 2026-06-30 11:20 ` Huang, Ying 2026-06-30 15:29 ` Gregory Price 2026-07-01 11:03 ` Huang, Ying 2026-07-01 15:33 ` Gregory Price 2026-07-01 15:49 ` Johannes Weiner 2026-07-01 16:22 ` Gregory Price 2026-06-29 18:33 ` David Hildenbrand (Arm) 2026-06-29 18:47 ` Johannes Weiner 2026-06-30 11:26 ` David Hildenbrand (Arm) 2026-06-30 23:40 ` Balbir Singh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox