* [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
@ 2026-06-29 16:33 Johannes Weiner
2026-06-29 17:59 ` Gregory Price
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Johannes Weiner @ 2026-06-29 16:33 UTC (permalink / raw)
To: Andrew Morton
Cc: David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
linux-mm, linux-kernel, Neha Gholkar
Neha reports that mapped shmem aren't considered for NUMA balancing,
noting convergence problems and bandwidth bottlenecking for cachelib
based workloads on tiered memory systems.
Looking at the code and going through the git history, this doesn't
actually seem intentional:
Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
motivation was a real usecase: Oracle was pinning shared segments with
mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
The handling of NULL from vm_ops->get_policy, however, treated "user
explicitly opted out" the same as "user never specified anything." For
VMAs whose shared policy is absent - the common case for shmem - the
scan was disabled too.
This issue is old. It probably hurts less in conventional NUMA. But it's
very noticable on tiered systems, where entire tmpfs workingsets can get
stuck on lower-bandwidth memory.
Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
thereby handle the fallback to task policy (-> preferred_node_policy()
has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
already handles it this way, the scan-eligibility check was the outlier.
This preserves Mel's intended fix: don't scan stuff the user explicitly
pinned. But allow default policy vmas to participate in balancing.
Reported-by: Neha Gholkar <nehagholkar@gmail.com>
Tested-by: Neha Gholkar <nehagholkar@gmail.com>
Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/mempolicy.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36699fabd3c2..bba65898aee1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
bool vma_policy_mof(struct vm_area_struct *vma)
{
struct mempolicy *pol;
+ pgoff_t ilx;
+ bool mof;
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- bool ret = false;
- pgoff_t ilx; /* ignored here */
-
- pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
- if (pol && (pol->flags & MPOL_F_MOF))
- ret = true;
- mpol_cond_put(pol);
-
- return ret;
- }
-
- pol = vma->vm_policy;
+ pol = __get_vma_policy(vma, vma->vm_start, &ilx);
if (!pol)
pol = get_task_policy(current);
-
- return pol->flags & MPOL_F_MOF;
+ mof = pol->flags & MPOL_F_MOF;
+ mpol_cond_put(pol);
+ return mof;
}
bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
--
2.54.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
@ 2026-06-29 17:59 ` Gregory Price
2026-06-29 18:22 ` Johannes Weiner
2026-06-30 11:20 ` Huang, Ying
2026-06-29 18:33 ` David Hildenbrand (Arm)
2026-06-30 23:40 ` Balbir Singh
2 siblings, 2 replies; 10+ messages in thread
From: Gregory Price @ 2026-06-29 17:59 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
>
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
>
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
>
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
>
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.
>
Eugh.
Demotions don't care about mempolicy, so opting shmem out of NUMA
balancing and mbind'ing on a tiered system is just full sadness.
This is all just more evidence that demotion needs to be completely
redone, it's creating a mess of undefined behavior for memory placement.
> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
>
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
>
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Gregory Price <gourry@gourry.net>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 17:59 ` Gregory Price
@ 2026-06-29 18:22 ` Johannes Weiner
2026-06-30 11:20 ` Huang, Ying
1 sibling, 0 replies; 10+ messages in thread
From: Johannes Weiner @ 2026-06-29 18:22 UTC (permalink / raw)
To: Gregory Price
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
On Mon, Jun 29, 2026 at 01:59:41PM -0400, Gregory Price wrote:
> On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> > Neha reports that mapped shmem aren't considered for NUMA balancing,
> > noting convergence problems and bandwidth bottlenecking for cachelib
> > based workloads on tiered memory systems.
> >
> > Looking at the code and going through the git history, this doesn't
> > actually seem intentional:
> >
> > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> > motivation was a real usecase: Oracle was pinning shared segments with
> > mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> >
> > The handling of NULL from vm_ops->get_policy, however, treated "user
> > explicitly opted out" the same as "user never specified anything." For
> > VMAs whose shared policy is absent - the common case for shmem - the
> > scan was disabled too.
> >
> > This issue is old. It probably hurts less in conventional NUMA. But it's
> > very noticable on tiered systems, where entire tmpfs workingsets can get
> > stuck on lower-bandwidth memory.
> >
>
> Eugh.
>
> Demotions don't care about mempolicy, so opting shmem out of NUMA
> balancing and mbind'ing on a tiered system is just full sadness.
Right, mbinding in tiered mode is a whole other ball of wax. I'm just
trying to make the default case work ;-)
> This is all just more evidence that demotion needs to be completely
> redone, it's creating a mess of undefined behavior for memory placement.
No argument from me.
> > Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> > thereby handle the fallback to task policy (-> preferred_node_policy()
> > has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> > already handles it this way, the scan-eligibility check was the outlier.
> >
> > This preserves Mel's intended fix: don't scan stuff the user explicitly
> > pinned. But allow default policy vmas to participate in balancing.
> >
> > Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> > Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> > Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Reviewed-by: Gregory Price <gourry@gourry.net>
Thanks! Sorry for making you feel bad.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
2026-06-29 17:59 ` Gregory Price
@ 2026-06-29 18:33 ` David Hildenbrand (Arm)
2026-06-29 18:47 ` Johannes Weiner
2026-06-30 23:40 ` Balbir Singh
2 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-29 18:33 UTC (permalink / raw)
To: Johannes Weiner, Andrew Morton
Cc: Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Gregory Price, Ying Huang, Alistair Popple, linux-mm,
linux-kernel, Neha Gholkar
On 6/29/26 18:33, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
>
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
>
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
>
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
>
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.
Sounds bad enough to warrant CC: stable?
>
> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
>
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
>
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/mempolicy.c | 21 ++++++---------------
> 1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..bba65898aee1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> bool vma_policy_mof(struct vm_area_struct *vma)
> {
> struct mempolicy *pol;
> + pgoff_t ilx;
> + bool mof;
>
> - if (vma->vm_ops && vma->vm_ops->get_policy) {
> - bool ret = false;
> - pgoff_t ilx; /* ignored here */
> -
> - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> - if (pol && (pol->flags & MPOL_F_MOF))
> - ret = true;
> - mpol_cond_put(pol);
> -
> - return ret;
> - }
Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy
was not available), which is what __get_vma_policy() does at well.
But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy().
Makes sense to me although this is a source of confusion for me.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 18:33 ` David Hildenbrand (Arm)
@ 2026-06-29 18:47 ` Johannes Weiner
2026-06-30 11:26 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 10+ messages in thread
From: Johannes Weiner @ 2026-06-29 18:47 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
linux-mm, linux-kernel, Neha Gholkar
On Mon, Jun 29, 2026 at 08:33:32PM +0200, David Hildenbrand (Arm) wrote:
> On 6/29/26 18:33, Johannes Weiner wrote:
> > Neha reports that mapped shmem aren't considered for NUMA balancing,
> > noting convergence problems and bandwidth bottlenecking for cachelib
> > based workloads on tiered memory systems.
> >
> > Looking at the code and going through the git history, this doesn't
> > actually seem intentional:
> >
> > Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> > VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> > policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> > motivation was a real usecase: Oracle was pinning shared segments with
> > mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
> >
> > The handling of NULL from vm_ops->get_policy, however, treated "user
> > explicitly opted out" the same as "user never specified anything." For
> > VMAs whose shared policy is absent - the common case for shmem - the
> > scan was disabled too.
> >
> > This issue is old. It probably hurts less in conventional NUMA. But it's
> > very noticable on tiered systems, where entire tmpfs workingsets can get
> > stuck on lower-bandwidth memory.
>
> Sounds bad enough to warrant CC: stable?
No objection from me. I was hesitant because it's old, and while these
are real workloads that see it they are hardware/kernel validation
runs. OTOH it's a straight-forward bug and should backport easily.
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > mm/mempolicy.c | 21 ++++++---------------
> > 1 file changed, 6 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 36699fabd3c2..bba65898aee1 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> > bool vma_policy_mof(struct vm_area_struct *vma)
> > {
> > struct mempolicy *pol;
> > + pgoff_t ilx;
> > + bool mof;
> >
> > - if (vma->vm_ops && vma->vm_ops->get_policy) {
> > - bool ret = false;
> > - pgoff_t ilx; /* ignored here */
> > -
> > - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> > - if (pol && (pol->flags & MPOL_F_MOF))
> > - ret = true;
> > - mpol_cond_put(pol);
> > -
> > - return ret;
> > - }
>
> Okay, we used the fallback of vma->vm_policy before (if vma->vm_ops->get_policy
> was not available), which is what __get_vma_policy() does at well.
>
> But if vma->vm_ops->get_policy now returns NULL, we fallback to get_task_policy().
Yep.
> Makes sense to me although this is a source of confusion for me.
How so? Is there anything I can improve in the changelog?
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Thanks David!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 17:59 ` Gregory Price
2026-06-29 18:22 ` Johannes Weiner
@ 2026-06-30 11:20 ` Huang, Ying
2026-06-30 15:29 ` Gregory Price
1 sibling, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2026-06-30 11:20 UTC (permalink / raw)
To: Gregory Price
Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
Gregory Price <gourry@gourry.net> writes:
[snip]
> Demotions don't care about mempolicy, so opting shmem out of NUMA
> balancing and mbind'ing on a tiered system is just full sadness.
>
> This is all just more evidence that demotion needs to be completely
> redone, it's creating a mess of undefined behavior for memory placement.
It's hard to respect mempolicy during demotion in the current
implementation. Do you have any ideas on how to improve this?
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 18:47 ` Johannes Weiner
@ 2026-06-30 11:26 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-30 11:26 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
linux-mm, linux-kernel, Neha Gholkar
>
>> Makes sense to me although this is a source of confusion for me.
>
> How so? Is there anything I can improve in the changelog?
Oh, it was just a comment in general around NUMA policies :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-30 11:20 ` Huang, Ying
@ 2026-06-30 15:29 ` Gregory Price
2026-07-01 11:03 ` Huang, Ying
0 siblings, 1 reply; 10+ messages in thread
From: Gregory Price @ 2026-06-30 15:29 UTC (permalink / raw)
To: Huang, Ying
Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
>
> [snip]
>
> > Demotions don't care about mempolicy, so opting shmem out of NUMA
> > balancing and mbind'ing on a tiered system is just full sadness.
> >
> > This is all just more evidence that demotion needs to be completely
> > redone, it's creating a mess of undefined behavior for memory placement.
>
> It's hard to respect mempolicy during demotion in the current
> implementation. Do you have any ideas on how to improve this?
>
I think it's feasible we could respect per-vma mempolicies, but not
per-task. That would at least make this particular interaction less
painful and mbind() would do what you'd expect. It is a bit racy,
but with MPOL_MF_MOVE_ALL the user can get what they actually want.
I think task-wide mempolicy is problematic and generally a bad idea
on tiered systems, maybe it's ok if we simply document task policies
are not respected on tiered systems?
~Gregory
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
2026-06-29 17:59 ` Gregory Price
2026-06-29 18:33 ` David Hildenbrand (Arm)
@ 2026-06-30 23:40 ` Balbir Singh
2 siblings, 0 replies; 10+ messages in thread
From: Balbir Singh @ 2026-06-30 23:40 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
On Mon, Jun 29, 2026 at 12:33:37PM -0400, Johannes Weiner wrote:
> Neha reports that mapped shmem aren't considered for NUMA balancing,
> noting convergence problems and bandwidth bottlenecking for cachelib
> based workloads on tiered memory systems.
>
> Looking at the code and going through the git history, this doesn't
> actually seem intentional:
>
> Commit fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault
> VMAs") added a vma_policy_mof() gate to task_numa_work() so VMAs whose
> policy lacks MPOL_F_MOF are skipped from NUMA balancing scans. The
> motivation was a real usecase: Oracle was pinning shared segments with
> mbind(MPOL_BIND) so trapping faults was both expensive and pointless.
>
> The handling of NULL from vm_ops->get_policy, however, treated "user
> explicitly opted out" the same as "user never specified anything." For
> VMAs whose shared policy is absent - the common case for shmem - the
> scan was disabled too.
>
> This issue is old. It probably hurts less in conventional NUMA. But it's
> very noticable on tiered systems, where entire tmpfs workingsets can get
> stuck on lower-bandwidth memory.
>
> Fix this by having vma_policy_mof() use __get_vma_policy() directly, and
> thereby handle the fallback to task policy (-> preferred_node_policy()
> has MPOL_F_MOF per default). Every other consumer of vm_ops->get_policy
> already handles it this way, the scan-eligibility check was the outlier.
>
> This preserves Mel's intended fix: don't scan stuff the user explicitly
> pinned. But allow default policy vmas to participate in balancing.
>
> Reported-by: Neha Gholkar <nehagholkar@gmail.com>
> Tested-by: Neha Gholkar <nehagholkar@gmail.com>
> Fixes: fc3147245d19 ("mm: numa: Limit NUMA scanning to migrate-on-fault VMAs")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/mempolicy.c | 21 ++++++---------------
> 1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36699fabd3c2..bba65898aee1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2057,24 +2057,15 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
> bool vma_policy_mof(struct vm_area_struct *vma)
> {
> struct mempolicy *pol;
> + pgoff_t ilx;
> + bool mof;
>
> - if (vma->vm_ops && vma->vm_ops->get_policy) {
> - bool ret = false;
> - pgoff_t ilx; /* ignored here */
> -
> - pol = vma->vm_ops->get_policy(vma, vma->vm_start, &ilx);
> - if (pol && (pol->flags & MPOL_F_MOF))
> - ret = true;
> - mpol_cond_put(pol);
> -
> - return ret;
> - }
> -
> - pol = vma->vm_policy;
> + pol = __get_vma_policy(vma, vma->vm_start, &ilx);
> if (!pol)
> pol = get_task_policy(current);
> -
> - return pol->flags & MPOL_F_MOF;
> + mof = pol->flags & MPOL_F_MOF;
> + mpol_cond_put(pol);
> + return mof;
> }
>
> bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> --
>
The change to use the fallback seems reasonable
Acked-by: Balbir Singh <balbirs@nvidia.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] mm: mempolicy: fix automatic numa balancing for shmem
2026-06-30 15:29 ` Gregory Price
@ 2026-07-01 11:03 ` Huang, Ying
0 siblings, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2026-07-01 11:03 UTC (permalink / raw)
To: Gregory Price
Cc: Johannes Weiner, Andrew Morton, David Hildenbrand, Zi Yan,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
Alistair Popple, linux-mm, linux-kernel, Neha Gholkar
Gregory Price <gourry@gourry.net> writes:
> On Tue, Jun 30, 2026 at 07:20:50PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>>
>> [snip]
>>
>> > Demotions don't care about mempolicy, so opting shmem out of NUMA
>> > balancing and mbind'ing on a tiered system is just full sadness.
>> >
>> > This is all just more evidence that demotion needs to be completely
>> > redone, it's creating a mess of undefined behavior for memory placement.
>>
>> It's hard to respect mempolicy during demotion in the current
>> implementation. Do you have any ideas on how to improve this?
>>
>
> I think it's feasible we could respect per-vma mempolicies, but not
> per-task. That would at least make this particular interaction less
> painful and mbind() would do what you'd expect. It is a bit racy,
> but with MPOL_MF_MOVE_ALL the user can get what they actually want.
Yes. Per-vma mempolicy support is possible.
> I think task-wide mempolicy is problematic and generally a bad idea
> on tiered systems, maybe it's ok if we simply document task policies
> are not respected on tiered systems?
Anyway, it's convenient to use numactl to manage mempolicy.
Is it possible to enable NUMA_BALANCING_MEMORY_TIERING for non-default
VMAs? If we don't enable NUMA_BALANCING_NORMAL, the overhead should be
OK because the page table entries are changed to PROTN_ONE only for
pages on the slow tier.
Additionally, we may need to consider cpusets.
---
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-07-01 11:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-29 16:33 [PATCH] mm: mempolicy: fix automatic numa balancing for shmem Johannes Weiner
2026-06-29 17:59 ` Gregory Price
2026-06-29 18:22 ` Johannes Weiner
2026-06-30 11:20 ` Huang, Ying
2026-06-30 15:29 ` Gregory Price
2026-07-01 11:03 ` Huang, Ying
2026-06-29 18:33 ` David Hildenbrand (Arm)
2026-06-29 18:47 ` Johannes Weiner
2026-06-30 11:26 ` David Hildenbrand (Arm)
2026-06-30 23:40 ` Balbir Singh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox