[PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
@ 2026-06-03 15:57 Michael S. Tsirkin
  2026-06-03 16:15 ` sashiko-bot
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2026-06-03 15:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Vlastimil Babka, Shivank Garg, kvm

kvm_gmem_get_policy() sets *ilx to the full page offset
(vm_pgoff + vma offset).  But get_vma_policy() adds the page
offset on top of *ilx, so the offset is counted twice.  This
causes NUMA interleaving to skip nodes: for order-0 pages the
effective index jumps by 2 for each consecutive page.

The get_policy vm_op should return only a per-file bias in *ilx
(like shmem_get_policy does with inode->i_ino), letting
get_vma_policy() add the page-offset component.

Fix by setting *ilx to inode->i_ino instead of the full page
offset.  The page offset is computed by get_vma_policy() in
mm/mempolicy.c. The full offset is still computed
in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
shmem_get_policy() follows the same pattern.

Found by Sashiko (sashiko.dev) AI code review.

Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 virt/kvm/guest_memfd.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 69c9d6d546b2..0bcf6fc08e2d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
 }
 
 static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
-					     unsigned long addr, pgoff_t *pgoff)
+					     unsigned long addr, pgoff_t *ilx)
 {
 	struct inode *inode = file_inode(vma->vm_file);
+	pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
 
-	*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
+	*ilx = inode->i_ino;
 
 	/*
 	 * Return the memory policy for this index, or NULL if none is set.
@@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
 	 * can then replace NULL with the default memory policy instead of the
 	 * current task's memory policy.
 	 */
-	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
+	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
 }
 #endif /* CONFIG_NUMA */
 
-- 
MST


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-03 15:57 [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting Michael S. Tsirkin
@ 2026-06-03 16:15 ` sashiko-bot
  2026-06-04  7:53   ` Michael S. Tsirkin
  2026-06-03 18:51 ` Garg, Shivank
  2026-06-05  9:26 ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-06-03 16:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] guest_memfd memory allocation ignores page offset for NUMA interleaving, falling back to the task's global interleave counter.
--

commit ce4f65d7ee89d8194749cc615f30619f101248db
Author: Michael S. Tsirkin <mst@redhat.com>

KVM: guest_memfd: fix NUMA interleave index double-counting

This commit fixes a bug where the NUMA interleave index was being
double-counted by setting *ilx to the inode number instead of the full page
offset in kvm_gmem_get_policy(). This allows get_vma_policy() to correctly
add the page offset without duplicating it.

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index bf9659a7b0f6d..65daef23c8390 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
>  }
>  
>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> -					     unsigned long addr, pgoff_t *pgoff)
> +					     unsigned long addr, pgoff_t *ilx)
>  {
>  	struct inode *inode = file_inode(vma->vm_file);
> +	pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>  
> -	*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> +	*ilx = inode->i_ino;
>  
>  	/*
>  	 * Return the memory policy for this index, or NULL if none is set.
> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>  	 * can then replace NULL with the default memory policy instead of the
>  	 * current task's memory policy.
>  	 */
> -	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> +	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
>  }

[Severity: High]
This isn't a bug introduced by this patch, but does guest_memfd memory
allocation ignore the page offset for NUMA interleaving and fall back to
the task's global interleave counter instead?

While this patch fixes the index calculation for the policy lookup, it seems
the index is completely dropped later during the actual memory allocation.

In virt/kvm/guest_memfd.c kvm_gmem_get_folio(), KVM calls into the generic
filemap allocator:

folio = __filemap_get_folio_mpol(inode->i_mapping, index,
                                 FGP_LOCK | FGP_CREAT,
                                 mapping_gfp_mask(inode->i_mapping), policy);

But the generic allocator hardcodes NO_INTERLEAVE_INDEX in mm/filemap.c
filemap_alloc_folio_noprof():

if (policy)
    return folio_alloc_mpol_noprof(gfp, order, policy,
                                   NO_INTERLEAVE_INDEX, numa_node_id());

For policies like MPOL_INTERLEAVE, using NO_INTERLEAVE_INDEX causes the
allocation to use the faulting task's global il_prev counter rather than the
file index-based deterministic interleaving.

Doesn't this defeat the purpose of the shared policy, causing interleaving
to be randomized by the chronological order of vCPU page faults rather than
deterministically spread based on the guest physical address?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com?part=1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-03 15:57 [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting Michael S. Tsirkin
  2026-06-03 16:15 ` sashiko-bot
@ 2026-06-03 18:51 ` Garg, Shivank
  2026-06-04 23:46   ` Michael S. Tsirkin
  2026-06-05  9:26 ` David Hildenbrand (Arm)
  2 siblings, 1 reply; 8+ messages in thread
From: Garg, Shivank @ 2026-06-03 18:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Vlastimil Babka, kvm



On 6/3/2026 9:27 PM, Michael S. Tsirkin wrote:
> kvm_gmem_get_policy() sets *ilx to the full page offset
> (vm_pgoff + vma offset).  But get_vma_policy() adds the page
> offset on top of *ilx, so the offset is counted twice.  This
> causes NUMA interleaving to skip nodes: for order-0 pages the
> effective index jumps by 2 for each consecutive page.
> 
> The get_policy vm_op should return only a per-file bias in *ilx
> (like shmem_get_policy does with inode->i_ino), letting
> get_vma_policy() add the page-offset component.
> 
> Fix by setting *ilx to inode->i_ino instead of the full page
> offset.  The page offset is computed by get_vma_policy() in
> mm/mempolicy.c. The full offset is still computed
> in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
> shmem_get_policy() follows the same pattern.
> 
> Found by Sashiko (sashiko.dev) AI code review.
> 
> Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  virt/kvm/guest_memfd.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b2..0bcf6fc08e2d 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
>  }
> 
>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> -                                            unsigned long addr, pgoff_t *pgoff)
> +                                            unsigned long addr, pgoff_t *ilx)
>  {
>         struct inode *inode = file_inode(vma->vm_file);
> +       pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> 
> -       *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> +       *ilx = inode->i_ino;
> 
>         /*
>          * Return the memory policy for this index, or NULL if none is set.
> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>          * can then replace NULL with the default memory policy instead of the
>          * current task's memory policy.
>          */
> -       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> +       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
>  }
>  #endif /* CONFIG_NUMA */
> 
> --
> MST
> 

Thanks for fixing this. LGTM!

Reviewed-by: Shivank Garg <shivankg@amd.com>

Best regards,
Shivank




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-03 16:15 ` sashiko-bot
@ 2026-06-04  7:53   ` Michael S. Tsirkin
  0 siblings, 0 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2026-06-04  7:53 UTC (permalink / raw)
  To: sashiko-reviews
  Cc: kvm, Sean Christopherson, Paolo Bonzini, David Hildenbrand,
	Vlastimil Babka

On Wed, Jun 03, 2026 at 04:15:25PM +0000, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> 
> Pre-existing issues:
> - [High] guest_memfd memory allocation ignores page offset for NUMA interleaving, falling back to the task's global interleave counter.
> --
> 
> commit ce4f65d7ee89d8194749cc615f30619f101248db
> Author: Michael S. Tsirkin <mst@redhat.com>
> 
> KVM: guest_memfd: fix NUMA interleave index double-counting
> 
> This commit fixes a bug where the NUMA interleave index was being
> double-counted by setting *ilx to the inode number instead of the full page
> offset in kvm_gmem_get_policy(). This allows get_vma_policy() to correctly
> add the page offset without duplicating it.
> 
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index bf9659a7b0f6d..65daef23c8390 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
> >  }
> >  
> >  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> > -					     unsigned long addr, pgoff_t *pgoff)
> > +					     unsigned long addr, pgoff_t *ilx)
> >  {
> >  	struct inode *inode = file_inode(vma->vm_file);
> > +	pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> >  
> > -	*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > +	*ilx = inode->i_ino;
> >  
> >  	/*
> >  	 * Return the memory policy for this index, or NULL if none is set.
> > @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >  	 * can then replace NULL with the default memory policy instead of the
> >  	 * current task's memory policy.
> >  	 */
> > -	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> > +	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
> >  }
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but does guest_memfd memory
> allocation ignore the page offset for NUMA interleaving and fall back to
> the task's global interleave counter instead?
> 
> While this patch fixes the index calculation for the policy lookup, it seems
> the index is completely dropped later during the actual memory allocation.
> 
> In virt/kvm/guest_memfd.c kvm_gmem_get_folio(), KVM calls into the generic
> filemap allocator:
> 
> folio = __filemap_get_folio_mpol(inode->i_mapping, index,
>                                  FGP_LOCK | FGP_CREAT,
>                                  mapping_gfp_mask(inode->i_mapping), policy);
> 
> But the generic allocator hardcodes NO_INTERLEAVE_INDEX in mm/filemap.c
> filemap_alloc_folio_noprof():
> 
> if (policy)
>     return folio_alloc_mpol_noprof(gfp, order, policy,
>                                    NO_INTERLEAVE_INDEX, numa_node_id());
> 
> For policies like MPOL_INTERLEAVE, using NO_INTERLEAVE_INDEX causes the
> allocation to use the faulting task's global il_prev counter rather than the
> file index-based deterministic interleaving.
> 
> Doesn't this defeat the purpose of the shared policy, causing interleaving
> to be randomized by the chronological order of vCPU page faults rather than
> deterministically spread based on the guest physical address?

Does not seem to be caused by this patch.
And I've no idea whether changing this will break anything
users rely on - might be prudent to have userspace opt-in,
if we are changing this?

At least kvm is the only user passing a non NULL policy, so
fixing filemap_alloc_folio_noprof should not break anyone else.
Anyway, I'll leave this one to the kvm maintainers.

> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/0eff0a90667b900bee837d06b5db5025e1f304b5.1780501924.git.mst@redhat.com?part=1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-03 18:51 ` Garg, Shivank
@ 2026-06-04 23:46   ` Michael S. Tsirkin
  2026-06-05 13:01     ` Garg, Shivank
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2026-06-04 23:46 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: linux-kernel, Sean Christopherson, Paolo Bonzini,
	David Hildenbrand, Vlastimil Babka, kvm

On Thu, Jun 04, 2026 at 12:21:15AM +0530, Garg, Shivank wrote:
> 
> 
> On 6/3/2026 9:27 PM, Michael S. Tsirkin wrote:
> > kvm_gmem_get_policy() sets *ilx to the full page offset
> > (vm_pgoff + vma offset).  But get_vma_policy() adds the page
> > offset on top of *ilx, so the offset is counted twice.  This
> > causes NUMA interleaving to skip nodes: for order-0 pages the
> > effective index jumps by 2 for each consecutive page.
> > 
> > The get_policy vm_op should return only a per-file bias in *ilx
> > (like shmem_get_policy does with inode->i_ino), letting
> > get_vma_policy() add the page-offset component.
> > 
> > Fix by setting *ilx to inode->i_ino instead of the full page
> > offset.  The page offset is computed by get_vma_policy() in
> > mm/mempolicy.c. The full offset is still computed
> > in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
> > shmem_get_policy() follows the same pattern.
> > 
> > Found by Sashiko (sashiko.dev) AI code review.
> > 
> > Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
> > Cc: Sean Christopherson <seanjc@google.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  virt/kvm/guest_memfd.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 69c9d6d546b2..0bcf6fc08e2d 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
> >  }
> > 
> >  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> > -                                            unsigned long addr, pgoff_t *pgoff)
> > +                                            unsigned long addr, pgoff_t *ilx)
> >  {
> >         struct inode *inode = file_inode(vma->vm_file);
> > +       pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > 
> > -       *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > +       *ilx = inode->i_ino;
> > 
> >         /*
> >          * Return the memory policy for this index, or NULL if none is set.
> > @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >          * can then replace NULL with the default memory policy instead of the
> >          * current task's memory policy.
> >          */
> > -       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> > +       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
> >  }
> >  #endif /* CONFIG_NUMA */
> > 
> > --
> > MST
> > 
> 
> Thanks for fixing this. LGTM!
> 
> Reviewed-by: Shivank Garg <shivankg@amd.com>


Can u actually test it though pls?
Because I think another patch I sent in response so Sashiko
is also needed.

> Best regards,
> Shivank
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-03 15:57 [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting Michael S. Tsirkin
  2026-06-03 16:15 ` sashiko-bot
  2026-06-03 18:51 ` Garg, Shivank
@ 2026-06-05  9:26 ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-05  9:26 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Sean Christopherson, Paolo Bonzini, Vlastimil Babka, Shivank Garg,
	kvm

On 6/3/26 17:57, Michael S. Tsirkin wrote:
> kvm_gmem_get_policy() sets *ilx to the full page offset
> (vm_pgoff + vma offset).  But get_vma_policy() adds the page
> offset on top of *ilx, so the offset is counted twice.  This
> causes NUMA interleaving to skip nodes: for order-0 pages the
> effective index jumps by 2 for each consecutive page.
> 
> The get_policy vm_op should return only a per-file bias in *ilx
> (like shmem_get_policy does with inode->i_ino), letting
> get_vma_policy() add the page-offset component.
> 
> Fix by setting *ilx to inode->i_ino instead of the full page
> offset.  The page offset is computed by get_vma_policy() in
> mm/mempolicy.c. The full offset is still computed
> in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
> shmem_get_policy() follows the same pattern.
> 
> Found by Sashiko (sashiko.dev) AI code review.
> 
> Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  virt/kvm/guest_memfd.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b2..0bcf6fc08e2d 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
>  }
>  
>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> -					     unsigned long addr, pgoff_t *pgoff)
> +					     unsigned long addr, pgoff_t *ilx)

That now matches the definition in struct vm_operations_struct.

>  {
>  	struct inode *inode = file_inode(vma->vm_file);
> +	pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>  
> -	*pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> +	*ilx = inode->i_ino;
>  
>  	/*
>  	 * Return the memory policy for this index, or NULL if none is set.
> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>  	 * can then replace NULL with the default memory policy instead of the
>  	 * current task's memory policy.
>  	 */
> -	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> +	return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
>  }
>  #endif /* CONFIG_NUMA */
>  

That now matches what shmem_get_policy() does logically.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-04 23:46   ` Michael S. Tsirkin
@ 2026-06-05 13:01     ` Garg, Shivank
  2026-06-05 14:55       ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Garg, Shivank @ 2026-06-05 13:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Sean Christopherson, Paolo Bonzini,
	David Hildenbrand, Vlastimil Babka, kvm



On 6/5/2026 5:16 AM, Michael S. Tsirkin wrote:
> [You don't often get email from mst@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Thu, Jun 04, 2026 at 12:21:15AM +0530, Garg, Shivank wrote:
>>
>>
>> On 6/3/2026 9:27 PM, Michael S. Tsirkin wrote:
>>> kvm_gmem_get_policy() sets *ilx to the full page offset
>>> (vm_pgoff + vma offset).  But get_vma_policy() adds the page
>>> offset on top of *ilx, so the offset is counted twice.  This
>>> causes NUMA interleaving to skip nodes: for order-0 pages the
>>> effective index jumps by 2 for each consecutive page.
>>>
>>> The get_policy vm_op should return only a per-file bias in *ilx
>>> (like shmem_get_policy does with inode->i_ino), letting
>>> get_vma_policy() add the page-offset component.
>>>
>>> Fix by setting *ilx to inode->i_ino instead of the full page
>>> offset.  The page offset is computed by get_vma_policy() in
>>> mm/mempolicy.c. The full offset is still computed
>>> in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
>>> shmem_get_policy() follows the same pattern.
>>>
>>> Found by Sashiko (sashiko.dev) AI code review.
>>>
>>> Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
>>> Cc: Sean Christopherson <seanjc@google.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Assisted-by: Claude:claude-opus-4-6
>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>> ---
>>>  virt/kvm/guest_memfd.c | 7 ++++---
>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>> index 69c9d6d546b2..0bcf6fc08e2d 100644
>>> --- a/virt/kvm/guest_memfd.c
>>> +++ b/virt/kvm/guest_memfd.c
>>> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
>>>  }
>>>
>>>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>>> -                                            unsigned long addr, pgoff_t *pgoff)
>>> +                                            unsigned long addr, pgoff_t *ilx)
>>>  {
>>>         struct inode *inode = file_inode(vma->vm_file);
>>> +       pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>>>
>>> -       *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
>>> +       *ilx = inode->i_ino;
>>>
>>>         /*
>>>          * Return the memory policy for this index, or NULL if none is set.
>>> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
>>>          * can then replace NULL with the default memory policy instead of the
>>>          * current task's memory policy.
>>>          */
>>> -       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
>>> +       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
>>>  }
>>>  #endif /* CONFIG_NUMA */
>>>
>>> --
>>> MST
>>>
>>
>> Thanks for fixing this. LGTM!
>>
>> Reviewed-by: Shivank Garg <shivankg@amd.com>
> 
> 
> Can u actually test it though pls?
> Because I think another patch I sent in response so Sashiko
> is also needed.

Hi Michael,

Yes, I tested this.

I used kretprobes to read *ilx on each kvm_gmem_get_policy(), while calling
get_mempolicy(MPOL_F_ADDR) on consecutive offsets(0..7) of guest_memfd mapping:

BEFORE:
page offset:  0   1   2   3   4   5   6   7
*ilx:         0   1   2   3   4   5   6   7
  
get_vma_policy() again add the page offset on top. so, it will increase by stride 2.

AFTER Fix:
page offset:  0       1       2       3      ...  7
*ilx:         128376  128376  128376  128376 ...  128376

It store i_no, so after get_vma_policy(), it will increase by just 1.

It's hard to show any wrong allocation with the bug because this index value is not
used by allocation path, which uses NO_INTERLEAVE_INDEX.

Tested-by: Shivank Garg <shivankg@amd.com>

Thanks,
Shivank



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
  2026-06-05 13:01     ` Garg, Shivank
@ 2026-06-05 14:55       ` Michael S. Tsirkin
  0 siblings, 0 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2026-06-05 14:55 UTC (permalink / raw)
  To: Garg, Shivank
  Cc: linux-kernel, Sean Christopherson, Paolo Bonzini,
	David Hildenbrand, Vlastimil Babka, kvm

On Fri, Jun 05, 2026 at 06:31:51PM +0530, Garg, Shivank wrote:
> 
> 
> On 6/5/2026 5:16 AM, Michael S. Tsirkin wrote:
> > [You don't often get email from mst@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Thu, Jun 04, 2026 at 12:21:15AM +0530, Garg, Shivank wrote:
> >>
> >>
> >> On 6/3/2026 9:27 PM, Michael S. Tsirkin wrote:
> >>> kvm_gmem_get_policy() sets *ilx to the full page offset
> >>> (vm_pgoff + vma offset).  But get_vma_policy() adds the page
> >>> offset on top of *ilx, so the offset is counted twice.  This
> >>> causes NUMA interleaving to skip nodes: for order-0 pages the
> >>> effective index jumps by 2 for each consecutive page.
> >>>
> >>> The get_policy vm_op should return only a per-file bias in *ilx
> >>> (like shmem_get_policy does with inode->i_ino), letting
> >>> get_vma_policy() add the page-offset component.
> >>>
> >>> Fix by setting *ilx to inode->i_ino instead of the full page
> >>> offset.  The page offset is computed by get_vma_policy() in
> >>> mm/mempolicy.c. The full offset is still computed
> >>> in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
> >>> shmem_get_policy() follows the same pattern.
> >>>
> >>> Found by Sashiko (sashiko.dev) AI code review.
> >>>
> >>> Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
> >>> Cc: Sean Christopherson <seanjc@google.com>
> >>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>> Assisted-by: Claude:claude-opus-4-6
> >>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>> ---
> >>>  virt/kvm/guest_memfd.c | 7 ++++---
> >>>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >>> index 69c9d6d546b2..0bcf6fc08e2d 100644
> >>> --- a/virt/kvm/guest_memfd.c
> >>> +++ b/virt/kvm/guest_memfd.c
> >>> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
> >>>  }
> >>>
> >>>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >>> -                                            unsigned long addr, pgoff_t *pgoff)
> >>> +                                            unsigned long addr, pgoff_t *ilx)
> >>>  {
> >>>         struct inode *inode = file_inode(vma->vm_file);
> >>> +       pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> >>>
> >>> -       *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> >>> +       *ilx = inode->i_ino;
> >>>
> >>>         /*
> >>>          * Return the memory policy for this index, or NULL if none is set.
> >>> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >>>          * can then replace NULL with the default memory policy instead of the
> >>>          * current task's memory policy.
> >>>          */
> >>> -       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> >>> +       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
> >>>  }
> >>>  #endif /* CONFIG_NUMA */
> >>>
> >>> --
> >>> MST
> >>>
> >>
> >> Thanks for fixing this. LGTM!
> >>
> >> Reviewed-by: Shivank Garg <shivankg@amd.com>
> > 
> > 
> > Can u actually test it though pls?
> > Because I think another patch I sent in response so Sashiko
> > is also needed.
> 
> Hi Michael,
> 
> Yes, I tested this.
> 
> I used kretprobes to read *ilx on each kvm_gmem_get_policy(), while calling
> get_mempolicy(MPOL_F_ADDR) on consecutive offsets(0..7) of guest_memfd mapping:
> 
> BEFORE:
> page offset:  0   1   2   3   4   5   6   7
> *ilx:         0   1   2   3   4   5   6   7
>   
> get_vma_policy() again add the page offset on top. so, it will increase by stride 2.
> 
> AFTER Fix:
> page offset:  0       1       2       3      ...  7
> *ilx:         128376  128376  128376  128376 ...  128376
> 
> It store i_no, so after get_vma_policy(), it will increase by just 1.
> 
> It's hard to show any wrong allocation with the bug because this index value is not
> used by allocation path, which uses NO_INTERLEAVE_INDEX.
> 
> Tested-by: Shivank Garg <shivankg@amd.com>
> 
> Thanks,
> Shivank
>


So for this to be useful at all
we do need the patch I sent in response to sashiko, right?
Mind trying out that one?

-- 
MST 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-06-05 14:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 15:57 [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting Michael S. Tsirkin
2026-06-03 16:15 ` sashiko-bot
2026-06-04  7:53   ` Michael S. Tsirkin
2026-06-03 18:51 ` Garg, Shivank
2026-06-04 23:46   ` Michael S. Tsirkin
2026-06-05 13:01     ` Garg, Shivank
2026-06-05 14:55       ` Michael S. Tsirkin
2026-06-05  9:26 ` David Hildenbrand (Arm)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.