Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Garg, Shivank" <shivankg@amd.com>
Cc: linux-kernel@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	David Hildenbrand <david@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>,
	kvm@vger.kernel.org
Subject: Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting
Date: Sat, 6 Jun 2026 09:12:39 -0400	[thread overview]
Message-ID: <20260606091121-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <b680f308-2dfa-47a2-9ff4-05669259a09c@amd.com>

On Sat, Jun 06, 2026 at 06:32:04PM +0530, Garg, Shivank wrote:
> 
> 
> On 6/5/2026 8:25 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 05, 2026 at 06:31:51PM +0530, Garg, Shivank wrote:
> >>
> >>
> >> On 6/5/2026 5:16 AM, Michael S. Tsirkin wrote:
> >>> [You don't often get email from mst@redhat.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >>>
> >>> On Thu, Jun 04, 2026 at 12:21:15AM +0530, Garg, Shivank wrote:
> >>>>
> >>>>
> >>>> On 6/3/2026 9:27 PM, Michael S. Tsirkin wrote:
> >>>>> kvm_gmem_get_policy() sets *ilx to the full page offset
> >>>>> (vm_pgoff + vma offset).  But get_vma_policy() adds the page
> >>>>> offset on top of *ilx, so the offset is counted twice.  This
> >>>>> causes NUMA interleaving to skip nodes: for order-0 pages the
> >>>>> effective index jumps by 2 for each consecutive page.
> >>>>>
> >>>>> The get_policy vm_op should return only a per-file bias in *ilx
> >>>>> (like shmem_get_policy does with inode->i_ino), letting
> >>>>> get_vma_policy() add the page-offset component.
> >>>>>
> >>>>> Fix by setting *ilx to inode->i_ino instead of the full page
> >>>>> offset.  The page offset is computed by get_vma_policy() in
> >>>>> mm/mempolicy.c. The full offset is still computed
> >>>>> in kvm_gmem_get_policy() for mpol_shared_policy_lookup().
> >>>>> shmem_get_policy() follows the same pattern.
> >>>>>
> >>>>> Found by Sashiko (sashiko.dev) AI code review.
> >>>>>
> >>>>> Fixes: ed1ffa810bd6 ("KVM: guest_memfd: Enforce NUMA mempolicy using shared policy")
> >>>>> Cc: Sean Christopherson <seanjc@google.com>
> >>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >>>>> Assisted-by: Claude:claude-opus-4-6
> >>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >>>>> ---
> >>>>>  virt/kvm/guest_memfd.c | 7 ++++---
> >>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >>>>> index 69c9d6d546b2..0bcf6fc08e2d 100644
> >>>>> --- a/virt/kvm/guest_memfd.c
> >>>>> +++ b/virt/kvm/guest_memfd.c
> >>>>> @@ -438,11 +438,12 @@ static int kvm_gmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpo
> >>>>>  }
> >>>>>
> >>>>>  static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >>>>> -                                            unsigned long addr, pgoff_t *pgoff)
> >>>>> +                                            unsigned long addr, pgoff_t *ilx)
> >>>>>  {
> >>>>>         struct inode *inode = file_inode(vma->vm_file);
> >>>>> +       pgoff_t pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> >>>>>
> >>>>> -       *pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> >>>>> +       *ilx = inode->i_ino;
> >>>>>
> >>>>>         /*
> >>>>>          * Return the memory policy for this index, or NULL if none is set.
> >>>>> @@ -453,7 +454,7 @@ static struct mempolicy *kvm_gmem_get_policy(struct vm_area_struct *vma,
> >>>>>          * can then replace NULL with the default memory policy instead of the
> >>>>>          * current task's memory policy.
> >>>>>          */
> >>>>> -       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, *pgoff);
> >>>>> +       return mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff);
> >>>>>  }
> >>>>>  #endif /* CONFIG_NUMA */
> >>>>>
> >>>>> --
> >>>>> MST
> >>>>>
> >>>>
> >>>> Thanks for fixing this. LGTM!
> >>>>
> >>>> Reviewed-by: Shivank Garg <shivankg@amd.com>
> >>>
> >>>
> >>> Can u actually test it though pls?
> >>> Because I think another patch I sent in response so Sashiko
> >>> is also needed.
> >>
> >> Hi Michael,
> >>
> >> Yes, I tested this.
> >>
> >> I used kretprobes to read *ilx on each kvm_gmem_get_policy(), while calling
> >> get_mempolicy(MPOL_F_ADDR) on consecutive offsets(0..7) of guest_memfd mapping:
> >>
> >> BEFORE:
> >> page offset:  0   1   2   3   4   5   6   7
> >> *ilx:         0   1   2   3   4   5   6   7
> >>   
> >> get_vma_policy() again add the page offset on top. so, it will increase by stride 2.
> >>
> >> AFTER Fix:
> >> page offset:  0       1       2       3      ...  7
> >> *ilx:         128376  128376  128376  128376 ...  128376
> >>
> >> It store i_no, so after get_vma_policy(), it will increase by just 1.
> >>
> >> It's hard to show any wrong allocation with the bug because this index value is not
> >> used by allocation path, which uses NO_INTERLEAVE_INDEX.
> >>
> >> Tested-by: Shivank Garg <shivankg@amd.com>
> >>
> >> Thanks,
> >> Shivank
> >>
> > 
> > 
> > So for this to be useful at all
> > we do need the patch I sent in response to sashiko, right?
> > Mind trying out that one?
> > 
> 
> I could not find the other patch from you.
> Are you talking about this response
> https://lore.kernel.org/all/20260604034539-mutt-send-email-mst@kernel.org?
> 
> If you send any separate patch to test elsewhere, please point me.

I could swear I sent it, but you are right. Inline, because untested:
-->

mm: filemap: pass interleave index through filemap_alloc_folio

filemap_alloc_folio_noprof() hardcodes NO_INTERLEAVE_INDEX when
calling folio_alloc_mpol_noprof() for NUMA policy-based allocations.
This causes MPOL_INTERLEAVE to fall back to the task's global
il_prev counter instead of using the file offset for deterministic
page placement.

The only current user passing a non-NULL policy is
__filemap_get_folio_mpol(), called by KVM guest_memfd.  The page
index is already available at that call site but was never threaded
down to the allocator.

Add a pgoff_t ilx parameter to filemap_alloc_folio_noprof() and
pass it through to folio_alloc_mpol_noprof().  Update
__filemap_get_folio_mpol() to forward its index argument, and all
other callers (which pass NULL policy and never hit the mpol path)
to pass 0.

Fixes: 7f3779a3ac3e ("mm/filemap: Add NUMA mempolicy support to filemap_alloc_folio()")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index a02b62e0a8f3..efdec0ac1482 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -452,7 +452,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		masked_constraint_gfp = mapping_gfp_constraint(mapping, constraint_gfp);
 		masked_constraint_gfp |= __GFP_NOWARN;
 
-		folio = filemap_alloc_folio(masked_constraint_gfp, 0, NULL);
+		folio = filemap_alloc_folio(masked_constraint_gfp, 0, NULL, 0);
 		if (!folio)
 			break;
 
diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
index 0062b3a55781..148fa0bcc974 100644
--- a/fs/btrfs/verity.c
+++ b/fs/btrfs/verity.c
@@ -731,7 +731,7 @@ static struct page *btrfs_read_merkle_tree_page(struct inode *inode,
 	}
 
 	folio = filemap_alloc_folio(mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS),
-				    0, NULL);
+				    0, NULL, 0);
 	if (!folio)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 27ab7bd844ec..f4416b57f480 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -563,7 +563,7 @@ static void z_erofs_bind_cache(struct z_erofs_frontend *fe)
 			 * Allocate a managed folio for cached I/O, or it may be
 			 * then filled with a file-backed folio for in-place I/O
 			 */
-			newfolio = filemap_alloc_folio(gfp, 0, NULL);
+			newfolio = filemap_alloc_folio(gfp, 0, NULL, 0);
 			if (!newfolio)
 				continue;
 			newfolio->private = Z_EROFS_PREALLOCATED_FOLIO;
diff --git a/fs/f2fs/compress.c b/fs/f2fs/compress.c
index 881e76158b96..7494a94338e4 100644
--- a/fs/f2fs/compress.c
+++ b/fs/f2fs/compress.c
@@ -1954,7 +1954,7 @@ static void f2fs_cache_compressed_page(struct f2fs_sb_info *sbi,
 		return;
 	}
 
-	cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0, NULL);
+	cfolio = filemap_alloc_folio(__GFP_NOWARN | __GFP_IO, 0, NULL, 0);
 	if (!cfolio)
 		return;
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 31a848485ad9..e2aea0800815 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -652,10 +652,10 @@ static inline void *detach_page_private(struct page *page)
 
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *policy);
+		struct mempolicy *policy, pgoff_t ilx);
 #else
 static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *policy)
+		struct mempolicy *policy, pgoff_t ilx)
 {
 	return folio_alloc_noprof(gfp, order);
 }
@@ -666,7 +666,7 @@ static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int o
 
 static inline struct page *__page_cache_alloc(gfp_t gfp)
 {
-	return &filemap_alloc_folio(gfp, 0, NULL)->page;
+	return &filemap_alloc_folio(gfp, 0, NULL, 0)->page;
 }
 
 static inline gfp_t readahead_gfp_mask(struct address_space *x)
diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c..2fccd9afa4d4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -992,14 +992,14 @@ EXPORT_SYMBOL_GPL(filemap_add_folio);
 
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order,
-		struct mempolicy *policy)
+		struct mempolicy *policy, pgoff_t ilx)
 {
 	int n;
 	struct folio *folio;
 
 	if (policy)
 		return folio_alloc_mpol_noprof(gfp, order, policy,
-				NO_INTERLEAVE_INDEX, numa_node_id());
+				ilx, numa_node_id());
 
 	if (cpuset_do_page_mem_spread()) {
 		unsigned int cpuset_mems_cookie;
@@ -2009,7 +2009,7 @@ struct folio *__filemap_get_folio_mpol(struct address_space *mapping,
 			err = -ENOMEM;
 			if (order > min_order)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
-			folio = filemap_alloc_folio(alloc_gfp, order, policy);
+			folio = filemap_alloc_folio(alloc_gfp, order, policy, index);
 			if (!folio)
 				continue;
 
@@ -2609,7 +2609,7 @@ static int filemap_create_folio(struct kiocb *iocb, struct folio_batch *fbatch)
 	if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 		return -EAGAIN;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), min_order, NULL, 0);
 	if (!folio)
 		return -ENOMEM;
 	if (iocb->ki_flags & IOCB_DONTCACHE)
@@ -4067,7 +4067,7 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL);
+		folio = filemap_alloc_folio(gfp, mapping_min_folio_order(mapping), NULL, 0);
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		index = mapping_align_index(mapping, index);
diff --git a/mm/readahead.c b/mm/readahead.c
index 7b05082c89ea..c435aee43e07 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -186,7 +186,7 @@ static struct folio *ractl_alloc_folio(struct readahead_control *ractl,
 {
 	struct folio *folio;
 
-	folio = filemap_alloc_folio(gfp_mask, order, NULL);
+	folio = filemap_alloc_folio(gfp_mask, order, NULL, 0);
 	if (folio && ractl->dropbehind)
 		__folio_set_dropbehind(folio);

next prev parent reply	other threads:[~2026-06-06 13:12 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03 15:57 [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting Michael S. Tsirkin
2026-06-03 16:15 ` sashiko-bot
2026-06-04  7:53   ` Michael S. Tsirkin
2026-06-03 18:51 ` Garg, Shivank
2026-06-04 23:46   ` Michael S. Tsirkin
2026-06-05 13:01     ` Garg, Shivank
2026-06-05 14:55       ` Michael S. Tsirkin
2026-06-06 13:02         ` Garg, Shivank
2026-06-06 13:12           ` Michael S. Tsirkin [this message]
2026-06-05  9:26 ` David Hildenbrand (Arm)

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:a02b62e0a8f dfblob:efdec0ac148 dfblob:0062b3a5578
dfblob:148fa0bcc97 dfblob:27ab7bd844e dfblob:f4416b57f48
dfblob:881e76158b9 dfblob:7494a94338e dfblob:31a848485ad
dfblob:e2aea080081 dfblob:4e636647100 dfblob:2fccd9afa4d
dfblob:7b05082c89e dfblob:c435aee43e0 )
 OR (
bs:"Re: [PATCH] KVM: guest_memfd: fix NUMA interleave index double-counting" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260606091121-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=david@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=shivankg@amd.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.