From: Michael Roth <michael.roth@amd.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: <kvm@vger.kernel.org>, <linux-coco@lists.linux.dev>,
<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<thomas.lendacky@amd.com>, <pbonzini@redhat.com>,
<seanjc@google.com>, <vbabka@suse.cz>, <ashish.kalra@amd.com>,
<liam.merwick@oracle.com>, <david@redhat.com>,
<vannapurve@google.com>, <ackerleytng@google.com>, <aik@amd.com>,
<ira.weiny@intel.com>
Subject: Re: [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory
Date: Fri, 21 Nov 2025 07:01:44 -0600 [thread overview]
Message-ID: <20251121130144.u7eeaafonhcqf2bd@amd.com> (raw)
In-Reply-To: <aR7bVKzM7rH/FSVh@yzhao56-desk.sh.intel.com>
On Thu, Nov 20, 2025 at 05:11:48PM +0800, Yan Zhao wrote:
> On Thu, Nov 13, 2025 at 05:07:59PM -0600, Michael Roth wrote:
> > Currently the post-populate callbacks handle copying source pages into
> > private GPA ranges backed by guest_memfd, where kvm_gmem_populate()
> > acquires the filemap invalidate lock, then calls a post-populate
> > callback which may issue a get_user_pages() on the source pages prior to
> > copying them into the private GPA (e.g. TDX).
> >
> > This will not be compatible with in-place conversion, where the
> > userspace page fault path will attempt to acquire filemap invalidate
> > lock while holding the mm->mmap_lock, leading to a potential ABBA
> > deadlock[1].
> >
> > Address this by hoisting the GUP above the filemap invalidate lock so
> > that these page faults path can be taken early, prior to acquiring the
> > filemap invalidate lock.
> >
> > It's not currently clear whether this issue is reachable with the
> > current implementation of guest_memfd, which doesn't support in-place
> > conversion, however it does provide a consistent mechanism to provide
> > stable source/target PFNs to callbacks rather than punting to
> > vendor-specific code, which allows for more commonality across
> > architectures, which may be worthwhile even without in-place conversion.
> >
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Michael Roth <michael.roth@amd.com>
> > ---
> > arch/x86/kvm/svm/sev.c | 40 ++++++++++++++++++++++++++------------
> > arch/x86/kvm/vmx/tdx.c | 21 +++++---------------
> > include/linux/kvm_host.h | 3 ++-
> > virt/kvm/guest_memfd.c | 42 ++++++++++++++++++++++++++++++++++------
> > 4 files changed, 71 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 0835c664fbfd..d0ac710697a2 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -2260,7 +2260,8 @@ struct sev_gmem_populate_args {
> > };
> >
> > static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pfn,
> > - void __user *src, int order, void *opaque)
> > + struct page **src_pages, loff_t src_offset,
> > + int order, void *opaque)
> > {
> > struct sev_gmem_populate_args *sev_populate_args = opaque;
> > struct kvm_sev_info *sev = to_kvm_sev_info(kvm);
> > @@ -2268,7 +2269,7 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> > int npages = (1 << order);
> > gfn_t gfn;
> >
> > - if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src))
> > + if (WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO && !src_pages))
> > return -EINVAL;
> >
> > for (gfn = gfn_start, i = 0; gfn < gfn_start + npages; gfn++, i++) {
> > @@ -2284,14 +2285,21 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> > goto err;
> > }
> >
> > - if (src) {
> > - void *vaddr = kmap_local_pfn(pfn + i);
> > + if (src_pages) {
> > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i]));
> > + void *dst_vaddr = kmap_local_pfn(pfn + i);
> >
> > - if (copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE)) {
> > - ret = -EFAULT;
> > - goto err;
> > + memcpy(dst_vaddr, src_vaddr + src_offset, PAGE_SIZE - src_offset);
> > + kunmap_local(src_vaddr);
> > +
> > + if (src_offset) {
> > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1]));
> > +
> > + memcpy(dst_vaddr + PAGE_SIZE - src_offset, src_vaddr, src_offset);
> > + kunmap_local(src_vaddr);
> IIUC, src_offset is the src's offset from the first page. e.g.,
> src could be 0x7fea82684100, with src_offset=0x100, while npages could be 512.
>
> Then it looks like the two memcpy() calls here only work when npages == 1 ?
src_offset ends up being the offset into the pair of src pages that we
are using to fully populate a single dest page with each iteration. So
if we start at src_offset, read a page worth of data, then we are now at
src_offset in the next src page and the loop continues that way even if
npages > 1.
If src_offset is 0 we never have to bother with straddling 2 src pages so
the 2nd memcpy is skipped on every iteration.
That's the intent at least. Is there a flaw in the code/reasoning that I
missed?
>
> > }
> > - kunmap_local(vaddr);
> > +
> > + kunmap_local(dst_vaddr);
> > }
> >
> > ret = rmp_make_private(pfn + i, gfn << PAGE_SHIFT, PG_LEVEL_4K,
> > @@ -2331,12 +2339,20 @@ static int sev_gmem_post_populate(struct kvm *kvm, gfn_t gfn_start, kvm_pfn_t pf
> > if (!snp_page_reclaim(kvm, pfn + i) &&
> > sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
> > sev_populate_args->fw_error == SEV_RET_INVALID_PARAM) {
> > - void *vaddr = kmap_local_pfn(pfn + i);
> > + void *src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i]));
> > + void *dst_vaddr = kmap_local_pfn(pfn + i);
> >
> > - if (copy_to_user(src + i * PAGE_SIZE, vaddr, PAGE_SIZE))
> > - pr_debug("Failed to write CPUID page back to userspace\n");
> > + memcpy(src_vaddr + src_offset, dst_vaddr, PAGE_SIZE - src_offset);
> > + kunmap_local(src_vaddr);
> > +
> > + if (src_offset) {
> > + src_vaddr = kmap_local_pfn(page_to_pfn(src_pages[i + 1]));
> > +
> > + memcpy(src_vaddr, dst_vaddr + PAGE_SIZE - src_offset, src_offset);
> > + kunmap_local(src_vaddr);
> > + }
> >
> > - kunmap_local(vaddr);
> > + kunmap_local(dst_vaddr);
> > }
> >
> > /* pfn + i is hypervisor-owned now, so skip below cleanup for it. */
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 57ed101a1181..dd5439ec1473 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -3115,37 +3115,26 @@ struct tdx_gmem_post_populate_arg {
> > };
> >
> > static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > - void __user *src, int order, void *_arg)
> > + struct page **src_pages, loff_t src_offset,
> > + int order, void *_arg)
> > {
> > struct tdx_gmem_post_populate_arg *arg = _arg;
> > struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > u64 err, entry, level_state;
> > gpa_t gpa = gfn_to_gpa(gfn);
> > - struct page *src_page;
> > int ret, i;
> >
> > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> > return -EIO;
> >
> > - if (KVM_BUG_ON(!PAGE_ALIGNED(src), kvm))
> > + /* Source should be page-aligned, in which case src_offset will be 0. */
> > + if (KVM_BUG_ON(src_offset))
> if (KVM_BUG_ON(src_offset, kvm))
>
> > return -EINVAL;
> >
> > - /*
> > - * Get the source page if it has been faulted in. Return failure if the
> > - * source page has been swapped out or unmapped in primary memory.
> > - */
> > - ret = get_user_pages_fast((unsigned long)src, 1, 0, &src_page);
> > - if (ret < 0)
> > - return ret;
> > - if (ret != 1)
> > - return -ENOMEM;
> > -
> > - kvm_tdx->page_add_src = src_page;
> > + kvm_tdx->page_add_src = src_pages[i];
> src_pages[0] ? i is not initialized.
Sorry, I switched on TDX options for compile testing but I must have done a
sloppy job confirming it actually built. I'll re-test push these and squash
in the fixes in the github tree.
>
> Should there also be a KVM_BUG_ON(order > 0, kvm) ?
Seems reasonable, but I'm not sure this is the right patch. Maybe I
could squash it into the preceeding documentation patch so as to not
give the impression this patch changes those expectations in any way.
>
> > ret = kvm_tdp_mmu_map_private_pfn(arg->vcpu, gfn, pfn);
> > kvm_tdx->page_add_src = NULL;
> >
> > - put_page(src_page);
> > -
> > if (ret || !(arg->flags & KVM_TDX_MEASURE_MEMORY_REGION))
> > return ret;
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d93f75b05ae2..7e9d2403c61f 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2581,7 +2581,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> > * Returns the number of pages that were populated.
> > */
> > typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> > - void __user *src, int order, void *opaque);
> > + struct page **src_pages, loff_t src_offset,
> > + int order, void *opaque);
> >
> > long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> > kvm_gmem_populate_cb post_populate, void *opaque);
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index 9160379df378..e9ac3fd4fd8f 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -814,14 +814,17 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> >
> > #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
> > +
> > +#define GMEM_GUP_NPAGES (1UL << PMD_ORDER)
> Limiting GMEM_GUP_NPAGES to PMD_ORDER may only work when the max_order of a huge
> folio is 2MB. What if the max_order returned from __kvm_gmem_get_pfn() is 1GB
> when src_pages[] can only hold up to 512 pages?
This was necessarily chosen in prep for hugepages, but more about my
unease at letting userspace GUP arbitrarilly large ranges. PMD_ORDER
happens to align with 2MB hugepages while seeming like a reasonable
batching value so that's why I chose it.
Even with 1GB support, I wasn't really planning to increase it. SNP
doesn't really make use of RMP sizes >2MB, and it sounds like TDX
handles promotion in a completely different path. So atm I'm leaning
toward just letting GMEM_GUP_NPAGES be the cap for the max page size we
support for kvm_gmem_populate() path and not bothering to change it
until a solid use-case arises.
>
> Increasing GMEM_GUP_NPAGES to (1UL << PUD_ORDER) is probabaly not a good idea.
>
> Given both TDX/SNP map at 4KB granularity, why not just invoke post_populate()
> per 4KB while removing the max_order from post_populate() parameters, as done
> in Sean's sketch patch [1]?
That's an option too, but SNP can make use of 2MB pages in the
post-populate callback so I don't want to shut the door on that option
just yet if it's not too much of a pain to work in. Given the guest BIOS
lives primarily in 1 or 2 of these 2MB regions the benefits might be
worthwhile, and SNP doesn't have a post-post-populate promotion path
like TDX (at least, not one that would help much for guest boot times)
Thanks,
Mike
>
> Then the WARN_ON() in kvm_gmem_populate() can be removed, which would be easily
> triggered by TDX when max_order > 0 && npages == 1:
>
> WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
> (npages - i) < (1 << max_order));
>
>
> [1] https://lore.kernel.org/all/aHEwT4X0RcfZzHlt@google.com/
>
> > long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> > kvm_gmem_populate_cb post_populate, void *opaque)
> > {
> > struct kvm_memory_slot *slot;
> > - void __user *p;
> > -
> > + struct page **src_pages;
> > int ret = 0, max_order;
> > - long i;
> > + loff_t src_offset = 0;
> > + long i, src_npages;
> >
> > lockdep_assert_held(&kvm->slots_lock);
> >
> > @@ -836,9 +839,28 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > if (!file)
> > return -EFAULT;
> >
> > + npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> > + npages = min_t(ulong, npages, GMEM_GUP_NPAGES);
> > +
> > + if (src) {
> > + src_npages = IS_ALIGNED((unsigned long)src, PAGE_SIZE) ? npages : npages + 1;
> > +
> > + src_pages = kmalloc_array(src_npages, sizeof(struct page *), GFP_KERNEL);
> > + if (!src_pages)
> > + return -ENOMEM;
> > +
> > + ret = get_user_pages_fast((unsigned long)src, src_npages, 0, src_pages);
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (ret != src_npages)
> > + return -ENOMEM;
> > +
> > + src_offset = (loff_t)(src - PTR_ALIGN_DOWN(src, PAGE_SIZE));
> > + }
> > +
> > filemap_invalidate_lock(file->f_mapping);
> >
> > - npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
> > for (i = 0; i < npages; i += (1 << max_order)) {
> > struct folio *folio;
> > gfn_t gfn = start_gfn + i;
> > @@ -869,8 +891,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > max_order--;
> > }
> >
> > - p = src ? src + i * PAGE_SIZE : NULL;
> > - ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
> > + ret = post_populate(kvm, gfn, pfn, src ? &src_pages[i] : NULL,
> > + src_offset, max_order, opaque);
> Why src_offset is not 0 starting from the 2nd page?
>
> > if (!ret)
> > folio_mark_uptodate(folio);
> >
> > @@ -882,6 +904,14 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> >
> > filemap_invalidate_unlock(file->f_mapping);
> >
> > + if (src) {
> > + long j;
> > +
> > + for (j = 0; j < src_npages; j++)
> > + put_page(src_pages[j]);
> > + kfree(src_pages);
> > + }
> > +
> > return ret && !i ? ret : i;
> > }
> > EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_populate);
> > --
> > 2.25.1
> >
next prev parent reply other threads:[~2025-11-21 13:02 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-13 23:07 [PATCH RFC 0/3] KVM: guest_memfd: Rework preparation/population flows in prep for in-place conversion Michael Roth
2025-11-13 23:07 ` [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking Michael Roth
2025-11-17 23:58 ` Ackerley Tng
2025-11-19 0:18 ` Michael Roth
2025-11-20 9:12 ` Yan Zhao
2025-11-21 12:43 ` Michael Roth
2025-11-25 3:13 ` Yan Zhao
2025-12-01 1:35 ` Vishal Annapurve
2025-12-01 2:51 ` Yan Zhao
2025-12-01 19:33 ` Vishal Annapurve
2025-12-02 9:16 ` Yan Zhao
2025-12-01 23:44 ` Michael Roth
2025-12-02 9:17 ` Yan Zhao
2025-12-03 13:47 ` Michael Roth
2025-12-05 3:54 ` Yan Zhao
2025-11-13 23:07 ` [PATCH 2/3] KVM: TDX: Document alignment requirements for KVM_TDX_INIT_MEM_REGION Michael Roth
2025-11-13 23:07 ` [PATCH 3/3] KVM: guest_memfd: GUP source pages prior to populating guest memory Michael Roth
2025-11-20 9:11 ` Yan Zhao
2025-11-21 13:01 ` Michael Roth [this message]
2025-11-24 9:31 ` Yan Zhao
2025-11-24 15:53 ` Ira Weiny
2025-11-25 3:12 ` Yan Zhao
2025-12-01 1:47 ` Vishal Annapurve
2025-12-01 21:03 ` Michael Roth
2025-12-01 22:13 ` Michael Roth
2025-12-03 2:46 ` Yan Zhao
2025-12-03 14:26 ` Michael Roth
2025-12-03 20:59 ` FirstName LastName
2025-12-03 23:12 ` Michael Roth
2025-12-08 11:07 ` Vlastimil Babka
2025-12-10 1:30 ` Sean Christopherson
2025-12-03 21:01 ` Ira Weiny
2025-12-03 23:07 ` Michael Roth
2025-12-05 3:38 ` Yan Zhao
2025-12-01 1:44 ` Vishal Annapurve
2025-12-03 23:48 ` Michael Roth
2025-11-20 19:34 ` Ira Weiny
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251121130144.u7eeaafonhcqf2bd@amd.com \
--to=michael.roth@amd.com \
--cc=ackerleytng@google.com \
--cc=aik@amd.com \
--cc=ashish.kalra@amd.com \
--cc=david@redhat.com \
--cc=ira.weiny@intel.com \
--cc=kvm@vger.kernel.org \
--cc=liam.merwick@oracle.com \
--cc=linux-coco@lists.linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=thomas.lendacky@amd.com \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.