Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Hyunwoo Kim <imv4bel@gmail.com>
Cc: Michael Roth <michael.roth@amd.com>,
	pbonzini@redhat.com, tglx@kernel.org,  mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	 hpa@zytor.com, kvm@vger.kernel.org
Subject: Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
Date: Thu, 11 Jun 2026 08:23:00 -0700	[thread overview]
Message-ID: <airS1G23atiuWdTl@google.com> (raw)
In-Reply-To: <airAthcKDZeDfzXS@v4bel>

On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> On Thu, Jun 11, 2026 at 05:47:52AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > > On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > > > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > > > ---
> > > > >  arch/x86/kvm/svm/sev.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > > > --- a/arch/x86/kvm/svm/sev.c
> > > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > > > 
> > > > >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > > > >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > > > -                             pfn, rc))
> > > > > +                             pfn, rc)) {
> > > > > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > > > > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > > > > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > > > >                         goto next_pfn;
> > > > > +               }
> > > > 
> > > > This roughly aligns with what would happen if snp_page_reclaim() fails
> > > > in sev_gmem_post_populate(), while the guest is being initialized via
> > > > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > > > 
> > > > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > > > address both cases, we should just add a parameter to snp_leak_pages()
> > > > to tell it to take an extra ref and use that in both of these paths.
> > > > 
> > > > Or we can just do the direct folio_get() in both cases, the above
> > > > formalizes the handling convention a little better though IMO.
> > > 
> > > If I understand correctly, an extra ref alone still seems to leave the
> > > LRU corruption that sashiko flagged:
> > > 
> > > https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> > > 
> > > A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> > > on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> > > same union as folio->lru, so leaking the page overwrites the folio's LRU
> > > pointers. Both paths deal with a gmem folio, so the same applies.
> > > 
> > > To handle this properly, the folio would need to be taken off the LRU
> > > before leaking, with something like folio_isolate_lru(), but that is
> > > mm-internal and does not look usable from KVM. How should we proceed?
> > > Please let me know if I am missing something.
> > 
> > I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
> > there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
> > point these types of errors are simply unrecoverable.
> 
> A guest can make a gmem page a VMSA via AP creation,

Ugh, the bane of my existence.  Can we kill off that feature yet?  I'm only half
joking.  Not even half.

> and if that gfn is then hole-punched, a page that is still assigned to the
> guest is returned to the host in sev_gmem_invalidate(), which looked like it
> could lead to a host RMP PF, so I sent the patch.

Yeah, I suspect you're right.  But leaking the page doesn't fix the underlying
problem, which is that it's possible to free a page that's being used as a VMSA.

We can't simply pin the page, because IIUC ->free_folio() is called when the page
is removed from the filemap, not when the folio/page is free back to the allocator.

I.e. we can't do this.

diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
index 74fb15551e83..ea70c7ade152 100644
--- arch/x86/kvm/svm/sev.c
+++ arch/x86/kvm/svm/sev.c
@@ -3510,6 +3510,9 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
 
        svm = to_svm(vcpu);
 
+       if (svm->sev_es.snp_has_guest_vmsa)
+               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
+
        /*
         * If it's an SNP guest, then the VMSA was marked in the RMP table as
         * a guest-owned page. Transition the page to hypervisor state before
@@ -4027,6 +4030,13 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
        if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, &page, NULL))
                return;
 
+       /*
+        * Drop the reference to the previous VMSA page (acquired above) if the
+        * guest is changing its VMSA.
+        */
+       if (svm->sev_es.snp_has_guest_vmsa)
+               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
+
        /*
         * From this point forward, the VMSA will always be a guest-mapped page
         * rather than the initial one allocated by KVM in svm->sev_es.vmsa. In
@@ -4043,13 +4053,6 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
 
        /* Mark the vCPU as runnable */
        kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
-
-       /*
-        * gmem pages aren't currently migratable, but if this ever changes
-        * then care should be taken to ensure svm->sev_es.vmsa is pinned
-        * through some other means.
-        */
-       kvm_release_page_clean(page);
 }
 
 static int sev_snp_ap_creation(struct vcpu_svm *svm)

The right way to handle this is to treat the VMSA "mapping" like an MMU mapping.
I.e. this is fundamentally the same mess we have to solve in order to not pin
pages that are mapped into L2 via vmcs01/vmcb02 for nVMX/nSVM.

Something like this, sans the actually handling of the request.  The simple way
to handle the request would be to invalidate control.vmsa_pa if snp_has_guest_vmsa
is true, and then redo the mapping part of sev_snp_init_protected_guest_state().

We could put the kvm_x86_call(gmem_invalidate_range)(kvm, range) call in
kvm_unmap_gfn_range() to avoid the extra arch hook, but I would prefer to have an
explicit connection between the invalidate() phase and the free() phase.

Have I mentioned how much I hate AP creation?

---
 arch/x86/include/asm/kvm-x86-ops.h |  3 ++-
 arch/x86/include/asm/kvm_host.h    |  3 ++-
 arch/x86/kvm/svm/sev.c             | 11 ++++++++++-
 arch/x86/kvm/svm/svm.c             |  5 ++++-
 arch/x86/kvm/svm/svm.h             | 13 ++-----------
 arch/x86/kvm/x86.c                 |  9 +++++++--
 include/linux/kvm_host.h           |  3 ++-
 virt/kvm/guest_memfd.c             | 17 +++++------------
 8 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 83dc5086138b..ea308250ef47 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -147,7 +147,8 @@ KVM_X86_OP_OPTIONAL(get_untagged_addr)
 KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 KVM_X86_OP_OPTIONAL_RET0(gmem_max_mapping_level)
-KVM_X86_OP_OPTIONAL(gmem_invalidate)
+KVM_X86_OP_OPTIONAL(gmem_invalidate_range)
+KVM_X86_OP_OPTIONAL(gmem_free_folio)
 #endif
 
 #undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3886b536c8a5..0a5aec1701bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2009,7 +2009,8 @@ struct kvm_x86_ops {
 	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
-	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
+	void (*gmem_invalidate_range)(struct kvm *kvm, struct kvm_gfn_range *range);
+	void (*gmem_free_folio)(struct folio *folio);
 	int (*gmem_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
 };
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 74fb15551e83..411480a90e36 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -5115,9 +5115,18 @@ int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
 
 	return 0;
 }
+void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	if (!sev_snp_guest(kvm))
+		return;
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_VMSA_PAGE_RELOAD);
+}
 
-void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+void sev_gmem_free_folio(struct folio *folio)
 {
+	kvm_pfn_t start = page_to_pfn(folio_page(folio, 0));
+	kvm_pfn_t end = start + (1ul << folio_order(folio));
 	kvm_pfn_t pfn;
 
 	if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 526e0fdcd16b..907be66d6f2a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5460,9 +5460,12 @@ struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 
+#ifdef CONFIG_KVM_AMD_SEV
 	.gmem_prepare = sev_gmem_prepare,
-	.gmem_invalidate = sev_gmem_invalidate,
+	.gmem_invalidate_range = sev_gmem_invalidate_range,
+	.gmem_free_folio = sev_gmem_free_folio,
 	.gmem_max_mapping_level = sev_gmem_max_mapping_level,
+#endif
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 716be21fba33..5129ab1a84d7 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -1008,7 +1008,8 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
 void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
-void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
+void sev_gmem_free_folio(struct folio *folio);
 int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
 struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu);
 void sev_free_decrypted_vmsa(struct kvm_vcpu *vcpu, struct vmcb_save_area *vmsa);
@@ -1034,16 +1035,6 @@ static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
 static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
-static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
-{
-	return 0;
-}
-static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
-static inline int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private)
-{
-	return 0;
-}
-
 static inline struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu)
 {
 	return NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf122b8c3210..a734ddf59cf9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14134,9 +14134,14 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	kvm_x86_call(gmem_invalidate)(start, end);
+	kvm_x86_call(gmem_invalidate_range)(kvm, range);
+}
+
+void kvm_arch_gmem_free_folio(struct folio *folio)
+{
+	kvm_x86_call(gmem_free_folio)(folio);
 }
 #endif
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..7de9b07870b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2602,7 +2602,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
+void kvm_arch_gmem_free_folio(struct folio *folio);
 #endif
 
 #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 86690683b2fe..8ec5041934db 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -185,6 +185,10 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
 		}
 
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
+		kvm_arch_gmem_invalidate_range(kvm, &gfn_range);
+#endif
 	}
 
 	if (flush)
@@ -523,23 +527,12 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	return MF_DELAYED;
 }
 
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-static void kvm_gmem_free_folio(struct folio *folio)
-{
-	struct page *page = folio_page(folio, 0);
-	kvm_pfn_t pfn = page_to_pfn(page);
-	int order = folio_order(folio);
-
-	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
-}
-#endif
-
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-	.free_folio = kvm_gmem_free_folio,
+	.free_folio = kvm_arch_gmem_free_folio,
 #endif
 };
 

base-commit: de3a35be92d2391ece4bf3143ef2887192625fd0
-- 

> Other sites such as snp_page_reclaim() leak the page on this failure rather
> than freeing it.
> 
> If you think this isn't a real problem, leaving it as is seems fine to me.
> I don't see a good place to put a BUG_ON.

next prev parent reply	other threads:[~2026-06-11 15:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-10 16:10 [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host Hyunwoo Kim
2026-06-10 16:26 ` sashiko-bot
2026-06-10 18:25   ` Sean Christopherson
2026-06-10 22:16 ` Michael Roth
2026-06-11 10:26   ` Hyunwoo Kim
2026-06-11 12:47     ` Sean Christopherson
2026-06-11 14:05       ` Hyunwoo Kim
2026-06-11 15:23         ` Sean Christopherson [this message]
2026-06-11 17:07           ` Hyunwoo Kim
2026-06-11 17:32             ` Sean Christopherson
2026-06-11 17:34               ` Hyunwoo Kim

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:74fb15551e8 dfblob:ea70c7ade15 dfblob:83dc5086138
dfblob:ea308250ef4 dfblob:3886b536c8a dfblob:0a5aec1701b
dfblob:74fb15551e8 dfblob:411480a90e3 dfblob:526e0fdcd16
dfblob:907be66d6f2 dfblob:716be21fba3 dfblob:5129ab1a84d
dfblob:cf122b8c321 dfblob:a734ddf59cf dfblob:27498e990df
dfblob:7de9b07870b dfblob:86690683b2f dfblob:8ec5041934d )
 OR (
bs:"Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=airS1G23atiuWdTl@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=imv4bel@gmail.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.