From: Hyunwoo Kim <imv4bel@gmail.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Michael Roth <michael.roth@amd.com>,
pbonzini@redhat.com, tglx@kernel.org, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, kvm@vger.kernel.org, imv4bel@gmail.com
Subject: Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
Date: Fri, 12 Jun 2026 02:07:56 +0900 [thread overview]
Message-ID: <airrbMQafcuxoVkg@v4bel> (raw)
In-Reply-To: <airS1G23atiuWdTl@google.com>
On Thu, Jun 11, 2026 at 08:23:00AM -0700, Sean Christopherson wrote:
> On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > On Thu, Jun 11, 2026 at 05:47:52AM -0700, Sean Christopherson wrote:
> > > On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > > > On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > > > > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > > > > ---
> > > > > > arch/x86/kvm/svm/sev.c | 6 +++++-
> > > > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > > > > --- a/arch/x86/kvm/svm/sev.c
> > > > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > > > >
> > > > > > rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > > > > > if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > > > > - pfn, rc))
> > > > > > + pfn, rc)) {
> > > > > > + /* Still assigned to the guest; pin and leak rather than freeing. */
> > > > > > + folio_get(page_folio(pfn_to_page(pfn)));
> > > > > > + snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > > > > > goto next_pfn;
> > > > > > + }
> > > > >
> > > > > This roughly aligns with what would happen if snp_page_reclaim() fails
> > > > > in sev_gmem_post_populate(), while the guest is being initialized via
> > > > > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > > > >
> > > > > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > > > > address both cases, we should just add a parameter to snp_leak_pages()
> > > > > to tell it to take an extra ref and use that in both of these paths.
> > > > >
> > > > > Or we can just do the direct folio_get() in both cases, the above
> > > > > formalizes the handling convention a little better though IMO.
> > > >
> > > > If I understand correctly, an extra ref alone still seems to leave the
> > > > LRU corruption that sashiko flagged:
> > > >
> > > > https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> > > >
> > > > A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> > > > on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> > > > same union as folio->lru, so leaking the page overwrites the folio's LRU
> > > > pointers. Both paths deal with a gmem folio, so the same applies.
> > > >
> > > > To handle this properly, the folio would need to be taken off the LRU
> > > > before leaking, with something like folio_isolate_lru(), but that is
> > > > mm-internal and does not look usable from KVM. How should we proceed?
> > > > Please let me know if I am missing something.
> > >
> > > I'm inclined to do nothing. rmp_make_shared() should only fail in this case if
> > > there's a fatal bug somewhere, no? Either that or do BUG_ON(), because at some
> > > point these types of errors are simply unrecoverable.
> >
> > A guest can make a gmem page a VMSA via AP creation,
>
> Ugh, the bane of my existence. Can we kill off that feature yet? I'm only half
> joking. Not even half.
>
> > and if that gfn is then hole-punched, a page that is still assigned to the
> > guest is returned to the host in sev_gmem_invalidate(), which looked like it
> > could lead to a host RMP PF, so I sent the patch.
>
> Yeah, I suspect you're right. But leaking the page doesn't fix the underlying
> problem, which is that it's possible to free a page that's being used as a VMSA.
>
> We can't simply pin the page, because IIUC ->free_folio() is called when the page
> is removed from the filemap, not when the folio/page is free back to the allocator.
>
> I.e. we can't do this.
>
> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 74fb15551e83..ea70c7ade152 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -3510,6 +3510,9 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>
> svm = to_svm(vcpu);
>
> + if (svm->sev_es.snp_has_guest_vmsa)
> + kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
> +
> /*
> * If it's an SNP guest, then the VMSA was marked in the RMP table as
> * a guest-owned page. Transition the page to hypervisor state before
> @@ -4027,6 +4030,13 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
> if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, &page, NULL))
> return;
>
> + /*
> + * Drop the reference to the previous VMSA page (acquired above) if the
> + * guest is changing its VMSA.
> + */
> + if (svm->sev_es.snp_has_guest_vmsa)
> + kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
> +
> /*
> * From this point forward, the VMSA will always be a guest-mapped page
> * rather than the initial one allocated by KVM in svm->sev_es.vmsa. In
> @@ -4043,13 +4053,6 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
>
> /* Mark the vCPU as runnable */
> kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
> -
> - /*
> - * gmem pages aren't currently migratable, but if this ever changes
> - * then care should be taken to ensure svm->sev_es.vmsa is pinned
> - * through some other means.
> - */
> - kvm_release_page_clean(page);
> }
>
> static int sev_snp_ap_creation(struct vcpu_svm *svm)
>
> The right way to handle this is to treat the VMSA "mapping" like an MMU mapping.
> I.e. this is fundamentally the same mess we have to solve in order to not pin
> pages that are mapped into L2 via vmcs01/vmcb02 for nVMX/nSVM.
>
> Something like this, sans the actually handling of the request. The simple way
> to handle the request would be to invalidate control.vmsa_pa if snp_has_guest_vmsa
> is true, and then redo the mapping part of sev_snp_init_protected_guest_state().
Understood. This looks like a fairly large change, at least for me, so
would you be able to handle the patch?
>
> We could put the kvm_x86_call(gmem_invalidate_range)(kvm, range) call in
> kvm_unmap_gfn_range() to avoid the extra arch hook, but I would prefer to have an
> explicit connection between the invalidate() phase and the free() phase.
>
> Have I mentioned how much I hate AP creation?
>
> ---
> arch/x86/include/asm/kvm-x86-ops.h | 3 ++-
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/svm/sev.c | 11 ++++++++++-
> arch/x86/kvm/svm/svm.c | 5 ++++-
> arch/x86/kvm/svm/svm.h | 13 ++-----------
> arch/x86/kvm/x86.c | 9 +++++++--
> include/linux/kvm_host.h | 3 ++-
> virt/kvm/guest_memfd.c | 17 +++++------------
> 8 files changed, 34 insertions(+), 30 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 83dc5086138b..ea308250ef47 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -147,7 +147,8 @@ KVM_X86_OP_OPTIONAL(get_untagged_addr)
> KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
> KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> KVM_X86_OP_OPTIONAL_RET0(gmem_max_mapping_level)
> -KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL(gmem_invalidate_range)
> +KVM_X86_OP_OPTIONAL(gmem_free_folio)
> #endif
>
> #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3886b536c8a5..0a5aec1701bd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2009,7 +2009,8 @@ struct kvm_x86_ops {
> gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
> void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> - void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> + void (*gmem_invalidate_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> + void (*gmem_free_folio)(struct folio *folio);
> int (*gmem_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
> };
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 74fb15551e83..411480a90e36 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -5115,9 +5115,18 @@ int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
>
> return 0;
> }
> +void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> + if (!sev_snp_guest(kvm))
> + return;
> +
> + kvm_make_all_cpus_request(kvm, KVM_REQ_VMSA_PAGE_RELOAD);
> +}
>
> -void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +void sev_gmem_free_folio(struct folio *folio)
> {
> + kvm_pfn_t start = page_to_pfn(folio_page(folio, 0));
> + kvm_pfn_t end = start + (1ul << folio_order(folio));
> kvm_pfn_t pfn;
>
> if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 526e0fdcd16b..907be66d6f2a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5460,9 +5460,12 @@ struct kvm_x86_ops svm_x86_ops __initdata = {
> .vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
> .alloc_apic_backing_page = svm_alloc_apic_backing_page,
>
> +#ifdef CONFIG_KVM_AMD_SEV
> .gmem_prepare = sev_gmem_prepare,
> - .gmem_invalidate = sev_gmem_invalidate,
> + .gmem_invalidate_range = sev_gmem_invalidate_range,
> + .gmem_free_folio = sev_gmem_free_folio,
> .gmem_max_mapping_level = sev_gmem_max_mapping_level,
> +#endif
> };
>
> /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 716be21fba33..5129ab1a84d7 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -1008,7 +1008,8 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
> extern unsigned int max_sev_asid;
> void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
> int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> -void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +void sev_gmem_free_folio(struct folio *folio);
> int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
> struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu);
> void sev_free_decrypted_vmsa(struct kvm_vcpu *vcpu, struct vmcb_save_area *vmsa);
> @@ -1034,16 +1035,6 @@ static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
> static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
> #define max_sev_asid 0
> static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
> -static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
> -{
> - return 0;
> -}
> -static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
> -static inline int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private)
> -{
> - return 0;
> -}
> -
> static inline struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu)
> {
> return NULL;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cf122b8c3210..a734ddf59cf9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14134,9 +14134,14 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
> #endif
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> - kvm_x86_call(gmem_invalidate)(start, end);
> + kvm_x86_call(gmem_invalidate_range)(kvm, range);
> +}
> +
> +void kvm_arch_gmem_free_folio(struct folio *folio)
> +{
> + kvm_x86_call(gmem_free_folio)(folio);
> }
> #endif
> #endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27498e990dff..7de9b07870b7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2602,7 +2602,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> #endif
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +void kvm_arch_gmem_free_folio(struct folio *folio);
> #endif
>
> #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 86690683b2fe..8ec5041934db 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -185,6 +185,10 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
> }
>
> flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> + kvm_arch_gmem_invalidate_range(kvm, &gfn_range);
> +#endif
> }
>
> if (flush)
> @@ -523,23 +527,12 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
> return MF_DELAYED;
> }
>
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_free_folio(struct folio *folio)
> -{
> - struct page *page = folio_page(folio, 0);
> - kvm_pfn_t pfn = page_to_pfn(page);
> - int order = folio_order(folio);
> -
> - kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> -}
> -#endif
> -
> static const struct address_space_operations kvm_gmem_aops = {
> .dirty_folio = noop_dirty_folio,
> .migrate_folio = kvm_gmem_migrate_folio,
> .error_remove_folio = kvm_gmem_error_folio,
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> - .free_folio = kvm_gmem_free_folio,
> + .free_folio = kvm_arch_gmem_free_folio,
> #endif
> };
>
>
> base-commit: de3a35be92d2391ece4bf3143ef2887192625fd0
> --
>
> > Other sites such as snp_page_reclaim() leak the page on this failure rather
> > than freeing it.
> >
> > If you think this isn't a real problem, leaving it as is seems fine to me.
> > I don't see a good place to put a BUG_ON.
next prev parent reply other threads:[~2026-06-11 17:08 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-10 16:10 [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host Hyunwoo Kim
2026-06-10 16:26 ` sashiko-bot
2026-06-10 18:25 ` Sean Christopherson
2026-06-10 22:16 ` Michael Roth
2026-06-11 10:26 ` Hyunwoo Kim
2026-06-11 12:47 ` Sean Christopherson
2026-06-11 14:05 ` Hyunwoo Kim
2026-06-11 15:23 ` Sean Christopherson
2026-06-11 17:07 ` Hyunwoo Kim [this message]
[not found] ` <airxMoy44ZxkbioH@google.com>
2026-06-11 17:34 ` Hyunwoo Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=airrbMQafcuxoVkg@v4bel \
--to=imv4bel@gmail.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=kvm@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=tglx@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox