Kernel KVM virtualization development
 help / color / mirror / Atom feed
* [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
@ 2026-06-10 16:10 Hyunwoo Kim
  2026-06-10 16:26 ` sashiko-bot
  2026-06-10 22:16 ` Michael Roth
  0 siblings, 2 replies; 10+ messages in thread
From: Hyunwoo Kim @ 2026-06-10 16:10 UTC (permalink / raw)
  To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	michael.roth
  Cc: kvm, imv4bel

sev_gmem_invalidate() is called when guest_memfd frees a gmem page.
For each PFN that is still assigned to the guest in the RMP table, it
transitions the page back to hypervisor-owned via rmp_make_shared()
before the page is returned to the host.

A guest-assigned page can reach this path while still private,
because the free path does not transition it beforehand and
sev_gmem_invalidate() is the only place that does. A gmem page used
as a vCPU's VMSA after SEV-SNP AP creation is one such case. When
rmp_make_shared() fails, the RMP entry remains guest-owned and the
host cannot use the page because of RMP protection, so it must not be
returned to the host. The existing code only issues WARN_ONCE() and
continues to the next PFN, returning the page to the host allocator.

Leak the page instead of freeing it, as kvm_rmp_make_shared(),
snp_page_reclaim() and sev_free_vcpu() already do when a transition
back to shared fails. snp_leak_pages() does not take a reference of
its own, and on this path the page is freed right after the hook
returns, so take a reference with folio_get() first to keep the page
from being freed.

Fixes: 8eb01900b018 ("KVM: SEV: Implement gmem hook for invalidating private pages")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
---
 arch/x86/kvm/svm/sev.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 6c6a6d663e29..8fee6ec529f9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 
 		rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
 		if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
-			      pfn, rc))
+			      pfn, rc)) {
+			/* Still assigned to the guest; pin and leak rather than freeing. */
+			folio_get(page_folio(pfn_to_page(pfn)));
+			snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
 			goto next_pfn;
+		}
 
 		/*
 		 * SEV-ES avoids host/guest cache coherency issues through
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-10 16:10 [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host Hyunwoo Kim
@ 2026-06-10 16:26 ` sashiko-bot
  2026-06-10 18:25   ` Sean Christopherson
  2026-06-10 22:16 ` Michael Roth
  1 sibling, 1 reply; 10+ messages in thread
From: sashiko-bot @ 2026-06-10 16:26 UTC (permalink / raw)
  To: Hyunwoo Kim; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Calling snp_leak_pages() on a pinned guest_memfd folio corrupts the system-wide LRU list by overwriting the folio's LRU pointers.
- [High] Unconditional PAGE_SIZE-stepping PFN iteration on large folios results in duplicate folio_get() increments on the same folio and failure to track leaked pages.
--

commit 5d574de6d0d362139fa648b3657e2f24c06fd5ee
Author: Hyunwoo Kim <imv4bel@gmail.com>

KVM: SEV: Don't return a still-assigned gmem page to the host

This commit addresses an issue where guest_memfd frees a page that is
still assigned to the guest. To prevent returning it to the host, the
page is pinned and leaked.

> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index f59c65abe3cfa..dbde6dce79b76 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
[ ... ]
> @@ -5036,8 +5036,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
>  
>  		rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
>  		if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> -			      pfn, rc))
> +			      pfn, rc)) {
> +			/* Still assigned to the guest; pin and leak rather than freeing. */
> +			folio_get(page_folio(pfn_to_page(pfn)));
> +			snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);

[Severity: Critical]
Does calling snp_leak_pages() on a pinned guest_memfd folio corrupt the
system-wide LRU list?

Memory allocated by guest_memfd is added to the page cache, which
unconditionally places the memory on the global unevictable LRU list. When
guest_memfd memory is freed, the folio has been removed from the page cache
but is still actively linked in the global LRU list until its reference
count drops to zero.

By calling folio_get() here, the reference count is elevated, keeping the
folio on the LRU list. Then, snp_leak_pages() unconditionally links the
page into a tracking list via list_add_tail(&page->buddy_list, ...).

Because buddy_list and lru share the exact same union inside struct page
and struct folio, modifying buddy_list silently overwrites the lru.next and
lru.prev pointers of the folio, severing the global LRU linked list.

[Severity: High]
Does this code also double-account references and fail to track the leaked
pages when processing large folios?

When guest_memfd allocates a large 2MB PMD-size folio, but the RMP table
has it mapped as 4K entries, use_2m_update will be false and the loop
unconditionally steps by 1 PFN.

Because page_folio(pfn_to_page(pfn)) returns the head folio for any subpage,
folio_get() will be redundantly called up to 512 times on the exact same
head folio.

Furthermore, passing use_2m_update ? PTRS_PER_PMD : 1 (which is 1) to
snp_leak_pages() results in the internal check compound_nr(page) <= npages
evaluating to 512 <= 1 (false). This causes the entire compound page to be
completely skipped and omitted from the snp_leaked_pages_list.

>  			goto next_pfn;
> +		}
>  
>  		/*
>  		 * SEV-ES avoids host/guest cache coherency issues through

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/aimMWzAf5b3luM0b@v4bel?part=1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-10 16:26 ` sashiko-bot
@ 2026-06-10 18:25   ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2026-06-10 18:25 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Hyunwoo Kim, kvm

On Wed, Jun 10, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index f59c65abe3cfa..dbde6dce79b76 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> [ ... ]
> > @@ -5036,8 +5036,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> >  
> >  		rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> >  		if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > -			      pfn, rc))
> > +			      pfn, rc)) {
> > +			/* Still assigned to the guest; pin and leak rather than freeing. */
> > +			folio_get(page_folio(pfn_to_page(pfn)));
> > +			snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> 
> [Severity: Critical]
> Does calling snp_leak_pages() on a pinned guest_memfd folio corrupt the
> system-wide LRU list?
> 
> Memory allocated by guest_memfd is added to the page cache, which
> unconditionally places the memory on the global unevictable LRU list. When
> guest_memfd memory is freed, the folio has been removed from the page cache
> but is still actively linked in the global LRU list until its reference
> count drops to zero.
> 
> By calling folio_get() here, the reference count is elevated, keeping the
> folio on the LRU list. Then, snp_leak_pages() unconditionally links the
> page into a tracking list via list_add_tail(&page->buddy_list, ...).
> 
> Because buddy_list and lru share the exact same union inside struct page
> and struct folio, modifying buddy_list silently overwrites the lru.next and
> lru.prev pointers of the folio, severing the global LRU linked list.

Yeah, trying to prevent a folio/page from being freed from within ->free_folio()
seems like a fool's errand.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-10 16:10 [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host Hyunwoo Kim
  2026-06-10 16:26 ` sashiko-bot
@ 2026-06-10 22:16 ` Michael Roth
  2026-06-11 10:26   ` Hyunwoo Kim
  1 sibling, 1 reply; 10+ messages in thread
From: Michael Roth @ 2026-06-10 22:16 UTC (permalink / raw)
  To: Hyunwoo Kim; +Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm

On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> [You don't often get email from imv4bel@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> sev_gmem_invalidate() is called when guest_memfd frees a gmem page.
> For each PFN that is still assigned to the guest in the RMP table, it
> transitions the page back to hypervisor-owned via rmp_make_shared()
> before the page is returned to the host.
> 
> A guest-assigned page can reach this path while still private,
> because the free path does not transition it beforehand and
> sev_gmem_invalidate() is the only place that does. A gmem page used
> as a vCPU's VMSA after SEV-SNP AP creation is one such case. When
> rmp_make_shared() fails, the RMP entry remains guest-owned and the
> host cannot use the page because of RMP protection, so it must not be
> returned to the host. The existing code only issues WARN_ONCE() and
> continues to the next PFN, returning the page to the host allocator.
> 
> Leak the page instead of freeing it, as kvm_rmp_make_shared(),
> snp_page_reclaim() and sev_free_vcpu() already do when a transition
> back to shared fails. snp_leak_pages() does not take a reference of
> its own, and on this path the page is freed right after the hook
> returns, so take a reference with folio_get() first to keep the page
> from being freed.
> 
> Fixes: 8eb01900b018 ("KVM: SEV: Implement gmem hook for invalidating private pages")
> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
> ---
>  arch/x86/kvm/svm/sev.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6c6a6d663e29..8fee6ec529f9 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> 
>                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
>                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> -                             pfn, rc))
> +                             pfn, rc)) {
> +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> +                       folio_get(page_folio(pfn_to_page(pfn)));
> +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
>                         goto next_pfn;
> +               }

This roughly aligns with what would happen if snp_page_reclaim() fails
in sev_gmem_post_populate(), while the guest is being initialized via
KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().

However, in kvm_gmem_populate(), we still free the page. Maybe, to
address both cases, we should just add a parameter to snp_leak_pages()
to tell it to take an extra ref and use that in both of these paths.

Or we can just do the direct folio_get() in both cases, the above
formalizes the handling convention a little better though IMO.

Thanks,

Mike

> 
>                 /*
>                  * SEV-ES avoids host/guest cache coherency issues through
> --
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-10 22:16 ` Michael Roth
@ 2026-06-11 10:26   ` Hyunwoo Kim
  2026-06-11 12:47     ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Hyunwoo Kim @ 2026-06-11 10:26 UTC (permalink / raw)
  To: Michael Roth
  Cc: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	imv4bel

On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > [You don't often get email from imv4bel@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > sev_gmem_invalidate() is called when guest_memfd frees a gmem page.
> > For each PFN that is still assigned to the guest in the RMP table, it
> > transitions the page back to hypervisor-owned via rmp_make_shared()
> > before the page is returned to the host.
> > 
> > A guest-assigned page can reach this path while still private,
> > because the free path does not transition it beforehand and
> > sev_gmem_invalidate() is the only place that does. A gmem page used
> > as a vCPU's VMSA after SEV-SNP AP creation is one such case. When
> > rmp_make_shared() fails, the RMP entry remains guest-owned and the
> > host cannot use the page because of RMP protection, so it must not be
> > returned to the host. The existing code only issues WARN_ONCE() and
> > continues to the next PFN, returning the page to the host allocator.
> > 
> > Leak the page instead of freeing it, as kvm_rmp_make_shared(),
> > snp_page_reclaim() and sev_free_vcpu() already do when a transition
> > back to shared fails. snp_leak_pages() does not take a reference of
> > its own, and on this path the page is freed right after the hook
> > returns, so take a reference with folio_get() first to keep the page
> > from being freed.
> > 
> > Fixes: 8eb01900b018 ("KVM: SEV: Implement gmem hook for invalidating private pages")
> > Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
> > ---
> >  arch/x86/kvm/svm/sev.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > index 6c6a6d663e29..8fee6ec529f9 100644
> > --- a/arch/x86/kvm/svm/sev.c
> > +++ b/arch/x86/kvm/svm/sev.c
> > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > 
> >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > -                             pfn, rc))
> > +                             pfn, rc)) {
> > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> >                         goto next_pfn;
> > +               }
> 
> This roughly aligns with what would happen if snp_page_reclaim() fails
> in sev_gmem_post_populate(), while the guest is being initialized via
> KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> 
> However, in kvm_gmem_populate(), we still free the page. Maybe, to
> address both cases, we should just add a parameter to snp_leak_pages()
> to tell it to take an extra ref and use that in both of these paths.
> 
> Or we can just do the direct folio_get() in both cases, the above
> formalizes the handling convention a little better though IMO.

If I understand correctly, an extra ref alone still seems to leave the
LRU corruption that sashiko flagged:

https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/

A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
same union as folio->lru, so leaking the page overwrites the folio's LRU
pointers. Both paths deal with a gmem folio, so the same applies.

To handle this properly, the folio would need to be taken off the LRU
before leaking, with something like folio_isolate_lru(), but that is
mm-internal and does not look usable from KVM. How should we proceed?
Please let me know if I am missing something.


Best regards,
Hyunwoo Kim

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-11 10:26   ` Hyunwoo Kim
@ 2026-06-11 12:47     ` Sean Christopherson
  2026-06-11 14:05       ` Hyunwoo Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-06-11 12:47 UTC (permalink / raw)
  To: Hyunwoo Kim
  Cc: Michael Roth, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm

On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > ---
> > >  arch/x86/kvm/svm/sev.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > --- a/arch/x86/kvm/svm/sev.c
> > > +++ b/arch/x86/kvm/svm/sev.c
> > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > 
> > >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > -                             pfn, rc))
> > > +                             pfn, rc)) {
> > > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > >                         goto next_pfn;
> > > +               }
> > 
> > This roughly aligns with what would happen if snp_page_reclaim() fails
> > in sev_gmem_post_populate(), while the guest is being initialized via
> > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > 
> > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > address both cases, we should just add a parameter to snp_leak_pages()
> > to tell it to take an extra ref and use that in both of these paths.
> > 
> > Or we can just do the direct folio_get() in both cases, the above
> > formalizes the handling convention a little better though IMO.
> 
> If I understand correctly, an extra ref alone still seems to leave the
> LRU corruption that sashiko flagged:
> 
> https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> 
> A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> same union as folio->lru, so leaking the page overwrites the folio's LRU
> pointers. Both paths deal with a gmem folio, so the same applies.
> 
> To handle this properly, the folio would need to be taken off the LRU
> before leaking, with something like folio_isolate_lru(), but that is
> mm-internal and does not look usable from KVM. How should we proceed?
> Please let me know if I am missing something.

I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
point these types of errors are simply unrecoverable.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-11 12:47     ` Sean Christopherson
@ 2026-06-11 14:05       ` Hyunwoo Kim
  2026-06-11 15:23         ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Hyunwoo Kim @ 2026-06-11 14:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm, imv4bel

On Thu, Jun 11, 2026 at 05:47:52AM -0700, Sean Christopherson wrote:
> On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > > ---
> > > >  arch/x86/kvm/svm/sev.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > > --- a/arch/x86/kvm/svm/sev.c
> > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > > 
> > > >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > > >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > > -                             pfn, rc))
> > > > +                             pfn, rc)) {
> > > > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > > > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > > > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > > >                         goto next_pfn;
> > > > +               }
> > > 
> > > This roughly aligns with what would happen if snp_page_reclaim() fails
> > > in sev_gmem_post_populate(), while the guest is being initialized via
> > > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > > 
> > > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > > address both cases, we should just add a parameter to snp_leak_pages()
> > > to tell it to take an extra ref and use that in both of these paths.
> > > 
> > > Or we can just do the direct folio_get() in both cases, the above
> > > formalizes the handling convention a little better though IMO.
> > 
> > If I understand correctly, an extra ref alone still seems to leave the
> > LRU corruption that sashiko flagged:
> > 
> > https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> > 
> > A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> > on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> > same union as folio->lru, so leaking the page overwrites the folio's LRU
> > pointers. Both paths deal with a gmem folio, so the same applies.
> > 
> > To handle this properly, the folio would need to be taken off the LRU
> > before leaking, with something like folio_isolate_lru(), but that is
> > mm-internal and does not look usable from KVM. How should we proceed?
> > Please let me know if I am missing something.
> 
> I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
> there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
> point these types of errors are simply unrecoverable.

A guest can make a gmem page a VMSA via AP creation, and if that gfn is
then hole-punched, a page that is still assigned to the guest is
returned to the host in sev_gmem_invalidate(), which looked like it could
lead to a host RMP PF, so I sent the patch. Other sites such as
snp_page_reclaim() leak the page on this failure rather than freeing it.

If you think this isn't a real problem, leaving it as is seems fine to me.
I don't see a good place to put a BUG_ON.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-11 14:05       ` Hyunwoo Kim
@ 2026-06-11 15:23         ` Sean Christopherson
  2026-06-11 17:07           ` Hyunwoo Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2026-06-11 15:23 UTC (permalink / raw)
  To: Hyunwoo Kim
  Cc: Michael Roth, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm

On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> On Thu, Jun 11, 2026 at 05:47:52AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > > On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > > > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > > > ---
> > > > >  arch/x86/kvm/svm/sev.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > > > --- a/arch/x86/kvm/svm/sev.c
> > > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > > > 
> > > > >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > > > >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > > > -                             pfn, rc))
> > > > > +                             pfn, rc)) {
> > > > > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > > > > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > > > > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > > > >                         goto next_pfn;
> > > > > +               }
> > > > 
> > > > This roughly aligns with what would happen if snp_page_reclaim() fails
> > > > in sev_gmem_post_populate(), while the guest is being initialized via
> > > > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > > > 
> > > > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > > > address both cases, we should just add a parameter to snp_leak_pages()
> > > > to tell it to take an extra ref and use that in both of these paths.
> > > > 
> > > > Or we can just do the direct folio_get() in both cases, the above
> > > > formalizes the handling convention a little better though IMO.
> > > 
> > > If I understand correctly, an extra ref alone still seems to leave the
> > > LRU corruption that sashiko flagged:
> > > 
> > > https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> > > 
> > > A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> > > on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> > > same union as folio->lru, so leaking the page overwrites the folio's LRU
> > > pointers. Both paths deal with a gmem folio, so the same applies.
> > > 
> > > To handle this properly, the folio would need to be taken off the LRU
> > > before leaking, with something like folio_isolate_lru(), but that is
> > > mm-internal and does not look usable from KVM. How should we proceed?
> > > Please let me know if I am missing something.
> > 
> > I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
> > there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
> > point these types of errors are simply unrecoverable.
> 
> A guest can make a gmem page a VMSA via AP creation,

Ugh, the bane of my existence.  Can we kill off that feature yet?  I'm only half
joking.  Not even half.

> and if that gfn is then hole-punched, a page that is still assigned to the
> guest is returned to the host in sev_gmem_invalidate(), which looked like it
> could lead to a host RMP PF, so I sent the patch.

Yeah, I suspect you're right.  But leaking the page doesn't fix the underlying
problem, which is that it's possible to free a page that's being used as a VMSA.

We can't simply pin the page, because IIUC ->free_folio() is called when the page
is removed from the filemap, not when the folio/page is free back to the allocator.

I.e. we can't do this.

diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
index 74fb15551e83..ea70c7ade152 100644
--- arch/x86/kvm/svm/sev.c
+++ arch/x86/kvm/svm/sev.c
@@ -3510,6 +3510,9 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
 
        svm = to_svm(vcpu);
 
+       if (svm->sev_es.snp_has_guest_vmsa)
+               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
+
        /*
         * If it's an SNP guest, then the VMSA was marked in the RMP table as
         * a guest-owned page. Transition the page to hypervisor state before
@@ -4027,6 +4030,13 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
        if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, &page, NULL))
                return;
 
+       /*
+        * Drop the reference to the previous VMSA page (acquired above) if the
+        * guest is changing its VMSA.
+        */
+       if (svm->sev_es.snp_has_guest_vmsa)
+               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
+
        /*
         * From this point forward, the VMSA will always be a guest-mapped page
         * rather than the initial one allocated by KVM in svm->sev_es.vmsa. In
@@ -4043,13 +4053,6 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
 
        /* Mark the vCPU as runnable */
        kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
-
-       /*
-        * gmem pages aren't currently migratable, but if this ever changes
-        * then care should be taken to ensure svm->sev_es.vmsa is pinned
-        * through some other means.
-        */
-       kvm_release_page_clean(page);
 }
 
 static int sev_snp_ap_creation(struct vcpu_svm *svm)

The right way to handle this is to treat the VMSA "mapping" like an MMU mapping.
I.e. this is fundamentally the same mess we have to solve in order to not pin
pages that are mapped into L2 via vmcs01/vmcb02 for nVMX/nSVM.

Something like this, sans the actually handling of the request.  The simple way
to handle the request would be to invalidate control.vmsa_pa if snp_has_guest_vmsa
is true, and then redo the mapping part of sev_snp_init_protected_guest_state().

We could put the kvm_x86_call(gmem_invalidate_range)(kvm, range) call in
kvm_unmap_gfn_range() to avoid the extra arch hook, but I would prefer to have an
explicit connection between the invalidate() phase and the free() phase.

Have I mentioned how much I hate AP creation?

---
 arch/x86/include/asm/kvm-x86-ops.h |  3 ++-
 arch/x86/include/asm/kvm_host.h    |  3 ++-
 arch/x86/kvm/svm/sev.c             | 11 ++++++++++-
 arch/x86/kvm/svm/svm.c             |  5 ++++-
 arch/x86/kvm/svm/svm.h             | 13 ++-----------
 arch/x86/kvm/x86.c                 |  9 +++++++--
 include/linux/kvm_host.h           |  3 ++-
 virt/kvm/guest_memfd.c             | 17 +++++------------
 8 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 83dc5086138b..ea308250ef47 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -147,7 +147,8 @@ KVM_X86_OP_OPTIONAL(get_untagged_addr)
 KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
 KVM_X86_OP_OPTIONAL_RET0(gmem_max_mapping_level)
-KVM_X86_OP_OPTIONAL(gmem_invalidate)
+KVM_X86_OP_OPTIONAL(gmem_invalidate_range)
+KVM_X86_OP_OPTIONAL(gmem_free_folio)
 #endif
 
 #undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3886b536c8a5..0a5aec1701bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2009,7 +2009,8 @@ struct kvm_x86_ops {
 	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
-	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
+	void (*gmem_invalidate_range)(struct kvm *kvm, struct kvm_gfn_range *range);
+	void (*gmem_free_folio)(struct folio *folio);
 	int (*gmem_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
 };
 
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 74fb15551e83..411480a90e36 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -5115,9 +5115,18 @@ int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
 
 	return 0;
 }
+void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
+{
+	if (!sev_snp_guest(kvm))
+		return;
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_VMSA_PAGE_RELOAD);
+}
 
-void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+void sev_gmem_free_folio(struct folio *folio)
 {
+	kvm_pfn_t start = page_to_pfn(folio_page(folio, 0));
+	kvm_pfn_t end = start + (1ul << folio_order(folio));
 	kvm_pfn_t pfn;
 
 	if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 526e0fdcd16b..907be66d6f2a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5460,9 +5460,12 @@ struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
 
+#ifdef CONFIG_KVM_AMD_SEV
 	.gmem_prepare = sev_gmem_prepare,
-	.gmem_invalidate = sev_gmem_invalidate,
+	.gmem_invalidate_range = sev_gmem_invalidate_range,
+	.gmem_free_folio = sev_gmem_free_folio,
 	.gmem_max_mapping_level = sev_gmem_max_mapping_level,
+#endif
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 716be21fba33..5129ab1a84d7 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -1008,7 +1008,8 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
 extern unsigned int max_sev_asid;
 void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
-void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
+void sev_gmem_free_folio(struct folio *folio);
 int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
 struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu);
 void sev_free_decrypted_vmsa(struct kvm_vcpu *vcpu, struct vmcb_save_area *vmsa);
@@ -1034,16 +1035,6 @@ static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
 static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
 #define max_sev_asid 0
 static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
-static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
-{
-	return 0;
-}
-static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
-static inline int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private)
-{
-	return 0;
-}
-
 static inline struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu)
 {
 	return NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf122b8c3210..a734ddf59cf9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14134,9 +14134,14 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	kvm_x86_call(gmem_invalidate)(start, end);
+	kvm_x86_call(gmem_invalidate_range)(kvm, range);
+}
+
+void kvm_arch_gmem_free_folio(struct folio *folio)
+{
+	kvm_x86_call(gmem_free_folio)(folio);
 }
 #endif
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 27498e990dff..7de9b07870b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2602,7 +2602,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
+void kvm_arch_gmem_free_folio(struct folio *folio);
 #endif
 
 #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 86690683b2fe..8ec5041934db 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -185,6 +185,10 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
 		}
 
 		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
+
+#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
+		kvm_arch_gmem_invalidate_range(kvm, &gfn_range);
+#endif
 	}
 
 	if (flush)
@@ -523,23 +527,12 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	return MF_DELAYED;
 }
 
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-static void kvm_gmem_free_folio(struct folio *folio)
-{
-	struct page *page = folio_page(folio, 0);
-	kvm_pfn_t pfn = page_to_pfn(page);
-	int order = folio_order(folio);
-
-	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
-}
-#endif
-
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
-	.free_folio = kvm_gmem_free_folio,
+	.free_folio = kvm_arch_gmem_free_folio,
 #endif
 };
 

base-commit: de3a35be92d2391ece4bf3143ef2887192625fd0
-- 

> Other sites such as snp_page_reclaim() leak the page on this failure rather
> than freeing it.
> 
> If you think this isn't a real problem, leaving it as is seems fine to me.
> I don't see a good place to put a BUG_ON.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
  2026-06-11 15:23         ` Sean Christopherson
@ 2026-06-11 17:07           ` Hyunwoo Kim
       [not found]             ` <airxMoy44ZxkbioH@google.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Hyunwoo Kim @ 2026-06-11 17:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm, imv4bel

On Thu, Jun 11, 2026 at 08:23:00AM -0700, Sean Christopherson wrote:
> On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > On Thu, Jun 11, 2026 at 05:47:52AM -0700, Sean Christopherson wrote:
> > > On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > > > On Wed, Jun 10, 2026 at 05:16:57PM -0500, Michael Roth wrote:
> > > > > On Thu, Jun 11, 2026 at 01:10:03AM +0900, Hyunwoo Kim wrote:
> > > > > > ---
> > > > > >  arch/x86/kvm/svm/sev.c | 6 +++++-
> > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> > > > > > index 6c6a6d663e29..8fee6ec529f9 100644
> > > > > > --- a/arch/x86/kvm/svm/sev.c
> > > > > > +++ b/arch/x86/kvm/svm/sev.c
> > > > > > @@ -5178,8 +5178,12 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> > > > > > 
> > > > > >                 rc = rmp_make_shared(pfn, use_2m_update ? PG_LEVEL_2M : PG_LEVEL_4K);
> > > > > >                 if (WARN_ONCE(rc, "SEV: Failed to update RMP entry for PFN 0x%llx error %d\n",
> > > > > > -                             pfn, rc))
> > > > > > +                             pfn, rc)) {
> > > > > > +                       /* Still assigned to the guest; pin and leak rather than freeing. */
> > > > > > +                       folio_get(page_folio(pfn_to_page(pfn)));
> > > > > > +                       snp_leak_pages(pfn, use_2m_update ? PTRS_PER_PMD : 1);
> > > > > >                         goto next_pfn;
> > > > > > +               }
> > > > > 
> > > > > This roughly aligns with what would happen if snp_page_reclaim() fails
> > > > > in sev_gmem_post_populate(), while the guest is being initialized via
> > > > > KVM_SEV_SNP_LAUNCH_UPDATE ioctl, which calls into kvm_gmem_populate().
> > > > > 
> > > > > However, in kvm_gmem_populate(), we still free the page. Maybe, to
> > > > > address both cases, we should just add a parameter to snp_leak_pages()
> > > > > to tell it to take an extra ref and use that in both of these paths.
> > > > > 
> > > > > Or we can just do the direct folio_get() in both cases, the above
> > > > > formalizes the handling convention a little better though IMO.
> > > > 
> > > > If I understand correctly, an extra ref alone still seems to leave the
> > > > LRU corruption that sashiko flagged:
> > > > 
> > > > https://lore.kernel.org/all/20260610162623.061BA1F00898@smtp.kernel.org/
> > > > 
> > > > A gmem folio is on the unevictable LRU, and taking a ref keeps the folio
> > > > on the LRU. page->buddy_list, which snp_leak_pages() uses, shares the
> > > > same union as folio->lru, so leaking the page overwrites the folio's LRU
> > > > pointers. Both paths deal with a gmem folio, so the same applies.
> > > > 
> > > > To handle this properly, the folio would need to be taken off the LRU
> > > > before leaking, with something like folio_isolate_lru(), but that is
> > > > mm-internal and does not look usable from KVM. How should we proceed?
> > > > Please let me know if I am missing something.
> > > 
> > > I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
> > > there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
> > > point these types of errors are simply unrecoverable.
> > 
> > A guest can make a gmem page a VMSA via AP creation,
> 
> Ugh, the bane of my existence.  Can we kill off that feature yet?  I'm only half
> joking.  Not even half.
> 
> > and if that gfn is then hole-punched, a page that is still assigned to the
> > guest is returned to the host in sev_gmem_invalidate(), which looked like it
> > could lead to a host RMP PF, so I sent the patch.
> 
> Yeah, I suspect you're right.  But leaking the page doesn't fix the underlying
> problem, which is that it's possible to free a page that's being used as a VMSA.
> 
> We can't simply pin the page, because IIUC ->free_folio() is called when the page
> is removed from the filemap, not when the folio/page is free back to the allocator.
> 
> I.e. we can't do this.
> 
> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 74fb15551e83..ea70c7ade152 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -3510,6 +3510,9 @@ void sev_free_vcpu(struct kvm_vcpu *vcpu)
>  
>         svm = to_svm(vcpu);
>  
> +       if (svm->sev_es.snp_has_guest_vmsa)
> +               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
> +
>         /*
>          * If it's an SNP guest, then the VMSA was marked in the RMP table as
>          * a guest-owned page. Transition the page to hypervisor state before
> @@ -4027,6 +4030,13 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
>         if (kvm_gmem_get_pfn(vcpu->kvm, slot, gfn, &pfn, &page, NULL))
>                 return;
>  
> +       /*
> +        * Drop the reference to the previous VMSA page (acquired above) if the
> +        * guest is changing its VMSA.
> +        */
> +       if (svm->sev_es.snp_has_guest_vmsa)
> +               kvm_release_page_clean(phys_to_page(svm->vmcb->control.vmsa_pa));
> +
>         /*
>          * From this point forward, the VMSA will always be a guest-mapped page
>          * rather than the initial one allocated by KVM in svm->sev_es.vmsa. In
> @@ -4043,13 +4053,6 @@ static void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu)
>  
>         /* Mark the vCPU as runnable */
>         kvm_set_mp_state(vcpu, KVM_MP_STATE_RUNNABLE);
> -
> -       /*
> -        * gmem pages aren't currently migratable, but if this ever changes
> -        * then care should be taken to ensure svm->sev_es.vmsa is pinned
> -        * through some other means.
> -        */
> -       kvm_release_page_clean(page);
>  }
>  
>  static int sev_snp_ap_creation(struct vcpu_svm *svm)
> 
> The right way to handle this is to treat the VMSA "mapping" like an MMU mapping.
> I.e. this is fundamentally the same mess we have to solve in order to not pin
> pages that are mapped into L2 via vmcs01/vmcb02 for nVMX/nSVM.
> 
> Something like this, sans the actually handling of the request.  The simple way
> to handle the request would be to invalidate control.vmsa_pa if snp_has_guest_vmsa
> is true, and then redo the mapping part of sev_snp_init_protected_guest_state().

Understood. This looks like a fairly large change, at least for me, so
would you be able to handle the patch?

> 
> We could put the kvm_x86_call(gmem_invalidate_range)(kvm, range) call in
> kvm_unmap_gfn_range() to avoid the extra arch hook, but I would prefer to have an
> explicit connection between the invalidate() phase and the free() phase.
> 
> Have I mentioned how much I hate AP creation?
> 
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  3 ++-
>  arch/x86/include/asm/kvm_host.h    |  3 ++-
>  arch/x86/kvm/svm/sev.c             | 11 ++++++++++-
>  arch/x86/kvm/svm/svm.c             |  5 ++++-
>  arch/x86/kvm/svm/svm.h             | 13 ++-----------
>  arch/x86/kvm/x86.c                 |  9 +++++++--
>  include/linux/kvm_host.h           |  3 ++-
>  virt/kvm/guest_memfd.c             | 17 +++++------------
>  8 files changed, 34 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 83dc5086138b..ea308250ef47 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -147,7 +147,8 @@ KVM_X86_OP_OPTIONAL(get_untagged_addr)
>  KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(gmem_max_mapping_level)
> -KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL(gmem_invalidate_range)
> +KVM_X86_OP_OPTIONAL(gmem_free_folio)
>  #endif
>  
>  #undef KVM_X86_OP
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3886b536c8a5..0a5aec1701bd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2009,7 +2009,8 @@ struct kvm_x86_ops {
>  	gva_t (*get_untagged_addr)(struct kvm_vcpu *vcpu, gva_t gva, unsigned int flags);
>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> -	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
> +	void (*gmem_invalidate_range)(struct kvm *kvm, struct kvm_gfn_range *range);
> +	void (*gmem_free_folio)(struct folio *folio);
>  	int (*gmem_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
>  };
>  
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 74fb15551e83..411480a90e36 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -5115,9 +5115,18 @@ int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
>  
>  	return 0;
>  }
> +void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
> +{
> +	if (!sev_snp_guest(kvm))
> +		return;
> +
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_VMSA_PAGE_RELOAD);
> +}
>  
> -void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +void sev_gmem_free_folio(struct folio *folio)
>  {
> +	kvm_pfn_t start = page_to_pfn(folio_page(folio, 0));
> +	kvm_pfn_t end = start + (1ul << folio_order(folio));
>  	kvm_pfn_t pfn;
>  
>  	if (!cc_platform_has(CC_ATTR_HOST_SEV_SNP))
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 526e0fdcd16b..907be66d6f2a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5460,9 +5460,12 @@ struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
>  	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
>  
> +#ifdef CONFIG_KVM_AMD_SEV
>  	.gmem_prepare = sev_gmem_prepare,
> -	.gmem_invalidate = sev_gmem_invalidate,
> +	.gmem_invalidate_range = sev_gmem_invalidate_range,
> +	.gmem_free_folio = sev_gmem_free_folio,
>  	.gmem_max_mapping_level = sev_gmem_max_mapping_level,
> +#endif
>  };
>  
>  /*
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 716be21fba33..5129ab1a84d7 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -1008,7 +1008,8 @@ int sev_dev_get_attr(u32 group, u64 attr, u64 *val);
>  extern unsigned int max_sev_asid;
>  void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
>  int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> -void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +void sev_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +void sev_gmem_free_folio(struct folio *folio);
>  int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private);
>  struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu);
>  void sev_free_decrypted_vmsa(struct kvm_vcpu *vcpu, struct vmcb_save_area *vmsa);
> @@ -1034,16 +1035,6 @@ static inline int sev_cpu_init(struct svm_cpu_data *sd) { return 0; }
>  static inline int sev_dev_get_attr(u32 group, u64 attr, u64 *val) { return -ENXIO; }
>  #define max_sev_asid 0
>  static inline void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code) {}
> -static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order)
> -{
> -	return 0;
> -}
> -static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
> -static inline int sev_gmem_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn, bool is_private)
> -{
> -	return 0;
> -}
> -
>  static inline struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu)
>  {
>  	return NULL;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cf122b8c3210..a734ddf59cf9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14134,9 +14134,14 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
>  #endif
>  
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> -	kvm_x86_call(gmem_invalidate)(start, end);
> +	kvm_x86_call(gmem_invalidate_range)(kvm, range);
> +}
> +
> +void kvm_arch_gmem_free_folio(struct folio *folio)
> +{
> +	kvm_x86_call(gmem_free_folio)(folio);
>  }
>  #endif
>  #endif
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 27498e990dff..7de9b07870b7 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2602,7 +2602,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
>  #endif
>  
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +void kvm_arch_gmem_invalidate_range(struct kvm *kvm, struct kvm_gfn_range *range);
> +void kvm_arch_gmem_free_folio(struct folio *folio);
>  #endif
>  
>  #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 86690683b2fe..8ec5041934db 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -185,6 +185,10 @@ static void __kvm_gmem_invalidate_start(struct gmem_file *f, pgoff_t start,
>  		}
>  
>  		flush |= kvm_mmu_unmap_gfn_range(kvm, &gfn_range);
> +
> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> +		kvm_arch_gmem_invalidate_range(kvm, &gfn_range);
> +#endif
>  	}
>  
>  	if (flush)
> @@ -523,23 +527,12 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>  	return MF_DELAYED;
>  }
>  
> -#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -static void kvm_gmem_free_folio(struct folio *folio)
> -{
> -	struct page *page = folio_page(folio, 0);
> -	kvm_pfn_t pfn = page_to_pfn(page);
> -	int order = folio_order(folio);
> -
> -	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> -}
> -#endif
> -
>  static const struct address_space_operations kvm_gmem_aops = {
>  	.dirty_folio = noop_dirty_folio,
>  	.migrate_folio	= kvm_gmem_migrate_folio,
>  	.error_remove_folio = kvm_gmem_error_folio,
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
> -	.free_folio = kvm_gmem_free_folio,
> +	.free_folio = kvm_arch_gmem_free_folio,
>  #endif
>  };
>  
> 
> base-commit: de3a35be92d2391ece4bf3143ef2887192625fd0
> -- 
> 
> > Other sites such as snp_page_reclaim() leak the page on this failure rather
> > than freeing it.
> > 
> > If you think this isn't a real problem, leaving it as is seems fine to me.
> > I don't see a good place to put a BUG_ON.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host
       [not found]             ` <airxMoy44ZxkbioH@google.com>
@ 2026-06-11 17:34               ` Hyunwoo Kim
  0 siblings, 0 replies; 10+ messages in thread
From: Hyunwoo Kim @ 2026-06-11 17:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael Roth, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm, imv4bel

On Thu, Jun 11, 2026 at 10:32:34AM -0700, Sean Christopherson wrote:
> On Fri, Jun 12, 2026, Hyunwoo Kim wrote:
> > On Thu, Jun 11, 2026 at 08:23:00AM -0700, Sean Christopherson wrote:
> > > On Thu, Jun 11, 2026, Hyunwoo Kim wrote:
> > > > > > To handle this properly, the folio would need to be taken off the LRU
> > > > > > before leaking, with something like folio_isolate_lru(), but that is
> > > > > > mm-internal and does not look usable from KVM. How should we proceed?
> > > > > > Please let me know if I am missing something.
> > > > > 
> > > > > I'm inclined to do nothing.  rmp_make_shared() should only fail in this case if
> > > > > there's a fatal bug somewhere, no?  Either that or do BUG_ON(), because at some
> > > > > point these types of errors are simply unrecoverable.
> > > > 
> > > > A guest can make a gmem page a VMSA via AP creation,
> > > 
> > > Ugh, the bane of my existence.  Can we kill off that feature yet?  I'm only half
> > > joking.  Not even half.
> > > 
> > > > and if that gfn is then hole-punched, a page that is still assigned to the
> > > > guest is returned to the host in sev_gmem_invalidate(), which looked like it
> > > > could lead to a host RMP PF, so I sent the patch.
> > > 
> > > Yeah, I suspect you're right.  But leaking the page doesn't fix the underlying
> > > problem, which is that it's possible to free a page that's being used as a VMSA.
> > > 
> > > We can't simply pin the page, because IIUC ->free_folio() is called when the page
> > > is removed from the filemap, not when the folio/page is free back to the allocator.
> 
> ...
> 
> > > The right way to handle this is to treat the VMSA "mapping" like an MMU mapping.
> > > I.e. this is fundamentally the same mess we have to solve in order to not pin
> > > pages that are mapped into L2 via vmcs01/vmcb02 for nVMX/nSVM.
> > > 
> > > Something like this, sans the actually handling of the request.  The simple way
> > > to handle the request would be to invalidate control.vmsa_pa if snp_has_guest_vmsa
> > > is true, and then redo the mapping part of sev_snp_init_protected_guest_state().
> > 
> > Understood. This looks like a fairly large change, at least for me,
> 
> Heh, definitely not just for you, it'll be a large, painful change for anyone :-)
> 
> > so would you be able to handle the patch?
> 
> Yeah, I'll work with Mike and others to get a fix landed.  Thanks much for the
> bug report!

OK. Thank you for the review :)

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-11 17:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 16:10 [PATCH] KVM: SEV: Don't return a still-assigned gmem page to the host Hyunwoo Kim
2026-06-10 16:26 ` sashiko-bot
2026-06-10 18:25   ` Sean Christopherson
2026-06-10 22:16 ` Michael Roth
2026-06-11 10:26   ` Hyunwoo Kim
2026-06-11 12:47     ` Sean Christopherson
2026-06-11 14:05       ` Hyunwoo Kim
2026-06-11 15:23         ` Sean Christopherson
2026-06-11 17:07           ` Hyunwoo Kim
     [not found]             ` <airxMoy44ZxkbioH@google.com>
2026-06-11 17:34               ` Hyunwoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox