From: Sean Christopherson <seanjc@google.com>
To: Sean Christopherson <seanjc@google.com>,
Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Ira Weiny <ira.weiny@intel.com>, Kai Huang <kai.huang@intel.com>,
Michael Roth <michael.roth@amd.com>,
Yan Zhao <yan.y.zhao@intel.com>,
Vishal Annapurve <vannapurve@google.com>,
Rick Edgecombe <rick.p.edgecombe@intel.com>,
Ackerley Tng <ackerleytng@google.com>
Subject: [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management
Date: Thu, 28 Aug 2025 17:06:05 -0700 [thread overview]
Message-ID: <20250829000618.351013-6-seanjc@google.com> (raw)
In-Reply-To: <20250829000618.351013-1-seanjc@google.com>
From: Yan Zhao <yan.y.zhao@intel.com>
Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
doesn't support page migration in any capacity, i.e. there are no migrate
callbacks because guest_memfd pages *can't* be migrated. See the WARN in
kvm_gmem_migrate_folio().
Eliminating TDX's explicit pinning will also enable guest_memfd to support
in-place conversion between shared and private memory[1][2]. Because KVM
cannot distinguish between speculative/transient refcounts and the
intentional refcount for TDX on private pages[3], failing to release
private page refcount in TDX could cause guest_memfd to indefinitely wait
on decreasing the refcount for the splitting.
Under normal conditions, not holding an extra page refcount in TDX is safe
because guest_memfd ensures pages are retained until its invalidation
notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
module, not holding an extra refcount when a page is mapped in S-EPT could
result in a page being released from guest_memfd while still mapped in the
S-EPT. But, doing work to make a fatal error slightly less fatal is a net
negative when that extra work adds complexity and confusion.
Several approaches were considered to address the refcount issue, including
- Attempting to modify the KVM unmap operation to return a failure,
which was deemed too complex and potentially incorrect[4].
- Increasing the folio reference count only upon S-EPT zapping failure[5].
- Use page flags or page_ext to indicate a page is still used by TDX[6],
which does not work for HVO (HugeTLB Vmemmap Optimization).
- Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
Due to the complexity or inappropriateness of these approaches, and the
fact that S-EPT zapping failure is currently only possible when there are
bugs in the KVM or TDX module, which is very rare in a production kernel,
a straightforward approach of simply not holding the page reference count
in TDX was chosen[8].
When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
vCPUs and mark the VM as dead. Although there is a potential window that a
private page mapped in the S-EPT could be reallocated and used outside the
VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
information. To be robust against bugs, the user can enable panic_on_warn
as normal.
Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
Link: https://youtu.be/UnBKahkAon4 [2]
Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Suggested-by: Ackerley Tng <ackerleytng@google.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
[sean: extract out of hugepage series, massage changelog accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/tdx.c | 28 ++++------------------------
1 file changed, 4 insertions(+), 24 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c83e1ff02827..f24f8635b433 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1586,29 +1586,22 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa);
}
-static void tdx_unpin(struct kvm *kvm, struct page *page)
-{
- put_page(page);
-}
-
static int tdx_mem_page_aug(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, struct page *page)
+ enum pg_level level, kvm_pfn_t pfn)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+ struct page *page = pfn_to_page(pfn);
gpa_t gpa = gfn_to_gpa(gfn);
u64 entry, level_state;
u64 err;
err = tdh_mem_page_aug(&kvm_tdx->td, gpa, tdx_level, page, &entry, &level_state);
- if (unlikely(tdx_operand_busy(err))) {
- tdx_unpin(kvm, page);
+ if (unlikely(tdx_operand_busy(err)))
return -EBUSY;
- }
if (KVM_BUG_ON(err, kvm)) {
pr_tdx_error_2(TDH_MEM_PAGE_AUG, err, entry, level_state);
- tdx_unpin(kvm, page);
return -EIO;
}
@@ -1642,29 +1635,18 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
enum pg_level level, kvm_pfn_t pfn)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
- struct page *page = pfn_to_page(pfn);
/* TODO: handle large pages. */
if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
return -EINVAL;
- /*
- * Because guest_memfd doesn't support page migration with
- * a_ops->migrate_folio (yet), no callback is triggered for KVM on page
- * migration. Until guest_memfd supports page migration, prevent page
- * migration.
- * TODO: Once guest_memfd introduces callback on page migration,
- * implement it and remove get_page/put_page().
- */
- get_page(page);
-
/*
* Read 'pre_fault_allowed' before 'kvm_tdx->state'; see matching
* barrier in tdx_td_finalize().
*/
smp_rmb();
if (likely(kvm_tdx->state == TD_STATE_RUNNABLE))
- return tdx_mem_page_aug(kvm, gfn, level, page);
+ return tdx_mem_page_aug(kvm, gfn, level, pfn);
return tdx_mem_page_record_premap_cnt(kvm, gfn, level, pfn);
}
@@ -1715,7 +1697,6 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
return -EIO;
}
tdx_clear_page(page);
- tdx_unpin(kvm, page);
return 0;
}
@@ -1795,7 +1776,6 @@ static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
if (tdx_is_sept_zap_err_due_to_premap(kvm_tdx, err, entry, level) &&
!KVM_BUG_ON(!atomic64_read(&kvm_tdx->nr_premapped), kvm)) {
atomic64_dec(&kvm_tdx->nr_premapped);
- tdx_unpin(kvm, page);
return 0;
}
--
2.51.0.318.gd7df087d1a-goog
next prev parent reply other threads:[~2025-08-29 0:06 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-29 0:06 [RFC PATCH v2 00/18] KVM: x86/mmu: TDX post-populate cleanups Sean Christopherson
2025-08-29 0:06 ` [RFC PATCH v2 01/18] KVM: TDX: Drop PROVE_MMU=y sanity check on to-be-populated mappings Sean Christopherson
2025-08-29 6:20 ` Binbin Wu
2025-08-29 0:06 ` [RFC PATCH v2 02/18] KVM: x86/mmu: Add dedicated API to map guest_memfd pfn into TDP MMU Sean Christopherson
2025-08-29 18:34 ` Edgecombe, Rick P
2025-08-29 20:27 ` Sean Christopherson
2025-08-29 0:06 ` [RFC PATCH v2 03/18] Revert "KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU" Sean Christopherson
2025-08-29 19:00 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 04/18] KVM: x86/mmu: Rename kvm_tdp_map_page() to kvm_tdp_page_prefault() Sean Christopherson
2025-08-29 19:03 ` Edgecombe, Rick P
2025-08-29 0:06 ` Sean Christopherson [this message]
2025-08-29 8:36 ` [RFC PATCH v2 05/18] KVM: TDX: Drop superfluous page pinning in S-EPT management Binbin Wu
2025-08-29 19:53 ` Edgecombe, Rick P
2025-08-29 20:19 ` Sean Christopherson
2025-08-29 21:54 ` Edgecombe, Rick P
2025-08-29 22:02 ` Sean Christopherson
2025-08-29 22:17 ` Edgecombe, Rick P
2025-08-29 22:58 ` Sean Christopherson
2025-08-29 22:59 ` Edgecombe, Rick P
2025-09-01 1:25 ` Yan Zhao
2025-09-02 17:33 ` Sean Christopherson
2025-09-02 18:55 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 06/18] KVM: TDX: Return -EIO, not -EINVAL, on a KVM_BUG_ON() condition Sean Christopherson
2025-08-29 9:40 ` Binbin Wu
2025-08-29 16:58 ` Ira Weiny
2025-08-29 19:59 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 07/18] KVM: TDX: Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-08-29 9:49 ` Binbin Wu
2025-08-29 0:06 ` [RFC PATCH v2 08/18] KVM: x86/mmu: Drop the return code from kvm_x86_ops.remove_external_spte() Sean Christopherson
2025-08-29 9:52 ` Binbin Wu
2025-08-29 0:06 ` [RFC PATCH v2 09/18] KVM: TDX: Avoid a double-KVM_BUG_ON() in tdx_sept_zap_private_spte() Sean Christopherson
2025-08-29 9:52 ` Binbin Wu
2025-08-29 0:06 ` [RFC PATCH v2 10/18] KVM: TDX: Use atomic64_dec_return() instead of a poor equivalent Sean Christopherson
2025-08-29 10:06 ` Binbin Wu
2025-08-29 0:06 ` [RFC PATCH v2 11/18] KVM: TDX: Fold tdx_mem_page_record_premap_cnt() into its sole caller Sean Christopherson
2025-09-02 22:46 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 12/18] KVM: TDX: Bug the VM if extended the initial measurement fails Sean Christopherson
2025-08-29 8:18 ` Yan Zhao
2025-08-29 18:16 ` Edgecombe, Rick P
2025-08-29 20:11 ` Sean Christopherson
2025-08-29 22:39 ` Edgecombe, Rick P
2025-08-29 23:15 ` Edgecombe, Rick P
2025-08-29 23:18 ` Sean Christopherson
2025-09-02 9:24 ` Yan Zhao
2025-09-02 17:04 ` Sean Christopherson
2025-09-03 0:18 ` Edgecombe, Rick P
2025-09-03 3:34 ` Yan Zhao
2025-09-03 9:19 ` Yan Zhao
2025-08-29 0:06 ` [RFC PATCH v2 13/18] KVM: TDX: ADD pages to the TD image while populating mirror EPT entries Sean Christopherson
2025-08-29 23:42 ` Edgecombe, Rick P
2025-09-02 17:09 ` Sean Christopherson
2025-08-29 0:06 ` [RFC PATCH v2 14/18] KVM: TDX: Fold tdx_sept_zap_private_spte() into tdx_sept_remove_private_spte() Sean Christopherson
2025-09-02 17:31 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 15/18] KVM: TDX: Combine KVM_BUG_ON + pr_tdx_error() into TDX_BUG_ON() Sean Christopherson
2025-08-29 9:03 ` Binbin Wu
2025-08-29 14:19 ` Sean Christopherson
2025-09-01 1:46 ` Binbin Wu
2025-09-02 18:55 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 16/18] KVM: TDX: Derive error argument names from the local variable names Sean Christopherson
2025-08-30 0:00 ` Edgecombe, Rick P
2025-08-29 0:06 ` [RFC PATCH v2 17/18] KVM: TDX: Assert that mmu_lock is held for write when removing S-EPT entries Sean Christopherson
2025-08-29 0:06 ` [RFC PATCH v2 18/18] KVM: TDX: Add macro to retry SEAMCALLs when forcing vCPUs out of guest Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250829000618.351013-6-seanjc@google.com \
--to=seanjc@google.com \
--cc=ackerleytng@google.com \
--cc=ira.weiny@intel.com \
--cc=kai.huang@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=pbonzini@redhat.com \
--cc=rick.p.edgecombe@intel.com \
--cc=vannapurve@google.com \
--cc=yan.y.zhao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).