* [PATCH v3 0/3] KVM: x86: Dynamically allocate hashed page list
@ 2025-05-16 21:54 Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
` (2 more replies)
0 siblings, 3 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-16 21:54 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
Third times a charm! Right? Right!?!?
Allocate the hashed list of shadow pages dynamically (separate from
struct kvm), and on-demand. The hashed list is 32KiB, i.e. absolutely
belongs in a separate allocation, and is worth skipping if KVM isn't
shadowing guest PTEs for the VM.
Side topic #1, a bunch of my measurements from v2 and ealier were "bad",
because I was using a PROVE_LOCKING=y kernel, which significantly inflates
the size of "struct kvm" in particular.
Side topic #2, I have a patch to dynamically allocate the memslots hash
tables (they're very conveniently either 2KiB or 4KiB in size for 64-bit
kernels), but I couldn't convince myself that the complexity is in any way
justified. I did however account for the size of the hash tables in the
assertions, if only to document where a big chunk of the per-VM memory usage
is going.
Side topic #3, AFAIK, DEBUG_KERNEL=n builds are quite rare, so I'm banking
on build bots tripping the assert (I'll also add a DEBUG_KERNEL=n config to
my own testing, probably).
v3:
- Add comments explaining the {READ,WRITE}_ONCE logic, and why it's safe
to set the list outside of mmu_lock. [Vipin]
- Make the assertions actually work. [Vipin]
- Refine the assertions so they (hopefully) won't fail on kernels with
a bunch of debug crud added.
v2:
- https://lore.kernel.org/all/20250401155714.838398-1-seanjc@google.com
- Actually defer allocation when using TDP MMU. [Vipin]
- Free allocation on MMU teardown. [Vipin]
v1: https://lore.kernel.org/all/20250315024010.2360884-1-seanjc@google.com
Sean Christopherson (3):
KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
KVM: x86: Use kvzalloc() to allocate VM struct
KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
arch/x86/include/asm/kvm_host.h | 6 +--
arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++---
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/vmx/main.c | 2 +
arch/x86/kvm/vmx/vmx.c | 2 +
arch/x86/kvm/x86.c | 5 ++-
arch/x86/kvm/x86.h | 22 ++++++++++
7 files changed, 102 insertions(+), 10 deletions(-)
base-commit: 7ef51a41466bc846ad794d505e2e34ff97157f7f
--
2.49.0.1112.g889b7c5bd8-goog
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v3 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
2025-05-16 21:54 [PATCH v3 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
@ 2025-05-16 21:54 ` Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-16 21:54 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation. Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.
Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.
Cc: Vipin Sharma <vipinsh@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/mmu/mmu.c | 23 ++++++++++++++++++++++-
arch/x86/kvm/x86.c | 5 ++++-
3 files changed, 28 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 330cdcbed1a6..9667d6b929ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1343,7 +1343,7 @@ struct kvm_arch {
bool has_private_mem;
bool has_protected_state;
bool pre_fault_allowed;
- struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
+ struct hlist_head *mmu_page_hash;
struct list_head active_mmu_pages;
/*
* A list of kvm_mmu_page structs that, if zapped, could possibly be
@@ -2006,7 +2006,7 @@ void kvm_mmu_vendor_module_exit(void);
void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
int kvm_mmu_create(struct kvm_vcpu *vcpu);
-void kvm_mmu_init_vm(struct kvm *kvm);
+int kvm_mmu_init_vm(struct kvm *kvm);
void kvm_mmu_uninit_vm(struct kvm *kvm);
void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..41da2cb1e3f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3882,6 +3882,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return r;
}
+static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
+{
+ typeof(kvm->arch.mmu_page_hash) h;
+
+ h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
+ if (!h)
+ return -ENOMEM;
+
+ kvm->arch.mmu_page_hash = h;
+ return 0;
+}
+
static int mmu_first_shadow_root_alloc(struct kvm *kvm)
{
struct kvm_memslots *slots;
@@ -6675,13 +6687,19 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
}
-void kvm_mmu_init_vm(struct kvm *kvm)
+int kvm_mmu_init_vm(struct kvm *kvm)
{
+ int r;
+
kvm->arch.shadow_mmio_value = shadow_mmio_value;
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
+ r = kvm_mmu_alloc_page_hash(kvm);
+ if (r)
+ return r;
+
if (tdp_mmu_enabled)
kvm_mmu_init_tdp_mmu(kvm);
@@ -6692,6 +6710,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+ return 0;
}
static void mmu_free_vm_memory_caches(struct kvm *kvm)
@@ -6703,6 +6722,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
void kvm_mmu_uninit_vm(struct kvm *kvm)
{
+ kvfree(kvm->arch.mmu_page_hash);
+
if (tdp_mmu_enabled)
kvm_mmu_uninit_tdp_mmu(kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f9f798f286ce..d204ba9368f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12787,7 +12787,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
if (ret)
goto out;
- kvm_mmu_init_vm(kvm);
+ ret = kvm_mmu_init_vm(kvm);
+ if (ret)
+ goto out_cleanup_page_track;
ret = kvm_x86_call(vm_init)(kvm);
if (ret)
@@ -12840,6 +12842,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
out_uninit_mmu:
kvm_mmu_uninit_vm(kvm);
+out_cleanup_page_track:
kvm_page_track_cleanup(kvm);
out:
return ret;
--
2.49.0.1112.g889b7c5bd8-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-16 21:54 [PATCH v3 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
@ 2025-05-16 21:54 ` Sean Christopherson
2025-05-17 12:35 ` Paolo Bonzini
2025-05-20 16:15 ` Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-16 21:54 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings. For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.
Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslos hash tables, are order-0 allocations, i.e.
are less than 4KiB. There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.
Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/svm/svm.c | 2 ++
arch/x86/kvm/vmx/main.c | 2 ++
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.h | 22 ++++++++++++++++++++++
5 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9667d6b929ee..3a985825a945 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1961,7 +1961,7 @@ void kvm_x86_vendor_exit(void);
#define __KVM_HAVE_ARCH_VM_ALLOC
static inline struct kvm *kvm_arch_alloc_vm(void)
{
- return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+ return kvzalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT);
}
#define __KVM_HAVE_ARCH_VM_FREE
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0ad1a6d4fb6d..d13e475c3407 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5675,6 +5675,8 @@ static int __init svm_init(void)
{
int r;
+ KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_svm);
+
__unused_size_checks();
if (!kvm_is_svm_supported())
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..e18dfada2e90 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -64,6 +64,8 @@ static __init int vt_hardware_setup(void)
vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
}
+ KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx);
+
return 0;
}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ff00ae9f05a..ef58b727d6c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8643,6 +8643,8 @@ int __init vmx_init(void)
{
int r, cpu;
+ KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_vmx);
+
if (!kvm_is_vmx_supported())
return -EOPNOTSUPP;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 832f0faf4779..0f3046cccb79 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -55,6 +55,28 @@ struct kvm_host_values {
void kvm_spurious_fault(void);
+#define SIZE_OF_MEMSLOTS_HASHTABLE \
+ (sizeof(((struct kvm_memslots *)0)->id_hash) * 2 * KVM_MAX_NR_ADDRESS_SPACES)
+
+/* Sanity check the size of the memslot hash tables. */
+static_assert(SIZE_OF_MEMSLOTS_HASHTABLE ==
+ (1024 * (1 + IS_ENABLED(CONFIG_X86_64)) * (1 + IS_ENABLED(CONFIG_KVM_SMM))));
+
+/*
+ * Assert that "struct kvm_{svm,vmx,tdx}" is an order-0 or order-1 allocation.
+ * Spilling over to an order-2 allocation isn't fundamentally problematic, but
+ * isn't expected to happen in the foreseeable future (O(years)). Assert that
+ * the size is an order-0 allocation when ignoring the memslot hash tables, to
+ * help detect and debug unexpected size increases.
+ */
+#define KVM_SANITY_CHECK_VM_STRUCT_SIZE(x) \
+do { \
+ BUILD_BUG_ON(get_order(sizeof(struct x) - SIZE_OF_MEMSLOTS_HASHTABLE) && \
+ !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
+ BUILD_BUG_ON(get_order(sizeof(struct x)) < 2 && \
+ !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
+} while (0)
+
#define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
({ \
bool failed = (consistency_check); \
--
2.49.0.1112.g889b7c5bd8-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
2025-05-16 21:54 [PATCH v3 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
@ 2025-05-16 21:54 ` Sean Christopherson
2025-05-17 12:43 ` Paolo Bonzini
2 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-05-16 21:54 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.
Setting the list outside of mmu_lock is safe, as concurrent readers must
hold mmu_lock in some capacity, shadow pages can only be added (or removed)
from the list when mmu_lock is held for write, and tasks that are creating
a shadow root are serialized by slots_arch_lock. I.e. it's impossible for
the list to become non-empty until all readers go away, and so readers are
guaranteed to see an empty list even if they make multiple calls to
kvm_get_mmu_page_hash() in a single mmu_lock critical section.
Use {WRITE/READ}_ONCE to set/get the list when mmu_lock isn't held for
write, out of an abundance of paranoia; no sane compiler should tear the
store or load, but it's technically possible.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/mmu/mmu.c | 60 +++++++++++++++++++++++++++++++++++-------
1 file changed, 50 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 41da2cb1e3f1..edb4ecff9917 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1983,14 +1983,33 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
return true;
}
+static __ro_after_init HLIST_HEAD(empty_page_hash);
+
+static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
+{
+ /*
+ * Load mmu_page_hash from memory exactly once, as it's set at runtime
+ * outside of mmu_lock when the TDP MMU is enabled, i.e. when the hash
+ * table of shadow pages isn't needed unless KVM needs to shadow L1's
+ * TDP for an L2 guest.
+ */
+ struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
+
+ lockdep_assert_held(&kvm->mmu_lock);
+
+ if (!page_hash)
+ return &empty_page_hash;
+
+ return &page_hash[kvm_page_table_hashfn(gfn)];
+}
+
#define for_each_valid_sp(_kvm, _sp, _list) \
hlist_for_each_entry(_sp, _list, hash_link) \
if (is_obsolete_sp((_kvm), (_sp))) { \
} else
#define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
- for_each_valid_sp(_kvm, _sp, \
- &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \
+ for_each_valid_sp(_kvm, _sp, kvm_get_mmu_page_hash(_kvm, _gfn)) \
if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
@@ -2358,6 +2377,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
struct kvm_mmu_page *sp;
bool created = false;
+ /*
+ * No need for READ_ONCE(), unlike in kvm_get_mmu_page_hash(), because
+ * mmu_page_hash must be set prior to creating the first shadow root,
+ * i.e. reaching this point is fully serialized by slots_arch_lock.
+ */
+ BUG_ON(!kvm->arch.mmu_page_hash);
sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
@@ -3886,11 +3911,21 @@ static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
{
typeof(kvm->arch.mmu_page_hash) h;
+ if (kvm->arch.mmu_page_hash)
+ return 0;
+
h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
if (!h)
return -ENOMEM;
- kvm->arch.mmu_page_hash = h;
+ /*
+ * Write mmu_page_hash exactly once as there may be concurrent readers,
+ * e.g. to check for shadowed PTEs in mmu_try_to_unsync_pages(). Note,
+ * mmu_lock must be held for write to add (or remove) shadow pages, and
+ * so readers are guaranteed to see an empty list for their current
+ * mmu_lock critical section.
+ */
+ WRITE_ONCE(kvm->arch.mmu_page_hash, h);
return 0;
}
@@ -3913,9 +3948,13 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
if (kvm_shadow_root_allocated(kvm))
goto out_unlock;
+ r = kvm_mmu_alloc_page_hash(kvm);
+ if (r)
+ goto out_unlock;
+
/*
- * Check if anything actually needs to be allocated, e.g. all metadata
- * will be allocated upfront if TDP is disabled.
+ * Check if memslot metadata actually needs to be allocated, e.g. all
+ * metadata will be allocated upfront if TDP is disabled.
*/
if (kvm_memslots_have_rmaps(kvm) &&
kvm_page_track_write_tracking_enabled(kvm))
@@ -6696,12 +6735,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
- r = kvm_mmu_alloc_page_hash(kvm);
- if (r)
- return r;
-
- if (tdp_mmu_enabled)
+ if (tdp_mmu_enabled) {
kvm_mmu_init_tdp_mmu(kvm);
+ } else {
+ r = kvm_mmu_alloc_page_hash(kvm);
+ if (r)
+ return r;
+ }
kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
--
2.49.0.1112.g889b7c5bd8-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-16 21:54 ` [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
@ 2025-05-17 12:35 ` Paolo Bonzini
2025-05-19 15:39 ` Sean Christopherson
2025-05-20 16:15 ` Sean Christopherson
1 sibling, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2025-05-17 12:35 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm, linux-kernel, Vipin Sharma
On 5/16/25 23:54, Sean Christopherson wrote:
> Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
> allocation before falling back to __vmalloc(), to avoid the overhead of
> establishing the virtual mappings. For non-debug builds, The SVM and VMX
> (and TDX) structures are now just below 7000 bytes in the worst case
> scenario (see below), i.e. are order-1 allocations, and will likely remain
> that way for quite some time.
>
> Add compile-time assertions in vendor code to ensure the size of the
> structures, sans the memslos hash tables, are order-0 allocations, i.e.
> are less than 4KiB. There's nothing fundamentally wrong with a larger
> kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
> the memslots hash tables) is below 2KiB after 18+ years of existence,
> more than doubling the size would be quite notable.
>
> Add sanity checks on the memslot hash table sizes, partly to ensure they
> aren't resized without accounting for the impact on VM structure size, and
> partly to document that the majority of the size of VM structures comes
> from the memslots.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/svm/svm.c | 2 ++
> arch/x86/kvm/vmx/main.c | 2 ++
> arch/x86/kvm/vmx/vmx.c | 2 ++
> arch/x86/kvm/x86.h | 22 ++++++++++++++++++++++
> 5 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9667d6b929ee..3a985825a945 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1961,7 +1961,7 @@ void kvm_x86_vendor_exit(void);
> #define __KVM_HAVE_ARCH_VM_ALLOC
> static inline struct kvm *kvm_arch_alloc_vm(void)
> {
> - return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> + return kvzalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT);
> }
>
> #define __KVM_HAVE_ARCH_VM_FREE
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 0ad1a6d4fb6d..d13e475c3407 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5675,6 +5675,8 @@ static int __init svm_init(void)
> {
> int r;
>
> + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_svm);
> +
> __unused_size_checks();
>
> if (!kvm_is_svm_supported())
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index d1e02e567b57..e18dfada2e90 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -64,6 +64,8 @@ static __init int vt_hardware_setup(void)
> vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> }
>
> + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx);
I would put either both or no checks in main.c.
Or if you use static_assert, you can also place the macro invocation
close to the struct definition.
Paolo
> + */
> +#define KVM_SANITY_CHECK_VM_STRUCT_SIZE(x) \
> +do { \
> + BUILD_BUG_ON(get_order(sizeof(struct x) - SIZE_OF_MEMSLOTS_HASHTABLE) && \
> + !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
> + BUILD_BUG_ON(get_order(sizeof(struct x)) < 2 && \
> + !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
> +} while (0)
> #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
> ({ \
> bool failed = (consistency_check); \
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
2025-05-16 21:54 ` [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
@ 2025-05-17 12:43 ` Paolo Bonzini
2025-05-19 13:37 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: Paolo Bonzini @ 2025-05-17 12:43 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm, linux-kernel, Vipin Sharma
On 5/16/25 23:54, Sean Christopherson wrote:
> When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
> nested TDP VM is run, defer allocation of the array of hashed lists used
> to track shadow MMU pages until the first shadow root is allocated.
>
> Setting the list outside of mmu_lock is safe, as concurrent readers must
> hold mmu_lock in some capacity, shadow pages can only be added (or removed)
> from the list when mmu_lock is held for write, and tasks that are creating
> a shadow root are serialized by slots_arch_lock. I.e. it's impossible for
> the list to become non-empty until all readers go away, and so readers are
> guaranteed to see an empty list even if they make multiple calls to
> kvm_get_mmu_page_hash() in a single mmu_lock critical section.
>
> Use {WRITE/READ}_ONCE to set/get the list when mmu_lock isn't held for
> write, out of an abundance of paranoia; no sane compiler should tear the
> store or load, but it's technically possible.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/mmu/mmu.c | 60 +++++++++++++++++++++++++++++++++++-------
> 1 file changed, 50 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 41da2cb1e3f1..edb4ecff9917 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1983,14 +1983,33 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
> return true;
> }
>
> +static __ro_after_init HLIST_HEAD(empty_page_hash);
> +
> +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> +{
> + /*
> + * Load mmu_page_hash from memory exactly once, as it's set at runtime
> + * outside of mmu_lock when the TDP MMU is enabled, i.e. when the hash
> + * table of shadow pages isn't needed unless KVM needs to shadow L1's
> + * TDP for an L2 guest.
Avoid double negations, for example "i.e. when shadow paging and the
mmu_page_hash is only needed for a L1 hypervisor's TDP".
> + */
> + struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
This READ_ONCE is more for the (now implicit) smp_read_barrier_depends()
than for "loading exactly once". Might as well use smp_load_acquire()
since it's free on x86.
> + lockdep_assert_held(&kvm->mmu_lock);
Move this comment here from kvm_mmu_alloc_page_hash():
/* mmu_lock must be held for write to add (or remove) shadow
* pages, and so readers are guaranteed to see an empty list for
* their current mmu_lock critical section.
*/
> + if (!page_hash)
> + return &empty_page_hash;
> +
> + return &page_hash[kvm_page_table_hashfn(gfn)];
> +}
> +
> #define for_each_valid_sp(_kvm, _sp, _list) \
> hlist_for_each_entry(_sp, _list, hash_link) \
> if (is_obsolete_sp((_kvm), (_sp))) { \
> } else
>
> #define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn) \
> - for_each_valid_sp(_kvm, _sp, \
> - &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)]) \
> + for_each_valid_sp(_kvm, _sp, kvm_get_mmu_page_hash(_kvm, _gfn)) \
> if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
>
> static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> @@ -2358,6 +2377,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> struct kvm_mmu_page *sp;
> bool created = false;
>
> + /*
> + * No need for READ_ONCE(), unlike in kvm_get_mmu_page_hash(), because
> + * mmu_page_hash must be set prior to creating the first shadow root,
> + * i.e. reaching this point is fully serialized by slots_arch_lock.
> + */
> + BUG_ON(!kvm->arch.mmu_page_hash);
> sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
>
> sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
> @@ -3886,11 +3911,21 @@ static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
> {
> typeof(kvm->arch.mmu_page_hash) h;
>
> + if (kvm->arch.mmu_page_hash)
> + return 0;
> +
> h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
> if (!h)
> return -ENOMEM;
>
> - kvm->arch.mmu_page_hash = h;
> + /*
> + * Write mmu_page_hash exactly once as there may be concurrent readers,
> + * e.g. to check for shadowed PTEs in mmu_try_to_unsync_pages(). Note,
> + * mmu_lock must be held for write to add (or remove) shadow pages, and
> + * so readers are guaranteed to see an empty list for their current
> + * mmu_lock critical section.
> + */
> + WRITE_ONCE(kvm->arch.mmu_page_hash, h);
Use smp_store_release here (unlike READ_ONCE(), it's technically
incorrect to use WRITE_ONCE() here!), with a remark that it pairs with
kvm_get_mmu_page_hash(). That's both more accurate and leads to a
better comment than "write exactly once".
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
2025-05-17 12:43 ` Paolo Bonzini
@ 2025-05-19 13:37 ` Sean Christopherson
2025-05-19 15:29 ` James Houghton
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-05-19 13:37 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
On Sat, May 17, 2025, Paolo Bonzini wrote:
> On 5/16/25 23:54, Sean Christopherson wrote:
> > + /*
> > + * Write mmu_page_hash exactly once as there may be concurrent readers,
> > + * e.g. to check for shadowed PTEs in mmu_try_to_unsync_pages(). Note,
> > + * mmu_lock must be held for write to add (or remove) shadow pages, and
> > + * so readers are guaranteed to see an empty list for their current
> > + * mmu_lock critical section.
> > + */
> > + WRITE_ONCE(kvm->arch.mmu_page_hash, h);
>
> Use smp_store_release here (unlike READ_ONCE(), it's technically incorrect
> to use WRITE_ONCE() here!),
Can you elaborate why? Due to my x86-centric life, my memory ordering knowledge
is woefully inadequate.
> with a remark that it pairs with kvm_get_mmu_page_hash(). That's both more
> accurate and leads to a better comment than "write exactly once".
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
2025-05-19 13:37 ` Sean Christopherson
@ 2025-05-19 15:29 ` James Houghton
2025-05-19 15:51 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: James Houghton @ 2025-05-19 15:29 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Vipin Sharma
On Mon, May 19, 2025 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Sat, May 17, 2025, Paolo Bonzini wrote:
> > On 5/16/25 23:54, Sean Christopherson wrote:
> > > + /*
> > > + * Write mmu_page_hash exactly once as there may be concurrent readers,
> > > + * e.g. to check for shadowed PTEs in mmu_try_to_unsync_pages(). Note,
> > > + * mmu_lock must be held for write to add (or remove) shadow pages, and
> > > + * so readers are guaranteed to see an empty list for their current
> > > + * mmu_lock critical section.
> > > + */
> > > + WRITE_ONCE(kvm->arch.mmu_page_hash, h);
> >
> > Use smp_store_release here (unlike READ_ONCE(), it's technically incorrect
> > to use WRITE_ONCE() here!),
>
> Can you elaborate why? Due to my x86-centric life, my memory ordering knowledge
> is woefully inadequate.
The compiler must be prohibited from reordering stores preceding this
WRITE_ONCE() to after it.
In reality, the only stores that matter will be from within
kvcalloc(), and the important bits of it will not be inlined, so it's
unlikely that the compiler would actually do such reordering. But it's
nonetheless allowed. :) barrier() is precisely what is needed to
prohibit this; smp_store_release() on x86 is merely barrier() +
WRITE_ONCE().
Semantically, smp_store_release() is what you mean to write, as Paolo
said. We're not really *only* preventing torn accesses, we also need
to ensure that any threads that read kvm->arch.mmu_page_hash can
actually use that result (i.e., that all the stores from the
kvcalloc() are visible). This sounds a little bit weird for x86 code,
but compiler reordering is still a possibility.
And I also agree with what Paolo said about smp_load_acquire(). :)
Thanks Paolo. Please correct me if I'm wrong above.
> > with a remark that it pairs with kvm_get_mmu_page_hash(). That's both more
> > accurate and leads to a better comment than "write exactly once".
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-17 12:35 ` Paolo Bonzini
@ 2025-05-19 15:39 ` Sean Christopherson
2025-05-20 14:42 ` Paolo Bonzini
2025-05-20 22:49 ` Huang, Kai
0 siblings, 2 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-19 15:39 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma
On Sat, May 17, 2025, Paolo Bonzini wrote:
> On 5/16/25 23:54, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 0ad1a6d4fb6d..d13e475c3407 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -5675,6 +5675,8 @@ static int __init svm_init(void)
> > {
> > int r;
> > + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_svm);
> > +
> > __unused_size_checks();
> > if (!kvm_is_svm_supported())
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index d1e02e567b57..e18dfada2e90 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -64,6 +64,8 @@ static __init int vt_hardware_setup(void)
> > vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> > }
> > + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx);
>
> I would put either both or no checks in main.c.
Yeah, I agree the current split is ugly. I originally had 'em both in main.c,
but then the assert effectively becomes dependent on CONFIG_KVM_INTEL_TDX=y.
Aha! If we add a proper tdx_hardware_setup(), then there's a convenient location
for the assert, IMO it's much easier to see/document the "TDX module not loaded"
behavior, and the TDX-specific kvm_x86_ops hooks don't need to be visible symbols.
I'll slot the below in, unless you've got a better idea.
> Or if you use static_assert, you can also place the macro invocation close
> to the struct definition.
Already tried that, get_order() doesn't play nice with static_assert :-/
--
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 19 May 2025 07:15:27 -0700
Subject: [PATCH] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
co-locating the setup with tdx_bringup() makes it easier to see and
document the success_disable_tdx "error" path, and configuring the TDX
specific hooks in tdx.c reduces the number of globally visible TDX symbols.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/vmx/main.c | 36 ++---------------------------
arch/x86/kvm/vmx/tdx.c | 46 +++++++++++++++++++++++++++-----------
arch/x86/kvm/vmx/tdx.h | 1 +
arch/x86/kvm/vmx/x86_ops.h | 10 ---------
4 files changed, 36 insertions(+), 57 deletions(-)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..d7178d15ac8f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -29,40 +29,8 @@ static __init int vt_hardware_setup(void)
if (ret)
return ret;
- /*
- * Update vt_x86_ops::vm_size here so it is ready before
- * kvm_ops_update() is called in kvm_x86_vendor_init().
- *
- * Note, the actual bringing up of TDX must be done after
- * kvm_ops_update() because enabling TDX requires enabling
- * hardware virtualization first, i.e., all online CPUs must
- * be in post-VMXON state. This means the @vm_size here
- * may be updated to TDX's size but TDX may fail to enable
- * at later time.
- *
- * The VMX/VT code could update kvm_x86_ops::vm_size again
- * after bringing up TDX, but this would require exporting
- * either kvm_x86_ops or kvm_ops_update() from the base KVM
- * module, which looks overkill. Anyway, the worst case here
- * is KVM may allocate couple of more bytes than needed for
- * each VM.
- */
- if (enable_tdx) {
- vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
- sizeof(struct kvm_tdx));
- /*
- * Note, TDX may fail to initialize in a later time in
- * vt_init(), in which case it is not necessary to setup
- * those callbacks. But making them valid here even
- * when TDX fails to init later is fine because those
- * callbacks won't be called if the VM isn't TDX guest.
- */
- vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
- vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
- vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
- vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
- vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
- }
+ if (enable_tdx)
+ tdx_hardware_setup();
return 0;
}
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..b4985a64501c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -738,7 +738,7 @@ bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
!to_tdx(vcpu)->vp_enter_args.r12;
}
-bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+static bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
u64 vcpu_state_details;
@@ -1543,8 +1543,8 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
return 0;
}
-int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn)
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
struct page *page = pfn_to_page(pfn);
@@ -1624,8 +1624,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
return 0;
}
-int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, void *private_spt)
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
{
int tdx_level = pg_level_to_tdx_sept_level(level);
gpa_t gpa = gfn_to_gpa(gfn);
@@ -1760,8 +1760,8 @@ static void tdx_track(struct kvm *kvm)
kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
}
-int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, void *private_spt)
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, void *private_spt)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1783,8 +1783,8 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
return tdx_reclaim_page(virt_to_page(private_spt));
}
-int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn)
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+ enum pg_level level, kvm_pfn_t pfn)
{
struct page *page = pfn_to_page(pfn);
int ret;
@@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
r = __tdx_bringup();
if (r) {
/*
- * Disable TDX only but don't fail to load module if
- * the TDX module could not be loaded. No need to print
- * message saying "module is not loaded" because it was
- * printed when the first SEAMCALL failed.
+ * Disable TDX only but don't fail to load module if the TDX
+ * module could not be loaded. No need to print message saying
+ * "module is not loaded" because it was printed when the first
+ * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
+ * vm_size, as kvm_x86_ops have already been finalized (and are
+ * intentionally not exported). The S-EPT code is unreachable,
+ * and allocating a few more bytes per VM in a should-be-rare
+ * failure scenario is a non-issue.
*/
if (r == -ENODEV)
goto success_disable_tdx;
@@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
enable_tdx = 0;
return 0;
}
+
+
+void __init tdx_hardware_setup(void)
+{
+ /*
+ * Note, if the TDX module can't be loaded, KVM TDX support will be
+ * disabled but KVM will continue loading (see tdx_bringup()).
+ */
+ vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
+
+ vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
+ vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
+ vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
+ vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+ vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 51f98443e8a2..ca39a9391db1 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,6 +8,7 @@
#ifdef CONFIG_KVM_INTEL_TDX
#include "common.h"
+void tdx_hardware_setup(void);
int tdx_bringup(void);
void tdx_cleanup(void);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..87e855276a88 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -136,7 +136,6 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
void tdx_vcpu_put(struct kvm_vcpu *vcpu);
-bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
int tdx_handle_exit(struct kvm_vcpu *vcpu,
enum exit_fastpath_completion fastpath);
@@ -151,15 +150,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
-int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, void *private_spt);
-int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, void *private_spt);
-int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn);
-int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
- enum pg_level level, kvm_pfn_t pfn);
-
void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
base-commit: 7ef51a41466bc846ad794d505e2e34ff97157f7f
--
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
2025-05-19 15:29 ` James Houghton
@ 2025-05-19 15:51 ` Sean Christopherson
0 siblings, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-19 15:51 UTC (permalink / raw)
To: James Houghton; +Cc: Paolo Bonzini, kvm, linux-kernel, Vipin Sharma
On Mon, May 19, 2025, James Houghton wrote:
> On Mon, May 19, 2025 at 9:37 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Sat, May 17, 2025, Paolo Bonzini wrote:
> > > On 5/16/25 23:54, Sean Christopherson wrote:
> > > > + /*
> > > > + * Write mmu_page_hash exactly once as there may be concurrent readers,
> > > > + * e.g. to check for shadowed PTEs in mmu_try_to_unsync_pages(). Note,
> > > > + * mmu_lock must be held for write to add (or remove) shadow pages, and
> > > > + * so readers are guaranteed to see an empty list for their current
> > > > + * mmu_lock critical section.
> > > > + */
> > > > + WRITE_ONCE(kvm->arch.mmu_page_hash, h);
> > >
> > > Use smp_store_release here (unlike READ_ONCE(), it's technically incorrect
> > > to use WRITE_ONCE() here!),
> >
> > Can you elaborate why? Due to my x86-centric life, my memory ordering knowledge
> > is woefully inadequate.
>
> The compiler must be prohibited from reordering stores preceding this
> WRITE_ONCE() to after it.
>
> In reality, the only stores that matter will be from within
> kvcalloc(), and the important bits of it will not be inlined, so it's
> unlikely that the compiler would actually do such reordering. But it's
> nonetheless allowed. :) barrier() is precisely what is needed to
> prohibit this; smp_store_release() on x86 is merely barrier() +
> WRITE_ONCE().
>
> Semantically, smp_store_release() is what you mean to write, as Paolo
> said. We're not really *only* preventing torn accesses, we also need
> to ensure that any threads that read kvm->arch.mmu_page_hash can
> actually use that result (i.e., that all the stores from the
> kvcalloc() are visible).
Ah, that's what I was missing. It's not KVM's stores that are theoretically
problematic, it's the zeroing of kvm->arch.mmu_page_hash that needs protection.
Thanks!
> This sounds a little bit weird for x86 code, but compiler reordering is still
> a possibility.
>
> And I also agree with what Paolo said about smp_load_acquire(). :)
>
> Thanks Paolo. Please correct me if I'm wrong above.
>
> > > with a remark that it pairs with kvm_get_mmu_page_hash(). That's both more
> > > accurate and leads to a better comment than "write exactly once".
> >
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-19 15:39 ` Sean Christopherson
@ 2025-05-20 14:42 ` Paolo Bonzini
2025-05-20 22:49 ` Huang, Kai
1 sibling, 0 replies; 19+ messages in thread
From: Paolo Bonzini @ 2025-05-20 14:42 UTC (permalink / raw)
To: Sean Christopherson; +Cc: kvm, linux-kernel, Vipin Sharma
On Mon, May 19, 2025 at 5:39 PM Sean Christopherson <seanjc@google.com> wrote:
> I'll slot the below in, unless you've got a better idea.
Nope, that's a good idea. Should have thought about it a couple months ago.
Paolo
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-16 21:54 ` [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
2025-05-17 12:35 ` Paolo Bonzini
@ 2025-05-20 16:15 ` Sean Christopherson
1 sibling, 0 replies; 19+ messages in thread
From: Sean Christopherson @ 2025-05-20 16:15 UTC (permalink / raw)
To: Paolo Bonzini, kvm, linux-kernel, Vipin Sharma
On Fri, May 16, 2025, Sean Christopherson wrote:
> +/*
> + * Assert that "struct kvm_{svm,vmx,tdx}" is an order-0 or order-1 allocation.
> + * Spilling over to an order-2 allocation isn't fundamentally problematic, but
> + * isn't expected to happen in the foreseeable future (O(years)). Assert that
> + * the size is an order-0 allocation when ignoring the memslot hash tables, to
> + * help detect and debug unexpected size increases.
> + */
> +#define KVM_SANITY_CHECK_VM_STRUCT_SIZE(x) \
> +do { \
> + BUILD_BUG_ON(get_order(sizeof(struct x) - SIZE_OF_MEMSLOTS_HASHTABLE) && \
> + !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
> + BUILD_BUG_ON(get_order(sizeof(struct x)) < 2 && \
Ugh, I jinxed myself. My severe ineptitude and inability to test persists. I
inverted this check, it should be:
get_order(sizeof(struct x)) >= 2
> + !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN)); \
> +} while (0)
> +
> #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check) \
> ({ \
> bool failed = (consistency_check); \
> --
> 2.49.0.1112.g889b7c5bd8-goog
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-19 15:39 ` Sean Christopherson
2025-05-20 14:42 ` Paolo Bonzini
@ 2025-05-20 22:49 ` Huang, Kai
2025-05-20 23:11 ` Sean Christopherson
1 sibling, 1 reply; 19+ messages in thread
From: Huang, Kai @ 2025-05-20 22:49 UTC (permalink / raw)
To: pbonzini@redhat.com, seanjc@google.com
Cc: kvm@vger.kernel.org, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Mon, 2025-05-19 at 08:39 -0700, Sean Christopherson wrote:
> On Sat, May 17, 2025, Paolo Bonzini wrote:
> > On 5/16/25 23:54, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 0ad1a6d4fb6d..d13e475c3407 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -5675,6 +5675,8 @@ static int __init svm_init(void)
> > > {
> > > int r;
> > > + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_svm);
> > > +
> > > __unused_size_checks();
> > > if (!kvm_is_svm_supported())
> > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > > index d1e02e567b57..e18dfada2e90 100644
> > > --- a/arch/x86/kvm/vmx/main.c
> > > +++ b/arch/x86/kvm/vmx/main.c
> > > @@ -64,6 +64,8 @@ static __init int vt_hardware_setup(void)
> > > vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> > > }
> > > + KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx);
> >
> > I would put either both or no checks in main.c.
>
> Yeah, I agree the current split is ugly. I originally had 'em both in main.c,
> but then the assert effectively becomes dependent on CONFIG_KVM_INTEL_TDX=y.
>
> Aha! If we add a proper tdx_hardware_setup(), then there's a convenient location
> for the assert, IMO it's much easier to see/document the "TDX module not loaded"
> behavior, and the TDX-specific kvm_x86_ops hooks don't need to be visible symbols.
>
> I'll slot the below in, unless you've got a better idea.
Looks good to me too. Minor things below.
[...]
> +static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> + enum pg_level level, kvm_pfn_t pfn)
> {
> struct page *page = pfn_to_page(pfn);
> int ret;
> @@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
> r = __tdx_bringup();
> if (r) {
> /*
> - * Disable TDX only but don't fail to load module if
> - * the TDX module could not be loaded. No need to print
> - * message saying "module is not loaded" because it was
> - * printed when the first SEAMCALL failed.
> + * Disable TDX only but don't fail to load module if the TDX
> + * module could not be loaded. No need to print message saying
> + * "module is not loaded" because it was printed when the first
> + * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
> + * vm_size, as kvm_x86_ops have already been finalized (and are
> + * intentionally not exported). The S-EPT code is unreachable,
> + * and allocating a few more bytes per VM in a should-be-rare
> + * failure scenario is a non-issue.
> */
> if (r == -ENODEV)
> goto success_disable_tdx;
> @@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
> enable_tdx = 0;
> return 0;
> }
> +
> +
> +void __init tdx_hardware_setup(void)
> +{
> + /*
> + * Note, if the TDX module can't be loaded, KVM TDX support will be
> + * disabled but KVM will continue loading (see tdx_bringup()).
> + */
This comment seems a little bit weird to me. I think what you meant here is the
@vm_size and those S-EPT ops are not unwound while TDX cannot be brought up but
KVM is still loaded.
> + vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
> +
> + vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
> + vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> + vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> + vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
> + vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> +}
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 51f98443e8a2..ca39a9391db1 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -8,6 +8,7 @@
> #ifdef CONFIG_KVM_INTEL_TDX
> #include "common.h"
>
> +void tdx_hardware_setup(void);
> int tdx_bringup(void);
> void tdx_cleanup(void);
>
There's a build error when CONFIG_KVM_INTEL_TDX is off:
vmx/main.c: In function ‘vt_hardware_setup’:
vmx/main.c:34:17: error: implicit declaration of function ‘tdx_hardware_setup’;
did you mean ‘vmx_hardware_setup’? [-Wimplicit-function-declaration]
34 | tdx_hardware_setup();
| ^~~~~~~~~~~~~~~~~~
| vmx_hardware_setup
.. for which you need a stub for tdx_hardware_setup() when CONFIG_KVM_INTEL_TDX
is off.
And one more thing:
With the above patch, we still have below code in vt_init():
/*
* TDX and VMX have different vCPU structures. Calculate the
* maximum size/align so that kvm_init() can use the larger
* values to create the kmem_vcpu_cache.
*/
vcpu_size = sizeof(struct vcpu_vmx);
vcpu_align = __alignof__(struct vcpu_vmx);
if (enable_tdx) {
vcpu_size = max_t(unsigned, vcpu_size,
sizeof(struct vcpu_tdx));
vcpu_align = max_t(unsigned, vcpu_align,
__alignof__(struct vcpu_tdx));
kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
}
It's kinda ugly too IMHO.
Since we already have @vm_size in kvm_x86_ops, how about also adding vcpu_size
and vcpu_align to it? Then they can be treated in the same way as vm_size for
TDX.
They are not needed for SVM, but it doesn't hurt that much?
Compile test only:
From b3a4b29cbd371189860b103813354813912019fa Mon Sep 17 00:00:00 2001
From: Kai Huang <kai.huang@intel.com>
Date: Wed, 21 May 2025 10:32:28 +1200
Subject: [PATCH] vcpu size/align for TDX
Signed-off-by: Kai Huang <kai.huang@intel.com>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/svm/svm.c | 6 ++++--
arch/x86/kvm/vmx/main.c | 20 +++-----------------
arch/x86/kvm/vmx/tdx.c | 7 +++++++
4 files changed, 16 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ed9b65785a24..7b96b6b30a5c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1684,6 +1684,8 @@ struct kvm_x86_ops {
void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
unsigned int vm_size;
+ unsigned int vcpu_size;
+ unsigned int vcpu_align;
int (*vm_init)(struct kvm *kvm);
void (*vm_destroy)(struct kvm *kvm);
void (*vm_pre_destroy)(struct kvm *kvm);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dc8e9af49f11..6a43d6402219 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5094,6 +5094,9 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.vcpu_reset = svm_vcpu_reset,
.vm_size = sizeof(struct kvm_svm),
+ .vcpu_size = sizeof(struct vcpu_svm),
+ .vcpu_align = __alignof__(struct vcpu_svm),
+
.vm_init = svm_vm_init,
.vm_destroy = svm_vm_destroy,
@@ -5543,8 +5546,7 @@ static int __init svm_init(void)
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
*/
- r = kvm_init(sizeof(struct vcpu_svm), __alignof__(struct vcpu_svm),
- THIS_MODULE);
+ r = kvm_init(svm_x86_ops.vcpu_size, svm_x86_ops.vcpu_align,
THIS_MODULE);
if (r)
goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c064d79b7550..86fdcaca7061 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -879,6 +879,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.has_emulated_msr = vt_has_emulated_msr,
.vm_size = sizeof(struct kvm_vmx),
+ .vcpu_size = sizeof(struct vcpu_vmx),
+ .vcpu_align = __alignof__(struct vcpu_vmx),
.vm_init = vt_vm_init,
.vm_pre_destroy = vt_vm_pre_destroy,
@@ -1035,7 +1037,6 @@ module_exit(vt_exit);
static int __init vt_init(void)
{
- unsigned vcpu_size, vcpu_align;
int r;
r = vmx_init();
@@ -1047,26 +1048,11 @@ static int __init vt_init(void)
if (r)
goto err_tdx_bringup;
- /*
- * TDX and VMX have different vCPU structures. Calculate the
- * maximum size/align so that kvm_init() can use the larger
- * values to create the kmem_vcpu_cache.
- */
- vcpu_size = sizeof(struct vcpu_vmx);
- vcpu_align = __alignof__(struct vcpu_vmx);
- if (enable_tdx) {
- vcpu_size = max_t(unsigned, vcpu_size,
- sizeof(struct vcpu_tdx));
- vcpu_align = max_t(unsigned, vcpu_align,
- __alignof__(struct vcpu_tdx));
- kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
- }
-
/*
* Common KVM initialization _must_ come last, after this, /dev/kvm is
* exposed to userspace!
*/
- r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
+ r = kvm_init(vt_x86_ops.vcpu_size, vt_x86_ops.vcpu_align, THIS_MODULE);
if (r)
goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b4985a64501c..dc44cabf89e6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3522,6 +3522,9 @@ int __init tdx_bringup(void)
enable_tdx = 0;
}
+ if (enable_tdx)
+ kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
+
return r;
success_disable_tdx:
@@ -3537,6 +3540,10 @@ void __init tdx_hardware_setup(void)
* disabled but KVM will continue loading (see tdx_bringup()).
*/
vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
sizeof(struct kvm_tdx));
+ vt_x86_ops.vcpu_size = max_t(unsigned int, vt_x86_ops.vcpu_size,
+ sizeof(struct vcpu_tdx));
+ vt_x86_ops.vcpu_align = max_t(unsigned int, vt_x86_ops.vcpu_align,
+ __alignof__(struct vcpu_tdx));
vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-20 22:49 ` Huang, Kai
@ 2025-05-20 23:11 ` Sean Christopherson
2025-05-20 23:57 ` Huang, Kai
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-05-20 23:11 UTC (permalink / raw)
To: Kai Huang
Cc: pbonzini@redhat.com, kvm@vger.kernel.org, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Tue, May 20, 2025, Kai Huang wrote:
> On Mon, 2025-05-19 at 08:39 -0700, Sean Christopherson wrote:
> > +static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> > + enum pg_level level, kvm_pfn_t pfn)
> > {
> > struct page *page = pfn_to_page(pfn);
> > int ret;
> > @@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
> > r = __tdx_bringup();
> > if (r) {
> > /*
> > - * Disable TDX only but don't fail to load module if
> > - * the TDX module could not be loaded. No need to print
> > - * message saying "module is not loaded" because it was
> > - * printed when the first SEAMCALL failed.
> > + * Disable TDX only but don't fail to load module if the TDX
> > + * module could not be loaded. No need to print message saying
> > + * "module is not loaded" because it was printed when the first
> > + * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
> > + * vm_size, as kvm_x86_ops have already been finalized (and are
> > + * intentionally not exported). The S-EPT code is unreachable,
> > + * and allocating a few more bytes per VM in a should-be-rare
> > + * failure scenario is a non-issue.
> > */
> > if (r == -ENODEV)
> > goto success_disable_tdx;
> > @@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
> > enable_tdx = 0;
> > return 0;
> > }
> > +
> > +
> > +void __init tdx_hardware_setup(void)
> > +{
> > + /*
> > + * Note, if the TDX module can't be loaded, KVM TDX support will be
> > + * disabled but KVM will continue loading (see tdx_bringup()).
> > + */
>
> This comment seems a little bit weird to me. I think what you meant here is the
> @vm_size and those S-EPT ops are not unwound while TDX cannot be brought up but
> KVM is still loaded.
This comment is weird? Or the one in tdx_bringup() is weird? The sole intent
of _this_ comment is to clarify that KVM could still end up running load with TDX
disabled. The comment about not unwinding S-EPT resides in tdx_bringup(), because
that's where the actual decision to not reject KVM load and to not undo the setup
lives.
> > +
> > + vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
> > + vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> > + vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> > + vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
> > + vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> > +}
> > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > index 51f98443e8a2..ca39a9391db1 100644
> > --- a/arch/x86/kvm/vmx/tdx.h
> > +++ b/arch/x86/kvm/vmx/tdx.h
> > @@ -8,6 +8,7 @@
> > #ifdef CONFIG_KVM_INTEL_TDX
> > #include "common.h"
> >
> > +void tdx_hardware_setup(void);
> > int tdx_bringup(void);
> > void tdx_cleanup(void);
> >
>
> There's a build error when CONFIG_KVM_INTEL_TDX is off:
>
> vmx/main.c: In function ‘vt_hardware_setup’:
> vmx/main.c:34:17: error: implicit declaration of function ‘tdx_hardware_setup’;
> did you mean ‘vmx_hardware_setup’? [-Wimplicit-function-declaration]
> 34 | tdx_hardware_setup();
> | ^~~~~~~~~~~~~~~~~~
> | vmx_hardware_setup
>
> .. for which you need a stub for tdx_hardware_setup() when CONFIG_KVM_INTEL_TDX
> is off.
Not in kvm-x86/next, commit 907092bf7cbd ("KVM: VMX: Clean up and macrofy x86_ops")
buried all of vt_hardware_setup() behind CONFIG_KVM_INTEL_TDX=y.
> And one more thing:
>
> With the above patch, we still have below code in vt_init():
>
> /*
> * TDX and VMX have different vCPU structures. Calculate the
> * maximum size/align so that kvm_init() can use the larger
> * values to create the kmem_vcpu_cache.
> */
> vcpu_size = sizeof(struct vcpu_vmx);
> vcpu_align = __alignof__(struct vcpu_vmx);
> if (enable_tdx) {
> vcpu_size = max_t(unsigned, vcpu_size,
> sizeof(struct vcpu_tdx));
> vcpu_align = max_t(unsigned, vcpu_align,
> __alignof__(struct vcpu_tdx));
> kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
> }
>
> It's kinda ugly too IMHO.
>
> Since we already have @vm_size in kvm_x86_ops, how about also adding vcpu_size
> and vcpu_align to it? Then they can be treated in the same way as vm_size for
> TDX.
>
> They are not needed for SVM, but it doesn't hurt that much?
I'd rather not. vt_init() already needs to be aware of TDX, e.g. to call into
tdx_bringup() in the first place. Shoving state into kvm_x86_ops that is only
ever used in vt_init() (an __init function at that) isn't a net positive.
Putting the fields in kvm_x86_init_ops would be better, but I still don't think
the complexity and indirection is justified. Bleeding gory TDX details into the
common code is undesirable, but I don't see the size of vcpu_tdx or the fact that
enable_tdx==true means KVM_X86_TDX_VM is supported as being information with
hiding.
kvm_x86_ops.vm_size is a means to an end. E.g. if kvm_vcpu_cache didn't exist,
then we'd probably want/need kvm_x86_ops.vcpu_size, but it does exist, so...
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-20 23:11 ` Sean Christopherson
@ 2025-05-20 23:57 ` Huang, Kai
2025-05-21 17:12 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: Huang, Kai @ 2025-05-20 23:57 UTC (permalink / raw)
To: seanjc@google.com
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Tue, 2025-05-20 at 16:11 -0700, Sean Christopherson wrote:
> On Tue, May 20, 2025, Kai Huang wrote:
> > On Mon, 2025-05-19 at 08:39 -0700, Sean Christopherson wrote:
> > > +static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> > > + enum pg_level level, kvm_pfn_t pfn)
> > > {
> > > struct page *page = pfn_to_page(pfn);
> > > int ret;
> > > @@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
> > > r = __tdx_bringup();
> > > if (r) {
> > > /*
> > > - * Disable TDX only but don't fail to load module if
> > > - * the TDX module could not be loaded. No need to print
> > > - * message saying "module is not loaded" because it was
> > > - * printed when the first SEAMCALL failed.
> > > + * Disable TDX only but don't fail to load module if the TDX
> > > + * module could not be loaded. No need to print message saying
> > > + * "module is not loaded" because it was printed when the first
> > > + * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
> > > + * vm_size, as kvm_x86_ops have already been finalized (and are
> > > + * intentionally not exported). The S-EPT code is unreachable,
> > > + * and allocating a few more bytes per VM in a should-be-rare
> > > + * failure scenario is a non-issue.
> > > */
> > > if (r == -ENODEV)
> > > goto success_disable_tdx;
> > > @@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
> > > enable_tdx = 0;
> > > return 0;
> > > }
> > > +
> > > +
> > > +void __init tdx_hardware_setup(void)
> > > +{
> > > + /*
> > > + * Note, if the TDX module can't be loaded, KVM TDX support will be
> > > + * disabled but KVM will continue loading (see tdx_bringup()).
> > > + */
> >
> > This comment seems a little bit weird to me. I think what you meant here is the
> > @vm_size and those S-EPT ops are not unwound while TDX cannot be brought up but
> > KVM is still loaded.
>
> This comment is weird? Or the one in tdx_bringup() is weird?
>
I definitely agree tdx_bringup() is weird :-)
> The sole intent
> of _this_ comment is to clarify that KVM could still end up running load with TDX
> disabled.
>
But this behaviour itself doesn't mean anything, e.g., if we export kvm_x86_ops,
we could unwind them. So without mentioning "those are not unwound", it doesn't
seem useful to me.
But it does have "(see tdx_bringup())" at the end, so OK to me. I guess I just
wish it could be more verbose.
> The comment about not unwinding S-EPT resides in tdx_bringup(), because
> that's where the actual decision to not reject KVM load and to not undo the setup
> lives.
Right.
>
> > > +
> > > + vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
> > > + vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
> > > + vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
> > > + vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
> > > + vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
> > > +}
> > > diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> > > index 51f98443e8a2..ca39a9391db1 100644
> > > --- a/arch/x86/kvm/vmx/tdx.h
> > > +++ b/arch/x86/kvm/vmx/tdx.h
> > > @@ -8,6 +8,7 @@
> > > #ifdef CONFIG_KVM_INTEL_TDX
> > > #include "common.h"
> > >
> > > +void tdx_hardware_setup(void);
> > > int tdx_bringup(void);
> > > void tdx_cleanup(void);
> > >
> >
> > There's a build error when CONFIG_KVM_INTEL_TDX is off:
> >
> > vmx/main.c: In function ‘vt_hardware_setup’:
> > vmx/main.c:34:17: error: implicit declaration of function ‘tdx_hardware_setup’;
> > did you mean ‘vmx_hardware_setup’? [-Wimplicit-function-declaration]
> > 34 | tdx_hardware_setup();
> > | ^~~~~~~~~~~~~~~~~~
> > | vmx_hardware_setup
> >
> > .. for which you need a stub for tdx_hardware_setup() when CONFIG_KVM_INTEL_TDX
> > is off.
>
> Not in kvm-x86/next, commit 907092bf7cbd ("KVM: VMX: Clean up and macrofy x86_ops")
> buried all of vt_hardware_setup() behind CONFIG_KVM_INTEL_TDX=y.
Oh I was using kvm-coco-queue. Thanks for pointing out.
>
> > And one more thing:
> >
> > With the above patch, we still have below code in vt_init():
> >
> > /*
> > * TDX and VMX have different vCPU structures. Calculate the
> > * maximum size/align so that kvm_init() can use the larger
> > * values to create the kmem_vcpu_cache.
> > */
> > vcpu_size = sizeof(struct vcpu_vmx);
> > vcpu_align = __alignof__(struct vcpu_vmx);
> > if (enable_tdx) {
> > vcpu_size = max_t(unsigned, vcpu_size,
> > sizeof(struct vcpu_tdx));
> > vcpu_align = max_t(unsigned, vcpu_align,
> > __alignof__(struct vcpu_tdx));
> > kvm_caps.supported_vm_types |= BIT(KVM_X86_TDX_VM);
> > }
> >
> > It's kinda ugly too IMHO.
> >
> > Since we already have @vm_size in kvm_x86_ops, how about also adding vcpu_size
> > and vcpu_align to it? Then they can be treated in the same way as vm_size for
> > TDX.
> >
> > They are not needed for SVM, but it doesn't hurt that much?
>
> I'd rather not. vt_init() already needs to be aware of TDX, e.g. to call into
> tdx_bringup() in the first place. Shoving state into kvm_x86_ops that is only
> ever used in vt_init() (an __init function at that) isn't a net positive.
>
> Putting the fields in kvm_x86_init_ops would be better, but I still don't think
> the complexity and indirection is justified. Bleeding gory TDX details into the
> common code is undesirable, but I don't see the size of vcpu_tdx or the fact that
> enable_tdx==true means KVM_X86_TDX_VM is supported as being information with
> hiding.
>
> kvm_x86_ops.vm_size is a means to an end. E.g. if kvm_vcpu_cache didn't exist,
> then we'd probably want/need kvm_x86_ops.vcpu_size, but it does exist, so...
Sure. Thanks for clarification.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-20 23:57 ` Huang, Kai
@ 2025-05-21 17:12 ` Sean Christopherson
2025-05-21 22:43 ` Huang, Kai
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-05-21 17:12 UTC (permalink / raw)
To: Kai Huang
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Tue, May 20, 2025, Kai Huang wrote:
> On Tue, 2025-05-20 at 16:11 -0700, Sean Christopherson wrote:
> > On Tue, May 20, 2025, Kai Huang wrote:
> > > On Mon, 2025-05-19 at 08:39 -0700, Sean Christopherson wrote:
> > > > +static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > + enum pg_level level, kvm_pfn_t pfn)
> > > > {
> > > > struct page *page = pfn_to_page(pfn);
> > > > int ret;
> > > > @@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
> > > > r = __tdx_bringup();
> > > > if (r) {
> > > > /*
> > > > - * Disable TDX only but don't fail to load module if
> > > > - * the TDX module could not be loaded. No need to print
> > > > - * message saying "module is not loaded" because it was
> > > > - * printed when the first SEAMCALL failed.
> > > > + * Disable TDX only but don't fail to load module if the TDX
> > > > + * module could not be loaded. No need to print message saying
> > > > + * "module is not loaded" because it was printed when the first
> > > > + * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
> > > > + * vm_size, as kvm_x86_ops have already been finalized (and are
> > > > + * intentionally not exported). The S-EPT code is unreachable,
> > > > + * and allocating a few more bytes per VM in a should-be-rare
> > > > + * failure scenario is a non-issue.
> > > > */
> > > > if (r == -ENODEV)
> > > > goto success_disable_tdx;
> > > > @@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
> > > > enable_tdx = 0;
> > > > return 0;
> > > > }
> > > > +
> > > > +
> > > > +void __init tdx_hardware_setup(void)
> > > > +{
> > > > + /*
> > > > + * Note, if the TDX module can't be loaded, KVM TDX support will be
> > > > + * disabled but KVM will continue loading (see tdx_bringup()).
> > > > + */
> > >
> > > This comment seems a little bit weird to me. I think what you meant here is the
> > > @vm_size and those S-EPT ops are not unwound while TDX cannot be brought up but
> > > KVM is still loaded.
> >
> > This comment is weird? Or the one in tdx_bringup() is weird?
> >
>
> I definitely agree tdx_bringup() is weird :-)
>
> > The sole intent of _this_ comment is to clarify that KVM could still end up
> > running load with TDX disabled.
> >
>
> But this behaviour itself doesn't mean anything,
I disagree. The overwhelming majority of code in KVM expects that either the
associated feature will be fully enabled, or KVM will abort the overall flow,
e.g. refuse to load, fail vCPU/VM creation, etc.
Continuing on is very exceptional IMO, and warrants a comment.
> e.g., if we export kvm_x86_ops, we could unwind them.
Maaaybe. I mean, yes, we could fully unwind kvm_x86_ops, but doing so would make
the overall code far more brittle. E.g. simply updating kvm_x86_ops won't suffice,
as the static_calls also need to be patched, and we would have to be very careful
not to touch anything in kvm_x86_ops that might have been consumed between here
and the call to tdx_bringup().
> So without mentioning "those are not unwound", it doesn't seem useful to me.
>
> But it does have "(see tdx_bringup())" at the end, so OK to me. I guess I just
> wish it could be more verbose.
Yeah, redirecting to another comment isn't a great experience for readers, but I
don't want to duplicate the explanation and details because that risks creating
stale and/or contradicting comments in the future, and in general increases the
maintenance cost (small though it should be in this case).
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-21 17:12 ` Sean Christopherson
@ 2025-05-21 22:43 ` Huang, Kai
2025-05-22 13:40 ` Sean Christopherson
0 siblings, 1 reply; 19+ messages in thread
From: Huang, Kai @ 2025-05-21 22:43 UTC (permalink / raw)
To: seanjc@google.com
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Wed, 2025-05-21 at 10:12 -0700, Sean Christopherson wrote:
> On Tue, May 20, 2025, Kai Huang wrote:
> > On Tue, 2025-05-20 at 16:11 -0700, Sean Christopherson wrote:
> > > On Tue, May 20, 2025, Kai Huang wrote:
> > > > On Mon, 2025-05-19 at 08:39 -0700, Sean Christopherson wrote:
> > > > > +static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> > > > > + enum pg_level level, kvm_pfn_t pfn)
> > > > > {
> > > > > struct page *page = pfn_to_page(pfn);
> > > > > int ret;
> > > > > @@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
> > > > > r = __tdx_bringup();
> > > > > if (r) {
> > > > > /*
> > > > > - * Disable TDX only but don't fail to load module if
> > > > > - * the TDX module could not be loaded. No need to print
> > > > > - * message saying "module is not loaded" because it was
> > > > > - * printed when the first SEAMCALL failed.
> > > > > + * Disable TDX only but don't fail to load module if the TDX
> > > > > + * module could not be loaded. No need to print message saying
> > > > > + * "module is not loaded" because it was printed when the first
> > > > > + * SEAMCALL failed. Don't bother unwinding the S-EPT hooks or
> > > > > + * vm_size, as kvm_x86_ops have already been finalized (and are
> > > > > + * intentionally not exported). The S-EPT code is unreachable,
> > > > > + * and allocating a few more bytes per VM in a should-be-rare
> > > > > + * failure scenario is a non-issue.
> > > > > */
> > > > > if (r == -ENODEV)
> > > > > goto success_disable_tdx;
> > > > > @@ -3524,3 +3528,19 @@ int __init tdx_bringup(void)
> > > > > enable_tdx = 0;
> > > > > return 0;
> > > > > }
> > > > > +
> > > > > +
> > > > > +void __init tdx_hardware_setup(void)
> > > > > +{
> > > > > + /*
> > > > > + * Note, if the TDX module can't be loaded, KVM TDX support will be
> > > > > + * disabled but KVM will continue loading (see tdx_bringup()).
> > > > > + */
> > > >
> > > > This comment seems a little bit weird to me. I think what you meant here is the
> > > > @vm_size and those S-EPT ops are not unwound while TDX cannot be brought up but
> > > > KVM is still loaded.
> > >
> > > This comment is weird? Or the one in tdx_bringup() is weird?
> > >
> >
> > I definitely agree tdx_bringup() is weird :-)
> >
> > > The sole intent of _this_ comment is to clarify that KVM could still end up
> > > running load with TDX disabled.
> > >
> >
> > But this behaviour itself doesn't mean anything,
>
> I disagree. The overwhelming majority of code in KVM expects that either the
> associated feature will be fully enabled, or KVM will abort the overall flow,
> e.g. refuse to load, fail vCPU/VM creation, etc.
>
> Continuing on is very exceptional IMO, and warrants a comment.
I see.
>
> > e.g., if we export kvm_x86_ops, we could unwind them.
>
> Maaaybe. I mean, yes, we could fully unwind kvm_x86_ops, but doing so would make
> the overall code far more brittle. E.g. simply updating kvm_x86_ops won't suffice,
> as the static_calls also need to be patched, and we would have to be very careful
> not to touch anything in kvm_x86_ops that might have been consumed between here
> and the call to tdx_bringup().
Right. Maybe exporting kvm_ops_update() is better.
>
> > So without mentioning "those are not unwound", it doesn't seem useful to me.
> >
> > But it does have "(see tdx_bringup())" at the end, so OK to me. I guess I just
> > wish it could be more verbose.
>
> Yeah, redirecting to another comment isn't a great experience for readers, but I
> don't want to duplicate the explanation and details because that risks creating
> stale and/or contradicting comments in the future, and in general increases the
> maintenance cost (small though it should be in this case).
Sure :-)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-21 22:43 ` Huang, Kai
@ 2025-05-22 13:40 ` Sean Christopherson
2025-05-23 11:31 ` Huang, Kai
0 siblings, 1 reply; 19+ messages in thread
From: Sean Christopherson @ 2025-05-22 13:40 UTC (permalink / raw)
To: Kai Huang
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Wed, May 21, 2025, Kai Huang wrote:
> On Wed, 2025-05-21 at 10:12 -0700, Sean Christopherson wrote:
> > > e.g., if we export kvm_x86_ops, we could unwind them.
> >
> > Maaaybe. I mean, yes, we could fully unwind kvm_x86_ops, but doing so would make
> > the overall code far more brittle. E.g. simply updating kvm_x86_ops won't suffice,
> > as the static_calls also need to be patched, and we would have to be very careful
> > not to touch anything in kvm_x86_ops that might have been consumed between here
> > and the call to tdx_bringup().
>
> Right. Maybe exporting kvm_ops_update() is better.
A bit, but KVM would still need to be careful not to modify the parts of
vt_x86_ops that have already been consumed.
While I agree that leaving TDX breadcrumbs in kvm_x86_ops when TDX is disabled is
undesirable, the behavior is known, i.e. we know exactly what TDX state is being
left behind. And failure to load the TDX Module should be very, very rare for
any setup that is actually trying to enable TDX.
Whereas providing a way to modify kvm_x86_ops creates the possibility for "bad"
updates. KVM's initialization code is a lot like the kernel's boot code (and
probably most bootstrapping code): it's inherently fragile because avoiding
dependencies is practically impossible.
E.g. I ran into a relevant ordering problem[*] just a few days ago, where checking
for VMX capabilities during PMU initialization always failed because the VMCS
config hadn't yet been parsed. Those types of bugs are especially dangerous
because they're very easy to overlook when modifying existing code, e.g. the
only sign that anything is broken is an optional feature being missing.
[*] https://lore.kernel.org/all/aCU2YEpU0dOk7RTk@google.com
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct
2025-05-22 13:40 ` Sean Christopherson
@ 2025-05-23 11:31 ` Huang, Kai
0 siblings, 0 replies; 19+ messages in thread
From: Huang, Kai @ 2025-05-23 11:31 UTC (permalink / raw)
To: seanjc@google.com
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, vipinsh@google.com,
linux-kernel@vger.kernel.org
On Thu, 2025-05-22 at 06:40 -0700, Sean Christopherson wrote:
> On Wed, May 21, 2025, Kai Huang wrote:
> > On Wed, 2025-05-21 at 10:12 -0700, Sean Christopherson wrote:
> > > > e.g., if we export kvm_x86_ops, we could unwind them.
> > >
> > > Maaaybe. I mean, yes, we could fully unwind kvm_x86_ops, but doing so would make
> > > the overall code far more brittle. E.g. simply updating kvm_x86_ops won't suffice,
> > > as the static_calls also need to be patched, and we would have to be very careful
> > > not to touch anything in kvm_x86_ops that might have been consumed between here
> > > and the call to tdx_bringup().
> >
> > Right. Maybe exporting kvm_ops_update() is better.
>
> A bit, but KVM would still need to be careful not to modify the parts of
> vt_x86_ops that have already been consumed.
>
> While I agree that leaving TDX breadcrumbs in kvm_x86_ops when TDX is disabled is
> undesirable, the behavior is known, i.e. we know exactly what TDX state is being
> left behind. And failure to load the TDX Module should be very, very rare for
> any setup that is actually trying to enable TDX.
This is true. Agreed.
>
> Whereas providing a way to modify kvm_x86_ops creates the possibility for "bad"
> updates. KVM's initialization code is a lot like the kernel's boot code (and
> probably most bootstrapping code): it's inherently fragile because avoiding
> dependencies is practically impossible.
>
> E.g. I ran into a relevant ordering problem[*] just a few days ago, where checking
> for VMX capabilities during PMU initialization always failed because the VMCS
> config hadn't yet been parsed. Those types of bugs are especially dangerous
> because they're very easy to overlook when modifying existing code, e.g. the
> only sign that anything is broken is an optional feature being missing.
>
> [*] https://lore.kernel.org/all/aCU2YEpU0dOk7RTk@google.com
Right. No argument around this. I agree if there are multiple features wanting
to update then it could be dangerous. Thanks for the insight :-)
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-05-23 11:31 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 21:54 [PATCH v3 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 2/3] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
2025-05-17 12:35 ` Paolo Bonzini
2025-05-19 15:39 ` Sean Christopherson
2025-05-20 14:42 ` Paolo Bonzini
2025-05-20 22:49 ` Huang, Kai
2025-05-20 23:11 ` Sean Christopherson
2025-05-20 23:57 ` Huang, Kai
2025-05-21 17:12 ` Sean Christopherson
2025-05-21 22:43 ` Huang, Kai
2025-05-22 13:40 ` Sean Christopherson
2025-05-23 11:31 ` Huang, Kai
2025-05-20 16:15 ` Sean Christopherson
2025-05-16 21:54 ` [PATCH v3 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2025-05-17 12:43 ` Paolo Bonzini
2025-05-19 13:37 ` Sean Christopherson
2025-05-19 15:29 ` James Houghton
2025-05-19 15:51 ` Sean Christopherson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).