[PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list
@ 2025-05-23  0:11 Sean Christopherson
  2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-05-23  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

Allocate the hashed list of shadow pages dynamically (separate from
struct kvm), and on-demand.  The hashed list is 32KiB, i.e. absolutely
belongs in a separate allocation, and is worth skipping if KVM isn't
shadowing guest PTEs for the VM.

I double checked that padding kvm_arch with a 4KiB array trips the assert,
but padding with 2KiB does not.  So knock on wood, I finally got the assert
right.  Maybe.

v4:
 - Use smp_store_release() and smp_load_acquire() instead of {READ,WRITE}_ONCE,
   and update the comments accordingly. [Paolo, James]
 - Move the kvm_tdx assert to tdx.c. [Paolo]
 - Fix the assertion, again.  [Vipin, in spirit if not in reality]
 - Add a patch to move TDX hardware setup to tdx.c.

v3:
 -  https://lore.kernel.org/all/20250516215422.2550669-1-seanjc@google.com
 - Add comments explaining the {READ,WRITE}_ONCE logic, and why it's safe
   to set the list outside of mmu_lock. [Vipin]
 - Make the assertions actually work. [Vipin]
 - Refine the assertions so they (hopefully) won't fail on kernels with
   a bunch of debug crud added.

v2:
 - https://lore.kernel.org/all/20250401155714.838398-1-seanjc@google.com
 - Actually defer allocation when using TDP MMU. [Vipin]
 - Free allocation on MMU teardown. [Vipin]

v1: https://lore.kernel.org/all/20250315024010.2360884-1-seanjc@google.com

Sean Christopherson (4):
  KVM: TDX: Move TDX hardware setup from main.c to tdx.c
  KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  KVM: x86: Use kvzalloc() to allocate VM struct
  KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list

 arch/x86/include/asm/kvm_host.h |  6 +--
 arch/x86/kvm/mmu/mmu.c          | 75 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/svm/svm.c          |  2 +
 arch/x86/kvm/vmx/main.c         | 36 +---------------
 arch/x86/kvm/vmx/tdx.c          | 47 +++++++++++++++------
 arch/x86/kvm/vmx/tdx.h          |  1 +
 arch/x86/kvm/vmx/vmx.c          |  2 +
 arch/x86/kvm/vmx/x86_ops.h      | 10 -----
 arch/x86/kvm/x86.c              |  5 ++-
 arch/x86/kvm/x86.h              | 22 ++++++++++
 10 files changed, 139 insertions(+), 67 deletions(-)


base-commit: 3f7b307757ecffc1c18ede9ee3cf9ce8101f3cc9
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
  2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
@ 2025-05-23  0:11 ` Sean Christopherson
  2025-05-23  2:27   ` Xiaoyao Li
  2025-05-23 11:31   ` Huang, Kai
  2025-05-23  0:11 ` [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list Sean Christopherson
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-05-23  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
co-locating the setup with tdx_bringup() makes it easier to see and
document the success_disable_tdx "error" path, and configuring the TDX
specific hooks in tdx.c reduces the number of globally visible TDX symbols.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/main.c    | 36 ++----------------------------
 arch/x86/kvm/vmx/tdx.c     | 45 +++++++++++++++++++++++++++-----------
 arch/x86/kvm/vmx/tdx.h     |  1 +
 arch/x86/kvm/vmx/x86_ops.h | 10 ---------
 4 files changed, 35 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..d7178d15ac8f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -29,40 +29,8 @@ static __init int vt_hardware_setup(void)
 	if (ret)
 		return ret;
 
-	/*
-	 * Update vt_x86_ops::vm_size here so it is ready before
-	 * kvm_ops_update() is called in kvm_x86_vendor_init().
-	 *
-	 * Note, the actual bringing up of TDX must be done after
-	 * kvm_ops_update() because enabling TDX requires enabling
-	 * hardware virtualization first, i.e., all online CPUs must
-	 * be in post-VMXON state.  This means the @vm_size here
-	 * may be updated to TDX's size but TDX may fail to enable
-	 * at later time.
-	 *
-	 * The VMX/VT code could update kvm_x86_ops::vm_size again
-	 * after bringing up TDX, but this would require exporting
-	 * either kvm_x86_ops or kvm_ops_update() from the base KVM
-	 * module, which looks overkill.  Anyway, the worst case here
-	 * is KVM may allocate couple of more bytes than needed for
-	 * each VM.
-	 */
-	if (enable_tdx) {
-		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
-				sizeof(struct kvm_tdx));
-		/*
-		 * Note, TDX may fail to initialize in a later time in
-		 * vt_init(), in which case it is not necessary to setup
-		 * those callbacks.  But making them valid here even
-		 * when TDX fails to init later is fine because those
-		 * callbacks won't be called if the VM isn't TDX guest.
-		 */
-		vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
-		vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
-		vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
-		vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
-		vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
-	}
+	if (enable_tdx)
+		tdx_hardware_setup();
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..1790f6dee870 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -738,7 +738,7 @@ bool tdx_interrupt_allowed(struct kvm_vcpu *vcpu)
 	       !to_tdx(vcpu)->vp_enter_args.r12;
 }
 
-bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+static bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
 	u64 vcpu_state_details;
 
@@ -1543,8 +1543,8 @@ static int tdx_mem_page_record_premap_cnt(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, kvm_pfn_t pfn)
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, kvm_pfn_t pfn)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct page *page = pfn_to_page(pfn);
@@ -1624,8 +1624,8 @@ static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
 	return 0;
 }
 
-int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, void *private_spt)
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
 {
 	int tdx_level = pg_level_to_tdx_sept_level(level);
 	gpa_t gpa = gfn_to_gpa(gfn);
@@ -1760,8 +1760,8 @@ static void tdx_track(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);
 }
 
-int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, void *private_spt)
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 
@@ -1783,8 +1783,8 @@ int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
 	return tdx_reclaim_page(virt_to_page(private_spt));
 }
 
-int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-				 enum pg_level level, kvm_pfn_t pfn)
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
 	int ret;
@@ -3507,10 +3507,14 @@ int __init tdx_bringup(void)
 	r = __tdx_bringup();
 	if (r) {
 		/*
-		 * Disable TDX only but don't fail to load module if
-		 * the TDX module could not be loaded.  No need to print
-		 * message saying "module is not loaded" because it was
-		 * printed when the first SEAMCALL failed.
+		 * Disable TDX only but don't fail to load module if the TDX
+		 * module could not be loaded.  No need to print message saying
+		 * "module is not loaded" because it was printed when the first
+		 * SEAMCALL failed.  Don't bother unwinding the S-EPT hooks or
+		 * vm_size, as kvm_x86_ops have already been finalized (and are
+		 * intentionally not exported).  The S-EPT code is unreachable,
+		 * and allocating a few more bytes per VM in a should-be-rare
+		 * failure scenario is a non-issue.
 		 */
 		if (r == -ENODEV)
 			goto success_disable_tdx;
@@ -3524,3 +3528,18 @@ int __init tdx_bringup(void)
 	enable_tdx = 0;
 	return 0;
 }
+
+void __init tdx_hardware_setup(void)
+{
+	/*
+	 * Note, if the TDX module can't be loaded, KVM TDX support will be
+	 * disabled but KVM will continue loading (see tdx_bringup()).
+	 */
+	vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size, sizeof(struct kvm_tdx));
+
+	vt_x86_ops.link_external_spt = tdx_sept_link_private_spt;
+	vt_x86_ops.set_external_spte = tdx_sept_set_private_spte;
+	vt_x86_ops.free_external_spt = tdx_sept_free_private_spt;
+	vt_x86_ops.remove_external_spte = tdx_sept_remove_private_spte;
+	vt_x86_ops.protected_apic_has_interrupt = tdx_protected_apic_has_interrupt;
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 51f98443e8a2..ca39a9391db1 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_KVM_INTEL_TDX
 #include "common.h"
 
+void tdx_hardware_setup(void);
 int tdx_bringup(void);
 void tdx_cleanup(void);
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..87e855276a88 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -136,7 +136,6 @@ int tdx_vcpu_pre_run(struct kvm_vcpu *vcpu);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, bool force_immediate_exit);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
-bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
 int tdx_handle_exit(struct kvm_vcpu *vcpu,
 		enum exit_fastpath_completion fastpath);
 
@@ -151,15 +150,6 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
-int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, void *private_spt);
-int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, void *private_spt);
-int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
-			      enum pg_level level, kvm_pfn_t pfn);
-int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
-				 enum pg_level level, kvm_pfn_t pfn);
-
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
  2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
@ 2025-05-23  0:11 ` Sean Christopherson
  2025-05-28  8:04   ` Xiaoyao Li
  2025-05-23  0:11 ` [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-05-23  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation.  Dynamically
allocating the array will allow allocating "struct kvm" using kvmalloc(),
and will also allow deferring allocation of the array until it's actually
needed, i.e. until the first shadow root is allocated.

Opportunistically use kvmalloc() for the hashed lists, as an order-3
allocation is (stating the obvious) less likely to fail than an order-4
allocation, and the overhead of vmalloc() is undesirable given that the
size of the allocation is fixed.

Cc: Vipin Sharma <vipinsh@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu/mmu.c          | 23 ++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  5 ++++-
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 330cdcbed1a6..9667d6b929ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1343,7 +1343,7 @@ struct kvm_arch {
 	bool has_private_mem;
 	bool has_protected_state;
 	bool pre_fault_allowed;
-	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
+	struct hlist_head *mmu_page_hash;
 	struct list_head active_mmu_pages;
 	/*
 	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
@@ -2006,7 +2006,7 @@ void kvm_mmu_vendor_module_exit(void);
 
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
-void kvm_mmu_init_vm(struct kvm *kvm);
+int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..41da2cb1e3f1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3882,6 +3882,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	return r;
 }
 
+static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
+{
+	typeof(kvm->arch.mmu_page_hash) h;
+
+	h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
+	if (!h)
+		return -ENOMEM;
+
+	kvm->arch.mmu_page_hash = h;
+	return 0;
+}
+
 static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
@@ -6675,13 +6687,19 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
 }
 
-void kvm_mmu_init_vm(struct kvm *kvm)
+int kvm_mmu_init_vm(struct kvm *kvm)
 {
+	int r;
+
 	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
+	r = kvm_mmu_alloc_page_hash(kvm);
+	if (r)
+		return r;
+
 	if (tdp_mmu_enabled)
 		kvm_mmu_init_tdp_mmu(kvm);
 
@@ -6692,6 +6710,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+	return 0;
 }
 
 static void mmu_free_vm_memory_caches(struct kvm *kvm)
@@ -6703,6 +6722,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
 {
+	kvfree(kvm->arch.mmu_page_hash);
+
 	if (tdp_mmu_enabled)
 		kvm_mmu_uninit_tdp_mmu(kvm);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f9f798f286ce..d204ba9368f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12787,7 +12787,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (ret)
 		goto out;
 
-	kvm_mmu_init_vm(kvm);
+	ret = kvm_mmu_init_vm(kvm);
+	if (ret)
+		goto out_cleanup_page_track;
 
 	ret = kvm_x86_call(vm_init)(kvm);
 	if (ret)
@@ -12840,6 +12842,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 out_uninit_mmu:
 	kvm_mmu_uninit_vm(kvm);
+out_cleanup_page_track:
 	kvm_page_track_cleanup(kvm);
 out:
 	return ret;
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct
  2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
  2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
  2025-05-23  0:11 ` [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list Sean Christopherson
@ 2025-05-23  0:11 ` Sean Christopherson
  2025-05-28  8:35   ` Xiaoyao Li
  2025-05-23  0:11 ` [PATCH v4 4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
  2025-06-24 19:38 ` [PATCH v4 0/4] KVM: x86: Dynamically allocate " Sean Christopherson
  4 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-05-23  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
allocation before falling back to __vmalloc(), to avoid the overhead of
establishing the virtual mappings.  For non-debug builds, The SVM and VMX
(and TDX) structures are now just below 7000 bytes in the worst case
scenario (see below), i.e. are order-1 allocations, and will likely remain
that way for quite some time.

Add compile-time assertions in vendor code to ensure the size of the
structures, sans the memslos hash tables, are order-0 allocations, i.e.
are less than 4KiB.  There's nothing fundamentally wrong with a larger
kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
the memslots hash tables) is below 2KiB after 18+ years of existence,
more than doubling the size would be quite notable.

Add sanity checks on the memslot hash table sizes, partly to ensure they
aren't resized without accounting for the impact on VM structure size, and
partly to document that the majority of the size of VM structures comes
from the memslots.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/svm/svm.c          |  2 ++
 arch/x86/kvm/vmx/tdx.c          |  2 ++
 arch/x86/kvm/vmx/vmx.c          |  2 ++
 arch/x86/kvm/x86.h              | 22 ++++++++++++++++++++++
 5 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9667d6b929ee..3a985825a945 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1961,7 +1961,7 @@ void kvm_x86_vendor_exit(void);
 #define __KVM_HAVE_ARCH_VM_ALLOC
 static inline struct kvm *kvm_arch_alloc_vm(void)
 {
-	return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	return kvzalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT);
 }
 
 #define __KVM_HAVE_ARCH_VM_FREE
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0ad1a6d4fb6d..d13e475c3407 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5675,6 +5675,8 @@ static int __init svm_init(void)
 {
 	int r;
 
+	KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_svm);
+
 	__unused_size_checks();
 
 	if (!kvm_is_svm_supported())
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1790f6dee870..559fb18ff9fb 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3531,6 +3531,8 @@ int __init tdx_bringup(void)
 
 void __init tdx_hardware_setup(void)
 {
+	KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_tdx);
+
 	/*
 	 * Note, if the TDX module can't be loaded, KVM TDX support will be
 	 * disabled but KVM will continue loading (see tdx_bringup()).
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ff00ae9f05a..ef58b727d6c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8643,6 +8643,8 @@ int __init vmx_init(void)
 {
 	int r, cpu;
 
+	KVM_SANITY_CHECK_VM_STRUCT_SIZE(kvm_vmx);
+
 	if (!kvm_is_vmx_supported())
 		return -EOPNOTSUPP;
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 832f0faf4779..db4e6a90e83d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -55,6 +55,28 @@ struct kvm_host_values {
 
 void kvm_spurious_fault(void);
 
+#define SIZE_OF_MEMSLOTS_HASHTABLE \
+	(sizeof(((struct kvm_memslots *)0)->id_hash) * 2 * KVM_MAX_NR_ADDRESS_SPACES)
+
+/* Sanity check the size of the memslot hash tables. */
+static_assert(SIZE_OF_MEMSLOTS_HASHTABLE ==
+	      (1024 * (1 + IS_ENABLED(CONFIG_X86_64)) * (1 + IS_ENABLED(CONFIG_KVM_SMM))));
+
+/*
+ * Assert that "struct kvm_{svm,vmx,tdx}" is an order-0 or order-1 allocation.
+ * Spilling over to an order-2 allocation isn't fundamentally problematic, but
+ * isn't expected to happen in the foreseeable future (O(years)).  Assert that
+ * the size is an order-0 allocation when ignoring the memslot hash tables, to
+ * help detect and debug unexpected size increases.
+ */
+#define KVM_SANITY_CHECK_VM_STRUCT_SIZE(x)						\
+do {											\
+	BUILD_BUG_ON(get_order(sizeof(struct x) - SIZE_OF_MEMSLOTS_HASHTABLE) &&	\
+		     !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN));	\
+	BUILD_BUG_ON(get_order(sizeof(struct x)) > 1 &&					\
+		     !IS_ENABLED(CONFIG_DEBUG_KERNEL) && !IS_ENABLED(CONFIG_KASAN));	\
+} while (0)
+
 #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check)		\
 ({									\
 	bool failed = (consistency_check);				\
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v4 4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-05-23  0:11 ` [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
@ 2025-05-23  0:11 ` Sean Christopherson
  2025-06-24 19:38 ` [PATCH v4 0/4] KVM: x86: Dynamically allocate " Sean Christopherson
  4 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-05-23  0:11 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.

Setting the list outside of mmu_lock is safe, as concurrent readers must
hold mmu_lock in some capacity, shadow pages can only be added (or removed)
from the list when mmu_lock is held for write, and tasks that are creating
a shadow root are serialized by slots_arch_lock.  I.e. it's impossible for
the list to become non-empty until all readers go away, and so readers are
guaranteed to see an empty list even if they make multiple calls to
kvm_get_mmu_page_hash() in a single mmu_lock critical section.

Use smp_store_release() and smp_load_acquire() to access the hash table
pointer to ensure the stores to zero the lists are retired before readers
start to walk the list.  E.g. if the compiler hoisted the store before the
zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale
kernel data.

Cc: James Houghton <jthoughton@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 62 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 52 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 41da2cb1e3f1..173f7fdfba21 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1983,14 +1983,35 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
 	return true;
 }
 
+static __ro_after_init HLIST_HEAD(empty_page_hash);
+
+static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
+{
+	/*
+	 * Ensure the load of the hash table pointer itself is ordered before
+	 * loads to walk the table.  The pointer is set at runtime outside of
+	 * mmu_lock when the TDP MMU is enabled, i.e. when the hash table of
+	 * shadow pages becomes necessary only when KVM needs to shadow L1's
+	 * TDP for an L2 guest.  Pairs with the smp_store_release() in
+	 * kvm_mmu_alloc_page_hash().
+	 */
+	struct hlist_head *page_hash = smp_load_acquire(&kvm->arch.mmu_page_hash);
+
+	lockdep_assert_held(&kvm->mmu_lock);
+
+	if (!page_hash)
+		return &empty_page_hash;
+
+	return &page_hash[kvm_page_table_hashfn(gfn)];
+}
+
 #define for_each_valid_sp(_kvm, _sp, _list)				\
 	hlist_for_each_entry(_sp, _list, hash_link)			\
 		if (is_obsolete_sp((_kvm), (_sp))) {			\
 		} else
 
 #define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
-	for_each_valid_sp(_kvm, _sp,					\
-	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
+	for_each_valid_sp(_kvm, _sp, kvm_get_mmu_page_hash(_kvm, _gfn))	\
 		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
 
 static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
@@ -2358,6 +2379,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
+	/*
+	 * No need for memory barriers, unlike in kvm_get_mmu_page_hash(), as
+	 * mmu_page_hash must be set prior to creating the first shadow root,
+	 * i.e. reaching this point is fully serialized by slots_arch_lock.
+	 */
+	BUG_ON(!kvm->arch.mmu_page_hash);
 	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
 	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
@@ -3886,11 +3913,21 @@ static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
 {
 	typeof(kvm->arch.mmu_page_hash) h;
 
+	if (kvm->arch.mmu_page_hash)
+		return 0;
+
 	h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
 	if (!h)
 		return -ENOMEM;
 
-	kvm->arch.mmu_page_hash = h;
+	/*
+	 * Ensure the hash table pointer is set only after all stores to zero
+	 * the memory are retired.  Pairs with the smp_load_acquire() in
+	 * kvm_get_mmu_page_hash().  Note, mmu_lock must be held for write to
+	 * add (or remove) shadow pages, and so readers are guaranteed to see
+	 * an empty list for their current mmu_lock critical section.
+	 */
+	smp_store_release(&kvm->arch.mmu_page_hash, h);
 	return 0;
 }
 
@@ -3913,9 +3950,13 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	if (kvm_shadow_root_allocated(kvm))
 		goto out_unlock;
 
+	r = kvm_mmu_alloc_page_hash(kvm);
+	if (r)
+		goto out_unlock;
+
 	/*
-	 * Check if anything actually needs to be allocated, e.g. all metadata
-	 * will be allocated upfront if TDP is disabled.
+	 * Check if memslot metadata actually needs to be allocated, e.g. all
+	 * metadata will be allocated upfront if TDP is disabled.
 	 */
 	if (kvm_memslots_have_rmaps(kvm) &&
 	    kvm_page_track_write_tracking_enabled(kvm))
@@ -6696,12 +6737,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
-	r = kvm_mmu_alloc_page_hash(kvm);
-	if (r)
-		return r;
-
-	if (tdp_mmu_enabled)
+	if (tdp_mmu_enabled) {
 		kvm_mmu_init_tdp_mmu(kvm);
+	} else {
+		r = kvm_mmu_alloc_page_hash(kvm);
+		if (r)
+			return r;
+	}
 
 	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
 	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
-- 
2.49.0.1151.ga128411c76-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
  2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
@ 2025-05-23  2:27   ` Xiaoyao Li
  2025-05-23  6:40     ` Xiaoyao Li
  2025-05-23 11:31   ` Huang, Kai
  1 sibling, 1 reply; 14+ messages in thread
From: Xiaoyao Li @ 2025-05-23  2:27 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On 5/23/2025 8:11 AM, Sean Christopherson wrote:
> Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
> co-locating the setup with tdx_bringup() makes it easier to see and
> document the success_disable_tdx "error" path, and configuring the TDX
> specific hooks in tdx.c reduces the number of globally visible TDX symbols.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

...

> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 51f98443e8a2..ca39a9391db1 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -8,6 +8,7 @@
>   #ifdef CONFIG_KVM_INTEL_TDX
>   #include "common.h"
>   
> +void tdx_hardware_setup(void);

we need define stub function for the case of !CONFIG_KVM_INTEL_TDX

with it fixed,

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
  2025-05-23  2:27   ` Xiaoyao Li
@ 2025-05-23  6:40     ` Xiaoyao Li
  0 siblings, 0 replies; 14+ messages in thread
From: Xiaoyao Li @ 2025-05-23  6:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On 5/23/2025 10:27 AM, Xiaoyao Li wrote:
> On 5/23/2025 8:11 AM, Sean Christopherson wrote:
>> Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
>> co-locating the setup with tdx_bringup() makes it easier to see and
>> document the success_disable_tdx "error" path, and configuring the TDX
>> specific hooks in tdx.c reduces the number of globally visible TDX 
>> symbols.
>>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> ...
> 
>> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
>> index 51f98443e8a2..ca39a9391db1 100644
>> --- a/arch/x86/kvm/vmx/tdx.h
>> +++ b/arch/x86/kvm/vmx/tdx.h
>> @@ -8,6 +8,7 @@
>>   #ifdef CONFIG_KVM_INTEL_TDX
>>   #include "common.h"
>> +void tdx_hardware_setup(void);
> 
> we need define stub function for the case of !CONFIG_KVM_INTEL_TDX

sorry that I missed the discussion on v3[*].
Based on kvm-x86/next, no issue.

[*] https://lore.kernel.org/all/aC0MIUOTQbb9-a7k@google.com/

> with it fixed,
> 
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
  2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
  2025-05-23  2:27   ` Xiaoyao Li
@ 2025-05-23 11:31   ` Huang, Kai
  1 sibling, 0 replies; 14+ messages in thread
From: Huang, Kai @ 2025-05-23 11:31 UTC (permalink / raw)
  To: pbonzini@redhat.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, vipinsh@google.com, jthoughton@google.com,
	linux-kernel@vger.kernel.org

On Thu, 2025-05-22 at 17:11 -0700, Sean Christopherson wrote:
> Move TDX hardware setup to tdx.c, as the code is obviously TDX specific,
> co-locating the setup with tdx_bringup() makes it easier to see and
> document the success_disable_tdx "error" path, and configuring the TDX
> specific hooks in tdx.c reduces the number of globally visible TDX symbols.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> 

Reviewed-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-05-23  0:11 ` [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list Sean Christopherson
@ 2025-05-28  8:04   ` Xiaoyao Li
  2025-05-28 12:54     ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Xiaoyao Li @ 2025-05-28  8:04 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On 5/23/2025 8:11 AM, Sean Christopherson wrote:
> Dynamically allocate the (massive) array of hashed lists used to track
> shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
> all on its own, and is *exactly* an order-3 allocation.  Dynamically
> allocating the array will allow allocating "struct kvm" using kvmalloc(),
> and will also allow deferring allocation of the array until it's actually
> needed, i.e. until the first shadow root is allocated.
> 
> Opportunistically use kvmalloc() for the hashed lists, as an order-3
> allocation is (stating the obvious) less likely to fail than an order-4
> allocation, and the overhead of vmalloc() is undesirable given that the
> size of the allocation is fixed.
> 
> Cc: Vipin Sharma <vipinsh@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  4 ++--
>   arch/x86/kvm/mmu/mmu.c          | 23 ++++++++++++++++++++++-
>   arch/x86/kvm/x86.c              |  5 ++++-
>   3 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 330cdcbed1a6..9667d6b929ee 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1343,7 +1343,7 @@ struct kvm_arch {
>   	bool has_private_mem;
>   	bool has_protected_state;
>   	bool pre_fault_allowed;
> -	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> +	struct hlist_head *mmu_page_hash;
>   	struct list_head active_mmu_pages;
>   	/*
>   	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
> @@ -2006,7 +2006,7 @@ void kvm_mmu_vendor_module_exit(void);
>   
>   void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
>   int kvm_mmu_create(struct kvm_vcpu *vcpu);
> -void kvm_mmu_init_vm(struct kvm *kvm);
> +int kvm_mmu_init_vm(struct kvm *kvm);
>   void kvm_mmu_uninit_vm(struct kvm *kvm);
>   
>   void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index cbc84c6abc2e..41da2cb1e3f1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3882,6 +3882,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>   	return r;
>   }
>   
> +static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
> +{
> +	typeof(kvm->arch.mmu_page_hash) h;

Out of curiousity, it is uncommon in KVM to use typeof() given that we 
know what the type actually is. Is there some specific reason?

anyway, it works.

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

> +
> +	h = kvcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	kvm->arch.mmu_page_hash = h;
> +	return 0;
> +}
> +
>   static int mmu_first_shadow_root_alloc(struct kvm *kvm)
>   {
>   	struct kvm_memslots *slots;
> @@ -6675,13 +6687,19 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>   		kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>   }
>   
> -void kvm_mmu_init_vm(struct kvm *kvm)
> +int kvm_mmu_init_vm(struct kvm *kvm)
>   {
> +	int r;
> +
>   	kvm->arch.shadow_mmio_value = shadow_mmio_value;
>   	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>   	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
>   	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>   
> +	r = kvm_mmu_alloc_page_hash(kvm);
> +	if (r)
> +		return r;
> +
>   	if (tdp_mmu_enabled)
>   		kvm_mmu_init_tdp_mmu(kvm);
>   
> @@ -6692,6 +6710,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>   
>   	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
>   	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +	return 0;
>   }
>   
>   static void mmu_free_vm_memory_caches(struct kvm *kvm)
> @@ -6703,6 +6722,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
>   
>   void kvm_mmu_uninit_vm(struct kvm *kvm)
>   {
> +	kvfree(kvm->arch.mmu_page_hash);
> +
>   	if (tdp_mmu_enabled)
>   		kvm_mmu_uninit_tdp_mmu(kvm);
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f9f798f286ce..d204ba9368f8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12787,7 +12787,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	if (ret)
>   		goto out;
>   
> -	kvm_mmu_init_vm(kvm);
> +	ret = kvm_mmu_init_vm(kvm);
> +	if (ret)
> +		goto out_cleanup_page_track;
>   
>   	ret = kvm_x86_call(vm_init)(kvm);
>   	if (ret)
> @@ -12840,6 +12842,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   
>   out_uninit_mmu:
>   	kvm_mmu_uninit_vm(kvm);
> +out_cleanup_page_track:
>   	kvm_page_track_cleanup(kvm);
>   out:
>   	return ret;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct
  2025-05-23  0:11 ` [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
@ 2025-05-28  8:35   ` Xiaoyao Li
  0 siblings, 0 replies; 14+ messages in thread
From: Xiaoyao Li @ 2025-05-28  8:35 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On 5/23/2025 8:11 AM, Sean Christopherson wrote:
> Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
> allocation before falling back to __vmalloc(), to avoid the overhead of
> establishing the virtual mappings.  For non-debug builds, The SVM and VMX
> (and TDX) structures are now just below 7000 bytes in the worst case
> scenario (see below), i.e. are order-1 allocations, and will likely remain
> that way for quite some time.
> 
> Add compile-time assertions in vendor code to ensure the size of the
> structures, sans the memslos hash tables, are order-0 allocations, i.e.

s/memslos/memslots

> are less than 4KiB.  There's nothing fundamentally wrong with a larger
> kvm_{svm,vmx,tdx} size, but given that the size of the structure (without
> the memslots hash tables) is below 2KiB after 18+ years of existence,
> more than doubling the size would be quite notable.
> 
> Add sanity checks on the memslot hash table sizes, partly to ensure they
> aren't resized without accounting for the impact on VM structure size, and
> partly to document that the majority of the size of VM structures comes
> from the memslots.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-05-28  8:04   ` Xiaoyao Li
@ 2025-05-28 12:54     ` Sean Christopherson
  2025-06-24 19:48       ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-05-28 12:54 UTC (permalink / raw)
  To: Xiaoyao Li; +Cc: Paolo Bonzini, kvm, linux-kernel, Vipin Sharma, James Houghton

On Wed, May 28, 2025, Xiaoyao Li wrote:
> On 5/23/2025 8:11 AM, Sean Christopherson wrote:
> > Dynamically allocate the (massive) array of hashed lists used to track
> > shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
> > all on its own, and is *exactly* an order-3 allocation.  Dynamically
> > allocating the array will allow allocating "struct kvm" using kvmalloc(),
> > and will also allow deferring allocation of the array until it's actually
> > needed, i.e. until the first shadow root is allocated.
> > 
> > Opportunistically use kvmalloc() for the hashed lists, as an order-3
> > allocation is (stating the obvious) less likely to fail than an order-4
> > allocation, and the overhead of vmalloc() is undesirable given that the
> > size of the allocation is fixed.
> > 
> > Cc: Vipin Sharma <vipinsh@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h |  4 ++--
> >   arch/x86/kvm/mmu/mmu.c          | 23 ++++++++++++++++++++++-
> >   arch/x86/kvm/x86.c              |  5 ++++-
> >   3 files changed, 28 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 330cdcbed1a6..9667d6b929ee 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1343,7 +1343,7 @@ struct kvm_arch {
> >   	bool has_private_mem;
> >   	bool has_protected_state;
> >   	bool pre_fault_allowed;
> > -	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > +	struct hlist_head *mmu_page_hash;
> >   	struct list_head active_mmu_pages;
> >   	/*
> >   	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
> > @@ -2006,7 +2006,7 @@ void kvm_mmu_vendor_module_exit(void);
> >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
> >   int kvm_mmu_create(struct kvm_vcpu *vcpu);
> > -void kvm_mmu_init_vm(struct kvm *kvm);
> > +int kvm_mmu_init_vm(struct kvm *kvm);
> >   void kvm_mmu_uninit_vm(struct kvm *kvm);
> >   void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index cbc84c6abc2e..41da2cb1e3f1 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3882,6 +3882,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >   	return r;
> >   }
> > +static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
> > +{
> > +	typeof(kvm->arch.mmu_page_hash) h;
> 
> Out of curiousity, it is uncommon in KVM to use typeof() given that we know
> what the type actually is. Is there some specific reason?

I'm pretty sure it's a leftover from various experiments.  IIRC, I was trying to
do something odd and was having a hard time getting the type right :-)

I'll drop the typeof() in favor of "struct hlist_head *", using typeof here isn't
justified and IMO makes the code a bit harder to read.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list
  2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
                   ` (3 preceding siblings ...)
  2025-05-23  0:11 ` [PATCH v4 4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
@ 2025-06-24 19:38 ` Sean Christopherson
  2025-06-24 21:12   ` Sean Christopherson
  4 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-06-24 19:38 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On Thu, 22 May 2025 17:11:34 -0700, Sean Christopherson wrote:
> Allocate the hashed list of shadow pages dynamically (separate from
> struct kvm), and on-demand.  The hashed list is 32KiB, i.e. absolutely
> belongs in a separate allocation, and is worth skipping if KVM isn't
> shadowing guest PTEs for the VM.
> 
> I double checked that padding kvm_arch with a 4KiB array trips the assert,
> but padding with 2KiB does not.  So knock on wood, I finally got the assert
> right.  Maybe.
> 
> [...]

Applied to kvm-x86 mmu, thanks!

[1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
      https://github.com/kvm-x86/linux/commit/1f287a4e7b90
[2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
      https://github.com/kvm-x86/linux/commit/02c6bea57d0d
[3/4] KVM: x86: Use kvzalloc() to allocate VM struct
      https://github.com/kvm-x86/linux/commit/97ad7dd0e53d
[4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
      https://github.com/kvm-x86/linux/commit/59ce4bd2996b

--
https://github.com/kvm-x86/kvm-unit-tests/tree/next

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-05-28 12:54     ` Sean Christopherson
@ 2025-06-24 19:48       ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-06-24 19:48 UTC (permalink / raw)
  To: Xiaoyao Li; +Cc: Paolo Bonzini, kvm, linux-kernel, Vipin Sharma, James Houghton

On Wed, May 28, 2025, Sean Christopherson wrote:
> On Wed, May 28, 2025, Xiaoyao Li wrote:
> > On 5/23/2025 8:11 AM, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index cbc84c6abc2e..41da2cb1e3f1 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -3882,6 +3882,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> > >   	return r;
> > >   }
> > > +static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
> > > +{
> > > +	typeof(kvm->arch.mmu_page_hash) h;
> > 
> > Out of curiousity, it is uncommon in KVM to use typeof() given that we know
> > what the type actually is. Is there some specific reason?
> 
> I'm pretty sure it's a leftover from various experiments.  IIRC, I was trying to
> do something odd and was having a hard time getting the type right :-)
> 
> I'll drop the typeof() in favor of "struct hlist_head *", using typeof here isn't
> justified and IMO makes the code a bit harder to read.

Gah, I forgot to switch to address this when applying.  I'll fixup the commit
and force push; it'll only affect this series (hooray for topic branches).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list
  2025-06-24 19:38 ` [PATCH v4 0/4] KVM: x86: Dynamically allocate " Sean Christopherson
@ 2025-06-24 21:12   ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-06-24 21:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma, James Houghton

On Tue, Jun 24, 2025, Sean Christopherson wrote:
> On Thu, 22 May 2025 17:11:34 -0700, Sean Christopherson wrote:
> > Allocate the hashed list of shadow pages dynamically (separate from
> > struct kvm), and on-demand.  The hashed list is 32KiB, i.e. absolutely
> > belongs in a separate allocation, and is worth skipping if KVM isn't
> > shadowing guest PTEs for the VM.
> > 
> > I double checked that padding kvm_arch with a 4KiB array trips the assert,
> > but padding with 2KiB does not.  So knock on wood, I finally got the assert
> > right.  Maybe.
> > 
> > [...]
> 
> Applied to kvm-x86 mmu, thanks!
> 
> [1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
>       https://github.com/kvm-x86/linux/commit/1f287a4e7b90
> [2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
>       https://github.com/kvm-x86/linux/commit/02c6bea57d0d
> [3/4] KVM: x86: Use kvzalloc() to allocate VM struct
>       https://github.com/kvm-x86/linux/commit/97ad7dd0e53d
> [4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
>       https://github.com/kvm-x86/linux/commit/59ce4bd2996b

New hashes after a force push to fixup the typeof() oddity:

[1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c
      https://github.com/kvm-x86/linux/commit/1f287a4e7b90
[2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
      https://github.com/kvm-x86/linux/commit/039ef33e2f93
[3/4] KVM: x86: Use kvzalloc() to allocate VM struct
      https://github.com/kvm-x86/linux/commit/ac777fbf064f
[4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
      https://github.com/kvm-x86/linux/commit/9c4fe6d1509b

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-06-24 21:12 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23  0:11 [PATCH v4 0/4] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-05-23  0:11 ` [PATCH v4 1/4] KVM: TDX: Move TDX hardware setup from main.c to tdx.c Sean Christopherson
2025-05-23  2:27   ` Xiaoyao Li
2025-05-23  6:40     ` Xiaoyao Li
2025-05-23 11:31   ` Huang, Kai
2025-05-23  0:11 ` [PATCH v4 2/4] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list Sean Christopherson
2025-05-28  8:04   ` Xiaoyao Li
2025-05-28 12:54     ` Sean Christopherson
2025-06-24 19:48       ` Sean Christopherson
2025-05-23  0:11 ` [PATCH v4 3/4] KVM: x86: Use kvzalloc() to allocate VM struct Sean Christopherson
2025-05-28  8:35   ` Xiaoyao Li
2025-05-23  0:11 ` [PATCH v4 4/4] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2025-06-24 19:38 ` [PATCH v4 0/4] KVM: x86: Dynamically allocate " Sean Christopherson
2025-06-24 21:12   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).