[PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list
@ 2025-04-01 15:57 Sean Christopherson
  2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-04-01 15:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Allocate the hashed list of shadow pages dynamically (separate from
struct kvm), and on-demand.  The hashed list is 32KiB, i.e. absolutely
belongs in a separate allocation, and is worth skipping if KVM isn't
shadowing guest PTEs for the VM.

v2:
 - Actually defer allocation when using TDP MMU. [Vipin]
 - Free allocation on MMU teardown. [Vipin]

v1: https://lore.kernel.org/all/20250315024010.2360884-1-seanjc@google.com

Sean Christopherson (3):
  KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list

 arch/x86/include/asm/kvm_host.h |  6 ++--
 arch/x86/kvm/mmu/mmu.c          | 53 +++++++++++++++++++++++++++++----
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/vmx/vmx.c          |  1 +
 arch/x86/kvm/x86.c              |  5 +++-
 5 files changed, 56 insertions(+), 10 deletions(-)


base-commit: 782f9feaa9517caf33186dcdd6b50a8f770ed29b
-- 
2.49.0.472.ge94155a9ec-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-04-01 15:57 [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
@ 2025-04-01 15:57 ` Sean Christopherson
  2025-04-16 15:53   ` Vipin Sharma
  2025-04-01 15:57 ` [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc() Sean Christopherson
  2025-04-01 15:57 ` [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
  2 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-01 15:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Dynamically allocate the (massive) array of hashed lists used to track
shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
all on its own, and is *exactly* an order-3 allocation.  Dynamically
allocating the array will allow allocating "struct kvm" using regular
kmalloc(), and will also allow deferring allocation of the array until
it's actually needed, i.e. until the first shadow root is allocated.

Cc: Vipin Sharma <vipinsh@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu/mmu.c          | 23 ++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  5 ++++-
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a884ab544335..e523d7d8a107 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1333,7 +1333,7 @@ struct kvm_arch {
 	bool has_private_mem;
 	bool has_protected_state;
 	bool pre_fault_allowed;
-	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
+	struct hlist_head *mmu_page_hash;
 	struct list_head active_mmu_pages;
 	/*
 	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
@@ -1985,7 +1985,7 @@ void kvm_mmu_vendor_module_exit(void);
 
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
-void kvm_mmu_init_vm(struct kvm *kvm);
+int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
 void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 63bb77ee1bb1..6b9c72405860 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3880,6 +3880,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	return r;
 }
 
+static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
+{
+	typeof(kvm->arch.mmu_page_hash) h;
+
+	h = kcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
+	if (!h)
+		return -ENOMEM;
+
+	kvm->arch.mmu_page_hash = h;
+	return 0;
+}
+
 static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
@@ -6673,13 +6685,19 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 		kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
 }
 
-void kvm_mmu_init_vm(struct kvm *kvm)
+int kvm_mmu_init_vm(struct kvm *kvm)
 {
+	int r;
+
 	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
+	r = kvm_mmu_alloc_page_hash(kvm);
+	if (r)
+		return r;
+
 	if (tdp_mmu_enabled)
 		kvm_mmu_init_tdp_mmu(kvm);
 
@@ -6690,6 +6708,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
+	return 0;
 }
 
 static void mmu_free_vm_memory_caches(struct kvm *kvm)
@@ -6701,6 +6720,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
 {
+	kfree(kvm->arch.mmu_page_hash);
+
 	if (tdp_mmu_enabled)
 		kvm_mmu_uninit_tdp_mmu(kvm);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c841817a914a..4070f9d34521 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12721,7 +12721,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (ret)
 		goto out;
 
-	kvm_mmu_init_vm(kvm);
+	ret = kvm_mmu_init_vm(kvm);
+	if (ret)
+		goto out_cleanup_page_track;
 
 	ret = kvm_x86_call(vm_init)(kvm);
 	if (ret)
@@ -12774,6 +12776,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 out_uninit_mmu:
 	kvm_mmu_uninit_vm(kvm);
+out_cleanup_page_track:
 	kvm_page_track_cleanup(kvm);
 out:
 	return ret;
-- 
2.49.0.472.ge94155a9ec-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-01 15:57 [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
  2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
@ 2025-04-01 15:57 ` Sean Christopherson
  2025-04-16 18:24   ` Vipin Sharma
  2025-04-01 15:57 ` [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
  2 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-01 15:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma

Now that the size of "struct kvm" is less than 2KiB, switch back to using
kzalloc() to allocate the VM structures.  Add compile-time assertions in
vendor code to ensure the size is an order-0 allocation, i.e. to prevent
unknowingly letting the size balloon in the future.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 arch/x86/kvm/svm/svm.c          | 1 +
 arch/x86/kvm/vmx/vmx.c          | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e523d7d8a107..6c7fd7db6f11 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1940,7 +1940,7 @@ void kvm_x86_vendor_exit(void);
 #define __KVM_HAVE_ARCH_VM_ALLOC
 static inline struct kvm *kvm_arch_alloc_vm(void)
 {
-	return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	return kzalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT);
 }
 
 #define __KVM_HAVE_ARCH_VM_FREE
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8abeab91d329..589adc5f92e0 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5536,6 +5536,7 @@ static int __init svm_init(void)
 	if (r)
 		goto err_kvm_init;
 
+	BUILD_BUG_ON(get_order(sizeof(struct kvm_svm) != 0));
 	return 0;
 
 err_kvm_init:
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b70ed72c1783..01264842bf45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8755,6 +8755,7 @@ static int __init vmx_init(void)
 	if (r)
 		goto err_kvm_init;
 
+	BUILD_BUG_ON(get_order(sizeof(struct kvm_vmx) != 0));
 	return 0;
 
 err_kvm_init:
-- 
2.49.0.472.ge94155a9ec-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-04-01 15:57 [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
  2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
  2025-04-01 15:57 ` [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc() Sean Christopherson
@ 2025-04-01 15:57 ` Sean Christopherson
  2025-04-15 20:06   ` Vipin Sharma
  2 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-01 15:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini; +Cc: kvm, linux-kernel, Vipin Sharma

When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a
nested TDP VM is run, defer allocation of the array of hashed lists used
to track shadow MMU pages until the first shadow root is allocated.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6b9c72405860..213009cdba15 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1982,14 +1982,25 @@ static bool sp_has_gptes(struct kvm_mmu_page *sp)
 	return true;
 }
 
+static __ro_after_init HLIST_HEAD(empty_page_hash);
+
+static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
+{
+	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
+
+	if (!page_hash)
+		return &empty_page_hash;
+
+	return &page_hash[kvm_page_table_hashfn(gfn)];
+}
+
 #define for_each_valid_sp(_kvm, _sp, _list)				\
 	hlist_for_each_entry(_sp, _list, hash_link)			\
 		if (is_obsolete_sp((_kvm), (_sp))) {			\
 		} else
 
 #define for_each_gfn_valid_sp_with_gptes(_kvm, _sp, _gfn)		\
-	for_each_valid_sp(_kvm, _sp,					\
-	  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)])	\
+	for_each_valid_sp(_kvm, _sp, kvm_get_mmu_page_hash(_kvm, _gfn))	\
 		if ((_sp)->gfn != (_gfn) || !sp_has_gptes(_sp)) {} else
 
 static bool kvm_sync_page_check(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
@@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
 	struct kvm_mmu_page *sp;
 	bool created = false;
 
+	BUG_ON(!kvm->arch.mmu_page_hash);
 	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
 
 	sp = kvm_mmu_find_shadow_page(kvm, vcpu, gfn, sp_list, role);
@@ -3884,11 +3896,14 @@ static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
 {
 	typeof(kvm->arch.mmu_page_hash) h;
 
+	if (kvm->arch.mmu_page_hash)
+		return 0;
+
 	h = kcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
 	if (!h)
 		return -ENOMEM;
 
-	kvm->arch.mmu_page_hash = h;
+	WRITE_ONCE(kvm->arch.mmu_page_hash, h);
 	return 0;
 }
 
@@ -3911,9 +3926,13 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	if (kvm_shadow_root_allocated(kvm))
 		goto out_unlock;
 
+	r = kvm_mmu_alloc_page_hash(kvm);
+	if (r)
+		goto out_unlock;
+
 	/*
-	 * Check if anything actually needs to be allocated, e.g. all metadata
-	 * will be allocated upfront if TDP is disabled.
+	 * Check if memslot metadata actually needs to be allocated, e.g. all
+	 * metadata will be allocated upfront if TDP is disabled.
 	 */
 	if (kvm_memslots_have_rmaps(kvm) &&
 	    kvm_page_track_write_tracking_enabled(kvm))
@@ -6694,12 +6713,13 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
-	r = kvm_mmu_alloc_page_hash(kvm);
-	if (r)
-		return r;
-
-	if (tdp_mmu_enabled)
+	if (tdp_mmu_enabled) {
 		kvm_mmu_init_tdp_mmu(kvm);
+	} else {
+		r = kvm_mmu_alloc_page_hash(kvm);
+		if (r)
+			return r;
+	}
 
 	kvm->arch.split_page_header_cache.kmem_cache = mmu_page_header_cache;
 	kvm->arch.split_page_header_cache.gfp_zero = __GFP_ZERO;
-- 
2.49.0.472.ge94155a9ec-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-04-01 15:57 ` [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
@ 2025-04-15 20:06   ` Vipin Sharma
  2025-04-15 21:52     ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Vipin Sharma @ 2025-04-15 20:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On 2025-04-01 08:57:14, Sean Christopherson wrote:
> +static __ro_after_init HLIST_HEAD(empty_page_hash);
> +
> +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> +{
> +	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
> +
> +	if (!page_hash)
> +		return &empty_page_hash;
> +
> +	return &page_hash[kvm_page_table_hashfn(gfn)];
> +}
> +
>  
> @@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
>  	struct kvm_mmu_page *sp;
>  	bool created = false;
>  
> +	BUG_ON(!kvm->arch.mmu_page_hash);
>  	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];

Why do we need READ_ONCE() at kvm_get_mmu_page_hash() but not here? My
understanding is that it is in kvm_get_mmu_page_hash() to avoid compiler
doing any read tear. If yes, then the same condition is valid here,
isn't it?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-04-15 20:06   ` Vipin Sharma
@ 2025-04-15 21:52     ` Sean Christopherson
  2025-04-22  0:24       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-15 21:52 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Apr 15, 2025, Vipin Sharma wrote:
> On 2025-04-01 08:57:14, Sean Christopherson wrote:
> > +static __ro_after_init HLIST_HEAD(empty_page_hash);
> > +
> > +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
> > +
> > +	if (!page_hash)
> > +		return &empty_page_hash;
> > +
> > +	return &page_hash[kvm_page_table_hashfn(gfn)];
> > +}
> > +
> >  
> > @@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> >  	struct kvm_mmu_page *sp;
> >  	bool created = false;
> >  
> > +	BUG_ON(!kvm->arch.mmu_page_hash);
> >  	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> 
> Why do we need READ_ONCE() at kvm_get_mmu_page_hash() but not here?

We don't (need it in kvm_get_mmu_page_hash()).  I suspect past me was thinking
it could be accessed without holding mmu_lock, but that's simply not true.  Unless
I'm forgetting, something, I'll drop the READ_ONCE() and WRITE_ONCE() in
kvm_mmu_alloc_page_hash(), and instead assert that mmu_lock is held for write.

> My understanding is that it is in kvm_get_mmu_page_hash() to avoid compiler
> doing any read tear. If yes, then the same condition is valid here, isn't it?

The intent wasn't to guard against a tear, but to instead ensure mmu_page_hash
couldn't be re-read and end up with a NULL pointer deref, e.g. if KVM set
mmu_page_hash and then nullfied it because some later step failed.  But if
mmu_lock is held for write, that is simply impossible.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list
  2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
@ 2025-04-16 15:53   ` Vipin Sharma
  0 siblings, 0 replies; 16+ messages in thread
From: Vipin Sharma @ 2025-04-16 15:53 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On 2025-04-01 08:57:12, Sean Christopherson wrote:
> Dynamically allocate the (massive) array of hashed lists used to track
> shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation
> all on its own, and is *exactly* an order-3 allocation.  Dynamically
> allocating the array will allow allocating "struct kvm" using regular
> kmalloc(), and will also allow deferring allocation of the array until
> it's actually needed, i.e. until the first shadow root is allocated.
> 
> Cc: Vipin Sharma <vipinsh@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 ++--
>  arch/x86/kvm/mmu/mmu.c          | 23 ++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |  5 ++++-
>  3 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index a884ab544335..e523d7d8a107 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1333,7 +1333,7 @@ struct kvm_arch {
>  	bool has_private_mem;
>  	bool has_protected_state;
>  	bool pre_fault_allowed;
> -	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> +	struct hlist_head *mmu_page_hash;
>  	struct list_head active_mmu_pages;
>  	/*
>  	 * A list of kvm_mmu_page structs that, if zapped, could possibly be
> @@ -1985,7 +1985,7 @@ void kvm_mmu_vendor_module_exit(void);
>  
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
>  int kvm_mmu_create(struct kvm_vcpu *vcpu);
> -void kvm_mmu_init_vm(struct kvm *kvm);
> +int kvm_mmu_init_vm(struct kvm *kvm);
>  void kvm_mmu_uninit_vm(struct kvm *kvm);
>  
>  void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 63bb77ee1bb1..6b9c72405860 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3880,6 +3880,18 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  	return r;
>  }
>  
> +static int kvm_mmu_alloc_page_hash(struct kvm *kvm)
> +{
> +	typeof(kvm->arch.mmu_page_hash) h;
> +
> +	h = kcalloc(KVM_NUM_MMU_PAGES, sizeof(*h), GFP_KERNEL_ACCOUNT);
> +	if (!h)
> +		return -ENOMEM;
> +
> +	kvm->arch.mmu_page_hash = h;
> +	return 0;
> +}
> +
>  static int mmu_first_shadow_root_alloc(struct kvm *kvm)
>  {
>  	struct kvm_memslots *slots;
> @@ -6673,13 +6685,19 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  		kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
> -void kvm_mmu_init_vm(struct kvm *kvm)
> +int kvm_mmu_init_vm(struct kvm *kvm)
>  {
> +	int r;
> +
>  	kvm->arch.shadow_mmio_value = shadow_mmio_value;
>  	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>  	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
>  	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>  
> +	r = kvm_mmu_alloc_page_hash(kvm);
> +	if (r)
> +		return r;
> +
>  	if (tdp_mmu_enabled)
>  		kvm_mmu_init_tdp_mmu(kvm);
>  
> @@ -6690,6 +6708,7 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>  
>  	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
>  	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
> +	return 0;
>  }
>  
>  static void mmu_free_vm_memory_caches(struct kvm *kvm)
> @@ -6701,6 +6720,8 @@ static void mmu_free_vm_memory_caches(struct kvm *kvm)
>  
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
>  {
> +	kfree(kvm->arch.mmu_page_hash);
> +
>  	if (tdp_mmu_enabled)
>  		kvm_mmu_uninit_tdp_mmu(kvm);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c841817a914a..4070f9d34521 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12721,7 +12721,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	if (ret)
>  		goto out;
>  
> -	kvm_mmu_init_vm(kvm);
> +	ret = kvm_mmu_init_vm(kvm);
> +	if (ret)
> +		goto out_cleanup_page_track;
>  
>  	ret = kvm_x86_call(vm_init)(kvm);
>  	if (ret)
> @@ -12774,6 +12776,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  out_uninit_mmu:
>  	kvm_mmu_uninit_vm(kvm);
> +out_cleanup_page_track:
>  	kvm_page_track_cleanup(kvm);
>  out:
>  	return ret;
> -- 
> 2.49.0.472.ge94155a9ec-goog
> 

Reviewed-by: Vipin Sharma <vipinsh@google.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-01 15:57 ` [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc() Sean Christopherson
@ 2025-04-16 18:24   ` Vipin Sharma
  2025-04-16 19:06     ` Vipin Sharma
  0 siblings, 1 reply; 16+ messages in thread
From: Vipin Sharma @ 2025-04-16 18:24 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On 2025-04-01 08:57:13, Sean Christopherson wrote:
> Now that the size of "struct kvm" is less than 2KiB, switch back to using
> kzalloc() to allocate the VM structures.  Add compile-time assertions in
> vendor code to ensure the size is an order-0 allocation, i.e. to prevent
> unknowingly letting the size balloon in the future.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 2 +-
>  arch/x86/kvm/svm/svm.c          | 1 +
>  arch/x86/kvm/vmx/vmx.c          | 1 +
>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e523d7d8a107..6c7fd7db6f11 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1940,7 +1940,7 @@ void kvm_x86_vendor_exit(void);
>  #define __KVM_HAVE_ARCH_VM_ALLOC
>  static inline struct kvm *kvm_arch_alloc_vm(void)
>  {
> -	return __vmalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +	return kzalloc(kvm_x86_ops.vm_size, GFP_KERNEL_ACCOUNT);
>  }
>  
>  #define __KVM_HAVE_ARCH_VM_FREE
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 8abeab91d329..589adc5f92e0 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5536,6 +5536,7 @@ static int __init svm_init(void)
>  	if (r)
>  		goto err_kvm_init;
>  
> +	BUILD_BUG_ON(get_order(sizeof(struct kvm_svm) != 0));

There is a typo here. It is checking sizeof(struct kvm_svm) != 0, instead
of checking get_order(...) != 0.

>  	return 0;
>  
>  err_kvm_init:
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b70ed72c1783..01264842bf45 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8755,6 +8755,7 @@ static int __init vmx_init(void)
>  	if (r)
>  		goto err_kvm_init;
>  
> +	BUILD_BUG_ON(get_order(sizeof(struct kvm_vmx) != 0));

Same as above.

>  	return 0;
>  
>  err_kvm_init:
> -- 
> 2.49.0.472.ge94155a9ec-goog
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-16 18:24   ` Vipin Sharma
@ 2025-04-16 19:06     ` Vipin Sharma
  2025-04-16 19:57       ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Vipin Sharma @ 2025-04-16 19:06 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel

On 2025-04-16 11:24:37, Vipin Sharma wrote:
> On 2025-04-01 08:57:13, Sean Christopherson wrote:
> >  
> > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_svm) != 0));
> 
> There is a typo here. It is checking sizeof(struct kvm_svm) != 0, instead
> of checking get_order(...) != 0.
> 
> >  	return 0;
> >  
> >  err_kvm_init:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index b70ed72c1783..01264842bf45 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -8755,6 +8755,7 @@ static int __init vmx_init(void)
> >  	if (r)
> >  		goto err_kvm_init;
> >  
> > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_vmx) != 0));
> 
> Same as above.
> 

After fixing the typo build is failing.

Checked via pahole, sizes of struct have reduced but still not under 4k.
After applying the patch:

struct kvm{} - 4104
struct kvm_svm{} - 4320
struct kvm_vmx{} - 4128

Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
under kvm_[vmx|svm] and its children are enabled. Won't that be an
issue?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-16 19:06     ` Vipin Sharma
@ 2025-04-16 19:57       ` Sean Christopherson
  2025-04-22 22:53         ` Huang, Kai
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-16 19:57 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: Paolo Bonzini, kvm, linux-kernel

On Wed, Apr 16, 2025, Vipin Sharma wrote:
> On 2025-04-16 11:24:37, Vipin Sharma wrote:
> > On 2025-04-01 08:57:13, Sean Christopherson wrote:
> > >  
> > > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_svm) != 0));
> > 
> > There is a typo here. It is checking sizeof(struct kvm_svm) != 0, instead
> > of checking get_order(...) != 0.
> > 
> > >  	return 0;
> > >  
> > >  err_kvm_init:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index b70ed72c1783..01264842bf45 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -8755,6 +8755,7 @@ static int __init vmx_init(void)
> > >  	if (r)
> > >  		goto err_kvm_init;
> > >  
> > > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_vmx) != 0));
> > 
> > Same as above.

Ugh.  That's what I get for violating the kernel's "don't check for '0'" rule
(I thought it would make the code more understandable).  Bad me.

> After fixing the typo build is failing.
> 
> Checked via pahole, sizes of struct have reduced but still not under 4k.
> After applying the patch:
> 
> struct kvm{} - 4104
> struct kvm_svm{} - 4320
> struct kvm_vmx{} - 4128
> 
> Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
> under kvm_[vmx|svm] and its children are enabled. Won't that be an
> issue?

That's what build bots (and to a lesser extent, maintainers) are for.  An individual
developer might miss a particular config, but the build bots that run allyesconfig
will very quickly detect the issue, and then we fix it.

I also build what is effectively an "allkvmconfig" before officially applying
anything, so in general things like this shouldn't even make it to the bots.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-04-15 21:52     ` Sean Christopherson
@ 2025-04-22  0:24       ` Sean Christopherson
  2025-04-25 17:45         ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-22  0:24 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: Paolo Bonzini, kvm, linux-kernel

On Tue, Apr 15, 2025, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Vipin Sharma wrote:
> > On 2025-04-01 08:57:14, Sean Christopherson wrote:
> > > +static __ro_after_init HLIST_HEAD(empty_page_hash);
> > > +
> > > +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> > > +{
> > > +	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
> > > +
> > > +	if (!page_hash)
> > > +		return &empty_page_hash;
> > > +
> > > +	return &page_hash[kvm_page_table_hashfn(gfn)];
> > > +}
> > > +
> > >  
> > > @@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> > >  	struct kvm_mmu_page *sp;
> > >  	bool created = false;
> > >  
> > > +	BUG_ON(!kvm->arch.mmu_page_hash);
> > >  	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > 
> > Why do we need READ_ONCE() at kvm_get_mmu_page_hash() but not here?
> 
> We don't (need it in kvm_get_mmu_page_hash()).  I suspect past me was thinking
> it could be accessed without holding mmu_lock, but that's simply not true.  Unless
> I'm forgetting, something, I'll drop the READ_ONCE() and WRITE_ONCE() in
> kvm_mmu_alloc_page_hash(), and instead assert that mmu_lock is held for write.

I remembered what I was trying to do.  The _writer_, kvm_mmu_alloc_page_hash(),
doesn't hold mmu_lock, and so the READ/WRITE_ONCE() is needed.

But looking at this again, there's really no point in such games.  All readers
hold mmu_lock for write, so kvm_mmu_alloc_page_hash() can take mmu_lock for read
to ensure correctness.  That's far easier to reason about than taking a dependency
on shadow_root_allocated.

For performance, taking mmu_lock for read is unlikely to generate contention, as
this is only reachable at runtime if the TDP MMU is enabled.  And mmu_lock is
going to be taken for write anyways (to allocate the shadow root).

> > My understanding is that it is in kvm_get_mmu_page_hash() to avoid compiler
> > doing any read tear. If yes, then the same condition is valid here, isn't it?
> 
> The intent wasn't to guard against a tear, but to instead ensure mmu_page_hash
> couldn't be re-read and end up with a NULL pointer deref, e.g. if KVM set
> mmu_page_hash and then nullfied it because some later step failed.  But if
> mmu_lock is held for write, that is simply impossible.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-16 19:57       ` Sean Christopherson
@ 2025-04-22 22:53         ` Huang, Kai
  2025-04-23 17:07           ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Kai @ 2025-04-22 22:53 UTC (permalink / raw)
  To: vipinsh@google.com, seanjc@google.com
  Cc: kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Wed, 2025-04-16 at 12:57 -0700, Sean Christopherson wrote:
> On Wed, Apr 16, 2025, Vipin Sharma wrote:
> > On 2025-04-16 11:24:37, Vipin Sharma wrote:
> > > On 2025-04-01 08:57:13, Sean Christopherson wrote:
> > > >  
> > > > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_svm) != 0));
> > > 
> > > There is a typo here. It is checking sizeof(struct kvm_svm) != 0, instead
> > > of checking get_order(...) != 0.
> > > 
> > > >  	return 0;
> > > >  
> > > >  err_kvm_init:
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index b70ed72c1783..01264842bf45 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -8755,6 +8755,7 @@ static int __init vmx_init(void)
> > > >  	if (r)
> > > >  		goto err_kvm_init;
> > > >  
> > > > +	BUILD_BUG_ON(get_order(sizeof(struct kvm_vmx) != 0));
> > > 
> > > Same as above.
> 
> Ugh.  That's what I get for violating the kernel's "don't check for '0'" rule
> (I thought it would make the code more understandable).  Bad me.
> 
> > After fixing the typo build is failing.
> > 
> > Checked via pahole, sizes of struct have reduced but still not under 4k.
> > After applying the patch:
> > 
> > struct kvm{} - 4104
> > struct kvm_svm{} - 4320
> > struct kvm_vmx{} - 4128
> > 
> > Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
> > under kvm_[vmx|svm] and its children are enabled. Won't that be an
> > issue?
> 
> That's what build bots (and to a lesser extent, maintainers) are for.  An individual
> developer might miss a particular config, but the build bots that run allyesconfig
> will very quickly detect the issue, and then we fix it.
> 
> I also build what is effectively an "allkvmconfig" before officially applying
> anything, so in general things like this shouldn't even make it to the bots.
> 

Just want to understand the intention here:

What if someday a developer really needs to add some new field(s) to, lets say
'struct kvm_vmx', and that makes the size exceed 4K?

What should the developer do?  Is it a hard requirement that the size should
never go beyond 4K?  Or, should the assert of order 0 allocation be changed to
the assert of order 1 allocation?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-22 22:53         ` Huang, Kai
@ 2025-04-23 17:07           ` Sean Christopherson
  2025-04-23 21:46             ` Huang, Kai
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2025-04-23 17:07 UTC (permalink / raw)
  To: Kai Huang
  Cc: vipinsh@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Tue, Apr 22, 2025, Kai Huang wrote:
> On Wed, 2025-04-16 at 12:57 -0700, Sean Christopherson wrote:
> > On Wed, Apr 16, 2025, Vipin Sharma wrote:
> > > Checked via pahole, sizes of struct have reduced but still not under 4k.
> > > After applying the patch:
> > > 
> > > struct kvm{} - 4104
> > > struct kvm_svm{} - 4320
> > > struct kvm_vmx{} - 4128
> > > 
> > > Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
> > > under kvm_[vmx|svm] and its children are enabled. Won't that be an
> > > issue?
> > 
> > That's what build bots (and to a lesser extent, maintainers) are for.  An individual
> > developer might miss a particular config, but the build bots that run allyesconfig
> > will very quickly detect the issue, and then we fix it.
> > 
> > I also build what is effectively an "allkvmconfig" before officially applying
> > anything, so in general things like this shouldn't even make it to the bots.
> > 
> 
> Just want to understand the intention here:
> 
> What if someday a developer really needs to add some new field(s) to, lets say
> 'struct kvm_vmx', and that makes the size exceed 4K?

If it helps, here's the changelog I plan on posting for v3:
    
    Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
    allocation before falling back to __vmalloc(), to avoid the overhead of
    establishing the virtual mappings.  The SVM and VMX (and TDX) structures
    are now just above 4096 bytes, i.e. are order-1 allocations, and will
    likely remain that way for quite some time.
    
    Add compile-time assertions in vendor code to ensure the size is an
    order-0 or order-1 allocation, i.e. to prevent unknowingly letting the
    size balloon in the future.  There's nothing fundamentally wrong with a
    larger kvm_{svm,vmx,tdx} size, but given that the size is barely above
    4096 after 18+ years of existence, exceeding exceed 8192 bytes would be
    quite notable.


> What should the developer do?  Is it a hard requirement that the size should
> never go beyond 4K?  Or, should the assert of order 0 allocation be changed to
> the assert of order 1 allocation?

It depends.  Now that Vipin has corrected my math, the assertion will be that the
VM struct is order-1 or smaller, i.e. <= 8KiB.  That gives us a _lot_ of room to
grow.  E.g. KVM has existed for ~18 years and is barely about 4KiB, so for organic
growth (small additions here and there), I don't expect to hit the 8KiB limit in
the next decade (famous last words).  And the memory landscape will likely be
quite different 10+ years from now, i.e. the assertion may be completely unnecessary
by the time it fires.

What I'm most interested in detecting and prevent is things like mmu_page_hash,
where a massive field is embedded in struct kvm for an *optional* feature.  I.e.
if a new feature adds a massive field, then it should probably be placed in a
separate, dynamically allocated structure.  And for those, it should be quite
obvious that a separate allocation is the way to go.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-23 17:07           ` Sean Christopherson
@ 2025-04-23 21:46             ` Huang, Kai
  2025-04-24 18:31               ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Kai @ 2025-04-23 21:46 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: vipinsh@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Wed, 2025-04-23 at 10:07 -0700, Sean Christopherson wrote:
> On Tue, Apr 22, 2025, Kai Huang wrote:
> > On Wed, 2025-04-16 at 12:57 -0700, Sean Christopherson wrote:
> > > On Wed, Apr 16, 2025, Vipin Sharma wrote:
> > > > Checked via pahole, sizes of struct have reduced but still not under 4k.
> > > > After applying the patch:
> > > > 
> > > > struct kvm{} - 4104
> > > > struct kvm_svm{} - 4320
> > > > struct kvm_vmx{} - 4128
> > > > 
> > > > Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
> > > > under kvm_[vmx|svm] and its children are enabled. Won't that be an
> > > > issue?
> > > 
> > > That's what build bots (and to a lesser extent, maintainers) are for.  An individual
> > > developer might miss a particular config, but the build bots that run allyesconfig
> > > will very quickly detect the issue, and then we fix it.
> > > 
> > > I also build what is effectively an "allkvmconfig" before officially applying
> > > anything, so in general things like this shouldn't even make it to the bots.
> > > 
> > 
> > Just want to understand the intention here:
> > 
> > What if someday a developer really needs to add some new field(s) to, lets say
> > 'struct kvm_vmx', and that makes the size exceed 4K?
> 
> If it helps, here's the changelog I plan on posting for v3:
>     
>     Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
>     allocation before falling back to __vmalloc(), to avoid the overhead of
>     establishing the virtual mappings.  The SVM and VMX (and TDX) structures
>     are now just above 4096 bytes, i.e. are order-1 allocations, and will
>     likely remain that way for quite some time.
>     
>     Add compile-time assertions in vendor code to ensure the size is an
>     order-0 or order-1 allocation, i.e. to prevent unknowingly letting the
>     size balloon in the future.  There's nothing fundamentally wrong with a
>     larger kvm_{svm,vmx,tdx} size, but given that the size is barely above
>     4096 after 18+ years of existence, exceeding exceed 8192 bytes would be
>     quite notable.

Yeah looks reasonable.

Nit: I am not quite following "falling back to __vmalloc()" part.  We are
replacing __vmalloc() with kzalloc() AFAICT, therefore there should be no
"falling back"?

> 
> 
> > What should the developer do?  Is it a hard requirement that the size should
> > never go beyond 4K?  Or, should the assert of order 0 allocation be changed to
> > the assert of order 1 allocation?
> 
> It depends.  Now that Vipin has corrected my math, the assertion will be that the
> VM struct is order-1 or smaller, i.e. <= 8KiB.  That gives us a _lot_ of room to
> grow.  E.g. KVM has existed for ~18 years and is barely about 4KiB, so for organic
> growth (small additions here and there), I don't expect to hit the 8KiB limit in
> the next decade (famous last words).  And the memory landscape will likely be
> quite different 10+ years from now, i.e. the assertion may be completely unnecessary
> by the time it fires.
> 
> What I'm most interested in detecting and prevent is things like mmu_page_hash,
> where a massive field is embedded in struct kvm for an *optional* feature.  I.e.
> if a new feature adds a massive field, then it should probably be placed in a
> separate, dynamically allocated structure.  And for those, it should be quite
> obvious that a separate allocation is the way to go.

Agreed.  Thanks for explaining.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc()
  2025-04-23 21:46             ` Huang, Kai
@ 2025-04-24 18:31               ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-04-24 18:31 UTC (permalink / raw)
  To: Kai Huang
  Cc: vipinsh@google.com, kvm@vger.kernel.org, pbonzini@redhat.com,
	linux-kernel@vger.kernel.org

On Wed, Apr 23, 2025, Kai Huang wrote:
> On Wed, 2025-04-23 at 10:07 -0700, Sean Christopherson wrote:
> > On Tue, Apr 22, 2025, Kai Huang wrote:
> > > On Wed, 2025-04-16 at 12:57 -0700, Sean Christopherson wrote:
> > > > On Wed, Apr 16, 2025, Vipin Sharma wrote:
> > > > > Checked via pahole, sizes of struct have reduced but still not under 4k.
> > > > > After applying the patch:
> > > > > 
> > > > > struct kvm{} - 4104
> > > > > struct kvm_svm{} - 4320
> > > > > struct kvm_vmx{} - 4128
> > > > > 
> > > > > Also, this BUILD_BUG_ON() might not be reliable unless all of the ifdefs
> > > > > under kvm_[vmx|svm] and its children are enabled. Won't that be an
> > > > > issue?
> > > > 
> > > > That's what build bots (and to a lesser extent, maintainers) are for.  An individual
> > > > developer might miss a particular config, but the build bots that run allyesconfig
> > > > will very quickly detect the issue, and then we fix it.
> > > > 
> > > > I also build what is effectively an "allkvmconfig" before officially applying
> > > > anything, so in general things like this shouldn't even make it to the bots.
> > > > 
> > > 
> > > Just want to understand the intention here:
> > > 
> > > What if someday a developer really needs to add some new field(s) to, lets say
> > > 'struct kvm_vmx', and that makes the size exceed 4K?
> > 
> > If it helps, here's the changelog I plan on posting for v3:
> >     
> >     Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical
> >     allocation before falling back to __vmalloc(), to avoid the overhead of
> >     establishing the virtual mappings.  The SVM and VMX (and TDX) structures
> >     are now just above 4096 bytes, i.e. are order-1 allocations, and will
> >     likely remain that way for quite some time.
> >     
> >     Add compile-time assertions in vendor code to ensure the size is an
> >     order-0 or order-1 allocation, i.e. to prevent unknowingly letting the
> >     size balloon in the future.  There's nothing fundamentally wrong with a
> >     larger kvm_{svm,vmx,tdx} size, but given that the size is barely above
> >     4096 after 18+ years of existence, exceeding exceed 8192 bytes would be
> >     quite notable.
> 
> Yeah looks reasonable.
> 
> Nit: I am not quite following "falling back to __vmalloc()" part.  We are
> replacing __vmalloc() with kzalloc() AFAICT, therefore there should be no
> "falling back"?

Correct, not in this version.  In the next version, my plan is to use kvzalloc()
(though honestly, I'm not sure that's worth doing; it'll be an order-1 allocation,
and if that fails...).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
  2025-04-22  0:24       ` Sean Christopherson
@ 2025-04-25 17:45         ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2025-04-25 17:45 UTC (permalink / raw)
  To: Vipin Sharma; +Cc: Paolo Bonzini, kvm, linux-kernel

On Mon, Apr 21, 2025, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Sean Christopherson wrote:
> > On Tue, Apr 15, 2025, Vipin Sharma wrote:
> > > On 2025-04-01 08:57:14, Sean Christopherson wrote:
> > > > +static __ro_after_init HLIST_HEAD(empty_page_hash);
> > > > +
> > > > +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> > > > +{
> > > > +	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
> > > > +
> > > > +	if (!page_hash)
> > > > +		return &empty_page_hash;
> > > > +
> > > > +	return &page_hash[kvm_page_table_hashfn(gfn)];
> > > > +}
> > > > +
> > > >  
> > > > @@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> > > >  	struct kvm_mmu_page *sp;
> > > >  	bool created = false;
> > > >  
> > > > +	BUG_ON(!kvm->arch.mmu_page_hash);
> > > >  	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > 
> > > Why do we need READ_ONCE() at kvm_get_mmu_page_hash() but not here?
> > 
> > We don't (need it in kvm_get_mmu_page_hash()).  I suspect past me was thinking
> > it could be accessed without holding mmu_lock, but that's simply not true.  Unless
> > I'm forgetting, something, I'll drop the READ_ONCE() and WRITE_ONCE() in
> > kvm_mmu_alloc_page_hash(), and instead assert that mmu_lock is held for write.
> 
> I remembered what I was trying to do.  The _writer_, kvm_mmu_alloc_page_hash(),
> doesn't hold mmu_lock, and so the READ/WRITE_ONCE() is needed.
> 
> But looking at this again, there's really no point in such games.  All readers
> hold mmu_lock for write, so kvm_mmu_alloc_page_hash() can take mmu_lock for read
> to ensure correctness.  That's far easier to reason about than taking a dependency
> on shadow_root_allocated.
> 
> For performance, taking mmu_lock for read is unlikely to generate contention, as
> this is only reachable at runtime if the TDP MMU is enabled.  And mmu_lock is
> going to be taken for write anyways (to allocate the shadow root).

Wrong again.  After way, way too many failed attempts (I tried some truly stupid
ideas) and staring, I finally remembered why it's a-ok to set arch.mmu_page_hash
outside of mmu_lock, and why it's a-ok for __kvm_mmu_get_shadow_page() to not use
READ_ONCE().  I guess that's my penance for not writing a decent changelog or
comments.

Setting the list outside of mmu_lock is safe, as concurrent readers must hold
mmu_lock in some capacity, shadow pages can only be added (or removed) from the
list when mmu_lock is held for write, and tasks that are creating a shadow root
are serialized by slots_arch_lock.  I.e. it's impossible for the list to become
non-empty until all readers go away, and so readers are guaranteed to see an empty
list even if they make multiple calls to kvm_get_mmu_page_hash() in a single
mmu_lock critical section.

__kvm_mmu_get_shadow_page() doesn't need READ_ONCE() because it's only reachable
after the task has gone through mmu_first_shadow_root_alloc(), i.e. access to
mmu_page_hash in that context is fully serialized by slots_arch_lock.

> > > My understanding is that it is in kvm_get_mmu_page_hash() to avoid compiler
> > > doing any read tear. If yes, then the same condition is valid here, isn't it?
> > 
> > The intent wasn't to guard against a tear, but to instead ensure mmu_page_hash
> > couldn't be re-read and end up with a NULL pointer deref, e.g. if KVM set
> > mmu_page_hash and then nullfied it because some later step failed.  But if
> > mmu_lock is held for write, that is simply impossible.

So yes, you were 100% correct, the only reason for WRITE_ONCE/READ_ONCE is to
ensure the compiler doesn't do something stupid and tear the accesses.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-04-25 17:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-01 15:57 [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
2025-04-16 15:53   ` Vipin Sharma
2025-04-01 15:57 ` [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc() Sean Christopherson
2025-04-16 18:24   ` Vipin Sharma
2025-04-16 19:06     ` Vipin Sharma
2025-04-16 19:57       ` Sean Christopherson
2025-04-22 22:53         ` Huang, Kai
2025-04-23 17:07           ` Sean Christopherson
2025-04-23 21:46             ` Huang, Kai
2025-04-24 18:31               ` Sean Christopherson
2025-04-01 15:57 ` [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2025-04-15 20:06   ` Vipin Sharma
2025-04-15 21:52     ` Sean Christopherson
2025-04-22  0:24       ` Sean Christopherson
2025-04-25 17:45         ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox