[PATCH 0/2] Introduce configuration for LRU cache of previous CR3s

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s
@ 2024-10-29  3:13 Yong He
  2024-10-29  3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He
  2024-10-29  3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He
  0 siblings, 2 replies; 8+ messages in thread
From: Yong He @ 2024-10-29  3:13 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids

From: Yong He <alexyonghe@tencent.com>

When running function loading inside VM without EPT supported,
we found shadow page table rebuilds are very frequent even
only 3 process running inside VM, such as kvm_mmu_free_roots
if invoked frequently.

PTI is enabled inside our VM, so 3 process will have 6
valid CR3s, but there are only 3 LRU cache of previous CR3s,
so this made the cache is always invalid, and the shadow page
table is frequently rebuilt.

In this patch we enlarge the number of LRU cache, and introduce
a parameter for it, so that user could enlarge the cache when
needed.

Here is context switch latency test of lmbench3, run in Ice
lake server, after enlarge the LRU cache number, the switch
latency reduced 14%~18%.

process number     2      3      4      5      6      7      8
LRU cache = 3    4.857  6.802  7.518  7.836  7.770  7.287  7.271
LRU cache = 11   4.654  5.518  6.292  6.516  6.512  7.135  7.270

Also, the kvm_mmu_free_roots reduced from 7k+ to 60, when running
the latency test with 4 processes.

Yong He (2):
  KVM: x86: expand the LRU cache of previous CR3s
  KVM: x86: introduce cache configurations for previous CR3s

 arch/x86/include/asm/kvm_host.h |  7 +++---
 arch/x86/kvm/mmu.h              |  1 +
 arch/x86/kvm/mmu/mmu.c          | 40 +++++++++++++++++++++++----------
 arch/x86/kvm/vmx/nested.c       |  4 ++--
 arch/x86/kvm/x86.c              |  2 +-
 5 files changed, 36 insertions(+), 18 deletions(-)

-- 
2.43.5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s
  2024-10-29  3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He
@ 2024-10-29  3:13 ` Yong He
  2024-10-29 14:40   ` Sean Christopherson
  2024-10-29  3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He
  1 sibling, 1 reply; 8+ messages in thread
From: Yong He @ 2024-10-29  3:13 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids

From: Yong He <alexyonghe@tencent.com>

Expand max number of LRU cache of previous CR3s, so that
we could cache more entry when needed, such as KPTI is
enabled inside guest.

Signed-off-by: Yong He <alexyonghe@tencent.com>
---
 arch/x86/include/asm/kvm_host.h | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a68cb3eb..02528d4a2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -430,11 +430,12 @@ struct kvm_mmu_root_info {
 #define KVM_MMU_ROOT_INFO_INVALID \
 	((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE })
 
-#define KVM_MMU_NUM_PREV_ROOTS 3
+#define KVM_MMU_NUM_PREV_ROOTS		3
+#define KVM_MMU_NUM_PREV_ROOTS_MAX	11
 
 #define KVM_MMU_ROOT_CURRENT		BIT(0)
 #define KVM_MMU_ROOT_PREVIOUS(i)	BIT(1+i)
-#define KVM_MMU_ROOTS_ALL		(BIT(1 + KVM_MMU_NUM_PREV_ROOTS) - 1)
+#define KVM_MMU_ROOTS_ALL		(BIT(1 + KVM_MMU_NUM_PREV_ROOTS_MAX) - 1)
 
 #define KVM_HAVE_MMU_RWLOCK
 
@@ -469,7 +470,7 @@ struct kvm_mmu {
 	*/
 	u32 pkru_mask;
 
-	struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
+	struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS_MAX];
 
 	/*
 	 * Bitmap; bit set = permission fault
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s
  2024-10-29  3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He
@ 2024-10-29 14:40   ` Sean Christopherson
  2024-10-30 12:08     ` zhuangel570
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-10-29 14:40 UTC (permalink / raw)
  To: Yong He; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids

KVM: x86/mmu:

On Tue, Oct 29, 2024, Yong He wrote:
> From: Yong He <alexyonghe@tencent.com>
> 
> Expand max number of LRU cache of previous CR3s, so that
> we could cache more entry when needed, such as KPTI is

No "we".  Documentation/process/maintainer-kvm-x86.rst

And I would argue this changelog is misleading.  I was expecting that the patch
would actually change the number of roots that KVM caches, whereas this simply
increases the capacity.  The changelog should also mention that the whole reason
for doing so is to allow for a module param.

Something like:

  KVM: x86/mmu: Expand max capacity of per-MMU CR3/PGD caches

  Expand the maximum capacity of the "previous roots" cache in kvm_mmu so
  that a future patch can make the number of roots configurable via module
  param, without needing to dynamically allocate the array.

That said, I hope we can avoid this entirely.  More in the next patch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s
  2024-10-29 14:40   ` Sean Christopherson
@ 2024-10-30 12:08     ` zhuangel570
  0 siblings, 0 replies; 8+ messages in thread
From: zhuangel570 @ 2024-10-30 12:08 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids

On Wed, Oct 30, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote:
>
> KVM: x86/mmu:
>
> On Tue, Oct 29, 2024, Yong He wrote:
> > From: Yong He <alexyonghe@tencent.com>
> >
> > Expand max number of LRU cache of previous CR3s, so that
> > we could cache more entry when needed, such as KPTI is
>
> No "we".  Documentation/process/maintainer-kvm-x86.rst
>
> And I would argue this changelog is misleading.  I was expecting that the patch
> would actually change the number of roots that KVM caches, whereas this simply
> increases the capacity.  The changelog should also mention that the whole reason
> for doing so is to allow for a module param.
>
> Something like:
>
>   KVM: x86/mmu: Expand max capacity of per-MMU CR3/PGD caches
>
>   Expand the maximum capacity of the "previous roots" cache in kvm_mmu so
>   that a future patch can make the number of roots configurable via module
>   param, without needing to dynamically allocate the array.
>
> That said, I hope we can avoid this entirely.  More in the next patch.

Thanks for pointing out the problem, will fix in next version.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s
  2024-10-29  3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He
  2024-10-29  3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He
@ 2024-10-29  3:14 ` Yong He
  2024-10-29 15:14   ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Yong He @ 2024-10-29  3:14 UTC (permalink / raw)
  To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids

From: Yong He <alexyonghe@tencent.com>

Introduce prev_roots_num param, so that we use more cache of
previous CR3/root_hpa pairs, which help us to reduce shadow
page table evict and rebuild overhead.

Signed-off-by: Yong He <alexyonghe@tencent.com>
---
 arch/x86/kvm/mmu.h        |  1 +
 arch/x86/kvm/mmu/mmu.c    | 40 +++++++++++++++++++++++++++------------
 arch/x86/kvm/vmx/nested.c |  4 ++--
 arch/x86/kvm/x86.c        |  2 +-
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 4341e0e28..e5615433a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -7,6 +7,7 @@
 #include "cpuid.h"
 
 extern bool __read_mostly enable_mmio_caching;
+extern uint __read_mostly prev_roots_num;
 
 #define PT_WRITABLE_SHIFT 1
 #define PT_USER_SHIFT 2
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7813d28b0..2acc24dd2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -96,6 +96,22 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint");
 static bool __read_mostly force_flush_and_sync_on_reuse;
 module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644);
 
+static int prev_roots_num_param(const char *val, const struct kernel_param *kp)
+{
+	return param_set_uint_minmax(val, kp, KVM_MMU_NUM_PREV_ROOTS, KVM_MMU_NUM_PREV_ROOTS_MAX);
+}
+
+static const struct kernel_param_ops prev_roots_num_ops = {
+	.set = prev_roots_num_param,
+	.get = param_get_uint,
+};
+
+uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS;
+EXPORT_SYMBOL_GPL(prev_roots_num);
+module_param_cb(prev_roots_num, &prev_roots_num_ops,
+		&prev_roots_num, 0644);
+__MODULE_PARM_TYPE(prev_roots_num, "uint");
+
 /*
  * When setting this variable to true it enables Two-Dimensional-Paging
  * where the hardware walks 2 page tables:
@@ -3594,12 +3610,12 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 		&& VALID_PAGE(mmu->root.hpa);
 
 	if (!free_active_root) {
-		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+		for (i = 0; i < prev_roots_num; i++)
 			if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) &&
 			    VALID_PAGE(mmu->prev_roots[i].hpa))
 				break;
 
-		if (i == KVM_MMU_NUM_PREV_ROOTS)
+		if (i == prev_roots_num)
 			return;
 	}
 
@@ -3608,7 +3624,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu,
 	else
 		write_lock(&kvm->mmu_lock);
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+	for (i = 0; i < prev_roots_num; i++)
 		if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i))
 			mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa,
 					   &invalid_list);
@@ -3655,7 +3671,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu)
 	 */
 	WARN_ON_ONCE(mmu->root_role.guest_mode);
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		root_hpa = mmu->prev_roots[i].hpa;
 		if (!VALID_PAGE(root_hpa))
 			continue;
@@ -4066,7 +4082,7 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu)
 	unsigned long roots_to_free = 0;
 	int i;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+	for (i = 0; i < prev_roots_num; i++)
 		if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa))
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
 
@@ -4814,7 +4830,7 @@ static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *m
 	if (is_root_usable(&mmu->root, new_pgd, new_role))
 		return true;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		/*
 		 * The swaps end up rotating the cache like this:
 		 *   C   0 1 2 3   (on entry to the function)
@@ -4845,7 +4861,7 @@ static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mm
 {
 	uint i;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+	for (i = 0; i < prev_roots_num; i++)
 		if (is_root_usable(&mmu->prev_roots[i], new_pgd, new_role))
 			goto hit;
 
@@ -4854,7 +4870,7 @@ static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mm
 hit:
 	swap(mmu->root, mmu->prev_roots[i]);
 	/* Bubble up the remaining roots.  */
-	for (; i < KVM_MMU_NUM_PREV_ROOTS - 1; i++)
+	for (; i < prev_roots_num - 1; i++)
 		mmu->prev_roots[i] = mmu->prev_roots[i + 1];
 	mmu->prev_roots[i].hpa = INVALID_PAGE;
 	return true;
@@ -5795,7 +5811,7 @@ static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu)
 	if (is_obsolete_root(kvm, mmu->root.hpa))
 		roots_to_free |= KVM_MMU_ROOT_CURRENT;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa))
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
 	}
@@ -6125,7 +6141,7 @@ void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	if (roots & KVM_MMU_ROOT_CURRENT)
 		__kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->root.hpa);
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		if (roots & KVM_MMU_ROOT_PREVIOUS(i))
 			__kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->prev_roots[i].hpa);
 	}
@@ -6159,7 +6175,7 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
 	if (pcid == kvm_get_active_pcid(vcpu))
 		roots |= KVM_MMU_ROOT_CURRENT;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		if (VALID_PAGE(mmu->prev_roots[i].hpa) &&
 		    pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd))
 			roots |= KVM_MMU_ROOT_PREVIOUS(i);
@@ -6271,7 +6287,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+	for (i = 0; i < prev_roots_num; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
 	/* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2392a7ef2..d7e375c34 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -394,7 +394,7 @@ static void nested_ept_invalidate_addr(struct kvm_vcpu *vcpu, gpa_t eptp,
 
 	WARN_ON_ONCE(!mmu_is_nested(vcpu));
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+	for (i = 0; i < prev_roots_num; i++) {
 		cached_root = &vcpu->arch.mmu->prev_roots[i];
 
 		if (nested_ept_root_matches(cached_root->hpa, cached_root->pgd,
@@ -5820,7 +5820,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 					    operand.eptp))
 			roots_to_free |= KVM_MMU_ROOT_CURRENT;
 
-		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
+		for (i = 0; i < prev_roots_num; i++) {
 			if (nested_ept_root_matches(mmu->prev_roots[i].hpa,
 						    mmu->prev_roots[i].pgd,
 						    operand.eptp))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c983c8e43..047cf66da 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1235,7 +1235,7 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
 	if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE))
 		return;
 
-	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+	for (i = 0; i < prev_roots_num; i++)
 		if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid)
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s
  2024-10-29  3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He
@ 2024-10-29 15:14   ` Sean Christopherson
  2024-10-30 12:51     ` zhuangel570
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-10-29 15:14 UTC (permalink / raw)
  To: Yong He; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids

On Tue, Oct 29, 2024, Yong He wrote:
> From: Yong He <alexyonghe@tencent.com>
> 
> Introduce prev_roots_num param, so that we use more cache of
> previous CR3/root_hpa pairs, which help us to reduce shadow
> page table evict and rebuild overhead.
> 
> Signed-off-by: Yong He <alexyonghe@tencent.com>
> ---

...

> +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS;
> +EXPORT_SYMBOL_GPL(prev_roots_num);
> +module_param_cb(prev_roots_num, &prev_roots_num_ops,
> +		&prev_roots_num, 0644);

Allowing the variable to be changed while KVM is running is unsafe.

I also think a module param is the wrong way to try to allow for bigger caches.
The caches themselves are relatively cheap, at 16 bytes per entry.  And I doubt
the cost of searching a larger cache in fast_pgd_switch() would have a measurable
impact, since the most recently used roots will be at the front of the cache,
i.e. only near-misses and misses will be affected.

The only potential downside to larger caches I can think of, is that keeping
root_count elevated would make it more difficult to reclaim shadow pages from
roots that are no longer relevant to the guest.  kvm_mmu_zap_oldest_mmu_pages()
in particular would refuse to reclaim roots.  That shouldn't be problematic for
legacy shadow paging, because KVM doesn't recursively zap shadow pages.  But for
nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that
child SPTEs aren't shared across multiple trees (common in legacy shadow paging,
extremely uncommon in nested TDP).

And for the nested TDP issue, if it's actually a problem, I would *love* to
solve that problem by making KVM's forced reclaim more sophisticated.  E.g. one
idea would be to kick all vCPUs if the maximum number of pages has been reached,
have each vCPU purge old roots from prev_roots, and then reclaim unused roots.
It would be a bit more complicated than that, as KVM would need a way to ensure
forward progress, e.g. if the shadow pages limit has been reach with a single
root.  But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter.

TL;DR: what if we simply bump the number of cached roots to ~16?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s
  2024-10-29 15:14   ` Sean Christopherson
@ 2024-10-30 12:51     ` zhuangel570
  2024-11-06  1:42       ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: zhuangel570 @ 2024-10-30 12:51 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids

On Wed, Oct 30, 2024 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Oct 29, 2024, Yong He wrote:
> > From: Yong He <alexyonghe@tencent.com>
> >
> > Introduce prev_roots_num param, so that we use more cache of
> > previous CR3/root_hpa pairs, which help us to reduce shadow
> > page table evict and rebuild overhead.
> >
> > Signed-off-by: Yong He <alexyonghe@tencent.com>
> > ---
>
> ...
>
> > +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS;
> > +EXPORT_SYMBOL_GPL(prev_roots_num);
> > +module_param_cb(prev_roots_num, &prev_roots_num_ops,
> > +             &prev_roots_num, 0644);
>
> Allowing the variable to be changed while KVM is running is unsafe.

Sorry, I have no intention to revise prev_roots_num, changing it 0444 is
simpler and can improve performance, will update in next version.

>
> I also think a module param is the wrong way to try to allow for bigger caches.
> The caches themselves are relatively cheap, at 16 bytes per entry.  And I doubt
> the cost of searching a larger cache in fast_pgd_switch() would have a measurable
> impact, since the most recently used roots will be at the front of the cache,
> i.e. only near-misses and misses will be affected.

Maybe the context switch test could see some result, when the processes in guest
are 7 or 8 (all cache misses), nearly no performance drop.

>
> The only potential downside to larger caches I can think of, is that keeping
> root_count elevated would make it more difficult to reclaim shadow pages from
> roots that are no longer relevant to the guest.  kvm_mmu_zap_oldest_mmu_pages()
> in particular would refuse to reclaim roots.  That shouldn't be problematic for
> legacy shadow paging, because KVM doesn't recursively zap shadow pages.  But for
> nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that
> child SPTEs aren't shared across multiple trees (common in legacy shadow paging,
> extremely uncommon in nested TDP).
>
> And for the nested TDP issue, if it's actually a problem, I would *love* to
> solve that problem by making KVM's forced reclaim more sophisticated.  E.g. one
> idea would be to kick all vCPUs if the maximum number of pages has been reached,
> have each vCPU purge old roots from prev_roots, and then reclaim unused roots.
> It would be a bit more complicated than that, as KVM would need a way to ensure
> forward progress, e.g. if the shadow pages limit has been reach with a single
> root.  But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter.

I not very familiar with TDP on TDP.
I think you mean force free cached roots in kvm_mmu_zap_oldest_mmu_pages() when
no mmu pages could be zapped. Such as kick all VCPUs and purge cached roots.

>
> TL;DR: what if we simply bump the number of cached roots to ~16?

I set the number to 11 because the PCID in guest kernel is 6
(11+current=12), when
there are more than 6 processes in guest, the PCID will be reused,
then cached roots
will not easily to hit.
The context switch case shows no performance gain when process are 7 and 8.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s
  2024-10-30 12:51     ` zhuangel570
@ 2024-11-06  1:42       ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2024-11-06  1:42 UTC (permalink / raw)
  To: zhuangel570; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids

On Wed, Oct 30, 2024, zhuangel570 wrote:
> On Wed, Oct 30, 2024 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Oct 29, 2024, Yong He wrote:
> > The only potential downside to larger caches I can think of, is that keeping
> > root_count elevated would make it more difficult to reclaim shadow pages from
> > roots that are no longer relevant to the guest.  kvm_mmu_zap_oldest_mmu_pages()
> > in particular would refuse to reclaim roots.  That shouldn't be problematic for
> > legacy shadow paging, because KVM doesn't recursively zap shadow pages.  But for
> > nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that
> > child SPTEs aren't shared across multiple trees (common in legacy shadow paging,
> > extremely uncommon in nested TDP).
> >
> > And for the nested TDP issue, if it's actually a problem, I would *love* to
> > solve that problem by making KVM's forced reclaim more sophisticated.  E.g. one
> > idea would be to kick all vCPUs if the maximum number of pages has been reached,
> > have each vCPU purge old roots from prev_roots, and then reclaim unused roots.
> > It would be a bit more complicated than that, as KVM would need a way to ensure
> > forward progress, e.g. if the shadow pages limit has been reach with a single
> > root.  But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter.
> 
> I not very familiar with TDP on TDP.
> I think you mean force free cached roots in kvm_mmu_zap_oldest_mmu_pages() when
> no mmu pages could be zapped. Such as kick all VCPUs and purge cached roots.

Not just when no MMU pages could be zapped; any time KVM needs to reclaim MMU
pages due to n_max_mmu_pages.

> > TL;DR: what if we simply bump the number of cached roots to ~16?
> 
> I set the number to 11 because the PCID in guest kernel is 6 (11+current=12),
> when there are more than 6 processes in guest, the PCID will be reused, then
> cached roots will not easily to hit.  The context switch case shows no
> performance gain when process are 7 and 8.

Do you control the guest kernel?  If so, it'd be interesting to see what happens
when you bump TLB_NR_DYN_ASIDS in the guest to something higher, and then adjust
KVM to match.

IIRC, Andy arrived at '6' in 10af6235e0d3 ("x86/mm: Implement PCID based optimization:
try to preserve old TLB entries using PCID") because that was the "sweet spot" for
hardware.  E.g. using fewer PCIDs wouldn't fully utilize hardware, and using more
PCIDs would oversubscribe the number of ASID tags too much.

For KVM shadow paging, the only meaningful limitation is the number of shadow
pages that KVM allows.  E.g. with a sufficiently high n_max_mmu_pages, the guest
could theoretically use hundreds of PCIDs will no ill effects.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-06  1:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29  3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He
2024-10-29  3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He
2024-10-29 14:40   ` Sean Christopherson
2024-10-30 12:08     ` zhuangel570
2024-10-29  3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He
2024-10-29 15:14   ` Sean Christopherson
2024-10-30 12:51     ` zhuangel570
2024-11-06  1:42       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox