* [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s @ 2024-10-29 3:13 Yong He 2024-10-29 3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He 2024-10-29 3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He 0 siblings, 2 replies; 8+ messages in thread From: Yong He @ 2024-10-29 3:13 UTC (permalink / raw) To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids From: Yong He <alexyonghe@tencent.com> When running function loading inside VM without EPT supported, we found shadow page table rebuilds are very frequent even only 3 process running inside VM, such as kvm_mmu_free_roots if invoked frequently. PTI is enabled inside our VM, so 3 process will have 6 valid CR3s, but there are only 3 LRU cache of previous CR3s, so this made the cache is always invalid, and the shadow page table is frequently rebuilt. In this patch we enlarge the number of LRU cache, and introduce a parameter for it, so that user could enlarge the cache when needed. Here is context switch latency test of lmbench3, run in Ice lake server, after enlarge the LRU cache number, the switch latency reduced 14%~18%. process number 2 3 4 5 6 7 8 LRU cache = 3 4.857 6.802 7.518 7.836 7.770 7.287 7.271 LRU cache = 11 4.654 5.518 6.292 6.516 6.512 7.135 7.270 Also, the kvm_mmu_free_roots reduced from 7k+ to 60, when running the latency test with 4 processes. Yong He (2): KVM: x86: expand the LRU cache of previous CR3s KVM: x86: introduce cache configurations for previous CR3s arch/x86/include/asm/kvm_host.h | 7 +++--- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/mmu/mmu.c | 40 +++++++++++++++++++++++---------- arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/x86.c | 2 +- 5 files changed, 36 insertions(+), 18 deletions(-) -- 2.43.5 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s 2024-10-29 3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He @ 2024-10-29 3:13 ` Yong He 2024-10-29 14:40 ` Sean Christopherson 2024-10-29 3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He 1 sibling, 1 reply; 8+ messages in thread From: Yong He @ 2024-10-29 3:13 UTC (permalink / raw) To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids From: Yong He <alexyonghe@tencent.com> Expand max number of LRU cache of previous CR3s, so that we could cache more entry when needed, such as KPTI is enabled inside guest. Signed-off-by: Yong He <alexyonghe@tencent.com> --- arch/x86/include/asm/kvm_host.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4a68cb3eb..02528d4a2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -430,11 +430,12 @@ struct kvm_mmu_root_info { #define KVM_MMU_ROOT_INFO_INVALID \ ((struct kvm_mmu_root_info) { .pgd = INVALID_PAGE, .hpa = INVALID_PAGE }) -#define KVM_MMU_NUM_PREV_ROOTS 3 +#define KVM_MMU_NUM_PREV_ROOTS 3 +#define KVM_MMU_NUM_PREV_ROOTS_MAX 11 #define KVM_MMU_ROOT_CURRENT BIT(0) #define KVM_MMU_ROOT_PREVIOUS(i) BIT(1+i) -#define KVM_MMU_ROOTS_ALL (BIT(1 + KVM_MMU_NUM_PREV_ROOTS) - 1) +#define KVM_MMU_ROOTS_ALL (BIT(1 + KVM_MMU_NUM_PREV_ROOTS_MAX) - 1) #define KVM_HAVE_MMU_RWLOCK @@ -469,7 +470,7 @@ struct kvm_mmu { */ u32 pkru_mask; - struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS]; + struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS_MAX]; /* * Bitmap; bit set = permission fault -- 2.43.5 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s 2024-10-29 3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He @ 2024-10-29 14:40 ` Sean Christopherson 2024-10-30 12:08 ` zhuangel570 0 siblings, 1 reply; 8+ messages in thread From: Sean Christopherson @ 2024-10-29 14:40 UTC (permalink / raw) To: Yong He; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids KVM: x86/mmu: On Tue, Oct 29, 2024, Yong He wrote: > From: Yong He <alexyonghe@tencent.com> > > Expand max number of LRU cache of previous CR3s, so that > we could cache more entry when needed, such as KPTI is No "we". Documentation/process/maintainer-kvm-x86.rst And I would argue this changelog is misleading. I was expecting that the patch would actually change the number of roots that KVM caches, whereas this simply increases the capacity. The changelog should also mention that the whole reason for doing so is to allow for a module param. Something like: KVM: x86/mmu: Expand max capacity of per-MMU CR3/PGD caches Expand the maximum capacity of the "previous roots" cache in kvm_mmu so that a future patch can make the number of roots configurable via module param, without needing to dynamically allocate the array. That said, I hope we can avoid this entirely. More in the next patch. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] KVM: x86: expand the LRU cache of previous CR3s 2024-10-29 14:40 ` Sean Christopherson @ 2024-10-30 12:08 ` zhuangel570 0 siblings, 0 replies; 8+ messages in thread From: zhuangel570 @ 2024-10-30 12:08 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids On Wed, Oct 30, 2024 at 3:44 PM Sean Christopherson <seanjc@google.com> wrote: > > KVM: x86/mmu: > > On Tue, Oct 29, 2024, Yong He wrote: > > From: Yong He <alexyonghe@tencent.com> > > > > Expand max number of LRU cache of previous CR3s, so that > > we could cache more entry when needed, such as KPTI is > > No "we". Documentation/process/maintainer-kvm-x86.rst > > And I would argue this changelog is misleading. I was expecting that the patch > would actually change the number of roots that KVM caches, whereas this simply > increases the capacity. The changelog should also mention that the whole reason > for doing so is to allow for a module param. > > Something like: > > KVM: x86/mmu: Expand max capacity of per-MMU CR3/PGD caches > > Expand the maximum capacity of the "previous roots" cache in kvm_mmu so > that a future patch can make the number of roots configurable via module > param, without needing to dynamically allocate the array. > > That said, I hope we can avoid this entirely. More in the next patch. Thanks for pointing out the problem, will fix in next version. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s 2024-10-29 3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He 2024-10-29 3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He @ 2024-10-29 3:14 ` Yong He 2024-10-29 15:14 ` Sean Christopherson 1 sibling, 1 reply; 8+ messages in thread From: Yong He @ 2024-10-29 3:14 UTC (permalink / raw) To: pbonzini, seanjc, kvm; +Cc: wanpengli, alexyonghe, junaids From: Yong He <alexyonghe@tencent.com> Introduce prev_roots_num param, so that we use more cache of previous CR3/root_hpa pairs, which help us to reduce shadow page table evict and rebuild overhead. Signed-off-by: Yong He <alexyonghe@tencent.com> --- arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/mmu/mmu.c | 40 +++++++++++++++++++++++++++------------ arch/x86/kvm/vmx/nested.c | 4 ++-- arch/x86/kvm/x86.c | 2 +- 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 4341e0e28..e5615433a 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -7,6 +7,7 @@ #include "cpuid.h" extern bool __read_mostly enable_mmio_caching; +extern uint __read_mostly prev_roots_num; #define PT_WRITABLE_SHIFT 1 #define PT_USER_SHIFT 2 diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 7813d28b0..2acc24dd2 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -96,6 +96,22 @@ __MODULE_PARM_TYPE(nx_huge_pages_recovery_period_ms, "uint"); static bool __read_mostly force_flush_and_sync_on_reuse; module_param_named(flush_on_reuse, force_flush_and_sync_on_reuse, bool, 0644); +static int prev_roots_num_param(const char *val, const struct kernel_param *kp) +{ + return param_set_uint_minmax(val, kp, KVM_MMU_NUM_PREV_ROOTS, KVM_MMU_NUM_PREV_ROOTS_MAX); +} + +static const struct kernel_param_ops prev_roots_num_ops = { + .set = prev_roots_num_param, + .get = param_get_uint, +}; + +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS; +EXPORT_SYMBOL_GPL(prev_roots_num); +module_param_cb(prev_roots_num, &prev_roots_num_ops, + &prev_roots_num, 0644); +__MODULE_PARM_TYPE(prev_roots_num, "uint"); + /* * When setting this variable to true it enables Two-Dimensional-Paging * where the hardware walks 2 page tables: @@ -3594,12 +3610,12 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, && VALID_PAGE(mmu->root.hpa); if (!free_active_root) { - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) if ((roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) && VALID_PAGE(mmu->prev_roots[i].hpa)) break; - if (i == KVM_MMU_NUM_PREV_ROOTS) + if (i == prev_roots_num) return; } @@ -3608,7 +3624,7 @@ void kvm_mmu_free_roots(struct kvm *kvm, struct kvm_mmu *mmu, else write_lock(&kvm->mmu_lock); - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) if (roots_to_free & KVM_MMU_ROOT_PREVIOUS(i)) mmu_free_root_page(kvm, &mmu->prev_roots[i].hpa, &invalid_list); @@ -3655,7 +3671,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm *kvm, struct kvm_mmu *mmu) */ WARN_ON_ONCE(mmu->root_role.guest_mode); - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { root_hpa = mmu->prev_roots[i].hpa; if (!VALID_PAGE(root_hpa)) continue; @@ -4066,7 +4082,7 @@ void kvm_mmu_sync_prev_roots(struct kvm_vcpu *vcpu) unsigned long roots_to_free = 0; int i; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) if (is_unsync_root(vcpu->arch.mmu->prev_roots[i].hpa)) roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); @@ -4814,7 +4830,7 @@ static bool cached_root_find_and_keep_current(struct kvm *kvm, struct kvm_mmu *m if (is_root_usable(&mmu->root, new_pgd, new_role)) return true; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { /* * The swaps end up rotating the cache like this: * C 0 1 2 3 (on entry to the function) @@ -4845,7 +4861,7 @@ static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mm { uint i; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) if (is_root_usable(&mmu->prev_roots[i], new_pgd, new_role)) goto hit; @@ -4854,7 +4870,7 @@ static bool cached_root_find_without_current(struct kvm *kvm, struct kvm_mmu *mm hit: swap(mmu->root, mmu->prev_roots[i]); /* Bubble up the remaining roots. */ - for (; i < KVM_MMU_NUM_PREV_ROOTS - 1; i++) + for (; i < prev_roots_num - 1; i++) mmu->prev_roots[i] = mmu->prev_roots[i + 1]; mmu->prev_roots[i].hpa = INVALID_PAGE; return true; @@ -5795,7 +5811,7 @@ static void __kvm_mmu_free_obsolete_roots(struct kvm *kvm, struct kvm_mmu *mmu) if (is_obsolete_root(kvm, mmu->root.hpa)) roots_to_free |= KVM_MMU_ROOT_CURRENT; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { if (is_obsolete_root(kvm, mmu->prev_roots[i].hpa)) roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); } @@ -6125,7 +6141,7 @@ void kvm_mmu_invalidate_addr(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, if (roots & KVM_MMU_ROOT_CURRENT) __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->root.hpa); - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { if (roots & KVM_MMU_ROOT_PREVIOUS(i)) __kvm_mmu_invalidate_addr(vcpu, mmu, addr, mmu->prev_roots[i].hpa); } @@ -6159,7 +6175,7 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid) if (pcid == kvm_get_active_pcid(vcpu)) roots |= KVM_MMU_ROOT_CURRENT; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { if (VALID_PAGE(mmu->prev_roots[i].hpa) && pcid == kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd)) roots |= KVM_MMU_ROOT_PREVIOUS(i); @@ -6271,7 +6287,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu) mmu->root.hpa = INVALID_PAGE; mmu->root.pgd = 0; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID; /* vcpu->arch.guest_mmu isn't used when !tdp_enabled. */ diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 2392a7ef2..d7e375c34 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -394,7 +394,7 @@ static void nested_ept_invalidate_addr(struct kvm_vcpu *vcpu, gpa_t eptp, WARN_ON_ONCE(!mmu_is_nested(vcpu)); - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { cached_root = &vcpu->arch.mmu->prev_roots[i]; if (nested_ept_root_matches(cached_root->hpa, cached_root->pgd, @@ -5820,7 +5820,7 @@ static int handle_invept(struct kvm_vcpu *vcpu) operand.eptp)) roots_to_free |= KVM_MMU_ROOT_CURRENT; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { + for (i = 0; i < prev_roots_num; i++) { if (nested_ept_root_matches(mmu->prev_roots[i].hpa, mmu->prev_roots[i].pgd, operand.eptp)) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c983c8e43..047cf66da 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1235,7 +1235,7 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid) if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)) return; - for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) + for (i = 0; i < prev_roots_num; i++) if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid) roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); -- 2.43.5 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s 2024-10-29 3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He @ 2024-10-29 15:14 ` Sean Christopherson 2024-10-30 12:51 ` zhuangel570 0 siblings, 1 reply; 8+ messages in thread From: Sean Christopherson @ 2024-10-29 15:14 UTC (permalink / raw) To: Yong He; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids On Tue, Oct 29, 2024, Yong He wrote: > From: Yong He <alexyonghe@tencent.com> > > Introduce prev_roots_num param, so that we use more cache of > previous CR3/root_hpa pairs, which help us to reduce shadow > page table evict and rebuild overhead. > > Signed-off-by: Yong He <alexyonghe@tencent.com> > --- ... > +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS; > +EXPORT_SYMBOL_GPL(prev_roots_num); > +module_param_cb(prev_roots_num, &prev_roots_num_ops, > + &prev_roots_num, 0644); Allowing the variable to be changed while KVM is running is unsafe. I also think a module param is the wrong way to try to allow for bigger caches. The caches themselves are relatively cheap, at 16 bytes per entry. And I doubt the cost of searching a larger cache in fast_pgd_switch() would have a measurable impact, since the most recently used roots will be at the front of the cache, i.e. only near-misses and misses will be affected. The only potential downside to larger caches I can think of, is that keeping root_count elevated would make it more difficult to reclaim shadow pages from roots that are no longer relevant to the guest. kvm_mmu_zap_oldest_mmu_pages() in particular would refuse to reclaim roots. That shouldn't be problematic for legacy shadow paging, because KVM doesn't recursively zap shadow pages. But for nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that child SPTEs aren't shared across multiple trees (common in legacy shadow paging, extremely uncommon in nested TDP). And for the nested TDP issue, if it's actually a problem, I would *love* to solve that problem by making KVM's forced reclaim more sophisticated. E.g. one idea would be to kick all vCPUs if the maximum number of pages has been reached, have each vCPU purge old roots from prev_roots, and then reclaim unused roots. It would be a bit more complicated than that, as KVM would need a way to ensure forward progress, e.g. if the shadow pages limit has been reach with a single root. But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter. TL;DR: what if we simply bump the number of cached roots to ~16? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s 2024-10-29 15:14 ` Sean Christopherson @ 2024-10-30 12:51 ` zhuangel570 2024-11-06 1:42 ` Sean Christopherson 0 siblings, 1 reply; 8+ messages in thread From: zhuangel570 @ 2024-10-30 12:51 UTC (permalink / raw) To: Sean Christopherson; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids On Wed, Oct 30, 2024 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Oct 29, 2024, Yong He wrote: > > From: Yong He <alexyonghe@tencent.com> > > > > Introduce prev_roots_num param, so that we use more cache of > > previous CR3/root_hpa pairs, which help us to reduce shadow > > page table evict and rebuild overhead. > > > > Signed-off-by: Yong He <alexyonghe@tencent.com> > > --- > > ... > > > +uint __read_mostly prev_roots_num = KVM_MMU_NUM_PREV_ROOTS; > > +EXPORT_SYMBOL_GPL(prev_roots_num); > > +module_param_cb(prev_roots_num, &prev_roots_num_ops, > > + &prev_roots_num, 0644); > > Allowing the variable to be changed while KVM is running is unsafe. Sorry, I have no intention to revise prev_roots_num, changing it 0444 is simpler and can improve performance, will update in next version. > > I also think a module param is the wrong way to try to allow for bigger caches. > The caches themselves are relatively cheap, at 16 bytes per entry. And I doubt > the cost of searching a larger cache in fast_pgd_switch() would have a measurable > impact, since the most recently used roots will be at the front of the cache, > i.e. only near-misses and misses will be affected. Maybe the context switch test could see some result, when the processes in guest are 7 or 8 (all cache misses), nearly no performance drop. > > The only potential downside to larger caches I can think of, is that keeping > root_count elevated would make it more difficult to reclaim shadow pages from > roots that are no longer relevant to the guest. kvm_mmu_zap_oldest_mmu_pages() > in particular would refuse to reclaim roots. That shouldn't be problematic for > legacy shadow paging, because KVM doesn't recursively zap shadow pages. But for > nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that > child SPTEs aren't shared across multiple trees (common in legacy shadow paging, > extremely uncommon in nested TDP). > > And for the nested TDP issue, if it's actually a problem, I would *love* to > solve that problem by making KVM's forced reclaim more sophisticated. E.g. one > idea would be to kick all vCPUs if the maximum number of pages has been reached, > have each vCPU purge old roots from prev_roots, and then reclaim unused roots. > It would be a bit more complicated than that, as KVM would need a way to ensure > forward progress, e.g. if the shadow pages limit has been reach with a single > root. But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter. I not very familiar with TDP on TDP. I think you mean force free cached roots in kvm_mmu_zap_oldest_mmu_pages() when no mmu pages could be zapped. Such as kick all VCPUs and purge cached roots. > > TL;DR: what if we simply bump the number of cached roots to ~16? I set the number to 11 because the PCID in guest kernel is 6 (11+current=12), when there are more than 6 processes in guest, the PCID will be reused, then cached roots will not easily to hit. The context switch case shows no performance gain when process are 7 and 8. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] KVM: x86: introduce cache configurations for previous CR3s 2024-10-30 12:51 ` zhuangel570 @ 2024-11-06 1:42 ` Sean Christopherson 0 siblings, 0 replies; 8+ messages in thread From: Sean Christopherson @ 2024-11-06 1:42 UTC (permalink / raw) To: zhuangel570; +Cc: pbonzini, kvm, wanpengli, alexyonghe, junaids On Wed, Oct 30, 2024, zhuangel570 wrote: > On Wed, Oct 30, 2024 at 4:38 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Oct 29, 2024, Yong He wrote: > > The only potential downside to larger caches I can think of, is that keeping > > root_count elevated would make it more difficult to reclaim shadow pages from > > roots that are no longer relevant to the guest. kvm_mmu_zap_oldest_mmu_pages() > > in particular would refuse to reclaim roots. That shouldn't be problematic for > > legacy shadow paging, because KVM doesn't recursively zap shadow pages. But for > > nested TDP, mmu_page_zap_pte() frees the entire tree, in the common case that > > child SPTEs aren't shared across multiple trees (common in legacy shadow paging, > > extremely uncommon in nested TDP). > > > > And for the nested TDP issue, if it's actually a problem, I would *love* to > > solve that problem by making KVM's forced reclaim more sophisticated. E.g. one > > idea would be to kick all vCPUs if the maximum number of pages has been reached, > > have each vCPU purge old roots from prev_roots, and then reclaim unused roots. > > It would be a bit more complicated than that, as KVM would need a way to ensure > > forward progress, e.g. if the shadow pages limit has been reach with a single > > root. But even then, kvm_mmu_zap_oldest_mmu_pages() could be made a _lot_ smarter. > > I not very familiar with TDP on TDP. > I think you mean force free cached roots in kvm_mmu_zap_oldest_mmu_pages() when > no mmu pages could be zapped. Such as kick all VCPUs and purge cached roots. Not just when no MMU pages could be zapped; any time KVM needs to reclaim MMU pages due to n_max_mmu_pages. > > TL;DR: what if we simply bump the number of cached roots to ~16? > > I set the number to 11 because the PCID in guest kernel is 6 (11+current=12), > when there are more than 6 processes in guest, the PCID will be reused, then > cached roots will not easily to hit. The context switch case shows no > performance gain when process are 7 and 8. Do you control the guest kernel? If so, it'd be interesting to see what happens when you bump TLB_NR_DYN_ASIDS in the guest to something higher, and then adjust KVM to match. IIRC, Andy arrived at '6' in 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID") because that was the "sweet spot" for hardware. E.g. using fewer PCIDs wouldn't fully utilize hardware, and using more PCIDs would oversubscribe the number of ASID tags too much. For KVM shadow paging, the only meaningful limitation is the number of shadow pages that KVM allows. E.g. with a sufficiently high n_max_mmu_pages, the guest could theoretically use hundreds of PCIDs will no ill effects. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-11-06 1:42 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-29 3:13 [PATCH 0/2] Introduce configuration for LRU cache of previous CR3s Yong He 2024-10-29 3:13 ` [PATCH 1/2] KVM: x86: expand the " Yong He 2024-10-29 14:40 ` Sean Christopherson 2024-10-30 12:08 ` zhuangel570 2024-10-29 3:14 ` [PATCH 2/2] KVM: x86: introduce cache configurations for " Yong He 2024-10-29 15:14 ` Sean Christopherson 2024-10-30 12:51 ` zhuangel570 2024-11-06 1:42 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox