All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org, Tianyu Lan <ltykernel@gmail.com>,
	Michael Kelley <mikelley@microsoft.com>
Subject: Re: [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V
Date: Tue, 07 Mar 2023 11:07:33 +0100	[thread overview]
Message-ID: <87a60ozxga.fsf@redhat.com> (raw)
In-Reply-To: <c6bb4b57-f134-d992-7f30-be80151fb67e@linux.microsoft.com>

Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:

> On 06/03/2023 18:52, Vitaly Kuznetsov wrote:
>> Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> writes:
>> 
>>> TDP MMU has been broken on AMD CPUs when running on Hyper-V since v5.17.
>>> The issue was first introduced by two commmits:
>>>
>>> - bb95dfb9e2dfbe6b3f5eb5e8a20e0259dadbe906 "KVM: x86/mmu: Defer TLB
>>>   flush to caller when freeing TDP MMU shadow pages"
>>> - efd995dae5eba57c5d28d6886a85298b390a4f07 "KVM: x86/mmu: Zap defunct
>>>   roots via asynchronous worker"
>>>
>>> The root cause is that since then there are missing TLB flushes which
>>> are required by HV_X64_NESTED_ENLIGHTENED_TLB.
>> 
>> Please share more details on what's actually missing as you get them,
>> I'd like to understand which flushes can be legally avoided on bare
>> hardware and Hyper-V/VMX but not on Hyper-V/SVM.
>> 
>
> See the linked thread here
> https://lore.kernel.org/lkml/20d189fc-8d20-8083-b448-460cc0420151@linux.microsoft.com/#t
> for all the details/analyses but the summary was that either of these 2
> options would work, with a) having less flushes (footnote: less flushes is not necessarily
> better):
>
> a) adding a hyperv_flush_guest_mapping(__pa(root->spt) after kvm_tdp_mmu_get_vcpu_root_hpa's call to tdp_mmu_alloc_sp()
> b) adding a hyperv_flush_guest_mapping(vcpu->arch.mmu->root.hpa) to svm_flush_tlb_current()
>
> These are only needed on Hyper-V/SVM because of how the enlightenment works (needs an explicit
> flush to rebuild L0 shadow page tables). Hyper-V/VMX does not need any changes and currently
> works. Let me know if you need more information on something here, I'll try to get it.
>

Ah, I missed the whole party! Thanks for the pointers!

>>>  The failure manifests
>>> as L2 guest VMs being unable to complete boot due to memory
>>> inconsistencies between L1 and L2 guests which lead to various
>>> assertion/emulation failures.

Which levels are we talking about here, *real* L1 and L2 or L1 and L2
from KVM's perspective (real L2 and L3)?

>>>
>>> The HV_X64_NESTED_ENLIGHTENED_TLB enlightenment is always exposed by
>>> Hyper-V on AMD and is always used by Linux. The TLB flush required by
>>> HV_X64_NESTED_ENLIGHTENED_TLB is much stricter than the local TLB flush
>>> that TDP MMU wants to issue. We have also found that with TDP MMU L2 guest
>>> boot performance on AMD is reproducibly slower compared to when TDP MMU is
>>> disabled.
>>>
>>> Disable TDP MMU when using SVM Hyper-V for the time being while we
>>> search for a better fix.
>> 
>> I'd suggest we go the other way around: disable
>> HV_X64_NESTED_ENLIGHTENED_TLB on SVM:
>
> Paolo suggested disabling TDP_MMU when HV_X64_NESTED_ENLIGHTENED_TLB is used, and
> I prefer that option too. The enlighenment does offer a nice performance advantage
> with non-TDP_MMU, and I did not see TDP_MMU perform any better compared to that.
> Afaik the code to use the enlightenment on Hyper-V/SVM was written/tested before
> TDP_MMU became the default.
>
> If you have a specific scenario in mind, we could test and see what the implications
> are there.

I don't have a strong opinion here, I've suggested a smaller change so
it's easier to backport it to stable kernels and easier to revert when a
proper fix comes to mainline. For performance implication, I'd only
consider non-nested scenarios from KVM's perspective (i.e. real L2 from
Hyper-V's PoV), as running L3 is unlikely a common use-case and, if I
understood correctly, is broken anyway.

>
>> 
>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>> index 6981c1e9a809..be98da5a4277 100644
>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>> @@ -32,7 +32,8 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>  
>>  static inline void svm_hv_hardware_setup(void)
>>  {
>> -       if (npt_enabled &&
>> +       /* A comment about missing TLB flushes */
>> +       if (!tdp_mmu_enabled && npt_enabled &&
>>             ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB) {
>>                 pr_info(KBUILD_MODNAME ": Hyper-V enlightened NPT TLB flush enabled\n");
>>                 svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
>> 
>> this way we won't have a not-obvious-at-all MMU change on Hyper-V. I
>> understand this may have some performance implications but MMU switch
>> has some as well.
>> 
>>>
>>> Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/t/#u
>>> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
>>> ---
>>> Based on kvm-x86-mmu-6.3. The approach used here does not apply cleanly to
>>> <=v6.2. This would be needed in stable too, and I don't know about putting
>>> fixes tags.
>> 
>> Cc: stable@vger.kernel.org # 5.17.0 
>> 
>> should do)
>> 
>>>
>>> Jeremi
>>>
>>>  arch/x86/include/asm/kvm_host.h |  3 ++-
>>>  arch/x86/kvm/mmu/mmu.c          |  5 +++--
>>>  arch/x86/kvm/svm/svm.c          |  6 +++++-
>>>  arch/x86/kvm/svm/svm_onhyperv.h | 10 ++++++++++
>>>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>>>  5 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 4d2bc08794e4..a0868ae3688d 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -2031,7 +2031,8 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid);
>>>  void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd);
>>>  
>>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> -		       int tdp_max_root_level, int tdp_huge_page_level);
>>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>>> +		       bool enable_tdp_mmu);
>>>  
>>>  static inline u16 kvm_read_ldt(void)
>>>  {
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index c91ee2927dd7..5c0e28a7a3bc 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -5787,14 +5787,15 @@ void kvm_mmu_invpcid_gva(struct kvm_vcpu *vcpu, gva_t gva, unsigned long pcid)
>>>  }
>>>  
>>>  void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>> -		       int tdp_max_root_level, int tdp_huge_page_level)
>>> +		       int tdp_max_root_level, int tdp_huge_page_level,
>>> +		       bool enable_tdp_mmu)
>>>  {
>>>  	tdp_enabled = enable_tdp;
>>>  	tdp_root_level = tdp_forced_root_level;
>>>  	max_tdp_level = tdp_max_root_level;
>>>  
>>>  #ifdef CONFIG_X86_64
>>> -	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled;
>>> +	tdp_mmu_enabled = tdp_mmu_allowed && tdp_enabled && enable_tdp_mmu;
>>>  #endif
>>>  	/*
>>>  	 * max_huge_page_level reflects KVM's MMU capabilities irrespective
>>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>>> index d13cf53e7390..070c3f7f8c9f 100644
>>> --- a/arch/x86/kvm/svm/svm.c
>>> +++ b/arch/x86/kvm/svm/svm.c
>>> @@ -4925,6 +4925,7 @@ static __init int svm_hardware_setup(void)
>>>  	struct page *iopm_pages;
>>>  	void *iopm_va;
>>>  	int r;
>>> +	bool enable_tdp_mmu;
>>>  	unsigned int order = get_order(IOPM_SIZE);
>>>  
>>>  	/*
>>> @@ -4991,9 +4992,12 @@ static __init int svm_hardware_setup(void)
>>>  	if (!boot_cpu_has(X86_FEATURE_NPT))
>>>  		npt_enabled = false;
>>>  
>>> +	enable_tdp_mmu = svm_hv_enable_tdp_mmu();
>>> +
>>>  	/* Force VM NPT level equal to the host's paging level */
>>>  	kvm_configure_mmu(npt_enabled, get_npt_level(),
>>> -			  get_npt_level(), PG_LEVEL_1G);
>>> +			  get_npt_level(), PG_LEVEL_1G,
>>> +			  enable_tdp_mmu);
>>>  	pr_info("Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
>>>  
>>>  	/* Setup shadow_me_value and shadow_me_mask */
>>> diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
>>> index 6981c1e9a809..aa49ac5d66bc 100644
>>> --- a/arch/x86/kvm/svm/svm_onhyperv.h
>>> +++ b/arch/x86/kvm/svm/svm_onhyperv.h
>>> @@ -30,6 +30,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>>  		hve->hv_enlightenments_control.msr_bitmap = 1;
>>>  }
>>>  
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> +	return !(npt_enabled && ms_hyperv.nested_features & HV_X64_NESTED_ENLIGHTENED_TLB);
>>> +}
>>> +
>>>  static inline void svm_hv_hardware_setup(void)
>>>  {
>>>  	if (npt_enabled &&
>>> @@ -84,6 +89,11 @@ static inline void svm_hv_init_vmcb(struct vmcb *vmcb)
>>>  {
>>>  }
>>>  
>>> +static inline bool svm_hv_enable_tdp_mmu(void)
>>> +{
>>> +	return true;
>>> +}
>>> +
>>>  static inline void svm_hv_hardware_setup(void)
>>>  {
>>>  }
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index c788aa382611..4d3808755d39 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -8442,7 +8442,8 @@ static __init int hardware_setup(void)
>>>  	vmx_setup_me_spte_mask();
>>>  
>>>  	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
>>> -			  ept_caps_to_lpage_level(vmx_capability.ept));
>>> +			  ept_caps_to_lpage_level(vmx_capability.ept),
>>> +			  true);
>>>  
>>>  	/*
>>>  	 * Only enable PML when hardware supports PML feature, and both EPT
>> 
>

-- 
Vitaly


  reply	other threads:[~2023-03-07 10:08 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-27 17:17 [PATCH] KVM: SVM: Disable TDP MMU when running on Hyper-V Jeremi Piotrowski
2023-03-06 17:52 ` Vitaly Kuznetsov
2023-03-06 18:31   ` Jeremi Piotrowski
2023-03-07 10:07     ` Vitaly Kuznetsov [this message]
2023-03-08 15:42       ` Jeremi Piotrowski
2023-03-07 17:36 ` Sean Christopherson
2023-03-08  0:00   ` Paolo Bonzini
2023-03-08  0:39     ` Sean Christopherson
2023-03-08 15:55       ` Jeremi Piotrowski
2023-03-08 17:22         ` Jeremi Piotrowski
2023-03-08 19:20           ` Sean Christopherson
2023-03-08 19:11         ` Sean Christopherson
2023-03-09 17:58           ` Jeremi Piotrowski
2023-03-12 17:42             ` Alexander Grest
2023-03-08 15:48     ` Jeremi Piotrowski
2023-04-05 16:43   ` Jeremi Piotrowski
2023-04-10 23:25     ` Sean Christopherson
2023-04-11 14:22       ` Jeremi Piotrowski
2023-04-11 16:02         ` Sean Christopherson
2023-04-13  9:53           ` Jeremi Piotrowski
2023-04-13 17:24             ` Sean Christopherson
2023-04-13 18:49               ` Sean Christopherson
2023-04-13 19:09               ` Sean Christopherson
2023-04-13 20:21                 ` David Matlack
2023-04-13 20:58                   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a60ozxga.fsf@redhat.com \
    --to=vkuznets@redhat.com \
    --cc=jpiotrowski@linux.microsoft.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ltykernel@gmail.com \
    --cc=mikelley@microsoft.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.