From: Tao Su <tao1.su@linux.intel.com>
To: Sean Christopherson <seanjc@google.com>
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, eddie.dong@intel.com,
chao.gao@intel.com, xiaoyao.li@intel.com,
yuan.yao@linux.intel.com, yi1.lai@intel.com,
xudong.hao@intel.com, chao.p.peng@intel.com
Subject: Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
Date: Thu, 21 Dec 2023 15:45:15 +0800 [thread overview]
Message-ID: <ZYPtC01FMknBSVxV@linux.bj.intel.com> (raw)
In-Reply-To: <ZYMWFhVQ7dCjYegQ@google.com>
On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading. In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
>
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change. The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
>
> Something like
>
> KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
>
> better captures that the patch affects what KVM advertises to userspace. Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect. Alternatively, it could be something like:
>
> KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported
Yeah, these two shortlogs are more accurate and intuitive.
>
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> >
> > Six KVM selftests are timeout due to this issue:
> > kvm:access_tracking_perf_test
> > kvm:demand_paging_test
> > kvm:dirty_log_test
> > kvm:dirty_log_perf_test
> > kvm:kvm_page_table_test
> > kvm:memslot_modification_stress_test
> >
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> >
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> >
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
>
> There is far too much unnecessary cruft in this changelog. The entire last
> paragraph is extraneous information. Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
>
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM. The symptoms described above don't actually explain why *KVM* is
> at fault.
>
> KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
>
> Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
> 48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
> CPU during a page table walk if and only if 5-level TDP is enabled. I.e.
> it's impossible for KVM to actually map GPAs with bits 51:49 set, which
> results in vCPUs getting stuck on endless EPT violations.
>
> From Intel's SDM:
>
> 4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
> walk length of 4) to translate a guest-physical address and uses only
> bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
> access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
> and uses guest-physical address bits 56:0. [Physical addresses and
> guest-physical addresses are architecturally limited to 52 bits (e.g.,
> by paging), so in practice bits 56:52 are zero.]
>
> While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
> not 5-level EPT, the combination is architecturally legal and such CPUs
> do exist (and can easily be "created" with nested virtualization).
>
> Note, because EPT capabilities are reported via MSRs, it's impossible
> for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
> is 100% KVM's responsibility.
Yes, I learn a lot, I will use the above shortlogs and changelogs as my
next version.
>
> > Reported-by: Yi Lai <yi1.lai@intel.com>
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > Tested-by: Xudong Hao <xudong.hao@intel.com>
> > ---
> > arch/x86/kvm/cpuid.c | 5 +++--
> > arch/x86/kvm/mmu.h | 1 +
> > arch/x86/kvm/mmu/mmu.c | 7 +++++++
> > 3 files changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> > *
> > * If TDP is enabled but an explicit guest MAXPHYADDR is not
> > * provided, use the raw bare metal MAXPHYADDR as reductions to
> > - * the HPAs do not affect GPAs.
> > + * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > + * doesn't exceed the bits that TDP can translate.
> > */
> > if (!tdp_enabled)
> > g_phys_as = boot_cpu_data.x86_phys_bits;
> > else if (!g_phys_as)
> > - g_phys_as = phys_as;
> > + g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
>
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR. Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.
Yes, agree.
>
> > entry->eax = g_phys_as | (virt_as << 8);
> > entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> > u64 fault_address, char *insn, int insn_len);
> > void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >
> > int kvm_mmu_load(struct kvm_vcpu *vcpu);
> > void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > reset_guest_paging_metadata(vcpu, mmu);
> > }
> >
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > + return max_tdp_level == 5 ? 57 : 48;
>
> Using "57" is kinda sorta wrong, e.g. the SDM says:
>
> Bits 56:52 of each guest-physical address are necessarily zero because
> guest-physical addresses are architecturally limited to 52 bits.
>
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).
>
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
>
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.
Yes, agree.
>
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
>
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper. It's essentially a moot point though if we avoid the helper in
> the first place.
Yes, previously I want to invoke this helper in the patch2, now it is no
longer needed.
>
> All in all, this?
The code is perfect, I will add it to my next version, thanks!
>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/cpuid.c | 16 ++++++++++++----
> arch/x86/kvm/mmu/mmu.c | 2 +-
> 3 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
> bool mask);
>
> extern bool tdp_enabled;
> +extern int max_tdp_level;
>
> u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> *
> * If TDP is enabled but an explicit guest MAXPHYADDR is not
> * provided, use the raw bare metal MAXPHYADDR as reductions to
> - * the HPAs do not affect GPAs.
> + * the HPAs do not affect GPAs. Finally, if TDP is enabled and
> + * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> + * as bits 51:49 are used by the CPU if and only if 5-level TDP
Nit, it should be bits 51:48?
Thanks,
Tao
> + * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
> */
> - if (!tdp_enabled)
> + if (!tdp_enabled) {
> g_phys_as = boot_cpu_data.x86_phys_bits;
> - else if (!g_phys_as)
> - g_phys_as = phys_as;
> + } else {
> + if (!g_phys_as)
> + g_phys_as = phys_as;
> +
> + if (max_tdp_level < 5)
> + g_phys_as = min(g_phys_as, 48);
> + }
>
> entry->eax = g_phys_as | (virt_as << 8);
> entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>
> static int max_huge_page_level __read_mostly;
> static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>
> #define PTE_PREFETCH_NUM 8
>
>
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> --
>
next prev parent reply other threads:[~2023-12-21 7:48 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-18 14:05 [PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported Tao Su
2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
2023-12-18 15:13 ` Sean Christopherson
2023-12-19 2:51 ` Chao Gao
2023-12-19 3:40 ` Jim Mattson
2023-12-19 8:09 ` Chao Gao
2023-12-19 15:26 ` Sean Christopherson
2023-12-20 7:16 ` Xiaoyao Li
2023-12-20 15:37 ` Sean Christopherson
2023-12-20 11:59 ` Tao Su
2023-12-20 13:39 ` Jim Mattson
2023-12-19 8:31 ` Tao Su
2023-12-20 16:28 ` Sean Christopherson
2023-12-21 7:45 ` Tao Su [this message]
2023-12-21 8:19 ` Xu Yilun
2024-01-02 23:24 ` Sean Christopherson
2024-01-03 0:34 ` Jim Mattson
2024-01-03 18:04 ` Sean Christopherson
2024-01-04 2:45 ` Chao Gao
2024-01-04 3:40 ` Jim Mattson
2024-01-04 4:34 ` Jim Mattson
2024-01-04 11:56 ` Tao Su
2024-01-04 14:03 ` Jim Mattson
2024-01-04 15:07 ` Chao Gao
2024-01-04 17:02 ` Jim Mattson
2024-01-05 20:26 ` Sean Christopherson
2024-01-08 13:45 ` Tao Su
2024-01-08 15:29 ` Sean Christopherson
2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
2023-12-18 15:23 ` Sean Christopherson
2023-12-19 3:10 ` Chao Gao
2023-12-20 13:42 ` Jim Mattson
2024-01-08 13:48 ` Tao Su
2024-01-08 15:19 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZYPtC01FMknBSVxV@linux.bj.intel.com \
--to=tao1.su@linux.intel.com \
--cc=chao.gao@intel.com \
--cc=chao.p.peng@intel.com \
--cc=eddie.dong@intel.com \
--cc=kvm@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=seanjc@google.com \
--cc=xiaoyao.li@intel.com \
--cc=xudong.hao@intel.com \
--cc=yi1.lai@intel.com \
--cc=yuan.yao@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox