[PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
@ 2023-12-18 14:05 Tao Su
  2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
  2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
  0 siblings, 2 replies; 34+ messages in thread
From: Tao Su @ 2023-12-18 14:05 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng, tao1.su

When host doesn't support 5-level EPT, bits 51:48 of the guest physical
address must all be zero, otherwise an EPT violation always occurs[1], and
current handler can't resolve if the GPA is in RAM region. So, instruction
will be re-executed again and again, which causes infinite EPT violation.

Six KVM selftests are timeout due to this issue:
    kvm:access_tracking_perf_test
    kvm:demand_paging_test
    kvm:dirty_log_test
    kvm:dirty_log_perf_test
    kvm:kvm_page_table_test
    kvm:memslot_modification_stress_test

Just report the max supported physical bits and not host physical bits to
guest which is limited by TDP.

[1]
https://www.intel.com/content/www/us/en/content-details/671442/5-level-paging-and-5-level-ept-white-paper.html


Tao Su (2):
  x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  x86: KVM: Emulate instruction when GPA can't be translated by EPT

 arch/x86/kvm/cpuid.c   | 5 +++--
 arch/x86/kvm/mmu.h     | 1 +
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 arch/x86/kvm/vmx/vmx.c | 7 +++++++
 4 files changed, 18 insertions(+), 2 deletions(-)


base-commit: ceb6a6f023fd3e8b07761ed900352ef574010bcb
-- 
2.34.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-18 14:05 [PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported Tao Su
@ 2023-12-18 14:05 ` Tao Su
  2023-12-18 15:13   ` Sean Christopherson
  2023-12-20 16:28   ` Sean Christopherson
  2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
  1 sibling, 2 replies; 34+ messages in thread
From: Tao Su @ 2023-12-18 14:05 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng, tao1.su

When host doesn't support 5-level EPT, bits 51:48 of the guest physical
address must all be zero, otherwise an EPT violation always occurs and
current handler can't resolve this if the gpa is in RAM region. Hence,
instruction will keep being executed repeatedly, which causes infinite
EPT violation.

Six KVM selftests are timeout due to this issue:
    kvm:access_tracking_perf_test
    kvm:demand_paging_test
    kvm:dirty_log_test
    kvm:dirty_log_perf_test
    kvm:kvm_page_table_test
    kvm:memslot_modification_stress_test

The above selftests add a RAM region close to max_gfn, if host has 52
physical bits but doesn't support 5-level EPT, these will trigger infinite
EPT violation when access the RAM region.

Since current Intel CPUID doesn't report max guest physical bits like AMD,
introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
enabled and report the max guest physical bits which is smaller than host.

When guest physical bits is smaller than host, some GPA are illegal from
guest's perspective, but are still legal from hardware's perspective,
which should be trapped to inject #PF. Current KVM already has a parameter
allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
is enabled and guest accesses an illegal address from guest's perspective,
KVM will utilize EPT violation and emulate the instruction to inject #PF
and determine #PF error code.

Reported-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Tested-by: Xudong Hao <xudong.hao@intel.com>
---
 arch/x86/kvm/cpuid.c   | 5 +++--
 arch/x86/kvm/mmu.h     | 1 +
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dda6fc4cfae8..91933ca739ad 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
+		 * doesn't exceed the bits that TDP can translate.
 		 */
 		if (!tdp_enabled)
 			g_phys_as = boot_cpu_data.x86_phys_bits;
 		else if (!g_phys_as)
-			g_phys_as = phys_as;
+			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bb8c86eefac0..1c7d649fcf6b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 				u64 fault_address, char *insn, int insn_len);
 void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 					struct kvm_mmu *mmu);
+unsigned int kvm_mmu_tdp_maxphyaddr(void);
 
 int kvm_mmu_load(struct kvm_vcpu *vcpu);
 void kvm_mmu_unload(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c57e181bba21..72634d6b61b2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 	reset_guest_paging_metadata(vcpu, mmu);
 }
 
+/* guest-physical-address bits limited by TDP */
+unsigned int kvm_mmu_tdp_maxphyaddr(void)
+{
+	return max_tdp_level == 5 ? 57 : 48;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
+
 static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 {
 	/* tdp_root_level is architecture forced level, use it if nonzero */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2023-12-18 14:05 [PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported Tao Su
  2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
@ 2023-12-18 14:05 ` Tao Su
  2023-12-18 15:23   ` Sean Christopherson
  2023-12-20 13:42   ` Jim Mattson
  1 sibling, 2 replies; 34+ messages in thread
From: Tao Su @ 2023-12-18 14:05 UTC (permalink / raw)
  To: kvm
  Cc: seanjc, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng, tao1.su

With 4-level EPT, bits 51:48 of the guest physical address must all
be zero; otherwise, an EPT violation always occurs, which is an unexpected
VM exit in KVM currently.

Even though KVM advertises the max physical bits to guest, guest may
ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
Rejecting invalid guest physical bits on KVM side is a choice, but it will
break current KVM ABI, e.g., current QEMU ignores the physical bits
advertised by KVM and uses host physical bits as guest physical bits by
default when using '-cpu host', although we would like to send a patch to
QEMU, it will still cause backward compatibility issues.

For GPA that can't be translated by EPT but within host.MAXPHYADDR,
emulation should be the best choice since KVM will inject #PF for the
invalid GPA in guest's perspective and try to emulate the instructions
which minimizes the impact on guests as much as possible.

Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index be20a60047b1..a8aa2cfa2f5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5774,6 +5774,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)

 	vcpu->arch.exit_qualification = exit_qualification;

+	/*
+	 * Emulate the instruction when accessing a GPA which is set any bits
+	 * beyond guest-physical bits that EPT can translate.
+	 */
+	if (unlikely(gpa & rsvd_bits(kvm_mmu_tdp_maxphyaddr(), 63)))
+		return kvm_emulate_instruction(vcpu, 0);
+
 	/*
 	 * Check that the GPA doesn't exceed physical memory limits, as that is
 	 * a guest page fault.  We have to emulate the instruction here, because
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
@ 2023-12-18 15:13   ` Sean Christopherson
  2023-12-19  2:51     ` Chao Gao
  2023-12-19  8:31     ` Tao Su
  2023-12-20 16:28   ` Sean Christopherson
  1 sibling, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-12-18 15:13 UTC (permalink / raw)
  To: Tao Su
  Cc: kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023, Tao Su wrote:
> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> address must all be zero, otherwise an EPT violation always occurs and
> current handler can't resolve this if the gpa is in RAM region. Hence,
> instruction will keep being executed repeatedly, which causes infinite
> EPT violation.
> 
> Six KVM selftests are timeout due to this issue:
>     kvm:access_tracking_perf_test
>     kvm:demand_paging_test
>     kvm:dirty_log_test
>     kvm:dirty_log_perf_test
>     kvm:kvm_page_table_test
>     kvm:memslot_modification_stress_test
> 
> The above selftests add a RAM region close to max_gfn, if host has 52
> physical bits but doesn't support 5-level EPT, these will trigger infinite
> EPT violation when access the RAM region.
> 
> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> enabled and report the max guest physical bits which is smaller than host.
> 
> When guest physical bits is smaller than host, some GPA are illegal from
> guest's perspective, but are still legal from hardware's perspective,
> which should be trapped to inject #PF. Current KVM already has a parameter
> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> is enabled and guest accesses an illegal address from guest's perspective,
> KVM will utilize EPT violation and emulate the instruction to inject #PF
> and determine #PF error code.

No, fix the selftests, it's not KVM's responsibility to advertise the correct
guest.MAXPHYADDR.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
@ 2023-12-18 15:23   ` Sean Christopherson
  2023-12-19  3:10     ` Chao Gao
  2023-12-20 13:42   ` Jim Mattson
  1 sibling, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2023-12-18 15:23 UTC (permalink / raw)
  To: Tao Su
  Cc: kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023, Tao Su wrote:
> With 4-level EPT, bits 51:48 of the guest physical address must all
> be zero; otherwise, an EPT violation always occurs, which is an unexpected
> VM exit in KVM currently.
> 
> Even though KVM advertises the max physical bits to guest, guest may
> ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
> Rejecting invalid guest physical bits on KVM side is a choice, but it will
> break current KVM ABI, e.g., current QEMU ignores the physical bits
> advertised by KVM and uses host physical bits as guest physical bits by
> default when using '-cpu host', although we would like to send a patch to
> QEMU, it will still cause backward compatibility issues.
> 
> For GPA that can't be translated by EPT but within host.MAXPHYADDR,
> emulation should be the best choice since KVM will inject #PF for the
> invalid GPA in guest's perspective and try to emulate the instructions
> which minimizes the impact on guests as much as possible.

NAK.  allow_smaller_maxphyaddr is a bit of a mess and in IMO was a mistake, but
at least there was reasonable motivation for trying to support guests with a small
MAXPHYADDR.  Fudging around a QEMU bug is not good enough justification, especially
since the odds of a hack in KVM fully working are slim to none.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-18 15:13   ` Sean Christopherson
@ 2023-12-19  2:51     ` Chao Gao
  2023-12-19  3:40       ` Jim Mattson
  2023-12-19  8:31     ` Tao Su
  1 sibling, 1 reply; 34+ messages in thread
From: Chao Gao @ 2023-12-19  2:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tao Su, kvm, pbonzini, eddie.dong, xiaoyao.li, yuan.yao, yi1.lai,
	xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
>On Mon, Dec 18, 2023, Tao Su wrote:
>> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
>> address must all be zero, otherwise an EPT violation always occurs and
>> current handler can't resolve this if the gpa is in RAM region. Hence,
>> instruction will keep being executed repeatedly, which causes infinite
>> EPT violation.
>> 
>> Six KVM selftests are timeout due to this issue:
>>     kvm:access_tracking_perf_test
>>     kvm:demand_paging_test
>>     kvm:dirty_log_test
>>     kvm:dirty_log_perf_test
>>     kvm:kvm_page_table_test
>>     kvm:memslot_modification_stress_test
>> 
>> The above selftests add a RAM region close to max_gfn, if host has 52
>> physical bits but doesn't support 5-level EPT, these will trigger infinite
>> EPT violation when access the RAM region.
>> 
>> Since current Intel CPUID doesn't report max guest physical bits like AMD,
>> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
>> enabled and report the max guest physical bits which is smaller than host.
>> 
>> When guest physical bits is smaller than host, some GPA are illegal from
>> guest's perspective, but are still legal from hardware's perspective,
>> which should be trapped to inject #PF. Current KVM already has a parameter
>> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
>> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
>> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
>> is enabled and guest accesses an illegal address from guest's perspective,
>> KVM will utilize EPT violation and emulate the instruction to inject #PF
>> and determine #PF error code.
>
>No, fix the selftests, it's not KVM's responsibility to advertise the correct
>guest.MAXPHYADDR.

In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
translate up to 48 bits of GPA.

Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
is invalid/incorrect. how can we say selftests are at fault and we should fix
them?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2023-12-18 15:23   ` Sean Christopherson
@ 2023-12-19  3:10     ` Chao Gao
  0 siblings, 0 replies; 34+ messages in thread
From: Chao Gao @ 2023-12-19  3:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tao Su, kvm, pbonzini, eddie.dong, xiaoyao.li, yuan.yao, yi1.lai,
	xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 07:23:01AM -0800, Sean Christopherson wrote:
>On Mon, Dec 18, 2023, Tao Su wrote:
>> With 4-level EPT, bits 51:48 of the guest physical address must all
>> be zero; otherwise, an EPT violation always occurs, which is an unexpected
>> VM exit in KVM currently.
>> 
>> Even though KVM advertises the max physical bits to guest, guest may
>> ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
>> Rejecting invalid guest physical bits on KVM side is a choice, but it will
>> break current KVM ABI, e.g., current QEMU ignores the physical bits
>> advertised by KVM and uses host physical bits as guest physical bits by
>> default when using '-cpu host', although we would like to send a patch to
>> QEMU, it will still cause backward compatibility issues.
>> 
>> For GPA that can't be translated by EPT but within host.MAXPHYADDR,
>> emulation should be the best choice since KVM will inject #PF for the
>> invalid GPA in guest's perspective and try to emulate the instructions
>> which minimizes the impact on guests as much as possible.
>
>NAK.  allow_smaller_maxphyaddr is a bit of a mess and in IMO was a mistake, but
>at least there was reasonable motivation for trying to support guests with a small
>MAXPHYADDR.  Fudging around a QEMU bug is not good enough justification, especially
>since the odds of a hack in KVM fully working are slim to none.

The changelog is a little misleading. Even if there is no QEMU "bug" and QEMU
sets guest.MAXPHYADDR to 48 correctly, guest __can__ set up CR3 page table to
access GPAs with bits of 51:48 set. In this case, KVM still gets EPT violation
and needs to inject #PF to the guest (by emulating the instruction).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19  2:51     ` Chao Gao
@ 2023-12-19  3:40       ` Jim Mattson
  2023-12-19  8:09         ` Chao Gao
  0 siblings, 1 reply; 34+ messages in thread
From: Jim Mattson @ 2023-12-19  3:40 UTC (permalink / raw)
  To: Chao Gao
  Cc: Sean Christopherson, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 6:51 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
> >On Mon, Dec 18, 2023, Tao Su wrote:
> >> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> >> address must all be zero, otherwise an EPT violation always occurs and
> >> current handler can't resolve this if the gpa is in RAM region. Hence,
> >> instruction will keep being executed repeatedly, which causes infinite
> >> EPT violation.
> >>
> >> Six KVM selftests are timeout due to this issue:
> >>     kvm:access_tracking_perf_test
> >>     kvm:demand_paging_test
> >>     kvm:dirty_log_test
> >>     kvm:dirty_log_perf_test
> >>     kvm:kvm_page_table_test
> >>     kvm:memslot_modification_stress_test
> >>
> >> The above selftests add a RAM region close to max_gfn, if host has 52
> >> physical bits but doesn't support 5-level EPT, these will trigger infinite
> >> EPT violation when access the RAM region.
> >>
> >> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> >> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> >> enabled and report the max guest physical bits which is smaller than host.
> >>
> >> When guest physical bits is smaller than host, some GPA are illegal from
> >> guest's perspective, but are still legal from hardware's perspective,
> >> which should be trapped to inject #PF. Current KVM already has a parameter
> >> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> >> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> >> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> >> is enabled and guest accesses an illegal address from guest's perspective,
> >> KVM will utilize EPT violation and emulate the instruction to inject #PF
> >> and determine #PF error code.
> >
> >No, fix the selftests, it's not KVM's responsibility to advertise the correct
> >guest.MAXPHYADDR.
>
> In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
> translate up to 48 bits of GPA.

There are a number of issues that KVM does not handle when
guest.MAXPHYADDR < host.MAXPHYADDR. For starters, KVM doesn't raise a
#GP in PAE mode when one of the address bits above guest.MAXPHYADDR is
set in one of the PDPTRs.

Honestly, I think KVM should just disable EPT if the EPT tables can't
support the CPU's physical address width.

> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> is invalid/incorrect. how can we say selftests are at fault and we should fix
> them?

In this case, the CPU is at fault, and you should complain to the CPU vendor.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19  3:40       ` Jim Mattson
@ 2023-12-19  8:09         ` Chao Gao
  2023-12-19 15:26           ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Chao Gao @ 2023-12-19  8:09 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
>On Mon, Dec 18, 2023 at 6:51 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
>> >On Mon, Dec 18, 2023, Tao Su wrote:
>> >> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
>> >> address must all be zero, otherwise an EPT violation always occurs and
>> >> current handler can't resolve this if the gpa is in RAM region. Hence,
>> >> instruction will keep being executed repeatedly, which causes infinite
>> >> EPT violation.
>> >>
>> >> Six KVM selftests are timeout due to this issue:
>> >>     kvm:access_tracking_perf_test
>> >>     kvm:demand_paging_test
>> >>     kvm:dirty_log_test
>> >>     kvm:dirty_log_perf_test
>> >>     kvm:kvm_page_table_test
>> >>     kvm:memslot_modification_stress_test
>> >>
>> >> The above selftests add a RAM region close to max_gfn, if host has 52
>> >> physical bits but doesn't support 5-level EPT, these will trigger infinite
>> >> EPT violation when access the RAM region.
>> >>
>> >> Since current Intel CPUID doesn't report max guest physical bits like AMD,
>> >> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
>> >> enabled and report the max guest physical bits which is smaller than host.
>> >>
>> >> When guest physical bits is smaller than host, some GPA are illegal from
>> >> guest's perspective, but are still legal from hardware's perspective,
>> >> which should be trapped to inject #PF. Current KVM already has a parameter
>> >> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
>> >> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
>> >> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
>> >> is enabled and guest accesses an illegal address from guest's perspective,
>> >> KVM will utilize EPT violation and emulate the instruction to inject #PF
>> >> and determine #PF error code.
>> >
>> >No, fix the selftests, it's not KVM's responsibility to advertise the correct
>> >guest.MAXPHYADDR.
>>
>> In this case, host.MAXPHYADDR is 52 and EPT supports 4-level only thus can
>> translate up to 48 bits of GPA.
>
>There are a number of issues that KVM does not handle when
>guest.MAXPHYADDR < host.MAXPHYADDR. For starters, KVM doesn't raise a
>#GP in PAE mode when one of the address bits above guest.MAXPHYADDR is
>set in one of the PDPTRs.

These are long-standing issues I believe.

Note: current KVM ABI doesn't enforce guest.MAXPHYADDR = host.MAXPHYADDR
regardless of "allow_smaller_maxphyaddr".

>
>Honestly, I think KVM should just disable EPT if the EPT tables can't
>support the CPU's physical address width.

Yes, it is an option.
But I prefer to allow admin to override this (i.e., admin still can enable EPT
via module parameter) because those issues are not new and disabling EPT
doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.

>
>> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
>> is invalid/incorrect. how can we say selftests are at fault and we should fix
>> them?
>
>In this case, the CPU is at fault, and you should complain to the CPU vendor.

Yeah, I agree with you and will check with related team inside Intel. My point
was just this isn't a selftest issue because not all information is disclosed
to the tests.

And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-18 15:13   ` Sean Christopherson
  2023-12-19  2:51     ` Chao Gao
@ 2023-12-19  8:31     ` Tao Su
  1 sibling, 0 replies; 34+ messages in thread
From: Tao Su @ 2023-12-19  8:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 07:13:27AM -0800, Sean Christopherson wrote:
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> No, fix the selftests, it's not KVM's responsibility to advertise the correct
> guest.MAXPHYADDR.

This patch is not for fixing these selftests, it is for fixing the issue
exposed by the selftests. KVM has responsibility to report a correct
guest.MAXPHYADDR, which lets userspace set a valid physical bits to KVM.

Actually, KVM will stuck in a loop if it tries to build spte with any bit
of bits[51:48] set, e.g., assign a huge RAM to guest.

Thanks,
Tao


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19  8:09         ` Chao Gao
@ 2023-12-19 15:26           ` Sean Christopherson
  2023-12-20  7:16             ` Xiaoyao Li
                               ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-12-19 15:26 UTC (permalink / raw)
  To: Chao Gao
  Cc: Jim Mattson, Tao Su, kvm, pbonzini, eddie.dong, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Tue, Dec 19, 2023, Chao Gao wrote:
> On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> >Honestly, I think KVM should just disable EPT if the EPT tables can't
> >support the CPU's physical address width.
> 
> Yes, it is an option.
> But I prefer to allow admin to override this (i.e., admin still can enable EPT
> via module parameter) because those issues are not new and disabling EPT
> doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> 
> >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> >> them?
> >
> >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> 
> Yeah, I agree with you and will check with related team inside Intel.

I agree that the CPU is being weird, but this is technically an architecturally
legal configuration, and KVM has largely committed to supporting weird setups.
At some point we have to draw a line when things get too ridiculous, but I don't
think this particular oddity crosses into absurd territory.

> My point was just this isn't a selftest issue because not all information is
> disclosed to the tests.

Ah, right, EPT capabilities are in MSRs that userspace can't read.

> And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.

Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
constrain the advertised MAXPHYADDR, no? 

---
 arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c |  5 +++++
 3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..5c346e1a10bd 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
+		 * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
+		 * bits as KVM can't install SPTEs for larger GPAs.
 		 */
-		if (!tdp_enabled)
+		if (!tdp_enabled) {
 			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
-			g_phys_as = phys_as;
+		} else {
+			u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
+
+			if (!g_phys_as)
+				g_phys_as = phys_as;
+
+			if (max_tdp_level < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 60f21bb4c27b..b410a227c601 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 	return boot_cpu_data.x86_phys_bits;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void);
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..b2845f5520b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
 	return max_tdp_level;
 }
 
+u8 kvm_mmu_get_max_tdp_level(void)
+{
+	return tdp_root_level ? tdp_root_level : max_tdp_level;
+}
+
 static union kvm_mmu_page_role
 kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
 				union kvm_cpu_role cpu_role)

base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
-- 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19 15:26           ` Sean Christopherson
@ 2023-12-20  7:16             ` Xiaoyao Li
  2023-12-20 15:37               ` Sean Christopherson
  2023-12-20 11:59             ` Tao Su
  2023-12-20 13:39             ` Jim Mattson
  2 siblings, 1 reply; 34+ messages in thread
From: Xiaoyao Li @ 2023-12-20  7:16 UTC (permalink / raw)
  To: Sean Christopherson, Chao Gao
  Cc: Jim Mattson, Tao Su, kvm, pbonzini, eddie.dong, yuan.yao, yi1.lai,
	xudong.hao, chao.p.peng

On 12/19/2023 11:26 PM, Sean Christopherson wrote:
> KVM can simply
> constrain the advertised MAXPHYADDR, no?

Sean. It looks you agree with this patch (Patch 1) now.

I think it's better for you to comment the original code from Tao 
instead of throwing out your own version. Tao needs to know why his 
version is not OK/correct and what can be improved.

Thanks,
-Xiaoyao

> ---
>   arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
>   arch/x86/kvm/mmu.h     |  2 ++
>   arch/x86/kvm/mmu/mmu.c |  5 +++++
>   3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..5c346e1a10bd 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		 *
>   		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>   		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
> +		 * bits as KVM can't install SPTEs for larger GPAs.
>   		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>   			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
> +
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>   
>   		entry->eax = g_phys_as | (virt_as << 8);
>   		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>   	return boot_cpu_data.x86_phys_bits;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>   void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>   void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>   void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..b2845f5520b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>   	return max_tdp_level;
>   }
>   
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +	return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>   static union kvm_mmu_page_role
>   kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>   				union kvm_cpu_role cpu_role)
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19 15:26           ` Sean Christopherson
  2023-12-20  7:16             ` Xiaoyao Li
@ 2023-12-20 11:59             ` Tao Su
  2023-12-20 13:39             ` Jim Mattson
  2 siblings, 0 replies; 34+ messages in thread
From: Tao Su @ 2023-12-20 11:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Jim Mattson, kvm, pbonzini, eddie.dong, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Tue, Dec 19, 2023 at 07:26:00AM -0800, Sean Christopherson wrote:
> On Tue, Dec 19, 2023, Chao Gao wrote:
> > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> > >Honestly, I think KVM should just disable EPT if the EPT tables can't
> > >support the CPU's physical address width.
> > 
> > Yes, it is an option.
> > But I prefer to allow admin to override this (i.e., admin still can enable EPT
> > via module parameter) because those issues are not new and disabling EPT
> > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> > 
> > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> > >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> > >> them?
> > >
> > >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> > 
> > Yeah, I agree with you and will check with related team inside Intel.
> 
> I agree that the CPU is being weird, but this is technically an architecturally
> legal configuration, and KVM has largely committed to supporting weird setups.
> At some point we have to draw a line when things get too ridiculous, but I don't
> think this particular oddity crosses into absurd territory.
> 
> > My point was just this isn't a selftest issue because not all information is
> > disclosed to the tests.
> 
> Ah, right, EPT capabilities are in MSRs that userspace can't read.
> 
> > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
> 
> Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
> constrain the advertised MAXPHYADDR, no? 

GPA is controlled by guest, I.e., just install PTE in guest page table, and the
GPAs beyond 48-bits always trigger EPT violation. If KVM does nothing, guest
can’t get #PF when accessing >MAXPHYADDR, which is inconsistent with
architectural behavior. But doing nothing is also an option because userspace
doesn’t respect the reported value.

Thanks,
Tao


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-19 15:26           ` Sean Christopherson
  2023-12-20  7:16             ` Xiaoyao Li
  2023-12-20 11:59             ` Tao Su
@ 2023-12-20 13:39             ` Jim Mattson
  2 siblings, 0 replies; 34+ messages in thread
From: Jim Mattson @ 2023-12-20 13:39 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, Tao Su, kvm, pbonzini, eddie.dong, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Tue, Dec 19, 2023 at 7:26 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 19, 2023, Chao Gao wrote:
> > On Mon, Dec 18, 2023 at 07:40:11PM -0800, Jim Mattson wrote:
> > >Honestly, I think KVM should just disable EPT if the EPT tables can't
> > >support the CPU's physical address width.
> >
> > Yes, it is an option.
> > But I prefer to allow admin to override this (i.e., admin still can enable EPT
> > via module parameter) because those issues are not new and disabling EPT
> > doesn't prevent QEMU from launching guests w/ smaller MAXPHYADDR.
> >
> > >> Here nothing visible to selftests or QEMU indicates that guest.MAXPHYADDR = 52
> > >> is invalid/incorrect. how can we say selftests are at fault and we should fix
> > >> them?
> > >
> > >In this case, the CPU is at fault, and you should complain to the CPU vendor.
> >
> > Yeah, I agree with you and will check with related team inside Intel.
>
> I agree that the CPU is being weird, but this is technically an architecturally
> legal configuration, and KVM has largely committed to supporting weird setups.
> At some point we have to draw a line when things get too ridiculous, but I don't
> think this particular oddity crosses into absurd territory.
>
> > My point was just this isn't a selftest issue because not all information is
> > disclosed to the tests.
>
> Ah, right, EPT capabilities are in MSRs that userspace can't read.
>
> > And I am afraid KVM as L1 VMM may run into this situation, i.e., only 4-level
> > EPT is supported but MAXPHYADDR is 52. So, KVM needs a fix anyway.
>
> Yes, but forcing emulation for a funky setup is not a good fix.  KVM can simply
> constrain the advertised MAXPHYADDR, no?

This is essentially the same as allow_smaller_maxphyaddr, and has the
same issues:
1) MOV-to-CR3 must be intercepted in PAE mode, to check the bits above
guest.MAXPHYADDR in the PDPTRs.
2) #PF must be intercepted whenever PTEs are 64 bits, to do a software
page walk of the guest's x86 page tables and check the bits above
guest.MAXPHYADDR for a terminal page fault that might result in a
different error code than the one produced by hardware.
3) EPT violations may require instruction emulation to synthesize an
appropriate #PF, and this requires the KVM instruction emulator to be
expanded to understand *all* memory-accessing instructions, not just
the limited set it understands now.

I just don't see how this is even remotely practical.

> ---
>  arch/x86/kvm/cpuid.c   | 17 +++++++++++++----
>  arch/x86/kvm/mmu.h     |  2 ++
>  arch/x86/kvm/mmu/mmu.c |  5 +++++
>  3 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..5c346e1a10bd 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,21 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>                  *
>                  * If TDP is enabled but an explicit guest MAXPHYADDR is not
>                  * provided, use the raw bare metal MAXPHYADDR as reductions to
> -                * the HPAs do not affect GPAs.
> +                * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +                * doesn't support 5-level paging, cap guest MAXPHYADDR at 48
> +                * bits as KVM can't install SPTEs for larger GPAs.
>                  */
> -               if (!tdp_enabled)
> +               if (!tdp_enabled) {
>                         g_phys_as = boot_cpu_data.x86_phys_bits;
> -               else if (!g_phys_as)
> -                       g_phys_as = phys_as;
> +               } else {
> +                       u8 max_tdp_level = kvm_mmu_get_max_tdp_level();
> +
> +                       if (!g_phys_as)
> +                               g_phys_as = phys_as;
> +
> +                       if (max_tdp_level < 5)
> +                               g_phys_as = min(g_phys_as, 48);
> +               }
>
>                 entry->eax = g_phys_as | (virt_as << 8);
>                 entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 60f21bb4c27b..b410a227c601 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -100,6 +100,8 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>         return boot_cpu_data.x86_phys_bits;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void);
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..b2845f5520b3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5267,6 +5267,11 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
>         return max_tdp_level;
>  }
>
> +u8 kvm_mmu_get_max_tdp_level(void)
> +{
> +       return tdp_root_level ? tdp_root_level : max_tdp_level;
> +}
> +
>  static union kvm_mmu_page_role
>  kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
>                                 union kvm_cpu_role cpu_role)
>
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> --

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
  2023-12-18 15:23   ` Sean Christopherson
@ 2023-12-20 13:42   ` Jim Mattson
  2024-01-08 13:48     ` Tao Su
  1 sibling, 1 reply; 34+ messages in thread
From: Jim Mattson @ 2023-12-20 13:42 UTC (permalink / raw)
  To: Tao Su
  Cc: kvm, seanjc, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Mon, Dec 18, 2023 at 6:08 AM Tao Su <tao1.su@linux.intel.com> wrote:
>
> With 4-level EPT, bits 51:48 of the guest physical address must all
> be zero; otherwise, an EPT violation always occurs, which is an unexpected
> VM exit in KVM currently.
>
> Even though KVM advertises the max physical bits to guest, guest may
> ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
> Rejecting invalid guest physical bits on KVM side is a choice, but it will
> break current KVM ABI, e.g., current QEMU ignores the physical bits
> advertised by KVM and uses host physical bits as guest physical bits by
> default when using '-cpu host', although we would like to send a patch to
> QEMU, it will still cause backward compatibility issues.
>
> For GPA that can't be translated by EPT but within host.MAXPHYADDR,
> emulation should be the best choice since KVM will inject #PF for the
> invalid GPA in guest's perspective and try to emulate the instructions
> which minimizes the impact on guests as much as possible.
>
> Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index be20a60047b1..a8aa2cfa2f5d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5774,6 +5774,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>
>         vcpu->arch.exit_qualification = exit_qualification;
>
> +       /*
> +        * Emulate the instruction when accessing a GPA which is set any bits
> +        * beyond guest-physical bits that EPT can translate.
> +        */
> +       if (unlikely(gpa & rsvd_bits(kvm_mmu_tdp_maxphyaddr(), 63)))
> +               return kvm_emulate_instruction(vcpu, 0);
> +

This doesn't really work, since the KVM instruction emulator is
woefully incomplete. To make this work, first you have to teach the
KVM instruction emulator how to emulate *all* memory-accessing
instructions.

>         /*
>          * Check that the GPA doesn't exceed physical memory limits, as that is
>          * a guest page fault.  We have to emulate the instruction here, because
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-20  7:16             ` Xiaoyao Li
@ 2023-12-20 15:37               ` Sean Christopherson
  0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-12-20 15:37 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Chao Gao, Jim Mattson, Tao Su, kvm, pbonzini, eddie.dong,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Wed, Dec 20, 2023, Xiaoyao Li wrote:
> On 12/19/2023 11:26 PM, Sean Christopherson wrote:
> > KVM can simply
> > constrain the advertised MAXPHYADDR, no?
> 
> Sean. It looks you agree with this patch (Patch 1) now.

Doh, my apologies.  I don't think I even got to the code, I reacted to the shortlog,
changelog, and to patch 2.  I'll re-respond.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
  2023-12-18 15:13   ` Sean Christopherson
@ 2023-12-20 16:28   ` Sean Christopherson
  2023-12-21  7:45     ` Tao Su
  2023-12-21  8:19     ` Xu Yilun
  1 sibling, 2 replies; 34+ messages in thread
From: Sean Christopherson @ 2023-12-20 16:28 UTC (permalink / raw)
  To: Tao Su
  Cc: kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
code it would be perfectly ok to say "limit", but in KVM's split world where
*userspace* is mostly responsible for the guest configuration, "limit guest ..."
is often going to be read as "limit the capabilities of the guest".

This is also a good example of why shortlogs and changelogs should NOT be play-by-play
descriptions of the code change.  The literal code change applies a limit to
guest physical bits, but that doesn't capture the net effect of applying said limit.

Something like

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

better captures that the patch affects what KVM advertises to userspace.  Yeah,
it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
ubiquitous enough that it's worth being technically wrong in order to clearly
communicate the net effect.  Alternatively, it could be something like:

  KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

On Mon, Dec 18, 2023, Tao Su wrote:
> When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> address must all be zero, otherwise an EPT violation always occurs and
> current handler can't resolve this if the gpa is in RAM region. Hence,
> instruction will keep being executed repeatedly, which causes infinite
> EPT violation.
> 
> Six KVM selftests are timeout due to this issue:
>     kvm:access_tracking_perf_test
>     kvm:demand_paging_test
>     kvm:dirty_log_test
>     kvm:dirty_log_perf_test
>     kvm:kvm_page_table_test
>     kvm:memslot_modification_stress_test
> 
> The above selftests add a RAM region close to max_gfn, if host has 52
> physical bits but doesn't support 5-level EPT, these will trigger infinite
> EPT violation when access the RAM region.
> 
> Since current Intel CPUID doesn't report max guest physical bits like AMD,
> introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> enabled and report the max guest physical bits which is smaller than host.
> 
> When guest physical bits is smaller than host, some GPA are illegal from
> guest's perspective, but are still legal from hardware's perspective,
> which should be trapped to inject #PF. Current KVM already has a parameter
> allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> is enabled and guest accesses an illegal address from guest's perspective,
> KVM will utilize EPT violation and emulate the instruction to inject #PF
> and determine #PF error code.

There is far too much unnecessary cruft in this changelog.  The entire last
paragraph is extraneous information.  Talking in detail about KVM's (flawed)
emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
it's correct for KVM to advertise an impossible configuration.

And this is exactly why I dislike the "explain the symptom, then the solution"
style for KVM.  The symptoms described above don't actually explain why *KVM* is
at fault.

  KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported

  Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
  48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
  CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
  it's impossible for KVM to actually map GPAs with bits 51:49 set, which
  results in vCPUs getting stuck on endless EPT violations.

  From Intel's SDM:

    4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
    walk length of 4) to translate a guest-physical address and uses only
    bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
    access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
    and uses guest-physical address bits 56:0. [Physical addresses and
    guest-physical addresses are architecturally limited to 52 bits (e.g.,
    by paging), so in practice bits 56:52 are zero.]

  While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
  not 5-level EPT, the combination is architecturally legal and such CPUs
  do exist (and can easily be "created" with nested virtualization).

  Note, because EPT capabilities are reported via MSRs, it's impossible
  for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
  is 100% KVM's responsibility.

> Reported-by: Yi Lai <yi1.lai@intel.com>
> Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> Tested-by: Yi Lai <yi1.lai@intel.com>
> Tested-by: Xudong Hao <xudong.hao@intel.com>
> ---
>  arch/x86/kvm/cpuid.c   | 5 +++--
>  arch/x86/kvm/mmu.h     | 1 +
>  arch/x86/kvm/mmu/mmu.c | 7 +++++++
>  3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index dda6fc4cfae8..91933ca739ad 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> +		 * doesn't exceed the bits that TDP can translate.
>  		 */
>  		if (!tdp_enabled)
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
>  		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());

I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
blatant CPU bug, but being paranoid is cheap in this case.

>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bb8c86eefac0..1c7d649fcf6b 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>  				u64 fault_address, char *insn, int insn_len);
>  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  					struct kvm_mmu *mmu);
> +unsigned int kvm_mmu_tdp_maxphyaddr(void);
>  
>  int kvm_mmu_load(struct kvm_vcpu *vcpu);
>  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c57e181bba21..72634d6b61b2 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  	reset_guest_paging_metadata(vcpu, mmu);
>  }
>  
> +/* guest-physical-address bits limited by TDP */
> +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> +{
> +	return max_tdp_level == 5 ? 57 : 48;

Using "57" is kinda sorta wrong, e.g. the SDM says:

  Bits 56:52 of each guest-physical address are necessarily zero because
  guest-physical addresses are architecturally limited to 52 bits.

Rather than split hairs over something that doesn't matter, I think it makes sense
for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
is still accurate when tdp_root_level is non-zero).

That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
*host* MAXPHYADDR.

Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
used by the CPUID code, so it's not the end of the world.

> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);

This shouldn't be exported, I don't see any reason for vendor code to need access
to this helper.  It's essentially a moot point though if we avoid the helper in
the first place.

All in all, this?

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7bc1daf68741..29b575b86912 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
 			     bool mask);
 
 extern bool tdp_enabled;
+extern int max_tdp_level;
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 294e5bd5f8a0..637a1f388a51 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		 *
 		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
 		 * provided, use the raw bare metal MAXPHYADDR as reductions to
-		 * the HPAs do not affect GPAs.
+		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
+		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
+		 * as bits 51:49 are used by the CPU if and only if 5-level TDP
+		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
 		 */
-		if (!tdp_enabled)
+		if (!tdp_enabled) {
 			g_phys_as = boot_cpu_data.x86_phys_bits;
-		else if (!g_phys_as)
-			g_phys_as = phys_as;
+		} else {
+			if (!g_phys_as)
+				g_phys_as = phys_as;
+
+			if (max_tdp_level < 5)
+				g_phys_as = min(g_phys_as, 48);
+		}
 
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3c844e428684..5036c7eb7dac 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
 
 static int max_huge_page_level __read_mostly;
 static int tdp_root_level __read_mostly;
-static int max_tdp_level __read_mostly;
+int max_tdp_level __read_mostly;
 
 #define PTE_PREFETCH_NUM		8
 

base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
-- 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-20 16:28   ` Sean Christopherson
@ 2023-12-21  7:45     ` Tao Su
  2023-12-21  8:19     ` Xu Yilun
  1 sibling, 0 replies; 34+ messages in thread
From: Tao Su @ 2023-12-21  7:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
> 
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change.  The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
> 
> Something like
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
> better captures that the patch affects what KVM advertises to userspace.  Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect.  Alternatively, it could be something like:
> 
>   KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported

Yeah, these two shortlogs are more accurate and intuitive.

> 
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> There is far too much unnecessary cruft in this changelog.  The entire last
> paragraph is extraneous information.  Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
> 
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM.  The symptoms described above don't actually explain why *KVM* is
> at fault.
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
>   Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
>   48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
>   CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
>   it's impossible for KVM to actually map GPAs with bits 51:49 set, which
>   results in vCPUs getting stuck on endless EPT violations.
> 
>   From Intel's SDM:
> 
>     4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
>     walk length of 4) to translate a guest-physical address and uses only
>     bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
>     access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
>     and uses guest-physical address bits 56:0. [Physical addresses and
>     guest-physical addresses are architecturally limited to 52 bits (e.g.,
>     by paging), so in practice bits 56:52 are zero.]
> 
>   While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
>   not 5-level EPT, the combination is architecturally legal and such CPUs
>   do exist (and can easily be "created" with nested virtualization).
> 
>   Note, because EPT capabilities are reported via MSRs, it's impossible
>   for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
>   is 100% KVM's responsibility.

Yes, I learn a lot, I will use the above shortlogs and changelogs as my
next version.

> 
> > Reported-by: Yi Lai <yi1.lai@intel.com>
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > Tested-by: Xudong Hao <xudong.hao@intel.com>
> > ---
> >  arch/x86/kvm/cpuid.c   | 5 +++--
> >  arch/x86/kvm/mmu.h     | 1 +
> >  arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  		 *
> >  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> >  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> > -		 * the HPAs do not affect GPAs.
> > +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > +		 * doesn't exceed the bits that TDP can translate.
> >  		 */
> >  		if (!tdp_enabled)
> >  			g_phys_as = boot_cpu_data.x86_phys_bits;
> >  		else if (!g_phys_as)
> > -			g_phys_as = phys_as;
> > +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
> 
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.

Yes, agree.

> 
> >  		entry->eax = g_phys_as | (virt_as << 8);
> >  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  					struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >  
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  	reset_guest_paging_metadata(vcpu, mmu);
> >  }
> >  
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > +	return max_tdp_level == 5 ? 57 : 48;
> 
> Using "57" is kinda sorta wrong, e.g. the SDM says:
> 
>   Bits 56:52 of each guest-physical address are necessarily zero because
>   guest-physical addresses are architecturally limited to 52 bits.
> 
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).
> 
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
> 
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.

Yes, agree.

> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
> 
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper.  It's essentially a moot point though if we avoid the helper in
> the first place.

Yes, previously I want to invoke this helper in the patch2, now it is no
longer needed.

> 
> All in all, this?

The code is perfect, I will add it to my next version, thanks!

> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +extern int max_tdp_level;
>  
>  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> +		 * as bits 51:49 are used by the CPU if and only if 5-level TDP

Nit, it should be bits 51:48?

Thanks,
Tao

> +		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
>  		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>  
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>  
>  static int max_huge_page_level __read_mostly;
>  static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>  
>  #define PTE_PREFETCH_NUM		8
>  
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-20 16:28   ` Sean Christopherson
  2023-12-21  7:45     ` Tao Su
@ 2023-12-21  8:19     ` Xu Yilun
  2024-01-02 23:24       ` Sean Christopherson
  1 sibling, 1 reply; 34+ messages in thread
From: Xu Yilun @ 2023-12-21  8:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Tao Su, kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> As evidenced by my initial response, the shortlog is a bit misleading.  In non-KVM
> code it would be perfectly ok to say "limit", but in KVM's split world where
> *userspace* is mostly responsible for the guest configuration, "limit guest ..."
> is often going to be read as "limit the capabilities of the guest".
> 
> This is also a good example of why shortlogs and changelogs should NOT be play-by-play
> descriptions of the code change.  The literal code change applies a limit to
> guest physical bits, but that doesn't capture the net effect of applying said limit.
> 
> Something like
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
> better captures that the patch affects what KVM advertises to userspace.  Yeah,
> it's technically inaccurate to say "52-bit", but I think 52-bit MAXPHYADDR is
> ubiquitous enough that it's worth being technically wrong in order to clearly
> communicate the net effect.  Alternatively, it could be something like:
> 
>   KVM: x86: Don't advertise incompatible MAXPHYADDR if 5-level TDP is unsupported
> 
> On Mon, Dec 18, 2023, Tao Su wrote:
> > When host doesn't support 5-level EPT, bits 51:48 of the guest physical
> > address must all be zero, otherwise an EPT violation always occurs and
> > current handler can't resolve this if the gpa is in RAM region. Hence,
> > instruction will keep being executed repeatedly, which causes infinite
> > EPT violation.
> > 
> > Six KVM selftests are timeout due to this issue:
> >     kvm:access_tracking_perf_test
> >     kvm:demand_paging_test
> >     kvm:dirty_log_test
> >     kvm:dirty_log_perf_test
> >     kvm:kvm_page_table_test
> >     kvm:memslot_modification_stress_test
> > 
> > The above selftests add a RAM region close to max_gfn, if host has 52
> > physical bits but doesn't support 5-level EPT, these will trigger infinite
> > EPT violation when access the RAM region.
> > 
> > Since current Intel CPUID doesn't report max guest physical bits like AMD,
> > introduce kvm_mmu_tdp_maxphyaddr() to limit guest physical bits when tdp is
> > enabled and report the max guest physical bits which is smaller than host.
> > 
> > When guest physical bits is smaller than host, some GPA are illegal from
> > guest's perspective, but are still legal from hardware's perspective,
> > which should be trapped to inject #PF. Current KVM already has a parameter
> > allow_smaller_maxphyaddr to support the case when guest.MAXPHYADDR <
> > host.MAXPHYADDR, which is disabled by default when EPT is enabled, user
> > can enable it when loading kvm-intel module. When allow_smaller_maxphyaddr
> > is enabled and guest accesses an illegal address from guest's perspective,
> > KVM will utilize EPT violation and emulate the instruction to inject #PF
> > and determine #PF error code.
> 
> There is far too much unnecessary cruft in this changelog.  The entire last
> paragraph is extraneous information.  Talking in detail about KVM's (flawed)
> emulation of smaller guest.MAXPHYADDR isn't at all relevant as to whether or not
> it's correct for KVM to advertise an impossible configuration.
> 
> And this is exactly why I dislike the "explain the symptom, then the solution"
> style for KVM.  The symptoms described above don't actually explain why *KVM* is
> at fault.
> 
>   KVM: x86: Don't advertise 52-bit MAXPHYADDR if 5-level TDP is unsupported
> 
>   Cap the effective guest.MAXPHYADDR that KVM advertises to userspace at
>   48 bits if 5-level TDP isn't supported, as bits 51:49 are consumed by the
>   CPU during a page table walk if and only if 5-level TDP is enabled.  I.e.
>   it's impossible for KVM to actually map GPAs with bits 51:49 set, which
>   results in vCPUs getting stuck on endless EPT violations.
> 
>   From Intel's SDM:
> 
>     4-level EPT accesses at most 4 EPT paging-structure entries (an EPT page-
>     walk length of 4) to translate a guest-physical address and uses only
>     bits 47:0 of each guest-physical address. In contrast, 5-level EPT may
>     access up to 5 EPT paging-structure entries (an EPT page-walk length of 5)
>     and uses guest-physical address bits 56:0. [Physical addresses and
>     guest-physical addresses are architecturally limited to 52 bits (e.g.,
>     by paging), so in practice bits 56:52 are zero.]
> 
>   While it's decidedly odd for a CPU to support a 52-bit MAXPHYADDR but
>   not 5-level EPT, the combination is architecturally legal and such CPUs
>   do exist (and can easily be "created" with nested virtualization).
> 
>   Note, because EPT capabilities are reported via MSRs, it's impossible
>   for userspace to avoid the funky setup, i.e. advertising a sane MAXPHYADDR
>   is 100% KVM's responsibility.
> 
> > Reported-by: Yi Lai <yi1.lai@intel.com>
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > Tested-by: Xudong Hao <xudong.hao@intel.com>
> > ---
> >  arch/x86/kvm/cpuid.c   | 5 +++--
> >  arch/x86/kvm/mmu.h     | 1 +
> >  arch/x86/kvm/mmu/mmu.c | 7 +++++++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > index dda6fc4cfae8..91933ca739ad 100644
> > --- a/arch/x86/kvm/cpuid.c
> > +++ b/arch/x86/kvm/cpuid.c
> > @@ -1212,12 +1212,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
> >  		 *
> >  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
> >  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> > -		 * the HPAs do not affect GPAs.
> > +		 * the HPAs do not affect GPAs, but ensure guest MAXPHYADDR
> > +		 * doesn't exceed the bits that TDP can translate.
> >  		 */
> >  		if (!tdp_enabled)
> >  			g_phys_as = boot_cpu_data.x86_phys_bits;
> >  		else if (!g_phys_as)
> > -			g_phys_as = phys_as;
> > +			g_phys_as = min(phys_as, kvm_mmu_tdp_maxphyaddr());
> 
> I think KVM should be paranoid and cap the advertised MAXPHYADDR even if the CPU
> advertises guest.MAXPHYADDR.  Advertising a bad guest.MAXPHYADDR is arguably a
> blatant CPU bug, but being paranoid is cheap in this case.
> 
> >  		entry->eax = g_phys_as | (virt_as << 8);
> >  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index bb8c86eefac0..1c7d649fcf6b 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -115,6 +115,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> >  				u64 fault_address, char *insn, int insn_len);
> >  void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  					struct kvm_mmu *mmu);
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void);
> >  
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_unload(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c57e181bba21..72634d6b61b2 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >  	reset_guest_paging_metadata(vcpu, mmu);
> >  }
> >  
> > +/* guest-physical-address bits limited by TDP */
> > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > +{
> > +	return max_tdp_level == 5 ? 57 : 48;
> 
> Using "57" is kinda sorta wrong, e.g. the SDM says:
> 
>   Bits 56:52 of each guest-physical address are necessarily zero because
>   guest-physical addresses are architecturally limited to 52 bits.
> 
> Rather than split hairs over something that doesn't matter, I think it makes sense
> for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> is still accurate when tdp_root_level is non-zero).

It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
max_tdp_level:

	kvm_configure_mmu(npt_enabled, get_npt_level(),
			  get_npt_level(), PG_LEVEL_1G);

But I wanna doulbe confirm if directly using max_tdp_level is fully
considered.  In your last proposal, it is:

  u8 kvm_mmu_get_max_tdp_level(void)
  {
	return tdp_root_level ? tdp_root_level : max_tdp_level;
  }

and I think it makes more sense, because EPT setup follows the same
rule.  If any future architechture sets tdp_root_level smaller than
max_tdp_level, the issue will happen again.

Thanks,
Yilun

> 
> That also avoids confusion with kvm_mmu_max_gfn(), which deliberately uses the
> *host* MAXPHYADDR.
> 
> Making max_tdp_level visible isn't ideal, but tdp_enabled is already visible and
> used by the CPUID code, so it's not the end of the world.
> 
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_mmu_tdp_maxphyaddr);
> 
> This shouldn't be exported, I don't see any reason for vendor code to need access
> to this helper.  It's essentially a moot point though if we avoid the helper in
> the first place.
> 
> All in all, this?
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 16 ++++++++++++----
>  arch/x86/kvm/mmu/mmu.c          |  2 +-
>  3 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7bc1daf68741..29b575b86912 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1932,6 +1932,7 @@ void kvm_fire_mask_notifiers(struct kvm *kvm, unsigned irqchip, unsigned pin,
>  			     bool mask);
>  
>  extern bool tdp_enabled;
> +extern int max_tdp_level;
>  
>  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 294e5bd5f8a0..637a1f388a51 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1233,12 +1233,20 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>  		 *
>  		 * If TDP is enabled but an explicit guest MAXPHYADDR is not
>  		 * provided, use the raw bare metal MAXPHYADDR as reductions to
> -		 * the HPAs do not affect GPAs.
> +		 * the HPAs do not affect GPAs.  Finally, if TDP is enabled and
> +		 * doesn't support 5-level TDP, cap guest MAXPHYADDR at 48 bits
> +		 * as bits 51:49 are used by the CPU if and only if 5-level TDP
> +		 * is enabled, i.e. KVM can't map such GPAs with 4-level TDP.
>  		 */
> -		if (!tdp_enabled)
> +		if (!tdp_enabled) {
>  			g_phys_as = boot_cpu_data.x86_phys_bits;
> -		else if (!g_phys_as)
> -			g_phys_as = phys_as;
> +		} else {
> +			if (!g_phys_as)
> +				g_phys_as = phys_as;
> +
> +			if (max_tdp_level < 5)
> +				g_phys_as = min(g_phys_as, 48);
> +		}
>  
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->ecx &= ~(GENMASK(31, 16) | GENMASK(11, 8));
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3c844e428684..5036c7eb7dac 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -114,7 +114,7 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0444);
>  
>  static int max_huge_page_level __read_mostly;
>  static int tdp_root_level __read_mostly;
> -static int max_tdp_level __read_mostly;
> +int max_tdp_level __read_mostly;
>  
>  #define PTE_PREFETCH_NUM		8
>  
> 
> base-commit: f2a3fb7234e52f72ff4a38364dbf639cf4c7d6c6
> -- 
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2023-12-21  8:19     ` Xu Yilun
@ 2024-01-02 23:24       ` Sean Christopherson
  2024-01-03  0:34         ` Jim Mattson
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2024-01-02 23:24 UTC (permalink / raw)
  To: Xu Yilun
  Cc: Tao Su, kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Thu, Dec 21, 2023, Xu Yilun wrote:
> On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index c57e181bba21..72634d6b61b2 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > >  	reset_guest_paging_metadata(vcpu, mmu);
> > >  }
> > >  
> > > +/* guest-physical-address bits limited by TDP */
> > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > +{
> > > +	return max_tdp_level == 5 ? 57 : 48;
> > 
> > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > 
> >   Bits 56:52 of each guest-physical address are necessarily zero because
> >   guest-physical addresses are architecturally limited to 52 bits.
> > 
> > Rather than split hairs over something that doesn't matter, I think it makes sense
> > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > is still accurate when tdp_root_level is non-zero).
> 
> It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> max_tdp_level:
> 
> 	kvm_configure_mmu(npt_enabled, get_npt_level(),
> 			  get_npt_level(), PG_LEVEL_1G);
> 
> But I wanna doulbe confirm if directly using max_tdp_level is fully
> considered.  In your last proposal, it is:
> 
>   u8 kvm_mmu_get_max_tdp_level(void)
>   {
> 	return tdp_root_level ? tdp_root_level : max_tdp_level;
>   }
> 
> and I think it makes more sense, because EPT setup follows the same
> rule.  If any future architechture sets tdp_root_level smaller than
> max_tdp_level, the issue will happen again.

Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
really means "max possible TDP level KVM can use".  If an exact TDP level is being
forced by tdp_root_level, then by definition it's also the max TDP level, because
it's the _only_ TDP level KVM supports.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-02 23:24       ` Sean Christopherson
@ 2024-01-03  0:34         ` Jim Mattson
  2024-01-03 18:04           ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Jim Mattson @ 2024-01-03  0:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 21, 2023, Xu Yilun wrote:
> > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index c57e181bba21..72634d6b61b2 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > >  }
> > > >
> > > > +/* guest-physical-address bits limited by TDP */
> > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > +{
> > > > + return max_tdp_level == 5 ? 57 : 48;
> > >
> > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > >
> > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > >   guest-physical addresses are architecturally limited to 52 bits.
> > >
> > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > is still accurate when tdp_root_level is non-zero).
> >
> > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > max_tdp_level:
> >
> >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >                         get_npt_level(), PG_LEVEL_1G);
> >
> > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > considered.  In your last proposal, it is:
> >
> >   u8 kvm_mmu_get_max_tdp_level(void)
> >   {
> >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >   }
> >
> > and I think it makes more sense, because EPT setup follows the same
> > rule.  If any future architechture sets tdp_root_level smaller than
> > max_tdp_level, the issue will happen again.
>
> Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> really means "max possible TDP level KVM can use".  If an exact TDP level is being
> forced by tdp_root_level, then by definition it's also the max TDP level, because
> it's the _only_ TDP level KVM supports.

This is all just so broken and wrong. The only guest.MAXPHYADDR that
can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
every #PF, and to emulate the faulting instruction to see if the RSVD
bit should be set in the error code. Hardware isn't going to do it.
Since some page faults may occur in CPL3, this means that KVM has to
be prepared to emulate any memory-accessing instruction. That's not
practical.

Basically, a CPU with more than 48 bits of physical address that
doesn't support 5-level EPT really doesn't support EPT at all, except
perhaps in the context of some new paravirtual pinky-swear from the
guest that it doesn't care about the RSVD bit in #PF error codes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-03  0:34         ` Jim Mattson
@ 2024-01-03 18:04           ` Sean Christopherson
  2024-01-04  2:45             ` Chao Gao
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2024-01-03 18:04 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Tue, Jan 02, 2024, Jim Mattson wrote:
> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > index c57e181bba21..72634d6b61b2 100644
> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > > >  }
> > > > >
> > > > > +/* guest-physical-address bits limited by TDP */
> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > > +{
> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > >
> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > >
> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > >
> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > > is still accurate when tdp_root_level is non-zero).
> > >
> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > max_tdp_level:
> > >
> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > >                         get_npt_level(), PG_LEVEL_1G);
> > >
> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > considered.  In your last proposal, it is:
> > >
> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > >   {
> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > >   }
> > >
> > > and I think it makes more sense, because EPT setup follows the same
> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > max_tdp_level, the issue will happen again.
> >
> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > it's the _only_ TDP level KVM supports.
> 
> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> every #PF, and to emulate the faulting instruction to see if the RSVD
> bit should be set in the error code. Hardware isn't going to do it.
> Since some page faults may occur in CPL3, this means that KVM has to
> be prepared to emulate any memory-accessing instruction. That's not
> practical.
> 
> Basically, a CPU with more than 48 bits of physical address that
> doesn't support 5-level EPT really doesn't support EPT at all, except
> perhaps in the context of some new paravirtual pinky-swear from the
> guest that it doesn't care about the RSVD bit in #PF error codes.

Doh, I managed to forget about the RSVD #PF mess.  That said, this patch will
"work" if userspace enables allow_smaller_maxphyaddr.  In quotes because I'm still
skeptical that allow_smaller_maxphyaddr actually works in all scenarios.  And we'd
need a way to communicate all of that to userspace.  Blech.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-03 18:04           ` Sean Christopherson
@ 2024-01-04  2:45             ` Chao Gao
  2024-01-04  3:40               ` Jim Mattson
  0 siblings, 1 reply; 34+ messages in thread
From: Chao Gao @ 2024-01-04  2:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
>On Tue, Jan 02, 2024, Jim Mattson wrote:
>> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>> >
>> > On Thu, Dec 21, 2023, Xu Yilun wrote:
>> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
>> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> > > > > index c57e181bba21..72634d6b61b2 100644
>> > > > > --- a/arch/x86/kvm/mmu/mmu.c
>> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
>> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>> > > > >   reset_guest_paging_metadata(vcpu, mmu);
>> > > > >  }
>> > > > >
>> > > > > +/* guest-physical-address bits limited by TDP */
>> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
>> > > > > +{
>> > > > > + return max_tdp_level == 5 ? 57 : 48;
>> > > >
>> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
>> > > >
>> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
>> > > >   guest-physical addresses are architecturally limited to 52 bits.
>> > > >
>> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
>> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
>> > > > is still accurate when tdp_root_level is non-zero).
>> > >
>> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
>> > > max_tdp_level:
>> > >
>> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
>> > >                         get_npt_level(), PG_LEVEL_1G);
>> > >
>> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
>> > > considered.  In your last proposal, it is:
>> > >
>> > >   u8 kvm_mmu_get_max_tdp_level(void)
>> > >   {
>> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
>> > >   }
>> > >
>> > > and I think it makes more sense, because EPT setup follows the same
>> > > rule.  If any future architechture sets tdp_root_level smaller than
>> > > max_tdp_level, the issue will happen again.
>> >
>> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
>> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
>> > forced by tdp_root_level, then by definition it's also the max TDP level, because
>> > it's the _only_ TDP level KVM supports.
>> 
>> This is all just so broken and wrong. The only guest.MAXPHYADDR that
>> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
>> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
>> every #PF,

in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
RSVD bits 51-48 set leads to EPT violation.

>> and to emulate the faulting instruction to see if the RSVD
>> bit should be set in the error code. Hardware isn't going to do it.

Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
in "guest-physical address" field of VMCS. so, it is not necessary to emulate
the faulting instruction to determine if any RSVD bit is set.

>> Since some page faults may occur in CPL3, this means that KVM has to
>> be prepared to emulate any memory-accessing instruction. That's not
>> practical.

as said above, no need to intercept #PF for this specific case.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04  2:45             ` Chao Gao
@ 2024-01-04  3:40               ` Jim Mattson
  2024-01-04  4:34                 ` Jim Mattson
  2024-01-04 15:07                 ` Chao Gao
  0 siblings, 2 replies; 34+ messages in thread
From: Jim Mattson @ 2024-01-04  3:40 UTC (permalink / raw)
  To: Chao Gao
  Cc: Sean Christopherson, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >> >
> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> > > > > index c57e181bba21..72634d6b61b2 100644
> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> >> > > > >  }
> >> > > > >
> >> > > > > +/* guest-physical-address bits limited by TDP */
> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> >> > > > > +{
> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> >> > > >
> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> >> > > >
> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> >> > > >
> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> >> > > > is still accurate when tdp_root_level is non-zero).
> >> > >
> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> >> > > max_tdp_level:
> >> > >
> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >> > >                         get_npt_level(), PG_LEVEL_1G);
> >> > >
> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> >> > > considered.  In your last proposal, it is:
> >> > >
> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> >> > >   {
> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >> > >   }
> >> > >
> >> > > and I think it makes more sense, because EPT setup follows the same
> >> > > rule.  If any future architechture sets tdp_root_level smaller than
> >> > > max_tdp_level, the issue will happen again.
> >> >
> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> >> > it's the _only_ TDP level KVM supports.
> >>
> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> >> every #PF,
>
> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> RSVD bits 51-48 set leads to EPT violation.

At the completion of the page table walk, if there is a permission
fault, the data address should not be accessed, so there should not be
an EPT violation. Remember Meltdown?

> >> and to emulate the faulting instruction to see if the RSVD
> >> bit should be set in the error code. Hardware isn't going to do it.
>
> Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> the faulting instruction to determine if any RSVD bit is set.

There should not be an EPT violation in the case discussed.

> >> Since some page faults may occur in CPL3, this means that KVM has to
> >> be prepared to emulate any memory-accessing instruction. That's not
> >> practical.
>
> as said above, no need to intercept #PF for this specific case.

I disagree. See above.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04  3:40               ` Jim Mattson
@ 2024-01-04  4:34                 ` Jim Mattson
  2024-01-04 11:56                   ` Tao Su
  2024-01-04 15:07                 ` Chao Gao
  1 sibling, 1 reply; 34+ messages in thread
From: Jim Mattson @ 2024-01-04  4:34 UTC (permalink / raw)
  To: Chao Gao
  Cc: Sean Christopherson, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
>
> On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > >> >
> > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > >> > > > >  }
> > >> > > > >
> > >> > > > > +/* guest-physical-address bits limited by TDP */
> > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > >> > > > > +{
> > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > >> > > >
> > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > >> > > >
> > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > >> > > >
> > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > >> > > > is still accurate when tdp_root_level is non-zero).
> > >> > >
> > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > >> > > max_tdp_level:
> > >> > >
> > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > >> > >
> > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > >> > > considered.  In your last proposal, it is:
> > >> > >
> > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > >> > >   {
> > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > >> > >   }
> > >> > >
> > >> > > and I think it makes more sense, because EPT setup follows the same
> > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > >> > > max_tdp_level, the issue will happen again.
> > >> >
> > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > >> > it's the _only_ TDP level KVM supports.
> > >>
> > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > >> every #PF,
> >
> > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > RSVD bits 51-48 set leads to EPT violation.
>
> At the completion of the page table walk, if there is a permission
> fault, the data address should not be accessed, so there should not be
> an EPT violation. Remember Meltdown?
>
> > >> and to emulate the faulting instruction to see if the RSVD
> > >> bit should be set in the error code. Hardware isn't going to do it.
> >
> > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > the faulting instruction to determine if any RSVD bit is set.
>
> There should not be an EPT violation in the case discussed.

For intercepted #PF, we can use CR2 to determine the necessary page
walk, and presumably the rest of the bits in the error code are
already set, so emulation is not necessary.

However, emulation is necessary when synthesizing a #PF from an EPT
violation, and bit 8 of the exit qualification is clear. See
https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.

> > >> Since some page faults may occur in CPL3, this means that KVM has to
> > >> be prepared to emulate any memory-accessing instruction. That's not
> > >> practical.
> >
> > as said above, no need to intercept #PF for this specific case.
>
> I disagree. See above.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04  4:34                 ` Jim Mattson
@ 2024-01-04 11:56                   ` Tao Su
  2024-01-04 14:03                     ` Jim Mattson
  0 siblings, 1 reply; 34+ messages in thread
From: Tao Su @ 2024-01-04 11:56 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Chao Gao, Sean Christopherson, Xu Yilun, kvm, pbonzini,
	eddie.dong, xiaoyao.li, yuan.yao, yi1.lai, xudong.hao,
	chao.p.peng

On Wed, Jan 03, 2024 at 08:34:16PM -0800, Jim Mattson wrote:
> On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
> >
> > On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > >
> > > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > > >> >
> > > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > >> > > > >  }
> > > >> > > > >
> > > >> > > > > +/* guest-physical-address bits limited by TDP */
> > > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > >> > > > > +{
> > > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > >> > > >
> > > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > >> > > >
> > > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > >> > > >
> > > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > >> > > > is still accurate when tdp_root_level is non-zero).
> > > >> > >
> > > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > >> > > max_tdp_level:
> > > >> > >
> > > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > > >> > >
> > > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > >> > > considered.  In your last proposal, it is:
> > > >> > >
> > > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > > >> > >   {
> > > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > > >> > >   }
> > > >> > >
> > > >> > > and I think it makes more sense, because EPT setup follows the same
> > > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > >> > > max_tdp_level, the issue will happen again.
> > > >> >
> > > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > > >> > it's the _only_ TDP level KVM supports.
> > > >>
> > > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > >> every #PF,
> > >
> > > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > RSVD bits 51-48 set leads to EPT violation.
> >
> > At the completion of the page table walk, if there is a permission
> > fault, the data address should not be accessed, so there should not be
> > an EPT violation. Remember Meltdown?
> >
> > > >> and to emulate the faulting instruction to see if the RSVD
> > > >> bit should be set in the error code. Hardware isn't going to do it.
> > >
> > > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > > the faulting instruction to determine if any RSVD bit is set.
> >
> > There should not be an EPT violation in the case discussed.
> 
> For intercepted #PF, we can use CR2 to determine the necessary page
> walk, and presumably the rest of the bits in the error code are
> already set, so emulation is not necessary.
> 
> However, emulation is necessary when synthesizing a #PF from an EPT
> violation, and bit 8 of the exit qualification is clear. See
> https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.

Although not all memory-accessing instructions are emulated, it covers most common
cases and is always better than KVM hangs anyway. We may probably continue to
improve allow_smaller_maxphyaddr, but KVM should report the maximum physical width
it supports.

Thanks,
Tao

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04 11:56                   ` Tao Su
@ 2024-01-04 14:03                     ` Jim Mattson
  0 siblings, 0 replies; 34+ messages in thread
From: Jim Mattson @ 2024-01-04 14:03 UTC (permalink / raw)
  To: Tao Su
  Cc: Chao Gao, Sean Christopherson, Xu Yilun, kvm, pbonzini,
	eddie.dong, xiaoyao.li, yuan.yao, yi1.lai, xudong.hao,
	chao.p.peng

On Thu, Jan 4, 2024 at 3:59 AM Tao Su <tao1.su@linux.intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 08:34:16PM -0800, Jim Mattson wrote:
> > On Wed, Jan 3, 2024 at 7:40 PM Jim Mattson <jmattson@google.com> wrote:
> > >
> > > On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > >
> > > > On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > > >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > > >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > >> >
> > > > >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> > > > >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> > > > >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > index c57e181bba21..72634d6b61b2 100644
> > > > >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> > > > >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> > > > >> > > > >  }
> > > > >> > > > >
> > > > >> > > > > +/* guest-physical-address bits limited by TDP */
> > > > >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> > > > >> > > > > +{
> > > > >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> > > > >> > > >
> > > > >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> > > > >> > > >
> > > > >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> > > > >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> > > > >> > > >
> > > > >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> > > > >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> > > > >> > > > is still accurate when tdp_root_level is non-zero).
> > > > >> > >
> > > > >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> > > > >> > > max_tdp_level:
> > > > >> > >
> > > > >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> > > > >> > >                         get_npt_level(), PG_LEVEL_1G);
> > > > >> > >
> > > > >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> > > > >> > > considered.  In your last proposal, it is:
> > > > >> > >
> > > > >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> > > > >> > >   {
> > > > >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> > > > >> > >   }
> > > > >> > >
> > > > >> > > and I think it makes more sense, because EPT setup follows the same
> > > > >> > > rule.  If any future architechture sets tdp_root_level smaller than
> > > > >> > > max_tdp_level, the issue will happen again.
> > > > >> >
> > > > >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> > > > >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> > > > >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> > > > >> > it's the _only_ TDP level KVM supports.
> > > > >>
> > > > >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > > >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > > >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > > >> every #PF,
> > > >
> > > > in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > > 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > > RSVD bits 51-48 set leads to EPT violation.
> > >
> > > At the completion of the page table walk, if there is a permission
> > > fault, the data address should not be accessed, so there should not be
> > > an EPT violation. Remember Meltdown?
> > >
> > > > >> and to emulate the faulting instruction to see if the RSVD
> > > > >> bit should be set in the error code. Hardware isn't going to do it.
> > > >
> > > > Note for EPT violation VM exits, the CPU stores the GPA that caused this exit
> > > > in "guest-physical address" field of VMCS. so, it is not necessary to emulate
> > > > the faulting instruction to determine if any RSVD bit is set.
> > >
> > > There should not be an EPT violation in the case discussed.
> >
> > For intercepted #PF, we can use CR2 to determine the necessary page
> > walk, and presumably the rest of the bits in the error code are
> > already set, so emulation is not necessary.
> >
> > However, emulation is necessary when synthesizing a #PF from an EPT
> > violation, and bit 8 of the exit qualification is clear. See
> > https://lore.kernel.org/kvm/4463f391-0a25-017e-f913-69c297e13c5e@redhat.com/.
>
> Although not all memory-accessing instructions are emulated, it covers most common
> cases and is always better than KVM hangs anyway. We may probably continue to
> improve allow_smaller_maxphyaddr, but KVM should report the maximum physical width
> it supports.

KVM can only support the host MAXPHYADDR. If EPT on the CPU doesn't
support host MAXPHYADDR, it should be disabled. Shadow paging can
handle host MAXPHYADDR just fine.

KVM simply does not work when guest MAXPHYADDR < host MAXPHYADDR.
Without additional hardware support, no hypervisor can. I asked Intel
to add hardware support for such configurations about 15 years ago. I
have yet to see it.

> Thanks,
> Tao
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04  3:40               ` Jim Mattson
  2024-01-04  4:34                 ` Jim Mattson
@ 2024-01-04 15:07                 ` Chao Gao
  2024-01-04 17:02                   ` Jim Mattson
  1 sibling, 1 reply; 34+ messages in thread
From: Chao Gao @ 2024-01-04 15:07 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
>On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
>>
>> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
>> >On Tue, Jan 02, 2024, Jim Mattson wrote:
>> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
>> >> >
>> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
>> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
>> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> >> > > > > index c57e181bba21..72634d6b61b2 100644
>> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
>> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
>> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
>> >> > > > >  }
>> >> > > > >
>> >> > > > > +/* guest-physical-address bits limited by TDP */
>> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
>> >> > > > > +{
>> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
>> >> > > >
>> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
>> >> > > >
>> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
>> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
>> >> > > >
>> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
>> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
>> >> > > > is still accurate when tdp_root_level is non-zero).
>> >> > >
>> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
>> >> > > max_tdp_level:
>> >> > >
>> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
>> >> > >                         get_npt_level(), PG_LEVEL_1G);
>> >> > >
>> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
>> >> > > considered.  In your last proposal, it is:
>> >> > >
>> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
>> >> > >   {
>> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
>> >> > >   }
>> >> > >
>> >> > > and I think it makes more sense, because EPT setup follows the same
>> >> > > rule.  If any future architechture sets tdp_root_level smaller than
>> >> > > max_tdp_level, the issue will happen again.
>> >> >
>> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
>> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
>> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
>> >> > it's the _only_ TDP level KVM supports.
>> >>
>> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
>> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
>> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
>> >> every #PF,
>>
>> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
>> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
>> RSVD bits 51-48 set leads to EPT violation.
>
>At the completion of the page table walk, if there is a permission
>fault, the data address should not be accessed, so there should not be
>an EPT violation. Remember Meltdown?

You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
in PFEC.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04 15:07                 ` Chao Gao
@ 2024-01-04 17:02                   ` Jim Mattson
  2024-01-05 20:26                     ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Jim Mattson @ 2024-01-04 17:02 UTC (permalink / raw)
  To: Chao Gao
  Cc: Sean Christopherson, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
>
> On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> >>
> >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> >> >> On Tue, Jan 2, 2024 at 3:24 PM Sean Christopherson <seanjc@google.com> wrote:
> >> >> >
> >> >> > On Thu, Dec 21, 2023, Xu Yilun wrote:
> >> >> > > On Wed, Dec 20, 2023 at 08:28:06AM -0800, Sean Christopherson wrote:
> >> >> > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > index c57e181bba21..72634d6b61b2 100644
> >> >> > > > > --- a/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> >> >> > > > > @@ -5177,6 +5177,13 @@ void __kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >> >> > > > >   reset_guest_paging_metadata(vcpu, mmu);
> >> >> > > > >  }
> >> >> > > > >
> >> >> > > > > +/* guest-physical-address bits limited by TDP */
> >> >> > > > > +unsigned int kvm_mmu_tdp_maxphyaddr(void)
> >> >> > > > > +{
> >> >> > > > > + return max_tdp_level == 5 ? 57 : 48;
> >> >> > > >
> >> >> > > > Using "57" is kinda sorta wrong, e.g. the SDM says:
> >> >> > > >
> >> >> > > >   Bits 56:52 of each guest-physical address are necessarily zero because
> >> >> > > >   guest-physical addresses are architecturally limited to 52 bits.
> >> >> > > >
> >> >> > > > Rather than split hairs over something that doesn't matter, I think it makes sense
> >> >> > > > for the CPUID code to consume max_tdp_level directly (I forgot that max_tdp_level
> >> >> > > > is still accurate when tdp_root_level is non-zero).
> >> >> > >
> >> >> > > It is still accurate for now. Only AMD SVM sets tdp_root_level the same as
> >> >> > > max_tdp_level:
> >> >> > >
> >> >> > >       kvm_configure_mmu(npt_enabled, get_npt_level(),
> >> >> > >                         get_npt_level(), PG_LEVEL_1G);
> >> >> > >
> >> >> > > But I wanna doulbe confirm if directly using max_tdp_level is fully
> >> >> > > considered.  In your last proposal, it is:
> >> >> > >
> >> >> > >   u8 kvm_mmu_get_max_tdp_level(void)
> >> >> > >   {
> >> >> > >       return tdp_root_level ? tdp_root_level : max_tdp_level;
> >> >> > >   }
> >> >> > >
> >> >> > > and I think it makes more sense, because EPT setup follows the same
> >> >> > > rule.  If any future architechture sets tdp_root_level smaller than
> >> >> > > max_tdp_level, the issue will happen again.
> >> >> >
> >> >> > Setting tdp_root_level != max_tdp_level would be a blatant bug.  max_tdp_level
> >> >> > really means "max possible TDP level KVM can use".  If an exact TDP level is being
> >> >> > forced by tdp_root_level, then by definition it's also the max TDP level, because
> >> >> > it's the _only_ TDP level KVM supports.
> >> >>
> >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> >> >> every #PF,
> >>
> >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> >> RSVD bits 51-48 set leads to EPT violation.
> >
> >At the completion of the page table walk, if there is a permission
> >fault, the data address should not be accessed, so there should not be
> >an EPT violation. Remember Meltdown?
>
> You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> in PFEC.

I have no problem with a user deliberately choosing an unsupported
configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
returning an unsupported configuration.

guest MAXPHYADDR < host MAXPHYADDR has the following issues:

1. In PAE mode, MOV to CR3 will not raise #GP for guest-reserved bits
in PDPTRs that are not host-reserved.
2. #PF for permission violations will not set the RSVD bit in the
error code for guest-reserved bits in the final data address PFN that
are not host-reserved.
3. #PF for other PFNs with guest-reserved bits that are not
host-reserved may not accurately set the non-RSVD bits (e.g. U/S, R/W)
in the error code.

Fix these three issues, and I will happily withdraw my objection.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-04 17:02                   ` Jim Mattson
@ 2024-01-05 20:26                     ` Sean Christopherson
  2024-01-08 13:45                       ` Tao Su
  0 siblings, 1 reply; 34+ messages in thread
From: Sean Christopherson @ 2024-01-05 20:26 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Chao Gao, Xu Yilun, Tao Su, kvm, pbonzini, eddie.dong, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Thu, Jan 04, 2024, Jim Mattson wrote:
> On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> >
> > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > >>
> > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > >> >> every #PF,
> > >>
> > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > >> RSVD bits 51-48 set leads to EPT violation.
> > >
> > >At the completion of the page table walk, if there is a permission
> > >fault, the data address should not be accessed, so there should not be
> > >an EPT violation. Remember Meltdown?
> >
> > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > in PFEC.
> 
> I have no problem with a user deliberately choosing an unsupported
> configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> returning an unsupported configuration.

+1

Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
is enabled, but that's just asking for confusion, e.g. if userspace reflects the
CPUID back into the guest, it could unknowingly create a VM that depends on
allow_smaller_maxphyaddr.

I think the least awful option is to have the kernel expose whether or not the
CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
but it's not the end of the world.  Alternatively, KVM could add a capability to
enumerate the max *addressable* GPA, but userspace would still need to manually
take action when KVM can't address all of memory, i.e. a capability would be less
ugly, but wouldn't meaningfully change userspace's responsibilities.

I.e.

diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index c6a7eed03914..266daf5b5b84 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -25,6 +25,7 @@
 #define VMX_FEATURE_EPT_EXECUTE_ONLY   ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
 #define VMX_FEATURE_EPT_AD             ( 0*32+ 18) /* EPT Accessed/Dirty bits */
 #define VMX_FEATURE_EPT_1GB            ( 0*32+ 19) /* 1GB EPT pages */
+#define VMX_FEATURE_EPT_5LEVEL         ( 0*32+ 20) /* 5-level EPT paging */
 
 /* Aggregated APIC features 24-27 */
 #define VMX_FEATURE_FLEXPRIORITY       ( 0*32+ 24) /* TPR shadow + virt APIC */
diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index 03851240c3e3..1640ae76548f 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
                c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
        if (ept & VMX_EPT_1GB_PAGE_BIT)
                c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
+       if (ept & VMX_EPT_PAGE_WALK_5_BIT)
+               c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
 
        /* Synthetic APIC features that are aggregates of multiple features. */
        if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-05 20:26                     ` Sean Christopherson
@ 2024-01-08 13:45                       ` Tao Su
  2024-01-08 15:29                         ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Tao Su @ 2024-01-08 13:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Chao Gao, Xu Yilun, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Fri, Jan 05, 2024 at 12:26:08PM -0800, Sean Christopherson wrote:
> On Thu, Jan 04, 2024, Jim Mattson wrote:
> > On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> > >
> > > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > >>
> > > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > >> >> every #PF,
> > > >>
> > > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > >> RSVD bits 51-48 set leads to EPT violation.
> > > >
> > > >At the completion of the page table walk, if there is a permission
> > > >fault, the data address should not be accessed, so there should not be
> > > >an EPT violation. Remember Meltdown?
> > >
> > > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > > in PFEC.
> > 
> > I have no problem with a user deliberately choosing an unsupported
> > configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> > returning an unsupported configuration.
> 
> +1
> 
> Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
> isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
> is enabled, but that's just asking for confusion, e.g. if userspace reflects the
> CPUID back into the guest, it could unknowingly create a VM that depends on
> allow_smaller_maxphyaddr.
> 
> I think the least awful option is to have the kernel expose whether or not the
> CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
> just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
> but it's not the end of the world.  Alternatively, KVM could add a capability to
> enumerate the max *addressable* GPA, but userspace would still need to manually
> take action when KVM can't address all of memory, i.e. a capability would be less
> ugly, but wouldn't meaningfully change userspace's responsibilities.

Yes, exposing whether the CPU support 5-level EPT is indeed a better solution, it
not only bypasses the KVM_GET_SUPPORTED_CPUID but also provides the information to
userspace.

I think a new KVM capability to enumerate the max GPA is better since it will be
more convenient if userspace wants to use, e.g., automatically limit physical bits
or the GPA in the user memory region.

But only reporting this capability can’t solve the KVM hang issue, userspace can
choose to ignore the max GPA, e.g., six selftests in changelog are still failed. I
think we can reconsider patch2 if we don’t advertise
guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID.

Thanks,
Tao

> 
> I.e.
> 
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index c6a7eed03914..266daf5b5b84 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -25,6 +25,7 @@
>  #define VMX_FEATURE_EPT_EXECUTE_ONLY   ( 0*32+ 17) /* "ept_x_only" EPT entries can be execute only */
>  #define VMX_FEATURE_EPT_AD             ( 0*32+ 18) /* EPT Accessed/Dirty bits */
>  #define VMX_FEATURE_EPT_1GB            ( 0*32+ 19) /* 1GB EPT pages */
> +#define VMX_FEATURE_EPT_5LEVEL         ( 0*32+ 20) /* 5-level EPT paging */
>  
>  /* Aggregated APIC features 24-27 */
>  #define VMX_FEATURE_FLEXPRIORITY       ( 0*32+ 24) /* TPR shadow + virt APIC */
> diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
> index 03851240c3e3..1640ae76548f 100644
> --- a/arch/x86/kernel/cpu/feat_ctl.c
> +++ b/arch/x86/kernel/cpu/feat_ctl.c
> @@ -72,6 +72,8 @@ static void init_vmx_capabilities(struct cpuinfo_x86 *c)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_AD);
>         if (ept & VMX_EPT_1GB_PAGE_BIT)
>                 c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_1GB);
> +       if (ept & VMX_EPT_PAGE_WALK_5_BIT)
> +               c->vmx_capability[MISC_FEATURES] |= VMX_F(EPT_5LEVEL);
>  
>         /* Synthetic APIC features that are aggregates of multiple features. */
>         if ((c->vmx_capability[PRIMARY_CTLS] & VMX_F(VIRTUAL_TPR)) &&
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2023-12-20 13:42   ` Jim Mattson
@ 2024-01-08 13:48     ` Tao Su
  2024-01-08 15:19       ` Sean Christopherson
  0 siblings, 1 reply; 34+ messages in thread
From: Tao Su @ 2024-01-08 13:48 UTC (permalink / raw)
  To: Jim Mattson
  Cc: kvm, seanjc, pbonzini, eddie.dong, chao.gao, xiaoyao.li, yuan.yao,
	yi1.lai, xudong.hao, chao.p.peng

On Wed, Dec 20, 2023 at 05:42:56AM -0800, Jim Mattson wrote:
> On Mon, Dec 18, 2023 at 6:08 AM Tao Su <tao1.su@linux.intel.com> wrote:
> >
> > With 4-level EPT, bits 51:48 of the guest physical address must all
> > be zero; otherwise, an EPT violation always occurs, which is an unexpected
> > VM exit in KVM currently.
> >
> > Even though KVM advertises the max physical bits to guest, guest may
> > ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
> > Rejecting invalid guest physical bits on KVM side is a choice, but it will
> > break current KVM ABI, e.g., current QEMU ignores the physical bits
> > advertised by KVM and uses host physical bits as guest physical bits by
> > default when using '-cpu host', although we would like to send a patch to
> > QEMU, it will still cause backward compatibility issues.
> >
> > For GPA that can't be translated by EPT but within host.MAXPHYADDR,
> > emulation should be the best choice since KVM will inject #PF for the
> > invalid GPA in guest's perspective and try to emulate the instructions
> > which minimizes the impact on guests as much as possible.
> >
> > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > Tested-by: Yi Lai <yi1.lai@intel.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index be20a60047b1..a8aa2cfa2f5d 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -5774,6 +5774,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >
> >         vcpu->arch.exit_qualification = exit_qualification;
> >
> > +       /*
> > +        * Emulate the instruction when accessing a GPA which is set any bits
> > +        * beyond guest-physical bits that EPT can translate.
> > +        */
> > +       if (unlikely(gpa & rsvd_bits(kvm_mmu_tdp_maxphyaddr(), 63)))
> > +               return kvm_emulate_instruction(vcpu, 0);
> > +
> 
> This doesn't really work, since the KVM instruction emulator is
> woefully incomplete. To make this work, first you have to teach the
> KVM instruction emulator how to emulate *all* memory-accessing
> instructions.

Please forget allow_smaller_maxphyaddr and #PF for a while.

I agree KVM instruction emulator is incomplete. However, hardware can’t
execute instructions with GPA>48-bit and exits to KVM, KVM just repeatedly
builds SPTE, I.e., current KVM is buggy.

In this case, emulation may be a choice, I.e., KVM can emulate most common
instructions which reflects KVM's best efforts to maintain an environment for
continued execution. Even if some instructions can’t be emulated, it can
terminate the guest and will not make KVM silently hang there.

Thanks,
Tao

> 
> >         /*
> >          * Check that the GPA doesn't exceed physical memory limits, as that is
> >          * a guest page fault.  We have to emulate the instruction here, because
> > --
> > 2.34.1
> >
> >

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT
  2024-01-08 13:48     ` Tao Su
@ 2024-01-08 15:19       ` Sean Christopherson
  0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2024-01-08 15:19 UTC (permalink / raw)
  To: Tao Su
  Cc: Jim Mattson, kvm, pbonzini, eddie.dong, chao.gao, xiaoyao.li,
	yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Mon, Jan 08, 2024, Tao Su wrote:
> On Wed, Dec 20, 2023 at 05:42:56AM -0800, Jim Mattson wrote:
> > On Mon, Dec 18, 2023 at 6:08 AM Tao Su <tao1.su@linux.intel.com> wrote:
> > >
> > > With 4-level EPT, bits 51:48 of the guest physical address must all
> > > be zero; otherwise, an EPT violation always occurs, which is an unexpected
> > > VM exit in KVM currently.
> > >
> > > Even though KVM advertises the max physical bits to guest, guest may
> > > ignore MAXPHYADDR in CPUID and set a bigger physical bits to KVM.
> > > Rejecting invalid guest physical bits on KVM side is a choice, but it will
> > > break current KVM ABI, e.g., current QEMU ignores the physical bits
> > > advertised by KVM and uses host physical bits as guest physical bits by
> > > default when using '-cpu host', although we would like to send a patch to
> > > QEMU, it will still cause backward compatibility issues.
> > >
> > > For GPA that can't be translated by EPT but within host.MAXPHYADDR,
> > > emulation should be the best choice since KVM will inject #PF for the
> > > invalid GPA in guest's perspective and try to emulate the instructions
> > > which minimizes the impact on guests as much as possible.
> > >
> > > Signed-off-by: Tao Su <tao1.su@linux.intel.com>
> > > Tested-by: Yi Lai <yi1.lai@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index be20a60047b1..a8aa2cfa2f5d 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -5774,6 +5774,13 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > >
> > >         vcpu->arch.exit_qualification = exit_qualification;
> > >
> > > +       /*
> > > +        * Emulate the instruction when accessing a GPA which is set any bits
> > > +        * beyond guest-physical bits that EPT can translate.
> > > +        */
> > > +       if (unlikely(gpa & rsvd_bits(kvm_mmu_tdp_maxphyaddr(), 63)))
> > > +               return kvm_emulate_instruction(vcpu, 0);
> > > +
> > 
> > This doesn't really work, since the KVM instruction emulator is
> > woefully incomplete. To make this work, first you have to teach the
> > KVM instruction emulator how to emulate *all* memory-accessing
> > instructions.
> 
> Please forget allow_smaller_maxphyaddr and #PF for a while.
> 
> I agree KVM instruction emulator is incomplete. However, hardware can’t
> execute instructions with GPA>48-bit and exits to KVM, KVM just repeatedly
> builds SPTE, I.e., current KVM is buggy.

Eh, hardware is just as much to blame as KVM.  Garbage in, garbage out.

> In this case, emulation may be a choice,

Yes, it can be a choice, but that choice needs to be made conciously by userspace,
not silently by KVM.  

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported
  2024-01-08 13:45                       ` Tao Su
@ 2024-01-08 15:29                         ` Sean Christopherson
  0 siblings, 0 replies; 34+ messages in thread
From: Sean Christopherson @ 2024-01-08 15:29 UTC (permalink / raw)
  To: Tao Su
  Cc: Jim Mattson, Chao Gao, Xu Yilun, kvm, pbonzini, eddie.dong,
	xiaoyao.li, yuan.yao, yi1.lai, xudong.hao, chao.p.peng

On Mon, Jan 08, 2024, Tao Su wrote:
> On Fri, Jan 05, 2024 at 12:26:08PM -0800, Sean Christopherson wrote:
> > On Thu, Jan 04, 2024, Jim Mattson wrote:
> > > On Thu, Jan 4, 2024 at 7:08 AM Chao Gao <chao.gao@intel.com> wrote:
> > > >
> > > > On Wed, Jan 03, 2024 at 07:40:02PM -0800, Jim Mattson wrote:
> > > > >On Wed, Jan 3, 2024 at 6:45 PM Chao Gao <chao.gao@intel.com> wrote:
> > > > >>
> > > > >> On Wed, Jan 03, 2024 at 10:04:41AM -0800, Sean Christopherson wrote:
> > > > >> >On Tue, Jan 02, 2024, Jim Mattson wrote:
> > > > >> >> This is all just so broken and wrong. The only guest.MAXPHYADDR that
> > > > >> >> can be supported under TDP is the host.MAXPHYADDR. If KVM claims to
> > > > >> >> support a smaller guest.MAXPHYADDR, then KVM is obligated to intercept
> > > > >> >> every #PF,
> > > > >>
> > > > >> in this case (i.e., to support 48-bit guest.MAXPHYADDR when CPU supports only
> > > > >> 4-level EPT), KVM has no need to intercept #PF because accessing a GPA with
> > > > >> RSVD bits 51-48 set leads to EPT violation.
> > > > >
> > > > >At the completion of the page table walk, if there is a permission
> > > > >fault, the data address should not be accessed, so there should not be
> > > > >an EPT violation. Remember Meltdown?
> > > >
> > > > You are right. I missed this case. KVM needs to intercept #PF to set RSVD bit
> > > > in PFEC.
> > > 
> > > I have no problem with a user deliberately choosing an unsupported
> > > configuration, but I do have a problem with KVM_GET_SUPPORTED_CPUID
> > > returning an unsupported configuration.
> > 
> > +1
> > 
> > Advertising guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID simply
> > isn't viable when TDP is enabled.  I suppose KVM do so when allow_smaller_maxphyaddr
> > is enabled, but that's just asking for confusion, e.g. if userspace reflects the
> > CPUID back into the guest, it could unknowingly create a VM that depends on
> > allow_smaller_maxphyaddr.
> > 
> > I think the least awful option is to have the kernel expose whether or not the
> > CPU support 5-level EPT to userspace.  That doesn't even require new uAPI per se,
> > just a new flag in /proc/cpuinfo.  It'll be a bit gross for userspace to parse,
> > but it's not the end of the world.  Alternatively, KVM could add a capability to
> > enumerate the max *addressable* GPA, but userspace would still need to manually
> > take action when KVM can't address all of memory, i.e. a capability would be less
> > ugly, but wouldn't meaningfully change userspace's responsibilities.
> 
> Yes, exposing whether the CPU support 5-level EPT is indeed a better solution, it
> not only bypasses the KVM_GET_SUPPORTED_CPUID but also provides the information to
> userspace.
> 
> I think a new KVM capability to enumerate the max GPA is better since it will be
> more convenient if userspace wants to use, e.g., automatically limit physical bits
> or the GPA in the user memory region.

Not really, because "automatically" limiting guest.MAXPHYADDR without setting
allow_smaller_maxphyaddr isn't exactly safe.  I think it's also useful to capture
*why* KVM's max addressable GPA is smaller than host.MAXPHYADDR, e.g. if down the
road someone ships a CPU that actually does the right thing when guest.MAXPHYADDR
is smaller than host.MAXPHYADDR.

> But only reporting this capability can’t solve the KVM hang issue, userspace can
> choose to ignore the max GPA, e.g., six selftests in changelog are still failed.

I know.  I just have more pressing concerns than making selftests work on funky
hardware that AFAIK isn't publicly available.

> I think we can reconsider patch2 if we don’t advertise
> guest.MAXPHYADDR < host.MAXPHYADDR in KVM_GET_SUPPORTED_CPUID.

Nah, someone just needs to update the selftests if/when the ept_5level flag
lands.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-01-08 15:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-18 14:05 [PATCH 0/2] x86: KVM: Limit guest physical bits when 5-level EPT is unsupported Tao Su
2023-12-18 14:05 ` [PATCH 1/2] " Tao Su
2023-12-18 15:13   ` Sean Christopherson
2023-12-19  2:51     ` Chao Gao
2023-12-19  3:40       ` Jim Mattson
2023-12-19  8:09         ` Chao Gao
2023-12-19 15:26           ` Sean Christopherson
2023-12-20  7:16             ` Xiaoyao Li
2023-12-20 15:37               ` Sean Christopherson
2023-12-20 11:59             ` Tao Su
2023-12-20 13:39             ` Jim Mattson
2023-12-19  8:31     ` Tao Su
2023-12-20 16:28   ` Sean Christopherson
2023-12-21  7:45     ` Tao Su
2023-12-21  8:19     ` Xu Yilun
2024-01-02 23:24       ` Sean Christopherson
2024-01-03  0:34         ` Jim Mattson
2024-01-03 18:04           ` Sean Christopherson
2024-01-04  2:45             ` Chao Gao
2024-01-04  3:40               ` Jim Mattson
2024-01-04  4:34                 ` Jim Mattson
2024-01-04 11:56                   ` Tao Su
2024-01-04 14:03                     ` Jim Mattson
2024-01-04 15:07                 ` Chao Gao
2024-01-04 17:02                   ` Jim Mattson
2024-01-05 20:26                     ` Sean Christopherson
2024-01-08 13:45                       ` Tao Su
2024-01-08 15:29                         ` Sean Christopherson
2023-12-18 14:05 ` [PATCH 2/2] x86: KVM: Emulate instruction when GPA can't be translated by EPT Tao Su
2023-12-18 15:23   ` Sean Christopherson
2023-12-19  3:10     ` Chao Gao
2023-12-20 13:42   ` Jim Mattson
2024-01-08 13:48     ` Tao Su
2024-01-08 15:19       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox