kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable
@ 2024-06-12 18:16 Reinette Chatre
  2024-06-12 18:16 ` [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility Reinette Chatre
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-06-12 18:16 UTC (permalink / raw)
  To: isaku.yamahata, pbonzini, erdemaktas, vkuznets, seanjc,
	vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao
  Cc: reinette.chatre, kvm, linux-kernel

Changes from v8:
- v8: https://lore.kernel.org/lkml/cover.1718043121.git.reinette.chatre@intel.com/
- Many changes to new udelay() utility patch as well as the APIC bus
  frequency test aimed to make it more robust (additional ASSERTs,
  consistent types, eliminate duplicate code, etc.) and useful with
  support for more user configuration. Please refer to individual patches for
  detailed changes.
- Series applies cleanly to next branch of kvm-x86 with HEAD
  e4e9e1067138e5620cf0500c3e5f6ebfb9d322c8.

Changes from v7:
- v7: https://lore.kernel.org/lkml/cover.1717695426.git.reinette.chatre@intel.com/
- Use appropriate ioctl() return type.
- Let new tsc_khz used by KVM selftest be of the same type as same variable
  used in kernel.
- Please refer to individual patches for detailed changes.

Changes from v6:
- V6: https://lore.kernel.org/lkml/cover.1715017765.git.reinette.chatre@intel.com/
- KVM changes were merged into kvm-x86 misc and subsequently dropped from
  this submission. This submission only contains the selftest changes.
  https://lore.kernel.org/lkml/ZmBw9R_jkNLYXkJh@google.com/
- New patch to add udelay() utility to x86_64 KVM selftests. (Sean)
- Many changes to APIC bus frequency selftest. Please refer to the patch
  for detailed changes.
- Series applies cleanly to next branch of kvm-x86 with HEAD
  af0903ab52ee6d6f0f63af67fa73d5eb00f79b9a.

Changes from v5:
- v5: https://lore.kernel.org/lkml/cover.1714081725.git.reinette.chatre@intel.com/
- Rebased on latest of "next" branch of https://github.com/kvm-x86/linux.git
- Added Xiaoyao Li and Yuan Yao's Reviewed-by tags.
- Use the KVM selftest vm_create() wrapper instead of open coding it. (Zide)
- Improve grammar of selftest description. (Zide)

Changes from v4:
- v4: https://lore.kernel.org/lkml/cover.1711035400.git.reinette.chatre@intel.com/
- Rename capability from KVM_CAP_X86_APIC_BUS_FREQUENCY to
  KVM_CAP_X86_APIC_BUS_CYCLES_NS. (Xiaoyao Li).
- Include "Testing" section in cover letter.
- Add Rick's Reviewed-by tags.
- Rebased on latest of "next" branch of https://github.com/kvm-x86/linux.git

Changes from v3:
- v3: https://lore.kernel.org/all/cover.1702974319.git.isaku.yamahata@intel.com/
- Reworked all changelogs.
- Regarding code changes: patches #1 and #2 are unchanged, patch #3 was
  reworked according to Sean's guidance, and patch #4 (the test)
  needed changes after rebase to kvm-x86/next and the implementation
  (patch #3) changes.
- Added Reviewed-by tags to patches #1, #2, and #4, but removed
  Maxim's Reviewed-by tag from patch #3 because it changed so much.
- Added a "Suggested-by" to patch #4 to reflect that it represents
  Sean's guidance.
- Reworked cover to match customs (in subject line) and reflect feedback
  to patches: capability renamed to KVM_CAP_X86_APIC_BUS_FREQUENCY, clarify
  distinction between "core crystal clock" and "bus clock", and update
  pro/con list.
- Please refer to individual patches for detailed changes.

Changes from v2:
- v2: https://lore.kernel.org/lkml/cover.1699936040.git.isaku.yamahata@intel.com/
- Removed APIC_BUS_FREQUENCY and apic_bus_frequency of struct kvm-arch.
- Update the commit messages.
- Added reviewed-by (Maxim Levitsky)
- Use 1.5GHz instead of 1GHz as frequency for the test case.

Changes from v1:
- v1: https://lore.kernel.org/all/cover.1699383993.git.isaku.yamahata@intel.com/
- Added a test case
- Fix a build error for i386 platform
- Add check if vcpu isn't created.
- Add check if lapic chip is in-kernel emulation.
- Updated api.rst

Summary
-------
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.

Description
-----------
Vishal reported [1] that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
    "The APIC timer frequency will be the processor’s bus clock or core
     crystal clock frequency (when TSC/core crystal clock ratio is
     enumerated in CPUID leaf 0x15) divided by the value specified in
     the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported in [1] the TDX guest kernel
expects a 25MHz timer frequency but gets timer interrupt more frequently
due to the 1GHz frequency used by KVM.

To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.

There are several options to address this:
1. Make the KVM able to configure APIC bus frequency (this series).
   Pro: It resembles the existing hardware.  The recent Intel CPUs
        adopts 25MHz.
   Con: Require the VMM to emulate the APIC timer at 25MHz.
2. Make the TDX architecture enumerate CPUID leaf 0x15 to configurable
   frequency or not enumerate it.
   Pro: Any APIC bus frequency is allowed.
   Con: Deviates from TDX architecture.
3. Make the TDX guest kernel use 1GHz when it's running on KVM.
   Con: The kernel ignores CPUID leaf 0x15.
4. Change CPUID leaf 0x15 under TDX to report the crystal clock frequency
   as 1 GHz.
   Pro: This has been the virtual APIC frequency for KVM guests for 13
        years.
   Pro: This requires changing only one hard-coded constant in TDX.
   Con: It doesn't work with other VMMs as TDX isn't specific to KVM.
   Con: Core crystal clock frequency is also used to calculate TSC frequency.
   Con: If it is configured to value different from hardware, it will
        break the correctness of INTEL-PT Mini Time Count (MTC) packets
        in TDs.

Testing
-------
Tested on non-TDX host using included kselftest. Host running kernel
with this series applied to "next" branch of
https://github.com/kvm-x86/linux.git

Tested on TDX host and TD using a modified version
of x86/apic.c:test_apic_timer_one_shot() available from
https://github.com/intel/kvm-unit-tests-tdx/blob/tdx/x86/apic.c.
Host running TDX KVM development patches and QEMU with corresponding TDX
changes.
The test needed to be modified to (a) stop any lingering timers before the
test starts, and (b) use CPUID 0x15 in TDX to accurately determine the TSC
and APIC frequencies instead of making 1GHz assumption and use similar
check as the kselftest test introduced in this series (while increasing
the amount with which the frequency is allowed to deviate by 1%).

The core changes made to x86/apic.c:test_apic_timer_one_shot() for this
testing are shown below for reference. Work is in progress to upstream
these modifications.

@@ -477,11 +478,29 @@ static void lvtt_handler(isr_regs_t *regs)
 
 static void test_apic_timer_one_shot(void)
 {
-	uint64_t tsc1, tsc2;
 	static const uint32_t interval = 0x10000;
+	uint64_t measured_apic_freq, tsc2, tsc1;
+	uint32_t tsc_freq = 0, apic_freq = 0;
+	struct cpuid cpuid_tsc = {};
 
 #define APIC_LVT_TIMER_VECTOR    (0xee)
 
+	/*
+	 * CPUID 0x15 is not available in VMX, can use it to obtain
+	 * TSC and APIC frequency for accurate testing
+	 */
+	if (is_tdx_guest()) {
+		cpuid_tsc = raw_cpuid(0x15, 0);
+		tsc_freq = cpuid_tsc.c * cpuid_tsc.b / cpuid_tsc.a;
+		apic_freq = cpuid_tsc.c;
+	}
+	/*
+	   stop already fired local timer
+	   the test case can be negative failure if the timer fired
+	   after installed lvtt_handler but *before*
+	   write to TIMICT again.
+	 */
+	apic_write(APIC_TMICT, 0);
 	handle_irq(APIC_LVT_TIMER_VECTOR, lvtt_handler);
 
 	/* One shot mode */
@@ -503,8 +522,16 @@ static void test_apic_timer_one_shot(void)
 	 * cases, the following should satisfy on all modern
 	 * processors.
 	 */
-	report((lvtt_counter == 1) && (tsc2 - tsc1 >= interval),
-	       "APIC LVT timer one shot");
+	if (is_tdx_guest()) {
+		measured_apic_freq = interval * (tsc_freq / (tsc2 - tsc1));
+		report((lvtt_counter == 1) &&
+		       (measured_apic_freq < apic_freq * 102 / 100) &&
+		       (measured_apic_freq > apic_freq * 98 / 100),
+		       "APIC LVT timer one shot");
+	} else {
+		report((lvtt_counter == 1) && (tsc2 - tsc1 >= interval),
+		"APIC LVT timer one shot");
+	}
 }

[1] https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com/

Isaku Yamahata (1):
  KVM: selftests: Add test for configure of x86 APIC bus frequency

Reinette Chatre (1):
  KVM: selftests: Add x86_64 guest udelay() utility

 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/apic.h       |   8 +
 .../selftests/kvm/include/x86_64/processor.h  |  17 ++
 .../selftests/kvm/lib/x86_64/processor.c      |  11 +
 .../kvm/x86_64/apic_bus_clock_test.c          | 202 ++++++++++++++++++
 5 files changed, 239 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c

base-commit: e4e9e1067138e5620cf0500c3e5f6ebfb9d322c8
-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility
  2024-06-12 18:16 [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Reinette Chatre
@ 2024-06-12 18:16 ` Reinette Chatre
  2024-06-28 22:46   ` Sean Christopherson
  2024-06-12 18:16 ` [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency Reinette Chatre
  2024-06-28 22:55 ` [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Sean Christopherson
  2 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2024-06-12 18:16 UTC (permalink / raw)
  To: isaku.yamahata, pbonzini, erdemaktas, vkuznets, seanjc,
	vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao
  Cc: reinette.chatre, kvm, linux-kernel

Create udelay() utility for x86_64 tests to match
udelay() available to ARM and RISC-V tests.

Calibrate guest frequency using the KVM_GET_TSC_KHZ ioctl()
and share it between host and guest with the new
tsc_khz global variable.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes v9:
- Change tsc_kz type to uint32_t (was unsigned int). (Sean)
- Change type used to store number of cycles to uint64_t
  (was unsigned long). (Sean)
- Add GUEST_ASSERT() in udelay() used in guest to avoid risk of
  udelay() unexpectedly doing nothing if something went wrong
  during TSC frequency discovery. (Sean)
- Use TEST_ASSERT() when checking for KVM_CAP_GET_TSC_KHZ. (Sean)
- Use TEST_ASSERT() to enforce that discovery of TSC frequency
  must never fail. (Sean)

Changes v8:
- Use appropriate signed type to discover TSC freq from KVM.
- Switch type used to store TSC frequency from unsigned long
  to unsigned int to match the type used by the kernel.

Changes v7:
- New patch
---
 .../selftests/kvm/include/x86_64/processor.h    | 17 +++++++++++++++++
 .../selftests/kvm/lib/x86_64/processor.c        | 11 +++++++++++
 2 files changed, 28 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index c0c7c1fe93f9..383a0f7fa9ef 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -23,6 +23,7 @@
 
 extern bool host_cpu_is_intel;
 extern bool host_cpu_is_amd;
+extern uint32_t tsc_khz;
 
 /* Forced emulation prefix, used to invoke the emulator unconditionally. */
 #define KVM_FEP "ud2; .byte 'k', 'v', 'm';"
@@ -816,6 +817,22 @@ static inline void cpu_relax(void)
 	asm volatile("rep; nop" ::: "memory");
 }
 
+static inline void udelay(unsigned long usec)
+{
+	uint64_t start, now, cycles;
+
+	GUEST_ASSERT(tsc_khz);
+	cycles = tsc_khz / 1000 * usec;
+
+	start = rdtsc();
+	for (;;) {
+		now = rdtsc();
+		if (now - start >= cycles)
+			break;
+		cpu_relax();
+	}
+}
+
 #define ud2()			\
 	__asm__ __volatile__(	\
 		"ud2\n"	\
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 594b061aef52..aaadda1ebcad 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -25,6 +25,7 @@ vm_vaddr_t exception_handlers;
 bool host_cpu_is_amd;
 bool host_cpu_is_intel;
 bool is_forced_emulation_enabled;
+uint32_t tsc_khz;
 
 static void regs_dump(FILE *stream, struct kvm_regs *regs, uint8_t indent)
 {
@@ -616,6 +617,11 @@ void assert_on_unhandled_exception(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vm_post_create(struct kvm_vm *vm)
 {
+	int r;
+
+	TEST_ASSERT(kvm_has_cap(KVM_CAP_GET_TSC_KHZ),
+		    "Require KVM_GET_TSC_KHZ to provide udelay() to guest.");
+
 	vm_create_irqchip(vm);
 	vm_init_descriptor_tables(vm);
 
@@ -628,6 +634,11 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm)
 
 		vm_sev_ioctl(vm, KVM_SEV_INIT2, &init);
 	}
+
+	r = __vm_ioctl(vm, KVM_GET_TSC_KHZ, NULL);
+	TEST_ASSERT(r > 0, "KVM_GET_TSC_KHZ did not provide a valid TSC frequency.");
+	tsc_khz = r;
+	sync_global_to_guest(vm, tsc_khz);
 }
 
 void vcpu_arch_set_entry_point(struct kvm_vcpu *vcpu, void *guest_code)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency
  2024-06-12 18:16 [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Reinette Chatre
  2024-06-12 18:16 ` [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility Reinette Chatre
@ 2024-06-12 18:16 ` Reinette Chatre
  2024-06-28 22:50   ` Sean Christopherson
  2024-06-29  0:39   ` VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX Sean Christopherson
  2024-06-28 22:55 ` [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Sean Christopherson
  2 siblings, 2 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-06-12 18:16 UTC (permalink / raw)
  To: isaku.yamahata, pbonzini, erdemaktas, vkuznets, seanjc,
	vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao
  Cc: reinette.chatre, kvm, linux-kernel

From: Isaku Yamahata <isaku.yamahata@intel.com>

Test if KVM emulates the APIC bus clock at the expected frequency when
userspace configures the frequency via KVM_CAP_X86_APIC_BUS_CYCLES_NS.

Set APIC timer's initial count to the maximum value and busy wait for 100
msec (largely arbitrary) using the TSC. Read the APIC timer's "current
count" to calculate the actual APIC bus clock frequency based on TSC
frequency.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
---
Changes v9:
- Use uint64_t to store APIC frequency (was unsigned long). (Sean)
- Introduce new helpers (apic_enable(), apic_read_reg(), and
  apic_write_reg()) that runs appropriate xAPIC or x2APIC functions
  based on new is_x2apic global. (Sean)
- Use new apic_* helpers in guest test code to eliminate duplicate
  code. (Sean)
- Use constant defined at beginning of function for ICR register
  value. (Sean)
- Drop unnecessary GUEST_ASSERT(tsc_cycles > 0). (Sean)
- Drop unnecessary comments, use provided comments. (Sean)
- Change APIC frequency command line option to "-f" (was "-a"). (Sean)
- Enable user to override delay used in guest with new "-d" command
  line option. (Sean)
- Open code test checks in what is now single guest test function.

Changes v8:
- With TSC frequency switching to unsigned int type the implicit
  type promotion during bus frequency computation results in
  overflow. Use explicit unsigned long type within computation to
  avoid overflow.

Changes v7:
- Drop Maxim Levitsky's Reviewed-by because of significant changes.
- Remove redefine of _GNU_SOURCE. (kernel test robot)
- Rewrite changelog and test description. (Sean)
- Enable user space to set APIC bus frequency. (Sean)
- Use GUEST_ASSERT() from guest instead of TEST_ASSERT() from host. (Sean)
- Test xAPIC as well as x2APIC. (Sean)
- Add check that KVM_CAP_X86_APIC_BUS_CYCLES_NS cannot be set
  after vCPU created. (Sean)
- Use new udelay() utility. (Sean)
- Drop CLI. Mask LVT timer independently. (Sean)
- Use correct LVT timer entry (0x350 -> 0x320) to enable oneshot operation.
- Remove unnecessary static functions from single file test.
- Be consistent in types by using uint32_t/uint64_t instead of
  u32/u64.

Changes v6:
- Use vm_create() wrapper instead of open coding it. (Zide)
- Improve grammar of test description. (Zide)

Changes v5:
- Update to new name of capability KVM_CAP_X86_APIC_BUS_FREQUENCY ->
  KVM_CAP_X86_APIC_BUS_CYCLES_NS. (Xiaoyao Li)

Changes v4:
- Rework changelog.
- Add Sean's "Suggested-by" to acknowledge guidance received in
  https://lore.kernel.org/all/ZU0BASXWcck85r90@google.com/
- Add copyright.
- Add test description to file header.
- Consistent capitalization for acronyms.
- Rebase to kvm-x86/next.
- Update to v4 change of providing bus clock rate in nanoseconds.
- Add a "TEST_REQUIRE()" for the new capability so that the test can
  work on kernels that do not support the new capability.
- Address checkpatch warnings and use tabs instead of spaces in header
  file to match existing code.

Changes v3:
- Use 1.5GHz instead of 1GHz as frequency.

Changes v2:
- Newly added.
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/apic.h       |   8 +
 .../kvm/x86_64/apic_bus_clock_test.c          | 202 ++++++++++++++++++
 3 files changed, 211 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ce8ff8e8ce3a..ad8b5d15f2bd 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -112,6 +112,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_invalid_nested_guest_state
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
+TEST_GEN_PROGS_x86_64 += x86_64/apic_bus_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
 TEST_GEN_PROGS_x86_64 += x86_64/xapic_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/xcr0_cpuid_test
diff --git a/tools/testing/selftests/kvm/include/x86_64/apic.h b/tools/testing/selftests/kvm/include/x86_64/apic.h
index bed316fdecd5..0f268b55fa06 100644
--- a/tools/testing/selftests/kvm/include/x86_64/apic.h
+++ b/tools/testing/selftests/kvm/include/x86_64/apic.h
@@ -60,6 +60,14 @@
 #define		APIC_VECTOR_MASK	0x000FF
 #define	APIC_ICR2	0x310
 #define		SET_APIC_DEST_FIELD(x)	((x) << 24)
+#define APIC_LVTT	0x320
+#define		APIC_LVT_TIMER_ONESHOT		(0 << 17)
+#define		APIC_LVT_TIMER_PERIODIC		(1 << 17)
+#define		APIC_LVT_TIMER_TSCDEADLINE	(2 << 17)
+#define		APIC_LVT_MASKED			(1 << 16)
+#define	APIC_TMICT	0x380
+#define	APIC_TMCCT	0x390
+#define	APIC_TDCR	0x3E0
 
 void apic_disable(void);
 void xapic_enable(void);
diff --git a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
new file mode 100644
index 000000000000..bff3cd95134f
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Verify KVM correctly emulates the APIC bus frequency when the VMM configures
+ * the frequency via KVM_CAP_X86_APIC_BUS_CYCLES_NS.  Start the APIC timer by
+ * programming TMICT (timer initial count) to the largest value possible (so
+ * that the timer will not expire during the test).  Then, after an arbitrary
+ * amount of time has elapsed, verify TMCCT (timer current count) is within 1%
+ * of the expected value based on the time elapsed, the APIC bus frequency, and
+ * the programmed TDCR (timer divide configuration register).
+ */
+
+#include "apic.h"
+#include "test_util.h"
+
+/*
+ * Pick 25MHz for APIC bus frequency. Different enough from the default 1GHz.
+ * User can override via command line.
+ */
+static uint64_t apic_hz = 25 * 1000 * 1000;
+
+/*
+ * Delay in msec that guest uses to determine APIC bus frequency.
+ * User can override via command line.
+ */
+static unsigned long delay_ms = 100;
+
+/*
+ * Possible TDCR values with matching divide count. Used to modify APIC
+ * timer frequency.
+ */
+static struct {
+	uint32_t tdcr;
+	uint32_t divide_count;
+} tdcrs[] = {
+	{0x0, 2},
+	{0x1, 4},
+	{0x2, 8},
+	{0x3, 16},
+	{0x8, 32},
+	{0x9, 64},
+	{0xa, 128},
+	{0xb, 1},
+};
+
+/* true if x2APIC test is running, false if xAPIC test is running. */
+static bool is_x2apic;
+
+static void apic_enable(void)
+{
+	if (is_x2apic)
+		x2apic_enable();
+	else
+		xapic_enable();
+}
+
+static uint32_t apic_read_reg(unsigned int reg)
+{
+	return is_x2apic ? x2apic_read_reg(reg) : xapic_read_reg(reg);
+}
+
+static void apic_write_reg(unsigned int reg, uint32_t val)
+{
+	if (is_x2apic)
+		x2apic_write_reg(reg, val);
+	else
+		xapic_write_reg(reg, val);
+}
+
+static void apic_guest_code(void)
+{
+	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
+	const uint32_t tmict = ~0u;
+	uint64_t tsc0, tsc1, freq;
+	uint32_t tmcct;
+	int i;
+
+	apic_enable();
+
+	/*
+	 * Setup one-shot timer.  The vector does not matter because the
+	 * interrupt should not fire.
+	 */
+	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
+
+	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
+
+		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
+		apic_write_reg(APIC_TMICT, tmict);
+
+		tsc0 = rdtsc();
+		udelay(delay_ms * 1000);
+		tmcct = apic_read_reg(APIC_TMCCT);
+		tsc1 = rdtsc();
+
+		/*
+		 * Stop the timer _after_ reading the current, final count, as
+		 * writing the initial counter also modifies the current count.
+		 */
+		apic_write_reg(APIC_TMICT, 0);
+
+		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
+		/* Check if measured frequency is within 1% of configured frequency. */
+		GUEST_ASSERT(freq < apic_hz * 101 / 100);
+		GUEST_ASSERT(freq > apic_hz * 99 / 100);
+	}
+
+	GUEST_DONE();
+}
+
+static void test_apic_bus_clock(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			TEST_FAIL("Unknown ucall %lu", uc.cmd);
+			break;
+		}
+	}
+}
+
+static void run_apic_bus_clock_test(bool x2apic)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	int ret;
+
+	is_x2apic = x2apic;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, apic_hz);
+	sync_global_to_guest(vm, delay_ms);
+	sync_global_to_guest(vm, is_x2apic);
+
+	vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
+		      NSEC_PER_SEC / apic_hz);
+
+	vcpu = vm_vcpu_add(vm, 0, apic_guest_code);
+
+	ret = __vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
+			      NSEC_PER_SEC / apic_hz);
+	TEST_ASSERT(ret < 0 && errno == EINVAL,
+		    "Setting of APIC bus frequency after vCPU is created should fail.");
+
+	if (!is_x2apic)
+		virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
+
+	test_apic_bus_clock(vcpu);
+	kvm_vm_free(vm);
+}
+
+static void help(char *name)
+{
+	puts("");
+	printf("usage: %s [-h] [-d delay] [-f APIC bus freq]\n", name);
+	puts("");
+	printf("-d: Delay (in msec) guest uses to measure APIC bus frequency.\n");
+	printf("-f: The APIC bus frequency (in Hz) to be configured for the guest.\n");
+	puts("");
+}
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
+
+	while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
+		switch (opt) {
+		case 'f':
+			apic_hz = atol(optarg);
+			break;
+		case 'd':
+			delay_ms = atol(optarg);
+			break;
+		case 'h':
+			help(argv[0]);
+			exit(0);
+		default:
+			help(argv[0]);
+			exit(1);
+		}
+	}
+
+	run_apic_bus_clock_test(false);
+	run_apic_bus_clock_test(true);
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility
  2024-06-12 18:16 ` [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility Reinette Chatre
@ 2024-06-28 22:46   ` Sean Christopherson
  0 siblings, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2024-06-28 22:46 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> ---
>  .../selftests/kvm/include/x86_64/processor.h    | 17 +++++++++++++++++
>  .../selftests/kvm/lib/x86_64/processor.c        | 11 +++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
> index c0c7c1fe93f9..383a0f7fa9ef 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/processor.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
> @@ -23,6 +23,7 @@
>  
>  extern bool host_cpu_is_intel;
>  extern bool host_cpu_is_amd;
> +extern uint32_t tsc_khz;

This should be guest_tsc_khz, because it most definitely isn't guaranteed to be
the host TSC frequency.  And because it's global, we should try to avoid variable
shadowing, e.g. tsc_scaling_sync.c also defines tsc_khz.

Which, by the by, probably needs to be addressed, i.e. we should probably add a
helper for setting KVM_SET_TSC_KHZ+guest_tsc_khz.

I think it also makes sense to have this be a 64-bit value, even though KVM
*internally* tracks a 32-bit value.  That way we don't have to worry about
casting to avoid truncation.

>  /* Forced emulation prefix, used to invoke the emulator unconditionally. */
>  #define KVM_FEP "ud2; .byte 'k', 'v', 'm';"
> @@ -816,6 +817,22 @@ static inline void cpu_relax(void)
>  	asm volatile("rep; nop" ::: "memory");
>  }
>  
> +static inline void udelay(unsigned long usec)
> +{
> +	uint64_t start, now, cycles;
> +
> +	GUEST_ASSERT(tsc_khz);
> +	cycles = tsc_khz / 1000 * usec;
> +
> +	start = rdtsc();
> +	for (;;) {
> +		now = rdtsc();
> +		if (now - start >= cycles)
> +			break;
> +		cpu_relax();

Given that this is guest code, we should omit the PAUSE so that it doesn't trigger
PLE exits, i.e. to make the delay as accurate as possible.  Then this simply becomes:

	start = rdtsc();
	do {
		now = rdtsc();
	} while (now - start < cycles);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency
  2024-06-12 18:16 ` [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency Reinette Chatre
@ 2024-06-28 22:50   ` Sean Christopherson
  2024-06-29  0:39   ` VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX Sean Christopherson
  1 sibling, 0 replies; 13+ messages in thread
From: Sean Christopherson @ 2024-06-28 22:50 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Pick 25MHz for APIC bus frequency. Different enough from the default 1GHz.
> + * User can override via command line.
> + */
> +static uint64_t apic_hz = 25 * 1000 * 1000;
> +
> +/*
> + * Delay in msec that guest uses to determine APIC bus frequency.
> + * User can override via command line.
> + */
> +static unsigned long delay_ms = 100;

There's no need for these to be global, it's easy enough to pass them as params
to apic_guest_code().  Making is_x2apic is wortwhile as it cuts down on the noise,
but for these, I think it's better to keep them local.

> +
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;

These can/should all be const.

> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};
> +
> +/* true if x2APIC test is running, false if xAPIC test is running. */
> +static bool is_x2apic;
> +
> +static void apic_enable(void)
> +{
> +	if (is_x2apic)
> +		x2apic_enable();
> +	else
> +		xapic_enable();
> +}
> +
> +static uint32_t apic_read_reg(unsigned int reg)
> +{
> +	return is_x2apic ? x2apic_read_reg(reg) : xapic_read_reg(reg);
> +}
> +
> +static void apic_write_reg(unsigned int reg, uint32_t val)
> +{
> +	if (is_x2apic)
> +		x2apic_write_reg(reg, val);
> +	else
> +		xapic_write_reg(reg, val);
> +}
> +
> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */

1% is likely too aggressive, i.e. we'll get false failures due to host activity.
On our systems, even a single pr_warn() in the timer path causes failure.  For
now, I think 5% is good enough, e.g. it'll catch cases where KVM is waaay off.
If we want to do better (or that's still too tight), then we can add a '-t <tolerance'
param or something.

> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);

Combine these into a single assert, and print the params, i.e. don't rely on the
line number to figure out what's up.

		__GUEST_ASSERT(freq < apic_hz * 105 / 100 && freq > apic_hz * 95 / 100,
			       "Frequency = %lu (wanted %lu - %lu), bus = %lu, div = %u, tsc = %lu",
			       freq, apic_hz * 95 / 100, apic_hz * 105 / 100,
			       apic_hz, tdcrs[i].divide_count, tsc_hz);

> +int main(int argc, char *argv[])
> +{
> +	int opt;
> +
> +	TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
> +
> +	while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
> +		switch (opt) {
> +		case 'f':
> +			apic_hz = atol(optarg);
> +			break;
> +		case 'd':
> +			delay_ms = atol(optarg);
> +			break;
> +		case 'h':
> +			help(argv[0]);
> +			exit(0);
> +		default:
> +			help(argv[0]);
> +			exit(1);

Heh, selftests are anything but consistent, but this should be exit(KSFT_SKIP)
for both.

> +		}
> +	}
> +
> +	run_apic_bus_clock_test(false);
> +	run_apic_bus_clock_test(true);
> +}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable
  2024-06-12 18:16 [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Reinette Chatre
  2024-06-12 18:16 ` [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility Reinette Chatre
  2024-06-12 18:16 ` [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency Reinette Chatre
@ 2024-06-28 22:55 ` Sean Christopherson
  2024-06-29  0:10   ` Reinette Chatre
  2 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2024-06-28 22:55 UTC (permalink / raw)
  To: Sean Christopherson, isaku.yamahata, pbonzini, erdemaktas,
	vkuznets, vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao, Reinette Chatre
  Cc: kvm, linux-kernel

On Wed, 12 Jun 2024 11:16:10 -0700, Reinette Chatre wrote:
> Changes from v8:
> - v8: https://lore.kernel.org/lkml/cover.1718043121.git.reinette.chatre@intel.com/
> - Many changes to new udelay() utility patch as well as the APIC bus
>   frequency test aimed to make it more robust (additional ASSERTs,
>   consistent types, eliminate duplicate code, etc.) and useful with
>   support for more user configuration. Please refer to individual patches for
>   detailed changes.
> - Series applies cleanly to next branch of kvm-x86 with HEAD
>   e4e9e1067138e5620cf0500c3e5f6ebfb9d322c8.
> 
> [...]

Applied to kvm-x86 misc, with all the changes mentioned in my earlier replies.
I'm out next week, and don't want to merge the KVM changes without these tests,
hence the rushed application.

Please holler if you disagree with anything (or if I broke something).  I won't
respond until July 8th at the earliest, but worst case scenario we can do fixup
patches after 6.11-rc1.

[1/2] KVM: selftests: Add x86_64 guest udelay() utility
      https://github.com/kvm-x86/linux/commit/6b878cbb87bf
[2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency
      https://github.com/kvm-x86/linux/commit/82222ee7e84c

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable
  2024-06-28 22:55 ` [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Sean Christopherson
@ 2024-06-29  0:10   ` Reinette Chatre
  2024-07-10 15:42     ` Sean Christopherson
  0 siblings, 1 reply; 13+ messages in thread
From: Reinette Chatre @ 2024-06-29  0:10 UTC (permalink / raw)
  To: Sean Christopherson, isaku.yamahata, pbonzini, erdemaktas,
	vkuznets, vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao
  Cc: kvm, linux-kernel

Hi Sean,

On 6/28/24 3:55 PM, Sean Christopherson wrote:
> On Wed, 12 Jun 2024 11:16:10 -0700, Reinette Chatre wrote:
>> Changes from v8:
>> - v8: https://lore.kernel.org/lkml/cover.1718043121.git.reinette.chatre@intel.com/
>> - Many changes to new udelay() utility patch as well as the APIC bus
>>    frequency test aimed to make it more robust (additional ASSERTs,
>>    consistent types, eliminate duplicate code, etc.) and useful with
>>    support for more user configuration. Please refer to individual patches for
>>    detailed changes.
>> - Series applies cleanly to next branch of kvm-x86 with HEAD
>>    e4e9e1067138e5620cf0500c3e5f6ebfb9d322c8.
>>
>> [...]
> 
> Applied to kvm-x86 misc, with all the changes mentioned in my earlier replies.
> I'm out next week, and don't want to merge the KVM changes without these tests,
> hence the rushed application.
> 
> Please holler if you disagree with anything (or if I broke something).  I won't
> respond until July 8th at the earliest, but worst case scenario we can do fixup
> patches after 6.11-rc1.

Thank you very much for taking the time to make the changes and apply the patches.
All the changes look good to me and passes my testing.

Now that the x86 udelay() utility no longer use cpu_relax(), should ARM
and RISC-V's udelay() be modified to match in this regard? I can prepare
(unable to test) changes for you to consider on your return.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
  2024-06-12 18:16 ` [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency Reinette Chatre
  2024-06-28 22:50   ` Sean Christopherson
@ 2024-06-29  0:39   ` Sean Christopherson
  2024-07-03 20:14     ` Reinette Chatre
  1 sibling, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2024-06-29  0:39 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7048 bytes --]

Forking this off to try and avoid confusion...

On Wed, Jun 12, 2024, Reinette Chatre wrote:
> +/*
> + * Possible TDCR values with matching divide count. Used to modify APIC
> + * timer frequency.
> + */
> +static struct {
> +	uint32_t tdcr;
> +	uint32_t divide_count;
> +} tdcrs[] = {
> +	{0x0, 2},
> +	{0x1, 4},
> +	{0x2, 8},
> +	{0x3, 16},
> +	{0x8, 32},
> +	{0x9, 64},
> +	{0xa, 128},
> +	{0xb, 1},
> +};

...

> +static void apic_guest_code(void)
> +{
> +	uint64_t tsc_hz = (uint64_t)tsc_khz * 1000;
> +	const uint32_t tmict = ~0u;
> +	uint64_t tsc0, tsc1, freq;
> +	uint32_t tmcct;
> +	int i;
> +
> +	apic_enable();
> +
> +	/*
> +	 * Setup one-shot timer.  The vector does not matter because the
> +	 * interrupt should not fire.
> +	 */
> +	apic_write_reg(APIC_LVTT, APIC_LVT_TIMER_ONESHOT | APIC_LVT_MASKED);
> +
> +	for (i = 0; i < ARRAY_SIZE(tdcrs); i++) {
> +
> +		apic_write_reg(APIC_TDCR, tdcrs[i].tdcr);
> +		apic_write_reg(APIC_TMICT, tmict);
> +
> +		tsc0 = rdtsc();
> +		udelay(delay_ms * 1000);
> +		tmcct = apic_read_reg(APIC_TMCCT);
> +		tsc1 = rdtsc();
> +
> +		/*
> +		 * Stop the timer _after_ reading the current, final count, as
> +		 * writing the initial counter also modifies the current count.
> +		 */
> +		apic_write_reg(APIC_TMICT, 0);
> +
> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> +		/* Check if measured frequency is within 1% of configured frequency. */
> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> +	}

This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
bug.  It looks like something APICv related is clobbering internal VMX timer state?
Or maybe there's a tearing or truncation issue?

As mentioned ad nauseum at this point, I'm offline all of next week, so hopefully
there's enough info here to get a root cause...


A spurious VM-Exit will occur after programming a vmcs.PREEMPTION_TIMER_VALUE that
shouldn't exit.  Every observed failure occurs when bits 27:16 are zero, with a
small value in bits 15:0, e.g. VM-Enter with a timer value of 0xe0003bf7 or
0xa0006db6 will cause a near-immediate VM-Exit.

Weirdly, it doesn't always happen, e.g. I've observed rollover from 0xa000xxxx
to 0x9fffxxxx without failure.  However, the *test* failure is 100% reproducible,
i.e. it's not _that_ rare of a failure.  So maybe there's state build-up required?
E.g. in the failing cases, there are 10s of entries with a slightly larger timer
value, with no preemption timer exit (the host's tick IRQ interrupts the guest,
and then KVM reprograms the VMX timer).

Even more sketchy, the failure only occurs if APICv is enabled.  Turning off APICv
makes the problem go away (I initially suspected KVM was somehow botching the TMCCT
emulation).  I am 99.9% certain there is no KVM APICv bug that is causing problems.
Our IVB servers don't fail (even with APICv enabled), nor does my Raptor Lake client
box (with APICv enabled).  And I confirmed that the VMX timer is still getting
programmed with the same sequence that fails when APICv is enabled.

(Before I realized the pattern of values), I sanity checked the VMCS field just
before VM-Enter, and again after VM-Exit (KVM runs without the CPU save
vmcs.PREEMPTION_TIMER_VALUE on exit).

I also verified the CPU thinks the timer has expired by enabling "save on exit"
and verifying vmcs.PREEMPTION_TIMER_VALUE is indeed '0', and that KVM really did
get a PREEMPTION_TIMER exit.

Attached is the debug patch I used to get the below data (sort of; one of the
post-exit prints is without saving vmcs.PREEMPTION_TIMER_VALUE on exit).

In kvm_hypercall (ignore the name, I piggybacked a tracepoint because trying to
log to dmesg was too slow, guest literally couldn't make forward progress), a1 is
the VMX timer value programmed by KVM (0xe0003bf7).
 
 apic_bus_clock_-11419   [056] d..2.   146.771179: kvm_hypercall: nr 0x2c8a9e9cc6703c a0 0x2c8b0e9ce46c37 a1 0xe0003bf7 a2 0xe0003bf7 a3 0x7  
 apic_bus_clock_-11419   [056] d..2.   146.771242: kvm_exit: vcpu 0 reason PREEMPTION_TIMER rip 0x402065 info1 0x0000000000000000 info2 0x0000000000000000 intr_info 0x00000000 error_code 0x00000000
 VMX timer exit, VMCS = e0003bf7, delta = e0003690

and the post-exit print with the attached patch:

  kvm_intel: VMX timer exit, EXIT_REASON = 34, VMCS = 0, delta = e00037e9

And I've been fiddling with the below hack to coerce KVM into programming VMX
timer values.  Had I more time, I would have booted kernels with the ability to
properly fuzz the timer values.

Note, with TMICT=-1, only divide_count=1 will fail, because larger divide counts
result in a timeout that doesn't fit into the 32-bit VMX timer field (don't ask
me how long it took me to realize the divide count affects the time frequency,
not the actual count, *sigh*).

diff --git a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
index f8916bb34405..4eb49e20ff9c 100644
--- a/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
+++ b/tools/testing/selftests/kvm/x86_64/apic_bus_clock_test.c
@@ -22,13 +22,13 @@ static const struct {
        const uint32_t tdcr;
        const uint32_t divide_count;
 } tdcrs[] = {
-       {0x0, 2},
-       {0x1, 4},
-       {0x2, 8},
-       {0x3, 16},
-       {0x8, 32},
-       {0x9, 64},
-       {0xa, 128},
+       // {0x0, 2},
+       // {0x1, 4},
+       // {0x2, 8},
+       // {0x3, 16},
+       // {0x8, 32},
+       // {0x9, 64},
+       // {0xa, 128},
        {0xb, 1},
 };
 
@@ -55,10 +55,11 @@ static void apic_write_reg(unsigned int reg, uint32_t val)
                xapic_write_reg(reg, val);
 }
 
+uint32_t tmict = ~0u;
+
 static void apic_guest_code(uint64_t apic_hz, uint64_t delay_ms)
 {
        uint64_t tsc_hz = guest_tsc_khz * 1000;
-       const uint32_t tmict = ~0u;
        uint64_t tsc0, tsc1, freq;
        uint32_t tmcct;
        int i;
@@ -133,6 +134,7 @@ static void run_apic_bus_clock_test(uint64_t apic_hz, uint64_t delay_ms,
        vm = vm_create(1);
 
        sync_global_to_guest(vm, is_x2apic);
+       sync_global_to_guest(vm, tmict);
 
        vm_enable_cap(vm, KVM_CAP_X86_APIC_BUS_CYCLES_NS,
                      NSEC_PER_SEC / apic_hz);
@@ -174,7 +176,7 @@ int main(int argc, char *argv[])
 
        TEST_REQUIRE(kvm_has_cap(KVM_CAP_X86_APIC_BUS_CYCLES_NS));
 
-       while ((opt = getopt(argc, argv, "d:f:h")) != -1) {
+       while ((opt = getopt(argc, argv, "d:f:i:h")) != -1) {
                switch (opt) {
                case 'f':
                        apic_hz = atoi_positive("APIC bus frequency", optarg) * 1000 * 1000;
@@ -182,6 +184,9 @@ int main(int argc, char *argv[])
                case 'd':
                        delay_ms = atoi_positive("Delay in milliseconds", optarg);
                        break;
+               case 'i':
+                       tmict = ~0u - atoi_positive("offset", optarg);
+                       break;
                case 'h':
                default:
                        help(argv[0]);

[-- Attachment #2: 0001-debug.patch --]
[-- Type: text/x-diff, Size: 5168 bytes --]

From 42da584d5e4e2dfaf4296f2ea666f3a04c82052f Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 28 Jun 2024 16:33:32 -0700
Subject: [PATCH] debug

---
 arch/x86/kvm/vmx/vmx.c | 33 +++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h |  1 +
 arch/x86/kvm/x86.c     |  1 +
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..e83351c690d9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,7 @@ static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
 
 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5997,6 +5996,8 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u32 delta_tsc;
+	u64 tscl;
 
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
@@ -6020,6 +6021,16 @@ static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 	if (is_guest_mode(vcpu))
 		return EXIT_FASTPATH_NONE;
 
+	tscl = rdtsc();
+	if (vmx->hv_deadline_tsc > tscl)
+		delta_tsc = (u32)((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi);
+	else
+		delta_tsc = 0;
+
+	pr_warn("VMX timer exit, EXIT_REASON = %x, VMCS = %x, delta = %x\n",
+		vmcs_read32(VM_EXIT_REASON),
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE), delta_tsc);
+
 	kvm_lapic_expired_hv_timer(vcpu);
 	return EXIT_FASTPATH_REENTER_GUEST;
 }
@@ -7197,6 +7208,8 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	u32 delta_tsc;
 
 	if (force_immediate_exit) {
+		trace_kvm_hypercall(0, 0, 0, 0, 0);
+		vmx->preemption_timer = 0;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
@@ -7208,9 +7221,14 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 		else
 			delta_tsc = 0;
 
+		trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc,
+				    ((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi),
+				    cpu_preemption_timer_multi);
+		vmx->preemption_timer = delta_tsc;
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (!vmx->loaded_vmcs->hv_timer_soft_disabled) {
+		trace_kvm_hypercall(-1, -1, -1, -1, -1);
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, -1);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = true;
 	}
@@ -7218,6 +7236,8 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 
 void noinstr vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
 {
+	WARN_ON(!vmx->loaded_vmcs->hv_timer_soft_disabled &&
+		vmcs_read32(VMX_PREEMPTION_TIMER_VALUE) != vmx->preemption_timer);
 	if (unlikely(host_rsp != vmx->loaded_vmcs->host_state.rsp)) {
 		vmx->loaded_vmcs->host_state.rsp = host_rsp;
 		vmcs_writel(HOST_RSP, host_rsp);
@@ -8128,7 +8148,7 @@ int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	    delta_tsc && u64_shl_div_u64(delta_tsc,
 				kvm_caps.tsc_scaling_ratio_frac_bits,
 				vcpu->arch.l1_tsc_scaling_ratio, &delta_tsc))
-		return -ERANGE;
+		goto out_of_range;
 
 	/*
 	 * If the delta tsc can't fit in the 32 bit after the multi shift,
@@ -8137,11 +8157,16 @@ int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	 * on every vmentry is costly so we just use an hrtimer.
 	 */
 	if (delta_tsc >> (cpu_preemption_timer_multi + 32))
-		return -ERANGE;
+		goto out_of_range;
 
+	trace_kvm_hypercall(tscl, vmx->hv_deadline_tsc, delta_tsc, -1, -1);
 	vmx->hv_deadline_tsc = tscl + delta_tsc;
 	*expired = !delta_tsc;
 	return 0;
+
+out_of_range:
+	vmx->hv_deadline_tsc = -1;
+	return -ERANGE;
 }
 
 void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..ecafbb11519d 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -341,6 +341,7 @@ struct vcpu_vmx {
 
 	/* apic deadline value in host tsc */
 	u64 hv_deadline_tsc;
+	u32 preemption_timer;
 
 	unsigned long host_debugctlmsr;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..00847259bcc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14024,6 +14024,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_hypercall);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);

base-commit: 128c71b7f489d8115d29a487367c648f8acc8374
-- 
2.45.2.803.g4e1b14247a-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
  2024-06-29  0:39   ` VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX Sean Christopherson
@ 2024-07-03 20:14     ` Reinette Chatre
  2024-07-03 21:37       ` Reinette Chatre
  2024-07-08  5:55       ` Yuan Yao
  0 siblings, 2 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-07-03 20:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel, Hao, Xudong

Hi Sean,

On 6/28/24 5:39 PM, Sean Christopherson wrote:
> Forking this off to try and avoid confusion...
> 
> On Wed, Jun 12, 2024, Reinette Chatre wrote:

...

>> +
>> +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>> +		/* Check if measured frequency is within 1% of configured frequency. */
>> +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
>> +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
>> +	}
> 
> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> bug.  It looks like something APICv related is clobbering internal VMX timer state?
> Or maybe there's a tearing or truncation issue?

It has been a few days. Just a note to let you know that we are investigating this.
On my side I have not yet been able to reproduce this issue. I tested
kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
apic_bus_clock_test and they all passed. Since I have lack of experience here there are
some Intel virtualization experts helping out with this investigation and I hope that
they will be some insights from the analysis and testing that you already provided.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
  2024-07-03 20:14     ` Reinette Chatre
@ 2024-07-03 21:37       ` Reinette Chatre
  2024-07-08  5:55       ` Yuan Yao
  1 sibling, 0 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-07-03 21:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel, Hao, Xudong


(a short update ...)

On 7/3/24 1:14 PM, Reinette Chatre wrote:
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
>> Forking this off to try and avoid confusion...
>>
>> On Wed, Jun 12, 2024, Reinette Chatre wrote:
> 
> ...
> 
>>> +
>>> +        freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
>>> +        /* Check if measured frequency is within 1% of configured frequency. */
>>> +        GUEST_ASSERT(freq < apic_hz * 101 / 100);
>>> +        GUEST_ASSERT(freq > apic_hz * 99 / 100);
>>> +    }
>>
>> This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
>> bug.  It looks like something APICv related is clobbering internal VMX timer state?
>> Or maybe there's a tearing or truncation issue?
> 
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.

I have now been able to test on SKX also and I am not yet able to reproduce. For
reference, the systems I tested on are:
SKX: https://ark.intel.com/content/www/us/en/ark/products/120507/intel-xeon-platinum-8170m-processor-35-75m-cache-2-10-ghz.html
ICX: https://ark.intel.com/content/www/us/en/ark/products/212459/intel-xeon-platinum-8360y-processor-54m-cache-2-40-ghz.html
CLX: https://ark.intel.com/content/www/us/en/ark/products/192476/intel-xeon-platinum-8260l-processor-35-75m-cache-2-40-ghz.html

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX
  2024-07-03 20:14     ` Reinette Chatre
  2024-07-03 21:37       ` Reinette Chatre
@ 2024-07-08  5:55       ` Yuan Yao
  1 sibling, 0 replies; 13+ messages in thread
From: Yuan Yao @ 2024-07-08  5:55 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: Sean Christopherson, isaku.yamahata, pbonzini, erdemaktas,
	vkuznets, vannapurve, jmattson, mlevitsk, xiaoyao.li, chao.gao,
	rick.p.edgecombe, yuan.yao, kvm, linux-kernel, Hao, Xudong

On Wed, Jul 03, 2024 at 01:14:09PM -0700, Reinette Chatre wrote:
> Hi Sean,
>
> On 6/28/24 5:39 PM, Sean Christopherson wrote:
> > Forking this off to try and avoid confusion...
> >
> > On Wed, Jun 12, 2024, Reinette Chatre wrote:
>
> ...
>
> > > +
> > > +		freq = (tmict - tmcct) * tdcrs[i].divide_count * tsc_hz / (tsc1 - tsc0);
> > > +		/* Check if measured frequency is within 1% of configured frequency. */
> > > +		GUEST_ASSERT(freq < apic_hz * 101 / 100);
> > > +		GUEST_ASSERT(freq > apic_hz * 99 / 100);
> > > +	}
> >
> > This test fails on our SKX, CLX, and ICX systems due to what appears to be a CPU
> > bug.  It looks like something APICv related is clobbering internal VMX timer state?
> > Or maybe there's a tearing or truncation issue?
>
> It has been a few days. Just a note to let you know that we are investigating this.
> On my side I have not yet been able to reproduce this issue. I tested
> kvm-x86-next-2024.06.28 on an ICX and an CLX system by running 100 iterations of
> apic_bus_clock_test and they all passed. Since I have lack of experience here there are
> some Intel virtualization experts helping out with this investigation and I hope that
> they will be some insights from the analysis and testing that you already provided.
>
> Reinette
>

I can reproduce this on my side, even with apicv disabled by
'insmod $kernel_path/arch/x86/kvm/kvm-intel.ko enable_apicv=N'.
@Sean I think we're observing same issue, please see the details
below:

apic_bus_clock_test can't reproduce this may because the
preemption timer value calculation relies on apic bus
frequency/TMICT/Divide count/host TSC frequency and ratio
between preemption timer and host TSC frequency, too many
factors to generate the 'magic' value there. So I changed
KVM and added a small KVM kselftest tool to set the
preemption timer value directly from guest, this makes the
reproducing easily. The changes are attached at end of this
comment.

The trace I captured below came form host with 1.7GHz TSC,
the VM_EXIT_SAVE_VMX_PREEMPTION_TIMER is enabled to get the
cpu saved vmcs.VMX_PREEMPTION_TIMER_VALUE after VMEXIT. I
set the vmcs.VMX_PREEMPTION_TIMER_VALUE to 0x880042ad which
is the 'magic' number on this 1.7Ghz TSC machine:

preempt_test  20677.199521: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5554 a1:0x880042ad a2:0x880042ad a3:0x20462e98d9b9
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The value read back from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, for double confirmation.
  a3: The host tsc at the time point, debug only.

preempt_test  20677.199579:      kvm:kvm_exit: reason PREEMPTION_TIMER rip 0x40274d info 0 0 intr 0

preempt_test  20677.199579: kvm:kvm_vmx_debug: kvm_vmx_debug: 2, a0:0x34 a1:0x0 a2:0x87fea9b0 a3:0x20462e9a5749
  a0: The VMEXIT reason, 0x34 is preemption timer VMEXIT.
  a1: The read back vmcs.VMX_PREEMPTION_TIMER_VALUE value, here 0.
  a2: The next preemption timer value should be written to
      vmcs, calculates from the (target tsc - current tsc) >>
      7. Now the preemption timer vmexit happend only after
      ~58 microseconds elapsed, it should happen after
      ~171.79 seconds but not such soon, the issue is
      reproduced.

Another more easy way to observe this symptom w/o care the
'magic' preemption timer vlaue is use the maximum preemption
timer value 0xffffffff, below log w/ 0xffffffff is captured
from same machine:

preempt_test 20530.456589: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77fd5551 a1:0xffffffff a2:0xffffffff a3:0x200c1971ca5d

  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT.
  a1: The new preemption timer value wrote to
      vmcs.VMX_PREEMPTION_TIMER_VALUE.
  a2: The read back value from
      vmcs.VMX_PREEMPTION_TIMER_VALUE, double confirmation.
  a3: the host tsc at the time point, debug only.

preempt_test 20530.456690:      kvm:kvm_exit: reason VMCALL rip 0x4131a0 info 0 0 intr 0
  The preempt_test checks preemption timer state every
  100us, this VMEXIT is expected behavior.

preempt_test 20530.456691:     kvm:kvm_entry: vcpu 0, rip 0x4131a3

preempt_test 20530.456691: kvm:kvm_vmx_debug: kvm_vmx_debug: 3, a0:0x77ff82cc a1:0xfffe900b a2:0xfffe900b a3:0x200c19746d45
  a0: The previous vmcs.VMX_PREEMPTION_TIMER_VALUE value
      saved by CPU when VMEXIT. The difference value
      shouldn't be such huge number 0xffffffff - 0x77ff82cc
      = 0x88007D33 when just ~100us elapsed from previous
      VMENTRY, the issue is reproduced.

Use 0x88000000 as preemption timer value to verify this
preempt_test tool, the preemption timer VMEXIT happend after
~171.45 seconds which is expected behavior on the host
1.7Ghz TSC system:

The preemption timer VMEXIT should happen after:
2281701376 × 128 / 1700000000 = 171.79 seconds.

Attached my changes in KVM and kselftest tool for
reproducing here, based on:
https://github.com/kvm-x86/linux.git
tag:kvm-x86-next-2024.06.28

Patch 01:

From a977bf12a8cd1bbe401e68d3702c0b5aa3bf66e4 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Thu, 4 Jul 2024 09:59:55 +0800
Subject: [PATCH 1/2] KVM: x86: Introudce trace_kvm_vmx_debug()

debug only common trace.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 arch/x86/kvm/trace.h | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c   |  1 +
 2 files changed, 29 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index d3aeffd6ae75..7b9eb23d71d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -34,6 +34,34 @@ TRACE_EVENT(kvm_entry,
 		  __entry->immediate_exit ? "[immediate exit]" : "")
 );

+TRACE_EVENT(kvm_vmx_debug,
+	    TP_PROTO(unsigned long n, unsigned long a0,
+		     unsigned long a1,
+		     unsigned long a2,
+		     unsigned long a3),
+	    TP_ARGS(n, a0, a1, a2, a3),
+
+	TP_STRUCT__entry(
+		 __field(	unsigned long,	n		)
+		__field(	unsigned long,	a0		)
+		__field(	unsigned long,	a1		)
+		__field(	unsigned long,	a2		)
+		__field(	unsigned long,	a3		)
+	),
+
+	TP_fast_assign(
+		__entry->n      = n;
+		__entry->a0		= a0;
+		__entry->a1		= a1;
+		__entry->a2		= a2;
+		__entry->a3		= a3;
+	),
+
+	TP_printk("kvm_vmx_debug: %ld, a0:0x%lx a1:0x%lx a2:0x%lx a3:0x%lx",
+		  __entry->n, __entry->a0, __entry->a1, __entry->a2, __entry->a3)
+);
+
+
 /*
  * Tracepoint for hypercall.
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 994743266480..6d1972d6c988 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14036,6 +14036,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_msr_protocol_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_rmp_fault);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmx_debug);

 static int __init kvm_x86_init(void)
 {
--
2.27.0

Patch 02:

From a3bdce4f2f93810372f6068396776ac702946d16 Mon Sep 17 00:00:00 2001
From: Yao Yuan <yuan.yao@intel.com>
Date: Wed, 3 Jul 2024 14:08:02 +0800
Subject: [PATCH 2/2] [DEBUG] preempt timer debug test

A specific kselftesting based program to allow set the VMX
preempt timer value from VM directly.

Introduce 2 hypercall 0x56780001/2, 01 to set the preempt
timer value, 02 to wait for the preemption time expired.

Usage:
Reload kvm applied this change, then:
$KRNEL_SRC_ROOT/tools/testing/selftests/kvm/x86_64/preempt_test -p 'preempt_timer_vale'

'preempt_timer_vale' is the preempt timer value in DEC format, HEX is not supported.

For example:

perf record -e "kvm:*" tools/testing/selftests/kvm/x86_64/preempt_test -p 2281718445

Above set the preempt value to 2281718445(0x880042AD) and
capture the trace, then check the kvm_vmx_debug in the trace
to know the preempt timer behavior.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 arch/x86/kvm/vmx/vmx.h                        |   5 +
 arch/x86/kvm/vmx/vmx.c                        | 113 +++++++++++++++++-
 .../selftests/kvm/x86_64/preempt_test.c       |  82 +++++++++++++
 4 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/preempt_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index ad8b5d15f2bd..957509957f80 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -129,6 +129,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/amx_test
 TEST_GEN_PROGS_x86_64 += x86_64/max_vcpuid_cap_test
 TEST_GEN_PROGS_x86_64 += x86_64/triple_fault_event_test
 TEST_GEN_PROGS_x86_64 += x86_64/recalc_apic_map_test
+TEST_GEN_PROGS_x86_64 += x86_64/preempt_test
 TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 42498fa63abb..82ea0ccc7a63 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -368,6 +368,11 @@ struct vcpu_vmx {

 	/* ve_info must be page aligned. */
 	struct vmx_ve_information *ve_info;
+
+	volatile bool debug_timer;
+	bool debug_timer_set_to_hardware;
+	u32 debug_timer_val;
+	u64 debug_timer_deadline_tsc;
 };

 struct kvm_vmx {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..73f084c29f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4431,8 +4431,9 @@ static u32 vmx_vmexit_ctrl(void)
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
-	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
-			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
+	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER);
+	pr_info("Set VM_EXIT_SAVE_VMX_PREEMPTION_TIMER forcedly for preempt timer debug\n");
+

 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
@@ -5993,11 +5994,41 @@ static int handle_pml_full(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static fastpath_t handle_fastpath_debug_timer(struct kvm_vcpu *vcpu,
+					      bool force_immediate_exit)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	u64 tscl;
+	u32 delta;
+
+	tscl = rdtsc();
+
+	if (vmx->debug_timer_deadline_tsc > tscl)
+		delta = (u32)((vmx->debug_timer_deadline_tsc - tscl) >>
+			      cpu_preemption_timer_multi);
+	else
+		delta = 0;
+
+	trace_kvm_vmx_debug(2UL,
+			    (unsigned long)vmcs_read32(VM_EXIT_REASON),
+			    (unsigned long)vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+			    (unsigned long)delta, tscl);
+
+	vmx->debug_timer = false;
+
+	return EXIT_FASTPATH_REENTER_GUEST;
+}
+
 static fastpath_t handle_fastpath_preemption_timer(struct kvm_vcpu *vcpu,
 						   bool force_immediate_exit)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);

+	WARN_ON(vmx->debug_timer && force_immediate_exit);
+	if (vmx->debug_timer)
+		return handle_fastpath_debug_timer(vcpu,
+						   force_immediate_exit);
+
 	/*
 	 * In the *extremely* unlikely scenario that this is a spurious VM-Exit
 	 * due to the timer expiring while it was "soft" disabled, just eat the
@@ -6096,6 +6127,60 @@ static int handle_notify(struct kvm_vcpu *vcpu)
 	return 1;
 }

+static unsigned long vmx_debug_set_preempt_timer(struct kvm_vcpu *vcpu,
+						 unsigned long a0,
+						 unsigned long a1,
+						 unsigned long a2,
+						 unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->debug_timer = true;
+	vmx->debug_timer_set_to_hardware = false;
+	vmx->debug_timer_val = a0;
+	vmx->debug_timer_deadline_tsc = rdtsc() + (a0 << cpu_preemption_timer_multi);
+	pr_info("debug_timer = %u\n", (u32)a0);
+
+	return 0;
+}
+
+
+static unsigned long vmx_debug_get_preempt_timer_result(struct kvm_vcpu *vcpu,
+							unsigned long a0,
+							unsigned long a1,
+							unsigned long a2,
+							unsigned long a3)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (vmx->debug_timer)
+		return 1;
+	return 0;
+}
+
+static int vmx_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3;
+	unsigned long ret;
+
+	nr = kvm_rax_read(vcpu);
+	if (nr != 0x87650001 && nr != 0x87650002)
+		return kvm_emulate_hypercall(vcpu);
+
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+
+	if (nr == 0x87650001)
+		ret = vmx_debug_set_preempt_timer(vcpu, a0, a1, a2, a3);
+	else
+		ret = vmx_debug_get_preempt_timer_result(vcpu, a0, a1, a2, a3);
+
+	kvm_rax_write(vcpu, ret);
+	return kvm_skip_emulated_instruction(vcpu);
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -6117,7 +6202,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_INVD]		      = kvm_emulate_invd,
 	[EXIT_REASON_INVLPG]		      = handle_invlpg,
 	[EXIT_REASON_RDPMC]                   = kvm_emulate_rdpmc,
-	[EXIT_REASON_VMCALL]                  = kvm_emulate_hypercall,
+	[EXIT_REASON_VMCALL]                  = vmx_emulate_hypercall,
 	[EXIT_REASON_VMCLEAR]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMLAUNCH]		      = handle_vmx_instruction,
 	[EXIT_REASON_VMPTRLD]		      = handle_vmx_instruction,
@@ -7199,6 +7284,28 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu, bool force_immediate_exit
 	if (force_immediate_exit) {
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
 		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
+	} else if (vmx->debug_timer) {
+		u32 old;
+
+		tscl = rdtsc();
+
+		if (!vmx->debug_timer_set_to_hardware) {
+			delta_tsc = vmx->debug_timer_val;
+			vmx->debug_timer_set_to_hardware = true;
+		} else {
+			if (vmx->debug_timer_deadline_tsc > tscl)
+				delta_tsc = (u32)((vmx->debug_timer_deadline_tsc - tscl)
+						  >> cpu_preemption_timer_multi);
+			else
+				delta_tsc = 0;
+		}
+
+		old = vmcs_read32(VMX_PREEMPTION_TIMER_VALUE);
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
+		trace_kvm_vmx_debug(3UL, old,
+				    vmcs_read32(VMX_PREEMPTION_TIMER_VALUE),
+				    delta_tsc, tscl);
+		vmx->loaded_vmcs->hv_timer_soft_disabled = false;
 	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
diff --git a/tools/testing/selftests/kvm/x86_64/preempt_test.c b/tools/testing/selftests/kvm/x86_64/preempt_test.c
new file mode 100644
index 000000000000..2e58cfee61d0
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/preempt_test.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024 Intel Corporation
+ *
+ * Debug the preemption timer behavior
+ */
+
+#include "test_util.h"
+#include "processor.h"
+#include "ucall_common.h"
+
+uint32_t preempt_timer_val = 0x1000000;
+static void guest_code(uint64_t apic_hz, uint64_t delay_ms)
+{
+	volatile unsigned long r;
+
+	kvm_hypercall(0x87650001, preempt_timer_val, 0, 0, 0);
+	do {
+		udelay(100);
+		r = kvm_hypercall(0x87650002, 0, 0, 0, 0);
+	} while(r != 0);
+
+	GUEST_DONE();
+}
+
+static void do_test(struct kvm_vcpu *vcpu)
+{
+	bool done = false;
+	struct ucall uc;
+
+	while (!done) {
+		vcpu_run(vcpu);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_DONE:
+			done = true;
+			break;
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		default:
+			break;
+		}
+	}
+}
+
+static void run_test(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	vm = vm_create(1);
+
+	sync_global_to_guest(vm, preempt_timer_val);
+
+	vcpu = vm_vcpu_add(vm, 0, guest_code);
+
+	do_test(vcpu);
+
+	kvm_vm_free(vm);
+}
+
+
+int main(int argc, char *argv[])
+{
+	int opt;
+
+	while ((opt = getopt(argc, argv, "p:h")) != -1) {
+		switch (opt) {
+		case 'p':
+			preempt_timer_val = atoi(optarg);
+			break;
+		default:
+			exit(KSFT_SKIP);
+		}
+	}
+
+	printf("preempt timer value:%u(0x%x)\n",
+	       preempt_timer_val, preempt_timer_val);
+
+	run_test();
+}
--
2.27.0

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable
  2024-06-29  0:10   ` Reinette Chatre
@ 2024-07-10 15:42     ` Sean Christopherson
  2024-07-10 17:14       ` Reinette Chatre
  0 siblings, 1 reply; 13+ messages in thread
From: Sean Christopherson @ 2024-07-10 15:42 UTC (permalink / raw)
  To: Reinette Chatre
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel

On Fri, Jun 28, 2024, Reinette Chatre wrote:
> Hi Sean,
> 
> On 6/28/24 3:55 PM, Sean Christopherson wrote:
> > On Wed, 12 Jun 2024 11:16:10 -0700, Reinette Chatre wrote:
> > > Changes from v8:
> > > - v8: https://lore.kernel.org/lkml/cover.1718043121.git.reinette.chatre@intel.com/
> > > - Many changes to new udelay() utility patch as well as the APIC bus
> > >    frequency test aimed to make it more robust (additional ASSERTs,
> > >    consistent types, eliminate duplicate code, etc.) and useful with
> > >    support for more user configuration. Please refer to individual patches for
> > >    detailed changes.
> > > - Series applies cleanly to next branch of kvm-x86 with HEAD
> > >    e4e9e1067138e5620cf0500c3e5f6ebfb9d322c8.
> > > 
> > > [...]
> > 
> > Applied to kvm-x86 misc, with all the changes mentioned in my earlier replies.
> > I'm out next week, and don't want to merge the KVM changes without these tests,
> > hence the rushed application.
> > 
> > Please holler if you disagree with anything (or if I broke something).  I won't
> > respond until July 8th at the earliest, but worst case scenario we can do fixup
> > patches after 6.11-rc1.
> 
> Thank you very much for taking the time to make the changes and apply the patches.
> All the changes look good to me and passes my testing.
> 
> Now that the x86 udelay() utility no longer use cpu_relax(), should ARM
> and RISC-V's udelay() be modified to match in this regard? I can prepare
> (unable to test) changes for you to consider on your return.

I don't think so?  IIUC, arm64's "yield", used by cpu_relax() doesn't trigger the
"on spin" exists.  Such exist are only triggered by "wfet" and friends.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable
  2024-07-10 15:42     ` Sean Christopherson
@ 2024-07-10 17:14       ` Reinette Chatre
  0 siblings, 0 replies; 13+ messages in thread
From: Reinette Chatre @ 2024-07-10 17:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, pbonzini, erdemaktas, vkuznets, vannapurve,
	jmattson, mlevitsk, xiaoyao.li, chao.gao, rick.p.edgecombe,
	yuan.yao, kvm, linux-kernel



On 7/10/24 8:42 AM, Sean Christopherson wrote:
> On Fri, Jun 28, 2024, Reinette Chatre wrote:

>> Now that the x86 udelay() utility no longer use cpu_relax(), should ARM
>> and RISC-V's udelay() be modified to match in this regard? I can prepare
>> (unable to test) changes for you to consider on your return.
> 
> I don't think so?  IIUC, arm64's "yield", used by cpu_relax() doesn't trigger the
> "on spin" exists.  Such exist are only triggered by "wfet" and friends.

ah, I see, thank you very much Sean.

Reinette

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-07-10 17:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-12 18:16 [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Reinette Chatre
2024-06-12 18:16 ` [PATCH V9 1/2] KVM: selftests: Add x86_64 guest udelay() utility Reinette Chatre
2024-06-28 22:46   ` Sean Christopherson
2024-06-12 18:16 ` [PATCH V9 2/2] KVM: selftests: Add test for configure of x86 APIC bus frequency Reinette Chatre
2024-06-28 22:50   ` Sean Christopherson
2024-06-29  0:39   ` VMX Preemption Timer appears to be buggy on SKX, CLX, and ICX Sean Christopherson
2024-07-03 20:14     ` Reinette Chatre
2024-07-03 21:37       ` Reinette Chatre
2024-07-08  5:55       ` Yuan Yao
2024-06-28 22:55 ` [PATCH V9 0/2] KVM: x86: Make bus clock frequency for vAPIC timer configurable Sean Christopherson
2024-06-29  0:10   ` Reinette Chatre
2024-07-10 15:42     ` Sean Christopherson
2024-07-10 17:14       ` Reinette Chatre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).