kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements
@ 2024-09-14 10:17 Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 01/18] x86: pmu: Remove duplicate code in pmu_init() Dapeng Mi
                   ` (17 more replies)
  0 siblings, 18 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Changes:
v5 -> v6:
  * limit maximum fixed counters number to 
    MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events)) to avoid
    out-of-bound access for fixed_events[]. If supported fixed counters
    number is larger than array size of fixed_events[], print message to
    remind to update test case. (Jim)
  * limit instructions & branches events precise validation for only
    Intel processors. (Sandipan Das)

KUT/pmu test passes on Intel Sapphire Rapids platform.

History:
  v5: https://lore.kernel.org/all/20240703095712.64202-1-dapeng1.mi@linux.intel.com/
  v4: https://lore.kernel.org/all/20240419035233.3837621-1-dapeng1.mi@linux.intel.com/
  v3: https://lore.kernel.org/lkml/20240103031409.2504051-1-dapeng1.mi@linux.intel.com/ 
  v2: https://lore.kernel.org/lkml/20231031092921.2885109-1-dapeng1.mi@linux.intel.com/
  v1: https://lore.kernel.org/lkml/20231024075748.1675382-1-dapeng1.mi@linux.intel.com/

Dapeng Mi (17):
  x86: pmu: Remove blank line and redundant space
  x86: pmu: Refine fixed_events[] names
  x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line
  x86: pmu: Enlarge cnt[] length to 48 in check_counters_many()
  x86: pmu: Print measured event count if test fails
  x86: pmu: Fix potential out of bound access for fixed events
  x86: pmu: Fix cycles event validation failure
  x86: pmu: Use macro to replace hard-coded branches event index
  x86: pmu: Use macro to replace hard-coded ref-cycles event index
  x86: pmu: Use macro to replace hard-coded instructions event index
  x86: pmu: Enable and disable PMCs in loop() asm blob
  x86: pmu: Improve instruction and branches events verification
  x86: pmu: Improve LLC misses event verification
  x86: pmu: Adjust lower boundary of llc-misses event to 0 for legacy
    CPUs
  x86: pmu: Add IBPB indirect jump asm blob
  x86: pmu: Adjust lower boundary of branch-misses event
  x86: pmu: Optimize emulated instruction validation

Xiong Zhang (1):
  x86: pmu: Remove duplicate code in pmu_init()

 lib/x86/pmu.c |   5 -
 x86/pmu.c     | 427 ++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 346 insertions(+), 86 deletions(-)


base-commit: 17f6f2fd17935eb5e564f621c71244b4a3ddeafb
-- 
2.40.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 01/18] x86: pmu: Remove duplicate code in pmu_init()
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 02/18] x86: pmu: Remove blank line and redundant space Dapeng Mi
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

From: Xiong Zhang <xiong.y.zhang@intel.com>

There are totally same code in pmu_init() helper, remove the duplicate
code.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 lib/x86/pmu.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/lib/x86/pmu.c b/lib/x86/pmu.c
index 0f2afd65..d06e9455 100644
--- a/lib/x86/pmu.c
+++ b/lib/x86/pmu.c
@@ -16,11 +16,6 @@ void pmu_init(void)
 			pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
 		}
 
-		if (pmu.version > 1) {
-			pmu.nr_fixed_counters = cpuid_10.d & 0x1f;
-			pmu.fixed_counter_width = (cpuid_10.d >> 5) & 0xff;
-		}
-
 		pmu.nr_gp_counters = (cpuid_10.a >> 8) & 0xff;
 		pmu.gp_counter_width = (cpuid_10.a >> 16) & 0xff;
 		pmu.gp_counter_mask_length = (cpuid_10.a >> 24) & 0xff;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 02/18] x86: pmu: Remove blank line and redundant space
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 01/18] x86: pmu: Remove duplicate code in pmu_init() Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 03/18] x86: pmu: Refine fixed_events[] names Dapeng Mi
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

code style changes.

Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index ce9abbe1..865dbe67 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -205,8 +205,7 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
 static bool verify_event(uint64_t count, struct pmu_event *e)
 {
 	// printf("%d <= %ld <= %d\n", e->min, count, e->max);
-	return count >= e->min  && count <= e->max;
-
+	return count >= e->min && count <= e->max;
 }
 
 static bool verify_counter(pmu_counter_t *cnt)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 03/18] x86: pmu: Refine fixed_events[] names
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 01/18] x86: pmu: Remove duplicate code in pmu_init() Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 02/18] x86: pmu: Remove blank line and redundant space Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line Dapeng Mi
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

In SDM the fixed counter is numbered from 0 but currently the
fixed_events names are numbered from 1. It would cause confusion for
users. So Change the fixed_events[] names to number from 0 as well and
keep identical with SDM.

Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 865dbe67..60db8bdf 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -45,9 +45,9 @@ struct pmu_event {
 	{"branches", 0x00c2, 1*N, 1.1*N},
 	{"branch misses", 0x00c3, 0, 0.1*N},
 }, fixed_events[] = {
-	{"fixed 1", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
-	{"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
-	{"fixed 3", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
+	{"fixed 0", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
+	{"fixed 1", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
+	{"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
 };
 
 char *buf;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (2 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 03/18] x86: pmu: Refine fixed_events[] names Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:05   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many() Dapeng Mi
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

When running pmu test on SPR, the following #GP fault is reported.

Unhandled exception 13 #GP at ip 000000000040771f
error_code=0000      rflags=00010046      cs=00000008
rax=00000000004031ad rcx=0000000000000186 rdx=0000000000000000 rbx=00000000005142f0
rbp=0000000000514260 rsi=0000000000000020 rdi=0000000000000340
 r8=0000000000513a65  r9=00000000000003f8 r10=000000000000000d r11=00000000ffffffff
r12=000000000043003c r13=0000000000514450 r14=000000000000000b r15=0000000000000001
cr0=0000000080010011 cr2=0000000000000000 cr3=0000000001007000 cr4=0000000000000020
cr8=0000000000000000
        STACK: @40771f 40040e 400976 400aef 40148d 401da9 4001ad
FAIL pmu

It looks EVENTSEL0 MSR (0x186) is written a invalid value (0x4031ad) and
cause a #GP.

Further investigation shows the #GP is caused by below code in
__start_event().

rmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
		  evt->config | EVNTSEL_EN);

The evt->config is correctly initialized but seems corrupted before
writing to MSR.

The original pmu_counter_t layout looks as below.

typedef struct {
	uint32_t ctr;
	uint64_t config;
	uint64_t count;
	int idx;
} pmu_counter_t;

Obviously the config filed crosses two cache lines. When the two cache
lines are not updated simultaneously, the config value is corrupted.

Adjust pmu_counter_t fields order and ensure config field is cache-line
aligned.

Signeduoff-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 60db8bdf..a0268db8 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -21,9 +21,9 @@
 
 typedef struct {
 	uint32_t ctr;
+	uint32_t idx;
 	uint64_t config;
 	uint64_t count;
-	int idx;
 } pmu_counter_t;
 
 struct pmu_event {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many()
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (3 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:06   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 06/18] x86: pmu: Print measured event count if test fails Dapeng Mi
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Considering there are already 8 GP counters and 4 fixed counters on
latest Intel processors, like Sapphire Rapids. The original cnt[] array
length 10 is definitely not enough to cover all supported PMU counters on
these new processors even through currently KVM only supports 3 fixed
counters at most. This would cause out of bound memory access and may trigger
false alarm on PMU counter validation

It's probably more and more GP and fixed counters are introduced in the
future and then directly extends the cnt[] array length to 48 once and
for all. Base on the layout of IA32_PERF_GLOBAL_CTRL and
IA32_PERF_GLOBAL_STATUS, 48 looks enough in near feature.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index a0268db8..b4de2680 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -255,7 +255,7 @@ static void check_fixed_counters(void)
 
 static void check_counters_many(void)
 {
-	pmu_counter_t cnt[10];
+	pmu_counter_t cnt[48];
 	int i, n;
 
 	for (i = 0, n = 0; n < pmu.nr_gp_counters; i++) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 06/18] x86: pmu: Print measured event count if test fails
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (4 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many() Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events Dapeng Mi
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Print the measured event count if the test case fails. This helps users
quickly know why the test case fails.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index b4de2680..53bb7ec0 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -204,8 +204,12 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
 
 static bool verify_event(uint64_t count, struct pmu_event *e)
 {
-	// printf("%d <= %ld <= %d\n", e->min, count, e->max);
-	return count >= e->min && count <= e->max;
+	bool pass = count >= e->min && count <= e->max;
+
+	if (!pass)
+		printf("FAIL: %d <= %"PRId64" <= %d\n", e->min, count, e->max);
+
+	return pass;
 }
 
 static bool verify_counter(pmu_counter_t *cnt)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (5 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 06/18] x86: pmu: Print measured event count if test fails Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:07   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure Dapeng Mi
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Current PMU code doesn't check whether PMU fixed counter number is
larger than pre-defined fixed events. If so, it would cause memory
access out of range.

So limit validated fixed counters number to
MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events)) and print message
to warn that KUT/pmu tests need to be updated if fixed counters number
exceeds defined fixed events number.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 53bb7ec0..cc940a61 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -54,6 +54,7 @@ char *buf;
 
 static struct pmu_event *gp_events;
 static unsigned int gp_events_size;
+static unsigned int fixed_counters_num;
 
 static inline void loop(void)
 {
@@ -113,8 +114,12 @@ static struct pmu_event* get_counter_event(pmu_counter_t *cnt)
 		for (i = 0; i < gp_events_size; i++)
 			if (gp_events[i].unit_sel == (cnt->config & 0xffff))
 				return &gp_events[i];
-	} else
-		return &fixed_events[cnt->ctr - MSR_CORE_PERF_FIXED_CTR0];
+	} else {
+		unsigned int idx = cnt->ctr - MSR_CORE_PERF_FIXED_CTR0;
+
+		if (idx < ARRAY_SIZE(fixed_events))
+			return &fixed_events[idx];
+	}
 
 	return (void*)0;
 }
@@ -204,8 +209,12 @@ static noinline void __measure(pmu_counter_t *evt, uint64_t count)
 
 static bool verify_event(uint64_t count, struct pmu_event *e)
 {
-	bool pass = count >= e->min && count <= e->max;
+	bool pass;
 
+	if (!e)
+		return false;
+
+	pass = count >= e->min && count <= e->max;
 	if (!pass)
 		printf("FAIL: %d <= %"PRId64" <= %d\n", e->min, count, e->max);
 
@@ -250,7 +259,7 @@ static void check_fixed_counters(void)
 	};
 	int i;
 
-	for (i = 0; i < pmu.nr_fixed_counters; i++) {
+	for (i = 0; i < fixed_counters_num; i++) {
 		cnt.ctr = fixed_events[i].unit_sel;
 		measure_one(&cnt);
 		report(verify_event(cnt.count, &fixed_events[i]), "fixed-%d", i);
@@ -271,7 +280,7 @@ static void check_counters_many(void)
 			gp_events[i % gp_events_size].unit_sel;
 		n++;
 	}
-	for (i = 0; i < pmu.nr_fixed_counters; i++) {
+	for (i = 0; i < fixed_counters_num; i++) {
 		cnt[n].ctr = fixed_events[i].unit_sel;
 		cnt[n].config = EVNTSEL_OS | EVNTSEL_USR;
 		n++;
@@ -419,7 +428,7 @@ static void check_rdpmc(void)
 		else
 			report(cnt.count == (u32)val, "fast-%d", i);
 	}
-	for (i = 0; i < pmu.nr_fixed_counters; i++) {
+	for (i = 0; i < fixed_counters_num; i++) {
 		uint64_t x = val & ((1ull << pmu.fixed_counter_width) - 1);
 		pmu_counter_t cnt = {
 			.ctr = MSR_CORE_PERF_FIXED_CTR0 + i,
@@ -744,6 +753,12 @@ int main(int ac, char **av)
 	printf("Fixed counters:      %d\n", pmu.nr_fixed_counters);
 	printf("Fixed counter width: %d\n", pmu.fixed_counter_width);
 
+	fixed_counters_num = MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events));
+	if (pmu.nr_fixed_counters > ARRAY_SIZE(fixed_events))
+		report_info("Fixed counters number %d > defined fixed events %ld. "
+			    "Please update test case.", pmu.nr_fixed_counters,
+			    ARRAY_SIZE(fixed_events));
+
 	apic_write(APIC_LVTPC, PMI_VECTOR);
 
 	check_counters();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (6 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:07   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 09/18] x86: pmu: Use macro to replace hard-coded branches event index Dapeng Mi
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

When running pmu test on SPR, sometimes the following failure is
reported.

PMU version:         2
GP counters:         8
GP counter width:    48
Mask length:         8
Fixed counters:      3
Fixed counter width: 48
1000000 <= 55109398 <= 50000000
FAIL: Intel: core cycles-0
1000000 <= 18279571 <= 50000000
PASS: Intel: core cycles-1
1000000 <= 12238092 <= 50000000
PASS: Intel: core cycles-2
1000000 <= 7981727 <= 50000000
PASS: Intel: core cycles-3
1000000 <= 6984711 <= 50000000
PASS: Intel: core cycles-4
1000000 <= 6773673 <= 50000000
PASS: Intel: core cycles-5
1000000 <= 6697842 <= 50000000
PASS: Intel: core cycles-6
1000000 <= 6747947 <= 50000000
PASS: Intel: core cycles-7

The count of the "core cycles" on first counter would exceed the upper
boundary and leads to a failure, and then the "core cycles" count would
drop gradually and reach a stable state.

That looks reasonable. The "core cycles" event is defined as the 1st
event in xxx_gp_events[] array and it is always verified at first.
when the program loop() is executed at the first time it needs to warm
up the pipeline and cache, such as it has to wait for cache is filled.
All these warm-up work leads to a quite large core cycles count which
may exceeds the verification range.

To avoid the false positive of cycles event caused by warm-up,
explicitly introduce a warm-up state before really starting
verification.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/x86/pmu.c b/x86/pmu.c
index cc940a61..e864ebc4 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -602,11 +602,27 @@ static void check_tsx_cycles(void)
 	report_prefix_pop();
 }
 
+static void warm_up(void)
+{
+	int i = 8;
+
+	/*
+	 * Since cycles event is always run as the first event, there would be
+	 * a warm-up state to warm up the cache, it leads to the measured cycles
+	 * value may exceed the pre-defined cycles upper boundary and cause
+	 * false positive. To avoid this, introduce an warm-up state before
+	 * the real verification.
+	 */
+	while (i--)
+		loop();
+}
+
 static void check_counters(void)
 {
 	if (is_fep_available())
 		check_emulated_instr();
 
+	warm_up();
 	check_gp_counters();
 	check_fixed_counters();
 	check_rdpmc();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 09/18] x86: pmu: Use macro to replace hard-coded branches event index
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (7 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 10/18] x86: pmu: Use macro to replace hard-coded ref-cycles " Dapeng Mi
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Currently the branches event index is a hard-coded number. User could
add new events and cause the branches event index changes in the future,
but don't notice the hard-coded event index and forget to update the
event index synchronously, then the issue comes.

Thus, replace the hard-coded index to a macro.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index e864ebc4..496ee877 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -50,6 +50,22 @@ struct pmu_event {
 	{"fixed 2", MSR_CORE_PERF_FIXED_CTR0 + 2, 0.1*N, 30*N}
 };
 
+/*
+ * Events index in intel_gp_events[], ensure consistent with
+ * intel_gp_events[].
+ */
+enum {
+	INTEL_BRANCHES_IDX	= 5,
+};
+
+/*
+ * Events index in amd_gp_events[], ensure consistent with
+ * amd_gp_events[].
+ */
+enum {
+	AMD_BRANCHES_IDX	= 2,
+};
+
 char *buf;
 
 static struct pmu_event *gp_events;
@@ -492,7 +508,8 @@ static void check_emulated_instr(void)
 {
 	uint64_t status, instr_start, brnch_start;
 	uint64_t gp_counter_width = (1ull << pmu.gp_counter_width) - 1;
-	unsigned int branch_idx = pmu.is_intel ? 5 : 2;
+	unsigned int branch_idx = pmu.is_intel ?
+				  INTEL_BRANCHES_IDX : AMD_BRANCHES_IDX;
 	pmu_counter_t brnch_cnt = {
 		.ctr = MSR_GP_COUNTERx(0),
 		/* branch instructions */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 10/18] x86: pmu: Use macro to replace hard-coded ref-cycles event index
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (8 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 09/18] x86: pmu: Use macro to replace hard-coded branches event index Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 11/18] x86: pmu: Use macro to replace hard-coded instructions " Dapeng Mi
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Replace hard-coded ref-cycles event index with macro to avoid possible
mismatch issue if new event is added in the future and cause ref-cycles
event index changed, but forget to update the hard-coded ref-cycles
event index.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 496ee877..523369b2 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -55,6 +55,7 @@ struct pmu_event {
  * intel_gp_events[].
  */
 enum {
+	INTEL_REF_CYCLES_IDX	= 2,
 	INTEL_BRANCHES_IDX	= 5,
 };
 
@@ -708,7 +709,8 @@ static void set_ref_cycle_expectations(void)
 {
 	pmu_counter_t cnt = {
 		.ctr = MSR_IA32_PERFCTR0,
-		.config = EVNTSEL_OS | EVNTSEL_USR | intel_gp_events[2].unit_sel,
+		.config = EVNTSEL_OS | EVNTSEL_USR |
+			  intel_gp_events[INTEL_REF_CYCLES_IDX].unit_sel,
 	};
 	uint64_t tsc_delta;
 	uint64_t t0, t1, t2, t3;
@@ -744,8 +746,10 @@ static void set_ref_cycle_expectations(void)
 	if (!tsc_delta)
 		return;
 
-	intel_gp_events[2].min = (intel_gp_events[2].min * cnt.count) / tsc_delta;
-	intel_gp_events[2].max = (intel_gp_events[2].max * cnt.count) / tsc_delta;
+	intel_gp_events[INTEL_REF_CYCLES_IDX].min =
+		(intel_gp_events[INTEL_REF_CYCLES_IDX].min * cnt.count) / tsc_delta;
+	intel_gp_events[INTEL_REF_CYCLES_IDX].max =
+		(intel_gp_events[INTEL_REF_CYCLES_IDX].max * cnt.count) / tsc_delta;
 }
 
 static void check_invalid_rdpmc_gp(void)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 11/18] x86: pmu: Use macro to replace hard-coded instructions event index
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (9 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 10/18] x86: pmu: Use macro to replace hard-coded ref-cycles " Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 12/18] x86: pmu: Enable and disable PMCs in loop() asm blob Dapeng Mi
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Replace hard-coded instruction event index with macro to avoid possible
mismatch issue if new event is added in the future and cause
instructions event index changed, but forget to update the hard-coded
event index.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 523369b2..91484c77 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -55,6 +55,7 @@ struct pmu_event {
  * intel_gp_events[].
  */
 enum {
+	INTEL_INSTRUCTIONS_IDX  = 1,
 	INTEL_REF_CYCLES_IDX	= 2,
 	INTEL_BRANCHES_IDX	= 5,
 };
@@ -64,6 +65,7 @@ enum {
  * amd_gp_events[].
  */
 enum {
+	AMD_INSTRUCTIONS_IDX    = 1,
 	AMD_BRANCHES_IDX	= 2,
 };
 
@@ -328,11 +330,16 @@ static uint64_t measure_for_overflow(pmu_counter_t *cnt)
 
 static void check_counter_overflow(void)
 {
-	uint64_t overflow_preset;
 	int i;
+	uint64_t overflow_preset;
+	int instruction_idx = pmu.is_intel ?
+			      INTEL_INSTRUCTIONS_IDX :
+			      AMD_INSTRUCTIONS_IDX;
+
 	pmu_counter_t cnt = {
 		.ctr = MSR_GP_COUNTERx(0),
-		.config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
+		.config = EVNTSEL_OS | EVNTSEL_USR |
+			  gp_events[instruction_idx].unit_sel /* instructions */,
 	};
 	overflow_preset = measure_for_overflow(&cnt);
 
@@ -388,13 +395,18 @@ static void check_counter_overflow(void)
 
 static void check_gp_counter_cmask(void)
 {
+	int instruction_idx = pmu.is_intel ?
+			      INTEL_INSTRUCTIONS_IDX :
+			      AMD_INSTRUCTIONS_IDX;
+
 	pmu_counter_t cnt = {
 		.ctr = MSR_GP_COUNTERx(0),
-		.config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel /* instructions */,
+		.config = EVNTSEL_OS | EVNTSEL_USR |
+			  gp_events[instruction_idx].unit_sel /* instructions */,
 	};
 	cnt.config |= (0x2 << EVNTSEL_CMASK_SHIFT);
 	measure_one(&cnt);
-	report(cnt.count < gp_events[1].min, "cmask");
+	report(cnt.count < gp_events[instruction_idx].min, "cmask");
 }
 
 static void do_rdpmc_fast(void *ptr)
@@ -469,9 +481,14 @@ static void check_running_counter_wrmsr(void)
 {
 	uint64_t status;
 	uint64_t count;
+	unsigned int instruction_idx = pmu.is_intel ?
+				       INTEL_INSTRUCTIONS_IDX :
+				       AMD_INSTRUCTIONS_IDX;
+
 	pmu_counter_t evt = {
 		.ctr = MSR_GP_COUNTERx(0),
-		.config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
+		.config = EVNTSEL_OS | EVNTSEL_USR |
+			  gp_events[instruction_idx].unit_sel,
 	};
 
 	report_prefix_push("running counter wrmsr");
@@ -480,7 +497,7 @@ static void check_running_counter_wrmsr(void)
 	loop();
 	wrmsr(MSR_GP_COUNTERx(0), 0);
 	stop_event(&evt);
-	report(evt.count < gp_events[1].min, "cntr");
+	report(evt.count < gp_events[instruction_idx].min, "cntr");
 
 	/* clear status before overflow test */
 	if (this_cpu_has_perf_global_status())
@@ -511,6 +528,9 @@ static void check_emulated_instr(void)
 	uint64_t gp_counter_width = (1ull << pmu.gp_counter_width) - 1;
 	unsigned int branch_idx = pmu.is_intel ?
 				  INTEL_BRANCHES_IDX : AMD_BRANCHES_IDX;
+	unsigned int instruction_idx = pmu.is_intel ?
+				       INTEL_INSTRUCTIONS_IDX :
+				       AMD_INSTRUCTIONS_IDX;
 	pmu_counter_t brnch_cnt = {
 		.ctr = MSR_GP_COUNTERx(0),
 		/* branch instructions */
@@ -519,7 +539,7 @@ static void check_emulated_instr(void)
 	pmu_counter_t instr_cnt = {
 		.ctr = MSR_GP_COUNTERx(1),
 		/* instructions */
-		.config = EVNTSEL_OS | EVNTSEL_USR | gp_events[1].unit_sel,
+		.config = EVNTSEL_OS | EVNTSEL_USR | gp_events[instruction_idx].unit_sel,
 	};
 	report_prefix_push("emulated instruction");
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 12/18] x86: pmu: Enable and disable PMCs in loop() asm blob
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (10 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 11/18] x86: pmu: Use macro to replace hard-coded instructions " Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification Dapeng Mi
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Currently enabling PMCs, executing loop() and disabling PMCs are divided
3 separated functions. So there could be other instructions executed
between enabling PMCS and running loop() or running loop() and disabling
PMCs, e.g. if there are multiple counters enabled in measure_many()
function, the instructions which enabling the 2nd and more counters
would be counted in by the 1st counter.

So current implementation can only verify the correctness of count by an
rough range rather than a precise count even for instructions and
branches events. Strictly speaking, this verification is meaningless as
the test could still pass even though KVM vPMU has something wrong and
reports an incorrect instructions or branches count which is in the rough
range.

Thus, move the PMCs enabling and disabling into the loop() asm blob and
ensure only the loop asm instructions would be counted, then the
instructions or branches events can be verified with an precise count
instead of an rough range.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 80 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 65 insertions(+), 15 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 91484c77..270f11b9 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -19,6 +19,15 @@
 #define EXPECTED_INSTR 17
 #define EXPECTED_BRNCH 5
 
+#define LOOP_ASM(_wrmsr)						\
+	_wrmsr "\n\t"							\
+	"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t"			\
+	"1: mov (%1), %2; add $64, %1;\n\t"				\
+	"nop; nop; nop; nop; nop; nop; nop;\n\t"			\
+	"loop 1b;\n\t"							\
+	"mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t"	\
+	_wrmsr "\n\t"
+
 typedef struct {
 	uint32_t ctr;
 	uint32_t idx;
@@ -75,13 +84,43 @@ static struct pmu_event *gp_events;
 static unsigned int gp_events_size;
 static unsigned int fixed_counters_num;
 
-static inline void loop(void)
+
+static inline void __loop(void)
+{
+	unsigned long tmp, tmp2, tmp3;
+
+	asm volatile(LOOP_ASM("nop")
+		     : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
+		     : "0"(N), "1"(buf));
+}
+
+/*
+ * Enable and disable counters in a whole asm blob to ensure
+ * no other instructions are counted in the window between
+ * counters enabling and really LOOP_ASM code executing.
+ * Thus counters can verify instructions and branches events
+ * against precise counts instead of a rough valid count range.
+ */
+static inline void __precise_loop(u64 cntrs)
 {
 	unsigned long tmp, tmp2, tmp3;
+	unsigned int global_ctl = pmu.msr_global_ctl;
+	u32 eax = cntrs & (BIT_ULL(32) - 1);
+	u32 edx = cntrs >> 32;
 
-	asm volatile("1: mov (%1), %2; add $64, %1; nop; nop; nop; nop; nop; nop; nop; loop 1b"
-			: "=c"(tmp), "=r"(tmp2), "=r"(tmp3): "0"(N), "1"(buf));
+	asm volatile(LOOP_ASM("wrmsr")
+		     : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
+		     : "a"(eax), "d"(edx), "c"(global_ctl),
+		       "0"(N), "1"(buf)
+		     : "edi");
+}
 
+static inline void loop(u64 cntrs)
+{
+	if (!this_cpu_has_perf_global_ctrl())
+		__loop();
+	else
+		__precise_loop(cntrs);
 }
 
 volatile uint64_t irq_received;
@@ -181,18 +220,17 @@ static void __start_event(pmu_counter_t *evt, uint64_t count)
 	    ctrl = (ctrl & ~(0xf << shift)) | (usrospmi << shift);
 	    wrmsr(MSR_CORE_PERF_FIXED_CTR_CTRL, ctrl);
     }
-    global_enable(evt);
     apic_write(APIC_LVTPC, PMI_VECTOR);
 }
 
 static void start_event(pmu_counter_t *evt)
 {
 	__start_event(evt, 0);
+	global_enable(evt);
 }
 
-static void stop_event(pmu_counter_t *evt)
+static void __stop_event(pmu_counter_t *evt)
 {
-	global_disable(evt);
 	if (is_gp(evt)) {
 		wrmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
 		      evt->config & ~EVNTSEL_EN);
@@ -204,14 +242,24 @@ static void stop_event(pmu_counter_t *evt)
 	evt->count = rdmsr(evt->ctr);
 }
 
+static void stop_event(pmu_counter_t *evt)
+{
+	global_disable(evt);
+	__stop_event(evt);
+}
+
 static noinline void measure_many(pmu_counter_t *evt, int count)
 {
 	int i;
+	u64 cntrs = 0;
+
+	for (i = 0; i < count; i++) {
+		__start_event(&evt[i], 0);
+		cntrs |= BIT_ULL(event_to_global_idx(&evt[i]));
+	}
+	loop(cntrs);
 	for (i = 0; i < count; i++)
-		start_event(&evt[i]);
-	loop();
-	for (i = 0; i < count; i++)
-		stop_event(&evt[i]);
+		__stop_event(&evt[i]);
 }
 
 static void measure_one(pmu_counter_t *evt)
@@ -221,9 +269,11 @@ static void measure_one(pmu_counter_t *evt)
 
 static noinline void __measure(pmu_counter_t *evt, uint64_t count)
 {
+	u64 cntrs = BIT_ULL(event_to_global_idx(evt));
+
 	__start_event(evt, count);
-	loop();
-	stop_event(evt);
+	loop(cntrs);
+	__stop_event(evt);
 }
 
 static bool verify_event(uint64_t count, struct pmu_event *e)
@@ -494,7 +544,7 @@ static void check_running_counter_wrmsr(void)
 	report_prefix_push("running counter wrmsr");
 
 	start_event(&evt);
-	loop();
+	__loop();
 	wrmsr(MSR_GP_COUNTERx(0), 0);
 	stop_event(&evt);
 	report(evt.count < gp_events[instruction_idx].min, "cntr");
@@ -511,7 +561,7 @@ static void check_running_counter_wrmsr(void)
 
 	wrmsr(MSR_GP_COUNTERx(0), count);
 
-	loop();
+	__loop();
 	stop_event(&evt);
 
 	if (this_cpu_has_perf_global_status()) {
@@ -652,7 +702,7 @@ static void warm_up(void)
 	 * the real verification.
 	 */
 	while (i--)
-		loop();
+		loop(0);
 }
 
 static void check_counters(void)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (11 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 12/18] x86: pmu: Enable and disable PMCs in loop() asm blob Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:08   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 14/18] x86: pmu: Improve LLC misses event verification Dapeng Mi
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
__precise_count_loop(). Thus, instructions and branches events can be
verified against a precise count instead of a rough range.

BTW, some intermittent failures on AMD processors using PerfMonV2 is
seen due to variance in counts. This probably has to do with the way
instructions leading to a VM-Entry or VM-Exit are accounted when
counting retired instructions and branches.

https://lore.kernel.org/all/6d512a14-ace1-41a3-801e-0beb41425734@amd.com/

So only enable this precise check for Intel processors.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/x86/pmu.c b/x86/pmu.c
index 270f11b9..13c7c45d 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -19,6 +19,11 @@
 #define EXPECTED_INSTR 17
 #define EXPECTED_BRNCH 5
 
+
+/* Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
+#define EXTRA_INSTRNS  (3 + 3)
+#define LOOP_INSTRNS   (N * 10 + EXTRA_INSTRNS)
+#define LOOP_BRANCHES  (N)
 #define LOOP_ASM(_wrmsr)						\
 	_wrmsr "\n\t"							\
 	"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t"			\
@@ -123,6 +128,30 @@ static inline void loop(u64 cntrs)
 		__precise_loop(cntrs);
 }
 
+static void adjust_events_range(struct pmu_event *gp_events,
+				int instruction_idx, int branch_idx)
+{
+	/*
+	 * If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are
+	 * moved in __precise_loop(). Thus, instructions and branches events
+	 * can be verified against a precise count instead of a rough range.
+	 *
+	 * We see some intermittent failures on AMD processors using PerfMonV2
+	 * due to variance in counts. This probably has to do with the way
+	 * instructions leading to a VM-Entry or VM-Exit are accounted when
+	 * counting retired instructions and branches. Thus only enable the
+	 * precise validation for Intel processors.
+	 */
+	if (pmu.is_intel && this_cpu_has_perf_global_ctrl()) {
+		/* instructions event */
+		gp_events[instruction_idx].min = LOOP_INSTRNS;
+		gp_events[instruction_idx].max = LOOP_INSTRNS;
+		/* branches event */
+		gp_events[branch_idx].min = LOOP_BRANCHES;
+		gp_events[branch_idx].max = LOOP_BRANCHES;
+	}
+}
+
 volatile uint64_t irq_received;
 
 static void cnt_overflow(isr_regs_t *regs)
@@ -832,6 +861,9 @@ static void check_invalid_rdpmc_gp(void)
 
 int main(int ac, char **av)
 {
+	int instruction_idx;
+	int branch_idx;
+
 	setup_vm();
 	handle_irq(PMI_VECTOR, cnt_overflow);
 	buf = malloc(N*64);
@@ -845,13 +877,18 @@ int main(int ac, char **av)
 		}
 		gp_events = (struct pmu_event *)intel_gp_events;
 		gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
+		instruction_idx = INTEL_INSTRUCTIONS_IDX;
+		branch_idx = INTEL_BRANCHES_IDX;
 		report_prefix_push("Intel");
 		set_ref_cycle_expectations();
 	} else {
 		gp_events_size = sizeof(amd_gp_events)/sizeof(amd_gp_events[0]);
 		gp_events = (struct pmu_event *)amd_gp_events;
+		instruction_idx = AMD_INSTRUCTIONS_IDX;
+		branch_idx = AMD_BRANCHES_IDX;
 		report_prefix_push("AMD");
 	}
+	adjust_events_range(gp_events, instruction_idx, branch_idx);
 
 	printf("PMU version:         %d\n", pmu.version);
 	printf("GP counters:         %d\n", pmu.nr_gp_counters);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 14/18] x86: pmu: Improve LLC misses event verification
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (12 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 15/18] x86: pmu: Adjust lower boundary of llc-misses event to 0 for legacy CPUs Dapeng Mi
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

When running pmu test on SPR, sometimes the following failure is
reported.

1 <= 0 <= 1000000
FAIL: Intel: llc misses-4

Currently The LLC misses occurring only depends on probability. It's
possible that there is no LLC misses happened in the whole loop(),
especially along with processors have larger and larger cache size just
like what we observed on SPR.

Thus, add clflush instruction into the loop() asm blob and ensure once
LLC miss is triggered at least.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 13c7c45d..c9160423 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -20,19 +20,30 @@
 #define EXPECTED_BRNCH 5
 
 
-/* Enable GLOBAL_CTRL + disable GLOBAL_CTRL instructions */
-#define EXTRA_INSTRNS  (3 + 3)
+/* Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
+#define EXTRA_INSTRNS  (3 + 3 + 2)
 #define LOOP_INSTRNS   (N * 10 + EXTRA_INSTRNS)
 #define LOOP_BRANCHES  (N)
-#define LOOP_ASM(_wrmsr)						\
+#define LOOP_ASM(_wrmsr, _clflush)					\
 	_wrmsr "\n\t"							\
 	"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t"			\
+	_clflush "\n\t"                                 		\
+	"mfence;\n\t"                                   		\
 	"1: mov (%1), %2; add $64, %1;\n\t"				\
 	"nop; nop; nop; nop; nop; nop; nop;\n\t"			\
 	"loop 1b;\n\t"							\
 	"mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t"	\
 	_wrmsr "\n\t"
 
+#define _loop_asm(_wrmsr, _clflush)				\
+do {								\
+	asm volatile(LOOP_ASM(_wrmsr, _clflush)			\
+		     : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)	\
+		     : "a"(eax), "d"(edx), "c"(global_ctl),	\
+		       "0"(N), "1"(buf)				\
+		     : "edi");					\
+} while (0)
+
 typedef struct {
 	uint32_t ctr;
 	uint32_t idx;
@@ -89,14 +100,17 @@ static struct pmu_event *gp_events;
 static unsigned int gp_events_size;
 static unsigned int fixed_counters_num;
 
-
 static inline void __loop(void)
 {
 	unsigned long tmp, tmp2, tmp3;
+	u32 global_ctl = 0;
+	u32 eax = 0;
+	u32 edx = 0;
 
-	asm volatile(LOOP_ASM("nop")
-		     : "=c"(tmp), "=r"(tmp2), "=r"(tmp3)
-		     : "0"(N), "1"(buf));
+	if (this_cpu_has(X86_FEATURE_CLFLUSH))
+		_loop_asm("nop", "clflush (%1)");
+	else
+		_loop_asm("nop", "nop");
 }
 
 /*
@@ -109,15 +123,14 @@ static inline void __loop(void)
 static inline void __precise_loop(u64 cntrs)
 {
 	unsigned long tmp, tmp2, tmp3;
-	unsigned int global_ctl = pmu.msr_global_ctl;
+	u32 global_ctl = pmu.msr_global_ctl;
 	u32 eax = cntrs & (BIT_ULL(32) - 1);
 	u32 edx = cntrs >> 32;
 
-	asm volatile(LOOP_ASM("wrmsr")
-		     : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)
-		     : "a"(eax), "d"(edx), "c"(global_ctl),
-		       "0"(N), "1"(buf)
-		     : "edi");
+	if (this_cpu_has(X86_FEATURE_CLFLUSH))
+		_loop_asm("wrmsr", "clflush (%1)");
+	else
+		_loop_asm("wrmsr", "nop");
 }
 
 static inline void loop(u64 cntrs)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 15/18] x86: pmu: Adjust lower boundary of llc-misses event to 0 for legacy CPUs
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (13 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 14/18] x86: pmu: Improve LLC misses event verification Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 16/18] x86: pmu: Add IBPB indirect jump asm blob Dapeng Mi
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

For these legacy Intel CPUs without clflush/clflushopt support, there is
on way to force to trigger a LLC miss and the measured llc misses is
possible to be 0. Thus adjust the lower boundary of llc-misses event to
0 to avoid possible false positive.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/x86/pmu.c b/x86/pmu.c
index c9160423..47b6305d 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -82,6 +82,7 @@ struct pmu_event {
 enum {
 	INTEL_INSTRUCTIONS_IDX  = 1,
 	INTEL_REF_CYCLES_IDX	= 2,
+	INTEL_LLC_MISSES_IDX	= 4,
 	INTEL_BRANCHES_IDX	= 5,
 };
 
@@ -892,6 +893,15 @@ int main(int ac, char **av)
 		gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
 		instruction_idx = INTEL_INSTRUCTIONS_IDX;
 		branch_idx = INTEL_BRANCHES_IDX;
+
+		/*
+		 * For legacy Intel CPUS without clflush/clflushopt support,
+		 * there is no way to force to trigger a LLC miss, thus set
+		 * the minimum value to 0 to avoid false positives.
+		 */
+		if (!this_cpu_has(X86_FEATURE_CLFLUSH))
+			gp_events[INTEL_LLC_MISSES_IDX].min = 0;
+
 		report_prefix_push("Intel");
 		set_ref_cycle_expectations();
 	} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 16/18] x86: pmu: Add IBPB indirect jump asm blob
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (14 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 15/18] x86: pmu: Adjust lower boundary of llc-misses event to 0 for legacy CPUs Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event Dapeng Mi
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 18/18] x86: pmu: Optimize emulated instruction validation Dapeng Mi
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Currently the lower boundary of branch misses event is set to 0.
Strictly speaking 0 shouldn't be a valid count since it can't tell us if
branch misses event counter works correctly or even disabled. Whereas
it's also possible and reasonable that branch misses event count is 0
especailly for such simple loop() program with advanced branch
predictor.

To eliminate such ambiguity and make branch misses event verification
more acccurately, an extra IBPB indirect jump asm blob is appended and
IBPB command is leveraged to clear the branch target buffer and force to
cause a branch miss for the indirect jump.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 56 insertions(+), 15 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 47b6305d..279d418d 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -19,25 +19,52 @@
 #define EXPECTED_INSTR 17
 #define EXPECTED_BRNCH 5
 
-
-/* Enable GLOBAL_CTRL + disable GLOBAL_CTRL + clflush/mfence instructions */
-#define EXTRA_INSTRNS  (3 + 3 + 2)
+#define IBPB_JMP_INSTRNS      9
+#define IBPB_JMP_BRANCHES     2
+
+#if defined(__i386__) || defined(_M_IX86) /* i386 */
+#define IBPB_JMP_ASM(_wrmsr)				\
+	"mov $1, %%eax; xor %%edx, %%edx;\n\t"		\
+	"mov $73, %%ecx;\n\t"				\
+	_wrmsr "\n\t"					\
+	"call 1f\n\t"					\
+	"1: pop %%eax\n\t"				\
+	"add $(2f-1b), %%eax\n\t"			\
+	"jmp *%%eax;\n\t"                               \
+	"nop;\n\t"					\
+	"2: nop;\n\t"
+#else /* x86_64 */
+#define IBPB_JMP_ASM(_wrmsr)				\
+	"mov $1, %%eax; xor %%edx, %%edx;\n\t"		\
+	"mov $73, %%ecx;\n\t"				\
+	_wrmsr "\n\t"					\
+	"call 1f\n\t"					\
+	"1: pop %%rax\n\t"				\
+	"add $(2f-1b), %%rax\n\t"                       \
+	"jmp *%%rax;\n\t"                               \
+	"nop;\n\t"					\
+	"2: nop;\n\t"
+#endif
+
+/* GLOBAL_CTRL enable + disable + clflush/mfence + IBPB_JMP */
+#define EXTRA_INSTRNS  (3 + 3 + 2 + IBPB_JMP_INSTRNS)
 #define LOOP_INSTRNS   (N * 10 + EXTRA_INSTRNS)
-#define LOOP_BRANCHES  (N)
-#define LOOP_ASM(_wrmsr, _clflush)					\
-	_wrmsr "\n\t"							\
+#define LOOP_BRANCHES  (N + IBPB_JMP_BRANCHES)
+#define LOOP_ASM(_wrmsr1, _clflush, _wrmsr2)				\
+	_wrmsr1 "\n\t"							\
 	"mov %%ecx, %%edi; mov %%ebx, %%ecx;\n\t"			\
 	_clflush "\n\t"                                 		\
 	"mfence;\n\t"                                   		\
 	"1: mov (%1), %2; add $64, %1;\n\t"				\
 	"nop; nop; nop; nop; nop; nop; nop;\n\t"			\
 	"loop 1b;\n\t"							\
+	IBPB_JMP_ASM(_wrmsr2) 						\
 	"mov %%edi, %%ecx; xor %%eax, %%eax; xor %%edx, %%edx;\n\t"	\
-	_wrmsr "\n\t"
+	_wrmsr1 "\n\t"
 
-#define _loop_asm(_wrmsr, _clflush)				\
+#define _loop_asm(_wrmsr1, _clflush, _wrmsr2)			\
 do {								\
-	asm volatile(LOOP_ASM(_wrmsr, _clflush)			\
+	asm volatile(LOOP_ASM(_wrmsr1, _clflush, _wrmsr2)	\
 		     : "=b"(tmp), "=r"(tmp2), "=r"(tmp3)	\
 		     : "a"(eax), "d"(edx), "c"(global_ctl),	\
 		       "0"(N), "1"(buf)				\
@@ -101,6 +128,12 @@ static struct pmu_event *gp_events;
 static unsigned int gp_events_size;
 static unsigned int fixed_counters_num;
 
+static int has_ibpb(void)
+{
+	return this_cpu_has(X86_FEATURE_SPEC_CTRL) ||
+	       this_cpu_has(X86_FEATURE_AMD_IBPB);
+}
+
 static inline void __loop(void)
 {
 	unsigned long tmp, tmp2, tmp3;
@@ -108,10 +141,14 @@ static inline void __loop(void)
 	u32 eax = 0;
 	u32 edx = 0;
 
-	if (this_cpu_has(X86_FEATURE_CLFLUSH))
-		_loop_asm("nop", "clflush (%1)");
+	if (this_cpu_has(X86_FEATURE_CLFLUSH) && has_ibpb())
+		_loop_asm("nop", "clflush (%1)", "wrmsr");
+	else if (this_cpu_has(X86_FEATURE_CLFLUSH))
+		_loop_asm("nop", "clflush (%1)", "nop");
+	else if (has_ibpb())
+		_loop_asm("nop", "nop", "wrmsr");
 	else
-		_loop_asm("nop", "nop");
+		_loop_asm("nop", "nop", "nop");
 }
 
 /*
@@ -128,10 +165,14 @@ static inline void __precise_loop(u64 cntrs)
 	u32 eax = cntrs & (BIT_ULL(32) - 1);
 	u32 edx = cntrs >> 32;
 
-	if (this_cpu_has(X86_FEATURE_CLFLUSH))
-		_loop_asm("wrmsr", "clflush (%1)");
+	if (this_cpu_has(X86_FEATURE_CLFLUSH) && has_ibpb())
+		_loop_asm("wrmsr", "clflush (%1)", "wrmsr");
+	else if (this_cpu_has(X86_FEATURE_CLFLUSH))
+		_loop_asm("wrmsr", "clflush (%1)", "nop");
+	else if (has_ibpb())
+		_loop_asm("wrmsr", "nop", "wrmsr");
 	else
-		_loop_asm("wrmsr", "nop");
+		_loop_asm("wrmsr", "nop", "nop");
 }
 
 static inline void loop(u64 cntrs)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (15 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 16/18] x86: pmu: Add IBPB indirect jump asm blob Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  2025-02-14 21:09   ` Sean Christopherson
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 18/18] x86: pmu: Optimize emulated instruction validation Dapeng Mi
  17 siblings, 1 reply; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

Since the IBPB command is added to force to trigger a branch miss at
least, the lower boundary of branch misses event is increased to 1 by
default. For these CPUs without IBPB support, adjust dynamically the
lower boundary to 0 to avoid false positive.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index 279d418d..c7848fd1 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -90,12 +90,12 @@ struct pmu_event {
 	{"llc references", 0x4f2e, 1, 2*N},
 	{"llc misses", 0x412e, 1, 1*N},
 	{"branches", 0x00c4, 1*N, 1.1*N},
-	{"branch misses", 0x00c5, 0, 0.1*N},
+	{"branch misses", 0x00c5, 1, 0.1*N},
 }, amd_gp_events[] = {
 	{"core cycles", 0x0076, 1*N, 50*N},
 	{"instructions", 0x00c0, 10*N, 10.2*N},
 	{"branches", 0x00c2, 1*N, 1.1*N},
-	{"branch misses", 0x00c3, 0, 0.1*N},
+	{"branch misses", 0x00c3, 1, 0.1*N},
 }, fixed_events[] = {
 	{"fixed 0", MSR_CORE_PERF_FIXED_CTR0, 10*N, 10.2*N},
 	{"fixed 1", MSR_CORE_PERF_FIXED_CTR0 + 1, 1*N, 30*N},
@@ -111,6 +111,7 @@ enum {
 	INTEL_REF_CYCLES_IDX	= 2,
 	INTEL_LLC_MISSES_IDX	= 4,
 	INTEL_BRANCHES_IDX	= 5,
+	INTEL_BRANCH_MISS_IDX	= 6,
 };
 
 /*
@@ -120,6 +121,7 @@ enum {
 enum {
 	AMD_INSTRUCTIONS_IDX    = 1,
 	AMD_BRANCHES_IDX	= 2,
+	AMD_BRANCH_MISS_IDX	= 3,
 };
 
 char *buf;
@@ -184,7 +186,8 @@ static inline void loop(u64 cntrs)
 }
 
 static void adjust_events_range(struct pmu_event *gp_events,
-				int instruction_idx, int branch_idx)
+				int instruction_idx, int branch_idx,
+				int branch_miss_idx)
 {
 	/*
 	 * If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are
@@ -205,6 +208,17 @@ static void adjust_events_range(struct pmu_event *gp_events,
 		gp_events[branch_idx].min = LOOP_BRANCHES;
 		gp_events[branch_idx].max = LOOP_BRANCHES;
 	}
+
+	/*
+	 * For CPUs without IBPB support, no way to force to trigger a
+	 * branch miss and the measured branch misses is possible to be
+	 * 0. Thus overwrite the lower boundary of branch misses event
+	 * to 0 to avoid false positive.
+	 */
+	if (!has_ibpb()) {
+		/* branch misses event */
+		gp_events[branch_miss_idx].min = 0;
+	}
 }
 
 volatile uint64_t irq_received;
@@ -918,6 +932,7 @@ int main(int ac, char **av)
 {
 	int instruction_idx;
 	int branch_idx;
+	int branch_miss_idx;
 
 	setup_vm();
 	handle_irq(PMI_VECTOR, cnt_overflow);
@@ -934,6 +949,7 @@ int main(int ac, char **av)
 		gp_events_size = sizeof(intel_gp_events)/sizeof(intel_gp_events[0]);
 		instruction_idx = INTEL_INSTRUCTIONS_IDX;
 		branch_idx = INTEL_BRANCHES_IDX;
+		branch_miss_idx = INTEL_BRANCH_MISS_IDX;
 
 		/*
 		 * For legacy Intel CPUS without clflush/clflushopt support,
@@ -950,9 +966,10 @@ int main(int ac, char **av)
 		gp_events = (struct pmu_event *)amd_gp_events;
 		instruction_idx = AMD_INSTRUCTIONS_IDX;
 		branch_idx = AMD_BRANCHES_IDX;
+		branch_miss_idx = AMD_BRANCH_MISS_IDX;
 		report_prefix_push("AMD");
 	}
-	adjust_events_range(gp_events, instruction_idx, branch_idx);
+	adjust_events_range(gp_events, instruction_idx, branch_idx, branch_miss_idx);
 
 	printf("PMU version:         %d\n", pmu.version);
 	printf("GP counters:         %d\n", pmu.nr_gp_counters);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [kvm-unit-tests patch v6 18/18] x86: pmu: Optimize emulated instruction validation
  2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
                   ` (16 preceding siblings ...)
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event Dapeng Mi
@ 2024-09-14 10:17 ` Dapeng Mi
  17 siblings, 0 replies; 33+ messages in thread
From: Dapeng Mi @ 2024-09-14 10:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Jim Mattson, Mingwei Zhang, Xiong Zhang,
	Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma, Dapeng Mi,
	Dapeng Mi

For support CPUs supporting PERF_GLOBAL_CTRL MSR, the validation for
emulated instruction can be improved to check against precise counts for
instructions and branches events instead of a rough range.

Move enabling and disabling PERF_GLOBAL_CTRL MSR into kvm_fep_asm blob,
thus instructions and branches events can be verified against precise
counts.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 x86/pmu.c | 108 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 65 insertions(+), 43 deletions(-)

diff --git a/x86/pmu.c b/x86/pmu.c
index c7848fd1..3a5659b2 100644
--- a/x86/pmu.c
+++ b/x86/pmu.c
@@ -14,11 +14,6 @@
 
 #define N 1000000
 
-// These values match the number of instructions and branches in the
-// assembly block in check_emulated_instr().
-#define EXPECTED_INSTR 17
-#define EXPECTED_BRNCH 5
-
 #define IBPB_JMP_INSTRNS      9
 #define IBPB_JMP_BRANCHES     2
 
@@ -71,6 +66,40 @@ do {								\
 		     : "edi");					\
 } while (0)
 
+/* the number of instructions and branches of the kvm_fep_asm() blob */
+#define KVM_FEP_INSTR	22
+#define KVM_FEP_BRNCH	5
+
+/*
+ * KVM_FEP is a magic prefix that forces emulation so
+ * 'KVM_FEP "jne label\n"' just counts as a single instruction.
+ */
+#define kvm_fep_asm(_wrmsr)			\
+do {						\
+	asm volatile(				\
+		_wrmsr "\n\t"			\
+		"mov %%ecx, %%edi;\n\t"		\
+		"mov $0x0, %%eax;\n\t"		\
+		"cmp $0x0, %%eax;\n\t"		\
+		KVM_FEP "jne 1f\n\t"		\
+		KVM_FEP "jne 1f\n\t"		\
+		KVM_FEP "jne 1f\n\t"		\
+		KVM_FEP "jne 1f\n\t"		\
+		KVM_FEP "jne 1f\n\t"		\
+		"mov $0xa, %%eax; cpuid;\n\t"	\
+		"mov $0xa, %%eax; cpuid;\n\t"	\
+		"mov $0xa, %%eax; cpuid;\n\t"	\
+		"mov $0xa, %%eax; cpuid;\n\t"	\
+		"mov $0xa, %%eax; cpuid;\n\t"	\
+		"1: mov %%edi, %%ecx; \n\t"	\
+		"xor %%eax, %%eax; \n\t"	\
+		"xor %%edx, %%edx;\n\t"		\
+		_wrmsr "\n\t"			\
+		:				\
+		: "a"(eax), "d"(edx), "c"(ecx)	\
+		: "ebx", "edi");		\
+} while (0)
+
 typedef struct {
 	uint32_t ctr;
 	uint32_t idx;
@@ -672,6 +701,7 @@ static void check_running_counter_wrmsr(void)
 
 static void check_emulated_instr(void)
 {
+	u32 eax, edx, ecx;
 	uint64_t status, instr_start, brnch_start;
 	uint64_t gp_counter_width = (1ull << pmu.gp_counter_width) - 1;
 	unsigned int branch_idx = pmu.is_intel ?
@@ -679,6 +709,7 @@ static void check_emulated_instr(void)
 	unsigned int instruction_idx = pmu.is_intel ?
 				       INTEL_INSTRUCTIONS_IDX :
 				       AMD_INSTRUCTIONS_IDX;
+
 	pmu_counter_t brnch_cnt = {
 		.ctr = MSR_GP_COUNTERx(0),
 		/* branch instructions */
@@ -694,55 +725,46 @@ static void check_emulated_instr(void)
 	if (this_cpu_has_perf_global_status())
 		pmu_clear_global_status();
 
-	start_event(&brnch_cnt);
-	start_event(&instr_cnt);
+	__start_event(&brnch_cnt, 0);
+	__start_event(&instr_cnt, 0);
 
-	brnch_start = -EXPECTED_BRNCH;
-	instr_start = -EXPECTED_INSTR;
+	brnch_start = -KVM_FEP_BRNCH;
+	instr_start = -KVM_FEP_INSTR;
 	wrmsr(MSR_GP_COUNTERx(0), brnch_start & gp_counter_width);
 	wrmsr(MSR_GP_COUNTERx(1), instr_start & gp_counter_width);
-	// KVM_FEP is a magic prefix that forces emulation so
-	// 'KVM_FEP "jne label\n"' just counts as a single instruction.
-	asm volatile(
-		"mov $0x0, %%eax\n"
-		"cmp $0x0, %%eax\n"
-		KVM_FEP "jne label\n"
-		KVM_FEP "jne label\n"
-		KVM_FEP "jne label\n"
-		KVM_FEP "jne label\n"
-		KVM_FEP "jne label\n"
-		"mov $0xa, %%eax\n"
-		"cpuid\n"
-		"mov $0xa, %%eax\n"
-		"cpuid\n"
-		"mov $0xa, %%eax\n"
-		"cpuid\n"
-		"mov $0xa, %%eax\n"
-		"cpuid\n"
-		"mov $0xa, %%eax\n"
-		"cpuid\n"
-		"label:\n"
-		:
-		:
-		: "eax", "ebx", "ecx", "edx");
 
-	if (this_cpu_has_perf_global_ctrl())
-		wrmsr(pmu.msr_global_ctl, 0);
+	if (this_cpu_has_perf_global_ctrl()) {
+		eax = BIT(0) | BIT(1);
+		ecx = pmu.msr_global_ctl;
+		edx = 0;
+		kvm_fep_asm("wrmsr");
+	} else {
+		eax = ecx = edx = 0;
+		kvm_fep_asm("nop");
+	}
 
-	stop_event(&brnch_cnt);
-	stop_event(&instr_cnt);
+	__stop_event(&brnch_cnt);
+	__stop_event(&instr_cnt);
 
 	// Check that the end count - start count is at least the expected
 	// number of instructions and branches.
-	report(instr_cnt.count - instr_start >= EXPECTED_INSTR,
-	       "instruction count");
-	report(brnch_cnt.count - brnch_start >= EXPECTED_BRNCH,
-	       "branch count");
+	if (this_cpu_has_perf_global_ctrl()) {
+		report(instr_cnt.count - instr_start == KVM_FEP_INSTR,
+		       "instruction count");
+		report(brnch_cnt.count - brnch_start == KVM_FEP_BRNCH,
+		       "branch count");
+	} else {
+		report(instr_cnt.count - instr_start >= KVM_FEP_INSTR,
+		       "instruction count");
+		report(brnch_cnt.count - brnch_start >= KVM_FEP_BRNCH,
+		       "branch count");
+	}
+
 	if (this_cpu_has_perf_global_status()) {
 		// Additionally check that those counters overflowed properly.
 		status = rdmsr(pmu.msr_global_status);
-		report(status & 1, "branch counter overflow");
-		report(status & 2, "instruction counter overflow");
+		report(status & BIT_ULL(0), "branch counter overflow");
+		report(status & BIT_ULL(1), "instruction counter overflow");
 	}
 
 	report_prefix_pop();
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line Dapeng Mi
@ 2025-02-14 21:05   ` Sean Christopherson
  2025-02-18  9:07     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:05 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> When running pmu test on SPR, the following #GP fault is reported.
> 
> Unhandled exception 13 #GP at ip 000000000040771f
> error_code=0000      rflags=00010046      cs=00000008
> rax=00000000004031ad rcx=0000000000000186 rdx=0000000000000000 rbx=00000000005142f0
> rbp=0000000000514260 rsi=0000000000000020 rdi=0000000000000340
>  r8=0000000000513a65  r9=00000000000003f8 r10=000000000000000d r11=00000000ffffffff
> r12=000000000043003c r13=0000000000514450 r14=000000000000000b r15=0000000000000001
> cr0=0000000080010011 cr2=0000000000000000 cr3=0000000001007000 cr4=0000000000000020
> cr8=0000000000000000
>         STACK: @40771f 40040e 400976 400aef 40148d 401da9 4001ad
> FAIL pmu
> 
> It looks EVENTSEL0 MSR (0x186) is written a invalid value (0x4031ad) and
> cause a #GP.

Nope.

> Further investigation shows the #GP is caused by below code in
> __start_event().
> 
> rmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
> 		  evt->config | EVNTSEL_EN);

Nope.

> The evt->config is correctly initialized but seems corrupted before
> writing to MSR.

Still nope.

> 
> The original pmu_counter_t layout looks as below.
> 
> typedef struct {
> 	uint32_t ctr;
> 	uint64_t config;
> 	uint64_t count;
> 	int idx;
> } pmu_counter_t;
> 
> Obviously the config filed crosses two cache lines. 

Yeah, no.  Cache lines are 64 bytes on x86, and even with the bad layout, the size
only adds up to 32 bytes on x86-64.  Packing it slightly better drops it to 24
bytes, but that has nothing to do with cache lines.
 
> When the two cache lines are not updated simultaneously, the config value is
> corrupted.

This is simply nonsensical.  Compilers don't generate accesses that split cache
lines unless software is being deliberately stupid, and x86 doesn't corrupt data
on unaligned accesses.

The actual problem is that your next patch increases the size of the array in
check_counters_many() from 10 to 48 entries.  With 32 bytes per entry, it's just
enough to overflow the stack when making function calls (the array itself stays
on the stack page).  And because KUT's percpu and stack management is complete
and utter garbage, overflowing the stack clobbers the percpu area.

Of course, it's way too hard to even see that, because all of the code is beyond
stupid and (a) doesn't align the stacks to 4KiB, and (b) puts the percpu area at
the bottom of the stack "page".

I'll send patches to put band-aids on the per-CPU insanity, along with a refreshed
version of this series.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many()
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many() Dapeng Mi
@ 2025-02-14 21:06   ` Sean Christopherson
  2025-02-18  9:24     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:06 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> Considering there are already 8 GP counters and 4 fixed counters on
> latest Intel processors, like Sapphire Rapids. The original cnt[] array
> length 10 is definitely not enough to cover all supported PMU counters on
> these new processors even through currently KVM only supports 3 fixed
> counters at most. This would cause out of bound memory access and may trigger
> false alarm on PMU counter validation
> 
> It's probably more and more GP and fixed counters are introduced in the
> future and then directly extends the cnt[] array length to 48 once and
> for all. Base on the layout of IA32_PERF_GLOBAL_CTRL and
> IA32_PERF_GLOBAL_STATUS, 48 looks enough in near feature.
> 
> Reviewed-by: Jim Mattson <jmattson@google.com>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  x86/pmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/x86/pmu.c b/x86/pmu.c
> index a0268db8..b4de2680 100644
> --- a/x86/pmu.c
> +++ b/x86/pmu.c
> @@ -255,7 +255,7 @@ static void check_fixed_counters(void)
>  
>  static void check_counters_many(void)
>  {
> -	pmu_counter_t cnt[10];
> +	pmu_counter_t cnt[48];

ARGH.  Since the *entire* purpose of increasing the size is to guard against
buffer overflow, add an assert that the loop doesn't overflow.

>  	int i, n;
>  
>  	for (i = 0, n = 0; n < pmu.nr_gp_counters; i++) {
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events Dapeng Mi
@ 2025-02-14 21:07   ` Sean Christopherson
  2025-02-18  9:34     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:07 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> @@ -744,6 +753,12 @@ int main(int ac, char **av)
>  	printf("Fixed counters:      %d\n", pmu.nr_fixed_counters);
>  	printf("Fixed counter width: %d\n", pmu.fixed_counter_width);
>  
> +	fixed_counters_num = MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events));
> +	if (pmu.nr_fixed_counters > ARRAY_SIZE(fixed_events))
> +		report_info("Fixed counters number %d > defined fixed events %ld. "

Doesn't compile on 32-bit builds.  Easiest thing is to cast ARRAY_SIZE, because
size_t is different between 32-bit and 64-bit.

> +			    "Please update test case.", pmu.nr_fixed_counters,
> +			    ARRAY_SIZE(fixed_events));
> +
>  	apic_write(APIC_LVTPC, PMI_VECTOR);
>  
>  	check_counters();
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure Dapeng Mi
@ 2025-02-14 21:07   ` Sean Christopherson
  2025-02-18  9:36     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:07 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> +static void warm_up(void)
> +{
> +	int i = 8;
> +
> +	/*
> +	 * Since cycles event is always run as the first event, there would be
> +	 * a warm-up state to warm up the cache, it leads to the measured cycles
> +	 * value may exceed the pre-defined cycles upper boundary and cause
> +	 * false positive. To avoid this, introduce an warm-up state before
> +	 * the real verification.
> +	 */
> +	while (i--)
> +		loop();

Use a for-loop.

> +}
> +
>  static void check_counters(void)
>  {
>  	if (is_fep_available())
>  		check_emulated_instr();
>  
> +	warm_up();
>  	check_gp_counters();
>  	check_fixed_counters();
>  	check_rdpmc();
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification Dapeng Mi
@ 2025-02-14 21:08   ` Sean Christopherson
  2025-02-18  9:40     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:08 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
> __precise_count_loop(). Thus, instructions and branches events can be
> verified against a precise count instead of a rough range.
> 
> BTW, some intermittent failures on AMD processors using PerfMonV2 is
> seen due to variance in counts. This probably has to do with the way
> instructions leading to a VM-Entry or VM-Exit are accounted when
> counting retired instructions and branches.

AMD counts VMRUN as a branch in guest context.

> +	 * We see some intermittent failures on AMD processors using PerfMonV2
> +	 * due to variance in counts. This probably has to do with the way
> +	 * instructions leading to a VM-Entry or VM-Exit are accounted when
> +	 * counting retired instructions and branches. Thus only enable the
> +	 * precise validation for Intel processors.
> +	 */
> +	if (pmu.is_intel && this_cpu_has_perf_global_ctrl()) {
> +		/* instructions event */

These comments are useless.

> +		gp_events[instruction_idx].min = LOOP_INSTRNS;
> +		gp_events[instruction_idx].max = LOOP_INSTRNS;
> +		/* branches event */
> +		gp_events[branch_idx].min = LOOP_BRANCHES;
> +		gp_events[branch_idx].max = LOOP_BRANCHES;
> +	}
> +}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event
  2024-09-14 10:17 ` [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event Dapeng Mi
@ 2025-02-14 21:09   ` Sean Christopherson
  2025-02-18  9:42     ` Mi, Dapeng
  0 siblings, 1 reply; 33+ messages in thread
From: Sean Christopherson @ 2025-02-14 21:09 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Sat, Sep 14, 2024, Dapeng Mi wrote:
> @@ -205,6 +208,17 @@ static void adjust_events_range(struct pmu_event *gp_events,
>  		gp_events[branch_idx].min = LOOP_BRANCHES;
>  		gp_events[branch_idx].max = LOOP_BRANCHES;
>  	}
> +
> +	/*
> +	 * For CPUs without IBPB support, no way to force to trigger a
> +	 * branch miss and the measured branch misses is possible to be
> +	 * 0. Thus overwrite the lower boundary of branch misses event
> +	 * to 0 to avoid false positive.
> +	 */
> +	if (!has_ibpb()) {
> +		/* branch misses event */

This comment is worse than useless, because it necessitates curly braces.

> +		gp_events[branch_miss_idx].min = 0;
> +	}
>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line
  2025-02-14 21:05   ` Sean Christopherson
@ 2025-02-18  9:07     ` Mi, Dapeng
  0 siblings, 0 replies; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Like Xu, Jinrong Liang, Dapeng Mi


On 2/15/2025 5:05 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> When running pmu test on SPR, the following #GP fault is reported.
>>
>> Unhandled exception 13 #GP at ip 000000000040771f
>> error_code=0000      rflags=00010046      cs=00000008
>> rax=00000000004031ad rcx=0000000000000186 rdx=0000000000000000 rbx=00000000005142f0
>> rbp=0000000000514260 rsi=0000000000000020 rdi=0000000000000340
>>  r8=0000000000513a65  r9=00000000000003f8 r10=000000000000000d r11=00000000ffffffff
>> r12=000000000043003c r13=0000000000514450 r14=000000000000000b r15=0000000000000001
>> cr0=0000000080010011 cr2=0000000000000000 cr3=0000000001007000 cr4=0000000000000020
>> cr8=0000000000000000
>>         STACK: @40771f 40040e 400976 400aef 40148d 401da9 4001ad
>> FAIL pmu
>>
>> It looks EVENTSEL0 MSR (0x186) is written a invalid value (0x4031ad) and
>> cause a #GP.
> Nope.
>
>> Further investigation shows the #GP is caused by below code in
>> __start_event().
>>
>> rmsr(MSR_GP_EVENT_SELECTx(event_to_global_idx(evt)),
>> 		  evt->config | EVNTSEL_EN);
> Nope.
>
>> The evt->config is correctly initialized but seems corrupted before
>> writing to MSR.
> Still nope.
>
>> The original pmu_counter_t layout looks as below.
>>
>> typedef struct {
>> 	uint32_t ctr;
>> 	uint64_t config;
>> 	uint64_t count;
>> 	int idx;
>> } pmu_counter_t;
>>
>> Obviously the config filed crosses two cache lines. 
> Yeah, no.  Cache lines are 64 bytes on x86, and even with the bad layout, the size
> only adds up to 32 bytes on x86-64.  Packing it slightly better drops it to 24
> bytes, but that has nothing to do with cache lines.
>  
>> When the two cache lines are not updated simultaneously, the config value is
>> corrupted.
> This is simply nonsensical.  Compilers don't generate accesses that split cache
> lines unless software is being deliberately stupid, and x86 doesn't corrupt data
> on unaligned accesses.
>
> The actual problem is that your next patch increases the size of the array in
> check_counters_many() from 10 to 48 entries.  With 32 bytes per entry, it's just
> enough to overflow the stack when making function calls (the array itself stays
> on the stack page).  And because KUT's percpu and stack management is complete
> and utter garbage, overflowing the stack clobbers the percpu area.
>
> Of course, it's way too hard to even see that, because all of the code is beyond
> stupid and (a) doesn't align the stacks to 4KiB, and (b) puts the percpu area at
> the bottom of the stack "page".
>
> I'll send patches to put band-aids on the per-CPU insanity, along with a refreshed
> version of this series.

Thumb up! I never realized this issue is caused by stack corruption. I
would review and test the v7 patch set later. Thanks.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many()
  2025-02-14 21:06   ` Sean Christopherson
@ 2025-02-18  9:24     ` Mi, Dapeng
  2025-02-18 15:56       ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:24 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi


On 2/15/2025 5:06 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> Considering there are already 8 GP counters and 4 fixed counters on
>> latest Intel processors, like Sapphire Rapids. The original cnt[] array
>> length 10 is definitely not enough to cover all supported PMU counters on
>> these new processors even through currently KVM only supports 3 fixed
>> counters at most. This would cause out of bound memory access and may trigger
>> false alarm on PMU counter validation
>>
>> It's probably more and more GP and fixed counters are introduced in the
>> future and then directly extends the cnt[] array length to 48 once and
>> for all. Base on the layout of IA32_PERF_GLOBAL_CTRL and
>> IA32_PERF_GLOBAL_STATUS, 48 looks enough in near feature.
>>
>> Reviewed-by: Jim Mattson <jmattson@google.com>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  x86/pmu.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/x86/pmu.c b/x86/pmu.c
>> index a0268db8..b4de2680 100644
>> --- a/x86/pmu.c
>> +++ b/x86/pmu.c
>> @@ -255,7 +255,7 @@ static void check_fixed_counters(void)
>>  
>>  static void check_counters_many(void)
>>  {
>> -	pmu_counter_t cnt[10];
>> +	pmu_counter_t cnt[48];
> ARGH.  Since the *entire* purpose of increasing the size is to guard against
> buffer overflow, add an assert that the loop doesn't overflow.

This is not only for ensuring no buffer overflow. As the commit message
says,  the counter number has already exceeded 10, such as SPR has 12
counters (8 GP + 4 fixed), and there would be more counters in later
platfroms. The aim of enlarging the array size is to ensure these counters
can be enabled and verified simultaneously.  48 may be too large and 32
should be fair enough. Thanks.


>
>>  	int i, n;
>>  
>>  	for (i = 0, n = 0; n < pmu.nr_gp_counters; i++) {
>> -- 
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events
  2025-02-14 21:07   ` Sean Christopherson
@ 2025-02-18  9:34     ` Mi, Dapeng
  2025-02-18 15:04       ` Sean Christopherson
  0 siblings, 1 reply; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi


On 2/15/2025 5:07 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> @@ -744,6 +753,12 @@ int main(int ac, char **av)
>>  	printf("Fixed counters:      %d\n", pmu.nr_fixed_counters);
>>  	printf("Fixed counter width: %d\n", pmu.fixed_counter_width);
>>  
>> +	fixed_counters_num = MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events));
>> +	if (pmu.nr_fixed_counters > ARRAY_SIZE(fixed_events))
>> +		report_info("Fixed counters number %d > defined fixed events %ld. "
> Doesn't compile on 32-bit builds.  Easiest thing is to cast ARRAY_SIZE, because
> size_t is different between 32-bit and 64-bit.

But ARRAY_SIZE() should return same value regardless of 32-bit or 64-bit,
right?


>
>> +			    "Please update test case.", pmu.nr_fixed_counters,
>> +			    ARRAY_SIZE(fixed_events));
>> +
>>  	apic_write(APIC_LVTPC, PMI_VECTOR);
>>  
>>  	check_counters();
>> -- 
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure
  2025-02-14 21:07   ` Sean Christopherson
@ 2025-02-18  9:36     ` Mi, Dapeng
  0 siblings, 0 replies; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Like Xu, Jinrong Liang, Dapeng Mi


On 2/15/2025 5:07 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> +static void warm_up(void)
>> +{
>> +	int i = 8;
>> +
>> +	/*
>> +	 * Since cycles event is always run as the first event, there would be
>> +	 * a warm-up state to warm up the cache, it leads to the measured cycles
>> +	 * value may exceed the pre-defined cycles upper boundary and cause
>> +	 * false positive. To avoid this, introduce an warm-up state before
>> +	 * the real verification.
>> +	 */
>> +	while (i--)
>> +		loop();
> Use a for-loop.

Sure.


>
>> +}
>> +
>>  static void check_counters(void)
>>  {
>>  	if (is_fep_available())
>>  		check_emulated_instr();
>>  
>> +	warm_up();
>>  	check_gp_counters();
>>  	check_fixed_counters();
>>  	check_rdpmc();
>> -- 
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification
  2025-02-14 21:08   ` Sean Christopherson
@ 2025-02-18  9:40     ` Mi, Dapeng
  0 siblings, 0 replies; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Like Xu, Jinrong Liang, Dapeng Mi


On 2/15/2025 5:08 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> If HW supports GLOBAL_CTRL MSR, enabling and disabling PMCs are moved in
>> __precise_count_loop(). Thus, instructions and branches events can be
>> verified against a precise count instead of a rough range.
>>
>> BTW, some intermittent failures on AMD processors using PerfMonV2 is
>> seen due to variance in counts. This probably has to do with the way
>> instructions leading to a VM-Entry or VM-Exit are accounted when
>> counting retired instructions and branches.
> AMD counts VMRUN as a branch in guest context.

Good to know. Thanks.


>
>> +	 * We see some intermittent failures on AMD processors using PerfMonV2
>> +	 * due to variance in counts. This probably has to do with the way
>> +	 * instructions leading to a VM-Entry or VM-Exit are accounted when
>> +	 * counting retired instructions and branches. Thus only enable the
>> +	 * precise validation for Intel processors.
>> +	 */
>> +	if (pmu.is_intel && this_cpu_has_perf_global_ctrl()) {
>> +		/* instructions event */
> These comments are useless.

Sure.


>
>> +		gp_events[instruction_idx].min = LOOP_INSTRNS;
>> +		gp_events[instruction_idx].max = LOOP_INSTRNS;
>> +		/* branches event */
>> +		gp_events[branch_idx].min = LOOP_BRANCHES;
>> +		gp_events[branch_idx].max = LOOP_BRANCHES;
>> +	}
>> +}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event
  2025-02-14 21:09   ` Sean Christopherson
@ 2025-02-18  9:42     ` Mi, Dapeng
  0 siblings, 0 replies; 33+ messages in thread
From: Mi, Dapeng @ 2025-02-18  9:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi


On 2/15/2025 5:09 AM, Sean Christopherson wrote:
> On Sat, Sep 14, 2024, Dapeng Mi wrote:
>> @@ -205,6 +208,17 @@ static void adjust_events_range(struct pmu_event *gp_events,
>>  		gp_events[branch_idx].min = LOOP_BRANCHES;
>>  		gp_events[branch_idx].max = LOOP_BRANCHES;
>>  	}
>> +
>> +	/*
>> +	 * For CPUs without IBPB support, no way to force to trigger a
>> +	 * branch miss and the measured branch misses is possible to be
>> +	 * 0. Thus overwrite the lower boundary of branch misses event
>> +	 * to 0 to avoid false positive.
>> +	 */
>> +	if (!has_ibpb()) {
>> +		/* branch misses event */
> This comment is worse than useless, because it necessitates curly braces.

Ah. Yes.


>
>> +		gp_events[branch_miss_idx].min = 0;
>> +	}
>>  }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events
  2025-02-18  9:34     ` Mi, Dapeng
@ 2025-02-18 15:04       ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2025-02-18 15:04 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Tue, Feb 18, 2025, Dapeng Mi wrote:
> 
> On 2/15/2025 5:07 AM, Sean Christopherson wrote:
> > On Sat, Sep 14, 2024, Dapeng Mi wrote:
> >> @@ -744,6 +753,12 @@ int main(int ac, char **av)
> >>  	printf("Fixed counters:      %d\n", pmu.nr_fixed_counters);
> >>  	printf("Fixed counter width: %d\n", pmu.fixed_counter_width);
> >>  
> >> +	fixed_counters_num = MIN(pmu.nr_fixed_counters, ARRAY_SIZE(fixed_events));
> >> +	if (pmu.nr_fixed_counters > ARRAY_SIZE(fixed_events))
> >> +		report_info("Fixed counters number %d > defined fixed events %ld. "
> > Doesn't compile on 32-bit builds.  Easiest thing is to cast ARRAY_SIZE, because
> > size_t is different between 32-bit and 64-bit.
> 
> But ARRAY_SIZE() should return same value regardless of 32-bit or 64-bit,
> right?

Yep.  The value is the same, but the type "returned" by sizeof() is different.
On 32-bit, it's an "unsigned int"; on 64-bit, it's an unsigned long.  I.e. the
size of sizeof() is different (sorry, couldn't resist).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many()
  2025-02-18  9:24     ` Mi, Dapeng
@ 2025-02-18 15:56       ` Sean Christopherson
  0 siblings, 0 replies; 33+ messages in thread
From: Sean Christopherson @ 2025-02-18 15:56 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Paolo Bonzini, kvm, linux-kernel, Jim Mattson, Mingwei Zhang,
	Xiong Zhang, Zhenyu Wang, Like Xu, Jinrong Liang, Yongwei Ma,
	Dapeng Mi

On Tue, Feb 18, 2025, Dapeng Mi wrote:
> On 2/15/2025 5:06 AM, Sean Christopherson wrote:
> > On Sat, Sep 14, 2024, Dapeng Mi wrote:
> >> Considering there are already 8 GP counters and 4 fixed counters on
> >> latest Intel processors, like Sapphire Rapids. The original cnt[] array
> >> length 10 is definitely not enough to cover all supported PMU counters on
> >> these new processors even through currently KVM only supports 3 fixed
> >> counters at most. This would cause out of bound memory access and may trigger
> >> false alarm on PMU counter validation
> >>
> >> It's probably more and more GP and fixed counters are introduced in the
> >> future and then directly extends the cnt[] array length to 48 once and
> >> for all. Base on the layout of IA32_PERF_GLOBAL_CTRL and
> >> IA32_PERF_GLOBAL_STATUS, 48 looks enough in near feature.
> >>
> >> Reviewed-by: Jim Mattson <jmattson@google.com>
> >> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> >> ---
> >>  x86/pmu.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/x86/pmu.c b/x86/pmu.c
> >> index a0268db8..b4de2680 100644
> >> --- a/x86/pmu.c
> >> +++ b/x86/pmu.c
> >> @@ -255,7 +255,7 @@ static void check_fixed_counters(void)
> >>  
> >>  static void check_counters_many(void)
> >>  {
> >> -	pmu_counter_t cnt[10];
> >> +	pmu_counter_t cnt[48];
> > ARGH.  Since the *entire* purpose of increasing the size is to guard against
> > buffer overflow, add an assert that the loop doesn't overflow.
> 
> This is not only for ensuring no buffer overflow.

In practice, it is.  As is, there are *zero* sanity checks or restrictions on the
number of possible counters.  Yes, the net effect is that the test doesn't work
if a CPU supports more than ARRAY_SIZE(cnt) counters, but the reason the test
doesn't work is because such a CPU would cause buffer overflow.

Yes, there are more nuanced reasons for using a large, statically sized array.
If the goal was to support any theoretical CPU, the array would be dynamically
allocated, but that's not worth the complexity.  If the goal just was to support
SPR, the size would have been set to 12, but that would incur additional maintenance
in the not-too-distant future.

> As the commit message says,  the counter number has already exceeded 10, such
> as SPR has 12 counters (8 GP + 4 fixed),

I am well aware.

> and there would be more counters in later platfroms. The aim of enlarging the
> array size is to ensure these counters can be enabled and verified
> simultaneously.  48 may be too large and 32 should be fair enough. Thanks.

No.  Just no.  Unless there is an architecturally defined limit, and even then a
sanity check is strongly encourage, KVM-related software should never, ever blindly
assume a buffer size is "good enough".

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-02-18 15:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-14 10:17 [kvm-unit-tests patch v6 00/18] pmu test bugs fix and improvements Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 01/18] x86: pmu: Remove duplicate code in pmu_init() Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 02/18] x86: pmu: Remove blank line and redundant space Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 03/18] x86: pmu: Refine fixed_events[] names Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 04/18] x86: pmu: Fix the issue that pmu_counter_t.config crosses cache line Dapeng Mi
2025-02-14 21:05   ` Sean Christopherson
2025-02-18  9:07     ` Mi, Dapeng
2024-09-14 10:17 ` [kvm-unit-tests patch v6 05/18] x86: pmu: Enlarge cnt[] length to 48 in check_counters_many() Dapeng Mi
2025-02-14 21:06   ` Sean Christopherson
2025-02-18  9:24     ` Mi, Dapeng
2025-02-18 15:56       ` Sean Christopherson
2024-09-14 10:17 ` [kvm-unit-tests patch v6 06/18] x86: pmu: Print measured event count if test fails Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 07/18] x86: pmu: Fix potential out of bound access for fixed events Dapeng Mi
2025-02-14 21:07   ` Sean Christopherson
2025-02-18  9:34     ` Mi, Dapeng
2025-02-18 15:04       ` Sean Christopherson
2024-09-14 10:17 ` [kvm-unit-tests patch v6 08/18] x86: pmu: Fix cycles event validation failure Dapeng Mi
2025-02-14 21:07   ` Sean Christopherson
2025-02-18  9:36     ` Mi, Dapeng
2024-09-14 10:17 ` [kvm-unit-tests patch v6 09/18] x86: pmu: Use macro to replace hard-coded branches event index Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 10/18] x86: pmu: Use macro to replace hard-coded ref-cycles " Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 11/18] x86: pmu: Use macro to replace hard-coded instructions " Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 12/18] x86: pmu: Enable and disable PMCs in loop() asm blob Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 13/18] x86: pmu: Improve instruction and branches events verification Dapeng Mi
2025-02-14 21:08   ` Sean Christopherson
2025-02-18  9:40     ` Mi, Dapeng
2024-09-14 10:17 ` [kvm-unit-tests patch v6 14/18] x86: pmu: Improve LLC misses event verification Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 15/18] x86: pmu: Adjust lower boundary of llc-misses event to 0 for legacy CPUs Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 16/18] x86: pmu: Add IBPB indirect jump asm blob Dapeng Mi
2024-09-14 10:17 ` [kvm-unit-tests patch v6 17/18] x86: pmu: Adjust lower boundary of branch-misses event Dapeng Mi
2025-02-14 21:09   ` Sean Christopherson
2025-02-18  9:42     ` Mi, Dapeng
2024-09-14 10:17 ` [kvm-unit-tests patch v6 18/18] x86: pmu: Optimize emulated instruction validation Dapeng Mi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).