* KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
@ 2025-07-08 5:01 Aithal, Srikanth
2025-07-08 20:57 ` Sean Christopherson
0 siblings, 1 reply; 5+ messages in thread
From: Aithal, Srikanth @ 2025-07-08 5:01 UTC (permalink / raw)
To: KVM
Hello all,
KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
(Zen 5) for a while now, even on latest
linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
next-20250704]. The same seem to work fine with linux-next tag
next-20250505.
The TSC delay test fails intermittently (approximately once in three
runs) with an unexpected result (expected: 50, actual: 49). This test
passed consistently on earlier tags (e.g., next-20250505) and on
non-Turin platforms.
[stdout] timeout -k 1s --foreground 90s
/home/VT_BUILD/usr/local/bin/qemu-system-x86_64 --no-reboot -nodefaults
-device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc
none -serial stdio -device pci-testdev -machine accel=kvm -kernel
/tmp/tmp.j05CJuifTf -smp 2 -cpu max,+svm -m 4g -append
-pause_filter_test # -initrd /tmp/tmp.rYXaENVjEC
[stdout] enabling apic
[stdout] smp: waiting for 1 APs
[stdout] enabling apic
[stdout] setup: CPU 1 online
[stdout] paging enabled
[stdout] cr0 = 80010011
[stdout] cr3 = 10bf000
[stdout] cr4 = 20
[stdout] NPT detected - running all tests with NPT enabled
[stdout] PASS: null
..
[stdout] PASS: tsc delay (expected: 42, actual: 42)
[stdout] INFO: duration=20, tsc_scale=670, tsc_offset=505748840785296448
[stdout] PASS: tsc delay (expected: 20, actual: 20)
[stdout] INFO: duration=9, tsc_scale=830, tsc_offset=8332629130251870915
[stdout] PASS: tsc delay (expected: 9, actual: 9)
[stdout] INFO: duration=46, tsc_scale=550, tsc_offset=65726211827426474
[stdout] PASS: tsc delay (expected: 46, actual: 46)
[stdout] PASS: tsc delay (expected: 50, actual: 50)
[stdout] FAIL: tsc delay (expected: 50, actual: 49)
[stdout] PASS: shutdown test passed
[stdout] SUMMARY: 285 tests, 1 unexpected failures
[stdout] [31mFAIL [0m svm (285 tests, 1 unexpected failures)
[stdlog] 2025-06-30 08:38:28 | Dependency was not fulfilled.
[stdlog] Not logging /sys/kernel/debug/sched_features (file not found)
[stdlog] Not logging /proc/pci (file not found)
Versions of qemu and ovmf used:
qemu : v9.2.3 as well as v10.0.1 ovmf : edk2-stable202502
If this issue is resolved, please include Reported-by: Srikanth Aithal
Srikanth.Aithal@amd.com<mailto:Srikanth.Aithal@amd.com>
Regards
Srikanth Aithal
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
2025-07-08 5:01 KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5) Aithal, Srikanth
@ 2025-07-08 20:57 ` Sean Christopherson
2025-07-24 3:59 ` Jim Mattson
0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-07-08 20:57 UTC (permalink / raw)
To: Srikanth Aithal; +Cc: KVM
On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> Hello all,
> KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> next-20250704]. The same seem to work fine with linux-next tag
> next-20250505.
> The TSC delay test fails intermittently (approximately once in three runs)
> with an unexpected result (expected: 50, actual: 49). This test passed
> consistently on earlier tags (e.g., next-20250505) and on non-Turin
> platforms.
Stating the obvious to some extent, I suspect it's something to do with Turin,
not a KVM issue. This fails on our Turin hosts as far back as v6.12, i.e. long
before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
the KUT test isn't doing anything to actually stress KVM itself. I.e. I would
expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
TSC slop.
FAIL: tsc delay (expected: 50, actual: 49)
SUMMARY: 13 tests, 1 unexpected failures
✘ ~/build/kut/x86 # uname -r
6.12.0-smp--adc218676eef-tsc
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
2025-07-08 20:57 ` Sean Christopherson
@ 2025-07-24 3:59 ` Jim Mattson
2025-11-18 22:29 ` Sean Christopherson
0 siblings, 1 reply; 5+ messages in thread
From: Jim Mattson @ 2025-07-24 3:59 UTC (permalink / raw)
To: Sean Christopherson; +Cc: Srikanth Aithal, KVM
On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > Hello all,
> > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > next-20250704]. The same seem to work fine with linux-next tag
> > next-20250505.
> > The TSC delay test fails intermittently (approximately once in three runs)
> > with an unexpected result (expected: 50, actual: 49). This test passed
> > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > platforms.
>
> Stating the obvious to some extent, I suspect it's something to do with Turin,
> not a KVM issue. This fails on our Turin hosts as far back as v6.12, i.e. long
> before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> the KUT test isn't doing anything to actually stress KVM itself. I.e. I would
> expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> TSC slop.
I think the final test case is broken, actually.
The test case is:
svm_tsc_scale_run_testcase(50, 0.0001, rdrand());
So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
83886. Note that this is 83886.080000000002 truncated.
If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
have to accumulate an additional (0.080000000002 / 0.0001 =
800.0000000199999) cycles between the two rdtsc() operations
bracketing the svm_vmrun() in L1 .
The test probably passes on other CPUs because emulated VMRUN and
#VMEXIT add those 800 cycles.
Instead of truncating ((50 << 24) * 0.0001), I think we should
calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
Something like this:
diff --git a/x86/svm_tests.c b/x86/svm_tests.c
index 9358c1f0383a..1bfe11045bd1 100644
--- a/x86/svm_tests.c
+++ b/x86/svm_tests.c
@@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
u64 start_tsc, actual_duration;
guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
+ if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
+ guest_tsc_delay_value++;
test_set_guest(svm_tsc_scale_guest);
vmcb->control.tsc_offset = tsc_offset;
Even then, equality of duration and actual_duration is only guaranteed
if there are no significant delays during the measurement.
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
2025-07-24 3:59 ` Jim Mattson
@ 2025-11-18 22:29 ` Sean Christopherson
2025-11-18 22:38 ` Sean Christopherson
0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-11-18 22:29 UTC (permalink / raw)
To: Jim Mattson; +Cc: Srikanth Aithal, KVM
On Wed, Jul 23, 2025, Jim Mattson wrote:
> On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > > Hello all,
> > > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > > next-20250704]. The same seem to work fine with linux-next tag
> > > next-20250505.
> > > The TSC delay test fails intermittently (approximately once in three runs)
> > > with an unexpected result (expected: 50, actual: 49). This test passed
> > > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > > platforms.
> >
> > Stating the obvious to some extent, I suspect it's something to do with Turin,
> > not a KVM issue. This fails on our Turin hosts as far back as v6.12, i.e. long
> > before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> > the KUT test isn't doing anything to actually stress KVM itself. I.e. I would
> > expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> > TSC slop.
>
> I think the final test case is broken, actually.
>
> The test case is:
>
> svm_tsc_scale_run_testcase(50, 0.0001, rdrand());
>
> So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
> 83886. Note that this is 83886.080000000002 truncated.
>
> If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
> will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
> have to accumulate an additional (0.080000000002 / 0.0001 =
> 800.0000000199999) cycles between the two rdtsc() operations
> bracketing the svm_vmrun() in L1 .
>
> The test probably passes on other CPUs because emulated VMRUN and
> #VMEXIT add those 800 cycles.
>
> Instead of truncating ((50 << 24) * 0.0001), I think we should
> calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
> Something like this:
>
> diff --git a/x86/svm_tests.c b/x86/svm_tests.c
> index 9358c1f0383a..1bfe11045bd1 100644
> --- a/x86/svm_tests.c
> +++ b/x86/svm_tests.c
> @@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
> u64 start_tsc, actual_duration;
>
> guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
> + if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
> + guest_tsc_delay_value++;
>
> test_set_guest(svm_tsc_scale_guest);
> vmcb->control.tsc_offset = tsc_offset;
>
> Even then, equality of duration and actual_duration is only guaranteed
> if there are no significant delays during the measurement.
Wrote a changelog and applied this to kvm-x86 next. Thanks Jim!
[1/1] x86/svm: Account for numerical rounding errors in TSC scaling test
https://github.com/kvm-x86/linux/commit/5465145a
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
2025-11-18 22:29 ` Sean Christopherson
@ 2025-11-18 22:38 ` Sean Christopherson
0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2025-11-18 22:38 UTC (permalink / raw)
To: Jim Mattson; +Cc: Srikanth Aithal, KVM
On Tue, Nov 18, 2025, Sean Christopherson wrote:
> On Wed, Jul 23, 2025, Jim Mattson wrote:
> > On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > > > Hello all,
> > > > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > > > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > > > next-20250704]. The same seem to work fine with linux-next tag
> > > > next-20250505.
> > > > The TSC delay test fails intermittently (approximately once in three runs)
> > > > with an unexpected result (expected: 50, actual: 49). This test passed
> > > > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > > > platforms.
> > >
> > > Stating the obvious to some extent, I suspect it's something to do with Turin,
> > > not a KVM issue. This fails on our Turin hosts as far back as v6.12, i.e. long
> > > before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> > > the KUT test isn't doing anything to actually stress KVM itself. I.e. I would
> > > expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> > > TSC slop.
> >
> > I think the final test case is broken, actually.
> >
> > The test case is:
> >
> > svm_tsc_scale_run_testcase(50, 0.0001, rdrand());
> >
> > So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
> > 83886. Note that this is 83886.080000000002 truncated.
> >
> > If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
> > will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
> > have to accumulate an additional (0.080000000002 / 0.0001 =
> > 800.0000000199999) cycles between the two rdtsc() operations
> > bracketing the svm_vmrun() in L1 .
> >
> > The test probably passes on other CPUs because emulated VMRUN and
> > #VMEXIT add those 800 cycles.
> >
> > Instead of truncating ((50 << 24) * 0.0001), I think we should
> > calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
> > Something like this:
> >
> > diff --git a/x86/svm_tests.c b/x86/svm_tests.c
> > index 9358c1f0383a..1bfe11045bd1 100644
> > --- a/x86/svm_tests.c
> > +++ b/x86/svm_tests.c
> > @@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
> > u64 start_tsc, actual_duration;
> >
> > guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
> > + if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
> > + guest_tsc_delay_value++;
> >
> > test_set_guest(svm_tsc_scale_guest);
> > vmcb->control.tsc_offset = tsc_offset;
> >
> > Even then, equality of duration and actual_duration is only guaranteed
> > if there are no significant delays during the measurement.
>
> Wrote a changelog and applied this to kvm-x86 next. Thanks Jim!
>
> [1/1] x86/svm: Account for numerical rounding errors in TSC scaling test
> https://github.com/kvm-x86/linux/commit/5465145a
Gah, my alias is hardcoded to point at linux, the actual commit is:
https://github.com/kvm-x86/kvm-unit-tests/commit/5465145a
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-18 22:38 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 5:01 KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5) Aithal, Srikanth
2025-07-08 20:57 ` Sean Christopherson
2025-07-24 3:59 ` Jim Mattson
2025-11-18 22:29 ` Sean Christopherson
2025-11-18 22:38 ` Sean Christopherson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox