KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
@ 2025-07-08  5:01 Aithal, Srikanth
  2025-07-08 20:57 ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Aithal, Srikanth @ 2025-07-08  5:01 UTC (permalink / raw)
  To: KVM

Hello all,
KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform 
(Zen 5) for a while now, even on latest 
linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h= 
next-20250704]. The same seem to work fine with linux-next tag 
next-20250505.
The TSC delay test fails intermittently (approximately once in three 
runs) with an unexpected result (expected: 50, actual: 49). This test 
passed consistently on earlier tags (e.g., next-20250505) and on 
non-Turin platforms.

[stdout] timeout -k 1s --foreground 90s 
/home/VT_BUILD/usr/local/bin/qemu-system-x86_64 --no-reboot -nodefaults 
-device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc 
none -serial stdio -device pci-testdev -machine accel=kvm -kernel 
/tmp/tmp.j05CJuifTf -smp 2 -cpu max,+svm -m 4g -append 
-pause_filter_test # -initrd /tmp/tmp.rYXaENVjEC
[stdout] enabling apic
[stdout] smp: waiting for 1 APs
[stdout] enabling apic
[stdout] setup: CPU 1 online
[stdout] paging enabled
[stdout] cr0 = 80010011
[stdout] cr3 = 10bf000
[stdout] cr4 = 20
[stdout] NPT detected - running all tests with NPT enabled
[stdout] PASS: null
..
[stdout] PASS: tsc delay (expected: 42, actual: 42)
[stdout] INFO: duration=20, tsc_scale=670, tsc_offset=505748840785296448
[stdout] PASS: tsc delay (expected: 20, actual: 20)
[stdout] INFO: duration=9, tsc_scale=830, tsc_offset=8332629130251870915
[stdout] PASS: tsc delay (expected: 9, actual: 9)
[stdout] INFO: duration=46, tsc_scale=550, tsc_offset=65726211827426474
[stdout] PASS: tsc delay (expected: 46, actual: 46)
[stdout] PASS: tsc delay (expected: 50, actual: 50)
[stdout] FAIL: tsc delay (expected: 50, actual: 49)
[stdout] PASS: shutdown test passed
[stdout] SUMMARY: 285 tests, 1 unexpected failures
[stdout]  [31mFAIL [0m svm (285 tests, 1 unexpected failures)
[stdlog] 2025-06-30 08:38:28 | Dependency was not fulfilled.
[stdlog] Not logging /sys/kernel/debug/sched_features (file not found)
[stdlog] Not logging /proc/pci (file not found)

Versions of qemu and ovmf used:
qemu : v9.2.3 as well as v10.0.1 ovmf : edk2-stable202502

If this issue is resolved, please include Reported-by: Srikanth Aithal 
Srikanth.Aithal@amd.com<mailto:Srikanth.Aithal@amd.com>

Regards
Srikanth Aithal


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
  2025-07-08  5:01 KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5) Aithal, Srikanth
@ 2025-07-08 20:57 ` Sean Christopherson
  2025-07-24  3:59   ` Jim Mattson
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-07-08 20:57 UTC (permalink / raw)
  To: Srikanth Aithal; +Cc: KVM

On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> Hello all,
> KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> next-20250704]. The same seem to work fine with linux-next tag
> next-20250505.
> The TSC delay test fails intermittently (approximately once in three runs)
> with an unexpected result (expected: 50, actual: 49). This test passed
> consistently on earlier tags (e.g., next-20250505) and on non-Turin
> platforms.

Stating the obvious to some extent, I suspect it's something to do with Turin,
not a KVM issue.  This fails on our Turin hosts as far back as v6.12, i.e. long
before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
the KUT test isn't doing anything to actually stress KVM itself.  I.e. I would
expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
TSC slop.

  FAIL: tsc delay (expected: 50, actual: 49)
  SUMMARY: 13 tests, 1 unexpected failures
  ✘ ~/build/kut/x86 # uname -r
  6.12.0-smp--adc218676eef-tsc

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
  2025-07-08 20:57 ` Sean Christopherson
@ 2025-07-24  3:59   ` Jim Mattson
  2025-11-18 22:29     ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Jim Mattson @ 2025-07-24  3:59 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Srikanth Aithal, KVM

On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > Hello all,
> > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > next-20250704]. The same seem to work fine with linux-next tag
> > next-20250505.
> > The TSC delay test fails intermittently (approximately once in three runs)
> > with an unexpected result (expected: 50, actual: 49). This test passed
> > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > platforms.
>
> Stating the obvious to some extent, I suspect it's something to do with Turin,
> not a KVM issue.  This fails on our Turin hosts as far back as v6.12, i.e. long
> before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> the KUT test isn't doing anything to actually stress KVM itself.  I.e. I would
> expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> TSC slop.

I think the final test case is broken, actually.

The test case is:

    svm_tsc_scale_run_testcase(50, 0.0001, rdrand());

So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
83886. Note that this is 83886.080000000002 truncated.

If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
have to accumulate an additional (0.080000000002 / 0.0001 =
800.0000000199999) cycles between the two rdtsc() operations
bracketing the svm_vmrun() in L1 .

The test probably passes on other CPUs because emulated VMRUN and
#VMEXIT add those 800 cycles.

Instead of truncating ((50 << 24) * 0.0001), I think we should
calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
Something like this:

diff --git a/x86/svm_tests.c b/x86/svm_tests.c
index 9358c1f0383a..1bfe11045bd1 100644
--- a/x86/svm_tests.c
+++ b/x86/svm_tests.c
@@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
        u64 start_tsc, actual_duration;

        guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
+       if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
+               guest_tsc_delay_value++;

        test_set_guest(svm_tsc_scale_guest);
        vmcb->control.tsc_offset = tsc_offset;

Even then, equality of duration and actual_duration is only guaranteed
if there are no significant delays during the measurement.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
  2025-07-24  3:59   ` Jim Mattson
@ 2025-11-18 22:29     ` Sean Christopherson
  2025-11-18 22:38       ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2025-11-18 22:29 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Srikanth Aithal, KVM

On Wed, Jul 23, 2025, Jim Mattson wrote:
> On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > > Hello all,
> > > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > > next-20250704]. The same seem to work fine with linux-next tag
> > > next-20250505.
> > > The TSC delay test fails intermittently (approximately once in three runs)
> > > with an unexpected result (expected: 50, actual: 49). This test passed
> > > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > > platforms.
> >
> > Stating the obvious to some extent, I suspect it's something to do with Turin,
> > not a KVM issue.  This fails on our Turin hosts as far back as v6.12, i.e. long
> > before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> > the KUT test isn't doing anything to actually stress KVM itself.  I.e. I would
> > expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> > TSC slop.
> 
> I think the final test case is broken, actually.
> 
> The test case is:
> 
>     svm_tsc_scale_run_testcase(50, 0.0001, rdrand());
> 
> So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
> 83886. Note that this is 83886.080000000002 truncated.
> 
> If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
> will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
> have to accumulate an additional (0.080000000002 / 0.0001 =
> 800.0000000199999) cycles between the two rdtsc() operations
> bracketing the svm_vmrun() in L1 .
> 
> The test probably passes on other CPUs because emulated VMRUN and
> #VMEXIT add those 800 cycles.
> 
> Instead of truncating ((50 << 24) * 0.0001), I think we should
> calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
> Something like this:
> 
> diff --git a/x86/svm_tests.c b/x86/svm_tests.c
> index 9358c1f0383a..1bfe11045bd1 100644
> --- a/x86/svm_tests.c
> +++ b/x86/svm_tests.c
> @@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
>         u64 start_tsc, actual_duration;
> 
>         guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
> +       if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
> +               guest_tsc_delay_value++;
> 
>         test_set_guest(svm_tsc_scale_guest);
>         vmcb->control.tsc_offset = tsc_offset;
> 
> Even then, equality of duration and actual_duration is only guaranteed
> if there are no significant delays during the measurement.

Wrote a changelog and applied this to kvm-x86 next.  Thanks Jim!

[1/1] x86/svm: Account for numerical rounding errors in TSC scaling test
      https://github.com/kvm-x86/linux/commit/5465145a

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5)
  2025-11-18 22:29     ` Sean Christopherson
@ 2025-11-18 22:38       ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2025-11-18 22:38 UTC (permalink / raw)
  To: Jim Mattson; +Cc: Srikanth Aithal, KVM

On Tue, Nov 18, 2025, Sean Christopherson wrote:
> On Wed, Jul 23, 2025, Jim Mattson wrote:
> > On Tue, Jul 8, 2025 at 1:58 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Tue, Jul 08, 2025, Srikanth Aithal wrote:
> > > > Hello all,
> > > > KVM unit test suite for SVM is regressing on the AMD EPYC Turin platform
> > > > (Zen 5) for a while now, even on latest linux-next[https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=
> > > > next-20250704]. The same seem to work fine with linux-next tag
> > > > next-20250505.
> > > > The TSC delay test fails intermittently (approximately once in three runs)
> > > > with an unexpected result (expected: 50, actual: 49). This test passed
> > > > consistently on earlier tags (e.g., next-20250505) and on non-Turin
> > > > platforms.
> > >
> > > Stating the obvious to some extent, I suspect it's something to do with Turin,
> > > not a KVM issue.  This fails on our Turin hosts as far back as v6.12, i.e. long
> > > before next-20250505 (I haven't bothered checking earlier builds), and AFAICT
> > > the KUT test isn't doing anything to actually stress KVM itself.  I.e. I would
> > > expect KVM bugs to manifest as blatant, 100% reproducible failures, not random
> > > TSC slop.
> > 
> > I think the final test case is broken, actually.
> > 
> > The test case is:
> > 
> >     svm_tsc_scale_run_testcase(50, 0.0001, rdrand());
> > 
> > So, guest_tsc_delay_value is (u64)((50 << 24) * 0.0001), which is
> > 83886. Note that this is 83886.080000000002 truncated.
> > 
> > If L2 exits after 83886 scaled TSC cycles, the "duration" spent in L2
> > will be (u64)(83886 / 0.0001) >> 24, which is 49. To get up to 50, we
> > have to accumulate an additional (0.080000000002 / 0.0001 =
> > 800.0000000199999) cycles between the two rdtsc() operations
> > bracketing the svm_vmrun() in L1 .
> > 
> > The test probably passes on other CPUs because emulated VMRUN and
> > #VMEXIT add those 800 cycles.
> > 
> > Instead of truncating ((50 << 24) * 0.0001), I think we should
> > calculate guest_tsc_delay_value as ceil((50 << 24) * 0.0001).
> > Something like this:
> > 
> > diff --git a/x86/svm_tests.c b/x86/svm_tests.c
> > index 9358c1f0383a..1bfe11045bd1 100644
> > --- a/x86/svm_tests.c
> > +++ b/x86/svm_tests.c
> > @@ -891,6 +891,8 @@ static void svm_tsc_scale_run_testcase(u64 duration,
> >         u64 start_tsc, actual_duration;
> > 
> >         guest_tsc_delay_value = (duration << TSC_SHIFT) * tsc_scale;
> > +       if (guest_tsc_delay_value < (duration << TSC_SHIFT) * tsc_scale)
> > +               guest_tsc_delay_value++;
> > 
> >         test_set_guest(svm_tsc_scale_guest);
> >         vmcb->control.tsc_offset = tsc_offset;
> > 
> > Even then, equality of duration and actual_duration is only guaranteed
> > if there are no significant delays during the measurement.
> 
> Wrote a changelog and applied this to kvm-x86 next.  Thanks Jim!
> 
> [1/1] x86/svm: Account for numerical rounding errors in TSC scaling test
>       https://github.com/kvm-x86/linux/commit/5465145a

Gah, my alias is hardcoded to point at linux, the actual commit is:

  https://github.com/kvm-x86/kvm-unit-tests/commit/5465145a

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-18 22:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08  5:01 KVM Unit Test Suite Regression on AMD EPYC Turin (Zen 5) Aithal, Srikanth
2025-07-08 20:57 ` Sean Christopherson
2025-07-24  3:59   ` Jim Mattson
2025-11-18 22:29     ` Sean Christopherson
2025-11-18 22:38       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox