kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
@ 2024-04-17 14:29 bugzilla-daemon
  2024-04-23  0:21 ` [Bug 218739] " bugzilla-daemon
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-04-17 14:29 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

            Bug ID: 218739
           Summary: pmu_counters_test kvm-selftest fails with (count !=
                    NUM_INSNS_RETIRED)
           Product: Virtualization
           Version: unspecified
          Hardware: Intel
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: jarichte@redhat.com
        Regression: No

Environment:
CPU Architecture: x86_64, Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
Host OS: Fedorarawhide
Host kernel: Linux Kernel 6.9.0-rc3
gcc: gcc (GCC) 14.0.1
Host kernel source: https://git.kernel.org/pub/scm/virt/kvm/kvm.git
Branch: master
Commit: 1c3bed8006691f485156153778192864c9d8e14f
Bug Detailed Description:
Assertion failure executing kvm selftest pmu_counters_test.

Reproducing Steps:
git clone https://git.kernel.org/pub/scm/virt/kvm/kvm.git
cd kvm && make headers_install
cd kvm/tools/testing/selftests/kvm && make
cd x86_64 && ./pmu_counters_test

Actual Result:
Testing arch events, PMU version 0, perf_caps = 0
Testing GP counters, PMU version 0, perf_caps = 0
Testing fixed counters, PMU version 0, perf_caps = 0
Testing arch events, PMU version 0, perf_caps = 2000
Testing GP counters, PMU version 0, perf_caps = 2000
Testing fixed counters, PMU version 0, perf_caps = 2000
Testing arch events, PMU version 1, perf_caps = 0
==== Test Assertion Failure ====
x86_64/pmu_counters_test.c:107: count == NUM_INSNS_RETIRED
pid=51128 tid=51128 errno=4 - Interrupted system call
1       0x0000000000402c7d: run_vcpu at pmu_counters_test.c:61
2       0x0000000000402ead: test_arch_events at pmu_counters_test.c:307
3       0x0000000000402674: test_arch_events at pmu_counters_test.c:296
4        (inlined by) test_intel_counters at pmu_counters_test.c:601
5        (inlined by) main at pmu_counters_test.c:635
6       0x00007f78bd1981c7: ?? ??:0
7       0x00007f78bd19828a: ?? ??:0
8       0x0000000000402924: _start at ??:?
0x12 != 0x11 (count != NUM_INSNS_RETIRED)

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
@ 2024-04-23  0:21 ` bugzilla-daemon
  2024-05-27 18:19 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-04-23  0:21 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

Dongli Zhang (dongli.zhang@oracle.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dongli.zhang@oracle.com

--- Comment #1 from Dongli Zhang (dongli.zhang@oracle.com) ---
Perhaps more information can be printed by pmu_counters_test in the future,
e.g., msr, msr_ctrl, their values, cflush and whether forced emulation?

Just from the output, the number of instructions by GUEST_MEASURE_EVENT() does
not match with NUM_INSNS_RETIRED=17,

------------------------

I have tried on an Icelake server and I could not reproduce anything for most
of times, except the below for only once.

# ./pmu_counters_test 
Testing arch events, PMU version 0, perf_caps = 0
Testing GP counters, PMU version 0, perf_caps = 0
Testing fixed counters, PMU version 0, perf_caps = 0
Testing arch events, PMU version 0, perf_caps = 2000
Testing GP counters, PMU version 0, perf_caps = 2000
Testing fixed counters, PMU version 0, perf_caps = 2000
Testing arch events, PMU version 1, perf_caps = 0
Testing GP counters, PMU version 1, perf_caps = 0
Testing fixed counters, PMU version 1, perf_caps = 0
Testing arch events, PMU version 1, perf_caps = 2000
Testing GP counters, PMU version 1, perf_caps = 2000
Testing fixed counters, PMU version 1, perf_caps = 2000
Testing arch events, PMU version 2, perf_caps = 0
Testing GP counters, PMU version 2, perf_caps = 0
Testing fixed counters, PMU version 2, perf_caps = 0
Testing arch events, PMU version 2, perf_caps = 2000
Testing GP counters, PMU version 2, perf_caps = 2000
Testing fixed counters, PMU version 2, perf_caps = 2000
Testing arch events, PMU version 3, perf_caps = 0
Testing GP counters, PMU version 3, perf_caps = 0
Testing fixed counters, PMU version 3, perf_caps = 0
Testing arch events, PMU version 3, perf_caps = 2000
Testing GP counters, PMU version 3, perf_caps = 2000
Testing fixed counters, PMU version 3, perf_caps = 2000
Testing arch events, PMU version 4, perf_caps = 0
==== Test Assertion Failure ====
  x86_64/pmu_counters_test.c:120: count != 0
  pid=39696 tid=39696 errno=4 - Interrupted system call
     1  0x0000000000402baf: run_vcpu at pmu_counters_test.c:61
     2  0x0000000000402ddd: test_arch_events at pmu_counters_test.c:307
     3  0x0000000000402683: test_arch_events at pmu_counters_test.c:605
     4   (inlined by) test_intel_counters at pmu_counters_test.c:605
     5   (inlined by) main at pmu_counters_test.c:635
     6  0x00007fcfeb43ae44: ?? ??:0
     7  0x000000000040288d: _start at ??:?
  0x0 == 0x0 (count == 0)


# cat /sys/module/kvm/parameters/enable_pmu 
Y
# cat /sys/module/kvm/parameters/force_emulation_prefix 
0

# cpuid -l 0xa -1
CPU:
   Architecture Performance Monitoring Features (0xa):
      version ID                               = 0x5 (5)
      number of counters per logical processor = 0x8 (8)
      bit width of counter                     = 0x30 (48)
      length of EBX bit vector                 = 0x8 (8)
      core cycle event                         = available
      instruction retired event                = available
      reference cycles event                   = available
      last-level cache ref event               = available
      last-level cache miss event              = available
      branch inst retired event                = available
      branch mispred retired event             = available
      top-down slots event                     = available
      fixed counter  0 supported               = true
      fixed counter  1 supported               = true
      fixed counter  2 supported               = true
      fixed counter  3 supported               = true
      fixed counter  4 supported               = false
      fixed counter  5 supported               = false
      fixed counter  6 supported               = false
      fixed counter  7 supported               = false
      fixed counter  8 supported               = false
      fixed counter  9 supported               = false
      fixed counter 10 supported               = false
      fixed counter 11 supported               = false
      fixed counter 12 supported               = false
      fixed counter 13 supported               = false
      fixed counter 14 supported               = false
      fixed counter 15 supported               = false
      fixed counter 16 supported               = false
      fixed counter 17 supported               = false
      fixed counter 18 supported               = false
      fixed counter 19 supported               = false
      fixed counter 20 supported               = false
      fixed counter 21 supported               = false
      fixed counter 22 supported               = false
      fixed counter 23 supported               = false
      fixed counter 24 supported               = false
      fixed counter 25 supported               = false
      fixed counter 26 supported               = false
      fixed counter 27 supported               = false
      fixed counter 28 supported               = false
      fixed counter 29 supported               = false
      fixed counter 30 supported               = false
      fixed counter 31 supported               = false
      number of contiguous fixed counters      = 0x4 (4)
      bit width of fixed counters              = 0x30 (48)
      anythread deprecation                    = true



-------------------------------------------

I also did tests on nested L1 hypervisor (more legacy hardware). Most of time
are good, except once.

# ./pmu_counters_test
Testing arch events, PMU version 0, perf_caps = 0
Testing GP counters, PMU version 0, perf_caps = 0
Testing fixed counters, PMU version 0, perf_caps = 0
Testing arch events, PMU version 0, perf_caps = 2000
Testing GP counters, PMU version 0, perf_caps = 2000
Testing fixed counters, PMU version 0, perf_caps = 2000
Testing arch events, PMU version 1, perf_caps = 0
Testing GP counters, PMU version 1, perf_caps = 0
Testing fixed counters, PMU version 1, perf_caps = 0
Testing arch events, PMU version 1, perf_caps = 2000
Testing GP counters, PMU version 1, perf_caps = 2000
Testing fixed counters, PMU version 1, perf_caps = 2000
Testing arch events, PMU version 2, perf_caps = 0
Testing GP counters, PMU version 2, perf_caps = 0
Testing fixed counters, PMU version 2, perf_caps = 0
Testing arch events, PMU version 2, perf_caps = 2000
Testing GP counters, PMU version 2, perf_caps = 2000
Testing fixed counters, PMU version 2, perf_caps = 2000
Testing arch events, PMU version 3, perf_caps = 0
==== Test Assertion Failure ====
  x86_64/pmu_counters_test.c:120: count != 0
  pid=9301 tid=9301 errno=4 - Interrupted system call
     1  0x0000000000402bdf: run_vcpu at pmu_counters_test.c:61
     2  0x0000000000402dfd: test_arch_events at pmu_counters_test.c:307
     3  0x00000000004026a3: test_arch_events at pmu_counters_test.c:605
     4   (inlined by) test_intel_counters at pmu_counters_test.c:605
     5   (inlined by) main at pmu_counters_test.c:635
     6  0x00007f05e2f60d8f: ?? ??:0
     7  0x00007f05e2f60e3f: ?? ??:0
     8  0x00000000004028b4: _start at ??:?
  0x0 == 0x0 (count == 0)

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
  2024-04-23  0:21 ` [Bug 218739] " bugzilla-daemon
@ 2024-05-27 18:19 ` bugzilla-daemon
  2024-05-28 17:20   ` Sean Christopherson
  2024-05-28 17:20 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: bugzilla-daemon @ 2024-05-27 18:19 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

mlevitsk@redhat.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mlevitsk@redhat.com

--- Comment #2 from mlevitsk@redhat.com ---
Adding my .02 cents:

I also see this test fail sometimes (once per hour or so of continuous running)
 and in my case it fails because 'count != 0' assert on
INTEL_ARCH_LLC_MISSES_INDEX event and only for this event.

The reason is IMHO, is that it is possible  to have 0 LLC misses if the cache
is large enough and code was run for enough iterations.

I wasn't able to make the test fail for other reasons (I only tested non-nested
case so far, nested this test also fails sometimes)

Best regards,
        Maxim Levitsky

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-05-27 18:19 ` bugzilla-daemon
@ 2024-05-28 17:20   ` Sean Christopherson
  0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-05-28 17:20 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: kvm

On Mon, May 27, 2024, bugzilla-daemon@kernel.org wrote:
> I also see this test fail sometimes (once per hour or so of continuous running)
>  and in my case it fails because 'count != 0' assert on
> INTEL_ARCH_LLC_MISSES_INDEX event and only for this event.
> 
> The reason is IMHO, is that it is possible  to have 0 LLC misses if the cache
> is large enough and code was run for enough iterations.

The test does CLFUSH{,OPT} on its future code sequence after enabling the counter.
In theory, that's should guarantee an LLC Miss.

Hmm, but this SDM blurb about speculative loads makes me think past me was wrong.

  (that is, data can be speculatively loaded into a cache line just before, during,
   or after the execution of a CLFLUSH instruction that references the cache line).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
  2024-04-23  0:21 ` [Bug 218739] " bugzilla-daemon
  2024-05-27 18:19 ` bugzilla-daemon
@ 2024-05-28 17:20 ` bugzilla-daemon
  2024-06-10 19:22 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-05-28 17:20 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

--- Comment #3 from Sean Christopherson (seanjc@google.com) ---
On Mon, May 27, 2024, bugzilla-daemon@kernel.org wrote:
> I also see this test fail sometimes (once per hour or so of continuous
> running)
>  and in my case it fails because 'count != 0' assert on
> INTEL_ARCH_LLC_MISSES_INDEX event and only for this event.
> 
> The reason is IMHO, is that it is possible  to have 0 LLC misses if the cache
> is large enough and code was run for enough iterations.

The test does CLFUSH{,OPT} on its future code sequence after enabling the
counter.
In theory, that's should guarantee an LLC Miss.

Hmm, but this SDM blurb about speculative loads makes me think past me was
wrong.

  (that is, data can be speculatively loaded into a cache line just before,
during,
   or after the execution of a CLFLUSH instruction that references the cache
line).

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
                   ` (2 preceding siblings ...)
  2024-05-28 17:20 ` bugzilla-daemon
@ 2024-06-10 19:22 ` bugzilla-daemon
  2024-06-20 21:28 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-06-10 19:22 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

--- Comment #4 from mlevitsk@redhat.com ---
I did some more testing:

1. I double checked that INTEL_ARCH_LLC_MISSES_INDEX is the only event that
fails,
test survived whole night with only it commented out.

2. using CLFLUSH instead of CLFLUSHOPT doesn't help



Should we disable this event for now to avoid the failure until we figure out
how to use this event in a reliable way?

Best regards,
    Maxim Levitsky

PS: I also did initial testing for running this test nested - it fails with
invalid guest state in L1, but only sometimes. 
I'll investigate that further soon.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
                   ` (3 preceding siblings ...)
  2024-06-10 19:22 ` bugzilla-daemon
@ 2024-06-20 21:28 ` bugzilla-daemon
  2024-06-20 21:35 ` bugzilla-daemon
  2024-06-21 15:14 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-06-20 21:28 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

--- Comment #5 from mlevitsk@redhat.com ---
I tested several approaches to eliminate the issue, but none of them seem to be
very robust.

In particular:
 - I tried to clflush a global memory location outside of the loop, then access
it.
   0 LLC misses still happen, once in a while.

 - I also tried to access a location on the stack.
   Here the test started failing on INTEL_ARCH_TOPDOWN_SLOTS_INDEX sometimes,
   I am not sure why. I did push/pop, maybe ucode is smart enough to elide
this?


I now found a new and a more or less robust solution, which is to clflush on
each loop iteration.

That both increases the chances of at least one clflush working and it should
also confuse the speculation code enough.

It survived about 4 hours of testing.

I attached a draft patch with this solution, if you think that it is
reasonable, I'll send it to LKML.

Note that I dropped the mfence instruction thinking that it doesn't help much
since it helps with memory loads/stores while we clflush the memory which is
fetched for code execution.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
                   ` (4 preceding siblings ...)
  2024-06-20 21:28 ` bugzilla-daemon
@ 2024-06-20 21:35 ` bugzilla-daemon
  2024-06-21 15:14 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-06-20 21:35 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

--- Comment #6 from mlevitsk@redhat.com ---
Created attachment 306480
  --> https://bugzilla.kernel.org/attachment.cgi?id=306480&action=edit
Patch to do CLFLUSH on each iteration of the loop

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 218739] pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED)
  2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
                   ` (5 preceding siblings ...)
  2024-06-20 21:35 ` bugzilla-daemon
@ 2024-06-21 15:14 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-06-21 15:14 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=218739

--- Comment #7 from mlevitsk@redhat.com ---
I ran the test overnight, not a single failure.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-21 15:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-17 14:29 [Bug 218739] New: pmu_counters_test kvm-selftest fails with (count != NUM_INSNS_RETIRED) bugzilla-daemon
2024-04-23  0:21 ` [Bug 218739] " bugzilla-daemon
2024-05-27 18:19 ` bugzilla-daemon
2024-05-28 17:20   ` Sean Christopherson
2024-05-28 17:20 ` bugzilla-daemon
2024-06-10 19:22 ` bugzilla-daemon
2024-06-20 21:28 ` bugzilla-daemon
2024-06-20 21:35 ` bugzilla-daemon
2024-06-21 15:14 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).