Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
To: Sean Christopherson <seanjc@google.com>,
	kernel test robot <oliver.sang@intel.com>,
	g@google.com
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
	Maxim Levitsky <mlevitsk@redhat.com>,
	kvm@vger.kernel.org, xudong.hao@intel.com
Subject: Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail
Date: Wed, 15 Jan 2025 10:44:43 +0800	[thread overview]
Message-ID: <a2adf1b8-c394-4741-a42b-32288657b07e@linux.intel.com> (raw)
In-Reply-To: <Z4a_PmUVVmUtOd4p@google.com>


On 1/15/2025 3:47 AM, Sean Christopherson wrote:
> +Dapeng
>
> On Tue, Jan 14, 2025, kernel test robot wrote:
>> we fould the test failed on a Cooper Lake, not sure if this is expected.
>> below full report FYI.
>>
>>
>> kernel test robot noticed "kernel-selftests.kvm.pmu_counters_test.fail" on:
>>
>> commit: 7803339fa929387bbc66479532afbaf8cbebb41b ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [test failed on linux-next/master 37136bf5c3a6f6b686d74f41837a6406bec6b7bc]
>>
>> in testcase: kernel-selftests
>> version: kernel-selftests-x86_64-7503345ac5f5-1_20241208
>> with following parameters:
>>
>> 	group: kvm
>>
>> config: x86_64-rhel-9.4-kselftests
>> compiler: gcc-12
>> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> *sigh*
>
> This fails on our Skylake and Cascade Lake systems, but I only tested an Emerald
> Rapids.
>
>> # Testing fixed counters, PMU version 0, perf_caps = 2000
>> # Testing arch events, PMU version 1, perf_caps = 0
>> # ==== Test Assertion Failure ====
>> #   x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
>> #   pid=6278 tid=6278 errno=4 - Interrupted system call
>> #      1	0x0000000000411281: assert_on_unhandled_exception at processor.c:625
>> #      2	0x00000000004075d4: _vcpu_run at kvm_util.c:1652
>> #      3	 (inlined by) vcpu_run at kvm_util.c:1663
>> #      4	0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
>> #      5	0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
>> #      6	0x0000000000402663: test_arch_events at pmu_counters_test.c:304
>> #      7	 (inlined by) test_intel_counters at pmu_counters_test.c:609
>> #      8	 (inlined by) main at pmu_counters_test.c:642
>> #      9	0x00007f3b134f9249: ?? ??:0
>> #     10	0x00007f3b134f9304: ?? ??:0
>> #     11	0x0000000000402900: _start at ??:?
>> #   count >= NUM_INSNS_RETIRED
> The failure is on top-down slots.  I modified the assert to actually print the
> count (I'll make sure to post a patch regardless of where this goes), and based
> on the count for failing vs. passing, I'm pretty sure the issue is not the extra
> instruction, but instead is due to changing the target of the CLFUSH from the
> address of the code to the address of kvm_pmu_version.
>
> However, I think the blame lies with the assertion itself, i.e. with commit
> 4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
> Either that or top-down slots is broken on the Lakes.
>
> By my rudimentary measurements, tying the number of available slots to the number
> of instructions *retired* is fundamentally flawed.  E.g. on the Lakes (SKX is more
> or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
> slots being available throughout the lifetime of the measured section.
>
> My best guess is that flushing the cache line use for the data load causes the
> backend to saturate its slots with prefetching data, and as a result the number
> of slots that are available goes down.
>
>         CLFLUSHOPT .    | CLFLUSHOPT [%m]       | NOP
> CLX     350-100         | 20-60[*]              | 135-150  
> SPR     49000-57000     | 32500-41000           | 6760-6830
>
> [*] CLX had a few outliers in the 200-400 range, but the majority of runs were
>     in the 20-60 range.
>
> Reading through more (and more and more) of the TMA documentation, I don't think
> we can assume anything about the number of available slots, beyond a very basic
> assertion that it's practically impossible for there to never be an available
> slot.  IIUC, retiring an instruction does NOT require an available slot, rather
> it requires the opposite: an occupied slot for the uop(s).

I'm not quite sure about this. IIUC, retiring an instruction may not need a
cycle, but it needs a slot at least except the instruction is macro-fused.
Anyway, let me double check with our micro-architecture and perf experts.


>
> I'm mildly curious as to why the counts for SPR are orders of magnitude higher
> that CLX (simple accounting differences?), but I don't think it changes anything
> in the test itself.
>
> Unless someone has a better idea, my plan is to post a patch to assert that the
> top-down slots count is non-zero, not that it's >= instructions retired.  E.g.
>
> diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> index accd7ecd3e5f..21acedcd46cd 100644
> --- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> +++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> @@ -123,10 +123,8 @@ static void guest_assert_event_count(uint8_t idx,
>                 fallthrough;
>         case INTEL_ARCH_CPU_CYCLES_INDEX:
>         case INTEL_ARCH_REFERENCE_CYCLES_INDEX:
> -               GUEST_ASSERT_NE(count, 0);
> -               break;
>         case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
> -               GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
> +               GUEST_ASSERT_NE(count, 0);
>                 break;
>         default:
>                 break;
>

next prev parent reply	other threads:[~2025-01-15  2:44 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-14  2:40 [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail kernel test robot
2025-01-14 19:47 ` Sean Christopherson
2025-01-15  2:44   ` Mi, Dapeng [this message]
2025-01-17  3:04     ` Mi, Dapeng
2025-01-17 17:11       ` Sean Christopherson
2025-01-20  2:02         ` Mi, Dapeng
2025-01-21 16:13           ` Sean Christopherson
2025-01-22  1:26             ` Mi, Dapeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a2adf1b8-c394-4741-a42b-32288657b07e@linux.intel.com \
    --to=dapeng1.mi@linux.intel.com \
    --cc=g@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=mlevitsk@redhat.com \
    --cc=oe-lkp@lists.linux.dev \
    --cc=oliver.sang@intel.com \
    --cc=seanjc@google.com \
    --cc=xudong.hao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox