From: "Mi, Dapeng" <dapeng1.mi@linux.intel.com>
To: Sean Christopherson <seanjc@google.com>,
kernel test robot <oliver.sang@intel.com>,
g@google.com
Cc: oe-lkp@lists.linux.dev, lkp@intel.com,
Maxim Levitsky <mlevitsk@redhat.com>,
kvm@vger.kernel.org, xudong.hao@intel.com
Subject: Re: [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail
Date: Wed, 15 Jan 2025 10:44:43 +0800 [thread overview]
Message-ID: <a2adf1b8-c394-4741-a42b-32288657b07e@linux.intel.com> (raw)
In-Reply-To: <Z4a_PmUVVmUtOd4p@google.com>
On 1/15/2025 3:47 AM, Sean Christopherson wrote:
> +Dapeng
>
> On Tue, Jan 14, 2025, kernel test robot wrote:
>> we fould the test failed on a Cooper Lake, not sure if this is expected.
>> below full report FYI.
>>
>>
>> kernel test robot noticed "kernel-selftests.kvm.pmu_counters_test.fail" on:
>>
>> commit: 7803339fa929387bbc66479532afbaf8cbebb41b ("KVM: selftests: Use data load to trigger LLC references/misses in Intel PMU")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> [test failed on linux-next/master 37136bf5c3a6f6b686d74f41837a6406bec6b7bc]
>>
>> in testcase: kernel-selftests
>> version: kernel-selftests-x86_64-7503345ac5f5-1_20241208
>> with following parameters:
>>
>> group: kvm
>>
>> config: x86_64-rhel-9.4-kselftests
>> compiler: gcc-12
>> test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
> *sigh*
>
> This fails on our Skylake and Cascade Lake systems, but I only tested an Emerald
> Rapids.
>
>> # Testing fixed counters, PMU version 0, perf_caps = 2000
>> # Testing arch events, PMU version 1, perf_caps = 0
>> # ==== Test Assertion Failure ====
>> # x86/pmu_counters_test.c:129: count >= (10 * 4 + 5)
>> # pid=6278 tid=6278 errno=4 - Interrupted system call
>> # 1 0x0000000000411281: assert_on_unhandled_exception at processor.c:625
>> # 2 0x00000000004075d4: _vcpu_run at kvm_util.c:1652
>> # 3 (inlined by) vcpu_run at kvm_util.c:1663
>> # 4 0x0000000000402c5e: run_vcpu at pmu_counters_test.c:62
>> # 5 0x0000000000402e4d: test_arch_events at pmu_counters_test.c:315
>> # 6 0x0000000000402663: test_arch_events at pmu_counters_test.c:304
>> # 7 (inlined by) test_intel_counters at pmu_counters_test.c:609
>> # 8 (inlined by) main at pmu_counters_test.c:642
>> # 9 0x00007f3b134f9249: ?? ??:0
>> # 10 0x00007f3b134f9304: ?? ??:0
>> # 11 0x0000000000402900: _start at ??:?
>> # count >= NUM_INSNS_RETIRED
> The failure is on top-down slots. I modified the assert to actually print the
> count (I'll make sure to post a patch regardless of where this goes), and based
> on the count for failing vs. passing, I'm pretty sure the issue is not the extra
> instruction, but instead is due to changing the target of the CLFUSH from the
> address of the code to the address of kvm_pmu_version.
>
> However, I think the blame lies with the assertion itself, i.e. with commit
> 4a447b135e45 ("KVM: selftests: Test top-down slots event in x86's pmu_counters_test").
> Either that or top-down slots is broken on the Lakes.
>
> By my rudimentary measurements, tying the number of available slots to the number
> of instructions *retired* is fundamentally flawed. E.g. on the Lakes (SKX is more
> or less identical to CLX), omitting the CLFLUSHOPT entirely results in *more*
> slots being available throughout the lifetime of the measured section.
>
> My best guess is that flushing the cache line use for the data load causes the
> backend to saturate its slots with prefetching data, and as a result the number
> of slots that are available goes down.
>
> CLFLUSHOPT . | CLFLUSHOPT [%m] | NOP
> CLX 350-100 | 20-60[*] | 135-150
> SPR 49000-57000 | 32500-41000 | 6760-6830
>
> [*] CLX had a few outliers in the 200-400 range, but the majority of runs were
> in the 20-60 range.
>
> Reading through more (and more and more) of the TMA documentation, I don't think
> we can assume anything about the number of available slots, beyond a very basic
> assertion that it's practically impossible for there to never be an available
> slot. IIUC, retiring an instruction does NOT require an available slot, rather
> it requires the opposite: an occupied slot for the uop(s).
I'm not quite sure about this. IIUC, retiring an instruction may not need a
cycle, but it needs a slot at least except the instruction is macro-fused.
Anyway, let me double check with our micro-architecture and perf experts.
>
> I'm mildly curious as to why the counts for SPR are orders of magnitude higher
> that CLX (simple accounting differences?), but I don't think it changes anything
> in the test itself.
>
> Unless someone has a better idea, my plan is to post a patch to assert that the
> top-down slots count is non-zero, not that it's >= instructions retired. E.g.
>
> diff --git a/tools/testing/selftests/kvm/x86/pmu_counters_test.c b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> index accd7ecd3e5f..21acedcd46cd 100644
> --- a/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> +++ b/tools/testing/selftests/kvm/x86/pmu_counters_test.c
> @@ -123,10 +123,8 @@ static void guest_assert_event_count(uint8_t idx,
> fallthrough;
> case INTEL_ARCH_CPU_CYCLES_INDEX:
> case INTEL_ARCH_REFERENCE_CYCLES_INDEX:
> - GUEST_ASSERT_NE(count, 0);
> - break;
> case INTEL_ARCH_TOPDOWN_SLOTS_INDEX:
> - GUEST_ASSERT(count >= NUM_INSNS_RETIRED);
> + GUEST_ASSERT_NE(count, 0);
> break;
> default:
> break;
>
next prev parent reply other threads:[~2025-01-15 2:44 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-14 2:40 [linux-next:master] [KVM] 7803339fa9: kernel-selftests.kvm.pmu_counters_test.fail kernel test robot
2025-01-14 19:47 ` Sean Christopherson
2025-01-15 2:44 ` Mi, Dapeng [this message]
2025-01-17 3:04 ` Mi, Dapeng
2025-01-17 17:11 ` Sean Christopherson
2025-01-20 2:02 ` Mi, Dapeng
2025-01-21 16:13 ` Sean Christopherson
2025-01-22 1:26 ` Mi, Dapeng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a2adf1b8-c394-4741-a42b-32288657b07e@linux.intel.com \
--to=dapeng1.mi@linux.intel.com \
--cc=g@google.com \
--cc=kvm@vger.kernel.org \
--cc=lkp@intel.com \
--cc=mlevitsk@redhat.com \
--cc=oe-lkp@lists.linux.dev \
--cc=oliver.sang@intel.com \
--cc=seanjc@google.com \
--cc=xudong.hao@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox