Re: [6.1.7][6.2-rc5] perf all metrics test: FAILED!

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Liang, Kan" <kan.liang@linux.intel.com>
To: Ian Rogers <irogers@google.com>
Cc: sedat.dilek@gmail.com, "Xing, Zhengjun" <zhengjun.xing@intel.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@kernel.org>, Namhyung Kim <namhyung@kernel.org>,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nick Desaulniers <ndesaulniers@google.com>,
	Nathan Chancellor <natechancellor@gmail.com>,
	llvm@lists.linux.dev, Ben Hutchings <benh@debian.org>,
	James Clark <james.clark@arm.com>,
	Stephane Eranian <eranian@google.com>
Subject: Re: [6.1.7][6.2-rc5] perf all metrics test: FAILED!
Date: Wed, 1 Feb 2023 14:06:54 -0500	[thread overview]
Message-ID: <df8d710c-2543-520e-fe82-dbc8b2a47950@linux.intel.com> (raw)
In-Reply-To: <CAP-5=fXHDuB5G+ujVCoe81H+-Y=J8ig6TZ-dGZHknskb7Z53ig@mail.gmail.com>



On 2023-02-01 12:02 p.m., Ian Rogers wrote:
> On Wed, Feb 1, 2023 at 7:28 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>> Hi Ian,
>>
>> On 2023-01-30 10:55 p.m., Ian Rogers wrote:
>>>>> There's a question about what we should do in the perf test about
>>>>> this? I have a few solutions:
>>>>>
>>>>> 1) try metric tests again with the --metric-no-group flag and don't
>>>>> fail the test if this succeeds. This allows kernel bugs to hide, so
>>>>> I'm not a huge fan.
>>>>>
>>>>> 2) add a new metric flag/constraint to say not to group, this way the
>>>>> metric will automatically apply the "--metric-no-group" flag. It is a
>>>>> bit of work to wire this up but this kind of failure is common enough
>>>>> in PMUs that it is probably worthwhile. We also need to add the flag
>>>>> to metrics and I'm not sure how to get a good list of the metrics that
>>>>> currently fail and require it. This is okay but error prone.
>>>>>
>>>>> 3) fix the kernel bug and let the perf test fail until an adequate
>>>>> kernel is installed. Probably the best option.
>>>>>
>>>> Hi Ian,
>>>>
>>>> I can confirm:
>>>>
>>>> $ echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
>>>> /proc/sys/kernel/perf_event_paranoid
>>>> 0
>>>>
>>>> $ ~/bin/perf stat -M tma_l3_bound --metric-no-group -a sleep 1
>>>>
>>>> Performance counter stats for 'system wide':
>>>>
>>>>         2.058.892      MEM_LOAD_UOPS_RETIRED.LLC_HIT    #      1,5 %
>>>> tma_l3_bound             (99,30%)
>>>>       173.254.697      CYCLE_ACTIVITY.STALLS_L2_PENDING
>>>>                         (99,10%)
>>>>     2.396.130.501      CPU_CLK_UNHALTED.THREAD
>>>>                         (99,60%)
>>>>         1.110.486      MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
>>>>                            (99,53%)
>>>>
>>>>       1,001989022 seconds time elapsed
>>>>
>>>> $ ~/bin/perf stat -M tma_dram_bound --metric-no-group -a sleep 1
>>>>
>>>> Performance counter stats for 'system wide':
>>>>
>>>>         1.729.208      MEM_LOAD_UOPS_RETIRED.LLC_HIT    #      1,2 %
>>>> tma_dram_bound           (99,50%)
>>>>        50.346.734      CYCLE_ACTIVITY.STALLS_L2_PENDING
>>>>                         (99,50%)
>>>>     2.354.963.862      CPU_CLK_UNHALTED.THREAD
>>>>                         (99,80%)
>>>>           306.500      MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS
>>>>                            (99,61%)
>>>>
>>>>       1,001981392 seconds time elapsed
>>>>
>>>> Thanks!
>>> Thanks, apparently it is an issue with SandyBridge/IvyBridge that some
>>> counters on one hyperthread will limit what can be on the other. I
>>> believe that's the comment related to EXCL access here:
>>> https://github.com/torvalds/linux/blob/master/arch/x86/events/intel/core.c#L124
>>> So you may have more success with the metric if you disable
>>> hyperthreading, but I imagine that's not a popular option.
>>
>> Thanks for debugging the issue. Yes, it's caused by the HT workaround
>> for SNB/IVB/HSW.
>>
>> The weak group check in the kernel is in validate_group(). It only does
>> a sanity check. It doesn't check all the workarounds and the current
>> status of counters (e.g., whether the fixed counter is occupied by NMI
>> watchdog.) It's possible that a false positive is returned to the perf
>> tool. I once tried to fix the NMI watchdog check in the kernel, but the
>> proposal was rejected. So the metric constraint is introduced.
>>
>> For this issue, I think the above option2 should be a better and
>> practical choice. The issue is only observed on old machines, which
>> usually has a stable kernel running on it. I don't think the user wants
>> to update their kernel just to workaround an issue for several metrics.
>> But it should be much easier for them to update the perf tool.
>>
>> We know that the below events are the problematic events.
>> /* MEM_UOPS_RETIRED.* */
>> /* MEM_LOAD_UOPS_RETIRED.* */
>> /* MEM_LOAD_UOPS_LLC_HIT_RETIRED.* */
>> /* MEM_LOAD_UOPS_LLC_MISS_RETIRED.* */
>> Can we update the convertor script and apply the "--metric-no-group"
>> flag or add a new constraint if the above events are detected in
>> SNB/IVB/HSW?
>>
>> Thanks,
>> Kan
> 
> Thanks Kan,
> 
> We absolutely can do that! In this case should it be --metric-no-group
> only when SMT is enabled? I can do some patches but would like to know
> about whether we need SMT and not SMT versions of --metric-no-group.

The kernel workaround is disabled when SMT is off. So I think we only
need SMT version of --metric-no-group.
https://lore.kernel.org/all/1416251225-17721-13-git-send-email-eranian@google.com/T/#u

> Also, should we just have a list of metrics that need the flag or try
> to automate detection? 

I don't think Intel will update the metrics or events for the old
SNB/IVB/HSW platforms. Hard code a list of metrics may be simpler than
automated detection.

> Some warts in detection are the names of the
> events that vary between Ivybridge and Sandybridge, and how to
> determine which events conflict. For example, the perfmon event data:
> 
> MEM_LOAD_UOPS_RETIRED.LLC_HIT
> https://github.com/intel/perfmon/blob/main/IVB/events/ivybridge_core.json#L5368
> MEM_LOAD_UOPS_RETIRED.LLC_MISS
> https://github.com/intel/perfmon/blob/main/IVB/events/ivybridge_core.json#L5431
> CYCLE_ACTIVITY.STALLS_L2_PENDING
> https://github.com/intel/perfmon/blob/main/IVB/events/ivybridge_core.json#L3541
>

The problematic events should have the same name among platforms. If the
event name doesn't work, the event encoding is exactly the same among
those platforms.


> The events list all counters, there are no errata fields.. Should the
> event data be updated and then in the converter script handle that? If
> I get shown an example I can modify the script accordingly.

If it can helps the converter script, I think we can update the errata
field.

Here are the errata information.
 * SNB: BJ122
 * IVB: BV98
 * HSW: HSD29

Here is the details regarding the issue. (Please search BV98)
https://www.intel.com/content/www/us/en/content-details/619604/desktop-3rd-generation-intel-core-processor-family-specification-update.html
> 
> It is also hard for me to test anything other than SMT on Ivybridge.
> 

I think it's OK to only test on Ivybridge.
The original kernel patch indicates the issue is the same among SNB, IVB
and HSW.
https://lore.kernel.org/all/1416251225-17721-7-git-send-email-eranian@google.com/T/#u

Thanks,
Kan

next prev parent reply	other threads:[~2023-02-01 19:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-29  9:58 [6.1.7][6.2-rc5] perf all metrics test: FAILED! Sedat Dilek
2023-01-29 23:21 ` Ian Rogers
2023-01-30  2:24   ` Sedat Dilek
2023-01-30 10:04     ` James Clark
2023-01-31  0:20       ` Ian Rogers
2023-01-31  3:45         ` Sedat Dilek
2023-01-31  3:55           ` Ian Rogers
2023-01-31  6:14             ` Sedat Dilek
2023-01-31  6:20               ` Sedat Dilek
2023-02-01 15:27             ` Liang, Kan
2023-02-01 17:02               ` Ian Rogers
2023-02-01 19:06                 ` Liang, Kan [this message]
2023-02-01  6:51         ` Ravi Bangoria

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df8d710c-2543-520e-fe82-dbc8b2a47950@linux.intel.com \
    --to=kan.liang@linux.intel.com \
    --cc=acme@redhat.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=benh@debian.org \
    --cc=eranian@google.com \
    --cc=irogers@google.com \
    --cc=james.clark@arm.com \
    --cc=jolsa@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=natechancellor@gmail.com \
    --cc=ndesaulniers@google.com \
    --cc=peterz@infradead.org \
    --cc=sedat.dilek@gmail.com \
    --cc=zhengjun.xing@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).