Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: maddy <maddy@linux.ibm.com>
To: Kim Phillips <kim.phillips@amd.com>,
	Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Andi Kleen <ak@linux.intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jiri Olsa <jolsa@redhat.com>, LKML <linux-kernel@vger.kernel.org>,
	Stephane Eranian <eranian@google.com>,
	Adrian Hunter <adrian.hunter@intel.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	yao.jin@linux.intel.com, Ingo Molnar <mingo@redhat.com>,
	Paul Mackerras <paulus@samba.org>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Robert Richter <robert.richter@amd.com>,
	Namhyung Kim <namhyung@kernel.org>,
	linuxppc-dev@lists.ozlabs.org,
	Alexey Budankov <alexey.budankov@linux.intel.com>,
	"Liang, Kan" <kan.liang@linux.intel.com>
Subject: Re: [RFC 00/11] perf: Enhancing perf to export processor hazard information
Date: Thu, 26 Mar 2020 15:49:16 +0530	[thread overview]
Message-ID: <965dba09-813a-59a7-9c10-97ed1c892245@linux.ibm.com> (raw)
In-Reply-To: <8803550e-5d6d-2eda-39f5-e4594052188c@amd.com>



On 3/18/20 11:05 PM, Kim Phillips wrote:
> Hi Maddy,
>
> On 3/17/20 1:50 AM, maddy wrote:
>> On 3/13/20 4:08 AM, Kim Phillips wrote:
>>> On 3/11/20 11:00 AM, Ravi Bangoria wrote:
>>>> On 3/6/20 3:36 AM, Kim Phillips wrote:
>>>>>> On 3/3/20 3:55 AM, Kim Phillips wrote:
>>>>>>> On 3/2/20 2:21 PM, Stephane Eranian wrote:
>>>>>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>>>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote:
>>>>>>>>>> Modern processors export such hazard data in Performance
>>>>>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event
>>>>>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on
>>>>>>>>>> AMD[3] provides similar information.
>>>>>>>>>>
>>>>>>>>>> Implementation detail:
>>>>>>>>>>
>>>>>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced.
>>>>>>>>>> If it's set, kernel converts arch specific hazard information
>>>>>>>>>> into generic format:
>>>>>>>>>>
>>>>>>>>>>       struct perf_pipeline_haz_data {
>>>>>>>>>>              /* Instruction/Opcode type: Load, Store, Branch .... */
>>>>>>>>>>              __u8    itype;
>>>>>>>>>>              /* Instruction Cache source */
>>>>>>>>>>              __u8    icache;
>>>>>>>>>>              /* Instruction suffered hazard in pipeline stage */
>>>>>>>>>>              __u8    hazard_stage;
>>>>>>>>>>              /* Hazard reason */
>>>>>>>>>>              __u8    hazard_reason;
>>>>>>>>>>              /* Instruction suffered stall in pipeline stage */
>>>>>>>>>>              __u8    stall_stage;
>>>>>>>>>>              /* Stall reason */
>>>>>>>>>>              __u8    stall_reason;
>>>>>>>>>>              __u16   pad;
>>>>>>>>>>       };
>>>>>>>>> Kim, does this format indeed work for AMD IBS?
>>>>>>> It's not really 1:1, we don't have these separations of stages
>>>>>>> and reasons, for example: we have missed in L2 cache, for example.
>>>>>>> So IBS output is flatter, with more cycle latency figures than
>>>>>>> IBM's AFAICT.
>>>>>> AMD IBS captures pipeline latency data incase Fetch sampling like the
>>>>>> Fetch latency, tag to retire latency, completion to retire latency and
>>>>>> so on. Yes, Ops sampling do provide more data on load/store centric
>>>>>> information. But it also captures more detailed data for Branch instructions.
>>>>>> And we also looked at ARM SPE, which also captures more details pipeline
>>>>>> data and latency information.
>>>>>>
>>>>>>>> Personally, I don't like the term hazard. This is too IBM Power
>>>>>>>> specific. We need to find a better term, maybe stall or penalty.
>>>>>>> Right, IBS doesn't have a filter to only count stalled or otherwise
>>>>>>> bad events.  IBS' PPR descriptions has one occurrence of the
>>>>>>> word stall, and no penalty.  The way I read IBS is it's just
>>>>>>> reporting more sample data than just the precise IP: things like
>>>>>>> hits, misses, cycle latencies, addresses, types, etc., so words
>>>>>>> like 'extended', or the 'auxiliary' already used today even
>>>>>>> are more appropriate for IBS, although I'm the last person to
>>>>>>> bikeshed.
>>>>>> We are thinking of using "pipeline" word instead of Hazard.
>>>>> Hm, the word 'pipeline' occurs 0 times in IBS documentation.
>>>> NP. We thought pipeline is generic hw term so we proposed "pipeline"
>>>> word. We are open to term which can be generic enough.
>>>>
>>>>> I realize there are a couple of core pipeline-specific pieces
>>>>> of information coming out of it, but the vast majority
>>>>> are addresses, latencies of various components in the memory
>>>>> hierarchy, and various component hit/miss bits.
>>>> Yes. we should capture core pipeline specific details. For example,
>>>> IBS generates Branch unit information(IbsOpData1) and Icahce related
>>>> data(IbsFetchCtl) which is something that shouldn't be extended as
>>>> part of perf-mem, IMO.
>>> Sure, IBS Op-side output is more 'perf mem' friendly, and so it
>>> should populate perf_mem_data_src fields, just like POWER9 can:
>>>
>>> union perf_mem_data_src {
>>> ...
>>>                   __u64   mem_rsvd:24,
>>>                           mem_snoopx:2,   /* snoop mode, ext */
>>>                           mem_remote:1,   /* remote */
>>>                           mem_lvl_num:4,  /* memory hierarchy level number */
>>>                           mem_dtlb:7,     /* tlb access */
>>>                           mem_lock:2,     /* lock instr */
>>>                           mem_snoop:5,    /* snoop mode */
>>>                           mem_lvl:14,     /* memory hierarchy level */
>>>                           mem_op:5;       /* type of opcode */
>>>
>>>
>>> E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate
>>> mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op',
>>> 'mem_lock', and the Reload Bus Source Encoding bits can
>>> be used to populate mem_snoop, right?
>> Hi Kim,
>>
>> Yes. We do expose these data as part of perf-mem for POWER.
> OK, I see relevant PERF_MEM_S bits in arch/powerpc/perf/isa207-common.c:
> isa207_find_source now, thanks.
>
>>> For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be
>>> used for the ld/st target addresses, too.
>>>
>>>>> What's needed here is a vendor-specific extended
>>>>> sample information that all these technologies gather,
>>>>> of which things like e.g., 'L1 TLB cycle latency' we
>>>>> all should have in common.
>>>> Yes. We will include fields to capture the latency cycles (like Issue
>>>> latency, Instruction completion latency etc..) along with other pipeline
>>>> details in the proposed structure.
>>> Latency figures are just an example, and from what I
>>> can tell, struct perf_sample_data already has a 'weight' member,
>>> used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to
>>> transfer memory access latency figures.  Granted, that's
>>> a bad name given all other vendors don't call latency
>>> 'weight'.
>>>
>>> I didn't see any latency figures coming out of POWER9,
>>> and do not expect this patchseries to implement those
>>> of other vendors, e.g., AMD's IBS; leave each vendor
>>> to amend perf to suit their own h/w output please.
>> Reference structure proposed in this patchset did not have members
>> to capture latency info for that exact reason. But idea here is to
>> abstract  as vendor specific as possible. So if we include u16 array,
>> then this format can also capture data from IBS since it provides
>> few latency details.
> OK, that sounds a bit different from the 6 x u8's + 1 u16 padded
> struct presented in this patchset.
>
> IBS Ops can report e.g.:
>
> 15 tag-to-retire cycles bits,
> 15 completion to retire count bits,
> 15 L1 DTLB refill latency bits,
> 15 DC miss latency bits,
> 5 outstanding memory requests on mem refill bits, and so on.
>
> IBS Fetch reports 15 bits of fetch latency, and another 16
> for iTLB latency, among others.
>
> Some of these may/may not be valid simultaneously, and
> there are IBS specific rules to establish validity.
>
>>> My main point there, however, was that each vendor should
>>> use streamlined record-level code to just copy the data
>>> in the proprietary format that their hardware produces,
>>> and then then perf tooling can synthesize the events
>>> from the raw data at report/script/etc. time.
>>>
>>>>> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed
>>>>> either.  Can we use PERF_SAMPLE_AUX instead?
>>>> We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when
>>>> large volume of data needs to be captured as part of perf.data without
>>>> frequent PMIs. But proposed type is to address the capture of pipeline
>>> SAMPLE_AUX shouldn't care whether the volume is large, or how frequent
>>> PMIs are, even though it may be used in those environments.
>>>
>>>> information on each sample using PMI at periodic intervals. Hence proposing
>>>> PERF_SAMPLE_PIPELINE_HAZ.
>>> And that's fine for any extra bits that POWER9 has to convey
>>> to its users beyond things already represented by other sample
>>> types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9
>>> and other vendor e.g., AMD IBS data can be made vendor-independent
>>> at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is
>>> what IBS currently uses.
>> My bad. Not sure what you mean by this. We are trying to abstract
>> as much vendor specific data as possible with this (like perf-mem).
> Perhaps if I say it this way: instead of doing all the
> isa207_get_phazard_data() work past the mfspr(SPRN_SIER)
> in patch 4/11, rather/instead just put the raw sier value in a
> PERF_SAMPLE_RAW or _AUX event, and call perf_event_update_userpage.
> Specific SIER capabilities can be written as part of the perf.data
> header.  Then synthesize the true pipe events from the raw SIER
> values later, and in userspace.

Hi Kim,

Would like to stay away from SAMPLE_RAW type for these comments in 
perf_events.h

*      #
*      # The RAW record below is opaque data wrt the ABI
*      #
*      # That is, the ABI doesn't make any promises wrt to
*      # the stability of its content, it may vary depending
*      # on event, hardware, kernel version and phase of
*      # the moon.
*      #
*      # In other words, PERF_SAMPLE_RAW contents are not an ABI.
*      #

Secondly, sorry I didn't understand your suggestion about using 
PERF_SAMPLE_AUX.
IIUC, SAMPLE_AUX will go to AUX ring buffer, which is more memory and more
challenging when correlating and presenting the pipeline details for 
each IP.
IMO, having a new sample type can be useful to capture the pipeline data
both in perf_sample_data and if _AUX is enabled, can be made to push to
AUX buffer.

Maddy

>
> I guess it's technically optional, but I think that's how
> I'd do it in IBS, since it minimizes the record-time overhead.
>
> Thanks,
>
> Kim
>
>> Maddy
>>>>>    Take a look at
>>>>> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling
>>>>> definitions".  The sample identifier can be used to determine
>>>>> which vendor's sampling IP's data is in it, and events can
>>>>> be recorded just by copying the content of the SIER, etc.
>>>>> registers, and then events get synthesized from the aux
>>>>> sample at report/inject/annotate etc. time.  This allows
>>>>> for less sample recording overhead, and moves all the vendor
>>>>> specific decoding and common event conversions for userspace
>>>>> to figure out.
>>>> When AUX buffer data is structured, tool side changes added to present the
>>>> pipeline data can be re-used.
>>> Not sure I understand: AUX data would be structured on
>>> each vendor's raw h/w register formats.
>>>
>>> Thanks,
>>>
>>> Kim
>>>
>>>>>>>> Also worth considering is the support of ARM SPE (Statistical
>>>>>>>> Profiling Extension) which is their version of IBS.
>>>>>>>> Whatever gets added need to cover all three with no limitations.
>>>>>>> I thought Intel's various LBR, PEBS, and PT supported providing
>>>>>>> similar sample data in perf already, like with perf mem/c2c?
>>>>>> perf-mem is more of data centric in my opinion. It is more towards
>>>>>> memory profiling. So proposal here is to expose pipeline related
>>>>>> details like stalls and latencies.
>>>>> Like I said, I don't see it that way, I see it as "any particular
>>>>> vendor's event's extended details', and these pipeline details
>>>>> have overlap with existing infrastructure within perf, e.g., L2
>>>>> cache misses.
>>>>>
>>>>> Kim
>>>>>

next prev parent reply	other threads:[~2020-03-26 10:21 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-02  5:23 [RFC 00/11] perf: Enhancing perf to export processor hazard information Ravi Bangoria
2020-03-02  5:23 ` [RFC 01/11] powerpc/perf: Simplify ISA207_SIER macros Ravi Bangoria
2020-03-02  5:23 ` [RFC 02/11] perf/core: Data structure to present hazard data Ravi Bangoria
2020-03-02  9:55   ` Peter Zijlstra
2020-03-02 14:23     ` maddy
2020-03-02 14:48   ` Mark Rutland
2020-03-03 14:32     ` Ravi Bangoria
2020-03-02 14:54   ` Mark Rutland
2020-03-03 14:31     ` Ravi Bangoria
2020-03-02  5:23 ` [RFC 03/11] powerpc/perf: Arch specific definitions for pipeline Ravi Bangoria
2020-03-02  5:23 ` [RFC 04/11] powerpc/perf: Arch support to expose Hazard data Ravi Bangoria
2020-03-02  5:23 ` [RFC 05/11] perf tools: Enable record and script to record and show hazard data Ravi Bangoria
2020-03-02  5:23 ` [RFC 06/11] perf hists: Make a room for hazard info in struct hist_entry Ravi Bangoria
2020-03-02  5:23 ` [RFC 07/11] perf hazard: Functions to convert generic hazard data to arch specific string Ravi Bangoria
2020-03-02  5:23 ` [RFC 08/11] perf report: Enable hazard mode Ravi Bangoria
2020-03-02  5:23 ` [RFC 09/11] perf annotate: Introduce type for annotation_line Ravi Bangoria
2020-03-02  5:23 ` [RFC 10/11] perf annotate: Preparation for hazard Ravi Bangoria
2020-03-02  5:23 ` [RFC 11/11] perf annotate: Show hazard data in tui mode Ravi Bangoria
2020-03-02 10:13 ` [RFC 00/11] perf: Enhancing perf to export processor hazard information Peter Zijlstra
2020-03-02 20:21   ` Stephane Eranian
2020-03-02 22:25     ` Kim Phillips
2020-03-05  4:46       ` Ravi Bangoria
2020-03-05 22:06         ` Kim Phillips
2020-03-11 16:00           ` Ravi Bangoria
2020-03-12 22:38             ` Kim Phillips
2020-03-17  6:50               ` maddy
2020-03-18 17:35                 ` Kim Phillips
2020-03-19 11:22                   ` Michael Ellerman
2020-03-26 10:19                   ` maddy [this message]
2020-03-26 19:48                     ` Kim Phillips
2020-04-20  7:09                       ` Madhavan Srinivasan
2020-04-27  7:18                         ` Madhavan Srinivasan
2020-03-05  4:28     ` maddy
2020-03-03  1:33   ` Andi Kleen
2020-03-05  5:06     ` Ravi Bangoria
2020-03-02 21:08 ` Paul Clarke
2020-03-05  5:06   ` Ravi Bangoria

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=965dba09-813a-59a7-9c10-97ed1c892245@linux.ibm.com \
    --to=maddy@linux.ibm.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ak@linux.intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=alexey.budankov@linux.intel.com \
    --cc=eranian@google.com \
    --cc=jolsa@redhat.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kim.phillips@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=ravi.bangoria@linux.ibm.com \
    --cc=robert.richter@amd.com \
    --cc=yao.jin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).