Re: [RFC PATCH 0/2] perf: A mechanism for efficient support for per-function metrics

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ben Gainey <Ben.Gainey@arm.com>
To: "linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-perf-users@vger.kernel.org"
	<linux-perf-users@vger.kernel.org>
Cc: "alexander.shishkin@linux.intel.com"
	<alexander.shishkin@linux.intel.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"acme@kernel.org" <acme@kernel.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	Mark Rutland <Mark.Rutland@arm.com>,
	"adrian.hunter@intel.com" <adrian.hunter@intel.com>,
	"irogers@google.com" <irogers@google.com>,
	"jolsa@kernel.org" <jolsa@kernel.org>,
	"will@kernel.org" <will@kernel.org>,
	"namhyung@kernel.org" <namhyung@kernel.org>
Subject: Re: [RFC PATCH 0/2] perf: A mechanism for efficient support for per-function metrics
Date: Wed, 14 Feb 2024 09:40:57 +0000	[thread overview]
Message-ID: <7b6c55a93f5064147b85795d471a8ae990817d3b.camel@arm.com> (raw)
In-Reply-To: <20240123113420.1928154-1-ben.gainey@arm.com>

Hi all

Appreciate everyone is busy, but if you have some time I'd appreciate
some comments, particularly around whether this make sense to approach
as an Arm-only PMU feature or has wider applicability?

Thanks
Ben




On Tue, 2024-01-23 at 11:34 +0000, Ben Gainey wrote:
> I've been working on an approach to supporting per-function metrics
> for
> aarch64 cores, which requires some changes to the arm_pmuv3 driver,
> and
> I'm wondering if this approach would make sense as a generic feature
> that could be used to enable the same on other architectures?
> 
> The basic idea is as follows:
> 
>  * Periodically sample one or more counters as needed for the chosen
>    set of metrics.
>  * Record a sample count for each symbol so as to identify hot
>    functions.
>  * Accumulate counter totals for each of the counters in each of the
>    metrics *but* only do this where the previous sample's symbol
>    matches the current sample's symbol.
> 
> Discarding the counter deltas for any given sample is important to
> ensure that couters are correctly attributed to a single function,
> without this step the resulting metrics trend towards some average
> value across the top 'n' functions. It is acknowledged that it is
> possible for this heuristic to fail, for example if samples to land
> either side of a nested call, so a sufficiently small sample window
> over which the counters are collected must be used to reduce the risk
> of misattribution.
> 
> So far, this can be acheived without any further modifications to
> perf
> tools or the kernel. However as noted it requires the counter
> collection window to be sufficiently small; in testing on
> Neoverse-N1/-V1, a window of about 300 cycles gives good results.
> Using
> the cycle counter with a sample_period of 300 is possible, but such
> an
> approach generates significant amounts of capture data, and
> introduces
> a lot of overhead and probe effect. Whilst the kernel will throttle
> such a configuration, which helps to mitigate a large portion of the
> bandwidth and capture overhead, it is not something that can be
> controlled for on a per event basis, or for non-root users, and
> because
> throttling is controlled as a percentage of time its affects vary
> from
> machine to machine.
> 
> For this to work efficiently, it is useful to provide a means to
> decouple the sample window (time over which events are counted) from
> the sample period (time between interesting samples). This patcheset
> modifies the Arm PMU driver to support alternating between two
> sample_period values, providing a simple and inexpensive way for
> tools
> to separate out the sample period and the sample window. It is
> expected
> to be used with the cycle counter event, alternating between a long
> and
> short period and subsequently discarding the counter data for samples
> with the long period. The combined long and short period gives the
> overall sampling period, and the short sample period gives the sample
> window. The symbol taken from the sample at the end of the long
> period
> can be used by tools to ensure correct attribution as described
> previously. The cycle counter is recommended as it provides fair
> temporal distribution of samples as would be required for the
> per-symbol sample count mentioned previously, and because the PMU can
> be programmed to overflow after a sufficiently short window; this may
> not be possible with software timer (for example). This patch does
> not
> restrict to only the cycle counter, it is possible there could be
> other
> novel uses based on different events.
> 
> To test this I have developed a simple `perf script` based python
> script. For a limited set of Arm PMU events it will post process a
> `perf record`-ing and generate a table of metrics. Along side this I
> have developed a benchmark application that rotates through a
> sequence
> of different classes of behaviour that can be detected by the Arm PMU
> (eg. mispredics, cache misses, different instruction mixes). The path
> through the benchmark can be rotated after each iteration so as to
> ensure the results don't land on some lucky harmonic with the sample
> period. The script can be used with and without the strobing patch
> allowing comparison of the results. Testing was on Juno (A53+A57), 
> N1SDP, Gravaton 2 and 3. In addition this approach has been applied
> to
> a few of Arm's tools projects and has correctly identified
> improvements
> and regressions. 
> 
> Headline results from testing indicate that ~300 cycles sample window
> gives good results with or without the strobing patch. When the
> strobing patch is used, the resulting `perf.data` files are typically
> 25-50x smaller that without, and take ~25x less time for the python
> script to post-process. Without strobing, the test application's
> runtime was x20 slower when sampling every 300 cycles as compared to
> every 1000000 cycles. With strobing enabled such that the long period
> was 999700 cycles and the short period was 300 cycles, the test
> applications runtime was only x1.2 slower than every 1000000 cycles.
> Notably, without the patch, L1D cache miss rates are significantly
> higher than with the patch, which we attribute to increased impact on
> the cache that trapping into the kernel every 300 cycles has. These
> results are given with `perf_cpu_time_max_percent=25`. When tested
> with `perf_cpu_time_max_percent=100` the size and time comparisons
> are
> more significant. Disabling throttling did not lead to obvious
> improvements in the collected metrics, suggesting that the sampling
> approach is sufficient to collect representative metrics.
> 
> Cursory testing on a Xeon(R) W-2145 sampling every 300 cycles
> (without
> the patch) suggests this approach would work for some counters.
> Calculating branch miss rates for example appears to be correct,
> likewise UOPS_EXECUTED.THREAD seems to give something like a sensible
> cycles-per-uop value. On the other hand the fixed function
> instructions
> counter does not appear to sample correctly (it seems to report
> either
> very small or very large numbers). No idea whats going on there, so
> any
> insight welcome...
> 
> This patch modifies the arm_pmu and introduces an additional field in
> config2 to configure the behaviour. If we think there is broad
> applicability, would it make sense to move into perf_event_attr flags
> or field, and possibly pull up into core? If we don't think so, then
> some feedback around the core header and arm_pmu modifications would
> be appreciated.
> 
> A copy of the post-processing script is available at 
> https://github.com/ARM-
> software/gator/blob/prototypes/strobing/prototypes/strobing-
> patches/test-script/generate-function-metrics.py
> note that the script its self has a dependency on
> https://lore.kernel.org/linux-perf-users/20240123103137.1890779-1-
> ben.gainey@arm.com/
> 
> 
> Ben Gainey (2):
>   arm_pmu: Allow the PMU to alternate between two sample_period
> values.
>   arm_pmuv3: Add config bits for sample period strobing
> 
>  drivers/perf/arm_pmu.c       | 74 +++++++++++++++++++++++++++++-----
> --
>  drivers/perf/arm_pmuv3.c     | 25 ++++++++++++
>  include/linux/perf/arm_pmu.h |  1 +
>  include/linux/perf_event.h   | 10 ++++-
>  4 files changed, 95 insertions(+), 15 deletions(-)
>

next prev parent reply	other threads:[~2024-02-14  9:41 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-23 11:34 [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics Ben Gainey
2024-01-23 11:34 ` [RFC PATCH 1/2] arm_pmu: Allow the PMU to alternate between two sample_period values Ben Gainey
2024-01-23 11:34 ` [RFC PATCH 2/2] arm_pmuv3: Add config bits for sample period strobing Ben Gainey
2024-02-14  9:40 ` Ben Gainey [this message]
2024-02-14  9:55 ` [RFC PATCH 0/2] A mechanism for efficient support for per-function metrics Andi Kleen
2024-02-14 19:13   ` Ben Gainey
2024-02-15  7:08     ` Andi Kleen
2024-03-10 13:00       ` Ben Gainey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b6c55a93f5064147b85795d471a8ae990817d3b.camel@arm.com \
    --to=ben.gainey@arm.com \
    --cc=Mark.Rutland@arm.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).