[PATCH v1 0/4] A mechanism for efficient support for per-function metrics

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] A mechanism for efficient support for per-function metrics
@ 2024-11-07 16:07 Deepak Surti
  2024-11-07 16:07 ` [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods Deepak Surti
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-07 16:07 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: deepak.surti, mark.barnett, ben.gainey, ak, will, james.clark,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel, linux-arm-kernel

This patch introduces the concept on an alternating sample rate to perf
core and provides the necessary basic changes in the tools to activate
that option.

This patchset was original posted by Ben Gainey out for RFC back in April,
the latest version of which can be found at
https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/.
Going forward, I will be owning this.

The primary use case for this change is to be able to enable collecting
per-function performance metrics using the Arm PMU, as per the following 
approach:

 * Starting with a simple periodic sampling (hotspot) profile,
   augment each sample with PMU counters accumulated over a short window
   up to the point the sample was take.
 * For each sample, perform some filtering to improve attribution of
   the accumulated PMU counters (ensure they are attributed to a single
   function)
 * For each function accumulate a total for each PMU counter so that
   metrics may be derived.

Without modification, and sampling at a typical rate associated
with hotspot profiling (~1mS) leads to poor results. Such an
approach gives you a reasonable estimation of where the profiled
application is spending time for relatively low overhead, but the
PMU counters cannot easily be attributed to a single function as the
window over which they are collected is too large. A modern CPU may
execute many millions of instructions over many thousands of functions
within 1mS window. With this approach, the per-function metrics tend
to trend to some average value across the top N functions in the
profile.

In order to ensure a reasonable likelihood that the counters are
attributed to a single function, the sampling window must be rather
short; typically something in the order of a few hundred cycles proves
well as tested on a range of aarch64 Cortex and Neoverse cores.

As it stands, it is possible to achieve this with perf using a very high
sampling rate (e.g ~300cy), but there are at least three major concerns
with this approach:

 * For speculatively executing, out of order cores, can the results be
   accurately attributed to a give function or the given sample window?
 * A short sample window is not guaranteed to cover a single function.
 * The overhead of sampling every few hundred cycles is very high and
   is highly likely to cause throttling which is undesirable as it leads
   to patchy results; i.e. the profile alternates between periods of 
   high frequency samples followed by longer periods of no samples. 

This patch does not address the first two points directly. Some means
to address those are discussed on the RFC v2 cover letter. The patch
focuses on addressing the final point, though happily this approach
gives us a way to perform basic filtering on the second point.

The alternating sample period allows us to do two things:

 * We can control the risk of throttling and reduce overhead by
   alternating between a long and short period. This allows us to
   decouple the "periodic" sampling rate (as might be used for hotspot
   profiling) from the short sampling window needed for collecting
   the PMU counters.
 * The sample taken at the end of the long period can be otherwise 
   discarded (as the PMU data is not useful), but the
   PERF_RECORD_CALLCHAIN information can be used to identify the current
   function at the start of the short sample window. This is useful
   for filtering samples where the PMU counter data cannot be attributed
   to a single function.

There are several reasons why it is desirable to reduce the overhead and
risk of throttling:

  * PMU counter overflow typically causes an interrupt into the kernel;
    this affects program runtime, and can affect things like branch
    prediction, cache locality and so on which can skew the metrics.
  * The very high sample rate produces significant amounts of data.
    Depending on the configuration of the profiling session and machine,
    it is easily possible to produce many orders of magnitude more data
    which is costly for tools to post-process and increases the chance
    of data loss. This is especially relevant on larger core count
    systems where it is very easy to produce massive recordings.
    Whilst the kernel will throttle such a configuration,
    which helps to mitigate a large portion of the bandwidth and capture
    overhead, it is not something that can be controlled for on a per
    event basis, or for non-root users, and because throttling is
    controlled as a percentage of time its affects vary from machine to
    machine. AIUI throttling may also produce an uneven temporal
    distribution of samples. Finally, whilst throttling does a good job
    at reducing the overall amount of data produced, it still leads to
    much larger captures than with this method; typically we have
    observed 1-2 orders of magnitude larger captures.

This patch set modifies perf core to support alternating between two
sample_period values, providing a simple and inexpensive way for tools
to separate out the sample window (time over which events are
counted) from the sample period (time between interesting samples). 

It is expected to be used with the cycle counter event, alternating
between a long and short period and subsequently discarding the counter
data for samples with the long period. The combined long and short
period gives the overall sampling period, and the short sample period 
gives the sample window. The symbol taken from the sample at the end of
the long period can be used by tools to ensure correct attribution as
described previously. The cycle counter is recommended as it provides
fair temporal distribution of samples as would be required for the
per-symbol sample count mentioned previously, and because the PMU can
be programmed to overflow after a sufficiently short window (which may
not be possible with software timer, for example). This patch does not
restrict to only the cycle counter, it is possible there could be other
novel uses based on different events, or more appropriate counters on
other architectures. This patch set does not modify or otherwise disable
the kernel's existing throttling behaviour; if a configuration is given
that would lead high CPU usage, then throttling still occurs.

To test this a simple `perf script` based python script was developed.
For a limited set of Arm PMU events it will post process a
`perf record`-ing and generate a table of metrics. Along side this a
benchmark application was developed that rotates through a sequence
of different classes of behaviour that can be detected by the Arm PMU
(eg. mispredicts, cache misses, different instruction mixes). The path
through the benchmark can be rotated after each iteration so as to
ensure the results don't land on some lucky harmonic with the sample
period. The script can be used with and without this patch allowing
comparison of the results. Testing was on Juno (A53+A57), N1SDP,
Gravaton 2 and 3. In addition this approach has been applied to a few
of Arm's tools projects and has correctly identified improvements and
regressions.

Headline results from testing indicate that ~300 cycles sample window
gives good results with or without this patch. Typical output on N1SDP (Neoverse-N1)
for the provided benchmark when run as:

    perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
        -k CLOCK_MONOTONIC_RAW \
        -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
        benchmark 0 1

    perf script -s generate-function-metrics.py -- -s discard

Looks like (reformatted for email brevity):

    Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I    %BM   %L1DA  %L1DM
    fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8  22.9   0.1   0.6    0.0
    int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3  21.5   0.1   1.9    0.2
    isb                 3414  20.1   0.2     0.0   0.4  17.6   2.3   0.1   0.8    0.0
    branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2  99.0  71.6    0.1
    double_to_int        694   0.5   0.0     0.0   0.6   3.4  19.1   0.1   1.2    0.1
    nops                 417   0.3   0.2     0.0   2.8   1.9  18.3   0.6   0.4    0.1
    dcache_miss          185   3.6   0.4   184.7  53.8   0.7   0.5   0.0  18.4   99.1

(CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000 Instruction,
 CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
 accesses that miss, %CY = Percentage of total cycles, %I = Percentage
 of total instructions, %BM = Percentage of total branch mispredicts,
 %L1DA = Percentage of total cache accesses, %L1DM = Percentage of total
 cache misses)

When the patch is used, the resulting `perf.data` files are typically
between 25-50x smaller than without, and take ~25x less time for the
python script to post-process. For example, running the following:

    perf record -i -vvv -e '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
    perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS' benchmark 0 1
    perf record -i -vvv -e '{cycles/period=300/,instructions}:uS' benchmark 0 1

produces captures on N1SDP (Neoverse-N1) of the following sizes:

    * period=1000000: 2.601 MB perf.data (55780 samples), script time = 0m0.362s
    * period=1000: 283.749 MB perf.data (6162932 samples), script time = 0m33.100s
    * period=300: 304.281 MB perf.data (6614182 samples), script time = 0m35.826s

The "script time" is the user time from running "time perf script -s generate-function-metrics.py"
on the recording. Similar processing times were observed for "time perf report --stdio|cat"
as well.

By comparison, with the patch active:

    perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1

produces 4.923 MB perf.data (107512 samples), and script time = 0m0.578s.
Which is as expected ~2x the size and ~2x the number of samples as per
the period=1000000 recording. When compared to the period=300 recording,
the results from the provided post-processing script are (within margin
of error) the same, but the data file is ~62x smaller. The same affect
is seen for the post-processing script runtime.

Notably, without the patch enable, L1D cache miss rates are often higher
than with, which we attribute to increased impact on the cache that
trapping into the kernel every 300 cycles has.

These results are given with `perf_cpu_time_max_percent=25`. When tested
with `perf_cpu_time_max_percent=100` the size and time comparisons are
more significant. Disabling throttling did not lead to obvious
improvements in the collected metrics, suggesting that the sampling
approach is sufficient to collect representative metrics.

Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
window (with and without the patch) suggests this approach might work
for some counters. Using the same test script, it was possible to identify
branch mispredicts correctly. However, whilst the patch is functionally 
correct, differences in the architectures may mean that this approach it
enables does not apply as a means to collect per-function metrics on x86.

Changes since RFC v2:
 - Rebased on v6.12-rc6.

Changes since RFC v1:
 - Rebased on v6.9-rc1.
 - Refactored from arm_pmu based extension to core feature
 - Added the ability to jitter the sample window based on feedback
   from Andi Kleen.
 - Modified perf tool to parse the "alt-period" and "alt-period-jitter"
   terms in the event specification.

Ben Gainey (4):
  perf: Allow periodic events to alternate between two sample periods
  perf: Allow adding fixed random jitter to the alternate sampling
    period
  tools/perf: Modify event parser to support alt-period term
  tools/perf: Modify event parser to support alt-period-jitter term

 include/linux/perf_event.h                    |  5 ++
 include/uapi/linux/perf_event.h               | 13 ++++-
 kernel/events/core.c                          | 47 +++++++++++++++++++
 tools/include/uapi/linux/perf_event.h         | 13 ++++-
 tools/perf/tests/attr.c                       |  2 +
 tools/perf/tests/attr.py                      |  2 +
 tools/perf/tests/attr/base-record             |  4 +-
 tools/perf/tests/attr/base-record-spe         |  2 +
 tools/perf/tests/attr/base-stat               |  4 +-
 tools/perf/tests/attr/system-wide-dummy       |  4 +-
 .../attr/test-record-alt-period-jitter-term   | 13 +++++
 .../tests/attr/test-record-alt-period-term    | 12 +++++
 tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
 tools/perf/util/parse-events.c                | 30 ++++++++++++
 tools/perf/util/parse-events.h                |  4 +-
 tools/perf/util/parse-events.l                |  2 +
 tools/perf/util/perf_event_attr_fprintf.c     |  1 +
 tools/perf/util/pmu.c                         |  2 +
 18 files changed, 157 insertions(+), 7 deletions(-)
 create mode 100644 tools/perf/tests/attr/test-record-alt-period-jitter-term
 create mode 100644 tools/perf/tests/attr/test-record-alt-period-term

-- 
2.43.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods
  2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
@ 2024-11-07 16:07 ` Deepak Surti
  2024-11-14 15:01   ` Peter Zijlstra
  2024-11-07 16:07 ` [PATCH v1 2/4] perf: Allow adding fixed random jitter to the alternate sampling period Deepak Surti
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Deepak Surti @ 2024-11-07 16:07 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: deepak.surti, mark.barnett, ben.gainey, ak, will, james.clark,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel, linux-arm-kernel

From: Ben Gainey <ben.gainey@arm.com>

This change modifies perf_event_attr to add a second, alternative
sample period field, and modifies the core perf overflow handling
such that when specified an event will alternate between two sample
periods.

Currently, perf does not provide a  mechanism for decoupling the period
over which counters are counted from the period between samples. This is
problematic for building a tool to measure per-function metrics derived
from a sampled counter group. Ideally such a tool wants a very small
sample window in order to correctly attribute the metrics to a given
function, but prefers a larger sample period that provides representative
coverage without excessive probe effect, triggering throttling, or
generating excessive amounts of data.

By alternating between a long and short sample_period and subsequently
discarding the long samples, tools may decouple the period between
samples that the tool cares about from the window of time over which
interesting counts are collected.

It is expected that typically tools would use this feature with the
cycles or instructions events as an approximation for time, but no
restrictions are applied to which events this can be applied to.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 include/linux/perf_event.h      |  5 +++++
 include/uapi/linux/perf_event.h |  3 +++
 kernel/events/core.c            | 39 +++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..9910b3c115d0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -270,6 +270,11 @@ struct hw_perf_event {
 	 */
 	u64				freq_time_stamp;
 	u64				freq_count_stamp;
+
+	/*
+	 * Indicates that the alternative_sample_period is used
+	 */
+	bool				using_alternative_sample_period;
 #endif
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 4842c36fdf80..bedae424ba36 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -379,6 +379,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
 #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
+#define PERF_ATTR_SIZE_VER9	144	/* add: alternative_sample_period */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -522,6 +523,8 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u64	alternative_sample_period;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index df27d08a7232..579b04af70a2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4185,6 +4185,8 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 	s64 period, sample_period;
 	s64 delta;
 
+	WARN_ON_ONCE(hwc->using_alternative_sample_period);
+
 	period = perf_calculate_period(event, nsec, count);
 
 	delta = (s64)(period - hwc->sample_period);
@@ -9806,6 +9808,7 @@ static int __perf_event_overflow(struct perf_event *event,
 				 int throttle, struct perf_sample_data *data,
 				 struct pt_regs *regs)
 {
+	struct hw_perf_event *hwc = &event->hw;
 	int events = atomic_read(&event->event_limit);
 	int ret = 0;
 
@@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct perf_event *event,
 	    !bpf_overflow_handler(event, data, regs))
 		return ret;
 
+	/*
+	 * Swap the sample period to the alternative period
+	 */
+	if (event->attr.alternative_sample_period) {
+		bool using_alt = hwc->using_alternative_sample_period;
+		u64 sample_period = (using_alt ? event->attr.sample_period
+					       : event->attr.alternative_sample_period);
+
+		hwc->sample_period = sample_period;
+		hwc->using_alternative_sample_period = !using_alt;
+
+		if (local64_read(&hwc->period_left) > 0) {
+			event->pmu->stop(event, PERF_EF_UPDATE);
+
+			local64_set(&hwc->period_left, 0);
+
+			event->pmu->start(event, PERF_EF_RELOAD);
+		}
+	}
+
 	/*
 	 * XXX event_limit might not quite work as expected on inherited
 	 * events
@@ -12244,6 +12267,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	local64_set(&hwc->period_left, hwc->sample_period);
 
+	/*
+	 * alternative_sample_period cannot be used with freq
+	 */
+	if (attr->freq && attr->alternative_sample_period)
+		goto err_ns;
+
 	/*
 	 * We do not support PERF_SAMPLE_READ on inherited events unless
 	 * PERF_SAMPLE_TID is also selected, which allows inherited events to
@@ -12700,9 +12729,19 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (attr.freq) {
 		if (attr.sample_freq > sysctl_perf_event_sample_rate)
 			return -EINVAL;
+		if (attr.alternative_sample_period)
+			return -EINVAL;
 	} else {
 		if (attr.sample_period & (1ULL << 63))
 			return -EINVAL;
+		if (attr.alternative_sample_period) {
+			if (!attr.sample_period)
+				return -EINVAL;
+			if (attr.alternative_sample_period & (1ULL << 63))
+				return -EINVAL;
+			if (attr.alternative_sample_period == attr.sample_period)
+				attr.alternative_sample_period = 0;
+		}
 	}
 
 	/* Only privileged users can get physical addresses */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods
  2024-11-07 16:07 ` [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods Deepak Surti
@ 2024-11-14 15:01   ` Peter Zijlstra
  2024-11-25 17:11     ` Deepak Surti
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2024-11-14 15:01 UTC (permalink / raw)
  To: Deepak Surti
  Cc: mingo, acme, namhyung, mark.barnett, ben.gainey, ak, will,
	james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel

On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote:
> From: Ben Gainey <ben.gainey@arm.com>
> 
> This change modifies perf_event_attr to add a second, alternative
> sample period field, and modifies the core perf overflow handling
> such that when specified an event will alternate between two sample
> periods.
> 
> Currently, perf does not provide a  mechanism for decoupling the period
> over which counters are counted from the period between samples. This is
> problematic for building a tool to measure per-function metrics derived
> from a sampled counter group. Ideally such a tool wants a very small
> sample window in order to correctly attribute the metrics to a given
> function, but prefers a larger sample period that provides representative
> coverage without excessive probe effect, triggering throttling, or
> generating excessive amounts of data.
> 
> By alternating between a long and short sample_period and subsequently
> discarding the long samples, tools may decouple the period between
> samples that the tool cares about from the window of time over which
> interesting counts are collected.

Do you have a link to a paper or something that explains this method?


> +	/*
> +	 * Indicates that the alternative_sample_period is used
> +	 */
> +	bool				using_alternative_sample_period;

I typically prefer variables names that are shorter.


> @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct perf_event *event,
>  	    !bpf_overflow_handler(event, data, regs))
>  		return ret;
>  
> +	/*
> +	 * Swap the sample period to the alternative period
> +	 */
> +	if (event->attr.alternative_sample_period) {
> +		bool using_alt = hwc->using_alternative_sample_period;
> +		u64 sample_period = (using_alt ? event->attr.sample_period
> +					       : event->attr.alternative_sample_period);
> +
> +		hwc->sample_period = sample_period;
> +		hwc->using_alternative_sample_period = !using_alt;
> +
> +		if (local64_read(&hwc->period_left) > 0) {
> +			event->pmu->stop(event, PERF_EF_UPDATE);
> +
> +			local64_set(&hwc->period_left, 0);
> +
> +			event->pmu->start(event, PERF_EF_RELOAD);
> +		}

This is quite terrible :-(

Getting here means we just went through the effort of programming the
period and you'll pretty much always hit that 'period_left > 0' case.

Why do we need this case at all? If you don't do this, then the next
overflow will pick the period you just wrote to hwc->sample_period
(although you might want to audit all arch implementations).

Looking at it again, that truncation to 0 is just plain wrong -- always.
Why are you doing this?




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods
  2024-11-14 15:01   ` Peter Zijlstra
@ 2024-11-25 17:11     ` Deepak Surti
  0 siblings, 0 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-25 17:11 UTC (permalink / raw)
  To: peterz@infradead.org
  Cc: Ben Gainey, alexander.shishkin@linux.intel.com, Mark Barnett,
	James Clark, adrian.hunter@intel.com, ak@linux.intel.com,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	irogers@google.com, mingo@redhat.com, will@kernel.org,
	Mark Rutland, linux-arm-kernel@lists.infradead.org,
	acme@kernel.org, jolsa@kernel.org, namhyung@kernel.org

On Thu, 2024-11-14 at 16:01 +0100, Peter Zijlstra wrote:
> On Thu, Nov 07, 2024 at 04:07:18PM +0000, Deepak Surti wrote:
> > From: Ben Gainey <ben.gainey@arm.com>
> > 
> > This change modifies perf_event_attr to add a second, alternative
> > sample period field, and modifies the core perf overflow handling
> > such that when specified an event will alternate between two sample
> > periods.
> > 
> > Currently, perf does not provide a  mechanism for decoupling the
> > period
> > over which counters are counted from the period between samples.
> > This is
> > problematic for building a tool to measure per-function metrics
> > derived
> > from a sampled counter group. Ideally such a tool wants a very
> > small
> > sample window in order to correctly attribute the metrics to a
> > given
> > function, but prefers a larger sample period that provides
> > representative
> > coverage without excessive probe effect, triggering throttling, or
> > generating excessive amounts of data.
> > 
> > By alternating between a long and short sample_period and
> > subsequently
> > discarding the long samples, tools may decouple the period between
> > samples that the tool cares about from the window of time over
> > which
> > interesting counts are collected.
> 
> Do you have a link to a paper or something that explains this method?

Ben had originally authored this as an internal doc but I can look at
publishing externally. Is there anything in particular about this
method that you are interested in?

> 
> 
> > +	/*
> > +	 * Indicates that the alternative_sample_period is used
> > +	 */
> > +	bool				using_alternative_sample_p
> > eriod;
> 
> I typically prefer variables names that are shorter.

Acknowledged. Will do it in version 2 of the patch.

> 
> 
> > @@ -9822,6 +9825,26 @@ static int __perf_event_overflow(struct
> > perf_event *event,
> >  	    !bpf_overflow_handler(event, data, regs))
> >  		return ret;
> >  
> > +	/*
> > +	 * Swap the sample period to the alternative period
> > +	 */
> > +	if (event->attr.alternative_sample_period) {
> > +		bool using_alt = hwc-
> > >using_alternative_sample_period;
> > +		u64 sample_period = (using_alt ? event-
> > >attr.sample_period
> > +					       : event-
> > >attr.alternative_sample_period);
> > +
> > +		hwc->sample_period = sample_period;
> > +		hwc->using_alternative_sample_period = !using_alt;
> > +
> > +		if (local64_read(&hwc->period_left) > 0) {
> > +			event->pmu->stop(event, PERF_EF_UPDATE);
> > +
> > +			local64_set(&hwc->period_left, 0);
> > +
> > +			event->pmu->start(event, PERF_EF_RELOAD);
> > +		}
> 
> This is quite terrible :-(
> 
> Getting here means we just went through the effort of programming the
> period and you'll pretty much always hit that 'period_left > 0' case.
> 
> Why do we need this case at all? If you don't do this, then the next
> overflow will pick the period you just wrote to hwc->sample_period
> (although you might want to audit all arch implementations).
> 
> Looking at it again, that truncation to 0 is just plain wrong --
> always.
> Why are you doing this?

This was due to Ben's lack of familiarity with the codebase when this
was originally written; this replicates what the IOCTL handler does to
change the sample period.

But you are right this is inefficient; we've tested with this removed
and for SW events and arm pmu events it appears to be fine.

A quick review of the other architecture overflow handlers tend to all
follow the same pattern so it's probably safe, but we will do some more
validation on that before version 2 of the patch.

Thanks,
Deepak
> 
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 2/4] perf: Allow adding fixed random jitter to the alternate sampling period
  2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
  2024-11-07 16:07 ` [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods Deepak Surti
@ 2024-11-07 16:07 ` Deepak Surti
  2024-11-07 16:07 ` [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term Deepak Surti
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-07 16:07 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: deepak.surti, mark.barnett, ben.gainey, ak, will, james.clark,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel, linux-arm-kernel

From: Ben Gainey <ben.gainey@arm.com>

This change modifies the core perf overflow handler, adding some small
random jitter to each sample period whenever an event switches between the
two alternate sample periods. A new flag is added to perf_event_attr to
opt into this behaviour.

This change follows the discussion in [1], where it is recognized that it
may be possible for certain patterns of execution to end up with biased
results.

[1] https://lore.kernel.org/linux-perf-users/Zc24eLqZycmIg3d2@tassilo/

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 include/uapi/linux/perf_event.h | 10 +++++++++-
 kernel/events/core.c            | 10 +++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index bedae424ba36..16dbeea5803e 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -461,7 +461,15 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				/*
+				 * jitter_alternate_period:
+				 *
+				 * add a limited amount of jitter on each
+				 * alternate period, where the jitter is between
+				 * [0, (2<<jitter_alternate_period) - 1]
+				 */
+				jitter_alternate_period : 3,
+				__reserved_1   : 23;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 579b04af70a2..14dfac6f13dd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -15,6 +15,7 @@
 #include <linux/idr.h>
 #include <linux/file.h>
 #include <linux/poll.h>
+#include <linux/random.h>
 #include <linux/slab.h>
 #include <linux/hash.h>
 #include <linux/tick.h>
@@ -9831,7 +9832,11 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (event->attr.alternative_sample_period) {
 		bool using_alt = hwc->using_alternative_sample_period;
 		u64 sample_period = (using_alt ? event->attr.sample_period
-					       : event->attr.alternative_sample_period);
+					       : event->attr.alternative_sample_period)
+				  + (event->attr.jitter_alternate_period
+					? get_random_u32_below(2 <<
+								event->attr.jitter_alternate_period)
+					: 0);
 
 		hwc->sample_period = sample_period;
 		hwc->using_alternative_sample_period = !using_alt;
@@ -12744,6 +12749,9 @@ SYSCALL_DEFINE5(perf_event_open,
 		}
 	}
 
+	if (attr.jitter_alternate_period && !attr.alternative_sample_period)
+		return -EINVAL;
+
 	/* Only privileged users can get physical addresses */
 	if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR)) {
 		err = perf_allow_kernel(&attr);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term
  2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
  2024-11-07 16:07 ` [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods Deepak Surti
  2024-11-07 16:07 ` [PATCH v1 2/4] perf: Allow adding fixed random jitter to the alternate sampling period Deepak Surti
@ 2024-11-07 16:07 ` Deepak Surti
  2024-11-14  2:10   ` Ian Rogers
  2024-11-07 16:07 ` [PATCH v1 4/4] tools/perf: Modify event parser to support alt-period-jitter term Deepak Surti
  2024-11-14  2:22 ` [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Ian Rogers
  4 siblings, 1 reply; 12+ messages in thread
From: Deepak Surti @ 2024-11-07 16:07 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: deepak.surti, mark.barnett, ben.gainey, ak, will, james.clark,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel, linux-arm-kernel

From: Ben Gainey <ben.gainey@arm.com>

parse-events is modified, adding the "alt-period" term which can be used
to specify the alternative sampling period.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 tools/include/uapi/linux/perf_event.h             |  3 +++
 tools/perf/tests/attr.c                           |  1 +
 tools/perf/tests/attr.py                          |  1 +
 tools/perf/tests/attr/base-record                 |  3 ++-
 tools/perf/tests/attr/base-record-spe             |  1 +
 tools/perf/tests/attr/base-stat                   |  3 ++-
 tools/perf/tests/attr/system-wide-dummy           |  3 ++-
 tools/perf/tests/attr/test-record-alt-period-term | 12 ++++++++++++
 tools/perf/tests/attr/test-record-dummy-C0        |  3 ++-
 tools/perf/util/parse-events.c                    | 15 +++++++++++++++
 tools/perf/util/parse-events.h                    |  3 ++-
 tools/perf/util/parse-events.l                    |  1 +
 tools/perf/util/perf_event_attr_fprintf.c         |  1 +
 tools/perf/util/pmu.c                             |  1 +
 14 files changed, 46 insertions(+), 5 deletions(-)
 create mode 100644 tools/perf/tests/attr/test-record-alt-period-term

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 4842c36fdf80..bedae424ba36 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -379,6 +379,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
 #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
+#define PERF_ATTR_SIZE_VER9	144	/* add: alternative_sample_period */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -522,6 +523,8 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	__u64	alternative_sample_period;
 };
 
 /*
diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
index 97e1bdd6ec0e..956b58c7ba8f 100644
--- a/tools/perf/tests/attr.c
+++ b/tools/perf/tests/attr.c
@@ -139,6 +139,7 @@ static int store_event(struct perf_event_attr *attr, pid_t pid, struct perf_cpu
 	WRITE_ASS(branch_sample_type, "llu");
 	WRITE_ASS(sample_regs_user,   "llu");
 	WRITE_ASS(sample_stack_user,  PRIu32);
+	WRITE_ASS(alternative_sample_period,  "llu");
 
 	fclose(file);
 	return 0;
diff --git a/tools/perf/tests/attr.py b/tools/perf/tests/attr.py
index e890c261ad26..75c4527393f9 100644
--- a/tools/perf/tests/attr.py
+++ b/tools/perf/tests/attr.py
@@ -91,6 +91,7 @@ class Event(dict):
         'branch_sample_type',
         'sample_regs_user',
         'sample_stack_user',
+        'alternative_sample_period',
     ]
 
     def add(self, data):
diff --git a/tools/perf/tests/attr/base-record b/tools/perf/tests/attr/base-record
index b44e4e6e4443..403de2e2c891 100644
--- a/tools/perf/tests/attr/base-record
+++ b/tools/perf/tests/attr/base-record
@@ -5,7 +5,7 @@ group_fd=-1
 flags=0|8
 cpu=*
 type=0|1
-size=136
+size=144
 config=0|1
 sample_period=*
 sample_type=263
@@ -39,3 +39,4 @@ config2=0
 branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
+alternative_sample_period=0
\ No newline at end of file
diff --git a/tools/perf/tests/attr/base-record-spe b/tools/perf/tests/attr/base-record-spe
index 08fa96b59240..db528d7d8b73 100644
--- a/tools/perf/tests/attr/base-record-spe
+++ b/tools/perf/tests/attr/base-record-spe
@@ -38,3 +38,4 @@ config2=*
 branch_sample_type=*
 sample_regs_user=*
 sample_stack_user=*
+alternative_sample_period=0
\ No newline at end of file
diff --git a/tools/perf/tests/attr/base-stat b/tools/perf/tests/attr/base-stat
index fccd8ec4d1b0..27ef0fa1386f 100644
--- a/tools/perf/tests/attr/base-stat
+++ b/tools/perf/tests/attr/base-stat
@@ -5,7 +5,7 @@ group_fd=-1
 flags=0|8
 cpu=*
 type=0
-size=136
+size=144
 config=0
 sample_period=0
 sample_type=65536
@@ -39,3 +39,4 @@ config2=0
 branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
+alternative_sample_period=0
diff --git a/tools/perf/tests/attr/system-wide-dummy b/tools/perf/tests/attr/system-wide-dummy
index a1e1d6a263bf..5c4d2a60931d 100644
--- a/tools/perf/tests/attr/system-wide-dummy
+++ b/tools/perf/tests/attr/system-wide-dummy
@@ -7,7 +7,7 @@ cpu=*
 pid=-1
 flags=8
 type=1
-size=136
+size=144
 config=9
 sample_period=1
 # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
@@ -50,3 +50,4 @@ config2=0
 branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
+alternative_sample_period=0
diff --git a/tools/perf/tests/attr/test-record-alt-period-term b/tools/perf/tests/attr/test-record-alt-period-term
new file mode 100644
index 000000000000..e0de4c8eb1c8
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-alt-period-term
@@ -0,0 +1,12 @@
+[config]
+command = record
+args    = --no-bpf-event -e cycles/period=3,alt-period=2/ -- kill >/dev/null 2>&1
+ret     = 1
+kernel_since = 6.11
+
+[event-10:base-record]
+sample_period=3
+alternative_sample_period=2
+
+freq=0
+sample_type=7
diff --git a/tools/perf/tests/attr/test-record-dummy-C0 b/tools/perf/tests/attr/test-record-dummy-C0
index 576ec48b3aaf..d4f0546e02b6 100644
--- a/tools/perf/tests/attr/test-record-dummy-C0
+++ b/tools/perf/tests/attr/test-record-dummy-C0
@@ -10,7 +10,7 @@ cpu=0
 pid=-1
 flags=8
 type=1
-size=136
+size=144
 config=9
 sample_period=4000
 # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
@@ -53,3 +53,4 @@ config2=0
 branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
+alternative_sample_period=0
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 9a8be1e46d67..48723cea3a51 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -826,6 +826,7 @@ static const char *config_term_name(enum parse_events__term_type term_type)
 		[PARSE_EVENTS__TERM_TYPE_RAW]                   = "raw",
 		[PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE]          = "legacy-cache",
 		[PARSE_EVENTS__TERM_TYPE_HARDWARE]              = "hardware",
+		[PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD]	= "alt-period",
 	};
 	if ((unsigned int)term_type >= __PARSE_EVENTS__TERM_TYPE_NR)
 		return "unknown term";
@@ -854,6 +855,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
 	case PARSE_EVENTS__TERM_TYPE_NAME:
 	case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
 	case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
+	case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
 	case PARSE_EVENTS__TERM_TYPE_PERCORE:
 		return true;
 	case PARSE_EVENTS__TERM_TYPE_USER:
@@ -998,6 +1000,16 @@ do {									   \
 			return -EINVAL;
 		}
 		break;
+	case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
+		CHECK_TYPE_VAL(NUM);
+		if (term->val.num == 0) {
+			parse_events_error__handle(err, term->err_val,
+						strdup("expected a non-zero value"),
+						NULL);
+			return -EINVAL;
+		}
+		attr->alternative_sample_period = term->val.num;
+		break;
 	case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
 	case PARSE_EVENTS__TERM_TYPE_USER:
 	case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
@@ -1124,6 +1136,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
 	case PARSE_EVENTS__TERM_TYPE_RAW:
 	case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 	case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+	case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
 	default:
 		if (err) {
 			parse_events_error__handle(err, term->err_term,
@@ -1255,6 +1268,7 @@ do {								\
 		case PARSE_EVENTS__TERM_TYPE_RAW:
 		case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 		case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+		case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
 		default:
 			break;
 		}
@@ -1308,6 +1322,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
 		case PARSE_EVENTS__TERM_TYPE_RAW:
 		case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 		case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+		case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
 		default:
 			break;
 		}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 10cc9c433116..f1482361321a 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -79,7 +79,8 @@ enum parse_events__term_type {
 	PARSE_EVENTS__TERM_TYPE_RAW,
 	PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
 	PARSE_EVENTS__TERM_TYPE_HARDWARE,
-#define	__PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HARDWARE + 1)
+	PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD,
+#define	__PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD + 1)
 };
 
 struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 5a0bcd7f166a..1972a5a696ef 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -331,6 +331,7 @@ percore			{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_PERCORE); }
 aux-output		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
 aux-sample-size		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
 metric-id		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
+alt-period		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD); }
 cpu-cycles|cycles				{ return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
 stalled-cycles-frontend|idle-cycles-frontend	{ return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
 stalled-cycles-backend|idle-cycles-backend	{ return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 59fbbba79697..c6af598070b5 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -335,6 +335,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
 	PRINT_ATTRf(sample_max_stack, p_unsigned);
 	PRINT_ATTRf(aux_sample_size, p_unsigned);
 	PRINT_ATTRf(sig_data, p_unsigned);
+	PRINT_ATTRf(alternative_sample_period, p_unsigned);
 
 	return ret;
 }
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 61bdda01a05a..7615d05e389f 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1738,6 +1738,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
 		"percore",
 		"aux-output",
 		"aux-sample-size=number",
+		"alt-period=number",
 	};
 	struct perf_pmu_format *format;
 	int ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term
  2024-11-07 16:07 ` [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term Deepak Surti
@ 2024-11-14  2:10   ` Ian Rogers
  2024-11-25 16:02     ` Deepak Surti
  0 siblings, 1 reply; 12+ messages in thread
From: Ian Rogers @ 2024-11-14  2:10 UTC (permalink / raw)
  To: Deepak Surti
  Cc: peterz, mingo, acme, namhyung, mark.barnett, ben.gainey, ak, will,
	james.clark, mark.rutland, alexander.shishkin, jolsa,
	adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel

On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@arm.com> wrote:
>
> From: Ben Gainey <ben.gainey@arm.com>
>
> parse-events is modified, adding the "alt-period" term which can be used
> to specify the alternative sampling period.
>
> Signed-off-by: Ben Gainey <ben.gainey@arm.com>

Hi Deepak,

this and the next change conflict with:
https://lore.kernel.org/all/20241015000158.871828-1-irogers@google.com/
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/tests/shell/attr.sh?h=perf-tools-next&id=8519e4f44c2af72214dc029f0334be068466e71f
Could you migrate the work to perf-tools-next?

Thanks,
Ian

> ---
>  tools/include/uapi/linux/perf_event.h             |  3 +++
>  tools/perf/tests/attr.c                           |  1 +
>  tools/perf/tests/attr.py                          |  1 +
>  tools/perf/tests/attr/base-record                 |  3 ++-
>  tools/perf/tests/attr/base-record-spe             |  1 +
>  tools/perf/tests/attr/base-stat                   |  3 ++-
>  tools/perf/tests/attr/system-wide-dummy           |  3 ++-
>  tools/perf/tests/attr/test-record-alt-period-term | 12 ++++++++++++
>  tools/perf/tests/attr/test-record-dummy-C0        |  3 ++-
>  tools/perf/util/parse-events.c                    | 15 +++++++++++++++
>  tools/perf/util/parse-events.h                    |  3 ++-
>  tools/perf/util/parse-events.l                    |  1 +
>  tools/perf/util/perf_event_attr_fprintf.c         |  1 +
>  tools/perf/util/pmu.c                             |  1 +
>  14 files changed, 46 insertions(+), 5 deletions(-)
>  create mode 100644 tools/perf/tests/attr/test-record-alt-period-term
>
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index 4842c36fdf80..bedae424ba36 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -379,6 +379,7 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6    120     /* add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7    128     /* add: sig_data */
>  #define PERF_ATTR_SIZE_VER8    136     /* add: config3 */
> +#define PERF_ATTR_SIZE_VER9    144     /* add: alternative_sample_period */
>
>  /*
>   * Hardware event_id to monitor via a performance monitoring event:
> @@ -522,6 +523,8 @@ struct perf_event_attr {
>         __u64   sig_data;
>
>         __u64   config3; /* extension of config2 */
> +
> +       __u64   alternative_sample_period;
>  };
>
>  /*
> diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
> index 97e1bdd6ec0e..956b58c7ba8f 100644
> --- a/tools/perf/tests/attr.c
> +++ b/tools/perf/tests/attr.c
> @@ -139,6 +139,7 @@ static int store_event(struct perf_event_attr *attr, pid_t pid, struct perf_cpu
>         WRITE_ASS(branch_sample_type, "llu");
>         WRITE_ASS(sample_regs_user,   "llu");
>         WRITE_ASS(sample_stack_user,  PRIu32);
> +       WRITE_ASS(alternative_sample_period,  "llu");
>
>         fclose(file);
>         return 0;
> diff --git a/tools/perf/tests/attr.py b/tools/perf/tests/attr.py
> index e890c261ad26..75c4527393f9 100644
> --- a/tools/perf/tests/attr.py
> +++ b/tools/perf/tests/attr.py
> @@ -91,6 +91,7 @@ class Event(dict):
>          'branch_sample_type',
>          'sample_regs_user',
>          'sample_stack_user',
> +        'alternative_sample_period',
>      ]
>
>      def add(self, data):
> diff --git a/tools/perf/tests/attr/base-record b/tools/perf/tests/attr/base-record
> index b44e4e6e4443..403de2e2c891 100644
> --- a/tools/perf/tests/attr/base-record
> +++ b/tools/perf/tests/attr/base-record
> @@ -5,7 +5,7 @@ group_fd=-1
>  flags=0|8
>  cpu=*
>  type=0|1
> -size=136
> +size=144
>  config=0|1
>  sample_period=*
>  sample_type=263
> @@ -39,3 +39,4 @@ config2=0
>  branch_sample_type=0
>  sample_regs_user=0
>  sample_stack_user=0
> +alternative_sample_period=0
> \ No newline at end of file
> diff --git a/tools/perf/tests/attr/base-record-spe b/tools/perf/tests/attr/base-record-spe
> index 08fa96b59240..db528d7d8b73 100644
> --- a/tools/perf/tests/attr/base-record-spe
> +++ b/tools/perf/tests/attr/base-record-spe
> @@ -38,3 +38,4 @@ config2=*
>  branch_sample_type=*
>  sample_regs_user=*
>  sample_stack_user=*
> +alternative_sample_period=0
> \ No newline at end of file
> diff --git a/tools/perf/tests/attr/base-stat b/tools/perf/tests/attr/base-stat
> index fccd8ec4d1b0..27ef0fa1386f 100644
> --- a/tools/perf/tests/attr/base-stat
> +++ b/tools/perf/tests/attr/base-stat
> @@ -5,7 +5,7 @@ group_fd=-1
>  flags=0|8
>  cpu=*
>  type=0
> -size=136
> +size=144
>  config=0
>  sample_period=0
>  sample_type=65536
> @@ -39,3 +39,4 @@ config2=0
>  branch_sample_type=0
>  sample_regs_user=0
>  sample_stack_user=0
> +alternative_sample_period=0
> diff --git a/tools/perf/tests/attr/system-wide-dummy b/tools/perf/tests/attr/system-wide-dummy
> index a1e1d6a263bf..5c4d2a60931d 100644
> --- a/tools/perf/tests/attr/system-wide-dummy
> +++ b/tools/perf/tests/attr/system-wide-dummy
> @@ -7,7 +7,7 @@ cpu=*
>  pid=-1
>  flags=8
>  type=1
> -size=136
> +size=144
>  config=9
>  sample_period=1
>  # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
> @@ -50,3 +50,4 @@ config2=0
>  branch_sample_type=0
>  sample_regs_user=0
>  sample_stack_user=0
> +alternative_sample_period=0
> diff --git a/tools/perf/tests/attr/test-record-alt-period-term b/tools/perf/tests/attr/test-record-alt-period-term
> new file mode 100644
> index 000000000000..e0de4c8eb1c8
> --- /dev/null
> +++ b/tools/perf/tests/attr/test-record-alt-period-term
> @@ -0,0 +1,12 @@
> +[config]
> +command = record
> +args    = --no-bpf-event -e cycles/period=3,alt-period=2/ -- kill >/dev/null 2>&1
> +ret     = 1
> +kernel_since = 6.11
> +
> +[event-10:base-record]
> +sample_period=3
> +alternative_sample_period=2
> +
> +freq=0
> +sample_type=7
> diff --git a/tools/perf/tests/attr/test-record-dummy-C0 b/tools/perf/tests/attr/test-record-dummy-C0
> index 576ec48b3aaf..d4f0546e02b6 100644
> --- a/tools/perf/tests/attr/test-record-dummy-C0
> +++ b/tools/perf/tests/attr/test-record-dummy-C0
> @@ -10,7 +10,7 @@ cpu=0
>  pid=-1
>  flags=8
>  type=1
> -size=136
> +size=144
>  config=9
>  sample_period=4000
>  # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
> @@ -53,3 +53,4 @@ config2=0
>  branch_sample_type=0
>  sample_regs_user=0
>  sample_stack_user=0
> +alternative_sample_period=0
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 9a8be1e46d67..48723cea3a51 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -826,6 +826,7 @@ static const char *config_term_name(enum parse_events__term_type term_type)
>                 [PARSE_EVENTS__TERM_TYPE_RAW]                   = "raw",
>                 [PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE]          = "legacy-cache",
>                 [PARSE_EVENTS__TERM_TYPE_HARDWARE]              = "hardware",
> +               [PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD]     = "alt-period",
>         };
>         if ((unsigned int)term_type >= __PARSE_EVENTS__TERM_TYPE_NR)
>                 return "unknown term";
> @@ -854,6 +855,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
>         case PARSE_EVENTS__TERM_TYPE_NAME:
>         case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
>         case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
> +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
>         case PARSE_EVENTS__TERM_TYPE_PERCORE:
>                 return true;
>         case PARSE_EVENTS__TERM_TYPE_USER:
> @@ -998,6 +1000,16 @@ do {                                                                         \
>                         return -EINVAL;
>                 }
>                 break;
> +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> +               CHECK_TYPE_VAL(NUM);
> +               if (term->val.num == 0) {
> +                       parse_events_error__handle(err, term->err_val,
> +                                               strdup("expected a non-zero value"),
> +                                               NULL);
> +                       return -EINVAL;
> +               }
> +               attr->alternative_sample_period = term->val.num;
> +               break;
>         case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
>         case PARSE_EVENTS__TERM_TYPE_USER:
>         case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
> @@ -1124,6 +1136,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
>         case PARSE_EVENTS__TERM_TYPE_RAW:
>         case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
>         case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
>         default:
>                 if (err) {
>                         parse_events_error__handle(err, term->err_term,
> @@ -1255,6 +1268,7 @@ do {                                                              \
>                 case PARSE_EVENTS__TERM_TYPE_RAW:
>                 case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
>                 case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> +               case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
>                 default:
>                         break;
>                 }
> @@ -1308,6 +1322,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
>                 case PARSE_EVENTS__TERM_TYPE_RAW:
>                 case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
>                 case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> +               case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
>                 default:
>                         break;
>                 }
> diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
> index 10cc9c433116..f1482361321a 100644
> --- a/tools/perf/util/parse-events.h
> +++ b/tools/perf/util/parse-events.h
> @@ -79,7 +79,8 @@ enum parse_events__term_type {
>         PARSE_EVENTS__TERM_TYPE_RAW,
>         PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
>         PARSE_EVENTS__TERM_TYPE_HARDWARE,
> -#define        __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HARDWARE + 1)
> +       PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD,
> +#define        __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD + 1)
>  };
>
>  struct parse_events_term {
> diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
> index 5a0bcd7f166a..1972a5a696ef 100644
> --- a/tools/perf/util/parse-events.l
> +++ b/tools/perf/util/parse-events.l
> @@ -331,6 +331,7 @@ percore                     { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_PERCORE); }
>  aux-output             { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
>  aux-sample-size                { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
>  metric-id              { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
> +alt-period             { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD); }
>  cpu-cycles|cycles                              { return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
>  stalled-cycles-frontend|idle-cycles-frontend   { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
>  stalled-cycles-backend|idle-cycles-backend     { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
> diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
> index 59fbbba79697..c6af598070b5 100644
> --- a/tools/perf/util/perf_event_attr_fprintf.c
> +++ b/tools/perf/util/perf_event_attr_fprintf.c
> @@ -335,6 +335,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
>         PRINT_ATTRf(sample_max_stack, p_unsigned);
>         PRINT_ATTRf(aux_sample_size, p_unsigned);
>         PRINT_ATTRf(sig_data, p_unsigned);
> +       PRINT_ATTRf(alternative_sample_period, p_unsigned);
>
>         return ret;
>  }
> diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
> index 61bdda01a05a..7615d05e389f 100644
> --- a/tools/perf/util/pmu.c
> +++ b/tools/perf/util/pmu.c
> @@ -1738,6 +1738,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
>                 "percore",
>                 "aux-output",
>                 "aux-sample-size=number",
> +               "alt-period=number",
>         };
>         struct perf_pmu_format *format;
>         int ret;
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term
  2024-11-14  2:10   ` Ian Rogers
@ 2024-11-25 16:02     ` Deepak Surti
  0 siblings, 0 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-25 16:02 UTC (permalink / raw)
  To: irogers@google.com
  Cc: Ben Gainey, alexander.shishkin@linux.intel.com, Mark Barnett,
	James Clark, adrian.hunter@intel.com, ak@linux.intel.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	linux-perf-users@vger.kernel.org, will@kernel.org, Mark Rutland,
	peterz@infradead.org, linux-arm-kernel@lists.infradead.org,
	acme@kernel.org, jolsa@kernel.org, namhyung@kernel.org

On Wed, 2024-11-13 at 18:10 -0800, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@arm.com>
> wrote:
> > 
> > From: Ben Gainey <ben.gainey@arm.com>
> > 
> > parse-events is modified, adding the "alt-period" term which can be
> > used
> > to specify the alternative sampling period.
> > 
> > Signed-off-by: Ben Gainey <ben.gainey@arm.com>
> 
> Hi Deepak,
> 
> this and the next change conflict with:
> https://lore.kernel.org/all/20241015000158.871828-1-irogers@google.com/
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/tests/shell/attr.sh?h=perf-tools-next&id=8519e4f44c2af72214dc029f0334be068466e71f
> 
Hi Ian,

> Could you migrate the work to perf-tools-next?

Acknowledged. Will do this in the next version 2 of the patch.

Thanks,
Deepak

> Thanks,
> Ian
> 
> > ---
> >  tools/include/uapi/linux/perf_event.h             |  3 +++
> >  tools/perf/tests/attr.c                           |  1 +
> >  tools/perf/tests/attr.py                          |  1 +
> >  tools/perf/tests/attr/base-record                 |  3 ++-
> >  tools/perf/tests/attr/base-record-spe             |  1 +
> >  tools/perf/tests/attr/base-stat                   |  3 ++-
> >  tools/perf/tests/attr/system-wide-dummy           |  3 ++-
> >  tools/perf/tests/attr/test-record-alt-period-term | 12
> > ++++++++++++
> >  tools/perf/tests/attr/test-record-dummy-C0        |  3 ++-
> >  tools/perf/util/parse-events.c                    | 15
> > +++++++++++++++
> >  tools/perf/util/parse-events.h                    |  3 ++-
> >  tools/perf/util/parse-events.l                    |  1 +
> >  tools/perf/util/perf_event_attr_fprintf.c         |  1 +
> >  tools/perf/util/pmu.c                             |  1 +
> >  14 files changed, 46 insertions(+), 5 deletions(-)
> >  create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > term
> > 
> > diff --git a/tools/include/uapi/linux/perf_event.h
> > b/tools/include/uapi/linux/perf_event.h
> > index 4842c36fdf80..bedae424ba36 100644
> > --- a/tools/include/uapi/linux/perf_event.h
> > +++ b/tools/include/uapi/linux/perf_event.h
> > @@ -379,6 +379,7 @@ enum perf_event_read_format {
> >  #define PERF_ATTR_SIZE_VER6    120     /* add: aux_sample_size */
> >  #define PERF_ATTR_SIZE_VER7    128     /* add: sig_data */
> >  #define PERF_ATTR_SIZE_VER8    136     /* add: config3 */
> > +#define PERF_ATTR_SIZE_VER9    144     /* add:
> > alternative_sample_period */
> > 
> >  /*
> >   * Hardware event_id to monitor via a performance monitoring
> > event:
> > @@ -522,6 +523,8 @@ struct perf_event_attr {
> >         __u64   sig_data;
> > 
> >         __u64   config3; /* extension of config2 */
> > +
> > +       __u64   alternative_sample_period;
> >  };
> > 
> >  /*
> > diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
> > index 97e1bdd6ec0e..956b58c7ba8f 100644
> > --- a/tools/perf/tests/attr.c
> > +++ b/tools/perf/tests/attr.c
> > @@ -139,6 +139,7 @@ static int store_event(struct perf_event_attr
> > *attr, pid_t pid, struct perf_cpu
> >         WRITE_ASS(branch_sample_type, "llu");
> >         WRITE_ASS(sample_regs_user,   "llu");
> >         WRITE_ASS(sample_stack_user,  PRIu32);
> > +       WRITE_ASS(alternative_sample_period,  "llu");
> > 
> >         fclose(file);
> >         return 0;
> > diff --git a/tools/perf/tests/attr.py b/tools/perf/tests/attr.py
> > index e890c261ad26..75c4527393f9 100644
> > --- a/tools/perf/tests/attr.py
> > +++ b/tools/perf/tests/attr.py
> > @@ -91,6 +91,7 @@ class Event(dict):
> >          'branch_sample_type',
> >          'sample_regs_user',
> >          'sample_stack_user',
> > +        'alternative_sample_period',
> >      ]
> > 
> >      def add(self, data):
> > diff --git a/tools/perf/tests/attr/base-record
> > b/tools/perf/tests/attr/base-record
> > index b44e4e6e4443..403de2e2c891 100644
> > --- a/tools/perf/tests/attr/base-record
> > +++ b/tools/perf/tests/attr/base-record
> > @@ -5,7 +5,7 @@ group_fd=-1
> >  flags=0|8
> >  cpu=*
> >  type=0|1
> > -size=136
> > +size=144
> >  config=0|1
> >  sample_period=*
> >  sample_type=263
> > @@ -39,3 +39,4 @@ config2=0
> >  branch_sample_type=0
> >  sample_regs_user=0
> >  sample_stack_user=0
> > +alternative_sample_period=0
> > \ No newline at end of file
> > diff --git a/tools/perf/tests/attr/base-record-spe
> > b/tools/perf/tests/attr/base-record-spe
> > index 08fa96b59240..db528d7d8b73 100644
> > --- a/tools/perf/tests/attr/base-record-spe
> > +++ b/tools/perf/tests/attr/base-record-spe
> > @@ -38,3 +38,4 @@ config2=*
> >  branch_sample_type=*
> >  sample_regs_user=*
> >  sample_stack_user=*
> > +alternative_sample_period=0
> > \ No newline at end of file
> > diff --git a/tools/perf/tests/attr/base-stat
> > b/tools/perf/tests/attr/base-stat
> > index fccd8ec4d1b0..27ef0fa1386f 100644
> > --- a/tools/perf/tests/attr/base-stat
> > +++ b/tools/perf/tests/attr/base-stat
> > @@ -5,7 +5,7 @@ group_fd=-1
> >  flags=0|8
> >  cpu=*
> >  type=0
> > -size=136
> > +size=144
> >  config=0
> >  sample_period=0
> >  sample_type=65536
> > @@ -39,3 +39,4 @@ config2=0
> >  branch_sample_type=0
> >  sample_regs_user=0
> >  sample_stack_user=0
> > +alternative_sample_period=0
> > diff --git a/tools/perf/tests/attr/system-wide-dummy
> > b/tools/perf/tests/attr/system-wide-dummy
> > index a1e1d6a263bf..5c4d2a60931d 100644
> > --- a/tools/perf/tests/attr/system-wide-dummy
> > +++ b/tools/perf/tests/attr/system-wide-dummy
> > @@ -7,7 +7,7 @@ cpu=*
> >  pid=-1
> >  flags=8
> >  type=1
> > -size=136
> > +size=144
> >  config=9
> >  sample_period=1
> >  # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
> > @@ -50,3 +50,4 @@ config2=0
> >  branch_sample_type=0
> >  sample_regs_user=0
> >  sample_stack_user=0
> > +alternative_sample_period=0
> > diff --git a/tools/perf/tests/attr/test-record-alt-period-term
> > b/tools/perf/tests/attr/test-record-alt-period-term
> > new file mode 100644
> > index 000000000000..e0de4c8eb1c8
> > --- /dev/null
> > +++ b/tools/perf/tests/attr/test-record-alt-period-term
> > @@ -0,0 +1,12 @@
> > +[config]
> > +command = record
> > +args    = --no-bpf-event -e cycles/period=3,alt-period=2/ -- kill
> > >/dev/null 2>&1
> > +ret     = 1
> > +kernel_since = 6.11
> > +
> > +[event-10:base-record]
> > +sample_period=3
> > +alternative_sample_period=2
> > +
> > +freq=0
> > +sample_type=7
> > diff --git a/tools/perf/tests/attr/test-record-dummy-C0
> > b/tools/perf/tests/attr/test-record-dummy-C0
> > index 576ec48b3aaf..d4f0546e02b6 100644
> > --- a/tools/perf/tests/attr/test-record-dummy-C0
> > +++ b/tools/perf/tests/attr/test-record-dummy-C0
> > @@ -10,7 +10,7 @@ cpu=0
> >  pid=-1
> >  flags=8
> >  type=1
> > -size=136
> > +size=144
> >  config=9
> >  sample_period=4000
> >  # PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
> > @@ -53,3 +53,4 @@ config2=0
> >  branch_sample_type=0
> >  sample_regs_user=0
> >  sample_stack_user=0
> > +alternative_sample_period=0
> > diff --git a/tools/perf/util/parse-events.c
> > b/tools/perf/util/parse-events.c
> > index 9a8be1e46d67..48723cea3a51 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -826,6 +826,7 @@ static const char *config_term_name(enum
> > parse_events__term_type term_type)
> >                 [PARSE_EVENTS__TERM_TYPE_RAW]                   =
> > "raw",
> >                 [PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE]          =
> > "legacy-cache",
> >                 [PARSE_EVENTS__TERM_TYPE_HARDWARE]              =
> > "hardware",
> > +               [PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD]     =
> > "alt-period",
> >         };
> >         if ((unsigned int)term_type >=
> > __PARSE_EVENTS__TERM_TYPE_NR)
> >                 return "unknown term";
> > @@ -854,6 +855,7 @@ config_term_avail(enum parse_events__term_type
> > term_type, struct parse_events_er
> >         case PARSE_EVENTS__TERM_TYPE_NAME:
> >         case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
> >         case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
> > +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> >         case PARSE_EVENTS__TERM_TYPE_PERCORE:
> >                 return true;
> >         case PARSE_EVENTS__TERM_TYPE_USER:
> > @@ -998,6 +1000,16 @@ do
> > {                                                                  
> >        \
> >                         return -EINVAL;
> >                 }
> >                 break;
> > +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> > +               CHECK_TYPE_VAL(NUM);
> > +               if (term->val.num == 0) {
> > +                       parse_events_error__handle(err, term-
> > >err_val,
> > +                                               strdup("expected a
> > non-zero value"),
> > +                                               NULL);
> > +                       return -EINVAL;
> > +               }
> > +               attr->alternative_sample_period = term->val.num;
> > +               break;
> >         case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
> >         case PARSE_EVENTS__TERM_TYPE_USER:
> >         case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
> > @@ -1124,6 +1136,7 @@ static int config_term_tracepoint(struct
> > perf_event_attr *attr,
> >         case PARSE_EVENTS__TERM_TYPE_RAW:
> >         case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
> >         case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> > +       case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> >         default:
> >                 if (err) {
> >                         parse_events_error__handle(err, term-
> > >err_term,
> > @@ -1255,6 +1268,7 @@ do
> > {                                                              \
> >                 case PARSE_EVENTS__TERM_TYPE_RAW:
> >                 case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
> >                 case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> > +               case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> >                 default:
> >                         break;
> >                 }
> > @@ -1308,6 +1322,7 @@ static int get_config_chgs(struct perf_pmu
> > *pmu, struct parse_events_terms *head
> >                 case PARSE_EVENTS__TERM_TYPE_RAW:
> >                 case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
> >                 case PARSE_EVENTS__TERM_TYPE_HARDWARE:
> > +               case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
> >                 default:
> >                         break;
> >                 }
> > diff --git a/tools/perf/util/parse-events.h
> > b/tools/perf/util/parse-events.h
> > index 10cc9c433116..f1482361321a 100644
> > --- a/tools/perf/util/parse-events.h
> > +++ b/tools/perf/util/parse-events.h
> > @@ -79,7 +79,8 @@ enum parse_events__term_type {
> >         PARSE_EVENTS__TERM_TYPE_RAW,
> >         PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
> >         PARSE_EVENTS__TERM_TYPE_HARDWARE,
> > -#define        __PARSE_EVENTS__TERM_TYPE_NR
> > (PARSE_EVENTS__TERM_TYPE_HARDWARE + 1)
> > +       PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD,
> > +#define        __PARSE_EVENTS__TERM_TYPE_NR
> > (PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD + 1)
> >  };
> > 
> >  struct parse_events_term {
> > diff --git a/tools/perf/util/parse-events.l
> > b/tools/perf/util/parse-events.l
> > index 5a0bcd7f166a..1972a5a696ef 100644
> > --- a/tools/perf/util/parse-events.l
> > +++ b/tools/perf/util/parse-events.l
> > @@ -331,6 +331,7 @@ percore                     { return
> > term(yyscanner, PARSE_EVENTS__TERM_TYPE_PERCORE); }
> >  aux-output             { return term(yyscanner,
> > PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
> >  aux-sample-size                { return term(yyscanner,
> > PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
> >  metric-id              { return term(yyscanner,
> > PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
> > +alt-period             { return term(yyscanner,
> > PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD); }
> >  cpu-cycles|cycles                              { return
> > hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
> >  stalled-cycles-frontend|idle-cycles-frontend   { return
> > hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
> >  stalled-cycles-backend|idle-cycles-backend     { return
> > hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
> > diff --git a/tools/perf/util/perf_event_attr_fprintf.c
> > b/tools/perf/util/perf_event_attr_fprintf.c
> > index 59fbbba79697..c6af598070b5 100644
> > --- a/tools/perf/util/perf_event_attr_fprintf.c
> > +++ b/tools/perf/util/perf_event_attr_fprintf.c
> > @@ -335,6 +335,7 @@ int perf_event_attr__fprintf(FILE *fp, struct
> > perf_event_attr *attr,
> >         PRINT_ATTRf(sample_max_stack, p_unsigned);
> >         PRINT_ATTRf(aux_sample_size, p_unsigned);
> >         PRINT_ATTRf(sig_data, p_unsigned);
> > +       PRINT_ATTRf(alternative_sample_period, p_unsigned);
> > 
> >         return ret;
> >  }
> > diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
> > index 61bdda01a05a..7615d05e389f 100644
> > --- a/tools/perf/util/pmu.c
> > +++ b/tools/perf/util/pmu.c
> > @@ -1738,6 +1738,7 @@ int perf_pmu__for_each_format(struct perf_pmu
> > *pmu, void *state, pmu_format_call
> >                 "percore",
> >                 "aux-output",
> >                 "aux-sample-size=number",
> > +               "alt-period=number",
> >         };
> >         struct perf_pmu_format *format;
> >         int ret;
> > --
> > 2.43.0
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v1 4/4] tools/perf: Modify event parser to support alt-period-jitter term
  2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
                   ` (2 preceding siblings ...)
  2024-11-07 16:07 ` [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term Deepak Surti
@ 2024-11-07 16:07 ` Deepak Surti
  2024-11-14  2:22 ` [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Ian Rogers
  4 siblings, 0 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-07 16:07 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: deepak.surti, mark.barnett, ben.gainey, ak, will, james.clark,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel, linux-arm-kernel

From: Ben Gainey <ben.gainey@arm.com>

parse-events is modified, adding the "alt-period-jitter" term which
can be used to enable random jitter of the alternative sample
period.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 tools/include/uapi/linux/perf_event.h             | 10 +++++++++-
 tools/perf/tests/attr.c                           |  1 +
 tools/perf/tests/attr.py                          |  1 +
 tools/perf/tests/attr/base-record                 |  3 ++-
 tools/perf/tests/attr/base-record-spe             |  3 ++-
 tools/perf/tests/attr/base-stat                   |  1 +
 tools/perf/tests/attr/system-wide-dummy           |  1 +
 .../tests/attr/test-record-alt-period-jitter-term | 13 +++++++++++++
 tools/perf/tests/attr/test-record-dummy-C0        |  1 +
 tools/perf/util/parse-events.c                    | 15 +++++++++++++++
 tools/perf/util/parse-events.h                    |  3 ++-
 tools/perf/util/parse-events.l                    |  1 +
 tools/perf/util/pmu.c                             |  1 +
 13 files changed, 50 insertions(+), 4 deletions(-)
 create mode 100644 tools/perf/tests/attr/test-record-alt-period-jitter-term

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index bedae424ba36..16dbeea5803e 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -461,7 +461,15 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				/*
+				 * jitter_alternate_period:
+				 *
+				 * add a limited amount of jitter on each
+				 * alternate period, where the jitter is between
+				 * [0, (2<<jitter_alternate_period) - 1]
+				 */
+				jitter_alternate_period : 3,
+				__reserved_1   : 23;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
index 956b58c7ba8f..7fb5d1d0b0ab 100644
--- a/tools/perf/tests/attr.c
+++ b/tools/perf/tests/attr.c
@@ -140,6 +140,7 @@ static int store_event(struct perf_event_attr *attr, pid_t pid, struct perf_cpu
 	WRITE_ASS(sample_regs_user,   "llu");
 	WRITE_ASS(sample_stack_user,  PRIu32);
 	WRITE_ASS(alternative_sample_period,  "llu");
+	WRITE_ASS(jitter_alternate_period,  "d");
 
 	fclose(file);
 	return 0;
diff --git a/tools/perf/tests/attr.py b/tools/perf/tests/attr.py
index 75c4527393f9..ecab8e69418f 100644
--- a/tools/perf/tests/attr.py
+++ b/tools/perf/tests/attr.py
@@ -92,6 +92,7 @@ class Event(dict):
         'sample_regs_user',
         'sample_stack_user',
         'alternative_sample_period',
+        'jitter_alternate_period',
     ]
 
     def add(self, data):
diff --git a/tools/perf/tests/attr/base-record b/tools/perf/tests/attr/base-record
index 403de2e2c891..39a7228612c2 100644
--- a/tools/perf/tests/attr/base-record
+++ b/tools/perf/tests/attr/base-record
@@ -39,4 +39,5 @@ config2=0
 branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
-alternative_sample_period=0
\ No newline at end of file
+alternative_sample_period=0
+jitter_alternate_period=0
diff --git a/tools/perf/tests/attr/base-record-spe b/tools/perf/tests/attr/base-record-spe
index db528d7d8b73..b228cd98cfa1 100644
--- a/tools/perf/tests/attr/base-record-spe
+++ b/tools/perf/tests/attr/base-record-spe
@@ -38,4 +38,5 @@ config2=*
 branch_sample_type=*
 sample_regs_user=*
 sample_stack_user=*
-alternative_sample_period=0
\ No newline at end of file
+alternative_sample_period=0
+jitter_alternate_period=0
diff --git a/tools/perf/tests/attr/base-stat b/tools/perf/tests/attr/base-stat
index 27ef0fa1386f..d9057d780262 100644
--- a/tools/perf/tests/attr/base-stat
+++ b/tools/perf/tests/attr/base-stat
@@ -40,3 +40,4 @@ branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
 alternative_sample_period=0
+jitter_alternate_period=0
diff --git a/tools/perf/tests/attr/system-wide-dummy b/tools/perf/tests/attr/system-wide-dummy
index 5c4d2a60931d..4d80542c3a68 100644
--- a/tools/perf/tests/attr/system-wide-dummy
+++ b/tools/perf/tests/attr/system-wide-dummy
@@ -51,3 +51,4 @@ branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
 alternative_sample_period=0
+jitter_alternate_period=0
diff --git a/tools/perf/tests/attr/test-record-alt-period-jitter-term b/tools/perf/tests/attr/test-record-alt-period-jitter-term
new file mode 100644
index 000000000000..cda28c1ec8dd
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-alt-period-jitter-term
@@ -0,0 +1,13 @@
+[config]
+command = record
+args    = --no-bpf-event -e cycles/period=3,alt-period=2,alt-period-jitter=7/ -- kill >/dev/null 2>&1
+ret     = 1
+kernel_since = 6.11
+
+[event-10:base-record]
+sample_period=3
+alternative_sample_period=2
+jitter_alternate_period=7
+
+freq=0
+sample_type=7
diff --git a/tools/perf/tests/attr/test-record-dummy-C0 b/tools/perf/tests/attr/test-record-dummy-C0
index d4f0546e02b6..0f3360c35a5e 100644
--- a/tools/perf/tests/attr/test-record-dummy-C0
+++ b/tools/perf/tests/attr/test-record-dummy-C0
@@ -54,3 +54,4 @@ branch_sample_type=0
 sample_regs_user=0
 sample_stack_user=0
 alternative_sample_period=0
+jitter_alternate_period=0
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 48723cea3a51..d3a4f52a7644 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -827,6 +827,7 @@ static const char *config_term_name(enum parse_events__term_type term_type)
 		[PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE]          = "legacy-cache",
 		[PARSE_EVENTS__TERM_TYPE_HARDWARE]              = "hardware",
 		[PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD]	= "alt-period",
+		[PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER]	= "alt-period-jitter",
 	};
 	if ((unsigned int)term_type >= __PARSE_EVENTS__TERM_TYPE_NR)
 		return "unknown term";
@@ -856,6 +857,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
 	case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
 	case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
 	case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
+	case PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER:
 	case PARSE_EVENTS__TERM_TYPE_PERCORE:
 		return true;
 	case PARSE_EVENTS__TERM_TYPE_USER:
@@ -1010,6 +1012,16 @@ do {									   \
 		}
 		attr->alternative_sample_period = term->val.num;
 		break;
+	case PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER:
+		CHECK_TYPE_VAL(NUM);
+		if ((unsigned int)term->val.num > 7) {
+			parse_events_error__handle(err, term->err_val,
+						strdup("expected a value between 0-7"),
+						NULL);
+			return -EINVAL;
+		}
+		attr->jitter_alternate_period = (unsigned int)term->val.num;
+		break;
 	case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
 	case PARSE_EVENTS__TERM_TYPE_USER:
 	case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
@@ -1137,6 +1149,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
 	case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 	case PARSE_EVENTS__TERM_TYPE_HARDWARE:
 	case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
+	case PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER:
 	default:
 		if (err) {
 			parse_events_error__handle(err, term->err_term,
@@ -1269,6 +1282,7 @@ do {								\
 		case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 		case PARSE_EVENTS__TERM_TYPE_HARDWARE:
 		case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
+		case PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER:
 		default:
 			break;
 		}
@@ -1323,6 +1337,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
 		case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
 		case PARSE_EVENTS__TERM_TYPE_HARDWARE:
 		case PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD:
+		case PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER:
 		default:
 			break;
 		}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index f1482361321a..9b9649da1932 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -80,7 +80,8 @@ enum parse_events__term_type {
 	PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
 	PARSE_EVENTS__TERM_TYPE_HARDWARE,
 	PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD,
-#define	__PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD + 1)
+	PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER,
+#define	__PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER + 1)
 };
 
 struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 1972a5a696ef..c22e64407d64 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -332,6 +332,7 @@ aux-output		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
 aux-sample-size		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
 metric-id		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
 alt-period		{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_ALT_SAMPLE_PERIOD); }
+alt-period-jitter	{ return term(yyscanner, PARSE_EVENTS__TERM_TYPE_ALT_PERIOD_JITTER); }
 cpu-cycles|cycles				{ return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
 stalled-cycles-frontend|idle-cycles-frontend	{ return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
 stalled-cycles-backend|idle-cycles-backend	{ return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 7615d05e389f..c7a5ee61c436 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1739,6 +1739,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
 		"aux-output",
 		"aux-sample-size=number",
 		"alt-period=number",
+		"alt-period-jitter=number",
 	};
 	struct perf_pmu_format *format;
 	int ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/4] A mechanism for efficient support for per-function metrics
  2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
                   ` (3 preceding siblings ...)
  2024-11-07 16:07 ` [PATCH v1 4/4] tools/perf: Modify event parser to support alt-period-jitter term Deepak Surti
@ 2024-11-14  2:22 ` Ian Rogers
  2024-11-14 14:36   ` James Clark
  2024-11-25 17:05   ` Deepak Surti
  4 siblings, 2 replies; 12+ messages in thread
From: Ian Rogers @ 2024-11-14  2:22 UTC (permalink / raw)
  To: Deepak Surti
  Cc: peterz, mingo, acme, namhyung, mark.barnett, ben.gainey, ak, will,
	james.clark, mark.rutland, alexander.shishkin, jolsa,
	adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel

On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@arm.com> wrote:
>
> This patch introduces the concept on an alternating sample rate to perf
> core and provides the necessary basic changes in the tools to activate
> that option.
>
> This patchset was original posted by Ben Gainey out for RFC back in April,
> the latest version of which can be found at
> https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/.
> Going forward, I will be owning this.
>
> The primary use case for this change is to be able to enable collecting
> per-function performance metrics using the Arm PMU, as per the following
> approach:
>
>  * Starting with a simple periodic sampling (hotspot) profile,
>    augment each sample with PMU counters accumulated over a short window
>    up to the point the sample was take.
>  * For each sample, perform some filtering to improve attribution of
>    the accumulated PMU counters (ensure they are attributed to a single
>    function)
>  * For each function accumulate a total for each PMU counter so that
>    metrics may be derived.
>
> Without modification, and sampling at a typical rate associated
> with hotspot profiling (~1mS) leads to poor results. Such an
> approach gives you a reasonable estimation of where the profiled
> application is spending time for relatively low overhead, but the
> PMU counters cannot easily be attributed to a single function as the
> window over which they are collected is too large. A modern CPU may
> execute many millions of instructions over many thousands of functions
> within 1mS window. With this approach, the per-function metrics tend
> to trend to some average value across the top N functions in the
> profile.
>
> In order to ensure a reasonable likelihood that the counters are
> attributed to a single function, the sampling window must be rather
> short; typically something in the order of a few hundred cycles proves
> well as tested on a range of aarch64 Cortex and Neoverse cores.
>
> As it stands, it is possible to achieve this with perf using a very high
> sampling rate (e.g ~300cy), but there are at least three major concerns
> with this approach:
>
>  * For speculatively executing, out of order cores, can the results be
>    accurately attributed to a give function or the given sample window?
>  * A short sample window is not guaranteed to cover a single function.
>  * The overhead of sampling every few hundred cycles is very high and
>    is highly likely to cause throttling which is undesirable as it leads
>    to patchy results; i.e. the profile alternates between periods of
>    high frequency samples followed by longer periods of no samples.
>
> This patch does not address the first two points directly. Some means
> to address those are discussed on the RFC v2 cover letter. The patch
> focuses on addressing the final point, though happily this approach
> gives us a way to perform basic filtering on the second point.
>
> The alternating sample period allows us to do two things:
>
>  * We can control the risk of throttling and reduce overhead by
>    alternating between a long and short period. This allows us to
>    decouple the "periodic" sampling rate (as might be used for hotspot
>    profiling) from the short sampling window needed for collecting
>    the PMU counters.
>  * The sample taken at the end of the long period can be otherwise
>    discarded (as the PMU data is not useful), but the
>    PERF_RECORD_CALLCHAIN information can be used to identify the current
>    function at the start of the short sample window. This is useful
>    for filtering samples where the PMU counter data cannot be attributed
>    to a single function.

I think this is interesting. I'm a little concerned on the approach as
I wonder if a more flexible mechanism could be had.

One approach that wouldn't work would be to open high and low
frequency events, or groups of events, then use BPF filters to try to
replicate this approach by dropping most of the high frequency events.
I don't think it would work as the high frequency sampling is likely
going to trigger during the BPF filter execution, and the BPF filter
would be too much overhead.

Perhaps another approach is to change the perf event period with a new
BPF helper function that's called where we do the perf event
filtering. There's the overhead of running the BPF code, but the BPF
code could allow you to instead of alternating between two periods
allow you to alternate between an arbitrary number of them.

Thanks,
Ian

> There are several reasons why it is desirable to reduce the overhead and
> risk of throttling:
>
>   * PMU counter overflow typically causes an interrupt into the kernel;
>     this affects program runtime, and can affect things like branch
>     prediction, cache locality and so on which can skew the metrics.
>   * The very high sample rate produces significant amounts of data.
>     Depending on the configuration of the profiling session and machine,
>     it is easily possible to produce many orders of magnitude more data
>     which is costly for tools to post-process and increases the chance
>     of data loss. This is especially relevant on larger core count
>     systems where it is very easy to produce massive recordings.
>     Whilst the kernel will throttle such a configuration,
>     which helps to mitigate a large portion of the bandwidth and capture
>     overhead, it is not something that can be controlled for on a per
>     event basis, or for non-root users, and because throttling is
>     controlled as a percentage of time its affects vary from machine to
>     machine. AIUI throttling may also produce an uneven temporal
>     distribution of samples. Finally, whilst throttling does a good job
>     at reducing the overall amount of data produced, it still leads to
>     much larger captures than with this method; typically we have
>     observed 1-2 orders of magnitude larger captures.
>
> This patch set modifies perf core to support alternating between two
> sample_period values, providing a simple and inexpensive way for tools
> to separate out the sample window (time over which events are
> counted) from the sample period (time between interesting samples).
>
> It is expected to be used with the cycle counter event, alternating
> between a long and short period and subsequently discarding the counter
> data for samples with the long period. The combined long and short
> period gives the overall sampling period, and the short sample period
> gives the sample window. The symbol taken from the sample at the end of
> the long period can be used by tools to ensure correct attribution as
> described previously. The cycle counter is recommended as it provides
> fair temporal distribution of samples as would be required for the
> per-symbol sample count mentioned previously, and because the PMU can
> be programmed to overflow after a sufficiently short window (which may
> not be possible with software timer, for example). This patch does not
> restrict to only the cycle counter, it is possible there could be other
> novel uses based on different events, or more appropriate counters on
> other architectures. This patch set does not modify or otherwise disable
> the kernel's existing throttling behaviour; if a configuration is given
> that would lead high CPU usage, then throttling still occurs.
>
>
> To test this a simple `perf script` based python script was developed.
> For a limited set of Arm PMU events it will post process a
> `perf record`-ing and generate a table of metrics. Along side this a
> benchmark application was developed that rotates through a sequence
> of different classes of behaviour that can be detected by the Arm PMU
> (eg. mispredicts, cache misses, different instruction mixes). The path
> through the benchmark can be rotated after each iteration so as to
> ensure the results don't land on some lucky harmonic with the sample
> period. The script can be used with and without this patch allowing
> comparison of the results. Testing was on Juno (A53+A57), N1SDP,
> Gravaton 2 and 3. In addition this approach has been applied to a few
> of Arm's tools projects and has correctly identified improvements and
> regressions.
>
> Headline results from testing indicate that ~300 cycles sample window
> gives good results with or without this patch. Typical output on N1SDP (Neoverse-N1)
> for the provided benchmark when run as:
>
>     perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
>         -k CLOCK_MONOTONIC_RAW \
>         -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
>         benchmark 0 1
>
>     perf script -s generate-function-metrics.py -- -s discard
>
> Looks like (reformatted for email brevity):
>
>     Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I    %BM   %L1DA  %L1DM
>     fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8  22.9   0.1   0.6    0.0
>     int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3  21.5   0.1   1.9    0.2
>     isb                 3414  20.1   0.2     0.0   0.4  17.6   2.3   0.1   0.8    0.0
>     branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2  99.0  71.6    0.1
>     double_to_int        694   0.5   0.0     0.0   0.6   3.4  19.1   0.1   1.2    0.1
>     nops                 417   0.3   0.2     0.0   2.8   1.9  18.3   0.6   0.4    0.1
>     dcache_miss          185   3.6   0.4   184.7  53.8   0.7   0.5   0.0  18.4   99.1
>
> (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000 Instruction,
>  CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
>  accesses that miss, %CY = Percentage of total cycles, %I = Percentage
>  of total instructions, %BM = Percentage of total branch mispredicts,
>  %L1DA = Percentage of total cache accesses, %L1DM = Percentage of total
>  cache misses)
>
> When the patch is used, the resulting `perf.data` files are typically
> between 25-50x smaller than without, and take ~25x less time for the
> python script to post-process. For example, running the following:
>
>     perf record -i -vvv -e '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
>     perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS' benchmark 0 1
>     perf record -i -vvv -e '{cycles/period=300/,instructions}:uS' benchmark 0 1
>
> produces captures on N1SDP (Neoverse-N1) of the following sizes:
>
>     * period=1000000: 2.601 MB perf.data (55780 samples), script time = 0m0.362s
>     * period=1000: 283.749 MB perf.data (6162932 samples), script time = 0m33.100s
>     * period=300: 304.281 MB perf.data (6614182 samples), script time = 0m35.826s
>
> The "script time" is the user time from running "time perf script -s generate-function-metrics.py"
> on the recording. Similar processing times were observed for "time perf report --stdio|cat"
> as well.
>
> By comparison, with the patch active:
>
>     perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
>
> produces 4.923 MB perf.data (107512 samples), and script time = 0m0.578s.
> Which is as expected ~2x the size and ~2x the number of samples as per
> the period=1000000 recording. When compared to the period=300 recording,
> the results from the provided post-processing script are (within margin
> of error) the same, but the data file is ~62x smaller. The same affect
> is seen for the post-processing script runtime.
>
> Notably, without the patch enable, L1D cache miss rates are often higher
> than with, which we attribute to increased impact on the cache that
> trapping into the kernel every 300 cycles has.
>
> These results are given with `perf_cpu_time_max_percent=25`. When tested
> with `perf_cpu_time_max_percent=100` the size and time comparisons are
> more significant. Disabling throttling did not lead to obvious
> improvements in the collected metrics, suggesting that the sampling
> approach is sufficient to collect representative metrics.
>
> Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
> window (with and without the patch) suggests this approach might work
> for some counters. Using the same test script, it was possible to identify
> branch mispredicts correctly. However, whilst the patch is functionally
> correct, differences in the architectures may mean that this approach it
> enables does not apply as a means to collect per-function metrics on x86.
>
> Changes since RFC v2:
>  - Rebased on v6.12-rc6.
>
> Changes since RFC v1:
>  - Rebased on v6.9-rc1.
>  - Refactored from arm_pmu based extension to core feature
>  - Added the ability to jitter the sample window based on feedback
>    from Andi Kleen.
>  - Modified perf tool to parse the "alt-period" and "alt-period-jitter"
>    terms in the event specification.
>
> Ben Gainey (4):
>   perf: Allow periodic events to alternate between two sample periods
>   perf: Allow adding fixed random jitter to the alternate sampling
>     period
>   tools/perf: Modify event parser to support alt-period term
>   tools/perf: Modify event parser to support alt-period-jitter term
>
>  include/linux/perf_event.h                    |  5 ++
>  include/uapi/linux/perf_event.h               | 13 ++++-
>  kernel/events/core.c                          | 47 +++++++++++++++++++
>  tools/include/uapi/linux/perf_event.h         | 13 ++++-
>  tools/perf/tests/attr.c                       |  2 +
>  tools/perf/tests/attr.py                      |  2 +
>  tools/perf/tests/attr/base-record             |  4 +-
>  tools/perf/tests/attr/base-record-spe         |  2 +
>  tools/perf/tests/attr/base-stat               |  4 +-
>  tools/perf/tests/attr/system-wide-dummy       |  4 +-
>  .../attr/test-record-alt-period-jitter-term   | 13 +++++
>  .../tests/attr/test-record-alt-period-term    | 12 +++++
>  tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
>  tools/perf/util/parse-events.c                | 30 ++++++++++++
>  tools/perf/util/parse-events.h                |  4 +-
>  tools/perf/util/parse-events.l                |  2 +
>  tools/perf/util/perf_event_attr_fprintf.c     |  1 +
>  tools/perf/util/pmu.c                         |  2 +
>  18 files changed, 157 insertions(+), 7 deletions(-)
>  create mode 100644 tools/perf/tests/attr/test-record-alt-period-jitter-term
>  create mode 100644 tools/perf/tests/attr/test-record-alt-period-term
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/4] A mechanism for efficient support for per-function metrics
  2024-11-14  2:22 ` [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Ian Rogers
@ 2024-11-14 14:36   ` James Clark
  2024-11-25 17:05   ` Deepak Surti
  1 sibling, 0 replies; 12+ messages in thread
From: James Clark @ 2024-11-14 14:36 UTC (permalink / raw)
  To: Ian Rogers, Deepak Surti, Leo Yan
  Cc: peterz, mingo, acme, namhyung, mark.barnett, ben.gainey, ak, will,
	james.clark, mark.rutland, alexander.shishkin, jolsa,
	adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel



On 14/11/2024 2:22 am, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@arm.com> wrote:
>>
>> This patch introduces the concept on an alternating sample rate to perf
>> core and provides the necessary basic changes in the tools to activate
>> that option.
>>
>> This patchset was original posted by Ben Gainey out for RFC back in April,
>> the latest version of which can be found at
>> https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/.
>> Going forward, I will be owning this.
>>
>> The primary use case for this change is to be able to enable collecting
>> per-function performance metrics using the Arm PMU, as per the following
>> approach:
>>
>>   * Starting with a simple periodic sampling (hotspot) profile,
>>     augment each sample with PMU counters accumulated over a short window
>>     up to the point the sample was take.
>>   * For each sample, perform some filtering to improve attribution of
>>     the accumulated PMU counters (ensure they are attributed to a single
>>     function)
>>   * For each function accumulate a total for each PMU counter so that
>>     metrics may be derived.
>>
>> Without modification, and sampling at a typical rate associated
>> with hotspot profiling (~1mS) leads to poor results. Such an
>> approach gives you a reasonable estimation of where the profiled
>> application is spending time for relatively low overhead, but the
>> PMU counters cannot easily be attributed to a single function as the
>> window over which they are collected is too large. A modern CPU may
>> execute many millions of instructions over many thousands of functions
>> within 1mS window. With this approach, the per-function metrics tend
>> to trend to some average value across the top N functions in the
>> profile.
>>
>> In order to ensure a reasonable likelihood that the counters are
>> attributed to a single function, the sampling window must be rather
>> short; typically something in the order of a few hundred cycles proves
>> well as tested on a range of aarch64 Cortex and Neoverse cores.
>>
>> As it stands, it is possible to achieve this with perf using a very high
>> sampling rate (e.g ~300cy), but there are at least three major concerns
>> with this approach:
>>
>>   * For speculatively executing, out of order cores, can the results be
>>     accurately attributed to a give function or the given sample window?
>>   * A short sample window is not guaranteed to cover a single function.
>>   * The overhead of sampling every few hundred cycles is very high and
>>     is highly likely to cause throttling which is undesirable as it leads
>>     to patchy results; i.e. the profile alternates between periods of
>>     high frequency samples followed by longer periods of no samples.
>>
>> This patch does not address the first two points directly. Some means
>> to address those are discussed on the RFC v2 cover letter. The patch
>> focuses on addressing the final point, though happily this approach
>> gives us a way to perform basic filtering on the second point.
>>
>> The alternating sample period allows us to do two things:
>>
>>   * We can control the risk of throttling and reduce overhead by
>>     alternating between a long and short period. This allows us to
>>     decouple the "periodic" sampling rate (as might be used for hotspot
>>     profiling) from the short sampling window needed for collecting
>>     the PMU counters.
>>   * The sample taken at the end of the long period can be otherwise
>>     discarded (as the PMU data is not useful), but the
>>     PERF_RECORD_CALLCHAIN information can be used to identify the current
>>     function at the start of the short sample window. This is useful
>>     for filtering samples where the PMU counter data cannot be attributed
>>     to a single function.
> 
> I think this is interesting. I'm a little concerned on the approach as
> I wonder if a more flexible mechanism could be had.
> 
> One approach that wouldn't work would be to open high and low
> frequency events, or groups of events, then use BPF filters to try to
> replicate this approach by dropping most of the high frequency events.
> I don't think it would work as the high frequency sampling is likely
> going to trigger during the BPF filter execution, and the BPF filter
> would be too much overhead.
> 
> Perhaps another approach is to change the perf event period with a new
> BPF helper function that's called where we do the perf event
> filtering. There's the overhead of running the BPF code, but the BPF
> code could allow you to instead of alternating between two periods
> allow you to alternate between an arbitrary number of them.
> 
> Thanks,
> Ian
> 

There might be something to the arbitrary number of periods, because for 
a very short run you might want a high sample rate and for a long run 
you would want a low rate. BPF might allow you to reduce the rate over 
time so you don't have to worry about picking the right one so much.

+Leo because I think he's looking at linking BPF to the aux pause/resume 
patches [1] which could be similar.

[1]: 
https://lore.kernel.org/linux-perf-users/20241114101711.34987-1-adrian.hunter@intel.com/T/#t

>> There are several reasons why it is desirable to reduce the overhead and
>> risk of throttling:
>>
>>    * PMU counter overflow typically causes an interrupt into the kernel;
>>      this affects program runtime, and can affect things like branch
>>      prediction, cache locality and so on which can skew the metrics.
>>    * The very high sample rate produces significant amounts of data.
>>      Depending on the configuration of the profiling session and machine,
>>      it is easily possible to produce many orders of magnitude more data
>>      which is costly for tools to post-process and increases the chance
>>      of data loss. This is especially relevant on larger core count
>>      systems where it is very easy to produce massive recordings.
>>      Whilst the kernel will throttle such a configuration,
>>      which helps to mitigate a large portion of the bandwidth and capture
>>      overhead, it is not something that can be controlled for on a per
>>      event basis, or for non-root users, and because throttling is
>>      controlled as a percentage of time its affects vary from machine to
>>      machine. AIUI throttling may also produce an uneven temporal
>>      distribution of samples. Finally, whilst throttling does a good job
>>      at reducing the overall amount of data produced, it still leads to
>>      much larger captures than with this method; typically we have
>>      observed 1-2 orders of magnitude larger captures.
>>
>> This patch set modifies perf core to support alternating between two
>> sample_period values, providing a simple and inexpensive way for tools
>> to separate out the sample window (time over which events are
>> counted) from the sample period (time between interesting samples).
>>
>> It is expected to be used with the cycle counter event, alternating
>> between a long and short period and subsequently discarding the counter
>> data for samples with the long period. The combined long and short
>> period gives the overall sampling period, and the short sample period
>> gives the sample window. The symbol taken from the sample at the end of
>> the long period can be used by tools to ensure correct attribution as
>> described previously. The cycle counter is recommended as it provides
>> fair temporal distribution of samples as would be required for the
>> per-symbol sample count mentioned previously, and because the PMU can
>> be programmed to overflow after a sufficiently short window (which may
>> not be possible with software timer, for example). This patch does not
>> restrict to only the cycle counter, it is possible there could be other
>> novel uses based on different events, or more appropriate counters on
>> other architectures. This patch set does not modify or otherwise disable
>> the kernel's existing throttling behaviour; if a configuration is given
>> that would lead high CPU usage, then throttling still occurs.
>>
>>
>> To test this a simple `perf script` based python script was developed.
>> For a limited set of Arm PMU events it will post process a
>> `perf record`-ing and generate a table of metrics. Along side this a
>> benchmark application was developed that rotates through a sequence
>> of different classes of behaviour that can be detected by the Arm PMU
>> (eg. mispredicts, cache misses, different instruction mixes). The path
>> through the benchmark can be rotated after each iteration so as to
>> ensure the results don't land on some lucky harmonic with the sample
>> period. The script can be used with and without this patch allowing
>> comparison of the results. Testing was on Juno (A53+A57), N1SDP,
>> Gravaton 2 and 3. In addition this approach has been applied to a few
>> of Arm's tools projects and has correctly identified improvements and
>> regressions.
>>
>> Headline results from testing indicate that ~300 cycles sample window
>> gives good results with or without this patch. Typical output on N1SDP (Neoverse-N1)
>> for the provided benchmark when run as:
>>
>>      perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
>>          -k CLOCK_MONOTONIC_RAW \
>>          -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
>>          benchmark 0 1
>>
>>      perf script -s generate-function-metrics.py -- -s discard
>>
>> Looks like (reformatted for email brevity):
>>
>>      Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I    %BM   %L1DA  %L1DM
>>      fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8  22.9   0.1   0.6    0.0
>>      int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3  21.5   0.1   1.9    0.2
>>      isb                 3414  20.1   0.2     0.0   0.4  17.6   2.3   0.1   0.8    0.0
>>      branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2  99.0  71.6    0.1
>>      double_to_int        694   0.5   0.0     0.0   0.6   3.4  19.1   0.1   1.2    0.1
>>      nops                 417   0.3   0.2     0.0   2.8   1.9  18.3   0.6   0.4    0.1
>>      dcache_miss          185   3.6   0.4   184.7  53.8   0.7   0.5   0.0  18.4   99.1
>>
>> (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000 Instruction,
>>   CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
>>   accesses that miss, %CY = Percentage of total cycles, %I = Percentage
>>   of total instructions, %BM = Percentage of total branch mispredicts,
>>   %L1DA = Percentage of total cache accesses, %L1DM = Percentage of total
>>   cache misses)
>>
>> When the patch is used, the resulting `perf.data` files are typically
>> between 25-50x smaller than without, and take ~25x less time for the
>> python script to post-process. For example, running the following:
>>
>>      perf record -i -vvv -e '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
>>      perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS' benchmark 0 1
>>      perf record -i -vvv -e '{cycles/period=300/,instructions}:uS' benchmark 0 1
>>
>> produces captures on N1SDP (Neoverse-N1) of the following sizes:
>>
>>      * period=1000000: 2.601 MB perf.data (55780 samples), script time = 0m0.362s
>>      * period=1000: 283.749 MB perf.data (6162932 samples), script time = 0m33.100s
>>      * period=300: 304.281 MB perf.data (6614182 samples), script time = 0m35.826s
>>
>> The "script time" is the user time from running "time perf script -s generate-function-metrics.py"
>> on the recording. Similar processing times were observed for "time perf report --stdio|cat"
>> as well.
>>
>> By comparison, with the patch active:
>>
>>      perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
>>
>> produces 4.923 MB perf.data (107512 samples), and script time = 0m0.578s.
>> Which is as expected ~2x the size and ~2x the number of samples as per
>> the period=1000000 recording. When compared to the period=300 recording,
>> the results from the provided post-processing script are (within margin
>> of error) the same, but the data file is ~62x smaller. The same affect
>> is seen for the post-processing script runtime.
>>
>> Notably, without the patch enable, L1D cache miss rates are often higher
>> than with, which we attribute to increased impact on the cache that
>> trapping into the kernel every 300 cycles has.
>>
>> These results are given with `perf_cpu_time_max_percent=25`. When tested
>> with `perf_cpu_time_max_percent=100` the size and time comparisons are
>> more significant. Disabling throttling did not lead to obvious
>> improvements in the collected metrics, suggesting that the sampling
>> approach is sufficient to collect representative metrics.
>>
>> Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
>> window (with and without the patch) suggests this approach might work
>> for some counters. Using the same test script, it was possible to identify
>> branch mispredicts correctly. However, whilst the patch is functionally
>> correct, differences in the architectures may mean that this approach it
>> enables does not apply as a means to collect per-function metrics on x86.
>>
>> Changes since RFC v2:
>>   - Rebased on v6.12-rc6.
>>
>> Changes since RFC v1:
>>   - Rebased on v6.9-rc1.
>>   - Refactored from arm_pmu based extension to core feature
>>   - Added the ability to jitter the sample window based on feedback
>>     from Andi Kleen.
>>   - Modified perf tool to parse the "alt-period" and "alt-period-jitter"
>>     terms in the event specification.
>>
>> Ben Gainey (4):
>>    perf: Allow periodic events to alternate between two sample periods
>>    perf: Allow adding fixed random jitter to the alternate sampling
>>      period
>>    tools/perf: Modify event parser to support alt-period term
>>    tools/perf: Modify event parser to support alt-period-jitter term
>>
>>   include/linux/perf_event.h                    |  5 ++
>>   include/uapi/linux/perf_event.h               | 13 ++++-
>>   kernel/events/core.c                          | 47 +++++++++++++++++++
>>   tools/include/uapi/linux/perf_event.h         | 13 ++++-
>>   tools/perf/tests/attr.c                       |  2 +
>>   tools/perf/tests/attr.py                      |  2 +
>>   tools/perf/tests/attr/base-record             |  4 +-
>>   tools/perf/tests/attr/base-record-spe         |  2 +
>>   tools/perf/tests/attr/base-stat               |  4 +-
>>   tools/perf/tests/attr/system-wide-dummy       |  4 +-
>>   .../attr/test-record-alt-period-jitter-term   | 13 +++++
>>   .../tests/attr/test-record-alt-period-term    | 12 +++++
>>   tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
>>   tools/perf/util/parse-events.c                | 30 ++++++++++++
>>   tools/perf/util/parse-events.h                |  4 +-
>>   tools/perf/util/parse-events.l                |  2 +
>>   tools/perf/util/perf_event_attr_fprintf.c     |  1 +
>>   tools/perf/util/pmu.c                         |  2 +
>>   18 files changed, 157 insertions(+), 7 deletions(-)
>>   create mode 100644 tools/perf/tests/attr/test-record-alt-period-jitter-term
>>   create mode 100644 tools/perf/tests/attr/test-record-alt-period-term
>>
>> --
>> 2.43.0
>>
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1 0/4] A mechanism for efficient support for per-function metrics
  2024-11-14  2:22 ` [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Ian Rogers
  2024-11-14 14:36   ` James Clark
@ 2024-11-25 17:05   ` Deepak Surti
  1 sibling, 0 replies; 12+ messages in thread
From: Deepak Surti @ 2024-11-25 17:05 UTC (permalink / raw)
  To: irogers@google.com
  Cc: Ben Gainey, alexander.shishkin@linux.intel.com, Mark Barnett,
	James Clark, adrian.hunter@intel.com, ak@linux.intel.com,
	linux-kernel@vger.kernel.org, mingo@redhat.com,
	linux-perf-users@vger.kernel.org, will@kernel.org, Mark Rutland,
	peterz@infradead.org, linux-arm-kernel@lists.infradead.org,
	acme@kernel.org, jolsa@kernel.org, namhyung@kernel.org

On Wed, 2024-11-13 at 18:22 -0800, Ian Rogers wrote:
> On Thu, Nov 7, 2024 at 8:08 AM Deepak Surti <deepak.surti@arm.com>
> wrote:
> > 
> > This patch introduces the concept on an alternating sample rate to
> > perf
> > core and provides the necessary basic changes in the tools to
> > activate
> > that option.
> > 
> > This patchset was original posted by Ben Gainey out for RFC back in
> > April,
> > the latest version of which can be found at
> > https://lore.kernel.org/linux-perf-users/20240422104929.264241-1-ben.gainey@arm.com/
> > .
> > Going forward, I will be owning this.
> > 
> > The primary use case for this change is to be able to enable
> > collecting
> > per-function performance metrics using the Arm PMU, as per the
> > following
> > approach:
> > 
> >  * Starting with a simple periodic sampling (hotspot) profile,
> >    augment each sample with PMU counters accumulated over a short
> > window
> >    up to the point the sample was take.
> >  * For each sample, perform some filtering to improve attribution
> > of
> >    the accumulated PMU counters (ensure they are attributed to a
> > single
> >    function)
> >  * For each function accumulate a total for each PMU counter so
> > that
> >    metrics may be derived.
> > 
> > Without modification, and sampling at a typical rate associated
> > with hotspot profiling (~1mS) leads to poor results. Such an
> > approach gives you a reasonable estimation of where the profiled
> > application is spending time for relatively low overhead, but the
> > PMU counters cannot easily be attributed to a single function as
> > the
> > window over which they are collected is too large. A modern CPU may
> > execute many millions of instructions over many thousands of
> > functions
> > within 1mS window. With this approach, the per-function metrics
> > tend
> > to trend to some average value across the top N functions in the
> > profile.
> > 
> > In order to ensure a reasonable likelihood that the counters are
> > attributed to a single function, the sampling window must be rather
> > short; typically something in the order of a few hundred cycles
> > proves
> > well as tested on a range of aarch64 Cortex and Neoverse cores.
> > 
> > As it stands, it is possible to achieve this with perf using a very
> > high
> > sampling rate (e.g ~300cy), but there are at least three major
> > concerns
> > with this approach:
> > 
> >  * For speculatively executing, out of order cores, can the results
> > be
> >    accurately attributed to a give function or the given sample
> > window?
> >  * A short sample window is not guaranteed to cover a single
> > function.
> >  * The overhead of sampling every few hundred cycles is very high
> > and
> >    is highly likely to cause throttling which is undesirable as it
> > leads
> >    to patchy results; i.e. the profile alternates between periods
> > of
> >    high frequency samples followed by longer periods of no samples.
> > 
> > This patch does not address the first two points directly. Some
> > means
> > to address those are discussed on the RFC v2 cover letter. The
> > patch
> > focuses on addressing the final point, though happily this approach
> > gives us a way to perform basic filtering on the second point.
> > 
> > The alternating sample period allows us to do two things:
> > 
> >  * We can control the risk of throttling and reduce overhead by
> >    alternating between a long and short period. This allows us to
> >    decouple the "periodic" sampling rate (as might be used for
> > hotspot
> >    profiling) from the short sampling window needed for collecting
> >    the PMU counters.
> >  * The sample taken at the end of the long period can be otherwise
> >    discarded (as the PMU data is not useful), but the
> >    PERF_RECORD_CALLCHAIN information can be used to identify the
> > current
> >    function at the start of the short sample window. This is useful
> >    for filtering samples where the PMU counter data cannot be
> > attributed
> >    to a single function.
> 
> 
Hi Ian,

> I think this is interesting. I'm a little concerned on the approach
> as
> I wonder if a more flexible mechanism could be had.
> 
> One approach that wouldn't work would be to open high and low
> frequency events, or groups of events, then use BPF filters to try to
> replicate this approach by dropping most of the high frequency
> events.
> I don't think it would work as the high frequency sampling is likely
> going to trigger during the BPF filter execution, and the BPF filter
> would be too much overhead.
> 
> Perhaps another approach is to change the perf event period with a
> new
> BPF helper function that's called where we do the perf event
> filtering. There's the overhead of running the BPF code, but the BPF
> code could allow you to instead of alternating between two periods
> allow you to alternate between an arbitrary number of them.

As you note, just using BPF to filter high frequency samples is
probably too much overhead.

But I think more generally the issue with BPF is that it would impose
additional permissions restrictions vs perf-only. That is to say, in
many cases BPF is restricted to root-only, whereas the approach of
perf-only would support application profiling for non-root users. For
our use-case in particular, this is more important than the theoretical
additional flexibility that BPF could bring.

Thanks,
Deepak

> Thanks,
> Ian
> 
> > There are several reasons why it is desirable to reduce the
> > overhead and
> > risk of throttling:
> > 
> >   * PMU counter overflow typically causes an interrupt into the
> > kernel;
> >     this affects program runtime, and can affect things like branch
> >     prediction, cache locality and so on which can skew the
> > metrics.
> >   * The very high sample rate produces significant amounts of data.
> >     Depending on the configuration of the profiling session and
> > machine,
> >     it is easily possible to produce many orders of magnitude more
> > data
> >     which is costly for tools to post-process and increases the
> > chance
> >     of data loss. This is especially relevant on larger core count
> >     systems where it is very easy to produce massive recordings.
> >     Whilst the kernel will throttle such a configuration,
> >     which helps to mitigate a large portion of the bandwidth and
> > capture
> >     overhead, it is not something that can be controlled for on a
> > per
> >     event basis, or for non-root users, and because throttling is
> >     controlled as a percentage of time its affects vary from
> > machine to
> >     machine. AIUI throttling may also produce an uneven temporal
> >     distribution of samples. Finally, whilst throttling does a good
> > job
> >     at reducing the overall amount of data produced, it still leads
> > to
> >     much larger captures than with this method; typically we have
> >     observed 1-2 orders of magnitude larger captures.
> > 
> > This patch set modifies perf core to support alternating between
> > two
> > sample_period values, providing a simple and inexpensive way for
> > tools
> > to separate out the sample window (time over which events are
> > counted) from the sample period (time between interesting samples).
> > 
> > It is expected to be used with the cycle counter event, alternating
> > between a long and short period and subsequently discarding the
> > counter
> > data for samples with the long period. The combined long and short
> > period gives the overall sampling period, and the short sample
> > period
> > gives the sample window. The symbol taken from the sample at the
> > end of
> > the long period can be used by tools to ensure correct attribution
> > as
> > described previously. The cycle counter is recommended as it
> > provides
> > fair temporal distribution of samples as would be required for the
> > per-symbol sample count mentioned previously, and because the PMU
> > can
> > be programmed to overflow after a sufficiently short window (which
> > may
> > not be possible with software timer, for example). This patch does
> > not
> > restrict to only the cycle counter, it is possible there could be
> > other
> > novel uses based on different events, or more appropriate counters
> > on
> > other architectures. This patch set does not modify or otherwise
> > disable
> > the kernel's existing throttling behaviour; if a configuration is
> > given
> > that would lead high CPU usage, then throttling still occurs.
> > 
> > 
> > To test this a simple `perf script` based python script was
> > developed.
> > For a limited set of Arm PMU events it will post process a
> > `perf record`-ing and generate a table of metrics. Along side this
> > a
> > benchmark application was developed that rotates through a sequence
> > of different classes of behaviour that can be detected by the Arm
> > PMU
> > (eg. mispredicts, cache misses, different instruction mixes). The
> > path
> > through the benchmark can be rotated after each iteration so as to
> > ensure the results don't land on some lucky harmonic with the
> > sample
> > period. The script can be used with and without this patch allowing
> > comparison of the results. Testing was on Juno (A53+A57), N1SDP,
> > Gravaton 2 and 3. In addition this approach has been applied to a
> > few
> > of Arm's tools projects and has correctly identified improvements
> > and
> > regressions.
> > 
> > Headline results from testing indicate that ~300 cycles sample
> > window
> > gives good results with or without this patch. Typical output on
> > N1SDP (Neoverse-N1)
> > for the provided benchmark when run as:
> > 
> >     perf record -T --sample-cpu --call-graph fp,4 --user-callchains
> > \
> >         -k CLOCK_MONOTONIC_RAW \
> >         -e '{cycles/period=999700,alt-
> > period=300/,instructions,branch-misses,cache-references,cache-
> > misses}:uS' \
> >         benchmark 0 1
> > 
> >     perf script -s generate-function-metrics.py -- -s discard
> > 
> > Looks like (reformatted for email brevity):
> > 
> >     Symbol              #     CPI   BM/KI  CM/KI  %CM   %CY   %I   
> > %BM   %L1DA  %L1DM
> >     fp_divider_stalls   6553   4.9   0.0     0.0   0.0  41.8 
> > 22.9   0.1   0.6    0.0
> >     int_divider_stalls  4741   3.5   0.0     0.0   1.1  28.3 
> > 21.5   0.1   1.9    0.2
> >     isb                 3414  20.1   0.2     0.0   0.4  17.6  
> > 2.3   0.1   0.8    0.0
> >     branch_mispredicts  1234   1.1  33.0     0.0   0.0   6.1  15.2 
> > 99.0  71.6    0.1
> >     double_to_int        694   0.5   0.0     0.0   0.6   3.4 
> > 19.1   0.1   1.2    0.1
> >     nops                 417   0.3   0.2     0.0   2.8   1.9 
> > 18.3   0.6   0.4    0.1
> >     dcache_miss          185   3.6   0.4   184.7  53.8   0.7  
> > 0.5   0.0  18.4   99.1
> > 
> > (CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000
> > Instruction,
> >  CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
> >  accesses that miss, %CY = Percentage of total cycles, %I =
> > Percentage
> >  of total instructions, %BM = Percentage of total branch
> > mispredicts,
> >  %L1DA = Percentage of total cache accesses, %L1DM = Percentage of
> > total
> >  cache misses)
> > 
> > When the patch is used, the resulting `perf.data` files are
> > typically
> > between 25-50x smaller than without, and take ~25x less time for
> > the
> > python script to post-process. For example, running the following:
> > 
> >     perf record -i -vvv -e
> > '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
> >     perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS'
> > benchmark 0 1
> >     perf record -i -vvv -e '{cycles/period=300/,instructions}:uS'
> > benchmark 0 1
> > 
> > produces captures on N1SDP (Neoverse-N1) of the following sizes:
> > 
> >     * period=1000000: 2.601 MB perf.data (55780 samples), script
> > time = 0m0.362s
> >     * period=1000: 283.749 MB perf.data (6162932 samples), script
> > time = 0m33.100s
> >     * period=300: 304.281 MB perf.data (6614182 samples), script
> > time = 0m35.826s
> > 
> > The "script time" is the user time from running "time perf script -
> > s generate-function-metrics.py"
> > on the recording. Similar processing times were observed for "time
> > perf report --stdio|cat"
> > as well.
> > 
> > By comparison, with the patch active:
> > 
> >     perf record -i -vvv -e '{cycles/period=999700,alt-
> > period=300/,instructions}:uS' benchmark 0 1
> > 
> > produces 4.923 MB perf.data (107512 samples), and script time =
> > 0m0.578s.
> > Which is as expected ~2x the size and ~2x the number of samples as
> > per
> > the period=1000000 recording. When compared to the period=300
> > recording,
> > the results from the provided post-processing script are (within
> > margin
> > of error) the same, but the data file is ~62x smaller. The same
> > affect
> > is seen for the post-processing script runtime.
> > 
> > Notably, without the patch enable, L1D cache miss rates are often
> > higher
> > than with, which we attribute to increased impact on the cache that
> > trapping into the kernel every 300 cycles has.
> > 
> > These results are given with `perf_cpu_time_max_percent=25`. When
> > tested
> > with `perf_cpu_time_max_percent=100` the size and time comparisons
> > are
> > more significant. Disabling throttling did not lead to obvious
> > improvements in the collected metrics, suggesting that the sampling
> > approach is sufficient to collect representative metrics.
> > 
> > Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
> > window (with and without the patch) suggests this approach might
> > work
> > for some counters. Using the same test script, it was possible to
> > identify
> > branch mispredicts correctly. However, whilst the patch is
> > functionally
> > correct, differences in the architectures may mean that this
> > approach it
> > enables does not apply as a means to collect per-function metrics
> > on x86.
> > 
> > Changes since RFC v2:
> >  - Rebased on v6.12-rc6.
> > 
> > Changes since RFC v1:
> >  - Rebased on v6.9-rc1.
> >  - Refactored from arm_pmu based extension to core feature
> >  - Added the ability to jitter the sample window based on feedback
> >    from Andi Kleen.
> >  - Modified perf tool to parse the "alt-period" and "alt-period-
> > jitter"
> >    terms in the event specification.
> > 
> > Ben Gainey (4):
> >   perf: Allow periodic events to alternate between two sample
> > periods
> >   perf: Allow adding fixed random jitter to the alternate sampling
> >     period
> >   tools/perf: Modify event parser to support alt-period term
> >   tools/perf: Modify event parser to support alt-period-jitter term
> > 
> >  include/linux/perf_event.h                    |  5 ++
> >  include/uapi/linux/perf_event.h               | 13 ++++-
> >  kernel/events/core.c                          | 47
> > +++++++++++++++++++
> >  tools/include/uapi/linux/perf_event.h         | 13 ++++-
> >  tools/perf/tests/attr.c                       |  2 +
> >  tools/perf/tests/attr.py                      |  2 +
> >  tools/perf/tests/attr/base-record             |  4 +-
> >  tools/perf/tests/attr/base-record-spe         |  2 +
> >  tools/perf/tests/attr/base-stat               |  4 +-
> >  tools/perf/tests/attr/system-wide-dummy       |  4 +-
> >  .../attr/test-record-alt-period-jitter-term   | 13 +++++
> >  .../tests/attr/test-record-alt-period-term    | 12 +++++
> >  tools/perf/tests/attr/test-record-dummy-C0    |  4 +-
> >  tools/perf/util/parse-events.c                | 30 ++++++++++++
> >  tools/perf/util/parse-events.h                |  4 +-
> >  tools/perf/util/parse-events.l                |  2 +
> >  tools/perf/util/perf_event_attr_fprintf.c     |  1 +
> >  tools/perf/util/pmu.c                         |  2 +
> >  18 files changed, 157 insertions(+), 7 deletions(-)
> >  create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > jitter-term
> >  create mode 100644 tools/perf/tests/attr/test-record-alt-period-
> > term
> > 
> > --
> > 2.43.0
> > 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-11-25 17:12 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-07 16:07 [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Deepak Surti
2024-11-07 16:07 ` [PATCH v1 1/4] perf: Allow periodic events to alternate between two sample periods Deepak Surti
2024-11-14 15:01   ` Peter Zijlstra
2024-11-25 17:11     ` Deepak Surti
2024-11-07 16:07 ` [PATCH v1 2/4] perf: Allow adding fixed random jitter to the alternate sampling period Deepak Surti
2024-11-07 16:07 ` [PATCH v1 3/4] tools/perf: Modify event parser to support alt-period term Deepak Surti
2024-11-14  2:10   ` Ian Rogers
2024-11-25 16:02     ` Deepak Surti
2024-11-07 16:07 ` [PATCH v1 4/4] tools/perf: Modify event parser to support alt-period-jitter term Deepak Surti
2024-11-14  2:22 ` [PATCH v1 0/4] A mechanism for efficient support for per-function metrics Ian Rogers
2024-11-14 14:36   ` James Clark
2024-11-25 17:05   ` Deepak Surti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).