* [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
@ 2025-04-08 17:15 mark.barnett
2025-04-08 17:15 ` [PATCH v4 1/5] perf: Record sample last_period before updating mark.barnett
` (6 more replies)
0 siblings, 7 replies; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Mark Barnett <mark.barnett@arm.com>
This patch introduces the concept of an alternating sample rate to perf
core and provides the necessary basic changes in the tools to activate
that option.
The primary use case for this change is to be able to enable collecting
per-function performance metrics using the Arm PMU, as per the following
approach:
* Starting with a simple periodic sampling (hotspot) profile,
augment each sample with PMU counters accumulated over a short window
up to the point the sample was taken.
* For each sample, perform some filtering to improve attribution of
the accumulated PMU counters (ensure they are attributed to a single
function)
* For each function accumulate a total for each PMU counter so that
metrics may be derived.
Without modification, and sampling at a typical rate associated
with hotspot profiling (~1mS) leads to poor results. Such an
approach gives you a reasonable estimation of where the profiled
application is spending time for relatively low overhead, but the
PMU counters cannot easily be attributed to a single function as the
window over which they are collected is too large. A modern CPU may
execute many millions of instructions over many thousands of functions
within 1mS window. With this approach, the per-function metrics tend
to trend to some average value across the top N functions in the
profile.
In order to ensure a reasonable likelihood that the counters are
attributed to a single function, the sampling window must be rather
short; typically something in the order of a few hundred cycles proves
well as tested on a range of aarch64 Cortex and Neoverse cores.
As it stands, it is possible to achieve this with perf using a very high
sampling rate (e.g ~300cy), but there are at least three major concerns
with this approach:
* For speculatively executing, out of order cores, can the results be
accurately attributed to a give function or the given sample window?
* A short sample window is not guaranteed to cover a single function.
* The overhead of sampling every few hundred cycles is very high and
is highly likely to cause throttling which is undesirable as it leads
to patchy results; i.e. the profile alternates between periods of
high frequency samples followed by longer periods of no samples.
This patch does not address the first two points directly. Some means
to address those are discussed on the RFC v2 cover letter. The patch
focuses on addressing the final point, though happily this approach
gives us a way to perform basic filtering on the second point.
The alternating sample period allows us to do two things:
* We can control the risk of throttling and reduce overhead by
alternating between a long and short period. This allows us to
decouple the "periodic" sampling rate (as might be used for hotspot
profiling) from the short sampling window needed for collecting
the PMU counters.
* The sample taken at the end of the long period can be otherwise
discarded (as the PMU data is not useful), but the
PERF_RECORD_CALLCHAIN information can be used to identify the current
function at the start of the short sample window. This is useful
for filtering samples where the PMU counter data cannot be attributed
to a single function.
There are several reasons why it is desirable to reduce the overhead and
risk of throttling:
* PMU counter overflow typically causes an interrupt into the kernel;
this affects program runtime, and can affect things like branch
prediction, cache locality and so on which can skew the metrics.
* The very high sample rate produces significant amounts of data.
Depending on the configuration of the profiling session and machine,
it is easily possible to produce many orders of magnitude more data
which is costly for tools to post-process and increases the chance
of data loss. This is especially relevant on larger core count
systems where it is very easy to produce massive recordings.
Whilst the kernel will throttle such a configuration,
which helps to mitigate a large portion of the bandwidth and capture
overhead, it is not something that can be controlled for on a per
event basis, or for non-root users, and because throttling is
controlled as a percentage of time its affects vary from machine to
machine. AIUI throttling may also produce an uneven temporal
distribution of samples. Finally, whilst throttling does a good job
at reducing the overall amount of data produced, it still leads to
much larger captures than with this method; typically we have
observed 1-2 orders of magnitude larger captures.
This patch set modifies perf core to support alternating between two
sample_period values, providing a simple and inexpensive way for tools
to separate out the sample window (time over which events are
counted) from the sample period (time between interesting samples).
It is expected to be used with the cycle counter event, alternating
between a long and short period and subsequently discarding the counter
data for samples with the long period. The combined long and short
period gives the overall sampling period, and the short sample period
gives the sample window. The symbol taken from the sample at the end of
the long period can be used by tools to ensure correct attribution as
described previously. The cycle counter is recommended as it provides
fair temporal distribution of samples as would be required for the
per-symbol sample count mentioned previously, and because the PMU can
be programmed to overflow after a sufficiently short window (which may
not be possible with software timer, for example). This patch does not
restrict to only the cycle counter, it is possible there could be other
novel uses based on different events, or more appropriate counters on
other architectures. This patch set does not modify or otherwise disable
the kernel's existing throttling behaviour; if a configuration is given
that would lead high CPU usage, then throttling still occurs.
To test this a simple `perf script` based python script was developed.
For a limited set of Arm PMU events it will post process a
`perf record`-ing and generate a table of metrics. Along side this a
benchmark application was developed that rotates through a sequence
of different classes of behaviour that can be detected by the Arm PMU
(eg. mispredicts, cache misses, different instruction mixes). The path
through the benchmark can be rotated after each iteration so as to
ensure the results don't land on some lucky harmonic with the sample
period. The script can be used with and without this patch allowing
comparison of the results. Testing was on Juno (A53+A57), N1SDP,
Gravaton 2 and 3. In addition this approach has been applied to a few
of Arm's tools projects and has correctly identified improvements and
regressions.
Headline results from testing indicate that ~300 cycles sample window
gives good results with or without this patch. Typical output on N1SDP (Neoverse-N1)
for the provided benchmark when run as:
perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
-k CLOCK_MONOTONIC_RAW \
-e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
benchmark 0 1
perf script -s generate-function-metrics.py -- -s discard
Looks like (reformatted for email brevity):
Symbol # CPI BM/KI CM/KI %CM %CY %I %BM %L1DA %L1DM
fp_divider_stalls 6553 4.9 0.0 0.0 0.0 41.8 22.9 0.1 0.6 0.0
int_divider_stalls 4741 3.5 0.0 0.0 1.1 28.3 21.5 0.1 1.9 0.2
isb 3414 20.1 0.2 0.0 0.4 17.6 2.3 0.1 0.8 0.0
branch_mispredicts 1234 1.1 33.0 0.0 0.0 6.1 15.2 99.0 71.6 0.1
double_to_int 694 0.5 0.0 0.0 0.6 3.4 19.1 0.1 1.2 0.1
nops 417 0.3 0.2 0.0 2.8 1.9 18.3 0.6 0.4 0.1
dcache_miss 185 3.6 0.4 184.7 53.8 0.7 0.5 0.0 18.4 99.1
(CPI = Cycles/Instruction, BM/KI = Branch Misses per 1000 Instruction,
CM/KI = Cache Misses per 1000 Instruction, %CM = Percent of Cache
accesses that miss, %CY = Percentage of total cycles, %I = Percentage
of total instructions, %BM = Percentage of total branch mispredicts,
%L1DA = Percentage of total cache accesses, %L1DM = Percentage of total
cache misses)
When the patch is used, the resulting `perf.data` files are typically
between 25-50x smaller than without, and take ~25x less time for the
python script to post-process. For example, running the following:
perf record -i -vvv -e '{cycles/period=1000000/,instructions}:uS' benchmark 0 1
perf record -i -vvv -e '{cycles/period=1000/,instructions}:uS' benchmark 0 1
perf record -i -vvv -e '{cycles/period=300/,instructions}:uS' benchmark 0 1
produces captures on N1SDP (Neoverse-N1) of the following sizes:
* period=1000000: 2.601 MB perf.data (55780 samples), script time = 0m0.362s
* period=1000: 283.749 MB perf.data (6162932 samples), script time = 0m33.100s
* period=300: 304.281 MB perf.data (6614182 samples), script time = 0m35.826s
The "script time" is the user time from running "time perf script -s generate-function-metrics.py"
on the recording. Similar processing times were observed for "time perf report --stdio|cat"
as well.
By comparison, with the patch active:
perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
produces 4.923 MB perf.data (107512 samples), and script time = 0m0.578s.
Which is as expected ~2x the size and ~2x the number of samples as per
the period=1000000 recording. When compared to the period=300 recording,
the results from the provided post-processing script are (within margin
of error) the same, but the data file is ~62x smaller. The same affect
is seen for the post-processing script runtime.
Notably, without the patch enable, L1D cache miss rates are often higher
than with, which we attribute to increased impact on the cache that
trapping into the kernel every 300 cycles has.
These results are given with `perf_cpu_time_max_percent=25`. When tested
with `perf_cpu_time_max_percent=100` the size and time comparisons are
more significant. Disabling throttling did not lead to obvious
improvements in the collected metrics, suggesting that the sampling
approach is sufficient to collect representative metrics.
Cursory testing on a Xeon(R) W-2145 with a 300 *instruction* sample
window (with and without the patch) suggests this approach might work
for some counters. Using the same test script, it was possible to identify
branch mispredicts correctly. However, whilst the patch is functionally
correct, differences in the architectures may mean that this approach it
enables does not apply as a means to collect per-function metrics on x86.
Changes since v3:
- Rebased onto v6.15-rc1.
- Refactored to use 'high-frequency period' naming instead of
'alternate period', using interface changes suggested by Peter Zijlstra.
Note:
This introduces a 'sample_period_state' field to hw_perf_event, rather than
making use of the 'state' field. I went this way because there was no clear
ownership for the the contents of 'state' - many drivers completely clear
and assign to the field without preserving existing bit flags set
elsewhere.
- eBPF handling in __perf_event_overflow moved to a later point in the function
so that sample period updates continue to happen, even if an eBPF filter is
active.
- hf-period injection now works with frequency-based sampling.
- Changed jitter functionality to use prandom_u32_state instead of
get_random_u32_below.
Changes since v2:
- Rebased onto latest perf-tools-next.
- Reordered patch series so that "Record sample last_period before updating"
applies first.
- Addressed issue reported by LKP tests.
- Addressed review comments from Leo Yan.
Changes since v1:
- Rebased onto perf-tools-next, as per request from Ian Rogers.
- Removed unnecessary code that truncated period_left to 0 and restarted
the PMU.
- Renamed variables to use the shorter 'alt_period' instead of
'alterantive_period'.
- Addressed review comments from Leo Yan.
- Added patch #5 that addresses an issue in the x86 and PowerPC drivers that
caused the opposite period to be reported in the sample record.
Changes since RFC v2:
- Rebased on v6.12-rc6.
Changes since RFC v1:
- Rebased on v6.9-rc1.
- Refactored from arm_pmu based extension to core feature
- Added the ability to jitter the sample window based on feedback
from Andi Kleen.
- Modified perf tool to parse the "alt-period" and "alt-period-jitter"
terms in the event specification.
Ben Gainey (4):
perf: Allow periodic events to alternate between two sample periods
perf: Allow adding fixed random jitter to the sampling period
tools/perf: Modify event parser to support hf-period term
tools/perf: Modify event parser to support hf-rand term
Mark Barnett (1):
perf: Record sample last_period before updating
arch/powerpc/perf/core-book3s.c | 3 +-
arch/powerpc/perf/core-fsl-emb.c | 3 +-
arch/x86/events/core.c | 4 +-
arch/x86/events/intel/core.c | 5 +-
arch/x86/events/intel/knc.c | 4 +-
include/linux/perf_event.h | 13 ++-
include/uapi/linux/perf_event.h | 10 ++
kernel/events/core.c | 95 +++++++++++++++++--
tools/include/uapi/linux/perf_event.h | 10 ++
tools/perf/tests/shell/attr/base-record | 4 +-
tools/perf/tests/shell/attr/base-record-spe | 2 +
tools/perf/tests/shell/attr/base-stat | 4 +-
tools/perf/tests/shell/attr/system-wide-dummy | 4 +-
.../tests/shell/attr/test-record-dummy-C0 | 4 +-
.../shell/attr/test-record-hf-period-rand | 13 +++
.../shell/attr/test-record-hf-period-term | 12 +++
tools/perf/tests/shell/lib/attr.py | 2 +
tools/perf/util/evsel.c | 2 +
tools/perf/util/parse-events.c | 30 ++++++
tools/perf/util/parse-events.h | 4 +-
tools/perf/util/parse-events.l | 2 +
tools/perf/util/perf_event_attr_fprintf.c | 2 +
tools/perf/util/pmu.c | 4 +-
23 files changed, 216 insertions(+), 20 deletions(-)
create mode 100644 tools/perf/tests/shell/attr/test-record-hf-period-rand
create mode 100644 tools/perf/tests/shell/attr/test-record-hf-period-term
--
2.43.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 1/5] perf: Record sample last_period before updating
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
@ 2025-04-08 17:15 ` mark.barnett
2025-04-09 11:39 ` Ingo Molnar
2025-04-08 17:15 ` [PATCH v4 2/5] perf: Allow periodic events to alternate between two sample periods mark.barnett
` (5 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Mark Barnett <mark.barnett@arm.com>
This change alters the PowerPC and x86 driver implementations to record
the last sample period before the event is updated for the next period.
A common pattern in PMU driver implementations is to have a
"*_event_set_period" function which takes care of updating the various
period-related fields in a perf_event structure. In most cases, the
drivers choose to call this function after initializing a sample data
structure with perf_sample_data_init. The x86 and PowerPC drivers
deviate from this, choosing to update the period before initializing the
sample data. When using an event with an alternate sample period, this
causes an incorrect period to be written to the sample data that gets
reported to userspace.
Link: https://lore.kernel.org/r/20240515193610.2350456-4-yabinc@google.com
Signed-off-by: Mark Barnett <mark.barnett@arm.com>
---
arch/powerpc/perf/core-book3s.c | 3 ++-
arch/powerpc/perf/core-fsl-emb.c | 3 ++-
arch/x86/events/core.c | 4 +++-
arch/x86/events/intel/core.c | 5 ++++-
arch/x86/events/intel/knc.c | 4 +++-
5 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index b906d28f74fd..42ff4d167acc 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2239,6 +2239,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
struct pt_regs *regs)
{
u64 period = event->hw.sample_period;
+ const u64 last_period = event->hw.last_period;
s64 prev, delta, left;
int record = 0;
@@ -2320,7 +2321,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
if (record) {
struct perf_sample_data data;
- perf_sample_data_init(&data, ~0ULL, event->hw.last_period);
+ perf_sample_data_init(&data, ~0ULL, last_period);
if (event->attr.sample_type & PERF_SAMPLE_ADDR_TYPE)
perf_get_data_addr(event, regs, &data.addr);
diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c
index 1a53ab08447c..d2ffcc7021c5 100644
--- a/arch/powerpc/perf/core-fsl-emb.c
+++ b/arch/powerpc/perf/core-fsl-emb.c
@@ -590,6 +590,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
struct pt_regs *regs)
{
u64 period = event->hw.sample_period;
+ const u64 last_period = event->hw.last_period;
s64 prev, delta, left;
int record = 0;
@@ -632,7 +633,7 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
if (record) {
struct perf_sample_data data;
- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(&data, 0, last_period);
if (perf_event_overflow(event, &data, regs))
fsl_emb_pmu_stop(event, 0);
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6866cc5acb0b..4ccf44943370 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1683,6 +1683,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
struct cpu_hw_events *cpuc;
struct perf_event *event;
int idx, handled = 0;
+ u64 last_period;
u64 val;
cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1702,6 +1703,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
continue;
event = cpuc->events[idx];
+ last_period = event->hw.last_period;
val = static_call(x86_pmu_update)(event);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
@@ -1715,7 +1717,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
if (!static_call(x86_pmu_set_period)(event))
continue;
- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(&data, 0, last_period);
perf_sample_save_brstack(&data, event, &cpuc->lbr_stack, NULL);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 09d2d66c9f21..9c0afdbf9d78 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3132,6 +3132,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
struct perf_event *event = cpuc->events[bit];
+ u64 last_period;
handled++;
@@ -3159,10 +3160,12 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
if (is_pebs_counter_event_group(event))
x86_pmu.drain_pebs(regs, &data);
+ last_period = event->hw.last_period;
+
if (!intel_pmu_save_and_restart(event))
continue;
- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(&data, 0, last_period);
if (has_branch_stack(event))
intel_pmu_lbr_save_brstack(&data, cpuc, event);
diff --git a/arch/x86/events/intel/knc.c b/arch/x86/events/intel/knc.c
index 034a1f6a457c..3e8ec049b46d 100644
--- a/arch/x86/events/intel/knc.c
+++ b/arch/x86/events/intel/knc.c
@@ -241,16 +241,18 @@ static int knc_pmu_handle_irq(struct pt_regs *regs)
for_each_set_bit(bit, (unsigned long *)&status, X86_PMC_IDX_MAX) {
struct perf_event *event = cpuc->events[bit];
+ u64 last_period;
handled++;
if (!test_bit(bit, cpuc->active_mask))
continue;
+ last_period = event->hw.last_period;
if (!intel_pmu_save_and_restart(event))
continue;
- perf_sample_data_init(&data, 0, event->hw.last_period);
+ perf_sample_data_init(&data, 0, last_period);
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 2/5] perf: Allow periodic events to alternate between two sample periods
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
2025-04-08 17:15 ` [PATCH v4 1/5] perf: Record sample last_period before updating mark.barnett
@ 2025-04-08 17:15 ` mark.barnett
2025-04-08 17:15 ` [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period mark.barnett
` (4 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Ben Gainey <ben.gainey@arm.com>
This change modifies perf_event_attr to add a second, alternative
sample period field, and modifies the core perf overflow handling
such that when specified an event will alternate between two sample
periods.
Currently, perf does not provide a mechanism for decoupling the period
over which counters are counted from the period between samples. This is
problematic for building a tool to measure per-function metrics derived
from a sampled counter group. Ideally such a tool wants a very small
sample window in order to correctly attribute the metrics to a given
function, but prefers a larger sample period that provides representative
coverage without excessive probe effect, triggering throttling, or
generating excessive amounts of data.
By alternating between a long and short sample_period and subsequently
discarding the long samples, tools may decouple the period between
samples that the tool cares about from the window of time over which
interesting counts are collected.
It is expected that typically tools would use this feature with the
cycles or instructions events as an approximation for time, but no
restrictions are applied to which events this can be applied to.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Mark Barnett <mark.barnett@arm.com>
---
include/linux/perf_event.h | 12 +++++-
include/uapi/linux/perf_event.h | 10 +++++
kernel/events/core.c | 69 +++++++++++++++++++++++++++++----
3 files changed, 82 insertions(+), 9 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 5a9bf15d4461..be006965054e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -229,7 +229,11 @@ struct hw_perf_event {
#define PERF_HES_UPTODATE 0x02 /* event->count up-to-date */
#define PERF_HES_ARCH 0x04
- int state;
+ u32 state;
+
+#define PERF_SPS_HF_ON 0x00000001
+#define PERF_SPS_HF_SAMPLE 0x00000002
+ u32 sample_period_state;
/*
* The last observed hardware counter value, updated with a
@@ -242,6 +246,12 @@ struct hw_perf_event {
*/
u64 sample_period;
+ /*
+ * The original sample_period value before being modified with
+ * a high-frequency sampling window.
+ */
+ u64 sample_period_base;
+
union {
struct { /* Sampling */
/*
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 5fc753c23734..1529f97fb15d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -379,6 +379,7 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
#define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
#define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
+#define PERF_ATTR_SIZE_VER9 144 /* add: hf_sample */
/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -533,6 +534,15 @@ struct perf_event_attr {
__u64 sig_data;
__u64 config3; /* extension of config2 */
+
+ union {
+ __u64 hf_sample;
+ struct {
+ __u64 hf_sample_period : 32,
+ hf_sample_rand : 4,
+ __reserved_4 : 28;
+ };
+ };
};
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 128db74e9eab..5752ac7408b1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4195,19 +4195,19 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
period = perf_calculate_period(event, nsec, count);
- delta = (s64)(period - hwc->sample_period);
+ delta = (s64)(period - hwc->sample_period_base);
if (delta >= 0)
delta += 7;
else
delta -= 7;
delta /= 8; /* low pass filter */
- sample_period = hwc->sample_period + delta;
+ sample_period = hwc->sample_period_base + delta;
if (!sample_period)
sample_period = 1;
- hwc->sample_period = sample_period;
+ hwc->sample_period_base = sample_period;
if (local64_read(&hwc->period_left) > 8*sample_period) {
if (disable)
@@ -6179,7 +6179,7 @@ static void __perf_event_period(struct perf_event *event,
event->attr.sample_freq = value;
} else {
event->attr.sample_period = value;
- event->hw.sample_period = value;
+ event->hw.sample_period_base = value;
}
active = (event->state == PERF_EVENT_STATE_ACTIVE);
@@ -10064,7 +10064,7 @@ __perf_event_account_interrupt(struct perf_event *event, int throttle)
}
}
- if (event->attr.freq) {
+ if (event->attr.freq && !(hwc->sample_period_state & PERF_SPS_HF_SAMPLE)) {
u64 now = perf_clock();
s64 delta = now - hwc->freq_time_stamp;
@@ -10197,6 +10197,8 @@ static int __perf_event_overflow(struct perf_event *event,
int throttle, struct perf_sample_data *data,
struct pt_regs *regs)
{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 sample_period;
int events = atomic_read(&event->event_limit);
int ret = 0;
@@ -10212,6 +10214,33 @@ static int __perf_event_overflow(struct perf_event *event,
if (event->attr.aux_pause)
perf_event_aux_pause(event->aux_event, true);
+ sample_period = hwc->sample_period_base;
+
+ /*
+ * High Freq samples are injected inside the larger period:
+ *
+ * |------------|-|------------|-|
+ * P0 HF P1 HF
+ *
+ * By ignoring the HF samples, we measure the actual period.
+ */
+ if (hwc->sample_period_state & PERF_SPS_HF_ON) {
+ u64 hf_sample_period = event->attr.hf_sample_period;
+
+ if (sample_period <= hf_sample_period)
+ goto set_period;
+
+ if (hwc->sample_period_state & PERF_SPS_HF_SAMPLE)
+ sample_period = hf_sample_period;
+ else
+ sample_period -= hf_sample_period;
+
+ hwc->sample_period_state ^= PERF_SPS_HF_SAMPLE;
+ }
+
+set_period:
+ hwc->sample_period = sample_period;
+
if (event->prog && event->prog->type == BPF_PROG_TYPE_PERF_EVENT &&
!bpf_overflow_handler(event, data, regs))
goto out;
@@ -11694,6 +11723,7 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
long freq = event->attr.sample_freq;
event->attr.sample_period = NSEC_PER_SEC / freq;
+ hwc->sample_period_base = event->attr.sample_period;
hwc->sample_period = event->attr.sample_period;
local64_set(&hwc->period_left, hwc->sample_period);
hwc->last_period = hwc->sample_period;
@@ -12675,12 +12705,25 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
pmu = NULL;
hwc = &event->hw;
+ hwc->sample_period_base = attr->sample_period;
hwc->sample_period = attr->sample_period;
- if (attr->freq && attr->sample_freq)
+ if (attr->freq && attr->sample_freq) {
hwc->sample_period = 1;
- hwc->last_period = hwc->sample_period;
+ hwc->sample_period_base = 1;
+ }
- local64_set(&hwc->period_left, hwc->sample_period);
+ /*
+ * If the user requested a high-frequency sample period subtract that
+ * from the first period (the larger one), and set the high-frequency
+ * value to be used next.
+ */
+ u64 first_sample_period = hwc->sample_period;
+ if (attr->hf_sample_period && attr->hf_sample_period < hwc->sample_period) {
+ first_sample_period -= attr->hf_sample_period;
+ hwc->sample_period = attr->hf_sample_period;
+ }
+ hwc->last_period = first_sample_period;
+ local64_set(&hwc->period_left, first_sample_period);
/*
* We do not support PERF_SAMPLE_READ on inherited events unless
@@ -12710,6 +12753,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
return ERR_PTR(err);
}
+ if (attr->hf_sample_period)
+ hwc->sample_period_state |= PERF_SPS_HF_ON;
+
/*
* Disallow uncore-task events. Similarly, disallow uncore-cgroup
* events (they don't make sense as the cgroup will be different
@@ -13131,6 +13177,12 @@ SYSCALL_DEFINE5(perf_event_open,
} else {
if (attr.sample_period & (1ULL << 63))
return -EINVAL;
+ if (attr.hf_sample_period) {
+ if (!attr.sample_period)
+ return -EINVAL;
+ if (attr.hf_sample_period >= attr.sample_period)
+ return -EINVAL;
+ }
}
/* Only privileged users can get physical addresses */
@@ -14054,6 +14106,7 @@ inherit_event(struct perf_event *parent_event,
struct hw_perf_event *hwc = &child_event->hw;
hwc->sample_period = sample_period;
+ hwc->sample_period_base = sample_period;
hwc->last_period = sample_period;
local64_set(&hwc->period_left, sample_period);
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
2025-04-08 17:15 ` [PATCH v4 1/5] perf: Record sample last_period before updating mark.barnett
2025-04-08 17:15 ` [PATCH v4 2/5] perf: Allow periodic events to alternate between two sample periods mark.barnett
@ 2025-04-08 17:15 ` mark.barnett
2025-04-09 10:54 ` Ingo Molnar
2025-04-09 14:24 ` Peter Zijlstra
2025-04-08 17:15 ` [PATCH v4 4/5] tools/perf: Modify event parser to support hf-period term mark.barnett
` (3 subsequent siblings)
6 siblings, 2 replies; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Ben Gainey <ben.gainey@arm.com>
This change modifies the core perf overflow handler, adding some small
random jitter to each sample period when the high-frequency sample
period is in use. A new flag is added to perf_event_attr to opt into
this behaviour.
This change follows the discussion in [1], where it is recognized that it
may be possible for certain patterns of execution to end up with biased
results.
[1] https://lore.kernel.org/linux-perf-users/Zc24eLqZycmIg3d2@tassilo/
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Mark Barnett <mark.barnett@arm.com>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index be006965054e..78a6fd14b412 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -233,6 +233,7 @@ struct hw_perf_event {
#define PERF_SPS_HF_ON 0x00000001
#define PERF_SPS_HF_SAMPLE 0x00000002
+#define PERF_SPS_HF_RAND 0x00000004
u32 sample_period_state;
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5752ac7408b1..bc6991a33048 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -56,6 +56,7 @@
#include <linux/buildid.h>
#include <linux/task_work.h>
#include <linux/percpu-rwsem.h>
+#include <linux/prandom.h>
#include "internal.h"
@@ -472,6 +473,8 @@ static int perf_sample_period_ns __read_mostly = DEFAULT_SAMPLE_PERIOD_NS;
static int perf_sample_allowed_ns __read_mostly =
DEFAULT_SAMPLE_PERIOD_NS * DEFAULT_CPU_TIME_MAX_PERCENT / 100;
+static DEFINE_PER_CPU(struct rnd_state, sample_period_jitter_rnd);
+
static void update_perf_cpu_limits(void)
{
u64 tmp = perf_sample_period_ns;
@@ -10224,6 +10227,19 @@ static int __perf_event_overflow(struct perf_event *event,
*
* By ignoring the HF samples, we measure the actual period.
*/
+
+ /*
+ * Apply optional jitter to the overall sample period
+ */
+ if (hwc->sample_period_state & PERF_SPS_HF_RAND
+ && !(hwc->sample_period_state & PERF_SPS_HF_SAMPLE)) {
+ struct rnd_state *state = &get_cpu_var(sample_period_jitter_rnd);
+ u64 rand_period = 1 << event->attr.hf_sample_rand;
+
+ sample_period -= rand_period / 2;
+ sample_period += prandom_u32_state(state) & (rand_period - 1);
+ }
+
if (hwc->sample_period_state & PERF_SPS_HF_ON) {
u64 hf_sample_period = event->attr.hf_sample_period;
@@ -12756,6 +12772,14 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
if (attr->hf_sample_period)
hwc->sample_period_state |= PERF_SPS_HF_ON;
+ if (attr->hf_sample_rand) {
+ /* high-frequency jitter is only valid with a high-freq period */
+ if (!attr->hf_sample_period)
+ return ERR_PTR(-EINVAL);
+
+ hwc->sample_period_state |= PERF_SPS_HF_RAND;
+ }
+
/*
* Disallow uncore-task events. Similarly, disallow uncore-cgroup
* events (they don't make sense as the cgroup will be different
@@ -14367,6 +14391,7 @@ static void __init perf_event_init_all_cpus(void)
zalloc_cpumask_var(&perf_online_pkg_mask, GFP_KERNEL);
zalloc_cpumask_var(&perf_online_sys_mask, GFP_KERNEL);
+ prandom_seed_full_state(&sample_period_jitter_rnd);
for_each_possible_cpu(cpu) {
swhash = &per_cpu(swevent_htable, cpu);
@@ -14384,6 +14409,7 @@ static void __init perf_event_init_all_cpus(void)
cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
cpuctx->heap = cpuctx->heap_default;
+
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 4/5] tools/perf: Modify event parser to support hf-period term
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
` (2 preceding siblings ...)
2025-04-08 17:15 ` [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period mark.barnett
@ 2025-04-08 17:15 ` mark.barnett
2025-04-08 17:15 ` [PATCH v4 5/5] tools/perf: Modify event parser to support hf-rand term mark.barnett
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Ben Gainey <ben.gainey@arm.com>
parse-events is modified, adding the "hf-period" term which can be used
to specify the high-frequency sampling period.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Mark Barnett <mark.barnett@arm.com>
---
tools/include/uapi/linux/perf_event.h | 10 ++++++++++
tools/perf/tests/shell/attr/base-record | 3 ++-
tools/perf/tests/shell/attr/base-record-spe | 1 +
tools/perf/tests/shell/attr/base-stat | 3 ++-
tools/perf/tests/shell/attr/system-wide-dummy | 3 ++-
tools/perf/tests/shell/attr/test-record-dummy-C0 | 3 ++-
.../tests/shell/attr/test-record-hf-period-term | 12 ++++++++++++
tools/perf/tests/shell/lib/attr.py | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 15 +++++++++++++++
tools/perf/util/parse-events.h | 3 ++-
tools/perf/util/parse-events.l | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/pmu.c | 3 ++-
14 files changed, 54 insertions(+), 6 deletions(-)
create mode 100644 tools/perf/tests/shell/attr/test-record-hf-period-term
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0524d541d4e3..2a6cc543df0e 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -379,6 +379,7 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
#define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
#define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
+#define PERF_ATTR_SIZE_VER9 144 /* add: hf_sample */
/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -531,6 +532,15 @@ struct perf_event_attr {
__u64 sig_data;
__u64 config3; /* extension of config2 */
+
+ union {
+ __u64 hf_sample;
+ struct {
+ __u64 hf_sample_period : 32,
+ hf_sample_rand : 4,
+ __reserved_4 : 28;
+ };
+ };
};
/*
diff --git a/tools/perf/tests/shell/attr/base-record b/tools/perf/tests/shell/attr/base-record
index b44e4e6e4443..8369f505dfb2 100644
--- a/tools/perf/tests/shell/attr/base-record
+++ b/tools/perf/tests/shell/attr/base-record
@@ -5,7 +5,7 @@ group_fd=-1
flags=0|8
cpu=*
type=0|1
-size=136
+size=144
config=0|1
sample_period=*
sample_type=263
@@ -39,3 +39,4 @@ config2=0
branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
+hf_sample_period=0
diff --git a/tools/perf/tests/shell/attr/base-record-spe b/tools/perf/tests/shell/attr/base-record-spe
index 08fa96b59240..2b4f051b6717 100644
--- a/tools/perf/tests/shell/attr/base-record-spe
+++ b/tools/perf/tests/shell/attr/base-record-spe
@@ -38,3 +38,4 @@ config2=*
branch_sample_type=*
sample_regs_user=*
sample_stack_user=*
+hf_sample_period=0
diff --git a/tools/perf/tests/shell/attr/base-stat b/tools/perf/tests/shell/attr/base-stat
index fccd8ec4d1b0..499c44c6216c 100644
--- a/tools/perf/tests/shell/attr/base-stat
+++ b/tools/perf/tests/shell/attr/base-stat
@@ -5,7 +5,7 @@ group_fd=-1
flags=0|8
cpu=*
type=0
-size=136
+size=144
config=0
sample_period=0
sample_type=65536
@@ -39,3 +39,4 @@ config2=0
branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
+hf_sample_period=0
diff --git a/tools/perf/tests/shell/attr/system-wide-dummy b/tools/perf/tests/shell/attr/system-wide-dummy
index a1e1d6a263bf..1dad060d304a 100644
--- a/tools/perf/tests/shell/attr/system-wide-dummy
+++ b/tools/perf/tests/shell/attr/system-wide-dummy
@@ -7,7 +7,7 @@ cpu=*
pid=-1
flags=8
type=1
-size=136
+size=144
config=9
sample_period=1
# PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
@@ -50,3 +50,4 @@ config2=0
branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
+hf_sample_period=0
diff --git a/tools/perf/tests/shell/attr/test-record-dummy-C0 b/tools/perf/tests/shell/attr/test-record-dummy-C0
index 91499405fff4..18f0e3766389 100644
--- a/tools/perf/tests/shell/attr/test-record-dummy-C0
+++ b/tools/perf/tests/shell/attr/test-record-dummy-C0
@@ -10,7 +10,7 @@ cpu=0
pid=-1
flags=8
type=1
-size=136
+size=144
config=9
sample_period=4000
# PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_TIME |
@@ -53,3 +53,4 @@ config2=0
branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
+hf_sample_period=0
diff --git a/tools/perf/tests/shell/attr/test-record-hf-period-term b/tools/perf/tests/shell/attr/test-record-hf-period-term
new file mode 100644
index 000000000000..539b9946c8cc
--- /dev/null
+++ b/tools/perf/tests/shell/attr/test-record-hf-period-term
@@ -0,0 +1,12 @@
+[config]
+command = record
+args = --no-bpf-event -e cycles/period=3,hf-period=2/ -- kill >/dev/null 2>&1
+ret = 1
+kernel_since = 6.15
+
+[event-10:base-record]
+sample_period=3
+hf_sample_period=2
+
+freq=0
+sample_type=7
diff --git a/tools/perf/tests/shell/lib/attr.py b/tools/perf/tests/shell/lib/attr.py
index bfccc727d9b2..80c99758bd86 100644
--- a/tools/perf/tests/shell/lib/attr.py
+++ b/tools/perf/tests/shell/lib/attr.py
@@ -85,6 +85,7 @@ class Event(dict):
'branch_sample_type',
'sample_regs_user',
'sample_stack_user',
+ 'hf_sample_period',
]
def add(self, data):
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1974395492d7..6e8eb34ef957 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -180,6 +180,7 @@ static int store_event(struct perf_event_attr *attr, pid_t pid, struct perf_cpu
WRITE_ASS(branch_sample_type, "llu");
WRITE_ASS(sample_regs_user, "llu");
WRITE_ASS(sample_stack_user, PRIu32);
+ WRITE_ASS(hf_sample_period, PRIu32);
fclose(file);
return 0;
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 5152fd5a6ead..c0943eb7f171 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -805,6 +805,7 @@ const char *parse_events__term_type_str(enum parse_events__term_type term_type)
[PARSE_EVENTS__TERM_TYPE_RAW] = "raw",
[PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE] = "legacy-cache",
[PARSE_EVENTS__TERM_TYPE_HARDWARE] = "hardware",
+ [PARSE_EVENTS__TERM_TYPE_HF_PERIOD] = "hf-period",
};
if ((unsigned int)term_type >= __PARSE_EVENTS__TERM_TYPE_NR)
return "unknown term";
@@ -833,6 +834,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
case PARSE_EVENTS__TERM_TYPE_NAME:
case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
case PARSE_EVENTS__TERM_TYPE_PERCORE:
return true;
case PARSE_EVENTS__TERM_TYPE_USER:
@@ -981,6 +983,16 @@ do { \
return -EINVAL;
}
break;
+ case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ CHECK_TYPE_VAL(NUM);
+ if (term->val.num == 0) {
+ parse_events_error__handle(err, term->err_val,
+ strdup("expected a non-zero value"),
+ NULL);
+ return -EINVAL;
+ }
+ attr->hf_sample_period = term->val.num;
+ break;
case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
case PARSE_EVENTS__TERM_TYPE_USER:
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
@@ -1108,6 +1120,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
case PARSE_EVENTS__TERM_TYPE_RAW:
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+ case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
default:
if (err) {
parse_events_error__handle(err, term->err_term,
@@ -1242,6 +1255,7 @@ do { \
case PARSE_EVENTS__TERM_TYPE_RAW:
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+ case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
default:
break;
}
@@ -1296,6 +1310,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
case PARSE_EVENTS__TERM_TYPE_RAW:
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
+ case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
default:
break;
}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index e176a34ab088..a6c4f81d5989 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -80,7 +80,8 @@ enum parse_events__term_type {
PARSE_EVENTS__TERM_TYPE_RAW,
PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
PARSE_EVENTS__TERM_TYPE_HARDWARE,
-#define __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HARDWARE + 1)
+ PARSE_EVENTS__TERM_TYPE_HF_PERIOD,
+#define __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HF_PERIOD + 1)
};
struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 7ed86e3e34e3..482321ace228 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -335,6 +335,7 @@ aux-output { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_OUTPUT); }
aux-action { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_ACTION); }
aux-sample-size { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
metric-id { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
+hf-period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_HF_PERIOD); }
cpu-cycles|cycles { return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
stalled-cycles-frontend|idle-cycles-frontend { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
stalled-cycles-backend|idle-cycles-backend { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index 66b666d9ce64..c93904a299af 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -360,6 +360,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(aux_start_paused, p_unsigned);
PRINT_ATTRf(aux_pause, p_unsigned);
PRINT_ATTRf(aux_resume, p_unsigned);
+ PRINT_ATTRf(hf_sample_period, p_unsigned);
return ret;
}
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index b7ebac5ab1d1..f90c59e29371 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1429,7 +1429,7 @@ static int pmu_config_term(const struct perf_pmu *pmu,
break;
case PARSE_EVENTS__TERM_TYPE_USER: /* Not hardcoded. */
return -EINVAL;
- case PARSE_EVENTS__TERM_TYPE_NAME ... PARSE_EVENTS__TERM_TYPE_HARDWARE:
+ case PARSE_EVENTS__TERM_TYPE_NAME ... PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
/* Skip non-config terms. */
break;
default:
@@ -1804,6 +1804,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
"aux-output",
"aux-action=(pause|resume|start-paused)",
"aux-sample-size=number",
+ "hf-period=number",
};
struct perf_pmu_format *format;
int ret;
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 5/5] tools/perf: Modify event parser to support hf-rand term
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
` (3 preceding siblings ...)
2025-04-08 17:15 ` [PATCH v4 4/5] tools/perf: Modify event parser to support hf-period term mark.barnett
@ 2025-04-08 17:15 ` mark.barnett
2025-04-09 11:38 ` [PATCH v4 0/5] A mechanism for efficient support for per-function metrics Ingo Molnar
2025-04-09 14:29 ` Peter Zijlstra
6 siblings, 0 replies; 16+ messages in thread
From: mark.barnett @ 2025-04-08 17:15 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers
Cc: ben.gainey, deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel, Mark Barnett
From: Ben Gainey <ben.gainey@arm.com>
parse-events is modified, adding the "hf-rand" term which can be used to
enable enable random jitter of the sample period.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Mark Barnett <mark.barnett@arm.com>
---
tools/perf/tests/shell/attr/base-record | 1 +
tools/perf/tests/shell/attr/base-record-spe | 1 +
tools/perf/tests/shell/attr/base-stat | 1 +
tools/perf/tests/shell/attr/system-wide-dummy | 1 +
tools/perf/tests/shell/attr/test-record-dummy-C0 | 1 +
.../tests/shell/attr/test-record-hf-period-rand | 13 +++++++++++++
tools/perf/tests/shell/lib/attr.py | 1 +
tools/perf/util/evsel.c | 1 +
tools/perf/util/parse-events.c | 15 +++++++++++++++
tools/perf/util/parse-events.h | 3 ++-
tools/perf/util/parse-events.l | 1 +
tools/perf/util/perf_event_attr_fprintf.c | 1 +
tools/perf/util/pmu.c | 3 ++-
13 files changed, 41 insertions(+), 2 deletions(-)
create mode 100644 tools/perf/tests/shell/attr/test-record-hf-period-rand
diff --git a/tools/perf/tests/shell/attr/base-record b/tools/perf/tests/shell/attr/base-record
index 8369f505dfb2..f51a20abde2e 100644
--- a/tools/perf/tests/shell/attr/base-record
+++ b/tools/perf/tests/shell/attr/base-record
@@ -40,3 +40,4 @@ branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
hf_sample_period=0
+hf_sample_rand=0
diff --git a/tools/perf/tests/shell/attr/base-record-spe b/tools/perf/tests/shell/attr/base-record-spe
index 2b4f051b6717..d8695abeb2b2 100644
--- a/tools/perf/tests/shell/attr/base-record-spe
+++ b/tools/perf/tests/shell/attr/base-record-spe
@@ -39,3 +39,4 @@ branch_sample_type=*
sample_regs_user=*
sample_stack_user=*
hf_sample_period=0
+hf_sample_rand=0
diff --git a/tools/perf/tests/shell/attr/base-stat b/tools/perf/tests/shell/attr/base-stat
index 499c44c6216c..11df6cb36b9a 100644
--- a/tools/perf/tests/shell/attr/base-stat
+++ b/tools/perf/tests/shell/attr/base-stat
@@ -40,3 +40,4 @@ branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
hf_sample_period=0
+hf_sample_rand=0
diff --git a/tools/perf/tests/shell/attr/system-wide-dummy b/tools/perf/tests/shell/attr/system-wide-dummy
index 1dad060d304a..0439c81c5895 100644
--- a/tools/perf/tests/shell/attr/system-wide-dummy
+++ b/tools/perf/tests/shell/attr/system-wide-dummy
@@ -51,3 +51,4 @@ branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
hf_sample_period=0
+hf_sample_rand=0
diff --git a/tools/perf/tests/shell/attr/test-record-dummy-C0 b/tools/perf/tests/shell/attr/test-record-dummy-C0
index 18f0e3766389..d9eecbd5d246 100644
--- a/tools/perf/tests/shell/attr/test-record-dummy-C0
+++ b/tools/perf/tests/shell/attr/test-record-dummy-C0
@@ -54,3 +54,4 @@ branch_sample_type=0
sample_regs_user=0
sample_stack_user=0
hf_sample_period=0
+hf_sample_rand=0
diff --git a/tools/perf/tests/shell/attr/test-record-hf-period-rand b/tools/perf/tests/shell/attr/test-record-hf-period-rand
new file mode 100644
index 000000000000..517ff6cbc0e9
--- /dev/null
+++ b/tools/perf/tests/shell/attr/test-record-hf-period-rand
@@ -0,0 +1,13 @@
+[config]
+command = record
+args = --no-bpf-event -e cycles/period=3,hf-period=2,hf-rand=7/ -- kill >/dev/null 2>&1
+ret = 1
+kernel_since = 6.15
+
+[event-10:base-record]
+sample_period=3
+hf_sample_period=2
+hf_sample_rand=7
+
+freq=0
+sample_type=7
diff --git a/tools/perf/tests/shell/lib/attr.py b/tools/perf/tests/shell/lib/attr.py
index 80c99758bd86..a9600997a9ea 100644
--- a/tools/perf/tests/shell/lib/attr.py
+++ b/tools/perf/tests/shell/lib/attr.py
@@ -86,6 +86,7 @@ class Event(dict):
'sample_regs_user',
'sample_stack_user',
'hf_sample_period',
+ 'hf_sample_rand',
]
def add(self, data):
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 6e8eb34ef957..b0a54d64f616 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -181,6 +181,7 @@ static int store_event(struct perf_event_attr *attr, pid_t pid, struct perf_cpu
WRITE_ASS(sample_regs_user, "llu");
WRITE_ASS(sample_stack_user, PRIu32);
WRITE_ASS(hf_sample_period, PRIu32);
+ WRITE_ASS(hf_sample_rand, "d");
fclose(file);
return 0;
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index c0943eb7f171..91e5a0d07d58 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -806,6 +806,7 @@ const char *parse_events__term_type_str(enum parse_events__term_type term_type)
[PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE] = "legacy-cache",
[PARSE_EVENTS__TERM_TYPE_HARDWARE] = "hardware",
[PARSE_EVENTS__TERM_TYPE_HF_PERIOD] = "hf-period",
+ [PARSE_EVENTS__TERM_TYPE_HF_RAND] = "hf-rand",
};
if ((unsigned int)term_type >= __PARSE_EVENTS__TERM_TYPE_NR)
return "unknown term";
@@ -835,6 +836,7 @@ config_term_avail(enum parse_events__term_type term_type, struct parse_events_er
case PARSE_EVENTS__TERM_TYPE_METRIC_ID:
case PARSE_EVENTS__TERM_TYPE_SAMPLE_PERIOD:
case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_HF_RAND:
case PARSE_EVENTS__TERM_TYPE_PERCORE:
return true;
case PARSE_EVENTS__TERM_TYPE_USER:
@@ -993,6 +995,16 @@ do { \
}
attr->hf_sample_period = term->val.num;
break;
+ case PARSE_EVENTS__TERM_TYPE_HF_RAND:
+ CHECK_TYPE_VAL(NUM);
+ if ((unsigned int)term->val.num > 15) {
+ parse_events_error__handle(err, term->err_val,
+ strdup("expected a value between 0-15"),
+ NULL);
+ return -EINVAL;
+ }
+ attr->hf_sample_rand = (unsigned int)term->val.num;
+ break;
case PARSE_EVENTS__TERM_TYPE_DRV_CFG:
case PARSE_EVENTS__TERM_TYPE_USER:
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
@@ -1121,6 +1133,7 @@ static int config_term_tracepoint(struct perf_event_attr *attr,
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_HF_RAND:
default:
if (err) {
parse_events_error__handle(err, term->err_term,
@@ -1256,6 +1269,7 @@ do { \
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_HF_RAND:
default:
break;
}
@@ -1311,6 +1325,7 @@ static int get_config_chgs(struct perf_pmu *pmu, struct parse_events_terms *head
case PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE:
case PARSE_EVENTS__TERM_TYPE_HARDWARE:
case PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_HF_RAND:
default:
break;
}
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index a6c4f81d5989..4c2e950dcf81 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -81,7 +81,8 @@ enum parse_events__term_type {
PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE,
PARSE_EVENTS__TERM_TYPE_HARDWARE,
PARSE_EVENTS__TERM_TYPE_HF_PERIOD,
-#define __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HF_PERIOD + 1)
+ PARSE_EVENTS__TERM_TYPE_HF_RAND,
+#define __PARSE_EVENTS__TERM_TYPE_NR (PARSE_EVENTS__TERM_TYPE_HF_RAND + 1)
};
struct parse_events_term {
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 482321ace228..b60b5e796d3a 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -336,6 +336,7 @@ aux-action { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_ACTION); }
aux-sample-size { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_AUX_SAMPLE_SIZE); }
metric-id { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_METRIC_ID); }
hf-period { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_HF_PERIOD); }
+hf-rand { return term(yyscanner, PARSE_EVENTS__TERM_TYPE_HF_RAND); }
cpu-cycles|cycles { return hw_term(yyscanner, PERF_COUNT_HW_CPU_CYCLES); }
stalled-cycles-frontend|idle-cycles-frontend { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_FRONTEND); }
stalled-cycles-backend|idle-cycles-backend { return hw_term(yyscanner, PERF_COUNT_HW_STALLED_CYCLES_BACKEND); }
diff --git a/tools/perf/util/perf_event_attr_fprintf.c b/tools/perf/util/perf_event_attr_fprintf.c
index c93904a299af..3d769a10304b 100644
--- a/tools/perf/util/perf_event_attr_fprintf.c
+++ b/tools/perf/util/perf_event_attr_fprintf.c
@@ -361,6 +361,7 @@ int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
PRINT_ATTRf(aux_pause, p_unsigned);
PRINT_ATTRf(aux_resume, p_unsigned);
PRINT_ATTRf(hf_sample_period, p_unsigned);
+ PRINT_ATTRf(hf_sample_rand, p_unsigned);
return ret;
}
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index f90c59e29371..f735216e88c1 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -1429,7 +1429,7 @@ static int pmu_config_term(const struct perf_pmu *pmu,
break;
case PARSE_EVENTS__TERM_TYPE_USER: /* Not hardcoded. */
return -EINVAL;
- case PARSE_EVENTS__TERM_TYPE_NAME ... PARSE_EVENTS__TERM_TYPE_HF_PERIOD:
+ case PARSE_EVENTS__TERM_TYPE_NAME ... PARSE_EVENTS__TERM_TYPE_HF_RAND:
/* Skip non-config terms. */
break;
default:
@@ -1805,6 +1805,7 @@ int perf_pmu__for_each_format(struct perf_pmu *pmu, void *state, pmu_format_call
"aux-action=(pause|resume|start-paused)",
"aux-sample-size=number",
"hf-period=number",
+ "hf-period-rand=number",
};
struct perf_pmu_format *format;
int ret;
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period
2025-04-08 17:15 ` [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period mark.barnett
@ 2025-04-09 10:54 ` Ingo Molnar
2025-04-09 14:24 ` Peter Zijlstra
1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-04-09 10:54 UTC (permalink / raw)
To: mark.barnett
Cc: peterz, mingo, acme, namhyung, irogers, ben.gainey, deepak.surti,
ak, will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
* mark.barnett@arm.com <mark.barnett@arm.com> wrote:
> @@ -14384,6 +14409,7 @@ static void __init perf_event_init_all_cpus(void)
> cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
> cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
> cpuctx->heap = cpuctx->heap_default;
> +
> }
> }
A stray newline snuck up on you here I think. ;-)
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
` (4 preceding siblings ...)
2025-04-08 17:15 ` [PATCH v4 5/5] tools/perf: Modify event parser to support hf-rand term mark.barnett
@ 2025-04-09 11:38 ` Ingo Molnar
2025-04-11 11:07 ` Mark Barnett
2025-04-09 14:29 ` Peter Zijlstra
6 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2025-04-09 11:38 UTC (permalink / raw)
To: mark.barnett
Cc: peterz, mingo, acme, namhyung, irogers, ben.gainey, deepak.surti,
ak, will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
* mark.barnett@arm.com <mark.barnett@arm.com> wrote:
> From: Mark Barnett <mark.barnett@arm.com>
>
> This patch introduces the concept of an alternating sample rate to perf
> core and provides the necessary basic changes in the tools to activate
> that option.
>
> The primary use case for this change is to be able to enable collecting
> per-function performance metrics using the Arm PMU, as per the following
> approach:
>
> * Starting with a simple periodic sampling (hotspot) profile,
> augment each sample with PMU counters accumulated over a short window
> up to the point the sample was taken.
> * For each sample, perform some filtering to improve attribution of
> the accumulated PMU counters (ensure they are attributed to a single
> function)
> * For each function accumulate a total for each PMU counter so that
> metrics may be derived.
>
> Without modification, and sampling at a typical rate associated
> with hotspot profiling (~1mS) leads to poor results. Such an
> approach gives you a reasonable estimation of where the profiled
> application is spending time for relatively low overhead, but the
> PMU counters cannot easily be attributed to a single function as the
> window over which they are collected is too large. A modern CPU may
> execute many millions of instructions over many thousands of functions
> within 1mS window. With this approach, the per-function metrics tend
> to trend to some average value across the top N functions in the
> profile.
>
> In order to ensure a reasonable likelihood that the counters are
> attributed to a single function, the sampling window must be rather
> short; typically something in the order of a few hundred cycles proves
> well as tested on a range of aarch64 Cortex and Neoverse cores.
>
> As it stands, it is possible to achieve this with perf using a very high
> sampling rate (e.g ~300cy), but there are at least three major concerns
> with this approach:
>
> * For speculatively executing, out of order cores, can the results be
> accurately attributed to a give function or the given sample window?
> * A short sample window is not guaranteed to cover a single function.
> * The overhead of sampling every few hundred cycles is very high and
> is highly likely to cause throttling which is undesirable as it leads
> to patchy results; i.e. the profile alternates between periods of
> high frequency samples followed by longer periods of no samples.
>
> This patch does not address the first two points directly. Some means
> to address those are discussed on the RFC v2 cover letter. The patch
> focuses on addressing the final point, though happily this approach
> gives us a way to perform basic filtering on the second point.
>
> The alternating sample period allows us to do two things:
>
> * We can control the risk of throttling and reduce overhead by
> alternating between a long and short period. This allows us to
> decouple the "periodic" sampling rate (as might be used for hotspot
> profiling) from the short sampling window needed for collecting
> the PMU counters.
> * The sample taken at the end of the long period can be otherwise
> discarded (as the PMU data is not useful), but the
> PERF_RECORD_CALLCHAIN information can be used to identify the current
> function at the start of the short sample window. This is useful
> for filtering samples where the PMU counter data cannot be attributed
> to a single function.
>
> There are several reasons why it is desirable to reduce the overhead and
> risk of throttling:
>
> * PMU counter overflow typically causes an interrupt into the kernel;
> this affects program runtime, and can affect things like branch
> prediction, cache locality and so on which can skew the metrics.
> * The very high sample rate produces significant amounts of data.
> Depending on the configuration of the profiling session and machine,
> it is easily possible to produce many orders of magnitude more data
> which is costly for tools to post-process and increases the chance
> of data loss. This is especially relevant on larger core count
> systems where it is very easy to produce massive recordings.
> Whilst the kernel will throttle such a configuration,
> which helps to mitigate a large portion of the bandwidth and capture
> overhead, it is not something that can be controlled for on a per
> event basis, or for non-root users, and because throttling is
> controlled as a percentage of time its affects vary from machine to
> machine. AIUI throttling may also produce an uneven temporal
> distribution of samples. Finally, whilst throttling does a good job
> at reducing the overall amount of data produced, it still leads to
> much larger captures than with this method; typically we have
> observed 1-2 orders of magnitude larger captures.
>
> This patch set modifies perf core to support alternating between two
> sample_period values, providing a simple and inexpensive way for tools
> to separate out the sample window (time over which events are
> counted) from the sample period (time between interesting samples).
Upstreaming path:
=================
So, while this looks interesting and it might work, a big problem as I
see it is to get tools to use it: the usual kernel feature catch-22.
So I think a hard precondition for an upstream merge would be for the
usage of this new ABI to be built into 'perf top/record' and used by
default, so the kernel side code gets tested and verified - and our
default profiling output would improve rather substantially as well.
ABI details:
============
I'd propose a couple of common-sense extensions to the ABI:
1)
I think a better approach would be to also batch the short periods,
i.e. instead of interleaved long-short periods:
L S L S L
we'd support batches of short periods:
L SSSS L SSSS L SSSS L SSSS
As long as the long periods are 'long enough', throttling wouldn't
(or, at least, shouldn't) trigger. (If throttling triggers, it's the
throttling code that needs to be fixed.)
This means that your proposed ABI would also require an additional
parameter: [long,short,batch-count]. Your current proposal is basically
[long,short,1].
Advantages of batching the short periods (let's coin it
'burst-profiling'?) would be:
- Performance: the caching of the profiling machinery, which would
reduce the per-short-sample overhead rather substantially I believe.
With your current approach we bring all that code into CPU caches
and use it 1-2 times for a single data record, which is kind of a
waste.
- Data quality: batching increases the effective data rate of
'relevant' short samples, with very little overall performance
impact. By tuning the long-period and the batch length the overall
tradeoff between profiling overhead and amount of data extracted can
be finetuned pretty well IMHO. (Tools might even opt to discard the
first 'short' sample to decouple it from the first cache-cold
execution of the perf machinery.)
2)
I agree with the random-jitter approach as well, to remove short-period
sampling artifacts that may arise out of the period length resonating
with the execution time of key code sequences, especially in the 2-3
digits long integers sampling period spectrum, but maybe it should be
expressed in terms of a generic period length, not as a random 4-bit
parameter overlaid on another parameter.
Ie. the ABI would be something like:
[period_long, period_short, period_jitter, batch_count]
I see no reason why the random jitter has to be necessarily 4 bits
short, and it could apply to the 'long' periods as well. Obviously this
all complicates the math on the tooling side a bit. ;-)
If data size is a concern: there's no real need to save space all that
much on the perf_attr ABI side: it's a setup/configuration structure,
not a per sample field where every bit counts.
Implementation:
===============
Instead of making it an entirely different mode, we could allow
period_long to be zero, and map regular periodic events to
[0,period_short,0,1], or so? But only if that simplifies/unifies the
code.
Summary:
========
Anyway, would something like this work for you? I think the most
important aspect is to demonstrate working tooling side. Good thing
we have tools/perf/ in-tree for exactly such purposes. ;-)
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 1/5] perf: Record sample last_period before updating
2025-04-08 17:15 ` [PATCH v4 1/5] perf: Record sample last_period before updating mark.barnett
@ 2025-04-09 11:39 ` Ingo Molnar
0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-04-09 11:39 UTC (permalink / raw)
To: mark.barnett
Cc: peterz, mingo, acme, namhyung, irogers, ben.gainey, deepak.surti,
ak, will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
* mark.barnett@arm.com <mark.barnett@arm.com> wrote:
> From: Mark Barnett <mark.barnett@arm.com>
>
> This change alters the PowerPC and x86 driver implementations to record
> the last sample period before the event is updated for the next period.
>
> A common pattern in PMU driver implementations is to have a
> "*_event_set_period" function which takes care of updating the various
> period-related fields in a perf_event structure. In most cases, the
> drivers choose to call this function after initializing a sample data
> structure with perf_sample_data_init. The x86 and PowerPC drivers
> deviate from this, choosing to update the period before initializing the
> sample data. When using an event with an alternate sample period, this
> causes an incorrect period to be written to the sample data that gets
> reported to userspace.
>
> Link: https://lore.kernel.org/r/20240515193610.2350456-4-yabinc@google.com
> Signed-off-by: Mark Barnett <mark.barnett@arm.com>
> ---
> arch/powerpc/perf/core-book3s.c | 3 ++-
> arch/powerpc/perf/core-fsl-emb.c | 3 ++-
> arch/x86/events/core.c | 4 +++-
> arch/x86/events/intel/core.c | 5 ++++-
> arch/x86/events/intel/knc.c | 4 +++-
> 5 files changed, 14 insertions(+), 5 deletions(-)
I've picked up this patch into tip:perf/core, because I think it makes
sense standalone as well.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period
2025-04-08 17:15 ` [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period mark.barnett
2025-04-09 10:54 ` Ingo Molnar
@ 2025-04-09 14:24 ` Peter Zijlstra
1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-04-09 14:24 UTC (permalink / raw)
To: mark.barnett
Cc: mingo, acme, namhyung, irogers, ben.gainey, deepak.surti, ak,
will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
On Tue, Apr 08, 2025 at 06:15:28PM +0100, mark.barnett@arm.com wrote:
> @@ -10224,6 +10227,19 @@ static int __perf_event_overflow(struct perf_event *event,
> *
> * By ignoring the HF samples, we measure the actual period.
> */
> +
> + /*
> + * Apply optional jitter to the overall sample period
> + */
> + if (hwc->sample_period_state & PERF_SPS_HF_RAND
> + && !(hwc->sample_period_state & PERF_SPS_HF_SAMPLE)) {
Coding style nit: when breaking lines, the operator goes on the end of
the preceding line.
> + struct rnd_state *state = &get_cpu_var(sample_period_jitter_rnd);
> + u64 rand_period = 1 << event->attr.hf_sample_rand;
> +
> + sample_period -= rand_period / 2;
> + sample_period += prandom_u32_state(state) & (rand_period - 1);
> + }
> +
> if (hwc->sample_period_state & PERF_SPS_HF_ON) {
> u64 hf_sample_period = event->attr.hf_sample_period;
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
` (5 preceding siblings ...)
2025-04-09 11:38 ` [PATCH v4 0/5] A mechanism for efficient support for per-function metrics Ingo Molnar
@ 2025-04-09 14:29 ` Peter Zijlstra
2025-04-11 9:44 ` Mark Barnett
6 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2025-04-09 14:29 UTC (permalink / raw)
To: mark.barnett
Cc: mingo, acme, namhyung, irogers, ben.gainey, deepak.surti, ak,
will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
On Tue, Apr 08, 2025 at 06:15:25PM +0100, mark.barnett@arm.com wrote:
> perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
> -k CLOCK_MONOTONIC_RAW \
> -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
> benchmark 0 1
> perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
Should be updated to read something like:
cycles/period=1000000,hf-period=300/
right?
Also, cycles/freq=1000,hf-period=300/ should now also work, right?
Anyway, the kernel bits look good to me now (with the nits fixed), so:
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-09 14:29 ` Peter Zijlstra
@ 2025-04-11 9:44 ` Mark Barnett
0 siblings, 0 replies; 16+ messages in thread
From: Mark Barnett @ 2025-04-11 9:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, acme, namhyung, irogers, ben.gainey, deepak.surti, ak,
will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
On 4/9/25 15:29, Peter Zijlstra wrote:
> On Tue, Apr 08, 2025 at 06:15:25PM +0100, mark.barnett@arm.com wrote:
>
>> perf record -T --sample-cpu --call-graph fp,4 --user-callchains \
>> -k CLOCK_MONOTONIC_RAW \
>> -e '{cycles/period=999700,alt-period=300/,instructions,branch-misses,cache-references,cache-misses}:uS' \
>> benchmark 0 1
>
>> perf record -i -vvv -e '{cycles/period=999700,alt-period=300/,instructions}:uS' benchmark 0 1
>
> Should be updated to read something like:
>
> cycles/period=1000000,hf-period=300/
>
> right?
>
> Also, cycles/freq=1000,hf-period=300/ should now also work, right?
>
> Anyway, the kernel bits look good to me now (with the nits fixed), so:
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Yes, freq works. I'll update the cover letter and address the nits in
the next submission.
Thanks!
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-09 11:38 ` [PATCH v4 0/5] A mechanism for efficient support for per-function metrics Ingo Molnar
@ 2025-04-11 11:07 ` Mark Barnett
2025-04-11 17:34 ` Ian Rogers
2025-04-12 20:38 ` Ingo Molnar
0 siblings, 2 replies; 16+ messages in thread
From: Mark Barnett @ 2025-04-11 11:07 UTC (permalink / raw)
To: Ingo Molnar
Cc: peterz, mingo, acme, namhyung, irogers, ben.gainey, deepak.surti,
ak, will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
On 4/9/25 12:38, Ingo Molnar wrote:
>
> * mark.barnett@arm.com <mark.barnett@arm.com> wrote:
>
>> From: Mark Barnett <mark.barnett@arm.com>
>>
>> This patch introduces the concept of an alternating sample rate to perf
>> core and provides the necessary basic changes in the tools to activate
>> that option.
>>
>> The primary use case for this change is to be able to enable collecting
>> per-function performance metrics using the Arm PMU, as per the following
>> approach:
>>
>> * Starting with a simple periodic sampling (hotspot) profile,
>> augment each sample with PMU counters accumulated over a short window
>> up to the point the sample was taken.
>> * For each sample, perform some filtering to improve attribution of
>> the accumulated PMU counters (ensure they are attributed to a single
>> function)
>> * For each function accumulate a total for each PMU counter so that
>> metrics may be derived.
>>
>> Without modification, and sampling at a typical rate associated
>> with hotspot profiling (~1mS) leads to poor results. Such an
>> approach gives you a reasonable estimation of where the profiled
>> application is spending time for relatively low overhead, but the
>> PMU counters cannot easily be attributed to a single function as the
>> window over which they are collected is too large. A modern CPU may
>> execute many millions of instructions over many thousands of functions
>> within 1mS window. With this approach, the per-function metrics tend
>> to trend to some average value across the top N functions in the
>> profile.
>>
>> In order to ensure a reasonable likelihood that the counters are
>> attributed to a single function, the sampling window must be rather
>> short; typically something in the order of a few hundred cycles proves
>> well as tested on a range of aarch64 Cortex and Neoverse cores.
>>
>> As it stands, it is possible to achieve this with perf using a very high
>> sampling rate (e.g ~300cy), but there are at least three major concerns
>> with this approach:
>>
>> * For speculatively executing, out of order cores, can the results be
>> accurately attributed to a give function or the given sample window?
>> * A short sample window is not guaranteed to cover a single function.
>> * The overhead of sampling every few hundred cycles is very high and
>> is highly likely to cause throttling which is undesirable as it leads
>> to patchy results; i.e. the profile alternates between periods of
>> high frequency samples followed by longer periods of no samples.
>>
>> This patch does not address the first two points directly. Some means
>> to address those are discussed on the RFC v2 cover letter. The patch
>> focuses on addressing the final point, though happily this approach
>> gives us a way to perform basic filtering on the second point.
>>
>> The alternating sample period allows us to do two things:
>>
>> * We can control the risk of throttling and reduce overhead by
>> alternating between a long and short period. This allows us to
>> decouple the "periodic" sampling rate (as might be used for hotspot
>> profiling) from the short sampling window needed for collecting
>> the PMU counters.
>> * The sample taken at the end of the long period can be otherwise
>> discarded (as the PMU data is not useful), but the
>> PERF_RECORD_CALLCHAIN information can be used to identify the current
>> function at the start of the short sample window. This is useful
>> for filtering samples where the PMU counter data cannot be attributed
>> to a single function.
>>
>> There are several reasons why it is desirable to reduce the overhead and
>> risk of throttling:
>>
>> * PMU counter overflow typically causes an interrupt into the kernel;
>> this affects program runtime, and can affect things like branch
>> prediction, cache locality and so on which can skew the metrics.
>> * The very high sample rate produces significant amounts of data.
>> Depending on the configuration of the profiling session and machine,
>> it is easily possible to produce many orders of magnitude more data
>> which is costly for tools to post-process and increases the chance
>> of data loss. This is especially relevant on larger core count
>> systems where it is very easy to produce massive recordings.
>> Whilst the kernel will throttle such a configuration,
>> which helps to mitigate a large portion of the bandwidth and capture
>> overhead, it is not something that can be controlled for on a per
>> event basis, or for non-root users, and because throttling is
>> controlled as a percentage of time its affects vary from machine to
>> machine. AIUI throttling may also produce an uneven temporal
>> distribution of samples. Finally, whilst throttling does a good job
>> at reducing the overall amount of data produced, it still leads to
>> much larger captures than with this method; typically we have
>> observed 1-2 orders of magnitude larger captures.
>>
>> This patch set modifies perf core to support alternating between two
>> sample_period values, providing a simple and inexpensive way for tools
>> to separate out the sample window (time over which events are
>> counted) from the sample period (time between interesting samples).
>
> Upstreaming path:
> =================
>
> So, while this looks interesting and it might work, a big problem as I
> see it is to get tools to use it: the usual kernel feature catch-22.
>
> So I think a hard precondition for an upstream merge would be for the
> usage of this new ABI to be built into 'perf top/record' and used by
> default, so the kernel side code gets tested and verified - and our
> default profiling output would improve rather substantially as well.
>
> ABI details:
> ============
>
> I'd propose a couple of common-sense extensions to the ABI:
>
> 1)
>
> I think a better approach would be to also batch the short periods,
> i.e. instead of interleaved long-short periods:
>
> L S L S L
>
> we'd support batches of short periods:
>
> L SSSS L SSSS L SSSS L SSSS
>
> As long as the long periods are 'long enough', throttling wouldn't
> (or, at least, shouldn't) trigger. (If throttling triggers, it's the
> throttling code that needs to be fixed.)
>
> This means that your proposed ABI would also require an additional
> parameter: [long,short,batch-count]. Your current proposal is basically
> [long,short,1].
>
> Advantages of batching the short periods (let's coin it
> 'burst-profiling'?) would be:
>
> - Performance: the caching of the profiling machinery, which would
> reduce the per-short-sample overhead rather substantially I believe.
> With your current approach we bring all that code into CPU caches
> and use it 1-2 times for a single data record, which is kind of a
> waste.
>
> - Data quality: batching increases the effective data rate of
> 'relevant' short samples, with very little overall performance
> impact. By tuning the long-period and the batch length the overall
> tradeoff between profiling overhead and amount of data extracted can
> be finetuned pretty well IMHO. (Tools might even opt to discard the
> first 'short' sample to decouple it from the first cache-cold
> execution of the perf machinery.)
>
> 2)
>
> I agree with the random-jitter approach as well, to remove short-period
> sampling artifacts that may arise out of the period length resonating
> with the execution time of key code sequences, especially in the 2-3
> digits long integers sampling period spectrum, but maybe it should be
> expressed in terms of a generic period length, not as a random 4-bit
> parameter overlaid on another parameter.
>
> Ie. the ABI would be something like:
>
> [period_long, period_short, period_jitter, batch_count]
>
> I see no reason why the random jitter has to be necessarily 4 bits
> short, and it could apply to the 'long' periods as well. Obviously this
> all complicates the math on the tooling side a bit. ;-)
>
> If data size is a concern: there's no real need to save space all that
> much on the perf_attr ABI side: it's a setup/configuration structure,
> not a per sample field where every bit counts.
>
> Implementation:
> ===============
>
> Instead of making it an entirely different mode, we could allow
> period_long to be zero, and map regular periodic events to
> [0,period_short,0,1], or so? But only if that simplifies/unifies the
> code.
>
> Summary:
> ========
>
> Anyway, would something like this work for you? I think the most
> important aspect is to demonstrate working tooling side. Good thing
> we have tools/perf/ in-tree for exactly such purposes. ;-)
>
> Thanks,
>
> Ingo
Thanks, Ingo, for the detailed notes. Your feedback is very much
appreciated.
Tool Integration
==================
We've been using a python script to process the data into a report. We
can look at implementing this directly in perf report, if that is
required. However, I'm nervous about making the new feature the default
behaviour for the tool.
This feature has been integrated into our tools [1] for the last 12
months, and has received a lot of testing on Arm Neoverse hardware.
Other platforms have received less rigorous testing. In my opinion, more
work would be needed to validate the PMU hardware & software
characteristics of other architectures before this can be made the default.
Burst Sampling
================
I like the burst sampling idea. Increased I-Cache pressure is an
inherent weakness of this sampling method, and this would help to
alleviate that somewhat. I'll add this in the next spin.
Period Jitter
===============
Yes, we can apply this to both periods. I will make that change.
I'm not sure I've fully understood your suggestion here. In its current
state, the 4-bit jitter field acts as a base-2 exponent. This gives us a
random jitter value of up to 2**15. Is the suggestion to change this to
a fixed, absolute value that can be applied to both long & short periods?
Thanks,
Mark
[1]
https://developer.arm.com/documentation/109847/9-3/Overview-of-Streamline-CLI-Tools
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-11 11:07 ` Mark Barnett
@ 2025-04-11 17:34 ` Ian Rogers
2025-04-12 20:42 ` Ingo Molnar
2025-04-12 20:38 ` Ingo Molnar
1 sibling, 1 reply; 16+ messages in thread
From: Ian Rogers @ 2025-04-11 17:34 UTC (permalink / raw)
To: Mark Barnett
Cc: Ingo Molnar, peterz, mingo, acme, namhyung, ben.gainey,
deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel
On Fri, Apr 11, 2025 at 4:08 AM Mark Barnett <mark.barnett@arm.com> wrote:
> Tool Integration
> ==================
>
> We've been using a python script to process the data into a report. We
> can look at implementing this directly in perf report, if that is
> required. However, I'm nervous about making the new feature the default
> behaviour for the tool.
>
> This feature has been integrated into our tools [1] for the last 12
> months, and has received a lot of testing on Arm Neoverse hardware.
> Other platforms have received less rigorous testing. In my opinion, more
> work would be needed to validate the PMU hardware & software
> characteristics of other architectures before this can be made the default.
Hi Mark,
Wrt testing, in v6.14 we've fixed up the python scripting with perf a
bit more and there is an example that parses an event and then dumps
samples here:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/python/tracepoint.py?h=perf-tools-next
There is also the perf script integration for things like flame graphs:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/scripts/python/flamegraph.py?h=perf-tools-next#n96
I don't think work should be gated on cleaning up perf report, top,
etc. which still needs clean up for things like hybrid events. As the
histograms should use the sample's period then I believe things should
just work in much the same way as leader sampling can work. It'd be
worth checking.
Thanks,
Ian
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-11 11:07 ` Mark Barnett
2025-04-11 17:34 ` Ian Rogers
@ 2025-04-12 20:38 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-04-12 20:38 UTC (permalink / raw)
To: Mark Barnett
Cc: peterz, mingo, acme, namhyung, irogers, ben.gainey, deepak.surti,
ak, will, james.clark, mark.rutland, alexander.shishkin, jolsa,
adrian.hunter, linux-perf-users, linux-kernel, linux-arm-kernel
* Mark Barnett <mark.barnett@arm.com> wrote:
> Tool Integration
> ==================
>
> We've been using a python script to process the data into a report. We can
> look at implementing this directly in perf report, if that is required.
> However, I'm nervous about making the new feature the default behaviour for
> the tool.
That's OK - but it should be very simple to activate for perf
record/top. A single option or so to get some sane default behavior,
without having to micro-manage the period values?
In particular perf defaults to 4000 Hz auto-freq samples (at least on
my x86 devel box), and it would be nice to make this new feature work
well with the freq representation too, without complicating it too
much.
> This feature has been integrated into our tools [1] for the last 12
> months, and has received a lot of testing on Arm Neoverse hardware.
> Other platforms have received less rigorous testing. In my opinion,
> more work would be needed to validate the PMU hardware & software
> characteristics of other architectures before this can be made the
> default.
Sure. As long as the switch is simple, I think it will be a popular
change once the kernel feature goes upstream.
In other words: please add a simple, idiot-proof switch to perf
top/record for maintainers with chronically short attention spans who
want to try out your kernel feature. ;)
Please Cc: the perf tooling people to those changes, in general they
are very open to such features.
> Burst Sampling
> ================
>
> I like the burst sampling idea. Increased I-Cache pressure is an inherent
> weakness of this sampling method, and this would help to alleviate that
> somewhat. I'll add this in the next spin.
Great, thanks!
> Period Jitter
> ===============
>
> Yes, we can apply this to both periods. I will make that change.
>
> I'm not sure I've fully understood your suggestion here. In its
> current state, the 4-bit jitter field acts as a base-2 exponent. This
> gives us a random jitter value of up to 2**15. Is the suggestion to
> change this to a fixed, absolute value that can be applied to both
> long & short periods?
Oh, I missed the base-2 exponent aspect, I assumed it was a flat period
in the small integers range, a kind of micro-jitter to counter
resonance with the finest CPU microarchitectural base-frequencies.
What is the typical useful jitter range in your experience? base-2
exponents sound a bit too limiting - although the prandom32() indeed
smears it between [0..2^param). I'd guess that jitter would rarely want
to be larger than the long-period? (Although that might not always be
the case.)
I'd generally err on the side of being a bit too generic & generous in
the design of ABI parameters, because we never know what people will
use it for ... Within reason that is.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 0/5] A mechanism for efficient support for per-function metrics
2025-04-11 17:34 ` Ian Rogers
@ 2025-04-12 20:42 ` Ingo Molnar
0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2025-04-12 20:42 UTC (permalink / raw)
To: Ian Rogers
Cc: Mark Barnett, peterz, mingo, acme, namhyung, ben.gainey,
deepak.surti, ak, will, james.clark, mark.rutland,
alexander.shishkin, jolsa, adrian.hunter, linux-perf-users,
linux-kernel, linux-arm-kernel
* Ian Rogers <irogers@google.com> wrote:
> I don't think work should be gated on cleaning up perf report, top,
> etc. which still needs clean up for things like hybrid events. As the
> histograms should use the sample's period then I believe things
> should just work in much the same way as leader sampling can work.
> It'd be worth checking.
Yeah, so I think burst-profiling is basically still a single-event
profiling mode - with a tooling-side filter that skips the long-periods
and includes the burst-periods, and transforms all the statistics and
counts to make sense in the usual perf context.
Ie. I don't think it should be overly intrusive, and it could be a nice
performance & profiling quality feature we'd consider using by default
eventually.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-04-12 20:42 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-08 17:15 [PATCH v4 0/5] A mechanism for efficient support for per-function metrics mark.barnett
2025-04-08 17:15 ` [PATCH v4 1/5] perf: Record sample last_period before updating mark.barnett
2025-04-09 11:39 ` Ingo Molnar
2025-04-08 17:15 ` [PATCH v4 2/5] perf: Allow periodic events to alternate between two sample periods mark.barnett
2025-04-08 17:15 ` [PATCH v4 3/5] perf: Allow adding fixed random jitter to the sampling period mark.barnett
2025-04-09 10:54 ` Ingo Molnar
2025-04-09 14:24 ` Peter Zijlstra
2025-04-08 17:15 ` [PATCH v4 4/5] tools/perf: Modify event parser to support hf-period term mark.barnett
2025-04-08 17:15 ` [PATCH v4 5/5] tools/perf: Modify event parser to support hf-rand term mark.barnett
2025-04-09 11:38 ` [PATCH v4 0/5] A mechanism for efficient support for per-function metrics Ingo Molnar
2025-04-11 11:07 ` Mark Barnett
2025-04-11 17:34 ` Ian Rogers
2025-04-12 20:42 ` Ingo Molnar
2025-04-12 20:38 ` Ingo Molnar
2025-04-09 14:29 ` Peter Zijlstra
2025-04-11 9:44 ` Mark Barnett
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).