* [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6
@ 2024-12-16 20:45 kan.liang
2024-12-16 20:45 ` [PATCH V5 2/4] perf/x86: Extend event update interface kan.liang
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: kan.liang @ 2024-12-16 20:45 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers, linux-kernel,
linux-perf-users
Cc: ak, eranian, Kan Liang, stable
From: Kan Liang <kan.liang@linux.intel.com>
The only difference between 5 and 6 is the new counters snapshotting
group, without the following counters snapshotting enabling patches,
it's impossible to utilize the feature in a PEBS record. It's safe to
share the same code path with format 5.
Add format 6, so the end user can at least utilize the legacy PEBS
features.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
---
New patch since V4
arch/x86/events/intel/ds.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8dcf90f6fb59..ba74e1198328 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2551,6 +2551,7 @@ void __init intel_ds_init(void)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
break;
+ case 6:
case 5:
x86_pmu.pebs_ept = 1;
fallthrough;
--
2.38.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH V5 2/4] perf/x86: Extend event update interface
2024-12-16 20:45 [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6 kan.liang
@ 2024-12-16 20:45 ` kan.liang
2024-12-16 20:45 ` [PATCH V5 3/4] perf: Extend perf_output_read kan.liang
` (2 subsequent siblings)
3 siblings, 0 replies; 11+ messages in thread
From: kan.liang @ 2024-12-16 20:45 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers, linux-kernel,
linux-perf-users
Cc: ak, eranian, Kan Liang, Sandipan Das, Ravi Bangoria, silviazhao
From: Kan Liang <kan.liang@linux.intel.com>
The current event update interface directly reads the values from the
counter, but the values may not be the accurate ones users require. For
example, the sample read feature wants the counter value of the member
events when the leader event is overflow. But with the current
implementation, the read (event update) actually happens in the NMI
handler. There may be a small gap between the overflow and the NMI
handler. The new Intel PEBS counters snapshotting feature can provide
the accurate counter value in the overflow. The event update interface
has to be updated to apply the given accurate values.
Pass the accurate values via the event update interface. If the value is
not available, still directly read the counter.
Using u64 * rather than u64 as the new parameter. Because 0 might be a
valid rdpmc() value. The !val cannot be used to distinguish between
there begin an argument and there not being one. Also, for some cases,
e.g., intel_update_topdown_event, there could be more than one
counter/register are read.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: silviazhao <silviazhao-oc@zhaoxin.com>
---
No change from V4
The V4 can be found at
https://lore.kernel.org/lkml/20240731143835.771618-1-kan.liang@linux.intel.com/
arch/x86/events/amd/core.c | 2 +-
arch/x86/events/core.c | 13 ++++++-----
arch/x86/events/intel/core.c | 40 +++++++++++++++++++---------------
arch/x86/events/intel/p4.c | 2 +-
arch/x86/events/perf_event.h | 4 ++--
arch/x86/events/zhaoxin/core.c | 2 +-
6 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 30d6ceb4c8ad..711748ea6117 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -987,7 +987,7 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
event = cpuc->events[idx];
hwc = &event->hw;
- x86_perf_event_update(event);
+ x86_perf_event_update(event, NULL);
mask = BIT_ULL(idx);
if (!(status & mask))
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index cda754f3dabe..2b6029b5e78c 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -114,7 +114,7 @@ u64 __read_mostly hw_cache_extra_regs
* Can only be executed on the CPU where the event is active.
* Returns the delta events processed.
*/
-u64 x86_perf_event_update(struct perf_event *event)
+u64 x86_perf_event_update(struct perf_event *event, u64 *val)
{
struct hw_perf_event *hwc = &event->hw;
int shift = 64 - x86_pmu.cntval_bits;
@@ -133,7 +133,10 @@ u64 x86_perf_event_update(struct perf_event *event)
*/
prev_raw_count = local64_read(&hwc->prev_count);
do {
- rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+ if (!val)
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+ else
+ new_raw_count = *val;
} while (!local64_try_cmpxchg(&hwc->prev_count,
&prev_raw_count, new_raw_count));
@@ -1602,7 +1605,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
* Drain the remaining delta count out of a event
* that we are disabling:
*/
- static_call(x86_pmu_update)(event);
+ static_call(x86_pmu_update)(event, NULL);
hwc->state |= PERF_HES_UPTODATE;
}
}
@@ -1693,7 +1696,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
event = cpuc->events[idx];
- val = static_call(x86_pmu_update)(event);
+ val = static_call(x86_pmu_update)(event, NULL);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
continue;
@@ -2039,7 +2042,7 @@ static void x86_pmu_static_call_update(void)
static void _x86_pmu_read(struct perf_event *event)
{
- static_call(x86_pmu_update)(event);
+ static_call(x86_pmu_update)(event, NULL);
}
void x86_pmu_show_pmu_cap(struct pmu *pmu)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2e1e26846050..b448db65df52 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2418,7 +2418,7 @@ static void intel_pmu_nhm_workaround(void)
for (i = 0; i < 4; i++) {
event = cpuc->events[i];
if (event)
- static_call(x86_pmu_update)(event);
+ static_call(x86_pmu_update)(event, NULL);
}
for (i = 0; i < 4; i++) {
@@ -2710,7 +2710,7 @@ static void update_saved_topdown_regs(struct perf_event *event, u64 slots,
* modify by a NMI. PMU has to be disabled before calling this function.
*/
-static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
+static u64 intel_update_topdown_event(struct perf_event *event, int metric_end, u64 *val)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *other;
@@ -2718,13 +2718,18 @@ static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
bool reset = true;
int idx;
- /* read Fixed counter 3 */
- rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
- if (!slots)
- return 0;
+ if (!val) {
+ /* read Fixed counter 3 */
+ rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
+ if (!slots)
+ return 0;
- /* read PERF_METRICS */
- rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+ /* read PERF_METRICS */
+ rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+ } else {
+ slots = val[0];
+ metrics = val[1];
+ }
for_each_set_bit(idx, cpuc->active_mask, metric_end + 1) {
if (!is_topdown_idx(idx))
@@ -2767,10 +2772,11 @@ static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
return slots;
}
-static u64 icl_update_topdown_event(struct perf_event *event)
+static u64 icl_update_topdown_event(struct perf_event *event, u64 *val)
{
return intel_update_topdown_event(event, INTEL_PMC_IDX_METRIC_BASE +
- x86_pmu.num_topdown_events - 1);
+ x86_pmu.num_topdown_events - 1,
+ val);
}
DEFINE_STATIC_CALL(intel_pmu_update_topdown_event, x86_perf_event_update);
@@ -2785,7 +2791,7 @@ static void intel_pmu_read_topdown_event(struct perf_event *event)
return;
perf_pmu_disable(event->pmu);
- static_call(intel_pmu_update_topdown_event)(event);
+ static_call(intel_pmu_update_topdown_event)(event, NULL);
perf_pmu_enable(event->pmu);
}
@@ -2796,7 +2802,7 @@ static void intel_pmu_read_event(struct perf_event *event)
else if (is_topdown_count(event))
intel_pmu_read_topdown_event(event);
else
- x86_perf_event_update(event);
+ x86_perf_event_update(event, NULL);
}
static void intel_pmu_enable_fixed(struct perf_event *event)
@@ -2899,7 +2905,7 @@ static void intel_pmu_add_event(struct perf_event *event)
*/
int intel_pmu_save_and_restart(struct perf_event *event)
{
- static_call(x86_pmu_update)(event);
+ static_call(x86_pmu_update)(event, NULL);
/*
* For a checkpointed counter always reset back to 0. This
* avoids a situation where the counter overflows, aborts the
@@ -2922,12 +2928,12 @@ static int intel_pmu_set_period(struct perf_event *event)
return x86_perf_event_set_period(event);
}
-static u64 intel_pmu_update(struct perf_event *event)
+static u64 intel_pmu_update(struct perf_event *event, u64 *val)
{
if (unlikely(is_topdown_count(event)))
- return static_call(intel_pmu_update_topdown_event)(event);
+ return static_call(intel_pmu_update_topdown_event)(event, val);
- return x86_perf_event_update(event);
+ return x86_perf_event_update(event, val);
}
static void intel_pmu_reset(void)
@@ -3091,7 +3097,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
*/
if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
handled++;
- static_call(intel_pmu_update_topdown_event)(NULL);
+ static_call(intel_pmu_update_topdown_event)(NULL, NULL);
}
/*
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index 844bc4fc4724..3177be0dedd1 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1058,7 +1058,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
/* it might be unflagged overflow */
overflow = p4_pmu_clear_cccr_ovf(hwc);
- val = x86_perf_event_update(event);
+ val = x86_perf_event_update(event, NULL);
if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
continue;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 82c6f45ce975..14c8262d4811 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -796,7 +796,7 @@ struct x86_pmu {
void (*del)(struct perf_event *);
void (*read)(struct perf_event *event);
int (*set_period)(struct perf_event *event);
- u64 (*update)(struct perf_event *event);
+ u64 (*update)(struct perf_event *event, u64 *val);
int (*hw_config)(struct perf_event *event);
int (*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
unsigned eventsel;
@@ -1145,7 +1145,7 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX];
-u64 x86_perf_event_update(struct perf_event *event);
+u64 x86_perf_event_update(struct perf_event *event, u64 *cntr);
static inline unsigned int x86_pmu_config_addr(int index)
{
diff --git a/arch/x86/events/zhaoxin/core.c b/arch/x86/events/zhaoxin/core.c
index 2fd9b0cf9a5e..5fe3a9eed650 100644
--- a/arch/x86/events/zhaoxin/core.c
+++ b/arch/x86/events/zhaoxin/core.c
@@ -391,7 +391,7 @@ static int zhaoxin_pmu_handle_irq(struct pt_regs *regs)
if (!test_bit(bit, cpuc->active_mask))
continue;
- x86_perf_event_update(event);
+ x86_perf_event_update(event, NULL);
perf_sample_data_init(&data, 0, event->hw.last_period);
if (!x86_perf_event_set_period(event))
--
2.38.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH V5 3/4] perf: Extend perf_output_read
2024-12-16 20:45 [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6 kan.liang
2024-12-16 20:45 ` [PATCH V5 2/4] perf/x86: Extend event update interface kan.liang
@ 2024-12-16 20:45 ` kan.liang
2024-12-16 20:45 ` [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting kan.liang
2024-12-24 9:46 ` [tip: perf/urgent] perf/x86/intel/ds: Add PEBS format 6 tip-bot2 for Kan Liang
3 siblings, 0 replies; 11+ messages in thread
From: kan.liang @ 2024-12-16 20:45 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers, linux-kernel,
linux-perf-users
Cc: ak, eranian, Kan Liang
From: Kan Liang <kan.liang@linux.intel.com>
The event may have been updated in the PMU-specific implementation,
e.g., Intel PEBS counters snapshotting. The common code should not
read and overwrite the value.
The PERF_SAMPLE_READ in the data->sample_type can be used to detect
whether the PMU-specific value is available. If yes, avoid the
pmu->read() in the common code.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
No change from V4
The V4 can be found at
https://lore.kernel.org/lkml/20240731143835.771618-1-kan.liang@linux.intel.com/
kernel/events/core.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 684d631e78da..e2045403521a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7422,7 +7422,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
static void perf_output_read_group(struct perf_output_handle *handle,
struct perf_event *event,
- u64 enabled, u64 running)
+ u64 enabled, u64 running, bool read)
{
struct perf_event *leader = event->group_leader, *sub;
u64 read_format = event->attr.read_format;
@@ -7445,7 +7445,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
values[n++] = running;
- if ((leader != event) &&
+ if ((leader != event) && read &&
(leader->state == PERF_EVENT_STATE_ACTIVE))
leader->pmu->read(leader);
@@ -7460,7 +7460,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
for_each_sibling_event(sub, leader) {
n = 0;
- if ((sub != event) &&
+ if ((sub != event) && read &&
(sub->state == PERF_EVENT_STATE_ACTIVE))
sub->pmu->read(sub);
@@ -7491,7 +7491,8 @@ static void perf_output_read_group(struct perf_output_handle *handle,
* all cores.
*/
static void perf_output_read(struct perf_output_handle *handle,
- struct perf_event *event)
+ struct perf_event *event,
+ bool read)
{
u64 enabled = 0, running = 0, now;
u64 read_format = event->attr.read_format;
@@ -7509,7 +7510,7 @@ static void perf_output_read(struct perf_output_handle *handle,
calc_timer_values(event, &now, &enabled, &running);
if (event->attr.read_format & PERF_FORMAT_GROUP)
- perf_output_read_group(handle, event, enabled, running);
+ perf_output_read_group(handle, event, enabled, running, read);
else
perf_output_read_one(handle, event, enabled, running);
}
@@ -7551,7 +7552,7 @@ void perf_output_sample(struct perf_output_handle *handle,
perf_output_put(handle, data->period);
if (sample_type & PERF_SAMPLE_READ)
- perf_output_read(handle, event);
+ perf_output_read(handle, event, !(data->sample_flags & PERF_SAMPLE_READ));
if (sample_type & PERF_SAMPLE_CALLCHAIN) {
int size = 1;
@@ -8195,7 +8196,7 @@ perf_event_read_event(struct perf_event *event,
return;
perf_output_put(&handle, read_event);
- perf_output_read(&handle, event);
+ perf_output_read(&handle, event, true);
perf_event__output_id_sample(event, &handle, &sample);
perf_output_end(&handle);
--
2.38.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-16 20:45 [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6 kan.liang
2024-12-16 20:45 ` [PATCH V5 2/4] perf/x86: Extend event update interface kan.liang
2024-12-16 20:45 ` [PATCH V5 3/4] perf: Extend perf_output_read kan.liang
@ 2024-12-16 20:45 ` kan.liang
2024-12-17 13:37 ` Peter Zijlstra
2024-12-24 9:46 ` [tip: perf/urgent] perf/x86/intel/ds: Add PEBS format 6 tip-bot2 for Kan Liang
3 siblings, 1 reply; 11+ messages in thread
From: kan.liang @ 2024-12-16 20:45 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, irogers, linux-kernel,
linux-perf-users
Cc: ak, eranian, Kan Liang
From: Kan Liang <kan.liang@linux.intel.com>
The counters snapshotting is a new adaptive PEBS extension, which can
capture programmable counters, fixed-function counters, and performance
metrics in a PEBS record. The feature is available in the PEBS format
V6.
The target counters can be configured in the new fields of MSR_PEBS_CFG.
Then the PEBS HW will generate the bit mask of counters (Counters Group
Header) followed by the content of all the requested counters into a
PEBS record.
The current Linux perf sample read feature intends to read the counters
of other member events when the leader event is overflowing. But the
current read is in the NMI handler, which may has a small gap from
overflow. Using the counters snapshotting feature for the sample read.
Add a new PEBS_CNTR flag to indicate a sample read group that utilizes
the counters snapshotting feature. When the group is scheduled, the
PEBS configure can be updated accordingly.
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
Changes since V4
- Always drain PEBS before handling the overflow from the non PEBS
event overflow.
The V4 can be found at
https://lore.kernel.org/lkml/20240731143835.771618-1-kan.liang@linux.intel.com/
arch/x86/events/intel/core.c | 45 ++++++++++-
arch/x86/events/intel/ds.c | 122 ++++++++++++++++++++++++++---
arch/x86/events/perf_event.h | 8 ++
arch/x86/events/perf_event_flags.h | 2 +-
arch/x86/include/asm/perf_event.h | 15 ++++
5 files changed, 180 insertions(+), 12 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b448db65df52..32b28d674641 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3115,6 +3115,14 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
if (!test_bit(bit, cpuc->active_mask))
continue;
+ /*
+ * There may be unprocessed PEBS records in the PEBS buffer,
+ * which still stores the previous values.
+ * Process those records first before handling the latest value.
+ */
+ if (is_pebs_counter_event(event))
+ x86_pmu.drain_pebs(regs, &data);
+
if (!intel_pmu_save_and_restart(event))
continue;
@@ -4062,6 +4070,23 @@ static int intel_pmu_hw_config(struct perf_event *event)
event->hw.flags |= PERF_X86_EVENT_PEBS_VIA_PT;
}
+ if ((event->attr.sample_type & PERF_SAMPLE_READ) &&
+ (x86_pmu.intel_cap.pebs_format >= 6)) {
+ struct perf_event *leader = event->group_leader;
+ bool slots_leader = is_slots_event(leader);
+
+ if (slots_leader)
+ leader = list_next_entry(leader, sibling_list);
+
+ if (leader->attr.precise_ip) {
+ event->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
+ if (slots_leader) {
+ leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
+ event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
+ }
+ }
+ }
+
if ((event->attr.type == PERF_TYPE_HARDWARE) ||
(event->attr.type == PERF_TYPE_HW_CACHE))
return 0;
@@ -4165,6 +4190,24 @@ static int intel_pmu_hw_config(struct perf_event *event)
return 0;
}
+static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+ struct perf_event *event;
+ int ret = x86_schedule_events(cpuc, n, assign);
+
+ if (ret)
+ return ret;
+
+ if (cpuc->is_fake)
+ return ret;
+
+ event = cpuc->event_list[n - 1];
+ if (event && (event->group_leader->hw.flags & PERF_X86_EVENT_PEBS_CNTR))
+ intel_pmu_pebs_update_cfg(cpuc, n, assign);
+
+ return 0;
+}
+
/*
* Currently, the only caller of this function is the atomic_switch_perf_msrs().
* The host perf context helps to prepare the values of the real hardware for
@@ -5298,7 +5341,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.set_period = intel_pmu_set_period,
.update = intel_pmu_update,
.hw_config = intel_pmu_hw_config,
- .schedule_events = x86_schedule_events,
+ .schedule_events = intel_pmu_schedule_events,
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
.perfctr = MSR_ARCH_PERFMON_PERFCTR0,
.fixedctr = MSR_ARCH_PERFMON_FIXED_CTR0,
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ba74e1198328..e06ac9a3cdf8 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1308,10 +1308,63 @@ static void adaptive_pebs_record_size_update(void)
sz += sizeof(struct pebs_xmm);
if (pebs_data_cfg & PEBS_DATACFG_LBRS)
sz += x86_pmu.lbr_nr * sizeof(struct lbr_entry);
+ if (pebs_data_cfg & (PEBS_DATACFG_METRICS | PEBS_DATACFG_CNTR)) {
+ sz += sizeof(struct pebs_cntr_header);
+
+ /* Metrics base and Metrics Data */
+ if (pebs_data_cfg & PEBS_DATACFG_METRICS)
+ sz += 2 * sizeof(u64);
+
+ if (pebs_data_cfg & PEBS_DATACFG_CNTR) {
+ sz += hweight64((pebs_data_cfg >> PEBS_DATACFG_CNTR_SHIFT) & PEBS_DATACFG_CNTR_MASK)
+ * sizeof(u64);
+ sz += hweight64((pebs_data_cfg >> PEBS_DATACFG_FIX_SHIFT) & PEBS_DATACFG_FIX_MASK)
+ * sizeof(u64);
+ }
+ }
cpuc->pebs_record_size = sz;
}
+static void __intel_pmu_pebs_update_cfg(struct perf_event *event,
+ int idx, u64 *pebs_data_cfg)
+{
+ if (is_metric_event(event)) {
+ *pebs_data_cfg |= PEBS_DATACFG_METRICS;
+ return;
+ }
+
+ *pebs_data_cfg |= PEBS_DATACFG_CNTR;
+
+ if (idx >= INTEL_PMC_IDX_FIXED) {
+ *pebs_data_cfg |= ((1ULL << (idx - INTEL_PMC_IDX_FIXED)) & PEBS_DATACFG_FIX_MASK)
+ << PEBS_DATACFG_FIX_SHIFT;
+ } else {
+ *pebs_data_cfg |= ((1ULL << idx) & PEBS_DATACFG_CNTR_MASK)
+ << PEBS_DATACFG_CNTR_SHIFT;
+ }
+}
+
+void intel_pmu_pebs_update_cfg(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+ struct perf_event *leader, *event;
+ u64 pebs_data_cfg = 0;
+ int i = n - 1;
+
+ leader = cpuc->event_list[i]->group_leader;
+ for (; i >= 0; i--) {
+ event = cpuc->event_list[i];
+ if (!is_pebs_counter_event(event))
+ continue;
+ if (leader != event->group_leader)
+ break;
+ __intel_pmu_pebs_update_cfg(event, assign[i], &pebs_data_cfg);
+ }
+
+ if (pebs_data_cfg & ~cpuc->pebs_data_cfg)
+ cpuc->pebs_data_cfg |= pebs_data_cfg | PEBS_UPDATE_DS_SW;
+}
+
#define PERF_PEBS_MEMINFO_TYPE (PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC | \
PERF_SAMPLE_PHYS_ADDR | \
PERF_SAMPLE_WEIGHT_TYPE | \
@@ -2049,6 +2102,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
}
}
+ if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
+ struct pebs_cntr_header *cntr = next_record;
+ int bit;
+
+ next_record += sizeof(struct pebs_cntr_header);
+
+ for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
+ x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
+ next_record += sizeof(u64);
+ }
+
+ for_each_set_bit(bit, (unsigned long *)&cntr->fixed, INTEL_PMC_MAX_FIXED) {
+ /* The slots event will be handled with perf_metric later */
+ if ((cntr->metrics == INTEL_CNTR_METRICS) &&
+ (INTEL_PMC_IDX_FIXED_SLOTS == bit + INTEL_PMC_IDX_FIXED)) {
+ next_record += sizeof(u64);
+ continue;
+ }
+ x86_perf_event_update(cpuc->events[bit + INTEL_PMC_IDX_FIXED], (u64 *)next_record);
+ next_record += sizeof(u64);
+ }
+
+ /* HW will reload the value right after the overflow. */
+ if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
+ local64_set(&event->hw.prev_count, (u64)-event->hw.sample_period);
+
+ if (cntr->metrics == INTEL_CNTR_METRICS) {
+ static_call(intel_pmu_update_topdown_event)
+ (event->group_leader, (u64 *)next_record);
+ next_record += 2 * sizeof(u64);
+ }
+ data->sample_flags |= PERF_SAMPLE_READ;
+ }
+
WARN_ONCE(next_record != __pebs + basic->format_size,
"PEBS record size %u, expected %llu, config %llx\n",
basic->format_size,
@@ -2211,15 +2298,27 @@ __intel_pmu_pebs_last_event(struct perf_event *event,
}
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
- /*
- * Now, auto-reload is only enabled in fixed period mode.
- * The reload value is always hwc->sample_period.
- * May need to change it, if auto-reload is enabled in
- * freq mode later.
- */
- intel_pmu_save_and_restart_reload(event, count);
- } else
- intel_pmu_save_and_restart(event);
+ if ((is_pebs_counter_event(event))) {
+ /*
+ * The value of each sample has been updated when setup
+ * the corresponding sample data.
+ */
+ perf_event_update_userpage(event);
+ } else {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, count);
+ }
+ } else {
+ if (is_pebs_counter_event(event))
+ static_call(x86_pmu_set_period)(event);
+ else
+ intel_pmu_save_and_restart(event);
+ }
}
static __always_inline void
@@ -2552,6 +2651,9 @@ void __init intel_ds_init(void)
break;
case 6:
+ if (x86_pmu.intel_cap.pebs_baseline)
+ x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
+ fallthrough;
case 5:
x86_pmu.pebs_ept = 1;
fallthrough;
@@ -2576,7 +2678,7 @@ void __init intel_ds_init(void)
PERF_SAMPLE_REGS_USER |
PERF_SAMPLE_REGS_INTR);
}
- pr_cont("PEBS fmt4%c%s, ", pebs_type, pebs_qual);
+ pr_cont("PEBS fmt%d%c%s, ", format, pebs_type, pebs_qual);
if (!is_hybrid() && x86_pmu.intel_cap.pebs_output_pt_available) {
pr_cont("PEBS-via-PT, ");
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 14c8262d4811..e88dd0fcc4d4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -115,6 +115,11 @@ static inline bool is_branch_counters_group(struct perf_event *event)
return event->group_leader->hw.flags & PERF_X86_EVENT_BRANCH_COUNTERS;
}
+static inline bool is_pebs_counter_event(struct perf_event *event)
+{
+ return event->hw.flags & PERF_X86_EVENT_PEBS_CNTR;
+}
+
struct amd_nb {
int nb_id; /* NorthBridge id */
int refcnt; /* reference count */
@@ -1146,6 +1151,7 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_RESULT_MAX];
u64 x86_perf_event_update(struct perf_event *event, u64 *cntr);
+DECLARE_STATIC_CALL(intel_pmu_update_topdown_event, x86_perf_event_update);
static inline unsigned int x86_pmu_config_addr(int index)
{
@@ -1642,6 +1648,8 @@ void intel_pmu_pebs_disable_all(void);
void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
+void intel_pmu_pebs_update_cfg(struct cpu_hw_events *cpuc, int n, int *assign);
+
void intel_pmu_auto_reload_read(struct perf_event *event);
void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index 6c977c19f2cd..1d9e385649b5 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -9,7 +9,7 @@ PERF_ARCH(PEBS_LD_HSW, 0x00008) /* haswell style datala, load */
PERF_ARCH(PEBS_NA_HSW, 0x00010) /* haswell style datala, unknown */
PERF_ARCH(EXCL, 0x00020) /* HT exclusivity on counter */
PERF_ARCH(DYNAMIC, 0x00040) /* dynamic alloc'd constraint */
- /* 0x00080 */
+PERF_ARCH(PEBS_CNTR, 0x00080) /* PEBS counters snapshot */
PERF_ARCH(EXCL_ACCT, 0x00100) /* accounted EXCL event */
PERF_ARCH(AUTO_RELOAD, 0x00200) /* use PEBS auto-reload */
PERF_ARCH(LARGE_PEBS, 0x00400) /* use large PEBS */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index cd8023d5ea46..4e28ee0f2f3e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -140,6 +140,12 @@
#define PEBS_DATACFG_XMMS BIT_ULL(2)
#define PEBS_DATACFG_LBRS BIT_ULL(3)
#define PEBS_DATACFG_LBR_SHIFT 24
+#define PEBS_DATACFG_CNTR BIT_ULL(4)
+#define PEBS_DATACFG_CNTR_SHIFT 32
+#define PEBS_DATACFG_CNTR_MASK GENMASK_ULL(15, 0)
+#define PEBS_DATACFG_FIX_SHIFT 48
+#define PEBS_DATACFG_FIX_MASK GENMASK_ULL(7, 0)
+#define PEBS_DATACFG_METRICS BIT_ULL(5)
/* Steal the highest bit of pebs_data_cfg for SW usage */
#define PEBS_UPDATE_DS_SW BIT_ULL(63)
@@ -467,6 +473,15 @@ struct pebs_xmm {
#define IBS_CPUID_FEATURES 0x8000001b
+struct pebs_cntr_header {
+ u32 cntr;
+ u32 fixed;
+ u32 metrics;
+ u32 reserved;
+};
+
+#define INTEL_CNTR_METRICS 0x3
+
/*
* Same bit mask as for IBS cpuid feature flags (Fn8000_001B_EAX), but
* bit 0 is used to indicate the existence of IBS.
--
2.38.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-16 20:45 ` [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting kan.liang
@ 2024-12-17 13:37 ` Peter Zijlstra
2024-12-17 17:45 ` Liang, Kan
0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-12-17 13:37 UTC (permalink / raw)
To: kan.liang
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On Mon, Dec 16, 2024 at 12:45:05PM -0800, kan.liang@linux.intel.com wrote:
> @@ -2049,6 +2102,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
> }
> }
>
> + if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
> + struct pebs_cntr_header *cntr = next_record;
> + int bit;
> +
> + next_record += sizeof(struct pebs_cntr_header);
> +
> + for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
> + x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
> + next_record += sizeof(u64);
> + }
I still don't much like any of this -- the next_record value might be in
the past relative to what is already in the event.
Why can't you use something like the below -- that gives you a count
value matching the pmc value you put in, as long as it is 'near' the
current value.
---
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8f218ac0d445..3cf8b4f2b2c1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -154,6 +154,26 @@ u64 x86_perf_event_update(struct perf_event *event)
return new_raw_count;
}
+u64 x86_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 prev_pmc, prev_count;
+ u64 delta;
+
+ do {
+ prev_pmc = local64_read(&hwc->prev_count);
+ barrier();
+ prev_count = local64_read(&event->count);
+ barrier();
+ } while (prev_pmc != local64_read(&hwc->prev_count));
+
+ delta = (pmc << shift) - (prev_pmc << shift);
+ delta >>= shift;
+
+ return prev_count + delta;
+}
+
/*
* Find and validate any extra registers to set up.
*/
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-17 13:37 ` Peter Zijlstra
@ 2024-12-17 17:45 ` Liang, Kan
2024-12-17 20:29 ` Peter Zijlstra
2024-12-18 8:24 ` Peter Zijlstra
0 siblings, 2 replies; 11+ messages in thread
From: Liang, Kan @ 2024-12-17 17:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On 2024-12-17 8:37 a.m., Peter Zijlstra wrote:
> On Mon, Dec 16, 2024 at 12:45:05PM -0800, kan.liang@linux.intel.com wrote:
>
>> @@ -2049,6 +2102,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
>> }
>> }
>>
>> + if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
>> + struct pebs_cntr_header *cntr = next_record;
>> + int bit;
>> +
>> + next_record += sizeof(struct pebs_cntr_header);
>> +
>> + for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
>> + x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
>> + next_record += sizeof(u64);
>> + }
>
> I still don't much like any of this -- the next_record value might be in
> the past relative to what is already in the event.
>
I forgot to handle the read case. With the below patch, the next_record
value should not contain a previous value, unless there is a HW bug.
Because no matter it's read/pmi/context switch, perf always stop the
PMU, drains the buffer, and the value is always from the PEBS record.
There may be two places where perf directly reads the counter and update
the event->count. One is the x86_pmu_stop(). The other is to handle the
non-PEBS event's PMI. The PMU is stopped for both cases. The PEBS buffer
is also drained in advance. The next "next_record" must be a newer value.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 32b28d674641..13ab7fca72dd 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2797,7 +2797,8 @@ static void intel_pmu_read_topdown_event(struct
perf_event *event)
static void intel_pmu_read_event(struct perf_event *event)
{
- if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
+ if ((event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD) ||
+ is_pebs_counter_event(event))
intel_pmu_auto_reload_read(event);
else if (is_topdown_count(event))
intel_pmu_read_topdown_event(event);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e06ac9a3cdf8..5e11b38427c7 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2181,10 +2181,8 @@ get_next_pebs_record_by_bit(void *base, void
*top, int bit)
return NULL;
}
-void intel_pmu_auto_reload_read(struct perf_event *event)
+void intel_pmu_pebs_read(struct perf_event *event)
{
- WARN_ON(!(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD));
-
perf_pmu_disable(event->pmu);
intel_pmu_drain_pebs_buffer();
perf_pmu_enable(event->pmu);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e88dd0fcc4d4..22b0c3e565a6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1650,7 +1650,7 @@ void intel_pmu_pebs_sched_task(struct
perf_event_pmu_context *pmu_ctx, bool sche
void intel_pmu_pebs_update_cfg(struct cpu_hw_events *cpuc, int n, int
*assign);
-void intel_pmu_auto_reload_read(struct perf_event *event);
+void intel_pmu_pebs_read(struct perf_event *event);
void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
> Why can't you use something like the below -- that gives you a count
> value matching the pmc value you put in, as long as it is 'near' the
> current value.
>
> ---
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 8f218ac0d445..3cf8b4f2b2c1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -154,6 +154,26 @@ u64 x86_perf_event_update(struct perf_event *event)
> return new_raw_count;
> }
>
> +u64 x86_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 prev_pmc, prev_count;
> + u64 delta;
> +
> + do {
> + prev_pmc = local64_read(&hwc->prev_count);
> + barrier();
> + prev_count = local64_read(&event->count);
> + barrier();
> + } while (prev_pmc != local64_read(&hwc->prev_count));
Is the "while()" to handle PMI? But there should be no PMI, since the
PMU has been disabled when draining the PEBS buffer.
I'm thinking to introduce a dedicated function to update the count from
pmc. It will only be used in the drain_pebs(). So we can avoid extending
the x86_perf_event_update() in patch 2. (Will do more tests.)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e06ac9a3cdf8..7f0b850f7277 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1969,6 +1969,23 @@ static void adaptive_pebs_save_regs(struct
pt_regs *regs,
#define PEBS_LATENCY_MASK 0xffff
+void intel_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 prev_pmc;
+ u64 delta;
+
+ prev_pmc = local64_read(&hwc->prev_count);
+
+ delta = (pmc << shift) - (prev_pmc << shift);
+ delta >>= shift;
+
+ local64_add(delta, &event->count);
+ local64_sub(delta, &hwc->period_left);
+ local64_set(&hwc->prev_count, pmc);
+}
+
/*
* With adaptive PEBS the layout depends on what fields are configured.
*/
@@ -2109,7 +2126,7 @@ static void setup_pebs_adaptive_sample_data(struct
perf_event *event,
next_record += sizeof(struct pebs_cntr_header);
for_each_set_bit(bit, (unsigned long *)&cntr->cntr,
INTEL_PMC_MAX_GENERIC) {
- x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
+ intel_perf_event_pmc_to_count(cpuc->events[bit], (u64 *)next_record);
next_record += sizeof(u64);
}
Thanks,
Kan
> +
> + delta = (pmc << shift) - (prev_pmc << shift);
> + delta >>= shift;
> +
> + return prev_count + delta;
> +}
> +
> /*
> * Find and validate any extra registers to set up.
> */
>
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-17 17:45 ` Liang, Kan
@ 2024-12-17 20:29 ` Peter Zijlstra
2024-12-17 20:59 ` Liang, Kan
2024-12-18 8:24 ` Peter Zijlstra
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-12-17 20:29 UTC (permalink / raw)
To: Liang, Kan
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On Tue, Dec 17, 2024 at 12:45:56PM -0500, Liang, Kan wrote:
> > Why can't you use something like the below -- that gives you a count
> > value matching the pmc value you put in, as long as it is 'near' the
> > current value.
> >
> > ---
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 8f218ac0d445..3cf8b4f2b2c1 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -154,6 +154,26 @@ u64 x86_perf_event_update(struct perf_event *event)
> > return new_raw_count;
> > }
> >
> > +u64 x86_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
> > +{
> > + struct hw_perf_event *hwc = &event->hw;
> > + int shift = 64 - x86_pmu.cntval_bits;
> > + u64 prev_pmc, prev_count;
> > + u64 delta;
> > +
> > + do {
> > + prev_pmc = local64_read(&hwc->prev_count);
> > + barrier();
> > + prev_count = local64_read(&event->count);
> > + barrier();
> > + } while (prev_pmc != local64_read(&hwc->prev_count));
>
> Is the "while()" to handle PMI? But there should be no PMI, since the
> PMU has been disabled when draining the PEBS buffer.
Perhaps not in your case, but this way the function is more widely
usable.
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index e06ac9a3cdf8..7f0b850f7277 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1969,6 +1969,23 @@ static void adaptive_pebs_save_regs(struct
> pt_regs *regs,
>
> #define PEBS_LATENCY_MASK 0xffff
>
> +void intel_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 prev_pmc;
> + u64 delta;
> +
> + prev_pmc = local64_read(&hwc->prev_count);
> +
> + delta = (pmc << shift) - (prev_pmc << shift);
> + delta >>= shift;
> +
> + local64_add(delta, &event->count);
> + local64_sub(delta, &hwc->period_left);
> + local64_set(&hwc->prev_count, pmc);
> +}
This seems very fragile, at least keep the same store order and assert
you're in NMI/PMI context.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-17 20:29 ` Peter Zijlstra
@ 2024-12-17 20:59 ` Liang, Kan
0 siblings, 0 replies; 11+ messages in thread
From: Liang, Kan @ 2024-12-17 20:59 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On 2024-12-17 3:29 p.m., Peter Zijlstra wrote:
> On Tue, Dec 17, 2024 at 12:45:56PM -0500, Liang, Kan wrote:
>
>
>>> Why can't you use something like the below -- that gives you a count
>>> value matching the pmc value you put in, as long as it is 'near' the
>>> current value.
>>>
>>> ---
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 8f218ac0d445..3cf8b4f2b2c1 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -154,6 +154,26 @@ u64 x86_perf_event_update(struct perf_event *event)
>>> return new_raw_count;
>>> }
>>>
>>> +u64 x86_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + int shift = 64 - x86_pmu.cntval_bits;
>>> + u64 prev_pmc, prev_count;
>>> + u64 delta;
>>> +
>>> + do {
>>> + prev_pmc = local64_read(&hwc->prev_count);
>>> + barrier();
>>> + prev_count = local64_read(&event->count);
>>> + barrier();
>>> + } while (prev_pmc != local64_read(&hwc->prev_count));
>>
>> Is the "while()" to handle PMI? But there should be no PMI, since the
>> PMU has been disabled when draining the PEBS buffer.
>
> Perhaps not in your case, but this way the function is more widely
> usable.
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index e06ac9a3cdf8..7f0b850f7277 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1969,6 +1969,23 @@ static void adaptive_pebs_save_regs(struct
>> pt_regs *regs,
>>
>> #define PEBS_LATENCY_MASK 0xffff
>>
>> +void intel_perf_event_pmc_to_count(struct perf_event *event, u64 pmc)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + int shift = 64 - x86_pmu.cntval_bits;
>> + u64 prev_pmc;
>> + u64 delta;
>> +
>> + prev_pmc = local64_read(&hwc->prev_count);
>> +
>> + delta = (pmc << shift) - (prev_pmc << shift);
>> + delta >>= shift;
>> +
>> + local64_add(delta, &event->count);
>> + local64_sub(delta, &hwc->period_left);
>> + local64_set(&hwc->prev_count, pmc);
>> +}
>
> This seems very fragile, at least keep the same store order and assert
> you're in NMI/PMI context.
Sure, I will keep the store order.
You mean assert when there may be an unexpected NMI for the normal drain
case, right? I think we can check if the PMU is disabled as below.
@@ -1974,12 +1974,15 @@ static void intel_perf_event_pmc_to_count(struct
perf_event *event, u64 pmc)
int shift = 64 - x86_pmu.cntval_bits;
u64 delta;
+ /* Only read update the count when the PMU is disabled */
+ WARN_ON(this_cpu_read(cpu_hw_events.enabled));
+ local64_set(&hwc->prev_count, pmc);
+
delta = (pmc << shift) - (prev_pmc << shift);
delta >>= shift;
local64_add(delta, &event->count);
local64_sub(delta, &hwc->period_left);
- local64_set(&hwc->prev_count, pmc);
}
Thanks,
Kan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-17 17:45 ` Liang, Kan
2024-12-17 20:29 ` Peter Zijlstra
@ 2024-12-18 8:24 ` Peter Zijlstra
2024-12-18 14:52 ` Liang, Kan
1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2024-12-18 8:24 UTC (permalink / raw)
To: Liang, Kan
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On Tue, Dec 17, 2024 at 12:45:56PM -0500, Liang, Kan wrote:
>
>
> On 2024-12-17 8:37 a.m., Peter Zijlstra wrote:
> > On Mon, Dec 16, 2024 at 12:45:05PM -0800, kan.liang@linux.intel.com wrote:
> >
> >> @@ -2049,6 +2102,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
> >> }
> >> }
> >>
> >> + if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
> >> + struct pebs_cntr_header *cntr = next_record;
> >> + int bit;
> >> +
> >> + next_record += sizeof(struct pebs_cntr_header);
> >> +
> >> + for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
> >> + x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
> >> + next_record += sizeof(u64);
> >> + }
> >
> > I still don't much like any of this -- the next_record value might be in
> > the past relative to what is already in the event.
> >
>
> I forgot to handle the read case. With the below patch, the next_record
> value should not contain a previous value, unless there is a HW bug.
> Because no matter it's read/pmi/context switch, perf always stop the
> PMU, drains the buffer, and the value is always from the PEBS record.
Consider this 3 counter case:
A is our regular counter
B is a PEBS event which reads A
C is a PEBS event
For convencience, let A count the lines in our example (anything
incrementing at a faster rate will get the same result after all).
Then, afaict, the following can happen:
B-assist A=1
C A=2
B-assist A=3
A-overflow-PMI A=4
C-assist-PMI (DS buffer) A=5
So the A-PMI will update A->count to 4.
Then the C-PMI will process the DS buffer, which will:
- adjust A->count to 1
- adjust A->count to 3
Leaving us, in the end, with A going backwards.
How is that not a problem?
IIRC I've raised this point before, and your Changelog does not mention
anything about this issue at all :-(
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting
2024-12-18 8:24 ` Peter Zijlstra
@ 2024-12-18 14:52 ` Liang, Kan
0 siblings, 0 replies; 11+ messages in thread
From: Liang, Kan @ 2024-12-18 14:52 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo, acme, namhyung, irogers, linux-kernel, linux-perf-users,
ak, eranian
On 2024-12-18 3:24 a.m., Peter Zijlstra wrote:
> On Tue, Dec 17, 2024 at 12:45:56PM -0500, Liang, Kan wrote:
>>
>>
>> On 2024-12-17 8:37 a.m., Peter Zijlstra wrote:
>>> On Mon, Dec 16, 2024 at 12:45:05PM -0800, kan.liang@linux.intel.com wrote:
>>>
>>>> @@ -2049,6 +2102,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
>>>> }
>>>> }
>>>>
>>>> + if (format_group & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
>>>> + struct pebs_cntr_header *cntr = next_record;
>>>> + int bit;
>>>> +
>>>> + next_record += sizeof(struct pebs_cntr_header);
>>>> +
>>>> + for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
>>>> + x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
>>>> + next_record += sizeof(u64);
>>>> + }
>>>
>>> I still don't much like any of this -- the next_record value might be in
>>> the past relative to what is already in the event.
>>>
>>
>> I forgot to handle the read case. With the below patch, the next_record
>> value should not contain a previous value, unless there is a HW bug.
>> Because no matter it's read/pmi/context switch, perf always stop the
>> PMU, drains the buffer, and the value is always from the PEBS record.
>
> Consider this 3 counter case:
>
> A is our regular counter
> B is a PEBS event which reads A
> C is a PEBS event
>
> For convencience, let A count the lines in our example (anything
> incrementing at a faster rate will get the same result after all).
>
> Then, afaict, the following can happen:
>
> B-assist A=1
> C A=2
> B-assist A=3
> A-overflow-PMI A=4
> C-assist-PMI (DS buffer) A=5
>
> So the A-PMI will update A->count to 4.
> Then the C-PMI will process the DS buffer, which will:
>
> - adjust A->count to 1
> - adjust A->count to 3
The patch should have addressed the case. I will update the changelog to
point to the case in the next version.
The below is the code to address the issue in this patch.
Perf will drain the PEBS buffer before handling the non-PEBS overflow.
So the A-PMI will
- Process the DS. adjust A->count to 3
- adjust A->count to 4
The C-PMI will no touch the A->count since there is no record from B.
@@ -3115,6 +3115,14 @@ static int handle_pmi_common(struct pt_regs
*regs, u64 status)
if (!test_bit(bit, cpuc->active_mask))
continue;
+ /*
+ * There may be unprocessed PEBS records in the PEBS buffer,
+ * which still stores the previous values.
+ * Process those records first before handling the latest value.
+ */
+ if (is_pebs_counter_event(event))
+ x86_pmu.drain_pebs(regs, &data);
+
if (!intel_pmu_save_and_restart(event))
continue;
Thanks,
Kan
>
> Leaving us, in the end, with A going backwards.
>
> How is that not a problem?
>
> IIRC I've raised this point before, and your Changelog does not mention
> anything about this issue at all :-(
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* [tip: perf/urgent] perf/x86/intel/ds: Add PEBS format 6
2024-12-16 20:45 [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6 kan.liang
` (2 preceding siblings ...)
2024-12-16 20:45 ` [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting kan.liang
@ 2024-12-24 9:46 ` tip-bot2 for Kan Liang
3 siblings, 0 replies; 11+ messages in thread
From: tip-bot2 for Kan Liang @ 2024-12-24 9:46 UTC (permalink / raw)
To: linux-tip-commits
Cc: Kan Liang, Peter Zijlstra (Intel), stable, x86, linux-kernel
The following commit has been merged into the perf/urgent branch of tip:
Commit-ID: b8c3a2502a205321fe66c356f4b70cabd8e1a5fc
Gitweb: https://git.kernel.org/tip/b8c3a2502a205321fe66c356f4b70cabd8e1a5fc
Author: Kan Liang <kan.liang@linux.intel.com>
AuthorDate: Mon, 16 Dec 2024 12:45:02 -08:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 17 Dec 2024 17:47:23 +01:00
perf/x86/intel/ds: Add PEBS format 6
The only difference between 5 and 6 is the new counters snapshotting
group, without the following counters snapshotting enabling patches,
it's impossible to utilize the feature in a PEBS record. It's safe to
share the same code path with format 5.
Add format 6, so the end user can at least utilize the legacy PEBS
features.
Fixes: a932aa0e868f ("perf/x86: Add Lunar Lake and Arrow Lake support")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241216204505.748363-1-kan.liang@linux.intel.com
---
arch/x86/events/intel/ds.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 1a4b326..6ba6549 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2517,6 +2517,7 @@ void __init intel_ds_init(void)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
break;
+ case 6:
case 5:
x86_pmu.pebs_ept = 1;
fallthrough;
^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-12-24 9:46 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-16 20:45 [PATCH V5 1/4] perf/x86/intel/ds: Add PEBS format 6 kan.liang
2024-12-16 20:45 ` [PATCH V5 2/4] perf/x86: Extend event update interface kan.liang
2024-12-16 20:45 ` [PATCH V5 3/4] perf: Extend perf_output_read kan.liang
2024-12-16 20:45 ` [PATCH V5 4/4] perf/x86/intel: Support PEBS counters snapshotting kan.liang
2024-12-17 13:37 ` Peter Zijlstra
2024-12-17 17:45 ` Liang, Kan
2024-12-17 20:29 ` Peter Zijlstra
2024-12-17 20:59 ` Liang, Kan
2024-12-18 8:24 ` Peter Zijlstra
2024-12-18 14:52 ` Liang, Kan
2024-12-24 9:46 ` [tip: perf/urgent] perf/x86/intel/ds: Add PEBS format 6 tip-bot2 for Kan Liang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).