[PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU
@ 2024-07-31 14:38 kan.liang
  2024-07-31 14:38 ` [PATCH V4 1/5] perf/x86: Extend event update interface kan.liang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V3:
- Rebase on top of the 6.11-rc1
- Remove the patches which has been merged.
V3 can be found at
https://lore.kernel.org/lkml/20240626143545.480761-1-kan.liang@linux.intel.com/

Changes since V2:
- Rebase on top of the tip.git. Specifically on top of the below two
  patches.
  commit cd84351c8c1b ("perf/x86/amd: Use try_cmpxchg() in events/amd/{un,}core.c")
  commit d142df13f357 ("perf/x86/intel: Switch to new Intel CPU model defines")
- Add Reviewed-by tag from Ian

Changes since V1:
- Add x86/intel_pmu_max_num_pebs/counters/counters_fixed()
- Rename model-specific pebs_latency_data functions
- Rename V6 counter MSRs

From the core PMU' perspective, the Lunar Lake and Arrow Lake are the
same, which are similar to the previous generation Meteor Lake. Both are
hybrid platforms, with e-core and p-core.

The key differences include:
- The e-core supports 3 new fixed counters
- The p-core supports an updated PEBS Data Source format
- More GP counters (Updated event constraint table)
- New Architectural performance monitoring V6
  (New Perfmon MSRs aliasing, umask2, eq).
- New PEBS format V6 (Counters Snapshotting group)
- New RDPMC metrics clear mode

The details for the above new features can be found in the Intel
Architecture Instruction Set Extensions and Future Features (052).
https://cdrdv2.intel.com/v1/dl/getContent/671368

The counters may not be continuous anymore. Patch 1-2 converts the max
number of counters to a mask of counters. The change is a generic change
which impacts all X86 platforms.

Patch 3-5 supports all the legacy features on LNL and ARL.

Patch 6-8 supports the new Architectural performance monitoring V6.

Patch 9-12 supports the new PEBS format V6.

Patch 13 supports the new RDPMC metrics clear mode.

Only the two features (Architectural performance monitoring V6
and the RDPMC metrics clear mode) add new formats, which impacts the ABI.
The "Sysfs PMU tests" case has covered the non-contiguous format
definition caused by the new umask. The current perf test should be good
enough to cover the ABI changes.

Kan Liang (5):
  perf/x86: Extend event update interface
  perf: Extend perf_output_read
  perf/x86/intel: Move PEBS event update after the sample output
  perf/x86/intel: Support PEBS counters snapshotting
  perf/x86/intel: Support RDPMC metrics clear mode

 arch/x86/events/amd/core.c           |   2 +-
 arch/x86/events/core.c               |  13 +--
 arch/x86/events/intel/core.c         |  93 +++++++++++++++-----
 arch/x86/events/intel/ds.c           | 122 ++++++++++++++++++++++++---
 arch/x86/events/intel/p4.c           |   2 +-
 arch/x86/events/perf_event.h         |   8 +-
 arch/x86/events/perf_event_flags.h   |   2 +-
 arch/x86/events/zhaoxin/core.c       |   2 +-
 arch/x86/include/asm/perf_event.h    |  19 +++++
 kernel/events/core.c                 |  15 ++--
 tools/perf/Documentation/topdown.txt |   9 +-
 11 files changed, 236 insertions(+), 51 deletions(-)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V4 1/5] perf/x86: Extend event update interface
  2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
@ 2024-07-31 14:38 ` kan.liang
  2024-08-01 14:03   ` Peter Zijlstra
  2024-07-31 14:38 ` [PATCH V4 2/5] perf: Extend perf_output_read kan.liang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang, Sandipan Das, Ravi Bangoria, silviazhao

From: Kan Liang <kan.liang@linux.intel.com>

The current event update interface directly reads the values from the
counter, but the values may not be the accurate ones users require. For
example, the sample read feature wants the counter value of the member
events when the leader event is overflow. But with the current
implementation, the read (event update) actually happens in the NMI
handler. There may be a small gap between the overflow and the NMI
handler. The new Intel PEBS counters snapshotting feature can provide
the accurate counter value in the overflow. The event update interface
has to be updated to apply the given accurate values.

Pass the accurate values via the event update interface. If the value is
not available, still directly read the counter.

Using u64 * rather than u64 as the new parameter. Because 0 might be a
valid rdpmc() value. The !val cannot be used to distinguish between
there begin an argument and there not being one. Also, for some cases,
e.g., intel_update_topdown_event, there could be more than one
counter/register are read.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Sandipan Das <sandipan.das@amd.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: silviazhao <silviazhao-oc@zhaoxin.com>
---
 arch/x86/events/amd/core.c     |  2 +-
 arch/x86/events/core.c         | 13 ++++++-----
 arch/x86/events/intel/core.c   | 40 +++++++++++++++++++---------------
 arch/x86/events/intel/p4.c     |  2 +-
 arch/x86/events/perf_event.h   |  4 ++--
 arch/x86/events/zhaoxin/core.c |  2 +-
 6 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 920e3a640cad..284bf6157545 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -986,7 +986,7 @@ static int amd_pmu_v2_handle_irq(struct pt_regs *regs)
 
 		event = cpuc->events[idx];
 		hwc = &event->hw;
-		x86_perf_event_update(event);
+		x86_perf_event_update(event, NULL);
 		mask = BIT_ULL(idx);
 
 		if (!(status & mask))
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 12f2a0c14d33..07a56bf71160 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -112,7 +112,7 @@ u64 __read_mostly hw_cache_extra_regs
  * Can only be executed on the CPU where the event is active.
  * Returns the delta events processed.
  */
-u64 x86_perf_event_update(struct perf_event *event)
+u64 x86_perf_event_update(struct perf_event *event, u64 *val)
 {
 	struct hw_perf_event *hwc = &event->hw;
 	int shift = 64 - x86_pmu.cntval_bits;
@@ -131,7 +131,10 @@ u64 x86_perf_event_update(struct perf_event *event)
 	 */
 	prev_raw_count = local64_read(&hwc->prev_count);
 	do {
-		rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+		if (!val)
+			rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+		else
+			new_raw_count = *val;
 	} while (!local64_try_cmpxchg(&hwc->prev_count,
 				      &prev_raw_count, new_raw_count));
 
@@ -1598,7 +1601,7 @@ void x86_pmu_stop(struct perf_event *event, int flags)
 		 * Drain the remaining delta count out of a event
 		 * that we are disabling:
 		 */
-		static_call(x86_pmu_update)(event);
+		static_call(x86_pmu_update)(event, NULL);
 		hwc->state |= PERF_HES_UPTODATE;
 	}
 }
@@ -1689,7 +1692,7 @@ int x86_pmu_handle_irq(struct pt_regs *regs)
 
 		event = cpuc->events[idx];
 
-		val = static_call(x86_pmu_update)(event);
+		val = static_call(x86_pmu_update)(event, NULL);
 		if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
 			continue;
 
@@ -2036,7 +2039,7 @@ static void x86_pmu_static_call_update(void)
 
 static void _x86_pmu_read(struct perf_event *event)
 {
-	static_call(x86_pmu_update)(event);
+	static_call(x86_pmu_update)(event, NULL);
 }
 
 void x86_pmu_show_pmu_cap(struct pmu *pmu)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..f32d47cbe37f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2418,7 +2418,7 @@ static void intel_pmu_nhm_workaround(void)
 	for (i = 0; i < 4; i++) {
 		event = cpuc->events[i];
 		if (event)
-			static_call(x86_pmu_update)(event);
+			static_call(x86_pmu_update)(event, NULL);
 	}
 
 	for (i = 0; i < 4; i++) {
@@ -2710,7 +2710,7 @@ static void update_saved_topdown_regs(struct perf_event *event, u64 slots,
  * modify by a NMI. PMU has to be disabled before calling this function.
  */
 
-static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
+static u64 intel_update_topdown_event(struct perf_event *event, int metric_end, u64 *val)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct perf_event *other;
@@ -2718,13 +2718,18 @@ static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
 	bool reset = true;
 	int idx;
 
-	/* read Fixed counter 3 */
-	rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
-	if (!slots)
-		return 0;
+	if (!val) {
+		/* read Fixed counter 3 */
+		rdpmcl((3 | INTEL_PMC_FIXED_RDPMC_BASE), slots);
+		if (!slots)
+			return 0;
 
-	/* read PERF_METRICS */
-	rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+		/* read PERF_METRICS */
+		rdpmcl(INTEL_PMC_FIXED_RDPMC_METRICS, metrics);
+	} else {
+		slots = val[0];
+		metrics = val[1];
+	}
 
 	for_each_set_bit(idx, cpuc->active_mask, metric_end + 1) {
 		if (!is_topdown_idx(idx))
@@ -2767,10 +2772,11 @@ static u64 intel_update_topdown_event(struct perf_event *event, int metric_end)
 	return slots;
 }
 
-static u64 icl_update_topdown_event(struct perf_event *event)
+static u64 icl_update_topdown_event(struct perf_event *event, u64 *val)
 {
 	return intel_update_topdown_event(event, INTEL_PMC_IDX_METRIC_BASE +
-						 x86_pmu.num_topdown_events - 1);
+						 x86_pmu.num_topdown_events - 1,
+					  val);
 }
 
 DEFINE_STATIC_CALL(intel_pmu_update_topdown_event, x86_perf_event_update);
@@ -2785,7 +2791,7 @@ static void intel_pmu_read_topdown_event(struct perf_event *event)
 		return;
 
 	perf_pmu_disable(event->pmu);
-	static_call(intel_pmu_update_topdown_event)(event);
+	static_call(intel_pmu_update_topdown_event)(event, NULL);
 	perf_pmu_enable(event->pmu);
 }
 
@@ -2796,7 +2802,7 @@ static void intel_pmu_read_event(struct perf_event *event)
 	else if (is_topdown_count(event))
 		intel_pmu_read_topdown_event(event);
 	else
-		x86_perf_event_update(event);
+		x86_perf_event_update(event, NULL);
 }
 
 static void intel_pmu_enable_fixed(struct perf_event *event)
@@ -2899,7 +2905,7 @@ static void intel_pmu_add_event(struct perf_event *event)
  */
 int intel_pmu_save_and_restart(struct perf_event *event)
 {
-	static_call(x86_pmu_update)(event);
+	static_call(x86_pmu_update)(event, NULL);
 	/*
 	 * For a checkpointed counter always reset back to 0.  This
 	 * avoids a situation where the counter overflows, aborts the
@@ -2922,12 +2928,12 @@ static int intel_pmu_set_period(struct perf_event *event)
 	return x86_perf_event_set_period(event);
 }
 
-static u64 intel_pmu_update(struct perf_event *event)
+static u64 intel_pmu_update(struct perf_event *event, u64 *val)
 {
 	if (unlikely(is_topdown_count(event)))
-		return static_call(intel_pmu_update_topdown_event)(event);
+		return static_call(intel_pmu_update_topdown_event)(event, val);
 
-	return x86_perf_event_update(event);
+	return x86_perf_event_update(event, val);
 }
 
 static void intel_pmu_reset(void)
@@ -3091,7 +3097,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 	 */
 	if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
 		handled++;
-		static_call(intel_pmu_update_topdown_event)(NULL);
+		static_call(intel_pmu_update_topdown_event)(NULL, NULL);
 	}
 
 	/*
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index 844bc4fc4724..3177be0dedd1 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -1058,7 +1058,7 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
 		/* it might be unflagged overflow */
 		overflow = p4_pmu_clear_cccr_ovf(hwc);
 
-		val = x86_perf_event_update(event);
+		val = x86_perf_event_update(event, NULL);
 		if (!overflow && (val & (1ULL << (x86_pmu.cntval_bits - 1))))
 			continue;
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ac1182141bf6..2cb5c2e31b1f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -782,7 +782,7 @@ struct x86_pmu {
 	void		(*del)(struct perf_event *);
 	void		(*read)(struct perf_event *event);
 	int		(*set_period)(struct perf_event *event);
-	u64		(*update)(struct perf_event *event);
+	u64		(*update)(struct perf_event *event, u64 *val);
 	int		(*hw_config)(struct perf_event *event);
 	int		(*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
 	unsigned	eventsel;
@@ -1131,7 +1131,7 @@ extern u64 __read_mostly hw_cache_extra_regs
 				[PERF_COUNT_HW_CACHE_OP_MAX]
 				[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
-u64 x86_perf_event_update(struct perf_event *event);
+u64 x86_perf_event_update(struct perf_event *event, u64 *cntr);
 
 static inline unsigned int x86_pmu_config_addr(int index)
 {
diff --git a/arch/x86/events/zhaoxin/core.c b/arch/x86/events/zhaoxin/core.c
index 2fd9b0cf9a5e..5fe3a9eed650 100644
--- a/arch/x86/events/zhaoxin/core.c
+++ b/arch/x86/events/zhaoxin/core.c
@@ -391,7 +391,7 @@ static int zhaoxin_pmu_handle_irq(struct pt_regs *regs)
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;
 
-		x86_perf_event_update(event);
+		x86_perf_event_update(event, NULL);
 		perf_sample_data_init(&data, 0, event->hw.last_period);
 
 		if (!x86_perf_event_set_period(event))
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 2/5] perf: Extend perf_output_read
  2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
  2024-07-31 14:38 ` [PATCH V4 1/5] perf/x86: Extend event update interface kan.liang
@ 2024-07-31 14:38 ` kan.liang
  2024-07-31 14:38 ` [PATCH V4 3/5] perf/x86/intel: Move PEBS event update after the sample output kan.liang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The event may have been updated in the PMU-specific implementation,
e.g., Intel PEBS counters snapshotting. The common code should not
read and overwrite the value.

The PERF_SAMPLE_READ in the data->sample_type can be used to detect
whether the PMU-specific value is available. If yes, avoid the
pmu->read() in the common code.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index aa3450bdc227..fcc55d0b5848 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7269,7 +7269,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
 
 static void perf_output_read_group(struct perf_output_handle *handle,
 			    struct perf_event *event,
-			    u64 enabled, u64 running)
+			    u64 enabled, u64 running, bool read)
 {
 	struct perf_event *leader = event->group_leader, *sub;
 	u64 read_format = event->attr.read_format;
@@ -7291,7 +7291,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
 		values[n++] = running;
 
-	if ((leader != event) &&
+	if ((leader != event) && read &&
 	    (leader->state == PERF_EVENT_STATE_ACTIVE))
 		leader->pmu->read(leader);
 
@@ -7306,7 +7306,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 	for_each_sibling_event(sub, leader) {
 		n = 0;
 
-		if ((sub != event) &&
+		if ((sub != event) && read &&
 		    (sub->state == PERF_EVENT_STATE_ACTIVE))
 			sub->pmu->read(sub);
 
@@ -7333,7 +7333,8 @@ static void perf_output_read_group(struct perf_output_handle *handle,
  * on another CPU, from interrupt/NMI context.
  */
 static void perf_output_read(struct perf_output_handle *handle,
-			     struct perf_event *event)
+			     struct perf_event *event,
+			     bool read)
 {
 	u64 enabled = 0, running = 0, now;
 	u64 read_format = event->attr.read_format;
@@ -7351,7 +7352,7 @@ static void perf_output_read(struct perf_output_handle *handle,
 		calc_timer_values(event, &now, &enabled, &running);
 
 	if (event->attr.read_format & PERF_FORMAT_GROUP)
-		perf_output_read_group(handle, event, enabled, running);
+		perf_output_read_group(handle, event, enabled, running, read);
 	else
 		perf_output_read_one(handle, event, enabled, running);
 }
@@ -7393,7 +7394,7 @@ void perf_output_sample(struct perf_output_handle *handle,
 		perf_output_put(handle, data->period);
 
 	if (sample_type & PERF_SAMPLE_READ)
-		perf_output_read(handle, event);
+		perf_output_read(handle, event, !(data->sample_flags & PERF_SAMPLE_READ));
 
 	if (sample_type & PERF_SAMPLE_CALLCHAIN) {
 		int size = 1;
@@ -7994,7 +7995,7 @@ perf_event_read_event(struct perf_event *event,
 		return;
 
 	perf_output_put(&handle, read_event);
-	perf_output_read(&handle, event);
+	perf_output_read(&handle, event, true);
 	perf_event__output_id_sample(event, &handle, &sample);
 
 	perf_output_end(&handle);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 3/5] perf/x86/intel: Move PEBS event update after the sample output
  2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
  2024-07-31 14:38 ` [PATCH V4 1/5] perf/x86: Extend event update interface kan.liang
  2024-07-31 14:38 ` [PATCH V4 2/5] perf: Extend perf_output_read kan.liang
@ 2024-07-31 14:38 ` kan.liang
  2024-07-31 14:38 ` [PATCH V4 4/5] perf/x86/intel: Support PEBS counters snapshotting kan.liang
  2024-07-31 14:38 ` [PATCH V4 5/5] perf/x86/intel: Support RDPMC metrics clear mode kan.liang
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In the drain_pebs(), besides outputting the sample data, the perf needs
to update the PEBS event (e.g., prev_count, event->count, etc.) as well.
Both operations may invoke the perf_event_update(), but the sequence of
the two operations doesn't matter for now. Because the updated event
value is read directly from the counter via rdpmc. The counter stops in
the drain_pebs().

But if the updated event value is from different places (PEBS record VS.
counter), the sequence does matter. For example, with the new Intel PEBS
counters snapshotting feature, the large PEBS can be enabled for the
sample read, since counter values for each sample are recorded in PEBS
records. The current perf does the PEBS event update first, which also
updates the event for all the records altogether. It's impossible for
the later sample read output to dump the value for each sample, since
the prev_count is already the newest one from the current counter.

Move PEBS event update after the sample output. For each sample read
output, it will update and output the value only for this sample
(according to the value in the PEBS record). Once all samples are
output, update the PEBS event again according to the current counter,
and set the left period.

The !intel_pmu_save_and_restart() only happens when !hwc->event_base
or the left > 0. The !hwc->event_base is impossible for the PEBS event
which is only available on GP and fixed counters.
The __intel_pmu_pebs_event() is only to process the overflowed sample.
The left should be always <=0.
It's safe to ignore the return from the !inel_pmu_save_and_restart()
check.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index fa5ea65de0d0..9c28c7e34b57 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2168,17 +2168,6 @@ __intel_pmu_pebs_event(struct perf_event *event,
 	void *at = get_next_pebs_record_by_bit(base, top, bit);
 	static struct pt_regs dummy_iregs;

-	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
-		/*
-		 * Now, auto-reload is only enabled in fixed period mode.
-		 * The reload value is always hwc->sample_period.
-		 * May need to change it, if auto-reload is enabled in
-		 * freq mode later.
-		 */
-		intel_pmu_save_and_restart_reload(event, count);
-	} else if (!intel_pmu_save_and_restart(event))
-		return;
-
 	if (!iregs)
 		iregs = &dummy_iregs;

@@ -2207,6 +2196,17 @@ __intel_pmu_pebs_event(struct perf_event *event,
 		if (perf_event_overflow(event, data, regs))
 			x86_pmu_stop(event, 0);
 	}
+
+	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+		/*
+		 * Now, auto-reload is only enabled in fixed period mode.
+		 * The reload value is always hwc->sample_period.
+		 * May need to change it, if auto-reload is enabled in
+		 * freq mode later.
+		 */
+		intel_pmu_save_and_restart_reload(event, count);
+	} else
+		intel_pmu_save_and_restart(event);
 }

 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs, struct perf_sample_data *data)
-- 
2.38.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 4/5] perf/x86/intel: Support PEBS counters snapshotting
  2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
                   ` (2 preceding siblings ...)
  2024-07-31 14:38 ` [PATCH V4 3/5] perf/x86/intel: Move PEBS event update after the sample output kan.liang
@ 2024-07-31 14:38 ` kan.liang
  2024-07-31 14:38 ` [PATCH V4 5/5] perf/x86/intel: Support RDPMC metrics clear mode kan.liang
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The counters snapshotting is a new adaptive PEBS extension, which can
capture programmable counters, fixed-function counters, and performance
metrics in a PEBS record. The feature is available in the PEBS format
V6.

The target counters can be configured in the new fields of MSR_PEBS_CFG.
Then the PEBS HW will generate the bit mask of counters (Counters Group
Header) followed by the content of all the requested counters into a
PEBS record.

The current Linux perf sample read feature intends to read the counters
of other member events when the leader event is overflowing. But the
current read is in the NMI handler, which may has a small gap from
overflow. Using the counters snapshotting feature for the sample read.

Add a new PEBS_CNTR flag to indicate a sample read group that utilizes
the counters snapshotting feature. When the group is scheduled, the
PEBS configure can be updated accordingly.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c       |  33 ++++++++-
 arch/x86/events/intel/ds.c         | 114 +++++++++++++++++++++++++++--
 arch/x86/events/perf_event.h       |   3 +
 arch/x86/events/perf_event_flags.h |   2 +-
 arch/x86/include/asm/perf_event.h  |  15 ++++
 5 files changed, 157 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f32d47cbe37f..1988de2dd4f4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4058,6 +4058,19 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.flags |= PERF_X86_EVENT_PEBS_VIA_PT;
 	}
 
+	if ((event->attr.sample_type & PERF_SAMPLE_READ) &&
+	    (x86_pmu.intel_cap.pebs_format >= 6)) {
+		struct perf_event *leader = event->group_leader;
+
+		if (is_slots_event(leader))
+			leader = list_next_entry(leader, sibling_list);
+
+		if (leader->attr.precise_ip) {
+			leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
+			event->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
+		}
+	}
+
 	if ((event->attr.type == PERF_TYPE_HARDWARE) ||
 	    (event->attr.type == PERF_TYPE_HW_CACHE))
 		return 0;
@@ -4161,6 +4174,24 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	return 0;
 }
 
+static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+	struct perf_event *event;
+	int ret = x86_schedule_events(cpuc, n, assign);
+
+	if (ret)
+		return ret;
+
+	if (cpuc->is_fake)
+		return ret;
+
+	event = cpuc->event_list[n - 1];
+	if (event && (event->hw.flags & PERF_X86_EVENT_PEBS_CNTR))
+		intel_pmu_pebs_update_cfg(cpuc, n, assign);
+
+	return 0;
+}
+
 /*
  * Currently, the only caller of this function is the atomic_switch_perf_msrs().
  * The host perf context helps to prepare the values of the real hardware for
@@ -5245,7 +5276,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.set_period		= intel_pmu_set_period,
 	.update			= intel_pmu_update,
 	.hw_config		= intel_pmu_hw_config,
-	.schedule_events	= x86_schedule_events,
+	.schedule_events	= intel_pmu_schedule_events,
 	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
 	.perfctr		= MSR_ARCH_PERFMON_PERFCTR0,
 	.fixedctr		= MSR_ARCH_PERFMON_FIXED_CTR0,
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 9c28c7e34b57..1bb9223c31cc 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1287,10 +1287,61 @@ static void adaptive_pebs_record_size_update(void)
 		sz += sizeof(struct pebs_xmm);
 	if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 		sz += x86_pmu.lbr_nr * sizeof(struct lbr_entry);
+	if (pebs_data_cfg & (PEBS_DATACFG_METRICS | PEBS_DATACFG_CNTR)) {
+		sz += sizeof(struct pebs_cntr_header);
+
+		/* Metrics base and Metrics Data */
+		if (pebs_data_cfg & PEBS_DATACFG_METRICS)
+			sz += 2 * sizeof(u64);
+
+		if (pebs_data_cfg & PEBS_DATACFG_CNTR) {
+			sz += hweight64((pebs_data_cfg >> PEBS_DATACFG_CNTR_SHIFT) & PEBS_DATACFG_CNTR_MASK)
+			      * sizeof(u64);
+			sz += hweight64((pebs_data_cfg >> PEBS_DATACFG_FIX_SHIFT) & PEBS_DATACFG_FIX_MASK)
+			      * sizeof(u64);
+		}
+	}
 
 	cpuc->pebs_record_size = sz;
 }
 
+static void __intel_pmu_pebs_update_cfg(struct perf_event *event,
+					int idx, u64 *pebs_data_cfg)
+{
+	if (is_metric_event(event)) {
+		*pebs_data_cfg |= PEBS_DATACFG_METRICS;
+		return;
+	}
+
+	*pebs_data_cfg |= PEBS_DATACFG_CNTR;
+
+	if (idx >= INTEL_PMC_IDX_FIXED) {
+		*pebs_data_cfg |= ((1ULL << (idx - INTEL_PMC_IDX_FIXED)) & PEBS_DATACFG_FIX_MASK)
+				  << PEBS_DATACFG_FIX_SHIFT;
+	} else {
+		*pebs_data_cfg |= ((1ULL << idx) & PEBS_DATACFG_CNTR_MASK)
+				  << PEBS_DATACFG_CNTR_SHIFT;
+	}
+}
+
+void intel_pmu_pebs_update_cfg(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+	struct perf_event *leader, *event;
+	u64 pebs_data_cfg = 0;
+	int i = n - 1;
+
+	leader = cpuc->event_list[i]->group_leader;
+	for (; i >= 0; i--) {
+		event = cpuc->event_list[i];
+		if (leader != event->group_leader)
+			break;
+		__intel_pmu_pebs_update_cfg(event, assign[i], &pebs_data_cfg);
+	}
+
+	if (pebs_data_cfg & ~cpuc->pebs_data_cfg)
+		cpuc->pebs_data_cfg |= pebs_data_cfg | PEBS_UPDATE_DS_SW;
+}
+
 #define PERF_PEBS_MEMINFO_TYPE	(PERF_SAMPLE_ADDR | PERF_SAMPLE_DATA_SRC |   \
 				PERF_SAMPLE_PHYS_ADDR |			     \
 				PERF_SAMPLE_WEIGHT_TYPE |		     \
@@ -2034,6 +2085,40 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		}
 	}
 
+	if (format_size & (PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS)) {
+		struct pebs_cntr_header *cntr = next_record;
+		int bit;
+
+		next_record += sizeof(struct pebs_cntr_header);
+
+		for_each_set_bit(bit, (unsigned long *)&cntr->cntr, INTEL_PMC_MAX_GENERIC) {
+			x86_perf_event_update(cpuc->events[bit], (u64 *)next_record);
+			next_record += sizeof(u64);
+		}
+
+		for_each_set_bit(bit, (unsigned long *)&cntr->fixed, INTEL_PMC_MAX_FIXED) {
+			/* The slots event will be handled with perf_metric later */
+			if ((cntr->metrics == INTEL_CNTR_METRICS) &&
+			    (INTEL_PMC_IDX_FIXED_SLOTS == bit + INTEL_PMC_IDX_FIXED)) {
+				next_record += sizeof(u64);
+				continue;
+			}
+			x86_perf_event_update(cpuc->events[bit + INTEL_PMC_IDX_FIXED], (u64 *)next_record);
+			next_record += sizeof(u64);
+		}
+
+		/* HW will reload the value right after the overflow. */
+		if (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
+			local64_set(&event->hw.prev_count, (u64)-event->hw.sample_period);
+
+		if (cntr->metrics == INTEL_CNTR_METRICS) {
+			static_call(intel_pmu_update_topdown_event)
+					(event->group_leader, (u64 *)next_record);
+			next_record += 2 * sizeof(u64);
+		}
+		data->sample_flags |= PERF_SAMPLE_READ;
+	}
+
 	WARN_ONCE(next_record != __pebs + (format_size >> 48),
 			"PEBS record size %llu, expected %llu, config %llx\n",
 			format_size >> 48,
@@ -2198,13 +2283,22 @@ __intel_pmu_pebs_event(struct perf_event *event,
 	}
 
 	if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
-		/*
-		 * Now, auto-reload is only enabled in fixed period mode.
-		 * The reload value is always hwc->sample_period.
-		 * May need to change it, if auto-reload is enabled in
-		 * freq mode later.
-		 */
-		intel_pmu_save_and_restart_reload(event, count);
+		if (event->hw.flags & PERF_X86_EVENT_PEBS_CNTR) {
+			/*
+			 * The value of each sample has been updated when setup
+			 * the corresponding sample data. But there may be a small
+			 * gap between the last overflow and the drain_pebs().
+			 */
+			intel_pmu_save_and_restart_reload(event, 0);
+		} else {
+			/*
+			 * Now, auto-reload is only enabled in fixed period mode.
+			 * The reload value is always hwc->sample_period.
+			 * May need to change it, if auto-reload is enabled in
+			 * freq mode later.
+			 */
+			intel_pmu_save_and_restart_reload(event, count);
+		}
 	} else
 		intel_pmu_save_and_restart(event);
 }
@@ -2496,6 +2590,10 @@ void __init intel_ds_init(void)
 			x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
 			break;
 
+		case 6:
+			if (x86_pmu.intel_cap.pebs_baseline)
+				x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
+			fallthrough;
 		case 5:
 			x86_pmu.pebs_ept = 1;
 			fallthrough;
@@ -2520,7 +2618,7 @@ void __init intel_ds_init(void)
 					  PERF_SAMPLE_REGS_USER |
 					  PERF_SAMPLE_REGS_INTR);
 			}
-			pr_cont("PEBS fmt4%c%s, ", pebs_type, pebs_qual);
+			pr_cont("PEBS fmt%d%c%s, ", format, pebs_type, pebs_qual);
 
 			if (!is_hybrid() && x86_pmu.intel_cap.pebs_output_pt_available) {
 				pr_cont("PEBS-via-PT, ");
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2cb5c2e31b1f..de839dfa7dfb 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1132,6 +1132,7 @@ extern u64 __read_mostly hw_cache_extra_regs
 				[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
 u64 x86_perf_event_update(struct perf_event *event, u64 *cntr);
+DECLARE_STATIC_CALL(intel_pmu_update_topdown_event, x86_perf_event_update);
 
 static inline unsigned int x86_pmu_config_addr(int index)
 {
@@ -1626,6 +1627,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
+void intel_pmu_pebs_update_cfg(struct cpu_hw_events *cpuc, int n, int *assign);
+
 void intel_pmu_auto_reload_read(struct perf_event *event);
 
 void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index 6c977c19f2cd..1d9e385649b5 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -9,7 +9,7 @@ PERF_ARCH(PEBS_LD_HSW,		0x00008) /* haswell style datala, load */
 PERF_ARCH(PEBS_NA_HSW,		0x00010) /* haswell style datala, unknown */
 PERF_ARCH(EXCL,			0x00020) /* HT exclusivity on counter */
 PERF_ARCH(DYNAMIC,		0x00040) /* dynamic alloc'd constraint */
-			/*	0x00080	*/
+PERF_ARCH(PEBS_CNTR,		0x00080) /* PEBS counters snapshot */
 PERF_ARCH(EXCL_ACCT,		0x00100) /* accounted EXCL event */
 PERF_ARCH(AUTO_RELOAD,		0x00200) /* use PEBS auto-reload */
 PERF_ARCH(LARGE_PEBS,		0x00400) /* use large PEBS */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 91b73571412f..709746cd7c19 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -140,6 +140,12 @@
 #define PEBS_DATACFG_XMMS	BIT_ULL(2)
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_LBR_SHIFT	24
+#define PEBS_DATACFG_CNTR	BIT_ULL(4)
+#define PEBS_DATACFG_CNTR_SHIFT	32
+#define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
+#define PEBS_DATACFG_FIX_SHIFT	48
+#define PEBS_DATACFG_FIX_MASK	GENMASK_ULL(7, 0)
+#define PEBS_DATACFG_METRICS	BIT_ULL(5)
 
 /* Steal the highest bit of pebs_data_cfg for SW usage */
 #define PEBS_UPDATE_DS_SW	BIT_ULL(63)
@@ -444,6 +450,15 @@ struct pebs_xmm {
 	u64 xmm[16*2];	/* two entries for each register */
 };
 
+struct pebs_cntr_header {
+	u32 cntr;
+	u32 fixed;
+	u32 metrics;
+	u32 reserved;
+};
+
+#define INTEL_CNTR_METRICS		0x3
+
 /*
  * AMD Extended Performance Monitoring and Debug cpuid feature detection
  */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V4 5/5] perf/x86/intel: Support RDPMC metrics clear mode
  2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
                   ` (3 preceding siblings ...)
  2024-07-31 14:38 ` [PATCH V4 4/5] perf/x86/intel: Support PEBS counters snapshotting kan.liang
@ 2024-07-31 14:38 ` kan.liang
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-07-31 14:38 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The new RDPMC enhancement, metrics clear mode, is to clear the
PERF_METRICS-related resources as well as the fixed-function performance
monitoring counter 3 after the read is performed. It is available for
ring 3. The feature is enumerated by the
IA32_PERF_CAPABILITIES.RDPMC_CLEAR_METRICS[bit 19]. To enable the
feature, the IA32_FIXED_CTR_CTRL.METRICS_CLEAR_EN[bit 14] must be set.

Two ways were considered to enable the feature.
- Expose a knob in the sysfs globally. One user may affect the
  measurement of other users when changing the knob. The solution is
  dropped.
- Introduce a new event format, metrics_clear, for the slots event to
  disable/enable the feature only for the current process. Users can
  utilize the feature as needed.
The latter solution is implemented in the patch.

The current KVM doesn't support the perf metrics yet. For
virtualization, the feature can be enabled later separately.

Update the document of perf metrics.

Suggested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c         | 20 +++++++++++++++++++-
 arch/x86/events/perf_event.h         |  1 +
 arch/x86/include/asm/perf_event.h    |  4 ++++
 tools/perf/Documentation/topdown.txt |  9 +++++++--
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 1988de2dd4f4..ba981b37900e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2822,6 +2822,9 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
 			return;
 
 		idx = INTEL_PMC_IDX_FIXED_SLOTS;
+
+		if (event->attr.config1 & INTEL_TD_CFG_METRIC_CLEAR)
+			bits |= INTEL_FIXED_3_METRICS_CLEAR;
 	}
 
 	intel_set_masks(event, idx);
@@ -4086,7 +4089,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	 * is used in a metrics group, it too cannot support sampling.
 	 */
 	if (intel_pmu_has_cap(event, PERF_CAP_METRICS_IDX) && is_topdown_event(event)) {
-		if (event->attr.config1 || event->attr.config2)
+		/* The metrics_clear can only be set for the slots event */
+		if (event->attr.config1 &&
+		    (!is_slots_event(event) || (event->attr.config1 & ~INTEL_TD_CFG_METRIC_CLEAR)))
+			return -EINVAL;
+
+		if (event->attr.config2)
 			return -EINVAL;
 
 		/*
@@ -4673,6 +4681,8 @@ PMU_FORMAT_ATTR(in_tx,  "config:32"	);
 PMU_FORMAT_ATTR(in_tx_cp, "config:33"	);
 PMU_FORMAT_ATTR(eq,	"config:36"	); /* v6 + */
 
+PMU_FORMAT_ATTR(metrics_clear,	"config1:0"); /* PERF_CAPABILITIES.RDPMC_METRICS_CLEAR */
+
 static ssize_t umask2_show(struct device *dev,
 			   struct device_attribute *attr,
 			   char *page)
@@ -4692,6 +4702,7 @@ static struct device_attribute format_attr_umask2  =
 static struct attribute *format_evtsel_ext_attrs[] = {
 	&format_attr_umask2.attr,
 	&format_attr_eq.attr,
+	&format_attr_metrics_clear.attr,
 	NULL
 };
 
@@ -4716,6 +4727,13 @@ evtsel_ext_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 	if (i == 1)
 		return (mask & ARCH_PERFMON_EVENTSEL_EQ) ? attr->mode : 0;
 
+	/* PERF_CAPABILITIES.RDPMC_METRICS_CLEAR */
+	if (i == 2) {
+		union perf_capabilities intel_cap = hybrid(dev_get_drvdata(dev), intel_cap);
+
+		return intel_cap.rdpmc_metrics_clear ? attr->mode : 0;
+	}
+
 	return 0;
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index de839dfa7dfb..c50f8b4f7a89 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -624,6 +624,7 @@ union perf_capabilities {
 		u64	pebs_output_pt_available:1;
 		u64	pebs_timing_info:1;
 		u64	anythread_deprecated:1;
+		u64	rdpmc_metrics_clear:1;
 	};
 	u64	capabilities;
 };
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 709746cd7c19..21e1d1fe5972 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -41,6 +41,7 @@
 #define INTEL_FIXED_0_USER				(1ULL << 1)
 #define INTEL_FIXED_0_ANYTHREAD			(1ULL << 2)
 #define INTEL_FIXED_0_ENABLE_PMI			(1ULL << 3)
+#define INTEL_FIXED_3_METRICS_CLEAR			(1ULL << 2)
 
 #define HSW_IN_TX					(1ULL << 32)
 #define HSW_IN_TX_CHECKPOINTED				(1ULL << 33)
@@ -378,6 +379,9 @@ static inline bool use_fixed_pseudo_encoding(u64 code)
 #define INTEL_TD_METRIC_MAX			INTEL_TD_METRIC_MEM_BOUND
 #define INTEL_TD_METRIC_NUM			8
 
+#define INTEL_TD_CFG_METRIC_CLEAR_BIT		0
+#define INTEL_TD_CFG_METRIC_CLEAR		BIT_ULL(INTEL_TD_CFG_METRIC_CLEAR_BIT)
+
 static inline bool is_metric_idx(int idx)
 {
 	return (unsigned)(idx - INTEL_PMC_IDX_METRIC_BASE) < INTEL_TD_METRIC_NUM;
diff --git a/tools/perf/Documentation/topdown.txt b/tools/perf/Documentation/topdown.txt
index ae0aee86844f..f36c8ca1dc53 100644
--- a/tools/perf/Documentation/topdown.txt
+++ b/tools/perf/Documentation/topdown.txt
@@ -280,8 +280,13 @@ with no longer interval than a few seconds
 
 	perf stat -I 1000 --topdown ...
 
-For user programs using RDPMC directly the counter can
-be reset explicitly using ioctl:
+Starting from the Lunar Lake p-core, a RDPMC metrics clear mode is
+introduced. The metrics and the fixed counter 3 are automatically
+cleared after the read is performed. It is recommended to always enable
+the mode. To enable the mode, the config1 of slots event is set to 1.
+
+On the previous platforms, for user programs using RDPMC directly, the
+counter has to be reset explicitly using ioctl:
 
 	ioctl(perf_fd, PERF_EVENT_IOC_RESET, 0);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 1/5] perf/x86: Extend event update interface
  2024-07-31 14:38 ` [PATCH V4 1/5] perf/x86: Extend event update interface kan.liang
@ 2024-08-01 14:03   ` Peter Zijlstra
  2024-08-01 15:31     ` Liang, Kan
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2024-08-01 14:03 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, alexander.shishkin,
	linux-kernel, ak, eranian, Sandipan Das, Ravi Bangoria,
	silviazhao

On Wed, Jul 31, 2024 at 07:38:31AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The current event update interface directly reads the values from the
> counter, but the values may not be the accurate ones users require. For
> example, the sample read feature wants the counter value of the member
> events when the leader event is overflow. But with the current
> implementation, the read (event update) actually happens in the NMI
> handler. There may be a small gap between the overflow and the NMI
> handler.

This...

> The new Intel PEBS counters snapshotting feature can provide
> the accurate counter value in the overflow. The event update interface
> has to be updated to apply the given accurate values.
> 
> Pass the accurate values via the event update interface. If the value is
> not available, still directly read the counter.

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 12f2a0c14d33..07a56bf71160 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -112,7 +112,7 @@ u64 __read_mostly hw_cache_extra_regs
>   * Can only be executed on the CPU where the event is active.
>   * Returns the delta events processed.
>   */
> -u64 x86_perf_event_update(struct perf_event *event)
> +u64 x86_perf_event_update(struct perf_event *event, u64 *val)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
>  	int shift = 64 - x86_pmu.cntval_bits;
> @@ -131,7 +131,10 @@ u64 x86_perf_event_update(struct perf_event *event)
>  	 */
>  	prev_raw_count = local64_read(&hwc->prev_count);
>  	do {
> -		rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +		if (!val)
> +			rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +		else
> +			new_raw_count = *val;
>  	} while (!local64_try_cmpxchg(&hwc->prev_count,
>  				      &prev_raw_count, new_raw_count));
>  

Does that mean the following is possible?

Two counters: C0 and C1, where C0 is a PEBS counter that also samples
C1.

  C0: overflow-with-PEBS-assist -> PEBS entry with counter value A
      (DS buffer threshold not reached)

  C1: overflow -> PMI -> x86_perf_event_update(C1, NULL)
      rdpmcl reads value 'A+d', and sets prev_raw_count

  C0: more assists, hit DS threshold -> PMI
      PEBS processing does x86_perf_event_update(C1, A)
      and sets prev_raw_count *backwards*

How is that sane?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 1/5] perf/x86: Extend event update interface
  2024-08-01 14:03   ` Peter Zijlstra
@ 2024-08-01 15:31     ` Liang, Kan
  2024-08-01 16:36       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Liang, Kan @ 2024-08-01 15:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, alexander.shishkin,
	linux-kernel, ak, eranian, Sandipan Das, Ravi Bangoria,
	silviazhao



On 2024-08-01 10:03 a.m., Peter Zijlstra wrote:
> On Wed, Jul 31, 2024 at 07:38:31AM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The current event update interface directly reads the values from the
>> counter, but the values may not be the accurate ones users require. For
>> example, the sample read feature wants the counter value of the member
>> events when the leader event is overflow. But with the current
>> implementation, the read (event update) actually happens in the NMI
>> handler. There may be a small gap between the overflow and the NMI
>> handler.
> 
> This...
> 
>> The new Intel PEBS counters snapshotting feature can provide
>> the accurate counter value in the overflow. The event update interface
>> has to be updated to apply the given accurate values.
>>
>> Pass the accurate values via the event update interface. If the value is
>> not available, still directly read the counter.
> 
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 12f2a0c14d33..07a56bf71160 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -112,7 +112,7 @@ u64 __read_mostly hw_cache_extra_regs
>>   * Can only be executed on the CPU where the event is active.
>>   * Returns the delta events processed.
>>   */
>> -u64 x86_perf_event_update(struct perf_event *event)
>> +u64 x86_perf_event_update(struct perf_event *event, u64 *val)
>>  {
>>  	struct hw_perf_event *hwc = &event->hw;
>>  	int shift = 64 - x86_pmu.cntval_bits;
>> @@ -131,7 +131,10 @@ u64 x86_perf_event_update(struct perf_event *event)
>>  	 */
>>  	prev_raw_count = local64_read(&hwc->prev_count);
>>  	do {
>> -		rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>> +		if (!val)
>> +			rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>> +		else
>> +			new_raw_count = *val;
>>  	} while (!local64_try_cmpxchg(&hwc->prev_count,
>>  				      &prev_raw_count, new_raw_count));
>>  
> 
> Does that mean the following is possible?
> 
> Two counters: C0 and C1, where C0 is a PEBS counter that also samples
> C1.
> 
>   C0: overflow-with-PEBS-assist -> PEBS entry with counter value A
>       (DS buffer threshold not reached)
> 
>   C1: overflow -> PMI -> x86_perf_event_update(C1, NULL)
>       rdpmcl reads value 'A+d', and sets prev_raw_count
> 
>   C0: more assists, hit DS threshold -> PMI
>       PEBS processing does x86_perf_event_update(C1, A)
>       and sets prev_raw_count *backwards*

I think the C0 PMI handler doesn't touch other counters unless
PERF_SAMPLE_READ is applied. For the PERF_SAMPLE_READ, only one counter
does sampling. It's impossible that C0 and C1 do sampling at the same
time. I don't think the above scenario is possible.

Maybe we can add the below check to further prevent the abuse of the
interface.

WARN_ON_ONCE(!(event->attr.sample_type & PERF_SAMPLE_READ) && val);

Thanks,
Kan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 1/5] perf/x86: Extend event update interface
  2024-08-01 15:31     ` Liang, Kan
@ 2024-08-01 16:36       ` Peter Zijlstra
  2024-08-01 19:18         ` Liang, Kan
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2024-08-01 16:36 UTC (permalink / raw)
  To: Liang, Kan
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, alexander.shishkin,
	linux-kernel, ak, eranian, Sandipan Das, Ravi Bangoria,
	silviazhao

On Thu, Aug 01, 2024 at 11:31:40AM -0400, Liang, Kan wrote:
> 
> 
> On 2024-08-01 10:03 a.m., Peter Zijlstra wrote:
> > On Wed, Jul 31, 2024 at 07:38:31AM -0700, kan.liang@linux.intel.com wrote:
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> The current event update interface directly reads the values from the
> >> counter, but the values may not be the accurate ones users require. For
> >> example, the sample read feature wants the counter value of the member
> >> events when the leader event is overflow. But with the current
> >> implementation, the read (event update) actually happens in the NMI
> >> handler. There may be a small gap between the overflow and the NMI
> >> handler.
> > 
> > This...
> > 
> >> The new Intel PEBS counters snapshotting feature can provide
> >> the accurate counter value in the overflow. The event update interface
> >> has to be updated to apply the given accurate values.
> >>
> >> Pass the accurate values via the event update interface. If the value is
> >> not available, still directly read the counter.
> > 
> >> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> >> index 12f2a0c14d33..07a56bf71160 100644
> >> --- a/arch/x86/events/core.c
> >> +++ b/arch/x86/events/core.c
> >> @@ -112,7 +112,7 @@ u64 __read_mostly hw_cache_extra_regs
> >>   * Can only be executed on the CPU where the event is active.
> >>   * Returns the delta events processed.
> >>   */
> >> -u64 x86_perf_event_update(struct perf_event *event)
> >> +u64 x86_perf_event_update(struct perf_event *event, u64 *val)
> >>  {
> >>  	struct hw_perf_event *hwc = &event->hw;
> >>  	int shift = 64 - x86_pmu.cntval_bits;
> >> @@ -131,7 +131,10 @@ u64 x86_perf_event_update(struct perf_event *event)
> >>  	 */
> >>  	prev_raw_count = local64_read(&hwc->prev_count);
> >>  	do {
> >> -		rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> >> +		if (!val)
> >> +			rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> >> +		else
> >> +			new_raw_count = *val;
> >>  	} while (!local64_try_cmpxchg(&hwc->prev_count,
> >>  				      &prev_raw_count, new_raw_count));
> >>  
> > 
> > Does that mean the following is possible?
> > 
> > Two counters: C0 and C1, where C0 is a PEBS counter that also samples
> > C1.
> > 
> >   C0: overflow-with-PEBS-assist -> PEBS entry with counter value A
> >       (DS buffer threshold not reached)
> > 
> >   C1: overflow -> PMI -> x86_perf_event_update(C1, NULL)
> >       rdpmcl reads value 'A+d', and sets prev_raw_count
> > 
> >   C0: more assists, hit DS threshold -> PMI
> >       PEBS processing does x86_perf_event_update(C1, A)
> >       and sets prev_raw_count *backwards*
> 
> I think the C0 PMI handler doesn't touch other counters unless
> PERF_SAMPLE_READ is applied. For the PERF_SAMPLE_READ, only one counter
> does sampling. It's impossible that C0 and C1 do sampling at the same
> time. I don't think the above scenario is possible.

It is perfectly fine for C0 to have PERF_SAMPLE_READ and C1 to be a
normal counter, sampling or otherwise.

> Maybe we can add the below check to further prevent the abuse of the
> interface.

There is no abuse in the above scenario. You can have a group with all
sampling events and any number of them can have PERF_SAMPLE_READ. This
is perfectly fine.

> WARN_ON_ONCE(!(event->attr.sample_type & PERF_SAMPLE_READ) && val);

I don't see how PERF_SAMPLE_READ is relevant, *any* PMI for the C1 event
will cause x86_perf_event_update() to be called. And remember that even
non-sampling events have EVENTSEL_INT set to deal with counter overflow.

The problem here is that C0/PEBS will come in late and try to force
update an out-of-date value.

If you have C1 be a non-sampling event, this will typically not happen,
but it still *can*, and when you do, you get your counter moving
backwards.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V4 1/5] perf/x86: Extend event update interface
  2024-08-01 16:36       ` Peter Zijlstra
@ 2024-08-01 19:18         ` Liang, Kan
  0 siblings, 0 replies; 10+ messages in thread
From: Liang, Kan @ 2024-08-01 19:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, alexander.shishkin,
	linux-kernel, ak, eranian, Sandipan Das, Ravi Bangoria,
	silviazhao



On 2024-08-01 12:36 p.m., Peter Zijlstra wrote:
> On Thu, Aug 01, 2024 at 11:31:40AM -0400, Liang, Kan wrote:
>>
>>
>> On 2024-08-01 10:03 a.m., Peter Zijlstra wrote:
>>> On Wed, Jul 31, 2024 at 07:38:31AM -0700, kan.liang@linux.intel.com wrote:
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> The current event update interface directly reads the values from the
>>>> counter, but the values may not be the accurate ones users require. For
>>>> example, the sample read feature wants the counter value of the member
>>>> events when the leader event is overflow. But with the current
>>>> implementation, the read (event update) actually happens in the NMI
>>>> handler. There may be a small gap between the overflow and the NMI
>>>> handler.
>>>
>>> This...
>>>
>>>> The new Intel PEBS counters snapshotting feature can provide
>>>> the accurate counter value in the overflow. The event update interface
>>>> has to be updated to apply the given accurate values.
>>>>
>>>> Pass the accurate values via the event update interface. If the value is
>>>> not available, still directly read the counter.
>>>
>>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>>> index 12f2a0c14d33..07a56bf71160 100644
>>>> --- a/arch/x86/events/core.c
>>>> +++ b/arch/x86/events/core.c
>>>> @@ -112,7 +112,7 @@ u64 __read_mostly hw_cache_extra_regs
>>>>   * Can only be executed on the CPU where the event is active.
>>>>   * Returns the delta events processed.
>>>>   */
>>>> -u64 x86_perf_event_update(struct perf_event *event)
>>>> +u64 x86_perf_event_update(struct perf_event *event, u64 *val)
>>>>  {
>>>>  	struct hw_perf_event *hwc = &event->hw;
>>>>  	int shift = 64 - x86_pmu.cntval_bits;
>>>> @@ -131,7 +131,10 @@ u64 x86_perf_event_update(struct perf_event *event)
>>>>  	 */
>>>>  	prev_raw_count = local64_read(&hwc->prev_count);
>>>>  	do {
>>>> -		rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>>>> +		if (!val)
>>>> +			rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>>>> +		else
>>>> +			new_raw_count = *val;
>>>>  	} while (!local64_try_cmpxchg(&hwc->prev_count,
>>>>  				      &prev_raw_count, new_raw_count));
>>>>  
>>>
>>> Does that mean the following is possible?
>>>
>>> Two counters: C0 and C1, where C0 is a PEBS counter that also samples
>>> C1.
>>>
>>>   C0: overflow-with-PEBS-assist -> PEBS entry with counter value A
>>>       (DS buffer threshold not reached)
>>>
>>>   C1: overflow -> PMI -> x86_perf_event_update(C1, NULL)
>>>       rdpmcl reads value 'A+d', and sets prev_raw_count
>>>
>>>   C0: more assists, hit DS threshold -> PMI
>>>       PEBS processing does x86_perf_event_update(C1, A)
>>>       and sets prev_raw_count *backwards*
>>
>> I think the C0 PMI handler doesn't touch other counters unless
>> PERF_SAMPLE_READ is applied. For the PERF_SAMPLE_READ, only one counter
>> does sampling. It's impossible that C0 and C1 do sampling at the same
>> time. I don't think the above scenario is possible.
> 
> It is perfectly fine for C0 to have PERF_SAMPLE_READ and C1 to be a
> normal counter, sampling or otherwise.
> 
>> Maybe we can add the below check to further prevent the abuse of the
>> interface.
> 
> There is no abuse in the above scenario. You can have a group with all
> sampling events and any number of them can have PERF_SAMPLE_READ. This
> is perfectly fine.
> 
>> WARN_ON_ONCE(!(event->attr.sample_type & PERF_SAMPLE_READ) && val);
> 
> I don't see how PERF_SAMPLE_READ is relevant, *any* PMI for the C1 event
> will cause x86_perf_event_update() to be called. And remember that even
> non-sampling events have EVENTSEL_INT set to deal with counter overflow.
> 
> The problem here is that C0/PEBS will come in late and try to force
> update an out-of-date value.
> 
> If you have C1 be a non-sampling event, this will typically not happen,
> but it still *can*, and when you do, you get your counter moving
> backwards.

Now, only the PEBS records may include an out-of-date value. I think we
can always drain PEBS before handling the overflow from the non PEBS
event overflow.

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..255eb7231181 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3109,6 +3109,9 @@ static int handle_pmi_common(struct pt_regs *regs,
u64 status)
 		if (!test_bit(bit, cpuc->active_mask))
 			continue;

+		if (event->hw.flags & PERF_X86_EVENT_PEBS_CNTR)
+			x86_pmu.drain_pebs(regs, &data);
+
 		if (!intel_pmu_save_and_restart(event))
 			continue;


Thanks,
Kan


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-01 19:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31 14:38 [PATCH V4 0/5] Support Lunar Lake and Arrow Lake core PMU kan.liang
2024-07-31 14:38 ` [PATCH V4 1/5] perf/x86: Extend event update interface kan.liang
2024-08-01 14:03   ` Peter Zijlstra
2024-08-01 15:31     ` Liang, Kan
2024-08-01 16:36       ` Peter Zijlstra
2024-08-01 19:18         ` Liang, Kan
2024-07-31 14:38 ` [PATCH V4 2/5] perf: Extend perf_output_read kan.liang
2024-07-31 14:38 ` [PATCH V4 3/5] perf/x86/intel: Move PEBS event update after the sample output kan.liang
2024-07-31 14:38 ` [PATCH V4 4/5] perf/x86/intel: Support PEBS counters snapshotting kan.liang
2024-07-31 14:38 ` [PATCH V4 5/5] perf/x86/intel: Support RDPMC metrics clear mode kan.liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox