linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit
@ 2024-07-30  8:44 Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast Ben Gainey
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Ben Gainey @ 2024-07-30  8:44 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Ben Gainey

This change allows events to use PERF_SAMPLE READ with inherit so long
as PERF_SAMPLE_TID is also set.

Currently it is not possible to use PERF_SAMPLE_READ with inherit. This
restriction assumes the user is interested in collecting aggregate
statistics as per `perf stat`. It prevents a user from collecting
per-thread samples using counter groups from a multi-threaded or
multi-process application, as with `perf record -e '{....}:S'`. Instead
users must use system-wide mode, or forgo the ability to sample counter
groups, or profile a single thread. System-wide mode is often
problematic as it requires specific permissions (no CAP_PERFMON / root
access), or may lead to capture of significant amounts of extra data
from other processes running on the system.

This patch changes `perf_event_alloc` relaxing the restriction against
combining `inherit` with `PERF_SAMPLE_READ` so that the combination
will be allowed so long as `PERF_SAMPLE_TID` is enabled. It modifies
sampling so that only the count associated with the active thread is
recorded into the buffer. It modifies the context switch handling so
that perf contexts are always switched out if they have this kind of
event so that the correct per-thread state is maintained. Finally, the
tools are updated to allow perf record to specify this combination and
to correctly decode the sample data.

In this configuration sample values, as may appear in the read_format
field of a PERF_RECORD_SAMPLE, are no longer global counters. Instead
the value reports the per-thread value for the active thread.
Tools that expect the global total, for example when calculate a delta
between samples, would need updating to take this into account when
opting into this new behaviour. Previously valid event configurations
(system-wide, no-inherit and so on) are unaffected.


Changes since v8:
 - Rebase on v6.11-rc1

Changes since v7:
 - Rebase on v6.10-rc3
 - Respond to Peter Zijlstra's feedback:
 - Renamed nr_pending to nr_no_switch_fast and merged in nr_inherit_read
   which otherwise had overlapping use
 - Updated some of the commit messages to provide better justifications
   of usecase, behavioural changes and so on
 - Cleanup perf_event_count/_cumulative
 - Make it explicit that the sampling event decides whether or not the
   per-thread value is given in read_format for PERF_RECORD_SAMPLE and
   PERF_RECORD_READ; updated tools to account for this.

Changes since v6:
 - Rebase on v6.10-rc2
 - Make additional "perf test" tests succeed / skip based on kernel
   version as per feedback from Namhyung.

Changes since v5:
 - Rebase on v6.9
 - Cleanup feedback from Namhyung Kim

Changes since v4:
 - Rebase on v6.9-rc1
 - Removed the dependency on inherit_stat that was previously assumed
   necessary as per feedback from Namhyung Kim.
 - Fixed an incorrect use of zfree instead of free in the tools leading
   to an abort on tool shutdown.
 - Additional test coverage improvements added to perf test.
 - Cleaned up the remaining bit of irrelevant change missed between v3
   and v4.

Changes since v3:
 - Cleaned up perf test data changes incorrectly included into this
   series from elsewhere.

Changes since v2:
 - Rebase on v6.8
 - Respond to James Clarke's feedback; fixup some typos and move some
   repeated checks into a helper macro.
 - Cleaned up checkpatch lints.
 - Updated perf test; fixed evsel handling so that existing tests pass
   and added new tests to cover the new behaviour.

Changes since v1:
 - Rebase on v6.8-rc1
 - Fixed value written into sample after child exists.
 - Modified handling of switch-out so that context with these events
   take the slow path, so that the per-event/per-thread PMU state is
   correctly switched.
 - Modified perf tools to support this mode of operation.

Ben Gainey (4):
  perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.
  perf: Support PERF_SAMPLE_READ with inherit
  tools/perf: Correctly calculate sample period for inherited
    SAMPLE_READ values
  tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events

 include/linux/perf_event.h                    |  8 ++-
 kernel/events/core.c                          | 67 +++++++++++++------
 tools/lib/perf/evsel.c                        | 48 +++++++++++++
 tools/lib/perf/include/internal/evsel.h       | 63 ++++++++++++++++-
 tools/perf/tests/attr/README                  |  2 +
 .../tests/attr/test-record-group-sampling     |  3 +-
 .../tests/attr/test-record-group-sampling1    | 51 ++++++++++++++
 .../tests/attr/test-record-group-sampling2    | 61 +++++++++++++++++
 tools/perf/tests/attr/test-record-group2      |  1 +
 ...{test-record-group2 => test-record-group3} | 10 +--
 tools/perf/util/evsel.c                       | 19 +++++-
 tools/perf/util/evsel.h                       |  1 +
 tools/perf/util/session.c                     | 25 ++++---
 13 files changed, 320 insertions(+), 39 deletions(-)
 create mode 100644 tools/perf/tests/attr/test-record-group-sampling1
 create mode 100644 tools/perf/tests/attr/test-record-group-sampling2
 copy tools/perf/tests/attr/{test-record-group2 => test-record-group3} (81%)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v9 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.
  2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
@ 2024-07-30  8:44 ` Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 2/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Ben Gainey @ 2024-07-30  8:44 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Ben Gainey

nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.

Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 include/linux/perf_event.h |  5 +++--
 kernel/events/core.c       | 12 ++++++------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1a8942277dda..87ccb7ca241f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -963,12 +963,13 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
 
 	/*
-	 * Sum (event->pending_work + event->pending_work)
+	 * The count of events for which using the switch-out fast path
+	 * should be avoided.
 	 *
 	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
 	 * that until the signal is delivered.
 	 */
-	local_t				nr_pending;
+	local_t				nr_no_switch_fast;
 };
 
 struct perf_cpu_pmu_context {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index aa3450bdc227..e6cc354a3cee 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3516,9 +3516,9 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 
 			perf_ctx_disable(ctx, false);
 
-			/* PMIs are disabled; ctx->nr_pending is stable. */
-			if (local_read(&ctx->nr_pending) ||
-			    local_read(&next_ctx->nr_pending)) {
+			/* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
+			if (local_read(&ctx->nr_no_switch_fast) ||
+			    local_read(&next_ctx->nr_no_switch_fast)) {
 				/*
 				 * Must not swap out ctx when there's pending
 				 * events that rely on the ctx->task relation.
@@ -5204,7 +5204,7 @@ static void perf_pending_task_sync(struct perf_event *event)
 	 */
 	if (task_work_cancel(current, head)) {
 		event->pending_work = 0;
-		local_dec(&event->ctx->nr_pending);
+		local_dec(&event->ctx->nr_no_switch_fast);
 		return;
 	}
 
@@ -6868,7 +6868,7 @@ static void perf_pending_task(struct callback_head *head)
 	if (event->pending_work) {
 		event->pending_work = 0;
 		perf_sigtrap(event);
-		local_dec(&event->ctx->nr_pending);
+		local_dec(&event->ctx->nr_no_switch_fast);
 		rcuwait_wake_up(&event->pending_work_wait);
 	}
 	rcu_read_unlock();
@@ -9740,7 +9740,7 @@ static int __perf_event_overflow(struct perf_event *event,
 		if (!event->pending_work &&
 		    !task_work_add(current, &event->pending_task, notify_mode)) {
 			event->pending_work = pending_id;
-			local_inc(&event->ctx->nr_pending);
+			local_inc(&event->ctx->nr_no_switch_fast);
 
 			event->pending_addr = 0;
 			if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v9 2/4] perf: Support PERF_SAMPLE_READ with inherit
  2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast Ben Gainey
@ 2024-07-30  8:44 ` Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 3/4] tools/perf: Correctly calculate sample period for inherited SAMPLE_READ values Ben Gainey
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Ben Gainey @ 2024-07-30  8:44 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Ben Gainey

This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.

Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.

This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.

The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.

The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 55 ++++++++++++++++++++++++++++----------
 2 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 87ccb7ca241f..6c96da389b30 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -966,6 +966,9 @@ struct perf_event_context {
 	 * The count of events for which using the switch-out fast path
 	 * should be avoided.
 	 *
+	 * Sum (event->pending_work + events with
+	 *    (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)))
+	 *
 	 * The SIGTRAP is targeted at ctx->task, as such it won't do changing
 	 * that until the signal is delivered.
 	 */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e6cc354a3cee..c01a32687dad 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1767,6 +1767,14 @@ perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
 		event = rb_entry_safe(rb_next(&event->group_node),	\
 				typeof(*event), group_node))
 
+/*
+ * Does the event attribute request inherit with PERF_SAMPLE_READ
+ */
+static inline bool has_inherit_and_sample_read(struct perf_event_attr *attr)
+{
+	return attr->inherit && (attr->sample_type & PERF_SAMPLE_READ);
+}
+
 /*
  * Add an event from the lists for its context.
  * Must be called with ctx->mutex and ctx->lock held.
@@ -1797,6 +1805,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 		ctx->nr_user++;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat++;
+	if (has_inherit_and_sample_read(&event->attr))
+		local_inc(&ctx->nr_no_switch_fast);
 
 	if (event->state > PERF_EVENT_STATE_OFF)
 		perf_cgroup_event_enable(event, ctx);
@@ -2021,6 +2031,8 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 		ctx->nr_user--;
 	if (event->attr.inherit_stat)
 		ctx->nr_stat--;
+	if (has_inherit_and_sample_read(&event->attr))
+		local_dec(&ctx->nr_no_switch_fast);
 
 	list_del_rcu(&event->event_entry);
 
@@ -3522,6 +3534,11 @@ perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 				/*
 				 * Must not swap out ctx when there's pending
 				 * events that rely on the ctx->task relation.
+				 *
+				 * Likewise, when a context contains inherit +
+				 * SAMPLE_READ events they should be switched
+				 * out using the slow path so that they are
+				 * treated as if they were distinct contexts.
 				 */
 				raw_spin_unlock(&next_ctx->lock);
 				rcu_read_unlock();
@@ -4538,8 +4555,11 @@ static void __perf_event_read(void *info)
 	raw_spin_unlock(&ctx->lock);
 }
 
-static inline u64 perf_event_count(struct perf_event *event)
+static inline u64 perf_event_count(struct perf_event *event, bool self)
 {
+	if (self)
+		return local64_read(&event->count);
+
 	return local64_read(&event->count) + atomic64_read(&event->child_count);
 }
 
@@ -5498,7 +5518,7 @@ static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *
 	mutex_lock(&event->child_mutex);
 
 	(void)perf_event_read(event, false);
-	total += perf_event_count(event);
+	total += perf_event_count(event, false);
 
 	*enabled += event->total_time_enabled +
 			atomic64_read(&event->child_total_time_enabled);
@@ -5507,7 +5527,7 @@ static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *
 
 	list_for_each_entry(child, &event->child_list, child_list) {
 		(void)perf_event_read(child, false);
-		total += perf_event_count(child);
+		total += perf_event_count(child, false);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
 	}
@@ -5589,14 +5609,14 @@ static int __perf_read_group_add(struct perf_event *leader,
 	/*
 	 * Write {count,id} tuples for every sibling.
 	 */
-	values[n++] += perf_event_count(leader);
+	values[n++] += perf_event_count(leader, false);
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 	if (read_format & PERF_FORMAT_LOST)
 		values[n++] = atomic64_read(&leader->lost_samples);
 
 	for_each_sibling_event(sub, leader) {
-		values[n++] += perf_event_count(sub);
+		values[n++] += perf_event_count(sub, false);
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);
 		if (read_format & PERF_FORMAT_LOST)
@@ -6176,7 +6196,7 @@ void perf_event_update_userpage(struct perf_event *event)
 	++userpg->lock;
 	barrier();
 	userpg->index = perf_event_index(event);
-	userpg->offset = perf_event_count(event);
+	userpg->offset = perf_event_count(event, false);
 	if (userpg->index)
 		userpg->offset -= local64_read(&event->hw.prev_count);
 
@@ -7250,7 +7270,7 @@ static void perf_output_read_one(struct perf_output_handle *handle,
 	u64 values[5];
 	int n = 0;
 
-	values[n++] = perf_event_count(event);
+	values[n++] = perf_event_count(event, has_inherit_and_sample_read(&event->attr));
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
 		values[n++] = enabled +
 			atomic64_read(&event->child_total_time_enabled);
@@ -7268,14 +7288,15 @@ static void perf_output_read_one(struct perf_output_handle *handle,
 }
 
 static void perf_output_read_group(struct perf_output_handle *handle,
-			    struct perf_event *event,
-			    u64 enabled, u64 running)
+				   struct perf_event *event,
+				   u64 enabled, u64 running)
 {
 	struct perf_event *leader = event->group_leader, *sub;
 	u64 read_format = event->attr.read_format;
 	unsigned long flags;
 	u64 values[6];
 	int n = 0;
+	bool self = has_inherit_and_sample_read(&event->attr);
 
 	/*
 	 * Disabling interrupts avoids all counter scheduling
@@ -7295,7 +7316,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 	    (leader->state == PERF_EVENT_STATE_ACTIVE))
 		leader->pmu->read(leader);
 
-	values[n++] = perf_event_count(leader);
+	values[n++] = perf_event_count(leader, self);
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 	if (read_format & PERF_FORMAT_LOST)
@@ -7310,7 +7331,7 @@ static void perf_output_read_group(struct perf_output_handle *handle,
 		    (sub->state == PERF_EVENT_STATE_ACTIVE))
 			sub->pmu->read(sub);
 
-		values[n++] = perf_event_count(sub);
+		values[n++] = perf_event_count(sub, self);
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);
 		if (read_format & PERF_FORMAT_LOST)
@@ -7331,6 +7352,10 @@ static void perf_output_read_group(struct perf_output_handle *handle,
  * The problem is that its both hard and excessively expensive to iterate the
  * child list, not to mention that its impossible to IPI the children running
  * on another CPU, from interrupt/NMI context.
+ *
+ * Instead the combination of PERF_SAMPLE_READ and inherit will track per-thread
+ * counts rather than attempting to accumulate some value across all children on
+ * all cores.
  */
 static void perf_output_read(struct perf_output_handle *handle,
 			     struct perf_event *event)
@@ -12057,10 +12082,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	local64_set(&hwc->period_left, hwc->sample_period);
 
 	/*
-	 * We currently do not support PERF_SAMPLE_READ on inherited events.
+	 * We do not support PERF_SAMPLE_READ on inherited events unless
+	 * PERF_SAMPLE_TID is also selected, which allows inherited events to
+	 * collect per-thread samples.
 	 * See perf_output_read().
 	 */
-	if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ))
+	if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
 		goto err_ns;
 
 	if (!has_branch_stack(event))
@@ -13084,7 +13111,7 @@ static void sync_child_event(struct perf_event *child_event)
 			perf_event_read_event(child_event, task);
 	}
 
-	child_val = perf_event_count(child_event);
+	child_val = perf_event_count(child_event, false);
 
 	/*
 	 * Add back the child's count to the parent's count:
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v9 3/4] tools/perf: Correctly calculate sample period for inherited SAMPLE_READ values
  2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 2/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
@ 2024-07-30  8:44 ` Ben Gainey
  2024-07-30  8:44 ` [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events Ben Gainey
  2024-07-30 12:25 ` [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Peter Zijlstra
  4 siblings, 0 replies; 8+ messages in thread
From: Ben Gainey @ 2024-07-30  8:44 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Ben Gainey

Sample period calculation in deliver_sample_value is updated to
calculate the per-thread period delta for events that are inherit +
PERF_SAMPLE_READ. When the sampling event has this configuration, the
read_format.id is used with the tid from the sample to lookup the
storage of the previously accumulated counter total before calculating
the delta. All existing valid configurations where read_format.value
represents some global value continue to use just the read_format.id to
locate the storage of the previously accumulated total.

perf_sample_id is modified to support tracking per-thread
values, along with the existing global per-id values. In the
per-thread case, values are stored in a hash by tid within the
perf_sample_id, and are dynamically allocated as the number is not known
ahead of time.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 tools/lib/perf/evsel.c                  | 48 +++++++++++++++++++
 tools/lib/perf/include/internal/evsel.h | 63 ++++++++++++++++++++++++-
 tools/perf/util/session.c               | 25 ++++++----
 3 files changed, 126 insertions(+), 10 deletions(-)

diff --git a/tools/lib/perf/evsel.c b/tools/lib/perf/evsel.c
index c07160953224..abdae2f9498b 100644
--- a/tools/lib/perf/evsel.c
+++ b/tools/lib/perf/evsel.c
@@ -5,6 +5,7 @@
 #include <perf/evsel.h>
 #include <perf/cpumap.h>
 #include <perf/threadmap.h>
+#include <linux/hash.h>
 #include <linux/list.h>
 #include <internal/evsel.h>
 #include <linux/zalloc.h>
@@ -23,6 +24,7 @@ void perf_evsel__init(struct perf_evsel *evsel, struct perf_event_attr *attr,
 		      int idx)
 {
 	INIT_LIST_HEAD(&evsel->node);
+	INIT_LIST_HEAD(&evsel->per_stream_periods);
 	evsel->attr = *attr;
 	evsel->idx  = idx;
 	evsel->leader = evsel;
@@ -531,10 +533,56 @@ int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads)
 
 void perf_evsel__free_id(struct perf_evsel *evsel)
 {
+	struct perf_sample_id_period *pos, *n;
+
 	xyarray__delete(evsel->sample_id);
 	evsel->sample_id = NULL;
 	zfree(&evsel->id);
 	evsel->ids = 0;
+
+	perf_evsel_for_each_per_thread_period_safe(evsel, n, pos) {
+		list_del_init(&pos->node);
+		free(pos);
+	}
+}
+
+bool perf_evsel__attr_has_per_thread_sample_period(struct perf_evsel *evsel)
+{
+	return (evsel->attr.sample_type & PERF_SAMPLE_READ)
+		&& (evsel->attr.sample_type & PERF_SAMPLE_TID)
+		&& evsel->attr.inherit;
+}
+
+u64 *perf_sample_id__get_period_storage(struct perf_sample_id *sid, u32 tid, bool per_thread)
+{
+	struct hlist_head *head;
+	struct perf_sample_id_period *res;
+	int hash;
+
+	if (!per_thread)
+		return &sid->period;
+
+	hash = hash_32(tid, PERF_SAMPLE_ID__HLIST_BITS);
+	head = &sid->periods[hash];
+
+	hlist_for_each_entry(res, head, hnode)
+		if (res->tid == tid)
+			return &res->period;
+
+	if (sid->evsel == NULL)
+		return NULL;
+
+	res = zalloc(sizeof(struct perf_sample_id_period));
+	if (res == NULL)
+		return NULL;
+
+	INIT_LIST_HEAD(&res->node);
+	res->tid = tid;
+
+	list_add_tail(&res->node, &sid->evsel->per_stream_periods);
+	hlist_add_head(&res->hnode, &sid->periods[hash]);
+
+	return &res->period;
 }
 
 void perf_counts_values__scale(struct perf_counts_values *count,
diff --git a/tools/lib/perf/include/internal/evsel.h b/tools/lib/perf/include/internal/evsel.h
index 5cd220a61962..ea78defa77d0 100644
--- a/tools/lib/perf/include/internal/evsel.h
+++ b/tools/lib/perf/include/internal/evsel.h
@@ -11,6 +11,32 @@
 struct perf_thread_map;
 struct xyarray;
 
+/**
+ * The per-thread accumulated period storage node.
+ */
+struct perf_sample_id_period {
+	struct list_head	node;
+	struct hlist_node	hnode;
+	/* Holds total ID period value for PERF_SAMPLE_READ processing. */
+	u64			period;
+	/* The TID that the values belongs to */
+	u32			tid;
+};
+
+/**
+ * perf_evsel_for_each_per_thread_period_safe - safely iterate thru all the
+ * per_stream_periods
+ * @evlist:perf_evsel instance to iterate
+ * @item: struct perf_sample_id_period iterator
+ * @tmp: struct perf_sample_id_period temp iterator
+ */
+#define perf_evsel_for_each_per_thread_period_safe(evsel, tmp, item) \
+	list_for_each_entry_safe(item, tmp, &(evsel)->per_stream_periods, node)
+
+
+#define PERF_SAMPLE_ID__HLIST_BITS 4
+#define PERF_SAMPLE_ID__HLIST_SIZE (1 << PERF_SAMPLE_ID__HLIST_BITS)
+
 /*
  * Per fd, to map back from PERF_SAMPLE_ID to evsel, only used when there are
  * more than one entry in the evlist.
@@ -34,8 +60,32 @@ struct perf_sample_id {
 	pid_t			 machine_pid;
 	struct perf_cpu		 vcpu;
 
-	/* Holds total ID period value for PERF_SAMPLE_READ processing. */
-	u64			 period;
+	/*
+	 * Per-thread, and global event counts are mutually exclusive:
+	 * Whilst it is possible to combine events into a group with differing
+	 * values of PERF_SAMPLE_READ, it is not valid to have inconsistent
+	 * values for `inherit`. Therefore it is not possible to have a
+	 * situation where a per-thread event is sampled as a global event;
+	 * all !inherit groups are global, and all groups where the sampling
+	 * event is inherit + PERF_SAMPLE_READ will be per-thread. Any event
+	 * that is part of such a group that is inherit but not PERF_SAMPLE_READ
+	 * will be read as per-thread. If such an event can also trigger a
+	 * sample (such as with sample_period > 0) then it will not cause
+	 * `read_format` to be included in its PERF_RECORD_SAMPLE, and
+	 * therefore will not expose the per-thread group members as global.
+	 */
+	union {
+		/*
+		 * Holds total ID period value for PERF_SAMPLE_READ processing
+		 * (when period is not per-thread).
+		 */
+		u64			period;
+		/*
+		 * Holds total ID period value for PERF_SAMPLE_READ processing
+		 * (when period is per-thread).
+		 */
+		struct hlist_head	periods[PERF_SAMPLE_ID__HLIST_SIZE];
+	};
 };
 
 struct perf_evsel {
@@ -58,6 +108,10 @@ struct perf_evsel {
 	u32			 ids;
 	struct perf_evsel	*leader;
 
+	/* For events where the read_format value is per-thread rather than
+	 * global, stores the per-thread cumulative period */
+	struct list_head	per_stream_periods;
+
 	/* parse modifier helper */
 	int			 nr_members;
 	/*
@@ -88,4 +142,9 @@ int perf_evsel__apply_filter(struct perf_evsel *evsel, const char *filter);
 int perf_evsel__alloc_id(struct perf_evsel *evsel, int ncpus, int nthreads);
 void perf_evsel__free_id(struct perf_evsel *evsel);
 
+bool perf_evsel__attr_has_per_thread_sample_period(struct perf_evsel *evsel);
+
+u64 *perf_sample_id__get_period_storage(struct perf_sample_id *sid, u32 tid,
+					bool per_thread);
+
 #endif /* __LIBPERF_INTERNAL_EVSEL_H */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5596bed1b8c8..fac0557ff6ea 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1474,18 +1474,24 @@ static int deliver_sample_value(struct evlist *evlist,
 				union perf_event *event,
 				struct perf_sample *sample,
 				struct sample_read_value *v,
-				struct machine *machine)
+				struct machine *machine,
+				bool per_thread)
 {
 	struct perf_sample_id *sid = evlist__id2sid(evlist, v->id);
 	struct evsel *evsel;
+	u64 *storage = NULL;
 
 	if (sid) {
+		storage = perf_sample_id__get_period_storage(sid, sample->tid, per_thread);
+	}
+
+	if (storage) {
 		sample->id     = v->id;
-		sample->period = v->value - sid->period;
-		sid->period    = v->value;
+		sample->period = v->value - *storage;
+		*storage       = v->value;
 	}
 
-	if (!sid || sid->evsel == NULL) {
+	if (!storage || sid->evsel == NULL) {
 		++evlist->stats.nr_unknown_id;
 		return 0;
 	}
@@ -1506,14 +1512,15 @@ static int deliver_sample_group(struct evlist *evlist,
 				union  perf_event *event,
 				struct perf_sample *sample,
 				struct machine *machine,
-				u64 read_format)
+				u64 read_format,
+				bool per_thread)
 {
 	int ret = -EINVAL;
 	struct sample_read_value *v = sample->read.group.values;
 
 	sample_read_group__for_each(v, sample->read.group.nr, read_format) {
 		ret = deliver_sample_value(evlist, tool, event, sample, v,
-					   machine);
+					   machine, per_thread);
 		if (ret)
 			break;
 	}
@@ -1528,6 +1535,7 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
 	/* We know evsel != NULL. */
 	u64 sample_type = evsel->core.attr.sample_type;
 	u64 read_format = evsel->core.attr.read_format;
+	bool per_thread = perf_evsel__attr_has_per_thread_sample_period(&evsel->core);
 
 	/* Standard sample delivery. */
 	if (!(sample_type & PERF_SAMPLE_READ))
@@ -1536,10 +1544,11 @@ static int evlist__deliver_sample(struct evlist *evlist, struct perf_tool *tool,
 	/* For PERF_SAMPLE_READ we have either single or group mode. */
 	if (read_format & PERF_FORMAT_GROUP)
 		return deliver_sample_group(evlist, tool, event, sample,
-					    machine, read_format);
+					    machine, read_format, per_thread);
 	else
 		return deliver_sample_value(evlist, tool, event, sample,
-					    &sample->read.one, machine);
+					    &sample->read.one, machine,
+					    per_thread);
 }
 
 static int machines__deliver_event(struct machines *machines,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events
  2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
                   ` (2 preceding siblings ...)
  2024-07-30  8:44 ` [PATCH v9 3/4] tools/perf: Correctly calculate sample period for inherited SAMPLE_READ values Ben Gainey
@ 2024-07-30  8:44 ` Ben Gainey
  2024-07-31 18:17   ` Namhyung Kim
  2024-07-30 12:25 ` [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Peter Zijlstra
  4 siblings, 1 reply; 8+ messages in thread
From: Ben Gainey @ 2024-07-30  8:44 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung
  Cc: james.clark, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, Ben Gainey

The "perf record" tool will now default to this new mode if the user
specifies a sampling group when not in system-wide mode, and when
"--no-inherit" is not specified.

This change updates evsel to allow the combination of inherit
and PERF_SAMPLE_READ.

A fallback is implemented for kernel versions where this feature is not
supported.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
---
 tools/perf/tests/attr/README                  |  2 +
 .../tests/attr/test-record-group-sampling     |  3 +-
 .../tests/attr/test-record-group-sampling1    | 51 ++++++++++++++++
 .../tests/attr/test-record-group-sampling2    | 61 +++++++++++++++++++
 tools/perf/tests/attr/test-record-group2      |  1 +
 ...{test-record-group2 => test-record-group3} | 10 +--
 tools/perf/util/evsel.c                       | 19 +++++-
 tools/perf/util/evsel.h                       |  1 +
 8 files changed, 141 insertions(+), 7 deletions(-)
 create mode 100644 tools/perf/tests/attr/test-record-group-sampling1
 create mode 100644 tools/perf/tests/attr/test-record-group-sampling2
 copy tools/perf/tests/attr/{test-record-group2 => test-record-group3} (81%)

diff --git a/tools/perf/tests/attr/README b/tools/perf/tests/attr/README
index 4066fec7180a..67c4ca76b85d 100644
--- a/tools/perf/tests/attr/README
+++ b/tools/perf/tests/attr/README
@@ -51,6 +51,8 @@ Following tests are defined (with perf commands):
   perf record --call-graph fp kill              (test-record-graph-fp-aarch64)
   perf record -e '{cycles,instructions}' kill   (test-record-group1)
   perf record -e '{cycles/period=1/,instructions/period=2/}:S' kill (test-record-group2)
+  perf record -e '{cycles,cache-misses}:S' kill (test-record-group-sampling1)
+  perf record -c 10000 -e '{cycles,cache-misses}:S' kill (test-record-group-sampling2)
   perf record -D kill                           (test-record-no-delay)
   perf record -i kill                           (test-record-no-inherit)
   perf record -n kill                           (test-record-no-samples)
diff --git a/tools/perf/tests/attr/test-record-group-sampling b/tools/perf/tests/attr/test-record-group-sampling
index 97e7e64a38f0..da7a5d10785f 100644
--- a/tools/perf/tests/attr/test-record-group-sampling
+++ b/tools/perf/tests/attr/test-record-group-sampling
@@ -2,6 +2,7 @@
 command = record
 args    = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
 ret     = 1
+kernel_until = 6.11
 
 [event-1:base-record]
 fd=1
@@ -18,7 +19,7 @@ group_fd=1
 type=0
 config=3
 
-# default | PERF_SAMPLE_READ
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
 sample_type=343
 
 # PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST
diff --git a/tools/perf/tests/attr/test-record-group-sampling1 b/tools/perf/tests/attr/test-record-group-sampling1
new file mode 100644
index 000000000000..b02de391718d
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-group-sampling1
@@ -0,0 +1,51 @@
+[config]
+command = record
+args    = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
+ret     = 1
+kernel_since = 6.11
+
+[event-1:base-record]
+fd=1
+group_fd=-1
+
+# cycles
+type=0
+config=0
+
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
+sample_type=343
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=1
+mmap=1
+comm=1
+enable_on_exec=1
+disabled=1
+
+# inherit is enabled for group sampling
+inherit=1
+
+[event-2:base-record]
+fd=2
+group_fd=1
+
+# cache-misses
+type=0
+config=3
+
+# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
+sample_type=343
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=0
+mmap=0
+comm=0
+enable_on_exec=0
+disabled=0
+freq=0
+
+# inherit is enabled for group sampling
+inherit=1
+
diff --git a/tools/perf/tests/attr/test-record-group-sampling2 b/tools/perf/tests/attr/test-record-group-sampling2
new file mode 100644
index 000000000000..060fd1d24f63
--- /dev/null
+++ b/tools/perf/tests/attr/test-record-group-sampling2
@@ -0,0 +1,61 @@
+[config]
+command = record
+args    = --no-bpf-event -c 10000 -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
+ret     = 1
+kernel_since = 6.11
+
+[event-1:base-record]
+fd=1
+group_fd=-1
+
+# cycles
+type=0
+config=0
+
+# default | PERF_SAMPLE_READ
+sample_type=87
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=1
+mmap=1
+comm=1
+enable_on_exec=1
+disabled=1
+
+# inherit is enabled for group sampling
+inherit=1
+
+# sampling disabled
+sample_freq=0
+sample_period=10000
+freq=0
+write_backward=0
+
+[event-2:base-record]
+fd=2
+group_fd=1
+
+# cache-misses
+type=0
+config=3
+
+# default | PERF_SAMPLE_READ
+sample_type=87
+
+# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
+read_format=28|31
+task=0
+mmap=0
+comm=0
+enable_on_exec=0
+disabled=0
+
+# inherit is enabled for group sampling
+inherit=1
+
+# sampling disabled
+sample_freq=0
+sample_period=0
+freq=0
+write_backward=0
diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group2
index cebdaa8e64e4..ad97df77a506 100644
--- a/tools/perf/tests/attr/test-record-group2
+++ b/tools/perf/tests/attr/test-record-group2
@@ -2,6 +2,7 @@
 command = record
 args    = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
 ret     = 1
+kernel_until = 6.11
 
 [event-1:base-record]
 fd=1
diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group3
similarity index 81%
copy from tools/perf/tests/attr/test-record-group2
copy to tools/perf/tests/attr/test-record-group3
index cebdaa8e64e4..311afb478b85 100644
--- a/tools/perf/tests/attr/test-record-group2
+++ b/tools/perf/tests/attr/test-record-group3
@@ -2,6 +2,7 @@
 command = record
 args    = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
 ret     = 1
+kernel_since = 6.11
 
 [event-1:base-record]
 fd=1
@@ -9,8 +10,9 @@ group_fd=-1
 config=0|1
 sample_period=1234000
 sample_type=87
-read_format=12|28
-inherit=0
+read_format=28|31
+disabled=1
+inherit=1
 freq=0
 
 [event-2:base-record]
@@ -19,9 +21,9 @@ group_fd=1
 config=0|1
 sample_period=6789000
 sample_type=87
-read_format=12|28
+read_format=28|31
 disabled=0
-inherit=0
+inherit=1
 mmap=0
 comm=0
 freq=0
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index bc603193c477..ceb09b6a8c2f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1171,7 +1171,15 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 		 */
 		if (leader->core.nr_members > 1) {
 			attr->read_format |= PERF_FORMAT_GROUP;
-			attr->inherit = 0;
+		}
+
+		/*
+		 * Inherit + SAMPLE_READ requires SAMPLE_TID in the read_format
+		 */
+		if (attr->inherit) {
+			evsel__set_sample_bit(evsel, TID);
+			evsel->core.attr.read_format |=
+				PERF_FORMAT_ID;
 		}
 	}
 
@@ -2020,6 +2028,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 static void evsel__disable_missing_features(struct evsel *evsel)
 {
+	if (perf_missing_features.inherit_sample_read)
+		evsel->core.attr.inherit = 0;
 	if (perf_missing_features.branch_counters)
 		evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
 	if (perf_missing_features.read_lost)
@@ -2075,7 +2085,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
 	 * Must probe features in the order they were added to the
 	 * perf_event_attr interface.
 	 */
-	if (!perf_missing_features.branch_counters &&
+	if (!perf_missing_features.inherit_sample_read &&
+	    evsel->core.attr.inherit && (evsel->core.attr.sample_type & PERF_SAMPLE_READ)) {
+		perf_missing_features.inherit_sample_read = true;
+		pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit, falling back to no-inherit.\n");
+		return true;
+	} else if (!perf_missing_features.branch_counters &&
 	    (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
 		perf_missing_features.branch_counters = true;
 		pr_debug2("switching off branch counters support\n");
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 80b5f6dd868e..bb0c91c23679 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -206,6 +206,7 @@ struct perf_missing_features {
 	bool weight_struct;
 	bool read_lost;
 	bool branch_counters;
+	bool inherit_sample_read;
 };
 
 extern struct perf_missing_features perf_missing_features;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit
  2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
                   ` (3 preceding siblings ...)
  2024-07-30  8:44 ` [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events Ben Gainey
@ 2024-07-30 12:25 ` Peter Zijlstra
  4 siblings, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2024-07-30 12:25 UTC (permalink / raw)
  To: Ben Gainey
  Cc: mingo, acme, namhyung, james.clark, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel

On Tue, Jul 30, 2024 at 09:44:13AM +0100, Ben Gainey wrote:
> This change allows events to use PERF_SAMPLE READ with inherit so long
> as PERF_SAMPLE_TID is also set.
> 
> Currently it is not possible to use PERF_SAMPLE_READ with inherit. This
> restriction assumes the user is interested in collecting aggregate
> statistics as per `perf stat`. It prevents a user from collecting
> per-thread samples using counter groups from a multi-threaded or
> multi-process application, as with `perf record -e '{....}:S'`. Instead
> users must use system-wide mode, or forgo the ability to sample counter
> groups, or profile a single thread. System-wide mode is often
> problematic as it requires specific permissions (no CAP_PERFMON / root
> access), or may lead to capture of significant amounts of extra data
> from other processes running on the system.
> 
> This patch changes `perf_event_alloc` relaxing the restriction against
> combining `inherit` with `PERF_SAMPLE_READ` so that the combination
> will be allowed so long as `PERF_SAMPLE_TID` is enabled. It modifies
> sampling so that only the count associated with the active thread is
> recorded into the buffer. It modifies the context switch handling so
> that perf contexts are always switched out if they have this kind of
> event so that the correct per-thread state is maintained. Finally, the
> tools are updated to allow perf record to specify this combination and
> to correctly decode the sample data.
> 
> In this configuration sample values, as may appear in the read_format
> field of a PERF_RECORD_SAMPLE, are no longer global counters. Instead
> the value reports the per-thread value for the active thread.
> Tools that expect the global total, for example when calculate a delta
> between samples, would need updating to take this into account when
> opting into this new behaviour. Previously valid event configurations
> (system-wide, no-inherit and so on) are unaffected.
> 

Thanks, I've picked up the kernel patches, and provided the robot
doesn't hate on them, they will appear in tip/perf/core soonish.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events
  2024-07-30  8:44 ` [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events Ben Gainey
@ 2024-07-31 18:17   ` Namhyung Kim
  2024-08-01 12:28     ` Ben Gainey
  0 siblings, 1 reply; 8+ messages in thread
From: Namhyung Kim @ 2024-07-31 18:17 UTC (permalink / raw)
  To: Ben Gainey
  Cc: peterz, mingo, acme, james.clark, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel

Hello,

On Tue, Jul 30, 2024 at 09:44:17AM +0100, Ben Gainey wrote:
> The "perf record" tool will now default to this new mode if the user
> specifies a sampling group when not in system-wide mode, and when
> "--no-inherit" is not specified.
> 
> This change updates evsel to allow the combination of inherit
> and PERF_SAMPLE_READ.
> 
> A fallback is implemented for kernel versions where this feature is not
> supported.
> 
> Signed-off-by: Ben Gainey <ben.gainey@arm.com>
> ---
>  tools/perf/tests/attr/README                  |  2 +
>  .../tests/attr/test-record-group-sampling     |  3 +-
>  .../tests/attr/test-record-group-sampling1    | 51 ++++++++++++++++
>  .../tests/attr/test-record-group-sampling2    | 61 +++++++++++++++++++
>  tools/perf/tests/attr/test-record-group2      |  1 +
>  ...{test-record-group2 => test-record-group3} | 10 +--
>  tools/perf/util/evsel.c                       | 19 +++++-
>  tools/perf/util/evsel.h                       |  1 +
>  8 files changed, 141 insertions(+), 7 deletions(-)
>  create mode 100644 tools/perf/tests/attr/test-record-group-sampling1
>  create mode 100644 tools/perf/tests/attr/test-record-group-sampling2
>  copy tools/perf/tests/attr/{test-record-group2 => test-record-group3} (81%)
> 
> diff --git a/tools/perf/tests/attr/README b/tools/perf/tests/attr/README
> index 4066fec7180a..67c4ca76b85d 100644
> --- a/tools/perf/tests/attr/README
> +++ b/tools/perf/tests/attr/README
> @@ -51,6 +51,8 @@ Following tests are defined (with perf commands):
>    perf record --call-graph fp kill              (test-record-graph-fp-aarch64)
>    perf record -e '{cycles,instructions}' kill   (test-record-group1)
>    perf record -e '{cycles/period=1/,instructions/period=2/}:S' kill (test-record-group2)
> +  perf record -e '{cycles,cache-misses}:S' kill (test-record-group-sampling1)
> +  perf record -c 10000 -e '{cycles,cache-misses}:S' kill (test-record-group-sampling2)
>    perf record -D kill                           (test-record-no-delay)
>    perf record -i kill                           (test-record-no-inherit)
>    perf record -n kill                           (test-record-no-samples)
> diff --git a/tools/perf/tests/attr/test-record-group-sampling b/tools/perf/tests/attr/test-record-group-sampling
> index 97e7e64a38f0..da7a5d10785f 100644
> --- a/tools/perf/tests/attr/test-record-group-sampling
> +++ b/tools/perf/tests/attr/test-record-group-sampling
> @@ -2,6 +2,7 @@
>  command = record
>  args    = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
>  ret     = 1
> +kernel_until = 6.11

I guess it's v6.12. :)

>  
>  [event-1:base-record]
>  fd=1
> @@ -18,7 +19,7 @@ group_fd=1
>  type=0
>  config=3
>  
> -# default | PERF_SAMPLE_READ
> +# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
>  sample_type=343
>  
>  # PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST
> diff --git a/tools/perf/tests/attr/test-record-group-sampling1 b/tools/perf/tests/attr/test-record-group-sampling1
> new file mode 100644
> index 000000000000..b02de391718d
> --- /dev/null
> +++ b/tools/perf/tests/attr/test-record-group-sampling1
> @@ -0,0 +1,51 @@
> +[config]
> +command = record
> +args    = --no-bpf-event -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
> +ret     = 1
> +kernel_since = 6.11
> +
> +[event-1:base-record]
> +fd=1
> +group_fd=-1
> +
> +# cycles
> +type=0
> +config=0
> +
> +# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
> +sample_type=343
> +
> +# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
> +read_format=28|31
> +task=1
> +mmap=1
> +comm=1
> +enable_on_exec=1
> +disabled=1
> +
> +# inherit is enabled for group sampling
> +inherit=1
> +
> +[event-2:base-record]
> +fd=2
> +group_fd=1
> +
> +# cache-misses
> +type=0
> +config=3
> +
> +# default | PERF_SAMPLE_READ | PERF_SAMPLE_PERIOD
> +sample_type=343
> +
> +# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
> +read_format=28|31
> +task=0
> +mmap=0
> +comm=0
> +enable_on_exec=0
> +disabled=0
> +freq=0
> +
> +# inherit is enabled for group sampling
> +inherit=1
> +
> diff --git a/tools/perf/tests/attr/test-record-group-sampling2 b/tools/perf/tests/attr/test-record-group-sampling2
> new file mode 100644
> index 000000000000..060fd1d24f63
> --- /dev/null
> +++ b/tools/perf/tests/attr/test-record-group-sampling2
> @@ -0,0 +1,61 @@
> +[config]
> +command = record
> +args    = --no-bpf-event -c 10000 -e '{cycles,cache-misses}:S' kill >/dev/null 2>&1
> +ret     = 1
> +kernel_since = 6.11
> +
> +[event-1:base-record]
> +fd=1
> +group_fd=-1
> +
> +# cycles
> +type=0
> +config=0
> +
> +# default | PERF_SAMPLE_READ
> +sample_type=87
> +
> +# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
> +read_format=28|31
> +task=1
> +mmap=1
> +comm=1
> +enable_on_exec=1
> +disabled=1
> +
> +# inherit is enabled for group sampling
> +inherit=1
> +
> +# sampling disabled
> +sample_freq=0
> +sample_period=10000
> +freq=0
> +write_backward=0
> +
> +[event-2:base-record]
> +fd=2
> +group_fd=1
> +
> +# cache-misses
> +type=0
> +config=3
> +
> +# default | PERF_SAMPLE_READ
> +sample_type=87
> +
> +# PERF_FORMAT_ID | PERF_FORMAT_GROUP  | PERF_FORMAT_LOST | PERF_FORMAT_TOTAL_TIME_ENABLED | PERF_FORMAT_TOTAL_TIME_RUNNING
> +read_format=28|31
> +task=0
> +mmap=0
> +comm=0
> +enable_on_exec=0
> +disabled=0
> +
> +# inherit is enabled for group sampling
> +inherit=1
> +
> +# sampling disabled
> +sample_freq=0
> +sample_period=0
> +freq=0
> +write_backward=0
> diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group2
> index cebdaa8e64e4..ad97df77a506 100644
> --- a/tools/perf/tests/attr/test-record-group2
> +++ b/tools/perf/tests/attr/test-record-group2
> @@ -2,6 +2,7 @@
>  command = record
>  args    = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
>  ret     = 1
> +kernel_until = 6.11
>  
>  [event-1:base-record]
>  fd=1
> diff --git a/tools/perf/tests/attr/test-record-group2 b/tools/perf/tests/attr/test-record-group3
> similarity index 81%
> copy from tools/perf/tests/attr/test-record-group2
> copy to tools/perf/tests/attr/test-record-group3
> index cebdaa8e64e4..311afb478b85 100644
> --- a/tools/perf/tests/attr/test-record-group2
> +++ b/tools/perf/tests/attr/test-record-group3
> @@ -2,6 +2,7 @@
>  command = record
>  args    = --no-bpf-event -e '{cycles/period=1234000/,instructions/period=6789000/}:S' kill >/dev/null 2>&1
>  ret     = 1
> +kernel_since = 6.11
>  
>  [event-1:base-record]
>  fd=1
> @@ -9,8 +10,9 @@ group_fd=-1
>  config=0|1
>  sample_period=1234000
>  sample_type=87
> -read_format=12|28
> -inherit=0
> +read_format=28|31
> +disabled=1
> +inherit=1
>  freq=0
>  
>  [event-2:base-record]
> @@ -19,9 +21,9 @@ group_fd=1
>  config=0|1
>  sample_period=6789000
>  sample_type=87
> -read_format=12|28
> +read_format=28|31
>  disabled=0
> -inherit=0
> +inherit=1
>  mmap=0
>  comm=0
>  freq=0
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index bc603193c477..ceb09b6a8c2f 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1171,7 +1171,15 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
>  		 */
>  		if (leader->core.nr_members > 1) {
>  			attr->read_format |= PERF_FORMAT_GROUP;
> -			attr->inherit = 0;
> +		}
> +
> +		/*
> +		 * Inherit + SAMPLE_READ requires SAMPLE_TID in the read_format
> +		 */
> +		if (attr->inherit) {
> +			evsel__set_sample_bit(evsel, TID);
> +			evsel->core.attr.read_format |=
> +				PERF_FORMAT_ID;
>  		}

Also I think we should reset the inherit bit for system-wide events.

  $ perf record -a --synth=no true
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.042 MB perf.data (51 samples) ]
  
  $ perf evlist -v | tr ',' '\n' | grep inherit
   inherit: 1
   inherit: 1

Maybe something like this:

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index bc603193c477..9423cd65c3c4 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1149,7 +1149,7 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
        bool per_cpu = opts->target.default_per_cpu && !opts->target.per_thread;
 
        attr->sample_id_all = perf_missing_features.sample_id_all ? 0 : 1;
-       attr->inherit       = !opts->no_inherit;
+       attr->inherit       = target__has_cpu(&opts->target) ? 0 : !opts->no_inherit;
        attr->write_backward = opts->overwrite ? 1 : 0;
        attr->read_format   = PERF_FORMAT_LOST;
 

Thanks,
Namhyung


>  	}
>  
> @@ -2020,6 +2028,8 @@ static int __evsel__prepare_open(struct evsel *evsel, struct perf_cpu_map *cpus,
>  
>  static void evsel__disable_missing_features(struct evsel *evsel)
>  {
> +	if (perf_missing_features.inherit_sample_read)
> +		evsel->core.attr.inherit = 0;
>  	if (perf_missing_features.branch_counters)
>  		evsel->core.attr.branch_sample_type &= ~PERF_SAMPLE_BRANCH_COUNTERS;
>  	if (perf_missing_features.read_lost)
> @@ -2075,7 +2085,12 @@ bool evsel__detect_missing_features(struct evsel *evsel)
>  	 * Must probe features in the order they were added to the
>  	 * perf_event_attr interface.
>  	 */
> -	if (!perf_missing_features.branch_counters &&
> +	if (!perf_missing_features.inherit_sample_read &&
> +	    evsel->core.attr.inherit && (evsel->core.attr.sample_type & PERF_SAMPLE_READ)) {
> +		perf_missing_features.inherit_sample_read = true;
> +		pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible with inherit, falling back to no-inherit.\n");
> +		return true;
> +	} else if (!perf_missing_features.branch_counters &&
>  	    (evsel->core.attr.branch_sample_type & PERF_SAMPLE_BRANCH_COUNTERS)) {
>  		perf_missing_features.branch_counters = true;
>  		pr_debug2("switching off branch counters support\n");
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 80b5f6dd868e..bb0c91c23679 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -206,6 +206,7 @@ struct perf_missing_features {
>  	bool weight_struct;
>  	bool read_lost;
>  	bool branch_counters;
> +	bool inherit_sample_read;
>  };
>  
>  extern struct perf_missing_features perf_missing_features;
> -- 
> 2.45.2
> 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events
  2024-07-31 18:17   ` Namhyung Kim
@ 2024-08-01 12:28     ` Ben Gainey
  0 siblings, 0 replies; 8+ messages in thread
From: Ben Gainey @ 2024-08-01 12:28 UTC (permalink / raw)
  To: namhyung@kernel.org
  Cc: alexander.shishkin@linux.intel.com, peterz@infradead.org,
	acme@kernel.org, mingo@redhat.com, James Clark,
	adrian.hunter@intel.com, irogers@google.com, jolsa@kernel.org,
	linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org,
	Mark Rutland

On Wed, 2024-07-31 at 11:17 -0700, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Jul 30, 2024 at 09:44:17AM +0100, Ben Gainey wrote:
> > The "perf record" tool will now default to this new mode if the
> > user
> > specifies a sampling group when not in system-wide mode, and when
> > "--no-inherit" is not specified.
> > 
> > This change updates evsel to allow the combination of inherit
> > and PERF_SAMPLE_READ.
> > 
> > A fallback is implemented for kernel versions where this feature is
> > not
> > supported.
> > 
> > Signed-off-by: Ben Gainey <ben.gainey@arm.com>
> > ---
> > 

snip


> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index bc603193c477..ceb09b6a8c2f 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -1171,7 +1171,15 @@ void evsel__config(struct evsel *evsel,
> > struct record_opts *opts,
> >   */
> >   if (leader->core.nr_members > 1) {
> >   attr->read_format |= PERF_FORMAT_GROUP;
> > - attr->inherit = 0;
> > + }
> > +
> > + /*
> > + * Inherit + SAMPLE_READ requires SAMPLE_TID in the read_format
> > + */
> > + if (attr->inherit) {
> > + evsel__set_sample_bit(evsel, TID);
> > + evsel->core.attr.read_format |=
> > + PERF_FORMAT_ID;
> >   }
> 
> Also I think we should reset the inherit bit for system-wide events.
> 
>   $ perf record -a --synth=no true
>   [ perf record: Woken up 1 times to write data ]
>   [ perf record: Captured and wrote 0.042 MB perf.data (51 samples) ]
>   
>   $ perf evlist -v | tr ',' '\n' | grep inherit
>    inherit: 1
>    inherit: 1
> 
> Maybe something like this:
> 
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index bc603193c477..9423cd65c3c4 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1149,7 +1149,7 @@ void evsel__config(struct evsel *evsel, struct
> record_opts *opts,
>         bool per_cpu = opts->target.default_per_cpu && !opts-
> >target.per_thread;
>  
>         attr->sample_id_all = perf_missing_features.sample_id_all ? 0
> : 1;
> -       attr->inherit       = !opts->no_inherit;
> +       attr->inherit       = target__has_cpu(&opts->target) ? 0 :
> !opts->no_inherit;
>         attr->write_backward = opts->overwrite ? 1 : 0;
>         attr->read_format   = PERF_FORMAT_LOST;
>  
> 
> Thanks,
> Namhyung
> 

Done, though this looks benign; the current tools appear to do this,
and there are a couple of files in tests/attr that assumed inherit=1
for system wide mode that I've had to update.


   # perf --version
   perf version 6.10-1
   # perf record -vvv -a --synth=no true
   ...
   ------------------------------------------------------------
   perf_event_attr:
     type                             0 (PERF_TYPE_HARDWARE)
     size                             136
     config                           0 (PERF_COUNT_HW_CPU_CYCLES)
     { sample_period, sample_freq }   4000
     sample_type                      IP|TID|TIME|CPU|PERIOD|IDENTIFIER
     read_format                      ID|LOST
     disabled                         1
     inherit                          1
     freq                             1
     precise_ip                       3
     sample_id_all                    1
   ------------------------------------------------------------

Testing your fix with -a, -C <n> as well as with application profiling
appears to do the right thing.


Regards
Ben


> 
> >   }
> >  
> > @@ -2020,6 +2028,8 @@ static int __evsel__prepare_open(struct evsel
> > *evsel, struct perf_cpu_map *cpus,
> >  
> >  static void evsel__disable_missing_features(struct evsel *evsel)
> >  {
> > + if (perf_missing_features.inherit_sample_read)
> > + evsel->core.attr.inherit = 0;
> >   if (perf_missing_features.branch_counters)
> >   evsel->core.attr.branch_sample_type &=
> > ~PERF_SAMPLE_BRANCH_COUNTERS;
> >   if (perf_missing_features.read_lost)
> > @@ -2075,7 +2085,12 @@ bool evsel__detect_missing_features(struct
> > evsel *evsel)
> >   * Must probe features in the order they were added to the
> >   * perf_event_attr interface.
> >   */
> > - if (!perf_missing_features.branch_counters &&
> > + if (!perf_missing_features.inherit_sample_read &&
> > +     evsel->core.attr.inherit && (evsel->core.attr.sample_type &
> > PERF_SAMPLE_READ)) {
> > + perf_missing_features.inherit_sample_read = true;
> > + pr_debug2("Using PERF_SAMPLE_READ / :S modifier is not compatible
> > with inherit, falling back to no-inherit.\n");
> > + return true;
> > + } else if (!perf_missing_features.branch_counters &&
> >       (evsel->core.attr.branch_sample_type &
> > PERF_SAMPLE_BRANCH_COUNTERS)) {
> >   perf_missing_features.branch_counters = true;
> >   pr_debug2("switching off branch counters support\n");
> > diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> > index 80b5f6dd868e..bb0c91c23679 100644
> > --- a/tools/perf/util/evsel.h
> > +++ b/tools/perf/util/evsel.h
> > @@ -206,6 +206,7 @@ struct perf_missing_features {
> >   bool weight_struct;
> >   bool read_lost;
> >   bool branch_counters;
> > + bool inherit_sample_read;
> >  };
> >  
> >  extern struct perf_missing_features perf_missing_features;
> > -- 
> > 2.45.2
> > 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-08-01 12:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30  8:44 [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
2024-07-30  8:44 ` [PATCH v9 1/4] perf: Rename perf_event_context.nr_pending to nr_no_switch_fast Ben Gainey
2024-07-30  8:44 ` [PATCH v9 2/4] perf: Support PERF_SAMPLE_READ with inherit Ben Gainey
2024-07-30  8:44 ` [PATCH v9 3/4] tools/perf: Correctly calculate sample period for inherited SAMPLE_READ values Ben Gainey
2024-07-30  8:44 ` [PATCH v9 4/4] tools/perf: Allow inherit + PERF_SAMPLE_READ when opening events Ben Gainey
2024-07-31 18:17   ` Namhyung Kim
2024-08-01 12:28     ` Ben Gainey
2024-07-30 12:25 ` [PATCH v9 0/4] perf: Support PERF_SAMPLE_READ with inherit Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).