Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] perf cs-etm: stamp pid/tid/EL on each buffered packet to fix cross-pid attribution
@ 2026-05-15  2:11 Amir Ayupov
  0 siblings, 0 replies; only message in thread
From: Amir Ayupov @ 2026-05-15  2:11 UTC (permalink / raw)
  To: Suzuki K Poulose, Mike Leach, James Clark, Leo Yan,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, John Garry, Will Deacon, coresight,
	linux-arm-kernel, linux-perf-users, linux-kernel
  Cc: stable

In a system-wide `perf record -e cs_etm/.../u` capture on aarch64,
synthesized samples emitted by `perf script --itrace=il64` are
sometimes attributed to the WRONG sample.pid/tid (and to the wrong
EL/cpumode) for the chunk of branches that straddle a context-switch
boundary on a CPU. A branch actually retired by process A is emitted
with sample.pid set to the thread that next ran on the same CPU.

Mechanism:
  1. ETM emits CONTEXTIDR/EL packets in-stream when the kernel updates
     CONTEXTIDR_EL1 on context switch / EL change. OpenCSD turns these
     into OCSD_GEN_TRC_ELEM_PE_CONTEXT elements interleaved with
     OCSD_GEN_TRC_ELEM_INSTR_RANGE elements for retired branch ranges.
  2. cs_etm_decoder__buffer_range() queues each INSTR_RANGE into
     packet_queue->packet_buffer[]; packets carry start/end addrs,
     instr_count, last-instruction info, etc., but NO owner identity.
  3. PE_CONTEXT goes through cs_etm_decoder__set_tid() ->
     cs_etm__set_thread(), which immediately mutates tidq->thread and
     tidq->el. Queued packets are not drained first; reset_timestamp()
     is called so the next TIMESTAMP triggers OCSD_RESP_WAIT and a
     drain.
  4. By drain time in cs_etm__process_traceid_queue() ->
     cs_etm__sample(), sample.pid/tid is read from the now-mutated
     tidq->thread and sample.cpumode from the now-mutated tidq->el.
     Pre-context INSTR_RANGEs get the post-context owner.

The same race affects branch samples via tidq->prev_packet_thread /
tidq->prev_packet_el, captured at packet-swap time from
tidq->thread / tidq->el (which may already have flipped).

This is independent of PERF_RECORD_SWITCH_CPU_WIDE, which is
deliberately not used to assign sample identity in this path. The
bug applies to any cs_etm capture with in-stream CONTEXTIDR
(PIDFMT_CTXTID or PIDFMT_CTXTID2).

Effect on downstream tools: branches that should belong to the
previous thread on the CPU get attributed to the next thread. When
the two threads share a binary, leaked branches' VAs land in the
wrong thread's mappings; samples whose IPs land in r-x mappings
silently pollute that binary's profile, while samples landing in
R-only/RW mappings show up as out-of-range / non-text samples.
Either way, AutoFDO/BOLT profiles built from `perf script --itrace`
output of system-wide cs_etm captures contain misattributed samples.

Concrete example from `perf script --itrace=il64` of the same
captured branch (same timestamp, same IP, same from/to addrs) before
and after this fix:

  before: launcher_multia 2638146/2638146 705897.219172: \
              fffcda6b124c 0xfffcda641958/0xfffcda6b123c
  after:  ws-tcf-sr-io13  2736581/2741587 705897.219172: \
              fffcda6b124c 0xfffcda641958/0xfffcda6b123c

The branch was retired by ws-tcf-sr-io13 (tid 2741587) but, before
the fix, was attributed to launcher_multia (the next thread to run on
that CPU after the context switch). After the fix, it is correctly
attributed to ws-tcf-sr-io13.

Why not "drain on PE_CONTEXT then switch" (deferred-set_thread):
tidq->thread has two consumers \u2014 sample emission needs the OUTGOING
identity for queued packets, but cs_etm__mem_access() needs the
CURRENT thread's maps to fetch instruction bytes for OpenCSD. The
two needs are temporally inverted; a single tidq->thread cannot
serve both. Keeping tidq->thread current and stamping owner identity
per packet is the only design that decouples them cleanly.

Fix: capture the owning pid/tid/EL on each buffered packet at
cs_etm_decoder__buffer_packet() time (before any subsequent
PE_CONTEXT can mutate tidq->thread / tidq->el), and read them at
sample emission time.

  - struct cs_etm_packet gains pid_t pid, pid_t tid, int el (storing
    an ocsd_ex_level value; typed as int so the struct does not
    depend on OpenCSD headers, which are only included inside
    HAVE_CSTRACE_SUPPORT).
  - cs_etm__etmq_get_pid_tid_el() (formerly cs_etm__etmq_get_pid_tid)
    returns all three.
  - cs_etm__synth_instruction_sample() reads sample.pid / sample.tid
    from tidq->packet->{pid,tid} and derives sample.cpumode from
    tidq->packet->el.
  - cs_etm__synth_branch_sample() reads sample.pid / sample.tid /
    cpumode from tidq->prev_packet->{pid,tid,el}.
  - The separate prev_packet_thread / prev_packet_el bookkeeping in
    cs_etm__packet_swap() / cs_etm__init_traceid_queue() /
    cs_etm__free_traceid_queues() is removed; the per-packet stamp
    on prev_packet now carries that information.

Cost: 12 bytes added to struct cs_etm_packet (~12-16 KB per
packet_queue with CS_ETM_PACKET_MAX_BUFFER=1024), 16 bytes saved per
cs_etm_traceid_queue (one struct thread * + one ocsd_ex_level).

A residual gap: cs_etm__copy_insn() reads sample.insn bytes via
cs_etm__mem_access(), which still uses tidq->thread (the current
thread), so the inline insn bytes for an outgoing-thread sample may
be looked up against the wrong address space. Fixing this requires
threading the packet's owner pid through cs_etm__mem_access and is
left for a follow-up. sample.ip / sample.pid attribution \u2014 what
AutoFDO/BOLT consume \u2014 is correct.

A workaround for users on existing perf binaries:
`perf record -p PIDS \u2026` programs the ETM
TRCCONTEXTIDCTLR/TRCVMIDCCTLR registers so the trace stream itself
never carries foreign-PID instructions, eliminating the leak at the
trace source.

Signed-off-by: Amir Ayupov <aaupov@meta.com>
Signed-off-by: Amir Ayupov <aaupov@fb.com>
---
 .../perf/util/cs-etm-decoder/cs-etm-decoder.c | 12 +++
 tools/perf/util/cs-etm.c                      | 75 +++++++++++++------
 tools/perf/util/cs-etm.h                      | 25 +++++++
 3 files changed, 88 insertions(+), 24 deletions(-)

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
index dee3020ceaa91..ed99dfc7b0f8d 100644
--- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
+++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
@@ -402,6 +402,18 @@ cs_etm_decoder__buffer_packet(struct cs_etm_queue *etmq,
 	packet_queue->packet_buffer[et].flags = 0;
 	packet_queue->packet_buffer[et].exception_number = UINT32_MAX;
 	packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;
+	/*
+	 * Stamp the owner thread (pid/tid) onto the packet at buffer time.
+	 * A subsequent OCSD_GEN_TRC_ELEM_PE_CONTEXT element will mutate
+	 * tidq->thread before this packet is emitted as a sample; recording
+	 * the identity here keeps each buffered packet correctly attributed
+	 * to the thread that retired its instructions. See
+	 * cs_etm__synth_instruction_sample().
+	 */
+	cs_etm__etmq_get_pid_tid_el(etmq, trace_chan_id,
+				    &packet_queue->packet_buffer[et].pid,
+				    &packet_queue->packet_buffer[et].tid,
+				    &packet_queue->packet_buffer[et].el);
 
 	if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1)
 		return OCSD_RESP_WAIT;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 8a639d2e51a4c..ee3437488b753 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -86,8 +86,6 @@ struct cs_etm_traceid_queue {
 	size_t last_branch_pos;
 	union perf_event *event_buf;
 	struct thread *thread;
-	struct thread *prev_packet_thread;
-	ocsd_ex_level prev_packet_el;
 	ocsd_ex_level el;
 	struct branch_stack *last_branch;
 	struct branch_stack *last_branch_rb;
@@ -614,10 +612,9 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 
 	queue = &etmq->etm->queues.queue_array[etmq->queue_nr];
 	tidq->trace_chan_id = trace_chan_id;
-	tidq->el = tidq->prev_packet_el = ocsd_EL_unknown;
+	tidq->el = ocsd_EL_unknown;
 	tidq->thread = machine__findnew_thread(&etm->session->machines.host, -1,
 					       queue->tid);
-	tidq->prev_packet_thread = machine__idle_thread(&etm->session->machines.host);
 
 	tidq->packet = zalloc(sizeof(struct cs_etm_packet));
 	if (!tidq->packet)
@@ -740,6 +737,26 @@ struct cs_etm_packet_queue
 	return NULL;
 }
 
+void cs_etm__etmq_get_pid_tid_el(struct cs_etm_queue *etmq, u8 trace_chan_id,
+				 pid_t *pid, pid_t *tid, int *el)
+{
+	struct cs_etm_traceid_queue *tidq;
+
+	*pid = -1;
+	*tid = -1;
+	*el  = ocsd_EL_unknown;
+
+	tidq = cs_etm__etmq_get_traceid_queue(etmq, trace_chan_id);
+	if (!tidq)
+		return;
+
+	*el = tidq->el;
+	if (tidq->thread) {
+		*pid = thread__pid(tidq->thread);
+		*tid = thread__tid(tidq->thread);
+	}
+}
+
 static void cs_etm__packet_swap(struct cs_etm_auxtrace *etm,
 				struct cs_etm_traceid_queue *tidq)
 {
@@ -748,23 +765,15 @@ static void cs_etm__packet_swap(struct cs_etm_auxtrace *etm,
 	if (etm->synth_opts.branches || etm->synth_opts.last_branch ||
 	    etm->synth_opts.instructions) {
 		/*
-		 * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
-		 * the next incoming packet.
-		 *
-		 * Threads and exception levels are also tracked for both the
-		 * previous and current packets. This is because the previous
-		 * packet is used for the 'from' IP for branch samples, so the
-		 * thread at that time must also be assigned to that sample.
-		 * Across discontinuity packets the thread can change, so by
-		 * tracking the thread for the previous packet the branch sample
-		 * will have the correct info.
+		 * Rotate PACKET into PREV_PACKET so the next decoded packet
+		 * lands in PACKET. Owner identity (pid/tid/el) travels with
+		 * the packet itself — it was stamped at
+		 * cs_etm_decoder__buffer_packet() time — so no separate
+		 * thread/EL tracking is needed here.
 		 */
 		tmp = tidq->packet;
 		tidq->packet = tidq->prev_packet;
 		tidq->prev_packet = tmp;
-		tidq->prev_packet_el = tidq->el;
-		thread__put(tidq->prev_packet_thread);
-		tidq->prev_packet_thread = thread__get(tidq->thread);
 	}
 }
 
@@ -938,7 +947,6 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq)
 		/* Free this traceid_queue from the array */
 		tidq = etmq->traceid_queues[idx];
 		thread__zput(tidq->thread);
-		thread__zput(tidq->prev_packet_thread);
 		zfree(&tidq->event_buf);
 		zfree(&tidq->last_branch);
 		zfree(&tidq->last_branch_rb);
@@ -1570,15 +1578,24 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 
 	perf_sample__init(&sample, /*all=*/true);
 	event->sample.header.type = PERF_RECORD_SAMPLE;
-	event->sample.header.misc = cs_etm__cpu_mode(etmq, addr, tidq->el);
+	event->sample.header.misc = cs_etm__cpu_mode(etmq, addr,
+						     (ocsd_ex_level)tidq->packet->el);
 	event->sample.header.size = sizeof(struct perf_event_header);
 
 	/* Set time field based on etm auxtrace config. */
 	sample.time = cs_etm__resolve_sample_time(etmq, tidq);
 
 	sample.ip = addr;
-	sample.pid = thread__pid(tidq->thread);
-	sample.tid = thread__tid(tidq->thread);
+	/*
+	 * Read pid/tid (and EL above for cpumode) from the packet's
+	 * stamped owner identity rather than tidq->thread / tidq->el,
+	 * which reflect the thread that is current at sample emission
+	 * time. A PE_CONTEXT element delivered between buffer time and
+	 * emit time would otherwise misattribute pre-context packets to
+	 * the next thread/EL on the CPU.
+	 */
+	sample.pid = tidq->packet->pid;
+	sample.tid = tidq->packet->tid;
 	sample.id = etmq->etm->instructions_id;
 	sample.stream_id = etmq->etm->instructions_id;
 	sample.period = period;
@@ -1586,6 +1603,16 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 	sample.flags = tidq->prev_packet->flags;
 	sample.cpumode = event->sample.header.misc;
 
+	/*
+	 * Note: cs_etm__copy_insn() reads sample.insn bytes via
+	 * cs_etm__mem_access(), which uses tidq->thread (the *current*
+	 * thread). For samples whose pid/tid were stamped from an
+	 * outgoing thread that has since been replaced by a PE_CONTEXT,
+	 * the inline insn bytes may be looked up against the wrong
+	 * address space. sample.ip / sample.pid attribution is correct;
+	 * fixing the insn bytes requires threading the packet's owner
+	 * pid through cs_etm__mem_access and is left for a follow-up.
+	 */
 	cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
 
 	if (etm->synth_opts.last_branch)
@@ -1631,15 +1658,15 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
 
 	event->sample.header.type = PERF_RECORD_SAMPLE;
 	event->sample.header.misc = cs_etm__cpu_mode(etmq, ip,
-						     tidq->prev_packet_el);
+						     (ocsd_ex_level)tidq->prev_packet->el);
 	event->sample.header.size = sizeof(struct perf_event_header);
 
 	/* Set time field based on etm auxtrace config. */
 	sample.time = cs_etm__resolve_sample_time(etmq, tidq);
 
 	sample.ip = ip;
-	sample.pid = thread__pid(tidq->prev_packet_thread);
-	sample.tid = thread__tid(tidq->prev_packet_thread);
+	sample.pid = tidq->prev_packet->pid;
+	sample.tid = tidq->prev_packet->tid;
 	sample.addr = cs_etm__first_executed_instr(tidq->packet);
 	sample.id = etmq->etm->branches_id;
 	sample.stream_id = etmq->etm->branches_id;
diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h
index aa9bb4a32ecaf..6ba47604b8c52 100644
--- a/tools/perf/util/cs-etm.h
+++ b/tools/perf/util/cs-etm.h
@@ -10,6 +10,7 @@
 #include "debug.h"
 #include "util/event.h"
 #include <linux/bits.h>
+#include <sys/types.h>
 
 struct perf_session;
 struct perf_pmu;
@@ -184,6 +185,20 @@ struct cs_etm_packet {
 	u8 last_instr_size;
 	u8 trace_chan_id;
 	int cpu;
+	/*
+	 * Owner identity captured at cs_etm_decoder__buffer_packet() time.
+	 * A subsequent PE_CONTEXT element will mutate tidq->thread/tidq->el
+	 * before this packet is emitted as a sample; stamping pid/tid/el on
+	 * the packet keeps each one attributed to the thread that actually
+	 * retired its instructions and the EL it ran at. Read at sample
+	 * emission time by cs_etm__synth_instruction_sample() and
+	 * cs_etm__synth_branch_sample(). 'el' holds an ocsd_ex_level value
+	 * but is typed as int so the struct does not depend on OpenCSD
+	 * headers (which are only included inside HAVE_CSTRACE_SUPPORT).
+	 */
+	pid_t pid;
+	pid_t tid;
+	int el;
 };
 
 #define CS_ETM_PACKET_MAX_BUFFER 1024
@@ -266,6 +281,16 @@ void cs_etm__etmq_set_traceid_queue_timestamp(struct cs_etm_queue *etmq,
 					      u8 trace_chan_id);
 struct cs_etm_packet_queue
 *cs_etm__etmq_get_packet_queue(struct cs_etm_queue *etmq, u8 trace_chan_id);
+/*
+ * Read the pid/tid/EL currently associated with the given trace_chan_id.
+ * Called from cs_etm_decoder__buffer_packet() to stamp owner identity on
+ * each buffered packet at buffer time, before any subsequent PE_CONTEXT
+ * (CONTEXTIDR / EL change) can mutate tidq->thread / tidq->el. The stamp
+ * is consumed at sample emission time so that each sample is attributed
+ * to the thread/EL that actually retired its instructions.
+ */
+void cs_etm__etmq_get_pid_tid_el(struct cs_etm_queue *etmq, u8 trace_chan_id,
+				 pid_t *pid, pid_t *tid, int *el);
 int cs_etm__process_auxtrace_info_full(union perf_event *event __maybe_unused,
 				       struct perf_session *session __maybe_unused);
 u64 cs_etm__convert_sample_time(struct cs_etm_queue *etmq, u64 cs_timestamp);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2026-05-15  2:18 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-15  2:11 [PATCH] perf cs-etm: stamp pid/tid/EL on each buffered packet to fix cross-pid attribution Amir Ayupov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox