* [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain
@ 2026-05-26 16:59 Leo Yan
2026-05-26 16:59 ` [PATCH v6 1/8] perf cs-etm: Decode ETE exception packets Leo Yan
` (7 more replies)
0 siblings, 8 replies; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 150 -vvv
150: Check Arm CoreSight synthesized callchain:
--- start ---
test child forked, pid 13528
Test callchain push: PASS
Test callchain pop: PASS
---- end(0) ----
150: Check Arm CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linaro.org/
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
Leo Yan (8):
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
.../tests/shell/test_arm_coresight_callchain.sh | 235 ++++++++++++++++
tools/perf/util/cs-etm.c | 309 ++++++++++++---------
2 files changed, 408 insertions(+), 136 deletions(-)
---
base-commit: bd2a5be1fe731bc7548205dd148db75f1d588da2
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan@arm.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v6 1/8] perf cs-etm: Decode ETE exception packets
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 16:59 ` [PATCH v6 2/8] perf cs-etm: Refactor instruction size handling Leo Yan
` (6 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users
ETE shares the same packet format as ETMv4, but exception decoding
handled ETMv4 packets only. As a result, ETE exception packets were
not classified.
Recognize the ETE magic for exception number decoding.
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 6ec48de29441012f3d827d50616349c6c0d1f037..ab79d08f5a6095448470e2c3ec85ff3db2fb5634 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -2138,7 +2138,7 @@ static bool cs_etm__is_syscall(struct cs_etm_queue *etmq,
* HVC cases; need to check if it's SVC instruction based on
* packet address.
*/
- if (magic == __perf_cs_etmv4_magic) {
+ if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic) {
if (packet->exception_number == CS_ETMV4_EXC_CALL &&
cs_etm__is_svc_instr(etmq, trace_chan_id, prev_packet,
prev_packet->end_addr))
@@ -2161,7 +2161,7 @@ static bool cs_etm__is_async_exception(struct cs_etm_traceid_queue *tidq,
packet->exception_number == CS_ETMV3_EXC_FIQ)
return true;
- if (magic == __perf_cs_etmv4_magic)
+ if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic)
if (packet->exception_number == CS_ETMV4_EXC_RESET ||
packet->exception_number == CS_ETMV4_EXC_DEBUG_HALT ||
packet->exception_number == CS_ETMV4_EXC_SYSTEM_ERROR ||
@@ -2192,7 +2192,7 @@ static bool cs_etm__is_sync_exception(struct cs_etm_queue *etmq,
packet->exception_number == CS_ETMV3_EXC_GENERIC)
return true;
- if (magic == __perf_cs_etmv4_magic) {
+ if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic) {
if (packet->exception_number == CS_ETMV4_EXC_TRAP ||
packet->exception_number == CS_ETMV4_EXC_ALIGNMENT ||
packet->exception_number == CS_ETMV4_EXC_INST_FAULT ||
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 2/8] perf cs-etm: Refactor instruction size handling
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
2026-05-26 16:59 ` [PATCH v6 1/8] perf cs-etm: Decode ETE exception packets Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 16:59 ` [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries Leo Yan
` (5 subsequent siblings)
7 siblings, 0 replies; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
From: Leo Yan <leo.yan@linaro.org>
This patch introduces a new function cs_etm__instr_size() to calculate
the instruction size based on ISA type and instruction address.
Given the trace data can be MB and most likely that will be A64/A32 on
a lot of platforms, cs_etm__instr_addr() keeps a single ISA type check
for A64/A32 and executes an optimized calculation (addr + offset * 4).
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 44 +++++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 21 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index ab79d08f5a6095448470e2c3ec85ff3db2fb5634..5bff8811d61e423463b7bd4e20d599d5b5307a1a 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1347,6 +1347,17 @@ static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq,
return ((instrBytes[1] & 0xF8) >= 0xE8) ? 4 : 2;
}
+static inline int cs_etm__instr_size(struct cs_etm_queue *etmq,
+ u8 trace_chan_id,
+ enum cs_etm_isa isa, u64 addr)
+{
+ if (isa == CS_ETM_ISA_T32)
+ return cs_etm__t32_instr_size(etmq, trace_chan_id, addr);
+
+ /* Otherwise, 4-byte instruction size for A32/A64 */
+ return 4;
+}
+
static inline u64 cs_etm__first_executed_instr(struct cs_etm_packet *packet)
{
/*
@@ -1375,19 +1386,18 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq,
const struct cs_etm_packet *packet,
u64 offset)
{
- if (packet->isa == CS_ETM_ISA_T32) {
- u64 addr = packet->start_addr;
+ u64 addr = packet->start_addr;
- while (offset) {
- addr += cs_etm__t32_instr_size(etmq,
- trace_chan_id, addr);
- offset--;
- }
- return addr;
- }
+ /* 4-byte instruction size for A32/A64 */
+ if (packet->isa == CS_ETM_ISA_A64 || packet->isa == CS_ETM_ISA_A32)
+ return addr + offset * 4;
- /* Assume a 4 byte instruction size (A32/A64) */
- return packet->start_addr + offset * 4;
+ while (offset) {
+ addr += cs_etm__instr_size(etmq, trace_chan_id,
+ packet->isa, addr);
+ offset--;
+ }
+ return addr;
}
static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq,
@@ -1540,16 +1550,8 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq,
return;
}
- /*
- * T32 instruction size might be 32-bit or 16-bit, decide by calling
- * cs_etm__t32_instr_size().
- */
- if (packet->isa == CS_ETM_ISA_T32)
- sample->insn_len = cs_etm__t32_instr_size(etmq, trace_chan_id,
- sample->ip);
- /* Otherwise, A64 and A32 instruction size are always 32-bit. */
- else
- sample->insn_len = 4;
+ sample->insn_len = cs_etm__instr_size(etmq, trace_chan_id,
+ packet->isa, sample->ip);
cs_etm__mem_access(etmq, trace_chan_id, sample->ip, sample->insn_len,
(void *)sample->insn, 0);
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
2026-05-26 16:59 ` [PATCH v6 1/8] perf cs-etm: Decode ETE exception packets Leo Yan
2026-05-26 16:59 ` [PATCH v6 2/8] perf cs-etm: Refactor instruction size handling Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 18:32 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset Leo Yan
` (4 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users
CS ETM maintains its own circular array for last branch entries, with
local helpers to update, copy and reset the branch stack. This duplicates
logic already provided by the common code.
Record branch with thread_stack__event() and synthesize branch stack
with thread_stack__br_sample(). This removes the local last_branch_rb
buffer and position tracking. Keep the buffer number updated via
thread_stack__set_trace_nr(), which is used when exporting samples to
Python scripts.
The output should remain same, except that be->flags.predicted is no
longer set. Since CoreSight trace does not provide branch prediction
information, clearing the flag avoids confusion.
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 152 +++++++++++++----------------------------------
1 file changed, 41 insertions(+), 111 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 5bff8811d61e423463b7bd4e20d599d5b5307a1a..398ab3b7a429d402cc8e5f6cccb35c0b7c253732 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -83,14 +83,13 @@ struct cs_etm_auxtrace {
struct cs_etm_traceid_queue {
u8 trace_chan_id;
u64 period_instructions;
- size_t last_branch_pos;
union perf_event *event_buf;
struct thread *thread;
struct thread *prev_packet_thread;
ocsd_ex_level prev_packet_el;
ocsd_ex_level el;
+ unsigned int br_stack_sz;
struct branch_stack *last_branch;
- struct branch_stack *last_branch_rb;
struct cs_etm_packet *prev_packet;
struct cs_etm_packet *packet;
struct cs_etm_packet_queue packet_queue;
@@ -635,9 +634,8 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
tidq->last_branch = zalloc(sz);
if (!tidq->last_branch)
goto out_free;
- tidq->last_branch_rb = zalloc(sz);
- if (!tidq->last_branch_rb)
- goto out_free;
+
+ tidq->br_stack_sz = etm->synth_opts.last_branch_sz;
}
tidq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE);
@@ -647,7 +645,6 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
return 0;
out_free:
- zfree(&tidq->last_branch_rb);
zfree(&tidq->last_branch);
zfree(&tidq->prev_packet);
zfree(&tidq->packet);
@@ -941,7 +938,6 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq)
thread__zput(tidq->prev_packet_thread);
zfree(&tidq->event_buf);
zfree(&tidq->last_branch);
- zfree(&tidq->last_branch_rb);
zfree(&tidq->prev_packet);
zfree(&tidq->packet);
zfree(&tidq);
@@ -1281,57 +1277,6 @@ static int cs_etm__queue_first_cs_timestamp(struct cs_etm_auxtrace *etm,
return ret;
}
-static inline
-void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq,
- struct cs_etm_traceid_queue *tidq)
-{
- struct branch_stack *bs_src = tidq->last_branch_rb;
- struct branch_stack *bs_dst = tidq->last_branch;
- size_t nr = 0;
-
- /*
- * Set the number of records before early exit: ->nr is used to
- * determine how many branches to copy from ->entries.
- */
- bs_dst->nr = bs_src->nr;
-
- /*
- * Early exit when there is nothing to copy.
- */
- if (!bs_src->nr)
- return;
-
- /*
- * As bs_src->entries is a circular buffer, we need to copy from it in
- * two steps. First, copy the branches from the most recently inserted
- * branch ->last_branch_pos until the end of bs_src->entries buffer.
- */
- nr = etmq->etm->synth_opts.last_branch_sz - tidq->last_branch_pos;
- memcpy(&bs_dst->entries[0],
- &bs_src->entries[tidq->last_branch_pos],
- sizeof(struct branch_entry) * nr);
-
- /*
- * If we wrapped around at least once, the branches from the beginning
- * of the bs_src->entries buffer and until the ->last_branch_pos element
- * are older valid branches: copy them over. The total number of
- * branches copied over will be equal to the number of branches asked by
- * the user in last_branch_sz.
- */
- if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
- memcpy(&bs_dst->entries[nr],
- &bs_src->entries[0],
- sizeof(struct branch_entry) * tidq->last_branch_pos);
- }
-}
-
-static inline
-void cs_etm__reset_last_branch_rb(struct cs_etm_traceid_queue *tidq)
-{
- tidq->last_branch_pos = 0;
- tidq->last_branch_rb->nr = 0;
-}
-
static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq,
u8 trace_chan_id, u64 addr)
{
@@ -1400,38 +1345,6 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq,
return addr;
}
-static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq,
- struct cs_etm_traceid_queue *tidq)
-{
- struct branch_stack *bs = tidq->last_branch_rb;
- struct branch_entry *be;
-
- /*
- * The branches are recorded in a circular buffer in reverse
- * chronological order: we start recording from the last element of the
- * buffer down. After writing the first element of the stack, move the
- * insert position back to the end of the buffer.
- */
- if (!tidq->last_branch_pos)
- tidq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
-
- tidq->last_branch_pos -= 1;
-
- be = &bs->entries[tidq->last_branch_pos];
- be->from = cs_etm__last_executed_instr(tidq->prev_packet);
- be->to = cs_etm__first_executed_instr(tidq->packet);
- /* No support for mispredict */
- be->flags.mispred = 0;
- be->flags.predicted = 1;
-
- /*
- * Increment bs->nr until reaching the number of last branches asked by
- * the user on the command line.
- */
- if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
- bs->nr += 1;
-}
-
static int cs_etm__inject_event(struct cs_etm_auxtrace *etm, union perf_event *event,
struct perf_sample *sample, u64 type)
{
@@ -1579,6 +1492,37 @@ static inline u64 cs_etm__resolve_sample_time(struct cs_etm_queue *etmq,
return etm->latest_kernel_timestamp;
}
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
+ struct cs_etm_traceid_queue *tidq)
+{
+ u64 from, to;
+ int size;
+
+ if (!tidq->prev_packet->last_instr_taken_branch)
+ return;
+
+ if (tidq->prev_packet->sample_type != CS_ETM_RANGE ||
+ tidq->packet->sample_type != CS_ETM_RANGE)
+ return;
+
+ if (etmq->etm->synth_opts.last_branch) {
+ from = cs_etm__last_executed_instr(tidq->prev_packet);
+ to = cs_etm__first_executed_instr(tidq->packet);
+
+ size = cs_etm__instr_size(etmq, tidq->trace_chan_id,
+ tidq->prev_packet->isa, from);
+
+ /* Enable callchain so thread stack entry can be allocated */
+ thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
+ tidq->prev_packet->flags, from, to, size,
+ etmq->buffer->buffer_nr + 1, true,
+ tidq->br_stack_sz, 0);
+ } else {
+ thread_stack__set_trace_nr(tidq->thread, tidq->prev_packet->cpu,
+ etmq->buffer->buffer_nr + 1);
+ }
+}
+
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq,
u64 addr, u64 period)
@@ -1608,8 +1552,12 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
- if (etm->synth_opts.last_branch)
+ if (etm->synth_opts.last_branch) {
+ thread_stack__br_sample(tidq->thread, tidq->packet->cpu,
+ tidq->last_branch,
+ tidq->br_stack_sz);
sample.branch_stack = tidq->last_branch;
+ }
if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(etm, event, &sample,
@@ -1798,14 +1746,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
tidq->period_instructions += tidq->packet->instr_count;
- /*
- * Record a branch when the last instruction in
- * PREV_PACKET is a branch.
- */
- if (etm->synth_opts.last_branch &&
- tidq->prev_packet->sample_type == CS_ETM_RANGE &&
- tidq->prev_packet->last_instr_taken_branch)
- cs_etm__update_last_branch_rb(etmq, tidq);
+ cs_etm__add_stack_event(etmq, tidq);
if (etm->synth_opts.instructions &&
tidq->period_instructions >= etm->instructions_sample_period) {
@@ -1864,10 +1805,6 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
u64 offset = etm->instructions_sample_period - instrs_prev;
u64 addr;
- /* Prepare last branches for instruction sample */
- if (etm->synth_opts.last_branch)
- cs_etm__copy_last_branch_rb(etmq, tidq);
-
while (tidq->period_instructions >=
etm->instructions_sample_period) {
/*
@@ -1947,10 +1884,6 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
etmq->etm->synth_opts.instructions &&
tidq->prev_packet->sample_type == CS_ETM_RANGE) {
u64 addr;
-
- /* Prepare last branches for instruction sample */
- cs_etm__copy_last_branch_rb(etmq, tidq);
-
/*
* Generate a last branch event for the branches left in the
* circular buffer at the end of the trace.
@@ -1982,7 +1915,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
/* Reset last branches after flush the trace */
if (etm->synth_opts.last_branch)
- cs_etm__reset_last_branch_rb(tidq);
+ thread_stack__flush(tidq->thread);
return err;
}
@@ -2006,9 +1939,6 @@ static int cs_etm__end_block(struct cs_etm_queue *etmq,
tidq->prev_packet->sample_type == CS_ETM_RANGE) {
u64 addr;
- /* Prepare last branches for instruction sample */
- cs_etm__copy_last_branch_rb(etmq, tidq);
-
/*
* Use the address of the end of the last reported execution
* range.
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
` (2 preceding siblings ...)
2026-05-26 16:59 ` [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 19:10 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 5/8] perf cs-etm: Support call indentation Leo Yan
` (3 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users
Perf resets the CoreSight decoder when moving to a new AUX trace buffer,
this causes trace discontinunity globally.
For callchain synthesis, keeping thread-stack state after decoder reset
can leave stale call/return history attached to threads that are decoded
later, producing incorrect synthesized callchains.
Flush all host thread stacks after a decoder reset. When virtualization
is present, flush the guest thread stacks as well.
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 398ab3b7a429d402cc8e5f6cccb35c0b7c253732..ea2424175558ddc0a6f20a9de6c30f377facdc52 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1956,6 +1956,37 @@ static int cs_etm__end_block(struct cs_etm_queue *etmq,
return 0;
}
+
+static int cs_etm__flush_stack_cb(struct thread *thread,
+ void *data __maybe_unused)
+{
+ thread_stack__flush(thread);
+ return 0;
+}
+
+static void cs_etm__flush_machine_stack(struct cs_etm_queue *etmq, pid_t pid)
+{
+ struct machine *machine;
+
+ machine = machines__find(&etmq->etm->session->machines, pid);
+ if (machine)
+ machine__for_each_thread(machine, cs_etm__flush_stack_cb, NULL);
+}
+
+static void cs_etm__flush_all_stack(struct cs_etm_queue *etmq)
+{
+ enum cs_etm_pid_fmt pid_fmt = cs_etm__get_pid_fmt(etmq);
+
+ if (!etmq->etm->synth_opts.last_branch)
+ return;
+
+ cs_etm__flush_machine_stack(etmq, HOST_KERNEL_ID);
+
+ /* Clear the guest stack if virtualization is supported */
+ if (pid_fmt == CS_ETM_PIDFMT_CTXTID2)
+ cs_etm__flush_machine_stack(etmq, DEFAULT_GUEST_KERNEL_ID);
+}
+
/*
* cs_etm__get_data_block: Fetch a block from the auxtrace_buffer queue
* if need be.
@@ -1978,6 +2009,12 @@ static int cs_etm__get_data_block(struct cs_etm_queue *etmq)
ret = cs_etm_decoder__reset(etmq->decoder);
if (ret)
return ret;
+
+ /*
+ * Since the decoder is reset, this causes a global trace
+ * discontinuity. Flush all thread stacks.
+ */
+ cs_etm__flush_all_stack(etmq);
}
return etmq->buf_len;
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 5/8] perf cs-etm: Support call indentation
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
` (3 preceding siblings ...)
2026-05-26 16:59 ` [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 19:42 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 6/8] perf cs-etm: Filter synthesized branch samples Leo Yan
` (2 subsequent siblings)
7 siblings, 1 reply; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
From: Leo Yan <leo.yan@linaro.org>
This commit supports the field "callindent" to reflect the call stack
depth.
The branch stack is used by both call indentation and the last branch
record, which are separate features. Use a new flag "use_br_stack" to
track whether the branch stack needs to be recorded.
Before:
perf script -F +callindent
callchain_test 9187 [002] 599611.826599: 1 branches: main ffff83312258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
callchain_test 9187 [002] 599611.826599: 1 branches: foo aaaae3ed07c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: print aaaae3ed07ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: do_svc aaaae3ed0794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: aaaae3ed077c do_svc+0x14 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: vectors aaaae3ed0780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: ffff800080010c00 vectors+0x400 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080010c24 vectors+0x424 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114dc el0t_64_sync+0xd4 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114f8 el0t_64_sync+0xf0 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011528 el0t_64_sync+0x120 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011538 el0t_64_sync+0x130 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800080011568 el0t_64_sync+0x160 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800081829110 el0t_64_sync_handler+0x18 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff800081829140 el0t_64_sync_handler+0x48 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0_svc ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
After:
callchain_test 9187 [002] 599611.826599: 1 branches: main ffff83312258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
callchain_test 9187 [002] 599611.826599: 1 branches: foo aaaae3ed07c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: print aaaae3ed07ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: do_svc aaaae3ed0794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: aaaae3ed077c do_svc+0x14 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: vectors aaaae3ed0780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: ffff800080010c00 vectors+0x400 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080010c24 vectors+0x424 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114dc el0t_64_sync+0xd4 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114f8 el0t_64_sync+0xf0 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011528 el0t_64_sync+0x120 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011538 el0t_64_sync+0x130 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800080011568 el0t_64_sync+0x160 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800081829110 el0t_64_sync_handler+0x18 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff800081829140 el0t_64_sync_handler+0x48 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0_svc ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index ea2424175558ddc0a6f20a9de6c30f377facdc52..b31d0dd46a45dc365edd7c2f9e9b2eb077ca23db 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -66,6 +66,7 @@ struct cs_etm_auxtrace {
bool snapshot_mode;
bool data_queued;
bool has_virtual_ts; /* Virtual/Kernel timestamps in the trace. */
+ bool use_thread_stack;
int num_cpu;
u64 latest_kernel_timestamp;
@@ -626,7 +627,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
if (!tidq->prev_packet)
goto out_free;
- if (etm->synth_opts.last_branch) {
+ if (etm->use_thread_stack) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
@@ -1505,7 +1506,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
tidq->packet->sample_type != CS_ETM_RANGE)
return;
- if (etmq->etm->synth_opts.last_branch) {
+ if (etmq->etm->use_thread_stack) {
from = cs_etm__last_executed_instr(tidq->prev_packet);
to = cs_etm__first_executed_instr(tidq->packet);
@@ -1914,7 +1915,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
cs_etm__packet_swap(etm, tidq);
/* Reset last branches after flush the trace */
- if (etm->synth_opts.last_branch)
+ if (etm->use_thread_stack)
thread_stack__flush(tidq->thread);
return err;
@@ -1977,7 +1978,7 @@ static void cs_etm__flush_all_stack(struct cs_etm_queue *etmq)
{
enum cs_etm_pid_fmt pid_fmt = cs_etm__get_pid_fmt(etmq);
- if (!etmq->etm->synth_opts.last_branch)
+ if (!etmq->etm->use_thread_stack)
return;
cs_etm__flush_machine_stack(etmq, HOST_KERNEL_ID);
@@ -3438,6 +3439,7 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
itrace_synth_opts__set_default(&etm->synth_opts,
session->itrace_synth_opts->default_no_sample);
etm->synth_opts.callchain = false;
+ etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack;
}
etm->session = session;
@@ -3489,6 +3491,10 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
etm->tc.cap_user_time_zero = tc->cap_user_time_zero;
etm->tc.cap_user_time_short = tc->cap_user_time_short;
}
+
+ etm->use_thread_stack = etm->synth_opts.thread_stack ||
+ etm->synth_opts.last_branch;
+
err = cs_etm__synth_events(etm, session);
if (err)
goto err_free_queues;
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 6/8] perf cs-etm: Filter synthesized branch samples
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
` (4 preceding siblings ...)
2026-05-26 16:59 ` [PATCH v6 5/8] perf cs-etm: Support call indentation Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 16:59 ` [PATCH v6 7/8] perf cs-etm: Synthesize callchains for instruction samples Leo Yan
2026-05-26 16:59 ` [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test Leo Yan
7 siblings, 0 replies; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
From: Leo Yan <leo.yan@linaro.org>
CS ETM currently emits branch samples for every decoded branch when
branch synthesis is enabled. This delivers redundant info when users
request only call or return branches.
Add a branch filter derived from the itrace "calls" and "returns" options.
When no filter is set, keep the existing behavior and emit all branch
samples. When call or return filtering is requested, only synthesize branch
samples whose flags match the selected branch types, including trace
begin and end markers.
Before:
perf script -F +callindent
callchain_test 9187 [002] 599611.826599: 1 branches: main ffff83312258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
callchain_test 9187 [002] 599611.826599: 1 branches: foo aaaae3ed07c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: print aaaae3ed07ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: do_svc aaaae3ed0794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: aaaae3ed077c do_svc+0x14 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: vectors aaaae3ed0780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: ffff800080010c00 vectors+0x400 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080010c24 vectors+0x424 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114dc el0t_64_sync+0xd4 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff8000800114f8 el0t_64_sync+0xf0 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011528 el0t_64_sync+0x120 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826600: 1 branches: ffff800080011538 el0t_64_sync+0x130 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800080011568 el0t_64_sync+0x160 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: ffff800081829110 el0t_64_sync_handler+0x18 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff800081829140 el0t_64_sync_handler+0x48 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0_svc ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
After:
callchain_test 9187 [002] 599611.826599: 1 branches: main ffff83312258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
callchain_test 9187 [002] 599611.826599: 1 branches: foo aaaae3ed07c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: print aaaae3ed07ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: do_svc aaaae3ed0794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826599: 1 branches: vectors aaaae3ed0780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 9187 [002] 599611.826601: 1 branches: el0t_64_sync_handler ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
callchain_test 9187 [002] 599611.826601: 1 branches: el0_svc ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index b31d0dd46a45dc365edd7c2f9e9b2eb077ca23db..8d98e772ecb307381b5ed1b4bbc4056e8779b261 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -71,6 +71,7 @@ struct cs_etm_auxtrace {
int num_cpu;
u64 latest_kernel_timestamp;
u32 auxtrace_type;
+ u32 branches_filter;
u64 branches_sample_type;
u64 branches_id;
u64 instructions_sample_type;
@@ -1596,6 +1597,10 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
} dummy_bs;
u64 ip;
+ if (etm->branches_filter &&
+ !(etm->branches_filter & tidq->prev_packet->flags))
+ return 0;
+
ip = cs_etm__last_executed_instr(tidq->prev_packet);
event->sample.header.type = PERF_RECORD_SAMPLE;
@@ -3442,6 +3447,16 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack;
}
+ if (etm->synth_opts.calls)
+ etm->branches_filter |= PERF_IP_FLAG_CALL |
+ PERF_IP_FLAG_TRACE_BEGIN |
+ PERF_IP_FLAG_TRACE_END;
+
+ if (etm->synth_opts.returns)
+ etm->branches_filter |= PERF_IP_FLAG_RETURN |
+ PERF_IP_FLAG_TRACE_BEGIN |
+ PERF_IP_FLAG_TRACE_END;
+
etm->session = session;
etm->num_cpu = num_cpu;
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 7/8] perf cs-etm: Synthesize callchains for instruction samples
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
` (5 preceding siblings ...)
2026-05-26 16:59 ` [PATCH v6 6/8] perf cs-etm: Filter synthesized branch samples Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 16:59 ` [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test Leo Yan
7 siblings, 0 replies; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
From: Leo Yan <leo.yan@linaro.org>
CS ETM already records branches into the thread stack, but instruction
samples do not carry synthesized callchains. It misses to support the
callchain and no output with the itrace option 'g'.
Allocate a callchain buffer per queue and use thread_stack__sample()
when synthesizing instruction samples. Advertise PERF_SAMPLE_CALLCHAIN
on the synthetic instruction event.
Allocate the callchain stack with one more entry than requested, as the
first entry is reserved for storing context information.
After:
perf script --itrace=g16l64i100
callchain_test 9187 [002] 599611.826599: 1 instructions:
aaaae3ed0774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaae3ed0798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaae3ed07b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaae3ed07c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff8331225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff8331233c call_init+0x9c (inlined)
ffff8331233c __libc_start_main_impl+0x9c (inlined)
aaaae3ed0670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
tools/perf/util/cs-etm.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 8d98e772ecb307381b5ed1b4bbc4056e8779b261..90e0beb910156093d8bd0f320bb0210aca95dd26 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -17,6 +17,7 @@
#include <stdlib.h>
#include "auxtrace.h"
+#include "callchain.h"
#include "color.h"
#include "cs-etm.h"
#include "cs-etm-decoder/cs-etm-decoder.h"
@@ -85,6 +86,7 @@ struct cs_etm_auxtrace {
struct cs_etm_traceid_queue {
u8 trace_chan_id;
u64 period_instructions;
+ u64 kernel_start;
union perf_event *event_buf;
struct thread *thread;
struct thread *prev_packet_thread;
@@ -92,6 +94,7 @@ struct cs_etm_traceid_queue {
ocsd_ex_level el;
unsigned int br_stack_sz;
struct branch_stack *last_branch;
+ struct ip_callchain *callchain;
struct cs_etm_packet *prev_packet;
struct cs_etm_packet *packet;
struct cs_etm_packet_queue packet_queue;
@@ -640,6 +643,16 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
tidq->br_stack_sz = etm->synth_opts.last_branch_sz;
}
+ if (etm->synth_opts.callchain) {
+ size_t sz = sizeof(struct ip_callchain);
+
+ /* Add 1 to callchain_sz for callchain context */
+ sz += (etm->synth_opts.callchain_sz + 1) * sizeof(u64);
+ tidq->callchain = zalloc(sz);
+ if (!tidq->callchain)
+ goto out_free;
+ }
+
tidq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE);
if (!tidq->event_buf)
goto out_free;
@@ -647,6 +660,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
return 0;
out_free:
+ zfree(&tidq->callchain);
zfree(&tidq->last_branch);
zfree(&tidq->prev_packet);
zfree(&tidq->packet);
@@ -939,6 +953,7 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq)
thread__zput(tidq->thread);
thread__zput(tidq->prev_packet_thread);
zfree(&tidq->event_buf);
+ zfree(&tidq->callchain);
zfree(&tidq->last_branch);
zfree(&tidq->prev_packet);
zfree(&tidq->packet);
@@ -1431,6 +1446,7 @@ static void cs_etm__set_thread(struct cs_etm_queue *etmq,
tidq->thread = machine__idle_thread(machine);
tidq->el = el;
+ tidq->kernel_start = machine__kernel_start(machine);
}
int cs_etm__etmq_set_tid_el(struct cs_etm_queue *etmq, pid_t tid,
@@ -1561,6 +1577,25 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
sample.branch_stack = tidq->last_branch;
}
+ if (etm->synth_opts.callchain) {
+ if (tidq->kernel_start)
+ thread_stack__sample(tidq->thread, tidq->packet->cpu,
+ tidq->callchain,
+ etm->synth_opts.callchain_sz + 1,
+ sample.ip, tidq->kernel_start);
+ else
+ /*
+ * Clear the callchain when the kernel start address is
+ * not available yet. The empty callchain can then be
+ * consumed by cs_etm__inject_event().
+ */
+ memset(tidq->callchain, 0,
+ sizeof(struct ip_callchain) +
+ (etm->synth_opts.callchain_sz + 1) * sizeof(u64));
+
+ sample.callchain = tidq->callchain;
+ }
+
if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(etm, event, &sample,
etm->instructions_sample_type);
@@ -1724,6 +1759,9 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm,
attr.branch_sample_type |= PERF_SAMPLE_BRANCH_HW_INDEX;
}
+ if (etm->synth_opts.callchain)
+ attr.sample_type |= PERF_SAMPLE_CALLCHAIN;
+
if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
@@ -3457,6 +3495,14 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
PERF_IP_FLAG_TRACE_BEGIN |
PERF_IP_FLAG_TRACE_END;
+ if (etm->synth_opts.callchain && !symbol_conf.use_callchain) {
+ symbol_conf.use_callchain = true;
+ if (callchain_register_param(&callchain_param) < 0) {
+ symbol_conf.use_callchain = false;
+ etm->synth_opts.callchain = false;
+ }
+ }
+
etm->session = session;
etm->num_cpu = num_cpu;
@@ -3508,7 +3554,8 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
}
etm->use_thread_stack = etm->synth_opts.thread_stack ||
- etm->synth_opts.last_branch;
+ etm->synth_opts.last_branch ||
+ etm->synth_opts.callchain;
err = cs_etm__synth_events(etm, session);
if (err)
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
` (6 preceding siblings ...)
2026-05-26 16:59 ` [PATCH v6 7/8] perf cs-etm: Synthesize callchains for instruction samples Leo Yan
@ 2026-05-26 16:59 ` Leo Yan
2026-05-26 20:56 ` sashiko-bot
7 siblings, 1 reply; 13+ messages in thread
From: Leo Yan @ 2026-05-26 16:59 UTC (permalink / raw)
To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
Al Grant, Paschalis Mpeis, Amir Ayupov
Cc: linux-arm-kernel, coresight, linux-perf-users
Add a shell test for synthesized callchains from Arm CoreSight trace.
The test runs only on arm64 systems with cs_etm event and gcc available.
Build a small test program for syscall, record them with CoreSight trace
data, and decode with itrace callchain synthesis enabled. Verify that
the push and pop callchain.
After:
perf test 150 -vvv
150: Check Arm CoreSight synthesized callchain:
--- start ---
test child forked, pid 13528
Test callchain push: PASS
Test callchain pop: PASS
---- end(0) ----
150: Check Arm CoreSight synthesized callchain : Ok
Assisted-by: Codex:GPT-5.5
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
.../tests/shell/test_arm_coresight_callchain.sh | 235 +++++++++++++++++++++
1 file changed, 235 insertions(+)
diff --git a/tools/perf/tests/shell/test_arm_coresight_callchain.sh b/tools/perf/tests/shell/test_arm_coresight_callchain.sh
new file mode 100755
index 0000000000000000000000000000000000000000..0e5a5d1129ae7d34f8e0c5942fb62d27db3e862d
--- /dev/null
+++ b/tools/perf/tests/shell/test_arm_coresight_callchain.sh
@@ -0,0 +1,235 @@
+#!/bin/bash
+# Check Arm CoreSight synthesized callchain (exclusive)
+# SPDX-License-Identifier: GPL-2.0
+
+glb_err=1
+
+if ! tmpdir=$(mktemp -d /tmp/perf-cs-callchain-test.XXXXXX); then
+ echo "mktemp failed"
+ exit 1
+fi
+
+cleanup_files()
+{
+ rm -rf "$tmpdir"
+}
+
+trap cleanup_files EXIT
+trap 'cleanup_files; exit $glb_err' TERM INT
+
+skip_if_system_is_not_ready()
+{
+ [ "$(uname -m)" = "aarch64" ] || {
+ echo "Skip: arm64 only test" >&2
+ return 2
+ }
+
+ perf list | grep -q 'cs_etm//' || {
+ echo "Skip: cs_etm event is not available" >&2
+ return 2
+ }
+
+ command -v gcc >/dev/null 2>&1 || {
+ echo "Skip: gcc is not available" >&2
+ return 2
+ }
+
+ return 0
+}
+
+build_test_program()
+{
+ local src=$1
+ local bin=$2
+
+ gcc -g -O0 -o "$bin" "$src"
+}
+
+record_trace()
+{
+ local bin=$1
+ local data=$2
+ local script=$3
+
+ perf record -m ,32M -o "$data" --per-thread -e cs_etm// -- "$bin" >/dev/null 2>&1 &&
+ perf script --itrace=g16i10il64 -i "$data" > "$script"
+}
+
+check_regex()
+{
+ local name=$1
+ local regex=$2
+ local script=$3
+
+ if grep -Pzo "$regex" "$script" >/dev/null; then
+ echo "Test $name: PASS"
+ return 0
+ else
+ echo "Test $name: FAIL"
+ return 1
+ fi
+}
+
+run_test()
+{
+ local name=$1
+ local src=$tmpdir/$name.S
+ local bin=$tmpdir/$name
+ local data=$tmpdir/perf.$name.data
+ local script=$tmpdir/perf.$name.script
+ local regex
+
+ "${name}_src" "$src"
+
+ if ! build_test_program "$src" "$bin"; then
+ echo "$name: build failed"
+ return
+ fi
+
+ if ! record_trace "$bin" "$data" "$script"; then
+ echo "$name: perf record/script failed"
+ return
+ fi
+
+ regex=$("${name}_push_regex")
+ check_regex "${name} push" "$regex" "$script" || return
+
+ regex=$("${name}_pop_regex")
+ check_regex "${name} pop" "$regex" "$script" || return
+
+ glb_err=0
+}
+
+callchain_src()
+{
+ cat > "$1" <<'EOF'
+/* callchain.S */
+ .text
+
+ .global do_svc
+ .type do_svc, %function
+do_svc:
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
+ mov x0, #1
+ adr x1, msg
+ mov x2, #23
+ mov x8, #64
+
+ nop
+ nop // Pad nops for 9 insns before svc
+
+ b 1f
+1:
+ svc #0
+
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop // Pad nops for 10 insns after svc
+
+ ldp x29, x30, [sp], #16
+ ret
+ .size do_svc, .-do_svc
+
+ .global foo
+ .type foo, %function
+foo:
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop // Pad nops for 9 insns before call
+
+ bl do_svc
+
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop
+ nop // Pad nops for 10 insns after call
+
+ ldp x29, x30, [sp], #16
+ ret
+ .size foo, .-foo
+
+ .global main
+ .type main, %function
+main:
+ stp x29, x30, [sp, #-16]!
+ mov x29, sp
+
+ bl foo
+
+ mov w0, #0
+ ldp x29, x30, [sp], #16
+ .size main, .-main
+ ret
+
+ .section .rodata
+msg:
+ .asciz "hello from svc syscall\n"
+EOF
+}
+
+callchain_push_regex()
+{
+ printf '%s' \
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x[[:xdigit:]]+ \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'\
+'\n'\
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ do_svc\+0x[[:xdigit:]]+ \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'\
+'\n'\
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ (vectors|el.*_64_sync|tramp_vectors)\+0x[[:xdigit:]]+ \(\[kernel\.kallsyms\]\)\n'\
+'[[:space:]]+[[:xdigit:]]+ do_svc\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+callchain_pop_regex()
+{
+ printf '%s' \
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ (ret_to_user|tramp_exit)\+0x[[:xdigit:]]+ \(\[kernel\.kallsyms\]\)\n'\
+'[[:space:]]+[[:xdigit:]]+ do_svc\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'\
+'\n'\
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ do_svc\+0x[[:xdigit:]]+ \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x28 \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n' \
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'\
+'\n'\
+'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ foo\+0x[[:xdigit:]]+ \(.*/callchain\)\n'\
+'[[:space:]]+[[:xdigit:]]+ main\+0xc \(.*/callchain\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+skip_if_system_is_not_ready || exit 2
+
+run_test "callchain"
+
+exit $glb_err
--
2.34.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries
2026-05-26 16:59 ` [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries Leo Yan
@ 2026-05-26 18:32 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-05-26 18:32 UTC (permalink / raw)
To: Leo Yan; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
New issues:
- [High] Flushing the thread branch stack via thread_stack__flush() globally resets branch state for all CPUs a thread has run on, corrupting concurrent branch traces.
- [High] Unconditional NULL pointer dereference of `tidq->thread` in `cs_etm__flush()` when flushing the thread stack.
- [Medium] Forcing the creation of a full callstack introduces hidden CPU and memory overhead even when only branch tracking is requested.
Pre-existing issues:
- [Critical] Heap buffer overflow in thread_stack__br_sample() when copying branch entries.
- [High] NULL pointer dereference in thread_stack__update_br_stack() when processing mixed perf events.
--
commit 423cf82c1cfc3330f4878f3ed31b1cf46de41008
Author: Leo Yan <leo.yan@arm.com>
perf cs-etm: Use thread-stack for last branch entries
CS ETM maintains its own circular array for last branch entries, with
local helpers to update, copy and reset the branch stack. This duplicates
logic already provided by the common code.
Record branch with thread_stack__event() and synthesize branch stack
with thread_stack__br_sample(). This removes the local last_branch_rb
buffer and position tracking.
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
[ ... ]
> @@ -1579,6 +1492,37 @@ static inline u64 cs_etm__resolve_sample_time(
> + /* Enable callchain so thread stack entry can be allocated */
> + thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
> + tidq->prev_packet->flags, from, to, size,
> + etmq->buffer->buffer_nr + 1, true,
> + tidq->br_stack_sz, 0);
[Severity: High]
This is a pre-existing issue, but is there a NULL pointer dereference in
thread_stack__update_br_stack() when processing mixed perf events?
If a thread stack is initially created by a normal PERF_RECORD_SAMPLE event
via thread_stack__process(), it is initialized with br_stack_sz = 0.
Consequently, ts->br_stack_rb is left as NULL.
When an auxtrace record (from intel-pt or cs-etm) is processed subsequently,
thread_stack__event() bypasses initialization because the thread stack already
exists.
It then unconditionally calls thread_stack__update_br_stack() since
tidq->br_stack_sz > 0, which dereferences ts->br_stack_rb->entries[...],
causing an immediate crash.
[Severity: Medium]
Does passing true for the callstack parameter force the creation of a full
callstack?
This causes thread_stack__event() to parse PERF_IP_FLAG_CALL and
PERF_IP_FLAG_RETURN events and maintain a full function call stack in memory.
Could this introduce hidden CPU and memory overhead even when users only
requested the lightweight --last-branch tracking?
[ ... ]
> @@ -1608,8 +1552,12 @@ static int cs_etm__synth_instruction_sample(
>
> cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
>
> - if (etm->synth_opts.last_branch)
> + if (etm->synth_opts.last_branch) {
> + thread_stack__br_sample(tidq->thread, tidq->packet->cpu,
> + tidq->last_branch,
> + tidq->br_stack_sz);
[Severity: Critical]
This is a pre-existing issue, but does thread_stack__br_sample() have a heap
buffer overflow when copying branch entries?
If ts->br_stack_sz is greater than the size of the destination buffer, the
wrap-around logic is triggered.
The second memcpy intended to copy the remaining elements is:
memcpy(be, &src->entries[0], bsz * ts->br_stack_pos);
It completely ignores the calculated nr bounds and copies up to ts->br_stack_sz
entries into the destination buffer. For Intel PT, ts->br_stack_sz can be 1024
while the destination size is 16.
Can this write hundreds of branch entries past the end of the array,
leading to a heap buffer overflow?
[ ... ]
> @@ -1982,7 +1915,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
>
> /* Reset last branches after flush the trace */
> if (etm->synth_opts.last_branch)
> - cs_etm__reset_last_branch_rb(tidq);
> + thread_stack__flush(tidq->thread);
[Severity: High]
Does thread_stack__flush() globally reset the branch state for all CPUs a
thread has run on?
For threads that can run on multiple CPUs concurrently (e.g., the idle thread),
thread->ts points to an array of thread stacks for each CPU, and
thread_stack__flush() iterates over ts->arr_sz and resets all of them.
Can this cause a trace discontinuity on one CPU's queue to wipe out the active
branch histories and call chains of the same thread on all other CPUs?
[Severity: High]
Can tidq->thread be NULL here?
In cs_etm__init_traceid_queue(), tidq->thread is initialized via
machine__findnew_thread(), which may return NULL on allocation failure.
If tidq->thread is NULL, won't thread_stack__flush() cause an immediate NULL
pointer dereference?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com?part=3
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset
2026-05-26 16:59 ` [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset Leo Yan
@ 2026-05-26 19:10 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-05-26 19:10 UTC (permalink / raw)
To: Leo Yan; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Flushing all thread stacks globally across the machine upon a per-queue decoder reset corrupts callchains for threads executing on other CPUs.
- [Medium] The thread stack flush callback silently ignores errors returned by `thread_stack__flush()`.
--
commit 5bb375ff87340381fc1d45a283d0256f7738d4d0
Author: Leo Yan <leo.yan@arm.com>
perf cs-etm: Flush thread stacks after decoder reset
Perf resets the CoreSight decoder when moving to a new AUX trace buffer,
causing a global trace discontinuity. To prevent stale call and return
history from attaching to later threads, this patch flushes all host and
guest thread stacks after a decoder reset.
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> index 398ab3b7a429d..ea2424175558d 100644
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
[ ... ]
> +static int cs_etm__flush_stack_cb(struct thread *thread,
> + void *data __maybe_unused)
> +{
> + thread_stack__flush(thread);
> + return 0;
> +}
[Severity: Medium]
Does this callback safely ignore errors from thread_stack__flush()?
thread_stack__flush() can return error codes such as -ENOMEM when memory
allocation fails during call and return synthesis. Returning 0 unconditionally
prevents machine__for_each_thread() from aborting early, but it also
silently drops the error.
Could this mask underlying synthesis failures?
> +
> +static void cs_etm__flush_machine_stack(struct cs_etm_queue *etmq, pid_t pid)
> +{
> + struct machine *machine;
> +
> + machine = machines__find(&etmq->etm->session->machines, pid);
> + if (machine)
> + machine__for_each_thread(machine, cs_etm__flush_stack_cb, NULL);
> +}
[Severity: Medium]
Does flushing all threads globally cause issues for other CPUs?
In CoreSight tracing, traces are decoded using per-CPU queues. When one
cs_etm_queue encounters a trace discontinuity, calling
machine__for_each_thread() will flush the call stacks for all threads in
the machine.
Could this destroy the call and return state of threads that are actively
executing and being decoded on other queues, resulting in corrupted
synthesized callchains for those unrelated threads?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com?part=4
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 5/8] perf cs-etm: Support call indentation
2026-05-26 16:59 ` [PATCH v6 5/8] perf cs-etm: Support call indentation Leo Yan
@ 2026-05-26 19:42 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-05-26 19:42 UTC (permalink / raw)
To: Leo Yan; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The condition for swapping packets in `cs_etm__packet_swap()` was not updated to include the new `use_thread_stack` flag, which breaks the `callindent` feature if branch and instruction sampling are disabled.
--
commit 36aa886e2c5be33efb8d11e03a02df7c74dce99b
Author: Leo Yan <leo.yan@linux.dev>
perf cs-etm: Support call indentation
This commit introduces a use_thread_stack flag to track whether the branch
stack needs to be recorded. This allows call indentation to function
independently of the last branch record feature.
[ ... ]
> diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
> index ea2424175558d..b31d0dd46a45d 100644
> --- a/tools/perf/util/cs-etm.c
> +++ b/tools/perf/util/cs-etm.c
[ ... ]
> @@ -1914,7 +1915,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
> cs_etm__packet_swap(etm, tidq);
[Severity: Medium]
Will this swap logic still work if branch and instruction sampling are
disabled?
Looking at cs_etm__packet_swap(), the condition to perform the swap was not
updated to check the new use_thread_stack flag:
tools/perf/util/cs-etm.c:cs_etm__packet_swap() {
if (etm->synth_opts.branches || etm->synth_opts.last_branch ||
etm->synth_opts.instructions) {
...
// swap packets
}
}
If a user runs perf script -F +callindent without enabling branches,
last_branch, or instructions, cs_etm__packet_swap() won't swap the packets.
Does the previous packet remaining empty prevent the thread stack from being
populated, causing callindent to silently fail?
Could cs_etm__packet_swap() be updated to check use_thread_stack as well?
>
> /* Reset last branches after flush the trace */
> - if (etm->synth_opts.last_branch)
> + if (etm->use_thread_stack)
> thread_stack__flush(tidq->thread);
>
> return err;
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com?part=5
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test
2026-05-26 16:59 ` [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test Leo Yan
@ 2026-05-26 20:56 ` sashiko-bot
0 siblings, 0 replies; 13+ messages in thread
From: sashiko-bot @ 2026-05-26 20:56 UTC (permalink / raw)
To: Leo Yan; +Cc: linux-perf-users
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] The regular expressions for matching `perf script` output do not account for the timestamp field, guaranteeing test failure.
- [Low] The `.size` directive for the `main` function incorrectly excludes its `ret` instruction.
- [Low] The `adr` instruction is used to reference a symbol in a different section, which can cause a linker error.
--
commit c68e5facc9f76be5097ddc178f00bfa18d0f188f
Author: Leo Yan <leo.yan@arm.com>
perf test: Add Arm CoreSight callchain test
This commit adds a shell test for synthesized callchains from Arm CoreSight
trace data. It builds a test program for syscalls, records them, and
decodes with itrace callchain synthesis enabled to verify the push and pop
callchains.
> diff --git a/tools/perf/tests/shell/test_arm_coresight_callchain.sh b/tools/perf/tests/shell/test_arm_coresight_callchain.sh
> new file mode 100755
> index 0000000000000..0e5a5d1129ae7
> --- /dev/null
> +++ b/tools/perf/tests/shell/test_arm_coresight_callchain.sh
[ ... ]
> +do_svc:
> + stp x29, x30, [sp, #-16]!
> + mov x29, sp
> +
> + mov x0, #1
> + adr x1, msg
[Severity: Low]
Can this adr instruction cause a linker error?
The adr instruction computes a pc-relative offset but has a strict +/- 1 MB
range limit. If the linker places .text and .rodata further apart than 1 MB
(due to page size alignment or hardening constraints), the build will fail
with an out-of-range relocation error.
Would it be safer to use adrp and add here?
[ ... ]
> +main:
> + stp x29, x30, [sp, #-16]!
> + mov x29, sp
> +
> + bl foo
> +
> + mov w0, #0
> + ldp x29, x30, [sp], #16
> + .size main, .-main
> + ret
[Severity: Low]
Does this .size directive incorrectly exclude the ret instruction from the
main function bounds?
Placing it before the ret instruction might cause profiling tools to fail to
correctly attribute an execution sample on that instruction to the main
function.
[ ... ]
> +callchain_push_regex()
> +{
> + printf '%s' \
> +'callchain[[:space:]]+[0-9]+ \[[0-9]+\][[:space:]]+10 instructions:[[:space:]]*\n'\
[Severity: High]
Will this regex fail to match because of the perf script timestamp field?
By default, perf script output includes a timestamp field, formatting lines
as "comm tid [cpu] time: event:". The test invokes perf script without
restricting the output fields (for example, it does not use -F,-time to
disable the timestamp).
Since this regex strictly requires spaces immediately between the cpu
identifier and the event name, it seems it will fail to match the timestamp
and cause the test to fail.
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com?part=8
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-05-26 20:56 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 16:59 [PATCH v6 0/8] perf cs-etm: Support thread stack and callchain Leo Yan
2026-05-26 16:59 ` [PATCH v6 1/8] perf cs-etm: Decode ETE exception packets Leo Yan
2026-05-26 16:59 ` [PATCH v6 2/8] perf cs-etm: Refactor instruction size handling Leo Yan
2026-05-26 16:59 ` [PATCH v6 3/8] perf cs-etm: Use thread-stack for last branch entries Leo Yan
2026-05-26 18:32 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 4/8] perf cs-etm: Flush thread stacks after decoder reset Leo Yan
2026-05-26 19:10 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 5/8] perf cs-etm: Support call indentation Leo Yan
2026-05-26 19:42 ` sashiko-bot
2026-05-26 16:59 ` [PATCH v6 6/8] perf cs-etm: Filter synthesized branch samples Leo Yan
2026-05-26 16:59 ` [PATCH v6 7/8] perf cs-etm: Synthesize callchains for instruction samples Leo Yan
2026-05-26 16:59 ` [PATCH v6 8/8] perf test: Add Arm CoreSight callchain test Leo Yan
2026-05-26 20:56 ` sashiko-bot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox