Linux-ARM-Kernel Archive on lore.kernel.org

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v10 7/9] perf cs-etm: Support call indentation
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

From: Leo Yan <leo.yan@linaro.org>

The perf script callindent is derived from call stack in thread context,
CS ETM ignores the requirement for callindent without pushing and poping
call stack.

Enable thread-stack when either itrace thread-stack support or last branch
entries are requested, allocate the branch stack storage accordingly, and
feed taken branches to thread_stack__event() whenever thread-stack state
is needed.

When callindent is requested, pass callstack=true to thread_stack__event()
so the common thread-stack code maintains call depth for branch samples.

Before:

  perf script -F +callindent

  callchain_test    6543 [002]          1 branches: main                                 ffff93252258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
  callchain_test    6543 [002]          1 branches: foo                                  aaaad6b607c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches: print                                aaaad6b607ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches: do_svc                               aaaad6b60794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches: vectors                              aaaad6b60780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches: el0t_64_sync_handler             ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches: el0_svc                          ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches: lockdep_hardirqs_off             ffff800081828794 el0_svc+0x24 ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches: __this_cpu_preempt_check         ffff80008182b348 lockdep_hardirqs_off+0xf0 ([kernel.kallsyms])

After:

  callchain_test    6543 [002]          1 branches:                 main                                                 ffff93252258 __libc_start_call_main+0x78 (/usr/lib/aarch64-linux-gnu/libc.so.6)
  callchain_test    6543 [002]          1 branches:                     foo                                              aaaad6b607c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches:                         print                                        aaaad6b607ac foo+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches:                             do_svc                                   aaaad6b60794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches:                                 vectors                              aaaad6b60780 do_svc+0x18 (/home/kernel/leoy/test_cs_callchain/callchain_test)
  callchain_test    6543 [002]          1 branches:                                     el0t_64_sync_handler         ffff80008001159c el0t_64_sync+0x194 ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches:                                         el0_svc                  ffff800081829194 el0t_64_sync_handler+0x9c ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches:                                             lockdep_hardirqs_off ffff800081828794 el0_svc+0x24 ([kernel.kallsyms])
  callchain_test    6543 [002]          1 branches:                                                 __this_cpu_preempt_check                         ffff80008182b348 lockdep_hardirqs_off+0xf0 ([kernel.kallsyms])

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index e43f0c1dd00788abaed4455bc0da4723b0b36b9e..30536b919af732abeee11d4f7813dcef0b4047eb 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -67,6 +67,8 @@ struct cs_etm_auxtrace {
 	bool snapshot_mode;
 	bool data_queued;
 	bool has_virtual_ts; /* Virtual/Kernel timestamps in the trace. */
+	bool use_thread_stack;
+	bool use_callchain;
 
 	int num_cpu;
 	u64 latest_kernel_timestamp;
@@ -638,7 +640,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 	if (!tidq->prev_packet)
 		goto out_free;
 
-	if (etm->synth_opts.last_branch) {
+	if (etm->use_thread_stack) {
 		size_t sz = sizeof(struct branch_stack);
 
 		sz += etm->synth_opts.last_branch_sz *
@@ -1554,7 +1556,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
 	if (!cs_etm__packet_has_taken_branch(tidq->prev_packet))
 		return;
 
-	if (etmq->etm->synth_opts.last_branch) {
+	if (etmq->etm->use_thread_stack) {
 		from = cs_etm__last_executed_instr(tidq->prev_packet);
 		to = cs_etm__first_executed_instr(tidq->packet);
 
@@ -1563,7 +1565,8 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
 		/* Enable callchain so thread stack entry can be allocated */
 		thread_stack__event(tidq->frontend_thread, tidq->prev_packet->cpu,
 				    tidq->prev_packet->flags, from, to, size,
-				    etmq->buffer->buffer_nr + 1, false,
+				    etmq->buffer->buffer_nr + 1,
+				    etmq->etm->use_callchain,
 				    tidq->br_stack_sz, 0);
 	} else {
 		thread_stack__set_trace_nr(tidq->frontend_thread,
@@ -1970,7 +1973,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 	cs_etm__packet_swap(etm, tidq);
 
 	/* Reset last branches after flush the trace */
-	if (etm->synth_opts.last_branch)
+	if (etm->use_thread_stack)
 		thread_stack__flush(tidq->frontend_thread);
 
 	return err;
@@ -2033,7 +2036,7 @@ static void cs_etm__flush_all_stack(struct cs_etm_queue *etmq)
 {
 	enum cs_etm_pid_fmt pid_fmt = cs_etm__get_pid_fmt(etmq);
 
-	if (!etmq->etm->synth_opts.last_branch)
+	if (!etmq->etm->use_thread_stack)
 		return;
 
 	switch (pid_fmt) {
@@ -3520,6 +3523,7 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
 		itrace_synth_opts__set_default(&etm->synth_opts,
 				session->itrace_synth_opts->default_no_sample);
 		etm->synth_opts.callchain = false;
+		etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack;
 	}
 
 	if (etm->synth_opts.calls)
@@ -3581,6 +3585,12 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
 		etm->tc.cap_user_time_zero = tc->cap_user_time_zero;
 		etm->tc.cap_user_time_short = tc->cap_user_time_short;
 	}
+
+	etm->use_thread_stack = etm->synth_opts.thread_stack ||
+				etm->synth_opts.last_branch;
+
+	etm->use_callchain = etm->synth_opts.thread_stack;
+
 	err = cs_etm__synth_events(etm, session);
 	if (err)
 		goto err_free_queues;

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 9/9] perf test: Add Arm CoreSight callchain test
From: Leo Yan @ 2026-06-17 15:09 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

Add a CoreSight shell test for synthesized callchains.

The test uses the new callchain workload to generate trace and decodes
it with synthesis callchain. It then verifies that the instruction
samples show the expected callchain push and pop.

Use control FIFOs so tracing starts only around the workload, which
keeps the trace data small. The test is limited to with the cs_etm
event available and root permission.

After:

  perf test 138 -vvv
  138: CoreSight synthesized callchain:
  ---- start ----
  test child forked, pid 35581
  Callchain flow matched:
    l1=4642868 l2=4642880 l3=4642895 l4=4642919 l5=4670494 l6=4670500 l7=4670520
  ---- end(0) ----
  138: CoreSight synthesized callchain                                                                           : Ok

Assisted-by: Codex:GPT-5.5
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/Documentation/perf-test.txt        |   6 +-
 tools/perf/tests/builtin-test.c               |   1 +
 tools/perf/tests/shell/coresight/callchain.sh | 172 ++++++++++++++++++++++++++
 tools/perf/tests/tests.h                      |   1 +
 tools/perf/tests/workloads/Build              |   2 +
 tools/perf/tests/workloads/callchain.c        |  33 +++++
 6 files changed, 213 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-test.txt b/tools/perf/Documentation/perf-test.txt
index 81c8525f594680d814f80e6f88bcce8d867bb350..859df74e62efc4b1e80da13ae8e053356f68ae54 100644
--- a/tools/perf/Documentation/perf-test.txt
+++ b/tools/perf/Documentation/perf-test.txt
@@ -57,7 +57,8 @@ OPTIONS
 --workload=::
 	Run a built-in workload, to list them use '--list-workloads', current
 	ones include: noploop, thloop, leafloop, sqrtloop, brstack, datasym,
-	context_switch_loop, deterministic, named_threads and landlock.
+	context_switch_loop, deterministic, named_threads, landlock and
+	callchain.
 
 	Used with the shell script regression tests.
 
@@ -69,7 +70,8 @@ OPTIONS
 	'named_threads' accepts the number of threads and the number of loops to
 	do in each thread.
 
-	The datasym, landlock and deterministic workloads don't accept any.
+	The datasym, landlock, deterministic and callchain workloads don't accept
+	any.
 
 --list-workloads::
 	List the available workloads to use with -w/--workload.
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 7e75f590f225e3284980829707ca8d916c98cada..1d1f38127e05429a27f31beda814f2b5f5a75089 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -168,6 +168,7 @@ static struct test_workload *workloads[] = {
 	&workload__jitdump,
 	&workload__context_switch_loop,
 	&workload__deterministic,
+	&workload__callchain,
 
 #ifdef HAVE_RUST_SUPPORT
 	&workload__code_with_type,
diff --git a/tools/perf/tests/shell/coresight/callchain.sh b/tools/perf/tests/shell/coresight/callchain.sh
new file mode 100755
index 0000000000000000000000000000000000000000..13cca7dc11184002e3ddc058c0d0ffa1c458c483
--- /dev/null
+++ b/tools/perf/tests/shell/coresight/callchain.sh
@@ -0,0 +1,172 @@
+#!/bin/bash
+# CoreSight synthesized callchain (exclusive)
+# SPDX-License-Identifier: GPL-2.0
+
+glb_err=1
+
+if ! tmpdir=$(mktemp -d /tmp/perf-cs-callchain-test.XXXXXX); then
+	echo "mktemp failed"
+	exit 1
+fi
+
+cleanup_files()
+{
+	rm -rf "$tmpdir"
+}
+
+trap cleanup_files EXIT
+trap 'cleanup_files; exit $glb_err' TERM INT
+
+skip_if_system_is_not_ready()
+{
+	perf list | grep -Pzq 'cs_etm//' || {
+		echo "[Skip] cs_etm event is not available" >&2
+		return 2
+	}
+
+	# Requires root for trace in kernel
+	[ "$(id -u)" = 0 ] || {
+		echo "[Skip] No root permission" >&2
+		return 2
+	}
+
+	return 0
+}
+
+record_trace()
+{
+	local data=$1
+	local script=$2
+
+	local cf="$tmpdir/ctl"
+	local af="$tmpdir/ack"
+
+	mkfifo "$cf" "$af"
+
+	perf record -o "$data" -e cs_etm// --per-thread -D -1 --control fifo:"$cf","$af" -- \
+		perf test --record-ctl fifo:"$cf","$af" -w callchain >/dev/null 2>&1 &&
+
+	# It is safe to use 'i3i' with a three-instruction interval, since the
+	# workload is compiled with -O0.
+	perf script --itrace=g16i3il64 -i "$data" > "$script"
+}
+
+callchain_regex_1()
+{
+	printf '%s' \
+'perf[[:space:]]+[0-9]+[[:space:]]+\[[0-9]+\][[:space:]]+([0-9.]+:[[:space:]]+)?[0-9]+ instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_foo\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+callchain_regex_2()
+{
+	printf '%s' \
+'perf[[:space:]]+[0-9]+[[:space:]]+\[[0-9]+\][[:space:]]+([0-9.]+:[[:space:]]+)?[0-9]+ instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_do_syscall\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_foo\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+callchain_regex_3()
+{
+	printf '%s' \
+'perf[[:space:]]+[0-9]+[[:space:]]+\[[0-9]+\][[:space:]]+([0-9.]+:[[:space:]]+)?[0-9]+ instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ syscall(@plt)?\+0x[[:xdigit:]]+ \(.*\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_do_syscall\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_foo\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+callchain_regex_4()
+{
+	printf '%s' \
+'perf[[:space:]]+[0-9]+[[:space:]]+\[[0-9]+\][[:space:]]+([0-9.]+:[[:space:]]+)?[0-9]+ instructions:[[:space:]]*\n'\
+'[[:space:]]+[[:xdigit:]]+ .*\+0x[[:xdigit:]]+ \(\[kernel\.kallsyms\]\)\n'\
+'[[:space:]]+[[:xdigit:]]+ syscall(@plt)?\+0x[[:xdigit:]]+ \(.*\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_do_syscall\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain_foo\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'[[:space:]]+[[:xdigit:]]+ callchain\+0x[[:xdigit:]]+ \(.*/perf\)\n'\
+'([[:space:]]+[[:xdigit:]]+ .*\n)*'
+}
+
+find_after_line()
+{
+	local regex="$1"
+	local file="$2"
+	local start="$3"
+	local offset
+	local line
+
+	# Search in byte offset
+	offset=$(
+		tail -n +"$start" "$file" |
+		grep -Pzob -m1 "$regex" |
+		tr '\0' '\n' |
+		sed -n 's/^\([0-9][0-9]*\):.*/\1/p;q'
+	)
+
+	if [ -z "$offset" ]; then
+		echo "Failed to match regex after line $start" >&2
+		echo "Regex:" >&2
+		printf '%s\n' "$regex" >&2
+		echo "Context from line $start:" >&2
+		sed -n "${start},$((start + 100))p" "$file" >&2
+		return 1
+	fi
+
+	# Convert from offset to line
+	line=$(
+		tail -n +"$start" "$file" |
+		head -c "$offset" |
+		wc -l
+	)
+
+	echo "$((start + line))"
+}
+
+check_callchain_flow()
+{
+	local file="$1"
+	local l1 l2 l3 l4 l5 l6 l7
+
+	# Callchain push
+	l1=$(find_after_line "$(callchain_regex_1)" "$file" 1) || return 1
+	l2=$(find_after_line "$(callchain_regex_2)" "$file" "$((l1 + 1))") || return 1
+	l3=$(find_after_line "$(callchain_regex_3)" "$file" "$((l2 + 1))") || return 1
+	l4=$(find_after_line "$(callchain_regex_4)" "$file" "$((l3 + 1))") || return 1
+
+	# Callchain pop
+	l5=$(find_after_line "$(callchain_regex_3)" "$file" "$((l4 + 1))") || return 1
+	l6=$(find_after_line "$(callchain_regex_2)" "$file" "$((l5 + 1))") || return 1
+	l7=$(find_after_line "$(callchain_regex_1)" "$file" "$((l6 + 1))") || return 1
+
+	echo "Callchain flow matched:"
+	echo "  l1=$l1 l2=$l2 l3=$l3 l4=$l4 l5=$l5 l6=$l6 l7=$l7"
+
+	return 0
+}
+
+run_test()
+{
+	local data=$tmpdir/perf.data
+	local script=$tmpdir/perf.script
+
+	if ! record_trace "$data" "$script"; then
+		echo "perf record/script failed"
+		return
+	fi
+
+	check_callchain_flow "$script" || return
+
+	glb_err=0
+}
+
+skip_if_system_is_not_ready || exit 2
+
+run_test
+
+exit $glb_err
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 7cedf05be544ad79a99e86d30dfa4f7b01ca0837..cee9e6b62dcc838c864bbe76efe3b638ed75b134 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -248,6 +248,7 @@ DECLARE_WORKLOAD(inlineloop);
 DECLARE_WORKLOAD(jitdump);
 DECLARE_WORKLOAD(context_switch_loop);
 DECLARE_WORKLOAD(deterministic);
+DECLARE_WORKLOAD(callchain);
 
 #ifdef HAVE_RUST_SUPPORT
 DECLARE_WORKLOAD(code_with_type);
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index 7bb4b9829ba245740c8967e6bf3235614cdd55a3..048e371eb63e316453b6b46ebd0a02794c3d25d7 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -13,6 +13,7 @@ perf-test-y += inlineloop.o
 perf-test-y += jitdump.o
 perf-test-y += context_switch_loop.o
 perf-test-y += deterministic.o
+perf-test-y += callchain.o
 
 ifeq ($(CONFIG_RUST_SUPPORT),y)
     perf-test-y += code_with_type.o
@@ -27,3 +28,4 @@ CFLAGS_traploop.o         = -g -O0 -fno-inline -U_FORTIFY_SOURCE
 CFLAGS_inlineloop.o       = -g -O2
 CFLAGS_deterministic.o    = -g -O0 -fno-inline -U_FORTIFY_SOURCE
 CFLAGS_named_threads.o    = -g -O0 -fno-inline -U_FORTIFY_SOURCE
+CFLAGS_callchain.o        = -g -O0 -fno-inline -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/callchain.c b/tools/perf/tests/workloads/callchain.c
new file mode 100644
index 0000000000000000000000000000000000000000..abbb406ba90b30a012620f46c6e84a5661066b76
--- /dev/null
+++ b/tools/perf/tests/workloads/callchain.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/compiler.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+#include "../tests.h"
+
+/*
+ * Mark as noinline to establish the call chain, and avoid the static
+ * annotation to prevent LTO from renaming the functions.
+ */
+noinline void callchain_do_syscall(void);
+noinline void callchain_foo(void);
+noinline int callchain(int argc, const char **argv);
+
+noinline void callchain_do_syscall(void)
+{
+	syscall(SYS_gettid);
+}
+
+noinline void callchain_foo(void)
+{
+	callchain_do_syscall();
+}
+
+noinline int callchain(int argc __maybe_unused,
+		       const char **argv __maybe_unused)
+{
+	callchain_foo();
+
+	return 0;
+}
+
+DEFINE_WORKLOAD(callchain);

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 8/9] perf cs-etm: Synthesize callchains for instruction samples
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

From: Leo Yan <leo.yan@linaro.org>

CS ETM already records branches into the thread stack, but instruction
samples do not carry synthesized callchains. It misses to support the
callchain and no output with the itrace option 'g'.

Allocate a callchain buffer per queue and use thread_stack__sample()
when synthesizing instruction samples.

Advertise PERF_SAMPLE_CALLCHAIN on the synthetic instruction event.
Allocate one extra callchain entry than requested, as the first entry
is reserved for storing context information.

cs_etm__context() is introduced for handling context packet and update
the thread info and start kernel address for frontend decoding.

After:

  perf script --itrace=g16l64i1i

  callchain_test    6543 [002]          1 instructions:
        ffff800080010c14 vectors+0x414 ([kernel.kallsyms])
            aaaad6b60784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaad6b60798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaad6b607b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaad6b607c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            ffff9325225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
            ffff9325233c call_init+0x9c (inlined)
            ffff9325233c __libc_start_main_impl+0x9c (inlined)
            aaaad6b60670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
        ffff800080012290 ret_to_user+0x120 ([kernel.kallsyms])

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 83 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 30536b919af732abeee11d4f7813dcef0b4047eb..67885ed43bab9cfd47bdb30c598c4e36d0e191fb 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -18,6 +18,7 @@
 #include <stdlib.h>
 
 #include "auxtrace.h"
+#include "callchain.h"
 #include "color.h"
 #include "cs-etm.h"
 #include "cs-etm-decoder/cs-etm-decoder.h"
@@ -87,9 +88,11 @@ struct cs_etm_auxtrace {
 struct cs_etm_traceid_queue {
 	u8 trace_chan_id;
 	u64 period_instructions;
+	u64 kernel_start;
 	union perf_event *event_buf;
 	unsigned int br_stack_sz;
 	struct branch_stack *last_branch;
+	struct ip_callchain *callchain;
 	struct cs_etm_packet *prev_packet;
 	struct cs_etm_packet *packet;
 	struct cs_etm_packet_queue packet_queue;
@@ -652,6 +655,15 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 		tidq->br_stack_sz = etm->synth_opts.last_branch_sz;
 	}
 
+	if (etm->synth_opts.callchain) {
+		/* Add 1 to callchain_sz for callchain context */
+		tidq->callchain =
+			zalloc(struct_size(tidq->callchain, ips,
+					   etm->synth_opts.callchain_sz + 1));
+		if (!tidq->callchain)
+			goto out_free;
+	}
+
 	tidq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE);
 	if (!tidq->event_buf)
 		goto out_free;
@@ -659,6 +671,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 	return 0;
 
 out_free:
+	zfree(&tidq->callchain);
 	zfree(&tidq->last_branch);
 	zfree(&tidq->prev_packet);
 	zfree(&tidq->packet);
@@ -942,6 +955,7 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq)
 		thread__zput(tidq->frontend_thread);
 		thread__zput(tidq->decode_thread);
 		zfree(&tidq->event_buf);
+		zfree(&tidq->callchain);
 		zfree(&tidq->last_branch);
 		zfree(&tidq->prev_packet);
 		zfree(&tidq->packet);
@@ -1584,6 +1598,26 @@ static void cs_etm__sample_branch_stack(struct cs_etm_auxtrace *etm,
 					tidq->last_branch, tidq->br_stack_sz);
 		sample->branch_stack = tidq->last_branch;
 	}
+
+	if (etm->synth_opts.callchain) {
+		if (tidq->kernel_start)
+			thread_stack__sample(tidq->frontend_thread,
+					     tidq->packet->cpu,
+					     tidq->callchain,
+					     etm->synth_opts.callchain_sz + 1,
+					     sample->ip, tidq->kernel_start);
+		else
+			/*
+			 * Clear the callchain when the kernel start address is
+			 * not available yet. The empty callchain can then be
+			 * consumed by cs_etm__inject_event().
+			 */
+			memset(tidq->callchain, 0,
+			       struct_size(tidq->callchain, ips,
+					   etm->synth_opts.callchain_sz + 1));
+
+		sample->callchain = tidq->callchain;
+	}
 }
 
 static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
@@ -1779,6 +1813,9 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm,
 		attr.branch_sample_type |= PERF_SAMPLE_BRANCH_HW_INDEX;
 	}
 
+	if (etm->synth_opts.callchain)
+		attr.sample_type |= PERF_SAMPLE_CALLCHAIN;
+
 	if (etm->synth_opts.instructions) {
 		attr.config = PERF_COUNT_HW_INSTRUCTIONS;
 		attr.sample_period = etm->synth_opts.period;
@@ -1910,6 +1947,34 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 	return 0;
 }
 
+static int cs_etm__context(struct cs_etm_queue *etmq,
+			   struct cs_etm_traceid_queue *tidq)
+{
+	ocsd_ex_level el = tidq->packet->el;
+	struct machine *machine;
+	int ret;
+
+	machine = cs_etm__get_machine(etmq, el);
+	if (!machine) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	tidq->kernel_start = machine__kernel_start(machine);
+
+	ret = cs_etm__etmq_update_thread(etmq, el, tidq->packet->tid,
+					 &tidq->frontend_thread);
+	if (ret)
+		goto err;
+
+	return 0;
+
+err:
+	thread__zput(tidq->frontend_thread);
+	tidq->kernel_start = 0;
+	return ret;
+}
+
 static int cs_etm__exception(struct cs_etm_traceid_queue *tidq)
 {
 	/*
@@ -2510,9 +2575,7 @@ static int cs_etm__process_traceid_queue(struct cs_etm_queue *etmq,
 			 * tracing the kernel the context packet will be emitted
 			 * between two ranges.
 			 */
-			ret = cs_etm__etmq_update_thread(etmq, tidq->packet->el,
-							 tidq->packet->tid,
-							 &tidq->frontend_thread);
+			ret = cs_etm__context(etmq, tidq);
 			if (ret)
 				goto out;
 			break;
@@ -3536,6 +3599,14 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
 					PERF_IP_FLAG_TRACE_BEGIN |
 					PERF_IP_FLAG_TRACE_END;
 
+	if (etm->synth_opts.callchain && !symbol_conf.use_callchain) {
+		symbol_conf.use_callchain = true;
+		if (callchain_register_param(&callchain_param) < 0) {
+			symbol_conf.use_callchain = false;
+			etm->synth_opts.callchain = false;
+		}
+	}
+
 	etm->session = session;
 
 	etm->num_cpu = num_cpu;
@@ -3587,9 +3658,11 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
 	}
 
 	etm->use_thread_stack = etm->synth_opts.thread_stack ||
-				etm->synth_opts.last_branch;
+				etm->synth_opts.last_branch ||
+				etm->synth_opts.callchain;
 
-	etm->use_callchain = etm->synth_opts.thread_stack;
+	etm->use_callchain = etm->synth_opts.thread_stack ||
+			     etm->synth_opts.callchain;
 
 	err = cs_etm__synth_events(etm, session);
 	if (err)

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 5/9] perf cs-etm: Use thread-stack for last branch entries
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

CS ETM maintains its own circular array for last branch entries, with
local helpers to update, copy and reset the branch stack. This
duplicates logic already provided by the common code.

Record taken branches with thread_stack__event() and synthesize
PERF_SAMPLE_BRANCH_STACK data with thread_stack__br_sample(). This
removes the private last_branch_rb buffer and its position tracking.

This also makes the branch history state belong to the thread rather
than the trace queue. That is a better fit for CoreSight traces where
a trace queue can effectively be CPU scoped, while call/return history
is per thread.

Keep the buffer number updated via thread_stack__set_trace_nr(), which
is used when exporting samples to Python scripts. Pass callstack=false
for now; synthesized callchains are added by a later patch.

The output should remain same, except that be->flags.predicted is no
longer set. Since CoreSight trace does not provide branch prediction
information, clearing the flag avoids confusion.

Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 173 ++++++++++++++++-------------------------------
 1 file changed, 58 insertions(+), 115 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 6827ef8871a8fc092500b93a8284d5d162558357..5ede0f0ff8c6ec3aa10545693eb914af2e7b5285 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -85,10 +85,9 @@ struct cs_etm_auxtrace {
 struct cs_etm_traceid_queue {
 	u8 trace_chan_id;
 	u64 period_instructions;
-	size_t last_branch_pos;
 	union perf_event *event_buf;
+	unsigned int br_stack_sz;
 	struct branch_stack *last_branch;
-	struct branch_stack *last_branch_rb;
 	struct cs_etm_packet *prev_packet;
 	struct cs_etm_packet *packet;
 	struct cs_etm_packet_queue packet_queue;
@@ -647,9 +646,8 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 		tidq->last_branch = zalloc(sz);
 		if (!tidq->last_branch)
 			goto out_free;
-		tidq->last_branch_rb = zalloc(sz);
-		if (!tidq->last_branch_rb)
-			goto out_free;
+
+		tidq->br_stack_sz = etm->synth_opts.last_branch_sz;
 	}
 
 	tidq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE);
@@ -659,7 +657,6 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 	return 0;
 
 out_free:
-	zfree(&tidq->last_branch_rb);
 	zfree(&tidq->last_branch);
 	zfree(&tidq->prev_packet);
 	zfree(&tidq->packet);
@@ -944,7 +941,6 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq)
 		thread__zput(tidq->decode_thread);
 		zfree(&tidq->event_buf);
 		zfree(&tidq->last_branch);
-		zfree(&tidq->last_branch_rb);
 		zfree(&tidq->prev_packet);
 		zfree(&tidq->packet);
 		zfree(&tidq);
@@ -1304,57 +1300,6 @@ static int cs_etm__queue_first_cs_timestamp(struct cs_etm_auxtrace *etm,
 	return ret;
 }
 
-static inline
-void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq,
-				 struct cs_etm_traceid_queue *tidq)
-{
-	struct branch_stack *bs_src = tidq->last_branch_rb;
-	struct branch_stack *bs_dst = tidq->last_branch;
-	size_t nr = 0;
-
-	/*
-	 * Set the number of records before early exit: ->nr is used to
-	 * determine how many branches to copy from ->entries.
-	 */
-	bs_dst->nr = bs_src->nr;
-
-	/*
-	 * Early exit when there is nothing to copy.
-	 */
-	if (!bs_src->nr)
-		return;
-
-	/*
-	 * As bs_src->entries is a circular buffer, we need to copy from it in
-	 * two steps.  First, copy the branches from the most recently inserted
-	 * branch ->last_branch_pos until the end of bs_src->entries buffer.
-	 */
-	nr = etmq->etm->synth_opts.last_branch_sz - tidq->last_branch_pos;
-	memcpy(&bs_dst->entries[0],
-	       &bs_src->entries[tidq->last_branch_pos],
-	       sizeof(struct branch_entry) * nr);
-
-	/*
-	 * If we wrapped around at least once, the branches from the beginning
-	 * of the bs_src->entries buffer and until the ->last_branch_pos element
-	 * are older valid branches: copy them over.  The total number of
-	 * branches copied over will be equal to the number of branches asked by
-	 * the user in last_branch_sz.
-	 */
-	if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
-		memcpy(&bs_dst->entries[nr],
-		       &bs_src->entries[0],
-		       sizeof(struct branch_entry) * tidq->last_branch_pos);
-	}
-}
-
-static inline
-void cs_etm__reset_last_branch_rb(struct cs_etm_traceid_queue *tidq)
-{
-	tidq->last_branch_pos = 0;
-	tidq->last_branch_rb->nr = 0;
-}
-
 static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq,
 					 struct cs_etm_traceid_queue *tidq,
 					 struct cs_etm_packet *packet, u64 addr)
@@ -1424,38 +1369,6 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq,
 	return addr;
 }
 
-static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq,
-					  struct cs_etm_traceid_queue *tidq)
-{
-	struct branch_stack *bs = tidq->last_branch_rb;
-	struct branch_entry *be;
-
-	/*
-	 * The branches are recorded in a circular buffer in reverse
-	 * chronological order: we start recording from the last element of the
-	 * buffer down.  After writing the first element of the stack, move the
-	 * insert position back to the end of the buffer.
-	 */
-	if (!tidq->last_branch_pos)
-		tidq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
-
-	tidq->last_branch_pos -= 1;
-
-	be       = &bs->entries[tidq->last_branch_pos];
-	be->from = cs_etm__last_executed_instr(tidq->prev_packet);
-	be->to	 = cs_etm__first_executed_instr(tidq->packet);
-	/* No support for mispredict */
-	be->flags.mispred = 0;
-	be->flags.predicted = 1;
-
-	/*
-	 * Increment bs->nr until reaching the number of last branches asked by
-	 * the user on the command line.
-	 */
-	if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
-		bs->nr += 1;
-}
-
 static int cs_etm__inject_event(struct cs_etm_auxtrace *etm, union perf_event *event,
 			       struct perf_sample *sample, u64 type)
 {
@@ -1619,6 +1532,57 @@ static inline u64 cs_etm__resolve_sample_time(struct cs_etm_queue *etmq,
 		return etm->latest_kernel_timestamp;
 }
 
+static bool cs_etm__packet_has_taken_branch(struct cs_etm_packet *packet)
+{
+	if (packet->sample_type == CS_ETM_RANGE &&
+	    packet->last_instr_taken_branch)
+		return true;
+
+	return false;
+}
+
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
+				    struct cs_etm_traceid_queue *tidq)
+{
+	struct cs_etm_auxtrace *etm = etmq->etm;
+	u64 from, to;
+	int size;
+
+	if (!etm->synth_opts.branches && !etm->synth_opts.instructions)
+		return;
+
+	if (!cs_etm__packet_has_taken_branch(tidq->prev_packet))
+		return;
+
+	if (etmq->etm->synth_opts.last_branch) {
+		from = cs_etm__last_executed_instr(tidq->prev_packet);
+		to = cs_etm__first_executed_instr(tidq->packet);
+
+		size = cs_etm__instr_size(etmq, tidq, tidq->prev_packet, from);
+
+		/* Enable callchain so thread stack entry can be allocated */
+		thread_stack__event(tidq->frontend_thread, tidq->prev_packet->cpu,
+				    tidq->prev_packet->flags, from, to, size,
+				    etmq->buffer->buffer_nr + 1, false,
+				    tidq->br_stack_sz, 0);
+	} else {
+		thread_stack__set_trace_nr(tidq->frontend_thread,
+					   tidq->prev_packet->cpu,
+					   etmq->buffer->buffer_nr + 1);
+	}
+}
+
+static void cs_etm__sample_branch_stack(struct cs_etm_auxtrace *etm,
+					struct cs_etm_traceid_queue *tidq,
+					struct perf_sample *sample)
+{
+	if (etm->synth_opts.last_branch) {
+		thread_stack__br_sample(tidq->frontend_thread, tidq->packet->cpu,
+					tidq->last_branch, tidq->br_stack_sz);
+		sample->branch_stack = tidq->last_branch;
+	}
+}
+
 static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 					    struct cs_etm_traceid_queue *tidq,
 					    struct cs_etm_packet *packet,
@@ -1648,9 +1612,7 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
 	sample.cpumode = event->sample.header.misc;
 
 	cs_etm__copy_insn(etmq, tidq, packet, &sample);
-
-	if (etm->synth_opts.last_branch)
-		sample.branch_stack = tidq->last_branch;
+	cs_etm__sample_branch_stack(etm, tidq, &sample);
 
 	if (etm->synth_opts.inject) {
 		ret = cs_etm__inject_event(etm, event, &sample,
@@ -1841,14 +1803,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 
 	tidq->period_instructions += tidq->packet->instr_count;
 
-	/*
-	 * Record a branch when the last instruction in
-	 * PREV_PACKET is a branch.
-	 */
-	if (etm->synth_opts.last_branch &&
-	    tidq->prev_packet->sample_type == CS_ETM_RANGE &&
-	    tidq->prev_packet->last_instr_taken_branch)
-		cs_etm__update_last_branch_rb(etmq, tidq);
+	cs_etm__add_stack_event(etmq, tidq);
 
 	if (etm->synth_opts.instructions &&
 	    tidq->period_instructions >= etm->instructions_sample_period) {
@@ -1907,10 +1862,6 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 		u64 offset = etm->instructions_sample_period - instrs_prev;
 		u64 addr;
 
-		/* Prepare last branches for instruction sample */
-		if (etm->synth_opts.last_branch)
-			cs_etm__copy_last_branch_rb(etmq, tidq);
-
 		while (tidq->period_instructions >=
 				etm->instructions_sample_period) {
 			/*
@@ -1941,8 +1892,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
 			generate_sample = true;
 
 		/* Generate sample for branch taken packet */
-		if (tidq->prev_packet->sample_type == CS_ETM_RANGE &&
-		    tidq->prev_packet->last_instr_taken_branch)
+		if (cs_etm__packet_has_taken_branch(tidq->prev_packet))
 			generate_sample = true;
 
 		if (generate_sample) {
@@ -1990,10 +1940,6 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 	    etmq->etm->synth_opts.instructions &&
 	    tidq->prev_packet->sample_type == CS_ETM_RANGE) {
 		u64 addr;
-
-		/* Prepare last branches for instruction sample */
-		cs_etm__copy_last_branch_rb(etmq, tidq);
-
 		/*
 		 * Generate a last branch event for the branches left in the
 		 * circular buffer at the end of the trace.
@@ -2025,7 +1971,7 @@ static int cs_etm__flush(struct cs_etm_queue *etmq,
 
 	/* Reset last branches after flush the trace */
 	if (etm->synth_opts.last_branch)
-		cs_etm__reset_last_branch_rb(tidq);
+		thread_stack__flush(tidq->frontend_thread);
 
 	return err;
 }
@@ -2049,9 +1995,6 @@ static int cs_etm__end_block(struct cs_etm_queue *etmq,
 	    tidq->prev_packet->sample_type == CS_ETM_RANGE) {
 		u64 addr;
 
-		/* Prepare last branches for instruction sample */
-		cs_etm__copy_last_branch_rb(etmq, tidq);
-
 		/*
 		 * Use the address of the end of the last reported execution
 		 * range.

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 6/9] perf cs-etm: Flush thread stacks after decoder reset
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

Perf resets the CoreSight decoder when moving to a new AUX trace buffer,
this causes trace discontinunity globally.

For callchain synthesis, keeping thread-stack state after decoder reset
can leave stale call/return history attached to threads that are decoded
later, producing incorrect synthesized callchains.

Flush all host thread stacks after a decoder reset. When virtualization
is present, flush the guest thread stacks as well.

Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 5ede0f0ff8c6ec3aa10545693eb914af2e7b5285..e43f0c1dd00788abaed4455bc0da4723b0b36b9e 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -2012,6 +2012,45 @@ static int cs_etm__end_block(struct cs_etm_queue *etmq,
 
 	return 0;
 }
+
+static int cs_etm__flush_stack_cb(struct thread *thread,
+				  void *data __maybe_unused)
+{
+	thread_stack__flush(thread);
+	return 0;
+}
+
+static void cs_etm__flush_machine_stack(struct cs_etm_queue *etmq, pid_t pid)
+{
+	struct machine *machine;
+
+	machine = machines__find(&etmq->etm->session->machines, pid);
+	if (machine)
+		machine__for_each_thread(machine, cs_etm__flush_stack_cb, NULL);
+}
+
+static void cs_etm__flush_all_stack(struct cs_etm_queue *etmq)
+{
+	enum cs_etm_pid_fmt pid_fmt = cs_etm__get_pid_fmt(etmq);
+
+	if (!etmq->etm->synth_opts.last_branch)
+		return;
+
+	switch (pid_fmt) {
+	case CS_ETM_PIDFMT_CTXTID2:
+		/* Clear the guest stack if virtualization is supported */
+		cs_etm__flush_machine_stack(etmq, DEFAULT_GUEST_KERNEL_ID);
+		fallthrough;
+	case CS_ETM_PIDFMT_CTXTID:
+		cs_etm__flush_machine_stack(etmq, HOST_KERNEL_ID);
+		break;
+	case CS_ETM_PIDFMT_NONE:
+	default:
+		break;
+
+	}
+}
+
 /*
  * cs_etm__get_data_block: Fetch a block from the auxtrace_buffer queue
  *			   if need be.
@@ -2034,6 +2073,12 @@ static int cs_etm__get_data_block(struct cs_etm_queue *etmq)
 		ret = cs_etm_decoder__reset(etmq->decoder);
 		if (ret)
 			return ret;
+
+		/*
+		 * Since the decoder is reset, this causes a global trace
+		 * discontinuity. Flush all thread stacks.
+		 */
+		cs_etm__flush_all_stack(etmq);
 	}
 
 	return etmq->buf_len;

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 4/9] perf cs-etm: Refactor instruction size handling
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

From: Leo Yan <leo.yan@linaro.org>

This patch introduces a new function cs_etm__instr_size() to calculate
the instruction size based on ISA type and instruction address.

Given the trace data can be MB and most likely that will be A64/A32 on
a lot of platforms, cs_etm__instr_addr() keeps a single ISA type check
for A64/A32 and executes an optimized calculation (addr + offset * 4).

Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index e2c3d2efb5982136abf9295159acab04271897a0..6827ef8871a8fc092500b93a8284d5d162558357 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1371,6 +1371,18 @@ static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq,
 	return ((instrBytes[1] & 0xF8) >= 0xE8) ? 4 : 2;
 }
 
+static inline int cs_etm__instr_size(struct cs_etm_queue *etmq,
+				     struct cs_etm_traceid_queue *tidq,
+				     struct cs_etm_packet *packet,
+				     u64 addr)
+{
+	if (packet->isa == CS_ETM_ISA_T32)
+		return cs_etm__t32_instr_size(etmq, tidq, packet, addr);
+
+	/* Otherwise, 4-byte instruction size for A32/A64 */
+	return 4;
+}
+
 static inline u64 cs_etm__first_executed_instr(struct cs_etm_packet *packet)
 {
 	/*
@@ -1399,19 +1411,17 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq,
 				     struct cs_etm_packet *packet,
 				     u64 offset)
 {
-	if (packet->isa == CS_ETM_ISA_T32) {
-		u64 addr = packet->start_addr;
+	u64 addr = packet->start_addr;
 
-		while (offset) {
-			addr += cs_etm__t32_instr_size(etmq, tidq, packet,
-						       addr);
-			offset--;
-		}
-		return addr;
-	}
+	/* 4-byte instruction size for A32/A64 */
+	if (packet->isa == CS_ETM_ISA_A64 || packet->isa == CS_ETM_ISA_A32)
+		return addr + offset * 4;
 
-	/* Assume a 4 byte instruction size (A32/A64) */
-	return packet->start_addr + offset * 4;
+	while (offset) {
+		addr += cs_etm__instr_size(etmq, tidq, packet, addr);
+		offset--;
+	}
+	return addr;
 }
 
 static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq,
@@ -1581,16 +1591,7 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq,
 		return;
 	}
 
-	/*
-	 * T32 instruction size might be 32-bit or 16-bit, decide by calling
-	 * cs_etm__t32_instr_size().
-	 */
-	if (packet->isa == CS_ETM_ISA_T32)
-		sample->insn_len = cs_etm__t32_instr_size(etmq, tidq, packet,
-							  sample->ip);
-	/* Otherwise, A64 and A32 instruction size are always 32-bit. */
-	else
-		sample->insn_len = 4;
+	sample->insn_len = cs_etm__instr_size(etmq, tidq, packet, sample->ip);
 
 	cs_etm__frontend_mem_access(etmq, tidq, packet, sample->ip,
 				    sample->insn_len, (void *)sample->insn);

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 2/9] perf cs-etm: Filter synthesized branch samples
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

From: Leo Yan <leo.yan@linaro.org>

The itrace 'c' and 'r' options request synthesized branch events for
calls and returns only. For perf script the default itrace options are
"--itrace=ce", so CS ETM should emit call branches and error events by
default.

CS ETM currently synthesizes a branch sample for every decoded taken
branch whenever branch synthesis is enabled. This produces redundant
jump and conditional branch samples.

Add a branch filter derived from the itrace calls and returns options.
When neither option is set, keep the existing behavior and synthesize all
branch samples. When calls or returns are requested, emit only branch
samples whose flags match the selected branch type, while preserving trace
begin/end markers.

Also update test_arm_coresight_disasm.sh and arm-cs-trace-disasm.py
to use the --itrace=b option for generating branch samples.

Before:

  perf script -F,+flags

  callchain_test    6114 [005] 331519.825214:          1 branches:   tr strt jmp                           0 [unknown] ([unknown]) => ffff8000803a3a68 perf_report_aux_output_id+0x50 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000803a3a74 perf_report_aux_output_id+0x5c ([kernel.kallsyms]) => ffff8000817f4d88 memset+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jmp                    ffff8000817f4d8c memset+0x4 ([kernel.kallsyms]) => ffff8000817f4c00 __pi_memset_generic+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4c1c __pi_memset_generic+0x1c ([kernel.kallsyms]) => ffff8000817f4c44 __pi_memset_generic+0x44 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4c4c __pi_memset_generic+0x4c ([kernel.kallsyms]) => ffff8000817f4c5c __pi_memset_generic+0x5c ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4c5c __pi_memset_generic+0x5c ([kernel.kallsyms]) => ffff8000817f4cf0 __pi_memset_generic+0xf0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4d30 __pi_memset_generic+0x130 ([kernel.kallsyms]) => ffff8000817f4d68 __pi_memset_generic+0x168 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4d78 __pi_memset_generic+0x178 ([kernel.kallsyms]) => ffff8000817f4d6c __pi_memset_generic+0x16c ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4d78 __pi_memset_generic+0x178 ([kernel.kallsyms]) => ffff8000817f4d6c __pi_memset_generic+0x16c ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000817f4d78 __pi_memset_generic+0x178 ([kernel.kallsyms]) => ffff8000817f4d6c __pi_memset_generic+0x16c ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   return                 ffff8000817f4d84 __pi_memset_generic+0x184 ([kernel.kallsyms]) => ffff8000803a3a78 perf_report_aux_output_id+0x60 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   jcc                    ffff8000803a3a98 perf_report_aux_output_id+0x80 ([kernel.kallsyms]) => ffff8000803a3b04 perf_report_aux_output_id+0xec ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000803a3b1c perf_report_aux_output_id+0x104 ([kernel.kallsyms]) => ffff8000803a38f8 __perf_event_header__init_id+0x0 ([kernel.kallsyms])

After:

  callchain_test    6114 [005] 331519.825214:          1 branches:   tr strt jmp                           0 [unknown] ([unknown]) => ffff8000803a3a68 perf_report_aux_output_id+0x50 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000803a3a74 perf_report_aux_output_id+0x5c ([kernel.kallsyms]) => ffff8000817f4d88 memset+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000803a3b1c perf_report_aux_output_id+0x104 ([kernel.kallsyms]) => ffff8000803a38f8 __perf_event_header__init_id+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000803a39c0 __perf_event_header__init_id+0xc8 ([kernel.kallsyms]) => ffff800080105258 __task_pid_nr_ns+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff80008010528c __task_pid_nr_ns+0x34 ([kernel.kallsyms]) => ffff8000801d5610 __rcu_read_lock+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000801052b0 __task_pid_nr_ns+0x58 ([kernel.kallsyms]) => ffff800080192078 lock_acquire+0x0 ([kernel.kallsyms])
  callchain_test    6114 [005] 331519.825214:          1 branches:   call                   ffff8000801923f4 lock_acquire+0x37c ([kernel.kallsyms]) => ffff8000801d6da0 rcu_is_watching+0x0 ([kernel.kallsyms])

Fixes: b12235b113cf ("perf tools: Add mechanic to synthesise CoreSight trace packets")
Signed-off-by: Leo Yan <leo.yan@linaro.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/scripts/python/arm-cs-trace-disasm.py          |  9 +++++----
 .../tests/shell/coresight/test_arm_coresight_disasm.sh    |  4 ++--
 tools/perf/util/cs-etm.c                                  | 15 +++++++++++++++
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/tools/perf/scripts/python/arm-cs-trace-disasm.py b/tools/perf/scripts/python/arm-cs-trace-disasm.py
index 8f6fa4a007b42fcc98e71b74b36ba3a61d7acb2f..42579f8586842704d3800ad731d4609d2bb968da 100755
--- a/tools/perf/scripts/python/arm-cs-trace-disasm.py
+++ b/tools/perf/scripts/python/arm-cs-trace-disasm.py
@@ -31,18 +31,19 @@ from perf_trace_context import perf_sample_srccode, perf_config_get
 #
 # Output disassembly with objdump and auto detect vmlinux
 # (when running on same machine.):
-#  perf script -s scripts/python/arm-cs-trace-disasm.py -d
+#  perf script --itrace=b -s scripts/python/arm-cs-trace-disasm.py \
+#       -- -d
 #
 # Output disassembly with llvm-objdump:
-#  perf script -s scripts/python/arm-cs-trace-disasm.py \
+#  perf script --itrace=b -s scripts/python/arm-cs-trace-disasm.py \
 #		-- -d llvm-objdump-11 -k path/to/vmlinux
 #
 # Output accurate disassembly by passing kcore to script:
-#  perf script -s scripts/python/arm-cs-trace-disasm.py \
+#  perf script --itrace=b -s scripts/python/arm-cs-trace-disasm.py \
 #		-- -d -k perf.data/kcore_dir/kcore
 #
 # Output only source line and symbols:
-#  perf script -s scripts/python/arm-cs-trace-disasm.py
+#  perf script --itrace=b -s scripts/python/arm-cs-trace-disasm.py
 
 def default_objdump():
 	config = perf_config_get("annotate.objdump")
diff --git a/tools/perf/tests/shell/coresight/test_arm_coresight_disasm.sh b/tools/perf/tests/shell/coresight/test_arm_coresight_disasm.sh
index ccb90dda24758522be12cba27140abc9b60d8261..f3ebad5963783e9ae74be5b046d20c3f2e01a5a1 100755
--- a/tools/perf/tests/shell/coresight/test_arm_coresight_disasm.sh
+++ b/tools/perf/tests/shell/coresight/test_arm_coresight_disasm.sh
@@ -44,7 +44,7 @@ branch_search='[[:space:]](bl|b(\.(eq|ne|cs|cc|mi|pl|vs|vc|hi|ls|ge|lt|gt|le|al)
 if [ "$(id -u)" == 0 ] && [ -e /proc/kcore ]; then
 	echo "Testing kernel disassembly"
 	perf record -o ${perfdata} -e cs_etm//k --kcore -Se -m,64K -- touch $file > /dev/null 2>&1
-	perf script -i ${perfdata} -s python:${script_path} -- \
+	perf script -i ${perfdata} --itrace=b -s python:${script_path} -- \
 		-d --stop-sample=2 -k ${perfdata}/kcore_dir/kcore 2> /dev/null > ${file}
 	grep -q -E ${branch_search} ${file}
 	echo "Found kernel branches"
@@ -56,7 +56,7 @@ fi
 ## Test user ##
 echo "Testing userspace disassembly"
 perf record -o ${perfdata} -e cs_etm//u -Se -m,64K -- touch $file > /dev/null 2>&1
-perf script -i ${perfdata} -s python:${script_path} -- \
+perf script -i ${perfdata} --itrace=b -s python:${script_path} -- \
 	-d --stop-sample=2 2> /dev/null > ${file}
 grep -q -E ${branch_search} ${file}
 echo "Found userspace branches"
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index d484a6155c2c22fa916d0365987302f6bb9978e9..42de2d82fd728bcc719adcab80670efa9859762f 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -71,6 +71,7 @@ struct cs_etm_auxtrace {
 	int num_cpu;
 	u64 latest_kernel_timestamp;
 	u32 auxtrace_type;
+	u32 branches_filter;
 	u64 branches_sample_type;
 	u64 branches_id;
 	u64 instructions_sample_type;
@@ -1686,6 +1687,10 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
 	} dummy_bs;
 	u64 ip;
 
+	if (etm->branches_filter &&
+		!(etm->branches_filter & tidq->prev_packet->flags))
+		return 0;
+
 	ip = cs_etm__last_executed_instr(tidq->prev_packet);
 
 	event->sample.header.type = PERF_RECORD_SAMPLE;
@@ -3528,6 +3533,16 @@ int cs_etm__process_auxtrace_info_full(union perf_event *event,
 		etm->synth_opts.callchain = false;
 	}
 
+	if (etm->synth_opts.calls)
+		etm->branches_filter |= PERF_IP_FLAG_CALL |
+					PERF_IP_FLAG_TRACE_BEGIN |
+					PERF_IP_FLAG_TRACE_END;
+
+	if (etm->synth_opts.returns)
+		etm->branches_filter |= PERF_IP_FLAG_RETURN |
+					PERF_IP_FLAG_TRACE_BEGIN |
+					PERF_IP_FLAG_TRACE_END;
+
 	etm->session = session;
 
 	etm->num_cpu = num_cpu;

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 3/9] perf cs-etm: Decode ETE exception packets
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

ETE shares the same packet format as ETMv4, but exception decoding
handled ETMv4 packets only. As a result, ETE exception packets were
not classified.

Recognize the ETE magic for exception number decoding.

Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 42de2d82fd728bcc719adcab80670efa9859762f..e2c3d2efb5982136abf9295159acab04271897a0 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -2181,7 +2181,7 @@ static bool cs_etm__is_syscall(struct cs_etm_queue *etmq,
 	 * HVC cases; need to check if it's SVC instruction based on
 	 * packet address.
 	 */
-	if (magic == __perf_cs_etmv4_magic) {
+	if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic) {
 		if (packet->exception_number == CS_ETMV4_EXC_CALL &&
 		    cs_etm__is_svc_instr(etmq, tidq, prev_packet,
 					 prev_packet->end_addr))
@@ -2204,7 +2204,7 @@ static bool cs_etm__is_async_exception(struct cs_etm_traceid_queue *tidq,
 		    packet->exception_number == CS_ETMV3_EXC_FIQ)
 			return true;
 
-	if (magic == __perf_cs_etmv4_magic)
+	if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic)
 		if (packet->exception_number == CS_ETMV4_EXC_RESET ||
 		    packet->exception_number == CS_ETMV4_EXC_DEBUG_HALT ||
 		    packet->exception_number == CS_ETMV4_EXC_SYSTEM_ERROR ||
@@ -2234,7 +2234,7 @@ static bool cs_etm__is_sync_exception(struct cs_etm_queue *etmq,
 		    packet->exception_number == CS_ETMV3_EXC_GENERIC)
 			return true;
 
-	if (magic == __perf_cs_etmv4_magic) {
+	if (magic == __perf_cs_etmv4_magic || magic == __perf_cs_ete_magic) {
 		if (packet->exception_number == CS_ETMV4_EXC_TRAP ||
 		    packet->exception_number == CS_ETMV4_EXC_ALIGNMENT ||
 		    packet->exception_number == CS_ETMV4_EXC_INST_FAULT ||

-- 
2.34.1



^ permalink raw reply related

* [PATCH v10 0/9] perf cs-etm: Support thread stack and callchain
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan, Leo Yan

This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.

CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.

The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.

A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.

One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.

The series has been tested on Orion6 board:

  perf test 136 -vvv
  136: CoreSight synthesized callchain:
  --- start ---
  test child forked, pid 3539
  ---- end(0) ----
  136: CoreSight synthesized callchain			: Ok

  perf script --itrace=g16i10il64

  callchain_test   17468 [005] 1031003.229943:         10 instructions:
              aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
              ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
              ffff90bd233c call_init+0x9c (inlined)
              ffff90bd233c __libc_start_main_impl+0x9c (inlined)
              aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)

  callchain_test   17468 [005] 1031003.229943:         10 instructions:
              aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
              ffff90bd233c call_init+0x9c (inlined)
              ffff90bd233c __libc_start_main_impl+0x9c (inlined)
              aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)

  callchain_test   17468 [005] 1031003.229944:         10 instructions:
          ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
              aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
              ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
              ffff90bd233c call_init+0x9c (inlined)
              ffff90bd233c __libc_start_main_impl+0x9c (inlined)
              aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)

Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.

[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linaro.org/

Signed-off-by: Leo Yan <leo.yan@arm.com>
---
Changes in v10:
- Change to syscall(SYS_gettid) for build failure on x86 (James).
- Extracted sample thread stack into cs_etm__sample_branch_stack().
- Link to v9: https://lore.kernel.org/r/20260616-b4-arm_cs_callchain_support_v1-v9-0-f8fad931c413@arm.com

Changes in v9:
- Added patch 01 to fixed thread leak during trace queue init (sashiko).
- Added check in instruction and branch samples in
  cs_etm__add_stack_event() (sashiko).
- Released frontend_thread properly in cs_etm__context() (sashiko).
- Refined cs_etm__flush_all_stack() to use switch (sashiko).
- Gathered James' review tags.
- Rebased on the latest perf-tools-next.
- Link to v8: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v8-0-737948584fea@arm.com

Changes in v8:
- Updated test_arm_coresight_disasm.sh to pass "--itrace=b" and updated
  examples in arm-cs-trace-disasm.py (James).
- Removed static annotation in callchain workload and renamed functions
  with prefix "callchain_" to reduce naming conflict (James).
- For callchain test pre-condition check, removed the aarch64 check and
  added the root permission check (James).
- Resolved the shellcheck errors (James).
- Link to v7: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v7-0-1ba770c862ae@arm.com

Changes in v7:
- Rebased on the latest perf-tools-next.
- Used struct_size() for allocation callchain struct (James).
- Added a helper cs_etm__packet_has_taken_branch() (James).
- Minor improvements for the callchain test (used record-ctl FIFO and
  reworked the validation callstack push / pop).
- Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49f53c9dd@arm.com

Changes in v6:
- Heavily rewrote the patches since restarted the work after 6 years.
- Changed to use the common thread-stack for branch stack and callchain
  management.
- Added a callchain test.
- Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linaro.org/

Changes in v5:
- Addressed Mike's suggestion for performance improvement for function
  cs_etm__instr_addr() for quick calculation for non T32;
- Removed the patch 'perf cs-etm: Synchronize instruction sample with
  the thread stack' (Mike);
- Fixed the issue for exception is taken for branch target address
  accessing, for the branch sample and stack thread handling, the
  related patches are 01, 02, 07;
- Fixed the stack thread handling for instruction emulation and single
  step with patches 08, 09.
- Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@linaro.org/

---
Leo Yan (9):
      perf cs-etm: Fix thread leaks on trace queue init failure
      perf cs-etm: Filter synthesized branch samples
      perf cs-etm: Decode ETE exception packets
      perf cs-etm: Refactor instruction size handling
      perf cs-etm: Use thread-stack for last branch entries
      perf cs-etm: Flush thread stacks after decoder reset
      perf cs-etm: Support call indentation
      perf cs-etm: Synthesize callchains for instruction samples
      perf test: Add Arm CoreSight callchain test

 tools/perf/Documentation/perf-test.txt             |   6 +-
 tools/perf/scripts/python/arm-cs-trace-disasm.py   |   9 +-
 tools/perf/tests/builtin-test.c                    |   1 +
 tools/perf/tests/shell/coresight/callchain.sh      | 172 ++++++++++
 .../shell/coresight/test_arm_coresight_disasm.sh   |   4 +-
 tools/perf/tests/tests.h                           |   1 +
 tools/perf/tests/workloads/Build                   |   2 +
 tools/perf/tests/workloads/callchain.c             |  33 ++
 tools/perf/util/cs-etm.c                           | 377 +++++++++++++--------
 9 files changed, 454 insertions(+), 151 deletions(-)
---
base-commit: 8c214ad8cb8d692c82c6466b8e88973dbfa8e064
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc

Best regards,
-- 
Leo Yan <leo.yan@arm.com>



^ permalink raw reply

* [PATCH v10 1/9] perf cs-etm: Fix thread leaks on trace queue init failure
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, John Garry, Will Deacon, James Clark,
	Mike Leach, Suzuki K Poulose, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	Al Grant, Paschalis Mpeis, Amir Ayupov
  Cc: linux-arm-kernel, coresight, linux-perf-users, Leo Yan
In-Reply-To: <20260617-b4-arm_cs_callchain_support_v1-v10-0-e8b6e5d63db5@arm.com>

cs_etm__init_traceid_queue() allocates the frontend and decode threads,
if a later allocation fails, the error path does not drop thread
reference that was already acquired.

Release both thread pointers with thread__zput() on the error path, so
does not leak thread references or leave stale pointers behind.

Fixes: 951ccccdc715 ("perf cs-etm: Only track threads instead of PID and TIDs")
Reviewed-by: James Clark <james.clark@linaro.org>
Signed-off-by: Leo Yan <leo.yan@arm.com>
---
 tools/perf/util/cs-etm.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 0927b0b9c06b15046afeafe23fe170b8248cfcc6..d484a6155c2c22fa916d0365987302f6bb9978e9 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -627,6 +627,8 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 					       queue->tid);
 	tidq->decode_thread = machine__findnew_thread(&etm->session->machines.host, -1,
 					       queue->tid);
+	if (!tidq->frontend_thread || !tidq->decode_thread)
+		goto out;
 
 	tidq->packet = zalloc(sizeof(struct cs_etm_packet));
 	if (!tidq->packet)
@@ -661,6 +663,8 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq,
 	zfree(&tidq->prev_packet);
 	zfree(&tidq->packet);
 out:
+	thread__zput(tidq->frontend_thread);
+	thread__zput(tidq->decode_thread);
 	return rc;
 }
 

-- 
2.34.1



^ permalink raw reply related

* Re: [PATCH v7 00/13] Add support for SCMIv4.0 Powercap Extensions
From: Cristian Marussi @ 2026-06-17 15:08 UTC (permalink / raw)
  To: Philip Radford
  Cc: linux-kernel, linux-arm-kernel, arm-scmi, linux-pm, sudeep.holla,
	james.quinlan, f.fainelli, vincent.guittot, etienne.carriere,
	peng.fan, michal.simek, quic_sibis, dan.carpenter, d-gole,
	souvik.chakravarty, cristian.marussi
In-Reply-To: <20260617095910.1963578-1-philip.radford@arm.com>

On Wed, Jun 17, 2026 at 10:58:57AM +0100, Philip Radford wrote:
> Hi all,
> 
> I will be taking over this series from Cristian and in doing so I have
> addressed a couple of issues raised by the first version and added six
> additional patches since Cristian's original series:

Hi Phil,

please refrain from posting big series like this during the merge window
in the future.

Thanks,
Cristian


^ permalink raw reply

* Re: [PATCH v9 9/9] perf test: Add Arm CoreSight callchain test
From: Leo Yan @ 2026-06-17 15:08 UTC (permalink / raw)
  To: James Clark, linux-arm-kernel, coresight, linux-perf-users,
	Arnaldo Carvalho de Melo, John Garry, Will Deacon, Mike Leach,
	Namhyung Kim, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Ian Rogers, Adrian Hunter, Paschalis Mpeis, Amir Ayupov
In-Reply-To: <20260617123322.GD31870@e132581.arm.com>

On Wed, Jun 17, 2026 at 01:33:22PM +0100, Coresight ML wrote:
> On Wed, Jun 17, 2026 at 11:03:07AM +0100, James Clark wrote:
> 
> [...]
> 
> > > +	# It is safe to use 'i3i' with a three-instruction interval, since the
> > > +	# workload is compiled with -O0.
> > > +	perf script --itrace=g16i3il64 -i "$data" > "$script"
> > 
> > Is there a reason we don't generate callstacks on branch samples and use
> > --itrace=g16bl64? That removes the magic number 3 and reduces the output
> > file size and test runtime a bit.
> 
> I checked Intel-PT which does not generate callchain and branch stack for
> branch samples. I just keep cs-etm aligned.
> 
> I can add callstack / branch stack for branch samples.

Tried a bit for this.

The branch stack is skipped due the check:

  if (is_bts_event(attr)) {
          perf_sample__fprintf_bts(sample, evsel, thread, al, addr_al, machine, fp);
          return;
  }

For the callstack attached to branch samples, the output seems not
directive:

  callchain_test    4372 [003] 75596.459422:          1 branches:
            aaaaabdb0794 print+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaaabdb0798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaaabdb07b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            aaaaabdb07c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
            ffff9a10225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
            ffff9a10233c call_init+0x9c (inlined)
            ffff9a10233c __libc_start_main_impl+0x9c (inlined)
            aaaaabdb0670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
            ffff9a2206a0 __libc_early_init+0x100 (/usr/lib/aarch64-linux-gnu/libc.so.6)
 =>     aaaaabdb0768 do_svc+0x0 (/home/kernel/leoy/test_cs_callchain/callchain_test)

It is hard to digest the log as it separates branch from address
(aaaaabdb0794 print+0x8) and to address (aaaaabdb0768 do_svc+0x0),
and put the callchain in the middle of from and to ranges.

Given this is not enabled by other hardware trace (e.g., Intel-PT),
and we need to change the common code to make it better, I'd first
enable callchain/branch stack for instruction samples. Let's see if
further requirement after get this done.

Thanks,
Leo


^ permalink raw reply

* Re: [PATCH v3 3/3] arm64: escalate smp_send_stop() to an SDEI NMI as a last resort
From: Kiryl Shutsemau @ 2026-06-17 15:07 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Catalin Marinas, Will Deacon, James Morse, Mark Rutland,
	Marc Zyngier, Petr Mladek, Thomas Gleixner, Andrew Morton,
	Baoquan He, Puranjay Mohan, Usama Arif, Breno Leitao,
	Julien Thierry, Lecopzer Chen, Sumit Garg, kernel-team, kexec,
	linux-arm-kernel, linux-kernel
In-Reply-To: <CAD=FV=WK0=xTZOWK+yDqEtGbbhkvoW50ekHKBBWhpoO9Zb8cBQ@mail.gmail.com>

On Tue, Jun 16, 2026 at 02:38:16PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Sun, Jun 14, 2026 at 7:36 PM Kiryl Shutsemau <kirill@shutemov.name> wrote:
> >
> > +/*
> > + * Bring the local CPU to a stop, saving its register state into the vmcore
> > + * on the kdump crash path first. The single point every arm64 stop path
> > + * funnels through, so the bookkeeping (mask interrupts, mark offline, mask
> > + * SDEI, optionally power off) lives in one place:
> > + *
> > + *   - the regular IPI_CPU_STOP and pseudo-NMI IPI_CPU_STOP_NMI handlers;
> > + *   - panic_smp_self_stop(), a CPU parking itself on a parallel panic();
> > + *   - the SDEI cross-CPU NMI handler (drivers/firmware/arm_sdei_nmi.c),
> > + *     which reaches CPUs the stop IPIs could not.
> > + *
> > + * @regs is the register state to record in the vmcore on a crash stop; NULL
> > + * means "capture the current context". @die_on_crash decides the kdump crash
> > + * path: the IPI stop handlers pass true and power the CPU off (PSCI CPU_OFF,
> > + * via __cpu_try_die()) so a capture kernel can reclaim it. The SDEI handler
> > + * and panic_smp_self_stop() pass false and only park. For SDEI that is
> > + * required, not just conservative: it runs inside an SDEI event that is
> > + * deliberately never completed (completing it has firmware resume the wedged
> > + * context), and a CPU_OFF from that not-yet-completed context wedges EL3 on
> > + * some firmware -- a documented follow-up. Parking also matches this path's
> > + * own fallback when CPU_OFF is unavailable.
> 
> Nice to have all the details in the function comment. Any reason why
> you didn't use kernel-doc format? Nothing else in this file does, I
> guess, but it doesn't seem like it would be a problem to start the
> trend... ;-)

No reason -- switched it to kernel-doc in v4.

> > +       if (READ_ONCE(sdei_nmi_stopping)) {
>
> Don't you need a smp_rmb() before that, to match with the smp_wmb()?

No -- there's nothing for it to pair against. An smp_rmb() orders a
flag-load before a later *data*-load (the message-passing pattern), but
the flag is the only value shared here: the handler reads
sdei_nmi_stopping and nothing else published through it. crash_stop,
which the stop path reads afterwards, was written before the flag, so
reordering the reads can't expose a stale value either.

And the handler isn't spinning on the flag: it runs only because firmware
delivered the SDEI event, which is a full round-trip (SMC -> EL3 -> GIC ->
exception entry) strictly after sdei_nmi_stop_cpus()'s
WRITE_ONCE() + smp_wmb() + signal. By the time the read executes the store
is already globally visible -- there's no window for a stale read. The
publish-before-signal barrier is the half that matters.

The IPI core does exactly this. gic_ipi_send_mask()
(drivers/irqchip/irq-gic-v3.c) issues dsb(ishst) -- "stores to Normal
memory ... visible to the other CPUs before issuing the IPI" -- before
sending the SGI, and no IPI handler takes a matching read-side barrier. I
use an explicit smp_wmb() only because the SDEI signal goes through an
SMC, which carries no such barrier, where gic_ipi_send_mask() bakes it
into the SGI send. The backtrace framework this driver plugs into is the
same shape: nmi_cpu_backtrace() reads backtrace_mask -- set by
nmi_trigger_cpumask_backtrace() before raise() -- with no smp_rmb().

A bare smp_rmb() with no second load to order would just be confusing, so
v4 adds a comment at the READ_ONCE() explaining why there isn't one.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH RFC 2/4] printk: deprecate boot_delay in favour of printk_delay
From: Petr Mladek @ 2026-06-17 15:00 UTC (permalink / raw)
  To: Andrew Murray
  Cc: Jonathan Corbet, Shuah Khan, Russell King, Florian Fainelli,
	Broadcom internal kernel review list, Ray Jui, Scott Branden,
	Steven Rostedt, John Ogness, Sergey Senozhatsky, Andrew Morton,
	Sebastian Andrzej Siewior, Clark Williams, Randy Dunlap,
	Linus Torvalds, linux-doc, linux-kernel, linux-arm-kernel,
	linux-rpi-kernel, linux-rt-devel
In-Reply-To: <CALqELGzTH8cTLVgX9CXuf_LFLgC97_yfqYJVHzU9ghPuev7SNA@mail.gmail.com>

On Sun 2026-06-14 12:45:44, Andrew Murray wrote:
> On Mon, 8 Jun 2026 at 15:07, Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Mon 2026-06-01 00:17:38, Andrew Murray wrote:
> > > The boot_delay (BOOT_PRINTK_DELAY) kernel parameter and printk_delay sysctl
> > > are two distinct mechanisms for providing similar functionality which add a
> > > delay prior to each printed printk message.
> > >
> > > boot_delay provides a kernel parameter for delaying printk output from
> > > kernel start through to boot (SYSTEM_RUNNING), whereas printk_delay is
> > > configurable only via sysctl and thus is only used post boot.
> > >
> > > Let's deprecate the boot_delay feature in favour of printk_delay. In order
> > > to preserve functionality, we'll also extend printk_delay such that it can
> > > additionally configured via a kernel parameter.
> >
> > > --- a/kernel/printk/printk.c
> > > +++ b/kernel/printk/printk.c
> > > @@ -1339,11 +1327,34 @@ static void boot_delay_msec(int level)
> > >       }
> > >  }
> > >  #else
> > > -static inline void boot_delay_msec(int level)
> > > +static inline void __init printk_delay_calculate(void)
> > > +{
> > > +}
> > > +
> > > +static inline void early_boot_delay_msec(void)
> > >  {
> >
> > It would be nice to print a warning that the early boot delay
> > does not work, something like:
> >
> >         pr_warn_once("Early boot delay does not work without CONFIG_GENERIC_CALIBRATE_DELAY enabled.\n");
> >
> > >  }
> > >  #endif
> > >
> > > +static int __init printk_delay_setup(char *str)
> > > +{
> > > +     get_option(&str, &printk_delay_msec);
> > > +     if (printk_delay_msec > 10 * 1000)
> > > +             printk_delay_msec = 0;
> >
> > Sashiko AI warns that this code accepts negative values.
> > It might cause long delays, see
> > https://sashiko.dev/#/patchset/20260601-deprecate_boot_delay-v1-0-c34c187142a6%40thegoodpenguin.co.uk
> >
> > The problem has already been there even before. But it would be nice
> > to fix it.
> 
> Thanks for pointing out Sashiko, I hadn't seen its review on my
> patches. Are authors expected to get emails from it, as I didn't?

Sashiko is able to send mails but it is opt-in.

It might create too much noise because it has false positives, it
keeps reporting minor or nice-to-fix problems which will "never"
get fixed. Also it is not predictable so that you could not reliably
check the patchset before sending.

Anyway, I thought about enabling this for printk-related patches
because Sashiko also gives a lot of useful feedback. And it is easier
to discuss it as reply to a mail. But AFAIK, it can be done only
per-mailing list. And printk does not have any dedicated mailing list.

So, I "always" search for the feedback at https://sashiko.dev/

Best Regards,
Petr


^ permalink raw reply

* [PATCH v7 7/7] KVM: arm64: Zero out the stack initialized data in the FFA handler
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui, Sashiko AI
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

Don't leak hypervisor stack data when using the FFA_VERSION call.
When the compiler doesn't support -ftrivial-auto-var-init=zero option
we need to zero out the stack initialized variable before returning data
to the host caller.

Reported-by: Sashiko AI <sashiko-bot@kernel.org>
Closes:
https://lore.kernel.org/all/20260616160016.C62C81F000E9@smtp.kernel.org/
Fixes: c9c012625e12 ("KVM: arm64: Trap FFA_VERSION host call in pKVM")
Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/ffa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c
index d7c5701d0584..b321682ead04 100644
--- a/arch/arm64/kvm/hyp/nvhe/ffa.c
+++ b/arch/arm64/kvm/hyp/nvhe/ffa.c
@@ -883,7 +883,7 @@ static void do_ffa_part_get(struct arm_smccc_1_2_regs *res,
 
 bool kvm_host_ffa_handler(struct kvm_cpu_context *host_ctxt, u32 func_id)
 {
-	struct arm_smccc_1_2_regs res;
+	struct arm_smccc_1_2_regs res = {0};
 
 	/*
 	 * There's no way we can tell what a non-standard SMC call might
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 6/7] KVM: arm64: Ensure FFA ranges are page aligned
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

From: Mostafa Saleh <smostafa@google.com>

At the moment we only check that the size of the range is page
aligned, and truncate the address to the page boundary.
This make an assumption that TZ will do the same.

However, it might decide to use the extra offset of the neighbour
page at the end, which is valid under FFA if NS is using larger
page size.

Harden this check by also checking that the base address is aligned
and reject it otherwise.

Fixes: 436090001776 ("KVM: arm64: Handle FFA_MEM_SHARE calls from the host")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/ffa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c
index 1a2abd0154c6..d7c5701d0584 100644
--- a/arch/arm64/kvm/hyp/nvhe/ffa.c
+++ b/arch/arm64/kvm/hyp/nvhe/ffa.c
@@ -352,7 +352,7 @@ static u32 __ffa_host_share_ranges(struct ffa_mem_region_addr_range *ranges,
 		u64 sz = (u64)range->pg_cnt * FFA_PAGE_SIZE;
 		u64 pfn = hyp_phys_to_pfn(range->address);
 
-		if (!PAGE_ALIGNED(sz))
+		if (!PAGE_ALIGNED(sz | range->address))
 			break;
 
 		if (__pkvm_host_share_ffa(pfn, sz / PAGE_SIZE))
@@ -372,7 +372,7 @@ static u32 __ffa_host_unshare_ranges(struct ffa_mem_region_addr_range *ranges,
 		u64 sz = (u64)range->pg_cnt * FFA_PAGE_SIZE;
 		u64 pfn = hyp_phys_to_pfn(range->address);
 
-		if (!PAGE_ALIGNED(sz))
+		if (!PAGE_ALIGNED(sz | range->address))
 			break;
 
 		if (__pkvm_host_unshare_ffa(pfn, sz / PAGE_SIZE))
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 4/7] KVM: arm64: Fix bounds checking in do_ffa_mem_reclaim()
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

From: Mostafa Saleh <smostafa@google.com>

Sashiko (locally) reports out of bound write possiblity if SPMD
returns an invalid data.

While SPMD is considered trusted, pKVM does some basic checks,
for offset to be less than or equal len.

However, that is incorrect as even if the offset is smaller than
len pKVM can still access out of bound memory in the next
ffa_host_unshare_ranges().

Split this check into 2:
1- Check that the fixed portion of the descriptor fits.
2- After getting reg, check the variable array size addr_range_cnt
   fits.

Also, drop the WARN_ONs as that will panic the kernel and in the
next checks there are no WARNs, so that makes it consistent.

Fixes: 0a9f15fd5674 ("KVM: arm64: pkvm: Add support for fragmented FF-A descriptors")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 arch/arm64/kvm/hyp/nvhe/ffa.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/hyp/nvhe/ffa.c b/arch/arm64/kvm/hyp/nvhe/ffa.c
index 1af722771178..2d211661952e 100644
--- a/arch/arm64/kvm/hyp/nvhe/ffa.c
+++ b/arch/arm64/kvm/hyp/nvhe/ffa.c
@@ -607,8 +607,8 @@ static void do_ffa_mem_reclaim(struct arm_smccc_1_2_regs *res,
 	 * check that we end up with something that doesn't look _completely_
 	 * bogus.
 	 */
-	if (WARN_ON(offset > len ||
-		    fraglen > KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE)) {
+	if (offset + CONSTITUENTS_OFFSET(0) > len ||
+	    fraglen > KVM_FFA_MBOX_NR_PAGES * PAGE_SIZE) {
 		ret = FFA_RET_ABORTED;
 		ffa_rx_release(res);
 		goto out_unlock;
@@ -636,11 +636,17 @@ static void do_ffa_mem_reclaim(struct arm_smccc_1_2_regs *res,
 		ffa_rx_release(res);
 	}
 
+	reg = (void *)buf + offset;
+	if (offset + CONSTITUENTS_OFFSET(reg->addr_range_cnt) > len) {
+		ret = FFA_RET_ABORTED;
+		ffa_rx_release(res);
+		goto out_unlock;
+	}
+
 	ffa_mem_reclaim(res, handle_lo, handle_hi, flags);
 	if (res->a0 != FFA_SUCCESS)
 		goto out_unlock;
 
-	reg = (void *)buf + offset;
 	/* If the SPMD was happy, then we should be too. */
 	WARN_ON(ffa_host_unshare_ranges(reg->constituents,
 					reg->addr_range_cnt));
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 3/7] firmware: arm_ffa: Fix Endpoint Memory Access Descriptor offset calculation
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

Use the descriptor's `ep_mem_offset` to calculate the start of the endpoint
memory access array and to comply with the FF-A spec instead of defaulting
to `sizeof(struct ffa_mem_region)`.
This requires moving `ffa_mem_region_additional_setup()` earlier in the setup
flow.
Also, add sanity checks to ensure the calculated descriptor offsets do not
exceed `max_fragsize`.

Fixes: 113580530ee7 ("firmware: arm_ffa: Update memory descriptor to support v1.1 format")
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sudeep Holla <sudeep.holla@kernel.org>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
---
 drivers/firmware/arm_ffa/driver.c | 20 +++++++++++++++-----
 include/linux/arm_ffa.h           |  2 +-
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c
index 059e2aae7ca0..92edf397bcd2 100644
--- a/drivers/firmware/arm_ffa/driver.c
+++ b/drivers/firmware/arm_ffa/driver.c
@@ -703,19 +703,30 @@ ffa_setup_and_transmit(u32 func_id, void *buffer, u32 max_fragsize,
 	struct ffa_composite_mem_region *composite;
 	struct ffa_mem_region_addr_range *constituents;
 	struct ffa_mem_region_attributes *ep_mem_access;
-	u32 idx, frag_len, length, buf_sz = 0, num_entries = sg_nents(args->sg);
+	u32 idx, frag_len, length, buf_sz = 0, num_entries = sg_nents(args->sg), ep_offset;
+	u32 emad_end, emad_size = ffa_emad_size_get(drv_info->version);
 
 	mem_region->tag = args->tag;
 	mem_region->flags = args->flags;
 	mem_region->sender_id = drv_info->vm_id;
 	mem_region->attributes = ffa_memory_attributes_get(func_id);
+
+	ffa_mem_region_additional_setup(drv_info->version, mem_region);
 	composite_offset = ffa_mem_desc_offset(buffer, args->nattrs,
 					       drv_info->version);
+	if (composite_offset + sizeof(*composite) > max_fragsize)
+		return -ENXIO;
 
 	for (idx = 0; idx < args->nattrs; idx++) {
-		ep_mem_access = buffer +
-			ffa_mem_desc_offset(buffer, idx, drv_info->version);
-		memset(ep_mem_access, 0, ffa_emad_size_get(drv_info->version));
+		ep_offset = ffa_mem_desc_offset(buffer, idx, drv_info->version);
+		if (check_add_overflow(ep_offset, emad_size, &emad_end))
+			return -ENXIO;
+
+		if (emad_end > max_fragsize)
+			return -ENXIO;
+
+		ep_mem_access = buffer + ep_offset;
+		memset(ep_mem_access, 0, emad_size);
 		ep_mem_access->receiver = args->attrs[idx].receiver;
 		ep_mem_access->attrs = args->attrs[idx].attrs;
 		ep_mem_access->composite_off = composite_offset;
@@ -725,7 +736,6 @@ ffa_setup_and_transmit(u32 func_id, void *buffer, u32 max_fragsize,
 	}
 	mem_region->handle = 0;
 	mem_region->ep_count = args->nattrs;
-	ffa_mem_region_additional_setup(drv_info->version, mem_region);
 
 	composite = buffer + composite_offset;
 	composite->total_pg_cnt = ffa_get_num_pages_sg(args->sg);
diff --git a/include/linux/arm_ffa.h b/include/linux/arm_ffa.h
index 81e603839c4a..62d67dae8b70 100644
--- a/include/linux/arm_ffa.h
+++ b/include/linux/arm_ffa.h
@@ -445,7 +445,7 @@ ffa_mem_desc_offset(struct ffa_mem_region *buf, int count, u32 ffa_version)
 	if (!FFA_MEM_REGION_HAS_EP_MEM_OFFSET(ffa_version))
 		offset += offsetof(struct ffa_mem_region, ep_mem_offset);
 	else
-		offset += sizeof(struct ffa_mem_region);
+		offset += buf->ep_mem_offset;
 
 	return offset;
 }
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 2/7] firmware: arm_ffa: Fix out-of-bound writes in ffa_setup_and_transmit()
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

From: Mostafa Saleh <smostafa@google.com>

From: Mostafa Saleh <smostafa@google.com>

Sashiko (locally) reports multiple out-of-bound issues in
ffa_setup_and_transmit:
1) Writing ep_mem_access->reserved can write out of bounds for FFA
   versions < 1.2 as ffa_emad_size_get() returns 16 bytes in that case
   while reserved has an offset of 24.
   Instead of zeroing fields, memset the struct to zero first based on
   the FFA version.

2) Make sure there is enough size to write constituents.

While at it, convert the only sizeof() in the driver that uses a
type instead of variable.

Reviewed-by: Sudeep Holla <sudeep.holla@kernel.org>
Fixes: 111a833dc5cb ("firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
---
 drivers/firmware/arm_ffa/driver.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/firmware/arm_ffa/driver.c b/drivers/firmware/arm_ffa/driver.c
index b9f17fda7243..059e2aae7ca0 100644
--- a/drivers/firmware/arm_ffa/driver.c
+++ b/drivers/firmware/arm_ffa/driver.c
@@ -715,11 +715,10 @@ ffa_setup_and_transmit(u32 func_id, void *buffer, u32 max_fragsize,
 	for (idx = 0; idx < args->nattrs; idx++) {
 		ep_mem_access = buffer +
 			ffa_mem_desc_offset(buffer, idx, drv_info->version);
+		memset(ep_mem_access, 0, ffa_emad_size_get(drv_info->version));
 		ep_mem_access->receiver = args->attrs[idx].receiver;
 		ep_mem_access->attrs = args->attrs[idx].attrs;
 		ep_mem_access->composite_off = composite_offset;
-		ep_mem_access->flag = 0;
-		ep_mem_access->reserved = 0;
 		ffa_emad_impdef_value_init(drv_info->version,
 					   ep_mem_access->impdef_val,
 					   args->attrs[idx].impdef_val);
@@ -759,7 +758,7 @@ ffa_setup_and_transmit(u32 func_id, void *buffer, u32 max_fragsize,
 			constituents = buffer;
 		}
 
-		if ((void *)constituents - buffer > max_fragsize) {
+		if ((void *)constituents + sizeof(*constituents) - buffer > max_fragsize) {
 			pr_err("Memory Region Fragment > Tx Buffer size\n");
 			return -EFAULT;
 		}
@@ -768,7 +767,7 @@ ffa_setup_and_transmit(u32 func_id, void *buffer, u32 max_fragsize,
 		constituents->pg_cnt = args->sg->length / FFA_PAGE_SIZE;
 		constituents->reserved = 0;
 		constituents++;
-		frag_len += sizeof(struct ffa_mem_region_addr_range);
+		frag_len += sizeof(*constituents);
 	} while ((args->sg = sg_next(args->sg)));
 
 	return ffa_transmit_fragment(func_id, addr, buf_sz, frag_len,
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 1/7] optee: ffa: Add NULL check in optee_ffa_lend_protmem
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui, Sumit Garg
In-Reply-To: <20260617145130.3729015-1-sebastianene@google.com>

From: Mostafa Saleh <smostafa@google.com>

From: Mostafa Saleh <smostafa@google.com>

Sashiko (locally) reports a possible null dereference under memory
pressure due to the lack of validation of the allocated pointer.

Fix that by adding the missing check.

Fixes: 2b78d79cdf96 ("optee: FF-A: dynamic protected memory allocation")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Sebastian Ene <sebastianene@google.com>
Reviewed-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
---
 drivers/tee/optee/ffa_abi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/tee/optee/ffa_abi.c b/drivers/tee/optee/ffa_abi.c
index b4372fa268d0..633715b98625 100644
--- a/drivers/tee/optee/ffa_abi.c
+++ b/drivers/tee/optee/ffa_abi.c
@@ -698,6 +698,9 @@ static int optee_ffa_lend_protmem(struct optee *optee, struct tee_shm *protmem,
 	int rc;
 
 	mem_attr = kzalloc_objs(*mem_attr, ma_count);
+	if (!mem_attr)
+		return -ENOMEM;
+
 	for (n = 0; n < ma_count; n++) {
 		mem_attr[n].receiver = mem_attrs[n] & U16_MAX;
 		mem_attr[n].attrs = mem_attrs[n] >> 16;
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* [PATCH v7 0/7] arm_ffa, KVM: Fix FF-A emad offset calculations
From: Sebastian Ene @ 2026-06-17 14:51 UTC (permalink / raw)
  To: catalin.marinas, oupton, sudeep.holla, will
  Cc: jens.wiklander, joey.gouly, kvmarm, linux-arm-kernel,
	linux-kernel, android-kvm, maz, mrigendra.chaubey, op-tee,
	perlarsen, sebastianene, seiden, smostafa, sumit.garg,
	suzuki.poulose, vdonnefort, yuzenghui

Hi all,

This series fixes the Endpoint Memory Access Descriptor (EMAD) offset
calculations and adds the necessary bounds checks for both the core
FF-A driver and the pKVM hypervisor.

Prior to FF-A version 1.1, the memory region header didn't specify an
explicit offset for the EMADs, leading to the assumption that they
immediately follow the header.
However, from v1.1 onwards, the specification dictates using the
ep_mem_offset` field to determine the start of the memory access
array.

The patches in this series address this by:
1. Updating the core `arm_ffa` firmware driver to correctly calculate
the descriptor
   offset using `ep_mem_offset` rather than defaulting to `sizeof(struct
ffa_mem_region)`.
   It also introduces bounds checking against `max_fragsize`.
2. Enhancing the pKVM hypervisor validation logic to no longer strictly
enforce that
   the descriptor strictly follows the header, aligning it with the
driver behavior
   and the FF-A specification, while also ensuring the offset falls
within the mailbox
   buffer bounds.

While addressing these bugs, Sashiko uncovered other issues that were
fixed in the same series.

All the patches aside from the first one in optee are urgent fixes as
they either impact the hypervisor security or kernel stability.

Changelog
#########
v6->v7:
- taking the patches from Mostafa and sending a new version with the
  collected tags
- Added overflow checks when doing `ep_offset + emad_size` in the arm
  ff-a driver
- Move the length check before the ffa_mem_reclaim
- fix compatibility break with ff-a version 1.0 reported by Sashiko
- add one more patch to fix an issue with the FFA_VERSION call
  that can lead to leaking pKVM stack un-initialized data to
  a host when -ftrivial-auto-var-init=zero is not used.

v5->v6:
- Add fixes tag
- Small clean up make variable declaration reverse christmas tree.

v4->v5:
- Collect Sudeep Rbs
- Add extra patch to check base address alignment.
- Remove WARN_ONs in KVM code
- Use ffa_emad_size_get() instead of hardcoded size in KVM code.

v3 -> v4:
- Address review comments and fix Sashiko bugs

v2 -> v3:
- Fixed typo in nvhe/ffa.c (missing sizeof)

v1 -> v2:
- For pKVM, removed the strict placement enforcement for `ep_mem_offset`
  as it is not
  compliant with the spec, and avoids making assumptions about the
driver's memory
  layout.

Link to:
########
v6:
https://lore.kernel.org/all/20260527150236.1978655-1-smostafa@google.com/
v5:
https://lore.kernel.org/all/20260526151934.3783707-1-smostafa@google.com/
v4:
https://lore.kernel.org/all/20260520204948.2440882-1-smostafa@google.com/
v3:
https://lore.kernel.org/all/20260512124442.1899107-1-sebastianene@google.com/
v2:
https://lore.kernel.org/all/20260430160241.1934777-1-sebastianene@google.com/
v1: https://lore.kernel.org/all/ae9KN9nkOgDYJcGP@google.com/T/#t

Mostafa Saleh (4):
  optee: ffa: Add NULL check in optee_ffa_lend_protmem
  firmware: arm_ffa: Fix out-of-bound writes in ffa_setup_and_transmit()
  KVM: arm64: Fix bounds checking in do_ffa_mem_reclaim()
  KVM: arm64: Ensure FFA ranges are page aligned

Sebastian Ene (3):
  firmware: arm_ffa: Fix Endpoint Memory Access Descriptor offset
    calculation
  KVM: arm64: Validate the offset to the mem access descriptor
  KVM: arm64: Zero out the stack initialized data in the FFA handler

 arch/arm64/kvm/hyp/nvhe/ffa.c     | 47 ++++++++++++++++++++++---------
 drivers/firmware/arm_ffa/driver.c | 25 ++++++++++------
 drivers/tee/optee/ffa_abi.c       |  3 ++
 include/linux/arm_ffa.h           |  2 +-
 4 files changed, 54 insertions(+), 23 deletions(-)

-- 
2.54.0.1136.gdb2ca164c4-goog

^ permalink raw reply

* [PATCH] KVM: arm64: nv: Fix PSTATE construction on illegal exception return
From: Fuad Tabba @ 2026-06-17 14:49 UTC (permalink / raw)
  To: Marc Zyngier, Oliver Upton, kvmarm, linux-arm-kernel
  Cc: Joey Gouly, Steffen Eiden, Suzuki K Poulose, Zenghui Yu,
	Catalin Marinas, Will Deacon, Jintack Lim, Ganapatrao Kulkarni,
	Christoffer Dall, linux-kernel, tabba

kvm_check_illegal_exception_return() sourced the flags {N,Z,C,V} and
masks {D,A,I,F} of the resulting PSTATE from the current PSTATE, but
R_VWJHB takes them from the SPSR being returned to and leaves
PSTATE.{EL,SP,nRW} (and EXLOCK when FEAT_GCS) unchanged. PAN, ALLINT
and PM were not applied at all.

Build the PSTATE by taking those fields from the SPSR while preserving
EL, SP, nRW and EXLOCK from the current PSTATE, then set IL.

Fixes: 47f3a2fc765a ("KVM: arm64: nv: Support virtual EL2 exceptions")
Suggested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/86wlvxs5r0.wl-maz@kernel.org/
Signed-off-by: Fuad Tabba <tabba@google.com>
---
This is a modified version of Marc's suggested diff [1]. That diff applied
a single mask to the incoming SPSR, which also takes PSTATE.{EL,SP,nRW}
(and EXLOCK) from the SPSR. The ARM ARM leaves those fields unchanged on an
illegal exception return. This path is reached precisely because SPSR.M is
illegal (EL3, M[1]=1, AArch32, EL1 under TGE), so this version preserves
EL/SP/nRW/EXLOCK from the current PSTATE and takes only the flags, masks
and PAN/ALLINT/PM from the SPSR.

[1] https://lore.kernel.org/all/86wlvxs5r0.wl-maz@kernel.org/
---
 arch/arm64/kvm/emulate-nested.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
index dba7ced74ca5..ace2b40cf875 100644
--- a/arch/arm64/kvm/emulate-nested.c
+++ b/arch/arm64/kvm/emulate-nested.c
@@ -2738,17 +2738,30 @@ static u64 kvm_check_illegal_exception_return(struct kvm_vcpu *vcpu, u64 spsr)
 	    (spsr & PSR_MODE32_BIT) ||
 	    (vcpu_el2_tge_is_set(vcpu) && (mode == PSR_MODE_EL1t ||
 					   mode == PSR_MODE_EL1h))) {
-		/*
-		 * The guest is playing with our nerves. Preserve EL, SP,
-		 * masks, flags from the existing PSTATE, and set IL.
-		 * The HW will then generate an Illegal State Exception
-		 * immediately after ERET.
-		 */
-		spsr = *vcpu_cpsr(vcpu);
+		u64 cpsr = *vcpu_cpsr(vcpu);
+		u64 mask;
 
-		spsr &= (PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT |
-			 PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT |
-			 PSR_MODE_MASK | PSR_MODE32_BIT);
+		/*
+		 * On an illegal exception return, PSTATE.{EL,SP,nRW} and,
+		 * if FEAT_GCS, PSTATE.EXLOCK are unchanged, while the flags
+		 * and masks are taken from the SPSR (R_VWJHB). Set IL so the
+		 * HW generates an Illegal State Exception right after ERET.
+		 */
+		mask = PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT |
+		       PSR_N_BIT | PSR_Z_BIT | PSR_C_BIT | PSR_V_BIT;
+
+		if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, IMP))
+			mask |= PSR_PAN_BIT;
+		if (kvm_has_feat(vcpu->kvm, ID_AA64PFR1_EL1, NMI, IMP))
+			mask |= ALLINT_ALLINT;
+		/* FEAT_SPE_EXC and FEAT_TRBE_EXC also gate PSTATE.PM one day... */
+		if (kvm_has_feat(vcpu->kvm, ID_AA64DFR1_EL1, EBEP, IMP))
+			mask |= BIT_ULL(32);	/* PSTATE.PM */
+
+		spsr &= mask;
+		spsr |= cpsr & (PSR_MODE_MASK | PSR_MODE32_BIT);
+		if (kvm_has_feat(vcpu->kvm, ID_AA64PFR1_EL1, GCS, IMP))
+			spsr |= cpsr & BIT_ULL(34);	/* PSTATE.EXLOCK */
 		spsr |= PSR_IL_BIT;
 	}
 
-- 
2.54.0.1136.gdb2ca164c4-goog



^ permalink raw reply related

* Re: [PATCH v6 03/20] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Aneesh Kumar K.V @ 2026-06-17 14:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy, iommu, linux-arm-kernel, linux-kernel,
	linux-coco
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik, Dan Williams,
	Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86, Jiri Pirko,
	Michael Kelley, Cheloha, Scott
In-Reply-To: <845d0c8a-6d51-47aa-8e0b-8381e733444a@amd.com>

Alexey Kardashevskiy <aik@amd.com> writes:

> On 4/6/26 18:39, Aneesh Kumar K.V (Arm) wrote:
>> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
>> dma-direct allocation path and use the attribute to drive the related
>> decisions.
>> 
>> This updates dma_direct_alloc(), dma_direct_free(), and
>> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
>> 
>> Tested-by: Jiri Pirko <jiri@nvidia.com>
>> Tested-by: Michael Kelley <mhklinux@outlook.com>
>> Tested-by: Mostafa Saleh <smostafa@google.com>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>   kernel/dma/direct.c | 53 +++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 44 insertions(+), 9 deletions(-)
>> 
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index a741c8a2ee66..90dc5057a0c0 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -193,16 +193,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
>>   		dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>>   {
>>   	bool remap = false, set_uncached = false;
>> -	bool mark_mem_decrypt = true;
>> +	bool mark_mem_decrypt = false;
>>   	struct page *page;
>>   	void *ret;
>>   
>> +	/*
>> +	 * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
>> +	 * attribute. The direct allocator uses it internally after it has
>> +	 * decided that the backing pages must be shared/decrypted, so the
>> +	 * rest of the allocation path can consistently select DMA addresses,
>> +	 * choose compatible pools and restore encryption on free.
>
> Why this limit?
>
> Context: I am looking for a memory pool for a few shared pages (to do
> some guest<->host communication), SWIOTLB seems like the right fit but
> swiotlb_alloc() is not exported and
> dma_direct_alloc(DMA_ATTR_CC_SHARED) is not allowed. Thanks,
>

swiotlb is not the right pool to use for that, right?
CCA had a similar requirement for ITS pages and ended up creating a genpool:
b08e2f42e86b ("irqchip/gic-v3-its: Share ITS tables with a non-trusted hypervisor")

-aneesh


^ permalink raw reply

* Re: [PATCH v2 2/6] iommu/arm-smmu: Add interconnect bandwidth voting support
From: Bibek Kumar Patro @ 2026-06-17 14:26 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Bjorn Andersson, Konrad Dybcio,
	linux-arm-kernel, iommu, devicetree, linux-kernel, linux-arm-msm
In-Reply-To: <sdm7sqiokmsgczeat2mxch42ois5rwabav6c5fm7abct2xoepf@j3kraqrjvpoc>



On 6/16/2026 5:51 AM, Dmitry Baryshkov wrote:
> On Mon, Jun 15, 2026 at 06:36:51PM +0530, Bibek Kumar Patro wrote:
>>
>>
>> On 6/8/2026 7:25 PM, Dmitry Baryshkov wrote:
>>> On Tue, May 26, 2026 at 08:12:03PM +0530, Bibek Kumar Patro wrote:
>>>> On some SoCs the SMMU registers require an active interconnect
>>>> bandwidth vote to be accessible. While other clients typically
>>>> satisfy this requirement implicitly, certain corner cases (e.g.
>>>> during sleep/wakeup transitions) can leave the SMMU without a
>>>> vote, causing intermittent register access failures.
>>>>
>>>> Add support for an optional interconnect path to the arm-smmu
>>>> driver and vote for bandwidth while the SMMU is active. The path
>>>> is acquired from DT if present and ignored otherwise.
>>>>
>>>> The bandwidth vote is enabled before accessing SMMU registers
>>>> during probe and runtime resume, and released during runtime
>>>> suspend and on error paths.
>>>>
>>>> Generally, from an architectural perspective, GEM_NOC and DDR are
>>>> expected to have an active vote whenever the adreno_smmu block is
>>>> powered on. In most common use cases, this requirement is implicitly
>>>> satisfied because other GPU-related clients (for example, the GMU
>>>> device) already hold a GEM_NOC vote when adreno_smmu is enabled.
>>>>
>>>> However, there are certain corner cases, such as during sleep/wakeup
>>>> transitions, where the GEM_NOC vote can be removed before adreno_smmu
>>>> is powered down. If adreno_smmu is then accessed while the interconnect
>>>> vote is missing, it can lead to the observed failures. Because of the
>>>> precise ordering involved, this scenario is difficult to reproduce
>>>> consistently.
>>>> (also GDSC is involved in adreno usecases can have an independent vote)
>>>>
>>>> Signed-off-by: Bibek Kumar Patro <bibek.patro@oss.qualcomm.com>
>>>> ---
>>>>    drivers/iommu/arm/arm-smmu/arm-smmu.c | 57 +++++++++++++++++++++++++++++++++--
>>>>    drivers/iommu/arm/arm-smmu/arm-smmu.h |  2 ++
>>>>    2 files changed, 57 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
>>>> index 0bd21d206eb3e75c3b9fb1364cdc92e82c5aa499..07c7e44ec6a5bd1488f00f87d859a20495e46601 100644
>>>> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
>>>> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
>>>> @@ -53,6 +53,11 @@
>>>>    #define MSI_IOVA_BASE			0x8000000
>>>>    #define MSI_IOVA_LENGTH			0x100000
>>>> +/* Interconnect bandwidth vote values for the SMMU register access path */
>>>> +#define ARM_SMMU_ICC_AVG_BW		0
>>>> +#define ARM_SMMU_ICC_PEAK_BW_HIGH	1000
>>>
>>> totally random numbers, which might be different for non-Qualcomm platform.
>>>
>>
>> Ideally, any non-zero value would be enough to keep the path active.
> 
> This is true for Qualcomm devices. However, you are adding this to a
> generic code.
> 
>> Here 1 Would be enough to keep the path active, but might be too small to
>> reliably keep the bus active.
>> Other is UINT_MAX, which will reliably keep the bus active but might cause a
>> power penalty.
>>
>> #define ARM_SMMU_ICC_PEAK_BW_HIGH	UINT_MAX
>>
>> seems to be suitable here to reliably keep the bus active by BCM
>> for both Qualcomm and non-Qualcomm platforms (with some power penalty).
>>
>> LMK, if you feel otherwise.
> 
> Shift it to the qcom instance or provide platform-specific values? (My
> preference would be towards the first solution).
> 


To support platform-specific values, we may need to introduce a 
LUT-based approach in the driver. (Bandwidth voting values cannot be 
placed in device-tree property IIRC ?)

Currently, all Qualcomm platforms use 0x1000 for SMMU ICC voting. I
can evaluate if this could be moved to a Qualcomm-specific
implementation.

To clarify, this applies only to the bandwidth values.
Since the ICC path itself can remain part of struct arm_smmu_device, 
similar to clocks and IRQs, as it represents common infrastructure 
required for the SMMU device.

Thanks & regards,
Bibek


>>
>>
>>>> +#define ARM_SMMU_ICC_PEAK_BW_LOW	0
>>>> +
>>>>    static int force_stage;
>>>>    module_param(force_stage, int, S_IRUGO);
>>>>    MODULE_PARM_DESC(force_stage,
> 



^ permalink raw reply

* Re: [RFC PATCH] KVM: Ignore MMU notifiers for guest_memfd-only memslots
From: Alexandru Elisei @ 2026-06-17 13:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pbonzini, kvm, linux-kernel, maz, oupton, suzuki.poulose, kvmarm,
	linux-arm-kernel, seanjc, mark.rutland
In-Reply-To: <3b1cda8e-96d2-4b66-9916-caef7762209e@arm.com>

Hi David,

On Wed, Jun 17, 2026 at 03:41:41PM +0200, David Hildenbrand wrote:
> On 6/17/26 15:23, Alexandru Elisei wrote:
> > Hi David,
> > 
> > On Mon, Jun 15, 2026 at 09:07:50PM +0200, David Hildenbrand wrote:
> >> On 6/15/26 17:52, Alexandru Elisei wrote:
> >>> For guest_memfd-only memslots (kvm_memslot_is_gmem_only() is true), the
> >>> memory provider for the virtual machine is the guest_memfd file, not the
> >>> userspace mapping. Faults are resolved using the guest_memfd page cache,
> >>> and the permissions for the secondary MMU mapping depends exclusively on
> >>> the memslot (i.e, if the memslot is read-only). How userspace happens to
> >>> have the memory mmaped at fault time, or even if the memory is mapped at
> >>> all into userspace, is not taken into consideration.
> >>>
> >>> guest_memfd memory is not evictable, is not movable and there's no backing
> >>> storage. Once memory is allocated for an offset in guest_memfd file, the
> >>> offset will not change, and that memory is not freed unless userspace
> >>> explicitly punches a hole in the file. As a result, memory reclaim, page
> >>> migration, page aging and dirty page tracking for the userspace mapping
> >>> serve little purpose.
> >>
> >> I don't think any of that is relevant for the patch at hand?
> >>
> >> The thing is: invalidation (truncation, later migration, for any other reason)
> >> is driven through guest_memfd notifications, not through unrelated page tables.
> >>
> >> If we don't lookup pages for the KVM MMU through the page table, then there is
> >> also no need for MMU notifiers. It's all guest_memfd only.
> >>
> >> Or am I missing something?
> > 
> > My thinking was that, because guest_memfd is not evictable, there is no need to
> > do page ageing, which would require that secondary MMU mappings be made old.
> 
> Not really.
> 
> The KVM MMU did not obtain the folios through the page tables, but directly
> through guest_memfd. Any aging would, therefore, have to be done through
> guest_memfd.
> 
> Which we don't support and don't want to support :)
> 
> That we happen to have a matching user space range that maps the guest_memfd is
> just coincidence from a KVM MMU point of view.
> 
> > 
> > The invalidate callbacks are also used when userspace memory is marked read-only
> > for dirty state tracking. I was trying to explaing that, since there is no
> > backing for the guest_memfd file, host doesn't need to keep track of dirty state
> > for the memory, and ignoring the invalidate callbacks is correct for all cases.
> > 
> > I can drop the paragraph entirely, if you think that would make the commit
> > message clearer.
> 
> I think the real motivation is:
> 
> "Mappings in the secondary MMU were established by obtaining folios from
> guest_memfd directly, not by looking the folios up through the page tables
> through GUP. Consequently, there is no relationship between the page tables and
> the secondary MMU: MMU notifiers do not apply."

That's much better than my version, thanks!

Alex


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox