* [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi
@ 2026-01-12 21:49 Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 1/4] x86/fgraph: Fix return_to_handler regs.rsp value Jiri Olsa
` (3 more replies)
0 siblings, 4 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-12 21:49 UTC (permalink / raw)
To: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf
Cc: Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko, Mahe Tardy
hi,
Mahe reported missing function from stack trace on top of kprobe multi
program. It turned out the latest fix [1] needs some more fixing.
thanks,
jirka
[1] https://lore.kernel.org/bpf/20251104215405.168643-1-jolsa@kernel.org/
---
Jiri Olsa (4):
x86/fgraph: Fix return_to_handler regs.rsp value
x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
selftests/bpf: Fix kprobe multi stacktrace_ips test
selftests/bpf: Allow to benchmark trigger with stacktrace
arch/x86/include/asm/ftrace.h | 2 +-
arch/x86/kernel/ftrace_64.S | 4 +++-
kernel/trace/fgraph.c | 2 +-
tools/testing/selftests/bpf/bench.c | 4 ++++
tools/testing/selftests/bpf/bench.h | 1 +
tools/testing/selftests/bpf/benchs/bench_trigger.c | 1 +
tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c | 3 ++-
tools/testing/selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
8 files changed, 31 insertions(+), 4 deletions(-)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH bpf-next 1/4] x86/fgraph: Fix return_to_handler regs.rsp value
2026-01-12 21:49 [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi Jiri Olsa
@ 2026-01-12 21:49 ` Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path Jiri Olsa
` (2 subsequent siblings)
3 siblings, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-12 21:49 UTC (permalink / raw)
To: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf
Cc: Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko, Mahe Tardy
The previous change (Fixes commit) messed up the rsp register value,
which is wrong because it's already adjusted with FRAME_SIZE, we need
the original value (after UNWIND_HINT_FUNC hint).
Fixes: 20a0bc10272f ("x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/kernel/ftrace_64.S | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index a132608265f6..613be81b6e88 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -368,13 +368,15 @@ SYM_CODE_START(return_to_handler)
subq $8, %rsp
UNWIND_HINT_FUNC
+ movq %rsp, %rdi
+
/* Save ftrace_regs for function exit context */
subq $(FRAME_SIZE), %rsp
movq %rax, RAX(%rsp)
movq %rdx, RDX(%rsp)
movq %rbp, RBP(%rsp)
- movq %rsp, RSP(%rsp)
+ movq %rdi, RSP(%rsp)
movq %rsp, %rdi
call ftrace_return_to_handler
--
2.52.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-12 21:49 [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 1/4] x86/fgraph: Fix return_to_handler regs.rsp value Jiri Olsa
@ 2026-01-12 21:49 ` Jiri Olsa
2026-01-12 22:07 ` Steven Rostedt
2026-01-12 21:49 ` [PATCH bpf-next 3/4] selftests/bpf: Fix kprobe multi stacktrace_ips test Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace Jiri Olsa
3 siblings, 1 reply; 17+ messages in thread
From: Jiri Olsa @ 2026-01-12 21:49 UTC (permalink / raw)
To: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf
Cc: Mahe Tardy, Peter Zijlstra, bpf, linux-trace-kernel, x86,
Yonghong Song, Song Liu, Andrii Nakryiko
Mahe reported missing function from stack trace on top of kprobe
multi program. The missing function is the very first one in the
stacktrace, the one that the bpf program is attached to.
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
('*' is used for kprobe_multi attachment)
The reason is that the previous change (the Fixes commit) fixed
stack unwind for tracepoint, but removed attached function address
from the stack trace on top of kprobe multi programs, which I also
overlooked in the related test (check following patch).
The tracepoint and kprobe_multi have different stack setup, but use
same unwind path. I think it's better to keep the previous change,
which fixed tracepoint unwind and instead change the kprobe multi
unwind as explained below.
The bpf program stack unwind calls perf_callchain_kernel for kernel
portion and it follows two unwind paths based on X86_EFLAGS_FIXED
bit in pt_regs.flags.
When the bit set we unwind from stack represented by pt_regs argument,
otherwise we unwind currently executed stack up to 'first_frame'
boundary.
The 'first_frame' value is taken from regs.rsp value, but ftrace_caller
and ftrace_regs_caller (ftrace trampoline) functions set the regs.rsp
to the previous stack frame, so we skip the attached function entry.
If we switch kprobe_multi unwind to use the X86_EFLAGS_FIXED bit,
we can control the start of the unwind and get back the attached
function address. As another benefit we also cut extra unwind cycles
needed to reach the 'first_frame' boundary.
The speedup can be meassured with trigger bench for kprobe_multi
program and stacktrace support.
- without bpf_get_stackid call:
# ./bench -w2 -d5 -a -p1 trig-kprobe-multi
Summary: hits 0.857 ± 0.003M/s ( 0.857M/prod), drops 0.000 ± 0.000M/s, total operations 0.857 ± 0.003M/s
- with bpf_get_stackid call:
# ./bench -w2 -d5 -a -g -p1 trig-kprobe-multi
Summary: hits 1.302 ± 0.002M/s ( 1.302M/prod), drops 0.000 ± 0.000M/s, total operations 1.302 ± 0.002M/s
Note the '-g' option for stacktrace added in following change.
To recreate same stack setup for return probe as we have for entry
probe, we set the instruction pointer to the attached function address,
which gets us the same unwind setup and same stack trace.
With the fix, entry probe:
# bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
__x64_sys_newuname+9
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
return probe:
# bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
Attaching 1 probe...
__x64_sys_newuname+4
do_syscall_64+134
entry_SYSCALL_64_after_hwframe+118
Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"")
Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
arch/x86/include/asm/ftrace.h | 2 +-
kernel/trace/fgraph.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index b08c95872eed..c56e1e63b893 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -57,7 +57,7 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
}
#define arch_ftrace_partial_regs(regs) do { \
- regs->flags &= ~X86_EFLAGS_FIXED; \
+ regs->flags |= X86_EFLAGS_FIXED; \
regs->cs = __KERNEL_CS; \
} while (0)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index cc48d16be43e..6279e0a753cf 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -825,7 +825,7 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
}
if (fregs)
- ftrace_regs_set_instruction_pointer(fregs, ret);
+ ftrace_regs_set_instruction_pointer(fregs, trace.func);
bit = ftrace_test_recursion_trylock(trace.func, ret);
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next 3/4] selftests/bpf: Fix kprobe multi stacktrace_ips test
2026-01-12 21:49 [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 1/4] x86/fgraph: Fix return_to_handler regs.rsp value Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path Jiri Olsa
@ 2026-01-12 21:49 ` Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace Jiri Olsa
3 siblings, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-12 21:49 UTC (permalink / raw)
To: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf
Cc: Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko, Mahe Tardy
We now include the attached function in the stack trace,
fixing the test accordingly.
Fixes: c9e208fa93cd ("selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
index c9efdd2a5b18..43781a67051b 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
@@ -74,7 +74,8 @@ static void test_stacktrace_ips_kprobe_multi(bool retprobe)
load_kallsyms();
- check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 4,
+ check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 5,
+ ksym_get_addr("bpf_testmod_stacktrace_test"),
ksym_get_addr("bpf_testmod_stacktrace_test_3"),
ksym_get_addr("bpf_testmod_stacktrace_test_2"),
ksym_get_addr("bpf_testmod_stacktrace_test_1"),
--
2.52.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-12 21:49 [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi Jiri Olsa
` (2 preceding siblings ...)
2026-01-12 21:49 ` [PATCH bpf-next 3/4] selftests/bpf: Fix kprobe multi stacktrace_ips test Jiri Olsa
@ 2026-01-12 21:49 ` Jiri Olsa
2026-01-15 18:48 ` Andrii Nakryiko
3 siblings, 1 reply; 17+ messages in thread
From: Jiri Olsa @ 2026-01-12 21:49 UTC (permalink / raw)
To: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf
Cc: Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko, Mahe Tardy
Adding support to call bpf_get_stackid helper from trigger programs,
so far added for kprobe multi.
Adding the --stacktrace/-g option to enable it.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
tools/testing/selftests/bpf/bench.c | 4 ++++
tools/testing/selftests/bpf/bench.h | 1 +
.../selftests/bpf/benchs/bench_trigger.c | 1 +
.../selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
4 files changed, 24 insertions(+)
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index bd29bb2e6cb5..8dadd9c928ec 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -265,6 +265,7 @@ static const struct argp_option opts[] = {
{ "verbose", 'v', NULL, 0, "Verbose debug output"},
{ "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
{ "quiet", 'q', NULL, 0, "Be more quiet"},
+ { "stacktrace", 'g', NULL, 0, "Get stack trace"},
{ "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
"Set of CPUs for producer threads; implies --affinity"},
{ "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
@@ -350,6 +351,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
case 'q':
env.quiet = true;
break;
+ case 'g':
+ env.stacktrace = true;
+ break;
case ARG_PROD_AFFINITY_SET:
env.affinity = true;
if (parse_num_list(arg, &env.prod_cpus.cpus,
diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
index bea323820ffb..7cf21936e7ed 100644
--- a/tools/testing/selftests/bpf/bench.h
+++ b/tools/testing/selftests/bpf/bench.h
@@ -26,6 +26,7 @@ struct env {
bool list;
bool affinity;
bool quiet;
+ bool stacktrace;
int consumer_cnt;
int producer_cnt;
int nr_cpus;
diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
index 34018fc3927f..aeec9edd3851 100644
--- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
+++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
@@ -146,6 +146,7 @@ static void setup_ctx(void)
bpf_program__set_autoload(ctx.skel->progs.trigger_driver, true);
ctx.skel->rodata->batch_iters = args.batch_iters;
+ ctx.skel->rodata->stacktrace = env.stacktrace;
}
static void load_ctx(void)
diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
index 2898b3749d07..479400d96fa4 100644
--- a/tools/testing/selftests/bpf/progs/trigger_bench.c
+++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
@@ -25,6 +25,23 @@ static __always_inline void inc_counter(void)
__sync_add_and_fetch(&hits[cpu & CPU_MASK].value, 1);
}
+volatile const int stacktrace;
+
+typedef __u64 stack_trace_t[128];
+
+struct {
+ __uint(type, BPF_MAP_TYPE_STACK_TRACE);
+ __uint(max_entries, 16384);
+ __type(key, __u32);
+ __type(value, stack_trace_t);
+} stackmap SEC(".maps");
+
+static __always_inline void do_stacktrace(void *ctx)
+{
+ if (stacktrace)
+ bpf_get_stackid(ctx, &stackmap, 0);
+}
+
SEC("?uprobe")
int bench_trigger_uprobe(void *ctx)
{
@@ -96,6 +113,7 @@ SEC("?kprobe.multi/bpf_get_numa_node_id")
int bench_trigger_kprobe_multi(void *ctx)
{
inc_counter();
+ do_stacktrace(ctx);
return 0;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-12 21:49 ` [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path Jiri Olsa
@ 2026-01-12 22:07 ` Steven Rostedt
2026-01-13 11:43 ` Jiri Olsa
0 siblings, 1 reply; 17+ messages in thread
From: Steven Rostedt @ 2026-01-12 22:07 UTC (permalink / raw)
To: Jiri Olsa
Cc: Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy, Peter Zijlstra, bpf,
linux-trace-kernel, x86, Yonghong Song, Song Liu, Andrii Nakryiko
On Mon, 12 Jan 2026 22:49:38 +0100
Jiri Olsa <jolsa@kernel.org> wrote:
> To recreate same stack setup for return probe as we have for entry
> probe, we set the instruction pointer to the attached function address,
> which gets us the same unwind setup and same stack trace.
>
> With the fix, entry probe:
>
> # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> Attaching 1 probe...
>
> __x64_sys_newuname+9
> do_syscall_64+134
> entry_SYSCALL_64_after_hwframe+118
>
> return probe:
>
> # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> Attaching 1 probe...
>
> __x64_sys_newuname+4
> do_syscall_64+134
> entry_SYSCALL_64_after_hwframe+118
But is this really correct?
The stack trace of the return from __x86_sys_newuname is from offset "+4".
The stack trace from entry is offset "+9". Isn't it confusing that the
offset is likely not from the return portion of that function?
-- Steve
>
> Fixes: 6d08340d1e35 ("Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"")
> Reported-by: Mahe Tardy <mahe.tardy@gmail.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-12 22:07 ` Steven Rostedt
@ 2026-01-13 11:43 ` Jiri Olsa
2026-01-15 18:52 ` Andrii Nakryiko
2026-01-20 16:17 ` Jiri Olsa
0 siblings, 2 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-13 11:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy, Peter Zijlstra, bpf,
linux-trace-kernel, x86, Yonghong Song, Song Liu, Andrii Nakryiko
On Mon, Jan 12, 2026 at 05:07:57PM -0500, Steven Rostedt wrote:
> On Mon, 12 Jan 2026 22:49:38 +0100
> Jiri Olsa <jolsa@kernel.org> wrote:
>
> > To recreate same stack setup for return probe as we have for entry
> > probe, we set the instruction pointer to the attached function address,
> > which gets us the same unwind setup and same stack trace.
> >
> > With the fix, entry probe:
> >
> > # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> > Attaching 1 probe...
> >
> > __x64_sys_newuname+9
> > do_syscall_64+134
> > entry_SYSCALL_64_after_hwframe+118
> >
> > return probe:
> >
> > # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> > Attaching 1 probe...
> >
> > __x64_sys_newuname+4
> > do_syscall_64+134
> > entry_SYSCALL_64_after_hwframe+118
>
> But is this really correct?
>
> The stack trace of the return from __x86_sys_newuname is from offset "+4".
>
> The stack trace from entry is offset "+9". Isn't it confusing that the
> offset is likely not from the return portion of that function?
right, makes sense.. so standard kprobe actualy skips attached function
(__x86_sys_newuname) on return probe stacktrace.. perhaps we should do
the same for kprobe_multi
I managed to get that with the change below, but it's wrong wrt arch code,
note the ftrace_regs_set_stack_pointer(fregs, stack + 8) .. will try to
figure out better way when we agree on the solution
thanks,
jirka
---
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index c56e1e63b893..b0e8ce4934e7 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -71,6 +71,9 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
#define ftrace_regs_set_instruction_pointer(fregs, _ip) \
do { arch_ftrace_regs(fregs)->regs.ip = (_ip); } while (0)
+#define ftrace_regs_set_stack_pointer(fregs, _sp) \
+ do { arch_ftrace_regs(fregs)->regs.sp = (_sp); } while (0)
+
static __always_inline unsigned long
ftrace_regs_get_return_address(struct ftrace_regs *fregs)
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 6279e0a753cf..b1510c412dcb 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -717,7 +717,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned long func,
/* Retrieve a function return address to the trace stack on thread info.*/
static struct ftrace_ret_stack *
ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
- unsigned long frame_pointer, int *offset)
+ unsigned long *stack, unsigned long frame_pointer,
+ int *offset)
{
struct ftrace_ret_stack *ret_stack;
@@ -762,6 +763,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
*offset += FGRAPH_FRAME_OFFSET;
*ret = ret_stack->ret;
+ *stack = (unsigned long) ret_stack->retp;
trace->func = ret_stack->func;
trace->overrun = atomic_read(¤t->trace_overrun);
trace->depth = current->curr_ret_depth;
@@ -810,12 +812,13 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
struct ftrace_ret_stack *ret_stack;
struct ftrace_graph_ret trace;
unsigned long bitmap;
+ unsigned long stack;
unsigned long ret;
int offset;
int bit;
int i;
- ret_stack = ftrace_pop_return_trace(&trace, &ret, frame_pointer, &offset);
+ ret_stack = ftrace_pop_return_trace(&trace, &ret, &stack, frame_pointer, &offset);
if (unlikely(!ret_stack)) {
ftrace_graph_stop();
@@ -824,8 +827,11 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
return (unsigned long)panic;
}
- if (fregs)
- ftrace_regs_set_instruction_pointer(fregs, trace.func);
+ if (fregs) {
+ ftrace_regs_set_instruction_pointer(fregs, ret);
+ ftrace_regs_set_stack_pointer(fregs, stack + 8);
+ }
+
bit = ftrace_test_recursion_trylock(trace.func, ret);
/*
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
index e1a9b55e07cb..852830536109 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
@@ -74,12 +74,20 @@ static void test_stacktrace_ips_kprobe_multi(bool retprobe)
load_kallsyms();
- check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 5,
- ksym_get_addr("bpf_testmod_stacktrace_test"),
- ksym_get_addr("bpf_testmod_stacktrace_test_3"),
- ksym_get_addr("bpf_testmod_stacktrace_test_2"),
- ksym_get_addr("bpf_testmod_stacktrace_test_1"),
- ksym_get_addr("bpf_testmod_test_read"));
+ if (retprobe) {
+ check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 4,
+ ksym_get_addr("bpf_testmod_stacktrace_test_3"),
+ ksym_get_addr("bpf_testmod_stacktrace_test_2"),
+ ksym_get_addr("bpf_testmod_stacktrace_test_1"),
+ ksym_get_addr("bpf_testmod_test_read"));
+ } else {
+ check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 5,
+ ksym_get_addr("bpf_testmod_stacktrace_test"),
+ ksym_get_addr("bpf_testmod_stacktrace_test_3"),
+ ksym_get_addr("bpf_testmod_stacktrace_test_2"),
+ ksym_get_addr("bpf_testmod_stacktrace_test_1"),
+ ksym_get_addr("bpf_testmod_test_read"));
+ }
cleanup:
stacktrace_ips__destroy(skel);
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-12 21:49 ` [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace Jiri Olsa
@ 2026-01-15 18:48 ` Andrii Nakryiko
2026-01-15 18:50 ` Andrii Nakryiko
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Andrii Nakryiko @ 2026-01-15 18:48 UTC (permalink / raw)
To: Jiri Olsa
Cc: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf, Peter Zijlstra,
bpf, linux-trace-kernel, x86, Yonghong Song, Song Liu,
Andrii Nakryiko, Mahe Tardy
On Mon, Jan 12, 2026 at 1:50 PM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding support to call bpf_get_stackid helper from trigger programs,
> so far added for kprobe multi.
>
> Adding the --stacktrace/-g option to enable it.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
> tools/testing/selftests/bpf/bench.c | 4 ++++
> tools/testing/selftests/bpf/bench.h | 1 +
> .../selftests/bpf/benchs/bench_trigger.c | 1 +
> .../selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
> 4 files changed, 24 insertions(+)
>
This now actually becomes a stack trace benchmark :) But I don't mind,
I think it would be good to be able to benchmark this. But I think we
should then implement it for all different tracing programs (tp,
raw_tp, fentry/fexit/fmod_ret) for consistency and so we can compare
and contrast?...
> diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> index bd29bb2e6cb5..8dadd9c928ec 100644
> --- a/tools/testing/selftests/bpf/bench.c
> +++ b/tools/testing/selftests/bpf/bench.c
> @@ -265,6 +265,7 @@ static const struct argp_option opts[] = {
> { "verbose", 'v', NULL, 0, "Verbose debug output"},
> { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> { "quiet", 'q', NULL, 0, "Be more quiet"},
> + { "stacktrace", 'g', NULL, 0, "Get stack trace"},
bikeshedding time: why "g"? why not -S or something like that?
> { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
> "Set of CPUs for producer threads; implies --affinity"},
> { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
> @@ -350,6 +351,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> case 'q':
> env.quiet = true;
> break;
> + case 'g':
> + env.stacktrace = true;
> + break;
> case ARG_PROD_AFFINITY_SET:
> env.affinity = true;
> if (parse_num_list(arg, &env.prod_cpus.cpus,
> diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
> index bea323820ffb..7cf21936e7ed 100644
> --- a/tools/testing/selftests/bpf/bench.h
> +++ b/tools/testing/selftests/bpf/bench.h
> @@ -26,6 +26,7 @@ struct env {
> bool list;
> bool affinity;
> bool quiet;
> + bool stacktrace;
> int consumer_cnt;
> int producer_cnt;
> int nr_cpus;
> diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> index 34018fc3927f..aeec9edd3851 100644
> --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> @@ -146,6 +146,7 @@ static void setup_ctx(void)
> bpf_program__set_autoload(ctx.skel->progs.trigger_driver, true);
>
> ctx.skel->rodata->batch_iters = args.batch_iters;
> + ctx.skel->rodata->stacktrace = env.stacktrace;
> }
>
> static void load_ctx(void)
> diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> index 2898b3749d07..479400d96fa4 100644
> --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> @@ -25,6 +25,23 @@ static __always_inline void inc_counter(void)
> __sync_add_and_fetch(&hits[cpu & CPU_MASK].value, 1);
> }
>
> +volatile const int stacktrace;
> +
> +typedef __u64 stack_trace_t[128];
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> + __uint(max_entries, 16384);
> + __type(key, __u32);
> + __type(value, stack_trace_t);
> +} stackmap SEC(".maps");
> +
> +static __always_inline void do_stacktrace(void *ctx)
> +{
> + if (stacktrace)
> + bpf_get_stackid(ctx, &stackmap, 0);
> +}
> +
> SEC("?uprobe")
> int bench_trigger_uprobe(void *ctx)
> {
> @@ -96,6 +113,7 @@ SEC("?kprobe.multi/bpf_get_numa_node_id")
> int bench_trigger_kprobe_multi(void *ctx)
> {
> inc_counter();
> + do_stacktrace(ctx);
> return 0;
> }
>
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-15 18:48 ` Andrii Nakryiko
@ 2026-01-15 18:50 ` Andrii Nakryiko
2026-01-16 16:30 ` Jiri Olsa
2026-01-16 16:26 ` Jiri Olsa
2026-01-22 8:35 ` Jiri Olsa
2 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2026-01-15 18:50 UTC (permalink / raw)
To: Jiri Olsa
Cc: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf, Peter Zijlstra,
bpf, linux-trace-kernel, x86, Yonghong Song, Song Liu,
Andrii Nakryiko, Mahe Tardy
On Thu, Jan 15, 2026 at 10:48 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jan 12, 2026 at 1:50 PM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding support to call bpf_get_stackid helper from trigger programs,
> > so far added for kprobe multi.
> >
> > Adding the --stacktrace/-g option to enable it.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > tools/testing/selftests/bpf/bench.c | 4 ++++
> > tools/testing/selftests/bpf/bench.h | 1 +
> > .../selftests/bpf/benchs/bench_trigger.c | 1 +
> > .../selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
> > 4 files changed, 24 insertions(+)
> >
>
> This now actually becomes a stack trace benchmark :) But I don't mind,
> I think it would be good to be able to benchmark this. But I think we
> should then implement it for all different tracing programs (tp,
> raw_tp, fentry/fexit/fmod_ret) for consistency and so we can compare
> and contrast?...
>
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > index bd29bb2e6cb5..8dadd9c928ec 100644
> > --- a/tools/testing/selftests/bpf/bench.c
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -265,6 +265,7 @@ static const struct argp_option opts[] = {
> > { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > { "quiet", 'q', NULL, 0, "Be more quiet"},
> > + { "stacktrace", 'g', NULL, 0, "Get stack trace"},
>
> bikeshedding time: why "g"? why not -S or something like that?
>
> > { "prod-affinity", ARG_PROD_AFFINITY_SET, "CPUSET", 0,
> > "Set of CPUs for producer threads; implies --affinity"},
> > { "cons-affinity", ARG_CONS_AFFINITY_SET, "CPUSET", 0,
> > @@ -350,6 +351,9 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> > case 'q':
> > env.quiet = true;
> > break;
> > + case 'g':
> > + env.stacktrace = true;
> > + break;
> > case ARG_PROD_AFFINITY_SET:
> > env.affinity = true;
> > if (parse_num_list(arg, &env.prod_cpus.cpus,
> > diff --git a/tools/testing/selftests/bpf/bench.h b/tools/testing/selftests/bpf/bench.h
> > index bea323820ffb..7cf21936e7ed 100644
> > --- a/tools/testing/selftests/bpf/bench.h
> > +++ b/tools/testing/selftests/bpf/bench.h
> > @@ -26,6 +26,7 @@ struct env {
> > bool list;
> > bool affinity;
> > bool quiet;
> > + bool stacktrace;
> > int consumer_cnt;
> > int producer_cnt;
> > int nr_cpus;
> > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > index 34018fc3927f..aeec9edd3851 100644
> > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > @@ -146,6 +146,7 @@ static void setup_ctx(void)
> > bpf_program__set_autoload(ctx.skel->progs.trigger_driver, true);
> >
> > ctx.skel->rodata->batch_iters = args.batch_iters;
> > + ctx.skel->rodata->stacktrace = env.stacktrace;
> > }
> >
> > static void load_ctx(void)
> > diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > index 2898b3749d07..479400d96fa4 100644
> > --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> > +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > @@ -25,6 +25,23 @@ static __always_inline void inc_counter(void)
> > __sync_add_and_fetch(&hits[cpu & CPU_MASK].value, 1);
> > }
> >
> > +volatile const int stacktrace;
> > +
> > +typedef __u64 stack_trace_t[128];
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> > + __uint(max_entries, 16384);
> > + __type(key, __u32);
> > + __type(value, stack_trace_t);
> > +} stackmap SEC(".maps");
oh, and why bother with STACK_TRACE map, just call bpf_get_stack() API
and have maybe per-CPU scratch array for stack trace (per-CPU so that
in multi-cpu benchmarks they don't just contend on the same cache
lines)
> > +
> > +static __always_inline void do_stacktrace(void *ctx)
> > +{
> > + if (stacktrace)
> > + bpf_get_stackid(ctx, &stackmap, 0);
> > +}
> > +
> > SEC("?uprobe")
> > int bench_trigger_uprobe(void *ctx)
> > {
> > @@ -96,6 +113,7 @@ SEC("?kprobe.multi/bpf_get_numa_node_id")
> > int bench_trigger_kprobe_multi(void *ctx)
> > {
> > inc_counter();
> > + do_stacktrace(ctx);
> > return 0;
> > }
> >
> > --
> > 2.52.0
> >
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-13 11:43 ` Jiri Olsa
@ 2026-01-15 18:52 ` Andrii Nakryiko
2026-01-16 16:25 ` Jiri Olsa
2026-01-20 16:17 ` Jiri Olsa
1 sibling, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2026-01-15 18:52 UTC (permalink / raw)
To: Jiri Olsa
Cc: Steven Rostedt, Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy,
Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko
On Tue, Jan 13, 2026 at 3:43 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Jan 12, 2026 at 05:07:57PM -0500, Steven Rostedt wrote:
> > On Mon, 12 Jan 2026 22:49:38 +0100
> > Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > > To recreate same stack setup for return probe as we have for entry
> > > probe, we set the instruction pointer to the attached function address,
> > > which gets us the same unwind setup and same stack trace.
> > >
> > > With the fix, entry probe:
> > >
> > > # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> > > Attaching 1 probe...
> > >
> > > __x64_sys_newuname+9
> > > do_syscall_64+134
> > > entry_SYSCALL_64_after_hwframe+118
> > >
> > > return probe:
> > >
> > > # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> > > Attaching 1 probe...
> > >
> > > __x64_sys_newuname+4
> > > do_syscall_64+134
> > > entry_SYSCALL_64_after_hwframe+118
> >
> > But is this really correct?
> >
> > The stack trace of the return from __x86_sys_newuname is from offset "+4".
> >
> > The stack trace from entry is offset "+9". Isn't it confusing that the
> > offset is likely not from the return portion of that function?
>
> right, makes sense.. so standard kprobe actualy skips attached function
> (__x86_sys_newuname) on return probe stacktrace.. perhaps we should do
> the same for kprobe_multi
but it is quite nice to see what function we were kretprobing,
actually... How hard would it be to support that for singular kprobe
as well? And what does fexit's stack trace show for such case?
>
> I managed to get that with the change below, but it's wrong wrt arch code,
> note the ftrace_regs_set_stack_pointer(fregs, stack + 8) .. will try to
> figure out better way when we agree on the solution
>
> thanks,
> jirka
>
>
> ---
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index c56e1e63b893..b0e8ce4934e7 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -71,6 +71,9 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
> #define ftrace_regs_set_instruction_pointer(fregs, _ip) \
> do { arch_ftrace_regs(fregs)->regs.ip = (_ip); } while (0)
>
> +#define ftrace_regs_set_stack_pointer(fregs, _sp) \
> + do { arch_ftrace_regs(fregs)->regs.sp = (_sp); } while (0)
> +
>
> static __always_inline unsigned long
> ftrace_regs_get_return_address(struct ftrace_regs *fregs)
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 6279e0a753cf..b1510c412dcb 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -717,7 +717,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned long func,
> /* Retrieve a function return address to the trace stack on thread info.*/
> static struct ftrace_ret_stack *
> ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
> - unsigned long frame_pointer, int *offset)
> + unsigned long *stack, unsigned long frame_pointer,
> + int *offset)
> {
> struct ftrace_ret_stack *ret_stack;
>
> @@ -762,6 +763,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
>
> *offset += FGRAPH_FRAME_OFFSET;
> *ret = ret_stack->ret;
> + *stack = (unsigned long) ret_stack->retp;
> trace->func = ret_stack->func;
> trace->overrun = atomic_read(¤t->trace_overrun);
> trace->depth = current->curr_ret_depth;
> @@ -810,12 +812,13 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
> struct ftrace_ret_stack *ret_stack;
> struct ftrace_graph_ret trace;
> unsigned long bitmap;
> + unsigned long stack;
> unsigned long ret;
> int offset;
> int bit;
> int i;
>
> - ret_stack = ftrace_pop_return_trace(&trace, &ret, frame_pointer, &offset);
> + ret_stack = ftrace_pop_return_trace(&trace, &ret, &stack, frame_pointer, &offset);
>
> if (unlikely(!ret_stack)) {
> ftrace_graph_stop();
> @@ -824,8 +827,11 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
> return (unsigned long)panic;
> }
>
> - if (fregs)
> - ftrace_regs_set_instruction_pointer(fregs, trace.func);
> + if (fregs) {
> + ftrace_regs_set_instruction_pointer(fregs, ret);
> + ftrace_regs_set_stack_pointer(fregs, stack + 8);
> + }
> +
>
> bit = ftrace_test_recursion_trylock(trace.func, ret);
> /*
> diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
> index e1a9b55e07cb..852830536109 100644
> --- a/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
> +++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_ips.c
> @@ -74,12 +74,20 @@ static void test_stacktrace_ips_kprobe_multi(bool retprobe)
>
> load_kallsyms();
>
> - check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 5,
> - ksym_get_addr("bpf_testmod_stacktrace_test"),
> - ksym_get_addr("bpf_testmod_stacktrace_test_3"),
> - ksym_get_addr("bpf_testmod_stacktrace_test_2"),
> - ksym_get_addr("bpf_testmod_stacktrace_test_1"),
> - ksym_get_addr("bpf_testmod_test_read"));
> + if (retprobe) {
> + check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 4,
> + ksym_get_addr("bpf_testmod_stacktrace_test_3"),
> + ksym_get_addr("bpf_testmod_stacktrace_test_2"),
> + ksym_get_addr("bpf_testmod_stacktrace_test_1"),
> + ksym_get_addr("bpf_testmod_test_read"));
> + } else {
> + check_stacktrace_ips(bpf_map__fd(skel->maps.stackmap), skel->bss->stack_key, 5,
> + ksym_get_addr("bpf_testmod_stacktrace_test"),
> + ksym_get_addr("bpf_testmod_stacktrace_test_3"),
> + ksym_get_addr("bpf_testmod_stacktrace_test_2"),
> + ksym_get_addr("bpf_testmod_stacktrace_test_1"),
> + ksym_get_addr("bpf_testmod_test_read"));
> + }
>
> cleanup:
> stacktrace_ips__destroy(skel);
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-15 18:52 ` Andrii Nakryiko
@ 2026-01-16 16:25 ` Jiri Olsa
2026-01-16 22:24 ` Andrii Nakryiko
0 siblings, 1 reply; 17+ messages in thread
From: Jiri Olsa @ 2026-01-16 16:25 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiri Olsa, Steven Rostedt, Masami Hiramatsu, Josh Poimboeuf,
Mahe Tardy, Peter Zijlstra, bpf, linux-trace-kernel, x86,
Yonghong Song, Song Liu, Andrii Nakryiko
On Thu, Jan 15, 2026 at 10:52:04AM -0800, Andrii Nakryiko wrote:
> On Tue, Jan 13, 2026 at 3:43 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Jan 12, 2026 at 05:07:57PM -0500, Steven Rostedt wrote:
> > > On Mon, 12 Jan 2026 22:49:38 +0100
> > > Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > > To recreate same stack setup for return probe as we have for entry
> > > > probe, we set the instruction pointer to the attached function address,
> > > > which gets us the same unwind setup and same stack trace.
> > > >
> > > > With the fix, entry probe:
> > > >
> > > > # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> > > > Attaching 1 probe...
> > > >
> > > > __x64_sys_newuname+9
> > > > do_syscall_64+134
> > > > entry_SYSCALL_64_after_hwframe+118
> > > >
> > > > return probe:
> > > >
> > > > # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> > > > Attaching 1 probe...
> > > >
> > > > __x64_sys_newuname+4
> > > > do_syscall_64+134
> > > > entry_SYSCALL_64_after_hwframe+118
> > >
> > > But is this really correct?
> > >
> > > The stack trace of the return from __x86_sys_newuname is from offset "+4".
> > >
> > > The stack trace from entry is offset "+9". Isn't it confusing that the
> > > offset is likely not from the return portion of that function?
> >
> > right, makes sense.. so standard kprobe actualy skips attached function
> > (__x86_sys_newuname) on return probe stacktrace.. perhaps we should do
> > the same for kprobe_multi
>
> but it is quite nice to see what function we were kretprobing,
> actually...
IIUC Steven doesn't like the wrong +offset that comes from entry probe,
maybe we could have func+(ADDRESS_OF_RET+1) ..but not sure how hard that
would be
still.. you always have the attached function ip when you get the stacktrace,
so I'm not sure how usefull it's to have it in stacktrace as well.. you can
always add it yourself
> How hard would it be to support that for singular kprobe
> as well? And what does fexit's stack trace show for such case?
I think we will get the bpf_program address, so we see the attached
function in stacktrace, will check
thanks,
jirka
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-15 18:48 ` Andrii Nakryiko
2026-01-15 18:50 ` Andrii Nakryiko
@ 2026-01-16 16:26 ` Jiri Olsa
2026-01-22 8:35 ` Jiri Olsa
2 siblings, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-16 16:26 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf, Peter Zijlstra,
bpf, linux-trace-kernel, x86, Yonghong Song, Song Liu,
Andrii Nakryiko, Mahe Tardy
On Thu, Jan 15, 2026 at 10:48:29AM -0800, Andrii Nakryiko wrote:
> On Mon, Jan 12, 2026 at 1:50 PM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding support to call bpf_get_stackid helper from trigger programs,
> > so far added for kprobe multi.
> >
> > Adding the --stacktrace/-g option to enable it.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > tools/testing/selftests/bpf/bench.c | 4 ++++
> > tools/testing/selftests/bpf/bench.h | 1 +
> > .../selftests/bpf/benchs/bench_trigger.c | 1 +
> > .../selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
> > 4 files changed, 24 insertions(+)
> >
>
> This now actually becomes a stack trace benchmark :) But I don't mind,
> I think it would be good to be able to benchmark this. But I think we
> should then implement it for all different tracing programs (tp,
> raw_tp, fentry/fexit/fmod_ret) for consistency and so we can compare
> and contrast?...
yep, agreed
>
> > diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
> > index bd29bb2e6cb5..8dadd9c928ec 100644
> > --- a/tools/testing/selftests/bpf/bench.c
> > +++ b/tools/testing/selftests/bpf/bench.c
> > @@ -265,6 +265,7 @@ static const struct argp_option opts[] = {
> > { "verbose", 'v', NULL, 0, "Verbose debug output"},
> > { "affinity", 'a', NULL, 0, "Set consumer/producer thread affinity"},
> > { "quiet", 'q', NULL, 0, "Be more quiet"},
> > + { "stacktrace", 'g', NULL, 0, "Get stack trace"},
>
> bikeshedding time: why "g"? why not -S or something like that?
perf tool strikes back ;-) -S is better
thanks,
jirka
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-15 18:50 ` Andrii Nakryiko
@ 2026-01-16 16:30 ` Jiri Olsa
0 siblings, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-16 16:30 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf, Peter Zijlstra,
bpf, linux-trace-kernel, x86, Yonghong Song, Song Liu,
Andrii Nakryiko, Mahe Tardy
On Thu, Jan 15, 2026 at 10:50:13AM -0800, Andrii Nakryiko wrote:
SNIP
> > > diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > index 34018fc3927f..aeec9edd3851 100644
> > > --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c
> > > @@ -146,6 +146,7 @@ static void setup_ctx(void)
> > > bpf_program__set_autoload(ctx.skel->progs.trigger_driver, true);
> > >
> > > ctx.skel->rodata->batch_iters = args.batch_iters;
> > > + ctx.skel->rodata->stacktrace = env.stacktrace;
> > > }
> > >
> > > static void load_ctx(void)
> > > diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > index 2898b3749d07..479400d96fa4 100644
> > > --- a/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c
> > > @@ -25,6 +25,23 @@ static __always_inline void inc_counter(void)
> > > __sync_add_and_fetch(&hits[cpu & CPU_MASK].value, 1);
> > > }
> > >
> > > +volatile const int stacktrace;
> > > +
> > > +typedef __u64 stack_trace_t[128];
> > > +
> > > +struct {
> > > + __uint(type, BPF_MAP_TYPE_STACK_TRACE);
> > > + __uint(max_entries, 16384);
> > > + __type(key, __u32);
> > > + __type(value, stack_trace_t);
> > > +} stackmap SEC(".maps");
>
> oh, and why bother with STACK_TRACE map, just call bpf_get_stack() API
> and have maybe per-CPU scratch array for stack trace (per-CPU so that
> in multi-cpu benchmarks they don't just contend on the same cache
> lines)
ok, will change
thanks,
jirka
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-16 16:25 ` Jiri Olsa
@ 2026-01-16 22:24 ` Andrii Nakryiko
2026-01-20 14:50 ` Steven Rostedt
0 siblings, 1 reply; 17+ messages in thread
From: Andrii Nakryiko @ 2026-01-16 22:24 UTC (permalink / raw)
To: Jiri Olsa
Cc: Steven Rostedt, Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy,
Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko
On Fri, Jan 16, 2026 at 8:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Jan 15, 2026 at 10:52:04AM -0800, Andrii Nakryiko wrote:
> > On Tue, Jan 13, 2026 at 3:43 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Jan 12, 2026 at 05:07:57PM -0500, Steven Rostedt wrote:
> > > > On Mon, 12 Jan 2026 22:49:38 +0100
> > > > Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > > To recreate same stack setup for return probe as we have for entry
> > > > > probe, we set the instruction pointer to the attached function address,
> > > > > which gets us the same unwind setup and same stack trace.
> > > > >
> > > > > With the fix, entry probe:
> > > > >
> > > > > # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> > > > > Attaching 1 probe...
> > > > >
> > > > > __x64_sys_newuname+9
> > > > > do_syscall_64+134
> > > > > entry_SYSCALL_64_after_hwframe+118
> > > > >
> > > > > return probe:
> > > > >
> > > > > # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> > > > > Attaching 1 probe...
> > > > >
> > > > > __x64_sys_newuname+4
> > > > > do_syscall_64+134
> > > > > entry_SYSCALL_64_after_hwframe+118
> > > >
> > > > But is this really correct?
> > > >
> > > > The stack trace of the return from __x86_sys_newuname is from offset "+4".
> > > >
> > > > The stack trace from entry is offset "+9". Isn't it confusing that the
> > > > offset is likely not from the return portion of that function?
> > >
> > > right, makes sense.. so standard kprobe actualy skips attached function
> > > (__x86_sys_newuname) on return probe stacktrace.. perhaps we should do
> > > the same for kprobe_multi
> >
> > but it is quite nice to see what function we were kretprobing,
> > actually...
>
> IIUC Steven doesn't like the wrong +offset that comes from entry probe,
> maybe we could have func+(ADDRESS_OF_RET+1) ..but not sure how hard that
> would be
>
> still.. you always have the attached function ip when you get the stacktrace,
> so I'm not sure how usefull it's to have it in stacktrace as well.. you can
> always add it yourself
You can, but that's a custom thing that every single tool has to
implement. Having traced function right there in the stack would be so
nice and convenient.
I don't insist, but I'm just saying that practically speaking this
would make sense. Even conceptually, kretprobe is (logically) called
from traced function right before exit. In reality it's not exactly
like that and we don't know where ret happened, but having traced
function in kretprobe's stack trace is more useful than confusing,
IMO.
But again, I just found it interesting that we could have it, if we wanted to.
>
>
> > How hard would it be to support that for singular kprobe
> > as well? And what does fexit's stack trace show for such case?
>
> I think we will get the bpf_program address, so we see the attached
> function in stacktrace, will check
>
> thanks,
> jirka
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-16 22:24 ` Andrii Nakryiko
@ 2026-01-20 14:50 ` Steven Rostedt
0 siblings, 0 replies; 17+ messages in thread
From: Steven Rostedt @ 2026-01-20 14:50 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Jiri Olsa, Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy,
Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko
On Fri, 16 Jan 2026 14:24:51 -0800
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> I don't insist, but I'm just saying that practically speaking this
> would make sense. Even conceptually, kretprobe is (logically) called
> from traced function right before exit. In reality it's not exactly
> like that and we don't know where ret happened, but having traced
> function in kretprobe's stack trace is more useful than confusing,
> IMO.
It's more useful than confusing for you because you understand it. For
anyone else, it will be very confusing, or worse, miscalculated, to see the
backtrace coming from the start of the function when the function has
already executed.
Sure, having the function is very useful, but the function is already
completed. Technically it shouldn't be in the stacktrace. Having the return
address in the trace should point out that the stacktrace came from the
function right before that address.
-- Steve
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path
2026-01-13 11:43 ` Jiri Olsa
2026-01-15 18:52 ` Andrii Nakryiko
@ 2026-01-20 16:17 ` Jiri Olsa
1 sibling, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-20 16:17 UTC (permalink / raw)
To: Jiri Olsa
Cc: Steven Rostedt, Masami Hiramatsu, Josh Poimboeuf, Mahe Tardy,
Peter Zijlstra, bpf, linux-trace-kernel, x86, Yonghong Song,
Song Liu, Andrii Nakryiko
On Tue, Jan 13, 2026 at 12:43:39PM +0100, Jiri Olsa wrote:
> On Mon, Jan 12, 2026 at 05:07:57PM -0500, Steven Rostedt wrote:
> > On Mon, 12 Jan 2026 22:49:38 +0100
> > Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > > To recreate same stack setup for return probe as we have for entry
> > > probe, we set the instruction pointer to the attached function address,
> > > which gets us the same unwind setup and same stack trace.
> > >
> > > With the fix, entry probe:
> > >
> > > # bpftrace -e 'kprobe:__x64_sys_newuname* { print(kstack)}'
> > > Attaching 1 probe...
> > >
> > > __x64_sys_newuname+9
> > > do_syscall_64+134
> > > entry_SYSCALL_64_after_hwframe+118
> > >
> > > return probe:
> > >
> > > # bpftrace -e 'kretprobe:__x64_sys_newuname* { print(kstack)}'
> > > Attaching 1 probe...
> > >
> > > __x64_sys_newuname+4
> > > do_syscall_64+134
> > > entry_SYSCALL_64_after_hwframe+118
> >
> > But is this really correct?
> >
> > The stack trace of the return from __x86_sys_newuname is from offset "+4".
> >
> > The stack trace from entry is offset "+9". Isn't it confusing that the
> > offset is likely not from the return portion of that function?
>
> right, makes sense.. so standard kprobe actualy skips attached function
> (__x86_sys_newuname) on return probe stacktrace.. perhaps we should do
> the same for kprobe_multi
>
> I managed to get that with the change below, but it's wrong wrt arch code,
> note the ftrace_regs_set_stack_pointer(fregs, stack + 8) .. will try to
> figure out better way when we agree on the solution
>
> thanks,
> jirka
>
>
> ---
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index c56e1e63b893..b0e8ce4934e7 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -71,6 +71,9 @@ arch_ftrace_get_regs(struct ftrace_regs *fregs)
> #define ftrace_regs_set_instruction_pointer(fregs, _ip) \
> do { arch_ftrace_regs(fregs)->regs.ip = (_ip); } while (0)
>
> +#define ftrace_regs_set_stack_pointer(fregs, _sp) \
> + do { arch_ftrace_regs(fregs)->regs.sp = (_sp); } while (0)
> +
>
> static __always_inline unsigned long
> ftrace_regs_get_return_address(struct ftrace_regs *fregs)
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 6279e0a753cf..b1510c412dcb 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -717,7 +717,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned long func,
> /* Retrieve a function return address to the trace stack on thread info.*/
> static struct ftrace_ret_stack *
> ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
> - unsigned long frame_pointer, int *offset)
> + unsigned long *stack, unsigned long frame_pointer,
> + int *offset)
> {
> struct ftrace_ret_stack *ret_stack;
>
> @@ -762,6 +763,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, unsigned long *ret,
>
> *offset += FGRAPH_FRAME_OFFSET;
> *ret = ret_stack->ret;
> + *stack = (unsigned long) ret_stack->retp;
> trace->func = ret_stack->func;
> trace->overrun = atomic_read(¤t->trace_overrun);
> trace->depth = current->curr_ret_depth;
> @@ -810,12 +812,13 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
> struct ftrace_ret_stack *ret_stack;
> struct ftrace_graph_ret trace;
> unsigned long bitmap;
> + unsigned long stack;
> unsigned long ret;
> int offset;
> int bit;
> int i;
>
> - ret_stack = ftrace_pop_return_trace(&trace, &ret, frame_pointer, &offset);
> + ret_stack = ftrace_pop_return_trace(&trace, &ret, &stack, frame_pointer, &offset);
>
> if (unlikely(!ret_stack)) {
> ftrace_graph_stop();
> @@ -824,8 +827,11 @@ __ftrace_return_to_handler(struct ftrace_regs *fregs, unsigned long frame_pointe
> return (unsigned long)panic;
> }
>
> - if (fregs)
> - ftrace_regs_set_instruction_pointer(fregs, trace.func);
> + if (fregs) {
> + ftrace_regs_set_instruction_pointer(fregs, ret);
> + ftrace_regs_set_stack_pointer(fregs, stack + 8);
actually looks like this might be better solution.. storing the proper
rsp value directly to the regs in return_to_handler
jirka
---
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index a132608265f6..971883045b75 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -368,13 +368,16 @@ SYM_CODE_START(return_to_handler)
subq $8, %rsp
UNWIND_HINT_FUNC
+ movq %rsp, %rdi
+ addq $8, %rdi
+
/* Save ftrace_regs for function exit context */
subq $(FRAME_SIZE), %rsp
movq %rax, RAX(%rsp)
movq %rdx, RDX(%rsp)
movq %rbp, RBP(%rsp)
- movq %rsp, RSP(%rsp)
+ movq %rdi, RSP(%rsp)
movq %rsp, %rdi
call ftrace_return_to_handler
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace
2026-01-15 18:48 ` Andrii Nakryiko
2026-01-15 18:50 ` Andrii Nakryiko
2026-01-16 16:26 ` Jiri Olsa
@ 2026-01-22 8:35 ` Jiri Olsa
2 siblings, 0 replies; 17+ messages in thread
From: Jiri Olsa @ 2026-01-22 8:35 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: Masami Hiramatsu, Steven Rostedt, Josh Poimboeuf, Peter Zijlstra,
bpf, linux-trace-kernel, x86, Yonghong Song, Song Liu,
Andrii Nakryiko, Mahe Tardy
On Thu, Jan 15, 2026 at 10:48:29AM -0800, Andrii Nakryiko wrote:
> On Mon, Jan 12, 2026 at 1:50 PM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding support to call bpf_get_stackid helper from trigger programs,
> > so far added for kprobe multi.
> >
> > Adding the --stacktrace/-g option to enable it.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> > tools/testing/selftests/bpf/bench.c | 4 ++++
> > tools/testing/selftests/bpf/bench.h | 1 +
> > .../selftests/bpf/benchs/bench_trigger.c | 1 +
> > .../selftests/bpf/progs/trigger_bench.c | 18 ++++++++++++++++++
> > 4 files changed, 24 insertions(+)
> >
>
> This now actually becomes a stack trace benchmark :) But I don't mind,
> I think it would be good to be able to benchmark this. But I think we
> should then implement it for all different tracing programs (tp,
> raw_tp, fentry/fexit/fmod_ret) for consistency and so we can compare
> and contrast?...
fyi I updated the bench for all program types and got some stats
current fix WITHOUT stacktrace:
usermode-count : 810.652 ± 1.036M/s
kernel-count : 336.645 ± 2.812M/s
syscall-count : 27.798 ± 0.063M/s
fentry : 67.677 ± 0.291M/s
fexit : 49.970 ± 0.214M/s
fmodret : 52.860 ± 0.237M/s
rawtp : 65.196 ± 0.224M/s
tp : 34.120 ± 0.042M/s
kprobe : 25.157 ± 0.019M/s
kprobe-multi : 33.223 ± 0.205M/s
kprobe-multi-all: 4.739 ± 0.003M/s
kretprobe : 10.904 ± 0.020M/s
kretprobe-multi: 15.996 ± 0.023M/s
kretprobe-multi-all: 2.559 ± 0.092M/s
current fix WITH stacktrace:
usermode-count : 782.529 ± 5.866M/s
kernel-count : 341.116 ± 2.247M/s
syscall-count : 27.481 ± 0.267M/s
fentry : 2.397 ± 0.026M/s
fexit : 2.472 ± 0.008M/s
fmodret : 2.475 ± 0.014M/s
rawtp : 2.593 ± 0.031M/s
tp : 2.641 ± 0.020M/s
kprobe : 3.848 ± 0.014M/s
kprobe-multi : 4.188 ± 0.025M/s
kprobe-multi-all: 0.261 ± 0.026M/s
kretprobe : 3.782 ± 0.011M/s
kretprobe-multi: 4.157 ± 0.023M/s
kretprobe-multi-all: 0.177 ± 0.000M/s
with similar fix for fentry/fexit/raw_tp/tp WITH stacktrace:
usermode-count : 792.613 ± 1.322M/s
kernel-count : 337.725 ± 2.422M/s
syscall-count : 27.363 ± 0.030M/s
fentry : 14.911 ± 0.083M/s
fexit : 13.749 ± 0.060M/s
fmodret : 13.987 ± 0.049M/s
rawtp : 13.760 ± 0.042M/s
tp : 7.060 ± 0.026M/s
kprobe : 3.920 ± 0.012M/s
kprobe-multi : 4.186 ± 0.030M/s
kprobe-multi-all: 0.281 ± 0.006M/s
kretprobe : 3.782 ± 0.005M/s
kretprobe-multi: 4.030 ± 0.014M/s
kretprobe-multi-all: 0.178 ± 0.000M/s
so cutting the extra initial unwind gets some speedup ex expected
I'm getting wrong callstack for rawtp programs, so I need to find out why,
but the rest of the tracing programs fentry/fexit.. work ok
jirka
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-01-22 8:35 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 21:49 [PATCH bpf-next 0/4] x86/fgraph,bpf: Fix ORC stack unwind from kprobe_multi Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 1/4] x86/fgraph: Fix return_to_handler regs.rsp value Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 2/4] x86/fgraph,bpf: Switch kprobe_multi program stack unwind to hw_regs path Jiri Olsa
2026-01-12 22:07 ` Steven Rostedt
2026-01-13 11:43 ` Jiri Olsa
2026-01-15 18:52 ` Andrii Nakryiko
2026-01-16 16:25 ` Jiri Olsa
2026-01-16 22:24 ` Andrii Nakryiko
2026-01-20 14:50 ` Steven Rostedt
2026-01-20 16:17 ` Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 3/4] selftests/bpf: Fix kprobe multi stacktrace_ips test Jiri Olsa
2026-01-12 21:49 ` [PATCH bpf-next 4/4] selftests/bpf: Allow to benchmark trigger with stacktrace Jiri Olsa
2026-01-15 18:48 ` Andrii Nakryiko
2026-01-15 18:50 ` Andrii Nakryiko
2026-01-16 16:30 ` Jiri Olsa
2026-01-16 16:26 ` Jiri Olsa
2026-01-22 8:35 ` Jiri Olsa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox