* [PATCH RESEND bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain
@ 2025-12-17 9:33 Tao Chen
2025-12-17 9:33 ` [PATCH bpf-next v7 1/2] perf: Refactor get_perf_callchain Tao Chen
2025-12-17 9:33 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
0 siblings, 2 replies; 13+ messages in thread
From: Tao Chen @ 2025-12-17 9:33 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo
Cc: linux-perf-users, linux-kernel, bpf, Tao Chen
Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com
A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev
Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.
Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.
Comments and suggestions are always welcome.
Change list:
- v1 -> v2
From Jiri
- rebase code, fix conflict
- v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev
- v2 -> v3:
From Andrii
- entries per CPU used in a stack-like fashion
- v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev
- v3 -> v4:
From Peter
- refactor get_perf_callchain and add three new APIs to use perf
callchain easily.
From Andrii
- reuse the perf callchain management.
- rename patch1 and patch2.
- v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev
- v4 -> v5:
From Yonghong
- keep add_mark false in stackmap when refactor get_perf_callchain in
patch1.
- add atomic operation in get_recursion_context in patch2.
- rename bpf_put_callchain_entry with bpf_put_perf_callchain in
patch3.
- rebase bpf-next master.
- v4: https://lore.kernel.org/bpf/20251028162502.3418817-1-chen.dylane@linux.dev
- v5 -> v6:
From Peter
- disable preemption from BPF side in patch2.
From AI
- use ctx->entry->nr instead of ctx->nr in patch1.
- v5: https://lore.kernel.org/bpf/20251109163559.4102849-1-chen.dylane@linux.dev
- v6 -> v7:
From yonghong
- Add ack in patch2
From AI
- resolve conflict
- v6: https://lore.kernel.org/bpf/20251112163148.100949-1-chen.dylane@linux.dev
Tao Chen (2):
perf: Refactor get_perf_callchain
bpf: Hold the perf callchain entry until used completely
include/linux/perf_event.h | 10 ++++
kernel/bpf/stackmap.c | 68 +++++++++++++++++++++-----
kernel/events/callchain.c | 99 +++++++++++++++++++++++---------------
3 files changed, 126 insertions(+), 51 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH bpf-next v7 1/2] perf: Refactor get_perf_callchain 2025-12-17 9:33 [PATCH RESEND bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain Tao Chen @ 2025-12-17 9:33 ` Tao Chen 2025-12-17 9:33 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen 1 sibling, 0 replies; 13+ messages in thread From: Tao Chen @ 2025-12-17 9:33 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen From BPF stack map, we want to ensure that the callchain buffer will not be overwritten by other preemptive tasks. Peter suggested provide more flexible stack-sampling APIs, which can be used in BPF, and we can still use the perf callchain entry with the help of these APIs. The next patch will modify the BPF part. In the future, these APIs will also make it convenient for us to add stack-sampling kfuncs in the eBPF subsystem, just as Andrii and Alexei discussed earlier. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- include/linux/perf_event.h | 10 ++++ kernel/events/callchain.c | 99 +++++++++++++++++++++++--------------- 2 files changed, 70 insertions(+), 39 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 9870d768db4..e727ff7fa0c 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx { u32 nr; short contexts; bool contexts_maxed; + bool add_mark; }; typedef unsigned long (*perf_copy_f)(void *dst, const void *src, @@ -1718,6 +1719,15 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry); extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs); extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs); + +extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx, + struct perf_callchain_entry *entry, + u32 max_stack, bool add_mark); + +extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs); +extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs, + u64 defer_cookie); + extern struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie); diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index b9c7e00725d..17030c22175 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -216,13 +216,67 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr #endif } +void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx, + struct perf_callchain_entry *entry, + u32 max_stack, bool add_mark) + +{ + ctx->entry = entry; + ctx->max_stack = max_stack; + ctx->nr = entry->nr = 0; + ctx->contexts = 0; + ctx->contexts_maxed = false; + ctx->add_mark = add_mark; +} + +void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs) +{ + if (user_mode(regs)) + return; + + if (ctx->add_mark) + perf_callchain_store_context(ctx, PERF_CONTEXT_KERNEL); + perf_callchain_kernel(ctx, regs); +} + +void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs, + u64 defer_cookie) +{ + int start_entry_idx; + + if (!user_mode(regs)) { + if (current->flags & (PF_KTHREAD | PF_USER_WORKER)) + return; + regs = task_pt_regs(current); + } + + if (defer_cookie) { + /* + * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED + * which can be stitched to this one, and add + * the cookie after it (it will be cut off when the + * user stack is copied to the callchain). + */ + perf_callchain_store_context(ctx, PERF_CONTEXT_USER_DEFERRED); + perf_callchain_store_context(ctx, defer_cookie); + return; + } + + if (ctx->add_mark) + perf_callchain_store_context(ctx, PERF_CONTEXT_USER); + + start_entry_idx = ctx->entry->nr; + perf_callchain_user(ctx, regs); + fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx); +} + struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; - int rctx, start_entry_idx; + int rctx; /* crosstask is not supported for user stacks */ if (crosstask && user && !kernel) @@ -232,46 +286,13 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, if (!entry) return NULL; - ctx.entry = entry; - ctx.max_stack = max_stack; - ctx.nr = entry->nr = 0; - ctx.contexts = 0; - ctx.contexts_maxed = false; - - if (kernel && !user_mode(regs)) { - if (add_mark) - perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL); - perf_callchain_kernel(&ctx, regs); - } + __init_perf_callchain_ctx(&ctx, entry, max_stack, add_mark); - if (user && !crosstask) { - if (!user_mode(regs)) { - if (current->flags & (PF_KTHREAD | PF_USER_WORKER)) - goto exit_put; - regs = task_pt_regs(current); - } - - if (defer_cookie) { - /* - * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED - * which can be stitched to this one, and add - * the cookie after it (it will be cut off when the - * user stack is copied to the callchain). - */ - perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED); - perf_callchain_store_context(&ctx, defer_cookie); - goto exit_put; - } - - if (add_mark) - perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); - - start_entry_idx = entry->nr; - perf_callchain_user(&ctx, regs); - fixup_uretprobe_trampoline_entries(entry, start_entry_idx); - } + if (kernel) + __get_perf_callchain_kernel(&ctx, regs); -exit_put: + if (user && !crosstask) + __get_perf_callchain_user(&ctx, regs, defer_cookie); put_callchain_entry(rctx); return entry; -- 2.48.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 9:33 [PATCH RESEND bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain Tao Chen 2025-12-17 9:33 ` [PATCH bpf-next v7 1/2] perf: Refactor get_perf_callchain Tao Chen @ 2025-12-17 9:33 ` Tao Chen 2025-12-23 6:29 ` Tao Chen 2026-01-23 0:38 ` Andrii Nakryiko 1 sibling, 2 replies; 13+ messages in thread From: Tao Chen @ 2025-12-17 9:33 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen As Alexei noted, get_perf_callchain() return values may be reused if a task is preempted after the BPF program enters migrate disable mode. The perf_callchain_entres has a small stack of entries, and we can reuse it as follows: 1. get the perf callchain entry 2. BPF use... 3. put the perf callchain entry And Peter suggested that get_recursion_context used with preemption disabled, so we should disable preemption at BPF side. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 56 insertions(+), 12 deletions(-) diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index da3d328f5c1..3bdd99a630d 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, } static struct perf_callchain_entry * -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) { #ifdef CONFIG_STACKTRACE struct perf_callchain_entry *entry; - int rctx; - entry = get_callchain_entry(&rctx); + preempt_disable(); + entry = get_callchain_entry(rctx); + preempt_enable(); if (!entry) return NULL; @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) to[i] = (u64)(from[i]); } - put_callchain_entry(rctx); - return entry; #else /* CONFIG_STACKTRACE */ return NULL; @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, return id; } +static struct perf_callchain_entry * +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, + int max_stack, bool crosstask) +{ + struct perf_callchain_entry_ctx ctx; + struct perf_callchain_entry *entry; + + preempt_disable(); + entry = get_callchain_entry(rctx); + preempt_enable(); + + if (unlikely(!entry)) + return NULL; + + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); + if (kernel) + __get_perf_callchain_kernel(&ctx, regs); + if (user && !crosstask) + __get_perf_callchain_user(&ctx, regs, 0); + + return entry; +} + +static void bpf_put_perf_callchain(int rctx) +{ + put_callchain_entry(rctx); +} + BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, u64, flags) { @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, struct perf_callchain_entry *trace; bool kernel = !user; u32 max_depth; + int rctx, ret; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) return -EINVAL; max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); - trace = get_perf_callchain(regs, kernel, user, max_depth, - false, false, 0); + + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, + false); if (unlikely(!trace)) /* couldn't fetch the stack trace */ return -EFAULT; - return __bpf_get_stackid(map, trace, flags); + ret = __bpf_get_stackid(map, trace, flags); + bpf_put_perf_callchain(rctx); + + return ret; } const struct bpf_func_proto bpf_get_stackid_proto = { @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, bool kernel = !user; int err = -EINVAL; u64 *ips; + int rctx; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_USER_BUILD_ID))) @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, trace = trace_in; trace->nr = min_t(u32, trace->nr, max_depth); } else if (kernel && task) { - trace = get_callchain_entry_for_task(task, max_depth); + trace = get_callchain_entry_for_task(&rctx, task, max_depth); } else { - trace = get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false, 0); + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, + crosstask); } - if (unlikely(!trace) || trace->nr < skip) { + if (unlikely(!trace)) { if (may_fault) rcu_read_unlock(); goto err_fault; } + if (trace->nr < skip) { + if (may_fault) + rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(rctx); + goto err_fault; + } + trace_nr = trace->nr - skip; copy_len = trace_nr * elem_size; @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, if (may_fault) rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(rctx); + if (user_build_id) stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); -- 2.48.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 9:33 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen @ 2025-12-23 6:29 ` Tao Chen 2026-01-06 16:00 ` Tao Chen 2026-01-23 0:38 ` Andrii Nakryiko 1 sibling, 1 reply; 13+ messages in thread From: Tao Chen @ 2025-12-23 6:29 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf 在 2025/12/17 17:33, Tao Chen 写道: > As Alexei noted, get_perf_callchain() return values may be reused > if a task is preempted after the BPF program enters migrate disable > mode. The perf_callchain_entres has a small stack of entries, and > we can reuse it as follows: > > 1. get the perf callchain entry > 2. BPF use... > 3. put the perf callchain entry > > And Peter suggested that get_recursion_context used with preemption > disabled, so we should disable preemption at BPF side. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- > 1 file changed, 56 insertions(+), 12 deletions(-) > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index da3d328f5c1..3bdd99a630d 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, > } > > static struct perf_callchain_entry * > -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) > { > #ifdef CONFIG_STACKTRACE > struct perf_callchain_entry *entry; > - int rctx; > > - entry = get_callchain_entry(&rctx); > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > > if (!entry) > return NULL; > @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > to[i] = (u64)(from[i]); > } > > - put_callchain_entry(rctx); > - > return entry; > #else /* CONFIG_STACKTRACE */ > return NULL; > @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, > return id; > } > > +static struct perf_callchain_entry * > +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, > + int max_stack, bool crosstask) > +{ > + struct perf_callchain_entry_ctx ctx; > + struct perf_callchain_entry *entry; > + > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > + > + if (unlikely(!entry)) > + return NULL; > + > + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); > + if (kernel) > + __get_perf_callchain_kernel(&ctx, regs); > + if (user && !crosstask) > + __get_perf_callchain_user(&ctx, regs, 0); > + > + return entry; > +} > + > +static void bpf_put_perf_callchain(int rctx) > +{ > + put_callchain_entry(rctx); > +} > + > BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > u64, flags) > { > @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > struct perf_callchain_entry *trace; > bool kernel = !user; > u32 max_depth; > + int rctx, ret; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > return -EINVAL; > > max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - false, false, 0); > + > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + false); > > if (unlikely(!trace)) > /* couldn't fetch the stack trace */ > return -EFAULT; > > - return __bpf_get_stackid(map, trace, flags); > + ret = __bpf_get_stackid(map, trace, flags); > + bpf_put_perf_callchain(rctx); > + > + return ret; > } > > const struct bpf_func_proto bpf_get_stackid_proto = { > @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > bool kernel = !user; > int err = -EINVAL; > u64 *ips; > + int rctx; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_USER_BUILD_ID))) > @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > trace = trace_in; > trace->nr = min_t(u32, trace->nr, max_depth); > } else if (kernel && task) { > - trace = get_callchain_entry_for_task(task, max_depth); > + trace = get_callchain_entry_for_task(&rctx, task, max_depth); > } else { > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - crosstask, false, 0); > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + crosstask); > } > > - if (unlikely(!trace) || trace->nr < skip) { > + if (unlikely(!trace)) { > if (may_fault) > rcu_read_unlock(); > goto err_fault; > } > > + if (trace->nr < skip) { > + if (may_fault) > + rcu_read_unlock(); > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + goto err_fault; > + } > + > trace_nr = trace->nr - skip; > copy_len = trace_nr * elem_size; > > @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > if (may_fault) > rcu_read_unlock(); > > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + > if (user_build_id) > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > Hi Peter, As Alexei said, the patch needs your ack, please review again, thanks. -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-23 6:29 ` Tao Chen @ 2026-01-06 16:00 ` Tao Chen 2026-01-09 23:47 ` Andrii Nakryiko 0 siblings, 1 reply; 13+ messages in thread From: Tao Chen @ 2026-01-06 16:00 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf 在 2025/12/23 14:29, Tao Chen 写道: > 在 2025/12/17 17:33, Tao Chen 写道: >> As Alexei noted, get_perf_callchain() return values may be reused >> if a task is preempted after the BPF program enters migrate disable >> mode. The perf_callchain_entres has a small stack of entries, and >> we can reuse it as follows: >> >> 1. get the perf callchain entry >> 2. BPF use... >> 3. put the perf callchain entry >> >> And Peter suggested that get_recursion_context used with preemption >> disabled, so we should disable preemption at BPF side. >> >> Acked-by: Yonghong Song <yonghong.song@linux.dev> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 56 insertions(+), 12 deletions(-) >> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >> index da3d328f5c1..3bdd99a630d 100644 >> --- a/kernel/bpf/stackmap.c >> +++ b/kernel/bpf/stackmap.c >> @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct >> bpf_stack_build_id *id_offs, >> } >> static struct perf_callchain_entry * >> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 >> max_depth) >> { >> #ifdef CONFIG_STACKTRACE >> struct perf_callchain_entry *entry; >> - int rctx; >> - entry = get_callchain_entry(&rctx); >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> if (!entry) >> return NULL; >> @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct >> *task, u32 max_depth) >> to[i] = (u64)(from[i]); >> } >> - put_callchain_entry(rctx); >> - >> return entry; >> #else /* CONFIG_STACKTRACE */ >> return NULL; >> @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, >> return id; >> } >> +static struct perf_callchain_entry * >> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, >> bool user, >> + int max_stack, bool crosstask) >> +{ >> + struct perf_callchain_entry_ctx ctx; >> + struct perf_callchain_entry *entry; >> + >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> + >> + if (unlikely(!entry)) >> + return NULL; >> + >> + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); >> + if (kernel) >> + __get_perf_callchain_kernel(&ctx, regs); >> + if (user && !crosstask) >> + __get_perf_callchain_user(&ctx, regs, 0); >> + >> + return entry; >> +} >> + >> +static void bpf_put_perf_callchain(int rctx) >> +{ >> + put_callchain_entry(rctx); >> +} >> + >> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map >> *, map, >> u64, flags) >> { >> @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, >> regs, struct bpf_map *, map, >> struct perf_callchain_entry *trace; >> bool kernel = !user; >> u32 max_depth; >> + int rctx, ret; >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) >> return -EINVAL; >> max_depth = stack_map_calculate_max_depth(map->value_size, >> elem_size, flags); >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - false, false, 0); >> + >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, >> + false); >> if (unlikely(!trace)) >> /* couldn't fetch the stack trace */ >> return -EFAULT; >> - return __bpf_get_stackid(map, trace, flags); >> + ret = __bpf_get_stackid(map, trace, flags); >> + bpf_put_perf_callchain(rctx); >> + >> + return ret; >> } >> const struct bpf_func_proto bpf_get_stackid_proto = { >> @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> bool kernel = !user; >> int err = -EINVAL; >> u64 *ips; >> + int rctx; >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_USER_BUILD_ID))) >> @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs >> *regs, struct task_struct *task, >> trace = trace_in; >> trace->nr = min_t(u32, trace->nr, max_depth); >> } else if (kernel && task) { >> - trace = get_callchain_entry_for_task(task, max_depth); >> + trace = get_callchain_entry_for_task(&rctx, task, max_depth); >> } else { >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - crosstask, false, 0); >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, >> max_depth, >> + crosstask); >> } >> - if (unlikely(!trace) || trace->nr < skip) { >> + if (unlikely(!trace)) { >> if (may_fault) >> rcu_read_unlock(); >> goto err_fault; >> } >> + if (trace->nr < skip) { >> + if (may_fault) >> + rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + goto err_fault; >> + } >> + >> trace_nr = trace->nr - skip; >> copy_len = trace_nr * elem_size; >> @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> if (may_fault) >> rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + >> if (user_build_id) >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > Hi Peter, > > As Alexei said, the patch needs your ack, please review again, thanks. > ping... -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2026-01-06 16:00 ` Tao Chen @ 2026-01-09 23:47 ` Andrii Nakryiko 2026-01-16 4:35 ` Tao Chen 0 siblings, 1 reply; 13+ messages in thread From: Andrii Nakryiko @ 2026-01-09 23:47 UTC (permalink / raw) To: Tao Chen, peterz Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Tue, Jan 6, 2026 at 8:00 AM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2025/12/23 14:29, Tao Chen 写道: > > 在 2025/12/17 17:33, Tao Chen 写道: > >> As Alexei noted, get_perf_callchain() return values may be reused > >> if a task is preempted after the BPF program enters migrate disable > >> mode. The perf_callchain_entres has a small stack of entries, and > >> we can reuse it as follows: > >> > >> 1. get the perf callchain entry > >> 2. BPF use... > >> 3. put the perf callchain entry > >> > >> And Peter suggested that get_recursion_context used with preemption > >> disabled, so we should disable preemption at BPF side. > >> > >> Acked-by: Yonghong Song <yonghong.song@linux.dev> > >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> > >> --- > >> kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- > >> 1 file changed, 56 insertions(+), 12 deletions(-) > >> > >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > >> index da3d328f5c1..3bdd99a630d 100644 > >> --- a/kernel/bpf/stackmap.c > >> +++ b/kernel/bpf/stackmap.c > >> @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct > >> bpf_stack_build_id *id_offs, > >> } > >> static struct perf_callchain_entry * > >> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > >> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 > >> max_depth) > >> { > >> #ifdef CONFIG_STACKTRACE > >> struct perf_callchain_entry *entry; > >> - int rctx; > >> - entry = get_callchain_entry(&rctx); > >> + preempt_disable(); > >> + entry = get_callchain_entry(rctx); > >> + preempt_enable(); > >> if (!entry) > >> return NULL; > >> @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct > >> *task, u32 max_depth) > >> to[i] = (u64)(from[i]); > >> } > >> - put_callchain_entry(rctx); > >> - > >> return entry; > >> #else /* CONFIG_STACKTRACE */ > >> return NULL; > >> @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, > >> return id; > >> } > >> +static struct perf_callchain_entry * > >> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, > >> bool user, > >> + int max_stack, bool crosstask) > >> +{ > >> + struct perf_callchain_entry_ctx ctx; > >> + struct perf_callchain_entry *entry; > >> + > >> + preempt_disable(); > >> + entry = get_callchain_entry(rctx); > >> + preempt_enable(); > >> + > >> + if (unlikely(!entry)) > >> + return NULL; > >> + > >> + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); > >> + if (kernel) > >> + __get_perf_callchain_kernel(&ctx, regs); > >> + if (user && !crosstask) > >> + __get_perf_callchain_user(&ctx, regs, 0); > >> + > >> + return entry; > >> +} > >> + > >> +static void bpf_put_perf_callchain(int rctx) > >> +{ > >> + put_callchain_entry(rctx); > >> +} > >> + > >> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map > >> *, map, > >> u64, flags) > >> { > >> @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, > >> regs, struct bpf_map *, map, > >> struct perf_callchain_entry *trace; > >> bool kernel = !user; > >> u32 max_depth; > >> + int rctx, ret; > >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > >> return -EINVAL; > >> max_depth = stack_map_calculate_max_depth(map->value_size, > >> elem_size, flags); > >> - trace = get_perf_callchain(regs, kernel, user, max_depth, > >> - false, false, 0); > >> + > >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > >> + false); > >> if (unlikely(!trace)) > >> /* couldn't fetch the stack trace */ > >> return -EFAULT; > >> - return __bpf_get_stackid(map, trace, flags); > >> + ret = __bpf_get_stackid(map, trace, flags); > >> + bpf_put_perf_callchain(rctx); > >> + > >> + return ret; > >> } > >> const struct bpf_func_proto bpf_get_stackid_proto = { > >> @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, > >> struct task_struct *task, > >> bool kernel = !user; > >> int err = -EINVAL; > >> u64 *ips; > >> + int rctx; > >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > >> BPF_F_USER_BUILD_ID))) > >> @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs > >> *regs, struct task_struct *task, > >> trace = trace_in; > >> trace->nr = min_t(u32, trace->nr, max_depth); > >> } else if (kernel && task) { > >> - trace = get_callchain_entry_for_task(task, max_depth); > >> + trace = get_callchain_entry_for_task(&rctx, task, max_depth); > >> } else { > >> - trace = get_perf_callchain(regs, kernel, user, max_depth, > >> - crosstask, false, 0); > >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, > >> max_depth, > >> + crosstask); > >> } > >> - if (unlikely(!trace) || trace->nr < skip) { > >> + if (unlikely(!trace)) { > >> if (may_fault) > >> rcu_read_unlock(); > >> goto err_fault; > >> } > >> + if (trace->nr < skip) { > >> + if (may_fault) > >> + rcu_read_unlock(); > >> + if (!trace_in) > >> + bpf_put_perf_callchain(rctx); > >> + goto err_fault; > >> + } > >> + > >> trace_nr = trace->nr - skip; > >> copy_len = trace_nr * elem_size; > >> @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, > >> struct task_struct *task, > >> if (may_fault) > >> rcu_read_unlock(); > >> + if (!trace_in) > >> + bpf_put_perf_callchain(rctx); > >> + > >> if (user_build_id) > >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > > > Hi Peter, > > > > As Alexei said, the patch needs your ack, please review again, thanks. > > > > ping... Peter, if I understand correctly, this will go through bpf-next tree, but it would be great if you could take a look and confirm this overall is not broken. Thanks! > > -- > Best Regards > Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2026-01-09 23:47 ` Andrii Nakryiko @ 2026-01-16 4:35 ` Tao Chen 0 siblings, 0 replies; 13+ messages in thread From: Tao Chen @ 2026-01-16 4:35 UTC (permalink / raw) To: Andrii Nakryiko, peterz Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/10 07:47, Andrii Nakryiko 写道: > On Tue, Jan 6, 2026 at 8:00 AM Tao Chen <chen.dylane@linux.dev> wrote: >> >> 在 2025/12/23 14:29, Tao Chen 写道: >>> 在 2025/12/17 17:33, Tao Chen 写道: >>>> As Alexei noted, get_perf_callchain() return values may be reused >>>> if a task is preempted after the BPF program enters migrate disable >>>> mode. The perf_callchain_entres has a small stack of entries, and >>>> we can reuse it as follows: >>>> >>>> 1. get the perf callchain entry >>>> 2. BPF use... >>>> 3. put the perf callchain entry >>>> >>>> And Peter suggested that get_recursion_context used with preemption >>>> disabled, so we should disable preemption at BPF side. >>>> >>>> Acked-by: Yonghong Song <yonghong.song@linux.dev> >>>> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >>>> --- >>>> kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- >>>> 1 file changed, 56 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >>>> index da3d328f5c1..3bdd99a630d 100644 >>>> --- a/kernel/bpf/stackmap.c >>>> +++ b/kernel/bpf/stackmap.c >>>> @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct >>>> bpf_stack_build_id *id_offs, >>>> } >>>> static struct perf_callchain_entry * >>>> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >>>> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 >>>> max_depth) >>>> { >>>> #ifdef CONFIG_STACKTRACE >>>> struct perf_callchain_entry *entry; >>>> - int rctx; >>>> - entry = get_callchain_entry(&rctx); >>>> + preempt_disable(); >>>> + entry = get_callchain_entry(rctx); >>>> + preempt_enable(); >>>> if (!entry) >>>> return NULL; >>>> @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct >>>> *task, u32 max_depth) >>>> to[i] = (u64)(from[i]); >>>> } >>>> - put_callchain_entry(rctx); >>>> - >>>> return entry; >>>> #else /* CONFIG_STACKTRACE */ >>>> return NULL; >>>> @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, >>>> return id; >>>> } >>>> +static struct perf_callchain_entry * >>>> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, >>>> bool user, >>>> + int max_stack, bool crosstask) >>>> +{ >>>> + struct perf_callchain_entry_ctx ctx; >>>> + struct perf_callchain_entry *entry; >>>> + >>>> + preempt_disable(); >>>> + entry = get_callchain_entry(rctx); >>>> + preempt_enable(); >>>> + >>>> + if (unlikely(!entry)) >>>> + return NULL; >>>> + >>>> + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); >>>> + if (kernel) >>>> + __get_perf_callchain_kernel(&ctx, regs); >>>> + if (user && !crosstask) >>>> + __get_perf_callchain_user(&ctx, regs, 0); >>>> + >>>> + return entry; >>>> +} >>>> + >>>> +static void bpf_put_perf_callchain(int rctx) >>>> +{ >>>> + put_callchain_entry(rctx); >>>> +} >>>> + >>>> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map >>>> *, map, >>>> u64, flags) >>>> { >>>> @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, >>>> regs, struct bpf_map *, map, >>>> struct perf_callchain_entry *trace; >>>> bool kernel = !user; >>>> u32 max_depth; >>>> + int rctx, ret; >>>> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >>>> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) >>>> return -EINVAL; >>>> max_depth = stack_map_calculate_max_depth(map->value_size, >>>> elem_size, flags); >>>> - trace = get_perf_callchain(regs, kernel, user, max_depth, >>>> - false, false, 0); >>>> + >>>> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, >>>> + false); >>>> if (unlikely(!trace)) >>>> /* couldn't fetch the stack trace */ >>>> return -EFAULT; >>>> - return __bpf_get_stackid(map, trace, flags); >>>> + ret = __bpf_get_stackid(map, trace, flags); >>>> + bpf_put_perf_callchain(rctx); >>>> + >>>> + return ret; >>>> } >>>> const struct bpf_func_proto bpf_get_stackid_proto = { >>>> @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, >>>> struct task_struct *task, >>>> bool kernel = !user; >>>> int err = -EINVAL; >>>> u64 *ips; >>>> + int rctx; >>>> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >>>> BPF_F_USER_BUILD_ID))) >>>> @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs >>>> *regs, struct task_struct *task, >>>> trace = trace_in; >>>> trace->nr = min_t(u32, trace->nr, max_depth); >>>> } else if (kernel && task) { >>>> - trace = get_callchain_entry_for_task(task, max_depth); >>>> + trace = get_callchain_entry_for_task(&rctx, task, max_depth); >>>> } else { >>>> - trace = get_perf_callchain(regs, kernel, user, max_depth, >>>> - crosstask, false, 0); >>>> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, >>>> max_depth, >>>> + crosstask); >>>> } >>>> - if (unlikely(!trace) || trace->nr < skip) { >>>> + if (unlikely(!trace)) { >>>> if (may_fault) >>>> rcu_read_unlock(); >>>> goto err_fault; >>>> } >>>> + if (trace->nr < skip) { >>>> + if (may_fault) >>>> + rcu_read_unlock(); >>>> + if (!trace_in) >>>> + bpf_put_perf_callchain(rctx); >>>> + goto err_fault; >>>> + } >>>> + >>>> trace_nr = trace->nr - skip; >>>> copy_len = trace_nr * elem_size; >>>> @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, >>>> struct task_struct *task, >>>> if (may_fault) >>>> rcu_read_unlock(); >>>> + if (!trace_in) >>>> + bpf_put_perf_callchain(rctx); >>>> + >>>> if (user_build_id) >>>> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); >>> >>> Hi Peter, >>> >>> As Alexei said, the patch needs your ack, please review again, thanks. >>> >> >> ping... > > Peter, if I understand correctly, this will go through bpf-next tree, > but it would be great if you could take a look and confirm this > overall is not broken. Thanks! > >> >> -- >> Best Regards >> Tao Chen Hi Andrii, Peter It appears that the code does not require a rebase, and the latest CI build is valid. Looking forward to your response. Thanks. CI has tested the following submission: Status: SUCCESS Name: [RESEND,bpf-next,v7,0/2] Pass external callchain entry to get_perf_callchain Patchwork: https://patchwork.kernel.org/project/netdevbpf/list/?series=1034091&state=* Matrix: https://github.com/kernel-patches/bpf/actions/runs/21051611369 -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 9:33 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen 2025-12-23 6:29 ` Tao Chen @ 2026-01-23 0:38 ` Andrii Nakryiko 2026-01-23 5:42 ` Tao Chen 1 sibling, 1 reply; 13+ messages in thread From: Andrii Nakryiko @ 2026-01-23 0:38 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Wed, Dec 17, 2025 at 1:34 AM Tao Chen <chen.dylane@linux.dev> wrote: > > As Alexei noted, get_perf_callchain() return values may be reused > if a task is preempted after the BPF program enters migrate disable > mode. The perf_callchain_entres has a small stack of entries, and > we can reuse it as follows: > > 1. get the perf callchain entry > 2. BPF use... > 3. put the perf callchain entry > > And Peter suggested that get_recursion_context used with preemption > disabled, so we should disable preemption at BPF side. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- > 1 file changed, 56 insertions(+), 12 deletions(-) > I took a bit closer look at these changes and I'm a fan of the particular implementation, tbh. It's a bit of a maze how all these different call chain cases are handled, so I might be missing something, but I'd address this a bit differently. First, instead of manipulating this obscure rctx as part of interface, I'd record rctx inside the perf_callchain_entry itself, and make sure that get_callchain_entry does have any output arguments. put_callchain_entry() would then accept perf_callchain_entry reference and just fetch rctx from inside it. Then instead of open-coding get_perf_callchain by exposing __init_perf_callchain_ctx, __get_perf_callchain_kernel, and __get_perf_callchain_user, can't we have __get_perf_callchain() which will accept perf_callchain_entry as an input and won't do get/put internally. And then existing get_perf_callchain() will just do get + __get_perf_callchain + put, while BPF-side code will do it's own get (with preemption temporarily disabled), will fetch callstack in one of a few possible ways, and then put it (unless callchain_entry is coming from outside, that trace_in thing). It's close to what you are doing, but I don't think anyone likes those exposed __init_perf_callchain_ctx + __get_perf_callchain_kernel + __get_perf_callchain_user. Can't we avoid that? (and also not sure we need add_mark inside the ctx itself, do we?) pw-bot: cr > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index da3d328f5c1..3bdd99a630d 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, > } > > static struct perf_callchain_entry * > -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) > { > #ifdef CONFIG_STACKTRACE > struct perf_callchain_entry *entry; > - int rctx; > > - entry = get_callchain_entry(&rctx); > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > > if (!entry) > return NULL; > @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > to[i] = (u64)(from[i]); > } > > - put_callchain_entry(rctx); > - > return entry; > #else /* CONFIG_STACKTRACE */ > return NULL; > @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, > return id; > } > > +static struct perf_callchain_entry * > +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, > + int max_stack, bool crosstask) > +{ > + struct perf_callchain_entry_ctx ctx; > + struct perf_callchain_entry *entry; > + > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > + > + if (unlikely(!entry)) > + return NULL; > + > + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); > + if (kernel) > + __get_perf_callchain_kernel(&ctx, regs); > + if (user && !crosstask) > + __get_perf_callchain_user(&ctx, regs, 0); > + > + return entry; > +} > + > +static void bpf_put_perf_callchain(int rctx) > +{ > + put_callchain_entry(rctx); > +} > + > BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > u64, flags) > { > @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > struct perf_callchain_entry *trace; > bool kernel = !user; > u32 max_depth; > + int rctx, ret; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > return -EINVAL; > > max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - false, false, 0); > + > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + false); > > if (unlikely(!trace)) > /* couldn't fetch the stack trace */ > return -EFAULT; > > - return __bpf_get_stackid(map, trace, flags); > + ret = __bpf_get_stackid(map, trace, flags); > + bpf_put_perf_callchain(rctx); > + > + return ret; > } > > const struct bpf_func_proto bpf_get_stackid_proto = { > @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > bool kernel = !user; > int err = -EINVAL; > u64 *ips; > + int rctx; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_USER_BUILD_ID))) > @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > trace = trace_in; > trace->nr = min_t(u32, trace->nr, max_depth); > } else if (kernel && task) { > - trace = get_callchain_entry_for_task(task, max_depth); > + trace = get_callchain_entry_for_task(&rctx, task, max_depth); > } else { > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - crosstask, false, 0); > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + crosstask); > } > > - if (unlikely(!trace) || trace->nr < skip) { > + if (unlikely(!trace)) { > if (may_fault) > rcu_read_unlock(); > goto err_fault; > } > > + if (trace->nr < skip) { > + if (may_fault) > + rcu_read_unlock(); > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + goto err_fault; > + } > + > trace_nr = trace->nr - skip; > copy_len = trace_nr * elem_size; > > @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > if (may_fault) > rcu_read_unlock(); > > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + > if (user_build_id) > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2026-01-23 0:38 ` Andrii Nakryiko @ 2026-01-23 5:42 ` Tao Chen 2026-01-23 18:40 ` Andrii Nakryiko 0 siblings, 1 reply; 13+ messages in thread From: Tao Chen @ 2026-01-23 5:42 UTC (permalink / raw) To: Andrii Nakryiko Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/23 08:38, Andrii Nakryiko 写道: > On Wed, Dec 17, 2025 at 1:34 AM Tao Chen <chen.dylane@linux.dev> wrote: >> >> As Alexei noted, get_perf_callchain() return values may be reused >> if a task is preempted after the BPF program enters migrate disable >> mode. The perf_callchain_entres has a small stack of entries, and >> we can reuse it as follows: >> >> 1. get the perf callchain entry >> 2. BPF use... >> 3. put the perf callchain entry >> >> And Peter suggested that get_recursion_context used with preemption >> disabled, so we should disable preemption at BPF side. >> >> Acked-by: Yonghong Song <yonghong.song@linux.dev> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 56 insertions(+), 12 deletions(-) >> > > I took a bit closer look at these changes and I'm a fan of the > particular implementation, tbh. It's a bit of a maze how all these > different call chain cases are handled, so I might be missing > something, but I'd address this a bit differently. > > First, instead of manipulating this obscure rctx as part of interface, > I'd record rctx inside the perf_callchain_entry itself, and make sure > that get_callchain_entry does have any output arguments. > put_callchain_entry() would then accept perf_callchain_entry reference > and just fetch rctx from inside it. > Hi Andrri, Try to implement this briefly with code, is my understanding correct? struct perf_callchain_entry *get_callchain_entry(void) { int cpu; int rctx; struct perf_callchain_entry *entry; struct callchain_cpus_entries *entries; rctx = get_recursion_context(this_cpu_ptr(callchain_recursion)); if (rctx == -1) return NULL; entries = rcu_dereference(callchain_cpus_entries); if (!entries) { put_recursion_context(this_cpu_ptr(callchain_recursion), *rctx); return NULL; } cpu = smp_processor_id(); entry = ((void *)entries->cpu_entries[cpu]) + (rctx * perf_callchain_entry__sizeof()); entry->rctx = rctx; return entry; } void put_callchain_entry(struct perf_callchain_entry *entry) { put_recursion_context(this_cpu_ptr(callchain_recursion), entry->rctx); } And then no need rtcx in bpf_get_perf_callchain. bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack, bool crosstask) Functionally, this seems fine. The only concern is whether the perf maintainer will approve it after all, this change involves modifying their core interfaces. Peter, Yonghong, can we do this? > Then instead of open-coding get_perf_callchain by exposing > __init_perf_callchain_ctx, __get_perf_callchain_kernel, and > __get_perf_callchain_user, can't we have __get_perf_callchain() which > will accept perf_callchain_entry as an input and won't do get/put > internally. And then existing get_perf_callchain() will just do get + > __get_perf_callchain + put, while BPF-side code will do it's own get > (with preemption temporarily disabled), will fetch callstack in one of > a few possible ways, and then put it (unless callchain_entry is coming > from outside, that trace_in thing). > Exposing only __get_perf_callchain will make it much easier for BPF callers to understand and use. And following Peter's suggestion, we can mark _init_perf_callchain_ctx, __get_perf_callchain_kernel, and __get_perf_callchain_user as static, then have __get_perf_callchain encapsulate the logic of _init_perf_callchain_ctx, __get_perf_callchain_kernel, and __get_perf_callchain_user. What do you think? > It's close to what you are doing, but I don't think anyone likes those > exposed __init_perf_callchain_ctx + __get_perf_callchain_kernel + > __get_perf_callchain_user. Can't we avoid that? (and also not sure we > need add_mark inside the ctx itself, do we?) > > pw-bot: cr > > > >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >> index da3d328f5c1..3bdd99a630d 100644 >> --- a/kernel/bpf/stackmap.c >> +++ b/kernel/bpf/stackmap.c >> @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, >> } >> >> static struct perf_callchain_entry * >> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) >> { >> #ifdef CONFIG_STACKTRACE >> struct perf_callchain_entry *entry; >> - int rctx; >> >> - entry = get_callchain_entry(&rctx); >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> >> if (!entry) >> return NULL; >> @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> to[i] = (u64)(from[i]); >> } >> >> - put_callchain_entry(rctx); >> - >> return entry; >> #else /* CONFIG_STACKTRACE */ >> return NULL; >> @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, >> return id; >> } >> >> +static struct perf_callchain_entry * >> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, >> + int max_stack, bool crosstask) >> +{ >> + struct perf_callchain_entry_ctx ctx; >> + struct perf_callchain_entry *entry; >> + >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> + >> + if (unlikely(!entry)) >> + return NULL; >> + >> + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); >> + if (kernel) >> + __get_perf_callchain_kernel(&ctx, regs); >> + if (user && !crosstask) >> + __get_perf_callchain_user(&ctx, regs, 0); >> + >> + return entry; >> +} >> + >> +static void bpf_put_perf_callchain(int rctx) >> +{ >> + put_callchain_entry(rctx); >> +} >> + >> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, >> u64, flags) >> { >> @@ -328,20 +355,25 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, >> struct perf_callchain_entry *trace; >> bool kernel = !user; >> u32 max_depth; >> + int rctx, ret; >> >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) >> return -EINVAL; >> >> max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - false, false, 0); >> + >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, >> + false); >> >> if (unlikely(!trace)) >> /* couldn't fetch the stack trace */ >> return -EFAULT; >> >> - return __bpf_get_stackid(map, trace, flags); >> + ret = __bpf_get_stackid(map, trace, flags); >> + bpf_put_perf_callchain(rctx); >> + >> + return ret; >> } >> >> const struct bpf_func_proto bpf_get_stackid_proto = { >> @@ -435,6 +467,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, >> bool kernel = !user; >> int err = -EINVAL; >> u64 *ips; >> + int rctx; >> >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_USER_BUILD_ID))) >> @@ -467,18 +500,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, >> trace = trace_in; >> trace->nr = min_t(u32, trace->nr, max_depth); >> } else if (kernel && task) { >> - trace = get_callchain_entry_for_task(task, max_depth); >> + trace = get_callchain_entry_for_task(&rctx, task, max_depth); >> } else { >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - crosstask, false, 0); >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, >> + crosstask); >> } >> >> - if (unlikely(!trace) || trace->nr < skip) { >> + if (unlikely(!trace)) { >> if (may_fault) >> rcu_read_unlock(); >> goto err_fault; >> } >> >> + if (trace->nr < skip) { >> + if (may_fault) >> + rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + goto err_fault; >> + } >> + >> trace_nr = trace->nr - skip; >> copy_len = trace_nr * elem_size; >> >> @@ -497,6 +538,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, >> if (may_fault) >> rcu_read_unlock(); >> >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + >> if (user_build_id) >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); >> >> -- >> 2.48.1 >> -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2026-01-23 5:42 ` Tao Chen @ 2026-01-23 18:40 ` Andrii Nakryiko 0 siblings, 0 replies; 13+ messages in thread From: Andrii Nakryiko @ 2026-01-23 18:40 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Thu, Jan 22, 2026 at 9:42 PM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2026/1/23 08:38, Andrii Nakryiko 写道: > > On Wed, Dec 17, 2025 at 1:34 AM Tao Chen <chen.dylane@linux.dev> wrote: > >> > >> As Alexei noted, get_perf_callchain() return values may be reused > >> if a task is preempted after the BPF program enters migrate disable > >> mode. The perf_callchain_entres has a small stack of entries, and > >> we can reuse it as follows: > >> > >> 1. get the perf callchain entry > >> 2. BPF use... > >> 3. put the perf callchain entry > >> > >> And Peter suggested that get_recursion_context used with preemption > >> disabled, so we should disable preemption at BPF side. > >> > >> Acked-by: Yonghong Song <yonghong.song@linux.dev> > >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> > >> --- > >> kernel/bpf/stackmap.c | 68 +++++++++++++++++++++++++++++++++++-------- > >> 1 file changed, 56 insertions(+), 12 deletions(-) > >> > > > > I took a bit closer look at these changes and I'm a fan of the > > particular implementation, tbh. It's a bit of a maze how all these > > different call chain cases are handled, so I might be missing > > something, but I'd address this a bit differently. > > > > First, instead of manipulating this obscure rctx as part of interface, > > I'd record rctx inside the perf_callchain_entry itself, and make sure > > that get_callchain_entry does have any output arguments. > > put_callchain_entry() would then accept perf_callchain_entry reference > > and just fetch rctx from inside it. > > > > Hi Andrri, > > Try to implement this briefly with code, is my understanding correct? > > struct perf_callchain_entry *get_callchain_entry(void) > { > int cpu; > int rctx; > struct perf_callchain_entry *entry; > struct callchain_cpus_entries *entries; > > rctx = get_recursion_context(this_cpu_ptr(callchain_recursion)); > if (rctx == -1) > return NULL; > > entries = rcu_dereference(callchain_cpus_entries); > if (!entries) { > > put_recursion_context(this_cpu_ptr(callchain_recursion), *rctx); > return NULL; > } > > cpu = smp_processor_id(); > > entry = ((void *)entries->cpu_entries[cpu]) + > (rctx * perf_callchain_entry__sizeof()); > entry->rctx = rctx; > > return entry; > } > > void > put_callchain_entry(struct perf_callchain_entry *entry) > { > put_recursion_context(this_cpu_ptr(callchain_recursion), > entry->rctx); > } > > And then no need rtcx in bpf_get_perf_callchain. yes, exactly > > bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, > int max_stack, bool crosstask) > > Functionally, this seems fine. The only concern is whether the perf > maintainer will approve it after all, this change involves modifying > their core interfaces. > > Peter, Yonghong, can we do this? > > > Then instead of open-coding get_perf_callchain by exposing > > __init_perf_callchain_ctx, __get_perf_callchain_kernel, and > > __get_perf_callchain_user, can't we have __get_perf_callchain() which > > will accept perf_callchain_entry as an input and won't do get/put > > internally. And then existing get_perf_callchain() will just do get + > > __get_perf_callchain + put, while BPF-side code will do it's own get > > (with preemption temporarily disabled), will fetch callstack in one of > > a few possible ways, and then put it (unless callchain_entry is coming > > from outside, that trace_in thing). > > > > Exposing only __get_perf_callchain will make it much easier for BPF > callers to understand and use. And following Peter's suggestion, we can > mark _init_perf_callchain_ctx, __get_perf_callchain_kernel, and > __get_perf_callchain_user as static, then have __get_perf_callchain > encapsulate the logic of _init_perf_callchain_ctx, > __get_perf_callchain_kernel, and __get_perf_callchain_user. What do you > think? That's exactly what I proposed. Except I don't think we even need to have __get_perf_callchain_user and __get_perf_callchain_kernel, keep __get_perf_callchain almost identical to current get_perf_callchain(), except for getting/putting callchain entry > > > It's close to what you are doing, but I don't think anyone likes those > > exposed __init_perf_callchain_ctx + __get_perf_callchain_kernel + > > __get_perf_callchain_user. Can't we avoid that? (and also not sure we > > need add_mark inside the ctx itself, do we?) > > > > pw-bot: cr > > > > > > [please trim irrelevant parts] ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain
@ 2025-12-17 5:12 Tao Chen
2025-12-17 5:12 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
0 siblings, 1 reply; 13+ messages in thread
From: Tao Chen @ 2025-12-17 5:12 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo
Cc: linux-perf-users, linux-kernel, bpf, Tao Chen
Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com
A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev
Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.
Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.
Comments and suggestions are always welcome.
Change list:
- v1 -> v2
From Jiri
- rebase code, fix conflict
- v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev
- v2 -> v3:
From Andrii
- entries per CPU used in a stack-like fashion
- v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev
- v3 -> v4:
From Peter
- refactor get_perf_callchain and add three new APIs to use perf
callchain easily.
From Andrii
- reuse the perf callchain management.
- rename patch1 and patch2.
- v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev
- v4 -> v5:
From Yonghong
- keep add_mark false in stackmap when refactor get_perf_callchain in
patch1.
- add atomic operation in get_recursion_context in patch2.
- rename bpf_put_callchain_entry with bpf_put_perf_callchain in
patch3.
- rebase bpf-next master.
- v4: https://lore.kernel.org/bpf/20251028162502.3418817-1-chen.dylane@linux.dev
- v5 -> v6:
From Peter
- disable preemption from BPF side in patch2.
From AI
- use ctx->entry->nr instead of ctx->nr in patch1.
- v5: https://lore.kernel.org/bpf/20251109163559.4102849-1-chen.dylane@linux.dev
- v6 -> v7:
From yonghong
- Add ack in patch2
- v6: https://lore.kernel.org/bpf/20251112163148.100949-1-chen.dylane@linux.dev
Tao Chen (2):
perf: Refactor get_perf_callchain
bpf: Hold the perf callchain entry until used completely
include/linux/perf_event.h | 9 +++++
kernel/bpf/stackmap.c | 67 +++++++++++++++++++++++++++-------
kernel/events/callchain.c | 73 ++++++++++++++++++++++++--------------
3 files changed, 111 insertions(+), 38 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 5:12 [PATCH bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain Tao Chen @ 2025-12-17 5:12 ` Tao Chen 2025-12-17 5:22 ` Tao Chen 0 siblings, 1 reply; 13+ messages in thread From: Tao Chen @ 2025-12-17 5:12 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen As Alexei noted, get_perf_callchain() return values may be reused if a task is preempted after the BPF program enters migrate disable mode. The perf_callchain_entres has a small stack of entries, and we can reuse it as follows: 1. get the perf callchain entry 2. BPF use... 3. put the perf callchain entry And Peter suggested that get_recursion_context used with preemption disabled, so we should disable preemption at BPF side. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- kernel/bpf/stackmap.c | 67 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 55 insertions(+), 12 deletions(-) diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 2365541c81d..64ace4ed50e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, } static struct perf_callchain_entry * -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) { #ifdef CONFIG_STACKTRACE struct perf_callchain_entry *entry; - int rctx; - entry = get_callchain_entry(&rctx); + preempt_disable(); + entry = get_callchain_entry(rctx); + preempt_enable(); if (!entry) return NULL; @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) to[i] = (u64)(from[i]); } - put_callchain_entry(rctx); - return entry; #else /* CONFIG_STACKTRACE */ return NULL; @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, return id; } +static struct perf_callchain_entry * +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, + int max_stack, bool crosstask) +{ + struct perf_callchain_entry_ctx ctx; + struct perf_callchain_entry *entry; + + preempt_disable(); + entry = get_callchain_entry(rctx); + preempt_enable(); + + if (unlikely(!entry)) + return NULL; + + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); + if (kernel) + __get_perf_callchain_kernel(&ctx, regs); + if (user && !crosstask) + __get_perf_callchain_user(&ctx, regs); + + return entry; +} + +static void bpf_put_perf_callchain(int rctx) +{ + put_callchain_entry(rctx); +} + BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, u64, flags) { @@ -328,20 +355,24 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, struct perf_callchain_entry *trace; bool kernel = !user; u32 max_depth; + int rctx, ret; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) return -EINVAL; max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); - trace = get_perf_callchain(regs, kernel, user, max_depth, - false, false); + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, + false); if (unlikely(!trace)) /* couldn't fetch the stack trace */ return -EFAULT; - return __bpf_get_stackid(map, trace, flags); + ret = __bpf_get_stackid(map, trace, flags); + bpf_put_perf_callchain(rctx); + + return ret; } const struct bpf_func_proto bpf_get_stackid_proto = { @@ -435,6 +466,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, bool kernel = !user; int err = -EINVAL; u64 *ips; + int rctx; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_USER_BUILD_ID))) @@ -467,18 +499,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, trace = trace_in; trace->nr = min_t(u32, trace->nr, max_depth); } else if (kernel && task) { - trace = get_callchain_entry_for_task(task, max_depth); + trace = get_callchain_entry_for_task(&rctx, task, max_depth); } else { - trace = get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false); + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, + crosstask); } - if (unlikely(!trace) || trace->nr < skip) { + if (unlikely(!trace)) { if (may_fault) rcu_read_unlock(); goto err_fault; } + if (trace->nr < skip) { + if (may_fault) + rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(rctx); + goto err_fault; + } + trace_nr = trace->nr - skip; copy_len = trace_nr * elem_size; @@ -497,6 +537,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, if (may_fault) rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(rctx); + if (user_build_id) stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); -- 2.48.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 5:12 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen @ 2025-12-17 5:22 ` Tao Chen 2025-12-17 9:11 ` Tao Chen 0 siblings, 1 reply; 13+ messages in thread From: Tao Chen @ 2025-12-17 5:22 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf 在 2025/12/17 13:12, Tao Chen 写道: > As Alexei noted, get_perf_callchain() return values may be reused > if a task is preempted after the BPF program enters migrate disable > mode. The perf_callchain_entres has a small stack of entries, and > we can reuse it as follows: > > 1. get the perf callchain entry > 2. BPF use... > 3. put the perf callchain entry > > And Peter suggested that get_recursion_context used with preemption > disabled, so we should disable preemption at BPF side. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > kernel/bpf/stackmap.c | 67 +++++++++++++++++++++++++++++++++++-------- > 1 file changed, 55 insertions(+), 12 deletions(-) > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index 2365541c81d..64ace4ed50e 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, > } > > static struct perf_callchain_entry * > -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth) > { > #ifdef CONFIG_STACKTRACE > struct perf_callchain_entry *entry; > - int rctx; > > - entry = get_callchain_entry(&rctx); > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > > if (!entry) > return NULL; > @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > to[i] = (u64)(from[i]); > } > > - put_callchain_entry(rctx); > - > return entry; > #else /* CONFIG_STACKTRACE */ > return NULL; > @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, > return id; > } > > +static struct perf_callchain_entry * > +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user, > + int max_stack, bool crosstask) > +{ > + struct perf_callchain_entry_ctx ctx; > + struct perf_callchain_entry *entry; > + > + preempt_disable(); > + entry = get_callchain_entry(rctx); > + preempt_enable(); > + > + if (unlikely(!entry)) > + return NULL; > + > + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); > + if (kernel) > + __get_perf_callchain_kernel(&ctx, regs); > + if (user && !crosstask) > + __get_perf_callchain_user(&ctx, regs); > + > + return entry; > +} > + > +static void bpf_put_perf_callchain(int rctx) > +{ > + put_callchain_entry(rctx); > +} > + > BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > u64, flags) > { > @@ -328,20 +355,24 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > struct perf_callchain_entry *trace; > bool kernel = !user; > u32 max_depth; > + int rctx, ret; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > return -EINVAL; > > max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - false, false); > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + false); > > if (unlikely(!trace)) > /* couldn't fetch the stack trace */ > return -EFAULT; > > - return __bpf_get_stackid(map, trace, flags); > + ret = __bpf_get_stackid(map, trace, flags); > + bpf_put_perf_callchain(rctx); > + > + return ret; > } > > const struct bpf_func_proto bpf_get_stackid_proto = { > @@ -435,6 +466,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > bool kernel = !user; > int err = -EINVAL; > u64 *ips; > + int rctx; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_USER_BUILD_ID))) > @@ -467,18 +499,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > trace = trace_in; > trace->nr = min_t(u32, trace->nr, max_depth); > } else if (kernel && task) { > - trace = get_callchain_entry_for_task(task, max_depth); > + trace = get_callchain_entry_for_task(&rctx, task, max_depth); > } else { > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - crosstask, false); > + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, > + crosstask); > } > > - if (unlikely(!trace) || trace->nr < skip) { > + if (unlikely(!trace)) { > if (may_fault) > rcu_read_unlock(); > goto err_fault; > } > > + if (trace->nr < skip) { > + if (may_fault) > + rcu_read_unlock(); > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + goto err_fault; > + } > + > trace_nr = trace->nr - skip; > copy_len = trace_nr * elem_size; > > @@ -497,6 +537,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > if (may_fault) > rcu_read_unlock(); > > + if (!trace_in) > + bpf_put_perf_callchain(rctx); > + > if (user_build_id) > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > Hi Peter, As requested by Alexei, I have re-sent the v7 version. Compared with the v6 version, the only change is the addition of the ack tag in patch2. In accordance with your previous suggestions, patch1 has been modified based on your earlier patch, and patch2 adds preempt_disable on the eBPF side—this does not affect the original perf logic. Please review it again, thank you. -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely 2025-12-17 5:22 ` Tao Chen @ 2025-12-17 9:11 ` Tao Chen 0 siblings, 0 replies; 13+ messages in thread From: Tao Chen @ 2025-12-17 9:11 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf 在 2025/12/17 13:22, Tao Chen 写道: > 在 2025/12/17 13:12, Tao Chen 写道: >> As Alexei noted, get_perf_callchain() return values may be reused >> if a task is preempted after the BPF program enters migrate disable >> mode. The perf_callchain_entres has a small stack of entries, and >> we can reuse it as follows: >> >> 1. get the perf callchain entry >> 2. BPF use... >> 3. put the perf callchain entry >> >> And Peter suggested that get_recursion_context used with preemption >> disabled, so we should disable preemption at BPF side. >> >> Acked-by: Yonghong Song <yonghong.song@linux.dev> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> kernel/bpf/stackmap.c | 67 +++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 55 insertions(+), 12 deletions(-) >> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >> index 2365541c81d..64ace4ed50e 100644 >> --- a/kernel/bpf/stackmap.c >> +++ b/kernel/bpf/stackmap.c >> @@ -210,13 +210,14 @@ static void stack_map_get_build_id_offset(struct >> bpf_stack_build_id *id_offs, >> } >> static struct perf_callchain_entry * >> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 >> max_depth) >> { >> #ifdef CONFIG_STACKTRACE >> struct perf_callchain_entry *entry; >> - int rctx; >> - entry = get_callchain_entry(&rctx); >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> if (!entry) >> return NULL; >> @@ -238,8 +239,6 @@ get_callchain_entry_for_task(struct task_struct >> *task, u32 max_depth) >> to[i] = (u64)(from[i]); >> } >> - put_callchain_entry(rctx); >> - >> return entry; >> #else /* CONFIG_STACKTRACE */ >> return NULL; >> @@ -320,6 +319,34 @@ static long __bpf_get_stackid(struct bpf_map *map, >> return id; >> } >> +static struct perf_callchain_entry * >> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, >> bool user, >> + int max_stack, bool crosstask) >> +{ >> + struct perf_callchain_entry_ctx ctx; >> + struct perf_callchain_entry *entry; >> + >> + preempt_disable(); >> + entry = get_callchain_entry(rctx); >> + preempt_enable(); >> + >> + if (unlikely(!entry)) >> + return NULL; >> + >> + __init_perf_callchain_ctx(&ctx, entry, max_stack, false); >> + if (kernel) >> + __get_perf_callchain_kernel(&ctx, regs); >> + if (user && !crosstask) >> + __get_perf_callchain_user(&ctx, regs); >> + >> + return entry; >> +} >> + >> +static void bpf_put_perf_callchain(int rctx) >> +{ >> + put_callchain_entry(rctx); >> +} >> + >> BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map >> *, map, >> u64, flags) >> { >> @@ -328,20 +355,24 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, >> regs, struct bpf_map *, map, >> struct perf_callchain_entry *trace; >> bool kernel = !user; >> u32 max_depth; >> + int rctx, ret; >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) >> return -EINVAL; >> max_depth = stack_map_calculate_max_depth(map->value_size, >> elem_size, flags); >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - false, false); >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, >> + false); >> if (unlikely(!trace)) >> /* couldn't fetch the stack trace */ >> return -EFAULT; >> - return __bpf_get_stackid(map, trace, flags); >> + ret = __bpf_get_stackid(map, trace, flags); >> + bpf_put_perf_callchain(rctx); >> + >> + return ret; >> } >> const struct bpf_func_proto bpf_get_stackid_proto = { >> @@ -435,6 +466,7 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> bool kernel = !user; >> int err = -EINVAL; >> u64 *ips; >> + int rctx; >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_USER_BUILD_ID))) >> @@ -467,18 +499,26 @@ static long __bpf_get_stack(struct pt_regs >> *regs, struct task_struct *task, >> trace = trace_in; >> trace->nr = min_t(u32, trace->nr, max_depth); >> } else if (kernel && task) { >> - trace = get_callchain_entry_for_task(task, max_depth); >> + trace = get_callchain_entry_for_task(&rctx, task, max_depth); >> } else { >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - crosstask, false); >> + trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, >> max_depth, >> + crosstask); >> } >> - if (unlikely(!trace) || trace->nr < skip) { >> + if (unlikely(!trace)) { >> if (may_fault) >> rcu_read_unlock(); >> goto err_fault; >> } >> + if (trace->nr < skip) { >> + if (may_fault) >> + rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + goto err_fault; >> + } >> + >> trace_nr = trace->nr - skip; >> copy_len = trace_nr * elem_size; >> @@ -497,6 +537,9 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> if (may_fault) >> rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(rctx); >> + >> if (user_build_id) >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > Hi Peter, > > As requested by Alexei, I have re-sent the v7 version. Compared with the > v6 version, the only change is the addition of the ack tag in patch2. In > accordance with your previous suggestions, patch1 has been modified > based on your earlier patch, and patch2 adds preempt_disable on the eBPF > side—this does not affect the original perf logic. Please review it > again, thank you. > Sorry, there are code conflicts. I will resend it. -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-01-23 18:40 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-17 9:33 [PATCH RESEND bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain Tao Chen 2025-12-17 9:33 ` [PATCH bpf-next v7 1/2] perf: Refactor get_perf_callchain Tao Chen 2025-12-17 9:33 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen 2025-12-23 6:29 ` Tao Chen 2026-01-06 16:00 ` Tao Chen 2026-01-09 23:47 ` Andrii Nakryiko 2026-01-16 4:35 ` Tao Chen 2026-01-23 0:38 ` Andrii Nakryiko 2026-01-23 5:42 ` Tao Chen 2026-01-23 18:40 ` Andrii Nakryiko -- strict thread matches above, loose matches on Subject: below -- 2025-12-17 5:12 [PATCH bpf-next v7 0/2] Pass external callchain entry to get_perf_callchain Tao Chen 2025-12-17 5:12 ` [PATCH bpf-next v7 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen 2025-12-17 5:22 ` Tao Chen 2025-12-17 9:11 ` Tao Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox