* [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain
@ 2026-01-26 7:43 Tao Chen
2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-26 7:43 UTC (permalink / raw)
To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
kpsingh, sdf, haoluo
Cc: linux-perf-users, linux-kernel, bpf, Tao Chen
Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com
A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev
Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.
Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.
Comments and suggestions are always welcome.
Change list:
- v1 -> v2
From Jiri
- rebase code, fix conflict
- v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev
- v2 -> v3:
From Andrii
- entries per CPU used in a stack-like fashion
- v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev
- v3 -> v4:
From Peter
- refactor get_perf_callchain and add three new APIs to use perf
callchain easily.
From Andrii
- reuse the perf callchain management.
- rename patch1 and patch2.
- v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev
- v4 -> v5:
From Yonghong
- keep add_mark false in stackmap when refactor get_perf_callchain in
patch1.
- add atomic operation in get_recursion_context in patch2.
- rename bpf_put_callchain_entry with bpf_put_perf_callchain in
patch3.
- rebase bpf-next master.
- v4: https://lore.kernel.org/bpf/20251028162502.3418817-1-chen.dylane@linux.dev
- v5 -> v6:
From Peter
- disable preemption from BPF side in patch2.
From AI
- use ctx->entry->nr instead of ctx->nr in patch1.
- v5: https://lore.kernel.org/bpf/20251109163559.4102849-1-chen.dylane@linux.dev
- v6 -> v7:
From yonghong
- Add ack in patch2
From AI
- resolve conflict
- v6: https://lore.kernel.org/bpf/20251112163148.100949-1-chen.dylane@linux.dev
- v7 -> v8:
From Andrii
- Record rctx in perf_callchain_entry in patch1
- Export __get_perf_callchain in patch2
- v7: https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev
Tao Chen (3):
perf: Add rctx in perf_callchain_entry
perf: Refactor get_perf_callchain
bpf: Hold ther perf callchain entry until used completely
include/linux/perf_event.h | 10 +++++--
kernel/bpf/stackmap.c | 58 ++++++++++++++++++++++++++++++-------
kernel/events/callchain.c | 59 ++++++++++++++++++++++++--------------
3 files changed, 93 insertions(+), 34 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 28+ messages in thread* [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-26 7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen @ 2026-01-26 7:43 ` Tao Chen 2026-01-26 8:03 ` bot+bpf-ci 2026-01-28 8:59 ` Peter Zijlstra 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen 2 siblings, 2 replies; 28+ messages in thread From: Tao Chen @ 2026-01-26 7:43 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen Record rctx inside the perf_callchain_entry itself, when callers of get_callchain_entry no longer care about the assignment of rctx, and will be used in the next patch. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- include/linux/perf_event.h | 5 +++-- kernel/bpf/stackmap.c | 5 ++--- kernel/events/callchain.c | 27 ++++++++++++++++----------- 3 files changed, 21 insertions(+), 16 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 9870d768db4..f0489843ebc 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -57,6 +57,7 @@ #include <asm/local.h> struct perf_callchain_entry { + int rctx; u64 nr; u64 ip[]; /* /proc/sys/kernel/perf_event_max_stack */ }; @@ -1723,8 +1724,8 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie); extern int get_callchain_buffers(int max_stack); extern void put_callchain_buffers(void); -extern struct perf_callchain_entry *get_callchain_entry(int *rctx); -extern void put_callchain_entry(int rctx); +extern struct perf_callchain_entry *get_callchain_entry(void); +extern void put_callchain_entry(struct perf_callchain_entry *entry); extern int sysctl_perf_event_max_stack; extern int sysctl_perf_event_max_contexts_per_stack; diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index da3d328f5c1..e77dcdc2164 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -214,9 +214,8 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) { #ifdef CONFIG_STACKTRACE struct perf_callchain_entry *entry; - int rctx; - entry = get_callchain_entry(&rctx); + entry = get_callchain_entry(); if (!entry) return NULL; @@ -238,7 +237,7 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) to[i] = (u64)(from[i]); } - put_callchain_entry(rctx); + put_callchain_entry(entry); return entry; #else /* CONFIG_STACKTRACE */ diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index b9c7e00725d..6cdbc5937b1 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -151,31 +151,36 @@ void put_callchain_buffers(void) } } -struct perf_callchain_entry *get_callchain_entry(int *rctx) +struct perf_callchain_entry *get_callchain_entry(void) { int cpu; + int rctx; struct callchain_cpus_entries *entries; + struct perf_callchain_entry *entry; - *rctx = get_recursion_context(this_cpu_ptr(callchain_recursion)); - if (*rctx == -1) + rctx = get_recursion_context(this_cpu_ptr(callchain_recursion)); + if (rctx == -1) return NULL; entries = rcu_dereference(callchain_cpus_entries); if (!entries) { - put_recursion_context(this_cpu_ptr(callchain_recursion), *rctx); + put_recursion_context(this_cpu_ptr(callchain_recursion), rctx); return NULL; } cpu = smp_processor_id(); - return (((void *)entries->cpu_entries[cpu]) + - (*rctx * perf_callchain_entry__sizeof())); + entry = ((void *)entries->cpu_entries[cpu]) + + (rctx * perf_callchain_entry__sizeof()); + entry->rctx = rctx; + + return entry; } void -put_callchain_entry(int rctx) +put_callchain_entry(struct perf_callchain_entry *entry) { - put_recursion_context(this_cpu_ptr(callchain_recursion), rctx); + put_recursion_context(this_cpu_ptr(callchain_recursion), entry->rctx); } static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entry, @@ -222,13 +227,13 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; - int rctx, start_entry_idx; + int start_entry_idx; /* crosstask is not supported for user stacks */ if (crosstask && user && !kernel) return NULL; - entry = get_callchain_entry(&rctx); + entry = get_callchain_entry(); if (!entry) return NULL; @@ -272,7 +277,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, } exit_put: - put_callchain_entry(rctx); + put_callchain_entry(entry); return entry; } -- 2.48.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen @ 2026-01-26 8:03 ` bot+bpf-ci 2026-01-26 8:51 ` Tao Chen 2026-01-28 8:59 ` Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: bot+bpf-ci @ 2026-01-26 8:03 UTC (permalink / raw) To: chen.dylane, peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, chen.dylane, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 1734 bytes --] > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 9ded2e582c60..0330f265173d 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -57,6 +57,7 @@ > #include <asm/local.h> > > struct perf_callchain_entry { > + int rctx; > u64 nr; > u64 ip[]; /* /proc/sys/kernel/perf_event_max_stack */ > }; Adding rctx before nr changes the structure layout in a way that appears to break the perf userspace ABI. In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN is set, the code does: int size = 1; size += data->callchain->nr; size *= sizeof(u64); __output_copy(handle, data->callchain, size); This copies (1 + nr) * 8 bytes starting from the beginning of the structure. Before this patch, offset 0 contained nr (u64), followed by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes padding, then nr at offset 8, then ip[] at offset 16. With the new layout, this copy sends the following to userspace: - First 8 bytes: rctx + padding (kernel-internal data) instead of nr - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1] This means the callchain data sent to userspace is shifted by 8 bytes, the last IP is truncated, and the rctx value is leaked to userspace. Is there a plan to update perf_output_sample() to output starting from &entry->nr instead of the structure start? Or am I missing something about how this structure is used for userspace output? [ ... ] --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831 ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-26 8:03 ` bot+bpf-ci @ 2026-01-26 8:51 ` Tao Chen 2026-01-27 21:01 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Tao Chen @ 2026-01-26 8:51 UTC (permalink / raw) To: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, martin.lau, clm, ihor.solodrai 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道: >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> index 9ded2e582c60..0330f265173d 100644 >> --- a/include/linux/perf_event.h >> +++ b/include/linux/perf_event.h >> @@ -57,6 +57,7 @@ >> #include <asm/local.h> >> >> struct perf_callchain_entry { >> + int rctx; >> u64 nr; >> u64 ip[]; /* /proc/sys/kernel/perf_event_max_stack */ >> }; > > Adding rctx before nr changes the structure layout in a way that > appears to break the perf userspace ABI. > > In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN > is set, the code does: > > int size = 1; > size += data->callchain->nr; > size *= sizeof(u64); > __output_copy(handle, data->callchain, size); > > This copies (1 + nr) * 8 bytes starting from the beginning of the > structure. Before this patch, offset 0 contained nr (u64), followed > by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes > padding, then nr at offset 8, then ip[] at offset 16. > > With the new layout, this copy sends the following to userspace: > - First 8 bytes: rctx + padding (kernel-internal data) instead of nr > - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1] > > This means the callchain data sent to userspace is shifted by 8 bytes, > the last IP is truncated, and the rctx value is leaked to userspace. > > Is there a plan to update perf_output_sample() to output starting from > &entry->nr instead of the structure start? Or am I missing something > about how this structure is used for userspace output? > As AI said, this change shifts the memory layout, which effectively breaks the userspace ABI. Maybe we can use __output_copy(handle, &data->callchain->nr, size); > [ ... ] > > > --- > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831 -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-26 8:51 ` Tao Chen @ 2026-01-27 21:01 ` Andrii Nakryiko 2026-01-28 2:41 ` Tao Chen 0 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-27 21:01 UTC (permalink / raw) To: Tao Chen Cc: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf, martin.lau, clm, ihor.solodrai On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道: > >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > >> index 9ded2e582c60..0330f265173d 100644 > >> --- a/include/linux/perf_event.h > >> +++ b/include/linux/perf_event.h > >> @@ -57,6 +57,7 @@ > >> #include <asm/local.h> > >> > >> struct perf_callchain_entry { > >> + int rctx; > >> u64 nr; > >> u64 ip[]; /* /proc/sys/kernel/perf_event_max_stack */ > >> }; > > > > Adding rctx before nr changes the structure layout in a way that > > appears to break the perf userspace ABI. > > > > In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN > > is set, the code does: > > > > int size = 1; > > size += data->callchain->nr; > > size *= sizeof(u64); > > __output_copy(handle, data->callchain, size); > > > > This copies (1 + nr) * 8 bytes starting from the beginning of the > > structure. Before this patch, offset 0 contained nr (u64), followed > > by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes > > padding, then nr at offset 8, then ip[] at offset 16. > > > > With the new layout, this copy sends the following to userspace: > > - First 8 bytes: rctx + padding (kernel-internal data) instead of nr > > - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1] > > > > This means the callchain data sent to userspace is shifted by 8 bytes, > > the last IP is truncated, and the rctx value is leaked to userspace. > > > > Is there a plan to update perf_output_sample() to output starting from > > &entry->nr instead of the structure start? Or am I missing something > > about how this structure is used for userspace output? > > > > As AI said, this change shifts the memory layout, which effectively > breaks the userspace ABI. > > Maybe we can use __output_copy(handle, &data->callchain->nr, size); yep, very impressive for AI to notice this. I agree that &data->callchain->nr seems like the best way forward. > > > [ ... ] > > > > > > --- > > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831 > > > -- > Best Regards > Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-27 21:01 ` Andrii Nakryiko @ 2026-01-28 2:41 ` Tao Chen 0 siblings, 0 replies; 28+ messages in thread From: Tao Chen @ 2026-01-28 2:41 UTC (permalink / raw) To: Andrii Nakryiko Cc: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf, martin.lau, clm, ihor.solodrai 在 2026/1/28 05:01, Andrii Nakryiko 写道: > On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote: >> >> 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道: >>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >>>> index 9ded2e582c60..0330f265173d 100644 >>>> --- a/include/linux/perf_event.h >>>> +++ b/include/linux/perf_event.h >>>> @@ -57,6 +57,7 @@ >>>> #include <asm/local.h> >>>> >>>> struct perf_callchain_entry { >>>> + int rctx; >>>> u64 nr; >>>> u64 ip[]; /* /proc/sys/kernel/perf_event_max_stack */ >>>> }; >>> >>> Adding rctx before nr changes the structure layout in a way that >>> appears to break the perf userspace ABI. >>> >>> In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN >>> is set, the code does: >>> >>> int size = 1; >>> size += data->callchain->nr; >>> size *= sizeof(u64); >>> __output_copy(handle, data->callchain, size); >>> >>> This copies (1 + nr) * 8 bytes starting from the beginning of the >>> structure. Before this patch, offset 0 contained nr (u64), followed >>> by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes >>> padding, then nr at offset 8, then ip[] at offset 16. >>> >>> With the new layout, this copy sends the following to userspace: >>> - First 8 bytes: rctx + padding (kernel-internal data) instead of nr >>> - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1] >>> >>> This means the callchain data sent to userspace is shifted by 8 bytes, >>> the last IP is truncated, and the rctx value is leaked to userspace. >>> >>> Is there a plan to update perf_output_sample() to output starting from >>> &entry->nr instead of the structure start? Or am I missing something >>> about how this structure is used for userspace output? >>> >> >> As AI said, this change shifts the memory layout, which effectively >> breaks the userspace ABI. >> >> Maybe we can use __output_copy(handle, &data->callchain->nr, size); > > yep, very impressive for AI to notice this. I agree that > &data->callchain->nr seems like the best way forward. > will fix it in v9. >> >>> [ ... ] >>> >>> >>> --- >>> AI reviewed your patch. Please fix the bug or email reply why it's not a bug. >>> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md >>> >>> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831 >> >> >> -- >> Best Regards >> Tao Chen -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen 2026-01-26 8:03 ` bot+bpf-ci @ 2026-01-28 8:59 ` Peter Zijlstra 2026-01-28 16:52 ` Tao Chen 1 sibling, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-01-28 8:59 UTC (permalink / raw) To: Tao Chen Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote: > Record rctx inside the perf_callchain_entry itself, when callers of > get_callchain_entry no longer care about the assignment of rctx, and > will be used in the next patch. Sorry, what? The recursion context is very much about the caller, and very much not about the data. This just doesn't make sense. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-28 8:59 ` Peter Zijlstra @ 2026-01-28 16:52 ` Tao Chen 2026-01-28 18:59 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Tao Chen @ 2026-01-28 16:52 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/28 16:59, Peter Zijlstra 写道: > On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote: >> Record rctx inside the perf_callchain_entry itself, when callers of >> get_callchain_entry no longer care about the assignment of rctx, and >> will be used in the next patch. > > Sorry, what? > > The recursion context is very much about the caller, and very much not > about the data. This just doesn't make sense. Well, Andrii, it seems we have to go back to the original way. -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-28 16:52 ` Tao Chen @ 2026-01-28 18:59 ` Andrii Nakryiko 2026-01-29 3:03 ` Tao Chen 0 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-28 18:59 UTC (permalink / raw) To: Tao Chen Cc: Peter Zijlstra, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2026/1/28 16:59, Peter Zijlstra 写道: > > On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote: > >> Record rctx inside the perf_callchain_entry itself, when callers of > >> get_callchain_entry no longer care about the assignment of rctx, and > >> will be used in the next patch. > > > > Sorry, what? > > > > The recursion context is very much about the caller, and very much not > > about the data. This just doesn't make sense. > > Well, Andrii, it seems we have to go back to the original way. Huh, why? Peter is confused by your wording, he didn't say that what you are doing is broken. The point is to couple rctx with the exact perf_callchain_entry returned to the caller in that recursion context. There is no contradiction. It's purely about simplifying the interface. While previously the caller would have to juggle perf_callchain_entry and rctx separately (but still ensure that entry and rctx do match each other when calling put_callchain_entry), now it's more convenient and less error-prone because returned entry will record rctx it was retrieved with. That perf_callchain_entry should not be reused by anyone else between successful get_callchain_entry and put_callchain_entry, so there is no problem here. > > -- > Best Regards > Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry 2026-01-28 18:59 ` Andrii Nakryiko @ 2026-01-29 3:03 ` Tao Chen 0 siblings, 0 replies; 28+ messages in thread From: Tao Chen @ 2026-01-29 3:03 UTC (permalink / raw) To: Andrii Nakryiko Cc: Peter Zijlstra, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/29 02:59, Andrii Nakryiko 写道: > On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote: >> >> 在 2026/1/28 16:59, Peter Zijlstra 写道: >>> On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote: >>>> Record rctx inside the perf_callchain_entry itself, when callers of >>>> get_callchain_entry no longer care about the assignment of rctx, and >>>> will be used in the next patch. >>> >>> Sorry, what? >>> >>> The recursion context is very much about the caller, and very much not >>> about the data. This just doesn't make sense. >> >> Well, Andrii, it seems we have to go back to the original way. > > Huh, why? Peter is confused by your wording, he didn't say that what > you are doing is broken. > I see, will update the changelog, thanks. > The point is to couple rctx with the exact perf_callchain_entry > returned to the caller in that recursion context. There is no > contradiction. > > It's purely about simplifying the interface. While previously the > caller would have to juggle perf_callchain_entry and rctx separately > (but still ensure that entry and rctx do match each other when calling > put_callchain_entry), now it's more convenient and less error-prone > because returned entry will record rctx it was retrieved with. > > That perf_callchain_entry should not be reused by anyone else between > successful get_callchain_entry and put_callchain_entry, so there is no > problem here. > >> >> -- >> Best Regards >> Tao Chen -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-26 7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen @ 2026-01-26 7:43 ` Tao Chen 2026-01-27 21:07 ` Andrii Nakryiko ` (2 more replies) 2026-01-26 7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen 2 siblings, 3 replies; 28+ messages in thread From: Tao Chen @ 2026-01-26 7:43 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen From BPF stack map, we want to ensure that the callchain buffer will not be overwritten by other preemptive tasks and we also aim to reduce the preempt disable interval, Based on the suggestions from Peter and Andrrii, export new API __get_perf_callchain and the usage scenarios are as follows from BPF side: preempt_disable() entry = get_callchain_entry() preempt_enable() __get_perf_callchain(entry) put_callchain_entry(entry) Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- include/linux/perf_event.h | 5 +++++ kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ 2 files changed, 27 insertions(+), 12 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f0489843ebc..7132fa97bb1 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1722,6 +1722,11 @@ extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct extern struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie); + +extern int __get_perf_callchain(struct perf_callchain_entry *entry, struct pt_regs *regs, + bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark, + u64 defer_cookie); + extern int get_callchain_buffers(int max_stack); extern void put_callchain_buffers(void); extern struct perf_callchain_entry *get_callchain_entry(void); diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 6cdbc5937b1..f9789d10fa4 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -221,21 +221,15 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr #endif } -struct perf_callchain_entry * -get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) +int __get_perf_callchain(struct perf_callchain_entry *entry, struct pt_regs *regs, bool kernel, + bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) { - struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; int start_entry_idx; /* crosstask is not supported for user stacks */ if (crosstask && user && !kernel) - return NULL; - - entry = get_callchain_entry(); - if (!entry) - return NULL; + return -EINVAL; ctx.entry = entry; ctx.max_stack = max_stack; @@ -252,7 +246,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, if (user && !crosstask) { if (!user_mode(regs)) { if (current->flags & (PF_KTHREAD | PF_USER_WORKER)) - goto exit_put; + return 0; regs = task_pt_regs(current); } @@ -265,7 +259,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, */ perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED); perf_callchain_store_context(&ctx, defer_cookie); - goto exit_put; + return 0; } if (add_mark) @@ -275,9 +269,25 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, perf_callchain_user(&ctx, regs); fixup_uretprobe_trampoline_entries(entry, start_entry_idx); } + return 0; +} + +struct perf_callchain_entry * +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, + u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) +{ + struct perf_callchain_entry *entry; + int ret; + + entry = get_callchain_entry(); + if (!entry) + return NULL; -exit_put: + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark, + defer_cookie); put_callchain_entry(entry); + if (ret) + entry = NULL; return entry; } -- 2.48.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen @ 2026-01-27 21:07 ` Andrii Nakryiko 2026-01-28 2:42 ` Tao Chen 2026-01-28 9:10 ` Peter Zijlstra 2026-02-04 1:08 ` Andrii Nakryiko 2 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-27 21:07 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: > > From BPF stack map, we want to ensure that the callchain buffer > will not be overwritten by other preemptive tasks and we also aim > to reduce the preempt disable interval, Based on the suggestions from Peter > and Andrrii, export new API __get_perf_callchain and the usage scenarios > are as follows from BPF side: > > preempt_disable() > entry = get_callchain_entry() > preempt_enable() > __get_perf_callchain(entry) > put_callchain_entry(entry) > > Suggested-by: Andrii Nakryiko <andrii@kernel.org> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > include/linux/perf_event.h | 5 +++++ > kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ > 2 files changed, 27 insertions(+), 12 deletions(-) [...] > +struct perf_callchain_entry * > +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, > + u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) > +{ > + struct perf_callchain_entry *entry; > + int ret; > + > + entry = get_callchain_entry(); > + if (!entry) > + return NULL; > > -exit_put: > + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark, > + defer_cookie); > put_callchain_entry(entry); > + if (ret) > + entry = NULL; > purely stylistical nit, so this can be ignored if you disagree, but I find code that modifies some variable before returning it slightly less preferable to more explicit: if (__get_perf_callchain(...)) { put_callchain_entry(entry); return NULL; } return entry; > return entry; > } > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-27 21:07 ` Andrii Nakryiko @ 2026-01-28 2:42 ` Tao Chen 0 siblings, 0 replies; 28+ messages in thread From: Tao Chen @ 2026-01-28 2:42 UTC (permalink / raw) To: Andrii Nakryiko Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/28 05:07, Andrii Nakryiko 写道: > On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: >> >> From BPF stack map, we want to ensure that the callchain buffer >> will not be overwritten by other preemptive tasks and we also aim >> to reduce the preempt disable interval, Based on the suggestions from Peter >> and Andrrii, export new API __get_perf_callchain and the usage scenarios >> are as follows from BPF side: >> >> preempt_disable() >> entry = get_callchain_entry() >> preempt_enable() >> __get_perf_callchain(entry) >> put_callchain_entry(entry) >> >> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> include/linux/perf_event.h | 5 +++++ >> kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ >> 2 files changed, 27 insertions(+), 12 deletions(-) > > [...] > >> +struct perf_callchain_entry * >> +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, >> + u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) >> +{ >> + struct perf_callchain_entry *entry; >> + int ret; >> + >> + entry = get_callchain_entry(); >> + if (!entry) >> + return NULL; >> >> -exit_put: >> + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark, >> + defer_cookie); >> put_callchain_entry(entry); >> + if (ret) >> + entry = NULL; >> > > purely stylistical nit, so this can be ignored if you disagree, but I > find code that modifies some variable before returning it slightly > less preferable to more explicit: > > > if (__get_perf_callchain(...)) { > put_callchain_entry(entry); > return NULL; > } > > return entry; > >> return entry; >> } > agree, will change it in v9, thanks. >> -- >> 2.48.1 >> -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen 2026-01-27 21:07 ` Andrii Nakryiko @ 2026-01-28 9:10 ` Peter Zijlstra 2026-01-28 16:49 ` Tao Chen 2026-01-28 19:12 ` Andrii Nakryiko 2026-02-04 1:08 ` Andrii Nakryiko 2 siblings, 2 replies; 28+ messages in thread From: Peter Zijlstra @ 2026-01-28 9:10 UTC (permalink / raw) To: Tao Chen Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote: > From BPF stack map, we want to ensure that the callchain buffer > will not be overwritten by other preemptive tasks and we also aim > to reduce the preempt disable interval, Based on the suggestions from Peter > and Andrrii, export new API __get_perf_callchain and the usage scenarios > are as follows from BPF side: > > preempt_disable() > entry = get_callchain_entry() > preempt_enable() > __get_perf_callchain(entry) > put_callchain_entry(entry) That makes no sense, this means any other task on that CPU is getting screwed over. Why are you worried about the preempt_disable() here? If this were an interrupt context we'd still do that unwind -- but then with IRQs disabled. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-28 9:10 ` Peter Zijlstra @ 2026-01-28 16:49 ` Tao Chen 2026-01-28 19:12 ` Andrii Nakryiko 1 sibling, 0 replies; 28+ messages in thread From: Tao Chen @ 2026-01-28 16:49 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/28 17:10, Peter Zijlstra 写道: > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote: >> From BPF stack map, we want to ensure that the callchain buffer >> will not be overwritten by other preemptive tasks and we also aim >> to reduce the preempt disable interval, Based on the suggestions from Peter >> and Andrrii, export new API __get_perf_callchain and the usage scenarios >> are as follows from BPF side: >> >> preempt_disable() >> entry = get_callchain_entry() >> preempt_enable() >> __get_perf_callchain(entry) >> put_callchain_entry(entry) > > That makes no sense, this means any other task on that CPU is getting > screwed over. > > Why are you worried about the preempt_disable() here? If this were an > interrupt context we'd still do that unwind -- but then with IRQs > disabled. Hi Peter, Right now, obtaining stack information in BPF includes 2 steps: 1.get callchain 2.store callchain in bpf map or copy to buffer There is no preempt disable in BPF now, When obtaining the stack information of Process A, Process A may be preempted by Process B. With the same logic, we then acquire the stack information of Process B. However, when execution resumes to Process A, the callchain buffer will store the stack information of Process B. Because each context(task, soft irq, irq, nmi) has only one callchain entry. taskA taskB 1.callchain(A) = get_perf_callchain <-- preepmted by B callchain(B) = get_perf_callchain 2.stack_map(callchain(B)) So we want to ensure that when task A is in use, the preepmt task B cannot be used. The approach involves deferring the put_callchain_entry until the stack is captured and saved in the stack_map. taskA taskB 1.callchain(A) = __get_perf_callchain <-- preepmted by B callchain(B) = __get_perf_callchain 2.stack_map(callchain(A)) 3.put_callchain_entry() taskB can not get the callchain because taskA hold it. And the preempt_disable() for get_callchain_entry was suggested from Yonghong in v4 https://lore.kernel.org/bpf/c352f357-1417-47b5-9d8c-28d99f20f5a6@linux.dev/ Please correct me if I'm mistaken. Thanks. -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-28 9:10 ` Peter Zijlstra 2026-01-28 16:49 ` Tao Chen @ 2026-01-28 19:12 ` Andrii Nakryiko 2026-01-30 11:31 ` Peter Zijlstra 1 sibling, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-28 19:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote: > > From BPF stack map, we want to ensure that the callchain buffer > > will not be overwritten by other preemptive tasks and we also aim > > to reduce the preempt disable interval, Based on the suggestions from Peter > > and Andrrii, export new API __get_perf_callchain and the usage scenarios > > are as follows from BPF side: > > > > preempt_disable() > > entry = get_callchain_entry() > > preempt_enable() > > __get_perf_callchain(entry) > > put_callchain_entry(entry) > > That makes no sense, this means any other task on that CPU is getting > screwed over. Yes, unfortunately, but unless we dynamically allocate new entry each time and/or keep per-current entry cached there isn't much choice we have here, no? Maybe that's what we have to do, honestly, because get_perf_callchain() usage we have right now from sleepable BPF isn't great no matter with or without changes in this patch set. > > Why are you worried about the preempt_disable() here? If this were an > interrupt context we'd still do that unwind -- but then with IRQs > disabled. Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called from sleepable/faultable context and also we can do a rather expensive build ID resolution (either in sleepable or not, which only changes if build ID parsing logic waits for file backed pages to be paged in or not). ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-28 19:12 ` Andrii Nakryiko @ 2026-01-30 11:31 ` Peter Zijlstra 2026-01-30 20:04 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-01-30 11:31 UTC (permalink / raw) To: Andrii Nakryiko Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Wed, Jan 28, 2026 at 11:12:09AM -0800, Andrii Nakryiko wrote: > On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote: > > > From BPF stack map, we want to ensure that the callchain buffer > > > will not be overwritten by other preemptive tasks and we also aim > > > to reduce the preempt disable interval, Based on the suggestions from Peter > > > and Andrrii, export new API __get_perf_callchain and the usage scenarios > > > are as follows from BPF side: > > > > > > preempt_disable() > > > entry = get_callchain_entry() > > > preempt_enable() > > > __get_perf_callchain(entry) > > > put_callchain_entry(entry) > > > > That makes no sense, this means any other task on that CPU is getting > > screwed over. > > Yes, unfortunately, but unless we dynamically allocate new entry each > time and/or keep per-current entry cached there isn't much choice we > have here, no? > > Maybe that's what we have to do, honestly, because > get_perf_callchain() usage we have right now from sleepable BPF isn't > great no matter with or without changes in this patch set. All of the perf stuff is based on the fact that if we do it from IRQ/NMI context, we can certainly do it with preemption disabled. Bending the interface this way is just horrible. > > Why are you worried about the preempt_disable() here? If this were an > > interrupt context we'd still do that unwind -- but then with IRQs > > disabled. > > Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called > from sleepable/faultable context and also we can do a rather expensive > build ID resolution (either in sleepable or not, which only changes if > build ID parsing logic waits for file backed pages to be paged in or > not). <rant> So stack_map_get_build_id_offset() is a piece of crap -- and I've always said it was. And I hate that any of that ever got merged -- its the pinnacle of bad engineering and simply refusing to do the right thing in the interest of hack now, fix never :/ </rant> Anyway, as you well know, we now *do* have lockless vma lookups and should be able to do this buildid thing much saner. Also, there appears to be no buildid caching what so ever, surely that would help some. (and I'm not sure I've ever understood why the buildid crap needs to be in this path in any case) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-30 11:31 ` Peter Zijlstra @ 2026-01-30 20:04 ` Andrii Nakryiko 2026-02-02 19:59 ` Peter Zijlstra 0 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-30 20:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Fri, Jan 30, 2026 at 3:31 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jan 28, 2026 at 11:12:09AM -0800, Andrii Nakryiko wrote: > > On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote: > > > > From BPF stack map, we want to ensure that the callchain buffer > > > > will not be overwritten by other preemptive tasks and we also aim > > > > to reduce the preempt disable interval, Based on the suggestions from Peter > > > > and Andrrii, export new API __get_perf_callchain and the usage scenarios > > > > are as follows from BPF side: > > > > > > > > preempt_disable() > > > > entry = get_callchain_entry() > > > > preempt_enable() > > > > __get_perf_callchain(entry) > > > > put_callchain_entry(entry) > > > > > > That makes no sense, this means any other task on that CPU is getting > > > screwed over. > > > > Yes, unfortunately, but unless we dynamically allocate new entry each > > time and/or keep per-current entry cached there isn't much choice we > > have here, no? > > > > Maybe that's what we have to do, honestly, because > > get_perf_callchain() usage we have right now from sleepable BPF isn't > > great no matter with or without changes in this patch set. > > All of the perf stuff is based on the fact that if we do it from IRQ/NMI > context, we can certainly do it with preemption disabled. > > Bending the interface this way is just horrible. Sure, but I'm just trying to help mitigate the issue at hand (way too long preemption disabled region). I agree that we should do something better in terms of perf_callchain_entry retrieval and reuse, but maybe one thing at a time? > > > > Why are you worried about the preempt_disable() here? If this were an > > > interrupt context we'd still do that unwind -- but then with IRQs > > > disabled. > > > > Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called > > from sleepable/faultable context and also we can do a rather expensive > > build ID resolution (either in sleepable or not, which only changes if > > build ID parsing logic waits for file backed pages to be paged in or > > not). > > <rant> > So stack_map_get_build_id_offset() is a piece of crap -- and I've always > said it was. And I hate that any of that ever got merged -- its the > pinnacle of bad engineering and simply refusing to do the right thing in > the interest of hack now, fix never :/ > </rant> <hug><nod></hug> > > Anyway, as you well know, we now *do* have lockless vma lookups and > should be able to do this buildid thing much saner. Yes, probably, and I am aware of a mmap_lock use inside stack_map_get_build_id_offset() being problematic, we'll need to fix this as well. One step at a time. > > Also, there appears to be no buildid caching what so ever, surely that > would help some. Jiri Olsa proposed caching build id per file or per inode some time back, there was vehement opposition to it. And doing some locked global resizable hash that might need to be used from NMI sounds horrible, tbh. So we have what we have today. > > (and I'm not sure I've ever understood why the buildid crap needs to be > in this path in any case) Yeah, perhaps, I haven't dealt with stack_map_get_build_id_offset() much, so will need to go a bit deeper and analyze. As I said we have mmap_lock problem there, so we need to address that. I'll think if/how we can improve this. Do I understand correctly that you'd rather just not touch all this for now and we should just drop this patch set? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-30 20:04 ` Andrii Nakryiko @ 2026-02-02 19:59 ` Peter Zijlstra 2026-02-04 0:24 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Peter Zijlstra @ 2026-02-02 19:59 UTC (permalink / raw) To: Andrii Nakryiko Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Fri, Jan 30, 2026 at 12:04:45PM -0800, Andrii Nakryiko wrote: > > Also, there appears to be no buildid caching what so ever, surely that > > would help some. > > Jiri Olsa proposed caching build id per file or per inode some time > back, there was vehement opposition to it. And doing some locked > global resizable hash that might need to be used from NMI sounds > horrible, tbh. So we have what we have today. See kernel/module/tree_lookup.c and include/linux/rbtree_latch.h :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-02-02 19:59 ` Peter Zijlstra @ 2026-02-04 0:24 ` Andrii Nakryiko 0 siblings, 0 replies; 28+ messages in thread From: Andrii Nakryiko @ 2026-02-04 0:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Mon, Feb 2, 2026 at 11:59 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jan 30, 2026 at 12:04:45PM -0800, Andrii Nakryiko wrote: > > > > Also, there appears to be no buildid caching what so ever, surely that > > > would help some. > > > > Jiri Olsa proposed caching build id per file or per inode some time > > back, there was vehement opposition to it. And doing some locked > > global resizable hash that might need to be used from NMI sounds > > horrible, tbh. So we have what we have today. > > See kernel/module/tree_lookup.c and include/linux/rbtree_latch.h :-) Ah, neat idea. Each node is really two parallel nodes for each of the latched copies. TIL, good to know and thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen 2026-01-27 21:07 ` Andrii Nakryiko 2026-01-28 9:10 ` Peter Zijlstra @ 2026-02-04 1:08 ` Andrii Nakryiko 2026-02-05 6:16 ` Tao Chen 2 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-02-04 1:08 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: > > From BPF stack map, we want to ensure that the callchain buffer > will not be overwritten by other preemptive tasks and we also aim > to reduce the preempt disable interval, Based on the suggestions from Peter > and Andrrii, export new API __get_perf_callchain and the usage scenarios > are as follows from BPF side: > > preempt_disable() > entry = get_callchain_entry() > preempt_enable() > __get_perf_callchain(entry) > put_callchain_entry(entry) > > Suggested-by: Andrii Nakryiko <andrii@kernel.org> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > include/linux/perf_event.h | 5 +++++ > kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ > 2 files changed, 27 insertions(+), 12 deletions(-) > Looking at the whole __bpf_get_stack() logic again, why didn't we just do something like this: diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index da3d328f5c15..80364561611c 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, max_depth = stack_map_calculate_max_depth(size, elem_size, flags); - if (may_fault) - rcu_read_lock(); /* need RCU for perf's callchain below */ + if (!trace_in) + preempt_disable(); if (trace_in) { trace = trace_in; @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, } if (unlikely(!trace) || trace->nr < skip) { - if (may_fault) - rcu_read_unlock(); + if (!trace_in) + preempt_enable(); goto err_fault; } @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, } /* trace/ips should not be dereferenced after this point */ - if (may_fault) - rcu_read_unlock(); + if (!trace_in) + preempt_enable(); if (user_build_id) stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); ? Build ID parsing is happening after we copied data from perf's callchain_entry into user-provided memory, so raw callchain retrieval can be done with preemption disabled, as it's supposed to be brief. Build ID parsing part which indeed might fault and be much slower will be done well after that (we even have a comment there saying that trace/ips should not be touched). Am I missing something? [...] ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-02-04 1:08 ` Andrii Nakryiko @ 2026-02-05 6:16 ` Tao Chen 2026-02-05 17:34 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Tao Chen @ 2026-02-05 6:16 UTC (permalink / raw) To: Andrii Nakryiko Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/2/4 09:08, Andrii Nakryiko 写道: > On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: >> >> From BPF stack map, we want to ensure that the callchain buffer >> will not be overwritten by other preemptive tasks and we also aim >> to reduce the preempt disable interval, Based on the suggestions from Peter >> and Andrrii, export new API __get_perf_callchain and the usage scenarios >> are as follows from BPF side: >> >> preempt_disable() >> entry = get_callchain_entry() >> preempt_enable() >> __get_perf_callchain(entry) >> put_callchain_entry(entry) >> >> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> include/linux/perf_event.h | 5 +++++ >> kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ >> 2 files changed, 27 insertions(+), 12 deletions(-) >> > > Looking at the whole __bpf_get_stack() logic again, why didn't we just > do something like this: > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index da3d328f5c15..80364561611c 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > struct task_struct *task, > > max_depth = stack_map_calculate_max_depth(size, elem_size, flags); > > - if (may_fault) > - rcu_read_lock(); /* need RCU for perf's callchain below */ > + if (!trace_in) > + preempt_disable(); > > if (trace_in) { > trace = trace_in; > @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > struct task_struct *task, > } > > if (unlikely(!trace) || trace->nr < skip) { > - if (may_fault) > - rcu_read_unlock(); > + if (!trace_in) > + preempt_enable(); > goto err_fault; > } > > @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > struct task_struct *task, > } > > /* trace/ips should not be dereferenced after this point */ > - if (may_fault) > - rcu_read_unlock(); > + if (!trace_in) > + preempt_enable(); > > if (user_build_id) > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > > ? > > Build ID parsing is happening after we copied data from perf's > callchain_entry into user-provided memory, so raw callchain retrieval > can be done with preemption disabled, as it's supposed to be brief. > Build ID parsing part which indeed might fault and be much slower will > be done well after that (we even have a comment there saying that > trace/ips should not be touched). > > Am I missing something? Yes it looks good for bpf_get_stack, and I also proposed a similar change previously. But Alexei suggested that we should reduce the preemption-disabled section protected in bpf_get_stackid if we do like bpf_get_stack. Maybe we can change it first for bpf_get_stack? https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/ > > [...] -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-02-05 6:16 ` Tao Chen @ 2026-02-05 17:34 ` Andrii Nakryiko 2026-02-06 9:20 ` Tao Chen 0 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-02-05 17:34 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Wed, Feb 4, 2026 at 10:16 PM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2026/2/4 09:08, Andrii Nakryiko 写道: > > On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: > >> > >> From BPF stack map, we want to ensure that the callchain buffer > >> will not be overwritten by other preemptive tasks and we also aim > >> to reduce the preempt disable interval, Based on the suggestions from Peter > >> and Andrrii, export new API __get_perf_callchain and the usage scenarios > >> are as follows from BPF side: > >> > >> preempt_disable() > >> entry = get_callchain_entry() > >> preempt_enable() > >> __get_perf_callchain(entry) > >> put_callchain_entry(entry) > >> > >> Suggested-by: Andrii Nakryiko <andrii@kernel.org> > >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> > >> --- > >> include/linux/perf_event.h | 5 +++++ > >> kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ > >> 2 files changed, 27 insertions(+), 12 deletions(-) > >> > > > > Looking at the whole __bpf_get_stack() logic again, why didn't we just > > do something like this: > > > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > > index da3d328f5c15..80364561611c 100644 > > --- a/kernel/bpf/stackmap.c > > +++ b/kernel/bpf/stackmap.c > > @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > > struct task_struct *task, > > > > max_depth = stack_map_calculate_max_depth(size, elem_size, flags); > > > > - if (may_fault) > > - rcu_read_lock(); /* need RCU for perf's callchain below */ > > + if (!trace_in) > > + preempt_disable(); > > > > if (trace_in) { > > trace = trace_in; > > @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > > struct task_struct *task, > > } > > > > if (unlikely(!trace) || trace->nr < skip) { > > - if (may_fault) > > - rcu_read_unlock(); > > + if (!trace_in) > > + preempt_enable(); > > goto err_fault; > > } > > > > @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs, > > struct task_struct *task, > > } > > > > /* trace/ips should not be dereferenced after this point */ > > - if (may_fault) > > - rcu_read_unlock(); > > + if (!trace_in) > > + preempt_enable(); > > > > if (user_build_id) > > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > > > > > > ? > > > > Build ID parsing is happening after we copied data from perf's > > callchain_entry into user-provided memory, so raw callchain retrieval > > can be done with preemption disabled, as it's supposed to be brief. > > Build ID parsing part which indeed might fault and be much slower will > > be done well after that (we even have a comment there saying that > > trace/ips should not be touched). > > > > Am I missing something? > > Yes it looks good for bpf_get_stack, and I also proposed a similar > change previously. But Alexei suggested that we should reduce the > preemption-disabled section protected in bpf_get_stackid if we do like > bpf_get_stack. Maybe we can change it first for bpf_get_stack? > > https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/ This is broken because we are still using trace after you re-enabled preemption. We need to keep preemption disabled until we copy captured stack trace data from ips into our own memory. > > > > > [...] > > > -- > Best Regards > Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain 2026-02-05 17:34 ` Andrii Nakryiko @ 2026-02-06 9:20 ` Tao Chen 0 siblings, 0 replies; 28+ messages in thread From: Tao Chen @ 2026-02-06 9:20 UTC (permalink / raw) To: Andrii Nakryiko, Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/2/6 01:34, Andrii Nakryiko 写道: > On Wed, Feb 4, 2026 at 10:16 PM Tao Chen <chen.dylane@linux.dev> wrote: >> >> 在 2026/2/4 09:08, Andrii Nakryiko 写道: >>> On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote: >>>> >>>> From BPF stack map, we want to ensure that the callchain buffer >>>> will not be overwritten by other preemptive tasks and we also aim >>>> to reduce the preempt disable interval, Based on the suggestions from Peter >>>> and Andrrii, export new API __get_perf_callchain and the usage scenarios >>>> are as follows from BPF side: >>>> >>>> preempt_disable() >>>> entry = get_callchain_entry() >>>> preempt_enable() >>>> __get_perf_callchain(entry) >>>> put_callchain_entry(entry) >>>> >>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org> >>>> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >>>> --- >>>> include/linux/perf_event.h | 5 +++++ >>>> kernel/events/callchain.c | 34 ++++++++++++++++++++++------------ >>>> 2 files changed, 27 insertions(+), 12 deletions(-) >>>> >>> >>> Looking at the whole __bpf_get_stack() logic again, why didn't we just >>> do something like this: >>> >>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >>> index da3d328f5c15..80364561611c 100644 >>> --- a/kernel/bpf/stackmap.c >>> +++ b/kernel/bpf/stackmap.c >>> @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs, >>> struct task_struct *task, >>> >>> max_depth = stack_map_calculate_max_depth(size, elem_size, flags); >>> >>> - if (may_fault) >>> - rcu_read_lock(); /* need RCU for perf's callchain below */ >>> + if (!trace_in) >>> + preempt_disable(); >>> >>> if (trace_in) { >>> trace = trace_in; >>> @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs, >>> struct task_struct *task, >>> } >>> >>> if (unlikely(!trace) || trace->nr < skip) { >>> - if (may_fault) >>> - rcu_read_unlock(); >>> + if (!trace_in) >>> + preempt_enable(); >>> goto err_fault; >>> } >>> >>> @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs, >>> struct task_struct *task, >>> } >>> >>> /* trace/ips should not be dereferenced after this point */ >>> - if (may_fault) >>> - rcu_read_unlock(); >>> + if (!trace_in) >>> + preempt_enable(); >>> >>> if (user_build_id) >>> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); >>> >>> >>> ? >>> >>> Build ID parsing is happening after we copied data from perf's >>> callchain_entry into user-provided memory, so raw callchain retrieval >>> can be done with preemption disabled, as it's supposed to be brief. >>> Build ID parsing part which indeed might fault and be much slower will >>> be done well after that (we even have a comment there saying that >>> trace/ips should not be touched). >>> Hi Andrii, Building upon your work, I have also added preempt_disable() to bpf_get_stackid, and try to reduce the length of preempt section. Please review, thanks. https://lore.kernel.org/bpf/20260206090653.1336687-1-chen.dylane@linux.dev >>> Am I missing something? >> >> Yes it looks good for bpf_get_stack, and I also proposed a similar >> change previously. But Alexei suggested that we should reduce the >> preemption-disabled section protected in bpf_get_stackid if we do like >> bpf_get_stack. Maybe we can change it first for bpf_get_stack? >> >> https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/ > > This is broken because we are still using trace after you re-enabled > preemption. We need to keep preemption disabled until we copy captured > stack trace data from ips into our own memory. > >> >>> >>> [...] >> >> >> -- >> Best Regards >> Tao Chen > -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely 2026-01-26 7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen @ 2026-01-26 7:43 ` Tao Chen 2026-01-27 21:35 ` Andrii Nakryiko 2 siblings, 1 reply; 28+ messages in thread From: Tao Chen @ 2026-01-26 7:43 UTC (permalink / raw) To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo Cc: linux-perf-users, linux-kernel, bpf, Tao Chen As Alexei noted, get_perf_callchain() return values may be reused if a task is preempted after the BPF program enters migrate disable mode. The perf_callchain_entres has a small stack of entries, and we can reuse it as follows: 1. get the perf callchain entry 2. BPF use... 3. put the perf callchain entry And Peter suggested that get_recursion_context used with preemption disabled, so we should disable preemption at BPF side. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Tao Chen <chen.dylane@linux.dev> --- kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 47 insertions(+), 8 deletions(-) diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index e77dcdc2164..6bdee6cc05f 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) #ifdef CONFIG_STACKTRACE struct perf_callchain_entry *entry; + preempt_disable(); entry = get_callchain_entry(); + preempt_enable(); if (!entry) return NULL; @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) to[i] = (u64)(from[i]); } - put_callchain_entry(entry); - return entry; #else /* CONFIG_STACKTRACE */ return NULL; #endif } +static struct perf_callchain_entry * +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack, + bool crosstask) +{ + struct perf_callchain_entry *entry; + int ret; + + preempt_disable(); + entry = get_callchain_entry(); + preempt_enable(); + + if (unlikely(!entry)) + return NULL; + + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0); + if (ret) { + put_callchain_entry(entry); + return NULL; + } + + return entry; +} + +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry) +{ + put_callchain_entry(entry); +} + static long __bpf_get_stackid(struct bpf_map *map, struct perf_callchain_entry *trace, u64 flags) { @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, struct perf_callchain_entry *trace; bool kernel = !user; u32 max_depth; + int ret; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) return -EINVAL; max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); - trace = get_perf_callchain(regs, kernel, user, max_depth, - false, false, 0); + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false); if (unlikely(!trace)) /* couldn't fetch the stack trace */ return -EFAULT; - return __bpf_get_stackid(map, trace, flags); + ret = __bpf_get_stackid(map, trace, flags); + bpf_put_perf_callchain(trace); + + return ret; } const struct bpf_func_proto bpf_get_stackid_proto = { @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, } else if (kernel && task) { trace = get_callchain_entry_for_task(task, max_depth); } else { - trace = get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false, 0); + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask); } - if (unlikely(!trace) || trace->nr < skip) { + if (unlikely(!trace)) { + if (may_fault) + rcu_read_unlock(); + goto err_fault; + } + if (trace->nr < skip) { if (may_fault) rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(trace); goto err_fault; } @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, /* trace/ips should not be dereferenced after this point */ if (may_fault) rcu_read_unlock(); + if (!trace_in) + bpf_put_perf_callchain(trace); if (user_build_id) stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); -- 2.48.1 ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely 2026-01-26 7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen @ 2026-01-27 21:35 ` Andrii Nakryiko 2026-01-28 4:21 ` Tao Chen 0 siblings, 1 reply; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-27 21:35 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote: > > As Alexei noted, get_perf_callchain() return values may be reused > if a task is preempted after the BPF program enters migrate disable > mode. The perf_callchain_entres has a small stack of entries, and > we can reuse it as follows: > > 1. get the perf callchain entry > 2. BPF use... > 3. put the perf callchain entry > > And Peter suggested that get_recursion_context used with preemption > disabled, so we should disable preemption at BPF side. > > Acked-by: Yonghong Song <yonghong.song@linux.dev> > Signed-off-by: Tao Chen <chen.dylane@linux.dev> > --- > kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++------- > 1 file changed, 47 insertions(+), 8 deletions(-) > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index e77dcdc2164..6bdee6cc05f 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > #ifdef CONFIG_STACKTRACE > struct perf_callchain_entry *entry; > > + preempt_disable(); > entry = get_callchain_entry(); > + preempt_enable(); pass perf_callchain_entry as input argument, to keep similar approach to __get_perf_callchain, see below > > if (!entry) > return NULL; > @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > to[i] = (u64)(from[i]); > } > > - put_callchain_entry(entry); > - > return entry; > #else /* CONFIG_STACKTRACE */ > return NULL; > #endif > } > > +static struct perf_callchain_entry * > +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack, > + bool crosstask) > +{ I don't really like having this wrapper, it feels like the flow will be cleaner and easier to follow if we modify the code as suggested below > + struct perf_callchain_entry *entry; > + int ret; > + > + preempt_disable(); > + entry = get_callchain_entry(); > + preempt_enable(); I'd actually consider having __get_callchain_entry() that does what get_callchain_entry() does right now under assumption that preemption/migration is disabled, and then make get_callchain_entry do preempt_disable + fetch entry + preevent_enable + return entry dance. This will simplify the flow here to just with no explicit preempt_{disable,enable} visible. Either way all of this has assumption that we are staying on the same CPU throughout (so at the very least we need to have migration disabled) entry = get_callchain_entry(); __get_perf_callchain(entry, ...); put_callchain_entry(); BTW, is there a way to assert that either preemption or migration is currently disabled? I think both get_callchain_entry and put_callchain_entry would benefit from that pw-bot: cr > + > + if (unlikely(!entry)) > + return NULL; > + > + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0); > + if (ret) { > + put_callchain_entry(entry); > + return NULL; > + } > + > + return entry; > +} > + > +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry) > +{ > + put_callchain_entry(entry); > +} > + > static long __bpf_get_stackid(struct bpf_map *map, > struct perf_callchain_entry *trace, u64 flags) > { > @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > struct perf_callchain_entry *trace; > bool kernel = !user; > u32 max_depth; > + int ret; > > if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > return -EINVAL; > > max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - false, false, 0); > + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false); > > if (unlikely(!trace)) > /* couldn't fetch the stack trace */ > return -EFAULT; > > - return __bpf_get_stackid(map, trace, flags); > + ret = __bpf_get_stackid(map, trace, flags); > + bpf_put_perf_callchain(trace); Just as above, I think get_callchain_entry + __get_perf_callchain + put_callchain_entry is better, IMO > + > + return ret; > } > > const struct bpf_func_proto bpf_get_stackid_proto = { > @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > } else if (kernel && task) { > trace = get_callchain_entry_for_task(task, max_depth); > } else { > - trace = get_perf_callchain(regs, kernel, user, max_depth, > - crosstask, false, 0); > + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask); > } with the above suggestions this will be a pretty streamlined: trace = trace_in ?: get_callchain_entry(); if (!trace) goto err_fault; if (trace_in) { trace->nr = ... err = 0 } else if (kernel && task) { err = get_callchain_entry_for_task(trace, ...); } else { err = __get_perf_callchain(trace, ...); } if (err) goto clear; ... proceed as before, we have our stack trace inside trace ... for successful and failed paths (you'll have to duplicate this logic): if (trace != trace_in) put_callchain_entry(trace); > > - if (unlikely(!trace) || trace->nr < skip) { > + if (unlikely(!trace)) { this condition cannot happen: we either get trace_in != NULL or we get it using __get_callchain_entry and then validate it's not NULL earlier, so drop this condition > + if (may_fault) > + rcu_read_unlock(); > + goto err_fault; > + } > + if (trace->nr < skip) { > if (may_fault) > rcu_read_unlock(); > + if (!trace_in) > + bpf_put_perf_callchain(trace); do this clean up in one place, behind the new goto label? it's a bit too easy to miss this, IMO > goto err_fault; > } > > @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > /* trace/ips should not be dereferenced after this point */ > if (may_fault) > rcu_read_unlock(); now that I looked at this code, I feel like we don't really need this rcu_read_{lock,unlock}() dance (even though I added it in the first place). I this RCU was supposed to be need to keep perf_callchain_entry alive long enough, but for BPF this is guaranteed because either BPF stack map will keep them alive by delaying put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program itself will hold these buffers alive again, until freeing time which is delayed until after RCU Tasks Trace + RCU grace period. Please send this clean up as the first patch in the series so we can review and ack this separately. Thanks! > + if (!trace_in) > + bpf_put_perf_callchain(trace); > > if (user_build_id) > stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > -- > 2.48.1 > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely 2026-01-27 21:35 ` Andrii Nakryiko @ 2026-01-28 4:21 ` Tao Chen 2026-01-28 19:13 ` Andrii Nakryiko 0 siblings, 1 reply; 28+ messages in thread From: Tao Chen @ 2026-01-28 4:21 UTC (permalink / raw) To: Andrii Nakryiko Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf 在 2026/1/28 05:35, Andrii Nakryiko 写道: > On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote: >> >> As Alexei noted, get_perf_callchain() return values may be reused >> if a task is preempted after the BPF program enters migrate disable >> mode. The perf_callchain_entres has a small stack of entries, and >> we can reuse it as follows: >> >> 1. get the perf callchain entry >> 2. BPF use... >> 3. put the perf callchain entry >> >> And Peter suggested that get_recursion_context used with preemption >> disabled, so we should disable preemption at BPF side. >> >> Acked-by: Yonghong Song <yonghong.song@linux.dev> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> >> --- >> kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++------- >> 1 file changed, 47 insertions(+), 8 deletions(-) >> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >> index e77dcdc2164..6bdee6cc05f 100644 >> --- a/kernel/bpf/stackmap.c >> +++ b/kernel/bpf/stackmap.c >> @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> #ifdef CONFIG_STACKTRACE >> struct perf_callchain_entry *entry; >> >> + preempt_disable(); >> entry = get_callchain_entry(); >> + preempt_enable(); > > pass perf_callchain_entry as input argument, to keep similar approach > to __get_perf_callchain, see below > >> >> if (!entry) >> return NULL; >> @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) >> to[i] = (u64)(from[i]); >> } >> >> - put_callchain_entry(entry); >> - >> return entry; >> #else /* CONFIG_STACKTRACE */ >> return NULL; >> #endif >> } >> >> +static struct perf_callchain_entry * >> +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack, >> + bool crosstask) >> +{ > > I don't really like having this wrapper, it feels like the flow will > be cleaner and easier to follow if we modify the code as suggested > below > Ok, will use it directly. >> + struct perf_callchain_entry *entry; >> + int ret; >> + >> + preempt_disable(); >> + entry = get_callchain_entry(); >> + preempt_enable(); > > I'd actually consider having __get_callchain_entry() that does what > get_callchain_entry() does right now under assumption that > preemption/migration is disabled, and then make get_callchain_entry do > preempt_disable + fetch entry + preevent_enable + return entry dance. > in v4, YongHong suggested add preempt_disable in get_callchain_entry, but Peter suggested that do it from BPF side, so maybe keeping the existing method is a compromise. > This will simplify the flow here to just with no explicit > preempt_{disable,enable} visible. Either way all of this has > assumption that we are staying on the same CPU throughout (so at the > very least we need to have migration disabled) > > entry = get_callchain_entry(); > __get_perf_callchain(entry, ...); > put_callchain_entry(); > > > BTW, is there a way to assert that either preemption or migration is > currently disabled? I think both get_callchain_entry and > put_callchain_entry would benefit from that > > pw-bot: cr > > >> + >> + if (unlikely(!entry)) >> + return NULL; >> + >> + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0); >> + if (ret) { >> + put_callchain_entry(entry); >> + return NULL; >> + } >> + >> + return entry; >> +} >> + >> +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry) >> +{ >> + put_callchain_entry(entry); >> +} >> + >> static long __bpf_get_stackid(struct bpf_map *map, >> struct perf_callchain_entry *trace, u64 flags) >> { >> @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, >> struct perf_callchain_entry *trace; >> bool kernel = !user; >> u32 max_depth; >> + int ret; >> >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) >> return -EINVAL; >> >> max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - false, false, 0); >> + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false); >> >> if (unlikely(!trace)) >> /* couldn't fetch the stack trace */ >> return -EFAULT; >> >> - return __bpf_get_stackid(map, trace, flags); >> + ret = __bpf_get_stackid(map, trace, flags); >> + bpf_put_perf_callchain(trace); > > Just as above, I think get_callchain_entry + __get_perf_callchain + > put_callchain_entry is better, IMO > >> + >> + return ret; >> } >> >> const struct bpf_func_proto bpf_get_stackid_proto = { >> @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, >> } else if (kernel && task) { >> trace = get_callchain_entry_for_task(task, max_depth); >> } else { >> - trace = get_perf_callchain(regs, kernel, user, max_depth, >> - crosstask, false, 0); >> + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask); >> } > > with the above suggestions this will be a pretty streamlined: > > trace = trace_in ?: get_callchain_entry(); > if (!trace) > goto err_fault; > > if (trace_in) { > trace->nr = ... > err = 0 > } else if (kernel && task) { > err = get_callchain_entry_for_task(trace, ...); > } else { > err = __get_perf_callchain(trace, ...); > } > if (err) > goto clear; > This code looks much cleaner, i will change it, thanks. > ... proceed as before, we have our stack trace inside trace ... > > for successful and failed paths (you'll have to duplicate this logic): > > if (trace != trace_in) > put_callchain_entry(trace); > >> >> - if (unlikely(!trace) || trace->nr < skip) { >> + if (unlikely(!trace)) { > > this condition cannot happen: we either get trace_in != NULL or we get > it using __get_callchain_entry and then validate it's not NULL > earlier, so drop this condition > will remove it. >> + if (may_fault) >> + rcu_read_unlock(); >> + goto err_fault; >> + } >> + if (trace->nr < skip) { >> if (may_fault) >> rcu_read_unlock(); >> + if (!trace_in) >> + bpf_put_perf_callchain(trace); > > do this clean up in one place, behind the new goto label? it's a bit > too easy to miss this, IMO > ok, will do. >> goto err_fault; >> } >> >> @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, >> /* trace/ips should not be dereferenced after this point */ >> if (may_fault) >> rcu_read_unlock(); > > now that I looked at this code, I feel like we don't really need this > rcu_read_{lock,unlock}() dance (even though I added it in the first > place). I this RCU was supposed to be need to keep > perf_callchain_entry alive long enough, but for BPF this is guaranteed > because either BPF stack map will keep them alive by delaying > put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU > grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program > itself will hold these buffers alive again, until freeing time which > is delayed until after RCU Tasks Trace + RCU grace period. It seems so, for both, put_callchain_buffer is always called at the end, which ensures it won't be released during use, i will remove it as a new patch. > > Please send this clean up as the first patch in the series so we can > review and ack this separately. Thanks! > >> + if (!trace_in) >> + bpf_put_perf_callchain(trace); >> >> if (user_build_id) >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); >> -- >> 2.48.1 >> -- Best Regards Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely 2026-01-28 4:21 ` Tao Chen @ 2026-01-28 19:13 ` Andrii Nakryiko 0 siblings, 0 replies; 28+ messages in thread From: Andrii Nakryiko @ 2026-01-28 19:13 UTC (permalink / raw) To: Tao Chen Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf On Tue, Jan 27, 2026 at 8:21 PM Tao Chen <chen.dylane@linux.dev> wrote: > > 在 2026/1/28 05:35, Andrii Nakryiko 写道: > > On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote: > >> > >> As Alexei noted, get_perf_callchain() return values may be reused > >> if a task is preempted after the BPF program enters migrate disable > >> mode. The perf_callchain_entres has a small stack of entries, and > >> we can reuse it as follows: > >> > >> 1. get the perf callchain entry > >> 2. BPF use... > >> 3. put the perf callchain entry > >> > >> And Peter suggested that get_recursion_context used with preemption > >> disabled, so we should disable preemption at BPF side. > >> > >> Acked-by: Yonghong Song <yonghong.song@linux.dev> > >> Signed-off-by: Tao Chen <chen.dylane@linux.dev> > >> --- > >> kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++------- > >> 1 file changed, 47 insertions(+), 8 deletions(-) > >> > >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > >> index e77dcdc2164..6bdee6cc05f 100644 > >> --- a/kernel/bpf/stackmap.c > >> +++ b/kernel/bpf/stackmap.c > >> @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > >> #ifdef CONFIG_STACKTRACE > >> struct perf_callchain_entry *entry; > >> > >> + preempt_disable(); > >> entry = get_callchain_entry(); > >> + preempt_enable(); > > > > pass perf_callchain_entry as input argument, to keep similar approach > > to __get_perf_callchain, see below > > > >> > >> if (!entry) > >> return NULL; > >> @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) > >> to[i] = (u64)(from[i]); > >> } > >> > >> - put_callchain_entry(entry); > >> - > >> return entry; > >> #else /* CONFIG_STACKTRACE */ > >> return NULL; > >> #endif > >> } > >> > >> +static struct perf_callchain_entry * > >> +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack, > >> + bool crosstask) > >> +{ > > > > I don't really like having this wrapper, it feels like the flow will > > be cleaner and easier to follow if we modify the code as suggested > > below > > > > Ok, will use it directly. > >> + struct perf_callchain_entry *entry; > >> + int ret; > >> + > >> + preempt_disable(); > >> + entry = get_callchain_entry(); > >> + preempt_enable(); > > > > I'd actually consider having __get_callchain_entry() that does what > > get_callchain_entry() does right now under assumption that > > preemption/migration is disabled, and then make get_callchain_entry do > > preempt_disable + fetch entry + preevent_enable + return entry dance. > > > > in v4, YongHong suggested add preempt_disable in get_callchain_entry, > but Peter suggested that do it from BPF side, so maybe keeping the > existing method is a compromise. yeah, I guess perf's own usage of this is happening under constant preempt_disable(), so this would be unnecessary for them. That's fine, let's keep it outside > > > This will simplify the flow here to just with no explicit > > preempt_{disable,enable} visible. Either way all of this has > > assumption that we are staying on the same CPU throughout (so at the > > very least we need to have migration disabled) > > > > entry = get_callchain_entry(); > > __get_perf_callchain(entry, ...); > > put_callchain_entry(); > > > > > > BTW, is there a way to assert that either preemption or migration is > > currently disabled? I think both get_callchain_entry and > > put_callchain_entry would benefit from that > > > > pw-bot: cr > > > > > >> + > >> + if (unlikely(!entry)) > >> + return NULL; > >> + > >> + ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0); > >> + if (ret) { > >> + put_callchain_entry(entry); > >> + return NULL; > >> + } > >> + > >> + return entry; > >> +} > >> + > >> +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry) > >> +{ > >> + put_callchain_entry(entry); > >> +} > >> + > >> static long __bpf_get_stackid(struct bpf_map *map, > >> struct perf_callchain_entry *trace, u64 flags) > >> { > >> @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, > >> struct perf_callchain_entry *trace; > >> bool kernel = !user; > >> u32 max_depth; > >> + int ret; > >> > >> if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | > >> BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) > >> return -EINVAL; > >> > >> max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); > >> - trace = get_perf_callchain(regs, kernel, user, max_depth, > >> - false, false, 0); > >> + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false); > >> > >> if (unlikely(!trace)) > >> /* couldn't fetch the stack trace */ > >> return -EFAULT; > >> > >> - return __bpf_get_stackid(map, trace, flags); > >> + ret = __bpf_get_stackid(map, trace, flags); > >> + bpf_put_perf_callchain(trace); > > > > Just as above, I think get_callchain_entry + __get_perf_callchain + > > put_callchain_entry is better, IMO > > > >> + > >> + return ret; > >> } > >> > >> const struct bpf_func_proto bpf_get_stackid_proto = { > >> @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > >> } else if (kernel && task) { > >> trace = get_callchain_entry_for_task(task, max_depth); > >> } else { > >> - trace = get_perf_callchain(regs, kernel, user, max_depth, > >> - crosstask, false, 0); > >> + trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask); > >> } > > > > with the above suggestions this will be a pretty streamlined: > > > > trace = trace_in ?: get_callchain_entry(); > > if (!trace) > > goto err_fault; > > > > if (trace_in) { > > trace->nr = ... > > err = 0 > > } else if (kernel && task) { > > err = get_callchain_entry_for_task(trace, ...); > > } else { > > err = __get_perf_callchain(trace, ...); > > } > > if (err) > > goto clear; > > > > This code looks much cleaner, i will change it, thanks. > > > ... proceed as before, we have our stack trace inside trace ... > > > > for successful and failed paths (you'll have to duplicate this logic): > > > > if (trace != trace_in) > > put_callchain_entry(trace); > > > >> > >> - if (unlikely(!trace) || trace->nr < skip) { > >> + if (unlikely(!trace)) { > > > > this condition cannot happen: we either get trace_in != NULL or we get > > it using __get_callchain_entry and then validate it's not NULL > > earlier, so drop this condition > > > > will remove it. > > >> + if (may_fault) > >> + rcu_read_unlock(); > >> + goto err_fault; > >> + } > >> + if (trace->nr < skip) { > >> if (may_fault) > >> rcu_read_unlock(); > >> + if (!trace_in) > >> + bpf_put_perf_callchain(trace); > > > > do this clean up in one place, behind the new goto label? it's a bit > > too easy to miss this, IMO > > > > ok, will do. > > >> goto err_fault; > >> } > >> > >> @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > >> /* trace/ips should not be dereferenced after this point */ > >> if (may_fault) > >> rcu_read_unlock(); > > > > now that I looked at this code, I feel like we don't really need this > > rcu_read_{lock,unlock}() dance (even though I added it in the first > > place). I this RCU was supposed to be need to keep > > perf_callchain_entry alive long enough, but for BPF this is guaranteed > > because either BPF stack map will keep them alive by delaying > > put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU > > grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program > > itself will hold these buffers alive again, until freeing time which > > is delayed until after RCU Tasks Trace + RCU grace period. > > It seems so, for both, put_callchain_buffer is always called at the end, > which ensures it won't be released during use, i will remove it as a new > patch. > > > > > Please send this clean up as the first patch in the series so we can > > review and ack this separately. Thanks! > > > >> + if (!trace_in) > >> + bpf_put_perf_callchain(trace); > >> > >> if (user_build_id) > >> stack_map_get_build_id_offset(buf, trace_nr, user, may_fault); > >> -- > >> 2.48.1 > >> > > > -- > Best Regards > Tao Chen ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2026-02-06 9:20 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-26 7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen 2026-01-26 8:03 ` bot+bpf-ci 2026-01-26 8:51 ` Tao Chen 2026-01-27 21:01 ` Andrii Nakryiko 2026-01-28 2:41 ` Tao Chen 2026-01-28 8:59 ` Peter Zijlstra 2026-01-28 16:52 ` Tao Chen 2026-01-28 18:59 ` Andrii Nakryiko 2026-01-29 3:03 ` Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen 2026-01-27 21:07 ` Andrii Nakryiko 2026-01-28 2:42 ` Tao Chen 2026-01-28 9:10 ` Peter Zijlstra 2026-01-28 16:49 ` Tao Chen 2026-01-28 19:12 ` Andrii Nakryiko 2026-01-30 11:31 ` Peter Zijlstra 2026-01-30 20:04 ` Andrii Nakryiko 2026-02-02 19:59 ` Peter Zijlstra 2026-02-04 0:24 ` Andrii Nakryiko 2026-02-04 1:08 ` Andrii Nakryiko 2026-02-05 6:16 ` Tao Chen 2026-02-05 17:34 ` Andrii Nakryiko 2026-02-06 9:20 ` Tao Chen 2026-01-26 7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen 2026-01-27 21:35 ` Andrii Nakryiko 2026-01-28 4:21 ` Tao Chen 2026-01-28 19:13 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox