[PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain
@ 2026-01-26  7:43 Tao Chen
  2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-26  7:43 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com

A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev

Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.

Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.

Comments and suggestions are always welcome.

Change list:
 - v1 -> v2
   From Jiri
   - rebase code, fix conflict
 - v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev

 - v2 -> v3:
   From Andrii
   - entries per CPU used in a stack-like fashion
 - v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev

 - v3 -> v4:
   From Peter
   - refactor get_perf_callchain and add three new APIs to use perf
     callchain easily.
   From Andrii
   - reuse the perf callchain management.

   - rename patch1 and patch2.
 - v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev

 - v4 -> v5:
   From Yonghong
   - keep add_mark false in stackmap when refactor get_perf_callchain in
     patch1.
   - add atomic operation in get_recursion_context in patch2.
   - rename bpf_put_callchain_entry with bpf_put_perf_callchain in
     patch3.
   - rebase bpf-next master.
 - v4: https://lore.kernel.org/bpf/20251028162502.3418817-1-chen.dylane@linux.dev

 - v5 -> v6:
   From Peter
   - disable preemption from BPF side in patch2.
   From AI
   - use ctx->entry->nr instead of ctx->nr in patch1.
 - v5: https://lore.kernel.org/bpf/20251109163559.4102849-1-chen.dylane@linux.dev

 - v6 -> v7:
   From yonghong
   - Add ack in patch2
   From AI
   - resolve conflict
 - v6: https://lore.kernel.org/bpf/20251112163148.100949-1-chen.dylane@linux.dev

 - v7 -> v8:
  From Andrii
   - Record rctx in perf_callchain_entry in patch1
   - Export __get_perf_callchain in patch2
 - v7: https://lore.kernel.org/bpf/20251217093326.1745307-1-chen.dylane@linux.dev

Tao Chen (3):
  perf: Add rctx in perf_callchain_entry
  perf: Refactor get_perf_callchain
  bpf: Hold ther perf callchain entry until used completely

 include/linux/perf_event.h | 10 +++++--
 kernel/bpf/stackmap.c      | 58 ++++++++++++++++++++++++++++++-------
 kernel/events/callchain.c  | 59 ++++++++++++++++++++++++--------------
 3 files changed, 93 insertions(+), 34 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-26  7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
@ 2026-01-26  7:43 ` Tao Chen
  2026-01-26  8:03   ` bot+bpf-ci
  2026-01-28  8:59   ` Peter Zijlstra
  2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
  2026-01-26  7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen
  2 siblings, 2 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-26  7:43 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

Record rctx inside the perf_callchain_entry itself, when callers of
get_callchain_entry no longer care about the assignment of rctx, and
will be used in the next patch.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 include/linux/perf_event.h |  5 +++--
 kernel/bpf/stackmap.c      |  5 ++---
 kernel/events/callchain.c  | 27 ++++++++++++++++-----------
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9870d768db4..f0489843ebc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -57,6 +57,7 @@
 #include <asm/local.h>
 
 struct perf_callchain_entry {
+	int				rctx;
 	u64				nr;
 	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
 };
@@ -1723,8 +1724,8 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
-extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
-extern void put_callchain_entry(int rctx);
+extern struct perf_callchain_entry *get_callchain_entry(void);
+extern void put_callchain_entry(struct perf_callchain_entry *entry);
 
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index da3d328f5c1..e77dcdc2164 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -214,9 +214,8 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 {
 #ifdef CONFIG_STACKTRACE
 	struct perf_callchain_entry *entry;
-	int rctx;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry();
 
 	if (!entry)
 		return NULL;
@@ -238,7 +237,7 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 			to[i] = (u64)(from[i]);
 	}
 
-	put_callchain_entry(rctx);
+	put_callchain_entry(entry);
 
 	return entry;
 #else /* CONFIG_STACKTRACE */
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index b9c7e00725d..6cdbc5937b1 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -151,31 +151,36 @@ void put_callchain_buffers(void)
 	}
 }
 
-struct perf_callchain_entry *get_callchain_entry(int *rctx)
+struct perf_callchain_entry *get_callchain_entry(void)
 {
 	int cpu;
+	int rctx;
 	struct callchain_cpus_entries *entries;
+	struct perf_callchain_entry *entry;
 
-	*rctx = get_recursion_context(this_cpu_ptr(callchain_recursion));
-	if (*rctx == -1)
+	rctx = get_recursion_context(this_cpu_ptr(callchain_recursion));
+	if (rctx == -1)
 		return NULL;
 
 	entries = rcu_dereference(callchain_cpus_entries);
 	if (!entries) {
-		put_recursion_context(this_cpu_ptr(callchain_recursion), *rctx);
+		put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
 		return NULL;
 	}
 
 	cpu = smp_processor_id();
 
-	return (((void *)entries->cpu_entries[cpu]) +
-		(*rctx * perf_callchain_entry__sizeof()));
+	entry = ((void *)entries->cpu_entries[cpu]) +
+		(rctx * perf_callchain_entry__sizeof());
+	entry->rctx = rctx;
+
+	return entry;
 }
 
 void
-put_callchain_entry(int rctx)
+put_callchain_entry(struct perf_callchain_entry *entry)
 {
-	put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
+	put_recursion_context(this_cpu_ptr(callchain_recursion), entry->rctx);
 }
 
 static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entry,
@@ -222,13 +227,13 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
-	int rctx, start_entry_idx;
+	int start_entry_idx;
 
 	/* crosstask is not supported for user stacks */
 	if (crosstask && user && !kernel)
 		return NULL;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry();
 	if (!entry)
 		return NULL;
 
@@ -272,7 +277,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	}
 
 exit_put:
-	put_callchain_entry(rctx);
+	put_callchain_entry(entry);
 
 	return entry;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
@ 2026-01-26  8:03   ` bot+bpf-ci
  2026-01-26  8:51     ` Tao Chen
  2026-01-28  8:59   ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: bot+bpf-ci @ 2026-01-26  8:03 UTC (permalink / raw)
  To: chen.dylane, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, chen.dylane, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 9ded2e582c60..0330f265173d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -57,6 +57,7 @@
>  #include <asm/local.h>
>
>  struct perf_callchain_entry {
> +	int				rctx;
>  	u64				nr;
>  	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>  };

Adding rctx before nr changes the structure layout in a way that
appears to break the perf userspace ABI.

In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
is set, the code does:

    int size = 1;
    size += data->callchain->nr;
    size *= sizeof(u64);
    __output_copy(handle, data->callchain, size);

This copies (1 + nr) * 8 bytes starting from the beginning of the
structure. Before this patch, offset 0 contained nr (u64), followed
by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
padding, then nr at offset 8, then ip[] at offset 16.

With the new layout, this copy sends the following to userspace:
- First 8 bytes: rctx + padding (kernel-internal data) instead of nr
- Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]

This means the callchain data sent to userspace is shifted by 8 bytes,
the last IP is truncated, and the rctx value is leaked to userspace.

Is there a plan to update perf_output_sample() to output starting from
&entry->nr instead of the structure start? Or am I missing something
about how this structure is used for userspace output?

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-26  8:03   ` bot+bpf-ci
@ 2026-01-26  8:51     ` Tao Chen
  2026-01-27 21:01       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Tao Chen @ 2026-01-26  8:51 UTC (permalink / raw)
  To: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, martin.lau, clm,
	ihor.solodrai

在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 9ded2e582c60..0330f265173d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -57,6 +57,7 @@
>>   #include <asm/local.h>
>>
>>   struct perf_callchain_entry {
>> +	int				rctx;
>>   	u64				nr;
>>   	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>>   };
> 
> Adding rctx before nr changes the structure layout in a way that
> appears to break the perf userspace ABI.
> 
> In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
> is set, the code does:
> 
>      int size = 1;
>      size += data->callchain->nr;
>      size *= sizeof(u64);
>      __output_copy(handle, data->callchain, size);
> 
> This copies (1 + nr) * 8 bytes starting from the beginning of the
> structure. Before this patch, offset 0 contained nr (u64), followed
> by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
> padding, then nr at offset 8, then ip[] at offset 16.
> 
> With the new layout, this copy sends the following to userspace:
> - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
> - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
> 
> This means the callchain data sent to userspace is shifted by 8 bytes,
> the last IP is truncated, and the rctx value is leaked to userspace.
> 
> Is there a plan to update perf_output_sample() to output starting from
> &entry->nr instead of the structure start? Or am I missing something
> about how this structure is used for userspace output?
> 

As AI said, this change shifts the memory layout, which effectively 
breaks the userspace ABI.

Maybe we can use __output_copy(handle, &data->callchain->nr, size);

> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-26  8:51     ` Tao Chen
@ 2026-01-27 21:01       ` Andrii Nakryiko
  2026-01-28  2:41         ` Tao Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-27 21:01 UTC (permalink / raw)
  To: Tao Chen
  Cc: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, linux-perf-users,
	linux-kernel, bpf, martin.lau, clm, ihor.solodrai

On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 9ded2e582c60..0330f265173d 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -57,6 +57,7 @@
> >>   #include <asm/local.h>
> >>
> >>   struct perf_callchain_entry {
> >> +    int                             rctx;
> >>      u64                             nr;
> >>      u64                             ip[]; /* /proc/sys/kernel/perf_event_max_stack */
> >>   };
> >
> > Adding rctx before nr changes the structure layout in a way that
> > appears to break the perf userspace ABI.
> >
> > In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
> > is set, the code does:
> >
> >      int size = 1;
> >      size += data->callchain->nr;
> >      size *= sizeof(u64);
> >      __output_copy(handle, data->callchain, size);
> >
> > This copies (1 + nr) * 8 bytes starting from the beginning of the
> > structure. Before this patch, offset 0 contained nr (u64), followed
> > by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
> > padding, then nr at offset 8, then ip[] at offset 16.
> >
> > With the new layout, this copy sends the following to userspace:
> > - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
> > - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
> >
> > This means the callchain data sent to userspace is shifted by 8 bytes,
> > the last IP is truncated, and the rctx value is leaked to userspace.
> >
> > Is there a plan to update perf_output_sample() to output starting from
> > &entry->nr instead of the structure start? Or am I missing something
> > about how this structure is used for userspace output?
> >
>
> As AI said, this change shifts the memory layout, which effectively
> breaks the userspace ABI.
>
> Maybe we can use __output_copy(handle, &data->callchain->nr, size);

yep, very impressive for AI to notice this. I agree that
&data->callchain->nr seems like the best way forward.

>
> > [ ... ]
> >
> >
> > ---
> > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> >
> > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831
>
>
> --
> Best Regards
> Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-27 21:01       ` Andrii Nakryiko
@ 2026-01-28  2:41         ` Tao Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-28  2:41 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, linux-perf-users,
	linux-kernel, bpf, martin.lau, clm, ihor.solodrai

在 2026/1/28 05:01, Andrii Nakryiko 写道:
> On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 9ded2e582c60..0330f265173d 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -57,6 +57,7 @@
>>>>    #include <asm/local.h>
>>>>
>>>>    struct perf_callchain_entry {
>>>> +    int                             rctx;
>>>>       u64                             nr;
>>>>       u64                             ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>>>>    };
>>>
>>> Adding rctx before nr changes the structure layout in a way that
>>> appears to break the perf userspace ABI.
>>>
>>> In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
>>> is set, the code does:
>>>
>>>       int size = 1;
>>>       size += data->callchain->nr;
>>>       size *= sizeof(u64);
>>>       __output_copy(handle, data->callchain, size);
>>>
>>> This copies (1 + nr) * 8 bytes starting from the beginning of the
>>> structure. Before this patch, offset 0 contained nr (u64), followed
>>> by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
>>> padding, then nr at offset 8, then ip[] at offset 16.
>>>
>>> With the new layout, this copy sends the following to userspace:
>>> - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
>>> - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
>>>
>>> This means the callchain data sent to userspace is shifted by 8 bytes,
>>> the last IP is truncated, and the rctx value is leaked to userspace.
>>>
>>> Is there a plan to update perf_output_sample() to output starting from
>>> &entry->nr instead of the structure start? Or am I missing something
>>> about how this structure is used for userspace output?
>>>
>>
>> As AI said, this change shifts the memory layout, which effectively
>> breaks the userspace ABI.
>>
>> Maybe we can use __output_copy(handle, &data->callchain->nr, size);
> 
> yep, very impressive for AI to notice this. I agree that
> &data->callchain->nr seems like the best way forward.
>

will fix it in v9.
>>
>>> [ ... ]
>>>
>>>
>>> ---
>>> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
>>> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>>>
>>> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831
>>
>>
>> --
>> Best Regards
>> Tao Chen


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
  2026-01-26  8:03   ` bot+bpf-ci
@ 2026-01-28  8:59   ` Peter Zijlstra
  2026-01-28 16:52     ` Tao Chen
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2026-01-28  8:59 UTC (permalink / raw)
  To: Tao Chen
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, linux-perf-users, linux-kernel, bpf

On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
> Record rctx inside the perf_callchain_entry itself, when callers of
> get_callchain_entry no longer care about the assignment of rctx, and
> will be used in the next patch.

Sorry, what?

The recursion context is very much about the caller, and very much not
about the data. This just doesn't make sense.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-28  8:59   ` Peter Zijlstra
@ 2026-01-28 16:52     ` Tao Chen
  2026-01-28 18:59       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Tao Chen @ 2026-01-28 16:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, linux-perf-users, linux-kernel, bpf

在 2026/1/28 16:59, Peter Zijlstra 写道:
> On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
>> Record rctx inside the perf_callchain_entry itself, when callers of
>> get_callchain_entry no longer care about the assignment of rctx, and
>> will be used in the next patch.
> 
> Sorry, what?
> 
> The recursion context is very much about the caller, and very much not
> about the data. This just doesn't make sense.

Well, Andrii, it seems we have to go back to the original way.

-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-28 16:52     ` Tao Chen
@ 2026-01-28 18:59       ` Andrii Nakryiko
  2026-01-29  3:03         ` Tao Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-28 18:59 UTC (permalink / raw)
  To: Tao Chen
  Cc: Peter Zijlstra, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, linux-perf-users,
	linux-kernel, bpf

On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/1/28 16:59, Peter Zijlstra 写道:
> > On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
> >> Record rctx inside the perf_callchain_entry itself, when callers of
> >> get_callchain_entry no longer care about the assignment of rctx, and
> >> will be used in the next patch.
> >
> > Sorry, what?
> >
> > The recursion context is very much about the caller, and very much not
> > about the data. This just doesn't make sense.
>
> Well, Andrii, it seems we have to go back to the original way.

Huh, why? Peter is confused by your wording, he didn't say that what
you are doing is broken.

The point is to couple rctx with the exact perf_callchain_entry
returned to the caller in that recursion context. There is no
contradiction.

It's purely about simplifying the interface. While previously the
caller would have to juggle perf_callchain_entry and rctx separately
(but still ensure that entry and rctx do match each other when calling
put_callchain_entry), now it's more convenient and less error-prone
because returned entry will record rctx it was retrieved with.

That perf_callchain_entry should not be reused by anyone else between
successful get_callchain_entry and put_callchain_entry, so there is no
problem here.

>
> --
> Best Regards
> Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry
  2026-01-28 18:59       ` Andrii Nakryiko
@ 2026-01-29  3:03         ` Tao Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-29  3:03 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Peter Zijlstra, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, linux-perf-users,
	linux-kernel, bpf

在 2026/1/29 02:59, Andrii Nakryiko 写道:
> On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> 在 2026/1/28 16:59, Peter Zijlstra 写道:
>>> On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
>>>> Record rctx inside the perf_callchain_entry itself, when callers of
>>>> get_callchain_entry no longer care about the assignment of rctx, and
>>>> will be used in the next patch.
>>>
>>> Sorry, what?
>>>
>>> The recursion context is very much about the caller, and very much not
>>> about the data. This just doesn't make sense.
>>
>> Well, Andrii, it seems we have to go back to the original way.
> 
> Huh, why? Peter is confused by your wording, he didn't say that what
> you are doing is broken.
>

I see, will update the changelog, thanks.
> The point is to couple rctx with the exact perf_callchain_entry
> returned to the caller in that recursion context. There is no
> contradiction.
> 
> It's purely about simplifying the interface. While previously the
> caller would have to juggle perf_callchain_entry and rctx separately
> (but still ensure that entry and rctx do match each other when calling
> put_callchain_entry), now it's more convenient and less error-prone
> because returned entry will record rctx it was retrieved with.
> 
> That perf_callchain_entry should not be reused by anyone else between
> successful get_callchain_entry and put_callchain_entry, so there is no
> problem here.
> 
>>
>> --
>> Best Regards
>> Tao Chen


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-26  7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
  2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
@ 2026-01-26  7:43 ` Tao Chen
  2026-01-27 21:07   ` Andrii Nakryiko
                     ` (2 more replies)
  2026-01-26  7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen
  2 siblings, 3 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-26  7:43 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

From BPF stack map, we want to ensure that the callchain buffer
will not be overwritten by other preemptive tasks and we also aim
to reduce the preempt disable interval, Based on the suggestions from Peter
and Andrrii, export new API __get_perf_callchain and the usage scenarios
are as follows from BPF side:

preempt_disable()
entry = get_callchain_entry()
preempt_enable()
__get_perf_callchain(entry)
put_callchain_entry(entry)

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 include/linux/perf_event.h |  5 +++++
 kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f0489843ebc..7132fa97bb1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1722,6 +1722,11 @@ extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
+
+extern int __get_perf_callchain(struct perf_callchain_entry *entry, struct pt_regs *regs,
+				bool kernel, bool user, u32 max_stack, bool crosstask, bool add_mark,
+				u64 defer_cookie);
+
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(void);
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 6cdbc5937b1..f9789d10fa4 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -221,21 +221,15 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 #endif
 }
 
-struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
+int __get_perf_callchain(struct perf_callchain_entry *entry, struct pt_regs *regs, bool kernel,
+			 bool user, u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
 {
-	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
 	int start_entry_idx;
 
 	/* crosstask is not supported for user stacks */
 	if (crosstask && user && !kernel)
-		return NULL;
-
-	entry = get_callchain_entry();
-	if (!entry)
-		return NULL;
+		return -EINVAL;
 
 	ctx.entry		= entry;
 	ctx.max_stack		= max_stack;
@@ -252,7 +246,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	if (user && !crosstask) {
 		if (!user_mode(regs)) {
 			if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
-				goto exit_put;
+				return 0;
 			regs = task_pt_regs(current);
 		}
 
@@ -265,7 +259,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			 */
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
 			perf_callchain_store_context(&ctx, defer_cookie);
-			goto exit_put;
+			return 0;
 		}
 
 		if (add_mark)
@@ -275,9 +269,25 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		perf_callchain_user(&ctx, regs);
 		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
 	}
+	return 0;
+}
+
+struct perf_callchain_entry *
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
+		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
+{
+	struct perf_callchain_entry *entry;
+	int ret;
+
+	entry = get_callchain_entry();
+	if (!entry)
+		return NULL;
 
-exit_put:
+	ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark,
+				   defer_cookie);
 	put_callchain_entry(entry);
+	if (ret)
+		entry = NULL;
 
 	return entry;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
@ 2026-01-27 21:07   ` Andrii Nakryiko
  2026-01-28  2:42     ` Tao Chen
  2026-01-28  9:10   ` Peter Zijlstra
  2026-02-04  1:08   ` Andrii Nakryiko
  2 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-27 21:07 UTC (permalink / raw)
  To: Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
>
> From BPF stack map, we want to ensure that the callchain buffer
> will not be overwritten by other preemptive tasks and we also aim
> to reduce the preempt disable interval, Based on the suggestions from Peter
> and Andrrii, export new API __get_perf_callchain and the usage scenarios
> are as follows from BPF side:
>
> preempt_disable()
> entry = get_callchain_entry()
> preempt_enable()
> __get_perf_callchain(entry)
> put_callchain_entry(entry)
>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---
>  include/linux/perf_event.h |  5 +++++
>  kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
>  2 files changed, 27 insertions(+), 12 deletions(-)

[...]

> +struct perf_callchain_entry *
> +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> +                  u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
> +{
> +       struct perf_callchain_entry *entry;
> +       int ret;
> +
> +       entry = get_callchain_entry();
> +       if (!entry)
> +               return NULL;
>
> -exit_put:
> +       ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark,
> +                                  defer_cookie);
>         put_callchain_entry(entry);
> +       if (ret)
> +               entry = NULL;
>

purely stylistical nit, so this can be ignored if you disagree, but I
find code that modifies some variable before returning it slightly
less preferable to more explicit:


if (__get_perf_callchain(...)) {
    put_callchain_entry(entry);
    return NULL;
}

return entry;

>         return entry;
>  }

> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-27 21:07   ` Andrii Nakryiko
@ 2026-01-28  2:42     ` Tao Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-28  2:42 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

在 2026/1/28 05:07, Andrii Nakryiko 写道:
> On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>>  From BPF stack map, we want to ensure that the callchain buffer
>> will not be overwritten by other preemptive tasks and we also aim
>> to reduce the preempt disable interval, Based on the suggestions from Peter
>> and Andrrii, export new API __get_perf_callchain and the usage scenarios
>> are as follows from BPF side:
>>
>> preempt_disable()
>> entry = get_callchain_entry()
>> preempt_enable()
>> __get_perf_callchain(entry)
>> put_callchain_entry(entry)
>>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
>>   include/linux/perf_event.h |  5 +++++
>>   kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
>>   2 files changed, 27 insertions(+), 12 deletions(-)
> 
> [...]
> 
>> +struct perf_callchain_entry *
>> +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> +                  u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie)
>> +{
>> +       struct perf_callchain_entry *entry;
>> +       int ret;
>> +
>> +       entry = get_callchain_entry();
>> +       if (!entry)
>> +               return NULL;
>>
>> -exit_put:
>> +       ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, add_mark,
>> +                                  defer_cookie);
>>          put_callchain_entry(entry);
>> +       if (ret)
>> +               entry = NULL;
>>
> 
> purely stylistical nit, so this can be ignored if you disagree, but I
> find code that modifies some variable before returning it slightly
> less preferable to more explicit:
> 
> 
> if (__get_perf_callchain(...)) {
>      put_callchain_entry(entry);
>      return NULL;
> }
> 
> return entry;
> 
>>          return entry;
>>   }
> 

agree, will change it in v9, thanks.

>> --
>> 2.48.1
>>


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
  2026-01-27 21:07   ` Andrii Nakryiko
@ 2026-01-28  9:10   ` Peter Zijlstra
  2026-01-28 16:49     ` Tao Chen
  2026-01-28 19:12     ` Andrii Nakryiko
  2026-02-04  1:08   ` Andrii Nakryiko
  2 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2026-01-28  9:10 UTC (permalink / raw)
  To: Tao Chen
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, linux-perf-users, linux-kernel, bpf

On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
> From BPF stack map, we want to ensure that the callchain buffer
> will not be overwritten by other preemptive tasks and we also aim
> to reduce the preempt disable interval, Based on the suggestions from Peter
> and Andrrii, export new API __get_perf_callchain and the usage scenarios
> are as follows from BPF side:
> 
> preempt_disable()
> entry = get_callchain_entry()
> preempt_enable()
> __get_perf_callchain(entry)
> put_callchain_entry(entry)

That makes no sense, this means any other task on that CPU is getting
screwed over.

Why are you worried about the preempt_disable() here? If this were an
interrupt context we'd still do that unwind -- but then with IRQs
disabled.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-28  9:10   ` Peter Zijlstra
@ 2026-01-28 16:49     ` Tao Chen
  2026-01-28 19:12     ` Andrii Nakryiko
  1 sibling, 0 replies; 28+ messages in thread
From: Tao Chen @ 2026-01-28 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, song, ast, daniel, andrii,
	martin.lau, eddyz87, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, linux-perf-users, linux-kernel, bpf

在 2026/1/28 17:10, Peter Zijlstra 写道:
> On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
>>  From BPF stack map, we want to ensure that the callchain buffer
>> will not be overwritten by other preemptive tasks and we also aim
>> to reduce the preempt disable interval, Based on the suggestions from Peter
>> and Andrrii, export new API __get_perf_callchain and the usage scenarios
>> are as follows from BPF side:
>>
>> preempt_disable()
>> entry = get_callchain_entry()
>> preempt_enable()
>> __get_perf_callchain(entry)
>> put_callchain_entry(entry)
> 
> That makes no sense, this means any other task on that CPU is getting
> screwed over.
> 
> Why are you worried about the preempt_disable() here? If this were an
> interrupt context we'd still do that unwind -- but then with IRQs
> disabled.

Hi Peter,

Right now, obtaining stack information in BPF includes 2 steps:
1.get callchain
2.store callchain in bpf map or copy to buffer

There is no preempt disable in BPF now, When obtaining the stack 
information of Process A, Process A may be preempted by Process B. With 
the same logic, we then acquire the stack information of Process B. 
However, when execution resumes to Process A, the callchain buffer will 
store the stack information of Process B. Because each context(task, 
soft irq, irq, nmi) has only one callchain entry.

       taskA                             taskB
1.callchain(A) = get_perf_callchain
		<-- preepmted by B   callchain(B) = get_perf_callchain	
2.stack_map(callchain(B))
	

So we want to ensure that when task A is in use, the preepmt task B 
cannot be used. The approach involves deferring the put_callchain_entry 
until the stack is captured and saved in the stack_map.

       taskA                             taskB
1.callchain(A) = __get_perf_callchain
		<-- preepmted by B   callchain(B) = __get_perf_callchain
2.stack_map(callchain(A))
3.put_callchain_entry()		
	
taskB can not get the callchain because taskA hold it.

And the preempt_disable() for get_callchain_entry was suggested from 
Yonghong in v4
https://lore.kernel.org/bpf/c352f357-1417-47b5-9d8c-28d99f20f5a6@linux.dev/

Please correct me if I'm mistaken. Thanks.

-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-28  9:10   ` Peter Zijlstra
  2026-01-28 16:49     ` Tao Chen
@ 2026-01-28 19:12     ` Andrii Nakryiko
  2026-01-30 11:31       ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-28 19:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
> > From BPF stack map, we want to ensure that the callchain buffer
> > will not be overwritten by other preemptive tasks and we also aim
> > to reduce the preempt disable interval, Based on the suggestions from Peter
> > and Andrrii, export new API __get_perf_callchain and the usage scenarios
> > are as follows from BPF side:
> >
> > preempt_disable()
> > entry = get_callchain_entry()
> > preempt_enable()
> > __get_perf_callchain(entry)
> > put_callchain_entry(entry)
>
> That makes no sense, this means any other task on that CPU is getting
> screwed over.

Yes, unfortunately, but unless we dynamically allocate new entry each
time and/or keep per-current entry cached there isn't much choice we
have here, no?

Maybe that's what we have to do, honestly, because
get_perf_callchain() usage we have right now from sleepable BPF isn't
great no matter with or without changes in this patch set.

>
> Why are you worried about the preempt_disable() here? If this were an
> interrupt context we'd still do that unwind -- but then with IRQs
> disabled.

Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called
from sleepable/faultable context and also we can do a rather expensive
build ID resolution (either in sleepable or not, which only changes if
build ID parsing logic waits for file backed pages to be paged in or
not).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-28 19:12     ` Andrii Nakryiko
@ 2026-01-30 11:31       ` Peter Zijlstra
  2026-01-30 20:04         ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2026-01-30 11:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Wed, Jan 28, 2026 at 11:12:09AM -0800, Andrii Nakryiko wrote:
> On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
> > > From BPF stack map, we want to ensure that the callchain buffer
> > > will not be overwritten by other preemptive tasks and we also aim
> > > to reduce the preempt disable interval, Based on the suggestions from Peter
> > > and Andrrii, export new API __get_perf_callchain and the usage scenarios
> > > are as follows from BPF side:
> > >
> > > preempt_disable()
> > > entry = get_callchain_entry()
> > > preempt_enable()
> > > __get_perf_callchain(entry)
> > > put_callchain_entry(entry)
> >
> > That makes no sense, this means any other task on that CPU is getting
> > screwed over.
> 
> Yes, unfortunately, but unless we dynamically allocate new entry each
> time and/or keep per-current entry cached there isn't much choice we
> have here, no?
> 
> Maybe that's what we have to do, honestly, because
> get_perf_callchain() usage we have right now from sleepable BPF isn't
> great no matter with or without changes in this patch set.

All of the perf stuff is based on the fact that if we do it from IRQ/NMI
context, we can certainly do it with preemption disabled.

Bending the interface this way is just horrible.

> > Why are you worried about the preempt_disable() here? If this were an
> > interrupt context we'd still do that unwind -- but then with IRQs
> > disabled.
> 
> Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called
> from sleepable/faultable context and also we can do a rather expensive
> build ID resolution (either in sleepable or not, which only changes if
> build ID parsing logic waits for file backed pages to be paged in or
> not).

<rant>
So stack_map_get_build_id_offset() is a piece of crap -- and I've always
said it was. And I hate that any of that ever got merged -- its the
pinnacle of bad engineering and simply refusing to do the right thing in
the interest of hack now, fix never :/
</rant>

Anyway, as you well know, we now *do* have lockless vma lookups and
should be able to do this buildid thing much saner.

Also, there appears to be no buildid caching what so ever, surely that
would help some.

(and I'm not sure I've ever understood why the buildid crap needs to be
in this path in any case)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-30 11:31       ` Peter Zijlstra
@ 2026-01-30 20:04         ` Andrii Nakryiko
  2026-02-02 19:59           ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-30 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Fri, Jan 30, 2026 at 3:31 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Jan 28, 2026 at 11:12:09AM -0800, Andrii Nakryiko wrote:
> > On Wed, Jan 28, 2026 at 1:10 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Mon, Jan 26, 2026 at 03:43:30PM +0800, Tao Chen wrote:
> > > > From BPF stack map, we want to ensure that the callchain buffer
> > > > will not be overwritten by other preemptive tasks and we also aim
> > > > to reduce the preempt disable interval, Based on the suggestions from Peter
> > > > and Andrrii, export new API __get_perf_callchain and the usage scenarios
> > > > are as follows from BPF side:
> > > >
> > > > preempt_disable()
> > > > entry = get_callchain_entry()
> > > > preempt_enable()
> > > > __get_perf_callchain(entry)
> > > > put_callchain_entry(entry)
> > >
> > > That makes no sense, this means any other task on that CPU is getting
> > > screwed over.
> >
> > Yes, unfortunately, but unless we dynamically allocate new entry each
> > time and/or keep per-current entry cached there isn't much choice we
> > have here, no?
> >
> > Maybe that's what we have to do, honestly, because
> > get_perf_callchain() usage we have right now from sleepable BPF isn't
> > great no matter with or without changes in this patch set.
>
> All of the perf stuff is based on the fact that if we do it from IRQ/NMI
> context, we can certainly do it with preemption disabled.
>
> Bending the interface this way is just horrible.

Sure, but I'm just trying to help mitigate the issue at hand (way too
long preemption disabled region). I agree that we should do something
better in terms of perf_callchain_entry retrieval and reuse, but maybe
one thing at a time?

>
> > > Why are you worried about the preempt_disable() here? If this were an
> > > interrupt context we'd still do that unwind -- but then with IRQs
> > > disabled.
> >
> > Because __bpf_get_stack() from kernel/bpf/stackmap.c can be called
> > from sleepable/faultable context and also we can do a rather expensive
> > build ID resolution (either in sleepable or not, which only changes if
> > build ID parsing logic waits for file backed pages to be paged in or
> > not).
>
> <rant>
> So stack_map_get_build_id_offset() is a piece of crap -- and I've always
> said it was. And I hate that any of that ever got merged -- its the
> pinnacle of bad engineering and simply refusing to do the right thing in
> the interest of hack now, fix never :/
> </rant>

<hug><nod></hug>

>
> Anyway, as you well know, we now *do* have lockless vma lookups and
> should be able to do this buildid thing much saner.

Yes, probably, and I am aware of a mmap_lock use inside
stack_map_get_build_id_offset() being problematic, we'll need to fix
this as well. One step at a time.

>
> Also, there appears to be no buildid caching what so ever, surely that
> would help some.

Jiri Olsa proposed caching build id per file or per inode some time
back, there was vehement opposition to it. And doing some locked
global resizable hash that might need to be used from NMI sounds
horrible, tbh. So we have what we have today.

>
> (and I'm not sure I've ever understood why the buildid crap needs to be
> in this path in any case)

Yeah, perhaps, I haven't dealt with stack_map_get_build_id_offset()
much, so will need to go a bit deeper and analyze. As I said we have
mmap_lock problem there, so we need to address that. I'll think if/how
we can improve this.

Do I understand correctly that you'd rather just not touch all this
for now and we should just drop this patch set?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-30 20:04         ` Andrii Nakryiko
@ 2026-02-02 19:59           ` Peter Zijlstra
  2026-02-04  0:24             ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2026-02-02 19:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Fri, Jan 30, 2026 at 12:04:45PM -0800, Andrii Nakryiko wrote:

> > Also, there appears to be no buildid caching what so ever, surely that
> > would help some.
> 
> Jiri Olsa proposed caching build id per file or per inode some time
> back, there was vehement opposition to it. And doing some locked
> global resizable hash that might need to be used from NMI sounds
> horrible, tbh. So we have what we have today.

See kernel/module/tree_lookup.c and include/linux/rbtree_latch.h :-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-02-02 19:59           ` Peter Zijlstra
@ 2026-02-04  0:24             ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2026-02-04  0:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tao Chen, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Mon, Feb 2, 2026 at 11:59 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jan 30, 2026 at 12:04:45PM -0800, Andrii Nakryiko wrote:
>
> > > Also, there appears to be no buildid caching what so ever, surely that
> > > would help some.
> >
> > Jiri Olsa proposed caching build id per file or per inode some time
> > back, there was vehement opposition to it. And doing some locked
> > global resizable hash that might need to be used from NMI sounds
> > horrible, tbh. So we have what we have today.
>
> See kernel/module/tree_lookup.c and include/linux/rbtree_latch.h :-)

Ah, neat idea. Each node is really two parallel nodes for each of the
latched copies. TIL, good to know and thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
  2026-01-27 21:07   ` Andrii Nakryiko
  2026-01-28  9:10   ` Peter Zijlstra
@ 2026-02-04  1:08   ` Andrii Nakryiko
  2026-02-05  6:16     ` Tao Chen
  2 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-02-04  1:08 UTC (permalink / raw)
  To: Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
>
> From BPF stack map, we want to ensure that the callchain buffer
> will not be overwritten by other preemptive tasks and we also aim
> to reduce the preempt disable interval, Based on the suggestions from Peter
> and Andrrii, export new API __get_perf_callchain and the usage scenarios
> are as follows from BPF side:
>
> preempt_disable()
> entry = get_callchain_entry()
> preempt_enable()
> __get_perf_callchain(entry)
> put_callchain_entry(entry)
>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---
>  include/linux/perf_event.h |  5 +++++
>  kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
>  2 files changed, 27 insertions(+), 12 deletions(-)
>

Looking at the whole __bpf_get_stack() logic again, why didn't we just
do something like this:

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index da3d328f5c15..80364561611c 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,

        max_depth = stack_map_calculate_max_depth(size, elem_size, flags);

-       if (may_fault)
-               rcu_read_lock(); /* need RCU for perf's callchain below */
+       if (!trace_in)
+               preempt_disable();

        if (trace_in) {
                trace = trace_in;
@@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,
        }

        if (unlikely(!trace) || trace->nr < skip) {
-               if (may_fault)
-                       rcu_read_unlock();
+               if (!trace_in)
+                       preempt_enable();
                goto err_fault;
        }

@@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
struct task_struct *task,
        }

        /* trace/ips should not be dereferenced after this point */
-       if (may_fault)
-               rcu_read_unlock();
+       if (!trace_in)
+               preempt_enable();

        if (user_build_id)
                stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);


?

Build ID parsing is happening after we copied data from perf's
callchain_entry into user-provided memory, so raw callchain retrieval
can be done with preemption disabled, as it's supposed to be brief.
Build ID parsing part which indeed might fault and be much slower will
be done well after that (we even have a comment there saying that
trace/ips should not be touched).

Am I missing something?

[...]

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-02-04  1:08   ` Andrii Nakryiko
@ 2026-02-05  6:16     ` Tao Chen
  2026-02-05 17:34       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Tao Chen @ 2026-02-05  6:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

在 2026/2/4 09:08, Andrii Nakryiko 写道:
> On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>>  From BPF stack map, we want to ensure that the callchain buffer
>> will not be overwritten by other preemptive tasks and we also aim
>> to reduce the preempt disable interval, Based on the suggestions from Peter
>> and Andrrii, export new API __get_perf_callchain and the usage scenarios
>> are as follows from BPF side:
>>
>> preempt_disable()
>> entry = get_callchain_entry()
>> preempt_enable()
>> __get_perf_callchain(entry)
>> put_callchain_entry(entry)
>>
>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
>>   include/linux/perf_event.h |  5 +++++
>>   kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
>>   2 files changed, 27 insertions(+), 12 deletions(-)
>>
> 
> Looking at the whole __bpf_get_stack() logic again, why didn't we just
> do something like this:
> 
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index da3d328f5c15..80364561611c 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> struct task_struct *task,
> 
>          max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
> 
> -       if (may_fault)
> -               rcu_read_lock(); /* need RCU for perf's callchain below */
> +       if (!trace_in)
> +               preempt_disable();
> 
>          if (trace_in) {
>                  trace = trace_in;
> @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> struct task_struct *task,
>          }
> 
>          if (unlikely(!trace) || trace->nr < skip) {
> -               if (may_fault)
> -                       rcu_read_unlock();
> +               if (!trace_in)
> +                       preempt_enable();
>                  goto err_fault;
>          }
> 
> @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> struct task_struct *task,
>          }
> 
>          /* trace/ips should not be dereferenced after this point */
> -       if (may_fault)
> -               rcu_read_unlock();
> +       if (!trace_in)
> +               preempt_enable();
> 
>          if (user_build_id)
>                  stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
> 
> 
> ?
> 
> Build ID parsing is happening after we copied data from perf's
> callchain_entry into user-provided memory, so raw callchain retrieval
> can be done with preemption disabled, as it's supposed to be brief.
> Build ID parsing part which indeed might fault and be much slower will
> be done well after that (we even have a comment there saying that
> trace/ips should not be touched).
> 
> Am I missing something?

Yes it looks good for bpf_get_stack, and I also proposed a similar 
change previously. But Alexei suggested that we should reduce the 
preemption-disabled section protected in bpf_get_stackid if we do like 
bpf_get_stack. Maybe we can change it first for bpf_get_stack?

https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/

> 
> [...]


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-02-05  6:16     ` Tao Chen
@ 2026-02-05 17:34       ` Andrii Nakryiko
  2026-02-06  9:20         ` Tao Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-02-05 17:34 UTC (permalink / raw)
  To: Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Wed, Feb 4, 2026 at 10:16 PM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/2/4 09:08, Andrii Nakryiko 写道:
> > On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
> >>
> >>  From BPF stack map, we want to ensure that the callchain buffer
> >> will not be overwritten by other preemptive tasks and we also aim
> >> to reduce the preempt disable interval, Based on the suggestions from Peter
> >> and Andrrii, export new API __get_perf_callchain and the usage scenarios
> >> are as follows from BPF side:
> >>
> >> preempt_disable()
> >> entry = get_callchain_entry()
> >> preempt_enable()
> >> __get_perf_callchain(entry)
> >> put_callchain_entry(entry)
> >>
> >> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> >> ---
> >>   include/linux/perf_event.h |  5 +++++
> >>   kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
> >>   2 files changed, 27 insertions(+), 12 deletions(-)
> >>
> >
> > Looking at the whole __bpf_get_stack() logic again, why didn't we just
> > do something like this:
> >
> > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> > index da3d328f5c15..80364561611c 100644
> > --- a/kernel/bpf/stackmap.c
> > +++ b/kernel/bpf/stackmap.c
> > @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> > struct task_struct *task,
> >
> >          max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
> >
> > -       if (may_fault)
> > -               rcu_read_lock(); /* need RCU for perf's callchain below */
> > +       if (!trace_in)
> > +               preempt_disable();
> >
> >          if (trace_in) {
> >                  trace = trace_in;
> > @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> > struct task_struct *task,
> >          }
> >
> >          if (unlikely(!trace) || trace->nr < skip) {
> > -               if (may_fault)
> > -                       rcu_read_unlock();
> > +               if (!trace_in)
> > +                       preempt_enable();
> >                  goto err_fault;
> >          }
> >
> > @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
> > struct task_struct *task,
> >          }
> >
> >          /* trace/ips should not be dereferenced after this point */
> > -       if (may_fault)
> > -               rcu_read_unlock();
> > +       if (!trace_in)
> > +               preempt_enable();
> >
> >          if (user_build_id)
> >                  stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
> >
> >
> > ?
> >
> > Build ID parsing is happening after we copied data from perf's
> > callchain_entry into user-provided memory, so raw callchain retrieval
> > can be done with preemption disabled, as it's supposed to be brief.
> > Build ID parsing part which indeed might fault and be much slower will
> > be done well after that (we even have a comment there saying that
> > trace/ips should not be touched).
> >
> > Am I missing something?
>
> Yes it looks good for bpf_get_stack, and I also proposed a similar
> change previously. But Alexei suggested that we should reduce the
> preemption-disabled section protected in bpf_get_stackid if we do like
> bpf_get_stack. Maybe we can change it first for bpf_get_stack?
>
> https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/

This is broken because we are still using trace after you re-enabled
preemption. We need to keep preemption disabled until we copy captured
stack trace data from ips into our own memory.

>
> >
> > [...]
>
>
> --
> Best Regards
> Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain
  2026-02-05 17:34       ` Andrii Nakryiko
@ 2026-02-06  9:20         ` Tao Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Tao Chen @ 2026-02-06  9:20 UTC (permalink / raw)
  To: Andrii Nakryiko, Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

在 2026/2/6 01:34, Andrii Nakryiko 写道:
> On Wed, Feb 4, 2026 at 10:16 PM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> 在 2026/2/4 09:08, Andrii Nakryiko 写道:
>>> On Sun, Jan 25, 2026 at 11:45 PM Tao Chen <chen.dylane@linux.dev> wrote:
>>>>
>>>>   From BPF stack map, we want to ensure that the callchain buffer
>>>> will not be overwritten by other preemptive tasks and we also aim
>>>> to reduce the preempt disable interval, Based on the suggestions from Peter
>>>> and Andrrii, export new API __get_perf_callchain and the usage scenarios
>>>> are as follows from BPF side:
>>>>
>>>> preempt_disable()
>>>> entry = get_callchain_entry()
>>>> preempt_enable()
>>>> __get_perf_callchain(entry)
>>>> put_callchain_entry(entry)
>>>>
>>>> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
>>>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>>>> ---
>>>>    include/linux/perf_event.h |  5 +++++
>>>>    kernel/events/callchain.c  | 34 ++++++++++++++++++++++------------
>>>>    2 files changed, 27 insertions(+), 12 deletions(-)
>>>>
>>>
>>> Looking at the whole __bpf_get_stack() logic again, why didn't we just
>>> do something like this:
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index da3d328f5c15..80364561611c 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -460,8 +460,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
>>> struct task_struct *task,
>>>
>>>           max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
>>>
>>> -       if (may_fault)
>>> -               rcu_read_lock(); /* need RCU for perf's callchain below */
>>> +       if (!trace_in)
>>> +               preempt_disable();
>>>
>>>           if (trace_in) {
>>>                   trace = trace_in;
>>> @@ -474,8 +474,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
>>> struct task_struct *task,
>>>           }
>>>
>>>           if (unlikely(!trace) || trace->nr < skip) {
>>> -               if (may_fault)
>>> -                       rcu_read_unlock();
>>> +               if (!trace_in)
>>> +                       preempt_enable();
>>>                   goto err_fault;
>>>           }
>>>
>>> @@ -494,8 +494,8 @@ static long __bpf_get_stack(struct pt_regs *regs,
>>> struct task_struct *task,
>>>           }
>>>
>>>           /* trace/ips should not be dereferenced after this point */
>>> -       if (may_fault)
>>> -               rcu_read_unlock();
>>> +       if (!trace_in)
>>> +               preempt_enable();
>>>
>>>           if (user_build_id)
>>>                   stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
>>>
>>>
>>> ?
>>>
>>> Build ID parsing is happening after we copied data from perf's
>>> callchain_entry into user-provided memory, so raw callchain retrieval
>>> can be done with preemption disabled, as it's supposed to be brief.
>>> Build ID parsing part which indeed might fault and be much slower will
>>> be done well after that (we even have a comment there saying that
>>> trace/ips should not be touched).
>>>

Hi Andrii,

Building upon your work, I have also added preempt_disable() to 
bpf_get_stackid, and try to reduce the length of preempt section.
Please review, thanks.

https://lore.kernel.org/bpf/20260206090653.1336687-1-chen.dylane@linux.dev

>>> Am I missing something?
>>
>> Yes it looks good for bpf_get_stack, and I also proposed a similar
>> change previously. But Alexei suggested that we should reduce the
>> preemption-disabled section protected in bpf_get_stackid if we do like
>> bpf_get_stack. Maybe we can change it first for bpf_get_stack?
>>
>> https://lore.kernel.org/bpf/20250922075333.1452803-1-chen.dylane@linux.dev/
> 
> This is broken because we are still using trace after you re-enabled
> preemption. We need to keep preemption disabled until we copy captured
> stack trace data from ips into our own memory.
> 
>>
>>>
>>> [...]
>>
>>
>> --
>> Best Regards
>> Tao Chen
> 


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely
  2026-01-26  7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
  2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
  2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
@ 2026-01-26  7:43 ` Tao Chen
  2026-01-27 21:35   ` Andrii Nakryiko
  2 siblings, 1 reply; 28+ messages in thread
From: Tao Chen @ 2026-01-26  7:43 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

As Alexei noted, get_perf_callchain() return values may be reused
if a task is preempted after the BPF program enters migrate disable
mode. The perf_callchain_entres has a small stack of entries, and
we can reuse it as follows:

1. get the perf callchain entry
2. BPF use...
3. put the perf callchain entry

And Peter suggested that get_recursion_context used with preemption
disabled, so we should disable preemption at BPF side.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index e77dcdc2164..6bdee6cc05f 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 #ifdef CONFIG_STACKTRACE
 	struct perf_callchain_entry *entry;
 
+	preempt_disable();
 	entry = get_callchain_entry();
+	preempt_enable();
 
 	if (!entry)
 		return NULL;
@@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 			to[i] = (u64)(from[i]);
 	}
 
-	put_callchain_entry(entry);
-
 	return entry;
 #else /* CONFIG_STACKTRACE */
 	return NULL;
 #endif
 }
 
+static struct perf_callchain_entry *
+bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack,
+		       bool crosstask)
+{
+	struct perf_callchain_entry *entry;
+	int ret;
+
+	preempt_disable();
+	entry = get_callchain_entry();
+	preempt_enable();
+
+	if (unlikely(!entry))
+		return NULL;
+
+	ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0);
+	if (ret) {
+		put_callchain_entry(entry);
+		return NULL;
+	}
+
+	return entry;
+}
+
+static void bpf_put_perf_callchain(struct perf_callchain_entry *entry)
+{
+	put_callchain_entry(entry);
+}
+
 static long __bpf_get_stackid(struct bpf_map *map,
 			      struct perf_callchain_entry *trace, u64 flags)
 {
@@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	struct perf_callchain_entry *trace;
 	bool kernel = !user;
 	u32 max_depth;
+	int ret;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
 		return -EINVAL;
 
 	max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false, 0);
+	trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
 
-	return __bpf_get_stackid(map, trace, flags);
+	ret = __bpf_get_stackid(map, trace, flags);
+	bpf_put_perf_callchain(trace);
+
+	return ret;
 }
 
 const struct bpf_func_proto bpf_get_stackid_proto = {
@@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	} else if (kernel && task) {
 		trace = get_callchain_entry_for_task(task, max_depth);
 	} else {
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false, 0);
+		trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask);
 	}
 
-	if (unlikely(!trace) || trace->nr < skip) {
+	if (unlikely(!trace)) {
+		if (may_fault)
+			rcu_read_unlock();
+		goto err_fault;
+	}
+	if (trace->nr < skip) {
 		if (may_fault)
 			rcu_read_unlock();
+		if (!trace_in)
+			bpf_put_perf_callchain(trace);
 		goto err_fault;
 	}
 
@@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	/* trace/ips should not be dereferenced after this point */
 	if (may_fault)
 		rcu_read_unlock();
+	if (!trace_in)
+		bpf_put_perf_callchain(trace);
 
 	if (user_build_id)
 		stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely
  2026-01-26  7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen
@ 2026-01-27 21:35   ` Andrii Nakryiko
  2026-01-28  4:21     ` Tao Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-27 21:35 UTC (permalink / raw)
  To: Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote:
>
> As Alexei noted, get_perf_callchain() return values may be reused
> if a task is preempted after the BPF program enters migrate disable
> mode. The perf_callchain_entres has a small stack of entries, and
> we can reuse it as follows:
>
> 1. get the perf callchain entry
> 2. BPF use...
> 3. put the perf callchain entry
>
> And Peter suggested that get_recursion_context used with preemption
> disabled, so we should disable preemption at BPF side.
>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---
>  kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index e77dcdc2164..6bdee6cc05f 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>  #ifdef CONFIG_STACKTRACE
>         struct perf_callchain_entry *entry;
>
> +       preempt_disable();
>         entry = get_callchain_entry();
> +       preempt_enable();

pass perf_callchain_entry as input argument, to keep similar approach
to __get_perf_callchain, see below

>
>         if (!entry)
>                 return NULL;
> @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>                         to[i] = (u64)(from[i]);
>         }
>
> -       put_callchain_entry(entry);
> -
>         return entry;
>  #else /* CONFIG_STACKTRACE */
>         return NULL;
>  #endif
>  }
>
> +static struct perf_callchain_entry *
> +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack,
> +                      bool crosstask)
> +{

I don't really like having this wrapper, it feels like the flow will
be cleaner and easier to follow if we modify the code as suggested
below

> +       struct perf_callchain_entry *entry;
> +       int ret;
> +
> +       preempt_disable();
> +       entry = get_callchain_entry();
> +       preempt_enable();

I'd actually consider having __get_callchain_entry() that does what
get_callchain_entry() does right now under assumption that
preemption/migration is disabled, and then make get_callchain_entry do
preempt_disable + fetch entry + preevent_enable + return entry dance.

This will simplify the flow here to just with no explicit
preempt_{disable,enable} visible. Either way all of this has
assumption that we are staying on the same CPU throughout (so at the
very least we need to have migration disabled)

entry = get_callchain_entry();
__get_perf_callchain(entry, ...);
put_callchain_entry();


BTW, is there a way to assert that either preemption or migration is
currently disabled? I think both get_callchain_entry and
put_callchain_entry would benefit from that

pw-bot: cr


> +
> +       if (unlikely(!entry))
> +               return NULL;
> +
> +       ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0);
> +       if (ret) {
> +               put_callchain_entry(entry);
> +               return NULL;
> +       }
> +
> +       return entry;
> +}
> +
> +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry)
> +{
> +       put_callchain_entry(entry);
> +}
> +
>  static long __bpf_get_stackid(struct bpf_map *map,
>                               struct perf_callchain_entry *trace, u64 flags)
>  {
> @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>         struct perf_callchain_entry *trace;
>         bool kernel = !user;
>         u32 max_depth;
> +       int ret;
>
>         if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
>                                BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
>                 return -EINVAL;
>
>         max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                  false, false, 0);
> +       trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false);
>
>         if (unlikely(!trace))
>                 /* couldn't fetch the stack trace */
>                 return -EFAULT;
>
> -       return __bpf_get_stackid(map, trace, flags);
> +       ret = __bpf_get_stackid(map, trace, flags);
> +       bpf_put_perf_callchain(trace);

Just as above, I think get_callchain_entry + __get_perf_callchain +
put_callchain_entry is better, IMO

> +
> +       return ret;
>  }
>
>  const struct bpf_func_proto bpf_get_stackid_proto = {
> @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>         } else if (kernel && task) {
>                 trace = get_callchain_entry_for_task(task, max_depth);
>         } else {
> -               trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                          crosstask, false, 0);
> +               trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask);
>         }

with the above suggestions this will be a pretty streamlined:

trace = trace_in ?: get_callchain_entry();
if (!trace)
    goto err_fault;

if (trace_in) {
    trace->nr = ...
    err = 0
} else if (kernel && task) {
    err = get_callchain_entry_for_task(trace, ...);
} else {
    err = __get_perf_callchain(trace, ...);
}
if (err)
    goto clear;

... proceed as before, we have our stack trace inside trace ...

for successful and failed paths (you'll have to duplicate this logic):

if (trace != trace_in)
    put_callchain_entry(trace);

>
> -       if (unlikely(!trace) || trace->nr < skip) {
> +       if (unlikely(!trace)) {

this condition cannot happen: we either get trace_in != NULL or we get
it using __get_callchain_entry and then validate it's not NULL
earlier, so drop this condition

> +               if (may_fault)
> +                       rcu_read_unlock();
> +               goto err_fault;
> +       }
> +       if (trace->nr < skip) {
>                 if (may_fault)
>                         rcu_read_unlock();
> +               if (!trace_in)
> +                       bpf_put_perf_callchain(trace);

do this clean up in one place, behind the new goto label? it's a bit
too easy to miss this, IMO

>                 goto err_fault;
>         }
>
> @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>         /* trace/ips should not be dereferenced after this point */
>         if (may_fault)
>                 rcu_read_unlock();

now that I looked at this code, I feel like we don't really need this
rcu_read_{lock,unlock}() dance (even though I added it in the first
place). I this RCU was supposed to be need to keep
perf_callchain_entry alive long enough, but for BPF this is guaranteed
because either BPF stack map will keep them alive by delaying
put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU
grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program
itself will hold these buffers alive again, until freeing time which
is delayed until after RCU Tasks Trace + RCU grace period.

Please send this clean up as the first patch in the series so we can
review and ack this separately. Thanks!

> +       if (!trace_in)
> +               bpf_put_perf_callchain(trace);
>
>         if (user_build_id)
>                 stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely
  2026-01-27 21:35   ` Andrii Nakryiko
@ 2026-01-28  4:21     ` Tao Chen
  2026-01-28 19:13       ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Tao Chen @ 2026-01-28  4:21 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

在 2026/1/28 05:35, Andrii Nakryiko 写道:
> On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> As Alexei noted, get_perf_callchain() return values may be reused
>> if a task is preempted after the BPF program enters migrate disable
>> mode. The perf_callchain_entres has a small stack of entries, and
>> we can reuse it as follows:
>>
>> 1. get the perf callchain entry
>> 2. BPF use...
>> 3. put the perf callchain entry
>>
>> And Peter suggested that get_recursion_context used with preemption
>> disabled, so we should disable preemption at BPF side.
>>
>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
>>   kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 47 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index e77dcdc2164..6bdee6cc05f 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>>   #ifdef CONFIG_STACKTRACE
>>          struct perf_callchain_entry *entry;
>>
>> +       preempt_disable();
>>          entry = get_callchain_entry();
>> +       preempt_enable();
> 
> pass perf_callchain_entry as input argument, to keep similar approach
> to __get_perf_callchain, see below
> 
>>
>>          if (!entry)
>>                  return NULL;
>> @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>>                          to[i] = (u64)(from[i]);
>>          }
>>
>> -       put_callchain_entry(entry);
>> -
>>          return entry;
>>   #else /* CONFIG_STACKTRACE */
>>          return NULL;
>>   #endif
>>   }
>>
>> +static struct perf_callchain_entry *
>> +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack,
>> +                      bool crosstask)
>> +{
> 
> I don't really like having this wrapper, it feels like the flow will
> be cleaner and easier to follow if we modify the code as suggested
> below
> 

Ok, will use it directly.
>> +       struct perf_callchain_entry *entry;
>> +       int ret;
>> +
>> +       preempt_disable();
>> +       entry = get_callchain_entry();
>> +       preempt_enable();
> 
> I'd actually consider having __get_callchain_entry() that does what
> get_callchain_entry() does right now under assumption that
> preemption/migration is disabled, and then make get_callchain_entry do
> preempt_disable + fetch entry + preevent_enable + return entry dance.
> 

in v4, YongHong suggested add preempt_disable in get_callchain_entry,
but Peter suggested that do it from BPF side, so maybe keeping the 
existing method is a compromise.

> This will simplify the flow here to just with no explicit
> preempt_{disable,enable} visible. Either way all of this has
> assumption that we are staying on the same CPU throughout (so at the
> very least we need to have migration disabled)
> 
> entry = get_callchain_entry();
> __get_perf_callchain(entry, ...);
> put_callchain_entry();
> 
> 
> BTW, is there a way to assert that either preemption or migration is
> currently disabled? I think both get_callchain_entry and
> put_callchain_entry would benefit from that
> 
> pw-bot: cr
> 
> 
>> +
>> +       if (unlikely(!entry))
>> +               return NULL;
>> +
>> +       ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0);
>> +       if (ret) {
>> +               put_callchain_entry(entry);
>> +               return NULL;
>> +       }
>> +
>> +       return entry;
>> +}
>> +
>> +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry)
>> +{
>> +       put_callchain_entry(entry);
>> +}
>> +
>>   static long __bpf_get_stackid(struct bpf_map *map,
>>                                struct perf_callchain_entry *trace, u64 flags)
>>   {
>> @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>>          struct perf_callchain_entry *trace;
>>          bool kernel = !user;
>>          u32 max_depth;
>> +       int ret;
>>
>>          if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
>>                                 BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
>>                  return -EINVAL;
>>
>>          max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
>> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                                  false, false, 0);
>> +       trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false);
>>
>>          if (unlikely(!trace))
>>                  /* couldn't fetch the stack trace */
>>                  return -EFAULT;
>>
>> -       return __bpf_get_stackid(map, trace, flags);
>> +       ret = __bpf_get_stackid(map, trace, flags);
>> +       bpf_put_perf_callchain(trace);
> 
> Just as above, I think get_callchain_entry + __get_perf_callchain +
> put_callchain_entry is better, IMO
> 
>> +
>> +       return ret;
>>   }
>>
>>   const struct bpf_func_proto bpf_get_stackid_proto = {
>> @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>>          } else if (kernel && task) {
>>                  trace = get_callchain_entry_for_task(task, max_depth);
>>          } else {
>> -               trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                                          crosstask, false, 0);
>> +               trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask);
>>          }
> 
> with the above suggestions this will be a pretty streamlined:
> 
> trace = trace_in ?: get_callchain_entry();
> if (!trace)
>      goto err_fault;
> 
> if (trace_in) {
>      trace->nr = ...
>      err = 0
> } else if (kernel && task) {
>      err = get_callchain_entry_for_task(trace, ...);
> } else {
>      err = __get_perf_callchain(trace, ...);
> }
> if (err)
>      goto clear;
> 

This code looks much cleaner, i will change it, thanks.

> ... proceed as before, we have our stack trace inside trace ...
> 
> for successful and failed paths (you'll have to duplicate this logic):
> 
> if (trace != trace_in)
>      put_callchain_entry(trace);
> 
>>
>> -       if (unlikely(!trace) || trace->nr < skip) {
>> +       if (unlikely(!trace)) {
> 
> this condition cannot happen: we either get trace_in != NULL or we get
> it using __get_callchain_entry and then validate it's not NULL
> earlier, so drop this condition
>

will remove it.

>> +               if (may_fault)
>> +                       rcu_read_unlock();
>> +               goto err_fault;
>> +       }
>> +       if (trace->nr < skip) {
>>                  if (may_fault)
>>                          rcu_read_unlock();
>> +               if (!trace_in)
>> +                       bpf_put_perf_callchain(trace);
> 
> do this clean up in one place, behind the new goto label? it's a bit
> too easy to miss this, IMO
> 

ok, will do.

>>                  goto err_fault;
>>          }
>>
>> @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>>          /* trace/ips should not be dereferenced after this point */
>>          if (may_fault)
>>                  rcu_read_unlock();
> 
> now that I looked at this code, I feel like we don't really need this
> rcu_read_{lock,unlock}() dance (even though I added it in the first
> place). I this RCU was supposed to be need to keep
> perf_callchain_entry alive long enough, but for BPF this is guaranteed
> because either BPF stack map will keep them alive by delaying
> put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU
> grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program
> itself will hold these buffers alive again, until freeing time which
> is delayed until after RCU Tasks Trace + RCU grace period.

It seems so, for both, put_callchain_buffer is always called at the end, 
which ensures it won't be released during use, i will remove it as a new
patch.

> 
> Please send this clean up as the first patch in the series so we can
> review and ack this separately. Thanks!
> 
>> +       if (!trace_in)
>> +               bpf_put_perf_callchain(trace);
>>
>>          if (user_build_id)
>>                  stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
>> --
>> 2.48.1
>>


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely
  2026-01-28  4:21     ` Tao Chen
@ 2026-01-28 19:13       ` Andrii Nakryiko
  0 siblings, 0 replies; 28+ messages in thread
From: Andrii Nakryiko @ 2026-01-28 19:13 UTC (permalink / raw)
  To: Tao Chen
  Cc: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, linux-perf-users, linux-kernel, bpf

On Tue, Jan 27, 2026 at 8:21 PM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/1/28 05:35, Andrii Nakryiko 写道:
> > On Sun, Jan 25, 2026 at 11:46 PM Tao Chen <chen.dylane@linux.dev> wrote:
> >>
> >> As Alexei noted, get_perf_callchain() return values may be reused
> >> if a task is preempted after the BPF program enters migrate disable
> >> mode. The perf_callchain_entres has a small stack of entries, and
> >> we can reuse it as follows:
> >>
> >> 1. get the perf callchain entry
> >> 2. BPF use...
> >> 3. put the perf callchain entry
> >>
> >> And Peter suggested that get_recursion_context used with preemption
> >> disabled, so we should disable preemption at BPF side.
> >>
> >> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> >> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> >> ---
> >>   kernel/bpf/stackmap.c | 55 ++++++++++++++++++++++++++++++++++++-------
> >>   1 file changed, 47 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> >> index e77dcdc2164..6bdee6cc05f 100644
> >> --- a/kernel/bpf/stackmap.c
> >> +++ b/kernel/bpf/stackmap.c
> >> @@ -215,7 +215,9 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
> >>   #ifdef CONFIG_STACKTRACE
> >>          struct perf_callchain_entry *entry;
> >>
> >> +       preempt_disable();
> >>          entry = get_callchain_entry();
> >> +       preempt_enable();
> >
> > pass perf_callchain_entry as input argument, to keep similar approach
> > to __get_perf_callchain, see below
> >
> >>
> >>          if (!entry)
> >>                  return NULL;
> >> @@ -237,14 +239,40 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
> >>                          to[i] = (u64)(from[i]);
> >>          }
> >>
> >> -       put_callchain_entry(entry);
> >> -
> >>          return entry;
> >>   #else /* CONFIG_STACKTRACE */
> >>          return NULL;
> >>   #endif
> >>   }
> >>
> >> +static struct perf_callchain_entry *
> >> +bpf_get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, int max_stack,
> >> +                      bool crosstask)
> >> +{
> >
> > I don't really like having this wrapper, it feels like the flow will
> > be cleaner and easier to follow if we modify the code as suggested
> > below
> >
>
> Ok, will use it directly.
> >> +       struct perf_callchain_entry *entry;
> >> +       int ret;
> >> +
> >> +       preempt_disable();
> >> +       entry = get_callchain_entry();
> >> +       preempt_enable();
> >
> > I'd actually consider having __get_callchain_entry() that does what
> > get_callchain_entry() does right now under assumption that
> > preemption/migration is disabled, and then make get_callchain_entry do
> > preempt_disable + fetch entry + preevent_enable + return entry dance.
> >
>
> in v4, YongHong suggested add preempt_disable in get_callchain_entry,
> but Peter suggested that do it from BPF side, so maybe keeping the
> existing method is a compromise.

yeah, I guess perf's own usage of this is happening under constant
preempt_disable(), so this would be unnecessary for them. That's fine,
let's keep it outside

>
> > This will simplify the flow here to just with no explicit
> > preempt_{disable,enable} visible. Either way all of this has
> > assumption that we are staying on the same CPU throughout (so at the
> > very least we need to have migration disabled)
> >
> > entry = get_callchain_entry();
> > __get_perf_callchain(entry, ...);
> > put_callchain_entry();
> >
> >
> > BTW, is there a way to assert that either preemption or migration is
> > currently disabled? I think both get_callchain_entry and
> > put_callchain_entry would benefit from that
> >
> > pw-bot: cr
> >
> >
> >> +
> >> +       if (unlikely(!entry))
> >> +               return NULL;
> >> +
> >> +       ret = __get_perf_callchain(entry, regs, kernel, user, max_stack, crosstask, false, 0);
> >> +       if (ret) {
> >> +               put_callchain_entry(entry);
> >> +               return NULL;
> >> +       }
> >> +
> >> +       return entry;
> >> +}
> >> +
> >> +static void bpf_put_perf_callchain(struct perf_callchain_entry *entry)
> >> +{
> >> +       put_callchain_entry(entry);
> >> +}
> >> +
> >>   static long __bpf_get_stackid(struct bpf_map *map,
> >>                                struct perf_callchain_entry *trace, u64 flags)
> >>   {
> >> @@ -327,20 +355,23 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> >>          struct perf_callchain_entry *trace;
> >>          bool kernel = !user;
> >>          u32 max_depth;
> >> +       int ret;
> >>
> >>          if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
> >>                                 BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
> >>                  return -EINVAL;
> >>
> >>          max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
> >> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
> >> -                                  false, false, 0);
> >> +       trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, false);
> >>
> >>          if (unlikely(!trace))
> >>                  /* couldn't fetch the stack trace */
> >>                  return -EFAULT;
> >>
> >> -       return __bpf_get_stackid(map, trace, flags);
> >> +       ret = __bpf_get_stackid(map, trace, flags);
> >> +       bpf_put_perf_callchain(trace);
> >
> > Just as above, I think get_callchain_entry + __get_perf_callchain +
> > put_callchain_entry is better, IMO
> >
> >> +
> >> +       return ret;
> >>   }
> >>
> >>   const struct bpf_func_proto bpf_get_stackid_proto = {
> >> @@ -468,13 +499,19 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> >>          } else if (kernel && task) {
> >>                  trace = get_callchain_entry_for_task(task, max_depth);
> >>          } else {
> >> -               trace = get_perf_callchain(regs, kernel, user, max_depth,
> >> -                                          crosstask, false, 0);
> >> +               trace = bpf_get_perf_callchain(regs, kernel, user, max_depth, crosstask);
> >>          }
> >
> > with the above suggestions this will be a pretty streamlined:
> >
> > trace = trace_in ?: get_callchain_entry();
> > if (!trace)
> >      goto err_fault;
> >
> > if (trace_in) {
> >      trace->nr = ...
> >      err = 0
> > } else if (kernel && task) {
> >      err = get_callchain_entry_for_task(trace, ...);
> > } else {
> >      err = __get_perf_callchain(trace, ...);
> > }
> > if (err)
> >      goto clear;
> >
>
> This code looks much cleaner, i will change it, thanks.
>
> > ... proceed as before, we have our stack trace inside trace ...
> >
> > for successful and failed paths (you'll have to duplicate this logic):
> >
> > if (trace != trace_in)
> >      put_callchain_entry(trace);
> >
> >>
> >> -       if (unlikely(!trace) || trace->nr < skip) {
> >> +       if (unlikely(!trace)) {
> >
> > this condition cannot happen: we either get trace_in != NULL or we get
> > it using __get_callchain_entry and then validate it's not NULL
> > earlier, so drop this condition
> >
>
> will remove it.
>
> >> +               if (may_fault)
> >> +                       rcu_read_unlock();
> >> +               goto err_fault;
> >> +       }
> >> +       if (trace->nr < skip) {
> >>                  if (may_fault)
> >>                          rcu_read_unlock();
> >> +               if (!trace_in)
> >> +                       bpf_put_perf_callchain(trace);
> >
> > do this clean up in one place, behind the new goto label? it's a bit
> > too easy to miss this, IMO
> >
>
> ok, will do.
>
> >>                  goto err_fault;
> >>          }
> >>
> >> @@ -495,6 +532,8 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> >>          /* trace/ips should not be dereferenced after this point */
> >>          if (may_fault)
> >>                  rcu_read_unlock();
> >
> > now that I looked at this code, I feel like we don't really need this
> > rcu_read_{lock,unlock}() dance (even though I added it in the first
> > place). I this RCU was supposed to be need to keep
> > perf_callchain_entry alive long enough, but for BPF this is guaranteed
> > because either BPF stack map will keep them alive by delaying
> > put_callchain_buffer() until freeing time (after RCU Tasks Trace + RCU
> > grace periods), or for bpf_get_stack/bpf_get_task_stack, BPF program
> > itself will hold these buffers alive again, until freeing time which
> > is delayed until after RCU Tasks Trace + RCU grace period.
>
> It seems so, for both, put_callchain_buffer is always called at the end,
> which ensures it won't be released during use, i will remove it as a new
> patch.
>
> >
> > Please send this clean up as the first patch in the series so we can
> > review and ack this separately. Thanks!
> >
> >> +       if (!trace_in)
> >> +               bpf_put_perf_callchain(trace);
> >>
> >>          if (user_build_id)
> >>                  stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
> >> --
> >> 2.48.1
> >>
>
>
> --
> Best Regards
> Tao Chen

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-02-06  9:20 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-26  7:43 [PATCH bpf-next v8 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
2026-01-26  7:43 ` [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry Tao Chen
2026-01-26  8:03   ` bot+bpf-ci
2026-01-26  8:51     ` Tao Chen
2026-01-27 21:01       ` Andrii Nakryiko
2026-01-28  2:41         ` Tao Chen
2026-01-28  8:59   ` Peter Zijlstra
2026-01-28 16:52     ` Tao Chen
2026-01-28 18:59       ` Andrii Nakryiko
2026-01-29  3:03         ` Tao Chen
2026-01-26  7:43 ` [PATCH bpf-next v8 2/3] perf: Refactor get_perf_callchain Tao Chen
2026-01-27 21:07   ` Andrii Nakryiko
2026-01-28  2:42     ` Tao Chen
2026-01-28  9:10   ` Peter Zijlstra
2026-01-28 16:49     ` Tao Chen
2026-01-28 19:12     ` Andrii Nakryiko
2026-01-30 11:31       ` Peter Zijlstra
2026-01-30 20:04         ` Andrii Nakryiko
2026-02-02 19:59           ` Peter Zijlstra
2026-02-04  0:24             ` Andrii Nakryiko
2026-02-04  1:08   ` Andrii Nakryiko
2026-02-05  6:16     ` Tao Chen
2026-02-05 17:34       ` Andrii Nakryiko
2026-02-06  9:20         ` Tao Chen
2026-01-26  7:43 ` [PATCH bpf-next v8 3/3] bpf: Hold ther perf callchain entry until used completely Tao Chen
2026-01-27 21:35   ` Andrii Nakryiko
2026-01-28  4:21     ` Tao Chen
2026-01-28 19:13       ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox