[PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf_callchain
@ 2025-11-09 16:35 Tao Chen
  2025-11-09 16:35 ` [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain Tao Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Tao Chen @ 2025-11-09 16:35 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com

A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev

Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.

Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.

Comments and suggestions are always welcome.

Change list:
 - v1 -> v2
   From Jiri
   - rebase code, fix conflict
 - v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev

 - v2 -> v3:
   From Andrii
   - entries per CPU used in a stack-like fashion
 - v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev

 - v3 -> v4:
   From Peter
   - refactor get_perf_callchain and add three new APIs to use perf
     callchain easily.
   From Andrii
   - reuse the perf callchain management.

   - rename patch1 and patch2.
 - v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev

 - v4 -> v5:
   From Yonghong
   - keep add_mark false in stackmap when refactor get_perf_callchain in
     patch1.
   - add atomic operation in get_recursion_context in patch2.
   - rename bpf_put_callchain_entry with bpf_put_perf_callchain in
     patch3.
   - rebase bpf-next master.
 - v4: https://lore.kernel.org/bpf/20251028162502.3418817-1-chen.dylane@linux.dev

Tao Chen (3):
  perf: Refactor get_perf_callchain
  perf: Add atomic operation in get_recursion_context
  bpf: Hold the perf callchain entry until used completely

 include/linux/perf_event.h |  9 +++++
 kernel/bpf/stackmap.c      | 62 +++++++++++++++++++++++++-------
 kernel/events/callchain.c  | 73 ++++++++++++++++++++++++--------------
 kernel/events/internal.h   |  5 +--
 4 files changed, 107 insertions(+), 42 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain
  2025-11-09 16:35 [PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
@ 2025-11-09 16:35 ` Tao Chen
  2025-11-09 16:58   ` bot+bpf-ci
  2025-11-09 16:35 ` [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context Tao Chen
  2025-11-09 16:35 ` [PATCH bpf-next v5 3/3] bpf: Hold the perf callchain entry until used completely Tao Chen
  2 siblings, 1 reply; 8+ messages in thread
From: Tao Chen @ 2025-11-09 16:35 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

From BPF stack map, we want to ensure that the callchain buffer
will not be overwritten by other preemptive tasks. Peter
suggested provide more flexible stack-sampling APIs, which
can be used in BPF, and we can still use the perf callchain
entry with the help of these APIs. The next patch will modify
the BPF part.

In the future, these APIs will also make it convenient for us to
add stack-sampling kfuncs in the eBPF subsystem, just as Andrii and
Alexei discussed earlier.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 include/linux/perf_event.h |  9 +++++
 kernel/events/callchain.c  | 73 ++++++++++++++++++++++++--------------
 2 files changed, 56 insertions(+), 26 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b9..edd3058e4d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
 	u32				nr;
 	short				contexts;
 	bool				contexts_maxed;
+	bool				add_mark;
 };
 
 typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
@@ -1718,6 +1719,14 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+
+extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
+				      struct perf_callchain_entry *entry,
+				      u32 max_stack, bool add_mark);
+
+extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
+extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
+
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7a31f..fb1f26be297 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,13 +216,54 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 #endif
 }
 
+void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
+			       struct perf_callchain_entry *entry,
+			       u32 max_stack, bool add_mark)
+
+{
+	ctx->entry		= entry;
+	ctx->max_stack		= max_stack;
+	ctx->nr			= entry->nr = 0;
+	ctx->contexts		= 0;
+	ctx->contexts_maxed	= false;
+	ctx->add_mark		= add_mark;
+}
+
+void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		return;
+
+	if (ctx->add_mark)
+		perf_callchain_store_context(ctx, PERF_CONTEXT_KERNEL);
+	perf_callchain_kernel(ctx, regs);
+}
+
+void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
+{
+	int start_entry_idx;
+
+	if (!user_mode(regs)) {
+		if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
+			return;
+		regs = task_pt_regs(current);
+	}
+
+	if (ctx->add_mark)
+		perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
+
+	start_entry_idx = ctx->nr;
+	perf_callchain_user(ctx, regs);
+	fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
+}
+
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
-	int rctx, start_entry_idx;
+	int rctx;
 
 	/* crosstask is not supported for user stacks */
 	if (crosstask && user && !kernel)
@@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry		= entry;
-	ctx.max_stack		= max_stack;
-	ctx.nr			= entry->nr = 0;
-	ctx.contexts		= 0;
-	ctx.contexts_maxed	= false;
+	__init_perf_callchain_ctx(&ctx, entry, max_stack, add_mark);
 
-	if (kernel && !user_mode(regs)) {
-		if (add_mark)
-			perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
-		perf_callchain_kernel(&ctx, regs);
-	}
-
-	if (user && !crosstask) {
-		if (!user_mode(regs)) {
-			if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
-				goto exit_put;
-			regs = task_pt_regs(current);
-		}
+	if (kernel)
+		__get_perf_callchain_kernel(&ctx, regs);
 
-		if (add_mark)
-			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
-
-		start_entry_idx = entry->nr;
-		perf_callchain_user(&ctx, regs);
-		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-	}
+	if (user && !crosstask)
+		__get_perf_callchain_user(&ctx, regs);
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain
  2025-11-09 16:35 ` [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain Tao Chen
@ 2025-11-09 16:58   ` bot+bpf-ci
  2025-11-10  9:31     ` Tao Chen
  0 siblings, 1 reply; 8+ messages in thread
From: bot+bpf-ci @ 2025-11-09 16:58 UTC (permalink / raw)
  To: chen.dylane, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, chen.dylane, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2715 bytes --]

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fd1d91017..edd3058e4 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>  	u32				nr;
>  	short				contexts;
>  	bool				contexts_maxed;
> +	bool				add_mark;
>  };

[ ... ]

> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 808c0d7a3..fb1f26be2 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c

[ ... ]

> +void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
> +{
> +	int start_entry_idx;
> +
> +	if (!user_mode(regs)) {
> +		if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +			return;
> +		regs = task_pt_regs(current);
> +	}
> +
> +	if (ctx->add_mark)
> +		perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
> +
> +	start_entry_idx = ctx->nr;
                          ^^^^^^^^
> +	perf_callchain_user(ctx, regs);
> +	fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
> +}

Can this break uprobe trampoline fixups? The original code used
entry->nr for start_entry_idx, but the refactored version uses ctx->nr.

These two counters track different things:
- entry->nr counts ALL entries (incremented by both
  perf_callchain_store_context() and perf_callchain_store())
- ctx->nr only counts regular callchain entries (incremented only by
  perf_callchain_store(), not by perf_callchain_store_context())

When add_mark is true, perf_callchain_store_context() increments
entry->nr but not ctx->nr, causing them to become out of sync.

For example, if kernel callchain collection runs first with add_mark
set:

  Initial: ctx->nr = entry->nr = 0

  In __get_perf_callchain_kernel():
    perf_callchain_store_context() stores PERF_CONTEXT_KERNEL marker
      -> entry->nr becomes 1, ctx->nr still 0
    perf_callchain_kernel() stores N kernel frames
      -> entry->nr becomes 1+N, ctx->nr becomes N

  In __get_perf_callchain_user():
    perf_callchain_store_context() stores PERF_CONTEXT_USER marker
      -> entry->nr becomes 2+N, ctx->nr still N
    start_entry_idx = ctx->nr assigns N (should be 2+N)
    fixup_uretprobe_trampoline_entries() scans from wrong index

This causes fixup_uretprobe_trampoline_entries() to scan kernel
callchain entries instead of user callchain entries, missing the
uretprobe trampolines that need to be fixed up.

Should this use ctx->entry->nr instead of ctx->nr?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19211481431

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain
  2025-11-09 16:58   ` bot+bpf-ci
@ 2025-11-10  9:31     ` Tao Chen
  0 siblings, 0 replies; 8+ messages in thread
From: Tao Chen @ 2025-11-10  9:31 UTC (permalink / raw)
  To: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

在 2025/11/10 00:58, bot+bpf-ci@kernel.org 写道:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index fd1d91017..edd3058e4 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>>   	u32				nr;
>>   	short				contexts;
>>   	bool				contexts_maxed;
>> +	bool				add_mark;
>>   };
> 
> [ ... ]
> 
>> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
>> index 808c0d7a3..fb1f26be2 100644
>> --- a/kernel/events/callchain.c
>> +++ b/kernel/events/callchain.c
> 
> [ ... ]
> 
>> +void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
>> +{
>> +	int start_entry_idx;
>> +
>> +	if (!user_mode(regs)) {
>> +		if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
>> +			return;
>> +		regs = task_pt_regs(current);
>> +	}
>> +
>> +	if (ctx->add_mark)
>> +		perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
>> +
>> +	start_entry_idx = ctx->nr;
>                            ^^^^^^^^
>> +	perf_callchain_user(ctx, regs);
>> +	fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
>> +}
> 
> Can this break uprobe trampoline fixups? The original code used
> entry->nr for start_entry_idx, but the refactored version uses ctx->nr.
> 
> These two counters track different things:
> - entry->nr counts ALL entries (incremented by both
>    perf_callchain_store_context() and perf_callchain_store())
> - ctx->nr only counts regular callchain entries (incremented only by
>    perf_callchain_store(), not by perf_callchain_store_context())
> 
> When add_mark is true, perf_callchain_store_context() increments
> entry->nr but not ctx->nr, causing them to become out of sync.
> 
> For example, if kernel callchain collection runs first with add_mark
> set:
> 
>    Initial: ctx->nr = entry->nr = 0
> 
>    In __get_perf_callchain_kernel():
>      perf_callchain_store_context() stores PERF_CONTEXT_KERNEL marker
>        -> entry->nr becomes 1, ctx->nr still 0
>      perf_callchain_kernel() stores N kernel frames
>        -> entry->nr becomes 1+N, ctx->nr becomes N
> 
>    In __get_perf_callchain_user():
>      perf_callchain_store_context() stores PERF_CONTEXT_USER marker
>        -> entry->nr becomes 2+N, ctx->nr still N
>      start_entry_idx = ctx->nr assigns N (should be 2+N)
>      fixup_uretprobe_trampoline_entries() scans from wrong index
> 
> This causes fixup_uretprobe_trampoline_entries() to scan kernel
> callchain entries instead of user callchain entries, missing the
> uretprobe trampolines that need to be fixed up.
> 
> Should this use ctx->entry->nr instead of ctx->nr?
> 

Use ctx->entry->nr looks better, will change it.
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19211481431


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context
  2025-11-09 16:35 [PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
  2025-11-09 16:35 ` [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain Tao Chen
@ 2025-11-09 16:35 ` Tao Chen
  2025-11-10  8:52   ` Peter Zijlstra
  2025-11-09 16:35 ` [PATCH bpf-next v5 3/3] bpf: Hold the perf callchain entry until used completely Tao Chen
  2 siblings, 1 reply; 8+ messages in thread
From: Tao Chen @ 2025-11-09 16:35 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen, Yonghong Song

From BPF side, preemption usually is enabled. Yonghong said, it is
possible that both tasks (at process level) may reach right before
"recursion[rctx]++;". In such cases, both tasks will be able to get
buffer and this is not right.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 kernel/events/internal.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index d9cc5708309..684bde972ba 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -214,12 +214,9 @@ static inline int get_recursion_context(u8 *recursion)
 {
 	unsigned char rctx = interrupt_context_level();
 
-	if (recursion[rctx])
+	if (cmpxchg(&recursion[rctx], 0, 1) != 0)
 		return -1;
 
-	recursion[rctx]++;
-	barrier();
-
 	return rctx;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context
  2025-11-09 16:35 ` [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context Tao Chen
@ 2025-11-10  8:52   ` Peter Zijlstra
  2025-11-10  9:26     ` Tao Chen
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2025-11-10  8:52 UTC (permalink / raw)
  To: Tao Chen
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	bpf, Yonghong Song

On Mon, Nov 10, 2025 at 12:35:58AM +0800, Tao Chen wrote:
> From BPF side, preemption usually is enabled. Yonghong said, it is
> possible that both tasks (at process level) may reach right before
> "recursion[rctx]++;". In such cases, both tasks will be able to get
> buffer and this is not right.
> 
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---

Nope, this function really is meant to be used with preemption disabled.
If BPF doesn't abide, fix that.

>  kernel/events/internal.h | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index d9cc5708309..684bde972ba 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -214,12 +214,9 @@ static inline int get_recursion_context(u8 *recursion)
>  {
>  	unsigned char rctx = interrupt_context_level();
>  
> -	if (recursion[rctx])
> +	if (cmpxchg(&recursion[rctx], 0, 1) != 0)
>  		return -1;
>  
> -	recursion[rctx]++;
> -	barrier();
> -
>  	return rctx;
>  }
>  
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context
  2025-11-10  8:52   ` Peter Zijlstra
@ 2025-11-10  9:26     ` Tao Chen
  0 siblings, 0 replies; 8+ messages in thread
From: Tao Chen @ 2025-11-10  9:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, kan.liang, linux-perf-users, linux-kernel,
	bpf, Yonghong Song

在 2025/11/10 16:52, Peter Zijlstra 写道:
> On Mon, Nov 10, 2025 at 12:35:58AM +0800, Tao Chen wrote:
>>  From BPF side, preemption usually is enabled. Yonghong said, it is
>> possible that both tasks (at process level) may reach right before
>> "recursion[rctx]++;". In such cases, both tasks will be able to get
>> buffer and this is not right.
>>
>> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
> 
> Nope, this function really is meant to be used with preemption disabled.
> If BPF doesn't abide, fix that.
> 

Ok, let us use preempt_disable in bpf stackmap, thanks. I will change it 
in v6.

>>   kernel/events/internal.h | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
>> index d9cc5708309..684bde972ba 100644
>> --- a/kernel/events/internal.h
>> +++ b/kernel/events/internal.h
>> @@ -214,12 +214,9 @@ static inline int get_recursion_context(u8 *recursion)
>>   {
>>   	unsigned char rctx = interrupt_context_level();
>>   
>> -	if (recursion[rctx])
>> +	if (cmpxchg(&recursion[rctx], 0, 1) != 0)
>>   		return -1;
>>   
>> -	recursion[rctx]++;
>> -	barrier();
>> -
>>   	return rctx;
>>   }
>>   
>> -- 
>> 2.48.1
>>


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH bpf-next v5 3/3] bpf: Hold the perf callchain entry until used completely
  2025-11-09 16:35 [PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
  2025-11-09 16:35 ` [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain Tao Chen
  2025-11-09 16:35 ` [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context Tao Chen
@ 2025-11-09 16:35 ` Tao Chen
  2 siblings, 0 replies; 8+ messages in thread
From: Tao Chen @ 2025-11-09 16:35 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

As Alexei noted, get_perf_callchain() return values may be reused
if a task is preempted after the BPF program enters migrate disable
mode. The perf_callchain_entres has a small stack of entries, and
we can reuse it as follows:

1. get the perf callchain entry
2. BPF use...
3. put the perf callchain entry

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 kernel/bpf/stackmap.c | 62 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 50 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 2365541c81d..58b4432ab00 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -210,13 +210,12 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 }
 
 static struct perf_callchain_entry *
-get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
+get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth)
 {
 #ifdef CONFIG_STACKTRACE
 	struct perf_callchain_entry *entry;
-	int rctx;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry(rctx);
 
 	if (!entry)
 		return NULL;
@@ -238,8 +237,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 			to[i] = (u64)(from[i]);
 	}
 
-	put_callchain_entry(rctx);
-
 	return entry;
 #else /* CONFIG_STACKTRACE */
 	return NULL;
@@ -320,6 +317,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
 	return id;
 }
 
+static struct perf_callchain_entry *
+bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user,
+		       int max_stack, bool crosstask)
+{
+	struct perf_callchain_entry_ctx ctx;
+	struct perf_callchain_entry *entry;
+
+	entry = get_callchain_entry(rctx);
+	if (unlikely(!entry))
+		return NULL;
+
+	__init_perf_callchain_ctx(&ctx, entry, max_stack, false);
+	if (kernel)
+		__get_perf_callchain_kernel(&ctx, regs);
+	if (user && !crosstask)
+		__get_perf_callchain_user(&ctx, regs);
+
+	return entry;
+}
+
+static void bpf_put_perf_callchain(int rctx)
+{
+	put_callchain_entry(rctx);
+}
+
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
@@ -328,20 +350,24 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	struct perf_callchain_entry *trace;
 	bool kernel = !user;
 	u32 max_depth;
+	int rctx, ret;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
 		return -EINVAL;
 
 	max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth,
+				       false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
 
-	return __bpf_get_stackid(map, trace, flags);
+	ret = __bpf_get_stackid(map, trace, flags);
+	bpf_put_perf_callchain(rctx);
+
+	return ret;
 }
 
 const struct bpf_func_proto bpf_get_stackid_proto = {
@@ -435,6 +461,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	bool kernel = !user;
 	int err = -EINVAL;
 	u64 *ips;
+	int rctx;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_USER_BUILD_ID)))
@@ -467,18 +494,26 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = trace_in;
 		trace->nr = min_t(u32, trace->nr, max_depth);
 	} else if (kernel && task) {
-		trace = get_callchain_entry_for_task(task, max_depth);
+		trace = get_callchain_entry_for_task(&rctx, task, max_depth);
 	} else {
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+		trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth,
+					       crosstask);
 	}
 
-	if (unlikely(!trace) || trace->nr < skip) {
+	if (unlikely(!trace)) {
 		if (may_fault)
 			rcu_read_unlock();
 		goto err_fault;
 	}
 
+	if (trace->nr < skip) {
+		if (may_fault)
+			rcu_read_unlock();
+		if (!trace_in)
+			bpf_put_perf_callchain(rctx);
+		goto err_fault;
+	}
+
 	trace_nr = trace->nr - skip;
 	copy_len = trace_nr * elem_size;
 
@@ -497,6 +532,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (may_fault)
 		rcu_read_unlock();
 
+	if (!trace_in)
+		bpf_put_perf_callchain(rctx);
+
 	if (user_build_id)
 		stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-11-10  9:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-09 16:35 [PATCH bpf-next v5 0/3] Pass external callchain entry to get_perf_callchain Tao Chen
2025-11-09 16:35 ` [PATCH bpf-next v5 1/3] perf: Refactor get_perf_callchain Tao Chen
2025-11-09 16:58   ` bot+bpf-ci
2025-11-10  9:31     ` Tao Chen
2025-11-09 16:35 ` [PATCH bpf-next v5 2/3] perf: Add atomic operation in get_recursion_context Tao Chen
2025-11-10  8:52   ` Peter Zijlstra
2025-11-10  9:26     ` Tao Chen
2025-11-09 16:35 ` [PATCH bpf-next v5 3/3] bpf: Hold the perf callchain entry until used completely Tao Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).