bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/2] Pass external callchain entry to get_perf_callchain
@ 2025-10-28 16:25 Tao Chen
  2025-10-28 16:25 ` [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain Tao Chen
  2025-10-28 16:25 ` [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
  0 siblings, 2 replies; 11+ messages in thread
From: Tao Chen @ 2025-10-28 16:25 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

Background
==========
Alexei noted we should use preempt_disable to protect get_perf_callchain
in bpf stackmap.
https://lore.kernel.org/bpf/CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com

A previous patch was submitted to attempt fixing this issue. And Andrii
suggested teach get_perf_callchain to let us pass that buffer directly to
avoid that unnecessary copy.
https://lore.kernel.org/bpf/20250926153952.1661146-1-chen.dylane@linux.dev

Proposed Solution
=================
Add external perf_callchain_entry parameter for get_perf_callchain to
allow us to use external buffer from BPF side. The biggest advantage is
that it can reduce unnecessary copies.

Todo
====
But I'm not sure if this modification is appropriate. After all, the
implementation of get_callchain_entry in the perf subsystem seems much more
complex than directly using an external buffer.

Comments and suggestions are always welcome.

Change list:
 - v1 -> v2
   From Jiri
   - rebase code, fix conflict
 - v1: https://lore.kernel.org/bpf/20251013174721.2681091-1-chen.dylane@linux.dev
 
 - v2 -> v3:
   From Andrii
   - entries per CPU used in a stack-like fashion
 - v2: https://lore.kernel.org/bpf/20251014100128.2721104-1-chen.dylane@linux.dev

 - v3 -> v4:
   From Peter
   - refactor get_perf_callchain and add three new APIs to use perf
     callchain easily.
   From Andrii
   - reuse the perf callchain management.

   - rename patch1 and patch2.
 - v3: https://lore.kernel.org/bpf/20251019170118.2955346-1-chen.dylane@linux.dev

Tao Chen (2):
  perf: Refactor get_perf_callchain
  bpf: Hold the perf callchain entry until used completely

 include/linux/perf_event.h | 11 +++++-
 kernel/bpf/stackmap.c      | 61 ++++++++++++++++++++++++-------
 kernel/events/callchain.c  | 75 ++++++++++++++++++++++++--------------
 kernel/events/core.c       |  2 +-
 4 files changed, 107 insertions(+), 42 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain
  2025-10-28 16:25 [PATCH bpf-next v4 0/2] Pass external callchain entry to get_perf_callchain Tao Chen
@ 2025-10-28 16:25 ` Tao Chen
  2025-10-28 17:09   ` bot+bpf-ci
  2025-11-05 20:45   ` Yonghong Song
  2025-10-28 16:25 ` [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
  1 sibling, 2 replies; 11+ messages in thread
From: Tao Chen @ 2025-10-28 16:25 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

From BPF stack map, we want to use our own buffers to avoid
unnecessary copy and ensure that the buffer will not be
overwritten by other preemptive tasks. Peter suggested
provide more flexible stack-sampling APIs, which can be used
in BPF, and we can still use the perf callchain entry with
the help of these APIs. The next patch will modify the BPF part.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 include/linux/perf_event.h | 11 +++++-
 kernel/bpf/stackmap.c      |  4 +-
 kernel/events/callchain.c  | 75 ++++++++++++++++++++++++--------------
 kernel/events/core.c       |  2 +-
 4 files changed, 61 insertions(+), 31 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fd1d91017b9..14a382cad1d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
 	u32				nr;
 	short				contexts;
 	bool				contexts_maxed;
+	bool				add_mark;
 };
 
 typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
@@ -1718,9 +1719,17 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+
+extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
+				      struct perf_callchain_entry *entry,
+				      u32 max_stack, bool add_mark);
+
+extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
+extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
+
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool crosstask);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 4d53cdd1374..e28b35c7e0b 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 		max_depth = sysctl_perf_event_max_stack;
 
 	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+				   false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask, false);
+					   crosstask);
 
 	if (unlikely(!trace) || trace->nr < skip) {
 		if (may_fault)
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 808c0d7a31f..2c36e490625 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -216,13 +216,54 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
 #endif
 }
 
+void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
+			       struct perf_callchain_entry *entry,
+			       u32 max_stack, bool add_mark)
+
+{
+	ctx->entry		= entry;
+	ctx->max_stack		= max_stack;
+	ctx->nr			= entry->nr = 0;
+	ctx->contexts		= 0;
+	ctx->contexts_maxed	= false;
+	ctx->add_mark		= add_mark;
+}
+
+void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
+{
+	if (user_mode(regs))
+		return;
+
+	if (ctx->add_mark)
+		perf_callchain_store_context(ctx, PERF_CONTEXT_KERNEL);
+	perf_callchain_kernel(ctx, regs);
+}
+
+void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
+{
+	int start_entry_idx;
+
+	if (!user_mode(regs)) {
+		if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
+			return;
+		regs = task_pt_regs(current);
+	}
+
+	if (ctx->add_mark)
+		perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
+
+	start_entry_idx = ctx->nr;
+	perf_callchain_user(ctx, regs);
+	fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
+}
+
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool crosstask)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
-	int rctx, start_entry_idx;
+	int rctx;
 
 	/* crosstask is not supported for user stacks */
 	if (crosstask && user && !kernel)
@@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry		= entry;
-	ctx.max_stack		= max_stack;
-	ctx.nr			= entry->nr = 0;
-	ctx.contexts		= 0;
-	ctx.contexts_maxed	= false;
+	__init_perf_callchain_ctx(&ctx, entry, max_stack, true);
 
-	if (kernel && !user_mode(regs)) {
-		if (add_mark)
-			perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
-		perf_callchain_kernel(&ctx, regs);
-	}
-
-	if (user && !crosstask) {
-		if (!user_mode(regs)) {
-			if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
-				goto exit_put;
-			regs = task_pt_regs(current);
-		}
+	if (kernel)
+		__get_perf_callchain_kernel(&ctx, regs);
 
-		if (add_mark)
-			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
-
-		start_entry_idx = entry->nr;
-		perf_callchain_user(&ctx, regs);
-		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
-	}
+	if (user && !crosstask)
+		__get_perf_callchain_user(&ctx, regs);
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7541f6f85fc..eb0f110593d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8218,7 +8218,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 		return &__empty_callchain;
 
 	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+				       max_stack, crosstask);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely
  2025-10-28 16:25 [PATCH bpf-next v4 0/2] Pass external callchain entry to get_perf_callchain Tao Chen
  2025-10-28 16:25 ` [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain Tao Chen
@ 2025-10-28 16:25 ` Tao Chen
  2025-11-05 22:16   ` Yonghong Song
  1 sibling, 1 reply; 11+ messages in thread
From: Tao Chen @ 2025-10-28 16:25 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, kan.liang, song, ast, daniel,
	andrii, martin.lau, eddyz87, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, Tao Chen

As Alexei noted, get_perf_callchain() return values may be reused
if a task is preempted after the BPF program enters migrate disable
mode. The perf_callchain_entres has a small stack of entries, and
we can reuse it as follows:

1. get the perf callchain entry
2. BPF use...
3. put the perf callchain entry

Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 kernel/bpf/stackmap.c | 61 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 13 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index e28b35c7e0b..70d38249083 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -188,13 +188,12 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 }
 
 static struct perf_callchain_entry *
-get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
+get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth)
 {
 #ifdef CONFIG_STACKTRACE
 	struct perf_callchain_entry *entry;
-	int rctx;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry(rctx);
 
 	if (!entry)
 		return NULL;
@@ -216,8 +215,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 			to[i] = (u64)(from[i]);
 	}
 
-	put_callchain_entry(rctx);
-
 	return entry;
 #else /* CONFIG_STACKTRACE */
 	return NULL;
@@ -297,6 +294,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
 	return id;
 }
 
+static struct perf_callchain_entry *
+bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user,
+		       int max_stack, bool crosstask)
+{
+	struct perf_callchain_entry_ctx ctx;
+	struct perf_callchain_entry *entry;
+
+	entry = get_callchain_entry(rctx);
+	if (unlikely(!entry))
+		return NULL;
+
+	__init_perf_callchain_ctx(&ctx, entry, max_stack, false);
+	if (kernel)
+		__get_perf_callchain_kernel(&ctx, regs);
+	if (user && !crosstask)
+		__get_perf_callchain_user(&ctx, regs);
+
+	return entry;
+}
+
+static void bpf_put_callchain_entry(int rctx)
+{
+	put_callchain_entry(rctx);
+}
+
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
@@ -305,6 +327,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	bool user = flags & BPF_F_USER_STACK;
 	struct perf_callchain_entry *trace;
 	bool kernel = !user;
+	int rctx, ret;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
@@ -314,14 +337,15 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false);
-
+	trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
 
-	return __bpf_get_stackid(map, trace, flags);
+	ret = __bpf_get_stackid(map, trace, flags);
+	bpf_put_callchain_entry(rctx);
+
+	return ret;
 }
 
 const struct bpf_func_proto bpf_get_stackid_proto = {
@@ -415,6 +439,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	bool kernel = !user;
 	int err = -EINVAL;
 	u64 *ips;
+	int rctx;
 
 	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
 			       BPF_F_USER_BUILD_ID)))
@@ -449,17 +474,24 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (trace_in)
 		trace = trace_in;
 	else if (kernel && task)
-		trace = get_callchain_entry_for_task(task, max_depth);
+		trace = get_callchain_entry_for_task(&rctx, task, max_depth);
 	else
-		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   crosstask);
+		trace = bpf_get_perf_callchain(&rctx, regs, kernel, user, max_depth, crosstask);
 
-	if (unlikely(!trace) || trace->nr < skip) {
+	if (unlikely(!trace)) {
 		if (may_fault)
 			rcu_read_unlock();
 		goto err_fault;
 	}
 
+	if (trace->nr < skip) {
+		if (may_fault)
+			rcu_read_unlock();
+		if (!trace_in)
+			bpf_put_callchain_entry(rctx);
+		goto err_fault;
+	}
+
 	trace_nr = trace->nr - skip;
 	trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
 	copy_len = trace_nr * elem_size;
@@ -479,6 +511,9 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	if (may_fault)
 		rcu_read_unlock();
 
+	if (!trace_in)
+		bpf_put_callchain_entry(rctx);
+
 	if (user_build_id)
 		stack_map_get_build_id_offset(buf, trace_nr, user, may_fault);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain
  2025-10-28 16:25 ` [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain Tao Chen
@ 2025-10-28 17:09   ` bot+bpf-ci
  2025-10-30  2:36     ` Tao Chen
  2025-11-05 20:45   ` Yonghong Song
  1 sibling, 1 reply; 11+ messages in thread
From: bot+bpf-ci @ 2025-10-28 17:09 UTC (permalink / raw)
  To: chen.dylane, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, chen.dylane, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5548 bytes --]


```
commit 59cfc61f83ad4ec998f4242ca000c554e3001b47
Author: Tao Chen <chen.dylane@linux.dev>

perf: Refactor get_perf_callchain

This patch refactors get_perf_callchain() by extracting logic into new
helper functions to provide more flexible stack-sampling APIs for BPF.

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fd1d91017..14a382cad 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>  	u32				nr;
>  	short				contexts;
>  	bool				contexts_maxed;
> +	bool				add_mark;
>  };

[ ... ]

> @@ -1718,9 +1719,17 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>
>  extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
> +
> +extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
> +				      struct perf_callchain_entry *entry,
> +				      u32 max_stack, bool add_mark);
> +
> +extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
> +extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
> +
>  extern struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -		   u32 max_stack, bool crosstask, bool add_mark);
> +		   u32 max_stack, bool crosstask);

> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 4d53cdd13..e28b35c7e 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>  		max_depth = sysctl_perf_event_max_stack;
>
>  	trace = get_perf_callchain(regs, kernel, user, max_depth,
> -				   false, false);
> +				   false);
                                       ^^^^^ removed last parameter

Previously bpf_get_stackid() passed add_mark=false to get_perf_callchain().
Now the parameter is gone, but get_perf_callchain() hardcodes add_mark=true
when calling __init_perf_callchain_ctx(). This means BPF stack traces will
now include PERF_CONTEXT_KERNEL/PERF_CONTEXT_USER marker values when they
didn't before.

The markers get added by __get_perf_callchain_kernel() and
__get_perf_callchain_user() when ctx->add_mark is true. These markers are
sentinel values (-128 for KERNEL, -512 for USER) that appear in the trace
IP array alongside actual instruction pointers.

>
>  	if (unlikely(!trace))
>  		/* couldn't fetch the stack trace */
> @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>  		trace = get_callchain_entry_for_task(task, max_depth);
>  	else
>  		trace = get_perf_callchain(regs, kernel, user, max_depth,
> -					   crosstask, false);
> +					   crosstask);
                                                   ^^^^^ same issue here

Same problem in __bpf_get_stack() which was also passing add_mark=false.

This changes BPF stack map behavior in user-visible ways:

1) bpf_get_stackid() hashes the entire trace->ip[] array including any
   markers, so identical stacks will now produce different stack IDs.

2) bpf_get_stack() copies trace->ip[] directly to userspace buffers,
   so BPF programs will see marker values mixed with instruction pointers.

Can BPF tools handle these unexpected marker values? The commit message
says "The next patch will modify the BPF part" but doesn't explain this
intermediate behavior change.

> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 808c0d7a3..2c36e4906 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c

[ ... ]

>  struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -		   u32 max_stack, bool crosstask, bool add_mark)
> +		   u32 max_stack, bool crosstask)
>  {
>  	struct perf_callchain_entry *entry;
>  	struct perf_callchain_entry_ctx ctx;
> -	int rctx, start_entry_idx;
> +	int rctx;
>
>  	/* crosstask is not supported for user stacks */
>  	if (crosstask && user && !kernel)
> @@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>  	if (!entry)
>  		return NULL;
>
> -	ctx.entry		= entry;
> -	ctx.max_stack		= max_stack;
> -	ctx.nr			= entry->nr = 0;
> -	ctx.contexts		= 0;
> -	ctx.contexts_maxed	= false;
> +	__init_perf_callchain_ctx(&ctx, entry, max_stack, true);
                                                                ^^^^
Here's where add_mark is hardcoded to true for all callers. The old code
allowed callers to control this via parameter. BPF was passing false,
perf_callchain() was passing true. Now both get true unconditionally.

[ ... ]

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7541f6f85..eb0f11059 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8218,7 +8218,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>  		return &__empty_callchain;
>
>  	callchain = get_perf_callchain(regs, kernel, user,
> -				       max_stack, crosstask, true);
> +				       max_stack, crosstask);
>  	return callchain ?: &__empty_callchain;
>  }


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `perf: Refactor get_perf_callchain`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18882141979

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain
  2025-10-28 17:09   ` bot+bpf-ci
@ 2025-10-30  2:36     ` Tao Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Tao Chen @ 2025-10-30  2:36 UTC (permalink / raw)
  To: bot+bpf-ci, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf, martin.lau, clm,
	ihor.solodrai

在 2025/10/29 01:09, bot+bpf-ci@kernel.org 写道:
> 
> ```
> commit 59cfc61f83ad4ec998f4242ca000c554e3001b47
> Author: Tao Chen <chen.dylane@linux.dev>
> 
> perf: Refactor get_perf_callchain
> 
> This patch refactors get_perf_callchain() by extracting logic into new
> helper functions to provide more flexible stack-sampling APIs for BPF.
> 
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index fd1d91017..14a382cad 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>>   	u32				nr;
>>   	short				contexts;
>>   	bool				contexts_maxed;
>> +	bool				add_mark;
>>   };
> 
> [ ... ]
> 
>> @@ -1718,9 +1719,17 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>>
>>   extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>>   extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>> +
>> +extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
>> +				      struct perf_callchain_entry *entry,
>> +				      u32 max_stack, bool add_mark);
>> +
>> +extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
>> +extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
>> +
>>   extern struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -		   u32 max_stack, bool crosstask, bool add_mark);
>> +		   u32 max_stack, bool crosstask);
> 
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 4d53cdd13..e28b35c7e 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>>   		max_depth = sysctl_perf_event_max_stack;
>>
>>   	trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -				   false, false);
>> +				   false);
>                                         ^^^^^ removed last parameter
> 
> Previously bpf_get_stackid() passed add_mark=false to get_perf_callchain().
> Now the parameter is gone, but get_perf_callchain() hardcodes add_mark=true
> when calling __init_perf_callchain_ctx(). This means BPF stack traces will
> now include PERF_CONTEXT_KERNEL/PERF_CONTEXT_USER marker values when they
> didn't before.
> 

The entire modification involves changes to two subsystems. To make the 
review easier, the get_perf_callchain has been simply adapted here. In 
the next patch, BPF will no longer use get_perf_callchain, and add_mark 
will be set to false.

> The markers get added by __get_perf_callchain_kernel() and
> __get_perf_callchain_user() when ctx->add_mark is true. These markers are
> sentinel values (-128 for KERNEL, -512 for USER) that appear in the trace
> IP array alongside actual instruction pointers.
> 
>>
>>   	if (unlikely(!trace))
>>   		/* couldn't fetch the stack trace */
>> @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>>   		trace = get_callchain_entry_for_task(task, max_depth);
>>   	else
>>   		trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -					   crosstask, false);
>> +					   crosstask);
>                                                     ^^^^^ same issue here
> 
> Same problem in __bpf_get_stack() which was also passing add_mark=false.
> 
> This changes BPF stack map behavior in user-visible ways:
> 
> 1) bpf_get_stackid() hashes the entire trace->ip[] array including any
>     markers, so identical stacks will now produce different stack IDs.
> 
> 2) bpf_get_stack() copies trace->ip[] directly to userspace buffers,
>     so BPF programs will see marker values mixed with instruction pointers.
> 
> Can BPF tools handle these unexpected marker values? The commit message
> says "The next patch will modify the BPF part" but doesn't explain this
> intermediate behavior change.
> 
>> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
>> index 808c0d7a3..2c36e4906 100644
>> --- a/kernel/events/callchain.c
>> +++ b/kernel/events/callchain.c
> 
> [ ... ]
> 
>>   struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -		   u32 max_stack, bool crosstask, bool add_mark)
>> +		   u32 max_stack, bool crosstask)
>>   {
>>   	struct perf_callchain_entry *entry;
>>   	struct perf_callchain_entry_ctx ctx;
>> -	int rctx, start_entry_idx;
>> +	int rctx;
>>
>>   	/* crosstask is not supported for user stacks */
>>   	if (crosstask && user && !kernel)
>> @@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>>   	if (!entry)
>>   		return NULL;
>>
>> -	ctx.entry		= entry;
>> -	ctx.max_stack		= max_stack;
>> -	ctx.nr			= entry->nr = 0;
>> -	ctx.contexts		= 0;
>> -	ctx.contexts_maxed	= false;
>> +	__init_perf_callchain_ctx(&ctx, entry, max_stack, true);
>                                                                  ^^^^
> Here's where add_mark is hardcoded to true for all callers. The old code
> allowed callers to control this via parameter. BPF was passing false,
> perf_callchain() was passing true. Now both get true unconditionally.
> 
> [ ... ]
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 7541f6f85..eb0f11059 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -8218,7 +8218,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>>   		return &__empty_callchain;
>>
>>   	callchain = get_perf_callchain(regs, kernel, user,
>> -				       max_stack, crosstask, true);
>> +				       max_stack, crosstask);
>>   	return callchain ?: &__empty_callchain;
>>   }
> 
> 
> ```
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> In-Reply-To-Subject: `perf: Refactor get_perf_callchain`
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18882141979


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain
  2025-10-28 16:25 ` [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain Tao Chen
  2025-10-28 17:09   ` bot+bpf-ci
@ 2025-11-05 20:45   ` Yonghong Song
  2025-11-06  3:28     ` Tao Chen
  1 sibling, 1 reply; 11+ messages in thread
From: Yonghong Song @ 2025-11-05 20:45 UTC (permalink / raw)
  To: Tao Chen, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf



On 10/28/25 9:25 AM, Tao Chen wrote:
>  From BPF stack map, we want to use our own buffers to avoid
> unnecessary copy and ensure that the buffer will not be
> overwritten by other preemptive tasks. Peter suggested
> provide more flexible stack-sampling APIs, which can be used
> in BPF, and we can still use the perf callchain entry with
> the help of these APIs. The next patch will modify the BPF part.
>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---
>   include/linux/perf_event.h | 11 +++++-
>   kernel/bpf/stackmap.c      |  4 +-
>   kernel/events/callchain.c  | 75 ++++++++++++++++++++++++--------------
>   kernel/events/core.c       |  2 +-
>   4 files changed, 61 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index fd1d91017b9..14a382cad1d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>   	u32				nr;
>   	short				contexts;
>   	bool				contexts_maxed;
> +	bool				add_mark;
>   };
>   
>   typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
> @@ -1718,9 +1719,17 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>   
>   extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>   extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
> +
> +extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
> +				      struct perf_callchain_entry *entry,
> +				      u32 max_stack, bool add_mark);
> +
> +extern void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
> +extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
> +
>   extern struct perf_callchain_entry *
>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -		   u32 max_stack, bool crosstask, bool add_mark);
> +		   u32 max_stack, bool crosstask);
>   extern int get_callchain_buffers(int max_stack);
>   extern void put_callchain_buffers(void);
>   extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 4d53cdd1374..e28b35c7e0b 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>   		max_depth = sysctl_perf_event_max_stack;
>   
>   	trace = get_perf_callchain(regs, kernel, user, max_depth,
> -				   false, false);
> +				   false);

This is not a refactor. Here, the add_mark parameter is removed. The 'add_mark'
value here is expected to be false, but later get_perf_callchain(...) has 'add_mark'
is true in __init_perf_callchain_ctx().

Applying this patch only on top of bpf-next master branch, we will have the
following crash:

[  457.730077] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
[  460.221871] BUG: unable to handle page fault for address: fffa3bfffffff000
[  460.221912] #PF: supervisor read access in kernel mode
[  460.221912] #PF: error_code(0x0000) - not-present page
[  460.221912] PGD 1e0ef1067 P4D 1e0ef0067 PUD 1e0eef067 PMD 1e0eee067 PTE 0
[  460.221912] Oops: Oops: 0000 [#1] SMP KASAN NOPTI
[  460.221912] CPU: 2 UID: 0 PID: 2012 Comm: test_progs Tainted: G        W  OE       6.18.0-rc4-gafe2e8
[  460.221912] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[  460.221912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-pr4
[  460.221912] RIP: 0010:kasan_check_range+0x183/0x2c0
[  460.221912] Code: 41 bf 08 00 00 00 41 29 ef 4d 01 fb 4d 29 de 4d 89 f4 4d 8d 6c 24 07 4d 85 e4 4d 0fd
[  460.221912] RSP: 0018:ff110001193bfc78 EFLAGS: 00010206
[  460.221912] RAX: ffd1ffffffd5b301 RBX: dffffc0000000001 RCX: ffffffff819a2ecb
[  460.221912] RDX: 0000000000000001 RSI: 00000000ffffffb0 RDI: ffd1ffffffd5b360
[  460.221912] RBP: 0000000000000004 R08: ffd20000ffd5b30f R09: 1ffa40001ffab661
[  460.221912] R10: dffffc0000000000 R11: fffa3bfffffab670 R12: 000000001ffffff2
[  460.221912] R13: 0000000003ff58cc R14: 0000000000053990 R15: 0000000000000000
[  460.221912] FS:  00007f358c6460c0(0000) GS:ff110002384b4000(0000) knlGS:0000000000000000
[  460.221912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  460.221912] CR2: fffa3bfffffff000 CR3: 000000011468c006 CR4: 0000000000371ef0
[  460.221912] Call Trace:
[  460.221912]  <TASK>
[  460.221912]  __asan_memset+0x22/0x50
[  460.221912]  __bpf_get_stack+0x6eb/0x7a0
[  460.221912]  ? bpf_perf_event_output_raw_tp+0x58c/0x6c0
[  460.221912]  bpf_get_stack+0x1d/0x30
[  460.221912]  bpf_get_stack_raw_tp+0x148/0x180
[  460.221912]  bpf_prog_40e346a03dc2914c_bpf_prog1+0x169/0x1af
[  460.221912]  bpf_trace_run2+0x1bc/0x350
[  460.221912]  ? bpf_trace_run2+0x104/0x350
[  460.221912]  ? trace_sys_enter+0x6b/0xf0
[  460.221912]  __bpf_trace_sys_enter+0x38/0x60
[  460.221912]  trace_sys_enter+0xa7/0xf0
[  460.221912]  syscall_trace_enter+0xfc/0x160
[  460.221912]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  460.221912]  do_syscall_64+0x5a/0xfa0
[  460.221912]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  460.221912]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

>   
>   	if (unlikely(!trace))
>   		/* couldn't fetch the stack trace */
> @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>   		trace = get_callchain_entry_for_task(task, max_depth);
>   	else
>   		trace = get_perf_callchain(regs, kernel, user, max_depth,
> -					   crosstask, false);
> +					   crosstask);
>   
>   	if (unlikely(!trace) || trace->nr < skip) {
>   		if (may_fault)
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 808c0d7a31f..2c36e490625 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -216,13 +216,54 @@ static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
>   #endif
>   }
>   
> +void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
> +			       struct perf_callchain_entry *entry,
> +			       u32 max_stack, bool add_mark)
> +
> +{
> +	ctx->entry		= entry;
> +	ctx->max_stack		= max_stack;
> +	ctx->nr			= entry->nr = 0;
> +	ctx->contexts		= 0;
> +	ctx->contexts_maxed	= false;
> +	ctx->add_mark		= add_mark;
> +}
> +
> +void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
> +{
> +	if (user_mode(regs))
> +		return;
> +
> +	if (ctx->add_mark)
> +		perf_callchain_store_context(ctx, PERF_CONTEXT_KERNEL);
> +	perf_callchain_kernel(ctx, regs);
> +}
> +
> +void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, struct pt_regs *regs)
> +{
> +	int start_entry_idx;
> +
> +	if (!user_mode(regs)) {
> +		if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> +			return;
> +		regs = task_pt_regs(current);
> +	}
> +
> +	if (ctx->add_mark)
> +		perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
> +
> +	start_entry_idx = ctx->nr;
> +	perf_callchain_user(ctx, regs);
> +	fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
> +}
> +
>   struct perf_callchain_entry *
>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -		   u32 max_stack, bool crosstask, bool add_mark)
> +		   u32 max_stack, bool crosstask)
>   {
>   	struct perf_callchain_entry *entry;
>   	struct perf_callchain_entry_ctx ctx;
> -	int rctx, start_entry_idx;
> +	int rctx;
>   
>   	/* crosstask is not supported for user stacks */
>   	if (crosstask && user && !kernel)
> @@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>   	if (!entry)
>   		return NULL;
>   
> -	ctx.entry		= entry;
> -	ctx.max_stack		= max_stack;
> -	ctx.nr			= entry->nr = 0;
> -	ctx.contexts		= 0;
> -	ctx.contexts_maxed	= false;
> +	__init_perf_callchain_ctx(&ctx, entry, max_stack, true);
>   
> -	if (kernel && !user_mode(regs)) {
> -		if (add_mark)
> -			perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
> -		perf_callchain_kernel(&ctx, regs);
> -	}
> -
> -	if (user && !crosstask) {
> -		if (!user_mode(regs)) {
> -			if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
> -				goto exit_put;
> -			regs = task_pt_regs(current);
> -		}
> +	if (kernel)
> +		__get_perf_callchain_kernel(&ctx, regs);
>   
> -		if (add_mark)
> -			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> -
> -		start_entry_idx = entry->nr;
> -		perf_callchain_user(&ctx, regs);
> -		fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
> -	}
> +	if (user && !crosstask)
> +		__get_perf_callchain_user(&ctx, regs);
>   
> -exit_put:
>   	put_callchain_entry(rctx);
>   
>   	return entry;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7541f6f85fc..eb0f110593d 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8218,7 +8218,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>   		return &__empty_callchain;
>   
>   	callchain = get_perf_callchain(regs, kernel, user,
> -				       max_stack, crosstask, true);
> +				       max_stack, crosstask);
>   	return callchain ?: &__empty_callchain;
>   }
>   


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely
  2025-10-28 16:25 ` [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
@ 2025-11-05 22:16   ` Yonghong Song
  2025-11-06  5:12     ` Tao Chen
  0 siblings, 1 reply; 11+ messages in thread
From: Yonghong Song @ 2025-11-05 22:16 UTC (permalink / raw)
  To: Tao Chen, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf



On 10/28/25 9:25 AM, Tao Chen wrote:
> As Alexei noted, get_perf_callchain() return values may be reused
> if a task is preempted after the BPF program enters migrate disable
> mode. The perf_callchain_entres has a small stack of entries, and
> we can reuse it as follows:
>
> 1. get the perf callchain entry
> 2. BPF use...
> 3. put the perf callchain entry
>
> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
> ---
>   kernel/bpf/stackmap.c | 61 ++++++++++++++++++++++++++++++++++---------
>   1 file changed, 48 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index e28b35c7e0b..70d38249083 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -188,13 +188,12 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>   }
>   
>   static struct perf_callchain_entry *
> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 max_depth)
>   {
>   #ifdef CONFIG_STACKTRACE
>   	struct perf_callchain_entry *entry;
> -	int rctx;
>   
> -	entry = get_callchain_entry(&rctx);
> +	entry = get_callchain_entry(rctx);
>   
>   	if (!entry)
>   		return NULL;
> @@ -216,8 +215,6 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>   			to[i] = (u64)(from[i]);
>   	}
>   
> -	put_callchain_entry(rctx);
> -
>   	return entry;
>   #else /* CONFIG_STACKTRACE */
>   	return NULL;
> @@ -297,6 +294,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
>   	return id;
>   }
>   
> +static struct perf_callchain_entry *
> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, bool user,
> +		       int max_stack, bool crosstask)
> +{
> +	struct perf_callchain_entry_ctx ctx;
> +	struct perf_callchain_entry *entry;
> +
> +	entry = get_callchain_entry(rctx);

I think this may not work. Let us say we have two bpf programs
both pinned to a particular cpu (migrate disabled but preempt enabled).
get_callchain_entry() calls get_recursion_context() to get the
buffer for a particulart level.

static inline int get_recursion_context(u8 *recursion)
{
         unsigned char rctx = interrupt_context_level();
         
         if (recursion[rctx])
                 return -1;
         
         recursion[rctx]++;
         barrier();
         
         return rctx;
}

It is possible that both tasks (at process level) may
reach right before "recursion[rctx]++;".
In such cases, both tasks will be able to get
buffer and this is not right.

To fix this, we either need to have preempt disable
in bpf side, or maybe we have some kind of atomic
operation (cmpxchg or similar things), or maybe
has a preempt disable between if statement and recursion[rctx]++,
so only one task can get buffer?


> +	if (unlikely(!entry))
> +		return NULL;
> +
> +	__init_perf_callchain_ctx(&ctx, entry, max_stack, false);
> +	if (kernel)
> +		__get_perf_callchain_kernel(&ctx, regs);
> +	if (user && !crosstask)
> +		__get_perf_callchain_user(&ctx, regs);
> +
> +	return entry;
> +}
> +
> +static void bpf_put_callchain_entry(int rctx)

we have bpf_get_perf_callchain(), maybe rename the above
to bpf_put_perf_callchain()?

> +{
> +	put_callchain_entry(rctx);
> +}
> +

[...]


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain
  2025-11-05 20:45   ` Yonghong Song
@ 2025-11-06  3:28     ` Tao Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Tao Chen @ 2025-11-06  3:28 UTC (permalink / raw)
  To: Yonghong Song, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf

在 2025/11/6 04:45, Yonghong Song 写道:
> 
> 
> On 10/28/25 9:25 AM, Tao Chen wrote:
>>  From BPF stack map, we want to use our own buffers to avoid
>> unnecessary copy and ensure that the buffer will not be
>> overwritten by other preemptive tasks. Peter suggested
>> provide more flexible stack-sampling APIs, which can be used
>> in BPF, and we can still use the perf callchain entry with
>> the help of these APIs. The next patch will modify the BPF part.
>>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
>>   include/linux/perf_event.h | 11 +++++-
>>   kernel/bpf/stackmap.c      |  4 +-
>>   kernel/events/callchain.c  | 75 ++++++++++++++++++++++++--------------
>>   kernel/events/core.c       |  2 +-
>>   4 files changed, 61 insertions(+), 31 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index fd1d91017b9..14a382cad1d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -67,6 +67,7 @@ struct perf_callchain_entry_ctx {
>>       u32                nr;
>>       short                contexts;
>>       bool                contexts_maxed;
>> +    bool                add_mark;
>>   };
>>   typedef unsigned long (*perf_copy_f)(void *dst, const void *src,
>> @@ -1718,9 +1719,17 @@ DECLARE_PER_CPU(struct perf_callchain_entry, 
>> perf_callchain_entry);
>>   extern void perf_callchain_user(struct perf_callchain_entry_ctx 
>> *entry, struct pt_regs *regs);
>>   extern void perf_callchain_kernel(struct perf_callchain_entry_ctx 
>> *entry, struct pt_regs *regs);
>> +
>> +extern void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx 
>> *ctx,
>> +                      struct perf_callchain_entry *entry,
>> +                      u32 max_stack, bool add_mark);
>> +
>> +extern void __get_perf_callchain_kernel(struct 
>> perf_callchain_entry_ctx *ctx, struct pt_regs *regs);
>> +extern void __get_perf_callchain_user(struct perf_callchain_entry_ctx 
>> *ctx, struct pt_regs *regs);
>> +
>>   extern struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -           u32 max_stack, bool crosstask, bool add_mark);
>> +           u32 max_stack, bool crosstask);
>>   extern int get_callchain_buffers(int max_stack);
>>   extern void put_callchain_buffers(void);
>>   extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 4d53cdd1374..e28b35c7e0b 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, 
>> regs, struct bpf_map *, map,
>>           max_depth = sysctl_perf_event_max_stack;
>>       trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                   false, false);
>> +                   false);
> 
> This is not a refactor. Here, the add_mark parameter is removed. The 
> 'add_mark'
> value here is expected to be false, but later get_perf_callchain(...) 
> has 'add_mark'
> is true in __init_perf_callchain_ctx().
> 

Hi Yonghong,

Thanks for your report, you are right. Maybe we should keep the 
get_perf_callchain parameters unchanged. I will change it in v5.

> Applying this patch only on top of bpf-next master branch, we will have the
> following crash:
> 
> [  457.730077] bpf_testmod: oh no, recursing into test_1, 
> recursion_misses 1
> [  460.221871] BUG: unable to handle page fault for address: 
> fffa3bfffffff000
> [  460.221912] #PF: supervisor read access in kernel mode
> [  460.221912] #PF: error_code(0x0000) - not-present page
> [  460.221912] PGD 1e0ef1067 P4D 1e0ef0067 PUD 1e0eef067 PMD 1e0eee067 
> PTE 0
> [  460.221912] Oops: Oops: 0000 [#1] SMP KASAN NOPTI
> [  460.221912] CPU: 2 UID: 0 PID: 2012 Comm: test_progs Tainted: 
> G        W  OE       6.18.0-rc4-gafe2e8
> [  460.221912] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
> [  460.221912] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS rel-1.14.0-0-g155821a1990b-pr4
> [  460.221912] RIP: 0010:kasan_check_range+0x183/0x2c0
> [  460.221912] Code: 41 bf 08 00 00 00 41 29 ef 4d 01 fb 4d 29 de 4d 89 
> f4 4d 8d 6c 24 07 4d 85 e4 4d 0fd
> [  460.221912] RSP: 0018:ff110001193bfc78 EFLAGS: 00010206
> [  460.221912] RAX: ffd1ffffffd5b301 RBX: dffffc0000000001 RCX: 
> ffffffff819a2ecb
> [  460.221912] RDX: 0000000000000001 RSI: 00000000ffffffb0 RDI: 
> ffd1ffffffd5b360
> [  460.221912] RBP: 0000000000000004 R08: ffd20000ffd5b30f R09: 
> 1ffa40001ffab661
> [  460.221912] R10: dffffc0000000000 R11: fffa3bfffffab670 R12: 
> 000000001ffffff2
> [  460.221912] R13: 0000000003ff58cc R14: 0000000000053990 R15: 
> 0000000000000000
> [  460.221912] FS:  00007f358c6460c0(0000) GS:ff110002384b4000(0000) 
> knlGS:0000000000000000
> [  460.221912] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  460.221912] CR2: fffa3bfffffff000 CR3: 000000011468c006 CR4: 
> 0000000000371ef0
> [  460.221912] Call Trace:
> [  460.221912]  <TASK>
> [  460.221912]  __asan_memset+0x22/0x50
> [  460.221912]  __bpf_get_stack+0x6eb/0x7a0
> [  460.221912]  ? bpf_perf_event_output_raw_tp+0x58c/0x6c0
> [  460.221912]  bpf_get_stack+0x1d/0x30
> [  460.221912]  bpf_get_stack_raw_tp+0x148/0x180
> [  460.221912]  bpf_prog_40e346a03dc2914c_bpf_prog1+0x169/0x1af
> [  460.221912]  bpf_trace_run2+0x1bc/0x350
> [  460.221912]  ? bpf_trace_run2+0x104/0x350
> [  460.221912]  ? trace_sys_enter+0x6b/0xf0
> [  460.221912]  __bpf_trace_sys_enter+0x38/0x60
> [  460.221912]  trace_sys_enter+0xa7/0xf0
> [  460.221912]  syscall_trace_enter+0xfc/0x160
> [  460.221912]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  460.221912]  do_syscall_64+0x5a/0xfa0
> [  460.221912]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  460.221912]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
>>       if (unlikely(!trace))
>>           /* couldn't fetch the stack trace */
>> @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, 
>> struct task_struct *task,
>>           trace = get_callchain_entry_for_task(task, max_depth);
>>       else
>>           trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                       crosstask, false);
>> +                       crosstask);
>>       if (unlikely(!trace) || trace->nr < skip) {
>>           if (may_fault)
>> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
>> index 808c0d7a31f..2c36e490625 100644
>> --- a/kernel/events/callchain.c
>> +++ b/kernel/events/callchain.c
>> @@ -216,13 +216,54 @@ static void 
>> fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entr
>>   #endif
>>   }
>> +void __init_perf_callchain_ctx(struct perf_callchain_entry_ctx *ctx,
>> +                   struct perf_callchain_entry *entry,
>> +                   u32 max_stack, bool add_mark)
>> +
>> +{
>> +    ctx->entry        = entry;
>> +    ctx->max_stack        = max_stack;
>> +    ctx->nr            = entry->nr = 0;
>> +    ctx->contexts        = 0;
>> +    ctx->contexts_maxed    = false;
>> +    ctx->add_mark        = add_mark;
>> +}
>> +
>> +void __get_perf_callchain_kernel(struct perf_callchain_entry_ctx 
>> *ctx, struct pt_regs *regs)
>> +{
>> +    if (user_mode(regs))
>> +        return;
>> +
>> +    if (ctx->add_mark)
>> +        perf_callchain_store_context(ctx, PERF_CONTEXT_KERNEL);
>> +    perf_callchain_kernel(ctx, regs);
>> +}
>> +
>> +void __get_perf_callchain_user(struct perf_callchain_entry_ctx *ctx, 
>> struct pt_regs *regs)
>> +{
>> +    int start_entry_idx;
>> +
>> +    if (!user_mode(regs)) {
>> +        if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
>> +            return;
>> +        regs = task_pt_regs(current);
>> +    }
>> +
>> +    if (ctx->add_mark)
>> +        perf_callchain_store_context(ctx, PERF_CONTEXT_USER);
>> +
>> +    start_entry_idx = ctx->nr;
>> +    perf_callchain_user(ctx, regs);
>> +    fixup_uretprobe_trampoline_entries(ctx->entry, start_entry_idx);
>> +}
>> +
>>   struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -           u32 max_stack, bool crosstask, bool add_mark)
>> +           u32 max_stack, bool crosstask)
>>   {
>>       struct perf_callchain_entry *entry;
>>       struct perf_callchain_entry_ctx ctx;
>> -    int rctx, start_entry_idx;
>> +    int rctx;
>>       /* crosstask is not supported for user stacks */
>>       if (crosstask && user && !kernel)
>> @@ -232,34 +273,14 @@ get_perf_callchain(struct pt_regs *regs, bool 
>> kernel, bool user,
>>       if (!entry)
>>           return NULL;
>> -    ctx.entry        = entry;
>> -    ctx.max_stack        = max_stack;
>> -    ctx.nr            = entry->nr = 0;
>> -    ctx.contexts        = 0;
>> -    ctx.contexts_maxed    = false;
>> +    __init_perf_callchain_ctx(&ctx, entry, max_stack, true);
>> -    if (kernel && !user_mode(regs)) {
>> -        if (add_mark)
>> -            perf_callchain_store_context(&ctx, PERF_CONTEXT_KERNEL);
>> -        perf_callchain_kernel(&ctx, regs);
>> -    }
>> -
>> -    if (user && !crosstask) {
>> -        if (!user_mode(regs)) {
>> -            if (current->flags & (PF_KTHREAD | PF_USER_WORKER))
>> -                goto exit_put;
>> -            regs = task_pt_regs(current);
>> -        }
>> +    if (kernel)
>> +        __get_perf_callchain_kernel(&ctx, regs);
>> -        if (add_mark)
>> -            perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>> -
>> -        start_entry_idx = entry->nr;
>> -        perf_callchain_user(&ctx, regs);
>> -        fixup_uretprobe_trampoline_entries(entry, start_entry_idx);
>> -    }
>> +    if (user && !crosstask)
>> +        __get_perf_callchain_user(&ctx, regs);
>> -exit_put:
>>       put_callchain_entry(rctx);
>>       return entry;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 7541f6f85fc..eb0f110593d 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -8218,7 +8218,7 @@ perf_callchain(struct perf_event *event, struct 
>> pt_regs *regs)
>>           return &__empty_callchain;
>>       callchain = get_perf_callchain(regs, kernel, user,
>> -                       max_stack, crosstask, true);
>> +                       max_stack, crosstask);
>>       return callchain ?: &__empty_callchain;
>>   }
> 


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely
  2025-11-05 22:16   ` Yonghong Song
@ 2025-11-06  5:12     ` Tao Chen
  2025-11-06  6:20       ` Yonghong Song
  0 siblings, 1 reply; 11+ messages in thread
From: Tao Chen @ 2025-11-06  5:12 UTC (permalink / raw)
  To: Yonghong Song, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf

在 2025/11/6 06:16, Yonghong Song 写道:
> 
> 
> On 10/28/25 9:25 AM, Tao Chen wrote:
>> As Alexei noted, get_perf_callchain() return values may be reused
>> if a task is preempted after the BPF program enters migrate disable
>> mode. The perf_callchain_entres has a small stack of entries, and
>> we can reuse it as follows:
>>
>> 1. get the perf callchain entry
>> 2. BPF use...
>> 3. put the perf callchain entry
>>
>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>> ---
>>   kernel/bpf/stackmap.c | 61 ++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 48 insertions(+), 13 deletions(-)
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index e28b35c7e0b..70d38249083 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -188,13 +188,12 @@ static void stack_map_get_build_id_offset(struct 
>> bpf_stack_build_id *id_offs,
>>   }
>>   static struct perf_callchain_entry *
>> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, u32 
>> max_depth)
>>   {
>>   #ifdef CONFIG_STACKTRACE
>>       struct perf_callchain_entry *entry;
>> -    int rctx;
>> -    entry = get_callchain_entry(&rctx);
>> +    entry = get_callchain_entry(rctx);
>>       if (!entry)
>>           return NULL;
>> @@ -216,8 +215,6 @@ get_callchain_entry_for_task(struct task_struct 
>> *task, u32 max_depth)
>>               to[i] = (u64)(from[i]);
>>       }
>> -    put_callchain_entry(rctx);
>> -
>>       return entry;
>>   #else /* CONFIG_STACKTRACE */
>>       return NULL;
>> @@ -297,6 +294,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
>>       return id;
>>   }
>> +static struct perf_callchain_entry *
>> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool kernel, 
>> bool user,
>> +               int max_stack, bool crosstask)
>> +{
>> +    struct perf_callchain_entry_ctx ctx;
>> +    struct perf_callchain_entry *entry;
>> +
>> +    entry = get_callchain_entry(rctx);
> 
> I think this may not work. Let us say we have two bpf programs
> both pinned to a particular cpu (migrate disabled but preempt enabled).
> get_callchain_entry() calls get_recursion_context() to get the
> buffer for a particulart level.
> 
> static inline int get_recursion_context(u8 *recursion)
> {
>          unsigned char rctx = interrupt_context_level();
>          if (recursion[rctx])
>                  return -1;
>          recursion[rctx]++;
>          barrier();
>          return rctx;
> }
> 
> It is possible that both tasks (at process level) may
> reach right before "recursion[rctx]++;".
> In such cases, both tasks will be able to get
> buffer and this is not right.
> 
> To fix this, we either need to have preempt disable
> in bpf side, or maybe we have some kind of atomic
> operation (cmpxchg or similar things), or maybe
> has a preempt disable between if statement and recursion[rctx]++,
> so only one task can get buffer?
> 

Thanks to your reminder, can we add preempt disable before and after 
get_callchain_entry, avoid affecting the original functions of perf.

Regarding multiple task preemption: if the entry is not released via 
put_callchain_entry, it appears that perf's buffer does not support 
recording the second task, so it returns directly here.

           if (recursion[rctx])
                   return -1;

> 
>> +    if (unlikely(!entry))
>> +        return NULL;
>> +
>> +    __init_perf_callchain_ctx(&ctx, entry, max_stack, false);
>> +    if (kernel)
>> +        __get_perf_callchain_kernel(&ctx, regs);
>> +    if (user && !crosstask)
>> +        __get_perf_callchain_user(&ctx, regs);
>> +
>> +    return entry;
>> +}
>> +
>> +static void bpf_put_callchain_entry(int rctx)
> 
> we have bpf_get_perf_callchain(), maybe rename the above
> to bpf_put_perf_callchain()?
> 

Ack, thanks.

>> +{
>> +    put_callchain_entry(rctx);
>> +}
>> +
> 
> [...]
> 


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely
  2025-11-06  5:12     ` Tao Chen
@ 2025-11-06  6:20       ` Yonghong Song
  2025-11-06  7:08         ` Tao Chen
  0 siblings, 1 reply; 11+ messages in thread
From: Yonghong Song @ 2025-11-06  6:20 UTC (permalink / raw)
  To: Tao Chen, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf



On 11/5/25 9:12 PM, Tao Chen wrote:
> 在 2025/11/6 06:16, Yonghong Song 写道:
>>
>>
>> On 10/28/25 9:25 AM, Tao Chen wrote:
>>> As Alexei noted, get_perf_callchain() return values may be reused
>>> if a task is preempted after the BPF program enters migrate disable
>>> mode. The perf_callchain_entres has a small stack of entries, and
>>> we can reuse it as follows:
>>>
>>> 1. get the perf callchain entry
>>> 2. BPF use...
>>> 3. put the perf callchain entry
>>>
>>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>>> ---
>>>   kernel/bpf/stackmap.c | 61 
>>> ++++++++++++++++++++++++++++++++++---------
>>>   1 file changed, 48 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index e28b35c7e0b..70d38249083 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -188,13 +188,12 @@ static void 
>>> stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>   }
>>>   static struct perf_callchain_entry *
>>> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>>> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, 
>>> u32 max_depth)
>>>   {
>>>   #ifdef CONFIG_STACKTRACE
>>>       struct perf_callchain_entry *entry;
>>> -    int rctx;
>>> -    entry = get_callchain_entry(&rctx);
>>> +    entry = get_callchain_entry(rctx);
>>>       if (!entry)
>>>           return NULL;
>>> @@ -216,8 +215,6 @@ get_callchain_entry_for_task(struct task_struct 
>>> *task, u32 max_depth)
>>>               to[i] = (u64)(from[i]);
>>>       }
>>> -    put_callchain_entry(rctx);
>>> -
>>>       return entry;
>>>   #else /* CONFIG_STACKTRACE */
>>>       return NULL;
>>> @@ -297,6 +294,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
>>>       return id;
>>>   }
>>> +static struct perf_callchain_entry *
>>> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool 
>>> kernel, bool user,
>>> +               int max_stack, bool crosstask)
>>> +{
>>> +    struct perf_callchain_entry_ctx ctx;
>>> +    struct perf_callchain_entry *entry;
>>> +
>>> +    entry = get_callchain_entry(rctx);
>>
>> I think this may not work. Let us say we have two bpf programs
>> both pinned to a particular cpu (migrate disabled but preempt enabled).
>> get_callchain_entry() calls get_recursion_context() to get the
>> buffer for a particulart level.
>>
>> static inline int get_recursion_context(u8 *recursion)
>> {
>>          unsigned char rctx = interrupt_context_level();
>>          if (recursion[rctx])
>>                  return -1;
>>          recursion[rctx]++;
>>          barrier();
>>          return rctx;
>> }
>>
>> It is possible that both tasks (at process level) may
>> reach right before "recursion[rctx]++;".
>> In such cases, both tasks will be able to get
>> buffer and this is not right.
>>
>> To fix this, we either need to have preempt disable
>> in bpf side, or maybe we have some kind of atomic
>> operation (cmpxchg or similar things), or maybe
>> has a preempt disable between if statement and recursion[rctx]++,
>> so only one task can get buffer?
>>
>
> Thanks to your reminder, can we add preempt disable before and after 
> get_callchain_entry, avoid affecting the original functions of perf.

Yes, we get two get_callchain_entry() call site:
   bpf/stackmap.c: entry = get_callchain_entry(&rctx);
   events/callchain.c:     entry = get_callchain_entry(&rctx);
We need to have preempt_disable()/preempt_enable() around them.

Another choice maybe adds preempt_disable/enable() for
get_callchain_entry() and get_perf_callchain() in stackmap.c,
assuming these two function usage in other places are for
interrupts (softirq, hardirq and nmi) so they are okay.

But maybe the following is better?

diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index d9cc57083091..0ccf94315954 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -214,12 +214,9 @@ static inline int get_recursion_context(u8 *recursion)
  {
         unsigned char rctx = interrupt_context_level();
  
-       if (recursion[rctx])
+       if (cmpxchg(&recursion[rctx], 0, 1) != 0)
                 return -1;
  
-       recursion[rctx]++;
-       barrier();
-
         return rctx;
  }

>
> Regarding multiple task preemption: if the entry is not released via 
> put_callchain_entry, it appears that perf's buffer does not support 
> recording the second task, so it returns directly here.
>
>           if (recursion[rctx])
>                   return -1;
>
>>
>>> +    if (unlikely(!entry))
>>> +        return NULL;
>>> +
>>> +    __init_perf_callchain_ctx(&ctx, entry, max_stack, false);
>>> +    if (kernel)
>>> +        __get_perf_callchain_kernel(&ctx, regs);
>>> +    if (user && !crosstask)
>>> +        __get_perf_callchain_user(&ctx, regs);
>>> +
>>> +    return entry;
>>> +}
>>> +
>>> +static void bpf_put_callchain_entry(int rctx)
>>
>> we have bpf_get_perf_callchain(), maybe rename the above
>> to bpf_put_perf_callchain()?
>>
>
> Ack, thanks.
>
>>> +{
>>> +    put_callchain_entry(rctx);
>>> +}
>>> +
>>
>> [...]
>>
>
>


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely
  2025-11-06  6:20       ` Yonghong Song
@ 2025-11-06  7:08         ` Tao Chen
  0 siblings, 0 replies; 11+ messages in thread
From: Tao Chen @ 2025-11-06  7:08 UTC (permalink / raw)
  To: Yonghong Song, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, kan.liang,
	song, ast, daniel, andrii, martin.lau, eddyz87, john.fastabend,
	kpsingh, sdf, haoluo
  Cc: linux-perf-users, linux-kernel, bpf

在 2025/11/6 14:20, Yonghong Song 写道:
> 
> 
> On 11/5/25 9:12 PM, Tao Chen wrote:
>> 在 2025/11/6 06:16, Yonghong Song 写道:
>>>
>>>
>>> On 10/28/25 9:25 AM, Tao Chen wrote:
>>>> As Alexei noted, get_perf_callchain() return values may be reused
>>>> if a task is preempted after the BPF program enters migrate disable
>>>> mode. The perf_callchain_entres has a small stack of entries, and
>>>> we can reuse it as follows:
>>>>
>>>> 1. get the perf callchain entry
>>>> 2. BPF use...
>>>> 3. put the perf callchain entry
>>>>
>>>> Signed-off-by: Tao Chen <chen.dylane@linux.dev>
>>>> ---
>>>>   kernel/bpf/stackmap.c | 61 +++++++++++++++++++++++++++++++++ 
>>>> +---------
>>>>   1 file changed, 48 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index e28b35c7e0b..70d38249083 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -188,13 +188,12 @@ static void 
>>>> stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>>   }
>>>>   static struct perf_callchain_entry *
>>>> -get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
>>>> +get_callchain_entry_for_task(int *rctx, struct task_struct *task, 
>>>> u32 max_depth)
>>>>   {
>>>>   #ifdef CONFIG_STACKTRACE
>>>>       struct perf_callchain_entry *entry;
>>>> -    int rctx;
>>>> -    entry = get_callchain_entry(&rctx);
>>>> +    entry = get_callchain_entry(rctx);
>>>>       if (!entry)
>>>>           return NULL;
>>>> @@ -216,8 +215,6 @@ get_callchain_entry_for_task(struct task_struct 
>>>> *task, u32 max_depth)
>>>>               to[i] = (u64)(from[i]);
>>>>       }
>>>> -    put_callchain_entry(rctx);
>>>> -
>>>>       return entry;
>>>>   #else /* CONFIG_STACKTRACE */
>>>>       return NULL;
>>>> @@ -297,6 +294,31 @@ static long __bpf_get_stackid(struct bpf_map *map,
>>>>       return id;
>>>>   }
>>>> +static struct perf_callchain_entry *
>>>> +bpf_get_perf_callchain(int *rctx, struct pt_regs *regs, bool 
>>>> kernel, bool user,
>>>> +               int max_stack, bool crosstask)
>>>> +{
>>>> +    struct perf_callchain_entry_ctx ctx;
>>>> +    struct perf_callchain_entry *entry;
>>>> +
>>>> +    entry = get_callchain_entry(rctx);
>>>
>>> I think this may not work. Let us say we have two bpf programs
>>> both pinned to a particular cpu (migrate disabled but preempt enabled).
>>> get_callchain_entry() calls get_recursion_context() to get the
>>> buffer for a particulart level.
>>>
>>> static inline int get_recursion_context(u8 *recursion)
>>> {
>>>          unsigned char rctx = interrupt_context_level();
>>>          if (recursion[rctx])
>>>                  return -1;
>>>          recursion[rctx]++;
>>>          barrier();
>>>          return rctx;
>>> }
>>>
>>> It is possible that both tasks (at process level) may
>>> reach right before "recursion[rctx]++;".
>>> In such cases, both tasks will be able to get
>>> buffer and this is not right.
>>>
>>> To fix this, we either need to have preempt disable
>>> in bpf side, or maybe we have some kind of atomic
>>> operation (cmpxchg or similar things), or maybe
>>> has a preempt disable between if statement and recursion[rctx]++,
>>> so only one task can get buffer?
>>>
>>
>> Thanks to your reminder, can we add preempt disable before and after 
>> get_callchain_entry, avoid affecting the original functions of perf.
> 
> Yes, we get two get_callchain_entry() call site:
>    bpf/stackmap.c: entry = get_callchain_entry(&rctx);
>    events/callchain.c:     entry = get_callchain_entry(&rctx);
> We need to have preempt_disable()/preempt_enable() around them.
> 
> Another choice maybe adds preempt_disable/enable() for
> get_callchain_entry() and get_perf_callchain() in stackmap.c,
> assuming these two function usage in other places are for
> interrupts (softirq, hardirq and nmi) so they are okay.
> 
> But maybe the following is better?
> 
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index d9cc57083091..0ccf94315954 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -214,12 +214,9 @@ static inline int get_recursion_context(u8 *recursion)
>   {
>          unsigned char rctx = interrupt_context_level();
> 
> -       if (recursion[rctx])
> +       if (cmpxchg(&recursion[rctx], 0, 1) != 0)
>                  return -1;
> 
> -       recursion[rctx]++;
> -       barrier();
> -
>          return rctx;
>   }
> 

Agree, this seems to have fewer side effects, thanks.

>>
>> Regarding multiple task preemption: if the entry is not released via 
>> put_callchain_entry, it appears that perf's buffer does not support 
>> recording the second task, so it returns directly here.
>>
>>           if (recursion[rctx])
>>                   return -1;
>>
>>>
>>>> +    if (unlikely(!entry))
>>>> +        return NULL;
>>>> +
>>>> +    __init_perf_callchain_ctx(&ctx, entry, max_stack, false);
>>>> +    if (kernel)
>>>> +        __get_perf_callchain_kernel(&ctx, regs);
>>>> +    if (user && !crosstask)
>>>> +        __get_perf_callchain_user(&ctx, regs);
>>>> +
>>>> +    return entry;
>>>> +}
>>>> +
>>>> +static void bpf_put_callchain_entry(int rctx)
>>>
>>> we have bpf_get_perf_callchain(), maybe rename the above
>>> to bpf_put_perf_callchain()?
>>>
>>
>> Ack, thanks.
>>
>>>> +{
>>>> +    put_callchain_entry(rctx);
>>>> +}
>>>> +
>>>
>>> [...]
>>>
>>
>>
> 


-- 
Best Regards
Tao Chen

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-11-06  7:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-28 16:25 [PATCH bpf-next v4 0/2] Pass external callchain entry to get_perf_callchain Tao Chen
2025-10-28 16:25 ` [PATCH bpf-next v4 1/2] perf: Refactor get_perf_callchain Tao Chen
2025-10-28 17:09   ` bot+bpf-ci
2025-10-30  2:36     ` Tao Chen
2025-11-05 20:45   ` Yonghong Song
2025-11-06  3:28     ` Tao Chen
2025-10-28 16:25 ` [PATCH bpf-next v4 2/2] bpf: Hold the perf callchain entry until used completely Tao Chen
2025-11-05 22:16   ` Yonghong Song
2025-11-06  5:12     ` Tao Chen
2025-11-06  6:20       ` Yonghong Song
2025-11-06  7:08         ` Tao Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).