Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE

BPF List
 help / color / mirror / Atom feed

From: Tao Chen <chen.dylane@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, bpf <bpf@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE
Date: Thu, 18 Sep 2025 21:34:54 +0800	[thread overview]
Message-ID: <457b805f-ea5c-460e-b93f-b7b63f3358af@linux.dev> (raw)
In-Reply-To: <CAADnVQ+s8B7-fvR1TNO-bniSyKv57cH_ihRszmZV7pQDyV=VDQ@mail.gmail.com>

在 2025/9/18 09:35, Alexei Starovoitov 写道:
> On Wed, Sep 17, 2025 at 3:16 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>
>>
>> P.S. It seems like a good idea to switch STACKMAP to open addressing
>> instead of the current kind-of-bucket-chain-but-not-really
>> implementation. It's fixed size and pre-allocated already, so open
>> addressing seems like a great approach here, IMO.
> 
> That makes sense. It won't have backward compat issues.
> Just more reliable stack_id.
> 
> Fixed value_size is another footgun there.
> Especially for collecting user stack traces.
> We can switch the whole stackmap to bpf_mem_alloc()
> or wait for kmalloc_nolock().
> But it's probably a diminishing return.
> 
> bpf_get_stack() also isn't great with a copy into
> perf_callchain_entry, then 2nd copy into on stack/percpu buf/ringbuf,
> and 3rd copy of correct size into ringbuf (optional).
> 
> Also, I just realized we have another nasty race there.
> In the past bpf progs were run in preempt disabled context,
> but we forgot to adjust bpf_get_stack[id]() helpers when everything
> switched to migrate disable.
> 
> The return value from get_perf_callchain() may be reused
> if another task preempts and requests the stack.
> We have partially incorrect comment in __bpf_get_stack() too:
>          if (may_fault)
>                  rcu_read_lock(); /* need RCU for perf's callchain below */
> 
> rcu can be preemptable. so rcu_read_lock() makes
> trace = get_perf_callchain(...)
> accessible, but that per-cpu trace buffer can be overwritten.
> It's not an issue for CONFIG_PREEMPT_NONE=y, but that doesn't
> give much comfort.

Hi Alexei,

Can we fix it like this?

-       if (may_fault)
-               rcu_read_lock(); /* need RCU for perf's callchain below */
+       preempt_diable();

         if (trace_in)
                 trace = trace_in;
@@ -455,8 +454,7 @@ static long __bpf_get_stack(struct pt_regs *regs, 
struct task_struct *task,
                                            crosstask, false);

         if (unlikely(!trace) || trace->nr < skip) {
-               if (may_fault)
-                       rcu_read_unlock();
+               preempt_enable();
                 goto err_fault;
         }

@@ -475,9 +473,7 @@ static long __bpf_get_stack(struct pt_regs *regs, 
struct task_struct *task,
                 memcpy(buf, ips, copy_len);
         }

-       /* trace/ips should not be dereferenced after this point */
-       if (may_fault)
-               rcu_read_unlock();
+       preempt_enable();

> 
> Modern day bpf api would probably be
> - get_callchain_entry()/put() kfuncs to expose low level mechanism
> with safe acq/rel of temp buffer.
> - then another kfuncs to perf_callchain_kernel/user into that buffer.
> 
> and with bpf_mem_alloc and hash kfuncs the bpf prog can
> implement either bpf_get_stack() equivalent or much better
> bpf_get_stackid() with variable length stack traces and so on.


-- 
Best Regards
Tao Chen

next prev parent reply	other threads:[~2025-09-18 13:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09 16:32 [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE Tao Chen
2025-09-09 16:32 ` [PATCH bpf-next v2 2/2] selftests/bpf: Add stacktrace map lookup_and_delete_elem test case Tao Chen
2025-09-17 22:16 ` [PATCH bpf-next v2 1/2] bpf: Add lookup_and_delete_elem for BPF_MAP_STACK_TRACE Andrii Nakryiko
2025-09-18  1:35   ` Alexei Starovoitov
2025-09-18 13:34     ` Tao Chen [this message]
2025-09-19  2:01       ` Alexei Starovoitov
2025-09-19  2:08         ` Tao Chen
2025-09-18 12:45   ` Tao Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=457b805f-ea5c-460e-b93f-b7b63f3358af@linux.dev \
    --to=chen.dylane@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=sdf@fomichev.me \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox