From: Eugene Loh <eugene.loh@oracle.com>
To: bpf@vger.kernel.org
Subject: Re: using skip>0 with bpf_get_stack()
Date: Fri, 25 Jun 2021 21:22:02 -0400 [thread overview]
Message-ID: <1b59751f-0bb1-a4ad-6548-2536e60a80ec@oracle.com> (raw)
In-Reply-To: <c65ac449-ec54-3dff-5447-8a318001285b@fb.com>
[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]
On 6/1/21 5:48 PM, Yonghong Song wrote:
>
>
> On 5/28/21 3:16 PM, Eugene Loh wrote:
>> I have a question about bpf_get_stack(). I'm interested in the case
>> skip > 0
>> user_build_id == 0
>> num_elem < sysctl_perf_event_max_stack
>>
>> The function sets
>> init_nr = sysctl_perf_event_max_stack - num_elem;
>> which means that get_perf_callchain() will return "num_elem" stack
>> frames. Then, since we skip "skip" frames, we'll fill the user
>> buffer with only "num_elem - skip" frames, the remaining frames being
>> filled zero.
>>
>> For example, let's say the call stack is
>> leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>
>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask
>> skip==2. I would expect to skip 2 frames then get 4 frames, getting
>> back:
>> foo1 foo2 foo3 foo4
>>
>> Instead, I get
>> foo1 foo2 0 0
>> skipping 2 frames but also leaving frames zeroed out.
>
> Thanks for reporting. I looked at codes and it does seem that we may
> have a kernel bug when skip != 0. Basically as you described,
> initially we collected num_elem stacks and later on we reduced by skip
> so we got num_elem - skip as you calculated in the above.
>
>>
>> I think the init_nr computation should be:
>>
>> - if (sysctl_perf_event_max_stack < num_elem)
>> + if (sysctl_perf_event_max_stack <= num_elem + skip)
>> init_nr = 0;
>> else
>> - init_nr = sysctl_perf_event_max_stack - num_elem;
>> + init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>
> A rough check looks like this is one correct way to fix the issue.
>
>> Incidentally, the return value of the function is presumably the size
>> of the returned data. Would it make sense to say so in
>> include/uapi/linux/bpf.h?
>
> The current documentation says:
> * Return
> * A non-negative value equal to or less than *size* on
> success,
> * or a negative error in case of failure.
>
> I think you can improve with more precise description such that
> a non-negative value for the copied **buf** length.
>
> Could you submit a patch for this? Thanks!
Sure. Thanks for looking at this and sorry about the long delay getting
back to you.
Could you take a look at the attached, proposed patch? As you see in
the commit message, I'm unclear about the bpf_get_stack*_pe() variants.
They might use an earlier construct callchain, and I do not know how
init_nr was set for them.
[-- Attachment #2: 0001-bpf-Adjust-BPF-stack-helper-functions-to-accommodate.patch --]
[-- Type: text/plain, Size: 4186 bytes --]
From 70da9057d7fa7dda76e1b0861b8a0174078434ea Mon Sep 17 00:00:00 2001
From: Eugene Loh <eugene.loh@oracle.com>
Date: Fri, 25 Jun 2021 15:18:46 -0700
Subject: [PATCH] bpf: Adjust BPF stack helper functions to accommodate skip>0
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip>0, one keeps only num_elem-skip frames.
Change the computation of init_nr so that num_elem+skip stack frames are
walked (and hence num_elem frames are kept).
[NB: I am unsure of the bpf_get_stack*_pe() variants, which in the case of
__PERF_SAMPLE_CALLCHAIN_EARLY use ctx->data->callchain, which was walked
earlier. I am unclear how init_nr was chosen for it.]
Change the comment on bpf_get_stack() in the header file to be more
explicit what the return value means.
Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
---
include/uapi/linux/bpf.h | 4 ++--
kernel/bpf/stackmap.c | 26 +++++++++++++++-----------
2 files changed, 17 insertions(+), 13 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 79c893310492..7c7b93e1db90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2183,8 +2183,8 @@ union bpf_attr {
*
* # sysctl kernel.perf_event_max_stack=<new value>
* Return
- * A non-negative value equal to or less than *size* on success,
- * or a negative error in case of failure.
+ * The non-negative copied *buf* length equal to or less than
+ * *size* on success, or a negative error in case of failure.
*
* long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
* Description
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index be35bfb7fb13..e2a193581550 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -249,23 +249,30 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
#endif
}
+static u32 get_init_nr(u32 nelem, u64 flags)
+{
+ u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+ if (sysctl_perf_event_max_stack <= nelem + skip)
+ return 0;
+ else
+ return sysctl_perf_event_max_stack - nelem - skip;
+}
+
static long __bpf_get_stackid(struct bpf_map *map,
struct perf_callchain_entry *trace, u64 flags)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
u32 max_depth = map->value_size / stack_map_data_size(map);
- /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
- u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+ u32 init_nr;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
u32 hash, id, trace_nr, trace_len;
bool user = flags & BPF_F_USER_STACK;
u64 *ips;
bool hash_matches;
- /* get_perf_callchain() guarantees that trace->nr >= init_nr
- * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
- */
+ init_nr = get_init_nr(max_depth, flags);
trace_nr = trace->nr - init_nr;
if (trace_nr <= skip)
@@ -331,8 +338,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
u64, flags)
{
u32 max_depth = map->value_size / stack_map_data_size(map);
- /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
- u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+ u32 init_nr;
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
@@ -341,6 +347,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
return -EINVAL;
+ init_nr = get_init_nr(max_depth, flags);
trace = get_perf_callchain(regs, init_nr, kernel, user,
sysctl_perf_event_max_stack, false, false);
@@ -458,10 +465,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
goto err_fault;
num_elem = size / elem_size;
- if (sysctl_perf_event_max_stack < num_elem)
- init_nr = 0;
- else
- init_nr = sysctl_perf_event_max_stack - num_elem;
+ init_nr = get_init_nr(num_elem, flags);
if (trace_in)
trace = trace_in;
--
2.27.0
next prev parent reply other threads:[~2021-06-26 1:22 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
2021-06-01 21:48 ` Yonghong Song
2021-06-26 1:22 ` Eugene Loh [this message]
2021-06-29 3:33 ` Yonghong Song
2022-03-04 20:37 ` Namhyung Kim
2022-03-04 20:50 ` Eugene Loh
2022-03-04 21:16 ` Namhyung Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1b59751f-0bb1-a4ad-6548-2536e60a80ec@oracle.com \
--to=eugene.loh@oracle.com \
--cc=bpf@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).