Re: using skip>0 with bpf_get_stack()

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eugene Loh <eugene.loh@oracle.com>
To: bpf@vger.kernel.org
Subject: Re: using skip>0 with bpf_get_stack()
Date: Fri, 25 Jun 2021 21:22:02 -0400	[thread overview]
Message-ID: <1b59751f-0bb1-a4ad-6548-2536e60a80ec@oracle.com> (raw)
In-Reply-To: <c65ac449-ec54-3dff-5447-8a318001285b@fb.com>

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]


On 6/1/21 5:48 PM, Yonghong Song wrote:
>
>
> On 5/28/21 3:16 PM, Eugene Loh wrote:
>> I have a question about bpf_get_stack(). I'm interested in the case
>>          skip > 0
>>          user_build_id == 0
>>          num_elem < sysctl_perf_event_max_stack
>>
>> The function sets
>>          init_nr = sysctl_perf_event_max_stack - num_elem;
>> which means that get_perf_callchain() will return "num_elem" stack 
>> frames.  Then, since we skip "skip" frames, we'll fill the user 
>> buffer with only "num_elem - skip" frames, the remaining frames being 
>> filled zero.
>>
>> For example, let's say the call stack is
>>          leaf <- caller <- foo1 <- foo2 <- foo3 <- foo4 <- foo5 <- foo6
>>
>> Let's say I pass bpf_get_stack() a buffer with num_elem==4 and ask 
>> skip==2.  I would expect to skip 2 frames then get 4 frames, getting 
>> back:
>>          foo1  foo2  foo3  foo4
>>
>> Instead, I get
>>          foo1  foo2  0  0
>> skipping 2 frames but also leaving frames zeroed out.
>
> Thanks for reporting. I looked at codes and it does seem that we may
> have a kernel bug when skip != 0. Basically as you described,
> initially we collected num_elem stacks and later on we reduced by skip
> so we got num_elem - skip as you calculated in the above.
>
>>
>> I think the init_nr computation should be:
>>
>> -       if (sysctl_perf_event_max_stack < num_elem)
>> +       if (sysctl_perf_event_max_stack <= num_elem + skip)
>>                  init_nr = 0;
>>          else
>> -               init_nr = sysctl_perf_event_max_stack - num_elem;
>> +               init_nr = sysctl_perf_event_max_stack - num_elem - skip;
>
> A rough check looks like this is one correct way to fix the issue.
>
>> Incidentally, the return value of the function is presumably the size 
>> of the returned data.  Would it make sense to say so in 
>> include/uapi/linux/bpf.h?
>
> The current documentation says:
>  *      Return
>  *              A non-negative value equal to or less than *size* on 
> success,
>  *              or a negative error in case of failure.
>
> I think you can improve with more precise description such that
> a non-negative value for the copied **buf** length.
>
> Could you submit a patch for this? Thanks!

Sure.  Thanks for looking at this and sorry about the long delay getting 
back to you.

Could you take a look at the attached, proposed patch?  As you see in 
the commit message, I'm unclear about the bpf_get_stack*_pe() variants.  
They might use an earlier construct callchain, and I do not know how 
init_nr was set for them.

[-- Attachment #2: 0001-bpf-Adjust-BPF-stack-helper-functions-to-accommodate.patch --]
[-- Type: text/plain, Size: 4186 bytes --]

From 70da9057d7fa7dda76e1b0861b8a0174078434ea Mon Sep 17 00:00:00 2001
From: Eugene Loh <eugene.loh@oracle.com>
Date: Fri, 25 Jun 2021 15:18:46 -0700
Subject: [PATCH] bpf: Adjust BPF stack helper functions to accommodate skip>0

Let's say that the caller has storage for num_elem stack frames.  Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip>0, one keeps only num_elem-skip frames.

Change the computation of init_nr so that num_elem+skip stack frames are
walked (and hence num_elem frames are kept).

[NB:  I am unsure of the bpf_get_stack*_pe() variants, which in the case of
__PERF_SAMPLE_CALLCHAIN_EARLY use ctx->data->callchain, which was walked
earlier.  I am unclear how init_nr was chosen for it.]

Change the comment on bpf_get_stack() in the header file to be more
explicit what the return value means.

Signed-off-by: Eugene Loh <eugene.loh@oracle.com>
---
 include/uapi/linux/bpf.h |  4 ++--
 kernel/bpf/stackmap.c    | 26 +++++++++++++++-----------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 79c893310492..7c7b93e1db90 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2183,8 +2183,8 @@ union bpf_attr {
  *
  * 			# sysctl kernel.perf_event_max_stack=<new value>
  * 	Return
- * 		A non-negative value equal to or less than *size* on success,
- * 		or a negative error in case of failure.
+ * 		The non-negative copied *buf* length equal to or less than
+ * 		*size* on success, or a negative error in case of failure.
  *
  * long bpf_skb_load_bytes_relative(const void *skb, u32 offset, void *to, u32 len, u32 start_header)
  * 	Description
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index be35bfb7fb13..e2a193581550 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -249,23 +249,30 @@ get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
 #endif
 }
 
+static u32 get_init_nr(u32 nelem, u64 flags)
+{
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+
+	if (sysctl_perf_event_max_stack <= nelem + skip)
+		return 0;
+	else
+		return sysctl_perf_event_max_stack - nelem - skip;
+}
+
 static long __bpf_get_stackid(struct bpf_map *map,
 			      struct perf_callchain_entry *trace, u64 flags)
 {
 	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
 	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
 	u32 max_depth = map->value_size / stack_map_data_size(map);
-	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
-	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	u32 init_nr;
 	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
 	u32 hash, id, trace_nr, trace_len;
 	bool user = flags & BPF_F_USER_STACK;
 	u64 *ips;
 	bool hash_matches;
 
-	/* get_perf_callchain() guarantees that trace->nr >= init_nr
-	 * and trace-nr <= sysctl_perf_event_max_stack, so trace_nr <= max_depth
-	 */
+	init_nr = get_init_nr(max_depth, flags);
 	trace_nr = trace->nr - init_nr;
 
 	if (trace_nr <= skip)
@@ -331,8 +338,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
 	u32 max_depth = map->value_size / stack_map_data_size(map);
-	/* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */
-	u32 init_nr = sysctl_perf_event_max_stack - max_depth;
+	u32 init_nr;
 	bool user = flags & BPF_F_USER_STACK;
 	struct perf_callchain_entry *trace;
 	bool kernel = !user;
@@ -341,6 +347,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
 		return -EINVAL;
 
+	init_nr = get_init_nr(max_depth, flags);
 	trace = get_perf_callchain(regs, init_nr, kernel, user,
 				   sysctl_perf_event_max_stack, false, false);
 
@@ -458,10 +465,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		goto err_fault;
 
 	num_elem = size / elem_size;
-	if (sysctl_perf_event_max_stack < num_elem)
-		init_nr = 0;
-	else
-		init_nr = sysctl_perf_event_max_stack - num_elem;
+	init_nr = get_init_nr(num_elem, flags);
 
 	if (trace_in)
 		trace = trace_in;
-- 
2.27.0

next prev parent reply	other threads:[~2021-06-26  1:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-28 22:16 using skip>0 with bpf_get_stack() Eugene Loh
2021-06-01 21:48 ` Yonghong Song
2021-06-26  1:22   ` Eugene Loh [this message]
2021-06-29  3:33     ` Yonghong Song
2022-03-04 20:37       ` Namhyung Kim
2022-03-04 20:50         ` Eugene Loh
2022-03-04 21:16           ` Namhyung Kim

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:79c89331049 dfblob:7c7b93e1db9 dfblob:be35bfb7fb1
dfblob:e2a19358155 )
 OR (
bs:"bpf: Adjust BPF stack helper functions to accommodate skip>0" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b59751f-0bb1-a4ad-6548-2536e60a80ec@oracle.com \
    --to=eugene.loh@oracle.com \
    --cc=bpf@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).