netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] A couple of issues on BPF callstack
@ 2022-03-04 23:28 Namhyung Kim
  2022-03-06  0:28 ` Yonghong Song
  0 siblings, 1 reply; 3+ messages in thread
From: Namhyung Kim @ 2022-03-04 23:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, Eugene Loh, Peter Zijlstra, Hao Luo

Hello,

While I'm working on lock contention tracepoints [1] for a future BPF
use, I found some issues on the stack trace in BPF programs.  Maybe
there are things that I missed but I'd like to share my thoughts for
your feedback.  So please correct me if I'm wrong.

The first thing I found is how it handles skipped frames in the
bpf_get_stack{,id}.  Initially I wanted a short stack trace like 4
depth to identify callers quickly, but it turned out that 4 is not
enough and it's all filled with the BPF code itself.

So I set to skip 4 frames but it always returns an error (-EFAULT).
After some time I figured out that BPF doesn't allow to set skip
frames greater than or equal to buffer size.  This seems strange and
looks like a bug.  Then I found a bug report (and a partial fix) [2]
and work on a full fix now.

But it revealed another problem with BPF programs on perf_event which
use a variant of stack trace functions.  The difference is that it
needs to use a callchain in the perf sample data.  The perf callchain
is saved from the begining while BPF callchain is saved at the last to
limit the stack depth by the buffer size.  But I can handle that.

More important thing to me is the content of the (perf) callchain.  If
the event has __PERF_SAMPLE_CALLCHAIN_EARLY, it will have context info
like PERF_CONTEXT_KERNEL.  So user might or might not see it depending
on whether the perf_event set with precise_ip and SAMPLE_CALLCHAIN.
This doesn't look good.

After all, I think it'd be really great if we can skip those
uninteresting info easily.  Maybe we could add a flag to skip BPF code
perf context, and even some scheduler code from the trace respectively
like in stack_trace_consume_entry_nosched().

Thoughts?

Thanks,
Namhyung


[1] https://lore.kernel.org/all/20220301010412.431299-1-namhyung@kernel.org/
[2] https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] A couple of issues on BPF callstack
  2022-03-04 23:28 [RFC] A couple of issues on BPF callstack Namhyung Kim
@ 2022-03-06  0:28 ` Yonghong Song
  2022-03-08  4:37   ` Namhyung Kim
  0 siblings, 1 reply; 3+ messages in thread
From: Yonghong Song @ 2022-03-06  0:28 UTC (permalink / raw)
  To: Namhyung Kim, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Martin KaFai Lau, Song Liu, John Fastabend, KP Singh, netdev, bpf,
	Eugene Loh, Peter Zijlstra, Hao Luo



On 3/4/22 3:28 PM, Namhyung Kim wrote:
> Hello,
> 
> While I'm working on lock contention tracepoints [1] for a future BPF
> use, I found some issues on the stack trace in BPF programs.  Maybe
> there are things that I missed but I'd like to share my thoughts for
> your feedback.  So please correct me if I'm wrong.
> 
> The first thing I found is how it handles skipped frames in the
> bpf_get_stack{,id}.  Initially I wanted a short stack trace like 4
> depth to identify callers quickly, but it turned out that 4 is not
> enough and it's all filled with the BPF code itself.
> 
> So I set to skip 4 frames but it always returns an error (-EFAULT).
> After some time I figured out that BPF doesn't allow to set skip
> frames greater than or equal to buffer size.  This seems strange and
> looks like a bug.  Then I found a bug report (and a partial fix) [2]
> and work on a full fix now.

Thanks for volunteering. Looking forward to the patch.

> 
> But it revealed another problem with BPF programs on perf_event which
> use a variant of stack trace functions.  The difference is that it
> needs to use a callchain in the perf sample data.  The perf callchain
> is saved from the begining while BPF callchain is saved at the last to
> limit the stack depth by the buffer size.  But I can handle that.
> 
> More important thing to me is the content of the (perf) callchain.  If
> the event has __PERF_SAMPLE_CALLCHAIN_EARLY, it will have context info
> like PERF_CONTEXT_KERNEL.  So user might or might not see it depending
> on whether the perf_event set with precise_ip and SAMPLE_CALLCHAIN.
> This doesn't look good.

Patch 7b04d6d60fcf ("bpf: Separate bpf_get_[stack|stackid] for
perf events BPF") tried to fix __PERF_SAMPLE_CALLCHAIN_EARLY issue
for bpf_get_stack[id]() helpers.
The helpers will check whether event->attr.sample_type has
__PERF_SAMPLE_CALLCHAIN_EARLY encoded or not, based on which
the stacks will be retrieved accordingly.
Did you any issue here?

> 
> After all, I think it'd be really great if we can skip those
> uninteresting info easily.  Maybe we could add a flag to skip BPF code

We cannot just skip those callchains with __PERF_SAMPLE_CALLCHAIN_EARLY.
There are real use cases for it.

> perf context, and even some scheduler code from the trace respectively
> like in stack_trace_consume_entry_nosched().

A flag for the bpf_get_stack[id]() helpers? It is possible. It would be
great if you can detail your use case here and how a flag could help
you.

> 
> Thoughts?
> 
> Thanks,
> Namhyung
> 
> 
> [1] https://lore.kernel.org/all/20220301010412.431299-1-namhyung@kernel.org/
> [2] https://lore.kernel.org/bpf/30a7b5d5-6726-1cc2-eaee-8da2828a9a9c@oracle.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] A couple of issues on BPF callstack
  2022-03-06  0:28 ` Yonghong Song
@ 2022-03-08  4:37   ` Namhyung Kim
  0 siblings, 0 replies; 3+ messages in thread
From: Namhyung Kim @ 2022-03-08  4:37 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Network Development, bpf, Eugene Loh, Peter Zijlstra, Hao Luo

Hello,

On Sat, Mar 5, 2022 at 4:28 PM Yonghong Song <yhs@fb.com> wrote:
> On 3/4/22 3:28 PM, Namhyung Kim wrote:
> > More important thing to me is the content of the (perf) callchain.  If
> > the event has __PERF_SAMPLE_CALLCHAIN_EARLY, it will have context info
> > like PERF_CONTEXT_KERNEL.  So user might or might not see it depending
> > on whether the perf_event set with precise_ip and SAMPLE_CALLCHAIN.
> > This doesn't look good.
>
> Patch 7b04d6d60fcf ("bpf: Separate bpf_get_[stack|stackid] for
> perf events BPF") tried to fix __PERF_SAMPLE_CALLCHAIN_EARLY issue
> for bpf_get_stack[id]() helpers.

Right.

> The helpers will check whether event->attr.sample_type has
> __PERF_SAMPLE_CALLCHAIN_EARLY encoded or not, based on which
> the stacks will be retrieved accordingly.
> Did you any issue here?

It changes stack trace results by adding perf contexts like
PERF_CONTEXT_KERNEL and PERF_CONTEXT_USER.
Without __PERF_SAMPLE_CALLCHAIN_EARLY, I don't see those.

> >
> > After all, I think it'd be really great if we can skip those
> > uninteresting info easily.  Maybe we could add a flag to skip BPF code
>
> We cannot just skip those callchains with __PERF_SAMPLE_CALLCHAIN_EARLY.
> There are real use cases for it.

I'm not saying that I want to skip all the callchains.
What I want is a way to avoid those perf context info
in the callchains so that I can make sure to have the
same stack traces in a known code path regardless
of the event attribute and cpu vendors - as far as I know
__PERF_SAMPLE_CALLCHAIN_EARLY is enabled on Intel cpus only.

>
> > perf context, and even some scheduler code from the trace respectively
> > like in stack_trace_consume_entry_nosched().
>
> A flag for the bpf_get_stack[id]() helpers? It is possible. It would be
> great if you can detail your use case here and how a flag could help
> you.

Yep, something like BPF_F_SKIP_BPF_STACK.

In my case, I collect a callchain in a tracepoint to find its caller.
And I want to have a short call stack depth for a performance reason.
But the every 3 or 4 entries are already filled by BPF code and
I want to skip them.  I know that I can set it with skip mask but
having a hard coded value can be annoying since it might be
changed by different compilers, kernel version or configurations.

Similarly, I think it'd be useful to skip some scheduler functions
like __schedule when collecting stack traces in sched_switch.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-03-08  4:38 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-04 23:28 [RFC] A couple of issues on BPF callstack Namhyung Kim
2022-03-06  0:28 ` Yonghong Song
2022-03-08  4:37   ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).