Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-20  3:55 ` Alexei Starovoitov
  0 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2015-01-20  3:55 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org

On Mon, Jan 19, 2015 at 6:58 PM, Masami Hiramatsu
<masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org> wrote:
>>
>> it's done already... one can do the same skb->dev->name logic
>> in kprobe attached program... so from bpf program point of view,
>> tracepoints and kprobes feature-wise are exactly the same.
>> Only input is different.
>
> No, I meant that the input should also be same, at least for the first step.
> I guess it is easy to hook the ring buffer committing and fetch arguments
> from the event entry.

No. That would be very slow. See my comment to Steven
and more detailed numbers below.
Allocating ring buffer takes too much time.

> And what I expected scenario was
>
> 1. setup kprobe traceevent with fd, buf, count by using perf-probe.
> 2. load bpf module
> 3. the module processes given event arguments.

from ring buffer? that's too slow.
It's not usable for high frequency events which
need this in-kernel aggregation.
If events are rare, then just dumping everything
into trace buffer is just fine. No in-kernel program is needed.

> Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
> Why don't you just reuse the existing facilities (perftools and ftrace)
> instead of co-exist?

hmm. I don't think we're on the same page yet...
ring buffer and tracing interface is fully reused.
programs are run as soon as event triggers.
They can return non-zero and kernel will allocate ring
buffer which user space will consume.
Please take a look at tracex1

>> Just look how ktap scripts look alike for kprobes and tracepoints.
>
> Ktap is a good example, it provides only a language parser and a runtime engine.
> Actually, currently it lacks a feature to execute "perf-probe" helper from
> script, but it is easy to add such feature.
...
> For this usecase, I've made --output option for perf probe
> https://lkml.org/lkml/2014/10/31/210

you're proposing to call perf binary from ktap binary?
I think packaging headaches and error conditions
will make such approach very hard to use.
it would be much cleaner to have ktap as part of perf
generating bpf on the fly and feeding into kernel.
'perf probe' parsing and functions don't belong in kernel
when userspace can generate them in more efficient way.

Speaking of performance...
I've added temporary tracepoint like this:
TRACE_EVENT(sys_write,
        TP_PROTO(int count),
        TP_fast_assign(
                __entry->cnt = count;
        ),
and call it from SYSCALL_DEFINE3(write,..., count):
 trace_sys_write(count);

and run the following test:
dd if=/dev/zero of=/dev/null count=5000000

1.19343 s, 2.1 GB/s - raw base line
1.53301 s, 1.7 GB/s - echo 1 > enable
1.62742 s, 1.6 GB/s - echo cnt==1234 > filter
and profile looks like:
     6.23%  dd       [kernel.vmlinux]  [k] __clear_user
     6.19%  dd       [kernel.vmlinux]  [k] __srcu_read_lock
     5.94%  dd       [kernel.vmlinux]  [k] system_call
     4.54%  dd       [kernel.vmlinux]  [k] __srcu_read_unlock
     4.14%  dd       [kernel.vmlinux]  [k] system_call_after_swapgs
     3.96%  dd       [kernel.vmlinux]  [k] fsnotify
     3.74%  dd       [kernel.vmlinux]  [k] ring_buffer_discard_commit
     3.18%  dd       [kernel.vmlinux]  [k] rb_reserve_next_event
     1.69%  dd       [kernel.vmlinux]  [k] rb_add_time_stamp

the slowdown due to unconditional buffer allocation
is too high to use this in production for aggregation
of high frequency events.
There is little reason to run bpf program in kernel after
such penalty. User space can just read trace_pipe_raw
and process data there.

Now if program is run right after tracepoint fires
the profile will look like:
    10.01%  dd             [kernel.vmlinux]            [k] __clear_user
     7.50%  dd             [kernel.vmlinux]            [k] system_call
     6.95%  dd             [kernel.vmlinux]            [k] __srcu_read_lock
     6.02%  dd             [kernel.vmlinux]            [k] __srcu_read_unlock
...
     1.15%  dd             [kernel.vmlinux]            [k]
ftrace_raw_event_sys_write
     0.90%  dd             [kernel.vmlinux]            [k] __bpf_prog_run
this is much more usable.
For empty bpf program that does 'return 0':
1.23418 s, 2.1 GB/s
For full tracex4 example that does map[log2(count)]++
1.2589 s, 2.0 GB/s

so the cost of doing such in-kernel aggregation is
1.19/1.25 is ~ 5%
which makes the whole solution usable as live
monitoring/analytics tool.
We would only need good set of tracepoints.
kprobe via fentry overhead is also not cheap.
Same tracex4 example via kprobe (instead of tracepoint)
1.45673 s, 1.8 GB/s
So tracepoints are 1.45/1.25 ~ 15% faster than kprobes.
which is huge when the cost of running bpf program
is just 5%.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-20  3:55 ` Alexei Starovoitov
  0 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2015-01-20  3:55 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt@hitachi.com

On Mon, Jan 19, 2015 at 6:58 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
>>
>> it's done already... one can do the same skb->dev->name logic
>> in kprobe attached program... so from bpf program point of view,
>> tracepoints and kprobes feature-wise are exactly the same.
>> Only input is different.
>
> No, I meant that the input should also be same, at least for the first step.
> I guess it is easy to hook the ring buffer committing and fetch arguments
> from the event entry.

No. That would be very slow. See my comment to Steven
and more detailed numbers below.
Allocating ring buffer takes too much time.

> And what I expected scenario was
>
> 1. setup kprobe traceevent with fd, buf, count by using perf-probe.
> 2. load bpf module
> 3. the module processes given event arguments.

from ring buffer? that's too slow.
It's not usable for high frequency events which
need this in-kernel aggregation.
If events are rare, then just dumping everything
into trace buffer is just fine. No in-kernel program is needed.

> Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
> Why don't you just reuse the existing facilities (perftools and ftrace)
> instead of co-exist?

hmm. I don't think we're on the same page yet...
ring buffer and tracing interface is fully reused.
programs are run as soon as event triggers.
They can return non-zero and kernel will allocate ring
buffer which user space will consume.
Please take a look at tracex1

>> Just look how ktap scripts look alike for kprobes and tracepoints.
>
> Ktap is a good example, it provides only a language parser and a runtime engine.
> Actually, currently it lacks a feature to execute "perf-probe" helper from
> script, but it is easy to add such feature.
...
> For this usecase, I've made --output option for perf probe
> https://lkml.org/lkml/2014/10/31/210

you're proposing to call perf binary from ktap binary?
I think packaging headaches and error conditions
will make such approach very hard to use.
it would be much cleaner to have ktap as part of perf
generating bpf on the fly and feeding into kernel.
'perf probe' parsing and functions don't belong in kernel
when userspace can generate them in more efficient way.

Speaking of performance...
I've added temporary tracepoint like this:
TRACE_EVENT(sys_write,
        TP_PROTO(int count),
        TP_fast_assign(
                __entry->cnt = count;
        ),
and call it from SYSCALL_DEFINE3(write,..., count):
 trace_sys_write(count);

and run the following test:
dd if=/dev/zero of=/dev/null count=5000000

1.19343 s, 2.1 GB/s - raw base line
1.53301 s, 1.7 GB/s - echo 1 > enable
1.62742 s, 1.6 GB/s - echo cnt==1234 > filter
and profile looks like:
     6.23%  dd       [kernel.vmlinux]  [k] __clear_user
     6.19%  dd       [kernel.vmlinux]  [k] __srcu_read_lock
     5.94%  dd       [kernel.vmlinux]  [k] system_call
     4.54%  dd       [kernel.vmlinux]  [k] __srcu_read_unlock
     4.14%  dd       [kernel.vmlinux]  [k] system_call_after_swapgs
     3.96%  dd       [kernel.vmlinux]  [k] fsnotify
     3.74%  dd       [kernel.vmlinux]  [k] ring_buffer_discard_commit
     3.18%  dd       [kernel.vmlinux]  [k] rb_reserve_next_event
     1.69%  dd       [kernel.vmlinux]  [k] rb_add_time_stamp

the slowdown due to unconditional buffer allocation
is too high to use this in production for aggregation
of high frequency events.
There is little reason to run bpf program in kernel after
such penalty. User space can just read trace_pipe_raw
and process data there.

Now if program is run right after tracepoint fires
the profile will look like:
    10.01%  dd             [kernel.vmlinux]            [k] __clear_user
     7.50%  dd             [kernel.vmlinux]            [k] system_call
     6.95%  dd             [kernel.vmlinux]            [k] __srcu_read_lock
     6.02%  dd             [kernel.vmlinux]            [k] __srcu_read_unlock
...
     1.15%  dd             [kernel.vmlinux]            [k]
ftrace_raw_event_sys_write
     0.90%  dd             [kernel.vmlinux]            [k] __bpf_prog_run
this is much more usable.
For empty bpf program that does 'return 0':
1.23418 s, 2.1 GB/s
For full tracex4 example that does map[log2(count)]++
1.2589 s, 2.0 GB/s

so the cost of doing such in-kernel aggregation is
1.19/1.25 is ~ 5%
which makes the whole solution usable as live
monitoring/analytics tool.
We would only need good set of tracepoints.
kprobe via fentry overhead is also not cheap.
Same tracex4 example via kprobe (instead of tracepoint)
1.45673 s, 1.8 GB/s
So tracepoints are 1.45/1.25 ~ 15% faster than kprobes.
which is huge when the cost of running bpf program
is just 5%.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
  2015-01-20  3:55 ` Alexei Starovoitov
  (?)
@ 2015-01-20 11:57 ` Masami Hiramatsu
  -1 siblings, 0 replies; 4+ messages in thread
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt@hitachi.com

(2015/01/20 12:55), Alexei Starovoitov wrote:
> On Mon, Jan 19, 2015 at 6:58 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>>>
>>> it's done already... one can do the same skb->dev->name logic
>>> in kprobe attached program... so from bpf program point of view,
>>> tracepoints and kprobes feature-wise are exactly the same.
>>> Only input is different.
>>
>> No, I meant that the input should also be same, at least for the first step.
>> I guess it is easy to hook the ring buffer committing and fetch arguments
>> from the event entry.
> 
> No. That would be very slow. See my comment to Steven
> and more detailed numbers below.

Thank you for measuring the performance differences.
Indeed, the ring buffer looks slow.

> Allocating ring buffer takes too much time.
> 
>> And what I expected scenario was
>>
>> 1. setup kprobe traceevent with fd, buf, count by using perf-probe.
>> 2. load bpf module
>> 3. the module processes given event arguments.
> 
> from ring buffer? that's too slow.

Ok, BTW, would you think is it possible to use a reusable small scratchpad
memory for passing arguments? (just a thought)

> It's not usable for high frequency events which
> need this in-kernel aggregation.
> If events are rare, then just dumping everything
> into trace buffer is just fine. No in-kernel program is needed.

Hmm, let me ensure your point, the performance number is the reason why
we need to do it in the kernel, right? Not mainly for the flexibility but speed.

>> Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
>> Why don't you just reuse the existing facilities (perftools and ftrace)
>> instead of co-exist?
> 
> hmm. I don't think we're on the same page yet...
> ring buffer and tracing interface is fully reused.
> programs are run as soon as event triggers.
> They can return non-zero and kernel will allocate ring
> buffer which user space will consume.
> Please take a look at tracex1

I see, this code itself is not a destructive change.

>>> Just look how ktap scripts look alike for kprobes and tracepoints.
>>
>> Ktap is a good example, it provides only a language parser and a runtime engine.
>> Actually, currently it lacks a feature to execute "perf-probe" helper from
>> script, but it is easy to add such feature.
> ...
>> For this usecase, I've made --output option for perf probe
>> https://lkml.org/lkml/2014/10/31/210
> 
> you're proposing to call perf binary from ktap binary?

Yes, that's right :)

> I think packaging headaches and error conditions
> will make such approach very hard to use.

No, I don't think so. perf can be a "buffer" from the kernel API
and command-line API. If you need to get clearer error, you also
can join the upstream development.

> it would be much cleaner to have ktap as part of perf
> generating bpf on the fly and feeding into kernel.
> 'perf probe' parsing and functions don't belong in kernel
> when userspace can generate them in more efficient way.

No, perf probe still be needed to users who don't choose "injecting
binary blob" tracing. Efficiency is NOT only one index.

- perf probe and kprobe-event gives us a complete understandable
 interface for what will be recorded at where.
 (we can see the event definitions via kprobe_events interface,
  without any tools)
- kprobe-event gives a completely same interface as other tracepoint
  events.
- it also doesn't require any build-binary parts :) nor special tools.
  We can play with ftrace on just a small busybox.

However, this does NOT interfere your patch upstreaming. I just said current
ftrace method is also meaningful for some reasons :)


By the way, I concern about that bpf compiler can become another systemtap,
especially if you build it on llvm. Would you plan to develop it on kernel
tree? or apart from the kernel-side development?
I think it is hard to sync the development if you do it out-of-tree.


Thank you,



-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-20 20:33 Alexei Starovoitov
  0 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2015-01-20 20:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, yrl.pp-manager.tt@hitachi.com,
	Jovi Zhangwei

On Tue, Jan 20, 2015 at 3:57 AM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
>
> Ok, BTW, would you think is it possible to use a reusable small scratchpad
> memory for passing arguments? (just a thought)

sure. doable, but what's the use case?

>> It's not usable for high frequency events which
>> need this in-kernel aggregation.
>> If events are rare, then just dumping everything
>> into trace buffer is just fine. No in-kernel program is needed.
>
> Hmm, let me ensure your point, the performance number is the reason why
> we need to do it in the kernel, right? Not mainly for the flexibility but speed.

if user space can do X at the same speed as kernel,
then user space is a better choice and more flexible.
In case of bpf programs two things user space cannot do:
- fast aggregation without adding penalty to things being traced
- access to in-kernel data structures
And often both used together.
Say, we want to monitor amount of network traffic per user.
So we'd use trace_net_dev_xmit() tracepoint and do
map[current_uid()] += skb_len
as part of the program.
Overhead will be tiny and users won't notice any slowdown.
Trying to do the same in user space by enabling
this tracepoint has two problems:
high overhead and events are hard to aggregate
per user, since trace has 'pid', but short lived
processes will have dead pids in trace output.

> - perf probe and kprobe-event gives us a complete understandable
>  interface for what will be recorded at where.
>  (we can see the event definitions via kprobe_events interface,
>   without any tools)
> - kprobe-event gives a completely same interface as other tracepoint
>   events.
> - it also doesn't require any build-binary parts :) nor special tools.
>   We can play with ftrace on just a small busybox.

yeah, when debugging in busybox is the goal
and 'cat' and 'echo' are your only tools, then
debugfs interface is the only choice :)

> However, this does NOT interfere your patch upstreaming. I just said current
> ftrace method is also meaningful for some reasons :)

of course :)
To emphasize the point I was trying to make with tracex1:
The program is a filter/aggregator. The bpf maps
are not suitable for streaming the events. That's the job
of ring buffer/trace_pipe. The program may choose
to aggregate some events and discard them (by
returning 0 from the program), and the rest of
the events will be streamed to user space via
ring buffer in the format statically defined by tracepoint
or by kprobe arguments.
The tracex1 example loads the program and then
reads /sys/kernel/debugfs/tracing/trace_pipe...

That part I was trying to improve with bpf_trace_printk:
to give ability to programs to stream data in a format
different from the one statically defined by tracepoints.
But trace_printk has its disadvantages, so probably
something cleaner is needed.
Like in my earlier example of trace_net_dev_xmit,
if the program could add printing of uid to arguments
already printed, it would have helped user space.

> By the way, I concern about that bpf compiler can become another systemtap,
> especially if you build it on llvm.
> Would you plan to develop it on kernel
> tree? or apart from the kernel-side development?

I'm not sure I completely understand the concern.
perf is using a bunch of out-of-tree libraries.
mcjit of llvm or libgccjit are another libraries.
Or may be eventually eBPF can be generated
by something like libpcap.

Ideally I would like to see 'perf run script.txt'
where script.txt is a program in a language suited
for tracing. The tracing language not necessary
will fit networking use cases. Currently I'm
using C for both and it's the most convenient,
but some folks complained that 'restricted'
nature of this C is hard to grasp, so I can only
encourage Jovi to do ktap language to bpf
translator. If it generates bpf directly that's great,
if it uses gcc or llvm backend that's fine too.

> I think it is hard to sync the development if you do it out-of-tree.

I think some pieces would have to be out of tree.
I've kept standalone llvm backend across 3.2, 3.3 and 3.4
but it gets polluted with ifdefs and not really a long term
solution, so now I'm working on upstreaming it
and feedback/codereviews I got, definitely improved
the quality of the bpf backend.
In case of backends the only bit to sync is instruction
set itself, which is stable. New instructions may be
added, but that's not a concern.
llvm backend doesn't care what language is
used in front-end or how programs are attached
to tracepoints or what set of bpf helper
functions is available.
All such bits and the main interface for
dynamic tracer, imo, should be in perf binary.
What it does underneath and how
many times it calls into llvm/gcc lib, won't be visible.
In case of systemtap compile time, for whatever
reason, is slow to the point of being annoying,
but here it should be instant.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-01-20 20:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-20  3:55 Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
2015-01-20  3:55 ` Alexei Starovoitov
2015-01-20 11:57 ` Masami Hiramatsu
  -- strict thread matches above, loose matches on Subject: below --
2015-01-20 20:33 Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.