* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe @ 2015-01-22 1:49 Alexei Starovoitov [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Alexei Starovoitov @ 2015-01-22 1:49 UTC (permalink / raw) To: Namhyung Kim Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML On Wed, Jan 21, 2015 at 5:03 PM, Namhyung Kim <namhyung-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > > AFAIK a trigger can be fired before allocating a ring buffer if it > doesn't use the event record (i.e. has filter) or ->post_trigger bit > set (stacktrace). Please see ftrace_trigger_soft_disabled(). yes, but such trigger has no arguments, so I would have to hack ftrace_trigger_soft_disabled() to pass 'ctx' further down and through all pointer dereferences and list walking. Also there is no return value, so I have to add it as well similar to post-triggers. that's quite a bit of overhead that I would like to avoid. Actually now I'm thinking to move condition if (ftrace_file->flags & TRACE_EVENT_FL_BPF) before ftrace_trigger_soft_disabled() check. So programs always run first and if they return non-zero then all standard processing will follow. May be return value from the program can influence triggers. That will nicely replace bpf_dump_stack()...the program will return ETT_STACKTRACE constant to trigger dump. > This also makes it keeping events in the soft-disabled state. I was never able to figure out the use case for soft-disabled state. Probably historical before static_key was done. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-01-22 1:56 ` Steven Rostedt [not found] ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Steven Rostedt @ 2015-01-22 1:56 UTC (permalink / raw) To: Alexei Starovoitov Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML On Wed, 21 Jan 2015 17:49:08 -0800 Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > > This also makes it keeping events in the soft-disabled state. > > I was never able to figure out the use case for soft-disabled state. > Probably historical before static_key was done. No, it's not historical at all. The "soft-disable" is a way to enable from any context. You can't enable a static key from NMI or interrupt context, but you can enable a "soft-disable" there. As you can enable or disable events from any function that the function tracer may trace, I needed a way to enable them (make the tracepoint active), but do nothing until something else turns them on. -- Steve ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>]
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe [not found] ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> @ 2015-01-22 2:13 ` Alexei Starovoitov 0 siblings, 0 replies; 9+ messages in thread From: Alexei Starovoitov @ 2015-01-22 2:13 UTC (permalink / raw) To: Steven Rostedt Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML On Wed, Jan 21, 2015 at 5:56 PM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote: > On Wed, 21 Jan 2015 17:49:08 -0800 > Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > > >> > This also makes it keeping events in the soft-disabled state. >> >> I was never able to figure out the use case for soft-disabled state. >> Probably historical before static_key was done. > > No, it's not historical at all. The "soft-disable" is a way to enable > from any context. You can't enable a static key from NMI or interrupt > context, but you can enable a "soft-disable" there. > > As you can enable or disable events from any function that the function > tracer may trace, I needed a way to enable them (make the tracepoint > active), but do nothing until something else turns them on. Thanks for explanation. Makes sense. Speaking of nmi... I think I will add a check that if (in_nmi()) just skip running the program, since supporting this use case is not needed at the moment. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe @ 2015-01-16 18:57 Alexei Starovoitov 2015-01-22 1:03 ` Namhyung Kim 0 siblings, 1 reply; 9+ messages in thread From: Alexei Starovoitov @ 2015-01-16 18:57 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML On Fri, Jan 16, 2015 at 7:02 AM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote: > On Thu, 15 Jan 2015 20:16:01 -0800 > Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > >> Hi Ingo, Steven, >> >> This patch set is based on tip/master. > > Note, the tracing code isn't maintained in tip/master, but perf code is. I know. I can rebase against linux-trace tree, but wanted to go through tip to get advantage of tip-bot ;) > Do you have a git repo somewhere that I can look at? It makes it easier > than loading in 9 patches ;-) https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/ > For syscalls this is fine as the parameters are usually set. But > there's a lot of tracepoints that we need to know the result of the > copied data to decide to filter or not, where the result happens at the > TP_fast_assign() part which requires allocating the buffers. ... > Again, for syscalls it may not be a problem, but for other tracepoints, > I'm not sure we can do that. How do you handle sched_switch for > example? The tracepoint only gets two pointers to task structs, you > need to then dereference them to get the pid, prio, state and other > data. exactly. In this patch set the user can use bpf_fetch_*() helpers to dereference task_struct to get to pid, prio, anything. The user is not limited by what tracepoints hard coded as part of TP_fast_assign. In the future when generic type/field infra is ready, the bpf program will have faster way to access such fields without going through bpf_fetch_*() helpers. In other words the program is a superset of existing TP_fast_assign + TP_print + filters + triggers + stack dumps. All are done from the program on demand. Some programs may just count number of events without accessing arguments at all. > Maybe we should have a way to do the program before and/or after the > buffering depending on what to filter on. There's no way to know what > the parameters of the tracepoint are without looking at the source. right now most of the tracepoints copy as much as possible fields as part of TP_fast_assign, since there is no way to know what will be needed later. And despite copying a lot, often it's not enough for analytics. With program is attached before the copy, the overhead is much lower. The program will look at whatever fields necessary, may do some operations on them and even store these fields into maps for further processing. >> - dump_stack >> - trace_printk >> The last two are mainly to debug the programs and to print data for user >> space consumptions. > > I have to look at the code, but currently trace_printk() isn't made to > be used in production systems. other than allocating a bunch of per_cpu pages. what concerns do you have? Some printk-like facility from the program is needed. I'd rather fix whatever necessary in trace_printk instead of inventing another way of printing. Anyway, I think I will drop these two helpers for now. > One last thing. If the ebpf is used for anything but filtering, it > should go into the trigger file. The filtering is only a way to say if > the event should be recorded or not. But the trigger could do something > else (a printk, a stacktrace, etc). it does way more than just filtering, but invoking program as a trigger is too slow. When program is called as soon as tracepoint fires, it can fetch other fields, evaluate them, printk some of them, optionally dump stack, aggregate into maps. We can let it call triggers too, so that user program will be able to enable/disable other events. I'm not against invoking programs as a trigger, but I don't see a use case for it. It's just too slow for production analytics that needs to act on huge number of events per second. We must minimize the overhead between tracepoint firing and program executing, so that programs can be used on events like packet receive which will be in millions per second. Every nsec counts. For example: - raw dd if=/dev/zero of=/dev/null does 760 MB/s (on my debug kernel) - echo 1 > events/syscalls/sys_enter_write/enable drops it to 400 MB/s - echo "echo "count == 123 " > events/syscalls/sys_enter_write/filter drops it even further down to 388 MB/s This slowdown is too high for this to be used on a live system. - tracex4 that computes histogram of sys_write sizes and stores log2(count) into a map does 580 MB/s This is still not great, but this slowdown is now usable and we can work further on minimizing the overhead. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe 2015-01-16 18:57 Alexei Starovoitov @ 2015-01-22 1:03 ` Namhyung Kim 0 siblings, 0 replies; 9+ messages in thread From: Namhyung Kim @ 2015-01-22 1:03 UTC (permalink / raw) To: Alexei Starovoitov Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML Hi Alexei, On Fri, Jan 16, 2015 at 10:57:15AM -0800, Alexei Starovoitov wrote: > On Fri, Jan 16, 2015 at 7:02 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > One last thing. If the ebpf is used for anything but filtering, it > > should go into the trigger file. The filtering is only a way to say if > > the event should be recorded or not. But the trigger could do something > > else (a printk, a stacktrace, etc). > > it does way more than just filtering, but > invoking program as a trigger is too slow. > When program is called as soon as tracepoint fires, > it can fetch other fields, evaluate them, printk some of them, > optionally dump stack, aggregate into maps. > We can let it call triggers too, so that user program will > be able to enable/disable other events. > I'm not against invoking programs as a trigger, but I don't > see a use case for it. It's just too slow for production > analytics that needs to act on huge number of events > per second. AFAIK a trigger can be fired before allocating a ring buffer if it doesn't use the event record (i.e. has filter) or ->post_trigger bit set (stacktrace). Please see ftrace_trigger_soft_disabled(). This also makes it keeping events in the soft-disabled state. Thanks, Namhyung > We must minimize the overhead between tracepoint > firing and program executing, so that programs can > be used on events like packet receive which will be > in millions per second. Every nsec counts. > For example: > - raw dd if=/dev/zero of=/dev/null > does 760 MB/s (on my debug kernel) > - echo 1 > events/syscalls/sys_enter_write/enable > drops it to 400 MB/s > - echo "echo "count == 123 " > events/syscalls/sys_enter_write/filter > drops it even further down to 388 MB/s > This slowdown is too high for this to be used on a live system. > - tracex4 that computes histogram of sys_write sizes > and stores log2(count) into a map does 580 MB/s > This is still not great, but this slowdown is now usable > and we can work further on minimizing the overhead. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe @ 2015-01-16 4:16 Alexei Starovoitov [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 9+ messages in thread From: Alexei Starovoitov @ 2015-01-16 4:16 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Hi Ingo, Steven, This patch set is based on tip/master. It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes. Mechanism of attaching: - load program via bpf() syscall and receive program_fd - event_fd = open("/sys/kernel/debug/tracing/events/.../filter") - write 'bpf-123' to event_fd where 123 is program_fd - program will be attached to particular event and event automatically enabled - close(event_fd) will detach bpf program from event and event disabled Program attach point and input arguments: - programs attached to kprobes receive 'struct pt_regs *' as an input. See tracex4_kern.c that demonstrates how users can write a C program like: SEC("events/kprobes/sys_write") int bpf_prog4(struct pt_regs *regs) { long write_size = regs->dx; // here user need to know the proto of sys_write() from kernel // sources and x64 calling convention to know that register $rdx // contains 3rd argument to sys_write() which is 'size_t count' it's obviously architecture dependent, but allows building sophisticated user tools on top, that can see from debug info of vmlinux which variables are in which registers or stack locations and fetch it from there. 'perf probe' can potentialy use this hook to generate programs in user space and insert them instead of letting kernel parse string during kprobe creation. - programs attached to tracepoints and syscalls receive 'struct bpf_context *': u64 arg1, arg2, ..., arg6; for syscalls they match syscall arguments. for tracepoints these args match arguments passed to tracepoint. For example: trace_sched_migrate_task(p, new_cpu); from sched/core.c arg1 <- p which is 'struct task_struct *' arg2 <- new_cpu which is 'unsigned int' arg3..arg6 = 0 the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct' or any other kernel data structures. These helpers are using probe_kernel_read() similar to 'perf probe' which is not 100% safe in both cases, but good enough. To access task_struct's pid inside 'sched_migrate_task' tracepoint the program can do: struct task_struct *task = (struct task_struct *)ctx->arg1; u32 pid = bpf_fetch_u32(&task->pid); Since struct layout is kernel configuration specific such programs are not portable and require access to kernel headers to be compiled, but in this case we don't need debug info. llvm with bpf backend will statically compute task->pid offset as a constant based on kernel headers only. The example of this arbitrary pointer walking is tracex1_kern.c which does skb->dev->name == "lo" filtering. In all cases the programs are called before trace buffer is allocated to minimize the overhead, since we want to filter huge number of events, but buffer alloc/free and argument copy for every event is too costly. Theoretically we can invoke programs after buffer is allocated, but it doesn't seem needed, since above approach is faster and achieves the same. Note, tracepoint/syscall and kprobe programs are two different types: BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER, since they expect different input. Both use the same set of helper functions: - map access (lookup/update/delete) - fetch (probe_kernel_read wrappers) - memcmp (probe_kernel_read + memcmp) - dump_stack - trace_printk The last two are mainly to debug the programs and to print data for user space consumptions. Portability: - kprobe programs are architecture dependent and need user scripting language like ktap/stap/dtrace/perf that will dynamically generate them based on debug info in vmlinux - tracepoint programs are architecture independent, but if arbitrary pointer walking (with fetch() helpers) is used, they need data struct layout to match. Debug info is not necessary - for networking use case we need to access 'struct sk_buff' fields in portable way (user space needs to fetch packet length without knowing skb->len offset), so for some frequently used data structures we will add helper functions or pseudo instructions to access them. I've hacked few ways specifically for skb, but abandoned them in favor of more generic type/field infra. That work is still wip. Not part of this set. Once it's ready tracepoint programs that access common data structs will be kernel independent. Program return value: - programs return 0 to discard an event - and return non-zero to proceed with event (allocate trace buffer, copy arguments there and print it eventually in trace_pipe in traditional way) Examples: - dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar to dropmon tool - tracex1_kern.c - does net/netif_receive_skb event filtering for dev->skb->name == "lo" condition - tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C plus computes histogram of all write sizes from sys_write syscall and prints the histogram in userspace - tracex3_kern.c - most sophisticated example that computes IO latency between block/block_rq_issue and block/block_rq_complete events and prints 'heatmap' using gray shades of text terminal. Useful to analyze disk performance. - tracex4_kern.c - computes histogram of write sizes from sys_write syscall using kprobe mechanism instead of syscall. Since kprobe is optimized into ftrace the overhead of instrumentation is smaller than in example 2. The user space tools like ktap/dtrace/systemptap/perf that has access to debug info would probably want to use kprobe attachment point, since kprobe can be inserted anywhere and all registers are avaiable in the program. tracepoint attachments are useful without debug info, so standalone tools like iosnoop will use them. The main difference vs existing perf_probe/ftrace infra is in kernel aggregation and conditional walking of arbitrary data structures. Thanks! Alexei Starovoitov (9): tracing: attach eBPF programs to tracepoints and syscalls tracing: allow eBPF programs to call bpf_printk() tracing: allow eBPF programs to call ktime_get_ns() samples: bpf: simple tracing example in eBPF assembler samples: bpf: simple tracing example in C samples: bpf: counting example for kfree_skb tracepoint and write syscall samples: bpf: IO latency analysis (iosnoop/heatmap) tracing: attach eBPF programs to kprobe/kretprobe samples: bpf: simple kprobe example include/linux/ftrace_event.h | 6 + include/trace/bpf_trace.h | 25 ++++ include/trace/ftrace.h | 30 +++++ include/uapi/linux/bpf.h | 11 ++ kernel/trace/Kconfig | 1 + kernel/trace/Makefile | 1 + kernel/trace/bpf_trace.c | 250 ++++++++++++++++++++++++++++++++++++ kernel/trace/trace.h | 3 + kernel/trace/trace_events.c | 41 +++++- kernel/trace/trace_events_filter.c | 80 +++++++++++- kernel/trace/trace_kprobe.c | 11 +- kernel/trace/trace_syscalls.c | 31 +++++ samples/bpf/Makefile | 18 +++ samples/bpf/bpf_helpers.h | 18 +++ samples/bpf/bpf_load.c | 62 ++++++++- samples/bpf/bpf_load.h | 3 + samples/bpf/dropmon.c | 129 +++++++++++++++++++ samples/bpf/tracex1_kern.c | 28 ++++ samples/bpf/tracex1_user.c | 24 ++++ samples/bpf/tracex2_kern.c | 71 ++++++++++ samples/bpf/tracex2_user.c | 95 ++++++++++++++ samples/bpf/tracex3_kern.c | 96 ++++++++++++++ samples/bpf/tracex3_user.c | 146 +++++++++++++++++++++ samples/bpf/tracex4_kern.c | 36 ++++++ samples/bpf/tracex4_user.c | 83 ++++++++++++ 25 files changed, 1290 insertions(+), 9 deletions(-) create mode 100644 include/trace/bpf_trace.h create mode 100644 kernel/trace/bpf_trace.c create mode 100644 samples/bpf/dropmon.c create mode 100644 samples/bpf/tracex1_kern.c create mode 100644 samples/bpf/tracex1_user.c create mode 100644 samples/bpf/tracex2_kern.c create mode 100644 samples/bpf/tracex2_user.c create mode 100644 samples/bpf/tracex3_kern.c create mode 100644 samples/bpf/tracex3_user.c create mode 100644 samples/bpf/tracex4_kern.c create mode 100644 samples/bpf/tracex4_user.c -- 1.7.9.5 ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>]
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> @ 2015-01-16 15:02 ` Steven Rostedt 2015-01-19 9:52 ` Masami Hiramatsu 1 sibling, 0 replies; 9+ messages in thread From: Steven Rostedt @ 2015-01-16 15:02 UTC (permalink / raw) To: Alexei Starovoitov Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Thu, 15 Jan 2015 20:16:01 -0800 Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote: > Hi Ingo, Steven, > > This patch set is based on tip/master. Note, the tracing code isn't maintained in tip/master, but perf code is. Using the latest 3.19-rc is probably sufficient for now. Do you have a git repo somewhere that I can look at? It makes it easier than loading in 9 patches ;-) > It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes. > > Mechanism of attaching: > - load program via bpf() syscall and receive program_fd > - event_fd = open("/sys/kernel/debug/tracing/events/.../filter") > - write 'bpf-123' to event_fd where 123 is program_fd > - program will be attached to particular event and event automatically enabled > - close(event_fd) will detach bpf program from event and event disabled > > Program attach point and input arguments: > - programs attached to kprobes receive 'struct pt_regs *' as an input. > See tracex4_kern.c that demonstrates how users can write a C program like: > SEC("events/kprobes/sys_write") > int bpf_prog4(struct pt_regs *regs) > { > long write_size = regs->dx; > // here user need to know the proto of sys_write() from kernel > // sources and x64 calling convention to know that register $rdx > // contains 3rd argument to sys_write() which is 'size_t count' > > it's obviously architecture dependent, but allows building sophisticated > user tools on top, that can see from debug info of vmlinux which variables > are in which registers or stack locations and fetch it from there. > 'perf probe' can potentialy use this hook to generate programs in user space > and insert them instead of letting kernel parse string during kprobe creation. > > - programs attached to tracepoints and syscalls receive 'struct bpf_context *': > u64 arg1, arg2, ..., arg6; > for syscalls they match syscall arguments. > for tracepoints these args match arguments passed to tracepoint. > For example: > trace_sched_migrate_task(p, new_cpu); from sched/core.c > arg1 <- p which is 'struct task_struct *' > arg2 <- new_cpu which is 'unsigned int' > arg3..arg6 = 0 > the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct' > or any other kernel data structures. > These helpers are using probe_kernel_read() similar to 'perf probe' which is > not 100% safe in both cases, but good enough. > To access task_struct's pid inside 'sched_migrate_task' tracepoint > the program can do: > struct task_struct *task = (struct task_struct *)ctx->arg1; > u32 pid = bpf_fetch_u32(&task->pid); > Since struct layout is kernel configuration specific such programs are not > portable and require access to kernel headers to be compiled, > but in this case we don't need debug info. > llvm with bpf backend will statically compute task->pid offset as a constant > based on kernel headers only. > The example of this arbitrary pointer walking is tracex1_kern.c > which does skb->dev->name == "lo" filtering. > > In all cases the programs are called before trace buffer is allocated to > minimize the overhead, since we want to filter huge number of events, but > buffer alloc/free and argument copy for every event is too costly. For syscalls this is fine as the parameters are usually set. But there's a lot of tracepoints that we need to know the result of the copied data to decide to filter or not, where the result happens at the TP_fast_assign() part which requires allocating the buffers. Maybe we should have a way to do the program before and/or after the buffering depending on what to filter on. There's no way to know what the parameters of the tracepoint are without looking at the source. > Theoretically we can invoke programs after buffer is allocated, but it > doesn't seem needed, since above approach is faster and achieves the same. Again, for syscalls it may not be a problem, but for other tracepoints, I'm not sure we can do that. How do you handle sched_switch for example? The tracepoint only gets two pointers to task structs, you need to then dereference them to get the pid, prio, state and other data. > > Note, tracepoint/syscall and kprobe programs are two different types: > BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER, > since they expect different input. > Both use the same set of helper functions: > - map access (lookup/update/delete) > - fetch (probe_kernel_read wrappers) > - memcmp (probe_kernel_read + memcmp) > - dump_stack > - trace_printk > The last two are mainly to debug the programs and to print data for user > space consumptions. I have to look at the code, but currently trace_printk() isn't made to be used in production systems. > > Portability: > - kprobe programs are architecture dependent and need user scripting > language like ktap/stap/dtrace/perf that will dynamically generate > them based on debug info in vmlinux > - tracepoint programs are architecture independent, but if arbitrary pointer > walking (with fetch() helpers) is used, they need data struct layout to match. > Debug info is not necessary If the program runs after the buffers are allocated, it could still be architecture independent because ftrace gives the information on how to retrieve the fields. One last thing. If the ebpf is used for anything but filtering, it should go into the trigger file. The filtering is only a way to say if the event should be recorded or not. But the trigger could do something else (a printk, a stacktrace, etc). -- Steve > - for networking use case we need to access 'struct sk_buff' fields in portable > way (user space needs to fetch packet length without knowing skb->len offset), > so for some frequently used data structures we will add helper functions > or pseudo instructions to access them. I've hacked few ways specifically > for skb, but abandoned them in favor of more generic type/field infra. > That work is still wip. Not part of this set. > Once it's ready tracepoint programs that access common data structs > will be kernel independent. > > Program return value: > - programs return 0 to discard an event > - and return non-zero to proceed with event (allocate trace buffer, copy > arguments there and print it eventually in trace_pipe in traditional way) > > Examples: > - dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar > to dropmon tool > - tracex1_kern.c - does net/netif_receive_skb event filtering > for dev->skb->name == "lo" condition > - tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C > plus computes histogram of all write sizes from sys_write syscall > and prints the histogram in userspace > - tracex3_kern.c - most sophisticated example that computes IO latency > between block/block_rq_issue and block/block_rq_complete events > and prints 'heatmap' using gray shades of text terminal. > Useful to analyze disk performance. > - tracex4_kern.c - computes histogram of write sizes from sys_write syscall > using kprobe mechanism instead of syscall. Since kprobe is optimized into > ftrace the overhead of instrumentation is smaller than in example 2. > > The user space tools like ktap/dtrace/systemptap/perf that has access > to debug info would probably want to use kprobe attachment point, since kprobe > can be inserted anywhere and all registers are avaiable in the program. > tracepoint attachments are useful without debug info, so standalone tools > like iosnoop will use them. > > The main difference vs existing perf_probe/ftrace infra is in kernel aggregation > and conditional walking of arbitrary data structures. > > Thanks! > > Alexei Starovoitov (9): > tracing: attach eBPF programs to tracepoints and syscalls > tracing: allow eBPF programs to call bpf_printk() > tracing: allow eBPF programs to call ktime_get_ns() > samples: bpf: simple tracing example in eBPF assembler > samples: bpf: simple tracing example in C > samples: bpf: counting example for kfree_skb tracepoint and write > syscall > samples: bpf: IO latency analysis (iosnoop/heatmap) > tracing: attach eBPF programs to kprobe/kretprobe > samples: bpf: simple kprobe example > > include/linux/ftrace_event.h | 6 + > include/trace/bpf_trace.h | 25 ++++ > include/trace/ftrace.h | 30 +++++ > include/uapi/linux/bpf.h | 11 ++ > kernel/trace/Kconfig | 1 + > kernel/trace/Makefile | 1 + > kernel/trace/bpf_trace.c | 250 ++++++++++++++++++++++++++++++++++++ > kernel/trace/trace.h | 3 + > kernel/trace/trace_events.c | 41 +++++- > kernel/trace/trace_events_filter.c | 80 +++++++++++- > kernel/trace/trace_kprobe.c | 11 +- > kernel/trace/trace_syscalls.c | 31 +++++ > samples/bpf/Makefile | 18 +++ > samples/bpf/bpf_helpers.h | 18 +++ > samples/bpf/bpf_load.c | 62 ++++++++- > samples/bpf/bpf_load.h | 3 + > samples/bpf/dropmon.c | 129 +++++++++++++++++++ > samples/bpf/tracex1_kern.c | 28 ++++ > samples/bpf/tracex1_user.c | 24 ++++ > samples/bpf/tracex2_kern.c | 71 ++++++++++ > samples/bpf/tracex2_user.c | 95 ++++++++++++++ > samples/bpf/tracex3_kern.c | 96 ++++++++++++++ > samples/bpf/tracex3_user.c | 146 +++++++++++++++++++++ > samples/bpf/tracex4_kern.c | 36 ++++++ > samples/bpf/tracex4_user.c | 83 ++++++++++++ > 25 files changed, 1290 insertions(+), 9 deletions(-) > create mode 100644 include/trace/bpf_trace.h > create mode 100644 kernel/trace/bpf_trace.c > create mode 100644 samples/bpf/dropmon.c > create mode 100644 samples/bpf/tracex1_kern.c > create mode 100644 samples/bpf/tracex1_user.c > create mode 100644 samples/bpf/tracex2_kern.c > create mode 100644 samples/bpf/tracex2_user.c > create mode 100644 samples/bpf/tracex3_kern.c > create mode 100644 samples/bpf/tracex3_user.c > create mode 100644 samples/bpf/tracex4_kern.c > create mode 100644 samples/bpf/tracex4_user.c > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 2015-01-16 15:02 ` Steven Rostedt @ 2015-01-19 9:52 ` Masami Hiramatsu 2015-01-19 20:48 ` Alexei Starovoitov 1 sibling, 1 reply; 9+ messages in thread From: Masami Hiramatsu @ 2015-01-19 9:52 UTC (permalink / raw) To: Alexei Starovoitov Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA (2015/01/16 13:16), Alexei Starovoitov wrote: > Hi Ingo, Steven, > > This patch set is based on tip/master. > It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes. > > Mechanism of attaching: > - load program via bpf() syscall and receive program_fd > - event_fd = open("/sys/kernel/debug/tracing/events/.../filter") > - write 'bpf-123' to event_fd where 123 is program_fd > - program will be attached to particular event and event automatically enabled > - close(event_fd) will detach bpf program from event and event disabled > > Program attach point and input arguments: > - programs attached to kprobes receive 'struct pt_regs *' as an input. > See tracex4_kern.c that demonstrates how users can write a C program like: > SEC("events/kprobes/sys_write") > int bpf_prog4(struct pt_regs *regs) > { > long write_size = regs->dx; > // here user need to know the proto of sys_write() from kernel > // sources and x64 calling convention to know that register $rdx > // contains 3rd argument to sys_write() which is 'size_t count' > > it's obviously architecture dependent, but allows building sophisticated > user tools on top, that can see from debug info of vmlinux which variables > are in which registers or stack locations and fetch it from there. > 'perf probe' can potentialy use this hook to generate programs in user space > and insert them instead of letting kernel parse string during kprobe creation. Actually, this program just shows raw pt_regs for handlers, but I guess it is also possible to pass event arguments from perf probe which given by user and perf-probe. If we can write the script as int bpf_prog4(s64 write_size) { ... } This will be much easier to play with. > - programs attached to tracepoints and syscalls receive 'struct bpf_context *': > u64 arg1, arg2, ..., arg6; > for syscalls they match syscall arguments. > for tracepoints these args match arguments passed to tracepoint. > For example: > trace_sched_migrate_task(p, new_cpu); from sched/core.c > arg1 <- p which is 'struct task_struct *' > arg2 <- new_cpu which is 'unsigned int' > arg3..arg6 = 0 > the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct' > or any other kernel data structures. > These helpers are using probe_kernel_read() similar to 'perf probe' which is > not 100% safe in both cases, but good enough. > To access task_struct's pid inside 'sched_migrate_task' tracepoint > the program can do: > struct task_struct *task = (struct task_struct *)ctx->arg1; > u32 pid = bpf_fetch_u32(&task->pid); > Since struct layout is kernel configuration specific such programs are not > portable and require access to kernel headers to be compiled, > but in this case we don't need debug info. > llvm with bpf backend will statically compute task->pid offset as a constant > based on kernel headers only. > The example of this arbitrary pointer walking is tracex1_kern.c > which does skb->dev->name == "lo" filtering. At least I would like to see this way on kprobes event too, since it should be treated as a traceevent. > In all cases the programs are called before trace buffer is allocated to > minimize the overhead, since we want to filter huge number of events, but > buffer alloc/free and argument copy for every event is too costly. > Theoretically we can invoke programs after buffer is allocated, but it > doesn't seem needed, since above approach is faster and achieves the same. > > Note, tracepoint/syscall and kprobe programs are two different types: > BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER, > since they expect different input. > Both use the same set of helper functions: > - map access (lookup/update/delete) > - fetch (probe_kernel_read wrappers) > - memcmp (probe_kernel_read + memcmp) > - dump_stack > - trace_printk > The last two are mainly to debug the programs and to print data for user > space consumptions. > > Portability: > - kprobe programs are architecture dependent and need user scripting > language like ktap/stap/dtrace/perf that will dynamically generate > them based on debug info in vmlinux If we can use kprobe event as a normal traceevent, user scripting can be architecture independent too. Only perf-probe fills the gap. All other userspace tools can collaborate with perf-probe to setup the events. If so, we can avoid redundant works on debuginfo. That is my point. Thank you, > - tracepoint programs are architecture independent, but if arbitrary pointer > walking (with fetch() helpers) is used, they need data struct layout to match. > Debug info is not necessary > - for networking use case we need to access 'struct sk_buff' fields in portable > way (user space needs to fetch packet length without knowing skb->len offset), > so for some frequently used data structures we will add helper functions > or pseudo instructions to access them. I've hacked few ways specifically > for skb, but abandoned them in favor of more generic type/field infra. > That work is still wip. Not part of this set. > Once it's ready tracepoint programs that access common data structs > will be kernel independent. > > Program return value: > - programs return 0 to discard an event > - and return non-zero to proceed with event (allocate trace buffer, copy > arguments there and print it eventually in trace_pipe in traditional way) > > Examples: > - dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar > to dropmon tool > - tracex1_kern.c - does net/netif_receive_skb event filtering > for dev->skb->name == "lo" condition > - tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C > plus computes histogram of all write sizes from sys_write syscall > and prints the histogram in userspace > - tracex3_kern.c - most sophisticated example that computes IO latency > between block/block_rq_issue and block/block_rq_complete events > and prints 'heatmap' using gray shades of text terminal. > Useful to analyze disk performance. > - tracex4_kern.c - computes histogram of write sizes from sys_write syscall > using kprobe mechanism instead of syscall. Since kprobe is optimized into > ftrace the overhead of instrumentation is smaller than in example 2. > > The user space tools like ktap/dtrace/systemptap/perf that has access > to debug info would probably want to use kprobe attachment point, since kprobe > can be inserted anywhere and all registers are avaiable in the program. > tracepoint attachments are useful without debug info, so standalone tools > like iosnoop will use them. > > The main difference vs existing perf_probe/ftrace infra is in kernel aggregation > and conditional walking of arbitrary data structures. > > Thanks! > > Alexei Starovoitov (9): > tracing: attach eBPF programs to tracepoints and syscalls > tracing: allow eBPF programs to call bpf_printk() > tracing: allow eBPF programs to call ktime_get_ns() > samples: bpf: simple tracing example in eBPF assembler > samples: bpf: simple tracing example in C > samples: bpf: counting example for kfree_skb tracepoint and write > syscall > samples: bpf: IO latency analysis (iosnoop/heatmap) > tracing: attach eBPF programs to kprobe/kretprobe > samples: bpf: simple kprobe example > > include/linux/ftrace_event.h | 6 + > include/trace/bpf_trace.h | 25 ++++ > include/trace/ftrace.h | 30 +++++ > include/uapi/linux/bpf.h | 11 ++ > kernel/trace/Kconfig | 1 + > kernel/trace/Makefile | 1 + > kernel/trace/bpf_trace.c | 250 ++++++++++++++++++++++++++++++++++++ > kernel/trace/trace.h | 3 + > kernel/trace/trace_events.c | 41 +++++- > kernel/trace/trace_events_filter.c | 80 +++++++++++- > kernel/trace/trace_kprobe.c | 11 +- > kernel/trace/trace_syscalls.c | 31 +++++ > samples/bpf/Makefile | 18 +++ > samples/bpf/bpf_helpers.h | 18 +++ > samples/bpf/bpf_load.c | 62 ++++++++- > samples/bpf/bpf_load.h | 3 + > samples/bpf/dropmon.c | 129 +++++++++++++++++++ > samples/bpf/tracex1_kern.c | 28 ++++ > samples/bpf/tracex1_user.c | 24 ++++ > samples/bpf/tracex2_kern.c | 71 ++++++++++ > samples/bpf/tracex2_user.c | 95 ++++++++++++++ > samples/bpf/tracex3_kern.c | 96 ++++++++++++++ > samples/bpf/tracex3_user.c | 146 +++++++++++++++++++++ > samples/bpf/tracex4_kern.c | 36 ++++++ > samples/bpf/tracex4_user.c | 83 ++++++++++++ > 25 files changed, 1290 insertions(+), 9 deletions(-) > create mode 100644 include/trace/bpf_trace.h > create mode 100644 kernel/trace/bpf_trace.c > create mode 100644 samples/bpf/dropmon.c > create mode 100644 samples/bpf/tracex1_kern.c > create mode 100644 samples/bpf/tracex1_user.c > create mode 100644 samples/bpf/tracex2_kern.c > create mode 100644 samples/bpf/tracex2_user.c > create mode 100644 samples/bpf/tracex3_kern.c > create mode 100644 samples/bpf/tracex3_user.c > create mode 100644 samples/bpf/tracex4_kern.c > create mode 100644 samples/bpf/tracex4_user.c > -- Masami HIRAMATSU Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe 2015-01-19 9:52 ` Masami Hiramatsu @ 2015-01-19 20:48 ` Alexei Starovoitov 0 siblings, 0 replies; 9+ messages in thread From: Alexei Starovoitov @ 2015-01-19 20:48 UTC (permalink / raw) To: Masami Hiramatsu Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API, Network Development, LKML On Mon, Jan 19, 2015 at 1:52 AM, Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> wrote: > If we can write the script as > > int bpf_prog4(s64 write_size) > { > ... > } > > This will be much easier to play with. yes. that's the intent for user space to do. >> The example of this arbitrary pointer walking is tracex1_kern.c >> which does skb->dev->name == "lo" filtering. > > At least I would like to see this way on kprobes event too, since it should be > treated as a traceevent. it's done already... one can do the same skb->dev->name logic in kprobe attached program... so from bpf program point of view, tracepoints and kprobes feature-wise are exactly the same. Only input is different. >> - kprobe programs are architecture dependent and need user scripting >> language like ktap/stap/dtrace/perf that will dynamically generate >> them based on debug info in vmlinux > > If we can use kprobe event as a normal traceevent, user scripting can be > architecture independent too. Only perf-probe fills the gap. All other > userspace tools can collaborate with perf-probe to setup the events. > If so, we can avoid redundant works on debuginfo. That is my point. yes. perf already has infra to read debug info and it can be extended to understand C like script as: int kprobe:sys_write(int fd, char *buf, size_t count) { // do stuff with 'count' } perf can be made to parse this text, recognize that it wants to create kprobe on 'sys_write' function. Then based on debuginfo figure out where 'count' is (either register or stack) and generate corresponding bpf program either using llvm/gcc backends or directly. perf facility of extracting debug info can be made into library too and used by ktap/dtrace tools for their languages. User space can innovate in many directions. and, yes, once we have a scripting language whether it's C like with perf or else, this language hides architecture depend things from users. Such scripting language will also hide the kernel side differences between tracepoint and kprobe. Just look how ktap scripts look alike for kprobes and tracepoints. Whether ktap syntax becomes part of perf or perf invents its own language, it's going to be good for users regardless. The C examples here are just examples. Something users can play with already until more user friendly tools are being worked on. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-01-22 2:13 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-01-22 1:49 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-01-22 1:56 ` Steven Rostedt [not found] ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org> 2015-01-22 2:13 ` Alexei Starovoitov -- strict thread matches above, loose matches on Subject: below -- 2015-01-16 18:57 Alexei Starovoitov 2015-01-22 1:03 ` Namhyung Kim 2015-01-16 4:16 Alexei Starovoitov [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> 2015-01-16 15:02 ` Steven Rostedt 2015-01-19 9:52 ` Masami Hiramatsu 2015-01-19 20:48 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).