[PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-16  4:16 Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 5/9] samples: bpf: simple tracing example in C Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Ingo, Steven,

This patch set is based on tip/master.
It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.

Mechanism of attaching:
- load program via bpf() syscall and receive program_fd
- event_fd = open("/sys/kernel/debug/tracing/events/.../filter")
- write 'bpf-123' to event_fd where 123 is program_fd
- program will be attached to particular event and event automatically enabled
- close(event_fd) will detach bpf program from event and event disabled

Program attach point and input arguments:
- programs attached to kprobes receive 'struct pt_regs *' as an input.
  See tracex4_kern.c that demonstrates how users can write a C program like:
  SEC("events/kprobes/sys_write")
  int bpf_prog4(struct pt_regs *regs)
  {
     long write_size = regs->dx; 
     // here user need to know the proto of sys_write() from kernel
     // sources and x64 calling convention to know that register $rdx
     // contains 3rd argument to sys_write() which is 'size_t count'

  it's obviously architecture dependent, but allows building sophisticated
  user tools on top, that can see from debug info of vmlinux which variables
  are in which registers or stack locations and fetch it from there.
  'perf probe' can potentialy use this hook to generate programs in user space
  and insert them instead of letting kernel parse string during kprobe creation.

- programs attached to tracepoints and syscalls receive 'struct bpf_context *':
  u64 arg1, arg2, ..., arg6;
  for syscalls they match syscall arguments.
  for tracepoints these args match arguments passed to tracepoint.
  For example:
  trace_sched_migrate_task(p, new_cpu); from sched/core.c
  arg1 <- p        which is 'struct task_struct *'
  arg2 <- new_cpu  which is 'unsigned int'
  arg3..arg6 = 0
  the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
  or any other kernel data structures.
  These helpers are using probe_kernel_read() similar to 'perf probe' which is
  not 100% safe in both cases, but good enough.
  To access task_struct's pid inside 'sched_migrate_task' tracepoint
  the program can do:
  struct task_struct *task = (struct task_struct *)ctx->arg1;
  u32 pid = bpf_fetch_u32(&task->pid);
  Since struct layout is kernel configuration specific such programs are not
  portable and require access to kernel headers to be compiled,
  but in this case we don't need debug info.
  llvm with bpf backend will statically compute task->pid offset as a constant
  based on kernel headers only.
  The example of this arbitrary pointer walking is tracex1_kern.c
  which does skb->dev->name == "lo" filtering.

In all cases the programs are called before trace buffer is allocated to
minimize the overhead, since we want to filter huge number of events, but
buffer alloc/free and argument copy for every event is too costly.
Theoretically we can invoke programs after buffer is allocated, but it
doesn't seem needed, since above approach is faster and achieves the same.

Note, tracepoint/syscall and kprobe programs are two different types:
BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER,
since they expect different input.
Both use the same set of helper functions:
- map access (lookup/update/delete)
- fetch (probe_kernel_read wrappers)
- memcmp (probe_kernel_read + memcmp)
- dump_stack
- trace_printk
The last two are mainly to debug the programs and to print data for user
space consumptions.

Portability:
- kprobe programs are architecture dependent and need user scripting
  language like ktap/stap/dtrace/perf that will dynamically generate
  them based on debug info in vmlinux
- tracepoint programs are architecture independent, but if arbitrary pointer
  walking (with fetch() helpers) is used, they need data struct layout to match.
  Debug info is not necessary
- for networking use case we need to access 'struct sk_buff' fields in portable
  way (user space needs to fetch packet length without knowing skb->len offset),
  so for some frequently used data structures we will add helper functions
  or pseudo instructions to access them. I've hacked few ways specifically
  for skb, but abandoned them in favor of more generic type/field infra.
  That work is still wip. Not part of this set.
  Once it's ready tracepoint programs that access common data structs
  will be kernel independent.

Program return value:
- programs return 0 to discard an event
- and return non-zero to proceed with event (allocate trace buffer, copy
  arguments there and print it eventually in trace_pipe in traditional way)

Examples:
- dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
  to dropmon tool
- tracex1_kern.c - does net/netif_receive_skb event filtering
  for dev->skb->name == "lo" condition
- tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
  plus computes histogram of all write sizes from sys_write syscall
  and prints the histogram in userspace
- tracex3_kern.c - most sophisticated example that computes IO latency
  between block/block_rq_issue and block/block_rq_complete events
  and prints 'heatmap' using gray shades of text terminal.
  Useful to analyze disk performance.
- tracex4_kern.c - computes histogram of write sizes from sys_write syscall
  using kprobe mechanism instead of syscall. Since kprobe is optimized into
  ftrace the overhead of instrumentation is smaller than in example 2.

The user space tools like ktap/dtrace/systemptap/perf that has access
to debug info would probably want to use kprobe attachment point, since kprobe
can be inserted anywhere and all registers are avaiable in the program.
tracepoint attachments are useful without debug info, so standalone tools
like iosnoop will use them.

The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
and conditional walking of arbitrary data structures.

Thanks!

Alexei Starovoitov (9):
  tracing: attach eBPF programs to tracepoints and syscalls
  tracing: allow eBPF programs to call bpf_printk()
  tracing: allow eBPF programs to call ktime_get_ns()
  samples: bpf: simple tracing example in eBPF assembler
  samples: bpf: simple tracing example in C
  samples: bpf: counting example for kfree_skb tracepoint and write
    syscall
  samples: bpf: IO latency analysis (iosnoop/heatmap)
  tracing: attach eBPF programs to kprobe/kretprobe
  samples: bpf: simple kprobe example

 include/linux/ftrace_event.h       |    6 +
 include/trace/bpf_trace.h          |   25 ++++
 include/trace/ftrace.h             |   30 +++++
 include/uapi/linux/bpf.h           |   11 ++
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace.c           |  250 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    3 +
 kernel/trace/trace_events.c        |   41 +++++-
 kernel/trace/trace_events_filter.c |   80 +++++++++++-
 kernel/trace/trace_kprobe.c        |   11 +-
 kernel/trace/trace_syscalls.c      |   31 +++++
 samples/bpf/Makefile               |   18 +++
 samples/bpf/bpf_helpers.h          |   18 +++
 samples/bpf/bpf_load.c             |   62 ++++++++-
 samples/bpf/bpf_load.h             |    3 +
 samples/bpf/dropmon.c              |  129 +++++++++++++++++++
 samples/bpf/tracex1_kern.c         |   28 ++++
 samples/bpf/tracex1_user.c         |   24 ++++
 samples/bpf/tracex2_kern.c         |   71 ++++++++++
 samples/bpf/tracex2_user.c         |   95 ++++++++++++++
 samples/bpf/tracex3_kern.c         |   96 ++++++++++++++
 samples/bpf/tracex3_user.c         |  146 +++++++++++++++++++++
 samples/bpf/tracex4_kern.c         |   36 ++++++
 samples/bpf/tracex4_user.c         |   83 ++++++++++++
 25 files changed, 1290 insertions(+), 9 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c
 create mode 100644 samples/bpf/dropmon.c
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c
 create mode 100644 samples/bpf/tracex2_kern.c
 create mode 100644 samples/bpf/tracex2_user.c
 create mode 100644 samples/bpf/tracex3_kern.c
 create mode 100644 samples/bpf/tracex3_user.c
 create mode 100644 samples/bpf/tracex4_kern.c
 create mode 100644 samples/bpf/tracex4_user.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH tip 5/9] samples: bpf: simple tracing example in C
  2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
@ 2015-01-16  4:16 ` Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 6/9] samples: bpf: counting example for kfree_skb tracepoint and write syscall Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api, netdev, linux-kernel

tracex1_kern.c - C program which will be compiled into eBPF
to filter netif_receive_skb events on skb->dev->name == "lo"
The programs returns 1 to continue storing an event into trace buffer
and returns 0 - to discard an event.

tracex1_user.c - corresponding user space component that
forever reads /sys/.../trace_pipe

Usage:
$ sudo tracex1

should see:
writing bpf-4 -> /sys/kernel/debug/tracing/events/net/netif_receive_skb/filter
  ping-364   [000] ..s2     8.089771: netif_receive_skb: dev=lo skbaddr=ffff88000dfcc100 len=84
  ping-364   [000] ..s2     8.089889: netif_receive_skb: dev=lo skbaddr=ffff88000dfcc900 len=84

Ctrl-C at any time, kernel will auto cleanup

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    4 +++
 samples/bpf/bpf_helpers.h  |   18 ++++++++++++++
 samples/bpf/bpf_load.c     |   59 +++++++++++++++++++++++++++++++++++++++-----
 samples/bpf/bpf_load.h     |    3 +++
 samples/bpf/tracex1_kern.c |   28 +++++++++++++++++++++
 samples/bpf/tracex1_user.c |   24 ++++++++++++++++++
 6 files changed, 130 insertions(+), 6 deletions(-)
 create mode 100644 samples/bpf/tracex1_kern.c
 create mode 100644 samples/bpf/tracex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 789691374562..da28e1b6d3a6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -7,6 +7,7 @@ hostprogs-y += sock_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += dropmon
+hostprogs-y += tracex1
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -14,17 +15,20 @@ test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
+tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
 always += sockex2_kern.o
+always += tracex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
+HOSTLOADLIBES_tracex1 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index ca0333146006..81388e821eb3 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -15,6 +15,24 @@ static int (*bpf_map_update_elem)(void *map, void *key, void *value,
 	(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
 	(void *) BPF_FUNC_map_delete_elem;
+static void *(*bpf_fetch_ptr)(void *unsafe_ptr) =
+	(void *) BPF_FUNC_fetch_ptr;
+static unsigned long long (*bpf_fetch_u64)(void *unsafe_ptr) =
+	(void *) BPF_FUNC_fetch_u64;
+static unsigned int (*bpf_fetch_u32)(void *unsafe_ptr) =
+	(void *) BPF_FUNC_fetch_u32;
+static unsigned short (*bpf_fetch_u16)(void *unsafe_ptr) =
+	(void *) BPF_FUNC_fetch_u16;
+static unsigned char (*bpf_fetch_u8)(void *unsafe_ptr) =
+	(void *) BPF_FUNC_fetch_u8;
+static int (*bpf_printk)(const char *fmt, int fmt_size, ...) =
+	(void *) BPF_FUNC_printk;
+static int (*bpf_memcmp)(void *unsafe_ptr, void *safe_ptr, int size) =
+	(void *) BPF_FUNC_memcmp;
+static void (*bpf_dump_stack)(void) =
+	(void *) BPF_FUNC_dump_stack;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+	(void *) BPF_FUNC_ktime_get_ns;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 1831d236382b..788ac51c1024 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -14,6 +14,8 @@
 #include "bpf_helpers.h"
 #include "bpf_load.h"
 
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
 static char license[128];
 static bool processed_sec[128];
 int map_fd[MAX_MAPS];
@@ -22,15 +24,18 @@ int prog_cnt;
 
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
-	int fd;
 	bool is_socket = strncmp(event, "socket", 6) == 0;
+	enum bpf_prog_type prog_type;
+	char path[256] = DEBUGFS;
+	char fmt[32];
+	int fd, event_fd, err;
 
-	if (!is_socket)
-		/* tracing events tbd */
-		return -1;
+	if (is_socket)
+		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	else
+		prog_type = BPF_PROG_TYPE_TRACING_FILTER;
 
-	fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
-			   prog, size, license);
+	fd = bpf_prog_load(prog_type, prog, size, license);
 
 	if (fd < 0) {
 		printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
@@ -39,6 +44,28 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	prog_fd[prog_cnt++] = fd;
 
+	if (is_socket)
+		return 0;
+
+	snprintf(fmt, sizeof(fmt), "bpf-%d", fd);
+
+	strcat(path, event);
+	strcat(path, "/filter");
+
+	printf("writing %s -> %s\n", fmt, path);
+
+	event_fd = open(path, O_WRONLY, 0);
+	if (event_fd < 0) {
+		printf("failed to open event %s\n", event);
+		return -1;
+	}
+
+	err = write(event_fd, fmt, strlen(fmt));
+	if (err < 0) {
+		printf("write to '%s' failed '%s'\n", event, strerror(errno));
+		return -1;
+	}
+
 	return 0;
 }
 
@@ -201,3 +228,23 @@ int load_bpf_file(char *path)
 	close(fd);
 	return 0;
 }
+
+void read_trace_pipe(void)
+{
+	int trace_fd;
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0)
+		return;
+
+	while (1) {
+		static char buf[4096];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf));
+		if (sz) {
+			buf[sz] = 0;
+			puts(buf);
+		}
+	}
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 27789a34f5e6..d154fc2b0535 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -21,4 +21,7 @@ extern int prog_fd[MAX_PROGS];
  */
 int load_bpf_file(char *path);
 
+/* forever reads /sys/.../trace_pipe */
+void read_trace_pipe(void);
+
 #endif
diff --git a/samples/bpf/tracex1_kern.c b/samples/bpf/tracex1_kern.c
new file mode 100644
index 000000000000..7849ceb4bce6
--- /dev/null
+++ b/samples/bpf/tracex1_kern.c
@@ -0,0 +1,28 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+SEC("events/net/netif_receive_skb")
+int bpf_prog1(struct bpf_context *ctx)
+{
+	/*
+	 * attaches to /sys/kernel/debug/tracing/events/net/netif_receive_skb
+	 * prints events for loobpack device only
+	 */
+	char devname[] = "lo";
+	struct net_device *dev;
+	struct sk_buff *skb = 0;
+
+	skb = (struct sk_buff *) ctx->arg1;
+	dev = bpf_fetch_ptr(&skb->dev);
+	if (bpf_memcmp(dev->name, devname, 2) == 0)
+		/* print event using default tracepoint format */
+		return 1;
+
+	/* drop event */
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex1_user.c b/samples/bpf/tracex1_user.c
new file mode 100644
index 000000000000..e85c1b483f57
--- /dev/null
+++ b/samples/bpf/tracex1_user.c
@@ -0,0 +1,24 @@
+#include <stdio.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int ac, char **argv)
+{
+	FILE *f;
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	read_trace_pipe();
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 6/9] samples: bpf: counting example for kfree_skb tracepoint and write syscall
  2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 5/9] samples: bpf: simple tracing example in C Alexei Starovoitov
@ 2015-01-16  4:16 ` Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 7/9] samples: bpf: IO latency analysis (iosnoop/heatmap) Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api, netdev, linux-kernel

this example has two probes in one C file that attach to different tracepoints
and use two different maps.

1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
count number of packet drops at different locations

2nd probe attaches to syscalls/sys_enter_write and computes a histogram of different
write sizes

Usage:
$ sudo tracex2
writing bpf-8 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
writing bpf-10 -> /sys/kernel/debug/tracing/events/syscalls/sys_enter_write/filter
location 0xffffffff816959a5 count 1

location 0xffffffff816959a5 count 2

557145+0 records in
557145+0 records out
285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
           syscall write() stats
     byte_size       : count     distribution
       1 -> 1        : 3        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 2        |                                      |
      32 -> 63       : 3        |                                      |
      64 -> 127      : 1        |                                      |
     128 -> 255      : 1        |                                      |
     256 -> 511      : 0        |                                      |
     512 -> 1023     : 1118968  |************************************* |

Ctrl-C at any time. Kernel will auto cleanup maps and programs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    4 ++
 samples/bpf/tracex2_kern.c |   71 +++++++++++++++++++++++++++++++++
 samples/bpf/tracex2_user.c |   95 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 170 insertions(+)
 create mode 100644 samples/bpf/tracex2_kern.c
 create mode 100644 samples/bpf/tracex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index da28e1b6d3a6..416af24b01fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -8,6 +8,7 @@ hostprogs-y += sockex1
 hostprogs-y += sockex2
 hostprogs-y += dropmon
 hostprogs-y += tracex1
+hostprogs-y += tracex2
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -16,12 +17,14 @@ sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
+tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
 always += sockex2_kern.o
 always += tracex1_kern.o
+always += tracex2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -29,6 +32,7 @@ HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
 HOSTLOADLIBES_tracex1 += -lelf
+HOSTLOADLIBES_tracex2 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
new file mode 100644
index 000000000000..a789c456c1b4
--- /dev/null
+++ b/samples/bpf/tracex2_kern.c
@@ -0,0 +1,71 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(long),
+	.value_size = sizeof(long),
+	.max_entries = 1024,
+};
+
+SEC("events/skb/kfree_skb")
+int bpf_prog2(struct bpf_context *ctx)
+{
+	long loc = ctx->arg2;
+	long init_val = 1;
+	long *value;
+
+	value = bpf_map_lookup_elem(&my_map, &loc);
+	if (value)
+		*value += 1;
+	else
+		bpf_map_update_elem(&my_map, &loc, &init_val, BPF_ANY);
+	return 0;
+}
+
+static unsigned int log2(unsigned int v)
+{
+	unsigned int r;
+	unsigned int shift;
+
+	r = (v > 0xFFFF) << 4; v >>= r;
+	shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+	shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+	shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+	r |= (v >> 1);
+	return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+	unsigned int hi = v >> 32;
+	if (hi)
+		return log2(hi) + 32;
+	else
+		return log2(v);
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 64,
+};
+
+SEC("events/syscalls/sys_enter_write")
+int bpf_prog3(struct bpf_context *ctx)
+{
+	long write_size = ctx->arg3;
+	long init_val = 1;
+	long *value;
+	u32 index = log2l(write_size);
+
+	value = bpf_map_lookup_elem(&my_hist_map, &index);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex2_user.c b/samples/bpf/tracex2_user.c
new file mode 100644
index 000000000000..016a76e97cd7
--- /dev/null
+++ b/samples/bpf/tracex2_user.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX	64
+#define MAX_STARS	38
+
+static void stars(char *str, long val, long max, int width)
+{
+	int i;
+
+	for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+		str[i] = '*';
+	if (val > max)
+		str[i - 1] = '+';
+	str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+	int key;
+	long value;
+	long data[MAX_INDEX] = {};
+	char starstr[MAX_STARS];
+	int i;
+	int max_ind = -1;
+	long max_value = 0;
+
+	for (key = 0; key < MAX_INDEX; key++) {
+		bpf_lookup_elem(fd, &key, &value);
+		data[key] = value;
+		if (value && key > max_ind)
+			max_ind = key;
+		if (value > max_value)
+			max_value = value;
+	}
+
+	printf("           syscall write() stats\n");
+	printf("     byte_size       : count     distribution\n");
+	for (i = 1; i <= max_ind + 1; i++) {
+		stars(starstr, data[i - 1], max_value, MAX_STARS);
+		printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+		       (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+		       MAX_STARS, starstr);
+	}
+}
+static void int_exit(int sig)
+{
+	print_hist(map_fd[1]);
+	exit(0);
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	long key, next_key, value;
+	FILE *f;
+	int i;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	signal(SIGINT, int_exit);
+
+	/* start 'ping' in the background to have some kfree_skb events */
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	/* start 'dd' in the background to have plenty of 'write' syscalls */
+	f = popen("dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	for (i = 0; i < 5; i++) {
+		key = 0;
+		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd[0], &next_key, &value);
+			printf("location 0x%lx count %ld\n", next_key, value);
+			key = next_key;
+		}
+		if (key)
+			printf("\n");
+		sleep(1);
+	}
+	print_hist(map_fd[1]);
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 7/9] samples: bpf: IO latency analysis (iosnoop/heatmap)
  2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 5/9] samples: bpf: simple tracing example in C Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 6/9] samples: bpf: counting example for kfree_skb tracepoint and write syscall Alexei Starovoitov
@ 2015-01-16  4:16 ` Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 8/9] tracing: attach eBPF programs to kprobe/kretprobe Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api, netdev, linux-kernel

eBPF C program attaches to block_rq_issue/block_rq_complete events to calculate
IO latency. Then it waits for the first 100 events to compute average latency
and uses range [0 .. ave_lat * 2] to record histogram of events in this latency
range.
User space reads this histogram map every 2 seconds and prints it as a 'heatmap'
using gray shades of text terminal. Black spaces have many events and white
spaces have very few events. Left most space is the smallest latency, right most
space is the largest latency in the range.
If kernel sees too many events that fall out of histogram range, user space
adjusts the range up, so heatmap for next 2 seconds will be more accurate.

Usage:
$ sudo ./tracex3
and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.
Observe IO latencies and how different activity (like 'make kernel') affects it.

Similar experiments can be done for network transmit latencies, syscalls, etc

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    4 ++
 samples/bpf/tracex3_kern.c |   96 +++++++++++++++++++++++++++++
 samples/bpf/tracex3_user.c |  146 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 246 insertions(+)
 create mode 100644 samples/bpf/tracex3_kern.c
 create mode 100644 samples/bpf/tracex3_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 416af24b01fd..da0efd8032ab 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -9,6 +9,7 @@ hostprogs-y += sockex2
 hostprogs-y += dropmon
 hostprogs-y += tracex1
 hostprogs-y += tracex2
+hostprogs-y += tracex3
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -18,6 +19,7 @@ sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
 tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
+tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -25,6 +27,7 @@ always += sockex1_kern.o
 always += sockex2_kern.o
 always += tracex1_kern.o
 always += tracex2_kern.o
+always += tracex3_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -33,6 +36,7 @@ HOSTLOADLIBES_sockex1 += -lelf
 HOSTLOADLIBES_sockex2 += -lelf
 HOSTLOADLIBES_tracex1 += -lelf
 HOSTLOADLIBES_tracex2 += -lelf
+HOSTLOADLIBES_tracex3 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/tracex3_kern.c b/samples/bpf/tracex3_kern.c
new file mode 100644
index 000000000000..fa04603b80b8
--- /dev/null
+++ b/samples/bpf/tracex3_kern.c
@@ -0,0 +1,96 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(long),
+	.value_size = sizeof(u64),
+	.max_entries = 4096,
+};
+
+SEC("events/block/block_rq_issue")
+int bpf_prog1(struct bpf_context *ctx)
+{
+	long rq = ctx->arg2;
+	u64 val = bpf_ktime_get_ns();
+
+	bpf_map_update_elem(&my_map, &rq, &val, BPF_ANY);
+	return 0;
+}
+
+struct globals {
+	u64 lat_ave;
+	u64 lat_sum;
+	u64 missed;
+	u64 max_lat;
+	int num_samples;
+};
+
+struct bpf_map_def SEC("maps") global_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct globals),
+	.max_entries = 1,
+};
+
+#define MAX_SLOT 32
+
+struct bpf_map_def SEC("maps") lat_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u64),
+	.max_entries = MAX_SLOT,
+};
+
+SEC("events/block/block_rq_complete")
+int bpf_prog2(struct bpf_context *ctx)
+{
+	long rq = ctx->arg2;
+	void *value;
+
+	value = bpf_map_lookup_elem(&my_map, &rq);
+	if (!value)
+		return 0;
+
+	u64 cur_time = bpf_ktime_get_ns();
+	u64 delta = (cur_time - *(u64 *)value) / 1000;
+
+	bpf_map_delete_elem(&my_map, &rq);
+
+	int ind = 0;
+	struct globals *g = bpf_map_lookup_elem(&global_map, &ind);
+	if (!g)
+		return 0;
+	if (g->lat_ave == 0) {
+		g->num_samples++;
+		g->lat_sum += delta;
+		if (g->num_samples >= 100) {
+			g->lat_ave = g->lat_sum / g->num_samples;
+			if (0/* debug */) {
+				char fmt[] = "after %d samples average latency %ld usec\n";
+				bpf_printk(fmt, sizeof(fmt), g->num_samples,
+					   g->lat_ave);
+			}
+		}
+	} else {
+		u64 max_lat = g->lat_ave * 2;
+		if (delta > max_lat) {
+			g->missed++;
+			if (delta > g->max_lat)
+				g->max_lat = delta;
+			return 0;
+		}
+
+		ind = delta * MAX_SLOT / max_lat;
+		value = bpf_map_lookup_elem(&lat_map, &ind);
+		if (!value)
+			return 0;
+		(*(u64 *)value) ++;
+	}
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex3_user.c b/samples/bpf/tracex3_user.c
new file mode 100644
index 000000000000..1945147925b5
--- /dev/null
+++ b/samples/bpf/tracex3_user.c
@@ -0,0 +1,146 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+struct globals {
+	__u64 lat_ave;
+	__u64 lat_sum;
+	__u64 missed;
+	__u64 max_lat;
+	int num_samples;
+};
+
+static void clear_stats(int fd)
+{
+	int key;
+	__u64 value = 0;
+	for (key = 0; key < 32; key++)
+		bpf_update_elem(fd, &key, &value, BPF_ANY);
+}
+
+const char *color[] = {
+	"\033[48;5;255m",
+	"\033[48;5;252m",
+	"\033[48;5;250m",
+	"\033[48;5;248m",
+	"\033[48;5;246m",
+	"\033[48;5;244m",
+	"\033[48;5;242m",
+	"\033[48;5;240m",
+	"\033[48;5;238m",
+	"\033[48;5;236m",
+	"\033[48;5;234m",
+	"\033[48;5;232m",
+};
+const int num_colors = ARRAY_SIZE(color);
+
+const char nocolor[] = "\033[00m";
+
+static void print_banner(__u64 max_lat)
+{
+	printf("0 usec     ...          %lld usec\n", max_lat);
+}
+
+static void print_hist(int fd)
+{
+	int key;
+	__u64 value;
+	__u64 cnt[32];
+	__u64 max_cnt = 0;
+	__u64 total_events = 0;
+	int max_bucket = 0;
+
+	for (key = 0; key < 32; key++) {
+		value = 0;
+		bpf_lookup_elem(fd, &key, &value);
+		if (value > 0)
+			max_bucket = key;
+		cnt[key] = value;
+		total_events += value;
+		if (value > max_cnt)
+			max_cnt = value;
+	}
+	clear_stats(fd);
+	for (key = 0; key < 32; key++) {
+		int c = num_colors * cnt[key] / (max_cnt + 1);
+		printf("%s %s", color[c], nocolor);
+	}
+	printf(" captured=%lld", total_events);
+
+	key = 0;
+	struct globals g = {};
+	bpf_lookup_elem(map_fd[1], &key, &g);
+
+	printf(" missed=%lld max_lat=%lld usec\n",
+	       g.missed, g.max_lat);
+
+	if (g.missed > 10 && g.missed > total_events / 10) {
+		printf("adjusting range UP...\n");
+		g.lat_ave = g.max_lat / 2;
+		print_banner(g.lat_ave * 2);
+	} else if (max_bucket < 4 && total_events > 100) {
+		printf("adjusting range DOWN...\n");
+		g.lat_ave = g.lat_ave / 4;
+		print_banner(g.lat_ave * 2);
+	}
+	/* clear some globals */
+	g.missed = 0;
+	g.max_lat = 0;
+	bpf_update_elem(map_fd[1], &key, &g, BPF_ANY);
+}
+
+static void int_exit(int sig)
+{
+	print_hist(map_fd[2]);
+	exit(0);
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	clear_stats(map_fd[2]);
+
+	signal(SIGINT, int_exit);
+
+	if (fork() == 0) {
+		read_trace_pipe();
+	} else {
+		struct globals g;
+
+		printf("waiting for events to determine average latency...\n");
+		for (;;) {
+			int key = 0;
+			bpf_lookup_elem(map_fd[1], &key, &g);
+			if (g.lat_ave)
+				break;
+			sleep(1);
+		}
+
+		printf("  IO latency in usec\n"
+		       "  %s %s - many events with this latency\n"
+		       "  %s %s - few events\n",
+		       color[num_colors - 1], nocolor,
+		       color[0], nocolor);
+		print_banner(g.lat_ave * 2);
+		for (;;) {
+			print_hist(map_fd[2]);
+			sleep(2);
+		}
+	}
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 8/9] tracing: attach eBPF programs to kprobe/kretprobe
  2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2015-01-16  4:16 ` [PATCH tip 7/9] samples: bpf: IO latency analysis (iosnoop/heatmap) Alexei Starovoitov
@ 2015-01-16  4:16 ` Alexei Starovoitov
  2015-01-16  4:16 ` [PATCH tip 9/9] samples: bpf: simple kprobe example Alexei Starovoitov
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api, netdev, linux-kernel

introduce new type of eBPF programs BPF_PROG_TYPE_KPROBE_FILTER.
Such programs are allowed to call the same helper functions
as tracing filters, but bpf_context is different:
For tracing filters bpf_context is 6 arguments of tracepoints or syscalls
For kprobe filters bpf_context == pt_regs

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/linux/ftrace_event.h       |    2 ++
 include/uapi/linux/bpf.h           |    1 +
 kernel/trace/bpf_trace.c           |   39 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_events_filter.c |   10 ++++++---
 kernel/trace/trace_kprobe.c        |   11 +++++++++-
 5 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index a3897f5e43ca..0f1a0418bef7 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -249,6 +249,7 @@ enum {
 	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
 	TRACE_EVENT_FL_BPF_BIT,
+	TRACE_EVENT_FL_KPROBE_BIT,
 };
 
 /*
@@ -272,6 +273,7 @@ enum {
 	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
 	TRACE_EVENT_FL_BPF		= (1 << TRACE_EVENT_FL_BPF_BIT),
+	TRACE_EVENT_FL_KPROBE		= (1 << TRACE_EVENT_FL_KPROBE_BIT),
 };
 
 struct ftrace_event_call {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6075c4f4b67e..79ca0c63ffaf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -119,6 +119,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
 	BPF_PROG_TYPE_SOCKET_FILTER,
 	BPF_PROG_TYPE_TRACING_FILTER,
+	BPF_PROG_TYPE_KPROBE_FILTER,
 };
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 14cfbbcec32e..c485c7cc8d57 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -209,3 +209,42 @@ static int __init register_tracing_filter_ops(void)
 	return 0;
 }
 late_initcall(register_tracing_filter_ops);
+
+/* check access to fields of 'struct pt_regs' from BPF program */
+static bool kprobe_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	/* check bounds */
+	if (off < 0 || off >= sizeof(struct pt_regs))
+		return false;
+
+	/* only read is allowed */
+	if (type != BPF_READ)
+		return false;
+
+	/* disallow misaligned access */
+	if (off % size != 0)
+		return false;
+
+	return true;
+}
+/* kprobe filter programs are allowed to call the same helper functions
+ * as tracing filters, but bpf_context is different:
+ * For tracing filters bpf_context is 6 arguments of tracepoints or syscalls
+ * For kprobe filters bpf_context == pt_regs
+ */
+static struct bpf_verifier_ops kprobe_filter_ops = {
+	.get_func_proto = tracing_filter_func_proto,
+	.is_valid_access = kprobe_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list kprobe_tl = {
+	.ops = &kprobe_filter_ops,
+	.type = BPF_PROG_TYPE_KPROBE_FILTER,
+};
+
+static int __init register_kprobe_filter_ops(void)
+{
+	bpf_register_prog_type(&kprobe_tl);
+	return 0;
+}
+late_initcall(register_kprobe_filter_ops);
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index bb0140414238..75b7e93b2d28 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1891,7 +1891,8 @@ static int create_filter_start(char *filter_str, bool set_str,
 	return err;
 }
 
-static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+static int create_filter_bpf(struct ftrace_event_call *call, char *filter_str,
+			     struct event_filter **filterp)
 {
 	struct event_filter *filter;
 	struct bpf_prog *prog;
@@ -1920,7 +1921,10 @@ static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
 
 	filter->prog = prog;
 
-	if (prog->aux->prog_type != BPF_PROG_TYPE_TRACING_FILTER) {
+	if (((call->flags & TRACE_EVENT_FL_KPROBE) &&
+	     prog->aux->prog_type != BPF_PROG_TYPE_KPROBE_FILTER) ||
+	    (!(call->flags & TRACE_EVENT_FL_KPROBE) &&
+	     prog->aux->prog_type != BPF_PROG_TYPE_TRACING_FILTER)) {
 		/* valid fd, but invalid bpf program type */
 		err = -EINVAL;
 		goto free_filter;
@@ -2051,7 +2055,7 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 	 */
 	if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
 	    filter_string[4] != 0) {
-		err = create_filter_bpf(filter_string, &filter);
+		err = create_filter_bpf(call, filter_string, &filter);
 		if (!err)
 			file->flags |= TRACE_EVENT_FL_BPF;
 	} else {
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 296079ae6583..113d10973e39 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,6 +19,7 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
 
 #include "trace_probe.h"
 
@@ -930,6 +931,10 @@ __kprobe_trace_func(struct trace_kprobe *tk, struct pt_regs *regs,
 	if (ftrace_trigger_soft_disabled(ftrace_file))
 		return;
 
+	if (ftrace_file->flags & TRACE_EVENT_FL_BPF)
+		if (trace_filter_call_bpf(ftrace_file->filter, regs) == 0)
+			return;
+
 	local_save_flags(irq_flags);
 	pc = preempt_count();
 
@@ -978,6 +983,10 @@ __kretprobe_trace_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
 	if (ftrace_trigger_soft_disabled(ftrace_file))
 		return;
 
+	if (ftrace_file->flags & TRACE_EVENT_FL_BPF)
+		if (trace_filter_call_bpf(ftrace_file->filter, regs) == 0)
+			return;
+
 	local_save_flags(irq_flags);
 	pc = preempt_count();
 
@@ -1286,7 +1295,7 @@ static int register_kprobe_event(struct trace_kprobe *tk)
 		kfree(call->print_fmt);
 		return -ENODEV;
 	}
-	call->flags = 0;
+	call->flags = TRACE_EVENT_FL_KPROBE;
 	call->class->reg = kprobe_register;
 	call->data = tk;
 	ret = trace_add_event_call(call);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 9/9] samples: bpf: simple kprobe example
  2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2015-01-16  4:16 ` [PATCH tip 8/9] tracing: attach eBPF programs to kprobe/kretprobe Alexei Starovoitov
@ 2015-01-16  4:16 ` Alexei Starovoitov
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api, netdev, linux-kernel

the logic of the example is similar to tracex2, but syscall 'write' statistics
is capturead from kprobe placed at sys_write function instead of through
syscall instrumentation.
Also tracex4_kern.c has a different way of doing log2 in C.
Note, unlike tracepoint and syscall programs, kprobe programs receive
'struct pt_regs' as an input. It's responsibility of the program author
or higher level dynamic tracing tool to match registers to function arguments.
Since pt_regs are architecture dependent, programs are also arch dependent,
unlike tracepoint/syscalls programs which are universal.

Usage:
$ sudo tracex4
writing bpf-6 -> /sys/kernel/debug/tracing/events/kprobes/sys_write/filter
2216443+0 records in
2216442+0 records out
1134818304 bytes (1.1 GB) copied, 2.00746 s, 565 MB/s

           kprobe sys_write() stats
     byte_size       : count     distribution
       1 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 0        |                                      |
      64 -> 127      : 1        |                                      |
     128 -> 255      : 0        |                                      |
     256 -> 511      : 0        |                                      |
     512 -> 1023     : 2214734  |************************************* |

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 samples/bpf/Makefile       |    4 +++
 samples/bpf/bpf_load.c     |    3 ++
 samples/bpf/tracex4_kern.c |   36 +++++++++++++++++++
 samples/bpf/tracex4_user.c |   83 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 126 insertions(+)
 create mode 100644 samples/bpf/tracex4_kern.c
 create mode 100644 samples/bpf/tracex4_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index da0efd8032ab..22c7a38f3f95 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -10,6 +10,7 @@ hostprogs-y += dropmon
 hostprogs-y += tracex1
 hostprogs-y += tracex2
 hostprogs-y += tracex3
+hostprogs-y += tracex4
 
 dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
@@ -20,6 +21,7 @@ sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
 tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
 tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
+tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -28,6 +30,7 @@ always += sockex2_kern.o
 always += tracex1_kern.o
 always += tracex2_kern.o
 always += tracex3_kern.o
+always += tracex4_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -37,6 +40,7 @@ HOSTLOADLIBES_sockex2 += -lelf
 HOSTLOADLIBES_tracex1 += -lelf
 HOSTLOADLIBES_tracex2 += -lelf
 HOSTLOADLIBES_tracex3 += -lelf
+HOSTLOADLIBES_tracex4 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 788ac51c1024..d8c5176f0564 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -25,6 +25,7 @@ int prog_cnt;
 static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 {
 	bool is_socket = strncmp(event, "socket", 6) == 0;
+	bool is_kprobe = strncmp(event, "events/kprobes/", 15) == 0;
 	enum bpf_prog_type prog_type;
 	char path[256] = DEBUGFS;
 	char fmt[32];
@@ -32,6 +33,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	if (is_socket)
 		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	else if (is_kprobe)
+		prog_type = BPF_PROG_TYPE_KPROBE_FILTER;
 	else
 		prog_type = BPF_PROG_TYPE_TRACING_FILTER;
 
diff --git a/samples/bpf/tracex4_kern.c b/samples/bpf/tracex4_kern.c
new file mode 100644
index 000000000000..9646f9e43417
--- /dev/null
+++ b/samples/bpf/tracex4_kern.c
@@ -0,0 +1,36 @@
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <uapi/linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include "bpf_helpers.h"
+
+static unsigned int log2l(unsigned long long n)
+{
+#define S(k) if (n >= (1ull << k)) { i += k; n >>= k; }
+	int i = -(n == 0);
+	S(32); S(16); S(8); S(4); S(2); S(1);
+	return i;
+#undef S
+}
+
+struct bpf_map_def SEC("maps") my_hist_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 64,
+};
+
+SEC("events/kprobes/sys_write")
+int bpf_prog4(struct pt_regs *regs)
+{
+	long write_size = regs->dx; /* $rdx contains 3rd argument to a function */
+	long init_val = 1;
+	void *value;
+	u32 index = log2l(write_size);
+
+	value = bpf_map_lookup_elem(&my_hist_map, &index);
+	if (value)
+		__sync_fetch_and_add((long *)value, 1);
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/tracex4_user.c b/samples/bpf/tracex4_user.c
new file mode 100644
index 000000000000..47dde2791f9e
--- /dev/null
+++ b/samples/bpf/tracex4_user.c
@@ -0,0 +1,83 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_INDEX	64
+#define MAX_STARS	38
+
+static void stars(char *str, long val, long max, int width)
+{
+	int i;
+
+	for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+		str[i] = '*';
+	if (val > max)
+		str[i - 1] = '+';
+	str[i] = '\0';
+}
+
+static void print_hist(int fd)
+{
+	int key;
+	long value;
+	long data[MAX_INDEX] = {};
+	char starstr[MAX_STARS];
+	int i;
+	int max_ind = -1;
+	long max_value = 0;
+
+	for (key = 0; key < MAX_INDEX; key++) {
+		bpf_lookup_elem(fd, &key, &value);
+		data[key] = value;
+		if (value && key > max_ind)
+			max_ind = key;
+		if (value > max_value)
+			max_value = value;
+	}
+
+	printf("\n           kprobe sys_write() stats\n");
+	printf("     byte_size       : count     distribution\n");
+	for (i = 1; i <= max_ind + 1; i++) {
+		stars(starstr, data[i - 1], max_value, MAX_STARS);
+		printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+		       (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+		       MAX_STARS, starstr);
+	}
+}
+static void int_exit(int sig)
+{
+	print_hist(map_fd[0]);
+	exit(0);
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	FILE *f;
+	int i;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	signal(SIGINT, int_exit);
+
+	i = system("echo 'p:sys_write sys_write' > /sys/kernel/debug/tracing/kprobe_events");
+	(void) i;
+	
+	/* start 'dd' in the background to have plenty of 'write' syscalls */
+	f = popen("dd if=/dev/zero of=/dev/null", "r");
+	(void) f;
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	sleep(2);
+	kill(0, SIGINT); /* send Ctrl-C to self and to 'dd' */
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

[parent not found: <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>]

* [PATCH tip 1/9] tracing: attach eBPF programs to tracepoints and syscalls
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
@ 2015-01-16  4:16   ` Alexei Starovoitov
  2015-01-16  4:16   ` [PATCH tip 2/9] tracing: allow eBPF programs to call bpf_printk() Alexei Starovoitov
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

User interface:
fd = open("/sys/kernel/debug/tracing/__event__/filter")

write(fd, "bpf_123")

where 123 is process local FD associated with eBPF program previously loaded.
__event__ is static tracepoint event or syscall.
(kprobe support is in next patch)
Once program is successfully attached to tracepoint event, the tracepoint
will be auto-enabled

close(fd)
auto-disables tracepoint event and detaches eBPF program from it

eBPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- memcmp
- dump_stack
- fetch_ptr/u64/u32/u16/u8 values from unsafe address via probe_kernel_read(),
  so that eBPF program can walk any kernel data structures

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/linux/ftrace_event.h       |    4 ++
 include/trace/bpf_trace.h          |   25 +++++++
 include/trace/ftrace.h             |   30 ++++++++
 include/uapi/linux/bpf.h           |    8 +++
 kernel/trace/Kconfig               |    1 +
 kernel/trace/Makefile              |    1 +
 kernel/trace/bpf_trace.c           |  140 ++++++++++++++++++++++++++++++++++++
 kernel/trace/trace.h               |    3 +
 kernel/trace/trace_events.c        |   33 ++++++++-
 kernel/trace/trace_events_filter.c |   76 +++++++++++++++++++-
 kernel/trace/trace_syscalls.c      |   31 ++++++++
 11 files changed, 350 insertions(+), 2 deletions(-)
 create mode 100644 include/trace/bpf_trace.h
 create mode 100644 kernel/trace/bpf_trace.c

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index d36f68b08acc..a3897f5e43ca 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -248,6 +248,7 @@ enum {
 	TRACE_EVENT_FL_WAS_ENABLED_BIT,
 	TRACE_EVENT_FL_USE_CALL_FILTER_BIT,
 	TRACE_EVENT_FL_TRACEPOINT_BIT,
+	TRACE_EVENT_FL_BPF_BIT,
 };
 
 /*
@@ -270,6 +271,7 @@ enum {
 	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
 	TRACE_EVENT_FL_USE_CALL_FILTER	= (1 << TRACE_EVENT_FL_USE_CALL_FILTER_BIT),
 	TRACE_EVENT_FL_TRACEPOINT	= (1 << TRACE_EVENT_FL_TRACEPOINT_BIT),
+	TRACE_EVENT_FL_BPF		= (1 << TRACE_EVENT_FL_BPF_BIT),
 };
 
 struct ftrace_event_call {
@@ -544,6 +546,8 @@ event_trigger_unlock_commit_regs(struct ftrace_event_file *file,
 		event_triggers_post_call(file, tt);
 }
 
+unsigned int trace_filter_call_bpf(struct event_filter *filter, void *ctx);
+
 enum {
 	FILTER_OTHER = 0,
 	FILTER_STATIC_STRING,
diff --git a/include/trace/bpf_trace.h b/include/trace/bpf_trace.h
new file mode 100644
index 000000000000..4e64f61f484d
--- /dev/null
+++ b/include/trace/bpf_trace.h
@@ -0,0 +1,25 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _LINUX_KERNEL_BPF_TRACE_H
+#define _LINUX_KERNEL_BPF_TRACE_H
+
+/* For tracepoint filters argN fields match one to one to arguments
+ * passed to tracepoint events
+ *
+ * For syscall entry filters argN fields match syscall arguments
+ * For syscall exit filters arg1 is a return value
+ */
+struct bpf_context {
+	u64 arg1;
+	u64 arg2;
+	u64 arg3;
+	u64 arg4;
+	u64 arg5;
+	u64 arg6;
+};
+
+#endif /* _LINUX_KERNEL_BPF_TRACE_H */
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 27609dfcce25..7b2cf74a9b08 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -17,6 +17,7 @@
  */
 
 #include <linux/ftrace_event.h>
+#include <trace/bpf_trace.h>
 
 /*
  * DECLARE_EVENT_CLASS can be used to add a generic function
@@ -617,6 +618,24 @@ static inline notrace int ftrace_get_offsets_##call(			\
 #undef __perf_task
 #define __perf_task(t)	(t)
 
+/* zero extend integer, pointer or aggregate type to u64 without warnings */
+#define __CAST_TO_U64(expr) ({ \
+	u64 ret = 0; \
+	switch (sizeof(expr)) { \
+	case 8: ret = *(u64 *) &expr; break; \
+	case 4: ret = *(u32 *) &expr; break; \
+	case 2: ret = *(u16 *) &expr; break; \
+	case 1: ret = *(u8 *) &expr; break; \
+	} \
+	ret; })
+
+#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
+#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
+#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
+#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
+#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
+#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
+
 #undef DECLARE_EVENT_CLASS
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 									\
@@ -632,6 +651,17 @@ ftrace_raw_event_##call(void *__data, proto)				\
 	if (ftrace_trigger_soft_disabled(ftrace_file))			\
 		return;							\
 									\
+	if (unlikely(ftrace_file->flags & FTRACE_EVENT_FL_FILTERED) &&	\
+	    unlikely(ftrace_file->flags & TRACE_EVENT_FL_BPF)) {	\
+		__maybe_unused const u64 z = 0;				\
+		struct bpf_context __ctx = ((struct bpf_context) {	\
+				__BPF_CAST6(args, z, z, z, z, z)	\
+			});						\
+									\
+		if (trace_filter_call_bpf(ftrace_file->filter, &__ctx) == 0) \
+			return;						\
+	}								\
+									\
 	__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
 									\
 	entry = ftrace_event_buffer_reserve(&fbuffer, ftrace_file,	\
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 45da7ec7d274..959538c50117 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -118,6 +118,7 @@ enum bpf_map_type {
 enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
 	BPF_PROG_TYPE_SOCKET_FILTER,
+	BPF_PROG_TYPE_TRACING_FILTER,
 };
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
@@ -162,6 +163,13 @@ enum bpf_func_id {
 	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
 	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
 	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
+	BPF_FUNC_fetch_ptr,       /* void *bpf_fetch_ptr(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u64,       /* u64 bpf_fetch_u64(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u32,       /* u32 bpf_fetch_u32(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u16,       /* u16 bpf_fetch_u16(void *unsafe_ptr) */
+	BPF_FUNC_fetch_u8,        /* u8 bpf_fetch_u8(void *unsafe_ptr) */
+	BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
+	BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index a5da09c899dd..eb60b234b824 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -75,6 +75,7 @@ config FTRACE_NMI_ENTER
 
 config EVENT_TRACING
 	select CONTEXT_SWITCH_TRACER
+	select BPF_SYSCALL
 	bool
 
 config CONTEXT_SWITCH_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 979ccde26720..ef821d90f3f5 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_EVENT_TRACING) += trace_event_perf.o
 endif
 obj-$(CONFIG_EVENT_TRACING) += trace_events_filter.o
 obj-$(CONFIG_EVENT_TRACING) += trace_events_trigger.o
+obj-$(CONFIG_EVENT_TRACING) += bpf_trace.o
 obj-$(CONFIG_KPROBE_EVENT) += trace_kprobe.o
 obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_PM),y)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
new file mode 100644
index 000000000000..639d3c25dead
--- /dev/null
+++ b/kernel/trace/bpf_trace.c
@@ -0,0 +1,140 @@
+/* Copyright (c) 2011-2015 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/uaccess.h>
+#include <trace/bpf_trace.h>
+#include "trace.h"
+
+static u64 bpf_fetch_ptr(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	void *unsafe_ptr = (void *) (long) r1;
+	void *ptr = NULL;
+
+	probe_kernel_read(&ptr, unsafe_ptr, sizeof(ptr));
+	return (u64) (unsigned long) ptr;
+}
+
+#define FETCH(SIZE) \
+static u64 bpf_fetch_##SIZE(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)	\
+{									\
+	void *unsafe_ptr = (void *) (long) r1;				\
+	SIZE val = 0;							\
+									\
+	probe_kernel_read(&val, unsafe_ptr, sizeof(val));		\
+	return (u64) (SIZE) val;					\
+}
+FETCH(u64)
+FETCH(u32)
+FETCH(u16)
+FETCH(u8)
+#undef FETCH
+
+static u64 bpf_memcmp(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	void *unsafe_ptr = (void *) (long) r1;
+	void *safe_ptr = (void *) (long) r2;
+	u32 size = (u32) r3;
+	char buf[64];
+	int err;
+
+	if (size < 64) {
+		err = probe_kernel_read(buf, unsafe_ptr, size);
+		if (err)
+			return err;
+		return memcmp(buf, safe_ptr, size);
+	}
+	return -1;
+}
+
+static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	trace_dump_stack(0);
+	return 0;
+}
+
+static struct bpf_func_proto tracing_filter_funcs[] = {
+#define FETCH(SIZE)				\
+	[BPF_FUNC_fetch_##SIZE] = {		\
+		.func = bpf_fetch_##SIZE,	\
+		.gpl_only = true,		\
+		.ret_type = RET_INTEGER,	\
+	},
+	FETCH(ptr)
+	FETCH(u64)
+	FETCH(u32)
+	FETCH(u16)
+	FETCH(u8)
+#undef FETCH
+	[BPF_FUNC_memcmp] = {
+		.func = bpf_memcmp,
+		.gpl_only = false,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_ANYTHING,
+		.arg2_type = ARG_PTR_TO_STACK,
+		.arg3_type = ARG_CONST_STACK_SIZE,
+	},
+	[BPF_FUNC_dump_stack] = {
+		.func = bpf_dump_stack,
+		.gpl_only = false,
+		.ret_type = RET_VOID,
+	},
+};
+
+static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
+		if (func_id < 0 || func_id >= ARRAY_SIZE(tracing_filter_funcs))
+			return NULL;
+		return &tracing_filter_funcs[func_id];
+	}
+}
+
+/* check access to argN fields of 'struct bpf_context' from program */
+static bool tracing_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	/* check bounds */
+	if (off < 0 || off >= sizeof(struct bpf_context))
+		return false;
+
+	/* only read is allowed */
+	if (type != BPF_READ)
+		return false;
+
+	/* disallow misaligned access */
+	if (off % size != 0)
+		return false;
+
+	return true;
+}
+
+static struct bpf_verifier_ops tracing_filter_ops = {
+	.get_func_proto = tracing_filter_func_proto,
+	.is_valid_access = tracing_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &tracing_filter_ops,
+	.type = BPF_PROG_TYPE_TRACING_FILTER,
+};
+
+static int __init register_tracing_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_tracing_filter_ops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 8de48bac1ce2..d667547c6f0e 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -977,12 +977,15 @@ struct ftrace_event_field {
 	int			is_signed;
 };
 
+struct bpf_prog;
+
 struct event_filter {
 	int			n_preds;	/* Number assigned */
 	int			a_preds;	/* allocated */
 	struct filter_pred	*preds;
 	struct filter_pred	*root;
 	char			*filter_string;
+	struct bpf_prog		*prog;
 };
 
 struct event_subsystem {
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 366a78a3e61e..189cc4d697b5 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1084,6 +1084,26 @@ event_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
 	return r;
 }
 
+static int event_filter_release(struct inode *inode, struct file *filp)
+{
+	struct ftrace_event_file *file;
+	char buf[2] = "0";
+
+	mutex_lock(&event_mutex);
+	file = event_file_data(filp);
+	if (file) {
+		if (file->flags & TRACE_EVENT_FL_BPF) {
+			/* auto-disable the filter */
+			ftrace_event_enable_disable(file, 0);
+
+			/* if BPF filter was used, clear it on fd close */
+			apply_event_filter(file, buf);
+		}
+	}
+	mutex_unlock(&event_mutex);
+	return 0;
+}
+
 static ssize_t
 event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 		   loff_t *ppos)
@@ -1107,8 +1127,18 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 
 	mutex_lock(&event_mutex);
 	file = event_file_data(filp);
-	if (file)
+	if (file) {
+		/*
+		 * note to user space tools:
+		 * write() into debugfs/tracing/events/xxx/filter file
+		 * must be done with the same privilege level as open()
+		 */
 		err = apply_event_filter(file, buf);
+		if (!err && file->flags & TRACE_EVENT_FL_BPF)
+			/* once filter is applied, auto-enable it */
+			ftrace_event_enable_disable(file, 1);
+	}
+
 	mutex_unlock(&event_mutex);
 
 	free_page((unsigned long) buf);
@@ -1363,6 +1393,7 @@ static const struct file_operations ftrace_event_filter_fops = {
 	.open = tracing_open_generic,
 	.read = event_filter_read,
 	.write = event_filter_write,
+	.release = event_filter_release,
 	.llseek = default_llseek,
 };
 
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index ced69da0ff55..bb0140414238 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -23,6 +23,9 @@
 #include <linux/mutex.h>
 #include <linux/perf_event.h>
 #include <linux/slab.h>
+#include <linux/bpf.h>
+#include <trace/bpf_trace.h>
+#include <linux/filter.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -541,6 +544,18 @@ static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
 	return WALK_PRED_DEFAULT;
 }
 
+unsigned int trace_filter_call_bpf(struct event_filter *filter, void *ctx)
+{
+	unsigned int ret;
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(filter->prog, ctx);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(trace_filter_call_bpf);
+
 /* return 1 if event matches, 0 otherwise (discard) */
 int filter_match_preds(struct event_filter *filter, void *rec)
 {
@@ -795,6 +810,8 @@ static void __free_filter(struct event_filter *filter)
 	if (!filter)
 		return;
 
+	if (filter->prog)
+		bpf_prog_put(filter->prog);
 	__free_preds(filter);
 	kfree(filter->filter_string);
 	kfree(filter);
@@ -1874,6 +1891,50 @@ static int create_filter_start(char *filter_str, bool set_str,
 	return err;
 }
 
+static int create_filter_bpf(char *filter_str, struct event_filter **filterp)
+{
+	struct event_filter *filter;
+	struct bpf_prog *prog;
+	long ufd;
+	int err = 0;
+
+	*filterp = NULL;
+
+	filter = __alloc_filter();
+	if (!filter)
+		return -ENOMEM;
+
+	err = replace_filter_string(filter, filter_str);
+	if (err)
+		goto free_filter;
+
+	err = kstrtol(filter_str + 4, 0, &ufd);
+	if (err)
+		goto free_filter;
+
+	prog = bpf_prog_get(ufd);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto free_filter;
+	}
+
+	filter->prog = prog;
+
+	if (prog->aux->prog_type != BPF_PROG_TYPE_TRACING_FILTER) {
+		/* valid fd, but invalid bpf program type */
+		err = -EINVAL;
+		goto free_filter;
+	}
+
+	*filterp = filter;
+
+	return 0;
+
+free_filter:
+	__free_filter(filter);
+	return err;
+}
+
 static void create_filter_finish(struct filter_parse_state *ps)
 {
 	if (ps) {
@@ -1971,6 +2032,7 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 		filter_disable(file);
 		filter = event_filter(file);
 
+		file->flags &= ~TRACE_EVENT_FL_BPF;
 		if (!filter)
 			return 0;
 
@@ -1983,7 +2045,19 @@ int apply_event_filter(struct ftrace_event_file *file, char *filter_string)
 		return 0;
 	}
 
-	err = create_filter(call, filter_string, true, &filter);
+	/*
+	 * 'bpf_123' string is a request to attach eBPF program with id == 123
+	 * also accept 'bpf 123', 'bpf.123', 'bpf-123' variants
+	 */
+	if (memcmp(filter_string, "bpf", 3) == 0 && filter_string[3] != 0 &&
+	    filter_string[4] != 0) {
+		err = create_filter_bpf(filter_string, &filter);
+		if (!err)
+			file->flags |= TRACE_EVENT_FL_BPF;
+	} else {
+		err = create_filter(call, filter_string, true, &filter);
+		file->flags &= ~TRACE_EVENT_FL_BPF;
+	}
 
 	/*
 	 * Always swap the call filter with the new filter
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index f97f6e3a676c..94242d2a9d76 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -7,6 +7,7 @@
 #include <linux/ftrace.h>
 #include <linux/perf_event.h>
 #include <asm/syscall.h>
+#include <trace/bpf_trace.h>
 
 #include "trace_output.h"
 #include "trace.h"
@@ -290,6 +291,20 @@ static int __init syscall_exit_define_fields(struct ftrace_event_call *call)
 	return ret;
 }
 
+static void populate_bpf_ctx(struct bpf_context *ctx, struct pt_regs *regs)
+{
+	struct task_struct *task = current;
+	unsigned long args[6];
+
+	syscall_get_arguments(task, regs, 0, 6, args);
+	ctx->arg1 = args[0];
+	ctx->arg2 = args[1];
+	ctx->arg3 = args[2];
+	ctx->arg4 = args[3];
+	ctx->arg5 = args[4];
+	ctx->arg6 = args[5];
+}
+
 static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 {
 	struct trace_array *tr = data;
@@ -319,6 +334,14 @@ static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	if (!sys_data)
 		return;
 
+	if (ftrace_file->flags & TRACE_EVENT_FL_BPF) {
+		struct bpf_context ctx;
+
+		populate_bpf_ctx(&ctx, regs);
+		if (trace_filter_call_bpf(ftrace_file->filter, &ctx) == 0)
+			return;
+	}
+
 	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
 
 	local_save_flags(irq_flags);
@@ -366,6 +389,14 @@ static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
+	if (ftrace_file->flags & TRACE_EVENT_FL_BPF) {
+		struct bpf_context ctx = {};
+
+		ctx.arg1 = syscall_get_return_value(current, regs);
+		if (trace_filter_call_bpf(ftrace_file->filter, &ctx) == 0)
+			return;
+	}
+
 	local_save_flags(irq_flags);
 	pc = preempt_count();
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 2/9] tracing: allow eBPF programs to call bpf_printk()
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2015-01-16  4:16   ` [PATCH tip 1/9] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
@ 2015-01-16  4:16   ` Alexei Starovoitov
  2015-01-16  4:16   ` [PATCH tip 3/9] tracing: allow eBPF programs to call ktime_get_ns() Alexei Starovoitov
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Debugging of eBPF programs needs some form of printk from the program,
so let programs call limited trace_printk() with %d %u %x %p modifiers only.

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/uapi/linux/bpf.h    |    1 +
 kernel/trace/bpf_trace.c    |   61 +++++++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_events.c |    8 ++++++
 3 files changed, 70 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 959538c50117..ef88e3f45b85 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_func_id {
 	BPF_FUNC_fetch_u8,        /* u8 bpf_fetch_u8(void *unsafe_ptr) */
 	BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
 	BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
+	BPF_FUNC_printk,          /* int bpf_printk(const char *fmt, int fmt_size, ...) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 639d3c25dead..3825d7a3cbd1 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -60,6 +60,60 @@ static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
 	return 0;
 }
 
+/* limited printk()
+ * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
+ */
+static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+	char *fmt = (char *) (long) r1;
+	int fmt_cnt = 0;
+	bool mod_l[3] = {};
+	int i;
+
+	/* bpf_check() guarantees that fmt points to bpf program stack and
+	 * fmt_size bytes of it were initialized by bpf program
+	 */
+	if (fmt[fmt_size - 1] != 0)
+		return -EINVAL;
+
+	/* check format string for allowed specifiers */
+	for (i = 0; i < fmt_size; i++)
+		if (fmt[i] == '%') {
+			if (fmt_cnt >= 3)
+				return -EINVAL;
+			i++;
+			if (i >= fmt_size)
+				return -EINVAL;
+
+			if (fmt[i] == 'l') {
+				mod_l[fmt_cnt] = true;
+				i++;
+				if (i >= fmt_size)
+					return -EINVAL;
+			} else if (fmt[i] == 'p') {
+				mod_l[fmt_cnt] = true;
+				fmt_cnt++;
+				continue;
+			}
+
+			if (fmt[i] == 'l') {
+				mod_l[fmt_cnt] = true;
+				i++;
+				if (i >= fmt_size)
+					return -EINVAL;
+			}
+
+			if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
+				return -EINVAL;
+			fmt_cnt++;
+		}
+
+	return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+			      mod_l[0] ? r3 : (u32) r3,
+			      mod_l[1] ? r4 : (u32) r4,
+			      mod_l[2] ? r5 : (u32) r5);
+}
+
 static struct bpf_func_proto tracing_filter_funcs[] = {
 #define FETCH(SIZE)				\
 	[BPF_FUNC_fetch_##SIZE] = {		\
@@ -86,6 +140,13 @@ static struct bpf_func_proto tracing_filter_funcs[] = {
 		.gpl_only = false,
 		.ret_type = RET_VOID,
 	},
+	[BPF_FUNC_printk] = {
+		.func = bpf_printk,
+		.gpl_only = true,
+		.ret_type = RET_INTEGER,
+		.arg1_type = ARG_PTR_TO_STACK,
+		.arg2_type = ARG_CONST_STACK_SIZE,
+	},
 };
 
 static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 189cc4d697b5..282ea5822480 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -1141,6 +1141,14 @@ event_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
 
 	mutex_unlock(&event_mutex);
 
+	if (file && file->flags & TRACE_EVENT_FL_BPF) {
+		/*
+		 * allocate per-cpu printk buffers, since programs
+		 * might be calling bpf_printk
+		 */
+		trace_printk_init_buffers();
+	}
+
 	free_page((unsigned long) buf);
 	if (err < 0)
 		return err;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 3/9] tracing: allow eBPF programs to call ktime_get_ns()
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
  2015-01-16  4:16   ` [PATCH tip 1/9] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
  2015-01-16  4:16   ` [PATCH tip 2/9] tracing: allow eBPF programs to call bpf_printk() Alexei Starovoitov
@ 2015-01-16  4:16   ` Alexei Starovoitov
  2015-01-16  4:16   ` [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler Alexei Starovoitov
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

bpf_ktime_get_ns() is used by programs to compue time delta between events
or as a timestamp

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/uapi/linux/bpf.h |    1 +
 kernel/trace/bpf_trace.c |   10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef88e3f45b85..6075c4f4b67e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -171,6 +171,7 @@ enum bpf_func_id {
 	BPF_FUNC_memcmp,          /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */
 	BPF_FUNC_dump_stack,      /* void bpf_dump_stack(void) */
 	BPF_FUNC_printk,          /* int bpf_printk(const char *fmt, int fmt_size, ...) */
+	BPF_FUNC_ktime_get_ns,    /* u64 bpf_ktime_get_ns(void) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3825d7a3cbd1..14cfbbcec32e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -114,6 +114,11 @@ static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
 			      mod_l[2] ? r5 : (u32) r5);
 }
 
+static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	return ktime_get_ns();
+}
+
 static struct bpf_func_proto tracing_filter_funcs[] = {
 #define FETCH(SIZE)				\
 	[BPF_FUNC_fetch_##SIZE] = {		\
@@ -147,6 +152,11 @@ static struct bpf_func_proto tracing_filter_funcs[] = {
 		.arg1_type = ARG_PTR_TO_STACK,
 		.arg2_type = ARG_CONST_STACK_SIZE,
 	},
+	[BPF_FUNC_ktime_get_ns] = {
+		.func = bpf_ktime_get_ns,
+		.gpl_only = true,
+		.ret_type = RET_INTEGER,
+	},
 };
 
 static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-01-16  4:16   ` [PATCH tip 3/9] tracing: allow eBPF programs to call ktime_get_ns() Alexei Starovoitov
@ 2015-01-16  4:16   ` Alexei Starovoitov
  2015-01-20 11:57     ` Masami Hiramatsu
  2015-01-16 15:02   ` [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Steven Rostedt
  2015-01-19  9:52   ` Masami Hiramatsu
  5 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16  4:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

simple packet drop monitor:
- in-kernel eBPF program attaches to kfree_skb() event and records number
  of packet drops at given location
- userspace iterates over the map every second and prints stats

Usage:
$ sudo dropmon
location 0xffffffff81695995 count 1
location 0xffffffff816d0da9 count 2

location 0xffffffff81695995 count 2
location 0xffffffff816d0da9 count 2

location 0xffffffff81695995 count 3
location 0xffffffff816d0da9 count 2

$ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 samples/bpf/Makefile  |    2 +
 samples/bpf/dropmon.c |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 131 insertions(+)
 create mode 100644 samples/bpf/dropmon.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b5b3600dcdf5..789691374562 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -6,7 +6,9 @@ hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
 hostprogs-y += sockex1
 hostprogs-y += sockex2
+hostprogs-y += dropmon
 
+dropmon-objs := dropmon.o libbpf.o
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
new file mode 100644
index 000000000000..9a2cd3344d69
--- /dev/null
+++ b/samples/bpf/dropmon.c
@@ -0,0 +1,129 @@
+/* simple packet drop monitor:
+ * - in-kernel eBPF program attaches to kfree_skb() event and records number
+ *   of packet drops at given location
+ * - userspace iterates over the map every second and prints stats
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <linux/unistd.h>
+#include <string.h>
+#include <linux/filter.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include "libbpf.h"
+
+#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
+
+static int write_to_file(const char *file, const char *str, bool keep_open)
+{
+	int fd, err;
+
+	fd = open(file, O_WRONLY);
+	err = write(fd, str, strlen(str));
+	(void) err;
+
+	if (keep_open) {
+		return fd;
+	} else {
+		close(fd);
+		return -1;
+	}
+}
+
+static int dropmon(void)
+{
+	long long key, next_key, value = 0;
+	int prog_fd, map_fd, i;
+	char fmt[32];
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
+	if (map_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	/* the following eBPF program is equivalent to C:
+	 * int filter(struct bpf_context *ctx)
+	 * {
+	 *   long loc = ctx->arg2;
+	 *   long init_val = 1;
+	 *   long *value;
+	 *
+	 *   value = bpf_map_lookup_elem(MAP_ID, &loc);
+	 *   if (value) {
+	 *      __sync_fetch_and_add(value, 1);
+	 *   } else {
+	 *      bpf_map_update_elem(MAP_ID, &loc, &init_val, BPF_ANY);
+	 *   }
+	 *   return 0;
+	 * }
+	 */
+	struct bpf_insn prog[] = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
+		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+		BPF_EXIT_INSN(),
+		BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
+		BPF_MOV64_IMM(BPF_REG_4, BPF_ANY),
+		BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
+		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+		BPF_EXIT_INSN(),
+	};
+
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
+				sizeof(prog), "GPL");
+	if (prog_fd < 0) {
+		printf("failed to load prog '%s'\n%s", strerror(errno), bpf_log_buf);
+		return -1;
+	}
+
+	sprintf(fmt, "bpf_%d", prog_fd);
+
+	write_to_file(TRACEPOINT "filter", fmt, true);
+
+	for (i = 0; i < 10; i++) {
+		key = 0;
+		while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd, &next_key, &value);
+			printf("location 0x%llx count %lld\n", next_key, value);
+			key = next_key;
+		}
+		if (key)
+			printf("\n");
+		sleep(1);
+	}
+
+cleanup:
+	/* maps, programs, tracepoint filters will auto cleanup on process exit */
+
+	return 0;
+}
+
+int main(void)
+{
+	FILE *f;
+
+	/* start ping in the background to get some kfree_skb events */
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	dropmon();
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler
  2015-01-16  4:16   ` [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler Alexei Starovoitov
@ 2015-01-20 11:57     ` Masami Hiramatsu
  0 siblings, 0 replies; 20+ messages in thread
From: Masami Hiramatsu @ 2015-01-20 11:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, linux-api,
	netdev, linux-kernel, yrl.pp-manager.tt@hitachi.com

(2015/01/16 13:16), Alexei Starovoitov wrote:
> simple packet drop monitor:
> - in-kernel eBPF program attaches to kfree_skb() event and records number
>   of packet drops at given location
> - userspace iterates over the map every second and prints stats

Hmm, this eBPF assembly macros are very interesting. Maybe I can
replace current kprobe's argument "fetching methods" with this.

Thank you,

> 
> Usage:
> $ sudo dropmon
> location 0xffffffff81695995 count 1
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 2
> location 0xffffffff816d0da9 count 2
> 
> location 0xffffffff81695995 count 3
> location 0xffffffff816d0da9 count 2
> 
> $ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995 0xffffffff816d0da9
> 0xffffffff81695995: ./bld_x64/../net/ipv4/icmp.c:1038
> 0xffffffff816d0da9: ./bld_x64/../net/unix/af_unix.c:1231
> 
> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> ---
>  samples/bpf/Makefile  |    2 +
>  samples/bpf/dropmon.c |  129 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 131 insertions(+)
>  create mode 100644 samples/bpf/dropmon.c
> 
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index b5b3600dcdf5..789691374562 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -6,7 +6,9 @@ hostprogs-y := test_verifier test_maps
>  hostprogs-y += sock_example
>  hostprogs-y += sockex1
>  hostprogs-y += sockex2
> +hostprogs-y += dropmon
>  
> +dropmon-objs := dropmon.o libbpf.o
>  test_verifier-objs := test_verifier.o libbpf.o
>  test_maps-objs := test_maps.o libbpf.o
>  sock_example-objs := sock_example.o libbpf.o
> diff --git a/samples/bpf/dropmon.c b/samples/bpf/dropmon.c
> new file mode 100644
> index 000000000000..9a2cd3344d69
> --- /dev/null
> +++ b/samples/bpf/dropmon.c
> @@ -0,0 +1,129 @@
> +/* simple packet drop monitor:
> + * - in-kernel eBPF program attaches to kfree_skb() event and records number
> + *   of packet drops at given location
> + * - userspace iterates over the map every second and prints stats
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <linux/bpf.h>
> +#include <errno.h>
> +#include <linux/unistd.h>
> +#include <string.h>
> +#include <linux/filter.h>
> +#include <stdlib.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <stdbool.h>
> +#include "libbpf.h"
> +
> +#define TRACEPOINT "/sys/kernel/debug/tracing/events/skb/kfree_skb/"
> +
> +static int write_to_file(const char *file, const char *str, bool keep_open)
> +{
> +	int fd, err;
> +
> +	fd = open(file, O_WRONLY);
> +	err = write(fd, str, strlen(str));
> +	(void) err;
> +
> +	if (keep_open) {
> +		return fd;
> +	} else {
> +		close(fd);
> +		return -1;
> +	}
> +}
> +
> +static int dropmon(void)
> +{
> +	long long key, next_key, value = 0;
> +	int prog_fd, map_fd, i;
> +	char fmt[32];
> +
> +	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
> +	if (map_fd < 0) {
> +		printf("failed to create map '%s'\n", strerror(errno));
> +		goto cleanup;
> +	}
> +
> +	/* the following eBPF program is equivalent to C:
> +	 * int filter(struct bpf_context *ctx)
> +	 * {
> +	 *   long loc = ctx->arg2;
> +	 *   long init_val = 1;
> +	 *   long *value;
> +	 *
> +	 *   value = bpf_map_lookup_elem(MAP_ID, &loc);
> +	 *   if (value) {
> +	 *      __sync_fetch_and_add(value, 1);
> +	 *   } else {
> +	 *      bpf_map_update_elem(MAP_ID, &loc, &init_val, BPF_ANY);
> +	 *   }
> +	 *   return 0;
> +	 * }
> +	 */
> +	struct bpf_insn prog[] = {
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1, 8), /* r2 = *(u64 *)(r1 + 8) */
> +		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_2, -8), /* *(u64 *)(fp - 8) = r2 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> +		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4),
> +		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
> +		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +		BPF_ST_MEM(BPF_DW, BPF_REG_10, -16, 1), /* *(u64 *)(fp - 16) = 1 */
> +		BPF_MOV64_IMM(BPF_REG_4, BPF_ANY),
> +		BPF_MOV64_REG(BPF_REG_3, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -16), /* r3 = fp - 16 */
> +		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), /* r2 = fp - 8 */
> +		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
> +		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem),
> +		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
> +		BPF_EXIT_INSN(),
> +	};
> +
> +	prog_fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog,
> +				sizeof(prog), "GPL");
> +	if (prog_fd < 0) {
> +		printf("failed to load prog '%s'\n%s", strerror(errno), bpf_log_buf);
> +		return -1;
> +	}
> +
> +	sprintf(fmt, "bpf_%d", prog_fd);
> +
> +	write_to_file(TRACEPOINT "filter", fmt, true);
> +
> +	for (i = 0; i < 10; i++) {
> +		key = 0;
> +		while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
> +			bpf_lookup_elem(map_fd, &next_key, &value);
> +			printf("location 0x%llx count %lld\n", next_key, value);
> +			key = next_key;
> +		}
> +		if (key)
> +			printf("\n");
> +		sleep(1);
> +	}
> +
> +cleanup:
> +	/* maps, programs, tracepoint filters will auto cleanup on process exit */
> +
> +	return 0;
> +}
> +
> +int main(void)
> +{
> +	FILE *f;
> +
> +	/* start ping in the background to get some kfree_skb events */
> +	f = popen("ping -c5 localhost", "r");
> +	(void) f;
> +
> +	dropmon();
> +	return 0;
> +}
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-01-16  4:16   ` [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler Alexei Starovoitov
@ 2015-01-16 15:02   ` Steven Rostedt
  2015-01-19  9:52   ` Masami Hiramatsu
  5 siblings, 0 replies; 20+ messages in thread
From: Steven Rostedt @ 2015-01-16 15:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Thu, 15 Jan 2015 20:16:01 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:

> Hi Ingo, Steven,
> 
> This patch set is based on tip/master.

Note, the tracing code isn't maintained in tip/master, but perf code is.

Using the latest 3.19-rc is probably sufficient for now.

Do you have a git repo somewhere that I can look at? It makes it easier
than loading in 9 patches ;-)

> It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.
> 
> Mechanism of attaching:
> - load program via bpf() syscall and receive program_fd
> - event_fd = open("/sys/kernel/debug/tracing/events/.../filter")
> - write 'bpf-123' to event_fd where 123 is program_fd
> - program will be attached to particular event and event automatically enabled
> - close(event_fd) will detach bpf program from event and event disabled
> 
> Program attach point and input arguments:
> - programs attached to kprobes receive 'struct pt_regs *' as an input.
>   See tracex4_kern.c that demonstrates how users can write a C program like:
>   SEC("events/kprobes/sys_write")
>   int bpf_prog4(struct pt_regs *regs)
>   {
>      long write_size = regs->dx; 
>      // here user need to know the proto of sys_write() from kernel
>      // sources and x64 calling convention to know that register $rdx
>      // contains 3rd argument to sys_write() which is 'size_t count'
> 
>   it's obviously architecture dependent, but allows building sophisticated
>   user tools on top, that can see from debug info of vmlinux which variables
>   are in which registers or stack locations and fetch it from there.
>   'perf probe' can potentialy use this hook to generate programs in user space
>   and insert them instead of letting kernel parse string during kprobe creation.
> 
> - programs attached to tracepoints and syscalls receive 'struct bpf_context *':
>   u64 arg1, arg2, ..., arg6;
>   for syscalls they match syscall arguments.
>   for tracepoints these args match arguments passed to tracepoint.
>   For example:
>   trace_sched_migrate_task(p, new_cpu); from sched/core.c
>   arg1 <- p        which is 'struct task_struct *'
>   arg2 <- new_cpu  which is 'unsigned int'
>   arg3..arg6 = 0
>   the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
>   or any other kernel data structures.
>   These helpers are using probe_kernel_read() similar to 'perf probe' which is
>   not 100% safe in both cases, but good enough.
>   To access task_struct's pid inside 'sched_migrate_task' tracepoint
>   the program can do:
>   struct task_struct *task = (struct task_struct *)ctx->arg1;
>   u32 pid = bpf_fetch_u32(&task->pid);
>   Since struct layout is kernel configuration specific such programs are not
>   portable and require access to kernel headers to be compiled,
>   but in this case we don't need debug info.
>   llvm with bpf backend will statically compute task->pid offset as a constant
>   based on kernel headers only.
>   The example of this arbitrary pointer walking is tracex1_kern.c
>   which does skb->dev->name == "lo" filtering.
> 
> In all cases the programs are called before trace buffer is allocated to
> minimize the overhead, since we want to filter huge number of events, but
> buffer alloc/free and argument copy for every event is too costly.

For syscalls this is fine as the parameters are usually set. But
there's a lot of tracepoints that we need to know the result of the
copied data to decide to filter or not, where the result happens at the
TP_fast_assign() part which requires allocating the buffers.

Maybe we should have a way to do the program before and/or after the
buffering depending on what to filter on. There's no way to know what
the parameters of the tracepoint are without looking at the source.



> Theoretically we can invoke programs after buffer is allocated, but it
> doesn't seem needed, since above approach is faster and achieves the same.

Again, for syscalls it may not be a problem, but for other tracepoints,
I'm not sure we can do that. How do you handle sched_switch for
example? The tracepoint only gets two pointers to task structs, you
need to then dereference them to get the pid, prio, state and other
data.

> 
> Note, tracepoint/syscall and kprobe programs are two different types:
> BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER,
> since they expect different input.
> Both use the same set of helper functions:
> - map access (lookup/update/delete)
> - fetch (probe_kernel_read wrappers)
> - memcmp (probe_kernel_read + memcmp)
> - dump_stack
> - trace_printk
> The last two are mainly to debug the programs and to print data for user
> space consumptions.

I have to look at the code, but currently trace_printk() isn't made to
be used in production systems.

> 
> Portability:
> - kprobe programs are architecture dependent and need user scripting
>   language like ktap/stap/dtrace/perf that will dynamically generate
>   them based on debug info in vmlinux
> - tracepoint programs are architecture independent, but if arbitrary pointer
>   walking (with fetch() helpers) is used, they need data struct layout to match.
>   Debug info is not necessary

If the program runs after the buffers are allocated, it could still be
architecture independent because ftrace gives the information on how to
retrieve the fields.

One last thing. If the ebpf is used for anything but filtering, it
should go into the trigger file. The filtering is only a way to say if
the event should be recorded or not. But the trigger could do something
else (a printk, a stacktrace, etc).

-- Steve


> - for networking use case we need to access 'struct sk_buff' fields in portable
>   way (user space needs to fetch packet length without knowing skb->len offset),
>   so for some frequently used data structures we will add helper functions
>   or pseudo instructions to access them. I've hacked few ways specifically
>   for skb, but abandoned them in favor of more generic type/field infra.
>   That work is still wip. Not part of this set.
>   Once it's ready tracepoint programs that access common data structs
>   will be kernel independent.
> 
> Program return value:
> - programs return 0 to discard an event
> - and return non-zero to proceed with event (allocate trace buffer, copy
>   arguments there and print it eventually in trace_pipe in traditional way)
> 
> Examples:
> - dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
>   to dropmon tool
> - tracex1_kern.c - does net/netif_receive_skb event filtering
>   for dev->skb->name == "lo" condition
> - tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
>   plus computes histogram of all write sizes from sys_write syscall
>   and prints the histogram in userspace
> - tracex3_kern.c - most sophisticated example that computes IO latency
>   between block/block_rq_issue and block/block_rq_complete events
>   and prints 'heatmap' using gray shades of text terminal.
>   Useful to analyze disk performance.
> - tracex4_kern.c - computes histogram of write sizes from sys_write syscall
>   using kprobe mechanism instead of syscall. Since kprobe is optimized into
>   ftrace the overhead of instrumentation is smaller than in example 2.
> 
> The user space tools like ktap/dtrace/systemptap/perf that has access
> to debug info would probably want to use kprobe attachment point, since kprobe
> can be inserted anywhere and all registers are avaiable in the program.
> tracepoint attachments are useful without debug info, so standalone tools
> like iosnoop will use them.
> 
> The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
> and conditional walking of arbitrary data structures.
> 
> Thanks!
> 
> Alexei Starovoitov (9):
>   tracing: attach eBPF programs to tracepoints and syscalls
>   tracing: allow eBPF programs to call bpf_printk()
>   tracing: allow eBPF programs to call ktime_get_ns()
>   samples: bpf: simple tracing example in eBPF assembler
>   samples: bpf: simple tracing example in C
>   samples: bpf: counting example for kfree_skb tracepoint and write
>     syscall
>   samples: bpf: IO latency analysis (iosnoop/heatmap)
>   tracing: attach eBPF programs to kprobe/kretprobe
>   samples: bpf: simple kprobe example
> 
>  include/linux/ftrace_event.h       |    6 +
>  include/trace/bpf_trace.h          |   25 ++++
>  include/trace/ftrace.h             |   30 +++++
>  include/uapi/linux/bpf.h           |   11 ++
>  kernel/trace/Kconfig               |    1 +
>  kernel/trace/Makefile              |    1 +
>  kernel/trace/bpf_trace.c           |  250 ++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.h               |    3 +
>  kernel/trace/trace_events.c        |   41 +++++-
>  kernel/trace/trace_events_filter.c |   80 +++++++++++-
>  kernel/trace/trace_kprobe.c        |   11 +-
>  kernel/trace/trace_syscalls.c      |   31 +++++
>  samples/bpf/Makefile               |   18 +++
>  samples/bpf/bpf_helpers.h          |   18 +++
>  samples/bpf/bpf_load.c             |   62 ++++++++-
>  samples/bpf/bpf_load.h             |    3 +
>  samples/bpf/dropmon.c              |  129 +++++++++++++++++++
>  samples/bpf/tracex1_kern.c         |   28 ++++
>  samples/bpf/tracex1_user.c         |   24 ++++
>  samples/bpf/tracex2_kern.c         |   71 ++++++++++
>  samples/bpf/tracex2_user.c         |   95 ++++++++++++++
>  samples/bpf/tracex3_kern.c         |   96 ++++++++++++++
>  samples/bpf/tracex3_user.c         |  146 +++++++++++++++++++++
>  samples/bpf/tracex4_kern.c         |   36 ++++++
>  samples/bpf/tracex4_user.c         |   83 ++++++++++++
>  25 files changed, 1290 insertions(+), 9 deletions(-)
>  create mode 100644 include/trace/bpf_trace.h
>  create mode 100644 kernel/trace/bpf_trace.c
>  create mode 100644 samples/bpf/dropmon.c
>  create mode 100644 samples/bpf/tracex1_kern.c
>  create mode 100644 samples/bpf/tracex1_user.c
>  create mode 100644 samples/bpf/tracex2_kern.c
>  create mode 100644 samples/bpf/tracex2_user.c
>  create mode 100644 samples/bpf/tracex3_kern.c
>  create mode 100644 samples/bpf/tracex3_user.c
>  create mode 100644 samples/bpf/tracex4_kern.c
>  create mode 100644 samples/bpf/tracex4_user.c
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
       [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-01-16 15:02   ` [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Steven Rostedt
@ 2015-01-19  9:52   ` Masami Hiramatsu
  2015-01-19 20:48     ` Alexei Starovoitov
  5 siblings, 1 reply; 20+ messages in thread
From: Masami Hiramatsu @ 2015-01-19  9:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

(2015/01/16 13:16), Alexei Starovoitov wrote:
> Hi Ingo, Steven,
> 
> This patch set is based on tip/master.
> It adds ability to attach eBPF programs to tracepoints, syscalls and kprobes.
> 
> Mechanism of attaching:
> - load program via bpf() syscall and receive program_fd
> - event_fd = open("/sys/kernel/debug/tracing/events/.../filter")
> - write 'bpf-123' to event_fd where 123 is program_fd
> - program will be attached to particular event and event automatically enabled
> - close(event_fd) will detach bpf program from event and event disabled
> 
> Program attach point and input arguments:
> - programs attached to kprobes receive 'struct pt_regs *' as an input.
>   See tracex4_kern.c that demonstrates how users can write a C program like:
>   SEC("events/kprobes/sys_write")
>   int bpf_prog4(struct pt_regs *regs)
>   {
>      long write_size = regs->dx; 
>      // here user need to know the proto of sys_write() from kernel
>      // sources and x64 calling convention to know that register $rdx
>      // contains 3rd argument to sys_write() which is 'size_t count'
> 
>   it's obviously architecture dependent, but allows building sophisticated
>   user tools on top, that can see from debug info of vmlinux which variables
>   are in which registers or stack locations and fetch it from there.
>   'perf probe' can potentialy use this hook to generate programs in user space
>   and insert them instead of letting kernel parse string during kprobe creation.

Actually, this program just shows raw pt_regs for handlers, but I guess it is also
possible to pass event arguments from perf probe which given by user and perf-probe.
If we can write the script as

int bpf_prog4(s64 write_size)
{
   ...
}

This will be much easier to play with.

> - programs attached to tracepoints and syscalls receive 'struct bpf_context *':
>   u64 arg1, arg2, ..., arg6;
>   for syscalls they match syscall arguments.
>   for tracepoints these args match arguments passed to tracepoint.
>   For example:
>   trace_sched_migrate_task(p, new_cpu); from sched/core.c
>   arg1 <- p        which is 'struct task_struct *'
>   arg2 <- new_cpu  which is 'unsigned int'
>   arg3..arg6 = 0
>   the program can use bpf_fetch_u8/16/32/64/ptr() helpers to walk 'task_struct'
>   or any other kernel data structures.
>   These helpers are using probe_kernel_read() similar to 'perf probe' which is
>   not 100% safe in both cases, but good enough.
>   To access task_struct's pid inside 'sched_migrate_task' tracepoint
>   the program can do:
>   struct task_struct *task = (struct task_struct *)ctx->arg1;
>   u32 pid = bpf_fetch_u32(&task->pid);
>   Since struct layout is kernel configuration specific such programs are not
>   portable and require access to kernel headers to be compiled,
>   but in this case we don't need debug info.
>   llvm with bpf backend will statically compute task->pid offset as a constant
>   based on kernel headers only.
>   The example of this arbitrary pointer walking is tracex1_kern.c
>   which does skb->dev->name == "lo" filtering.

At least I would like to see this way on kprobes event too, since it should be
treated as a traceevent.

> In all cases the programs are called before trace buffer is allocated to
> minimize the overhead, since we want to filter huge number of events, but
> buffer alloc/free and argument copy for every event is too costly.
> Theoretically we can invoke programs after buffer is allocated, but it
> doesn't seem needed, since above approach is faster and achieves the same.
> 
> Note, tracepoint/syscall and kprobe programs are two different types:
> BPF_PROG_TYPE_TRACING_FILTER and BPF_PROG_TYPE_KPROBE_FILTER,
> since they expect different input.
> Both use the same set of helper functions:
> - map access (lookup/update/delete)
> - fetch (probe_kernel_read wrappers)
> - memcmp (probe_kernel_read + memcmp)
> - dump_stack
> - trace_printk
> The last two are mainly to debug the programs and to print data for user
> space consumptions.
> 
> Portability:
> - kprobe programs are architecture dependent and need user scripting
>   language like ktap/stap/dtrace/perf that will dynamically generate
>   them based on debug info in vmlinux

If we can use kprobe event as a normal traceevent, user scripting can be
architecture independent too. Only perf-probe fills the gap. All other
userspace tools can collaborate with perf-probe to setup the events.
If so, we can avoid redundant works on debuginfo. That is my point.

Thank you,

> - tracepoint programs are architecture independent, but if arbitrary pointer
>   walking (with fetch() helpers) is used, they need data struct layout to match.
>   Debug info is not necessary
> - for networking use case we need to access 'struct sk_buff' fields in portable
>   way (user space needs to fetch packet length without knowing skb->len offset),
>   so for some frequently used data structures we will add helper functions
>   or pseudo instructions to access them. I've hacked few ways specifically
>   for skb, but abandoned them in favor of more generic type/field infra.
>   That work is still wip. Not part of this set.
>   Once it's ready tracepoint programs that access common data structs
>   will be kernel independent.
> 
> Program return value:
> - programs return 0 to discard an event
> - and return non-zero to proceed with event (allocate trace buffer, copy
>   arguments there and print it eventually in trace_pipe in traditional way)
> 
> Examples:
> - dropmon.c - simple kfree_skb() accounting in eBPF assembler, similar
>   to dropmon tool
> - tracex1_kern.c - does net/netif_receive_skb event filtering
>   for dev->skb->name == "lo" condition
> - tracex2_kern.c - same kfree_skb() accounting like dropmon, but now in C
>   plus computes histogram of all write sizes from sys_write syscall
>   and prints the histogram in userspace
> - tracex3_kern.c - most sophisticated example that computes IO latency
>   between block/block_rq_issue and block/block_rq_complete events
>   and prints 'heatmap' using gray shades of text terminal.
>   Useful to analyze disk performance.
> - tracex4_kern.c - computes histogram of write sizes from sys_write syscall
>   using kprobe mechanism instead of syscall. Since kprobe is optimized into
>   ftrace the overhead of instrumentation is smaller than in example 2.
> 
> The user space tools like ktap/dtrace/systemptap/perf that has access
> to debug info would probably want to use kprobe attachment point, since kprobe
> can be inserted anywhere and all registers are avaiable in the program.
> tracepoint attachments are useful without debug info, so standalone tools
> like iosnoop will use them.
> 
> The main difference vs existing perf_probe/ftrace infra is in kernel aggregation
> and conditional walking of arbitrary data structures.
> 
> Thanks!
> 
> Alexei Starovoitov (9):
>   tracing: attach eBPF programs to tracepoints and syscalls
>   tracing: allow eBPF programs to call bpf_printk()
>   tracing: allow eBPF programs to call ktime_get_ns()
>   samples: bpf: simple tracing example in eBPF assembler
>   samples: bpf: simple tracing example in C
>   samples: bpf: counting example for kfree_skb tracepoint and write
>     syscall
>   samples: bpf: IO latency analysis (iosnoop/heatmap)
>   tracing: attach eBPF programs to kprobe/kretprobe
>   samples: bpf: simple kprobe example
> 
>  include/linux/ftrace_event.h       |    6 +
>  include/trace/bpf_trace.h          |   25 ++++
>  include/trace/ftrace.h             |   30 +++++
>  include/uapi/linux/bpf.h           |   11 ++
>  kernel/trace/Kconfig               |    1 +
>  kernel/trace/Makefile              |    1 +
>  kernel/trace/bpf_trace.c           |  250 ++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.h               |    3 +
>  kernel/trace/trace_events.c        |   41 +++++-
>  kernel/trace/trace_events_filter.c |   80 +++++++++++-
>  kernel/trace/trace_kprobe.c        |   11 +-
>  kernel/trace/trace_syscalls.c      |   31 +++++
>  samples/bpf/Makefile               |   18 +++
>  samples/bpf/bpf_helpers.h          |   18 +++
>  samples/bpf/bpf_load.c             |   62 ++++++++-
>  samples/bpf/bpf_load.h             |    3 +
>  samples/bpf/dropmon.c              |  129 +++++++++++++++++++
>  samples/bpf/tracex1_kern.c         |   28 ++++
>  samples/bpf/tracex1_user.c         |   24 ++++
>  samples/bpf/tracex2_kern.c         |   71 ++++++++++
>  samples/bpf/tracex2_user.c         |   95 ++++++++++++++
>  samples/bpf/tracex3_kern.c         |   96 ++++++++++++++
>  samples/bpf/tracex3_user.c         |  146 +++++++++++++++++++++
>  samples/bpf/tracex4_kern.c         |   36 ++++++
>  samples/bpf/tracex4_user.c         |   83 ++++++++++++
>  25 files changed, 1290 insertions(+), 9 deletions(-)
>  create mode 100644 include/trace/bpf_trace.h
>  create mode 100644 kernel/trace/bpf_trace.c
>  create mode 100644 samples/bpf/dropmon.c
>  create mode 100644 samples/bpf/tracex1_kern.c
>  create mode 100644 samples/bpf/tracex1_user.c
>  create mode 100644 samples/bpf/tracex2_kern.c
>  create mode 100644 samples/bpf/tracex2_user.c
>  create mode 100644 samples/bpf/tracex3_kern.c
>  create mode 100644 samples/bpf/tracex3_user.c
>  create mode 100644 samples/bpf/tracex4_kern.c
>  create mode 100644 samples/bpf/tracex4_user.c
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
  2015-01-19  9:52   ` Masami Hiramatsu
@ 2015-01-19 20:48     ` Alexei Starovoitov
  2015-01-20  2:58       ` Masami Hiramatsu
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-19 20:48 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML

On Mon, Jan 19, 2015 at 1:52 AM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> If we can write the script as
>
> int bpf_prog4(s64 write_size)
> {
>    ...
> }
>
> This will be much easier to play with.

yes. that's the intent for user space to do.

>>   The example of this arbitrary pointer walking is tracex1_kern.c
>>   which does skb->dev->name == "lo" filtering.
>
> At least I would like to see this way on kprobes event too, since it should be
> treated as a traceevent.

it's done already... one can do the same skb->dev->name logic
in kprobe attached program... so from bpf program point of view,
tracepoints and kprobes feature-wise are exactly the same.
Only input is different.

>> - kprobe programs are architecture dependent and need user scripting
>>   language like ktap/stap/dtrace/perf that will dynamically generate
>>   them based on debug info in vmlinux
>
> If we can use kprobe event as a normal traceevent, user scripting can be
> architecture independent too. Only perf-probe fills the gap. All other
> userspace tools can collaborate with perf-probe to setup the events.
> If so, we can avoid redundant works on debuginfo. That is my point.

yes. perf already has infra to read debug info and it can be extended
to understand C like script as:
int kprobe:sys_write(int fd, char *buf, size_t count)
{
   // do stuff with 'count'
}
perf can be made to parse this text, recognize that it wants
to create kprobe on 'sys_write' function. Then based on
debuginfo figure out where 'count' is (either register or stack)
and generate corresponding bpf program either
using llvm/gcc backends or directly.
perf facility of extracting debug info can be made into
library too and used by ktap/dtrace tools for their
languages.
User space can innovate in many directions.
and, yes, once we have a scripting language whether
it's C like with perf or else, this language hides architecture
depend things from users.
Such scripting language will also hide the kernel
side differences between tracepoint and kprobe.
Just look how ktap scripts look alike for kprobes and tracepoints.
Whether ktap syntax becomes part of perf or perf invents
its own language, it's going to be good for users regardless.
The C examples here are just examples. Something
users can play with already until more user friendly
tools are being worked on.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
  2015-01-19 20:48     ` Alexei Starovoitov
@ 2015-01-20  2:58       ` Masami Hiramatsu
  0 siblings, 0 replies; 20+ messages in thread
From: Masami Hiramatsu @ 2015-01-20  2:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ingo Molnar, Steven Rostedt, Namhyung Kim,
	Arnaldo Carvalho de Melo, Jiri Olsa, David S. Miller,
	Daniel Borkmann, Hannes Frederic Sowa, Brendan Gregg, Linux API,
	Network Development, LKML, zhangwei(Jovi),
	yrl.pp-manager.tt@hitachi.com

(2015/01/20 5:48), Alexei Starovoitov wrote:
> On Mon, Jan 19, 2015 at 1:52 AM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>> If we can write the script as
>>
>> int bpf_prog4(s64 write_size)
>> {
>>    ...
>> }
>>
>> This will be much easier to play with.
> 
> yes. that's the intent for user space to do.
> 
>>>   The example of this arbitrary pointer walking is tracex1_kern.c
>>>   which does skb->dev->name == "lo" filtering.
>>
>> At least I would like to see this way on kprobes event too, since it should be
>> treated as a traceevent.
> 
> it's done already... one can do the same skb->dev->name logic
> in kprobe attached program... so from bpf program point of view,
> tracepoints and kprobes feature-wise are exactly the same.
> Only input is different.

No, I meant that the input should also be same, at least for the first step.
I guess it is easy to hook the ring buffer committing and fetch arguments
from the event entry.

>>> - kprobe programs are architecture dependent and need user scripting
>>>   language like ktap/stap/dtrace/perf that will dynamically generate
>>>   them based on debug info in vmlinux
>>
>> If we can use kprobe event as a normal traceevent, user scripting can be
>> architecture independent too. Only perf-probe fills the gap. All other
>> userspace tools can collaborate with perf-probe to setup the events.
>> If so, we can avoid redundant works on debuginfo. That is my point.
> 
> yes. perf already has infra to read debug info and it can be extended
> to understand C like script as:
> int kprobe:sys_write(int fd, char *buf, size_t count)
> {
>    // do stuff with 'count'
> }
> perf can be made to parse this text, recognize that it wants
> to create kprobe on 'sys_write' function. Then based on
> debuginfo figure out where 'count' is (either register or stack)
> and generate corresponding bpf program either
> using llvm/gcc backends or directly.

And what I expected scenario was

1. setup kprobe traceevent with fd, buf, count by using perf-probe.
2. load bpf module
3. the module processes given event arguments.

> perf facility of extracting debug info can be made into
> library too and used by ktap/dtrace tools for their
> languages.
> User space can innovate in many directions.
> and, yes, once we have a scripting language whether
> it's C like with perf or else, this language hides architecture
> depend things from users.
> Such scripting language will also hide the kernel
> side differences between tracepoint and kprobe.

Hmm, it sounds making another systemtap on top of tracepoint and kprobes.
Why don't you just reuse the existing facilities (perftools and ftrace)
instead of co-exist?

> Just look how ktap scripts look alike for kprobes and tracepoints.

Ktap is a good example, it provides only a language parser and a runtime engine.
Actually, currently it lacks a feature to execute "perf-probe" helper from
script, but it is easy to add such feature.

Jovi, if you hire perf-probe helper, you could do

trace probe:do_sys_open dfd fname flags mode {
...
}

instead of

trace probe:do_sys_open dfd=%di fname=%dx flags=%cx mode=+4($stack) {
...
}

For this usecase, I've made --output option for perf probe
https://lkml.org/lkml/2014/10/31/210

It currently stopped, but easy to resume on the latest perf.

Thank you,

> Whether ktap syntax becomes part of perf or perf invents
> its own language, it's going to be good for users regardless.
> The C examples here are just examples. Something
> users can play with already until more user friendly
> tools are being worked on.


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-16 18:57 Alexei Starovoitov
  2015-01-22  1:03 ` Namhyung Kim
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-16 18:57 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Namhyung Kim, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, Linux API, Network Development, LKML

On Fri, Jan 16, 2015 at 7:02 AM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote:
> On Thu, 15 Jan 2015 20:16:01 -0800
> Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>
>> Hi Ingo, Steven,
>>
>> This patch set is based on tip/master.
>
> Note, the tracing code isn't maintained in tip/master, but perf code is.

I know. I can rebase against linux-trace tree, but wanted to go
through tip to get advantage of tip-bot ;)

> Do you have a git repo somewhere that I can look at? It makes it easier
> than loading in 9 patches ;-)

https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/

> For syscalls this is fine as the parameters are usually set. But
> there's a lot of tracepoints that we need to know the result of the
> copied data to decide to filter or not, where the result happens at the
> TP_fast_assign() part which requires allocating the buffers.
...
> Again, for syscalls it may not be a problem, but for other tracepoints,
> I'm not sure we can do that. How do you handle sched_switch for
> example? The tracepoint only gets two pointers to task structs, you
> need to then dereference them to get the pid, prio, state and other
> data.

exactly. In this patch set the user can use bpf_fetch_*()
helpers to dereference task_struct to get to pid, prio, anything.
The user is not limited by what tracepoints hard coded as
part of TP_fast_assign.
In the future when generic type/field infra is ready,
the bpf program will have faster way to access such fields
without going through bpf_fetch_*() helpers.

In other words the program is a superset of existing
TP_fast_assign + TP_print + filters + triggers + stack dumps.
All are done from the program on demand.
Some programs may just count number of events
without accessing arguments at all.

> Maybe we should have a way to do the program before and/or after the
> buffering depending on what to filter on. There's no way to know what
> the parameters of the tracepoint are without looking at the source.

right now most of the tracepoints copy as much as possible
fields as part of TP_fast_assign, since there is no way to
know what will be needed later.
And despite copying a lot, often it's not enough for analytics.
With program is attached before the copy, the overhead
is much lower. The program will look at whatever fields
necessary, may do some operations on them and even
store these fields into maps for further processing.

>> - dump_stack
>> - trace_printk
>> The last two are mainly to debug the programs and to print data for user
>> space consumptions.
>
> I have to look at the code, but currently trace_printk() isn't made to
> be used in production systems.

other than allocating a bunch of per_cpu pages.
what concerns do you have?
Some printk-like facility from the program is needed.
I'd rather fix whatever necessary in trace_printk instead
of inventing another way of printing.
Anyway, I think I will drop these two helpers for now.

> One last thing. If the ebpf is used for anything but filtering, it
> should go into the trigger file. The filtering is only a way to say if
> the event should be recorded or not. But the trigger could do something
> else (a printk, a stacktrace, etc).

it does way more than just filtering, but
invoking program as a trigger is too slow.
When program is called as soon as tracepoint fires,
it can fetch other fields, evaluate them, printk some of them,
optionally dump stack, aggregate into maps.
We can let it call triggers too, so that user program will
be able to enable/disable other events.
I'm not against invoking programs as a trigger, but I don't
see a use case for it. It's just too slow for production
analytics that needs to act on huge number of events
per second.
We must minimize the overhead between tracepoint
firing and program executing, so that programs can
be used on events like packet receive which will be
in millions per second. Every nsec counts.
For example:
- raw dd if=/dev/zero of=/dev/null
  does 760 MB/s (on my debug kernel)
- echo 1 > events/syscalls/sys_enter_write/enable
  drops it to 400 MB/s
- echo "echo "count == 123 " > events/syscalls/sys_enter_write/filter
  drops it even further down to 388 MB/s
This slowdown is too high for this to be used on a live system.
- tracex4 that computes histogram of sys_write sizes
  and stores log2(count) into a map does 580 MB/s
  This is still not great, but this slowdown is now usable
  and we can work further on minimizing the overhead.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
  2015-01-16 18:57 Alexei Starovoitov
@ 2015-01-22  1:03 ` Namhyung Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Namhyung Kim @ 2015-01-22  1:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, Linux API, Network Development, LKML

Hi Alexei,

On Fri, Jan 16, 2015 at 10:57:15AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 16, 2015 at 7:02 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > One last thing. If the ebpf is used for anything but filtering, it
> > should go into the trigger file. The filtering is only a way to say if
> > the event should be recorded or not. But the trigger could do something
> > else (a printk, a stacktrace, etc).
> 
> it does way more than just filtering, but
> invoking program as a trigger is too slow.
> When program is called as soon as tracepoint fires,
> it can fetch other fields, evaluate them, printk some of them,
> optionally dump stack, aggregate into maps.
> We can let it call triggers too, so that user program will
> be able to enable/disable other events.
> I'm not against invoking programs as a trigger, but I don't
> see a use case for it. It's just too slow for production
> analytics that needs to act on huge number of events
> per second.

AFAIK a trigger can be fired before allocating a ring buffer if it
doesn't use the event record (i.e. has filter) or ->post_trigger bit
set (stacktrace).  Please see ftrace_trigger_soft_disabled().

This also makes it keeping events in the soft-disabled state.

Thanks,
Namhyung


> We must minimize the overhead between tracepoint
> firing and program executing, so that programs can
> be used on events like packet receive which will be
> in millions per second. Every nsec counts.
> For example:
> - raw dd if=/dev/zero of=/dev/null
>   does 760 MB/s (on my debug kernel)
> - echo 1 > events/syscalls/sys_enter_write/enable
>   drops it to 400 MB/s
> - echo "echo "count == 123 " > events/syscalls/sys_enter_write/filter
>   drops it even further down to 388 MB/s
> This slowdown is too high for this to be used on a live system.
> - tracex4 that computes histogram of sys_write sizes
>   and stores log2(count) into a map does 580 MB/s
>   This is still not great, but this slowdown is now usable
>   and we can work further on minimizing the overhead.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
@ 2015-01-22  1:49 Alexei Starovoitov
       [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-22  1:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, Linux API, Network Development, LKML

On Wed, Jan 21, 2015 at 5:03 PM, Namhyung Kim <namhyung-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> AFAIK a trigger can be fired before allocating a ring buffer if it
> doesn't use the event record (i.e. has filter) or ->post_trigger bit
> set (stacktrace).  Please see ftrace_trigger_soft_disabled().

yes, but such trigger has no arguments, so I would have to hack
ftrace_trigger_soft_disabled() to pass 'ctx' further down
and through all pointer dereferences and list walking.
Also there is no return value, so I have to add it as well
similar to post-triggers.
that's quite a bit of overhead that I would like to avoid.

Actually now I'm thinking to move condition
if (ftrace_file->flags & TRACE_EVENT_FL_BPF)
before ftrace_trigger_soft_disabled() check.
So programs always run first and if they return non-zero
then all standard processing will follow.
May be return value from the program can influence triggers.
That will nicely replace bpf_dump_stack()...the program
will return ETT_STACKTRACE constant to trigger dump.

> This also makes it keeping events in the soft-disabled state.

I was never able to figure out the use case for soft-disabled state.
Probably historical before static_key was done.

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
       [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-22  1:56   ` Steven Rostedt
       [not found]     ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Steven Rostedt @ 2015-01-22  1:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, Linux API, Network Development, LKML

On Wed, 21 Jan 2015 17:49:08 -0800
Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:


> > This also makes it keeping events in the soft-disabled state.
> 
> I was never able to figure out the use case for soft-disabled state.
> Probably historical before static_key was done.

No, it's not historical at all. The "soft-disable" is a way to enable
from any context. You can't enable a static key from NMI or interrupt
context, but you can enable a "soft-disable" there.

As you can enable or disable events from any function that the function
tracer may trace, I needed a way to enable them (make the tracepoint
active), but do nothing until something else turns them on.

-- Steve

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>]

* Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe
       [not found]     ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
@ 2015-01-22  2:13       ` Alexei Starovoitov
  0 siblings, 0 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2015-01-22  2:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Namhyung Kim, Ingo Molnar, Arnaldo Carvalho de Melo, Jiri Olsa,
	David S. Miller, Daniel Borkmann, Hannes Frederic Sowa,
	Brendan Gregg, Linux API, Network Development, LKML

On Wed, Jan 21, 2015 at 5:56 PM, Steven Rostedt <rostedt-nx8X9YLhiw1AfugRpC6u6w@public.gmane.org> wrote:
> On Wed, 21 Jan 2015 17:49:08 -0800
> Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org> wrote:
>
>
>> > This also makes it keeping events in the soft-disabled state.
>>
>> I was never able to figure out the use case for soft-disabled state.
>> Probably historical before static_key was done.
>
> No, it's not historical at all. The "soft-disable" is a way to enable
> from any context. You can't enable a static key from NMI or interrupt
> context, but you can enable a "soft-disable" there.
>
> As you can enable or disable events from any function that the function
> tracer may trace, I needed a way to enable them (make the tracepoint
> active), but do nothing until something else turns them on.

Thanks for explanation. Makes sense.
Speaking of nmi... I think I will add a check that if (in_nmi())
just skip running the program, since supporting this use
case is not needed at the moment.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-01-22  2:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-16  4:16 [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Alexei Starovoitov
2015-01-16  4:16 ` [PATCH tip 5/9] samples: bpf: simple tracing example in C Alexei Starovoitov
2015-01-16  4:16 ` [PATCH tip 6/9] samples: bpf: counting example for kfree_skb tracepoint and write syscall Alexei Starovoitov
2015-01-16  4:16 ` [PATCH tip 7/9] samples: bpf: IO latency analysis (iosnoop/heatmap) Alexei Starovoitov
2015-01-16  4:16 ` [PATCH tip 8/9] tracing: attach eBPF programs to kprobe/kretprobe Alexei Starovoitov
2015-01-16  4:16 ` [PATCH tip 9/9] samples: bpf: simple kprobe example Alexei Starovoitov
     [not found] ` <1421381770-4866-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
2015-01-16  4:16   ` [PATCH tip 1/9] tracing: attach eBPF programs to tracepoints and syscalls Alexei Starovoitov
2015-01-16  4:16   ` [PATCH tip 2/9] tracing: allow eBPF programs to call bpf_printk() Alexei Starovoitov
2015-01-16  4:16   ` [PATCH tip 3/9] tracing: allow eBPF programs to call ktime_get_ns() Alexei Starovoitov
2015-01-16  4:16   ` [PATCH tip 4/9] samples: bpf: simple tracing example in eBPF assembler Alexei Starovoitov
2015-01-20 11:57     ` Masami Hiramatsu
2015-01-16 15:02   ` [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe Steven Rostedt
2015-01-19  9:52   ` Masami Hiramatsu
2015-01-19 20:48     ` Alexei Starovoitov
2015-01-20  2:58       ` Masami Hiramatsu
  -- strict thread matches above, loose matches on Subject: below --
2015-01-16 18:57 Alexei Starovoitov
2015-01-22  1:03 ` Namhyung Kim
2015-01-22  1:49 Alexei Starovoitov
     [not found] ` <CAMEtUux8v2LDtLcgpT9hCvJgnrCwT2fkzsSvAPFSuEUx+itxyQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-22  1:56   ` Steven Rostedt
     [not found]     ` <20150121205643.4d8a3516-f9ZlEuEWxVcJvu8Pb33WZ0EMvNT87kid@public.gmane.org>
2015-01-22  2:13       ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).