netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
@ 2016-02-18  3:58 Alexei Starovoitov
  2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-18  3:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

This patch set introduces new map type to store stack traces and
corresponding bpf_get_stackid() helper.
BPF programs already can walk the stack via unrolled loop
of bpf_probe_read()s which is ok for simple analysis, but it's
not efficient and limited to <30 frames after that the programs
don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper
the programs can collect up to PERF_MAX_STACK_DEPTH both
user and kernel frames.
Using stack traces as a key in a map turned out to be very useful
for generating flame graphs, off-cpu graphs, waker and chain graphs.
Patch 3 is a simplified version of 'offwaketime' tool which is
described in detail here:
http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html

Earlier version of this patch were using save_stack_trace() helper,
but 'unreliable' frames add to much noise and two equiavlent
stack traces produce different 'stackid's.
Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is
great for lockdep, but not acceptable for bpf, since the stack_trace
map needs to be freed when user Ctrl-C the tool.
The ftrace style with per_cpu(struct ftrace_stack) is great, but it's
tightly coupled with ftrace ring buffer and has the same 'unreliable'
noise. perf_event's perf_callchain() mechanism is also very efficient
and it only needed minor generalization which is done in patch 1
to be used by bpf stack_trace maps.
Peter, please take a look at patch 1.
If you're ok with it, I'd like to take the whole set via net-next.

Patch 1 - generalization of perf_callchain()
Patch 2 - stack_trace map done as lock-less hashtable without link list
  to avoid spinlock on insertion which is critical path when
  bpf_get_stackid() helper is called for every task switch event
Patch 3 - offwaketime example

After the patch the 'perf report' for artificial 'sched_bench'
benchmark that doing pthread_cond_wait/signal and 'offwaketime'
example is running in the background:
 16.35%  swapper      [kernel.vmlinux]    [k] intel_idle
  2.18%  sched_bench  [kernel.vmlinux]    [k] __switch_to
  2.18%  sched_bench  libpthread-2.12.so  [.] pthread_cond_signal@@GLIBC_2.3.2
  1.72%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_unlock
  1.53%  sched_bench  [kernel.vmlinux]    [k] bpf_get_stackid
  1.44%  sched_bench  [kernel.vmlinux]    [k] entry_SYSCALL_64
  1.39%  sched_bench  [kernel.vmlinux]    [k] __call_rcu.constprop.73
  1.13%  sched_bench  libpthread-2.12.so  [.] pthread_mutex_lock
  1.07%  sched_bench  libpthread-2.12.so  [.] pthread_cond_wait@@GLIBC_2.3.2
  1.07%  sched_bench  [kernel.vmlinux]    [k] hash_futex
  1.05%  sched_bench  [kernel.vmlinux]    [k] do_futex
  1.05%  sched_bench  [kernel.vmlinux]    [k] get_futex_key_refs.isra.13

The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider
using some faster hash in the future, but it's good enough for now.

Alexei Starovoitov (3):
  perf: generalize perf_callchain
  bpf: introduce BPF_MAP_TYPE_STACK_TRACE
  samples/bpf: offwaketime example

 arch/x86/include/asm/stacktrace.h |   2 +-
 arch/x86/kernel/cpu/perf_event.c  |   4 +-
 arch/x86/kernel/dumpstack.c       |   6 +-
 arch/x86/kernel/stacktrace.c      |  18 +--
 arch/x86/oprofile/backtrace.c     |   3 +-
 include/linux/bpf.h               |   1 +
 include/linux/perf_event.h        |  13 ++-
 include/uapi/linux/bpf.h          |  21 ++++
 kernel/bpf/Makefile               |   3 +
 kernel/bpf/stackmap.c             | 237 ++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c             |   6 +-
 kernel/events/callchain.c         |  32 +++--
 kernel/events/internal.h          |   2 -
 kernel/trace/bpf_trace.c          |   2 +
 samples/bpf/Makefile              |   4 +
 samples/bpf/bpf_helpers.h         |   2 +
 samples/bpf/offwaketime_kern.c    | 131 +++++++++++++++++++++
 samples/bpf/offwaketime_user.c    | 185 +++++++++++++++++++++++++++++
 18 files changed, 642 insertions(+), 30 deletions(-)
 create mode 100644 kernel/bpf/stackmap.c
 create mode 100644 samples/bpf/offwaketime_kern.c
 create mode 100644 samples/bpf/offwaketime_user.c

-- 
2.4.6

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-18  3:58 [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map Alexei Starovoitov
@ 2016-02-18  3:58 ` Alexei Starovoitov
  2016-02-25 14:18   ` Peter Zijlstra
  2016-02-25 16:47   ` Peter Zijlstra
  2016-02-18  3:58 ` [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE Alexei Starovoitov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-18  3:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  2 +-
 arch/x86/kernel/cpu/perf_event.c  |  4 ++--
 arch/x86/kernel/dumpstack.c       |  6 ++++--
 arch/x86/kernel/stacktrace.c      | 18 +++++++++++-------
 arch/x86/oprofile/backtrace.c     |  3 ++-
 include/linux/perf_event.h        | 13 +++++++++++--
 kernel/events/callchain.c         | 32 ++++++++++++++++++++------------
 kernel/events/internal.h          |  2 --
 8 files changed, 51 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 70bbe39043a9..7c247e7404be 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -37,7 +37,7 @@ print_context_stack_bp(struct thread_info *tinfo,
 /* Generic stack tracer with callbacks */
 
 struct stacktrace_ops {
-	void (*address)(void *data, unsigned long address, int reliable);
+	int (*address)(void *data, unsigned long address, int reliable);
 	/* On negative return stop dumping */
 	int (*stack)(void *data, char *name);
 	walk_stack_t	walk_stack;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 1b443db2db50..d276b31ca473 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -2180,11 +2180,11 @@ static int backtrace_stack(void *data, char *name)
 	return 0;
 }
 
-static void backtrace_address(void *data, unsigned long addr, int reliable)
+static int backtrace_address(void *data, unsigned long addr, int reliable)
 {
 	struct perf_callchain_entry *entry = data;
 
-	perf_callchain_store(entry, addr);
+	return perf_callchain_store(entry, addr);
 }
 
 static const struct stacktrace_ops backtrace_ops = {
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 9c30acfadae2..0d1ff4b407d4 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -135,7 +135,8 @@ print_context_stack_bp(struct thread_info *tinfo,
 		if (!__kernel_text_address(addr))
 			break;
 
-		ops->address(data, addr, 1);
+		if (ops->address(data, addr, 1))
+			break;
 		frame = frame->next_frame;
 		ret_addr = &frame->return_address;
 		print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
@@ -154,10 +155,11 @@ static int print_trace_stack(void *data, char *name)
 /*
  * Print one address/symbol entries per line.
  */
-static void print_trace_address(void *data, unsigned long addr, int reliable)
+static int print_trace_address(void *data, unsigned long addr, int reliable)
 {
 	touch_nmi_watchdog();
 	printk_stack_address(addr, reliable, data);
+	return 0;
 }
 
 static const struct stacktrace_ops print_trace_ops = {
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index fdd0c6430e5a..9ee98eefc44d 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -14,30 +14,34 @@ static int save_stack_stack(void *data, char *name)
 	return 0;
 }
 
-static void
+static int
 __save_stack_address(void *data, unsigned long addr, bool reliable, bool nosched)
 {
 	struct stack_trace *trace = data;
 #ifdef CONFIG_FRAME_POINTER
 	if (!reliable)
-		return;
+		return 0;
 #endif
 	if (nosched && in_sched_functions(addr))
-		return;
+		return 0;
 	if (trace->skip > 0) {
 		trace->skip--;
-		return;
+		return 0;
 	}
-	if (trace->nr_entries < trace->max_entries)
+	if (trace->nr_entries < trace->max_entries) {
 		trace->entries[trace->nr_entries++] = addr;
+		return 0;
+	} else {
+		return -1; /* no more room, stop walking the stack */
+	}
 }
 
-static void save_stack_address(void *data, unsigned long addr, int reliable)
+static int save_stack_address(void *data, unsigned long addr, int reliable)
 {
 	return __save_stack_address(data, addr, reliable, false);
 }
 
-static void
+static int
 save_stack_address_nosched(void *data, unsigned long addr, int reliable)
 {
 	return __save_stack_address(data, addr, reliable, true);
diff --git a/arch/x86/oprofile/backtrace.c b/arch/x86/oprofile/backtrace.c
index 4e664bdb535a..cb31a4440e58 100644
--- a/arch/x86/oprofile/backtrace.c
+++ b/arch/x86/oprofile/backtrace.c
@@ -23,12 +23,13 @@ static int backtrace_stack(void *data, char *name)
 	return 0;
 }
 
-static void backtrace_address(void *data, unsigned long addr, int reliable)
+static int backtrace_address(void *data, unsigned long addr, int reliable)
 {
 	unsigned int *depth = data;
 
 	if ((*depth)--)
 		oprofile_add_trace(addr);
+	return 0;
 }
 
 static struct stacktrace_ops backtrace_ops = {
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b35a61a481fa..7da3c25999df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -964,11 +964,20 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 
 extern void perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry *entry, struct pt_regs *regs);
+extern struct perf_callchain_entry *
+get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+		   bool crosstask, bool add_mark);
+extern int get_callchain_buffers(void);
+extern void put_callchain_buffers(void);
 
-static inline void perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
+static inline int perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
 {
-	if (entry->nr < PERF_MAX_STACK_DEPTH)
+	if (entry->nr < PERF_MAX_STACK_DEPTH) {
 		entry->ip[entry->nr++] = ip;
+		return 0;
+	} else {
+		return -1; /* no more room, stop walking the stack */
+	}
 }
 
 extern int sysctl_perf_event_paranoid;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 9c418002b8c1..343c22f5e867 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -159,15 +159,24 @@ put_callchain_entry(int rctx)
 struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
-	int rctx;
-	struct perf_callchain_entry *entry;
-
-	int kernel = !event->attr.exclude_callchain_kernel;
-	int user   = !event->attr.exclude_callchain_user;
+	bool kernel = !event->attr.exclude_callchain_kernel;
+	bool user   = !event->attr.exclude_callchain_user;
+	/* Disallow cross-task user callchains. */
+	bool crosstask = event->ctx->task && event->ctx->task != current;
 
 	if (!kernel && !user)
 		return NULL;
 
+	return get_perf_callchain(regs, 0, kernel, user, crosstask, true);
+}
+
+struct perf_callchain_entry *
+get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+		   bool crosstask, bool add_mark)
+{
+	struct perf_callchain_entry *entry;
+	int rctx;
+
 	entry = get_callchain_entry(&rctx);
 	if (rctx == -1)
 		return NULL;
@@ -175,10 +184,11 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!entry)
 		goto exit_put;
 
-	entry->nr = 0;
+	entry->nr = init_nr;
 
 	if (kernel && !user_mode(regs)) {
-		perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
+		if (add_mark)
+			perf_callchain_store(entry, PERF_CONTEXT_KERNEL);
 		perf_callchain_kernel(entry, regs);
 	}
 
@@ -191,13 +201,11 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 		}
 
 		if (regs) {
-			/*
-			 * Disallow cross-task user callchains.
-			 */
-			if (event->ctx->task && event->ctx->task != current)
+			if (crosstask)
 				goto exit_put;
 
-			perf_callchain_store(entry, PERF_CONTEXT_USER);
+			if (add_mark)
+				perf_callchain_store(entry, PERF_CONTEXT_USER);
 			perf_callchain_user(entry, regs);
 		}
 	}
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 2bbad9c1274c..4199b6d193f5 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -182,8 +182,6 @@ DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user)
 /* Callchain handling */
 extern struct perf_callchain_entry *
 perf_callchain(struct perf_event *event, struct pt_regs *regs);
-extern int get_callchain_buffers(void);
-extern void put_callchain_buffers(void);
 
 static inline int get_recursion_context(int *recursion)
 {
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE
  2016-02-18  3:58 [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map Alexei Starovoitov
  2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
@ 2016-02-18  3:58 ` Alexei Starovoitov
  2016-02-25 14:23   ` Peter Zijlstra
  2016-02-18  3:58 ` [PATCH net-next 3/3] samples/bpf: offwaketime example Alexei Starovoitov
  2016-02-20  5:25 ` [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map David Miller
  3 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-18  3:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
        bit 8 - collect user stack instead of kernel
        bit 9 - compare stacks by hash only
        bit 10 - if two different stacks hash into the same stackid
                 discard old
        other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h      |   1 +
 include/uapi/linux/bpf.h |  21 +++++
 kernel/bpf/Makefile      |   3 +
 kernel/bpf/stackmap.c    | 237 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c    |   6 +-
 kernel/trace/bpf_trace.c |   2 +
 6 files changed, 269 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/stackmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 90ee6ab24bc5..0cadbb7456c0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -237,6 +237,7 @@ extern const struct bpf_func_proto bpf_get_current_uid_gid_proto;
 extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
+extern const struct bpf_func_proto bpf_get_stackid_proto;
 
 /* Shared helpers among cBPF and eBPF. */
 void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2ee0fde1bf96..d3e77da8e9e8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -83,6 +83,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_PERF_EVENT_ARRAY,
 	BPF_MAP_TYPE_PERCPU_HASH,
 	BPF_MAP_TYPE_PERCPU_ARRAY,
+	BPF_MAP_TYPE_STACK_TRACE,
 };
 
 enum bpf_prog_type {
@@ -272,6 +273,20 @@ enum bpf_func_id {
 	 */
 	BPF_FUNC_perf_event_output,
 	BPF_FUNC_skb_load_bytes,
+
+	/**
+	 * bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
+	 * @ctx: struct pt_regs*
+	 * @map: pointer to stack_trace map
+	 * @flags: bits 0-7 - numer of stack frames to skip
+	 *         bit 8 - collect user stack instead of kernel
+	 *         bit 9 - compare stacks by hash only
+	 *         bit 10 - if two different stacks hash into the same stackid
+	 *                  discard old
+	 *         other bits - reserved
+	 * Return: >= 0 stackid on success or negative error
+	 */
+	BPF_FUNC_get_stackid,
 	__BPF_FUNC_MAX_ID,
 };
 
@@ -294,6 +309,12 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6		(1ULL << 0)
 
+/* BPF_FUNC_get_stackid flags. */
+#define BPF_F_SKIP_FIELD_MASK		0xffULL
+#define BPF_F_USER_STACK		(1ULL << 8)
+#define BPF_F_FAST_STACK_CMP		(1ULL << 9)
+#define BPF_F_REUSE_STACKID		(1ULL << 10)
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 13272582eee0..8a932d079c24 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,3 +2,6 @@ obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
+ifeq ($(CONFIG_PERF_EVENTS),y)
+obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
+endif
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
new file mode 100644
index 000000000000..8a60ee14a977
--- /dev/null
+++ b/kernel/bpf/stackmap.c
@@ -0,0 +1,237 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/jhash.h>
+#include <linux/filter.h>
+#include <linux/vmalloc.h>
+#include <linux/stacktrace.h>
+#include <linux/perf_event.h>
+
+struct stack_map_bucket {
+	struct rcu_head rcu;
+	u32 hash;
+	u32 nr;
+	u64 ip[];
+};
+
+struct bpf_stack_map {
+	struct bpf_map map;
+	u32 n_buckets;
+	struct stack_map_bucket __rcu *buckets[];
+};
+
+/* Called from syscall */
+static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
+{
+	u32 value_size = attr->value_size;
+	struct bpf_stack_map *smap;
+	u64 cost, n_buckets;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    value_size < 8 || value_size % 8 ||
+	    value_size / 8 > PERF_MAX_STACK_DEPTH)
+		return ERR_PTR(-EINVAL);
+
+	/* hash table size must be power of 2 */
+	n_buckets = roundup_pow_of_two(attr->max_entries);
+
+	cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap);
+	if (cost >= U32_MAX - PAGE_SIZE)
+		return ERR_PTR(-E2BIG);
+
+	smap = kzalloc(cost, GFP_USER | __GFP_NOWARN);
+	if (!smap) {
+		smap = vzalloc(cost);
+		if (!smap)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	err = -E2BIG;
+	cost += n_buckets * (value_size + sizeof(struct stack_map_bucket));
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_smap;
+
+	smap->map.map_type = attr->map_type;
+	smap->map.key_size = attr->key_size;
+	smap->map.value_size = value_size;
+	smap->map.max_entries = attr->max_entries;
+	smap->n_buckets = n_buckets;
+	smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	err = get_callchain_buffers();
+	if (err)
+		goto free_smap;
+
+	return &smap->map;
+
+free_smap:
+	kvfree(smap);
+	return ERR_PTR(err);
+}
+
+static u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
+{
+	struct pt_regs *regs = (struct pt_regs *) (long) r1;
+	struct bpf_map *map = (struct bpf_map *) (long) r2;
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct perf_callchain_entry *trace;
+	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
+	u32 max_depth = map->value_size / 8;
+	/* stack_map_alloc() checks that max_depth <= PERF_MAX_STACK_DEPTH */
+	u32 init_nr = PERF_MAX_STACK_DEPTH - max_depth;
+	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+	u32 hash, id, trace_nr, trace_len;
+	bool user = flags & BPF_F_USER_STACK;
+	bool kernel = !user;
+	u64 *ips;
+
+	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
+			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
+		return -EINVAL;
+
+	trace = get_perf_callchain(regs, init_nr, kernel, user, false, false);
+
+	if (unlikely(!trace))
+		/* couldn't fetch the stack trace */
+		return -EFAULT;
+
+	/* get_perf_callchain() guarantees that trace->nr >= init_nr
+	 * and trace-nr <= PERF_MAX_STACK_DEPTH, so trace_nr <= max_depth
+	 */
+	trace_nr = trace->nr - init_nr;
+
+	if (trace_nr <= skip)
+		/* skipping more than usable stack trace */
+		return -EFAULT;
+
+	trace_nr -= skip;
+	trace_len = trace_nr * sizeof(u64);
+	ips = trace->ip + skip + init_nr;
+	hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
+	id = hash & (smap->n_buckets - 1);
+	bucket = rcu_dereference(smap->buckets[id]);
+
+	if (bucket && bucket->hash == hash) {
+		if (flags & BPF_F_FAST_STACK_CMP)
+			return id;
+		if (bucket->nr == trace_nr &&
+		    memcmp(bucket->ip, ips, trace_len) == 0)
+			return id;
+	}
+
+	/* this call stack is not in the map, try to add it */
+	if (bucket && !(flags & BPF_F_REUSE_STACKID))
+		return -EEXIST;
+
+	new_bucket = kmalloc(sizeof(struct stack_map_bucket) + map->value_size,
+			     GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(!new_bucket))
+		return -ENOMEM;
+
+	memcpy(new_bucket->ip, ips, trace_len);
+	memset(new_bucket->ip + trace_len / 8, 0, map->value_size - trace_len);
+	new_bucket->hash = hash;
+	new_bucket->nr = trace_nr;
+
+	old_bucket = xchg(&smap->buckets[id], new_bucket);
+	if (old_bucket)
+		kfree_rcu(old_bucket, rcu);
+	return id;
+}
+
+const struct bpf_func_proto bpf_get_stackid_proto = {
+	.func		= bpf_get_stackid,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_CONST_MAP_PTR,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+/* Called from syscall or from eBPF program */
+static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct stack_map_bucket *bucket;
+	u32 id = *(u32 *)key;
+
+	if (unlikely(id >= smap->n_buckets))
+		return NULL;
+	bucket = rcu_dereference(smap->buckets[id]);
+	return bucket ? bucket->ip : NULL;
+}
+
+static int stack_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -EINVAL;
+}
+
+static int stack_map_update_elem(struct bpf_map *map, void *key, void *value,
+				 u64 map_flags)
+{
+	return -EINVAL;
+}
+
+/* Called from syscall or from eBPF program */
+static int stack_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	struct stack_map_bucket *old_bucket;
+	u32 id = *(u32 *)key;
+
+	if (unlikely(id >= smap->n_buckets))
+		return -E2BIG;
+
+	old_bucket = xchg(&smap->buckets[id], NULL);
+	if (old_bucket) {
+		kfree_rcu(old_bucket, rcu);
+		return 0;
+	} else {
+		return -ENOENT;
+	}
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void stack_map_free(struct bpf_map *map)
+{
+	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
+	int i;
+
+	synchronize_rcu();
+
+	for (i = 0; i < smap->n_buckets; i++)
+		if (smap->buckets[i])
+			kfree_rcu(smap->buckets[i], rcu);
+	kvfree(smap);
+	put_callchain_buffers();
+}
+
+static const struct bpf_map_ops stack_map_ops = {
+	.map_alloc = stack_map_alloc,
+	.map_free = stack_map_free,
+	.map_get_next_key = stack_map_get_next_key,
+	.map_lookup_elem = stack_map_lookup_elem,
+	.map_update_elem = stack_map_update_elem,
+	.map_delete_elem = stack_map_delete_elem,
+};
+
+static struct bpf_map_type_list stack_map_type __read_mostly = {
+	.ops = &stack_map_ops,
+	.type = BPF_MAP_TYPE_STACK_TRACE,
+};
+
+static int __init register_stack_map(void)
+{
+	bpf_register_map_type(&stack_map_type);
+	return 0;
+}
+late_initcall(register_stack_map);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d1d3e8f57de9..42ba4ccc020b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -246,6 +246,7 @@ static const struct {
 	{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
 	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
 	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_output},
+	{BPF_MAP_TYPE_STACK_TRACE, BPF_FUNC_get_stackid},
 };
 
 static void print_verifier_state(struct verifier_env *env)
@@ -911,8 +912,11 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		 * don't allow any other map type to be passed into
 		 * the special func;
 		 */
-		if (bool_func && bool_map != bool_func)
+		if (bool_func && bool_map != bool_func) {
+			verbose("cannot pass map_type %d into func %d\n",
+				map->map_type, func_id);
 			return -EINVAL;
+		}
 	}
 
 	return 0;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 326a75e884db..4b8caa392b86 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -299,6 +299,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return &bpf_perf_event_read_proto;
 	case BPF_FUNC_perf_event_output:
 		return &bpf_perf_event_output_proto;
+	case BPF_FUNC_get_stackid:
+		return &bpf_get_stackid_proto;
 	default:
 		return NULL;
 	}
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 3/3] samples/bpf: offwaketime example
  2016-02-18  3:58 [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map Alexei Starovoitov
  2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
  2016-02-18  3:58 ` [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE Alexei Starovoitov
@ 2016-02-18  3:58 ` Alexei Starovoitov
  2016-02-20  5:25 ` [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map David Miller
  3 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-18  3:58 UTC (permalink / raw)
  To: David S. Miller
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

This is simplified version of Brendan Gregg's offwaketime:
This program shows kernel stack traces and task names that were blocked and
"off-CPU", along with the stack traces and task names for the threads that woke
them, and the total elapsed time from when they blocked to when they were woken
up. The combined stacks, task names, and total time is summarized in kernel
context for efficiency.

Example:
$ sudo ./offwaketime | flamegraph.pl > demo.svg
Open demo.svg in the browser as FlameGraph visualization.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/Makefile           |   4 +
 samples/bpf/bpf_helpers.h      |   2 +
 samples/bpf/offwaketime_kern.c | 131 +++++++++++++++++++++++++++++
 samples/bpf/offwaketime_user.c | 185 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 322 insertions(+)
 create mode 100644 samples/bpf/offwaketime_kern.c
 create mode 100644 samples/bpf/offwaketime_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index edd638b5825f..c4f8ae0c8afe 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -16,6 +16,7 @@ hostprogs-y += tracex5
 hostprogs-y += tracex6
 hostprogs-y += trace_output
 hostprogs-y += lathist
+hostprogs-y += offwaketime
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -32,6 +33,7 @@ tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
 tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
 trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
 lathist-objs := bpf_load.o libbpf.o lathist_user.o
+offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -47,6 +49,7 @@ always += tracex6_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
 always += lathist_kern.o
+always += offwaketime_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -63,6 +66,7 @@ HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
+HOSTLOADLIBES_offwaketime += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 7ad19e1dbaf4..811bcca0f29d 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -39,6 +39,8 @@ static int (*bpf_redirect)(int ifindex, int flags) =
 	(void *) BPF_FUNC_redirect;
 static int (*bpf_perf_event_output)(void *ctx, void *map, int index, void *data, int size) =
 	(void *) BPF_FUNC_perf_event_output;
+static int (*bpf_get_stackid)(void *ctx, void *map, int flags) =
+	(void *) BPF_FUNC_get_stackid;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c
new file mode 100644
index 000000000000..c0aa5a9b9c48
--- /dev/null
+++ b/samples/bpf/offwaketime_kern.c
@@ -0,0 +1,131 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+#include <uapi/linux/ptrace.h>
+#include <uapi/linux/perf_event.h>
+#include <linux/version.h>
+#include <linux/sched.h>
+
+#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+#define MINBLOCK_US	1
+
+struct key_t {
+	char waker[TASK_COMM_LEN];
+	char target[TASK_COMM_LEN];
+	u32 wret;
+	u32 tret;
+};
+
+struct bpf_map_def SEC("maps") counts = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(struct key_t),
+	.value_size = sizeof(u64),
+	.max_entries = 10000,
+};
+
+struct bpf_map_def SEC("maps") start = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u64),
+	.max_entries = 10000,
+};
+
+struct wokeby_t {
+	char name[TASK_COMM_LEN];
+	u32 ret;
+};
+
+struct bpf_map_def SEC("maps") wokeby = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct wokeby_t),
+	.max_entries = 10000,
+};
+
+struct bpf_map_def SEC("maps") stackmap = {
+	.type = BPF_MAP_TYPE_STACK_TRACE,
+	.key_size = sizeof(u32),
+	.value_size = PERF_MAX_STACK_DEPTH * sizeof(u64),
+	.max_entries = 10000,
+};
+
+#define STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP)
+
+SEC("kprobe/try_to_wake_up")
+int waker(struct pt_regs *ctx)
+{
+	struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
+	struct wokeby_t woke = {};
+	u32 pid;
+
+	pid = _(p->pid);
+
+	bpf_get_current_comm(&woke.name, sizeof(woke.name));
+	woke.ret = bpf_get_stackid(ctx, &stackmap, STACKID_FLAGS);
+
+	bpf_map_update_elem(&wokeby, &pid, &woke, BPF_ANY);
+	return 0;
+}
+
+static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
+{
+	struct key_t key = {};
+	struct wokeby_t *woke;
+	u64 zero = 0, *val;
+
+	bpf_get_current_comm(&key.target, sizeof(key.target));
+	key.tret = bpf_get_stackid(ctx, &stackmap, STACKID_FLAGS);
+
+	woke = bpf_map_lookup_elem(&wokeby, &pid);
+	if (woke) {
+		key.wret = woke->ret;
+		__builtin_memcpy(&key.waker, woke->name, TASK_COMM_LEN);
+		bpf_map_delete_elem(&wokeby, &pid);
+	}
+
+	val = bpf_map_lookup_elem(&counts, &key);
+	if (!val) {
+		bpf_map_update_elem(&counts, &key, &zero, BPF_NOEXIST);
+		val = bpf_map_lookup_elem(&counts, &key);
+		if (!val)
+			return 0;
+	}
+	(*val) += delta;
+	return 0;
+}
+
+SEC("kprobe/finish_task_switch")
+int oncpu(struct pt_regs *ctx)
+{
+	struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
+	u64 delta, ts, *tsp;
+	u32 pid;
+
+	/* record previous thread sleep time */
+	pid = _(p->pid);
+	ts = bpf_ktime_get_ns();
+	bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
+
+	/* calculate current thread's delta time */
+	pid = bpf_get_current_pid_tgid();
+	tsp = bpf_map_lookup_elem(&start, &pid);
+	if (!tsp)
+		/* missed start or filtered */
+		return 0;
+
+	delta = bpf_ktime_get_ns() - *tsp;
+	bpf_map_delete_elem(&start, &pid);
+	delta = delta / 1000;
+	if (delta < MINBLOCK_US)
+		return 0;
+
+	return update_counts(ctx, pid, delta);
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c
new file mode 100644
index 000000000000..17cf3024e22c
--- /dev/null
+++ b/samples/bpf/offwaketime_user.c
@@ -0,0 +1,185 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <linux/perf_event.h>
+#include <errno.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <sys/resource.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define MAX_SYMS 300000
+#define PRINT_RAW_ADDR 0
+
+static struct ksym {
+	long addr;
+	char *name;
+} syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+static int load_kallsyms(void)
+{
+	FILE *f = fopen("/proc/kallsyms", "r");
+	char func[256], buf[256];
+	char symbol;
+	void *addr;
+	int i = 0;
+
+	if (!f)
+		return -ENOENT;
+
+	while (!feof(f)) {
+		if (!fgets(buf, sizeof(buf), f))
+			break;
+		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
+			break;
+		if (!addr)
+			continue;
+		syms[i].addr = (long) addr;
+		syms[i].name = strdup(func);
+		i++;
+	}
+	sym_cnt = i;
+	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+	return 0;
+}
+
+static void *search(long key)
+{
+	int start = 0, end = sym_cnt;
+	int result;
+
+	while (start < end) {
+		size_t mid = start + (end - start) / 2;
+
+		result = key - syms[mid].addr;
+		if (result < 0)
+			end = mid;
+		else if (result > 0)
+			start = mid + 1;
+		else
+			return &syms[mid];
+	}
+
+	if (start >= 1 && syms[start - 1].addr < key &&
+	    key < syms[start].addr)
+		/* valid ksym */
+		return &syms[start - 1];
+
+	/* out of range. return _stext */
+	return &syms[0];
+}
+
+static void print_ksym(__u64 addr)
+{
+	struct ksym *sym;
+
+	if (!addr)
+		return;
+	sym = search(addr);
+	if (PRINT_RAW_ADDR)
+		printf("%s/%llx;", sym->name, addr);
+	else
+		printf("%s;", sym->name);
+}
+
+#define TASK_COMM_LEN 16
+
+struct key_t {
+	char waker[TASK_COMM_LEN];
+	char target[TASK_COMM_LEN];
+	__u32 wret;
+	__u32 tret;
+};
+
+static void print_stack(struct key_t *key, __u64 count)
+{
+	__u64 ip[PERF_MAX_STACK_DEPTH] = {};
+	static bool warned;
+	int i;
+
+	printf("%s;", key->target);
+	if (bpf_lookup_elem(map_fd[3], &key->tret, ip) != 0) {
+		printf("---;");
+	} else {
+		for (i = PERF_MAX_STACK_DEPTH - 1; i >= 0; i--)
+			print_ksym(ip[i]);
+	}
+	printf("-;");
+	if (bpf_lookup_elem(map_fd[3], &key->wret, ip) != 0) {
+		printf("---;");
+	} else {
+		for (i = 0; i < PERF_MAX_STACK_DEPTH; i++)
+			print_ksym(ip[i]);
+	}
+	printf(";%s %lld\n", key->waker, count);
+
+	if ((key->tret == -EEXIST || key->wret == -EEXIST) && !warned) {
+		printf("stackmap collisions seen. Consider increasing size\n");
+		warned = true;
+	} else if (((int)(key->tret) < 0 || (int)(key->wret) < 0)) {
+		printf("err stackid %d %d\n", key->tret, key->wret);
+	}
+}
+
+static void print_stacks(int fd)
+{
+	struct key_t key = {}, next_key;
+	__u64 value;
+
+	while (bpf_get_next_key(fd, &key, &next_key) == 0) {
+		bpf_lookup_elem(fd, &next_key, &value);
+		print_stack(&next_key, value);
+		key = next_key;
+	}
+}
+
+static void int_exit(int sig)
+{
+	print_stacks(map_fd[0]);
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char filename[256];
+	int delay = 1;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	setrlimit(RLIMIT_MEMLOCK, &r);
+
+	signal(SIGINT, int_exit);
+
+	if (load_kallsyms()) {
+		printf("failed to process /proc/kallsyms\n");
+		return 2;
+	}
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (argc > 1)
+		delay = atoi(argv[1]);
+	sleep(delay);
+	print_stacks(map_fd[0]);
+
+	return 0;
+}
-- 
2.4.6

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
  2016-02-18  3:58 [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2016-02-18  3:58 ` [PATCH net-next 3/3] samples/bpf: offwaketime example Alexei Starovoitov
@ 2016-02-20  5:25 ` David Miller
  2016-02-25 14:24   ` Peter Zijlstra
  3 siblings, 1 reply; 16+ messages in thread
From: David Miller @ 2016-02-20  5:25 UTC (permalink / raw)
  To: ast
  Cc: a.p.zijlstra, mingo, rostedt, wangnan0, daniel, brendan.d.gregg,
	netdev, linux-kernel

From: Alexei Starovoitov <ast@fb.com>
Date: Wed, 17 Feb 2016 19:58:56 -0800

> This patch set introduces new map type to store stack traces and
> corresponding bpf_get_stackid() helper.
 ...

Series applied, thanks Alexei.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
@ 2016-02-25 14:18   ` Peter Zijlstra
  2016-02-25 16:37     ` Alexei Starovoitov
  2016-02-25 16:47   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 14:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Wed, Feb 17, 2016 at 07:58:57PM -0800, Alexei Starovoitov wrote:
> . avoid walking the stack when there is no room left in the buffer
> . generalize get_perf_callchain() to be called from bpf helper

If it does two things it should be two patches.

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  arch/x86/include/asm/stacktrace.h |  2 +-
>  arch/x86/kernel/cpu/perf_event.c  |  4 ++--
>  arch/x86/kernel/dumpstack.c       |  6 ++++--
>  arch/x86/kernel/stacktrace.c      | 18 +++++++++++-------
>  arch/x86/oprofile/backtrace.c     |  3 ++-
>  include/linux/perf_event.h        | 13 +++++++++++--
>  kernel/events/callchain.c         | 32 ++++++++++++++++++++------------
>  kernel/events/internal.h          |  2 --
>  8 files changed, 51 insertions(+), 29 deletions(-)

And at the very least this should have had a note that it doesn't break
all the other archs that implement perf-callchain stuff.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE
  2016-02-18  3:58 ` [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE Alexei Starovoitov
@ 2016-02-25 14:23   ` Peter Zijlstra
  2016-02-25 16:42     ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 14:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Wed, Feb 17, 2016 at 07:58:58PM -0800, Alexei Starovoitov wrote:
> +static u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 r4, u64 r5)
> +{
> +	struct pt_regs *regs = (struct pt_regs *) (long) r1;
> +	struct bpf_map *map = (struct bpf_map *) (long) r2;
> +	struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
> +	struct perf_callchain_entry *trace;
> +	struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
> +	u32 max_depth = map->value_size / 8;
> +	/* stack_map_alloc() checks that max_depth <= PERF_MAX_STACK_DEPTH */
> +	u32 init_nr = PERF_MAX_STACK_DEPTH - max_depth;
> +	u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
> +	u32 hash, id, trace_nr, trace_len;
> +	bool user = flags & BPF_F_USER_STACK;
> +	bool kernel = !user;
> +	u64 *ips;
> +
> +	if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
> +			       BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
> +		return -EINVAL;
> +
> +	trace = get_perf_callchain(regs, init_nr, kernel, user, false, false);
> +
> +	if (unlikely(!trace))
> +		/* couldn't fetch the stack trace */
> +		return -EFAULT;
> +
> +	/* get_perf_callchain() guarantees that trace->nr >= init_nr
> +	 * and trace-nr <= PERF_MAX_STACK_DEPTH, so trace_nr <= max_depth
> +	 */
> +	trace_nr = trace->nr - init_nr;
> +
> +	if (trace_nr <= skip)
> +		/* skipping more than usable stack trace */
> +		return -EFAULT;
> +
> +	trace_nr -= skip;
> +	trace_len = trace_nr * sizeof(u64);
> +	ips = trace->ip + skip + init_nr;
> +	hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
> +	id = hash & (smap->n_buckets - 1);

Its not at all clear where the corresponding rcu_read_lock() is at.

> +	bucket = rcu_dereference(smap->buckets[id]);
> +
> +	if (bucket && bucket->hash == hash) {
> +		if (flags & BPF_F_FAST_STACK_CMP)
> +			return id;
> +		if (bucket->nr == trace_nr &&
> +		    memcmp(bucket->ip, ips, trace_len) == 0)
> +			return id;
> +	}
> +
> +	/* this call stack is not in the map, try to add it */
> +	if (bucket && !(flags & BPF_F_REUSE_STACKID))
> +		return -EEXIST;
> +
> +	new_bucket = kmalloc(sizeof(struct stack_map_bucket) + map->value_size,
> +			     GFP_ATOMIC | __GFP_NOWARN);
> +	if (unlikely(!new_bucket))
> +		return -ENOMEM;
> +
> +	memcpy(new_bucket->ip, ips, trace_len);
> +	memset(new_bucket->ip + trace_len / 8, 0, map->value_size - trace_len);
> +	new_bucket->hash = hash;
> +	new_bucket->nr = trace_nr;
> +
> +	old_bucket = xchg(&smap->buckets[id], new_bucket);
> +	if (old_bucket)
> +		kfree_rcu(old_bucket, rcu);
> +	return id;
> +}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
  2016-02-20  5:25 ` [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map David Miller
@ 2016-02-25 14:24   ` Peter Zijlstra
  2016-02-25 16:44     ` David Miller
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 14:24 UTC (permalink / raw)
  To: David Miller
  Cc: ast, mingo, rostedt, wangnan0, daniel, brendan.d.gregg, netdev,
	linux-kernel

On Sat, Feb 20, 2016 at 12:25:05AM -0500, David Miller wrote:
> From: Alexei Starovoitov <ast@fb.com>
> Date: Wed, 17 Feb 2016 19:58:56 -0800
> 
> > This patch set introduces new map type to store stack traces and
> > corresponding bpf_get_stackid() helper.
>  ...
> 
> Series applied, thanks Alexei.

David, please do not apply perf patches without an ACK from me or mingo.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-25 14:18   ` Peter Zijlstra
@ 2016-02-25 16:37     ` Alexei Starovoitov
  2016-02-25 16:45       ` Peter Zijlstra
  2016-02-25 16:48       ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-25 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On 2/25/16 6:18 AM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 07:58:57PM -0800, Alexei Starovoitov wrote:
>> . avoid walking the stack when there is no room left in the buffer
>> . generalize get_perf_callchain() to be called from bpf helper
>
> If it does two things it should be two patches.

could have been two patches, but it will only add more churn
to the same lines. what's the concern?

>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>> ---
>>   arch/x86/include/asm/stacktrace.h |  2 +-
>>   arch/x86/kernel/cpu/perf_event.c  |  4 ++--
>>   arch/x86/kernel/dumpstack.c       |  6 ++++--
>>   arch/x86/kernel/stacktrace.c      | 18 +++++++++++-------
>>   arch/x86/oprofile/backtrace.c     |  3 ++-
>>   include/linux/perf_event.h        | 13 +++++++++++--
>>   kernel/events/callchain.c         | 32 ++++++++++++++++++++------------
>>   kernel/events/internal.h          |  2 --
>>   8 files changed, 51 insertions(+), 29 deletions(-)
>
> And at the very least this should have had a note that it doesn't break
> all the other archs that implement perf-callchain stuff.

the cross-arch interface is two weak functions
perf_callchain_kernel() and perf_callchain_user()
and back into generic via perf_callchain_store().
Nothing changes there. The code speaks for itself.
"non-x86 archs are not broken" would be a silly comment.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE
  2016-02-25 14:23   ` Peter Zijlstra
@ 2016-02-25 16:42     ` Alexei Starovoitov
  2016-02-25 16:50       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-25 16:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On 2/25/16 6:23 AM, Peter Zijlstra wrote:
>> +	id = hash & (smap->n_buckets - 1);
> Its not at all clear where the corresponding rcu_read_lock() is at.
>
>> >+	bucket = rcu_dereference(smap->buckets[id]);

bpf programs of all types are always executing under rcu_read_lock().
This is fundamental for maps and majority of the helpers
that's why there is no WARN_ON(rcu_read_lock_held) in this helper,
since we already have it in many other places.
The rcu_read_lock() for kprobe type is in trace_call_bpf().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map
  2016-02-25 14:24   ` Peter Zijlstra
@ 2016-02-25 16:44     ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2016-02-25 16:44 UTC (permalink / raw)
  To: peterz
  Cc: ast, mingo, rostedt, wangnan0, daniel, brendan.d.gregg, netdev,
	linux-kernel

From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 25 Feb 2016 15:24:54 +0100

> On Sat, Feb 20, 2016 at 12:25:05AM -0500, David Miller wrote:
>> From: Alexei Starovoitov <ast@fb.com>
>> Date: Wed, 17 Feb 2016 19:58:56 -0800
>> 
>> > This patch set introduces new map type to store stack traces and
>> > corresponding bpf_get_stackid() helper.
>>  ...
>> 
>> Series applied, thanks Alexei.
> 
> David, please do not apply perf patches without an ACK from me or mingo.

Sure thing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-25 16:37     ` Alexei Starovoitov
@ 2016-02-25 16:45       ` Peter Zijlstra
  2016-02-25 16:48       ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 16:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Thu, Feb 25, 2016 at 08:37:34AM -0800, Alexei Starovoitov wrote:
> On 2/25/16 6:18 AM, Peter Zijlstra wrote:
> >On Wed, Feb 17, 2016 at 07:58:57PM -0800, Alexei Starovoitov wrote:
> >>. avoid walking the stack when there is no room left in the buffer
> >>. generalize get_perf_callchain() to be called from bpf helper
> >
> >If it does two things it should be two patches.
> 
> could have been two patches, but it will only add more churn
> to the same lines. what's the concern?

It makes review easier, shows which modification is for what purpose.

Also the changelog really needs more; it should for example explain what
BPF needs from the callchain code, and preferably why.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
  2016-02-25 14:18   ` Peter Zijlstra
@ 2016-02-25 16:47   ` Peter Zijlstra
  2016-02-25 17:27     ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 16:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Wed, Feb 17, 2016 at 07:58:57PM -0800, Alexei Starovoitov wrote:
> +static inline int perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
>  {
> +	if (entry->nr < PERF_MAX_STACK_DEPTH) {
>  		entry->ip[entry->nr++] = ip;
> +		return 0;
> +	} else {
> +		return -1; /* no more room, stop walking the stack */
> +	}
>  }

Why 0 and -1 ?

What's wrong with something like:

static inline bool perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
{
	if (entry->nr < PERF_MAX_STACK_DEPTH) {
		entry->ip[entry->nr++] = ip;
		return true;
	}
	return false;
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-25 16:37     ` Alexei Starovoitov
  2016-02-25 16:45       ` Peter Zijlstra
@ 2016-02-25 16:48       ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 16:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Thu, Feb 25, 2016 at 08:37:34AM -0800, Alexei Starovoitov wrote:
> On 2/25/16 6:18 AM, Peter Zijlstra wrote:

> >>  arch/x86/include/asm/stacktrace.h |  2 +-
> >>  arch/x86/kernel/cpu/perf_event.c  |  4 ++--
> >>  arch/x86/kernel/dumpstack.c       |  6 ++++--
> >>  arch/x86/kernel/stacktrace.c      | 18 +++++++++++-------
> >>  arch/x86/oprofile/backtrace.c     |  3 ++-
> >>  include/linux/perf_event.h        | 13 +++++++++++--
> >>  kernel/events/callchain.c         | 32 ++++++++++++++++++++------------
> >>  kernel/events/internal.h          |  2 --
> >>  8 files changed, 51 insertions(+), 29 deletions(-)
> >
> >And at the very least this should have had a note that it doesn't break
> >all the other archs that implement perf-callchain stuff.
> 
> the cross-arch interface is two weak functions
> perf_callchain_kernel() and perf_callchain_user()
> and back into generic via perf_callchain_store().
> Nothing changes there. The code speaks for itself.
> "non-x86 archs are not broken" would be a silly comment.

No, a diffstat like that immediately makes me wonder if you've even
bothered looking at !x86.

A statement to this effect would've shown you did indeed consider it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE
  2016-02-25 16:42     ` Alexei Starovoitov
@ 2016-02-25 16:50       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2016-02-25 16:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On Thu, Feb 25, 2016 at 08:42:40AM -0800, Alexei Starovoitov wrote:
> On 2/25/16 6:23 AM, Peter Zijlstra wrote:
> >>+	id = hash & (smap->n_buckets - 1);
> >Its not at all clear where the corresponding rcu_read_lock() is at.
> >
> >>>+	bucket = rcu_dereference(smap->buckets[id]);
> 
> bpf programs of all types are always executing under rcu_read_lock().
> This is fundamental for maps and majority of the helpers
> that's why there is no WARN_ON(rcu_read_lock_held) in this helper,
> since we already have it in many other places.
> The rcu_read_lock() for kprobe type is in trace_call_bpf().

OK, just not clear from reading this patch in isolation I suppose.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 1/3] perf: generalize perf_callchain
  2016-02-25 16:47   ` Peter Zijlstra
@ 2016-02-25 17:27     ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-02-25 17:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David S. Miller, Ingo Molnar, Steven Rostedt, Wang Nan,
	Daniel Borkmann, Brendan Gregg, netdev, linux-kernel

On 2/25/16 8:47 AM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 07:58:57PM -0800, Alexei Starovoitov wrote:
>> +static inline int perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
>>   {
>> +	if (entry->nr < PERF_MAX_STACK_DEPTH) {
>>   		entry->ip[entry->nr++] = ip;
>> +		return 0;
>> +	} else {
>> +		return -1; /* no more room, stop walking the stack */
>> +	}
>>   }
>
> Why 0 and -1 ?

because that's the interface you had for callbacks in
'struct stacktrace_ops' including a comment there:
/* On negative return stop dumping */

> What's wrong with something like:
>
> static inline bool perf_callchain_store(struct perf_callchain_entry *entry, u64 ip)
> {
> 	if (entry->nr < PERF_MAX_STACK_DEPTH) {
> 		entry->ip[entry->nr++] = ip;
> 		return true;
> 	}
> 	return false;
> }

I would prefer something like this as well, but it would look
inconsistent with what is already there. To make it bool
one would need to change struct stacktrace_ops for all archs
and touch a lot of files all over the tree.
Way more than 51 insertions(+), 29 deletions(-) as this patch did
for no real gain.
It's better to be consistent with existing code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-02-25 17:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-18  3:58 [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map Alexei Starovoitov
2016-02-18  3:58 ` [PATCH net-next 1/3] perf: generalize perf_callchain Alexei Starovoitov
2016-02-25 14:18   ` Peter Zijlstra
2016-02-25 16:37     ` Alexei Starovoitov
2016-02-25 16:45       ` Peter Zijlstra
2016-02-25 16:48       ` Peter Zijlstra
2016-02-25 16:47   ` Peter Zijlstra
2016-02-25 17:27     ` Alexei Starovoitov
2016-02-18  3:58 ` [PATCH net-next 2/3] bpf: introduce BPF_MAP_TYPE_STACK_TRACE Alexei Starovoitov
2016-02-25 14:23   ` Peter Zijlstra
2016-02-25 16:42     ` Alexei Starovoitov
2016-02-25 16:50       ` Peter Zijlstra
2016-02-18  3:58 ` [PATCH net-next 3/3] samples/bpf: offwaketime example Alexei Starovoitov
2016-02-20  5:25 ` [PATCH net-next 0/3] bpf_get_stackid() and stack_trace map David Miller
2016-02-25 14:24   ` Peter Zijlstra
2016-02-25 16:44     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).