netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event
@ 2017-10-24  6:53 Yonghong Song
  2017-10-24  6:53 ` [PATCH net-next v3 1/3] bpf: use the same condition in perf event set/free bpf handler Yonghong Song
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Yonghong Song @ 2017-10-24  6:53 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, kafai, netdev; +Cc: kernel-team

This patch set adds support to permit multiple bpf prog attachments
for a single perf tracepoint event. Patch 1 does some cleanup such
that perf_event_{set|free}_bpf_handler is called under the
same condition. Patch 2 has the core implementation, and
Patch 3 adds a test case.

Changelogs:
v2 -> v3:
  . fix compilation error.
v1 -> v2:
  . fix a potential deadlock issue discovered by Daniel.
  . fix some coding style issues.

Yonghong Song (3):
  bpf: use the same condition in perf event set/free bpf handler
  bpf: permit multiple bpf attachments for a single perf event
  bpf: add a test case to test single tp multiple bpf attachment

 include/linux/bpf.h           | 30 +++++++++++++---
 include/linux/trace_events.h  | 43 ++++++++++++++++++++---
 include/trace/perf.h          |  6 ++--
 kernel/bpf/core.c             | 81 ++++++++++++++++++++++++++++++++++++++++++
 kernel/events/core.c          | 30 ++++++----------
 kernel/trace/bpf_trace.c      | 82 ++++++++++++++++++++++++++++++++++++++++---
 kernel/trace/trace_kprobe.c   |  6 ++--
 kernel/trace/trace_syscalls.c | 34 ++++++++++--------
 kernel/trace/trace_uprobe.c   |  3 +-
 samples/bpf/syscall_tp_user.c | 66 +++++++++++++++++++++++++++-------
 10 files changed, 310 insertions(+), 71 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v3 1/3] bpf: use the same condition in perf event set/free bpf handler
  2017-10-24  6:53 [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event Yonghong Song
@ 2017-10-24  6:53 ` Yonghong Song
  2017-10-24  6:53 ` [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event Yonghong Song
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Yonghong Song @ 2017-10-24  6:53 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, kafai, netdev; +Cc: kernel-team

This is a cleanup such that doing the same check in
perf_event_free_bpf_prog as we already do in
perf_event_set_bpf_prog step.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/events/core.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 31ee304..9f78a682 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8191,10 +8191,10 @@ static void perf_event_free_bpf_prog(struct perf_event *event)
 {
 	struct bpf_prog *prog;
 
-	perf_event_free_bpf_handler(event);
-
-	if (!event->tp_event)
+	if (event->attr.type != PERF_TYPE_TRACEPOINT) {
+		perf_event_free_bpf_handler(event);
 		return;
+	}
 
 	prog = event->tp_event->prog;
 	if (prog && event->tp_event->bpf_prog_owner == event) {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event
  2017-10-24  6:53 [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event Yonghong Song
  2017-10-24  6:53 ` [PATCH net-next v3 1/3] bpf: use the same condition in perf event set/free bpf handler Yonghong Song
@ 2017-10-24  6:53 ` Yonghong Song
  2017-10-25  9:32   ` Jesper Dangaard Brouer
  2017-10-24  6:53 ` [PATCH net-next v3 3/3] bpf: add a test case to test single tp multiple bpf attachment Yonghong Song
  2017-10-25  1:48 ` [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event David Miller
  3 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2017-10-24  6:53 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, kafai, netdev; +Cc: kernel-team

This patch enables multiple bpf attachments for a
kprobe/uprobe/tracepoint single trace event.
Each trace_event keeps a list of attached perf events.
When an event happens, all attached bpf programs will
be executed based on the order of attachment.

A global bpf_event_mutex lock is introduced to protect
prog_array attaching and detaching. An alternative will
be introduce a mutex lock in every trace_event_call
structure, but it takes a lot of extra memory.
So a global bpf_event_mutex lock is a good compromise.

The bpf prog detachment involves allocation of memory.
If the allocation fails, a dummy do-nothing program
will replace to-be-detached program in-place.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h           | 30 +++++++++++++---
 include/linux/trace_events.h  | 43 ++++++++++++++++++++---
 include/trace/perf.h          |  6 ++--
 kernel/bpf/core.c             | 81 ++++++++++++++++++++++++++++++++++++++++++
 kernel/events/core.c          | 26 +++++---------
 kernel/trace/bpf_trace.c      | 82 ++++++++++++++++++++++++++++++++++++++++---
 kernel/trace/trace_kprobe.c   |  6 ++--
 kernel/trace/trace_syscalls.c | 34 ++++++++++--------
 kernel/trace/trace_uprobe.c   |  3 +-
 9 files changed, 255 insertions(+), 56 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1e334b2..172be7f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -273,18 +273,38 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs);
 int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 				__u32 __user *prog_ids, u32 cnt);
 
-#define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
+void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
+				struct bpf_prog *old_prog);
+int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
+			struct bpf_prog *exclude_prog,
+			struct bpf_prog *include_prog,
+			struct bpf_prog_array **new_array);
+
+#define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null)	\
 	({						\
-		struct bpf_prog **_prog;		\
+		struct bpf_prog **_prog, *__prog;	\
+		struct bpf_prog_array *_array;		\
 		u32 _ret = 1;				\
 		rcu_read_lock();			\
-		_prog = rcu_dereference(array)->progs;	\
-		for (; *_prog; _prog++)			\
-			_ret &= func(*_prog, ctx);	\
+		_array = rcu_dereference(array);	\
+		if (unlikely(check_non_null && !_array))\
+			goto _out;			\
+		_prog = _array->progs;			\
+		while ((__prog = READ_ONCE(*_prog))) {	\
+			_ret &= func(__prog, ctx);	\
+			_prog++;			\
+		}					\
+_out:							\
 		rcu_read_unlock();			\
 		_ret;					\
 	 })
 
+#define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
+	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
+
+#define BPF_PROG_RUN_ARRAY_CHECK(array, ctx, func)	\
+	__BPF_PROG_RUN_ARRAY(array, ctx, func, true)
+
 #ifdef CONFIG_BPF_SYSCALL
 DECLARE_PER_CPU(int, bpf_prog_active);
 
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 2e0f222..fc6aeca 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -271,14 +271,37 @@ struct trace_event_call {
 #ifdef CONFIG_PERF_EVENTS
 	int				perf_refcount;
 	struct hlist_head __percpu	*perf_events;
-	struct bpf_prog			*prog;
-	struct perf_event		*bpf_prog_owner;
+	struct bpf_prog_array __rcu	*prog_array;
 
 	int	(*perf_perm)(struct trace_event_call *,
 			     struct perf_event *);
 #endif
 };
 
+#ifdef CONFIG_PERF_EVENTS
+static inline bool bpf_prog_array_valid(struct trace_event_call *call)
+{
+	/*
+	 * This inline function checks whether call->prog_array
+	 * is valid or not. The function is called in various places,
+	 * outside rcu_read_lock/unlock, as a heuristic to speed up execution.
+	 *
+	 * If this function returns true, and later call->prog_array
+	 * becomes false inside rcu_read_lock/unlock region,
+	 * we bail out then. If this function return false,
+	 * there is a risk that we might miss a few events if the checking
+	 * were delayed until inside rcu_read_lock/unlock region and
+	 * call->prog_array happened to become non-NULL then.
+	 *
+	 * Here, READ_ONCE() is used instead of rcu_access_pointer().
+	 * rcu_access_pointer() requires the actual definition of
+	 * "struct bpf_prog_array" while READ_ONCE() only needs
+	 * a declaration of the same type.
+	 */
+	return !!READ_ONCE(call->prog_array);
+}
+#endif
+
 static inline const char *
 trace_event_name(struct trace_event_call *call)
 {
@@ -435,12 +458,23 @@ trace_trigger_soft_disabled(struct trace_event_file *file)
 }
 
 #ifdef CONFIG_BPF_EVENTS
-unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx);
+unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx);
+int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog);
+void perf_event_detach_bpf_prog(struct perf_event *event);
 #else
-static inline unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
 	return 1;
 }
+
+static inline int
+perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void perf_event_detach_bpf_prog(struct perf_event *event) { }
+
 #endif
 
 enum {
@@ -511,6 +545,7 @@ perf_trace_buf_submit(void *raw_data, int size, int rctx, u16 type,
 {
 	perf_tp_event(type, count, raw_data, size, regs, head, rctx, task, event);
 }
+
 #endif
 
 #endif /* _LINUX_TRACE_EVENT_H */
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 04fe68bb..14f127b6 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -34,7 +34,6 @@ perf_trace_##call(void *__data, proto)					\
 	struct trace_event_call *event_call = __data;			\
 	struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
 	struct trace_event_raw_##call *entry;				\
-	struct bpf_prog *prog = event_call->prog;			\
 	struct pt_regs *__regs;						\
 	u64 __count = 1;						\
 	struct task_struct *__task = NULL;				\
@@ -46,8 +45,9 @@ perf_trace_##call(void *__data, proto)					\
 	__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
 									\
 	head = this_cpu_ptr(event_call->perf_events);			\
-	if (!prog && __builtin_constant_p(!__task) && !__task &&	\
-				hlist_empty(head))			\
+	if (!bpf_prog_array_valid(event_call) &&			\
+	    __builtin_constant_p(!__task) && !__task &&			\
+	    hlist_empty(head))						\
 		return;							\
 									\
 	__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8e7c8bf..7fe4487 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1394,6 +1394,20 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
 
+static unsigned int __bpf_prog_ret1(const void *ctx,
+				    const struct bpf_insn *insn)
+{
+	return 1;
+}
+
+static struct bpf_prog_dummy {
+	struct bpf_prog prog;
+} dummy_bpf_prog = {
+	.prog = {
+		.bpf_func = __bpf_prog_ret1,
+	},
+};
+
 /* to avoid allocating empty bpf_prog_array for cgroups that
  * don't have bpf program attached use one global 'empty_prog_array'
  * It will not be modified the caller of bpf_prog_array_alloc()
@@ -1463,6 +1477,73 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
 	return 0;
 }
 
+void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
+				struct bpf_prog *old_prog)
+{
+	struct bpf_prog **prog = progs->progs;
+
+	for (; *prog; prog++)
+		if (*prog == old_prog) {
+			WRITE_ONCE(*prog, &dummy_bpf_prog.prog);
+			break;
+		}
+}
+
+int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
+			struct bpf_prog *exclude_prog,
+			struct bpf_prog *include_prog,
+			struct bpf_prog_array **new_array)
+{
+	int new_prog_cnt, carry_prog_cnt = 0;
+	struct bpf_prog **existing_prog;
+	struct bpf_prog_array *array;
+	int new_prog_idx = 0;
+
+	/* Figure out how many existing progs we need to carry over to
+	 * the new array.
+	 */
+	if (old_array) {
+		existing_prog = old_array->progs;
+		for (; *existing_prog; existing_prog++) {
+			if (*existing_prog != exclude_prog &&
+			    *existing_prog != &dummy_bpf_prog.prog)
+				carry_prog_cnt++;
+			if (*existing_prog == include_prog)
+				return -EEXIST;
+		}
+	}
+
+	/* How many progs (not NULL) will be in the new array? */
+	new_prog_cnt = carry_prog_cnt;
+	if (include_prog)
+		new_prog_cnt += 1;
+
+	/* Do we have any prog (not NULL) in the new array? */
+	if (!new_prog_cnt) {
+		*new_array = NULL;
+		return 0;
+	}
+
+	/* +1 as the end of prog_array is marked with NULL */
+	array = bpf_prog_array_alloc(new_prog_cnt + 1, GFP_KERNEL);
+	if (!array)
+		return -ENOMEM;
+
+	/* Fill in the new prog array */
+	if (carry_prog_cnt) {
+		existing_prog = old_array->progs;
+		for (; *existing_prog; existing_prog++)
+			if (*existing_prog != exclude_prog &&
+			    *existing_prog != &dummy_bpf_prog.prog)
+				array->progs[new_prog_idx++] = *existing_prog;
+	}
+	if (include_prog)
+		array->progs[new_prog_idx++] = include_prog;
+	array->progs[new_prog_idx] = NULL;
+	*new_array = array;
+	return 0;
+}
+
 static void bpf_prog_free_deferred(struct work_struct *work)
 {
 	struct bpf_prog_aux *aux;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9f78a682..9660ee6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7954,11 +7954,9 @@ void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx,
 			       struct pt_regs *regs, struct hlist_head *head,
 			       struct task_struct *task)
 {
-	struct bpf_prog *prog = call->prog;
-
-	if (prog) {
+	if (bpf_prog_array_valid(call)) {
 		*(struct pt_regs **)raw_data = regs;
-		if (!trace_call_bpf(prog, raw_data) || hlist_empty(head)) {
+		if (!trace_call_bpf(call, raw_data) || hlist_empty(head)) {
 			perf_swevent_put_recursion_context(rctx);
 			return;
 		}
@@ -8147,13 +8145,11 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 {
 	bool is_kprobe, is_tracepoint, is_syscall_tp;
 	struct bpf_prog *prog;
+	int ret;
 
 	if (event->attr.type != PERF_TYPE_TRACEPOINT)
 		return perf_event_set_bpf_handler(event, prog_fd);
 
-	if (event->tp_event->prog)
-		return -EEXIST;
-
 	is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE;
 	is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
 	is_syscall_tp = is_syscall_trace_event(event->tp_event);
@@ -8181,26 +8177,20 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
 			return -EACCES;
 		}
 	}
-	event->tp_event->prog = prog;
-	event->tp_event->bpf_prog_owner = event;
 
-	return 0;
+	ret = perf_event_attach_bpf_prog(event, prog);
+	if (ret)
+		bpf_prog_put(prog);
+	return ret;
 }
 
 static void perf_event_free_bpf_prog(struct perf_event *event)
 {
-	struct bpf_prog *prog;
-
 	if (event->attr.type != PERF_TYPE_TRACEPOINT) {
 		perf_event_free_bpf_handler(event);
 		return;
 	}
-
-	prog = event->tp_event->prog;
-	if (prog && event->tp_event->bpf_prog_owner == event) {
-		event->tp_event->prog = NULL;
-		bpf_prog_put(prog);
-	}
+	perf_event_detach_bpf_prog(event);
 }
 
 #else
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3126da2..b65011d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -17,7 +17,7 @@
 
 /**
  * trace_call_bpf - invoke BPF program
- * @prog: BPF program
+ * @call: tracepoint event
  * @ctx: opaque context pointer
  *
  * kprobe handlers execute BPF programs via this helper.
@@ -29,7 +29,7 @@
  * 1 - store kprobe event into ring buffer
  * Other values are reserved and currently alias to 1
  */
-unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
+unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx)
 {
 	unsigned int ret;
 
@@ -49,9 +49,22 @@ unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
 		goto out;
 	}
 
-	rcu_read_lock();
-	ret = BPF_PROG_RUN(prog, ctx);
-	rcu_read_unlock();
+	/*
+	 * Instead of moving rcu_read_lock/rcu_dereference/rcu_read_unlock
+	 * to all call sites, we did a bpf_prog_array_valid() there to check
+	 * whether call->prog_array is empty or not, which is
+	 * a heurisitc to speed up execution.
+	 *
+	 * If bpf_prog_array_valid() fetched prog_array was
+	 * non-NULL, we go into trace_call_bpf() and do the actual
+	 * proper rcu_dereference() under RCU lock.
+	 * If it turns out that prog_array is NULL then, we bail out.
+	 * For the opposite, if the bpf_prog_array_valid() fetched pointer
+	 * was NULL, you'll skip the prog_array with the risk of missing
+	 * out of events when it was updated in between this and the
+	 * rcu_dereference() which is accepted risk.
+	 */
+	ret = BPF_PROG_RUN_ARRAY_CHECK(call->prog_array, ctx, BPF_PROG_RUN);
 
  out:
 	__this_cpu_dec(bpf_prog_active);
@@ -741,3 +754,62 @@ const struct bpf_verifier_ops perf_event_verifier_ops = {
 
 const struct bpf_prog_ops perf_event_prog_ops = {
 };
+
+static DEFINE_MUTEX(bpf_event_mutex);
+
+int perf_event_attach_bpf_prog(struct perf_event *event,
+			       struct bpf_prog *prog)
+{
+	struct bpf_prog_array __rcu *old_array;
+	struct bpf_prog_array *new_array;
+	int ret = -EEXIST;
+
+	mutex_lock(&bpf_event_mutex);
+
+	if (event->prog)
+		goto out;
+
+	old_array = rcu_dereference_protected(event->tp_event->prog_array,
+					      lockdep_is_held(&bpf_event_mutex));
+	ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array);
+	if (ret < 0)
+		goto out;
+
+	/* set the new array to event->tp_event and set event->prog */
+	event->prog = prog;
+	rcu_assign_pointer(event->tp_event->prog_array, new_array);
+	bpf_prog_array_free(old_array);
+
+out:
+	mutex_unlock(&bpf_event_mutex);
+	return ret;
+}
+
+void perf_event_detach_bpf_prog(struct perf_event *event)
+{
+	struct bpf_prog_array __rcu *old_array;
+	struct bpf_prog_array *new_array;
+	int ret;
+
+	mutex_lock(&bpf_event_mutex);
+
+	if (!event->prog)
+		goto out;
+
+	old_array = rcu_dereference_protected(event->tp_event->prog_array,
+					      lockdep_is_held(&bpf_event_mutex));
+
+	ret = bpf_prog_array_copy(old_array, event->prog, NULL, &new_array);
+	if (ret < 0) {
+		bpf_prog_array_delete_safe(old_array, event->prog);
+	} else {
+		rcu_assign_pointer(event->tp_event->prog_array, new_array);
+		bpf_prog_array_free(old_array);
+	}
+
+	bpf_prog_put(event->prog);
+	event->prog = NULL;
+
+out:
+	mutex_unlock(&bpf_event_mutex);
+}
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 8a907e1..abf92e4 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1174,13 +1174,12 @@ static void
 kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 {
 	struct trace_event_call *call = &tk->tp.call;
-	struct bpf_prog *prog = call->prog;
 	struct kprobe_trace_entry_head *entry;
 	struct hlist_head *head;
 	int size, __size, dsize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs))
 		return;
 
 	head = this_cpu_ptr(call->perf_events);
@@ -1210,13 +1209,12 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
 		    struct pt_regs *regs)
 {
 	struct trace_event_call *call = &tk->tp.call;
-	struct bpf_prog *prog = call->prog;
 	struct kretprobe_trace_entry_head *entry;
 	struct hlist_head *head;
 	int size, __size, dsize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs))
 		return;
 
 	head = this_cpu_ptr(call->perf_events);
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 696afe7..71a6af3 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -559,9 +559,10 @@ static DECLARE_BITMAP(enabled_perf_exit_syscalls, NR_syscalls);
 static int sys_perf_refcount_enter;
 static int sys_perf_refcount_exit;
 
-static int perf_call_bpf_enter(struct bpf_prog *prog, struct pt_regs *regs,
-			      struct syscall_metadata *sys_data,
-			      struct syscall_trace_enter *rec) {
+static int perf_call_bpf_enter(struct trace_event_call *call, struct pt_regs *regs,
+			       struct syscall_metadata *sys_data,
+			       struct syscall_trace_enter *rec)
+{
 	struct syscall_tp_t {
 		unsigned long long regs;
 		unsigned long syscall_nr;
@@ -573,7 +574,7 @@ static int perf_call_bpf_enter(struct bpf_prog *prog, struct pt_regs *regs,
 	param.syscall_nr = rec->nr;
 	for (i = 0; i < sys_data->nb_args; i++)
 		param.args[i] = rec->args[i];
-	return trace_call_bpf(prog, &param);
+	return trace_call_bpf(call, &param);
 }
 
 static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
@@ -581,7 +582,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_enter *rec;
 	struct hlist_head *head;
-	struct bpf_prog *prog;
+	bool valid_prog_array;
 	int syscall_nr;
 	int rctx;
 	int size;
@@ -596,9 +597,9 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	if (!sys_data)
 		return;
 
-	prog = READ_ONCE(sys_data->enter_event->prog);
 	head = this_cpu_ptr(sys_data->enter_event->perf_events);
-	if (!prog && hlist_empty(head))
+	valid_prog_array = bpf_prog_array_valid(sys_data->enter_event);
+	if (!valid_prog_array && hlist_empty(head))
 		return;
 
 	/* get the size after alignment with the u32 buffer size field */
@@ -614,7 +615,8 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	syscall_get_arguments(current, regs, 0, sys_data->nb_args,
 			       (unsigned long *)&rec->args);
 
-	if ((prog && !perf_call_bpf_enter(prog, regs, sys_data, rec)) ||
+	if ((valid_prog_array &&
+	     !perf_call_bpf_enter(sys_data->enter_event, regs, sys_data, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
 		return;
@@ -659,8 +661,9 @@ static void perf_sysenter_disable(struct trace_event_call *call)
 	mutex_unlock(&syscall_trace_lock);
 }
 
-static int perf_call_bpf_exit(struct bpf_prog *prog, struct pt_regs *regs,
-			      struct syscall_trace_exit *rec) {
+static int perf_call_bpf_exit(struct trace_event_call *call, struct pt_regs *regs,
+			      struct syscall_trace_exit *rec)
+{
 	struct syscall_tp_t {
 		unsigned long long regs;
 		unsigned long syscall_nr;
@@ -670,7 +673,7 @@ static int perf_call_bpf_exit(struct bpf_prog *prog, struct pt_regs *regs,
 	*(struct pt_regs **)&param = regs;
 	param.syscall_nr = rec->nr;
 	param.ret = rec->ret;
-	return trace_call_bpf(prog, &param);
+	return trace_call_bpf(call, &param);
 }
 
 static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
@@ -678,7 +681,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	struct syscall_metadata *sys_data;
 	struct syscall_trace_exit *rec;
 	struct hlist_head *head;
-	struct bpf_prog *prog;
+	bool valid_prog_array;
 	int syscall_nr;
 	int rctx;
 	int size;
@@ -693,9 +696,9 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	if (!sys_data)
 		return;
 
-	prog = READ_ONCE(sys_data->exit_event->prog);
 	head = this_cpu_ptr(sys_data->exit_event->perf_events);
-	if (!prog && hlist_empty(head))
+	valid_prog_array = bpf_prog_array_valid(sys_data->exit_event);
+	if (!valid_prog_array && hlist_empty(head))
 		return;
 
 	/* We can probably do that at build time */
@@ -709,7 +712,8 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	rec->nr = syscall_nr;
 	rec->ret = syscall_get_return_value(current, regs);
 
-	if ((prog && !perf_call_bpf_exit(prog, regs, rec)) ||
+	if ((valid_prog_array &&
+	     !perf_call_bpf_exit(sys_data->exit_event, regs, rec)) ||
 	    hlist_empty(head)) {
 		perf_swevent_put_recursion_context(rctx);
 		return;
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 4525e02..153c0e4 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1113,13 +1113,12 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 {
 	struct trace_event_call *call = &tu->tp.call;
 	struct uprobe_trace_entry_head *entry;
-	struct bpf_prog *prog = call->prog;
 	struct hlist_head *head;
 	void *data;
 	int size, esize;
 	int rctx;
 
-	if (prog && !trace_call_bpf(prog, regs))
+	if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs))
 		return;
 
 	esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v3 3/3] bpf: add a test case to test single tp multiple bpf attachment
  2017-10-24  6:53 [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event Yonghong Song
  2017-10-24  6:53 ` [PATCH net-next v3 1/3] bpf: use the same condition in perf event set/free bpf handler Yonghong Song
  2017-10-24  6:53 ` [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event Yonghong Song
@ 2017-10-24  6:53 ` Yonghong Song
  2017-10-25  1:48 ` [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event David Miller
  3 siblings, 0 replies; 7+ messages in thread
From: Yonghong Song @ 2017-10-24  6:53 UTC (permalink / raw)
  To: peterz, rostedt, ast, daniel, kafai, netdev; +Cc: kernel-team

The bpf sample program syscall_tp is modified to
show attachment of more than bpf programs
for a particular kernel tracepoint.

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
---
 samples/bpf/syscall_tp_user.c | 66 ++++++++++++++++++++++++++++++++++---------
 1 file changed, 53 insertions(+), 13 deletions(-)

diff --git a/samples/bpf/syscall_tp_user.c b/samples/bpf/syscall_tp_user.c
index a3cb91e..9169d32 100644
--- a/samples/bpf/syscall_tp_user.c
+++ b/samples/bpf/syscall_tp_user.c
@@ -23,6 +23,13 @@
  * This requires kernel CONFIG_FTRACE_SYSCALLS to be set.
  */
 
+static void usage(const char *cmd)
+{
+	printf("USAGE: %s [-i num_progs] [-h]\n", cmd);
+	printf("       -i num_progs      # number of progs of the test\n");
+	printf("       -h                # help\n");
+}
+
 static void verify_map(int map_id)
 {
 	__u32 key = 0;
@@ -32,22 +39,29 @@ static void verify_map(int map_id)
 		fprintf(stderr, "map_lookup failed: %s\n", strerror(errno));
 		return;
 	}
-	if (val == 0)
+	if (val == 0) {
 		fprintf(stderr, "failed: map #%d returns value 0\n", map_id);
+		return;
+	}
+	val = 0;
+	if (bpf_map_update_elem(map_id, &key, &val, BPF_ANY) != 0) {
+		fprintf(stderr, "map_update failed: %s\n", strerror(errno));
+		return;
+	}
 }
 
-int main(int argc, char **argv)
+static int test(char *filename, int num_progs)
 {
-	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
-	char filename[256];
-	int fd;
+	int i, fd, map0_fds[num_progs], map1_fds[num_progs];
 
-	setrlimit(RLIMIT_MEMLOCK, &r);
-	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
-
-	if (load_bpf_file(filename)) {
-		fprintf(stderr, "%s", bpf_log_buf);
-		return 1;
+	for (i = 0; i < num_progs; i++) {
+		if (load_bpf_file(filename)) {
+			fprintf(stderr, "%s", bpf_log_buf);
+			return 1;
+		}
+		printf("prog #%d: map ids %d %d\n", i, map_fd[0], map_fd[1]);
+		map0_fds[i] = map_fd[0];
+		map1_fds[i] = map_fd[1];
 	}
 
 	/* current load_bpf_file has perf_event_open default pid = -1
@@ -64,8 +78,34 @@ int main(int argc, char **argv)
 	close(fd);
 
 	/* verify the map */
-	verify_map(map_fd[0]);
-	verify_map(map_fd[1]);
+	for (i = 0; i < num_progs; i++) {
+		verify_map(map0_fds[i]);
+		verify_map(map1_fds[i]);
+	}
 
 	return 0;
 }
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	int opt, num_progs = 1;
+	char filename[256];
+
+	while ((opt = getopt(argc, argv, "i:h")) != -1) {
+		switch (opt) {
+		case 'i':
+			num_progs = atoi(optarg);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+			return 0;
+		}
+	}
+
+	setrlimit(RLIMIT_MEMLOCK, &r);
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	return test(filename, num_progs);
+}
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event
  2017-10-24  6:53 [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event Yonghong Song
                   ` (2 preceding siblings ...)
  2017-10-24  6:53 ` [PATCH net-next v3 3/3] bpf: add a test case to test single tp multiple bpf attachment Yonghong Song
@ 2017-10-25  1:48 ` David Miller
  3 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2017-10-25  1:48 UTC (permalink / raw)
  To: yhs; +Cc: peterz, rostedt, ast, daniel, kafai, netdev, kernel-team

From: Yonghong Song <yhs@fb.com>
Date: Mon, 23 Oct 2017 23:53:06 -0700

> This patch set adds support to permit multiple bpf prog attachments
> for a single perf tracepoint event. Patch 1 does some cleanup such
> that perf_event_{set|free}_bpf_handler is called under the
> same condition. Patch 2 has the core implementation, and
> Patch 3 adds a test case.
> 
> Changelogs:
> v2 -> v3:
>   . fix compilation error.
> v1 -> v2:
>   . fix a potential deadlock issue discovered by Daniel.
>   . fix some coding style issues.

Series applied.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event
  2017-10-24  6:53 ` [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event Yonghong Song
@ 2017-10-25  9:32   ` Jesper Dangaard Brouer
  2017-10-25 17:01     ` Alexei Starovoitov
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2017-10-25  9:32 UTC (permalink / raw)
  To: Yonghong Song
  Cc: brouer, peterz, rostedt, ast, daniel, kafai, netdev, kernel-team


On Mon, 23 Oct 2017 23:53:08 -0700 Yonghong Song <yhs@fb.com> wrote:

> This patch enables multiple bpf attachments for a
> kprobe/uprobe/tracepoint single trace event.

Thanks for working on this, I've hit this issue, where another program
BPF-attach to a tracepoint, and the existing userspace-side prog
doesn't notice.  (Specifically in samples/bpf/xdp_monitor_user.c)

Should my issue be gone now?


> Each trace_event keeps a list of attached perf events.
> When an event happens, all attached bpf programs will
> be executed based on the order of attachment.

Can I somehow view/list the attached bpf programs from userspace?

[...]

You didn't describe the expected semantics of bpf-programs return codes.
>From below code it looks like, that if single program in the list/array
returns 0 then the collective return code is also 0 (is that correct?).

Where 0 means don't store the event into the perf record ring-buffer.

Is this a good semantics?

I do use the return 0 trick to save cycles (in samples/bpf/xdp_monitor_kern.c).
But when someone attach a new tracepoint, e.g. via perf record -e, then
they might be surprised that they don't receive any events, when my
xdp_monitor happen to be running at the same time...?


> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 1e334b2..172be7f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -273,18 +273,38 @@ int bpf_prog_array_length(struct bpf_prog_array __rcu *progs);
>  int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs,
>  				__u32 __user *prog_ids, u32 cnt);
>  
> -#define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
> +void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
> +				struct bpf_prog *old_prog);
> +int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
> +			struct bpf_prog *exclude_prog,
> +			struct bpf_prog *include_prog,
> +			struct bpf_prog_array **new_array);
> +
> +#define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null)	\
>  	({						\
> -		struct bpf_prog **_prog;		\
> +		struct bpf_prog **_prog, *__prog;	\
> +		struct bpf_prog_array *_array;		\
>  		u32 _ret = 1;				\
>  		rcu_read_lock();			\
> -		_prog = rcu_dereference(array)->progs;	\
> -		for (; *_prog; _prog++)			\
> -			_ret &= func(*_prog, ctx);	\
> +		_array = rcu_dereference(array);	\
> +		if (unlikely(check_non_null && !_array))\
> +			goto _out;			\
> +		_prog = _array->progs;			\
> +		while ((__prog = READ_ONCE(*_prog))) {	\
> +			_ret &= func(__prog, ctx);	\
> +			_prog++;			\
> +		}					\
> +_out:							\
>  		rcu_read_unlock();			\
>  		_ret;					\
>  	 })
>  
> +#define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
> +	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
> +
> +#define BPF_PROG_RUN_ARRAY_CHECK(array, ctx, func)	\
> +	__BPF_PROG_RUN_ARRAY(array, ctx, func, true)
> +
>  #ifdef CONFIG_BPF_SYSCALL
>  DECLARE_PER_CPU(int, bpf_prog_active);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event
  2017-10-25  9:32   ` Jesper Dangaard Brouer
@ 2017-10-25 17:01     ` Alexei Starovoitov
  0 siblings, 0 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2017-10-25 17:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Yonghong Song
  Cc: peterz, rostedt, daniel, kafai, netdev, kernel-team

On 10/25/17 2:32 AM, Jesper Dangaard Brouer wrote:
>
> On Mon, 23 Oct 2017 23:53:08 -0700 Yonghong Song <yhs@fb.com> wrote:
>
>> This patch enables multiple bpf attachments for a
>> kprobe/uprobe/tracepoint single trace event.
>
> Thanks for working on this, I've hit this issue, where another program
> BPF-attach to a tracepoint, and the existing userspace-side prog
> doesn't notice.  (Specifically in samples/bpf/xdp_monitor_user.c)
>
> Should my issue be gone now?

that issue was fixed earlier in 'net' by
commit ec9dd352d591 ("bpf: one perf event close won't free bpf program 
attached by another perf event")
Back then the was a concern that fix added extra pointer to
struct trace_event_call just to fix the bug.
Now with this commit the extra pointer is gone and we can attach
multiple programs. Two birds.
We still only allow one program per FD received from perf_event_open().
Main difference that different user space processes can attach their
progs to the same tracepoint/kprobe.

>> Each trace_event keeps a list of attached perf events.
>> When an event happens, all attached bpf programs will
>> be executed based on the order of attachment.
>
> Can I somehow view/list the attached bpf programs from userspace?

this work is ongoing.
The api is not yet obvious, since it's not clear whether to reuse
sys_bpf's BPF_PROG_QUERY command that is used for bpf-cgroup
introspection or extend perf_event's ioctl.
There are pros and cons to both.

> [...]
>
> You didn't describe the expected semantics of bpf-programs return codes.
> From below code it looks like, that if single program in the list/array
> returns 0 then the collective return code is also 0 (is that correct?).

the logic is the same used by BPF_PROG_RUN_ARRAY in bpf-cgroup, which is
"all yes is yes, any no is no" ...

> Where 0 means don't store the event into the perf record ring-buffer.

... meaning that all attached programs are run and any program
returning 0 will not store into the ring buffer.

> Is this a good semantics?

This semantic was there from day one.
The view point back then on bpf programs were that they act as
global filters for kprobes and tracepoints which are global objects
on their own. This patch doesn't change that.

> I do use the return 0 trick to save cycles (in samples/bpf/xdp_monitor_kern.c).
> But when someone attach a new tracepoint, e.g. via perf record -e, then
> they might be surprised that they don't receive any events, when my
> xdp_monitor happen to be running at the same time...?

yes. Most bpf monitoring scripts return 0 to save cycles and
yes it prevents parallel 'perf record -e of_the_same_kprobe'
to see anything.
It's a trade off that programs do. If they want to be fast they
return 0, if they want to be friendly to parallel users through
ring buffer they return 1.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-10-25 17:02 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-24  6:53 [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event Yonghong Song
2017-10-24  6:53 ` [PATCH net-next v3 1/3] bpf: use the same condition in perf event set/free bpf handler Yonghong Song
2017-10-24  6:53 ` [PATCH net-next v3 2/3] bpf: permit multiple bpf attachments for a single perf event Yonghong Song
2017-10-25  9:32   ` Jesper Dangaard Brouer
2017-10-25 17:01     ` Alexei Starovoitov
2017-10-24  6:53 ` [PATCH net-next v3 3/3] bpf: add a test case to test single tp multiple bpf attachment Yonghong Song
2017-10-25  1:48 ` [PATCH net-next v3 0/3] bpf: permit multiple bpf attachments for a single perf tracepoint event David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).