[PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
@ 2015-08-04  8:58 Kaixu Xia
  2015-08-04  8:58 ` [PATCH v6 1/4] bpf: Make the bpf_prog_array_map more generic Kaixu Xia
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-08-04  8:58 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

Previous patch v5 url:
https://lkml.org/lkml/2015/7/31/299

changes in V6: 
 - make the Patch 1/4 commit message more meaning and readable;
 - remove the unnecessary comment in Patch 2/4 and make it clean;
 - declare the function perf_event_release_kernel() in include/
   linux/perf_event.h to fix the build error when CONFIG_PERF_EVENTS
   isn't configured in Patch 2/4;
 - add function perf_event_attrs() to get the struct perf_event_attr
   in Patch 2/4. 
 - move the related code from kernel/trace/bpf_trace.c to kernel/
   events/core.c and add function perf_event_read_internal() to
   avoid poking inside of the event outside of perf code in Patch 3/4;
 - generial the func & map match-pair with an array in Patch 3/4;

changes in V5: 
 - move struct fd_array_map_ops* fd_ops to bpf_map;
 - move array perf event decrement refcnt function to
   map_free;
 - fix the NULL ptr of perf_event_get();
 - move bpf_perf_event_read() to kernel/bpf/bpf_trace.c;
 - get rid of the remaining struct bpf_prog;
 - move the unnecessay cast on void *;

changes in V4: 
 - make the bpf_prog_array_map more generic;
 - fix the bug of event refcnt leak;
 - use more useful errno in bpf_perf_event_read();

changes in V3: 
 - collapse V2 patches 1-3 into one;
 - drop the function map->ops->map_traverse_elem() and release
   the struct perf_event in map_free;
 - only allow to access bpf_perf_event_read() from programs;
 - update the perf_event_array_map elem via xchg();
 - pass index directly to bpf_perf_event_read() instead of
   MAP_KEY;

changes in V2:
 - put atomic_long_inc_not_zero() between fdget() and fdput();
 - limit the event type to PERF_TYPE_RAW and PERF_TYPE_HARDWARE;
 - Only read the event counter on current CPU or on current
   process;
 - add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY to store the
   pointer to the struct perf_event;
 - according to the perf_event_map_fd and key, the function
   bpf_perf_event_read() can get the Hardware PMU counter value;

Patch 4/4 is a simple example and shows how to use this new eBPF
programs ability. The PMU counter data can be found in
/sys/kernel/debug/tracing/trace(trace_pipe).(the cycles PMU
value when 'kprobe/sys_write' sampling)

  $ cat /sys/kernel/debug/tracing/trace_pipe
  $ ./tracex6
       ...
       syslog-ng-548   [000] d..1    76.905673: : CPU-0   681765271
       syslog-ng-548   [000] d..1    76.905690: : CPU-0   681787855
       syslog-ng-548   [000] d..1    76.905707: : CPU-0   681810504
       syslog-ng-548   [000] d..1    76.905725: : CPU-0   681834771
       syslog-ng-548   [000] d..1    76.905745: : CPU-0   681859519
       syslog-ng-548   [000] d..1    76.905766: : CPU-0   681890419
       syslog-ng-548   [000] d..1    76.905783: : CPU-0   681914045
       syslog-ng-548   [000] d..1    76.905800: : CPU-0   681935950
       syslog-ng-548   [000] d..1    76.905816: : CPU-0   681958299
              ls-690   [005] d..1    82.241308: : CPU-5   3138451
              sh-691   [004] d..1    82.244570: : CPU-4   7324988
           <...>-699   [007] d..1    99.961387: : CPU-7   3194027
           <...>-695   [003] d..1    99.961474: : CPU-3   288901
           <...>-695   [003] d..1    99.961541: : CPU-3   383145
           <...>-695   [003] d..1    99.961591: : CPU-3   450365
           <...>-695   [003] d..1    99.961639: : CPU-3   515751
           <...>-695   [003] d..1    99.961686: : CPU-3   579047
       ...

The detail of patches is as follow:

Patch 1/4 rewrites part of the bpf_prog_array map code and make it
more generic;

Patch 2/4 introduces a new bpf map type. This map only stores the
pointer to struct perf_event;

Patch 3/4 implements function bpf_perf_event_read() that get the
selected hardware PMU conuter;

Patch 4/4 gives a simple example.

Kaixu Xia (3):
  bpf: Add new bpf map type to store the pointer to struct perf_event
  bpf: Implement function bpf_perf_event_read() that get the selected
    hardware PMU conuter
  samples/bpf: example of get selected PMU counter value

Wang Nan (1):
  bpf: Make the bpf_prog_array_map more generic

 arch/x86/net/bpf_jit_comp.c |   6 +-
 include/linux/bpf.h         |  10 +++-
 include/linux/perf_event.h  |  10 ++++
 include/uapi/linux/bpf.h    |   2 +
 kernel/bpf/arraymap.c       | 137 ++++++++++++++++++++++++++++++++++----------
 kernel/bpf/core.c           |   2 +-
 kernel/bpf/syscall.c        |   2 +-
 kernel/bpf/verifier.c       |  49 +++++++++++-----
 kernel/events/core.c        |  44 ++++++++++++++
 kernel/trace/bpf_trace.c    |  31 ++++++++++
 samples/bpf/Makefile        |   4 ++
 samples/bpf/bpf_helpers.h   |   2 +
 samples/bpf/tracex6_kern.c  |  26 +++++++++
 samples/bpf/tracex6_user.c  |  68 ++++++++++++++++++++++
 14 files changed, 340 insertions(+), 53 deletions(-)
 create mode 100644 samples/bpf/tracex6_kern.c
 create mode 100644 samples/bpf/tracex6_user.c

-- 
1.8.3.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v6 1/4] bpf: Make the bpf_prog_array_map more generic
  2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
@ 2015-08-04  8:58 ` Kaixu Xia
  2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-08-04  8:58 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

From: Wang Nan <wangnan0@huawei.com>

All the map backends are of generic nature. In order to avoid
adding much special code into the eBPF core, rewrite part of
the bpf_prog_array map code and make it more generic. So the
new perf_event_array map type can reuse most of code with
bpf_prog_array map and add fewer lines of special code.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 arch/x86/net/bpf_jit_comp.c |  6 ++--
 include/linux/bpf.h         |  8 +++--
 kernel/bpf/arraymap.c       | 80 +++++++++++++++++++++++++++------------------
 kernel/bpf/core.c           |  2 +-
 kernel/bpf/syscall.c        |  2 +-
 5 files changed, 60 insertions(+), 38 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 579a8fd..e377f07 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -246,7 +246,7 @@ static void emit_prologue(u8 **pprog)
  *     goto out;
  *   if (++tail_call_cnt > MAX_TAIL_CALL_CNT)
  *     goto out;
- *   prog = array->prog[index];
+ *   prog = array->ptrs[index];
  *   if (prog == NULL)
  *     goto out;
  *   goto *(prog->bpf_func + prologue_size);
@@ -284,9 +284,9 @@ static void emit_bpf_tail_call(u8 **pprog)
 	EMIT3(0x83, 0xC0, 0x01);                  /* add eax, 1 */
 	EMIT2_off32(0x89, 0x85, -STACKSIZE + 36); /* mov dword ptr [rbp - 516], eax */
 
-	/* prog = array->prog[index]; */
+	/* prog = array->ptrs[index]; */
 	EMIT4(0x48, 0x8D, 0x44, 0xD6);            /* lea rax, [rsi + rdx * 8 + 0x50] */
-	EMIT1(offsetof(struct bpf_array, prog));
+	EMIT1(offsetof(struct bpf_array, ptrs));
 	EMIT3(0x48, 0x8B, 0x00);                  /* mov rax, qword ptr [rax] */
 
 	/* if (prog == NULL)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4383476..a8ce262 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,10 @@ struct bpf_map_ops {
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
 	int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
 	int (*map_delete_elem)(struct bpf_map *map, void *key);
+
+	/* funcs called by prog_array and perf_event_array map */
+	void *(*map_fd_get_ptr) (struct bpf_map *map, int fd);
+	void (*map_fd_put_ptr) (void *ptr);
 };
 
 struct bpf_map {
@@ -142,13 +146,13 @@ struct bpf_array {
 	bool owner_jited;
 	union {
 		char value[0] __aligned(8);
-		struct bpf_prog *prog[0] __aligned(8);
+		void *ptrs[0] __aligned(8);
 	};
 };
 #define MAX_TAIL_CALL_CNT 32
 
 u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5);
-void bpf_prog_array_map_clear(struct bpf_map *map);
+void bpf_fd_array_map_clear(struct bpf_map *map);
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp);
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index cb31229..45df657 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -150,15 +150,15 @@ static int __init register_array_map(void)
 }
 late_initcall(register_array_map);
 
-static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
+static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr)
 {
-	/* only bpf_prog file descriptors can be stored in prog_array map */
+	/* only file descriptors can be stored in this type of map */
 	if (attr->value_size != sizeof(u32))
 		return ERR_PTR(-EINVAL);
 	return array_map_alloc(attr);
 }
 
-static void prog_array_map_free(struct bpf_map *map)
+static void fd_array_map_free(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
@@ -167,21 +167,21 @@ static void prog_array_map_free(struct bpf_map *map)
 
 	/* make sure it's empty */
 	for (i = 0; i < array->map.max_entries; i++)
-		BUG_ON(array->prog[i] != NULL);
+		BUG_ON(array->ptrs[i] != NULL);
 	kvfree(array);
 }
 
-static void *prog_array_map_lookup_elem(struct bpf_map *map, void *key)
+static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	return NULL;
 }
 
 /* only called from syscall */
-static int prog_array_map_update_elem(struct bpf_map *map, void *key,
-				      void *value, u64 map_flags)
+static int fd_array_map_update_elem(struct bpf_map *map, void *key,
+				    void *value, u64 map_flags)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
-	struct bpf_prog *prog, *old_prog;
+	void *new_ptr, *old_ptr;
 	u32 index = *(u32 *)key, ufd;
 
 	if (map_flags != BPF_ANY)
@@ -191,57 +191,75 @@ static int prog_array_map_update_elem(struct bpf_map *map, void *key,
 		return -E2BIG;
 
 	ufd = *(u32 *)value;
-	prog = bpf_prog_get(ufd);
-	if (IS_ERR(prog))
-		return PTR_ERR(prog);
-
-	if (!bpf_prog_array_compatible(array, prog)) {
-		bpf_prog_put(prog);
-		return -EINVAL;
-	}
+	new_ptr = map->ops->map_fd_get_ptr(map, ufd);
+	if (IS_ERR(new_ptr))
+		return PTR_ERR(new_ptr);
 
-	old_prog = xchg(array->prog + index, prog);
-	if (old_prog)
-		bpf_prog_put_rcu(old_prog);
+	old_ptr = xchg(array->ptrs + index, new_ptr);
+	if (old_ptr)
+		map->ops->map_fd_put_ptr(old_ptr);
 
 	return 0;
 }
 
-static int prog_array_map_delete_elem(struct bpf_map *map, void *key)
+static int fd_array_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
-	struct bpf_prog *old_prog;
+	void *old_ptr;
 	u32 index = *(u32 *)key;
 
 	if (index >= array->map.max_entries)
 		return -E2BIG;
 
-	old_prog = xchg(array->prog + index, NULL);
-	if (old_prog) {
-		bpf_prog_put_rcu(old_prog);
+	old_ptr = xchg(array->ptrs + index, NULL);
+	if (old_ptr) {
+		map->ops->map_fd_put_ptr(old_ptr);
 		return 0;
 	} else {
 		return -ENOENT;
 	}
 }
 
+static void *prog_fd_array_get_ptr(struct bpf_map *map, int fd)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	struct bpf_prog *prog = bpf_prog_get(fd);
+	if (IS_ERR(prog))
+		return prog;
+
+	if (!bpf_prog_array_compatible(array, prog)) {
+		bpf_prog_put(prog);
+		return ERR_PTR(-EINVAL);
+	}
+	return prog;
+}
+
+static void prog_fd_array_put_ptr(void *ptr)
+{
+	struct bpf_prog *prog = ptr;
+
+	bpf_prog_put_rcu(prog);
+}
+
 /* decrement refcnt of all bpf_progs that are stored in this map */
-void bpf_prog_array_map_clear(struct bpf_map *map)
+void bpf_fd_array_map_clear(struct bpf_map *map)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
 	for (i = 0; i < array->map.max_entries; i++)
-		prog_array_map_delete_elem(map, &i);
+		fd_array_map_delete_elem(map, &i);
 }
 
 static const struct bpf_map_ops prog_array_ops = {
-	.map_alloc = prog_array_map_alloc,
-	.map_free = prog_array_map_free,
+	.map_alloc = fd_array_map_alloc,
+	.map_free = fd_array_map_free,
 	.map_get_next_key = array_map_get_next_key,
-	.map_lookup_elem = prog_array_map_lookup_elem,
-	.map_update_elem = prog_array_map_update_elem,
-	.map_delete_elem = prog_array_map_delete_elem,
+	.map_lookup_elem = fd_array_map_lookup_elem,
+	.map_update_elem = fd_array_map_update_elem,
+	.map_delete_elem = fd_array_map_delete_elem,
+	.map_fd_get_ptr = prog_fd_array_get_ptr,
+	.map_fd_put_ptr = prog_fd_array_put_ptr,
 };
 
 static struct bpf_map_type_list prog_array_type __read_mostly = {
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index c5bedc8..be848d0 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -449,7 +449,7 @@ select_insn:
 
 		tail_call_cnt++;
 
-		prog = READ_ONCE(array->prog[index]);
+		prog = READ_ONCE(array->ptrs[index]);
 		if (unlikely(!prog))
 			goto out;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index a1b14d1..dc9b464 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -72,7 +72,7 @@ static int bpf_map_release(struct inode *inode, struct file *filp)
 		/* prog_array stores refcnt-ed bpf_prog pointers
 		 * release them all when user space closes prog_array_fd
 		 */
-		bpf_prog_array_map_clear(map);
+		bpf_fd_array_map_clear(map);
 
 	bpf_map_put(map);
 	return 0;
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
  2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
  2015-08-04  8:58 ` [PATCH v6 1/4] bpf: Make the bpf_prog_array_map more generic Kaixu Xia
@ 2015-08-04  8:58 ` Kaixu Xia
  2015-08-04 17:43   ` Alexei Starovoitov
                     ` (2 more replies)
  2015-08-04  8:58 ` [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter Kaixu Xia
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-08-04  8:58 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'.
This map only stores the pointer to struct perf_event. The
user space event FDs from perf_event_open() syscall are converted
to the pointer to struct perf_event and stored in map.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 include/linux/bpf.h        |  1 +
 include/linux/perf_event.h |  8 +++++++
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/arraymap.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/events/core.c       | 25 ++++++++++++++++++++
 5 files changed, 92 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a8ce262..d0b394a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -10,6 +10,7 @@
 #include <uapi/linux/bpf.h>
 #include <linux/workqueue.h>
 #include <linux/file.h>
+#include <linux/perf_event.h>
 
 struct bpf_map;
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..81fc99e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -641,6 +641,8 @@ extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
+extern struct perf_event *perf_event_get(unsigned int fd);
+extern struct perf_event_attr *perf_event_attrs(struct perf_event *event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
@@ -979,6 +981,11 @@ static inline int perf_event_init_task(struct task_struct *child)	{ return 0; }
 static inline void perf_event_exit_task(struct task_struct *child)	{ }
 static inline void perf_event_free_task(struct task_struct *task)	{ }
 static inline void perf_event_delayed_put(struct task_struct *task)	{ }
+static inline struct perf_event *perf_event_get(unsigned int fd)	{ return ERR_PTR(-EINVAL); }
+static inline struct perf_event_attr *perf_event_attrs(struct perf_event *event)
+{
+	return ERR_PTR(-EINVAL);
+}
 static inline void perf_event_print_debug(void)				{ }
 static inline int perf_event_task_disable(void)				{ return -EINVAL; }
 static inline int perf_event_task_enable(void)				{ return -EINVAL; }
@@ -1011,6 +1018,7 @@ static inline void perf_event_enable(struct perf_event *event)		{ }
 static inline void perf_event_disable(struct perf_event *event)		{ }
 static inline int __perf_event_disable(void *info)			{ return -1; }
 static inline void perf_event_task_tick(void)				{ }
+static inline int perf_event_release_kernel(struct perf_event *event)	{ return 0; }
 #endif
 
 #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_NO_HZ_FULL)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 29ef6f9..69a1f6b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -114,6 +114,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_HASH,
 	BPF_MAP_TYPE_ARRAY,
 	BPF_MAP_TYPE_PROG_ARRAY,
+	BPF_MAP_TYPE_PERF_EVENT_ARRAY,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 45df657..b1e98ff 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -273,3 +273,60 @@ static int __init register_prog_array_map(void)
 	return 0;
 }
 late_initcall(register_prog_array_map);
+
+static void perf_event_array_map_free(struct bpf_map *map)
+{
+	bpf_fd_array_map_clear(map);
+	fd_array_map_free(map);
+}
+
+static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
+{
+	struct perf_event *event;
+	struct perf_event_attr *attr;
+
+	event = perf_event_get(fd);
+	if (IS_ERR(event))
+		return event;
+
+	attr = perf_event_attrs(event);
+	if (IS_ERR(attr))
+		return attr;
+
+	if (attr->type != PERF_TYPE_RAW &&
+	    attr->type != PERF_TYPE_HARDWARE) {
+		perf_event_release_kernel(event);
+		return ERR_PTR(-EINVAL);
+	}
+	return event;
+}
+
+static void perf_event_fd_array_put_ptr(void *ptr)
+{
+	struct perf_event *event = ptr;
+
+	perf_event_release_kernel(event);
+}
+
+static const struct bpf_map_ops perf_event_array_ops = {
+	.map_alloc = fd_array_map_alloc,
+	.map_free = perf_event_array_map_free,
+	.map_get_next_key = array_map_get_next_key,
+	.map_lookup_elem = fd_array_map_lookup_elem,
+	.map_update_elem = fd_array_map_update_elem,
+	.map_delete_elem = fd_array_map_delete_elem,
+	.map_fd_get_ptr = perf_event_fd_array_get_ptr,
+	.map_fd_put_ptr = perf_event_fd_array_put_ptr,
+};
+
+static struct bpf_map_type_list perf_event_array_type __read_mostly = {
+	.ops = &perf_event_array_ops,
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+};
+
+static int __init register_perf_event_array_map(void)
+{
+	bpf_register_map_type(&perf_event_array_type);
+	return 0;
+}
+late_initcall(register_perf_event_array_map);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..6251b53 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8574,6 +8574,31 @@ void perf_event_delayed_put(struct task_struct *task)
 		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
 }
 
+struct perf_event *perf_event_get(unsigned int fd)
+{
+	int err;
+	struct fd f;
+	struct perf_event *event;
+
+	err = perf_fget_light(fd, &f);
+	if (err)
+		return ERR_PTR(err);
+
+	event = f.file->private_data;
+	atomic_long_inc(&event->refcount);
+	fdput(f);
+
+	return event;
+}
+
+struct perf_event_attr *perf_event_attrs(struct perf_event *event)
+{
+	if (!event)
+		return ERR_PTR(-EINVAL);
+
+	return &event->attr;
+}
+
 /*
  * inherit a event from parent task to child task:
  */
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
  2015-08-04  8:58 ` [PATCH v6 1/4] bpf: Make the bpf_prog_array_map more generic Kaixu Xia
  2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
@ 2015-08-04  8:58 ` Kaixu Xia
  2015-08-04 17:55   ` Alexei Starovoitov
  2015-08-05 10:04   ` Peter Zijlstra
  2015-08-04  8:58 ` [PATCH v6 4/4] samples/bpf: example of get selected PMU counter value Kaixu Xia
  2015-08-05 10:06 ` [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Peter Zijlstra
  4 siblings, 2 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-08-04  8:58 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

According to the perf_event_map_fd and index, the function
bpf_perf_event_read() can convert the corresponding map
value to the pointer to struct perf_event and return the
Hardware PMU counter value.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 include/linux/bpf.h        |  1 +
 include/linux/perf_event.h |  2 ++
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/verifier.c      | 49 ++++++++++++++++++++++++++++++++--------------
 kernel/events/core.c       | 19 ++++++++++++++++++
 kernel/trace/bpf_trace.c   | 31 +++++++++++++++++++++++++++++
 6 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d0b394a..db9f781 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -190,6 +190,7 @@ extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
 
+extern const struct bpf_func_proto bpf_perf_event_read_proto;
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
 extern const struct bpf_func_proto bpf_tail_call_proto;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 81fc99e..6f1e448 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -643,6 +643,7 @@ extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
 extern struct perf_event *perf_event_get(unsigned int fd);
 extern struct perf_event_attr *perf_event_attrs(struct perf_event *event);
+extern u64 perf_event_read_internal(struct perf_event *event);
 extern void perf_event_print_debug(void);
 extern void perf_pmu_disable(struct pmu *pmu);
 extern void perf_pmu_enable(struct pmu *pmu);
@@ -986,6 +987,7 @@ static inline struct perf_event_attr *perf_event_attrs(struct perf_event *event)
 {
 	return ERR_PTR(-EINVAL);
 }
+static inline u64 perf_event_read_internal(struct perf_event *event)	{ return -EINVAL; }
 static inline void perf_event_print_debug(void)				{ }
 static inline int perf_event_task_disable(void)				{ return -EINVAL; }
 static inline int perf_event_task_enable(void)				{ return -EINVAL; }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69a1f6b..b9b13ce 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -250,6 +250,7 @@ enum bpf_func_id {
 	 * Return: 0 on success
 	 */
 	BPF_FUNC_get_current_comm,
+	BPF_FUNC_perf_event_read,	/* u64 bpf_perf_event_read(&map, index) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039d866..45fae14 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -238,6 +238,14 @@ static const char * const reg_type_str[] = {
 	[CONST_IMM]		= "imm",
 };
 
+static const struct {
+	int map_type;
+	int func_id;
+} func_limit[] = {
+	{BPF_MAP_TYPE_PROG_ARRAY, BPF_FUNC_tail_call},
+	{BPF_MAP_TYPE_PERF_EVENT_ARRAY, BPF_FUNC_perf_event_read},
+};
+
 static void print_verifier_state(struct verifier_env *env)
 {
 	enum bpf_reg_type t;
@@ -833,6 +841,29 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 	return err;
 }
 
+static int check_func_limit(struct bpf_map **mapp, int func_id)
+{
+	struct bpf_map *map = *mapp;
+	bool bool_map, bool_func;
+	int i;
+
+	if (!map)
+		return 0;
+
+	for (i = 0; i <= ARRAY_SIZE(func_limit); i++) {
+		bool_map = (map->map_type == func_limit[i].map_type);
+		bool_func = (func_id == func_limit[i].func_id);
+		/* only when map & func pair match it can continue.
+		 * don't allow any other map type to be passed into
+		 * the special func;
+		 */
+		if (bool_map != bool_func)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int check_call(struct verifier_env *env, int func_id)
 {
 	struct verifier_state *state = &env->cur_state;
@@ -908,21 +939,9 @@ static int check_call(struct verifier_env *env, int func_id)
 		return -EINVAL;
 	}
 
-	if (map && map->map_type == BPF_MAP_TYPE_PROG_ARRAY &&
-	    func_id != BPF_FUNC_tail_call)
-		/* prog_array map type needs extra care:
-		 * only allow to pass it into bpf_tail_call() for now.
-		 * bpf_map_delete_elem() can be allowed in the future,
-		 * while bpf_map_update_elem() must only be done via syscall
-		 */
-		return -EINVAL;
-
-	if (func_id == BPF_FUNC_tail_call &&
-	    map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
-		/* don't allow any other map type to be passed into
-		 * bpf_tail_call()
-		 */
-		return -EINVAL;
+	err = check_func_limit(&map, func_id);
+	if (err)
+		return err;
 
 	return 0;
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6251b53..726ca1b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8599,6 +8599,25 @@ struct perf_event_attr *perf_event_attrs(struct perf_event *event)
 	return &event->attr;
 }
 
+u64 perf_event_read_internal(struct perf_event *event)
+{
+	if (!event)
+		return -EINVAL;
+
+	if (unlikely(event->state != PERF_EVENT_STATE_ACTIVE))
+		return -EINVAL;
+
+	if (event->oncpu != raw_smp_processor_id() &&
+	    event->ctx->task != current)
+		return -EINVAL;
+
+	if (unlikely(event->attr.inherit))
+		return -EINVAL;
+
+	__perf_event_read(event);
+	return perf_event_count(event);
+}
+
 /*
  * inherit a event from parent task to child task:
  */
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 88a041a..7d7b724 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -158,6 +158,35 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void)
 	return &bpf_trace_printk_proto;
 }
 
+static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	struct perf_event *event;
+
+	if (unlikely(index >= array->map.max_entries))
+		return -E2BIG;
+
+	event = (struct perf_event *)array->ptrs[index];
+	if (!event)
+		return -ENOENT;
+
+	/*
+	 * we don't know if the function is run successfully by the
+	 * return value. It can be judged in other places, such as
+	 * eBPF programs.
+	 */
+	return perf_event_read_internal(event);
+}
+
+const struct bpf_func_proto bpf_perf_event_read_proto = {
+	.func		= bpf_perf_event_read,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_CONST_MAP_PTR,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -183,6 +212,8 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
 		return bpf_get_trace_printk_proto();
 	case BPF_FUNC_get_smp_processor_id:
 		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_perf_event_read:
+		return &bpf_perf_event_read_proto;
 	default:
 		return NULL;
 	}
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 4/4] samples/bpf: example of get selected PMU counter value
  2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
                   ` (2 preceding siblings ...)
  2015-08-04  8:58 ` [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter Kaixu Xia
@ 2015-08-04  8:58 ` Kaixu Xia
  2015-08-05 10:06 ` [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: Kaixu Xia @ 2015-08-04  8:58 UTC (permalink / raw)
  To: ast, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel
  Cc: xiakaixu, wangnan0, linux-kernel, pi3orama, hekuang, netdev

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia <xiakaixu@huawei.com>
---
 samples/bpf/Makefile       |  4 +++
 samples/bpf/bpf_helpers.h  |  2 ++
 samples/bpf/tracex6_kern.c | 26 ++++++++++++++++++
 samples/bpf/tracex6_user.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 100 insertions(+)
 create mode 100644 samples/bpf/tracex6_kern.c
 create mode 100644 samples/bpf/tracex6_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4450fed..63e7d50 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -12,6 +12,7 @@ hostprogs-y += tracex2
 hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
+hostprogs-y += tracex6
 hostprogs-y += lathist
 
 test_verifier-objs := test_verifier.o libbpf.o
@@ -25,6 +26,7 @@ tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
 tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
 tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
 tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
+tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
 lathist-objs := bpf_load.o libbpf.o lathist_user.o
 
 # Tell kbuild to always build the programs
@@ -37,6 +39,7 @@ always += tracex2_kern.o
 always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
+always += tracex6_kern.o
 always += tcbpf1_kern.o
 always += lathist_kern.o
 
@@ -51,6 +54,7 @@ HOSTLOADLIBES_tracex2 += -lelf
 HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
+HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_lathist += -lelf
 
 # point this to your LLVM backend with bpf support
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index bdf1c16..c8a3594 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -31,6 +31,8 @@ static unsigned long long (*bpf_get_current_uid_gid)(void) =
 	(void *) BPF_FUNC_get_current_uid_gid;
 static int (*bpf_get_current_comm)(void *buf, int buf_size) =
 	(void *) BPF_FUNC_get_current_comm;
+static int (*bpf_perf_event_read)(void *map, int index) =
+	(void *) BPF_FUNC_perf_event_read;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
new file mode 100644
index 0000000..23d1cff
--- /dev/null
+++ b/samples/bpf/tracex6_kern.c
@@ -0,0 +1,26 @@
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(u32),
+	.max_entries = 32,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+	u64 count;
+	u32 key = bpf_get_smp_processor_id();
+	char fmt[] = "CPU-%d   %llu\n";
+
+	count = bpf_perf_event_read(&my_map, key);
+	bpf_trace_printk(fmt, sizeof(fmt), key, count);
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
new file mode 100644
index 0000000..928f05e
--- /dev/null
+++ b/samples/bpf/tracex6_user.c
@@ -0,0 +1,68 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <sys/ioctl.h>
+#include <linux/perf_event.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+
+#define SAMPLE_PERIOD  0x7fffffffffffffffULL
+
+static void test_bpf_perf_event(void)
+{
+	int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	int *pmu_fd = malloc(nr_cpus * sizeof(int));
+	unsigned long value;
+	int i;
+
+	struct perf_event_attr attr_insn_pmu = {
+		.freq = 0,
+		.sample_period = SAMPLE_PERIOD,
+		.inherit = 0,
+		.type = PERF_TYPE_HARDWARE,
+		.read_format = 0,
+		.sample_type = 0,
+		.config = 0,/* PMU: cycles */
+	};
+
+	for (i = 0; i < nr_cpus; i++) {
+		pmu_fd[i] = perf_event_open(&attr_insn_pmu, -1/*pid*/, i/*cpu*/, -1/*group_fd*/, 0);
+		if (pmu_fd[i] < 0)
+			printf("event syscall failed\n");
+
+		bpf_update_elem(map_fd[0], &i, &pmu_fd[i], BPF_ANY);
+		ioctl(pmu_fd[i], PERF_EVENT_IOC_ENABLE, 0);
+	}
+
+	system("ls");
+	system("pwd");
+	system("sleep 2");
+
+	for (i = 0; i < nr_cpus; i++)
+		close(pmu_fd[i]);
+
+	close(map_fd);
+
+	free(pmu_fd);
+}
+
+int main(int argc, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	test_bpf_perf_event();
+
+	return 0;
+}
-- 
1.8.3.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
  2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
@ 2015-08-04 17:43   ` Alexei Starovoitov
  2015-08-05  9:41     ` Peter Zijlstra
  2015-08-05  9:39   ` Peter Zijlstra
  2015-08-05  9:43   ` Peter Zijlstra
  2 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-08-04 17:43 UTC (permalink / raw)
  To: Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: wangnan0, linux-kernel, pi3orama, hekuang, netdev

On 8/4/15 1:58 AM, Kaixu Xia wrote:
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 45df657..b1e98ff 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -273,3 +273,60 @@ static int __init register_prog_array_map(void)
>   	return 0;
>   }
>   late_initcall(register_prog_array_map);
> +
> +static void perf_event_array_map_free(struct bpf_map *map)
> +{
> +	bpf_fd_array_map_clear(map);
> +	fd_array_map_free(map);
> +}
> +
> +static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
> +{
> +	struct perf_event *event;
> +	struct perf_event_attr *attr;
> +
> +	event = perf_event_get(fd);
> +	if (IS_ERR(event))
> +		return event;
> +
> +	attr = perf_event_attrs(event);
> +	if (IS_ERR(attr))
> +		return attr;
> +
> +	if (attr->type != PERF_TYPE_RAW &&
> +	    attr->type != PERF_TYPE_HARDWARE) {
> +		perf_event_release_kernel(event);
> +		return ERR_PTR(-EINVAL);
> +	}
> +	return event;
> +}

I'm not sure whether Peter wanted to see the above function to be
in events/core.c or not.
imo it's fine here, since perf_event_attr is an uapi struct.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-04  8:58 ` [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter Kaixu Xia
@ 2015-08-04 17:55   ` Alexei Starovoitov
  2015-08-05  2:09     ` xiakaixu
  2015-08-05 10:04   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-08-04 17:55 UTC (permalink / raw)
  To: Kaixu Xia, davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt,
	jolsa, daniel
  Cc: wangnan0, linux-kernel, pi3orama, hekuang, netdev

On 8/4/15 1:58 AM, Kaixu Xia wrote:
> +static int check_func_limit(struct bpf_map **mapp, int func_id)

how about 'check_map_func_compatibility' or 'check_map_func_affinity' ?

> +{
> +	struct bpf_map *map = *mapp;

why pass pointer to a pointer? single pointer would be be fine.

> +	bool bool_map, bool_func;
> +	int i;
> +
> +	if (!map)
> +		return 0;
> +
> +	for (i = 0; i <= ARRAY_SIZE(func_limit); i++) {
> +		bool_map = (map->map_type == func_limit[i].map_type);
> +		bool_func = (func_id == func_limit[i].func_id);
> +		/* only when map & func pair match it can continue.
> +		 * don't allow any other map type to be passed into
> +		 * the special func;
> +		 */
> +		if (bool_map != bool_func)
> +			return -EINVAL;
> +	}

nice simplification!

the rest of the changes look good.
please respin your next set against net-next.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-04 17:55   ` Alexei Starovoitov
@ 2015-08-05  2:09     ` xiakaixu
  0 siblings, 0 replies; 22+ messages in thread
From: xiakaixu @ 2015-08-05  2:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, acme, mingo, a.p.zijlstra, masami.hiramatsu.pt, jolsa,
	daniel, wangnan0, linux-kernel, pi3orama, hekuang, netdev

于 2015/8/5 1:55, Alexei Starovoitov 写道:
> On 8/4/15 1:58 AM, Kaixu Xia wrote:
>> +static int check_func_limit(struct bpf_map **mapp, int func_id)
> 
> how about 'check_map_func_compatibility' or 'check_map_func_affinity' ?
> 
>> +{
>> +    struct bpf_map *map = *mapp;
> 
> why pass pointer to a pointer? single pointer would be be fine.
> 
>> +    bool bool_map, bool_func;
>> +    int i;
>> +
>> +    if (!map)
>> +        return 0;
>> +
>> +    for (i = 0; i <= ARRAY_SIZE(func_limit); i++) {
>> +        bool_map = (map->map_type == func_limit[i].map_type);
>> +        bool_func = (func_id == func_limit[i].func_id);
>> +        /* only when map & func pair match it can continue.
>> +         * don't allow any other map type to be passed into
>> +         * the special func;
>> +         */
>> +        if (bool_map != bool_func)
>> +            return -EINVAL;
>> +    }
> 
> nice simplification!
> 
> the rest of the changes look good.
> please respin your next set against net-next.

Thanks for your review! I will follow them in the next set.
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
  2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
  2015-08-04 17:43   ` Alexei Starovoitov
@ 2015-08-05  9:39   ` Peter Zijlstra
  2015-08-05  9:43   ` Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05  9:39 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Tue, Aug 04, 2015 at 08:58:14AM +0000, Kaixu Xia wrote:

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809..81fc99e 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -641,6 +641,8 @@ extern int perf_event_init_task(struct task_struct *child);
>  extern void perf_event_exit_task(struct task_struct *child);
>  extern void perf_event_free_task(struct task_struct *task);
>  extern void perf_event_delayed_put(struct task_struct *task);
> +extern struct perf_event *perf_event_get(unsigned int fd);
> +extern struct perf_event_attr *perf_event_attrs(struct perf_event *event);

const struct perf_event_attr *perf_event_attrs();


> +static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
> +{
> +	struct perf_event *event;
> +	struct perf_event_attr *attr;

Also const

> +
> +	event = perf_event_get(fd);
> +	if (IS_ERR(event))
> +		return event;
> +
> +	attr = perf_event_attrs(event);
> +	if (IS_ERR(attr))
> +		return attr;
> +
> +	if (attr->type != PERF_TYPE_RAW &&
> +	    attr->type != PERF_TYPE_HARDWARE) {
> +		perf_event_release_kernel(event);
> +		return ERR_PTR(-EINVAL);
> +	}
> +	return event;
> +}

> +struct perf_event_attr *perf_event_attrs(struct perf_event *event)

Also const

> +{
> +	if (!event)
> +		return ERR_PTR(-EINVAL);
> +
> +	return &event->attr;
> +}

It doesn't make sense (ever) to allow users of this function to change
attributes of the event, inspecting them is fine.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
  2015-08-04 17:43   ` Alexei Starovoitov
@ 2015-08-05  9:41     ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05  9:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kaixu Xia, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Tue, Aug 04, 2015 at 10:43:35AM -0700, Alexei Starovoitov wrote:
> >+static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
> >+{
> >+	struct perf_event *event;
> >+	struct perf_event_attr *attr;
> >+
> >+	event = perf_event_get(fd);
> >+	if (IS_ERR(event))
> >+		return event;
> >+
> >+	attr = perf_event_attrs(event);
> >+	if (IS_ERR(attr))
> >+		return attr;
> >+
> >+	if (attr->type != PERF_TYPE_RAW &&
> >+	    attr->type != PERF_TYPE_HARDWARE) {
> >+		perf_event_release_kernel(event);
> >+		return ERR_PTR(-EINVAL);
> >+	}
> >+	return event;
> >+}
> 
> I'm not sure whether Peter wanted to see the above function to be
> in events/core.c or not.
> imo it's fine here, since perf_event_attr is an uapi struct.

Right, aside from the const issue this is fine, for the exact reason you
state, perf_event_attr is an exposed interface.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event
  2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
  2015-08-04 17:43   ` Alexei Starovoitov
  2015-08-05  9:39   ` Peter Zijlstra
@ 2015-08-05  9:43   ` Peter Zijlstra
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05  9:43 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Tue, Aug 04, 2015 at 08:58:14AM +0000, Kaixu Xia wrote:
> +static void *perf_event_fd_array_get_ptr(struct bpf_map *map, int fd)
> +{
> +	struct perf_event *event;
> +	struct perf_event_attr *attr;
> +
> +	event = perf_event_get(fd);
> +	if (IS_ERR(event))
> +		return event;
> +
> +	attr = perf_event_attrs(event);
> +	if (IS_ERR(attr)) {
		perf_event_release_kernel(event);
> +		return attr;
	}

And maybe I should do a tree wide:
s/perf_event_release_kernel/perf_event_put/, but that's not too
important and can be done at any time.

> +
> +	if (attr->type != PERF_TYPE_RAW &&
> +	    attr->type != PERF_TYPE_HARDWARE) {
> +		perf_event_release_kernel(event);
> +		return ERR_PTR(-EINVAL);
> +	}
> +	return event;
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-04  8:58 ` [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter Kaixu Xia
  2015-08-04 17:55   ` Alexei Starovoitov
@ 2015-08-05 10:04   ` Peter Zijlstra
  2015-08-05 10:15     ` Peter Zijlstra
                       ` (3 more replies)
  1 sibling, 4 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 10:04 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Tue, Aug 04, 2015 at 08:58:15AM +0000, Kaixu Xia wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6251b53..726ca1b 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -8599,6 +8599,25 @@ struct perf_event_attr *perf_event_attrs(struct perf_event *event)
>  	return &event->attr;
>  }
>  
> +u64 perf_event_read_internal(struct perf_event *event)

Maybe: perf_event_read_local(), as this is this function only works for
events active on the current CPU.

> +{
> +	if (!event)
> +		return -EINVAL;
> +
> +	if (unlikely(event->state != PERF_EVENT_STATE_ACTIVE))
> +		return -EINVAL;

You can return perf_event_count() in that case.

> +
> +	if (event->oncpu != raw_smp_processor_id() &&

That _must_ be smp_processor_id(). If that gives a warning (ie.
preemption is not disabled or we're not affine to this one cpu) then the
warning is valid.

> +	    event->ctx->task != current)

Write it like:

	if (event->ctx->task != current &&
	    event->oncpu != smp_processor_id())

That way you'll not evaluate smp_processor_id() for current task events.

> +		return -EINVAL;
> +
> +	if (unlikely(event->attr.inherit))
> +		return -EINVAL;

This should be in your accept function, inherited events should never
get this far.

You need IRQs disabled while calling __perf_event_read(), did you test
with lockdep enabled?

> +	__perf_event_read(event);
> +	return perf_event_count(event);
> +}

Also, you probably want a WARN_ON(in_nmi()) there, this function is
_NOT_ NMI safe.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
  2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
                   ` (3 preceding siblings ...)
  2015-08-04  8:58 ` [PATCH v6 4/4] samples/bpf: example of get selected PMU counter value Kaixu Xia
@ 2015-08-05 10:06 ` Peter Zijlstra
  4 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 10:06 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev



Please split out the core perf API stuff into separate patches.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 10:04   ` Peter Zijlstra
@ 2015-08-05 10:15     ` Peter Zijlstra
  2015-08-05 16:00       ` Alexei Starovoitov
  2015-08-05 10:31     ` xiakaixu
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 10:15 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 04, 2015 at 08:58:15AM +0000, Kaixu Xia wrote:

> > +	    event->ctx->task != current)

Strictly speaking we should hold rcu_read_lock around dereferencing
event->ctx (or have IRQs disabled -- although I know Paul doesn't like
us relying on that).



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 10:04   ` Peter Zijlstra
  2015-08-05 10:15     ` Peter Zijlstra
@ 2015-08-05 10:31     ` xiakaixu
  2015-08-05 13:53     ` Peter Zijlstra
  2015-08-05 15:59     ` Alexei Starovoitov
  3 siblings, 0 replies; 22+ messages in thread
From: xiakaixu @ 2015-08-05 10:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

于 2015/8/5 18:04, Peter Zijlstra 写道:
> On Tue, Aug 04, 2015 at 08:58:15AM +0000, Kaixu Xia wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 6251b53..726ca1b 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -8599,6 +8599,25 @@ struct perf_event_attr *perf_event_attrs(struct perf_event *event)
>>  	return &event->attr;
>>  }
>>  
>> +u64 perf_event_read_internal(struct perf_event *event)
> 
> Maybe: perf_event_read_local(), as this is this function only works for
> events active on the current CPU.
> 
>> +{
>> +	if (!event)
>> +		return -EINVAL;
>> +
>> +	if (unlikely(event->state != PERF_EVENT_STATE_ACTIVE))
>> +		return -EINVAL;
> 
> You can return perf_event_count() in that case.
> 
>> +
>> +	if (event->oncpu != raw_smp_processor_id() &&
> 
> That _must_ be smp_processor_id(). If that gives a warning (ie.
> preemption is not disabled or we're not affine to this one cpu) then the
> warning is valid.
> 
>> +	    event->ctx->task != current)
> 
> Write it like:
> 
> 	if (event->ctx->task != current &&
> 	    event->oncpu != smp_processor_id())
> 
> That way you'll not evaluate smp_processor_id() for current task events.
> 
>> +		return -EINVAL;
>> +
>> +	if (unlikely(event->attr.inherit))
>> +		return -EINVAL;
> 
> This should be in your accept function, inherited events should never
> get this far.
> 
> You need IRQs disabled while calling __perf_event_read(), did you test
> with lockdep enabled?

Thanks for your review! I've not tested it, will do that from now on.
> 
>> +	__perf_event_read(event);
>> +	return perf_event_count(event);
>> +}
> 
> Also, you probably want a WARN_ON(in_nmi()) there, this function is
> _NOT_ NMI safe.
> 
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 10:04   ` Peter Zijlstra
  2015-08-05 10:15     ` Peter Zijlstra
  2015-08-05 10:31     ` xiakaixu
@ 2015-08-05 13:53     ` Peter Zijlstra
  2015-08-05 13:59       ` Peter Zijlstra
                         ` (2 more replies)
  2015-08-05 15:59     ` Alexei Starovoitov
  3 siblings, 3 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 13:53 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
> Also, you probably want a WARN_ON(in_nmi()) there, this function is
> _NOT_ NMI safe.

I had a wee think about that, and I think the below is safe.

(with the obvious problem that WARN from NMI context is not safe)

It does not give you up-to-date overcommit times but your version didn't
either so I'm assuming you don't need those, if you do need those it
needs more but we can do that too.

---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 53 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809433b3..64e821dd64f0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -659,6 +659,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
 				void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
 				int src_cpu, int dst_cpu);
+extern u64 perf_event_read_local(struct perf_event *event);
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 39753bfd9520..7105d37763c1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3222,6 +3222,59 @@ static inline u64 perf_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
+/*
+ * NMI-safe method to read a local event, that is an event that
+ * is:
+ *   - either for the current task, or for this CPU
+ *   - does not have inherit set, for inherited task events
+ *     will not be local and we cannot read them atomically
+ *   - must not have a pmu::count method
+ */
+u64 perf_event_read_local(struct perf_event *event)
+{
+	unsigned long flags;
+	u64 val;
+
+	/*
+	 * Disabling interrupts avoids all counter scheduling (context
+	 * switches, timer based rotation and IPIs).
+	 */
+	local_irq_safe(flags);
+
+	/* If this is a per-task event, it must be for current */
+	WARN_ON_ONCE((event->attach_state & PERF_ATTACH_TASK) &&
+		     event->hw.target != current);
+
+	/* If this is a per-CPU event, it must be for this CPU */
+	WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK) &&
+		     event->cpu != smp_processor_id());
+
+	/*
+	 * It must not be an event with inherit set, we cannot read
+	 * all child counters from atomic context.
+	 */
+	WARN_ON_ONCE(event->attr.inherit);
+
+	/*
+	 * It must not have a pmu::count method, those are not
+	 * NMI safe.
+	 */
+	WARN_ON_ONCE(event->pmu->count);
+
+	/*
+	 * If the event is currently on this CPU, its either a per-task event,
+	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
+	 * oncpu == -1).
+	 */
+	if (event->oncpu == smp_processor_id())
+		event->pmu->read(event);
+
+	val = local64_read(&event->count);
+	local_irq_restore(flags);
+
+	return val;
+}
+
 static u64 perf_event_read(struct perf_event *event)
 {
 	/*

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 13:53     ` Peter Zijlstra
@ 2015-08-05 13:59       ` Peter Zijlstra
  2015-08-05 16:08       ` Alexei Starovoitov
  2015-08-06  2:49       ` xiakaixu
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 13:59 UTC (permalink / raw)
  To: Kaixu Xia
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Wed, Aug 05, 2015 at 03:53:17PM +0200, Peter Zijlstra wrote:
> +/*
> + * NMI-safe method to read a local event, that is an event that
> + * is:
> + *   - either for the current task, or for this CPU
> + *   - does not have inherit set, for inherited task events
> + *     will not be local and we cannot read them atomically
> + *   - must not have a pmu::count method
> + */
> +u64 perf_event_read_local(struct perf_event *event)
> +{
> +	unsigned long flags;
> +	u64 val;
> +
> +	/*
> +	 * Disabling interrupts avoids all counter scheduling (context
> +	 * switches, timer based rotation and IPIs).
> +	 */
> +	local_irq_safe(flags);

Hmm, I think local_irq_save(flags) will compile much better. Obviously
this patch hasn't seen a compiler up close.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 10:04   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2015-08-05 13:53     ` Peter Zijlstra
@ 2015-08-05 15:59     ` Alexei Starovoitov
  3 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2015-08-05 15:59 UTC (permalink / raw)
  To: Peter Zijlstra, Kaixu Xia
  Cc: davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel, wangnan0,
	linux-kernel, pi3orama, hekuang, netdev

On 8/5/15 3:04 AM, Peter Zijlstra wrote:
>> >+	__perf_event_read(event);
>> >+	return perf_event_count(event);
>> >+}
> Also, you probably want a WARN_ON(in_nmi()) there, this function is
> _NOT_  NMI safe.

we check that very early on:
unsigned int trace_call_bpf(struct bpf_prog *prog, void *ctx)
{
         unsigned int ret;

         if (in_nmi()) /* not supported yet */
                 return 1;
...


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 10:15     ` Peter Zijlstra
@ 2015-08-05 16:00       ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2015-08-05 16:00 UTC (permalink / raw)
  To: Peter Zijlstra, Kaixu Xia
  Cc: davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel, wangnan0,
	linux-kernel, pi3orama, hekuang, netdev

On 8/5/15 3:15 AM, Peter Zijlstra wrote:
> On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 04, 2015 at 08:58:15AM +0000, Kaixu Xia wrote:
>
>>> +	    event->ctx->task != current)
>
> Strictly speaking we should hold rcu_read_lock around dereferencing
> event->ctx (or have IRQs disabled -- although I know Paul doesn't like
> us relying on that).

programs are always executing under rcu_read_lock, so we should be good
here too.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 13:53     ` Peter Zijlstra
  2015-08-05 13:59       ` Peter Zijlstra
@ 2015-08-05 16:08       ` Alexei Starovoitov
  2015-08-05 16:21         ` Peter Zijlstra
  2015-08-06  2:49       ` xiakaixu
  2 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2015-08-05 16:08 UTC (permalink / raw)
  To: Peter Zijlstra, Kaixu Xia
  Cc: davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel, wangnan0,
	linux-kernel, pi3orama, hekuang, netdev

On 8/5/15 6:53 AM, Peter Zijlstra wrote:
> +	/*
> +	 * If the event is currently on this CPU, its either a per-task event,
> +	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
> +	 * oncpu == -1).
> +	 */
> +	if (event->oncpu == smp_processor_id())
> +		event->pmu->read(event);
> +
> +	val = local64_read(&event->count);
> +	local_irq_restore(flags);
> +

nice! cleaner and faster.
so raw_spin_lock(&ctx->lock) is not needed, because
update_*(event) methods are not called, right?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 16:08       ` Alexei Starovoitov
@ 2015-08-05 16:21         ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-08-05 16:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kaixu Xia, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

On Wed, Aug 05, 2015 at 09:08:32AM -0700, Alexei Starovoitov wrote:
> On 8/5/15 6:53 AM, Peter Zijlstra wrote:
> >+	/*
> >+	 * If the event is currently on this CPU, its either a per-task event,
> >+	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
> >+	 * oncpu == -1).
> >+	 */
> >+	if (event->oncpu == smp_processor_id())
> >+		event->pmu->read(event);
> >+
> >+	val = local64_read(&event->count);
> >+	local_irq_restore(flags);
> >+
> 
> nice! cleaner and faster.
> so raw_spin_lock(&ctx->lock) is not needed, because
> update_*(event) methods are not called, right?

Indeed, and by ensuring the event is indeed local (by force of WARN_ON)
disabling IRQs will avoid counter scheduling and result in a stable
event state.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter
  2015-08-05 13:53     ` Peter Zijlstra
  2015-08-05 13:59       ` Peter Zijlstra
  2015-08-05 16:08       ` Alexei Starovoitov
@ 2015-08-06  2:49       ` xiakaixu
  2 siblings, 0 replies; 22+ messages in thread
From: xiakaixu @ 2015-08-06  2:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ast, davem, acme, mingo, masami.hiramatsu.pt, jolsa, daniel,
	wangnan0, linux-kernel, pi3orama, hekuang, netdev

于 2015/8/5 21:53, Peter Zijlstra 写道:
> On Wed, Aug 05, 2015 at 12:04:25PM +0200, Peter Zijlstra wrote:
>> Also, you probably want a WARN_ON(in_nmi()) there, this function is
>> _NOT_ NMI safe.
> 
> I had a wee think about that, and I think the below is safe.
> 
> (with the obvious problem that WARN from NMI context is not safe)
> 
> It does not give you up-to-date overcommit times but your version didn't
> either so I'm assuming you don't need those, if you do need those it
> needs more but we can do that too.
> 
> ---
>  include/linux/perf_event.h |  1 +
>  kernel/events/core.c       | 53 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2027809433b3..64e821dd64f0 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -659,6 +659,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
>  				void *context);
>  extern void perf_pmu_migrate_context(struct pmu *pmu,
>  				int src_cpu, int dst_cpu);
> +extern u64 perf_event_read_local(struct perf_event *event);
>  extern u64 perf_event_read_value(struct perf_event *event,
>  				 u64 *enabled, u64 *running);
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 39753bfd9520..7105d37763c1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -3222,6 +3222,59 @@ static inline u64 perf_event_count(struct perf_event *event)
>  	return __perf_event_count(event);
>  }
>  
> +/*
> + * NMI-safe method to read a local event, that is an event that
> + * is:
> + *   - either for the current task, or for this CPU
> + *   - does not have inherit set, for inherited task events
> + *     will not be local and we cannot read them atomically
> + *   - must not have a pmu::count method
> + */
> +u64 perf_event_read_local(struct perf_event *event)
> +{
> +	unsigned long flags;
> +	u64 val;
> +
> +	/*
> +	 * Disabling interrupts avoids all counter scheduling (context
> +	 * switches, timer based rotation and IPIs).
> +	 */
> +	local_irq_safe(flags);

s/local_irq_safe/local_irq_save, and I have compiled and tested this function
and it is fine. Will use it in the next set.

Thanks.
> +
> +	/* If this is a per-task event, it must be for current */
> +	WARN_ON_ONCE((event->attach_state & PERF_ATTACH_TASK) &&
> +		     event->hw.target != current);
> +
> +	/* If this is a per-CPU event, it must be for this CPU */
> +	WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK) &&
> +		     event->cpu != smp_processor_id());
> +
> +	/*
> +	 * It must not be an event with inherit set, we cannot read
> +	 * all child counters from atomic context.
> +	 */
> +	WARN_ON_ONCE(event->attr.inherit);
> +
> +	/*
> +	 * It must not have a pmu::count method, those are not
> +	 * NMI safe.
> +	 */
> +	WARN_ON_ONCE(event->pmu->count);
> +
> +	/*
> +	 * If the event is currently on this CPU, its either a per-task event,
> +	 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
> +	 * oncpu == -1).
> +	 */
> +	if (event->oncpu == smp_processor_id())
> +		event->pmu->read(event);
> +
> +	val = local64_read(&event->count);
> +	local_irq_restore(flags);
> +
> +	return val;
> +}
> +
>  static u64 perf_event_read(struct perf_event *event)
>  {
>  	/*
> 
> .
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-08-06  2:49 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-04  8:58 [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Kaixu Xia
2015-08-04  8:58 ` [PATCH v6 1/4] bpf: Make the bpf_prog_array_map more generic Kaixu Xia
2015-08-04  8:58 ` [PATCH v6 2/4] bpf: Add new bpf map type to store the pointer to struct perf_event Kaixu Xia
2015-08-04 17:43   ` Alexei Starovoitov
2015-08-05  9:41     ` Peter Zijlstra
2015-08-05  9:39   ` Peter Zijlstra
2015-08-05  9:43   ` Peter Zijlstra
2015-08-04  8:58 ` [PATCH v6 3/4] bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter Kaixu Xia
2015-08-04 17:55   ` Alexei Starovoitov
2015-08-05  2:09     ` xiakaixu
2015-08-05 10:04   ` Peter Zijlstra
2015-08-05 10:15     ` Peter Zijlstra
2015-08-05 16:00       ` Alexei Starovoitov
2015-08-05 10:31     ` xiakaixu
2015-08-05 13:53     ` Peter Zijlstra
2015-08-05 13:59       ` Peter Zijlstra
2015-08-05 16:08       ` Alexei Starovoitov
2015-08-05 16:21         ` Peter Zijlstra
2015-08-06  2:49       ` xiakaixu
2015-08-05 15:59     ` Alexei Starovoitov
2015-08-04  8:58 ` [PATCH v6 4/4] samples/bpf: example of get selected PMU counter value Kaixu Xia
2015-08-05 10:06 ` [PATCH v6 0/4] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.