[PATCH bpf-next 0/2] Two libbpf RB walk improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/2] Two libbpf RB walk improvements
@ 2018-10-11 14:02 Daniel Borkmann
  2018-10-11 14:02 ` [PATCH bpf-next 1/2] bpf, libbpf: use proper barriers in perf RB walk Daniel Borkmann
  2018-10-11 14:02 ` [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates Daniel Borkmann
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Borkmann @ 2018-10-11 14:02 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann

Two small improvements to libbpf's perf RB walk: first one fixes
used barriers and second one simplifies ring walking code. For
details see individual patches.

Thanks!

Daniel Borkmann (2):
  bpf, libbpf: use proper barriers in perf RB walk
  bpf, libbpf: simplify perf RB walk and do incremental updates

 tools/bpf/bpftool/map_perf_ring.c           |  10 ++-
 tools/lib/bpf/libbpf.c                      | 117 ++++++++++++++++++----------
 tools/lib/bpf/libbpf.h                      |  14 ++--
 tools/testing/selftests/bpf/trace_helpers.c |   7 +-
 4 files changed, 93 insertions(+), 55 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH bpf-next 1/2] bpf, libbpf: use proper barriers in perf RB walk
  2018-10-11 14:02 [PATCH bpf-next 0/2] Two libbpf RB walk improvements Daniel Borkmann
@ 2018-10-11 14:02 ` Daniel Borkmann
  2018-10-11 14:02 ` [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates Daniel Borkmann
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Borkmann @ 2018-10-11 14:02 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann

User proper CPU barrier instead of just a compile barrier when fetching
ring's data_head in bpf_perf_event_read_simple() which is not correct.
Also, add two small helpers bpf_perf_read_head() and bpf_perf_write_tail()
to make used barriers more obvious and a comment to what they pair to.

Fixes: d0cabbb021be ("tools: bpf: move the event reading loop to libbpf")
Fixes: 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/libbpf.c | 52 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 176cf55..1ac8856 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -20,6 +20,7 @@
 #include <fcntl.h>
 #include <errno.h>
 #include <asm/unistd.h>
+#include <asm/barrier.h>
 #include <linux/err.h>
 #include <linux/kernel.h>
 #include <linux/bpf.h>
@@ -27,6 +28,7 @@
 #include <linux/list.h>
 #include <linux/limits.h>
 #include <linux/perf_event.h>
+#include <linux/compiler.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/vfs.h>
@@ -2404,18 +2406,58 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
 	return 0;
 }
 
+/*
+ * Comment from kernel/events/ring_buffer.c:
+ *
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ *   kernel				user
+ *
+ *   if (LOAD ->data_tail) {		LOAD ->data_head
+ *			(A)		smp_rmb()	(C)
+ *	STORE $data			LOAD $data
+ *	smp_wmb()	(B)		smp_mb()	(D)
+ *	STORE ->data_head		STORE ->data_tail
+ *   }
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * In our case (A) is a control dependency that separates the load of
+ * the ->data_tail and the stores of $data. In case ->data_tail
+ * indicates there is no room in the buffer to store $data we do not.
+ *
+ * D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for C
+ * an RMB is sufficient since it separates two READs.
+ */
+static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
+{
+	__u64 data_head = READ_ONCE(header->data_head);
+
+	rmb();
+	return data_head;
+}
+
+static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
+				__u64 data_tail)
+{
+	mb();
+	header->data_tail = data_tail;
+}
+
 enum bpf_perf_event_ret
 bpf_perf_event_read_simple(void *mem, unsigned long size,
 			   unsigned long page_size, void **buf, size_t *buf_len,
 			   bpf_perf_event_print_t fn, void *priv)
 {
-	volatile struct perf_event_mmap_page *header = mem;
+	struct perf_event_mmap_page *header = mem;
+	__u64 data_head = bpf_perf_read_head(header);
 	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
 	int ret = LIBBPF_PERF_EVENT_ERROR;
 	void *base, *begin, *end;
 
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
 	if (data_head == data_tail)
 		return LIBBPF_PERF_EVENT_CONT;
 
@@ -2458,8 +2500,6 @@ bpf_perf_event_read_simple(void *mem, unsigned long size,
 		data_tail += ehdr->size;
 	}
 
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_tail;
-
+	bpf_perf_write_tail(header, data_tail);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates
  2018-10-11 14:02 [PATCH bpf-next 0/2] Two libbpf RB walk improvements Daniel Borkmann
  2018-10-11 14:02 ` [PATCH bpf-next 1/2] bpf, libbpf: use proper barriers in perf RB walk Daniel Borkmann
@ 2018-10-11 14:02 ` Daniel Borkmann
  2018-10-12  3:04   ` Jakub Kicinski
  1 sibling, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2018-10-11 14:02 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann

Clean up and improve bpf_perf_event_read_simple() ring walk a bit
to use similar tail update scheme as in perf and bcc allowing the
kernel to make forward progress not only after full timely walk.
Also few other improvements to use realloc() instead of free() and
malloc() combination and for the callback use proper perf_event_header
instead of void pointer, so that real applications can use container_of()
macro with proper type checking.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/bpf/bpftool/map_perf_ring.c           | 10 ++--
 tools/lib/bpf/libbpf.c                      | 77 +++++++++++++----------------
 tools/lib/bpf/libbpf.h                      | 14 +++---
 tools/testing/selftests/bpf/trace_helpers.c |  7 +--
 4 files changed, 53 insertions(+), 55 deletions(-)

diff --git a/tools/bpf/bpftool/map_perf_ring.c b/tools/bpf/bpftool/map_perf_ring.c
index 6d41323..bdaf406 100644
--- a/tools/bpf/bpftool/map_perf_ring.c
+++ b/tools/bpf/bpftool/map_perf_ring.c
@@ -50,15 +50,17 @@ static void int_exit(int signo)
 	stop = true;
 }
 
-static enum bpf_perf_event_ret print_bpf_output(void *event, void *priv)
+static enum bpf_perf_event_ret
+print_bpf_output(struct perf_event_header *event, void *private_data)
 {
-	struct event_ring_info *ring = priv;
-	struct perf_event_sample *e = event;
+	struct perf_event_sample *e = container_of(event, struct perf_event_sample,
+						   header);
+	struct event_ring_info *ring = private_data;
 	struct {
 		struct perf_event_header header;
 		__u64 id;
 		__u64 lost;
-	} *lost = event;
+	} *lost = (typeof(lost))event;
 
 	if (json_output) {
 		jsonw_start_object(json_wtr);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1ac8856..35f4a77 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2448,58 +2448,51 @@ static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
 }
 
 enum bpf_perf_event_ret
-bpf_perf_event_read_simple(void *mem, unsigned long size,
-			   unsigned long page_size, void **buf, size_t *buf_len,
-			   bpf_perf_event_print_t fn, void *priv)
-{
-	struct perf_event_mmap_page *header = mem;
-	__u64 data_head = bpf_perf_read_head(header);
-	__u64 data_tail = header->data_tail;
-	int ret = LIBBPF_PERF_EVENT_ERROR;
-	void *base, *begin, *end;
-
-	if (data_head == data_tail)
-		return LIBBPF_PERF_EVENT_CONT;
-
-	base = ((char *)header) + page_size;
-
-	begin = base + data_tail % size;
-	end = base + data_head % size;
-
-	while (begin != end) {
-		struct perf_event_header *ehdr;
-
-		ehdr = begin;
-		if (begin + ehdr->size > base + size) {
-			long len = base + size - begin;
-
-			if (*buf_len < ehdr->size) {
-				free(*buf);
-				*buf = malloc(ehdr->size);
-				if (!*buf) {
+bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
+			   void **copy_mem, size_t *copy_size,
+			   bpf_perf_event_print_t fn, void *private_data)
+{
+	struct perf_event_mmap_page *header = mmap_mem;
+	void *base = ((__u8 *)header) + page_size;
+	int ret = LIBBPF_PERF_EVENT_CONT;
+	struct perf_event_header *ehdr;
+	__u64 data_head, data_tail;
+	size_t ehdr_size;
+
+	for (data_tail = header->data_tail,
+	     data_head = bpf_perf_read_head(header);
+	     data_head != data_tail;
+	     data_head = bpf_perf_read_head(header)) {
+		ehdr = base + (data_tail & (mmap_size - 1));
+		ehdr_size = ehdr->size;
+		if (((void *)ehdr) + ehdr_size > base + mmap_size) {
+			void *copy_start = ehdr;
+			size_t len_first = base + mmap_size - copy_start;
+			size_t len_secnd = ehdr_size - len_first;
+
+			if (*copy_size < ehdr_size) {
+				void *ptr = realloc(*copy_mem, ehdr_size);
+
+				if (!ptr) {
 					ret = LIBBPF_PERF_EVENT_ERROR;
 					break;
 				}
-				*buf_len = ehdr->size;
+				*copy_mem = ptr;
+				*copy_size = ehdr_size;
 			}
 
-			memcpy(*buf, begin, len);
-			memcpy(*buf + len, base, ehdr->size - len);
-			ehdr = (void *)*buf;
-			begin = base + ehdr->size - len;
-		} else if (begin + ehdr->size == base + size) {
-			begin = base;
-		} else {
-			begin += ehdr->size;
+			memcpy(*copy_mem, copy_start, len_first);
+			memcpy(*copy_mem + len_first, base, len_secnd);
+			ehdr = *copy_mem;
 		}
 
-		ret = fn(ehdr, priv);
+		ret = fn(ehdr, private_data);
+
+		data_tail += ehdr_size;
+		bpf_perf_write_tail(header, data_tail);
 		if (ret != LIBBPF_PERF_EVENT_CONT)
 			break;
-
-		data_tail += ehdr->size;
 	}
 
-	bpf_perf_write_tail(header, data_tail);
 	return ret;
 }
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 8af8d36..16b5c95 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -284,12 +284,14 @@ enum bpf_perf_event_ret {
 	LIBBPF_PERF_EVENT_CONT	= -2,
 };
 
-typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(void *event,
-							  void *priv);
-int bpf_perf_event_read_simple(void *mem, unsigned long size,
-			       unsigned long page_size,
-			       void **buf, size_t *buf_len,
-			       bpf_perf_event_print_t fn, void *priv);
+struct perf_event_header;
+typedef enum bpf_perf_event_ret
+	(*bpf_perf_event_print_t)(struct perf_event_header *hdr,
+				  void *private_data);
+enum bpf_perf_event_ret
+bpf_perf_event_read_simple(void *mmap_mem, size_t mmap_size, size_t page_size,
+			   void **copy_mem, size_t *copy_size,
+			   bpf_perf_event_print_t fn, void *private_data);
 
 struct nlattr;
 typedef int (*libbpf_dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb);
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index cabe2a3..1764093 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -124,10 +124,11 @@ struct perf_event_sample {
 	char data[];
 };
 
-static enum bpf_perf_event_ret bpf_perf_event_print(void *event, void *priv)
+static enum bpf_perf_event_ret
+bpf_perf_event_print(struct perf_event_header *hdr, void *private_data)
 {
-	struct perf_event_sample *e = event;
-	perf_event_print_fn fn = priv;
+	struct perf_event_sample *e = (struct perf_event_sample *)hdr;
+	perf_event_print_fn fn = private_data;
 	int ret;
 
 	if (e->header.type == PERF_RECORD_SAMPLE) {
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates
  2018-10-11 14:02 ` [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates Daniel Borkmann
@ 2018-10-12  3:04   ` Jakub Kicinski
  2018-10-12  8:39     ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2018-10-12  3:04 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: alexei.starovoitov, netdev

On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
> to use similar tail update scheme as in perf and bcc allowing the
> kernel to make forward progress not only after full timely walk.

The extra memory barriers won't impact performance?  If I read the code
correctly we now have:

	while (bla) {
		head = HEAD
		rmb()

		...

		mb()
		TAIL = tail
	}

Would it make sense to try to piggy back on the mb() for head re-read
at least?  Perhaps that's a non-issue, just wondering.

> Also few other improvements to use realloc() instead of free() and
> malloc() combination and for the callback use proper perf_event_header
> instead of void pointer, so that real applications can use container_of()
> macro with proper type checking.

FWIW the free() + malloc() was to avoid the the needless copy of the
previous event realloc() may do.  It makes sense to use realloc()
especially if you want to put extra info in front of the buffer, just
sayin' it wasn't a complete braino ;)

> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates
  2018-10-12  3:04   ` Jakub Kicinski
@ 2018-10-12  8:39     ` Daniel Borkmann
  2018-10-12 13:30       ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2018-10-12  8:39 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: alexei.starovoitov, netdev

On 10/12/2018 05:04 AM, Jakub Kicinski wrote:
> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
>> to use similar tail update scheme as in perf and bcc allowing the
>> kernel to make forward progress not only after full timely walk.
> 
> The extra memory barriers won't impact performance?  If I read the code
> correctly we now have:
> 
> 	while (bla) {
> 		head = HEAD
> 		rmb()
> 
> 		...
> 
> 		mb()
> 		TAIL = tail
> 	}
> 
> Would it make sense to try to piggy back on the mb() for head re-read
> at least?  Perhaps that's a non-issue, just wondering.

>From the scheme specified in the comment in prior patch my understanding
would be that they don't pair (see B and C) so there would be no guarantee
that load of head would happen before load of data. Fwiw, I've been using
the exact same semantics as user space perf tool walks the perf mmap'ed
ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop
pushing into ring buffer while user space walks it and indicates how far
it has consumed data via tail update, it would allow for making room
successively and not only after full run has complete, so we don't make
any assumptions in the generic libbpf library helper on how slow/quick
the callback would be processing resp. how full ring is, etc, and kernel
pushing new data can be processed in the same run if necessary. One thing
we could consider is to batch tail updates, say, every 8 elements and a
final update once we break out walking the ring; probably okay'ish as a
heuristic..

>> Also few other improvements to use realloc() instead of free() and
>> malloc() combination and for the callback use proper perf_event_header
>> instead of void pointer, so that real applications can use container_of()
>> macro with proper type checking.
> 
> FWIW the free() + malloc() was to avoid the the needless copy of the
> previous event realloc() may do.  It makes sense to use realloc()
> especially if you want to put extra info in front of the buffer, just
> sayin' it wasn't a complete braino ;)

No strong preference from my side, I'd think that it might be sensible in
any case from applications to call the bpf_perf_event_read_simple() with a
already preallocated buffer, depending on the expected max element size from
BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
the BPF prog and MTU 1500 this would more than suffice to avoid new
allocations altogether. Anyway, given we only grow the new memory area I
did some testing on realloc() with an array of pointers to prior malloc()'ed
buffers, running randomly 10M realloc()s to increase size over them and
saw <1% where area had to be moved, so we're hitting corner case of a corner
case, I'm also ok to leave the combination, though. :)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates
  2018-10-12  8:39     ` Daniel Borkmann
@ 2018-10-12 13:30       ` Daniel Borkmann
  2018-10-12 15:39         ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2018-10-12 13:30 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: alexei.starovoitov, netdev

On 10/12/2018 10:39 AM, Daniel Borkmann wrote:
> On 10/12/2018 05:04 AM, Jakub Kicinski wrote:
>> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:
>>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
>>> to use similar tail update scheme as in perf and bcc allowing the
>>> kernel to make forward progress not only after full timely walk.
>>
>> The extra memory barriers won't impact performance?  If I read the code
>> correctly we now have:
>>
>> 	while (bla) {
>> 		head = HEAD
>> 		rmb()
>>
>> 		...
>>
>> 		mb()
>> 		TAIL = tail
>> 	}
>>
>> Would it make sense to try to piggy back on the mb() for head re-read
>> at least?  Perhaps that's a non-issue, just wondering.
> 
> From the scheme specified in the comment in prior patch my understanding
> would be that they don't pair (see B and C) so there would be no guarantee
> that load of head would happen before load of data. Fwiw, I've been using
> the exact same semantics as user space perf tool walks the perf mmap'ed
> ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop

On that note, I'll also respin, after some clarification with PeterZ on
why perf is using {rmb,mb}() barriers today as opposed to more lightweight
smp_{rmb,mb}() ones it turns out there is no real reason other than
historic one and perf can be changed and made more efficient as well. ;)

> pushing into ring buffer while user space walks it and indicates how far
> it has consumed data via tail update, it would allow for making room
> successively and not only after full run has complete, so we don't make
> any assumptions in the generic libbpf library helper on how slow/quick
> the callback would be processing resp. how full ring is, etc, and kernel
> pushing new data can be processed in the same run if necessary. One thing
> we could consider is to batch tail updates, say, every 8 elements and a
> final update once we break out walking the ring; probably okay'ish as a
> heuristic..
> 
>>> Also few other improvements to use realloc() instead of free() and
>>> malloc() combination and for the callback use proper perf_event_header
>>> instead of void pointer, so that real applications can use container_of()
>>> macro with proper type checking.
>>
>> FWIW the free() + malloc() was to avoid the the needless copy of the
>> previous event realloc() may do.  It makes sense to use realloc()
>> especially if you want to put extra info in front of the buffer, just
>> sayin' it wasn't a complete braino ;)
> 
> No strong preference from my side, I'd think that it might be sensible in
> any case from applications to call the bpf_perf_event_read_simple() with a
> already preallocated buffer, depending on the expected max element size from
> BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
> the BPF prog and MTU 1500 this would more than suffice to avoid new
> allocations altogether. Anyway, given we only grow the new memory area I
> did some testing on realloc() with an array of pointers to prior malloc()'ed
> buffers, running randomly 10M realloc()s to increase size over them and
> saw <1% where area had to be moved, so we're hitting corner case of a corner
> case, I'm also ok to leave the combination, though. :)
> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates
  2018-10-12 13:30       ` Daniel Borkmann
@ 2018-10-12 15:39         ` Jakub Kicinski
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2018-10-12 15:39 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: alexei.starovoitov, netdev

On Fri, 12 Oct 2018 15:30:57 +0200, Daniel Borkmann wrote:
> On 10/12/2018 10:39 AM, Daniel Borkmann wrote:
> > On 10/12/2018 05:04 AM, Jakub Kicinski wrote:  
> >> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote:  
> >>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit
> >>> to use similar tail update scheme as in perf and bcc allowing the
> >>> kernel to make forward progress not only after full timely walk.  
> >>
> >> The extra memory barriers won't impact performance?  If I read the code
> >> correctly we now have:
> >>
> >> 	while (bla) {
> >> 		head = HEAD
> >> 		rmb()
> >>
> >> 		...
> >>
> >> 		mb()
> >> 		TAIL = tail
> >> 	}
> >>
> >> Would it make sense to try to piggy back on the mb() for head re-read
> >> at least?  Perhaps that's a non-issue, just wondering.  
> > 
> > From the scheme specified in the comment in prior patch my understanding
> > would be that they don't pair (see B and C) so there would be no guarantee
> > that load of head would happen before load of data. Fwiw, I've been using
> > the exact same semantics as user space perf tool walks the perf mmap'ed
> > ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop  
> 
> On that note, I'll also respin, after some clarification with PeterZ on
> why perf is using {rmb,mb}() barriers today as opposed to more lightweight
> smp_{rmb,mb}() ones it turns out there is no real reason other than
> historic one and perf can be changed and made more efficient as well. ;)

Cool, to be clear I meant something like:

	head = HEAD
	rmb()

	while (bla) {
		...

		head = HEAD
		mb()
		TAIL = tail
	}

> > pushing into ring buffer while user space walks it and indicates how far
> > it has consumed data via tail update, it would allow for making room
> > successively and not only after full run has complete, so we don't make
> > any assumptions in the generic libbpf library helper on how slow/quick
> > the callback would be processing resp. how full ring is, etc, and kernel
> > pushing new data can be processed in the same run if necessary. One thing
> > we could consider is to batch tail updates, say, every 8 elements and a
> > final update once we break out walking the ring; probably okay'ish as a
> > heuristic..

Makes sense, I would be tempted to do some batching, but since I'm
mostly working with I/O devices I may think that memory barriers cost
more than they do for main memory-only accesses :)

> >>> Also few other improvements to use realloc() instead of free() and
> >>> malloc() combination and for the callback use proper perf_event_header
> >>> instead of void pointer, so that real applications can use container_of()
> >>> macro with proper type checking.  
> >>
> >> FWIW the free() + malloc() was to avoid the the needless copy of the
> >> previous event realloc() may do.  It makes sense to use realloc()
> >> especially if you want to put extra info in front of the buffer, just
> >> sayin' it wasn't a complete braino ;)  
> > 
> > No strong preference from my side, I'd think that it might be sensible in
> > any case from applications to call the bpf_perf_event_read_simple() with a
> > already preallocated buffer, depending on the expected max element size from
> > BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from
> > the BPF prog and MTU 1500 this would more than suffice to avoid new
> > allocations altogether. Anyway, given we only grow the new memory area I
> > did some testing on realloc() with an array of pointers to prior malloc()'ed
> > buffers, running randomly 10M realloc()s to increase size over them and
> > saw <1% where area had to be moved, so we're hitting corner case of a corner
> > case, I'm also ok to leave the combination, though. :)

Thanks for doing the experiment, not strong preference here either, so
realloc() seems good!

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-12 23:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-11 14:02 [PATCH bpf-next 0/2] Two libbpf RB walk improvements Daniel Borkmann
2018-10-11 14:02 ` [PATCH bpf-next 1/2] bpf, libbpf: use proper barriers in perf RB walk Daniel Borkmann
2018-10-11 14:02 ` [PATCH bpf-next 2/2] bpf, libbpf: simplify perf RB walk and do incremental updates Daniel Borkmann
2018-10-12  3:04   ` Jakub Kicinski
2018-10-12  8:39     ` Daniel Borkmann
2018-10-12 13:30       ` Daniel Borkmann
2018-10-12 15:39         ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).