Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Arnaldo Carvalho de Melo @ 2018-10-17 12:50 UTC (permalink / raw)
  To: Song Liu
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, Namhyung Kim, Jiri Olsa, Networking, kernel-team
In-Reply-To: <20181017121140.GA31465@kernel.org>

Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.
> 
> Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
> > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > I am working with Alexei on the idea of fetching BPF program information via
> > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > to perf_event_type, and dumped these events to perf event ring buffer.
> 
> > > > I found that perf will not process event until the end of perf-record:
> 
> > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > ...... 10 seconds later
> > > > [ perf record: Woken up 34 times to write data ]
> > > > machine__process_bpf_event: prog_id 6 loaded
> > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> > > > In this example, the bpf program was loaded and then unloaded in
> > > > another terminal. When machine__process_bpf_event() processes
> > > > the load event, the bpf program is already unloaded. Therefore,
> > > > machine__process_bpf_event() will not be able to get information
> > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > as soon as perf get the event from kernel. I looked around the perf
> > > > code for a while. But I haven't found a good example where some
> > > > events are processed before the end of perf-record. Could you
> > > > please help me with this?
> 
> > > perf record does not process events as they are generated. Its sole job
> > > is pushing data from the maps to a file as fast as possible meaning in
> > > bulk based on current read and write locations.
> 
> > > Adding code to process events will add significant overhead to the
> > > record command and will not really solve your race problem.
> 
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
>  
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> 
> Well, you could have a separate thread processing just those kinds of
> events, associate it with a dummy event where you only ask for
> PERF_RECORD_BPF_EVENTs.
> 
> Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> perf_event_attr:
> 
> [root@seventh ~]# perf record -vv -e dummy sleep 01
> ------------------------------------------------------------
> perf_event_attr:
>   type                             1
>   size                             112
>   config                           0x9
>   { sample_period, sample_freq }   4000
>   sample_type                      IP|TID|TIME|PERIOD
>   disabled                         1
>   inherit                          1

These you would have disabled, no need for
PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT

>   mmap                             1
>   comm                             1
>   task                             1
>   mmap2                            1
>   comm_exec                        1


>   freq                             1
>   enable_on_exec                   1
>   sample_id_all                    1
>   exclude_guest                    1
> ------------------------------------------------------------
> sys_perf_event_open: pid 12046  cpu 0  group_fd -1  flags 0x8 = 4
> sys_perf_event_open: pid 12046  cpu 1  group_fd -1  flags 0x8 = 5
> sys_perf_event_open: pid 12046  cpu 2  group_fd -1  flags 0x8 = 6
> sys_perf_event_open: pid 12046  cpu 3  group_fd -1  flags 0x8 = 8
> mmap size 528384B
> perf event ring buffer mmapped per cpu
> Synthesizing TSC conversion information
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.014 MB perf.data ]
> [root@seventh ~]#
> 
> [root@seventh ~]# perf evlist -v
> dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1
> [root@seventh ~]# 
> 
> There is work ongoing in dumping one file per cpu and then, at post
> processing time merging all those files to get ordering, so one more
> file, for these VIP events, that require per-event processing would be
> ordered at that time with all the other per-cpu files.
> 
> - Arnaldo

^ permalink raw reply

* [PATCH net] sctp: fix the data size calculation in sctp_data_size
From: Xin Long @ 2018-10-17 13:11 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

sctp data size should be calculated by subtracting data chunk header's
length from chunk_hdr->length, not just data header.

Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 include/net/sctp/sm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 5ef1bad..9e3d327 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -347,7 +347,7 @@ static inline __u16 sctp_data_size(struct sctp_chunk *chunk)
 	__u16 size;
 
 	size = ntohs(chunk->chunk_hdr->length);
-	size -= sctp_datahdr_len(&chunk->asoc->stream);
+	size -= sctp_datachk_len(&chunk->asoc->stream);
 
 	return size;
 }
-- 
2.1.0

^ permalink raw reply related

* Re: Bug in MACSec - stops passing traffic after approx 5TB
From: Josh Coombs @ 2018-10-17 13:45 UTC (permalink / raw)
  To: sd; +Cc: netdev
In-Reply-To: <CACcUnf8kTNEYW8ZisbEYMr+Pu7rmP7pKS=w+-ynUJ_sT75HZ0w@mail.gmail.com>

I've got wpa_supplicant working with macsec on Fedora, my test bed has
shuffled 16 billion packets so far without interruption.  I am a bit
concerned that I've just pushed the resource exhaustion issue down the
road though, looking at the output of ip macsec show I see four SAs
for TX and RX, it appears to negotiate a new pair every 3 to 3.5
billion packets.  It doesn't appear to be ripping down old SAs.  What
happens when available SA slots run out?

Joshua Coombs
GWI

office 207-494-2140
www.gwi.net

On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs <jcoombs@staff.gwi.net> wrote:
>
> And confirmed, starting with a high packet number results in a very
> short testbed run, 296 packets and then nothing, just as you surmised.
> Sorry for raising the alarm falsely.  Looks like I need to roll my own
> build of wpa_supplicant as the ubuntu builds don't include the macsec
> driver, haven't tested Gentoo's ebuilds yet to see if they do.
>
> Josh Coombs
>
> On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs <jcoombs@staff.gwi.net> wrote:
> >
> > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > >
> > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > > I initially mistook this for a traffic control issue, but after
> > > > stripping the test beds down to just the MACSec component, I can still
> > > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > > packets over a MACSec link it stops passing traffic.
> > >
> > > I think you're just hitting packet number exhaustion. After 2^32
> > > packets, the packet number would wrap to 0 and start being reused,
> > > which breaks the crypto used by macsec. Before this point, you have to
> > > add a new SA, and tell the macsec device to switch to it.
> >
> > I had not considered that, I naively thought as long as I didn't
> > specify a replay window, it'd roll the PN over on it's own and life
> > would be good.  I'll test that theory tomorrow, should be easy to
> > prove out.
> >
> > > That's why you should be using wpa_supplicant. It will monitor the
> > > growth of the packet number, and handle the rekey for you.
> >
> > Thank you for the heads up, I'll read up on this as well.
> >
> > Josh C

^ permalink raw reply

* my subject
From: test @ 2018-10-17 11:31 UTC (permalink / raw)
  To: Recipients

I am Peter Wong director of operations, Hong Kong and Shanghai Banking
Corporation Limited Hong Kong. I have a very confidential business
proposition involving transfer of $18.350.000.00 that will be of great
benefit for both of us. Reply for more details as regards this
transaction

Best Regards
Peter Wong
ec

^ permalink raw reply

* [PATCH bpf] bpf: fix doc of bpf_skb_adjust_room() in uapi
From: Nicolas Dichtel @ 2018-10-17 14:24 UTC (permalink / raw)
  To: ast, daniel, davem; +Cc: netdev, Nicolas Dichtel, Quentin Monnet

len_diff is signed.

Fixes: fa15601ab31e ("bpf: add documentation for eBPF helpers (33-41)")
CC: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 include/uapi/linux/bpf.h       | 2 +-
 tools/include/uapi/linux/bpf.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..c4ffe91d5598 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1430,7 +1430,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
+ * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
  * 	Description
  * 		Grow or shrink the room for data in the packet associated to
  * 		*skb* by *len_diff*, and according to the selected *mode*.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..c4ffe91d5598 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1430,7 +1430,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
+ * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
  * 	Description
  * 		Grow or shrink the room for data in the packet associated to
  * 		*skb* by *len_diff*, and according to the selected *mode*.
-- 
2.18.0

^ permalink raw reply related

* [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann

This set first adds smp_* barrier variants to tools infrastructure
and in a second step updates perf and libbpf to make use of them.
For details, please see individual patches, thanks!

Arnaldo, if there are no objections, could this be routed via bpf-next
with Acked-by's due to later dependencies in libbpf? Alternatively,
I could also get the 2nd patch out during merge window, but perhaps
it's okay to do in one go as there shouldn't be much conflict in perf.

Thanks!

Daniel Borkmann (3):
  tools: add smp_* barrier variants to include infrastructure
  tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  bpf, libbpf: use proper barriers in perf ring buffer walk

 tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
 tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
 tools/include/asm/barrier.h            | 11 +++++++++++
 tools/lib/bpf/libbpf.c                 | 25 +++++++++++++++++++------
 tools/perf/util/mmap.h                 |  5 +++--
 5 files changed, 49 insertions(+), 11 deletions(-)

-- 
2.9.5

^ permalink raw reply

* [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann
In-Reply-To: <20181017144156.16639-1-daniel@iogearbox.net>

Add the definition for smp_rmb(), smp_wmb(), and smp_mb() to the
tools include infrastructure. This patch adds the implementation
for x86-64 and arm64, and have it fall back for other archs which
do not have it implemented at this point such that others can be
added successively for those who have access to test machines. The
x86-64 one uses lock + add combination for smp_mb() with address
below red zone.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
 tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
 tools/include/asm/barrier.h            | 11 +++++++++++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index 40bde6b..acf1f06 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -14,4 +14,14 @@
 #define wmb()		asm volatile("dmb ishst" ::: "memory")
 #define rmb()		asm volatile("dmb ishld" ::: "memory")
 
+/*
+ * Kernel uses dmb variants on arm64 for smp_*() barriers. Pretty much the same
+ * implementation as above mb()/wmb()/rmb(), though for the latter kernel uses
+ * dsb. In any case, should above mb()/wmb()/rmb() change, make sure the below
+ * smp_*() don't.
+ */
+#define smp_mb()	asm volatile("dmb ish" ::: "memory")
+#define smp_wmb()	asm volatile("dmb ishst" ::: "memory")
+#define smp_rmb()	asm volatile("dmb ishld" ::: "memory")
+
 #endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */
diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h
index 8774dee..c97c0c5 100644
--- a/tools/arch/x86/include/asm/barrier.h
+++ b/tools/arch/x86/include/asm/barrier.h
@@ -21,9 +21,12 @@
 #define rmb()	asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define wmb()	asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #elif defined(__x86_64__)
-#define mb() 	asm volatile("mfence":::"memory")
-#define rmb()	asm volatile("lfence":::"memory")
-#define wmb()	asm volatile("sfence" ::: "memory")
+#define mb()      asm volatile("mfence" ::: "memory")
+#define rmb()     asm volatile("lfence" ::: "memory")
+#define wmb()     asm volatile("sfence" ::: "memory")
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+#define smp_mb()  asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc")
 #endif
 
 #endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */
diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 391d942..e4c8845 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/compiler.h>
 #if defined(__i386__) || defined(__x86_64__)
 #include "../../arch/x86/include/asm/barrier.h"
 #elif defined(__arm__)
@@ -26,3 +27,13 @@
 #else
 #include <asm-generic/barrier.h>
 #endif
+/* Fallback definitions for archs that haven't been updated yet. */
+#ifndef smp_rmb
+# define smp_rmb()	rmb()
+#endif
+#ifndef smp_wmb
+# define smp_wmb()	wmb()
+#endif
+#ifndef smp_mb
+# define smp_mb()	mb()
+#endif
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann
In-Reply-To: <20181017144156.16639-1-daniel@iogearbox.net>

Switch both rmb()/mb() barriers to more lightweight smp_rmb()/smp_mb()
ones. When walking the perf ring buffer they pair the following way,
quoting kernel/events/ring_buffer.c:

  Since the mmap() consumer (userspace) can run on a different CPU:

    kernel                               user

    if (LOAD ->data_tail) {              LOAD ->data_head
                          (A)            smp_rmb()       (C)
      STORE $data                        LOAD $data
      smp_wmb()           (B)            smp_mb()        (D)
      STORE ->data_head                  STORE ->data_tail
    }

  Where A pairs with D, and B pairs with C.

  In our case (A) is a control dependency that separates the load
  of the ->data_tail and the stores of $data. In case ->data_tail
  indicates there is no room in the buffer to store $data we do not.

  D needs to be a full barrier since it separates the data READ from
  the tail WRITE.

  For B a WMB is sufficient since it separates two WRITEs, and for C
  an RMB is sufficient since it separates two READs.

Currently, on x86-64, perf uses LFENCE and MFENCE which is overkill
as we can do more lightweight in particular given this is fast-path.

According to Peter rmb()/mb() were added back then via a94d342b9cb0
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/mmap.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..de6dc2e 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->base;
 	u64 head = READ_ONCE(pc->data_head);
-	rmb();
+
+	smp_rmb();
 	return head;
 }
 
@@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
 	/*
 	 * ensure all reads are done before we write the tail out.
 	 */
-	mb();
+	smp_mb();
 	pc->data_tail = tail;
 }
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann
In-Reply-To: <20181017144156.16639-1-daniel@iogearbox.net>

Add bpf_perf_read_head() and bpf_perf_write_tail() helpers to make it
more clear in what context barriers are used here, and use smp_rmb()
as well as smp_mb() barriers. Given libbpf is not restricted to x86-64
only, the compiler barrier needs to be replaced with smp_rmb(). Also
the __sync_synchronize() emits mfence whereas faster lock + add can
be used on x86-64 via smp_mb().

Fixes: d0cabbb021be ("tools: bpf: move the event reading loop to libbpf")
Fixes: 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/libbpf.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index bd71efc..e8ae8db 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -20,6 +20,7 @@
 #include <fcntl.h>
 #include <errno.h>
 #include <asm/unistd.h>
+#include <asm/barrier.h>
 #include <linux/err.h>
 #include <linux/kernel.h>
 #include <linux/bpf.h>
@@ -2413,18 +2414,32 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
 	return 0;
 }
 
+static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
+{
+	__u64 data_head = READ_ONCE(header->data_head);
+
+	smp_rmb();
+	return data_head;
+}
+
+static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
+				__u64 data_tail)
+{
+	smp_mb();
+	header->data_tail = data_tail;
+}
+
 enum bpf_perf_event_ret
 bpf_perf_event_read_simple(void *mem, unsigned long size,
 			   unsigned long page_size, void **buf, size_t *buf_len,
 			   bpf_perf_event_print_t fn, void *priv)
 {
-	volatile struct perf_event_mmap_page *header = mem;
+	struct perf_event_mmap_page *header = mem;
+	__u64 data_head = bpf_perf_read_head(header);
 	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
 	int ret = LIBBPF_PERF_EVENT_ERROR;
 	void *base, *begin, *end;
 
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
 	if (data_head == data_tail)
 		return LIBBPF_PERF_EVENT_CONT;
 
@@ -2467,8 +2482,6 @@ bpf_perf_event_read_simple(void *mem, unsigned long size,
 		data_tail += ehdr->size;
 	}
 
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_tail;
-
+	bpf_perf_write_tail(header, data_tail);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related

* Re: Bug in MACSec - stops passing traffic after approx 5TB
From: Josh Coombs @ 2018-10-17 14:46 UTC (permalink / raw)
  To: sd; +Cc: netdev
In-Reply-To: <CACcUnf-0wXVb5UUqJJLap-CD9DD82rS7YL=jayPsnPXQb_c9NQ@mail.gmail.com>

I see it reusing SAs, so I'm good.

Joshua Coombs


On Wed, Oct 17, 2018 at 9:45 AM Josh Coombs <jcoombs@staff.gwi.net> wrote:
>
> I've got wpa_supplicant working with macsec on Fedora, my test bed has
> shuffled 16 billion packets so far without interruption.  I am a bit
> concerned that I've just pushed the resource exhaustion issue down the
> road though, looking at the output of ip macsec show I see four SAs
> for TX and RX, it appears to negotiate a new pair every 3 to 3.5
> billion packets.  It doesn't appear to be ripping down old SAs.  What
> happens when available SA slots run out?
>
> Joshua Coombs
> GWI
>
> office 207-494-2140
> www.gwi.net
>
> On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs <jcoombs@staff.gwi.net> wrote:
> >
> > And confirmed, starting with a high packet number results in a very
> > short testbed run, 296 packets and then nothing, just as you surmised.
> > Sorry for raising the alarm falsely.  Looks like I need to roll my own
> > build of wpa_supplicant as the ubuntu builds don't include the macsec
> > driver, haven't tested Gentoo's ebuilds yet to see if they do.
> >
> > Josh Coombs
> >
> > On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs <jcoombs@staff.gwi.net> wrote:
> > >
> > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca <sd@queasysnail.net> wrote:
> > > >
> > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > > > I initially mistook this for a traffic control issue, but after
> > > > > stripping the test beds down to just the MACSec component, I can still
> > > > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > > > packets over a MACSec link it stops passing traffic.
> > > >
> > > > I think you're just hitting packet number exhaustion. After 2^32
> > > > packets, the packet number would wrap to 0 and start being reused,
> > > > which breaks the crypto used by macsec. Before this point, you have to
> > > > add a new SA, and tell the macsec device to switch to it.
> > >
> > > I had not considered that, I naively thought as long as I didn't
> > > specify a replay window, it'd roll the PN over on it's own and life
> > > would be good.  I'll test that theory tomorrow, should be easy to
> > > prove out.
> > >
> > > > That's why you should be using wpa_supplicant. It will monitor the
> > > > growth of the packet number, and handle the rekey for you.
> > >
> > > Thank you for the heads up, I'll read up on this as well.
> > >
> > > Josh C

^ permalink raw reply

* Re: [PATCH bpf-next v2 4/8] tls: convert to generic sk_msg interface
From: Eric Dumazet @ 2018-10-17 14:55 UTC (permalink / raw)
  To: Daniel Borkmann, alexei.starovoitov; +Cc: john.fastabend, davejwatson, netdev
In-Reply-To: <20181013004603.3747-5-daniel@iogearbox.net>



On 10/12/2018 05:45 PM, Daniel Borkmann wrote:
> Convert kTLS over to make use of sk_msg interface for plaintext and
> encrypted scattergather data, so it reuses all the sk_msg helpers
> and data structure which later on in a second step enables to glue
> this to BPF.
> 
> This also allows to remove quite a bit of open coded helpers which
> are covered by the sk_msg API. Recent changes in kTLs 80ece6a03aaf
> ("tls: Remove redundant vars from tls record structure") and
> 4e6d47206c32 ("tls: Add support for inplace records encryption")
> changed the data path handling a bit; while we've kept the latter
> optimization intact, we had to undo the former change to better
> fit the sk_msg model, hence the sg_aead_in and sg_aead_out have
> been brought back and are linked into the sk_msg sgs. Now the kTLS
> record contains a msg_plaintext and msg_encrypted sk_msg each.
> 
> In the original code, the zerocopy_from_iter() has been used out
> of TX but also RX path. For the strparser skb-based RX path,
> we've left the zerocopy_from_iter() in decrypt_internal() mostly
> untouched, meaning it has been moved into tls_setup_from_iter()
> with charging logic removed (as not used from RX). Given RX path
> is not based on sk_msg objects, we haven't pursued setting up a
> dummy sk_msg to call into sk_msg_zerocopy_from_iter(), but it
> could be an option to prusue in a later step.
> 
> Joint work with John.
> 
> 

Something looks broken and needs this fix :

(I have to run, I might submit this formally later if needed)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3c0e44cb811a4bed2e02aa107854d74eb3ae7358..3e55dedc038b22132b5b6d042bfedda9e4f3157e 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -175,12 +175,12 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
                }
        }
 
+       if (!sk_has_psock(sk)) {
+               ret = -EBUSY;
+               goto out_progs;
+       }
        psock = sk_psock_get(sk);
        if (psock) {
-               if (!sk_has_psock(sk)) {
-                       ret = -EBUSY;
-                       goto out_progs;
-               }
                if ((msg_parser && READ_ONCE(psock->progs.msg_parser)) ||
                    (skb_progs  && READ_ONCE(psock->progs.skb_parser))) {
                        sk_psock_put(sk, psock);


BUG: KASAN: slab-out-of-bounds in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: slab-out-of-bounds in refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
Read of size 4 at addr ffff88019548be58 by task syz-executor4/22387

CPU: 1 PID: 22387 Comm: syz-executor4 Not tainted 4.19.0-rc7+ #264
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
 print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 refcount_inc_not_zero_checked+0x97/0x2f0 lib/refcount.c:120
 sk_psock_get include/linux/skmsg.h:379 [inline]
 sock_map_link.isra.6+0x41f/0xe30 net/core/sock_map.c:178
 sock_hash_update_common+0x19b/0x11e0 net/core/sock_map.c:669
 sock_hash_update_elem+0x306/0x470 net/core/sock_map.c:738
 map_update_elem+0x819/0xdf0 kernel/bpf/syscall.c:818
 __do_sys_bpf kernel/bpf/syscall.c:2400 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:2371 [inline]
 __x64_sys_bpf+0x32d/0x510 kernel/bpf/syscall.c:2371
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457569
Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f71507e0c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457569
RDX: 0000000000000020 RSI: 0000000020000180 RDI: 0000000000000002
RBP: 000000000072bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f71507e16d4
R13: 00000000004bd926 R14: 00000000004cc2b0 R15: 00000000ffffffff

Allocated by task 22387:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
 kmem_cache_alloc+0x12e/0x730 mm/slab.c:3554
 kmem_cache_zalloc include/linux/slab.h:697 [inline]
 kcm_attach net/kcm/kcmsock.c:1405 [inline]
 kcm_attach_ioctl net/kcm/kcmsock.c:1490 [inline]
 kcm_ioctl+0xca7/0x18c0 net/kcm/kcmsock.c:1696
 sock_do_ioctl+0xeb/0x420 net/socket.c:950
 sock_ioctl+0x313/0x690 net/socket.c:1074
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:501 [inline]
 do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
 __do_sys_ioctl fs/ioctl.c:709 [inline]
 __se_sys_ioctl fs/ioctl.c:707 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 0:
(stack is not available)

The buggy address belongs to the object at ffff88019548bc00
 which belongs to the cache kcm_psock_cache of size 544
The buggy address is located 56 bytes to the right of
 544-byte region [ffff88019548bc00, ffff88019548be20)
The buggy address belongs to the page:
page:ffffea0006552280 count:1 mapcount:0 mapping:ffff8801cb789b40 index:0x0 compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801cb78c748 ffff8801cb78c748 ffff8801cb789b40
kobject: 'loop2' (0000000086df2c8c): kobject_uevent_env
raw: 0000000000000000 ffff88019548a080 000000010000000b 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88019548bd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88019548bd80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88019548be00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
                                                    ^
 ffff88019548be80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88019548bf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

^ permalink raw reply related

* Re: [PATCH net] sctp: not free the new asoc when sctp_wait_for_connect returns err
From: Marcelo Ricardo Leitner @ 2018-10-17 14:57 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman
In-Reply-To: <6ee31cce8c2e36a7e189de11c13c257e8c45b42f.1539716772.git.lucien.xin@gmail.com>

On Wed, Oct 17, 2018 at 03:06:12AM +0800, Xin Long wrote:
> When sctp_wait_for_connect is called to wait for connect ready
> for sp->strm_interleave in sctp_sendmsg_to_asoc, a panic could
> be triggered if cpu is scheduled out and the new asoc is freed
> elsewhere, as it will return err and later the asoc gets freed
> again in sctp_sendmsg.
> 
> [  285.840764] list_del corruption, ffff9f0f7b284078->next is LIST_POISON1 (dead000000000100)
> [  285.843590] WARNING: CPU: 1 PID: 8861 at lib/list_debug.c:47 __list_del_entry_valid+0x50/0xa0
> [  285.846193] Kernel panic - not syncing: panic_on_warn set ...
> [  285.846193]
> [  285.848206] CPU: 1 PID: 8861 Comm: sctp_ndata Kdump: loaded Not tainted 4.19.0-rc7.label #584
> [  285.850559] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [  285.852164] Call Trace:
> ...
> [  285.872210]  ? __list_del_entry_valid+0x50/0xa0
> [  285.872894]  sctp_association_free+0x42/0x2d0 [sctp]
> [  285.873612]  sctp_sendmsg+0x5a4/0x6b0 [sctp]
> [  285.874236]  sock_sendmsg+0x30/0x40
> [  285.874741]  ___sys_sendmsg+0x27a/0x290
> [  285.875304]  ? __switch_to_asm+0x34/0x70
> [  285.875872]  ? __switch_to_asm+0x40/0x70
> [  285.876438]  ? ptep_set_access_flags+0x2a/0x30
> [  285.877083]  ? do_wp_page+0x151/0x540
> [  285.877614]  __sys_sendmsg+0x58/0xa0
> [  285.878138]  do_syscall_64+0x55/0x180
> [  285.878669]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> This is a similar issue with the one fixed in Commit ca3af4dd28cf
> ("sctp: do not free asoc when it is already dead in sctp_sendmsg").
> But this one can't be fixed by returning -ESRCH for the dead asoc
> in sctp_wait_for_connect, as it will break sctp_connect's return
> value to users.
> 
> This patch is to simply set err to -ESRCH before it returns to
> sctp_sendmsg when any err is returned by sctp_wait_for_connect
> for sp->strm_interleave, so that no asoc would be freed due to
> this.
> 
> When users see this error, they will know the packet hasn't been
> sent. And it also makes sense to not free asoc because waiting
> connect fails, like the second call for sctp_wait_for_connect in
> sctp_sendmsg_to_asoc.
> 
> Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  net/sctp/socket.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index e25a20f..1baa9d9 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1946,8 +1946,10 @@ static int sctp_sendmsg_to_asoc(struct sctp_association *asoc,
>  		if (sp->strm_interleave) {
>  			timeo = sock_sndtimeo(sk, 0);
>  			err = sctp_wait_for_connect(asoc, &timeo);
> -			if (err)
> +			if (err) {
> +				err = -ESRCH;
>  				goto err;
> +			}
>  		} else {
>  			wait_connect = true;
>  		}
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb
From: Arnaldo Carvalho de Melo @ 2018-10-17 15:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, peterz, paulmck, will.deacon, yhs,
	john.fastabend, netdev
In-Reply-To: <20181017144156.16639-1-daniel@iogearbox.net>

Em Wed, Oct 17, 2018 at 04:41:53PM +0200, Daniel Borkmann escreveu:
> This set first adds smp_* barrier variants to tools infrastructure
> and in a second step updates perf and libbpf to make use of them.
> For details, please see individual patches, thanks!
> 
> Arnaldo, if there are no objections, could this be routed via bpf-next
> with Acked-by's due to later dependencies in libbpf? Alternatively,
> I could also get the 2nd patch out during merge window, but perhaps
> it's okay to do in one go as there shouldn't be much conflict in perf.

Right, when updating kernel/events/ring_buffer.c the corresponding
code in tools/ should've been changed :-)

Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>

- Arnaldo
 
> Thanks!
> 
> Daniel Borkmann (3):
>   tools: add smp_* barrier variants to include infrastructure
>   tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
>   bpf, libbpf: use proper barriers in perf ring buffer walk
> 
>  tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
>  tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
>  tools/include/asm/barrier.h            | 11 +++++++++++
>  tools/lib/bpf/libbpf.c                 | 25 +++++++++++++++++++------
>  tools/perf/util/mmap.h                 |  5 +++--
>  5 files changed, 49 insertions(+), 11 deletions(-)
> 
> -- 
> 2.9.5

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: David Ahern @ 2018-10-17 15:09 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, kernel-team
In-Reply-To: <CAPhsuW7zZE51ibma__y8SDVUU_YQMjGyHRhYDhBsuJF_b89h6g@mail.gmail.com>

On 10/16/18 11:43 PM, Song Liu wrote:
> I agree that processing events while recording has significant overhead.
> In this case, perf user space need to know details about the the jited BPF
> program. It is impossible to pass all these details to user space through
> the relatively stable ring_buffer API. Therefore, some processing of the
> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> details via BPF_OBJ_GET_INFO_BY_FD.
> 
> I have some idea on processing important data with relatively low overhead.
> Let me try implement it.
> 

As I understand it, you want this series:

 kernel: add event to perf buffer on bpf prog load

 userspace: perf reads the event and grabs information about the program
from the fd

Is that correct?

Userpsace is not awakened immediately when an event is added the the
ring. It is awakened once the number of events crosses a watermark. That
means there is an unknown - and potentially long - time window where the
program can be unloaded before perf reads the event.

So no matter what you do expecting perf record to be able to process the
event quickly is an unreasonable expectation.

^ permalink raw reply

* [PATCH] net: mscc: ocelot: Fix comment in ocelot_vlant_wait_for_completion()
From: Gregory CLEMENT @ 2018-10-17 15:26 UTC (permalink / raw)
  To: David S. Miller, linux-kernel, netdev, Alexandre Belloni; +Cc: Gregory CLEMENT

The ocelot_vlant_wait_for_completion() function is very similar to the
ocelot_mact_wait_for_completion(). It seemed to have be copied but the
comment was not updated, so let's fix it.

Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
---
 drivers/net/ethernet/mscc/ocelot.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c
index 1a4f2bb48ead..ed4e298cd823 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -133,9 +133,9 @@ static inline int ocelot_vlant_wait_for_completion(struct ocelot *ocelot)
 {
 	unsigned int val, timeout = 10;
 
-	/* Wait for the issued mac table command to be completed, or timeout.
-	 * When the command read from ANA_TABLES_MACACCESS is
-	 * MACACCESS_CMD_IDLE, the issued command completed successfully.
+	/* Wait for the issued vlan table command to be completed, or timeout.
+	 * When the command read from ANA_TABLES_VLANACCESS is
+	 * VLANACCESS_CMD_IDLE, the issued command completed successfully.
 	 */
 	do {
 		val = ocelot_read(ocelot, ANA_TABLES_VLANACCESS);
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH rdma-next v1 0/4] Scatter to CQE
From: Doug Ledford @ 2018-10-17 15:27 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, RDMA mailing list, Guy Levi, Yonatan Cohen,
	Saeed Mahameed, linux-netdev
In-Reply-To: <20181016190008.GL3606@mtr-leonro.mtl.com>

[-- Attachment #1: Type: text/plain, Size: 2584 bytes --]

On Tue, 2018-10-16 at 22:00 +0300, Leon Romanovsky wrote:
> On Tue, Oct 16, 2018 at 02:39:01PM -0400, Doug Ledford wrote:
> > On Tue, 2018-10-09 at 12:05 +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > > 
> > > Changelog v0->v1:
> > >  * Changed patch #3 to use check_mask function from rdma-core instead define.
> > > 
> > > --------------------------------------------------------------------------
> > > From Yonatan,
> > > 
> > > Scatter to CQE is a HW offload feature that saves PCI writes by
> > > scattering the payload to the CQE.
> > > 
> > > The feature depends on the CQE size and if the CQE size is 64B, it will
> > > work for payload smaller than 32. If the CQE size is 128B, it will work for
> > > payload smaller than 64.
> > > 
> > > The feature works for responder and requestor:
> > > 1. For responder, if the payload is small as required above, the data
> > > will be part of the CQE, and thus we save another PCI transaction the recv buffers.
> > > 2. For requestor, this can be used to get the RDMA_READ response and
> > > RDMA_ATOMIC response in the CQE. This feature is already supported in upstream.
> > > 
> > > As part of this series, we are adding support for DC transport type and
> > > ability to enable the feature (force enable) in the requestor when SQ
> > > is not configured to signal all WRs.
> > > 
> > > Thanks
> > > 
> > > Yonatan Cohen (4):
> > >   net/mlx5: Expose DC scatter to CQE capability bit
> > >   IB/mlx5: Support scatter to CQE for DC transport type
> > >   IB/mlx5: Verify that driver supports user flags
> > >   IB/mlx5: Allow scatter to CQE without global signaled WRs
> > > 
> > >  drivers/infiniband/hw/mlx5/cq.c      |  2 +-
> > >  drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +-
> > >  drivers/infiniband/hw/mlx5/qp.c      | 93 ++++++++++++++++++++++++++++--------
> > >  include/linux/mlx5/mlx5_ifc.h        |  3 +-
> > >  include/uapi/rdma/mlx5-abi.h         |  1 +
> > >  5 files changed, 79 insertions(+), 22 deletions(-)
> > > 
> > > --
> > > 2.14.4
> > > 
> > 
> > 
> > Hi Leon,
> > 
> > This series looks fine.  Let me know when the net/mlx5 portion has been
> > committed.
> 
> Thanks Doug,
> I pushed first patch to mlx5-next
> 94a04d1d3d36 ("net/mlx5: Expose DC scatter to CQE capability bit")

Thanks Leon, mlx5-next merged in, then remainder of series applied to
for-next.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH bpf] bpf: fix doc of bpf_skb_adjust_room() in uapi
From: Quentin Monnet @ 2018-10-17 15:40 UTC (permalink / raw)
  To: Nicolas Dichtel, ast, daniel, davem; +Cc: netdev
In-Reply-To: <20181017142448.15071-1-nicolas.dichtel@6wind.com>

2018-10-17 16:24 UTC+0200 ~ Nicolas Dichtel <nicolas.dichtel@6wind.com>
> len_diff is signed.
> 
> Fixes: fa15601ab31e ("bpf: add documentation for eBPF helpers (33-41)")
> CC: Quentin Monnet <quentin.monnet@netronome.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> ---
>   include/uapi/linux/bpf.h       | 2 +-
>   tools/include/uapi/linux/bpf.h | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4eba27..c4ffe91d5598 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1430,7 +1430,7 @@ union bpf_attr {
>    * 	Return
>    * 		0 on success, or a negative error in case of failure.
>    *
> - * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
> + * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
>    * 	Description
>    * 		Grow or shrink the room for data in the packet associated to
>    * 		*skb* by *len_diff*, and according to the selected *mode*.
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 66917a4eba27..c4ffe91d5598 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -1430,7 +1430,7 @@ union bpf_attr {
>    * 	Return
>    * 		0 on success, or a negative error in case of failure.
>    *
> - * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags)
> + * int bpf_skb_adjust_room(struct sk_buff *skb, s32 len_diff, u32 mode, u64 flags)
>    * 	Description
>    * 		Grow or shrink the room for data in the packet associated to
>    * 		*skb* by *len_diff*, and according to the selected *mode*.
> 

Correct, thank you Nicolas! :)

Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
From: Peter Zijlstra @ 2018-10-17 15:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev
In-Reply-To: <20181017144156.16639-3-daniel@iogearbox.net>

On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote:
> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
>  {
>  	struct perf_event_mmap_page *pc = mm->base;
>  	u64 head = READ_ONCE(pc->data_head);
> -	rmb();
> +
> +	smp_rmb();
>  	return head;
>  }
>  
> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
>  	/*
>  	 * ensure all reads are done before we write the tail out.
>  	 */
> -	mb();
> +	smp_mb();
>  	pc->data_tail = tail;

Ideally that would be a WRITE_ONCE() to avoid store tearing.

Alternatively, I think we can use smp_store_release() here, all we care
about is that the prior loads stay prior.

Similarly, I suppose, we could use smp_load_acquire() for the data_head
load above.

>  }
>  
> -- 
> 2.9.5
> 

^ permalink raw reply

* Re: [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk
From: Peter Zijlstra @ 2018-10-17 15:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev
In-Reply-To: <20181017144156.16639-4-daniel@iogearbox.net>

On Wed, Oct 17, 2018 at 04:41:56PM +0200, Daniel Borkmann wrote:
> +static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
> +{
> +	__u64 data_head = READ_ONCE(header->data_head);
> +
> +	smp_rmb();
> +	return data_head;
> +}
> +
> +static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
> +				__u64 data_tail)
> +{
> +	smp_mb();
> +	header->data_tail = data_tail;
> +}

Same coments, either smp_load_acquire()/smp_store_release() or at the
very least a WRITE_ONCE() there.

^ permalink raw reply

* Re: [PATCH net-next] ixgbe: fix XFRM_ALGO dependency
From: Jeff Kirsher @ 2018-10-17 15:56 UTC (permalink / raw)
  To: Shannon Nelson, Arnd Bergmann, David S. Miller, Steffen Klassert,
	Herbert Xu
  Cc: Jesse Brandeburg, Björn Töpel, Alexander Duyck,
	intel-wired-lan, netdev, linux-kernel
In-Reply-To: <07e372cb-06da-bf95-bc53-64ee09fc8e76@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 6462 bytes --]

On Tue, 2018-10-16 at 09:35 -0700, Shannon Nelson wrote:
> On 10/16/2018 3:03 AM, Arnd Bergmann wrote:
> > When XFRM_ALGO is not enabled, the new igxge ipsec code produces a
> > link error:
> 
> s/igxge/ixgbe/
> 
> > 
> > drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.o: In function
> > `ixgbe_ipsec_vf_add_sa':
> > ixgbe_ipsec.c:(.text+0x1266): undefined reference to
> > `xfrm_aead_get_byname'
> > 
> > Simply selectin XFRM_ALGO from here causes circular dependencies,
> > so
> 
> s/selectin/selecting/
> 
> > to fix it, we probably want this slightly more complex solution
> > that is
> > similar to what other drivers with XFRM offload do:
> > 
> > A separate Kconfig symbol now controls whether we include the ipsec
> > offload code. To keep the old behavior, this is left as 'default
> > y'. The
> > dependency in XFRM_OFFLOAD still causes a circular dependency but
> > is
> > not actually needed because this symbol is not user visible, so
> > removing
> > that dependency on top makes it all work.
> > 
> > Fixes: eda0333ac293 ("ixgbe: add VF IPsec management")
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>

I agree with Shannon's suggested changes.  Arnd, are you working on v2?
Or would you like me to take care of it?

> > ---
> >   drivers/net/ethernet/intel/Kconfig            | 10 ++++++++++
> >   drivers/net/ethernet/intel/ixgbe/Makefile     |  2 +-
> >   drivers/net/ethernet/intel/ixgbe/ixgbe.h      |  8 ++++----
> >   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  6 +++---
> >   net/xfrm/Kconfig                              |  1 -
> >   5 files changed, 18 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/Kconfig
> > b/drivers/net/ethernet/intel/Kconfig
> > index b542aba6f0e8..d1db382d8299 100644
> > --- a/drivers/net/ethernet/intel/Kconfig
> > +++ b/drivers/net/ethernet/intel/Kconfig
> > @@ -194,6 +194,16 @@ config IXGBE_DCB
> >   
> >   	  If unsure, say N.
> >   
> > +config IXGBE_IPSEC
> > +	bool "IPSec XFRM cryptography-offload accelaration"
> > +	default n
> 
> remove this "default n" line?
> 
> > +	depends on IXGBE
> > +	depends on XFRM_OFFLOAD
> > +	default y
> > +	select XFRM_ALGO
> > +	---help---
> > +	  Enable support for IPSec offload in ixgbe.ko
> > +
> >   config IXGBEVF
> >   	tristate "Intel(R) 10GbE PCI Express Virtual Function Ethernet
> > support"
> >   	depends on PCI_MSI
> > diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile
> > b/drivers/net/ethernet/intel/ixgbe/Makefile
> > index ca6b0c458e4a..4fb0d9e3f2da 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/Makefile
> > +++ b/drivers/net/ethernet/intel/ixgbe/Makefile
> > @@ -17,4 +17,4 @@ ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o
> > ixgbe_dcb_82598.o \
> >   ixgbe-$(CONFIG_IXGBE_HWMON) += ixgbe_sysfs.o
> >   ixgbe-$(CONFIG_DEBUG_FS) += ixgbe_debugfs.o
> >   ixgbe-$(CONFIG_FCOE:m=y) += ixgbe_fcoe.o
> > -ixgbe-$(CONFIG_XFRM_OFFLOAD) += ixgbe_ipsec.o
> > +ixgbe-$(CONFIG_IXGBE_IPSEC) += ixgbe_ipsec.o
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > index 7a7679e7be84..1d5d66436eac 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> > @@ -770,9 +770,9 @@ struct ixgbe_adapter {
> >   #define IXGBE_RSS_KEY_SIZE     40  /* size of RSS Hash Key in
> > bytes */
> >   	u32 *rss_key;
> >   
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > +#ifdef CONFIG_IXGBE_IPSEC
> >   	struct ixgbe_ipsec *ipsec;
> > -#endif /* CONFIG_XFRM_OFFLOAD */
> > +#endif /* CONFIG_IXGBE_IPSEC */
> 
> You'll probably want to apply these changes to the ixgbevf code as
> well.
> 
> Cheers,
> sln
> 
> >   
> >   	/* AF_XDP zero-copy */
> >   	struct xdp_umem **xsk_umems;
> > @@ -1009,7 +1009,7 @@ void ixgbe_store_key(struct ixgbe_adapter
> > *adapter);
> >   void ixgbe_store_reta(struct ixgbe_adapter *adapter);
> >   s32 ixgbe_negotiate_fc(struct ixgbe_hw *hw, u32 adv_reg, u32
> > lp_reg,
> >   		       u32 adv_sym, u32 adv_asm, u32 lp_sym, u32
> > lp_asm);
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > +#ifdef CONFIG_IXGBE_IPSEC
> >   void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter);
> >   void ixgbe_stop_ipsec_offload(struct ixgbe_adapter *adapter);
> >   void ixgbe_ipsec_restore(struct ixgbe_adapter *adapter);
> > @@ -1037,5 +1037,5 @@ static inline int
> > ixgbe_ipsec_vf_add_sa(struct ixgbe_adapter *adapter,
> >   					u32 *mbuf, u32 vf) { return
> > -EACCES; }
> >   static inline int ixgbe_ipsec_vf_del_sa(struct ixgbe_adapter
> > *adapter,
> >   					u32 *mbuf, u32 vf) { return
> > -EACCES; }
> > -#endif /* CONFIG_XFRM_OFFLOAD */
> > +#endif /* CONFIG_IXGBE_IPSEC */
> >   #endif /* _IXGBE_H_ */
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 0049a2becd7e..113b38e0defb 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -8694,7 +8694,7 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct
> > sk_buff *skb,
> >   
> >   #endif /* IXGBE_FCOE */
> >   
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > +#ifdef CONFIG_IXGBE_IPSEC
> >   	if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
> >   		goto out_drop;
> >   #endif
> > @@ -10190,7 +10190,7 @@ ixgbe_features_check(struct sk_buff *skb,
> > struct net_device *dev,
> >   	 * the TSO, so it's the exception.
> >   	 */
> >   	if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) {
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > +#ifdef CONFIG_IXGBE_IPSEC
> >   		if (!skb->sp)
> >   #endif
> >   			features &= ~NETIF_F_TSO;
> > @@ -10883,7 +10883,7 @@ static int ixgbe_probe(struct pci_dev
> > *pdev, const struct pci_device_id *ent)
> >   	if (hw->mac.type >= ixgbe_mac_82599EB)
> >   		netdev->features |= NETIF_F_SCTP_CRC;
> >   
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > +#ifdef CONFIG_IXGBE_IPSEC
> >   #define IXGBE_ESP_FEATURES	(NETIF_F_HW_ESP | \
> >   				 NETIF_F_HW_ESP_TX_CSUM | \
> >   				 NETIF_F_GSO_ESP)
> > diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
> > index 4a9ee2d83158..140270a13d54 100644
> > --- a/net/xfrm/Kconfig
> > +++ b/net/xfrm/Kconfig
> > @@ -8,7 +8,6 @@ config XFRM
> >   
> >   config XFRM_OFFLOAD
> >          bool
> > -       depends on XFRM
> >   
> >   config XFRM_ALGO
> >   	tristate
> > 


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH rdma-next] IB/mlx5: Add support for extended atomic operations
From: Doug Ledford @ 2018-10-17 15:55 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Yonatan Cohen, RDMA mailing list, Guy Levi, Saeed Mahameed,
	linux-netdev, Leon Romanovsky
In-Reply-To: <20181010062516.5996-1-leon@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 876 bytes --]

On Wed, 2018-10-10 at 09:25 +0300, Leon Romanovsky wrote:
> From: Yonatan Cohen <yonatanc@mellanox.com>
> 
> Extended atomic operations cmp&swp and fetch&add is a Mellanox
> feature extending the standard atomic operation to use, varied
> operand sizes, as apposed to normal atomic operation that use
> an 8 byte operand only.
> Extended atomics allows masking the results and arguments.
> 
> This patch configures QP to support extended atomic operation
> with the maximum size possible, as exposed by HCA capabilities.
> 
> Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com>
> Reviewed-by: Guy Levi <guyle@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>

Thanks, applied to for-next.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Song Liu @ 2018-10-17 16:06 UTC (permalink / raw)
  To: arnaldo.melo
  Cc: David Ahern, Alexei Starovoitov, Peter Zijlstra,
	Alexei Starovoitov, Alexey Budankov, David S . Miller,
	Daniel Borkmann, namhyung, Jiri Olsa, Networking, kernel-team
In-Reply-To: <20181017125028.GA4589@kernel.org>

On Wed, Oct 17, 2018 at 5:50 AM Arnaldo Carvalho de Melo
<arnaldo.melo@gmail.com> wrote:
>
> Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Adding Alexey, Jiri and Namhyung as they worked/are working on
> > multithreading 'perf record'.
> >
> > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern <dsahern@gmail.com> wrote:
> > > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > > I am working with Alexei on the idea of fetching BPF program information via
> > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > > to perf_event_type, and dumped these events to perf event ring buffer.
> >
> > > > > I found that perf will not process event until the end of perf-record:
> >
> > > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > > ...... 10 seconds later
> > > > > [ perf record: Woken up 34 times to write data ]
> > > > > machine__process_bpf_event: prog_id 6 loaded
> > > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> >
> > > > > In this example, the bpf program was loaded and then unloaded in
> > > > > another terminal. When machine__process_bpf_event() processes
> > > > > the load event, the bpf program is already unloaded. Therefore,
> > > > > machine__process_bpf_event() will not be able to get information
> > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> >
> > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > > as soon as perf get the event from kernel. I looked around the perf
> > > > > code for a while. But I haven't found a good example where some
> > > > > events are processed before the end of perf-record. Could you
> > > > > please help me with this?
> >
> > > > perf record does not process events as they are generated. Its sole job
> > > > is pushing data from the maps to a file as fast as possible meaning in
> > > > bulk based on current read and write locations.
> >
> > > > Adding code to process events will add significant overhead to the
> > > > record command and will not really solve your race problem.
> >
> > > I agree that processing events while recording has significant overhead.
> > > In this case, perf user space need to know details about the the jited BPF
> > > program. It is impossible to pass all these details to user space through
> > > the relatively stable ring_buffer API. Therefore, some processing of the
> > > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > > I have some idea on processing important data with relatively low overhead.
> > > Let me try implement it.
> >
> > Well, you could have a separate thread processing just those kinds of
> > events, associate it with a dummy event where you only ask for
> > PERF_RECORD_BPF_EVENTs.
> >
> > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> > perf_event_attr:
> >
> > [root@seventh ~]# perf record -vv -e dummy sleep 01
> > ------------------------------------------------------------
> > perf_event_attr:
> >   type                             1
> >   size                             112
> >   config                           0x9
> >   { sample_period, sample_freq }   4000
> >   sample_type                      IP|TID|TIME|PERIOD
> >   disabled                         1
> >   inherit                          1
>
> These you would have disabled, no need for
> PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT
>
> >   mmap                             1
> >   comm                             1
> >   task                             1
> >   mmap2                            1
> >   comm_exec                        1
>
>

Thanks Arnaldo! This looks better than my original idea (using POLLPRI
to highlight
special events). I will try implement the BPF_EVENT in this direction.

Song

^ permalink raw reply

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Edward Cree @ 2018-10-17 16:13 UTC (permalink / raw)
  To: Yonghong Song, ast, kafai, daniel, netdev; +Cc: kernel-team
In-Reply-To: <20181017072315.2766920-3-yhs@fb.com>

On 17/10/18 08:23, Yonghong Song wrote:
> This patch adds BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
> support to the type section. BTF_KIND_FUNC_PROTO is used
> to specify the type of a function pointer. With this,
> BTF has a complete set of C types (except float).
>
> BTF_KIND_FUNC is used to specify the signature of a
> defined subprogram. BTF_KIND_FUNC_PROTO can be referenced
> by another type, e.g., a pointer type, and BTF_KIND_FUNC
> type cannot be referenced by another type.
Why are distinct kinds created for these?  A function body is
 a value of function type, and since there's no way (in C) to
 declare a variable of function type (only pointer-to-
 function), any declaration of function type must necessarily
 be a BTF_KIND_FUNC, whereas any other reference to a function
 type (e.g. a declaration of type pointer to function type)
 must, as you state above, be a BTF_KIND_FUNC_PROTO.
In fact, you can tell the difference just from name_off, since
 a (C-legal) BTF_KIND_FUNC_PROTO will always be anonymous (as
 the pointee of a pointer type), while a BTF_KIND_FUNC will
 have the name of the subprogram.

-Ed

^ permalink raw reply

* Re: [PATCH bpf-next v2 00/13] bpf: add btf func info support
From: Yonghong Song @ 2018-10-17 16:13 UTC (permalink / raw)
  To: Edward Cree, Alexei Starovoitov, Martin Lau, daniel@iogearbox.net,
	netdev@vger.kernel.org
  Cc: Kernel Team
In-Reply-To: <eac35eaf-5164-2bba-5cca-c165302a4b4a@solarflare.com>



On 10/17/18 4:02 AM, Edward Cree wrote:
> I think the BTF work needs to be better documented; at the moment the only way
>   to determine how BTF sections are structured is to read through the headers,
>   and cross-reference with the DWARF spec to guess at the semantics of various
>   fields.  I've been working on adding BTF support to ebpf_asm, and finding
>   very frustrating the amount of guesswork required.
> Therefore please make sure that each patch extending the BTF format includes
>   documentation patches describing both the layout and the semantics of the new

Make sense. I will add some comments to describe the layout in patch #9.

>   extensions.  For example in patch #9 there is no explanation of
>   btf_ext_header.line_info_off and btf_ext_header.line_info_len (they're not
>   even used by the code, so one cannot reverse-engineer it); while it's fairly
>   clear that they indicate the bounds of the line_info subsection, there is no

The line_info field is added because it is implemented in llvm. I 
imported it to kernel tools directory to be compatible with what llvm 
generates although we did not process it yet. I will add a comment on this.

In the long term, I guess we should add description of format etc.
in Documentation/bpf directory like BTF.rst.

>   specification of what this subsection contains.
> 
> -Ed
> 

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
From: Song Liu @ 2018-10-17 16:18 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, acme, Peter Zijlstra, Alexei Starovoitov,
	David S . Miller, Daniel Borkmann, Networking, kernel-team
In-Reply-To: <9b9f5977-df19-3f76-afd4-7e9419bd7752@gmail.com>

Hi David,

On Wed, Oct 17, 2018 at 8:09 AM David Ahern <dsahern@gmail.com> wrote:
>
> On 10/16/18 11:43 PM, Song Liu wrote:
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> >
>
> As I understand it, you want this series:
>
>  kernel: add event to perf buffer on bpf prog load
>
>  userspace: perf reads the event and grabs information about the program
> from the fd
>
> Is that correct?

Yes, this is correct.

>
> Userpsace is not awakened immediately when an event is added the the
> ring. It is awakened once the number of events crosses a watermark. That
> means there is an unknown - and potentially long - time window where the
> program can be unloaded before perf reads the event.
>
> So no matter what you do expecting perf record to be able to process the
> event quickly is an unreasonable expectation.

In this case, we don't really need to gather the information immediately. We
will lost information about some short-living BPF programs. These BPF
programs are less important for the performance analysis anyway. I guess
the only missing case is when some BPF program get load/unload many
times, so each instance is short-living, but the overall perf impact is
significant. I think this case is not interesting either, as BPF programs
should not be used like this (considering the kernel need to translate and
jit the program, unloading and reloading soon doesn't make any sense).

Thanks,
Song

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox