Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf-next 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds field bpf_sock_ops_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program.

Examples of where to call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/tcp.h      | 8 ++++++++
 include/uapi/linux/bpf.h | 1 +
 net/core/filter.c        | 7 +++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..62f4388 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -373,6 +373,14 @@ struct tcp_sock {
 	 */
 	struct request_sock *fastopen_rsk;
 	u32	*saved_syn;
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+	u32     bpf_sock_ops_flags;     /* values defined in uapi/linux/tcp.h */
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
 };
 
 enum tsq_enum {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1fea987..62b2c89 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -959,6 +959,7 @@ struct bpf_sock_ops {
 				 */
 	__u32 snd_cwnd;
 	__u32 srtt_us;		/* Averaged RTT << 3 in usecs */
+	__u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 978ac78..76fd6e9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3843,6 +3843,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 		switch (off) {
 		case offsetof(struct bpf_sock_ops, op) ...
 		     offsetof(struct bpf_sock_ops, replylong[3]):
+		case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
 			break;
 		default:
 			return false;
@@ -4525,6 +4526,12 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sock_ops, srtt_us):
 		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
 		break;
+
+	case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
+		SOCK_OPS_GET_OR_SET_FIELD(bpf_sock_ops_flags,
+					  bpf_sock_ops_flags,
+					  struct tcp_sock, type);
+		break;
 	}
 	return insn - insn_buf;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/core/filter.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 130b842..099ff9fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4449,9 +4449,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 		break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)					      \
+#define SOCK_OPS_GET_TCP(FIELD_NAME)					      \
 	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
 						is_fullsock),		      \
@@ -4463,16 +4464,18 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 						struct bpf_sock_ops_kern, sk),\
 				      si->dst_reg, si->src_reg,		      \
 				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,        \
+		*insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,	      \
+						   FIELD_NAME), si->dst_reg,  \
+				      si->dst_reg,			      \
 				      offsetof(struct tcp_sock, FIELD_NAME)); \
 	} while (0)
 
 	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_TCP32(snd_cwnd);
+		SOCK_OPS_GET_TCP(snd_cwnd);
 		break;
 
 	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_TCP32(srtt_us);
+		SOCK_OPS_GET_TCP(srtt_us);
 		break;
 	}
 	return insn - insn_buf;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 0/11] bpf: more sock_ops callbacks
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>

Consists of the following patches:
[PATCH bpf 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
[PATCH bpf 03/11] bpf: Add write access to tcp_sock and sock fields
[PATCH bpf 04/11] bpf: Support passing args to sock_ops bpf function
[PATCH bpf 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock
[PATCH bpf 06/11] bpf: Add sock_ops RTO callback
[PATCH bpf 07/11] bpf: Add support for reading sk_state and more
[PATCH bpf 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
[PATCH bpf 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf 11/11] bpf: add selftest for tcpbpf

include/linux/filter.h                         |   4 +
include/linux/tcp.h                            |   8 ++
include/net/tcp.h                              |  66 +++++++++-
include/uapi/linux/bpf.h                       |  39 +++++-
include/uapi/linux/tcp.h                       |   5 +
net/core/filter.c                              | 221 ++++++++++++++++++++++++++++++--
net/ipv4/tcp.c                                 |   4 +-
net/ipv4/tcp_nv.c                              |   2 +-
net/ipv4/tcp_output.c                          |   5 +-
net/ipv4/tcp_timer.c                           |   9 ++
tools/include/uapi/linux/bpf.h                 |  45 ++++++-
tools/testing/selftests/bpf/Makefile           |   4 +-
tools/testing/selftests/bpf/tcp_client.py      |  52 ++++++++
tools/testing/selftests/bpf/tcp_server.py      |  79 ++++++++++++
tools/testing/selftests/bpf/test_tcpbpf_kern.c | 125 ++++++++++++++++++
tools/testing/selftests/bpf/test_tcpbpf_user.c | 113 ++++++++++++++++
16 files changed, 757 insertions(+), 24 deletions(-)

^ permalink raw reply

* [PATCH bpf-next 06/11] bpf: Add sock_ops RTO callback
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 5 +++++
 include/uapi/linux/tcp.h | 3 +++
 net/ipv4/tcp_timer.c     | 9 +++++++++
 3 files changed, 17 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 62b2c89..3cf9014 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -995,6 +995,11 @@ enum {
 					 * a congestion threshold. RTTs above
 					 * this indicate congestion
 					 */
+	BPF_SOCK_OPS_RTO_CB,		/* Called when an RTO has triggered.
+					 * Arg1: value of icsk_retransmits
+					 * Arg2: value of icsk_rto
+					 * Arg3: whether RTO has expired
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..089c19e 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -259,6 +259,9 @@ struct tcp_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];		/* key (binary) */
 };
 
+/* Definitions for bpf_sock_ops_flags */
+#define BPF_SOCK_OPS_RTO_CB_FLAG	(1<<0)
+
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
 	__u8	tcpm_family;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..f9c57e2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -215,9 +215,18 @@ static int tcp_write_timeout(struct sock *sk)
 	tcp_fastopen_active_detect_blackhole(sk, expired);
 	if (expired) {
 		/* Has it gone just too far? */
+		if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+			tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+					  icsk->icsk_retransmits,
+					  icsk->icsk_rto, 1);
 		tcp_write_err(sk);
 		return 1;
 	}
+
+	if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+		tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+				  icsk->icsk_retransmits,
+				  icsk->icsk_rto, 0);
 	return 0;
 }
 
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:      SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0xffffffff), or
  2) Make the verifier fail if that field is accessed (i.e. program
    fails to load) so the user will know that field is no longer
    supported.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/core/filter.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 099ff9fd..7064862 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4448,11 +4448,11 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					       is_fullsock));
 		break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME)					      \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
 	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
 						is_fullsock),		      \
@@ -4464,18 +4464,18 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 						struct bpf_sock_ops_kern, sk),\
 				      si->dst_reg, si->src_reg,		      \
 				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,	      \
-						   FIELD_NAME), si->dst_reg,  \
-				      si->dst_reg,			      \
-				      offsetof(struct tcp_sock, FIELD_NAME)); \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
 	} while (0)
 
 	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_TCP(snd_cwnd);
+		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
 		break;
 
 	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_TCP(srtt_us);
+		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
 		break;
 	}
 	return insn - insn_buf;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds support for calling sock_ops BPF program when there is a
retransmission. Two arguments are used; one for the sequence number and
other for the number of segments retransmitted. Does not include syn-ack
retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 4 ++++
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp_output.c    | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5bead0e..415b951 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1020,6 +1020,10 @@ enum {
 					 * Arg2: value of icsk_rto
 					 * Arg3: whether RTO has expired
 					 */
+	BPF_SOCK_OPS_RETRANS_CB,	/* Called when skb is retransmitted.
+					 * Arg1: sequence number of 1st byte
+					 * Arg2: # segments
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 089c19e..dc36d3c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -261,6 +261,7 @@ struct tcp_md5sig {
 
 /* Definitions for bpf_sock_ops_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG	(1<<0)
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG	(1<<1)
 
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b093985..8109675 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2907,6 +2907,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 	if (likely(!err)) {
 		TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
 		trace_tcp_retransmit_skb(sk, skb);
+		if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+			tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+					  TCP_SKB_CB(skb)->seq, segs);
 	} else if (err != -EBUSY) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
 	}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 07/11] bpf: Add support for reading sk_state and more
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Add support for reading many more tcp_sock fields

  state,	same as sk->sk_state
  rtt_min	same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  bytes_received (__u64)
  bytes_acked    (__u64)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h |  19 +++++++++
 net/core/filter.c        | 101 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3cf9014..c81fdec 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -960,6 +960,25 @@ struct bpf_sock_ops {
 	__u32 snd_cwnd;
 	__u32 srtt_us;		/* Averaged RTT << 3 in usecs */
 	__u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
+	__u32 state;
+	__u32 rtt_min;
+	__u32 snd_ssthresh;
+	__u32 rcv_nxt;
+	__u32 snd_nxt;
+	__u32 snd_una;
+	__u32 mss_cache;
+	__u32 ecn_flags;
+	__u32 rate_delivered;
+	__u32 rate_interval_us;
+	__u32 packets_out;
+	__u32 retrans_out;
+	__u32 total_retrans;
+	__u32 segs_in;
+	__u32 data_segs_in;
+	__u32 segs_out;
+	__u32 data_segs_out;
+	__u64 bytes_received;
+	__u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 76fd6e9..d4c5c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3829,7 +3829,7 @@ static bool __is_valid_sock_ops_access(int off, int size)
 	/* The verifier guarantees that size > 0. */
 	if (off % size != 0)
 		return false;
-	if (size != sizeof(__u32))
+	if (size != sizeof(__u32) && size != sizeof(__u64))
 		return false;
 
 	return true;
@@ -4449,6 +4449,32 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					       is_fullsock));
 		break;
 
+	case offsetof(struct bpf_sock_ops, state):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_state));
+		break;
+
+	case offsetof(struct bpf_sock_ops, rtt_min):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+			     sizeof(struct minmax));
+		BUILD_BUG_ON(sizeof(struct minmax) <
+			     sizeof(struct minmax_sample));
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct tcp_sock, rtt_min) +
+				      FIELD_SIZEOF(struct minmax_sample, t));
+		break;
+
 /* Helper macro for adding read access to tcp_sock or sock fields. */
 #define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
 	do {								      \
@@ -4532,6 +4558,79 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					  bpf_sock_ops_flags,
 					  struct tcp_sock, type);
 		break;
+
+	case offsetof(struct bpf_sock_ops, snd_ssthresh):
+		SOCK_OPS_GET_FIELD(snd_ssthresh, snd_ssthresh, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, rcv_nxt):
+		SOCK_OPS_GET_FIELD(rcv_nxt, rcv_nxt, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, snd_nxt):
+		SOCK_OPS_GET_FIELD(snd_nxt, snd_nxt, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, snd_una):
+		SOCK_OPS_GET_FIELD(snd_una, snd_una, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, mss_cache):
+		SOCK_OPS_GET_FIELD(mss_cache, mss_cache, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, ecn_flags):
+		SOCK_OPS_GET_FIELD(ecn_flags, ecn_flags, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, rate_delivered):
+		SOCK_OPS_GET_FIELD(rate_delivered, rate_delivered,
+				   struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, rate_interval_us):
+		SOCK_OPS_GET_FIELD(rate_interval_us, rate_interval_us,
+				   struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, packets_out):
+		SOCK_OPS_GET_FIELD(packets_out, packets_out, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, retrans_out):
+		SOCK_OPS_GET_FIELD(retrans_out, retrans_out, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, total_retrans):
+		SOCK_OPS_GET_FIELD(total_retrans, total_retrans,
+				   struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, segs_in):
+		SOCK_OPS_GET_FIELD(segs_in, segs_in, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, data_segs_in):
+		SOCK_OPS_GET_FIELD(data_segs_in, data_segs_in, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, segs_out):
+		SOCK_OPS_GET_FIELD(segs_out, segs_out, struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, data_segs_out):
+		SOCK_OPS_GET_FIELD(data_segs_out, data_segs_out,
+				   struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, bytes_received):
+		SOCK_OPS_GET_FIELD(bytes_received, bytes_received,
+				   struct tcp_sock);
+		break;
+
+	case offsetof(struct bpf_sock_ops, bytes_acked):
+		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
+		break;
 	}
 	return insn - insn_buf;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 03/11] bpf: Add write access to tcp_sock and sock fields
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/filter.h |  3 +++
 include/net/tcp.h      |  2 +-
 net/core/filter.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e872b4e..8fe4df4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -996,6 +996,9 @@ struct bpf_sock_ops_kern {
 		u32 replylong[4];
 	};
 	u32	is_fullsock;
+	u64	temp;			/* Used by sock_ops_convert_ctx_access
+					 * as temporary storaage of a register
+					 */
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69..108d16a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2010,7 +2010,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 	struct bpf_sock_ops_kern sock_ops;
 	int ret;
 
-	memset(&sock_ops, 0, sizeof(sock_ops));
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	if (sk_fullsock(sk)) {
 		sock_ops.is_fullsock = 1;
 		sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index 7064862..978ac78 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,6 +4470,54 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      offsetof(OBJ, OBJ_FIELD));	      \
 	} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
 	case offsetof(struct bpf_sock_ops, snd_cwnd):
 		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
 		break;
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds direct R/W access to sk_txhash and access to tclass for ipv6 flows
through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v, sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c        | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c81fdec..5bead0e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,7 @@ struct bpf_sock_ops {
 	__u32 data_segs_out;
 	__u64 bytes_received;
 	__u64 bytes_acked;
+	__u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index d4c5c1a..4d2ff88 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3230,6 +3230,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 			ret = -EINVAL;
 		}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (level == SOL_IPV6) {
+		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+			return -EINVAL;
+
+		val = *((int *)optval);
+		/* Only some options are supported */
+		switch (optname) {
+		case IPV6_TCLASS:
+			if (val < -1 || val > 0xff) {
+				ret = -EINVAL;
+			} else {
+				struct ipv6_pinfo *np = inet6_sk(sk);
+
+				if (val == -1)
+					val = 0;
+				np->tclass = val;
+			}
+			break;
+		default:
+			ret = -EINVAL;
+		}
+#endif
 	} else if (level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
 		if (optname == TCP_CONGESTION) {
@@ -3239,7 +3262,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 			strncpy(name, optval, min_t(long, optlen,
 						    TCP_CA_NAME_MAX-1));
 			name[TCP_CA_NAME_MAX-1] = 0;
-			ret = tcp_set_congestion_control(sk, name, false, reinit);
+			ret = tcp_set_congestion_control(sk, name, false,
+							 reinit);
 		} else {
 			struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3305,6 +3329,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 		} else {
 			goto err_clear;
 		}
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (level == SOL_IPV6) {
+		struct ipv6_pinfo *np = inet6_sk(sk);
+
+		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+			goto err_clear;
+
+		/* Only some options are supported */
+		switch (optname) {
+		case IPV6_TCLASS:
+			*((int *)optval) = (int)np->tclass;
+			break;
+		default:
+			goto err_clear;
+		}
+#endif
 	} else {
 		goto err_clear;
 	}
@@ -3844,6 +3884,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 		case offsetof(struct bpf_sock_ops, op) ...
 		     offsetof(struct bpf_sock_ops, replylong[3]):
 		case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
+		case offsetof(struct bpf_sock_ops, sk_txhash):
 			break;
 		default:
 			return false;
@@ -4631,6 +4672,11 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sock_ops, bytes_acked):
 		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
 		break;
+
+	case offsetof(struct bpf_sock_ops, sk_txhash):
+		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+					  struct sock, type);
+		break;
 	}
 	return insn - insn_buf;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 04/11] bpf: Support passing args to sock_ops bpf function
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h        | 64 ++++++++++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/bpf.h |  5 ++--
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_nv.c        |  2 +-
 net/ipv4/tcp_output.c    |  2 +-
 6 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8fe4df4..57ecd30 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -992,6 +992,7 @@ struct bpf_sock_ops_kern {
 	struct	sock *sk;
 	u32	op;
 	union {
+		u32 args[4];
 		u32 reply;
 		u32 replylong[4];
 	};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 108d16a..8e9111f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2005,7 +2005,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
 	struct bpf_sock_ops_kern sock_ops;
 	int ret;
@@ -2018,6 +2018,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
 	sock_ops.sk = sk;
 	sock_ops.op = op;
+	if (nargs > 0)
+		memcpy(sock_ops.args, args, nargs*sizeof(u32));
 
 	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
 	if (ret == 0)
@@ -2026,18 +2028,70 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 		ret = -1;
 	return ret;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+	return tcp_call_bpf(sk, op, 1, &arg);
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2)
+{
+	u32 args[2] = {arg1, arg2};
+
+	return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3)
+{
+	u32 args[3] = {arg1, arg2, arg3};
+
+	return tcp_call_bpf(sk, op, 3, args);
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3, u32 arg4)
+{
+	u32 args[4] = {arg1, arg2, arg3, arg4};
+
+	return tcp_call_bpf(sk, op, 4, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
 	return -EPERM;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+	u32 arg3)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3, u32 arg4)
+{
+	return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
 	int timeout;
 
-	timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+	timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
 	if (timeout <= 0)
 		timeout = TCP_TIMEOUT_INIT;
@@ -2048,7 +2102,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
 	int rwnd;
 
-	rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+	rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
 	if (rwnd < 0)
 		rwnd = 0;
@@ -2057,7 +2111,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-	return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+	return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69eabfc..1fea987 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -942,8 +942,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
 	__u32 op;
 	union {
-		__u32 reply;
-		__u32 replylong[4];
+		__u32 args[4];		/* Optionally passed to bpf program */
+		__u32 reply;		/* Returned by bpf program	    */
+		__u32 replylong[4];	/* Optionally returned by bpf prog  */
 	};
 	__u32 family;
 	__u32 remote_ip4;	/* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c470fec..587d65c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -465,7 +465,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
 	tcp_mtup_init(sk);
 	icsk->icsk_af_ops->rebuild_header(sk);
 	tcp_init_metrics(sk);
-	tcp_call_bpf(sk, bpf_op);
+	tcp_call_bpf(sk, bpf_op, 0, NULL);
 	tcp_init_congestion_control(sk);
 	tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 	 * within a datacenter, where we have reasonable estimates of
 	 * RTTs
 	 */
-	base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+	base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);
 	if (base_rtt > 0) {
 		ca->nv_base_rtt = base_rtt;
 		ca->nv_lower_bound_rtt = (base_rtt * 205) >> 8; /* 80% */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 04be9f83..b093985 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3468,7 +3468,7 @@ int tcp_connect(struct sock *sk)
 	struct sk_buff *buff;
 	int err;
 
-	tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB);
+	tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL);
 
 	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
 		return -EHOSTUNREACH; /* Routing failure or similar. */
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 11/11] bpf: add selftest for tcpbpf
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/include/uapi/linux/bpf.h                 |  45 ++++++++-
 tools/testing/selftests/bpf/Makefile           |   4 +-
 tools/testing/selftests/bpf/tcp_client.py      |  52 ++++++++++
 tools/testing/selftests/bpf/tcp_server.py      |  79 ++++++++++++++++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 125 +++++++++++++++++++++++++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 113 ++++++++++++++++++++++
 6 files changed, 414 insertions(+), 4 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index db1b092..79088f7 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -942,8 +942,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
 	__u32 op;
 	union {
-		__u32 reply;
-		__u32 replylong[4];
+		__u32 args[4];		/* Optionally passed to bpf program */
+		__u32 reply;		/* Returned by bpf program	    */
+		__u32 replylong[4];	/* Optionally returned by bpf prog  */
 	};
 	__u32 family;
 	__u32 remote_ip4;	/* Stored in network byte order */
@@ -952,6 +953,33 @@ struct bpf_sock_ops {
 	__u32 local_ip6[4];	/* Stored in network byte order */
 	__u32 remote_port;	/* Stored in network byte order */
 	__u32 local_port;	/* stored in host byte order */
+	__u32 is_fullsock;	/* Some TCP fields are only valid if
+				 * there is a full socket. If not, the
+				 * fields read as zero.
+				 */
+	__u32 snd_cwnd;
+	__u32 srtt_us;		/* Averaged RTT << 3 in usecs */
+	__u32 bpf_sock_ops_flags; /* flags defined in uapi/linux/tcp.h */
+	__u32 state;
+	__u32 rtt_min;
+	__u32 snd_ssthresh;
+	__u32 rcv_nxt;
+	__u32 snd_nxt;
+	__u32 snd_una;
+	__u32 mss_cache;
+	__u32 ecn_flags;
+	__u32 rate_delivered;
+	__u32 rate_interval_us;
+	__u32 packets_out;
+	__u32 retrans_out;
+	__u32 total_retrans;
+	__u32 segs_in;
+	__u32 data_segs_in;
+	__u32 segs_out;
+	__u32 data_segs_out;
+	__u64 bytes_received;
+	__u64 bytes_acked;
+	__u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
@@ -987,6 +1015,19 @@ enum {
 					 * a congestion threshold. RTTs above
 					 * this indicate congestion
 					 */
+	BPF_SOCK_OPS_RTO_CB,		/* Called when an RTO has triggered.
+					 * Arg1: value of icsk_retransmits
+					 * Arg2: value of icsk_rto
+					 * Arg3: whether RTO has expired
+					 */
+	BPF_SOCK_OPS_RETRANS_CB,	/* Called when skb is retransmitted.
+					 * Arg1: sequence number of 1st byte
+					 * Arg2: # segments
+					 */
+	BPF_SOCK_OPS_STATE_CB,		/* Called when TCP changes state.
+					 * Arg1: old_state
+					 * Arg2: new_state
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index a1fcb0c..7b7d0be 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -14,12 +14,12 @@ CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../i
 LDLIBS += -lcap -lelf
 
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \
-	test_align test_verifier_log test_dev_cgroup
+	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
-	test_l4lb_noinline.o test_xdp_noinline.o
+	test_l4lb_noinline.o test_xdp_noinline.o test_tcpbpf_kern.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
 	test_offload.py
diff --git a/tools/testing/selftests/bpf/tcp_client.py b/tools/testing/selftests/bpf/tcp_client.py
new file mode 100755
index 0000000..ac2ce32
--- /dev/null
+++ b/tools/testing/selftests/bpf/tcp_client.py
@@ -0,0 +1,52 @@
+#!/usr/local/bin/python
+#
+# SPDX-License-Identifier: GPL-2.0
+#
+
+import sys, os, os.path, getopt
+import socket, time
+import subprocess
+import select
+
+def read(sock, n):
+    buf = ''
+    while len(buf) < n:
+        rem = n - len(buf)
+        try: s = sock.recv(rem)
+        except (socket.error), e: return ''
+        buf += s
+    return buf
+
+def send(sock, s):
+    total = len(s)
+    count = 0
+    while count < total:
+        try: n = sock.send(s)
+        except (socket.error), e: n = 0
+        if n == 0:
+            return count;
+        count += n
+    return count
+
+
+serverPort = int(sys.argv[1])
+HostName = socket.gethostname()
+
+time.sleep(1)
+
+# create active socket
+sock = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
+try:
+    sock.connect((HostName, serverPort))
+except socket.error as e:
+    sys.exit(1)
+
+buf = ''
+n = 0
+while n < 1000:
+    buf += '+'
+    n += 1
+
+n = send(sock, buf)
+n = read(sock, 500)
+sys.exit(0)
diff --git a/tools/testing/selftests/bpf/tcp_server.py b/tools/testing/selftests/bpf/tcp_server.py
new file mode 100755
index 0000000..9e5db0d
--- /dev/null
+++ b/tools/testing/selftests/bpf/tcp_server.py
@@ -0,0 +1,79 @@
+#!/usr/local/bin/python
+#
+# SPDX-License-Identifier: GPL-2.0
+#
+
+import sys, os, os.path, getopt
+import socket, time
+import subprocess
+import select
+
+def read(sock, n):
+    buf = ''
+    while len(buf) < n:
+        rem = n - len(buf)
+        try: s = sock.recv(rem)
+        except (socket.error), e: return ''
+        buf += s
+    return buf
+
+def send(sock, s):
+    total = len(s)
+    count = 0
+    while count < total:
+        try: n = sock.send(s)
+        except (socket.error), e: n = 0
+        if n == 0:
+            return count;
+        count += n
+    return count
+
+
+SERVER_PORT = 12877
+MAX_PORTS = 2
+
+serverPort = SERVER_PORT
+serverSocket = None
+
+HostName = socket.gethostname()
+
+# create passive socket
+serverSocket = socket.socket(socket.AF_INET6, socket.SOCK_STREAM)
+host = socket.gethostname()
+
+while serverPort < SERVER_PORT + 5:
+	try: serverSocket.bind((host, serverPort))
+	except socket.error as msg:
+            serverPort += 1
+            continue
+	break
+
+cmdStr = ("./tcp_client.py %d &") % (serverPort)
+os.system(cmdStr)
+
+buf = ''
+n = 0
+while n < 500:
+    buf += '.'
+    n += 1
+
+serverSocket.listen(MAX_PORTS)
+readList = [serverSocket]
+
+while True:
+    readyRead, readyWrite, inError = \
+        select.select(readList, [], [], 10)
+
+    if len(readyRead) > 0:
+        waitCount = 0
+        for sock in readyRead:
+            if sock == serverSocket:
+                (clientSocket, address) = serverSocket.accept()
+                address = str(address[0])
+                readList.append(clientSocket)
+            else:
+                s = read(sock, 1000)
+                n = send(sock, buf)
+                sock.close()
+                time.sleep(1)
+                sys.exit(0)
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
new file mode 100644
index 0000000..8260130
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/in6.h>
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <netinet/in.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+
+struct globals {
+	__u32 event_map;
+	__u32 total_retrans;
+	__u32 data_segs_in;
+	__u32 data_segs_out;
+	__u64 bytes_received;
+	__u64 bytes_acked;
+};
+
+struct bpf_map_def SEC("maps") global_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct globals),
+	.max_entries = 2,
+};
+
+static inline void update_event_map(int event)
+{
+	__u32 key = 0;
+	struct globals g, *gp;
+
+	gp = bpf_map_lookup_elem(&global_map, &key);
+	if (gp == NULL) {
+		struct globals g = {0, 0, 0, 0, 0, 0};
+
+		g.event_map |= (1 << event);
+		bpf_map_update_elem(&global_map, &key, &g,
+			    BPF_ANY);
+	} else {
+		g = *gp;
+		g.event_map |= (1 << event);
+		bpf_map_update_elem(&global_map, &key, &g,
+			    BPF_ANY);
+	}
+}
+
+SEC("sockops")
+int bpf_testcb(struct bpf_sock_ops *skops)
+{
+	int rv = -1;
+	int op;
+	int init_seq = 0;
+	int ret = 0;
+	int v = 0;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if remote port number is in the range 12877..12887
+	 * I.e. the active side of the connection
+	 */
+	if ((bpf_ntohl(skops->remote_port) < 12877 ||
+	     bpf_ntohl(skops->remote_port) >= 12887)) {
+		skops->reply = -1;
+		return 1;
+	}
+
+	op = (int) skops->op;
+
+	/* Check that both hosts are within same datacenter. For this example
+	 * it is the case when the first 5.5 bytes of their IPv6 addresses are
+	 * the same.
+	 */
+	if (1) {
+		update_event_map(op);
+
+		switch (op) {
+		case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+			skops->bpf_sock_ops_flags = 0xfff;
+			init_seq = skops->snd_nxt;
+			break;
+		case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+			init_seq = skops->snd_nxt;
+			skops->bpf_sock_ops_flags = 0xfff;
+			skops->sk_txhash = 0x12345f;
+			v = 0xff;
+			ret = bpf_setsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v,
+					     sizeof(v));
+			break;
+		case BPF_SOCK_OPS_RTO_CB:
+			break;
+		case BPF_SOCK_OPS_RETRANS_CB:
+			break;
+		case BPF_SOCK_OPS_STATE_CB:
+			if (skops->args[1] == 7) {
+				__u32 key = 0;
+				struct globals g, *gp;
+
+				gp = bpf_map_lookup_elem(&global_map, &key);
+				if (!gp)
+					break;
+				g = *gp;
+				g.total_retrans = skops->total_retrans;
+				g.data_segs_in = skops->data_segs_in;
+				g.data_segs_out = skops->data_segs_out;
+				g.bytes_received = skops->bytes_received;
+				g.bytes_acked = skops->bytes_acked;
+				bpf_map_update_elem(&global_map, &key, &g,
+						    BPF_ANY);
+			}
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcpbpf_user.c b/tools/testing/selftests/bpf/test_tcpbpf_user.c
new file mode 100644
index 0000000..e6723c6
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpbpf_user.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <string.h>
+#include <assert.h>
+#include <linux/perf_event.h>
+#include <linux/ptrace.h>
+#include <linux/bpf.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include "bpf_util.h"
+#include <linux/perf_event.h>
+
+struct globals {
+	__u32 event_map;
+	__u32 total_retrans;
+	__u32 data_segs_in;
+	__u32 data_segs_out;
+	__u64 bytes_received;
+	__u64 bytes_acked;
+};
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map) {
+		printf("%s:FAIL:map '%s' not found\n", test, name);
+		return -1;
+	}
+	return bpf_map__fd(map);
+}
+
+#define SYSTEM(CMD)						\
+	do {							\
+		if (system(CMD)) {				\
+			printf("system(%s) FAILS!\n", CMD);	\
+		}						\
+	} while (0)
+
+int main(int argc, char **argv)
+{
+	struct globals g = {0, 0, 0, 0, 0, 0};
+	__u32 key = 0;
+	int rv;
+	int pid;
+	int error = EXIT_FAILURE;
+	int cg_fd, prog_fd, map_fd;
+	char cmd[100], *dir;
+	const char *file = "./test_tcpbpf_kern.o";
+	struct bpf_object *obj;
+	struct stat buffer;
+
+	dir = "/tmp/cgroupv2/foo";
+
+	if (stat(dir, &buffer) != 0) {
+		SYSTEM("mkdir -p /tmp/cgroupv2");
+		SYSTEM("mount -t cgroup2 none /tmp/cgroupv2");
+		SYSTEM("mkdir -p /tmp/cgroupv2/foo");
+	}
+	pid = (int) getpid();
+	sprintf(cmd, "echo %d >> /tmp/cgroupv2/foo/cgroup.procs", pid);
+	SYSTEM(cmd);
+
+	cg_fd = open(dir, O_DIRECTORY, O_RDONLY);
+	if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) {
+//	if (load_bpf_file(prog)) {
+		printf("FAILED: load_bpf_file failed for: %s\n", file);
+//		printf("%s", bpf_log_buf);
+		goto err;
+	}
+
+	rv = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_SOCK_OPS, 0);
+	if (rv) {
+		printf("FAILED: bpf_prog_attach: %d (%s)\n",
+		       error, strerror(errno));
+		goto err;
+	}
+
+	SYSTEM("./tcp_server.py");
+
+	map_fd = bpf_find_map(__func__, obj, "global_map");
+	if (map_fd < 0)
+		goto err;
+
+	rv = bpf_map_lookup_elem(map_fd, &key, &g);
+	if (rv != 0) {
+		printf("FAILED: bpf_map_lookup_elem returns %d\n", rv);
+		goto err;
+	}
+
+	if (g.bytes_received != 501 || g.bytes_acked != 1002 ||
+	    g.data_segs_in != 1 || g.data_segs_out != 1 ||
+		g.event_map != 0x45e) {
+		printf("FAILED: Wrong stats\n");
+		goto err;
+	}
+	printf("PASSED!\n");
+	error = 0;
+err:
+	bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
+	return error;
+}
-- 
2.9.5

^ permalink raw reply related

* [PATCH bpf-next 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
From: Lawrence Brakmo @ 2017-12-20 21:16 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 4 ++++
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp.c           | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 415b951..14040e0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1024,6 +1024,10 @@ enum {
 					 * Arg1: sequence number of 1st byte
 					 * Arg2: # segments
 					 */
+	BPF_SOCK_OPS_STATE_CB,		/* Called when TCP changes state.
+					 * Arg1: old_state
+					 * Arg2: new_state
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dc36d3c..211322c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -262,6 +262,7 @@ struct tcp_md5sig {
 /* Definitions for bpf_sock_ops_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG	(1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG	(1<<1)
+#define BPF_SOCK_OPS_STATE_CB_FLAG	(1<<2)
 
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 587d65c..f342f31 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2041,6 +2041,8 @@ void tcp_set_state(struct sock *sk, int state)
 	int oldstate = sk->sk_state;
 
 	trace_tcp_set_state(sk, oldstate, state);
+	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+		tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
 
 	switch (state) {
 	case TCP_ESTABLISHED:
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH v2] net: ibm: emac: support RGMII-[RX|TX]ID phymode
From: David Miller @ 2017-12-20 21:20 UTC (permalink / raw)
  To: chunkeey; +Cc: netdev, andrew, christophe.jaillet, benh
In-Reply-To: <5154461.jTBeEfsTQW@debian64>

From: Christian Lamparter <chunkeey@gmail.com>
Date: Wed, 20 Dec 2017 22:07:43 +0100

> Yeah, I can do that. no problem. 
> 
> Question is, should I also replace the rgmii_mode_name() with phy_modes() too?
> 
> The only user of rgmii_mode_name() is this notice printk in rgmii_attach():
> <http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/ibm/emac/rgmii.c#L117>

I would recommend you use phy_modes() instead of that custom rgmii_mode_name() thing,
yes.

Thanks for asking.

^ permalink raw reply

* Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe
From: Peter Zijlstra @ 2017-12-20 21:25 UTC (permalink / raw)
  To: Song Liu
  Cc: Steven Rostedt, mingo@redhat.com, David Miller,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Daniel Borkmann, Kernel Team
In-Reply-To: <F626E159-8271-4D3A-9DF0-9FC4E6AF0964@fb.com>

On Wed, Dec 20, 2017 at 06:10:11PM +0000, Song Liu wrote:
> I think there is one more thing to change:

OK, folded that too; it should all be at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core

Can you verify it all looks/works right?

^ permalink raw reply

* Re: [patch net-next 00/10] Add support for resource abstraction
From: David Ahern @ 2017-12-20 21:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, arkadis, mlxsw, andrew, vivien.didelot, f.fainelli,
	michael.chan, ganeshgr, saeedm, matanb, leonro, idosch,
	jakub.kicinski, ast, daniel, simon.horman, pieter.jansenvanvuuren,
	john.hurley, alexander.h.duyck, linville, gospo, steven.lin1,
	yuvalm, ogerlitz, roopa
In-Reply-To: <20171220200310.GE1760@nanopsycho>

On 12/20/17 1:03 PM, Jiri Pirko wrote:
>>> Userspace part prototype can be found at https://github.com/arkadis/iproute2/
>>> at resource_dev branch.

My devlink command getting no output is because you forgot to update
that iproute2 tree with a version that works with the patches that were
sent.

^ permalink raw reply

* Re: [RFC PATCH net-next] tools/bpftool: use version from the kernel source tree
From: Jakub Kicinski @ 2017-12-20 21:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: netdev, linux-kernel, kernel-team, scientist, Alexei Starovoitov,
	Daniel Borkmann, Arnaldo Carvalho de Melo
In-Reply-To: <20171220205332.GA28352@castle>

On Wed, 20 Dec 2017 20:53:41 +0000, Roman Gushchin wrote:
> On Wed, Dec 20, 2017 at 12:29:21PM -0800, Jakub Kicinski wrote:
> > On Wed, 20 Dec 2017 20:19:43 +0000, Roman Gushchin wrote:  
> > > Bpftool determines it's own version based on the kernel
> > > version, which is picked from the linux/version.h header.
> > > 
> > > It's strange to use the version of the installed kernel
> > > headers, and makes much more sense to use the version
> > > of the actual source tree, where bpftool sources are.
> > > 
> > > This patch adds $(srctree)/usr/include to the list
> > > of include files, which causes bpftool to use the version
> > > from the source tree.
> > > 
> > > Example:
> > > before:
> > > 
> > > $ bpftool version
> > > bpftool v4.14.6
> > > 
> > > after:
> > > $ bpftool version
> > > bpftool v4.15.0  
> > 
> > Thanks for the patch, this would indeed use some improvement.
> > 
> > How about we just run make to get the version like liblockdep does?
> > 
> > LIBLOCKDEP_VERSION=$(shell make --no-print-directory -sC ../../.. kernelversion)
> > 
> > probably s@../../..@$(srctree)@
> > 
> > $(srctree)/usr/include is not going to be there for out-of-source builds.  
> 
> Hm, why it's better? It's not only about the kernel version,
> IMO it's generally better to use includes from the source tree,
> rather then system-wide installed kernel headers.

Right I agree the kernel headers are preferred.  I'm not entirely sure
why we don't use them, if it was OK to assume usr/ is there we wouldn't
need the tools/include/uapi/ contraption.  Maybe Arnaldo could explain?

> I've got about out-of-source builds, but do we support it in general?
> How can I build bpftool outside of the kernel tree?
> I've tried a bit, but failed.

This is what I do:

make -C tools/bpf/bpftool/ W=1 O=/tmp/builds/bpftool

> > > diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> > > index 9c089cfa5f3f..6864d416c49e 100644
> > > --- a/tools/bpf/bpftool/Makefile
> > > +++ b/tools/bpf/bpftool/Makefile
> > > @@ -37,7 +37,9 @@ CC = gcc
> > >  
> > >  CFLAGS += -O2
> > >  CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
> > > -CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/> > > tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
> > > +CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi
> > > +CFLAGS += -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf
> > > +CFLAGS += -I$(srctree)/kernel/bpf/ -I$(srctree)/usr/include
> > >  LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
> > >  
> > >  INSTALL ?= install

^ permalink raw reply

* Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"
From: Jakub Kicinski @ 2017-12-20 21:59 UTC (permalink / raw)
  To: John Fastabend; +Cc: xiyou.wangcong, jiri, davem, netdev, eric.dumazet
In-Reply-To: <20171220200919.6233.48192.stgit@john-Precision-Tower-5810>

On Wed, 20 Dec 2017 12:09:19 -0800, John Fastabend wrote:
> RCU grace period is needed for lockless qdiscs added in the commit
> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
> 
> It is needed now that qdiscs may be lockless otherwise we risk
> free'ing a qdisc that is still in use from datapath. Additionally,
> push list cleanup into RCU callback. Otherwise we risk the datapath
> adding skbs during removal.
> 
> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Seems like this revert may be too heavy handed:

# ./tools/testing/selftests/bpf/test_offload.py --log /tmp/log
Test destruction of generic XDP...
Test TC non-offloaded...
Test TC non-offloaded isn't getting bound...
Test TC offloads are off by default...
Test TC offload by default...
Test TC cBPF bytcode tries offload by default...
Test TC cBPF unbound bytecode doesn't offload...
Test TC offloads work...
FAIL: TC filter did not load with TC offloads enabled

And it's triggering:

WARNING: CPU: 15 PID: 1853 at ../drivers/net/netdevsim/bpf.c:372 nsim_bpf_uninit+0x2e/0x41 [netdevsim]

Which is:

   368	void nsim_bpf_uninit(struct netdevsim *ns)
   369	{
   370		WARN_ON(!list_empty(&ns->bpf_bound_progs));
   371		WARN_ON(ns->xdp_prog);
>> 372		WARN_ON(ns->bpf_offloaded);
   373	}

(Meaning the offload was not stopped by the stack before ndo_uninit.)

^ permalink raw reply

* [PATCH v3 3/3] net: ibm: emac: support RGMII-[RX|TX]ID phymode
From: Christian Lamparter @ 2017-12-20 22:01 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, Christophe Jaillet
In-Reply-To: <2cb74d50c22d01873d1d976ec384917dc799be08.1513806256.git.chunkeey@gmail.com>

The RGMII spec allows compliance for devices that implement an internal
delay on TXC and/or RXC inside the transmitter. This patch adds the
necessary RGMII_[RX|TX]ID mode code to handle such PHYs with the
emac driver.

Signed-off-by: Christian Lamparter <chunkeey@gmail.com>

---
v3: - replace PHY_MODE_* with PHY_INTERFACE_MODE_*
    - replace rgmii_mode_name() with phy_modes[]

v2: - utilize phy_interface_mode_is_rgmii()
---
 drivers/net/ethernet/ibm/emac/core.c  | 4 ++--
 drivers/net/ethernet/ibm/emac/rgmii.c | 7 +++++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 27f592da7cfc..71ddad13baf4 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -199,8 +199,8 @@ static void __emac_set_multicast_list(struct emac_instance *dev);
 
 static inline int emac_phy_supports_gige(int phy_mode)
 {
-	return  phy_mode == PHY_INTERFACE_MODE_GMII ||
-		phy_mode == PHY_INTERFACE_MODE_RGMII ||
+	return  phy_interface_mode_is_rgmii(phy_mode) ||
+		phy_mode == PHY_INTERFACE_MODE_GMII ||
 		phy_mode == PHY_INTERFACE_MODE_SGMII ||
 		phy_mode == PHY_INTERFACE_MODE_TBI ||
 		phy_mode == PHY_INTERFACE_MODE_RTBI;
diff --git a/drivers/net/ethernet/ibm/emac/rgmii.c b/drivers/net/ethernet/ibm/emac/rgmii.c
index 5db225831e81..00f5999de3cf 100644
--- a/drivers/net/ethernet/ibm/emac/rgmii.c
+++ b/drivers/net/ethernet/ibm/emac/rgmii.c
@@ -52,9 +52,9 @@
 /* RGMII bridge supports only GMII/TBI and RGMII/RTBI PHYs */
 static inline int rgmii_valid_mode(int phy_mode)
 {
-	return  phy_mode == PHY_INTERFACE_MODE_GMII ||
+	return  phy_interface_mode_is_rgmii(phy_mode) ||
+		phy_mode == PHY_INTERFACE_MODE_GMII ||
 		phy_mode == PHY_INTERFACE_MODE_MII ||
-		phy_mode == PHY_INTERFACE_MODE_RGMII ||
 		phy_mode == PHY_INTERFACE_MODE_TBI ||
 		phy_mode == PHY_INTERFACE_MODE_RTBI;
 }
@@ -63,6 +63,9 @@ static inline u32 rgmii_mode_mask(int mode, int input)
 {
 	switch (mode) {
 	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
 		return RGMII_FER_RGMII(input);
 	case PHY_INTERFACE_MODE_TBI:
 		return RGMII_FER_TBI(input);
-- 
2.15.1

^ permalink raw reply related

* [PATCH v3 1/3] net: ibm: emac: replace custom rgmii_mode_name with phy_modes
From: Christian Lamparter @ 2017-12-20 22:01 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, Christophe Jaillet

phy_modes() in the common phy.h already defines the same phy mode
names in lower case. The deleted rgmii_mode_name() is used only
in one place and for a "notice-level" printk. Hence, it will not
be missed.

Signed-off-by: Christian Lamparter <chunkeey@gmail.com>
---
 drivers/net/ethernet/ibm/emac/rgmii.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/rgmii.c b/drivers/net/ethernet/ibm/emac/rgmii.c
index c4a1ac38bba8..9a1c06f2471d 100644
--- a/drivers/net/ethernet/ibm/emac/rgmii.c
+++ b/drivers/net/ethernet/ibm/emac/rgmii.c
@@ -59,24 +59,6 @@ static inline int rgmii_valid_mode(int phy_mode)
 		phy_mode == PHY_MODE_RTBI;
 }
 
-static inline const char *rgmii_mode_name(int mode)
-{
-	switch (mode) {
-	case PHY_MODE_RGMII:
-		return "RGMII";
-	case PHY_MODE_TBI:
-		return "TBI";
-	case PHY_MODE_GMII:
-		return "GMII";
-	case PHY_MODE_MII:
-		return "MII";
-	case PHY_MODE_RTBI:
-		return "RTBI";
-	default:
-		BUG();
-	}
-}
-
 static inline u32 rgmii_mode_mask(int mode, int input)
 {
 	switch (mode) {
@@ -115,7 +97,7 @@ int rgmii_attach(struct platform_device *ofdev, int input, int mode)
 	out_be32(&p->fer, in_be32(&p->fer) | rgmii_mode_mask(mode, input));
 
 	printk(KERN_NOTICE "%pOF: input %d in %s mode\n",
-	       ofdev->dev.of_node, input, rgmii_mode_name(mode));
+	       ofdev->dev.of_node, input, phy_modes(mode));
 
 	++dev->users;
 
-- 
2.15.1

^ permalink raw reply related

* [PATCH v3 2/3] net: ibm: emac: replace custom PHY_MODE_* macros
From: Christian Lamparter @ 2017-12-20 22:01 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Andrew Lunn, Christophe Jaillet
In-Reply-To: <a9482f4f4037f6eb732de327290a432539648bcd.1513806256.git.chunkeey@gmail.com>

The ibm_emac driver predates the PHY_INTERFACE_MODE_*
enums by a few years.

And while the driver has been retrofitted to use the PHYLIB,
the old definitions have stuck around to this day.

This patch replaces all occurences of PHY_MODE_* with
the respective equivalent PHY_INTERFACE_MODE_* enum.
And finally, it purges the old macros for good.

Signed-off-by: Christian Lamparter <chunkeey@gmail.com>

---
Extra change in zmii.c to make checkpatch happy.
---
 drivers/net/ethernet/ibm/emac/core.c  | 20 +++++++++---------
 drivers/net/ethernet/ibm/emac/emac.h  | 13 ------------
 drivers/net/ethernet/ibm/emac/phy.c   | 10 ++++-----
 drivers/net/ethernet/ibm/emac/rgmii.c | 20 +++++++++---------
 drivers/net/ethernet/ibm/emac/zmii.c  | 38 +++++++++++++++++------------------
 5 files changed, 44 insertions(+), 57 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c b/drivers/net/ethernet/ibm/emac/core.c
index 7feff2450ed6..27f592da7cfc 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -199,18 +199,18 @@ static void __emac_set_multicast_list(struct emac_instance *dev);
 
 static inline int emac_phy_supports_gige(int phy_mode)
 {
-	return  phy_mode == PHY_MODE_GMII ||
-		phy_mode == PHY_MODE_RGMII ||
-		phy_mode == PHY_MODE_SGMII ||
-		phy_mode == PHY_MODE_TBI ||
-		phy_mode == PHY_MODE_RTBI;
+	return  phy_mode == PHY_INTERFACE_MODE_GMII ||
+		phy_mode == PHY_INTERFACE_MODE_RGMII ||
+		phy_mode == PHY_INTERFACE_MODE_SGMII ||
+		phy_mode == PHY_INTERFACE_MODE_TBI ||
+		phy_mode == PHY_INTERFACE_MODE_RTBI;
 }
 
 static inline int emac_phy_gpcs(int phy_mode)
 {
-	return  phy_mode == PHY_MODE_SGMII ||
-		phy_mode == PHY_MODE_TBI ||
-		phy_mode == PHY_MODE_RTBI;
+	return  phy_mode == PHY_INTERFACE_MODE_SGMII ||
+		phy_mode == PHY_INTERFACE_MODE_TBI ||
+		phy_mode == PHY_INTERFACE_MODE_RTBI;
 }
 
 static inline void emac_tx_enable(struct emac_instance *dev)
@@ -2865,7 +2865,7 @@ static int emac_init_config(struct emac_instance *dev)
 	/* PHY mode needs some decoding */
 	dev->phy_mode = of_get_phy_mode(np);
 	if (dev->phy_mode < 0)
-		dev->phy_mode = PHY_MODE_NA;
+		dev->phy_mode = PHY_INTERFACE_MODE_NA;
 
 	/* Check EMAC version */
 	if (of_device_is_compatible(np, "ibm,emac4sync")) {
@@ -3168,7 +3168,7 @@ static int emac_probe(struct platform_device *ofdev)
 	printk(KERN_INFO "%s: EMAC-%d %pOF, MAC %pM\n",
 	       ndev->name, dev->cell_index, np, ndev->dev_addr);
 
-	if (dev->phy_mode == PHY_MODE_SGMII)
+	if (dev->phy_mode == PHY_INTERFACE_MODE_SGMII)
 		printk(KERN_NOTICE "%s: in SGMII mode\n", ndev->name);
 
 	if (dev->phy.address >= 0)
diff --git a/drivers/net/ethernet/ibm/emac/emac.h b/drivers/net/ethernet/ibm/emac/emac.h
index 5afcc27ceebb..bc14dcf27b6b 100644
--- a/drivers/net/ethernet/ibm/emac/emac.h
+++ b/drivers/net/ethernet/ibm/emac/emac.h
@@ -104,19 +104,6 @@ struct emac_regs {
 	} u1;
 };
 
-/*
- * PHY mode settings (EMAC <-> ZMII/RGMII bridge <-> PHY)
- */
-#define PHY_MODE_NA	PHY_INTERFACE_MODE_NA
-#define PHY_MODE_MII	PHY_INTERFACE_MODE_MII
-#define PHY_MODE_RMII	PHY_INTERFACE_MODE_RMII
-#define PHY_MODE_SMII	PHY_INTERFACE_MODE_SMII
-#define PHY_MODE_RGMII	PHY_INTERFACE_MODE_RGMII
-#define PHY_MODE_TBI	PHY_INTERFACE_MODE_TBI
-#define PHY_MODE_GMII	PHY_INTERFACE_MODE_GMII
-#define PHY_MODE_RTBI	PHY_INTERFACE_MODE_RTBI
-#define PHY_MODE_SGMII	PHY_INTERFACE_MODE_SGMII
-
 /* EMACx_MR0 */
 #define EMAC_MR0_RXI			0x80000000
 #define EMAC_MR0_TXI			0x40000000
diff --git a/drivers/net/ethernet/ibm/emac/phy.c b/drivers/net/ethernet/ibm/emac/phy.c
index 874949faa424..71ee13c34192 100644
--- a/drivers/net/ethernet/ibm/emac/phy.c
+++ b/drivers/net/ethernet/ibm/emac/phy.c
@@ -96,7 +96,7 @@ int emac_mii_reset_gpcs(struct mii_phy *phy)
 	if ((val & BMCR_ISOLATE) && limit > 0)
 		gpcs_phy_write(phy, MII_BMCR, val & ~BMCR_ISOLATE);
 
-	if (limit > 0 && phy->mode == PHY_MODE_SGMII) {
+	if (limit > 0 && phy->mode == PHY_INTERFACE_MODE_SGMII) {
 		/* Configure GPCS interface to recommended setting for SGMII */
 		gpcs_phy_write(phy, 0x04, 0x8120); /* AsymPause, FDX */
 		gpcs_phy_write(phy, 0x07, 0x2801); /* msg_pg, toggle */
@@ -313,16 +313,16 @@ static int cis8201_init(struct mii_phy *phy)
 	epcr &= ~EPCR_MODE_MASK;
 
 	switch (phy->mode) {
-	case PHY_MODE_TBI:
+	case PHY_INTERFACE_MODE_TBI:
 		epcr |= EPCR_TBI_MODE;
 		break;
-	case PHY_MODE_RTBI:
+	case PHY_INTERFACE_MODE_RTBI:
 		epcr |= EPCR_RTBI_MODE;
 		break;
-	case PHY_MODE_GMII:
+	case PHY_INTERFACE_MODE_GMII:
 		epcr |= EPCR_GMII_MODE;
 		break;
-	case PHY_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII:
 	default:
 		epcr |= EPCR_RGMII_MODE;
 	}
diff --git a/drivers/net/ethernet/ibm/emac/rgmii.c b/drivers/net/ethernet/ibm/emac/rgmii.c
index 9a1c06f2471d..5db225831e81 100644
--- a/drivers/net/ethernet/ibm/emac/rgmii.c
+++ b/drivers/net/ethernet/ibm/emac/rgmii.c
@@ -52,25 +52,25 @@
 /* RGMII bridge supports only GMII/TBI and RGMII/RTBI PHYs */
 static inline int rgmii_valid_mode(int phy_mode)
 {
-	return  phy_mode == PHY_MODE_GMII ||
-		phy_mode == PHY_MODE_MII ||
-		phy_mode == PHY_MODE_RGMII ||
-		phy_mode == PHY_MODE_TBI ||
-		phy_mode == PHY_MODE_RTBI;
+	return  phy_mode == PHY_INTERFACE_MODE_GMII ||
+		phy_mode == PHY_INTERFACE_MODE_MII ||
+		phy_mode == PHY_INTERFACE_MODE_RGMII ||
+		phy_mode == PHY_INTERFACE_MODE_TBI ||
+		phy_mode == PHY_INTERFACE_MODE_RTBI;
 }
 
 static inline u32 rgmii_mode_mask(int mode, int input)
 {
 	switch (mode) {
-	case PHY_MODE_RGMII:
+	case PHY_INTERFACE_MODE_RGMII:
 		return RGMII_FER_RGMII(input);
-	case PHY_MODE_TBI:
+	case PHY_INTERFACE_MODE_TBI:
 		return RGMII_FER_TBI(input);
-	case PHY_MODE_GMII:
+	case PHY_INTERFACE_MODE_GMII:
 		return RGMII_FER_GMII(input);
-	case PHY_MODE_MII:
+	case PHY_INTERFACE_MODE_MII:
 		return RGMII_FER_MII(input);
-	case PHY_MODE_RTBI:
+	case PHY_INTERFACE_MODE_RTBI:
 		return RGMII_FER_RTBI(input);
 	default:
 		BUG();
diff --git a/drivers/net/ethernet/ibm/emac/zmii.c b/drivers/net/ethernet/ibm/emac/zmii.c
index 89c42d362292..fdcc734541fe 100644
--- a/drivers/net/ethernet/ibm/emac/zmii.c
+++ b/drivers/net/ethernet/ibm/emac/zmii.c
@@ -49,20 +49,20 @@
  */
 static inline int zmii_valid_mode(int mode)
 {
-	return  mode == PHY_MODE_MII ||
-		mode == PHY_MODE_RMII ||
-		mode == PHY_MODE_SMII ||
-		mode == PHY_MODE_NA;
+	return  mode == PHY_INTERFACE_MODE_MII ||
+		mode == PHY_INTERFACE_MODE_RMII ||
+		mode == PHY_INTERFACE_MODE_SMII ||
+		mode == PHY_INTERFACE_MODE_NA;
 }
 
 static inline const char *zmii_mode_name(int mode)
 {
 	switch (mode) {
-	case PHY_MODE_MII:
+	case PHY_INTERFACE_MODE_MII:
 		return "MII";
-	case PHY_MODE_RMII:
+	case PHY_INTERFACE_MODE_RMII:
 		return "RMII";
-	case PHY_MODE_SMII:
+	case PHY_INTERFACE_MODE_SMII:
 		return "SMII";
 	default:
 		BUG();
@@ -72,11 +72,11 @@ static inline const char *zmii_mode_name(int mode)
 static inline u32 zmii_mode_mask(int mode, int input)
 {
 	switch (mode) {
-	case PHY_MODE_MII:
+	case PHY_INTERFACE_MODE_MII:
 		return ZMII_FER_MII(input);
-	case PHY_MODE_RMII:
+	case PHY_INTERFACE_MODE_RMII:
 		return ZMII_FER_RMII(input);
-	case PHY_MODE_SMII:
+	case PHY_INTERFACE_MODE_SMII:
 		return ZMII_FER_SMII(input);
 	default:
 		return 0;
@@ -106,27 +106,27 @@ int zmii_attach(struct platform_device *ofdev, int input, int *mode)
 	 * Please, always specify PHY mode in your board port to avoid
 	 * any surprises.
 	 */
-	if (dev->mode == PHY_MODE_NA) {
-		if (*mode == PHY_MODE_NA) {
+	if (dev->mode == PHY_INTERFACE_MODE_NA) {
+		if (*mode == PHY_INTERFACE_MODE_NA) {
 			u32 r = dev->fer_save;
 
 			ZMII_DBG(dev, "autodetecting mode, FER = 0x%08x" NL, r);
 
 			if (r & (ZMII_FER_MII(0) | ZMII_FER_MII(1)))
-				dev->mode = PHY_MODE_MII;
+				dev->mode = PHY_INTERFACE_MODE_MII;
 			else if (r & (ZMII_FER_RMII(0) | ZMII_FER_RMII(1)))
-				dev->mode = PHY_MODE_RMII;
+				dev->mode = PHY_INTERFACE_MODE_RMII;
 			else
-				dev->mode = PHY_MODE_SMII;
-		} else
+				dev->mode = PHY_INTERFACE_MODE_SMII;
+		} else {
 			dev->mode = *mode;
-
+		}
 		printk(KERN_NOTICE "%pOF: bridge in %s mode\n",
 		       ofdev->dev.of_node,
 		       zmii_mode_name(dev->mode));
 	} else {
 		/* All inputs must use the same mode */
-		if (*mode != PHY_MODE_NA && *mode != dev->mode) {
+		if (*mode != PHY_INTERFACE_MODE_NA && *mode != dev->mode) {
 			printk(KERN_ERR
 			       "%pOF: invalid mode %d specified for input %d\n",
 			       ofdev->dev.of_node, *mode, input);
@@ -246,7 +246,7 @@ static int zmii_probe(struct platform_device *ofdev)
 
 	mutex_init(&dev->lock);
 	dev->ofdev = ofdev;
-	dev->mode = PHY_MODE_NA;
+	dev->mode = PHY_INTERFACE_MODE_NA;
 
 	rc = -ENXIO;
 	if (of_address_to_resource(np, 0, &regs)) {
-- 
2.15.1

^ permalink raw reply related

* Re: [PATCH v5 3/6] perf: implement pmu perf_kprobe
From: Song Liu @ 2017-12-20 22:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, mingo@redhat.com, David Miller,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Daniel Borkmann, Kernel Team
In-Reply-To: <20171220212554.v4obwcp25hee556y@hirez.programming.kicks-ass.net>


> On Dec 20, 2017, at 1:25 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Dec 20, 2017 at 06:10:11PM +0000, Song Liu wrote:
>> I think there is one more thing to change:
> 
> OK, folded that too; it should all be at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
> 
> Can you verify it all looks/works right?

Thanks Peter! The patches look right. And they work as expected in my tests. 

Best,
Song

^ permalink raw reply

* Re: RCU callback crashes
From: Cong Wang @ 2017-12-20 22:25 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jiri Pirko, Jakub Kicinski, netdev@vger.kernel.org
In-Reply-To: <d82f1fdf-5d7b-db1b-6efb-85ff957f8a38@gmail.com>

On Wed, Dec 20, 2017 at 12:14 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Hi,
>
> Just sent a patch to complete qdisc_destroy from rcu callback. This
> is needed to resolve a race with the lockless qdisc patches.
>
> But I guess it should fix the miniq issue as well?


If you ever look into tools/testing/selftests/bpf/test_offload.py, it is
ingress qdisc. Don't know why you keep believing it is related
to your patches.

^ permalink raw reply

* [PATCH] hv_netvsc: update VF after name has changed.
From: Stephen Hemminger @ 2017-12-20 22:33 UTC (permalink / raw)
  To: davem; +Cc: netdev, Stephen Hemminger

Since commit 6123c66854c1 ("netvsc: delay setup of VF device")
the automatic bring up of the VF is delayed to allow userspace (udev)
a chance to rename the device. This delay is problematic because
it delays boot and may not be long enough for some cases.

Instead, use the rename can be used to trigger the next step
in setup to happen immediately.

The VF initialization sequence now looks like:
   * hotplug causes VF network device probe to create network device
   * netvsc notifier joins VF with netvsc and schedules VF to be
     setup after timer expires
   * udev in userspace renames device
   * if netvsc notifier detects rename, it can cancel timer
     and do immediate setup

The delay can also be increased to allow for slower rules.
Still need the delayed work to handle the case where rename is
not done.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 drivers/net/hyperv/netvsc_drv.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index c5584c2d440e..7771d6c54b55 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -49,7 +49,7 @@
 #define RING_SIZE_MIN		64
 
 #define LINKCHANGE_INT (2 * HZ)
-#define VF_TAKEOVER_INT (HZ / 10)
+#define VF_TAKEOVER_INT (HZ / 4)
 
 static unsigned int ring_size __ro_after_init = 128;
 module_param(ring_size, uint, S_IRUGO);
@@ -1894,6 +1894,26 @@ static int netvsc_vf_changed(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
+/* If VF was renamed, then udev (or other tool) has done its
+ * work and the VF can be brought up.
+ */
+static int netvsc_vf_renamed(struct net_device *vf_netdev)
+{
+	struct net_device_context *ndev_ctx;
+	struct net_device *ndev;
+
+	ndev = get_netvsc_byref(vf_netdev);
+	if (!ndev)
+		return NOTIFY_DONE;
+
+	ndev_ctx = netdev_priv(ndev);
+
+	if (cancel_delayed_work(&ndev_ctx->vf_takeover))
+		__netvsc_vf_setup(ndev, vf_netdev);
+
+	return NOTIFY_OK;
+}
+
 static int netvsc_unregister_vf(struct net_device *vf_netdev)
 {
 	struct net_device *ndev;
@@ -2110,6 +2130,8 @@ static int netvsc_netdev_event(struct notifier_block *this,
 	case NETDEV_UP:
 	case NETDEV_DOWN:
 		return netvsc_vf_changed(event_dev);
+	case NETDEV_CHANGENAME:
+		return netvsc_vf_renamed(event_dev);
 	default:
 		return NOTIFY_DONE;
 	}
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH] virtio_net: Extend virtio to use VF datapath when available
From: Jakub Kicinski @ 2017-12-20 22:33 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: mst, stephen, netdev, virtualization, alexander.duyck
In-Reply-To: <1513644036-45230-1-git-send-email-sridhar.samudrala@intel.com>

On Mon, 18 Dec 2017 16:40:36 -0800, Sridhar Samudrala wrote:
> +static int virtio_netdev_event(struct notifier_block *this,
> +			       unsigned long event, void *ptr)
> +{
> +	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> +	/* Skip our own events */
> +	if (event_dev->netdev_ops == &virtnet_netdev)
> +		return NOTIFY_DONE;

I wonder how does this work WRT loop prevention.  What if I have two
virtio devices with the same MAC, what is preventing them from claiming
each other?  Is it only the check above (not sure what is "own" in
comment referring to)?

I'm worried the check above will not stop virtio from enslaving hyperv
interfaces and vice versa, potentially leading to a loop, no?  There is
also the fact that it would be preferable to share the code between
paravirt drivers, to avoid duplicated bugs.

My suggestion during the previous discussion was to create a paravirt
bond device, which will explicitly tell the OS which interfaces to
bond, regardless of which driver they're using.  Could be some form of
ACPI/FW driver too, I don't know enough about x86 FW to suggest
something fully fleshed :(

^ permalink raw reply

* [PATCH net 0/2] zerocopy fixes
From: Willem de Bruijn @ 2017-12-20 22:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The removal of UFO hardware offload support exposed a bug in
segmentation of zerocopy skbs.

The reference counting mechanism for msg_zerocopy was incorrectly
applied to vhost_net zerocopy packets.

The other issue observed through analysis. We do not cook skbs
with skb_zcopy(skb) but no frags, but this is possible in principle.
Correctly call skb_zcopy_clear on those.

Willem de Bruijn (2):
  skbuff: orphan frags before zerocopy clone
  skbuff: skb_copy_ubufs must release uarg even without user frags

 net/core/skbuff.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox