Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v2 bpf-next 03/11] bpf: Add write access to tcp_sock and sock fields
From: Lawrence Brakmo @ 2017-12-22  1:20 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171222012101.3899534-1-brakmo@fb.com>

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/filter.h |  3 +++
 include/net/tcp.h      |  2 +-
 net/core/filter.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2b0df27..2e6e889 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1005,6 +1005,9 @@ struct bpf_sock_ops_kern {
 		u32 replylong[4];
 	};
 	u32	is_fullsock;
+	u64	temp;			/* Used by sock_ops_convert_ctx_access
+					 * as temporary storaage of a register
+					 */
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69..108d16a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2010,7 +2010,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 	struct bpf_sock_ops_kern sock_ops;
 	int ret;
 
-	memset(&sock_ops, 0, sizeof(sock_ops));
+	memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
 	if (sk_fullsock(sk)) {
 		sock_ops.is_fullsock = 1;
 		sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index 7064862..978ac78 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,6 +4470,54 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      offsetof(OBJ, OBJ_FIELD));	      \
 	} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
 	case offsetof(struct bpf_sock_ops, snd_cwnd):
 		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
 		break;
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
From: Lawrence Brakmo @ 2017-12-22  1:21 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171222012101.3899534-1-brakmo@fb.com>

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 4 ++++
 include/uapi/linux/tcp.h | 1 +
 net/ipv4/tcp.c           | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 415b951..14040e0 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1024,6 +1024,10 @@ enum {
 					 * Arg1: sequence number of 1st byte
 					 * Arg2: # segments
 					 */
+	BPF_SOCK_OPS_STATE_CB,		/* Called when TCP changes state.
+					 * Arg1: old_state
+					 * Arg2: new_state
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index dc36d3c..211322c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -262,6 +262,7 @@ struct tcp_md5sig {
 /* Definitions for bpf_sock_ops_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG	(1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG	(1<<1)
+#define BPF_SOCK_OPS_STATE_CB_FLAG	(1<<2)
 
 /* INET_DIAG_MD5SIG */
 struct tcp_diag_md5sig {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 587d65c..f342f31 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2041,6 +2041,8 @@ void tcp_set_state(struct sock *sk, int state)
 	int oldstate = sk->sk_state;
 
 	trace_tcp_set_state(sk, oldstate, state);
+	if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+		tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
 
 	switch (state) {
 	case TCP_ESTABLISHED:
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
From: Lawrence Brakmo @ 2017-12-22  1:20 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171222012101.3899534-1-brakmo@fb.com>

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:      SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0xffffffff), or
  2) Make the verifier fail if that field is accessed (i.e. program
    fails to load) so the user will know that field is no longer
    supported.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/core/filter.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 099ff9fd..7064862 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4448,11 +4448,11 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 					       is_fullsock));
 		break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME)					      \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
 	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
 		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
 						struct bpf_sock_ops_kern,     \
 						is_fullsock),		      \
@@ -4464,18 +4464,18 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 						struct bpf_sock_ops_kern, sk),\
 				      si->dst_reg, si->src_reg,		      \
 				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,	      \
-						   FIELD_NAME), si->dst_reg,  \
-				      si->dst_reg,			      \
-				      offsetof(struct tcp_sock, FIELD_NAME)); \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
 	} while (0)
 
 	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_TCP(snd_cwnd);
+		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
 		break;
 
 	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_TCP(srtt_us);
+		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
 		break;
 	}
 	return insn - insn_buf;
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
From: Lawrence Brakmo @ 2017-12-22  1:20 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171222012101.3899534-1-brakmo@fb.com>

Adds direct R/W access to sk_txhash and access to tclass for ipv6 flows
through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v, sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c        | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c81fdec..5bead0e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,7 @@ struct bpf_sock_ops {
 	__u32 data_segs_out;
 	__u64 bytes_received;
 	__u64 bytes_acked;
+	__u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index d4c5c1a..4d2ff88 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3230,6 +3230,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 			ret = -EINVAL;
 		}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (level == SOL_IPV6) {
+		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+			return -EINVAL;
+
+		val = *((int *)optval);
+		/* Only some options are supported */
+		switch (optname) {
+		case IPV6_TCLASS:
+			if (val < -1 || val > 0xff) {
+				ret = -EINVAL;
+			} else {
+				struct ipv6_pinfo *np = inet6_sk(sk);
+
+				if (val == -1)
+					val = 0;
+				np->tclass = val;
+			}
+			break;
+		default:
+			ret = -EINVAL;
+		}
+#endif
 	} else if (level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
 		if (optname == TCP_CONGESTION) {
@@ -3239,7 +3262,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 			strncpy(name, optval, min_t(long, optlen,
 						    TCP_CA_NAME_MAX-1));
 			name[TCP_CA_NAME_MAX-1] = 0;
-			ret = tcp_set_congestion_control(sk, name, false, reinit);
+			ret = tcp_set_congestion_control(sk, name, false,
+							 reinit);
 		} else {
 			struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3305,6 +3329,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 		} else {
 			goto err_clear;
 		}
+#if IS_ENABLED(CONFIG_IPV6)
+	} else if (level == SOL_IPV6) {
+		struct ipv6_pinfo *np = inet6_sk(sk);
+
+		if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+			goto err_clear;
+
+		/* Only some options are supported */
+		switch (optname) {
+		case IPV6_TCLASS:
+			*((int *)optval) = (int)np->tclass;
+			break;
+		default:
+			goto err_clear;
+		}
+#endif
 	} else {
 		goto err_clear;
 	}
@@ -3844,6 +3884,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 		case offsetof(struct bpf_sock_ops, op) ...
 		     offsetof(struct bpf_sock_ops, replylong[3]):
 		case offsetof(struct bpf_sock_ops, bpf_sock_ops_flags):
+		case offsetof(struct bpf_sock_ops, sk_txhash):
 			break;
 		default:
 			return false;
@@ -4631,6 +4672,11 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_sock_ops, bytes_acked):
 		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
 		break;
+
+	case offsetof(struct bpf_sock_ops, sk_txhash):
+		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+					  struct sock, type);
+		break;
 	}
 	return insn - insn_buf;
 }
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 04/11] bpf: Support passing args to sock_ops bpf function
From: Lawrence Brakmo @ 2017-12-22  1:20 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171222012101.3899534-1-brakmo@fb.com>

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h        | 64 ++++++++++++++++++++++++++++++++++++++++++++----
 include/uapi/linux/bpf.h |  5 ++--
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_nv.c        |  2 +-
 net/ipv4/tcp_output.c    |  2 +-
 6 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2e6e889..97f9a02 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1001,6 +1001,7 @@ struct bpf_sock_ops_kern {
 	struct	sock *sk;
 	u32	op;
 	union {
+		u32 args[4];
 		u32 reply;
 		u32 replylong[4];
 	};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 108d16a..8e9111f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2005,7 +2005,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
 	struct bpf_sock_ops_kern sock_ops;
 	int ret;
@@ -2018,6 +2018,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
 	sock_ops.sk = sk;
 	sock_ops.op = op;
+	if (nargs > 0)
+		memcpy(sock_ops.args, args, nargs*sizeof(u32));
 
 	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
 	if (ret == 0)
@@ -2026,18 +2028,70 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 		ret = -1;
 	return ret;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+	return tcp_call_bpf(sk, op, 1, &arg);
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2)
+{
+	u32 args[2] = {arg1, arg2};
+
+	return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3)
+{
+	u32 args[3] = {arg1, arg2, arg3};
+
+	return tcp_call_bpf(sk, op, 3, args);
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3, u32 arg4)
+{
+	u32 args[4] = {arg1, arg2, arg3, arg4};
+
+	return tcp_call_bpf(sk, op, 4, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
 	return -EPERM;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+	u32 arg3)
+{
+	return -EPERM;
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 arg2,
+				    u32 arg3, u32 arg4)
+{
+	return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
 	int timeout;
 
-	timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+	timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
 	if (timeout <= 0)
 		timeout = TCP_TIMEOUT_INIT;
@@ -2048,7 +2102,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
 	int rwnd;
 
-	rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+	rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
 	if (rwnd < 0)
 		rwnd = 0;
@@ -2057,7 +2111,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-	return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+	return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69eabfc..1fea987 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -942,8 +942,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
 	__u32 op;
 	union {
-		__u32 reply;
-		__u32 replylong[4];
+		__u32 args[4];		/* Optionally passed to bpf program */
+		__u32 reply;		/* Returned by bpf program	    */
+		__u32 replylong[4];	/* Optionally returned by bpf prog  */
 	};
 	__u32 family;
 	__u32 remote_ip4;	/* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c470fec..587d65c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -465,7 +465,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
 	tcp_mtup_init(sk);
 	icsk->icsk_af_ops->rebuild_header(sk);
 	tcp_init_metrics(sk);
-	tcp_call_bpf(sk, bpf_op);
+	tcp_call_bpf(sk, bpf_op, 0, NULL);
 	tcp_init_congestion_control(sk);
 	tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 	 * within a datacenter, where we have reasonable estimates of
 	 * RTTs
 	 */
-	base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+	base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);
 	if (base_rtt > 0) {
 		ca->nv_base_rtt = base_rtt;
 		ca->nv_lower_bound_rtt = (base_rtt * 205) >> 8; /* 80% */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 04be9f83..b093985 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3468,7 +3468,7 @@ int tcp_connect(struct sock *sk)
 	struct sk_buff *buff;
 	int err;
 
-	tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB);
+	tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL);
 
 	if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk))
 		return -EHOSTUNREACH; /* Routing failure or similar. */
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 bpf-next 00/11]
From: Lawrence Brakmo @ 2017-12-22  1:20 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Neal Cardwell, Yuchung Cheng

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
    below used "bpf" and Patchwork didn't work correctly.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>

Consists of the following patches:
[PATCH v2 bpf-next 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH v2 bpf-next 02/11] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH v2 bpf-next 03/11] bpf: Add write access to tcp_sock and sock
[PATCH v2 bpf-next 04/11] bpf: Support passing args to sock_ops bpf
[PATCH v2 bpf-next 05/11] bpf: Adds field bpf_sock_ops_flags to
[PATCH v2 bpf-next 06/11] bpf: Add sock_ops RTO callback
[PATCH v2 bpf-next 07/11] bpf: Add support for reading sk_state and
[PATCH v2 bpf-next 08/11] bpf: Add sock_ops R/W access to tclass &
[PATCH v2 bpf-next 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH v2 bpf-next 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH v2 bpf-next 11/11] bpf: add selftest for tcpbpf

 include/linux/filter.h                         |   4 +
 include/linux/tcp.h                            |   8 ++
 include/net/tcp.h                              |  66 +++++++++-
 include/uapi/linux/bpf.h                       |  39 +++++-
 include/uapi/linux/tcp.h                       |   5 +
 net/core/filter.c                              | 221 ++++++++++++++++++++++++++++++--
 net/ipv4/tcp.c                                 |   4 +-
 net/ipv4/tcp_nv.c                              |   2 +-
 net/ipv4/tcp_output.c                          |   5 +-
 net/ipv4/tcp_timer.c                           |   9 ++
 tools/include/uapi/linux/bpf.h                 |  45 ++++++-
 tools/testing/selftests/bpf/Makefile           |   4 +-
 tools/testing/selftests/bpf/tcp_client.py      |  52 ++++++++
 tools/testing/selftests/bpf/tcp_server.py      |  79 ++++++++++++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 125 ++++++++++++++++++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 113 ++++++++++++++++
 16 files changed, 757 insertions(+), 24 deletions(-)

^ permalink raw reply

* correctness of BPF stack size checking logic for multi-function programs?
From: Jann Horn @ 2017-12-22  1:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: kernel list, Network Development

Hi!

I saw the recently-added support for multiple functions in a single
program in BPF. I've stumbled over something that looks like it might
be a bug; I haven't verified it yet, but I thought I should give you a
heads-up before this lands in a release in case I'm right. If I'm
wrong, it might be worth adding a comment to stacksafe() that explains
why.

stacksafe() has the following code:

/* if explored stack has more populated slots than current stack
* such stacks are not equivalent
*/
if (old->allocated_stack > cur->allocated_stack)
return false;

Note that if the old state had a smaller stack than the new state,
that is permitted because it is guaranteed that none of the extra
space in the new state will be used.

However, as far as I can tell, this can be used to smuggle a call
chain with a total stack size bigger than the permitted maximum
through the verifier, with code roughly as follows:

void b(void) {
  <allocate 300 bytes stack>
}
void main(void) {
  if (<condition>) {
    <allocate 300 bytes stack>
  }
  b();
}

AFAICS, if the verifier first verifies the branch of main() where
<condition> is false, it will go down into b, seeing a total stack
size of around 300 bytes. Afterwards, it will verify the branch of
main() where <condition> is true, but the states will converge after
the branch, preventing the verifier from going down into b() again and
discovering through update_stack_depth() that actually, the total
stack size is around 600 bytes. From a coarse look, it seems like this
might be usable to overflow the kernel stack, which would be
exploitable on systems without vmapped stack?

And actually, you could probably even trigger it with something like
this, since the stack is always fully allocated on function entry:

void b(void) {
  <allocate 300 bytes stack>
}
void main(void) {
  b();
  <allocate 300 bytes stack>
}

So I think it might be necessary to do an extra pass for the stack
depth checking when all the stackframe sizes are known?

^ permalink raw reply

* [PATCH] tcp: make function tcp_recv_timestamp static
From: Colin King @ 2017-12-22  1:01 UTC (permalink / raw)
  To: David S . Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

The function tcp_recv_timestamp is local to the source and does not
need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'tcp_recv_timestamp' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/ipv4/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ca042cdf8496..3d93ae649bbd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1729,7 +1729,7 @@ static void tcp_update_recv_tstamps(struct sk_buff *skb,
 }

 /* Similar to __sock_recv_timestamp, but does not require an skb */
-void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk,
+static void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk,
 			       struct scm_timestamping *tss)
 {
 	struct timeval tv;
-- 
2.14.1

^ permalink raw reply related

* linux-next: build failure after merge of the net-next tree
From: Stephen Rothwell @ 2017-12-22  0:45 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List, Ido Schimmel,
	Eric Dumazet

Hi all,

After merging the net-next tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

net/ipv6/route.c: In function 'inet6_rtm_getroute':
net/ipv6/route.c:4324:25: error: 'struct dst_entry' has no member named 'from'
  if (fibmatch && rt->dst.from) {
                         ^
In file included from include/linux/uio.h:12:0,
                 from include/linux/socket.h:8,
                 from net/ipv6/route.c:34:
net/ipv6/route.c:4325:46: error: 'struct dst_entry' has no member named 'from'
   struct rt6_info *ort = container_of(rt->dst.from,
                                              ^
include/linux/kernel.h:929:26: note: in definition of macro 'container_of'
  void *__mptr = (void *)(ptr);     \
                          ^
In file included from include/linux/kernel.h:10:0,
                 from include/linux/uio.h:12,
                 from include/linux/socket.h:8,
                 from net/ipv6/route.c:34:
net/ipv6/route.c:4325:46: error: 'struct dst_entry' has no member named 'from'
   struct rt6_info *ort = container_of(rt->dst.from,
                                              ^
include/linux/compiler.h:301:19: note: in definition of macro '__compiletime_assert'
   bool __cond = !(condition);    \
                   ^
include/linux/compiler.h:324:2: note: in expansion of macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^
include/linux/build_bug.h:47:37: note: in expansion of macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^
include/linux/kernel.h:930:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
  ^
include/linux/kernel.h:930:20: note: in expansion of macro '__same_type'
  BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
                    ^
net/ipv6/route.c:4325:26: note: in expansion of macro 'container_of'
   struct rt6_info *ort = container_of(rt->dst.from,
                          ^
net/ipv6/route.c:4325:46: error: 'struct dst_entry' has no member named 'from'
   struct rt6_info *ort = container_of(rt->dst.from,
                                              ^
include/linux/compiler.h:301:19: note: in definition of macro '__compiletime_assert'
   bool __cond = !(condition);    \
                   ^
include/linux/compiler.h:324:2: note: in expansion of macro '_compiletime_assert'
  _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
  ^
include/linux/build_bug.h:47:37: note: in expansion of macro 'compiletime_assert'
 #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                     ^
include/linux/kernel.h:930:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
  BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
  ^
include/linux/kernel.h:931:6: note: in expansion of macro '__same_type'
     !__same_type(*(ptr), void),   \
      ^
net/ipv6/route.c:4325:26: note: in expansion of macro 'container_of'
   struct rt6_info *ort = container_of(rt->dst.from,
                          ^

Caused by commit

  3a2232e92e87 ("ipv6: Move dst->from into struct rt6_info")

interacting with commit

  58acfd714e6b ("ipv6: Honor specified parameters in fibmatch lookup"

from the net tree.

I have added the following merge fix patch for today (I am guessing
a bit here):

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Fri, 22 Dec 2017 11:25:13 +1100
Subject: [PATCH] ipv6: fix up for "ipv6: Move dst->from into struct rt6_info"

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/ipv6/route.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4efaac956f0c..2490280b3394 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4321,9 +4321,8 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 		goto errout;
 	}
 
-	if (fibmatch && rt->dst.from) {
-		struct rt6_info *ort = container_of(rt->dst.from,
-						    struct rt6_info, dst);
+	if (fibmatch && rt->from) {
+		struct rt6_info *ort = rt->from;
 
 		dst_hold(&ort->dst);
 		ip6_rt_put(rt);
-- 
2.15.0

-- 
Cheers,
Stephen Rothwell

^ permalink raw reply related

* Re: [PATCH bpf-next] samples/bpf: adjust rlimit RLIMIT_MEMLOCK for sampleip
From: Daniel Borkmann @ 2017-12-22  0:24 UTC (permalink / raw)
  To: Prashant Bhole, Alexei Starovoitov; +Cc: netdev, bgregg
In-Reply-To: <20171221094905.4872-1-bhole_prashant_q7@lab.ntt.co.jp>

On 12/21/2017 10:49 AM, Prashant Bhole wrote:
> The default memlock rlimit is 64KB, which causes failure in
> creating a map
> 
> For example:
> test@test# ./sampleip
> failed to create a map: 1 Operation not permitted
> ERROR: loading BPF program (errno 1):
> Try: ulimit -l unlimited
> 
> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
> ---
>  samples/bpf/sampleip_user.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
> index 4ed690b907ff..f240a7db7c0a 100644
> --- a/samples/bpf/sampleip_user.c
> +++ b/samples/bpf/sampleip_user.c
> @@ -19,6 +19,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/bpf.h>
>  #include <sys/ioctl.h>
> +#include <sys/resource.h>
>  #include "libbpf.h"
>  #include "bpf_load.h"
>  #include "perf-sys.h"
> @@ -132,8 +133,9 @@ static void int_exit(int sig)
>  
>  int main(int argc, char **argv)
>  {
> -	char filename[256];
>  	int *pmu_fd, opt, freq = DEFAULT_FREQ, secs = DEFAULT_SECS;
> +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
> +	char filename[256];
>  
>  	/* process arguments */
>  	while ((opt = getopt(argc, argv, "F:h")) != -1) {
> @@ -154,6 +156,11 @@ int main(int argc, char **argv)
>  		return 1;
>  	}
>  
> +	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
> +		perror("Failed to set memlock rlimit");
> +		return 1;
> +	}
> +
>  	/* initialize kernel symbol translation */
>  	if (load_kallsyms()) {
>  		fprintf(stderr, "ERROR: loading /proc/kallsyms\n");
> @@ -171,12 +178,8 @@ int main(int argc, char **argv)
>  	/* load BPF program */
>  	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
>  	if (load_bpf_file(filename)) {
> -		fprintf(stderr, "ERROR: loading BPF program (errno %d):\n",
> -			errno);
> -		if (strcmp(bpf_log_buf, "") == 0)
> -			fprintf(stderr, "Try: ulimit -l unlimited\n");

Given the author of that sample code clearly gave this as a hint to make
the decision up to the user to tweak ulimit, I don't think we should
then do it unconditionally in the sample program here. Therefore, I'm
not taking this, sorry.

> -		else
> -			fprintf(stderr, "%s", bpf_log_buf);
> +		fprintf(stderr, "ERROR: loading BPF program (errno %d): %s\n",
> +			errno, bpf_log_buf);
>  		return 1;
>  	}
>  	signal(SIGINT, int_exit);
> 

^ permalink raw reply

* Re: [PATCH bpf-next 0/11] bpf: more sock_ops callbacks
From: Lawrence Brakmo @ 2017-12-22  0:23 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng
In-Reply-To: <c77a7fb5-5bd2-7b2b-8fca-b2e6f9502499@iogearbox.net>

Daniel,

Dam, by mistake I copied the “consists of the following pachtes” from the previous bpf branch commit. I will send a corrected patch set in a few minutes.

Thanks,

- Lawrence

On 12/21/17, 4:03 PM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:

    On 12/20/2017 10:16 PM, Lawrence Brakmo wrote:
    > This patchset adds support for:
    > 
    > - direct R or R/W access to many tcp_sock fields
    > - passing up to 4 arguments to sock_ops BPF functions
    > - tcp_sock field bpf_sock_ops_flags for controlling callbacks
    > - optionally calling sock_ops BPF program when RTO fires
    > - optionally calling sock_ops BPF program when packet is retransmitted
    > - optionally calling sock_ops BPF program when TCP state changes
    > - access to tclass and sk_txhash
    > - new selftest
    > 
    > Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
    > 
    > Consists of the following patches:
    > [PATCH bpf 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
    > [PATCH bpf 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
    > [PATCH bpf 03/11] bpf: Add write access to tcp_sock and sock fields
    > [PATCH bpf 04/11] bpf: Support passing args to sock_ops bpf function
    > [PATCH bpf 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock
    > [PATCH bpf 06/11] bpf: Add sock_ops RTO callback
    > [PATCH bpf 07/11] bpf: Add support for reading sk_state and more
    > [PATCH bpf 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
    > [PATCH bpf 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
    > [PATCH bpf 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
    > [PATCH bpf 11/11] bpf: add selftest for tcpbpf
    
    Hmm, looks like only ever [1] and [2] made it into patchwork for some
    reason and both under a different series. Something wrong with mailer
    config?
    
    Cheers,
    Daniel
    
      [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_851690_&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=Kg_lciwL9AOJWdB5GjpWeRoRn3Vx0n3O4ttPPITzmf0&s=bl0Hj1SWmDCUF9_ZkT6QI-kbMiTyUOh0xhoy0FIsS9A&e=
      [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_851689_&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=Kg_lciwL9AOJWdB5GjpWeRoRn3Vx0n3O4ttPPITzmf0&s=BitYJKyncTLIJ35HMAPqjXpU5gm4B4B5tDgk1KOLU6o&e=
    
      (First two in: https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_project_netdev_list_-3Fsubmitter-3D66772-26state-3D-2A&d=DwICaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=Kg_lciwL9AOJWdB5GjpWeRoRn3Vx0n3O4ttPPITzmf0&s=0BGtuzYNs3pIzBEnWZUpCVEyT0DcZqccyQwAk5H1SH8&e=)
    


^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2017-12-22  0:11 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List, Jann Horn,
	Daniel Borkmann, Alexei Starovoitov

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  kernel/bpf/verifier.c

between commit:

  0c17d1d2c619 ("bpf: fix incorrect tracking of register size truncation")

from the net tree and commits:

  f4d7e40a5b71 ("bpf: introduce function calls (verification)")
  1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc kernel/bpf/verifier.c
index 04b24876cd23,48b2901cf483..000000000000
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@@ -1072,29 -1425,54 +1430,77 @@@ static int check_ptr_alignment(struct b
  					   strict);
  }
  
 +/* truncate register to smaller size (in bytes)
 + * must be called with size < BPF_REG_SIZE
 + */
 +static void coerce_reg_to_size(struct bpf_reg_state *reg, int size)
 +{
 +	u64 mask;
 +
 +	/* clear high bits in bit representation */
 +	reg->var_off = tnum_cast(reg->var_off, size);
 +
 +	/* fix arithmetic bounds */
 +	mask = ((u64)1 << (size * 8)) - 1;
 +	if ((reg->umin_value & ~mask) == (reg->umax_value & ~mask)) {
 +		reg->umin_value &= mask;
 +		reg->umax_value &= mask;
 +	} else {
 +		reg->umin_value = 0;
 +		reg->umax_value = mask;
 +	}
 +	reg->smin_value = reg->umin_value;
 +	reg->smax_value = reg->umax_value;
 +}
 +
+ static int update_stack_depth(struct bpf_verifier_env *env,
+ 			      const struct bpf_func_state *func,
+ 			      int off)
+ {
+ 	u16 stack = env->subprog_stack_depth[func->subprogno], total = 0;
+ 	struct bpf_verifier_state *cur = env->cur_state;
+ 	int i;
+ 
+ 	if (stack >= -off)
+ 		return 0;
+ 
+ 	/* update known max for given subprogram */
+ 	env->subprog_stack_depth[func->subprogno] = -off;
+ 
+ 	/* compute the total for current call chain */
+ 	for (i = 0; i <= cur->curframe; i++) {
+ 		u32 depth = env->subprog_stack_depth[cur->frame[i]->subprogno];
+ 
+ 		/* round up to 32-bytes, since this is granularity
+ 		 * of interpreter stack sizes
+ 		 */
+ 		depth = round_up(depth, 32);
+ 		total += depth;
+ 	}
+ 
+ 	if (total > MAX_BPF_STACK) {
+ 		verbose(env, "combined stack size of %d calls is %d. Too large\n",
+ 			cur->curframe, total);
+ 		return -EACCES;
+ 	}
+ 	return 0;
+ }
+ 
+ static int get_callee_stack_depth(struct bpf_verifier_env *env,
+ 				  const struct bpf_insn *insn, int idx)
+ {
+ 	int start = idx + insn->imm + 1, subprog;
+ 
+ 	subprog = find_subprog(env, start);
+ 	if (subprog < 0) {
+ 		WARN_ONCE(1, "verifier bug. No program starts at insn %d\n",
+ 			  start);
+ 		return -EFAULT;
+ 	}
+ 	subprog++;
+ 	return env->subprog_stack_depth[subprog];
+ }
+ 
  /* check whether memory at (regno + off) is accessible for t = (read | write)
   * if t==write, value_regno is a register which value is stored into memory
   * if t==read, value_regno is a register which will receive the value from memory
@@@ -1302,15 -1678,14 +1704,15 @@@ static int check_stack_boundary(struct 
  	}
  
  	/* Only allow fixed-offset stack reads */
- 	if (!tnum_is_const(regs[regno].var_off)) {
+ 	if (!tnum_is_const(reg->var_off)) {
  		char tn_buf[48];
  
- 		tnum_strn(tn_buf, sizeof(tn_buf), regs[regno].var_off);
+ 		tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
  		verbose(env, "invalid variable stack read R%d var_off=%s\n",
  			regno, tn_buf);
 +		return -EACCES;
  	}
- 	off = regs[regno].off + regs[regno].var_off.value;
+ 	off = reg->off + reg->var_off.value;
  	if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 ||
  	    access_size < 0 || (access_size == 0 && !zero_size_allowed)) {
  		verbose(env, "invalid stack type R%d off=%d access_size=%d\n",
@@@ -2294,9 -2758,12 +2828,11 @@@ static int adjust_scalar_min_max_vals(s
  static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
  				   struct bpf_insn *insn)
  {
- 	struct bpf_reg_state *regs = cur_regs(env), *dst_reg, *src_reg;
+ 	struct bpf_verifier_state *vstate = env->cur_state;
+ 	struct bpf_func_state *state = vstate->frame[vstate->curframe];
+ 	struct bpf_reg_state *regs = state->regs, *dst_reg, *src_reg;
  	struct bpf_reg_state *ptr_reg = NULL, off_reg = {0};
  	u8 opcode = BPF_OP(insn->code);
 -	int rc;
  
  	dst_reg = &regs[insn->dst_reg];
  	src_reg = NULL;

^ permalink raw reply

* [Patch net-next] net_sched: remove the unsafe __skb_array_empty()
From: Cong Wang @ 2017-12-22  0:03 UTC (permalink / raw)
  To: netdev; +Cc: jakub.kicinski, Cong Wang, John Fastabend
In-Reply-To: <20171222000330.29009-1-xiyou.wangcong@gmail.com>

__skb_array_empty() is only safe if array is never resized.
pfifo_fast_dequeue() is called in TX BH context and without
qdisc lock, so even after we disable BH on ->reset() path
we can still race with other CPU's.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_generic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 00ddb5f8f430..9279258ce060 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
 		struct skb_array *q = band2list(priv, band);
 
-		if (__skb_array_empty(q))
-			continue;
-
 		skb = skb_array_consume_bh(q);
 	}
 	if (likely(skb)) {
-- 
2.13.0

^ permalink raw reply related

* [Patch net-next] net_sched: call qdisc_reset() with qdisc lock
From: Cong Wang @ 2017-12-22  0:03 UTC (permalink / raw)
  To: netdev; +Cc: jakub.kicinski, Cong Wang, John Fastabend

qdisc_reset() should always be called with qdisc spinlock
and with BH disabled, otherwise qdisc ->reset() could race
with TX BH.

Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_generic.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10aaa3b615ce..00ddb5f8f430 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
 {
 	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
 
-	if (qdisc)
+	if (qdisc) {
+		spin_lock_bh(qdisc_lock(qdisc));
 		qdisc_reset(qdisc);
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
 }
 
 /**
-- 
2.13.0

^ permalink raw reply related

* Re: [PATCH bpf-next 0/11] bpf: more sock_ops callbacks
From: Daniel Borkmann @ 2017-12-22  0:03 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng
In-Reply-To: <20171220211644.3993770-1-brakmo@fb.com>

On 12/20/2017 10:16 PM, Lawrence Brakmo wrote:
> This patchset adds support for:
> 
> - direct R or R/W access to many tcp_sock fields
> - passing up to 4 arguments to sock_ops BPF functions
> - tcp_sock field bpf_sock_ops_flags for controlling callbacks
> - optionally calling sock_ops BPF program when RTO fires
> - optionally calling sock_ops BPF program when packet is retransmitted
> - optionally calling sock_ops BPF program when TCP state changes
> - access to tclass and sk_txhash
> - new selftest
> 
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> 
> Consists of the following patches:
> [PATCH bpf 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
> [PATCH bpf 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent
> [PATCH bpf 03/11] bpf: Add write access to tcp_sock and sock fields
> [PATCH bpf 04/11] bpf: Support passing args to sock_ops bpf function
> [PATCH bpf 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock
> [PATCH bpf 06/11] bpf: Add sock_ops RTO callback
> [PATCH bpf 07/11] bpf: Add support for reading sk_state and more
> [PATCH bpf 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash
> [PATCH bpf 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
> [PATCH bpf 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
> [PATCH bpf 11/11] bpf: add selftest for tcpbpf

Hmm, looks like only ever [1] and [2] made it into patchwork for some
reason and both under a different series. Something wrong with mailer
config?

Cheers,
Daniel

  [1] https://patchwork.ozlabs.org/patch/851690/
  [2] https://patchwork.ozlabs.org/patch/851689/

  (First two in: https://patchwork.ozlabs.org/project/netdev/list/?submitter=66772&state=*)

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: fix Makefile for passing LLC to the command line
From: Daniel Borkmann @ 2017-12-21 23:55 UTC (permalink / raw)
  To: Jakub Kicinski, netdev, alexei.starovoitov; +Cc: oss-drivers, Quentin Monnet
In-Reply-To: <20171221165250.21138-1-jakub.kicinski@netronome.com>

On 12/21/2017 05:52 PM, Jakub Kicinski wrote:
> From: Quentin Monnet <quentin.monnet@netronome.com>
> 
> Makefile has a LLC variable that is initialised to "llc", but can
> theoretically be overridden from the command line ("make LLC=llc-6.0").
> However, this fails because for LLVM probe check, "llc" is called
> directly. Use the $(LLC) variable instead to fix this.
> 
> Fixes: 22c8852624fc ("bpf: improve selftests and add tests for meta pointer")
> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied to bpf tree, thanks Jakub!

^ permalink raw reply

* [PATCH net-next] net: erspan: remove md NULL check
From: William Tu @ 2017-12-21 23:51 UTC (permalink / raw)
  To: netdev; +Cc: Haishuang Yan

The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and
since we've checked 'tun_dst', 'md' will never be NULL.
The patch removes it at both ipv4 and ipv6 erspan.

Fixes: afb4c97d90e6 ("ip6_gre: fix potential memory leak in ip6erspan_rcv")
Fixes: 50670b6ee9bc ("ip_gre: fix potential memory leak in erspan_rcv")
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: William Tu <u9012063@gmail.com>
---
 net/ipv4/ip_gre.c  | 5 -----
 net/ipv6/ip6_gre.c | 4 ----
 2 files changed, 9 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 90c912307814..47c7de3ca458 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -313,11 +313,6 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 				return PACKET_REJECT;
 
 			md = ip_tunnel_info_opts(&tun_dst->u.tun_info);
-			if (!md) {
-				dst_release((struct dst_entry *)tun_dst);
-				return PACKET_REJECT;
-			}
-
 			memcpy(md, pkt_md, sizeof(*md));
 			md->version = ver;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 8451d00b210b..1aabc8df7cb7 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -550,10 +550,6 @@ static int ip6erspan_rcv(struct sk_buff *skb, int gre_hdr_len,
 
 			info = &tun_dst->u.tun_info;
 			md = ip_tunnel_info_opts(info);
-			if (!md) {
-				dst_release((struct dst_entry *)tun_dst);
-				return PACKET_REJECT;
-			}
 
 			memcpy(md, pkt_md, sizeof(*md));
 			md->version = ver;
-- 
2.7.4

^ permalink raw reply related

* Re: pull-request: bpf-next 2017-12-18
From: Daniel Borkmann @ 2017-12-21 23:48 UTC (permalink / raw)
  To: David Miller, alexei.starovoitov; +Cc: ast, netdev
In-Reply-To: <20171221.112819.2190571738067203400.davem@davemloft.net>

On 12/21/2017 05:28 PM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Wed, 20 Dec 2017 16:16:44 -0500 (EST)
> 
>> I think I understand how this new stuff works, I'll take a stab at
>> doing the sparc64 JIT bits.
> 
> This patch should do it, please queue up for bpf-next.
> 
> But this is really overkill on sparc64.
> 
> No matter where you relocate the call destination to, the size of the
> program and the code output will be identical except for the call
> instruction PC relative offset field.
> 
> So at some point as a follow-up I should change this code to simply
> scan the insns for the function calls and fixup the offsets, rather
> than do a full set of code generation passes.
> 
> Thanks.
> 
> ====================
> bpf: sparc64: Add JIT support for multi-function programs.
> 
> Modelled strongly upon the arm64 implementation.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
> index a2f1b5e..4ee417f 100644
> --- a/arch/sparc/net/bpf_jit_comp_64.c
> +++ b/arch/sparc/net/bpf_jit_comp_64.c
> @@ -1507,11 +1507,19 @@ static void jit_fill_hole(void *area, unsigned int size)
>  		*ptr++ = 0x91d02005; /* ta 5 */
>  }
>  
> +struct sparc64_jit_data {
> +	struct bpf_binary_header *header;
> +	u8 *image;
> +	struct jit_ctx ctx;
> +};
> +
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  {
>  	struct bpf_prog *tmp, *orig_prog = prog;
> +	struct sparc64_jit_data *jit_data;
>  	struct bpf_binary_header *header;
>  	bool tmp_blinded = false;
> +	bool extra_pass = false;
>  	struct jit_ctx ctx;
>  	u32 image_size;
>  	u8 *image_ptr;
> @@ -1531,13 +1539,30 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  		prog = tmp;
>  	}
>  
> +	jit_data = prog->aux->jit_data;
> +	if (!jit_data) {
> +		jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
> +		if (!jit_data) {
> +			prog = orig_prog;
> +			goto out;
> +		}

Looks good, one thing: If I spot this correctly, isn't here a ...

		prog->aux->jit_data = jit_data;

... missing? Otherwise the context from the initial pass is neither
saved for the extra pass nor freed.

> +	}
> +	if (jit_data->ctx.offset) {
> +		ctx = jit_data->ctx;
> +		image_ptr = jit_data->image;
> +		header = jit_data->header;
> +		extra_pass = true;
> +		image_size = sizeof(u32) * ctx.idx;
> +		goto skip_init_ctx;
> +	}
> +
>  	memset(&ctx, 0, sizeof(ctx));
>  	ctx.prog = prog;
>  
>  	ctx.offset = kcalloc(prog->len, sizeof(unsigned int), GFP_KERNEL);
>  	if (ctx.offset == NULL) {
>  		prog = orig_prog;
> -		goto out;
> +		goto out_off;
>  	}
>  
>  	/* Fake pass to detect features used, and get an accurate assessment
> @@ -1560,7 +1585,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  	}
>  
>  	ctx.image = (u32 *)image_ptr;
> -
> +skip_init_ctx:
>  	for (pass = 1; pass < 3; pass++) {
>  		ctx.idx = 0;
>  
> @@ -1591,14 +1616,24 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
>  
>  	bpf_flush_icache(header, (u8 *)header + (header->pages * PAGE_SIZE));
>  
> -	bpf_jit_binary_lock_ro(header);
> +	if (!prog->is_func || extra_pass) {
> +		bpf_jit_binary_lock_ro(header);
> +	} else {
> +		jit_data->ctx = ctx;
> +		jit_data->image = image_ptr;
> +		jit_data->header = header;
> +	}
>  
>  	prog->bpf_func = (void *)ctx.image;
>  	prog->jited = 1;
>  	prog->jited_len = image_size;
>  
> +	if (!prog->is_func || extra_pass) {
>  out_off:
> -	kfree(ctx.offset);
> +		kfree(ctx.offset);
> +		kfree(jit_data);
> +		prog->aux->jit_data = NULL;
> +	}
>  out:
>  	if (tmp_blinded)
>  		bpf_jit_prog_release_other(prog, prog == orig_prog ?
> 

^ permalink raw reply

* [PATCH net-next v2] enic: add wq clean up budget
From: Govindarajulu Varadarajan @ 2017-12-21 16:12 UTC (permalink / raw)
  To: davem, netdev; +Cc: govindarajulu90, benve, Govindarajulu Varadarajan

In case of tx clean up, we set '-1' as budget. This means clean up until
wq is empty or till (1 << 32) pkts are cleaned. Under heavy load this
will run for long time and cause
"watchdog: BUG: soft lockup - CPU#25 stuck for 21s!" warning.

This patch sets wq clean up budget to 256.

Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
---
v2: resubmit: previous discussion: https://patchwork.ozlabs.org/patch/845011/

 drivers/net/ethernet/cisco/enic/enic.h      | 2 ++
 drivers/net/ethernet/cisco/enic/enic_main.c | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic.h b/drivers/net/ethernet/cisco/enic/enic.h
index 6a9527004cb1..9b218f0e5a4c 100644
--- a/drivers/net/ethernet/cisco/enic/enic.h
+++ b/drivers/net/ethernet/cisco/enic/enic.h
@@ -43,6 +43,8 @@
 #define ENIC_CQ_MAX		(ENIC_WQ_MAX + ENIC_RQ_MAX)
 #define ENIC_INTR_MAX		(ENIC_CQ_MAX + 2)
 
+#define ENIC_WQ_NAPI_BUDGET	256
+
 #define ENIC_AIC_LARGE_PKT_DIFF	3
 
 struct enic_msix_entry {
diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c b/drivers/net/ethernet/cisco/enic/enic_main.c
index d98676e43e03..f202ba72a811 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -1500,7 +1500,7 @@ static int enic_poll(struct napi_struct *napi, int budget)
 	unsigned int cq_wq = enic_cq_wq(enic, 0);
 	unsigned int intr = enic_legacy_io_intr();
 	unsigned int rq_work_to_do = budget;
-	unsigned int wq_work_to_do = -1; /* no limit */
+	unsigned int wq_work_to_do = ENIC_WQ_NAPI_BUDGET;
 	unsigned int  work_done, rq_work_done = 0, wq_work_done;
 	int err;
 
@@ -1598,7 +1598,7 @@ static int enic_poll_msix_wq(struct napi_struct *napi, int budget)
 	struct vnic_wq *wq = &enic->wq[wq_index];
 	unsigned int cq;
 	unsigned int intr;
-	unsigned int wq_work_to_do = -1; /* clean all desc possible */
+	unsigned int wq_work_to_do = ENIC_WQ_NAPI_BUDGET;
 	unsigned int wq_work_done;
 	unsigned int wq_irq;
 
-- 
2.15.1

^ permalink raw reply related

* Re: [PATCH net 1/2] dt-bindings: net: mediatek: add condition to property mediatek,pctl
From: Rob Herring @ 2017-12-21 22:55 UTC (permalink / raw)
  To: sean.wang-NuS5LvNUpcJWk0Htik3J/w
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, mark.rutland-5wv7dgnIgG8,
	matthias.bgg-Re5JQEeQqe8AvxtiuMwx3w, john-Pj+rj9U5foFAfugRpC6u6w,
	nbd-p3rKhJxN3npAfugRpC6u6w, nelson.chang-NuS5LvNUpcJWk0Htik3J/w,
	devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mediatek-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <e366efc29985d3292c8a1afb1389b5eac57c9037.1513762066.git.sean.wang-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org>

On Wed, Dec 20, 2017 at 05:47:05PM +0800, sean.wang-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org wrote:
> From: Sean Wang <sean.wang-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org>
> 
> The property "mediatek,pctl" is only required for SoCs such as MT2701 and
> MT7623, so adding a few words for stating the condition.
> 
> Signed-off-by: Sean Wang <sean.wang-NuS5LvNUpcJWk0Htik3J/w@public.gmane.org>
> ---
>  Documentation/devicetree/bindings/net/mediatek-net.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH ipsec-next] xfrm: update the stats documentation
From: Shannon Nelson @ 2017-12-21 22:26 UTC (permalink / raw)
  To: steffen.klassert; +Cc: netdev

Add a couple of stats that aren't in the documentation file
and rework the top description to be a little more readable.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
 Documentation/networking/xfrm_proc.txt | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt
index d0d8baf..2eae619 100644
--- a/Documentation/networking/xfrm_proc.txt
+++ b/Documentation/networking/xfrm_proc.txt
@@ -5,13 +5,15 @@ Masahide NAKAMURA <nakam@linux-ipv6.org>
 
 Transformation Statistics
 -------------------------
-xfrm_proc is a statistics shown factor dropped by transformation
-for developer.
-It is a counter designed from current transformation source code
-and defined like linux private MIB.
 
-Inbound statistics
-~~~~~~~~~~~~~~~~~~
+The xfrm_proc code is a set of statistics showing numbers of packets
+dropped by the transformation code and why.  These counters are defined
+as part of the linux private MIB.  These counters can be viewed in
+/proc/net/xfrm_stat.
+
+
+Inbound errors
+~~~~~~~~~~~~~~
 XfrmInError:
 	All errors which is not matched others
 XfrmInBufferError:
@@ -46,6 +48,10 @@ XfrmInPolBlock:
 	Policy discards
 XfrmInPolError:
 	Policy error
+XfrmAcquireError:
+	State hasn't been fully acquired before use
+XfrmFwdHdrError:
+	Forward routing of a packet is not allowed
 
 Outbound errors
 ~~~~~~~~~~~~~~~
@@ -72,3 +78,5 @@ XfrmOutPolDead:
 	Policy is dead
 XfrmOutPolError:
 	Policy error
+XfrmOutStateInvalid:
+	State is invalid, perhaps expired
-- 
2.7.4

^ permalink raw reply related

* [PATCH net] rtnetlink: fix struct net reference leak
From: Craig Gallek @ 2017-12-21 22:18 UTC (permalink / raw)
  To: David Miller, Jiri Benc; +Cc: netdev, Nicolas Dichtel, Jason A . Donenfeld

From: Craig Gallek <kraig@google.com>

The below referenced commit extended the RTM_GETLINK interface to
allow querying by netns id.  The netnsid property was previously
defined as a signed integer, but this patch assumes that the user
always passes a positive integer.  syzkaller discovered this problem
by setting a negative netnsid and then calling the get-link path
in a tight loop.  This surprisingly quickly overflows the reference
count on the associated struct net, potentially destroying it.  When the
default namespace is used, the machine crashes in strange and interesting
ways.

Unfortunately, this is not easy to reproduce with just the ip tool
as it enforces unsigned integer parsing despite the interface interpeting
the NETNSID attribute as signed.

I'm not sure why this attribute is signed in the first place, but
the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
so I assume it's too late to change.

This patch removes the positive netns id assumption, but adds another
assumption that the netns id 0 is always the 'self' identifying id (for
which an additional struct net reference is not necessary).

Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
CC: Jiri Benc <jbenc@redhat.com>
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Craig Gallek <kraig@google.com>
---
 net/core/rtnetlink.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..3de033b7e4b9 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1451,7 +1451,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	ifm->ifi_flags = dev_get_flags(dev);
 	ifm->ifi_change = change;

-	if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
+	if (tgt_netnsid && nla_put_s32(skb, IFLA_IF_NETNSID, tgt_netnsid))
 		goto nla_put_failure;

 	if (nla_put_string(skb, IFLA_IFNAME, dev->name) ||
@@ -1712,7 +1712,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 	const struct rtnl_link_ops *kind_ops = NULL;
 	unsigned int flags = NLM_F_MULTI;
 	int master_idx = 0;
-	int netnsid = -1;
+	int netnsid = 0;
 	int err;
 	int hdrlen;

@@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 			ifla_policy, NULL) >= 0) {
 		if (tb[IFLA_IF_NETNSID]) {
 			netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-			tgt_net = get_target_net(skb, netnsid);
-			if (IS_ERR(tgt_net)) {
-				tgt_net = net;
-				netnsid = -1;
+			if (netnsid) {
+				tgt_net = get_target_net(skb, netnsid);
+				if (IS_ERR(tgt_net)) {
+					tgt_net = net;
+					netnsid = 0;
+				}
 			}
 		}

@@ -1786,7 +1788,7 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 	cb->args[0] = h;
 	cb->seq = net->dev_base_seq;
 	nl_dump_check_consistent(cb, nlmsg_hdr(skb));
-	if (netnsid >= 0)
+	if (netnsid)
 		put_net(tgt_net);

 	return err;
@@ -2873,7 +2875,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	struct nlattr *tb[IFLA_MAX+1];
 	struct net_device *dev = NULL;
 	struct sk_buff *nskb;
-	int netnsid = -1;
+	int netnsid = 0;
 	int err;
 	u32 ext_filter_mask = 0;

@@ -2883,9 +2885,11 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,

 	if (tb[IFLA_IF_NETNSID]) {
 		netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-		tgt_net = get_target_net(skb, netnsid);
-		if (IS_ERR(tgt_net))
-			return PTR_ERR(tgt_net);
+		if (netnsid) {
+			tgt_net = get_target_net(skb, netnsid);
+			if (IS_ERR(tgt_net))
+				return PTR_ERR(tgt_net);
+		}
 	}

 	if (tb[IFLA_IFNAME])
@@ -2923,7 +2927,7 @@ static int rtnl_getlink(struct sk_buff *skb, struct nlmsghdr *nlh,
 	} else
 		err = rtnl_unicast(nskb, net, NETLINK_CB(skb).portid);
 out:
-	if (netnsid >= 0)
+	if (netnsid)
 		put_net(tgt_net);

 	return err;
-- 
2.15.1.620.gb9897f4670-goog

^ permalink raw reply related

* Re: [PATCH net-next] tcp: md5: Handle RCU dereference of md5sig_info
From: Christoph Paasch @ 2017-12-21 22:13 UTC (permalink / raw)
  To: Mat Martineau; +Cc: netdev
In-Reply-To: <20171221182910.4785-2-mathew.j.martineau@linux.intel.com>

On Thu, Dec 21, 2017 at 10:29 AM, Mat Martineau
<mathew.j.martineau@linux.intel.com> wrote:
> Dereference tp->md5sig_info in tcp_v4_destroy_sock() the same way it is
> done in the adjacent call to tcp_clear_md5_list().
>
> Resolves this sparse warning:
>
> net/ipv4/tcp_ipv4.c:1914:17: warning: incorrect type in argument 1 (different address spaces)
> net/ipv4/tcp_ipv4.c:1914:17:    expected struct callback_head *head
> net/ipv4/tcp_ipv4.c:1914:17:    got struct callback_head [noderef] <asn:4>*<noident>
>
> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
> ---
>  net/ipv4/tcp_ipv4.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index dd945b114215..5d203248123e 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1911,7 +1911,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
>         /* Clean up the MD5 key list, if any */
>         if (tp->md5sig_info) {
>                 tcp_clear_md5_list(sk);
> -               kfree_rcu(tp->md5sig_info, rcu);
> +               kfree_rcu(rcu_dereference_protected(tp->md5sig_info, 1), rcu);

Acked-by: Christoph Paasch <cpaasch@apple.com>

^ permalink raw reply

* Re: [patch iproute2 v2] tc: add -bs option for batch mode
From: David Ahern @ 2017-12-21 22:04 UTC (permalink / raw)
  To: Chris Mi, netdev; +Cc: gerlitz.or
In-Reply-To: <20171220092628.6673-1-chrism@mellanox.com>

On 12/20/17 2:26 AM, Chris Mi wrote:
> Currently in tc batch mode, only one command is read from the batch
> file and sent to kernel to process. With this patch, we can accumulate
> several commands before sending to kernel. The batch size is specified
> using option -bs or -batchsize.
> 
> To accumulate the commands in tc, we allocate an array of struct iovec.
> If batchsize is bigger than 1 and we haven't accumulated enough
> commands, rtnl_talk() will return without actually sending the message.
> One exception is that there is no more command in the batch file.
> 
> But please note that kernel still processes the requests one by one.
> To process the requests in parallel in kernel is another effort.
> The time we're saving in this patch is the user mode and kernel mode
> context switch. So this patch works on top of the current kernel.
> 
> Using the following script in kernel, we can generate 1,000,000 rules.
>         tools/testing/selftests/tc-testing/tdc_batch.py
> 
> Without this patch, 'tc -b $file' exection time is:
> 
> real    0m14.916s
> user    0m6.808s
> sys     0m8.046s
> 
> With this patch, 'tc -b $file -bs 10' exection time is:
> 
> real    0m12.286s
> user    0m5.903s
> sys     0m6.312s
> 
> The insertion rate is improved more than 10%.
> 
> Signed-off-by: Chris Mi <chrism@mellanox.com>
> ---
>  include/libnetlink.h |  6 ++++
>  include/utils.h      |  4 +++
>  lib/libnetlink.c     | 25 ++++++++++-----
>  lib/utils.c          | 20 ++++++++++++
>  man/man8/tc.8        |  5 +++
>  tc/m_action.c        | 62 +++++++++++++++++++++++++++---------
>  tc/tc.c              | 24 ++++++++++++--
>  tc/tc_common.h       |  3 ++
>  tc/tc_filter.c       | 88 ++++++++++++++++++++++++++++++++++++----------------
>  9 files changed, 186 insertions(+), 51 deletions(-)
> 
> diff --git a/include/libnetlink.h b/include/libnetlink.h
> index a4d83b9e..775f830b 100644
> --- a/include/libnetlink.h
> +++ b/include/libnetlink.h
> @@ -13,6 +13,8 @@
>  #include <linux/netconf.h>
>  #include <arpa/inet.h>
>  
> +#define MSG_IOV_MAX 256
> +
>  struct rtnl_handle {
>  	int			fd;
>  	struct sockaddr_nl	local;
> @@ -93,6 +95,10 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
>  			void *arg, __u16 nc_flags);
>  #define rtnl_dump_filter(rth, filter, arg) \
>  	rtnl_dump_filter_nc(rth, filter, arg, 0)
> +
> +extern int msg_iov_index;
> +extern int batch_size;
> +
>  int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
>  	      struct nlmsghdr **answer)
>  	__attribute__((warn_unused_result));
> diff --git a/include/utils.h b/include/utils.h
> index d3895d56..113a8c31 100644
> --- a/include/utils.h
> +++ b/include/utils.h
> @@ -235,6 +235,10 @@ void print_nlmsg_timestamp(FILE *fp, const struct nlmsghdr *n);
>  
>  extern int cmdlineno;
>  ssize_t getcmdline(char **line, size_t *len, FILE *in);
> +
> +extern int cmdlinetotal;
> +void setcmdlinetotal(const char *name);
> +
>  int makeargs(char *line, char *argv[], int maxargs);
>  int inet_get_addr(const char *src, __u32 *dst, struct in6_addr *dst6);
>  
> diff --git a/lib/libnetlink.c b/lib/libnetlink.c
> index 00e6ce0c..7ff32d64 100644
> --- a/lib/libnetlink.c
> +++ b/lib/libnetlink.c
> @@ -24,6 +24,7 @@
>  #include <sys/uio.h>
>  
>  #include "libnetlink.h"
> +#include "utils.h"
>  
>  #ifndef SOL_NETLINK
>  #define SOL_NETLINK 270
> @@ -581,6 +582,10 @@ static void rtnl_talk_error(struct nlmsghdr *h, struct nlmsgerr *err,
>  		strerror(-err->error));
>  }
>  
> +static struct iovec msg_iov[MSG_IOV_MAX];
> +int msg_iov_index;
> +int batch_size = 1;
> +
>  static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
>  		       struct nlmsghdr **answer,
>  		       bool show_rtnl_err, nl_ext_ack_fn_t errfn)
> @@ -589,23 +594,29 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
>  	unsigned int seq;
>  	struct nlmsghdr *h;
>  	struct sockaddr_nl nladdr = { .nl_family = AF_NETLINK };
> -	struct iovec iov = {
> -		.iov_base = n,
> -		.iov_len = n->nlmsg_len
> -	};
> +	struct iovec *iov = &msg_iov[msg_iov_index];
> +	char *buf;
> +
> +	iov->iov_base = n;
> +	iov->iov_len = n->nlmsg_len;
> +
>  	struct msghdr msg = {
>  		.msg_name = &nladdr,
>  		.msg_namelen = sizeof(nladdr),
> -		.msg_iov = &iov,
> -		.msg_iovlen = 1,
> +		.msg_iov = msg_iov,
> +		.msg_iovlen = ++msg_iov_index,
>  	};
> -	char *buf;
>  
>  	n->nlmsg_seq = seq = ++rtnl->seq;
>  
>  	if (answer == NULL)
>  		n->nlmsg_flags |= NLM_F_ACK;
>  
> +	msg_iov_index %= batch_size;
> +	if (msg_iov_index != 0 && batch_size > 1 && cmdlineno != cmdlinetotal &&
> +	    (n->nlmsg_type == RTM_NEWTFILTER || n->nlmsg_type == RTM_NEWACTION))
> +		return 3;
> +
>  	status = sendmsg(rtnl->fd, &msg, 0);
>  	if (status < 0) {
>  		perror("Cannot talk to rtnetlink");

I have a fair idea of why you did it this way, but relying on global
variables and magic return codes is not a great solution.

Why not refactor rtnl_talk -- move the sendmsg piece to a helper that
takes an iov or a msg as input arg. Then have the tc code piece together
the iov and call rtnl_talk. If batch_size == 1 it calls rtnl_talk; > 1
it puts together the iov and hands it off.

^ permalink raw reply

* Re: RCU callback crashes
From: Jakub Kicinski @ 2017-12-21 21:45 UTC (permalink / raw)
  To: Cong Wang; +Cc: John Fastabend, Jiri Pirko, netdev@vger.kernel.org
In-Reply-To: <CAM_iQpV8+NknPgGbNDzF+=S8Px4rxO2=PMwV5BLDQJhX5CGmDQ@mail.gmail.com>

On Thu, 21 Dec 2017 13:31:01 -0800, Cong Wang wrote:
> >    629          if (likely(skb)) {
> >    630                  qdisc_qstats_cpu_backlog_dec(qdisc, skb);
> >    631                  qdisc_bstats_cpu_update(qdisc, skb);
> >    632                  qdisc_qstats_cpu_qlen_dec(qdisc);
> >    633          }
> >    634
> >    635          return skb;
> >    636  }  
> 
> Hi, Jakub
> 
> Could you test the attached patch? It looks like the __skb_array_empty()
> use is unsafe.

I don't have a reproducer, unfortunately, I haven't seen the splat
since :(  FWIW the kernel config was with all debug/checks disabled,
only KASAN on.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox