[PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point

dccp.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
@ 2020-05-06 12:54 Jakub Sitnicki
  2020-05-06 13:16 ` Lorenz Bauer
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-06 12:54 UTC (permalink / raw)
  To: dccp

Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
called BPF_SK_LOOKUP. The new program kind is to be invoked by the
transport layer when looking up a socket for a received packet.

When called, SK_LOOKUP program can select a socket that will receive the
packet. This serves as a mechanism to overcome the limits of what bind()
API allows to express. Two use-cases driving this work are:

 (1) steer packets destined to an IP range, fixed port to a socket

     192.0.2.0/24, port 80 -> NGINX socket

 (2) steer packets destined to an IP address, any port to a socket

     198.51.100.1, any port -> L7 proxy socket

In its run-time context, program receives information about the packet that
triggered the socket lookup. Namely IP version, L4 protocol identifier, and
address 4-tuple. Context can be further extended to include ingress
interface identifier.

To select a socket BPF program fetches it from a map holding socket
references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
helper to record the selection. Transport layer then uses the selected
socket as a result of socket lookup.

This patch only enables the user to attach an SK_LOOKUP program to a
network namespace. Subsequent patches hook it up to run on local delivery
path in ipv4 and ipv6 stacks.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf_types.h   |   2 +
 include/linux/filter.h      |  23 ++++
 include/net/net_namespace.h |   1 +
 include/uapi/linux/bpf.h    |  53 ++++++++
 kernel/bpf/syscall.c        |   9 ++
 net/core/filter.c           | 247 ++++++++++++++++++++++++++++++++++++
 scripts/bpf_helpers_doc.py  |   9 +-
 7 files changed, 343 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 8345cdf553b8..08c2aef674ac 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
 	      struct sk_reuseport_md, struct sk_reuseport_kern)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
+	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
 #endif
 #if defined(CONFIG_BPF_JIT)
 BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index af37318bb1c5..33254e840c8d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1280,4 +1280,27 @@ struct bpf_sockopt_kern {
 	s32		retval;
 };
 
+struct bpf_sk_lookup_kern {
+	unsigned short	family;
+	u16		protocol;
+	union {
+		struct {
+			__be32 saddr;
+			__be32 daddr;
+		} v4;
+		struct {
+			struct in6_addr saddr;
+			struct in6_addr daddr;
+		} v6;
+	};
+	__be16		sport;
+	u16		dport;
+	struct sock	*selected_sk;
+};
+
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int sk_lookup_prog_detach(const union bpf_attr *attr);
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index ab96fb59131c..70bf4888c94d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -163,6 +163,7 @@ struct net {
 	struct net_generic __rcu	*gen;
 
 	struct bpf_prog __rcu	*flow_dissector_prog;
+	struct bpf_prog __rcu	*sk_lookup_prog;
 
 	/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b3643e27e264..e4c61b63d4bc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -187,6 +187,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_STRUCT_OPS,
 	BPF_PROG_TYPE_EXT,
 	BPF_PROG_TYPE_LSM,
+	BPF_PROG_TYPE_SK_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -218,6 +219,7 @@ enum bpf_attach_type {
 	BPF_TRACE_FEXIT,
 	BPF_MODIFY_RETURN,
 	BPF_LSM_MAC,
+	BPF_SK_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3041,6 +3043,10 @@ union bpf_attr {
  *
  * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
  *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
+ *		**BPF_PROG_TYPE_SCHED_ACT** programs.
+ *
  *		Assign the *sk* to the *skb*. When combined with appropriate
  *		routing configuration to receive the packet towards the socket,
  *		will cause *skb* to be delivered to the specified socket.
@@ -3061,6 +3067,39 @@ union bpf_attr {
  *					call from outside of TC ingress.
  *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
  *
+ * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
+ *	Description
+ *		Helper is overloaded depending on BPF program type. This
+ *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
+ *
+ *		Select the *sk* as a result of a socket lookup.
+ *
+ *		For the operation to succeed passed socket must be compatible
+ *		with the packet description provided by the *ctx* object.
+ *
+ *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
+ *		match. While IP family (*AF_INET* or *AF_INET6*) must be
+ *		compatible, that is IPv6 sockets that are not v6-only can be
+ *		selected for IPv4 packets.
+ *
+ *		Only full sockets can be selected. However, there is no need to
+ *		call bpf_fullsock() before passing a socket as an argument to
+ *		this helper.
+ *
+ *		The *flags* argument must be zero.
+ *	Return
+ *		0 on success, or a negative errno in case of failure.
+ *
+ *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
+ *		compatible with packet family (*ctx->family*).
+ *
+ *		**-EINVAL** if unsupported flags were specified.
+ *
+ *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
+ *		match packet protocol (*ctx->protocol*).
+ *
+ *		**-ESOCKTNOSUPPORT** if socket is not a full socket.
+ *
  * u64 bpf_ktime_get_boot_ns(void)
  * 	Description
  * 		Return the time elapsed since system boot, in nanoseconds.
@@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
 	__u32 pid;
 	__u32 tgid;
 };
+
+/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
+struct bpf_sk_lookup {
+	__u32 family;		/* AF_INET, AF_INET6 */
+	__u32 protocol;		/* IPPROTO_TCP, IPPROTO_UDP */
+	/* IP addresses allows 1, 2, and 4 bytes access */
+	__u32 src_ip4;
+	__u32 src_ip6[4];
+	__u32 src_port;		/* network byte order */
+	__u32 dst_ip4;
+	__u32 dst_ip6[4];
+	__u32 dst_port;		/* host byte order */
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index bb1ab7da6103..26d643c171fd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2729,6 +2729,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
 	case BPF_CGROUP_GETSOCKOPT:
 	case BPF_CGROUP_SETSOCKOPT:
 		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
+	case BPF_SK_LOOKUP:
+		return BPF_PROG_TYPE_SK_LOOKUP;
 	default:
 		return BPF_PROG_TYPE_UNSPEC;
 	}
@@ -2778,6 +2780,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		ret = sk_lookup_prog_attach(attr, prog);
+		break;
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2818,6 +2823,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return lirc_prog_detach(attr);
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return skb_flow_dissector_bpf_prog_detach(attr);
+	case BPF_PROG_TYPE_SK_LOOKUP:
+		return sk_lookup_prog_detach(attr);
 	case BPF_PROG_TYPE_CGROUP_DEVICE:
 	case BPF_PROG_TYPE_CGROUP_SKB:
 	case BPF_PROG_TYPE_CGROUP_SOCK:
@@ -2867,6 +2874,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
 		return skb_flow_dissector_prog_query(attr, uattr);
+	case BPF_SK_LOOKUP:
+		return sk_lookup_prog_query(attr, uattr);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index bc25bb1085b1..a00bdc70041c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9054,6 +9054,253 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+static DEFINE_MUTEX(sk_lookup_prog_mutex);
+
+int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_attach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex, prog,
+				  attr->attach_flags);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_detach(const union bpf_attr *attr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ret;
+
+	if (unlikely(attr->attach_flags))
+		return -EINVAL;
+
+	mutex_lock(&sk_lookup_prog_mutex);
+	ret = bpf_prog_detach_one(&net->sk_lookup_prog,
+				  &sk_lookup_prog_mutex);
+	mutex_unlock(&sk_lookup_prog_mutex);
+
+	return ret;
+}
+
+int sk_lookup_prog_query(const union bpf_attr *attr,
+			 union bpf_attr __user *uattr)
+{
+	struct net *net;
+	int ret;
+
+	net = get_net_ns_by_fd(attr->query.target_fd);
+	if (IS_ERR(net))
+		return PTR_ERR(net);
+
+	ret = bpf_prog_query_one(&net->sk_lookup_prog, attr, uattr);
+
+	put_net(net);
+	return ret;
+}
+
+BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
+	   struct sock *, sk, u64, flags)
+{
+	if (unlikely(flags != 0))
+		return -EINVAL;
+	if (unlikely(!sk_fullsock(sk)))
+		return -ESOCKTNOSUPPORT;
+
+	/* Check if socket is suitable for packet L3/L4 protocol */
+	if (sk->sk_protocol != ctx->protocol)
+		return -EPROTOTYPE;
+	if (sk->sk_family != ctx->family &&
+	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
+		return -EAFNOSUPPORT;
+
+	/* Select socket as lookup result */
+	ctx->selected_sk = sk;
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sk_lookup_assign_proto = {
+	.func		= bpf_sk_lookup_assign,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_SOCK_COMMON,
+	.arg3_type	= ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+sk_lookup_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_sk_assign:
+		return &bpf_sk_lookup_assign_proto;
+	case BPF_FUNC_sk_release:
+		return &bpf_sk_release_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static bool sk_lookup_is_valid_access(int off, int size,
+				      enum bpf_access_type type,
+				      const struct bpf_prog *prog,
+				      struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct bpf_sk_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (type != BPF_READ)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_sk_lookup, src_ip4):
+	case bpf_ctx_range(struct bpf_sk_lookup, dst_ip4):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				src_ip6[0], src_ip6[3]):
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				dst_ip6[0], dst_ip6[3]):
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		break;
+
+	case bpf_ctx_range(struct bpf_sk_lookup, family):
+	case bpf_ctx_range(struct bpf_sk_lookup, protocol):
+	case bpf_ctx_range(struct bpf_sk_lookup, src_port):
+	case bpf_ctx_range(struct bpf_sk_lookup, dst_port):
+		if (size != size_default)
+			return false;
+		break;
+
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+#define CHECK_FIELD_SIZE(BPF_TYPE, BPF_FIELD, KERN_TYPE, KERN_FIELD)	\
+	BUILD_BUG_ON(sizeof_field(BPF_TYPE, BPF_FIELD) <		\
+		     sizeof_field(KERN_TYPE, KERN_FIELD))
+
+#define LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, OFF)			\
+	BPF_LDX_MEM(SIZE, si->dst_reg, si->src_reg,			\
+		    bpf_target_off(TYPE, FIELD,				\
+				   sizeof_field(TYPE, FIELD),		\
+				   target_size) + (OFF))
+
+#define LOAD_FIELD_SIZE(TYPE, FIELD, SIZE) \
+	LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, 0)
+
+#define LOAD_FIELD(TYPE, FIELD) \
+	LOAD_FIELD_SIZE(TYPE, FIELD, BPF_FIELD_SIZEOF(TYPE, FIELD))
+
+static u32 sk_lookup_convert_ctx_access(enum bpf_access_type type,
+					const struct bpf_insn *si,
+					struct bpf_insn *insn_buf,
+					struct bpf_prog *prog,
+					u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sk_lookup, family):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, family,
+				 struct bpf_sk_lookup_kern, family);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, family);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, protocol):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, protocol,
+				 struct bpf_sk_lookup_kern, protocol);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, protocol);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, src_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, src_ip4,
+				 struct bpf_sk_lookup_kern, v4.saddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.saddr,
+					  BPF_SIZE(si->code));
+		break;
+
+	case offsetof(struct bpf_sk_lookup, dst_ip4):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, dst_ip4,
+				 struct bpf_sk_lookup_kern, v4.daddr);
+		*insn++ = LOAD_FIELD_SIZE(struct bpf_sk_lookup_kern, v4.daddr,
+					  BPF_SIZE(si->code));
+
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				src_ip6[0], src_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, src_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.saddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, src_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.saddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_sk_lookup,
+				dst_ip6[0], dst_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, dst_ip6[0],
+				 struct bpf_sk_lookup_kern,
+				 v6.daddr.s6_addr32[0]);
+		off = si->off;
+		off -= offsetof(struct bpf_sk_lookup, dst_ip6[0]);
+		*insn++ = LOAD_FIELD_SIZE_OFF(struct bpf_sk_lookup_kern,
+					      v6.daddr.s6_addr32[0],
+					      BPF_SIZE(si->code), off);
+#else
+		(void)off;
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sk_lookup, src_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, src_port,
+				 struct bpf_sk_lookup_kern, sport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, sport);
+		break;
+
+	case offsetof(struct bpf_sk_lookup, dst_port):
+		CHECK_FIELD_SIZE(struct bpf_sk_lookup, dst_port,
+				 struct bpf_sk_lookup_kern, dport);
+		*insn++ = LOAD_FIELD(struct bpf_sk_lookup_kern, dport);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops sk_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops sk_lookup_verifier_ops = {
+	.get_func_proto		= sk_lookup_func_proto,
+	.is_valid_access	= sk_lookup_is_valid_access,
+	.convert_ctx_access	= sk_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
 
 DEFINE_BPF_DISPATCHER(xdp)
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index f43d193aff3a..70b1b033c721 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -398,6 +398,7 @@ class PrinterHelpers(Printer):
 
     type_fwds = [
             'struct bpf_fib_lookup',
+            'struct bpf_sk_lookup',
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
@@ -437,6 +438,7 @@ class PrinterHelpers(Printer):
             'struct bpf_perf_event_data',
             'struct bpf_perf_event_value',
             'struct bpf_pidns_info',
+            'struct bpf_sk_lookup',
             'struct bpf_sock',
             'struct bpf_sock_addr',
             'struct bpf_sock_ops',
@@ -467,6 +469,11 @@ class PrinterHelpers(Printer):
             'struct sk_msg_buff': 'struct sk_msg_md',
             'struct xdp_buff': 'struct xdp_md',
     }
+    # Helpers overloaded for different context types.
+    overloaded_helpers = [
+        'bpf_get_socket_cookie',
+        'bpf_sk_assign',
+    ]
 
     def print_header(self):
         header = '''\
@@ -523,7 +530,7 @@ class PrinterHelpers(Printer):
         for i, a in enumerate(proto['args']):
             t = a['type']
             n = a['name']
-            if proto['name'] = 'bpf_get_socket_cookie' and i = 0:
+            if proto['name'] in self.overloaded_helpers and i = 0:
                     t = 'void'
                     n = 'ctx'
             one_arg = '{}{}'.format(comma, self.map_type(t))
-- 
2.25.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
@ 2020-05-06 13:16 ` Lorenz Bauer
  2020-05-06 13:53 ` Jakub Sitnicki
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lorenz Bauer @ 2020-05-06 13:16 UTC (permalink / raw)
  To: dccp

On Wed, 6 May 2020 at 13:55, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
> transport layer when looking up a socket for a received packet.
>
> When called, SK_LOOKUP program can select a socket that will receive the
> packet. This serves as a mechanism to overcome the limits of what bind()
> API allows to express. Two use-cases driving this work are:
>
>  (1) steer packets destined to an IP range, fixed port to a socket
>
>      192.0.2.0/24, port 80 -> NGINX socket
>
>  (2) steer packets destined to an IP address, any port to a socket
>
>      198.51.100.1, any port -> L7 proxy socket
>
> In its run-time context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
>
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
>
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
>
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/linux/bpf_types.h   |   2 +
>  include/linux/filter.h      |  23 ++++
>  include/net/net_namespace.h |   1 +
>  include/uapi/linux/bpf.h    |  53 ++++++++
>  kernel/bpf/syscall.c        |   9 ++
>  net/core/filter.c           | 247 ++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py  |   9 +-
>  7 files changed, 343 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 8345cdf553b8..08c2aef674ac 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>  #ifdef CONFIG_INET
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>               struct sk_reuseport_md, struct sk_reuseport_kern)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
> +             struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>  #endif
>  #if defined(CONFIG_BPF_JIT)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index af37318bb1c5..33254e840c8d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1280,4 +1280,27 @@ struct bpf_sockopt_kern {
>         s32             retval;
>  };
>
> +struct bpf_sk_lookup_kern {
> +       unsigned short  family;
> +       u16             protocol;
> +       union {
> +               struct {
> +                       __be32 saddr;
> +                       __be32 daddr;
> +               } v4;
> +               struct {
> +                       struct in6_addr saddr;
> +                       struct in6_addr daddr;
> +               } v6;
> +       };
> +       __be16          sport;
> +       u16             dport;
> +       struct sock     *selected_sk;
> +};
> +
> +int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int sk_lookup_prog_detach(const union bpf_attr *attr);
> +int sk_lookup_prog_query(const union bpf_attr *attr,
> +                        union bpf_attr __user *uattr);
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index ab96fb59131c..70bf4888c94d 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -163,6 +163,7 @@ struct net {
>         struct net_generic __rcu        *gen;
>
>         struct bpf_prog __rcu   *flow_dissector_prog;
> +       struct bpf_prog __rcu   *sk_lookup_prog;
>
>         /* Note : following structs are cache line aligned */
>  #ifdef CONFIG_XFRM
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b3643e27e264..e4c61b63d4bc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -187,6 +187,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_STRUCT_OPS,
>         BPF_PROG_TYPE_EXT,
>         BPF_PROG_TYPE_LSM,
> +       BPF_PROG_TYPE_SK_LOOKUP,
>  };
>
>  enum bpf_attach_type {
> @@ -218,6 +219,7 @@ enum bpf_attach_type {
>         BPF_TRACE_FEXIT,
>         BPF_MODIFY_RETURN,
>         BPF_LSM_MAC,
> +       BPF_SK_LOOKUP,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -3041,6 +3043,10 @@ union bpf_attr {
>   *
>   * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>   *     Description
> + *             Helper is overloaded depending on BPF program type. This
> + *             description applies to **BPF_PROG_TYPE_SCHED_CLS** and
> + *             **BPF_PROG_TYPE_SCHED_ACT** programs.
> + *
>   *             Assign the *sk* to the *skb*. When combined with appropriate
>   *             routing configuration to receive the packet towards the socket,
>   *             will cause *skb* to be delivered to the specified socket.
> @@ -3061,6 +3067,39 @@ union bpf_attr {
>   *                                     call from outside of TC ingress.
>   *             * **-ESOCKTNOSUPPORT**  Socket type not supported (reuseport).
>   *
> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
> + *     Description
> + *             Helper is overloaded depending on BPF program type. This
> + *             description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
> + *
> + *             Select the *sk* as a result of a socket lookup.
> + *
> + *             For the operation to succeed passed socket must be compatible
> + *             with the packet description provided by the *ctx* object.
> + *
> + *             L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
> + *             match. While IP family (*AF_INET* or *AF_INET6*) must be
> + *             compatible, that is IPv6 sockets that are not v6-only can be
> + *             selected for IPv4 packets.
> + *
> + *             Only full sockets can be selected. However, there is no need to
> + *             call bpf_fullsock() before passing a socket as an argument to
> + *             this helper.
> + *
> + *             The *flags* argument must be zero.
> + *     Return
> + *             0 on success, or a negative errno in case of failure.
> + *
> + *             **-EAFNOSUPPORT** is socket family (*sk->family*) is not
> + *             compatible with packet family (*ctx->family*).
> + *
> + *             **-EINVAL** if unsupported flags were specified.
> + *
> + *             **-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
> + *             match packet protocol (*ctx->protocol*).
> + *
> + *             **-ESOCKTNOSUPPORT** if socket is not a full socket.
> + *
>   * u64 bpf_ktime_get_boot_ns(void)
>   *     Description
>   *             Return the time elapsed since system boot, in nanoseconds.
> @@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
>         __u32 pid;
>         __u32 tgid;
>  };
> +
> +/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
> +struct bpf_sk_lookup {
> +       __u32 family;           /* AF_INET, AF_INET6 */
> +       __u32 protocol;         /* IPPROTO_TCP, IPPROTO_UDP */
> +       /* IP addresses allows 1, 2, and 4 bytes access */
> +       __u32 src_ip4;
> +       __u32 src_ip6[4];
> +       __u32 src_port;         /* network byte order */
> +       __u32 dst_ip4;
> +       __u32 dst_ip6[4];
> +       __u32 dst_port;         /* host byte order */

Jakub and I have discussed this off-list, but we couldn't come to an
agreement and decided to invite
your opinion.

I think that dst_port should be in network byte order, since it's one
less exception to the
rule to think about when writing BPF programs.

Jakub's argument is that this follows __sk_buff->local_port precedent,
which is also in host
byte order.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
  2020-05-06 13:16 ` Lorenz Bauer
@ 2020-05-06 13:53 ` Jakub Sitnicki
  2020-05-07 20:55 ` Martin KaFai Lau
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-06 13:53 UTC (permalink / raw)
  To: dccp

On Wed, May 06, 2020 at 03:16 PM CEST, Lorenz Bauer wrote:
> On Wed, 6 May 2020 at 13:55, Jakub Sitnicki <jakub@cloudflare.com> wrote:

[...]

>> @@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
>>         __u32 pid;
>>         __u32 tgid;
>>  };
>> +
>> +/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
>> +struct bpf_sk_lookup {
>> +       __u32 family;           /* AF_INET, AF_INET6 */
>> +       __u32 protocol;         /* IPPROTO_TCP, IPPROTO_UDP */
>> +       /* IP addresses allows 1, 2, and 4 bytes access */
>> +       __u32 src_ip4;
>> +       __u32 src_ip6[4];
>> +       __u32 src_port;         /* network byte order */
>> +       __u32 dst_ip4;
>> +       __u32 dst_ip6[4];
>> +       __u32 dst_port;         /* host byte order */
>
> Jakub and I have discussed this off-list, but we couldn't come to an
> agreement and decided to invite
> your opinion.
>
> I think that dst_port should be in network byte order, since it's one
> less exception to the
> rule to think about when writing BPF programs.
>
> Jakub's argument is that this follows __sk_buff->local_port precedent,
> which is also in host
> byte order.

Yes, would be great to hear if there is a preference here.

Small correction, proposed sk_lookup program doesn't have access to
__sk_buff, so perhaps that case matters less.

bpf_sk_lookup->dst_port, the packet destination port, is in host byte
order so that it can be compared against bpf_sock->src_port, socket
local port, without conversion.

But I also see how it can be a surprise for a BPF user that one field has
a different byte order.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
  2020-05-06 13:16 ` Lorenz Bauer
  2020-05-06 13:53 ` Jakub Sitnicki
@ 2020-05-07 20:55 ` Martin KaFai Lau
  2020-05-08  7:06 ` Martin KaFai Lau
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2020-05-07 20:55 UTC (permalink / raw)
  To: dccp

On Wed, May 06, 2020 at 03:53:35PM +0200, Jakub Sitnicki wrote:
> On Wed, May 06, 2020 at 03:16 PM CEST, Lorenz Bauer wrote:
> > On Wed, 6 May 2020 at 13:55, Jakub Sitnicki <jakub@cloudflare.com> wrote:
> 
> [...]
> 
> >> @@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
> >>         __u32 pid;
> >>         __u32 tgid;
> >>  };
> >> +
> >> +/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
> >> +struct bpf_sk_lookup {
> >> +       __u32 family;           /* AF_INET, AF_INET6 */
> >> +       __u32 protocol;         /* IPPROTO_TCP, IPPROTO_UDP */
> >> +       /* IP addresses allows 1, 2, and 4 bytes access */
> >> +       __u32 src_ip4;
> >> +       __u32 src_ip6[4];
> >> +       __u32 src_port;         /* network byte order */
> >> +       __u32 dst_ip4;
> >> +       __u32 dst_ip6[4];
> >> +       __u32 dst_port;         /* host byte order */
> >
> > Jakub and I have discussed this off-list, but we couldn't come to an
> > agreement and decided to invite
> > your opinion.
> >
> > I think that dst_port should be in network byte order, since it's one
> > less exception to the
> > rule to think about when writing BPF programs.
> >
> > Jakub's argument is that this follows __sk_buff->local_port precedent,
> > which is also in host
> > byte order.
> 
> Yes, would be great to hear if there is a preference here.
> 
> Small correction, proposed sk_lookup program doesn't have access to
> __sk_buff, so perhaps that case matters less.
> 
> bpf_sk_lookup->dst_port, the packet destination port, is in host byte
> order so that it can be compared against bpf_sock->src_port, socket
> local port, without conversion.
> 
> But I also see how it can be a surprise for a BPF user that one field has
> a different byte order.
I would also prefer port and addr were all in the same byte order.
However, it is not the cases for the other prog_type ctx.
People has stomped on it from time to time.  May be something
can be done at the libbpf to hide this difference.

I think uapi consistency with other existing ctx is more important here.
(i.e. keep the "local" port in host order).  Otherwise, the user will
be slapped left and right when writting bpf_prog in different prog_type.

Armed with the knowledge on skc_num, the "local" port is
in host byte order in the current existing prog ctx.  It is
unfortunate that the "dst"_port in this patch is the "local" port.
The "local" port in "struct bpf_sock" is actually the "src"_port. :/
Would "local"/"remote" be clearer than "src"/dst" in this patch?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2020-05-07 20:55 ` Martin KaFai Lau
@ 2020-05-08  7:06 ` Martin KaFai Lau
  2020-05-08  8:54 ` Jakub Sitnicki
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2020-05-08  7:06 UTC (permalink / raw)
  To: dccp

On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
> transport layer when looking up a socket for a received packet.
> 
> When called, SK_LOOKUP program can select a socket that will receive the
> packet. This serves as a mechanism to overcome the limits of what bind()
> API allows to express. Two use-cases driving this work are:
> 
>  (1) steer packets destined to an IP range, fixed port to a socket
> 
>      192.0.2.0/24, port 80 -> NGINX socket
> 
>  (2) steer packets destined to an IP address, any port to a socket
> 
>      198.51.100.1, any port -> L7 proxy socket
> 
> In its run-time context, program receives information about the packet that
> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> address 4-tuple. Context can be further extended to include ingress
> interface identifier.
> 
> To select a socket BPF program fetches it from a map holding socket
> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> helper to record the selection. Transport layer then uses the selected
> socket as a result of socket lookup.
> 
> This patch only enables the user to attach an SK_LOOKUP program to a
> network namespace. Subsequent patches hook it up to run on local delivery
> path in ipv4 and ipv6 stacks.
> 
> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  include/linux/bpf_types.h   |   2 +
>  include/linux/filter.h      |  23 ++++
>  include/net/net_namespace.h |   1 +
>  include/uapi/linux/bpf.h    |  53 ++++++++
>  kernel/bpf/syscall.c        |   9 ++
>  net/core/filter.c           | 247 ++++++++++++++++++++++++++++++++++++
>  scripts/bpf_helpers_doc.py  |   9 +-
>  7 files changed, 343 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 8345cdf553b8..08c2aef674ac 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -64,6 +64,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2,
>  #ifdef CONFIG_INET
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport,
>  	      struct sk_reuseport_md, struct sk_reuseport_kern)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_LOOKUP, sk_lookup,
> +	      struct bpf_sk_lookup, struct bpf_sk_lookup_kern)
>  #endif
>  #if defined(CONFIG_BPF_JIT)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops,
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index af37318bb1c5..33254e840c8d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1280,4 +1280,27 @@ struct bpf_sockopt_kern {
>  	s32		retval;
>  };
>  
> +struct bpf_sk_lookup_kern {
> +	unsigned short	family;
> +	u16		protocol;
> +	union {
> +		struct {
> +			__be32 saddr;
> +			__be32 daddr;
> +		} v4;
> +		struct {
> +			struct in6_addr saddr;
> +			struct in6_addr daddr;
> +		} v6;
> +	};
> +	__be16		sport;
> +	u16		dport;
> +	struct sock	*selected_sk;
> +};
> +
> +int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog);
> +int sk_lookup_prog_detach(const union bpf_attr *attr);
> +int sk_lookup_prog_query(const union bpf_attr *attr,
> +			 union bpf_attr __user *uattr);
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index ab96fb59131c..70bf4888c94d 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -163,6 +163,7 @@ struct net {
>  	struct net_generic __rcu	*gen;
>  
>  	struct bpf_prog __rcu	*flow_dissector_prog;
> +	struct bpf_prog __rcu	*sk_lookup_prog;
>  
>  	/* Note : following structs are cache line aligned */
>  #ifdef CONFIG_XFRM
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index b3643e27e264..e4c61b63d4bc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -187,6 +187,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_STRUCT_OPS,
>  	BPF_PROG_TYPE_EXT,
>  	BPF_PROG_TYPE_LSM,
> +	BPF_PROG_TYPE_SK_LOOKUP,
>  };
>  
>  enum bpf_attach_type {
> @@ -218,6 +219,7 @@ enum bpf_attach_type {
>  	BPF_TRACE_FEXIT,
>  	BPF_MODIFY_RETURN,
>  	BPF_LSM_MAC,
> +	BPF_SK_LOOKUP,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -3041,6 +3043,10 @@ union bpf_attr {
>   *
>   * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags)
>   *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SCHED_CLS** and
> + *		**BPF_PROG_TYPE_SCHED_ACT** programs.
> + *
>   *		Assign the *sk* to the *skb*. When combined with appropriate
>   *		routing configuration to receive the packet towards the socket,
>   *		will cause *skb* to be delivered to the specified socket.
> @@ -3061,6 +3067,39 @@ union bpf_attr {
>   *					call from outside of TC ingress.
>   *		* **-ESOCKTNOSUPPORT**	Socket type not supported (reuseport).
>   *
> + * int bpf_sk_assign(struct bpf_sk_lookup *ctx, struct bpf_sock *sk, u64 flags)
> + *	Description
> + *		Helper is overloaded depending on BPF program type. This
> + *		description applies to **BPF_PROG_TYPE_SK_LOOKUP** programs.
> + *
> + *		Select the *sk* as a result of a socket lookup.
> + *
> + *		For the operation to succeed passed socket must be compatible
> + *		with the packet description provided by the *ctx* object.
> + *
> + *		L4 protocol (*IPPROTO_TCP* or *IPPROTO_UDP*) must be an exact
> + *		match. While IP family (*AF_INET* or *AF_INET6*) must be
> + *		compatible, that is IPv6 sockets that are not v6-only can be
> + *		selected for IPv4 packets.
> + *
> + *		Only full sockets can be selected. However, there is no need to
> + *		call bpf_fullsock() before passing a socket as an argument to
> + *		this helper.
> + *
> + *		The *flags* argument must be zero.
> + *	Return
> + *		0 on success, or a negative errno in case of failure.
> + *
> + *		**-EAFNOSUPPORT** is socket family (*sk->family*) is not
> + *		compatible with packet family (*ctx->family*).
> + *
> + *		**-EINVAL** if unsupported flags were specified.
> + *
> + *		**-EPROTOTYPE** if socket L4 protocol (*sk->protocol*) doesn't
> + *		match packet protocol (*ctx->protocol*).
> + *
> + *		**-ESOCKTNOSUPPORT** if socket is not a full socket.
> + *
>   * u64 bpf_ktime_get_boot_ns(void)
>   * 	Description
>   * 		Return the time elapsed since system boot, in nanoseconds.
> @@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
>  	__u32 pid;
>  	__u32 tgid;
>  };
> +
> +/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
> +struct bpf_sk_lookup {
> +	__u32 family;		/* AF_INET, AF_INET6 */
> +	__u32 protocol;		/* IPPROTO_TCP, IPPROTO_UDP */
> +	/* IP addresses allows 1, 2, and 4 bytes access */
> +	__u32 src_ip4;
> +	__u32 src_ip6[4];
> +	__u32 src_port;		/* network byte order */
> +	__u32 dst_ip4;
> +	__u32 dst_ip6[4];
> +	__u32 dst_port;		/* host byte order */
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index bb1ab7da6103..26d643c171fd 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2729,6 +2729,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>  	case BPF_CGROUP_GETSOCKOPT:
>  	case BPF_CGROUP_SETSOCKOPT:
>  		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
> +	case BPF_SK_LOOKUP:
It may be a good idea to enforce the "expected_attach_type =
BPF_SK_LOOKUP" during prog load time in bpf_prog_load_check_attach().
The attr->expected_attach_type could be anything right now if I read
it correctly.

> +		return BPF_PROG_TYPE_SK_LOOKUP;
>  	default:
>  		return BPF_PROG_TYPE_UNSPEC;
>  	}
> @@ -2778,6 +2780,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	case BPF_PROG_TYPE_FLOW_DISSECTOR:
>  		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
>  		break;
> +	case BPF_PROG_TYPE_SK_LOOKUP:
> +		ret = sk_lookup_prog_attach(attr, prog);
> +		break;
>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
> @@ -2818,6 +2823,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  		return lirc_prog_detach(attr);
>  	case BPF_PROG_TYPE_FLOW_DISSECTOR:
>  		return skb_flow_dissector_bpf_prog_detach(attr);
> +	case BPF_PROG_TYPE_SK_LOOKUP:
> +		return sk_lookup_prog_detach(attr);
>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>  	case BPF_PROG_TYPE_CGROUP_SKB:
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
> @@ -2867,6 +2874,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  		return lirc_prog_query(attr, uattr);
>  	case BPF_FLOW_DISSECTOR:
>  		return skb_flow_dissector_prog_query(attr, uattr);
> +	case BPF_SK_LOOKUP:
> +		return sk_lookup_prog_query(attr, uattr);
"# CONFIG_NET is not set" needs to be taken care.

>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bc25bb1085b1..a00bdc70041c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -9054,6 +9054,253 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>  
>  const struct bpf_prog_ops sk_reuseport_prog_ops = {
>  };
> +
> +static DEFINE_MUTEX(sk_lookup_prog_mutex);
> +
> +int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ret;
> +
> +	if (unlikely(attr->attach_flags))
> +		return -EINVAL;
> +
> +	mutex_lock(&sk_lookup_prog_mutex);
> +	ret = bpf_prog_attach_one(&net->sk_lookup_prog,
> +				  &sk_lookup_prog_mutex, prog,
> +				  attr->attach_flags);
> +	mutex_unlock(&sk_lookup_prog_mutex);
> +
> +	return ret;
> +}
> +
> +int sk_lookup_prog_detach(const union bpf_attr *attr)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ret;
> +
> +	if (unlikely(attr->attach_flags))
> +		return -EINVAL;
> +
> +	mutex_lock(&sk_lookup_prog_mutex);
> +	ret = bpf_prog_detach_one(&net->sk_lookup_prog,
> +				  &sk_lookup_prog_mutex);
> +	mutex_unlock(&sk_lookup_prog_mutex);
> +
> +	return ret;
> +}
> +
> +int sk_lookup_prog_query(const union bpf_attr *attr,
> +			 union bpf_attr __user *uattr)
> +{
> +	struct net *net;
> +	int ret;
> +
> +	net = get_net_ns_by_fd(attr->query.target_fd);
> +	if (IS_ERR(net))
> +		return PTR_ERR(net);
> +
> +	ret = bpf_prog_query_one(&net->sk_lookup_prog, attr, uattr);
> +
> +	put_net(net);
> +	return ret;
> +}
> +
> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> +	   struct sock *, sk, u64, flags)
> +{
> +	if (unlikely(flags != 0))
> +		return -EINVAL;
> +	if (unlikely(!sk_fullsock(sk)))
May be ARG_PTR_TO_SOCKET instead?

> +		return -ESOCKTNOSUPPORT;
> +
> +	/* Check if socket is suitable for packet L3/L4 protocol */
> +	if (sk->sk_protocol != ctx->protocol)
> +		return -EPROTOTYPE;
> +	if (sk->sk_family != ctx->family &&
> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
> +		return -EAFNOSUPPORT;
> +
> +	/* Select socket as lookup result */
> +	ctx->selected_sk = sk;
Could sk be a TCP_ESTABLISHED sk?

> +	return 0;
> +}
> +

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2020-05-08  7:06 ` Martin KaFai Lau
@ 2020-05-08  8:54 ` Jakub Sitnicki
  2020-05-08 10:45 ` Jakub Sitnicki
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-08  8:54 UTC (permalink / raw)
  To: dccp

On Thu, May 07, 2020 at 10:55 PM CEST, Martin KaFai Lau wrote:
> On Wed, May 06, 2020 at 03:53:35PM +0200, Jakub Sitnicki wrote:
>> On Wed, May 06, 2020 at 03:16 PM CEST, Lorenz Bauer wrote:
>> > On Wed, 6 May 2020 at 13:55, Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> [...]
>>
>> >> @@ -4012,4 +4051,18 @@ struct bpf_pidns_info {
>> >>         __u32 pid;
>> >>         __u32 tgid;
>> >>  };
>> >> +
>> >> +/* User accessible data for SK_LOOKUP programs. Add new fields at the end. */
>> >> +struct bpf_sk_lookup {
>> >> +       __u32 family;           /* AF_INET, AF_INET6 */
>> >> +       __u32 protocol;         /* IPPROTO_TCP, IPPROTO_UDP */
>> >> +       /* IP addresses allows 1, 2, and 4 bytes access */
>> >> +       __u32 src_ip4;
>> >> +       __u32 src_ip6[4];
>> >> +       __u32 src_port;         /* network byte order */
>> >> +       __u32 dst_ip4;
>> >> +       __u32 dst_ip6[4];
>> >> +       __u32 dst_port;         /* host byte order */
>> >
>> > Jakub and I have discussed this off-list, but we couldn't come to an
>> > agreement and decided to invite
>> > your opinion.
>> >
>> > I think that dst_port should be in network byte order, since it's one
>> > less exception to the
>> > rule to think about when writing BPF programs.
>> >
>> > Jakub's argument is that this follows __sk_buff->local_port precedent,
>> > which is also in host
>> > byte order.
>>
>> Yes, would be great to hear if there is a preference here.
>>
>> Small correction, proposed sk_lookup program doesn't have access to
>> __sk_buff, so perhaps that case matters less.
>>
>> bpf_sk_lookup->dst_port, the packet destination port, is in host byte
>> order so that it can be compared against bpf_sock->src_port, socket
>> local port, without conversion.
>>
>> But I also see how it can be a surprise for a BPF user that one field has
>> a different byte order.
> I would also prefer port and addr were all in the same byte order.
> However, it is not the cases for the other prog_type ctx.
> People has stomped on it from time to time.  May be something
> can be done at the libbpf to hide this difference.
>
> I think uapi consistency with other existing ctx is more important here.
> (i.e. keep the "local" port in host order).  Otherwise, the user will
> be slapped left and right when writting bpf_prog in different prog_type.
>
> Armed with the knowledge on skc_num, the "local" port is
> in host byte order in the current existing prog ctx.  It is
> unfortunate that the "dst"_port in this patch is the "local" port.
> The "local" port in "struct bpf_sock" is actually the "src"_port. :/
> Would "local"/"remote" be clearer than "src"/dst" in this patch?

I went and compared the field naming and byte order in existing structs:

  | struct         | field      | byte order |
  |----------------+------------+------------|
  | __sk_buff      | local_port | host       |
  | sk_msg_md      | local_port | host       |
  | bpf_sock_ops   | local_port | host       |
  | bpf_sock       | src_port   | host       |
  | bpf_fib_lookup | dport      | network    |
  | bpf_flow_keys  | dport      | network    |
  | bpf_sock_tuple | dport      | network    |
  | bpf_sock_addr  | user_port  | network    |

It does look like "local"/"remote" prefix is the sensible choice.

I got carried away trying to match the field names with bpf_sock, which
actually doesn't follow the naming convention.

Will rename fields to local_*, remote_* in v2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2020-05-08  8:54 ` Jakub Sitnicki
@ 2020-05-08 10:45 ` Jakub Sitnicki
  2020-05-08 18:39 ` Martin KaFai Lau
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-08 10:45 UTC (permalink / raw)
  To: dccp

On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
> On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
>> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
>> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
>> transport layer when looking up a socket for a received packet.
>>
>> When called, SK_LOOKUP program can select a socket that will receive the
>> packet. This serves as a mechanism to overcome the limits of what bind()
>> API allows to express. Two use-cases driving this work are:
>>
>>  (1) steer packets destined to an IP range, fixed port to a socket
>>
>>      192.0.2.0/24, port 80 -> NGINX socket
>>
>>  (2) steer packets destined to an IP address, any port to a socket
>>
>>      198.51.100.1, any port -> L7 proxy socket
>>
>> In its run-time context, program receives information about the packet that
>> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
>> address 4-tuple. Context can be further extended to include ingress
>> interface identifier.
>>
>> To select a socket BPF program fetches it from a map holding socket
>> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
>> helper to record the selection. Transport layer then uses the selected
>> socket as a result of socket lookup.
>>
>> This patch only enables the user to attach an SK_LOOKUP program to a
>> network namespace. Subsequent patches hook it up to run on local delivery
>> path in ipv4 and ipv6 stacks.
>>
>> Suggested-by: Marek Majkowski <marek@cloudflare.com>
>> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---

[...]

>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index bb1ab7da6103..26d643c171fd 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2729,6 +2729,8 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
>>  	case BPF_CGROUP_GETSOCKOPT:
>>  	case BPF_CGROUP_SETSOCKOPT:
>>  		return BPF_PROG_TYPE_CGROUP_SOCKOPT;
>> +	case BPF_SK_LOOKUP:
> It may be a good idea to enforce the "expected_attach_type =
> BPF_SK_LOOKUP" during prog load time in bpf_prog_load_check_attach().
> The attr->expected_attach_type could be anything right now if I read
> it correctly.

I'll extend bpf_prog_attach_check_attach_type to enforce it for SK_LOOKUP.

>
>> +		return BPF_PROG_TYPE_SK_LOOKUP;
>>  	default:
>>  		return BPF_PROG_TYPE_UNSPEC;
>>  	}
>> @@ -2778,6 +2780,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>>  	case BPF_PROG_TYPE_FLOW_DISSECTOR:
>>  		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
>>  		break;
>> +	case BPF_PROG_TYPE_SK_LOOKUP:
>> +		ret = sk_lookup_prog_attach(attr, prog);
>> +		break;
>>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>>  	case BPF_PROG_TYPE_CGROUP_SKB:
>>  	case BPF_PROG_TYPE_CGROUP_SOCK:
>> @@ -2818,6 +2823,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>>  		return lirc_prog_detach(attr);
>>  	case BPF_PROG_TYPE_FLOW_DISSECTOR:
>>  		return skb_flow_dissector_bpf_prog_detach(attr);
>> +	case BPF_PROG_TYPE_SK_LOOKUP:
>> +		return sk_lookup_prog_detach(attr);
>>  	case BPF_PROG_TYPE_CGROUP_DEVICE:
>>  	case BPF_PROG_TYPE_CGROUP_SKB:
>>  	case BPF_PROG_TYPE_CGROUP_SOCK:
>> @@ -2867,6 +2874,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>>  		return lirc_prog_query(attr, uattr);
>>  	case BPF_FLOW_DISSECTOR:
>>  		return skb_flow_dissector_prog_query(attr, uattr);
>> +	case BPF_SK_LOOKUP:
>> +		return sk_lookup_prog_query(attr, uattr);
> "# CONFIG_NET is not set" needs to be taken care.

Sorry, embarassing mistake. Will add stubs returning -EINVAL like
flow_dissector and cgroup_bpf progs have.

>
>>  	default:
>>  		return -EINVAL;
>>  	}
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index bc25bb1085b1..a00bdc70041c 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -9054,6 +9054,253 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
>>
>>  const struct bpf_prog_ops sk_reuseport_prog_ops = {
>>  };
>> +
>> +static DEFINE_MUTEX(sk_lookup_prog_mutex);
>> +
>> +int sk_lookup_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog)
>> +{
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int ret;
>> +
>> +	if (unlikely(attr->attach_flags))
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&sk_lookup_prog_mutex);
>> +	ret = bpf_prog_attach_one(&net->sk_lookup_prog,
>> +				  &sk_lookup_prog_mutex, prog,
>> +				  attr->attach_flags);
>> +	mutex_unlock(&sk_lookup_prog_mutex);
>> +
>> +	return ret;
>> +}
>> +
>> +int sk_lookup_prog_detach(const union bpf_attr *attr)
>> +{
>> +	struct net *net = current->nsproxy->net_ns;
>> +	int ret;
>> +
>> +	if (unlikely(attr->attach_flags))
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&sk_lookup_prog_mutex);
>> +	ret = bpf_prog_detach_one(&net->sk_lookup_prog,
>> +				  &sk_lookup_prog_mutex);
>> +	mutex_unlock(&sk_lookup_prog_mutex);
>> +
>> +	return ret;
>> +}
>> +
>> +int sk_lookup_prog_query(const union bpf_attr *attr,
>> +			 union bpf_attr __user *uattr)
>> +{
>> +	struct net *net;
>> +	int ret;
>> +
>> +	net = get_net_ns_by_fd(attr->query.target_fd);
>> +	if (IS_ERR(net))
>> +		return PTR_ERR(net);
>> +
>> +	ret = bpf_prog_query_one(&net->sk_lookup_prog, attr, uattr);
>> +
>> +	put_net(net);
>> +	return ret;
>> +}
>> +
>> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
>> +	   struct sock *, sk, u64, flags)
>> +{
>> +	if (unlikely(flags != 0))
>> +		return -EINVAL;
>> +	if (unlikely(!sk_fullsock(sk)))
> May be ARG_PTR_TO_SOCKET instead?

I had ARG_PTR_TO_SOCKET initially, then switched to SOCK_COMMON to match
the TC bpf_sk_assign proto. Now that you point it out, it makes more
sense to be more specific in the helper proto.

>
>> +		return -ESOCKTNOSUPPORT;
>> +
>> +	/* Check if socket is suitable for packet L3/L4 protocol */
>> +	if (sk->sk_protocol != ctx->protocol)
>> +		return -EPROTOTYPE;
>> +	if (sk->sk_family != ctx->family &&
>> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
>> +		return -EAFNOSUPPORT;
>> +
>> +	/* Select socket as lookup result */
>> +	ctx->selected_sk = sk;
> Could sk be a TCP_ESTABLISHED sk?

Yes, and what's worse, it could be ref-counted. This is a bug. I should
be rejecting ref counted sockets here.

Callers of __inet_lookup_listener() and inet6_lookup_listener() expect
an RCU-freed socket on return.

For UDP lookup, returning a TCP_ESTABLISHED (connected) socket is okay.


Thank you for valuable comments. Will fix all of the above in v2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2020-05-08 10:45 ` Jakub Sitnicki
@ 2020-05-08 18:39 ` Martin KaFai Lau
  2020-05-11  9:08 ` Jakub Sitnicki
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2020-05-08 18:39 UTC (permalink / raw)
  To: dccp

On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
> >> Add a new program type BPF_PROG_TYPE_SK_LOOKUP and a dedicated attach type
> >> called BPF_SK_LOOKUP. The new program kind is to be invoked by the
> >> transport layer when looking up a socket for a received packet.
> >>
> >> When called, SK_LOOKUP program can select a socket that will receive the
> >> packet. This serves as a mechanism to overcome the limits of what bind()
> >> API allows to express. Two use-cases driving this work are:
> >>
> >>  (1) steer packets destined to an IP range, fixed port to a socket
> >>
> >>      192.0.2.0/24, port 80 -> NGINX socket
> >>
> >>  (2) steer packets destined to an IP address, any port to a socket
> >>
> >>      198.51.100.1, any port -> L7 proxy socket
> >>
> >> In its run-time context, program receives information about the packet that
> >> triggered the socket lookup. Namely IP version, L4 protocol identifier, and
> >> address 4-tuple. Context can be further extended to include ingress
> >> interface identifier.
> >>
> >> To select a socket BPF program fetches it from a map holding socket
> >> references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
> >> helper to record the selection. Transport layer then uses the selected
> >> socket as a result of socket lookup.
> >>
> >> This patch only enables the user to attach an SK_LOOKUP program to a
> >> network namespace. Subsequent patches hook it up to run on local delivery
> >> path in ipv4 and ipv6 stacks.
> >>
> >> Suggested-by: Marek Majkowski <marek@cloudflare.com>
> >> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> 

[...]

> >> +BPF_CALL_3(bpf_sk_lookup_assign, struct bpf_sk_lookup_kern *, ctx,
> >> +	   struct sock *, sk, u64, flags)
> >> +{
> >> +	if (unlikely(flags != 0))
> >> +		return -EINVAL;
> >> +	if (unlikely(!sk_fullsock(sk)))
> > May be ARG_PTR_TO_SOCKET instead?
> 
> I had ARG_PTR_TO_SOCKET initially, then switched to SOCK_COMMON to match
> the TC bpf_sk_assign proto. Now that you point it out, it makes more
> sense to be more specific in the helper proto.
> 
> >
> >> +		return -ESOCKTNOSUPPORT;
> >> +
> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
> >> +	if (sk->sk_protocol != ctx->protocol)
> >> +		return -EPROTOTYPE;
> >> +	if (sk->sk_family != ctx->family &&
> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
> >> +		return -EAFNOSUPPORT;
> >> +
> >> +	/* Select socket as lookup result */
> >> +	ctx->selected_sk = sk;
> > Could sk be a TCP_ESTABLISHED sk?
> 
> Yes, and what's worse, it could be ref-counted. This is a bug. I should
> be rejecting ref counted sockets here.
Agree. ref-counted (i.e. checking rcu protected or not) is the right check
here.

An unrelated quick thought, it may still be fine for the
TCP_ESTABLISHED tcp_sk returned from sock_map because of the
"call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
I was more thinking about in the future, what if this helper can take
other sk not coming from sock_map.

> 
> Callers of __inet_lookup_listener() and inet6_lookup_listener() expect
> an RCU-freed socket on return.
> 
> For UDP lookup, returning a TCP_ESTABLISHED (connected) socket is okay.
> 
> 
> Thank you for valuable comments. Will fix all of the above in v2.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2020-05-08 18:39 ` Martin KaFai Lau
@ 2020-05-11  9:08 ` Jakub Sitnicki
  2020-05-11 18:59 ` Martin KaFai Lau
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-11  9:08 UTC (permalink / raw)
  To: dccp

On Fri, May 08, 2020 at 08:39 PM CEST, Martin KaFai Lau wrote:
> On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
>> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
>> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:

[...]

>> >> +		return -ESOCKTNOSUPPORT;
>> >> +
>> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
>> >> +	if (sk->sk_protocol != ctx->protocol)
>> >> +		return -EPROTOTYPE;
>> >> +	if (sk->sk_family != ctx->family &&
>> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
>> >> +		return -EAFNOSUPPORT;
>> >> +
>> >> +	/* Select socket as lookup result */
>> >> +	ctx->selected_sk = sk;
>> > Could sk be a TCP_ESTABLISHED sk?
>>
>> Yes, and what's worse, it could be ref-counted. This is a bug. I should
>> be rejecting ref counted sockets here.
> Agree. ref-counted (i.e. checking rcu protected or not) is the right check
> here.
>
> An unrelated quick thought, it may still be fine for the
> TCP_ESTABLISHED tcp_sk returned from sock_map because of the
> "call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
> I was more thinking about in the future, what if this helper can take
> other sk not coming from sock_map.

I see, psock holds a sock reference and will not release it until a full
grace period has elapsed.

Even if holding a ref wasn't a problem, I'm not sure if returning a
TCP_ESTABLISHED socket wouldn't trip up callers of inet_lookup_listener
(tcp_v4_rcv and nf_tproxy_handle_time_wait4), that look for a listener
when processing a SYN to TIME_WAIT socket.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (7 preceding siblings ...)
  2020-05-11  9:08 ` Jakub Sitnicki
@ 2020-05-11 18:59 ` Martin KaFai Lau
  2020-05-11 19:26 ` Jakub Sitnicki
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2020-05-11 18:59 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 11:08:15AM +0200, Jakub Sitnicki wrote:
> On Fri, May 08, 2020 at 08:39 PM CEST, Martin KaFai Lau wrote:
> > On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
> >> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
> >> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
> 
> [...]
> 
> >> >> +		return -ESOCKTNOSUPPORT;
> >> >> +
> >> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
> >> >> +	if (sk->sk_protocol != ctx->protocol)
> >> >> +		return -EPROTOTYPE;
> >> >> +	if (sk->sk_family != ctx->family &&
> >> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
> >> >> +		return -EAFNOSUPPORT;
> >> >> +
> >> >> +	/* Select socket as lookup result */
> >> >> +	ctx->selected_sk = sk;
> >> > Could sk be a TCP_ESTABLISHED sk?
> >>
> >> Yes, and what's worse, it could be ref-counted. This is a bug. I should
> >> be rejecting ref counted sockets here.
> > Agree. ref-counted (i.e. checking rcu protected or not) is the right check
> > here.
> >
> > An unrelated quick thought, it may still be fine for the
> > TCP_ESTABLISHED tcp_sk returned from sock_map because of the
> > "call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
> > I was more thinking about in the future, what if this helper can take
> > other sk not coming from sock_map.
> 
> I see, psock holds a sock reference and will not release it until a full
> grace period has elapsed.
> 
> Even if holding a ref wasn't a problem, I'm not sure if returning a
> TCP_ESTABLISHED socket wouldn't trip up callers of inet_lookup_listener
> (tcp_v4_rcv and nf_tproxy_handle_time_wait4), that look for a listener
> when processing a SYN to TIME_WAIT socket.
Not suggesting the sk_assign helper has to support TCP_ESTABLISHED in TCP
if there is no use case for it.

Do you have a use case on supporting TCP_ESTABLISHED sk in UDP?
From the cover letter use cases, it is not clear to me it is
required.

or both should only support unconnected sk?

Regardless, this details will be useful in the helper's doc.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (8 preceding siblings ...)
  2020-05-11 18:59 ` Martin KaFai Lau
@ 2020-05-11 19:26 ` Jakub Sitnicki
  2020-05-11 20:54 ` Martin KaFai Lau
  2020-05-12 14:16 ` Jakub Sitnicki
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-11 19:26 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 08:59 PM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 11:08:15AM +0200, Jakub Sitnicki wrote:
>> On Fri, May 08, 2020 at 08:39 PM CEST, Martin KaFai Lau wrote:
>> > On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
>> >> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
>> >> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
>>
>> [...]
>>
>> >> >> +		return -ESOCKTNOSUPPORT;
>> >> >> +
>> >> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
>> >> >> +	if (sk->sk_protocol != ctx->protocol)
>> >> >> +		return -EPROTOTYPE;
>> >> >> +	if (sk->sk_family != ctx->family &&
>> >> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
>> >> >> +		return -EAFNOSUPPORT;
>> >> >> +
>> >> >> +	/* Select socket as lookup result */
>> >> >> +	ctx->selected_sk = sk;
>> >> > Could sk be a TCP_ESTABLISHED sk?
>> >>
>> >> Yes, and what's worse, it could be ref-counted. This is a bug. I should
>> >> be rejecting ref counted sockets here.
>> > Agree. ref-counted (i.e. checking rcu protected or not) is the right check
>> > here.
>> >
>> > An unrelated quick thought, it may still be fine for the
>> > TCP_ESTABLISHED tcp_sk returned from sock_map because of the
>> > "call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
>> > I was more thinking about in the future, what if this helper can take
>> > other sk not coming from sock_map.
>>
>> I see, psock holds a sock reference and will not release it until a full
>> grace period has elapsed.
>>
>> Even if holding a ref wasn't a problem, I'm not sure if returning a
>> TCP_ESTABLISHED socket wouldn't trip up callers of inet_lookup_listener
>> (tcp_v4_rcv and nf_tproxy_handle_time_wait4), that look for a listener
>> when processing a SYN to TIME_WAIT socket.
> Not suggesting the sk_assign helper has to support TCP_ESTABLISHED in TCP
> if there is no use case for it.

Ack, I didn't think you were. Just explored the consequences.

> Do you have a use case on supporting TCP_ESTABLISHED sk in UDP?
> From the cover letter use cases, it is not clear to me it is
> required.
>
> or both should only support unconnected sk?

No, we don't have a use case for selecting a connected UDP socket.

I left it as a possiblity because __udp[46]_lib_lookup, where BPF
sk_lookup is invoked from, can return one.

Perhaps the user would like to connect the selected receiving socket
(for instance to itself) to ensure its not used for TX?

I've pulled this scenario out of the hat. Happy to limit bpf_sk_assign
to select only unconnected UDP sockets, if returning a connected one
doesn't make sense.

> Regardless, this details will be useful in the helper's doc.

I've reworded the helper doc in v2 to say:

        Description
                ...

                Only TCP listeners and UDP sockets, that is sockets
                which have *SOCK_RCU_FREE* flag set, can be selected.

                ...
        Return
                ...

                **-ESOCKTNOSUPPORT** if socket does not use RCU freeing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (9 preceding siblings ...)
  2020-05-11 19:26 ` Jakub Sitnicki
@ 2020-05-11 20:54 ` Martin KaFai Lau
  2020-05-12 14:16 ` Jakub Sitnicki
  11 siblings, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2020-05-11 20:54 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 09:26:02PM +0200, Jakub Sitnicki wrote:
> On Mon, May 11, 2020 at 08:59 PM CEST, Martin KaFai Lau wrote:
> > On Mon, May 11, 2020 at 11:08:15AM +0200, Jakub Sitnicki wrote:
> >> On Fri, May 08, 2020 at 08:39 PM CEST, Martin KaFai Lau wrote:
> >> > On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
> >> >> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
> >> >> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
> >>
> >> [...]
> >>
> >> >> >> +		return -ESOCKTNOSUPPORT;
> >> >> >> +
> >> >> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
> >> >> >> +	if (sk->sk_protocol != ctx->protocol)
> >> >> >> +		return -EPROTOTYPE;
> >> >> >> +	if (sk->sk_family != ctx->family &&
> >> >> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
> >> >> >> +		return -EAFNOSUPPORT;
> >> >> >> +
> >> >> >> +	/* Select socket as lookup result */
> >> >> >> +	ctx->selected_sk = sk;
> >> >> > Could sk be a TCP_ESTABLISHED sk?
> >> >>
> >> >> Yes, and what's worse, it could be ref-counted. This is a bug. I should
> >> >> be rejecting ref counted sockets here.
> >> > Agree. ref-counted (i.e. checking rcu protected or not) is the right check
> >> > here.
> >> >
> >> > An unrelated quick thought, it may still be fine for the
> >> > TCP_ESTABLISHED tcp_sk returned from sock_map because of the
> >> > "call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
> >> > I was more thinking about in the future, what if this helper can take
> >> > other sk not coming from sock_map.
> >>
> >> I see, psock holds a sock reference and will not release it until a full
> >> grace period has elapsed.
> >>
> >> Even if holding a ref wasn't a problem, I'm not sure if returning a
> >> TCP_ESTABLISHED socket wouldn't trip up callers of inet_lookup_listener
> >> (tcp_v4_rcv and nf_tproxy_handle_time_wait4), that look for a listener
> >> when processing a SYN to TIME_WAIT socket.
> > Not suggesting the sk_assign helper has to support TCP_ESTABLISHED in TCP
> > if there is no use case for it.
> 
> Ack, I didn't think you were. Just explored the consequences.
> 
> > Do you have a use case on supporting TCP_ESTABLISHED sk in UDP?
> > From the cover letter use cases, it is not clear to me it is
> > required.
> >
> > or both should only support unconnected sk?
> 
> No, we don't have a use case for selecting a connected UDP socket.
> 
> I left it as a possiblity because __udp[46]_lib_lookup, where BPF
> sk_lookup is invoked from, can return one.
> 
> Perhaps the user would like to connect the selected receiving socket
> (for instance to itself) to ensure its not used for TX?
> 
> I've pulled this scenario out of the hat. Happy to limit bpf_sk_assign
> to select only unconnected UDP sockets, if returning a connected one
> doesn't make sense.
OTOH, my concern is:
TCP's SK_LOOKUP can override the kernel choice on TCP_LISTEN sk.
UDP's SK_LOOKUP can override the kernel choice on unconnected sk but
not the connected sk.

It could be quite confusing to bpf user if a bpf_prog was written to return
both connected and unconnected UDP sk and logically expect both
will be done before the kernel's choice.

> 
> > Regardless, this details will be useful in the helper's doc.
> 
> I've reworded the helper doc in v2 to say:
> 
>         Description
>                 ...
> 
>                 Only TCP listeners and UDP sockets, that is sockets
>                 which have *SOCK_RCU_FREE* flag set, can be selected.
> 
>                 ...
>         Return
>                 ...
> 
>                 **-ESOCKTNOSUPPORT** if socket does not use RCU freeing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point
  2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
                   ` (10 preceding siblings ...)
  2020-05-11 20:54 ` Martin KaFai Lau
@ 2020-05-12 14:16 ` Jakub Sitnicki
  11 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2020-05-12 14:16 UTC (permalink / raw)
  To: dccp

On Mon, May 11, 2020 at 10:54 PM CEST, Martin KaFai Lau wrote:
> On Mon, May 11, 2020 at 09:26:02PM +0200, Jakub Sitnicki wrote:
>> On Mon, May 11, 2020 at 08:59 PM CEST, Martin KaFai Lau wrote:
>> > On Mon, May 11, 2020 at 11:08:15AM +0200, Jakub Sitnicki wrote:
>> >> On Fri, May 08, 2020 at 08:39 PM CEST, Martin KaFai Lau wrote:
>> >> > On Fri, May 08, 2020 at 12:45:14PM +0200, Jakub Sitnicki wrote:
>> >> >> On Fri, May 08, 2020 at 09:06 AM CEST, Martin KaFai Lau wrote:
>> >> >> > On Wed, May 06, 2020 at 02:54:58PM +0200, Jakub Sitnicki wrote:
>> >>
>> >> [...]
>> >>
>> >> >> >> +		return -ESOCKTNOSUPPORT;
>> >> >> >> +
>> >> >> >> +	/* Check if socket is suitable for packet L3/L4 protocol */
>> >> >> >> +	if (sk->sk_protocol != ctx->protocol)
>> >> >> >> +		return -EPROTOTYPE;
>> >> >> >> +	if (sk->sk_family != ctx->family &&
>> >> >> >> +	    (sk->sk_family = AF_INET || ipv6_only_sock(sk)))
>> >> >> >> +		return -EAFNOSUPPORT;
>> >> >> >> +
>> >> >> >> +	/* Select socket as lookup result */
>> >> >> >> +	ctx->selected_sk = sk;
>> >> >> > Could sk be a TCP_ESTABLISHED sk?
>> >> >>
>> >> >> Yes, and what's worse, it could be ref-counted. This is a bug. I should
>> >> >> be rejecting ref counted sockets here.
>> >> > Agree. ref-counted (i.e. checking rcu protected or not) is the right check
>> >> > here.
>> >> >
>> >> > An unrelated quick thought, it may still be fine for the
>> >> > TCP_ESTABLISHED tcp_sk returned from sock_map because of the
>> >> > "call_rcu(&psock->rcu, sk_psock_destroy);" in sk_psock_drop().
>> >> > I was more thinking about in the future, what if this helper can take
>> >> > other sk not coming from sock_map.
>> >>
>> >> I see, psock holds a sock reference and will not release it until a full
>> >> grace period has elapsed.
>> >>
>> >> Even if holding a ref wasn't a problem, I'm not sure if returning a
>> >> TCP_ESTABLISHED socket wouldn't trip up callers of inet_lookup_listener
>> >> (tcp_v4_rcv and nf_tproxy_handle_time_wait4), that look for a listener
>> >> when processing a SYN to TIME_WAIT socket.
>> > Not suggesting the sk_assign helper has to support TCP_ESTABLISHED in TCP
>> > if there is no use case for it.
>>
>> Ack, I didn't think you were. Just explored the consequences.
>>
>> > Do you have a use case on supporting TCP_ESTABLISHED sk in UDP?
>> > From the cover letter use cases, it is not clear to me it is
>> > required.
>> >
>> > or both should only support unconnected sk?
>>
>> No, we don't have a use case for selecting a connected UDP socket.
>>
>> I left it as a possiblity because __udp[46]_lib_lookup, where BPF
>> sk_lookup is invoked from, can return one.
>>
>> Perhaps the user would like to connect the selected receiving socket
>> (for instance to itself) to ensure its not used for TX?
>>
>> I've pulled this scenario out of the hat. Happy to limit bpf_sk_assign
>> to select only unconnected UDP sockets, if returning a connected one
>> doesn't make sense.
> OTOH, my concern is:
> TCP's SK_LOOKUP can override the kernel choice on TCP_LISTEN sk.
> UDP's SK_LOOKUP can override the kernel choice on unconnected sk but
> not the connected sk.
>
> It could be quite confusing to bpf user if a bpf_prog was written to return
> both connected and unconnected UDP sk and logically expect both
> will be done before the kernel's choice.
>

That's a fair point. I've been looking at this from the PoV of in-kernel
callers of udp socket lookup, which now seems wrong.

I agree it would a be surprising if not confusing UAPI. Will limit it to
just unconnected UDP in v3.

Thanks for raising the concern,
Jakub

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-05-12 14:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-06 12:54 [PATCH bpf-next 02/17] bpf: Introduce SK_LOOKUP program type with a dedicated attach point Jakub Sitnicki
2020-05-06 13:16 ` Lorenz Bauer
2020-05-06 13:53 ` Jakub Sitnicki
2020-05-07 20:55 ` Martin KaFai Lau
2020-05-08  7:06 ` Martin KaFai Lau
2020-05-08  8:54 ` Jakub Sitnicki
2020-05-08 10:45 ` Jakub Sitnicki
2020-05-08 18:39 ` Martin KaFai Lau
2020-05-11  9:08 ` Jakub Sitnicki
2020-05-11 18:59 ` Martin KaFai Lau
2020-05-11 19:26 ` Jakub Sitnicki
2020-05-11 20:54 ` Martin KaFai Lau
2020-05-12 14:16 ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).