[bpf-next RFC 0/3] Introduce eBPF flow dissector

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [bpf-next RFC 0/3] Introduce eBPF flow dissector
@ 2018-08-16 16:44 Petar Penkov
  2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 3 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

This RFC patchset exposes a few design considerations:

1/ Because the flow dissector key definitions live in
include/linux/net/flow_dissector.h, they are not visible from userspace,
and the flow keys definitions need to be copied in the BPF program.

2/ An alternative to adding a new hook would have been to attach flow
dissection programs at the XDP hook. Because this hook is executed before
GRO, it would have to execute on every MSS, which would be more
computationally expensive. Furthermore, the XDP hook is executed before an
SKB has been allocated and there is no clear way to move the dissected keys
into the SKB after it has been allocated. Eventually, perhaps a single pass
can implement both GRO and flow dissection -- but napi_gro_cb shows that a
lot more flow state would need to be parsed for this.

3/ The BPF program cannot use direct packet access everywhere because it
uses an offset, initially supplied by the flow dissector.  Because the
initial value of this non-constant offset comes from outside of the
program, the verifier does not know what its value is, and it cannot verify
that it is within packet bounds. Therefore, direct packet access programs
get rejected.

4/ Loading and attaching the BPF program requires capable(), as opposed to
ns_capable(), because a malicious program might be able to return bad
values that would trigger bugs in the kernel, such as the nhoff value bug
fixed in commit 324f8305e59b ("net-backports: flow_dissector: properly cap
thoff field").

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (3):
  flow_dissector: implements flow dissector BPF hook
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf_types.h                     |   1 +
 include/linux/skbuff.h                        |   7 +
 include/net/flow_dissector.h                  |  16 +
 include/uapi/linux/bpf.h                      |  14 +-
 kernel/bpf/syscall.c                          |   8 +
 kernel/bpf/verifier.c                         |   2 +
 net/core/filter.c                             | 157 ++++
 net/core/flow_dissector.c                     |  76 ++
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   5 +-
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c        | 542 ++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h     |   3 +
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 21 files changed, 1967 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 16:44 [bpf-next RFC 0/3] Introduce eBPF flow dissector Petar Penkov
@ 2018-08-16 16:44 ` Petar Penkov
  2018-08-16 18:34   ` Edward Cree
  2018-08-16 22:40   ` Song Liu
  2018-08-16 16:44 ` [bpf-next RFC 2/3] flow_dissector: implements eBPF parser Petar Penkov
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 20+ messages in thread
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn

From: Petar Penkov <ppenkov@google.com>

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is kept as a global variable so it is
accessible to all flow dissectors.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/bpf_types.h                 |   1 +
 include/linux/skbuff.h                    |   7 +
 include/net/flow_dissector.h              |  16 +++
 include/uapi/linux/bpf.h                  |  14 +-
 kernel/bpf/syscall.c                      |   8 ++
 kernel/bpf/verifier.c                     |   2 +
 net/core/filter.c                         | 157 ++++++++++++++++++++++
 net/core/flow_dissector.c                 |  76 +++++++++++
 tools/bpf/bpftool/prog.c                  |   1 +
 tools/include/uapi/linux/bpf.h            |   5 +-
 tools/lib/bpf/libbpf.c                    |   2 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 12 files changed, 290 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 6a4586dcdede..edb919d320c1 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
 extern struct flow_dissector flow_keys_dissector;
 extern struct flow_dissector flow_keys_basic_dissector;
 
+/* struct bpf_flow_dissect_cb:
+ *
+ * This struct is used to pass parameters to BPF programs of type
+ * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
+ * the control block of the skb to be a struct of this type. The first field is
+ * used to communicate the next header offset between the BPF programs and the
+ * first value of it is passed from the kernel. The last two fields are used for
+ * writing out flow keys.
+ */
+struct bpf_flow_dissect_cb {
+	u16 nhoff;
+	u16 unused;
+	void *target_container;
+	struct flow_dissector *flow_dissector;
+};
+
 /* struct flow_keys_digest:
  *
  * This structure is used to hold a digest of the full flow keys. This is a
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..8bc0fdab685d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2141,6 +2143,15 @@ union bpf_attr {
  *		request in the skb.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
+ *	Description
+ *		Try to write *len* bytes from the source pointer into the offset
+ *		of the key with id *key_id*. If *len* is different from the
+ *		size of the key, an error is returned. If the key is not used,
+ *		this function exits with no effect and code 0.
+ *	Return
+ *		0 on success, negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2226,7 +2237,8 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(flow_dissector_write_keys),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 43727ed0d94a..a06568841a92 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_LIRC_MODE2:
 		ptype = BPF_PROG_TYPE_LIRC_MODE2;
 		break;
+	case BPF_FLOW_DISSECTOR:
+		ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_LIRC_MODE2:
 		ret = lirc_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
 	case BPF_LIRC_MODE2:
 		return lirc_prog_detach(attr);
+	case BPF_FLOW_DISSECTOR:
+		return skb_flow_dissector_bpf_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ca90679a7fe5..6d3f268fa8e0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (meta)
 			return meta->pkt_access;
 
@@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index fd423ce3da34..03d3037e6508 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
 	return false;
 }
 
+BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
+	   const void *, from, u32, len, enum flow_dissector_key_id, key_id)
+{
+	struct bpf_flow_dissect_cb *cb;
+	void *dest;
+
+	cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
+
+	/* Make sure the dissector actually uses the key. It is not an error if
+	 * it does not, but we should not continue past this point in that case
+	 */
+	if (!dissector_uses_key(cb->flow_dissector, key_id))
+		return 0;
+
+	/* Make sure the length is correct */
+	switch (key_id) {
+	case FLOW_DISSECTOR_KEY_CONTROL:
+	case FLOW_DISSECTOR_KEY_ENC_CONTROL:
+		if (len != sizeof(struct flow_dissector_key_control))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_BASIC:
+		if (len != sizeof(struct flow_dissector_key_basic))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+	case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+	case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ICMP:
+		if (len != sizeof(struct flow_dissector_key_icmp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_PORTS:
+	case FLOW_DISSECTOR_KEY_ENC_PORTS:
+		if (len != sizeof(struct flow_dissector_key_ports))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ETH_ADDRS:
+		if (len != sizeof(struct flow_dissector_key_eth_addrs))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_TIPC:
+		if (len != sizeof(struct flow_dissector_key_tipc))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_ARP:
+		if (len != sizeof(struct flow_dissector_key_arp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_VLAN:
+	case FLOW_DISSECTOR_KEY_CVLAN:
+		if (len != sizeof(struct flow_dissector_key_vlan))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_FLOW_LABEL:
+		if (len != sizeof(struct flow_dissector_key_tags))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_GRE_KEYID:
+	case FLOW_DISSECTOR_KEY_ENC_KEYID:
+	case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
+		if (len != sizeof(struct flow_dissector_key_keyid))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_MPLS:
+		if (len != sizeof(struct flow_dissector_key_mpls))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_TCP:
+		if (len != sizeof(struct flow_dissector_key_tcp))
+			return -EINVAL;
+		break;
+	case FLOW_DISSECTOR_KEY_IP:
+	case FLOW_DISSECTOR_KEY_ENC_IP:
+		if (len != sizeof(struct flow_dissector_key_ip))
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
+					 cb->target_container);
+
+	memcpy(dest, from, len);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
+	.func		= bpf_flow_dissector_write_keys,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_flow_dissector_write_keys:
+		return &bpf_flow_dissector_write_keys_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool flow_dissector_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range(struct __sk_buff, cb[0]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+	case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
+		return false;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 const struct bpf_prog_ops sk_msg_prog_ops = {
 };
 
+const struct bpf_verifier_ops flow_dissector_verifier_ops = {
+	.get_func_proto		= flow_dissector_func_proto,
+	.is_valid_access	= flow_dissector_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+};
+
+const struct bpf_prog_ops flow_dissector_prog_ops = {
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index ce9eeeb7c024..767daa231f04 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -25,6 +25,11 @@
 #include <net/flow_dissector.h>
 #include <scsi/fc/fc_fcoe.h>
 #include <uapi/linux/batadv_packet.h>
+#include <linux/bpf.h>
+
+/* BPF program accessible by all flow dissectors */
+static struct bpf_prog __rcu *flow_dissector_prog;
+static DEFINE_MUTEX(flow_dissector_mutex);
 
 static void dissector_set_key(struct flow_dissector *flow_dissector,
 			      enum flow_dissector_key_id key_id)
@@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 }
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog)
+{
+	struct bpf_prog *attached;
+
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (attached) {
+		/* Only one BPF program can be attached at a time */
+		mutex_unlock(&flow_dissector_mutex);
+		return -EEXIST;
+	}
+	rcu_assign_pointer(flow_dissector_prog, prog);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct bpf_prog *attached;
+
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (!flow_dissector_prog) {
+		mutex_unlock(&flow_dissector_mutex);
+		return -EINVAL;
+	}
+	bpf_prog_put(attached);
+	RCU_INIT_POINTER(flow_dissector_prog, NULL);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
 /**
  * skb_flow_get_be16 - extract be16 entity
  * @skb: sk_buff to extract from
@@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_vlan *key_vlan;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
+	struct bpf_prog *attached;
 	int num_hdrs = 0;
 	u8 ip_proto = 0;
 	bool ret;
@@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 					      FLOW_DISSECTOR_KEY_BASIC,
 					      target_container);
 
+	rcu_read_lock();
+	attached = rcu_dereference(flow_dissector_prog);
+	if (attached) {
+		/* Note that even though the const qualifier is discarded
+		 * throughout the execution of the BPF program, all changes(the
+		 * control block) are reverted after the BPF program returns.
+		 * Therefore, __skb_flow_dissect does not alter the skb.
+		 */
+		struct bpf_flow_dissect_cb *cb;
+		u8 cb_saved[BPF_SKB_CB_LEN];
+		u32 result;
+
+		cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
+
+		/* Save Control Block */
+		memcpy(cb_saved, cb, sizeof(cb_saved));
+		memset(cb, 0, sizeof(cb_saved));
+
+		/* Pass parameters to the BPF program */
+		cb->nhoff = nhoff;
+		cb->target_container = target_container;
+		cb->flow_dissector = flow_dissector;
+
+		bpf_compute_data_pointers((struct sk_buff *)skb);
+		result = BPF_PROG_RUN(attached, skb);
+
+		/* Restore state */
+		memcpy(cb, cb_saved, sizeof(cb_saved));
+
+		key_control->thoff = min_t(u16, key_control->thoff,
+					   skb ? skb->len : hlen);
+		rcu_read_unlock();
+		return result == BPF_OK;
+	}
+	rcu_read_unlock();
+
 	if (dissector_uses_key(flow_dissector,
 			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
 		struct ethhdr *eth = eth_hdr(skb);
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_FLOW_DISSECTOR]	= "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..acd74a0dd063 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2226,7 +2228,8 @@ union bpf_attr {
 	FN(get_current_cgroup_id),	\
 	FN(get_local_storage),		\
 	FN(sk_select_reuseport),	\
-	FN(skb_ancestor_cgroup_id),
+	FN(skb_ancestor_cgroup_id),	\
+	FN(flow_dissector_write_keys),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f112627..0c749ce1b717 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
 	BPF_PROG_SEC("sk_skb",		BPF_PROG_TYPE_SK_SKB),
 	BPF_PROG_SEC("sk_msg",		BPF_PROG_TYPE_SK_MSG),
 	BPF_PROG_SEC("lirc_mode2",	BPF_PROG_TYPE_LIRC_MODE2),
+	BPF_PROG_SEC("flow_dissector",	BPF_PROG_TYPE_FLOW_DISSECTOR),
 	BPF_SA_PROG_SEC("cgroup/bind4",	BPF_CGROUP_INET4_BIND),
 	BPF_SA_PROG_SEC("cgroup/bind6",	BPF_CGROUP_INET6_BIND),
 	BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..4204c496a04f 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
 	(void *) BPF_FUNC_skb_cgroup_id;
 static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
 	(void *) BPF_FUNC_skb_ancestor_cgroup_id;
+static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
+					    int key) =
+	(void *) BPF_FUNC_flow_dissector_write_keys;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [bpf-next RFC 2/3] flow_dissector: implements eBPF parser
  2018-08-16 16:44 [bpf-next RFC 0/3] Introduce eBPF flow dissector Petar Penkov
  2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
@ 2018-08-16 16:44 ` Petar Penkov
  2018-08-18 15:50   ` Tom Herbert
  2018-08-16 16:44 ` [bpf-next RFC 3/3] selftests/bpf: test bpf flow dissection Petar Penkov
  2018-08-20 20:52 ` [bpf-next RFC 0/3] Introduce eBPF flow dissector Alexei Starovoitov
  3 siblings, 1 reply; 20+ messages in thread
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn

From: Petar Penkov <ppenkov@google.com>

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP
encapsulation, MPLS, GUE, and VLAN, along with IPv4/IPv6 and extension
headers. This program is meant to show how flow dissection and key
extraction can be done in eBPF.

It is initially meant to be used for demonstration rather than as a
complete replacement of the existing flow dissector.

This includes parsing of GUE and MPLS payload, which cannot be done
in production in general, as GUE tunnels and MPLS payloads cannot
unambiguously be detected in general.

In closed environments, however, it can be enabled. Another example
where the programmability of BPF aids flow dissection.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 542 +++++++++++++++++++++++++
 2 files changed, 543 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
 	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-	test_skb_cgroup_id_kern.o
+	test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index 000000000000..9c11c644b713
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,542 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/icmp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_packet.h>
+#include <sys/socket.h>
+#include <linux/if_tunnel.h>
+#include <linux/mpls.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be cropped.
+ */
+enum {
+	IP,
+	IPV6,
+	IPV6OP,	/* Destination/Hop-by-Hop Options IPv6 Extension header */
+	IPV6FR,	/* Fragmentation IPv6 Extension Header */
+	MPLS,
+	VLAN,
+	GUE,
+};
+
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+#define IP6_MF		0x0001
+#define IP6_OFFSET	0xFFF8
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+	__be16 flags;
+	__be16 proto;
+};
+
+#define GUE_PORT 6080
+/* Taken from include/net/gue.h. Move that to uapi, instead? */
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+enum flow_dissector_key_id {
+	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
+	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
+	FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
+	FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
+	FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
+	FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
+	FLOW_DISSECTOR_KEY_TIPC, /* struct flow_dissector_key_tipc */
+	FLOW_DISSECTOR_KEY_ARP, /* struct flow_dissector_key_arp */
+	FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
+	FLOW_DISSECTOR_KEY_FLOW_LABEL, /* struct flow_dissector_key_flow_tags */
+	FLOW_DISSECTOR_KEY_GRE_KEYID, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_MPLS_ENTROPY, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_ENC_KEYID, /* struct flow_dissector_key_keyid */
+	FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
+	FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
+	FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */
+	FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */
+	FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
+	FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
+	FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
+	FLOW_DISSECTOR_KEY_CVLAN, /* struct flow_dissector_key_flow_vlan */
+
+	FLOW_DISSECTOR_KEY_MAX,
+};
+
+struct flow_dissector_key_control {
+	__u16	thoff;
+	__u16	addr_type;
+	__u32	flags;
+};
+
+#define FLOW_DIS_IS_FRAGMENT	(1 << 0)
+#define FLOW_DIS_FIRST_FRAG	(1 << 1)
+#define FLOW_DIS_ENCAPSULATION	(1 << 2)
+
+struct flow_dissector_key_basic {
+	__be16	n_proto;
+	__u8	ip_proto;
+	__u8	padding;
+};
+
+struct flow_dissector_key_ipv4_addrs {
+	__be32 src;
+	__be32 dst;
+};
+
+struct flow_dissector_key_ipv6_addrs {
+	struct in6_addr src;
+	struct in6_addr dst;
+};
+
+struct flow_dissector_key_addrs {
+	union {
+		struct flow_dissector_key_ipv4_addrs v4addrs;
+		struct flow_dissector_key_ipv6_addrs v6addrs;
+	};
+};
+
+struct flow_dissector_key_ports {
+	union {
+		__be32 ports;
+		struct {
+			__be16 src;
+			__be16 dst;
+		};
+	};
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 8
+};
+
+struct bpf_dissect_cb {
+	__u16 nhoff;
+	__u16 flags;
+};
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+	switch (proto) {
+	case bpf_htons(ETH_P_IP):
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case bpf_htons(ETH_P_IPV6):
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case bpf_htons(ETH_P_MPLS_MC):
+	case bpf_htons(ETH_P_MPLS_UC):
+		bpf_tail_call(skb, &jmp_table, MPLS);
+		break;
+	case bpf_htons(ETH_P_8021Q):
+	case bpf_htons(ETH_P_8021AD):
+		bpf_tail_call(skb, &jmp_table, VLAN);
+		break;
+	default:
+		/* Protocol not supported */
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int write_ports(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_ports ports;
+
+	/* The supported protocols always start with the ports */
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &ports, sizeof(ports)))
+		return BPF_DROP;
+
+	if (proto == IPPROTO_UDP && ports.dst == bpf_htons(GUE_PORT)) {
+		/* GUE encapsulation */
+		cb->nhoff += sizeof(struct udphdr);
+		bpf_tail_call(skb, &jmp_table, GUE);
+		return BPF_DROP;
+	}
+
+	if (bpf_flow_dissector_write_keys(skb, &ports, sizeof(ports),
+					  FLOW_DISSECTOR_KEY_PORTS))
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+	if (!skb->vlan_present)
+		return parse_eth_proto(skb, skb->protocol);
+	else
+		return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 *data_end = (__u8 *)(long)skb->data_end;
+	__u8 *data = (__u8 *)(long)skb->data;
+	__u32 data_len = data_end - data;
+	struct gre_hdr gre;
+	struct ethhdr eth;
+	struct tcphdr tcp;
+
+	switch (proto) {
+	case IPPROTO_ICMP:
+		if (cb->nhoff + sizeof(struct icmphdr) > data_len)
+			return BPF_DROP;
+		return BPF_OK;
+	case IPPROTO_IPIP:
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case IPPROTO_IPV6:
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case IPPROTO_GRE:
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &gre, sizeof(gre)))
+			return BPF_DROP;
+
+		if (bpf_htons(gre.flags & GRE_VERSION))
+			/* Only inspect standard GRE packets with version 0 */
+			return BPF_OK;
+
+		cb->nhoff += sizeof(gre); /* Step over GRE Flags and Protocol */
+		if (GRE_IS_CSUM(gre.flags))
+			cb->nhoff += 4; /* Step over chksum and Padding */
+		if (GRE_IS_KEY(gre.flags))
+			cb->nhoff += 4; /* Step over key */
+		if (GRE_IS_SEQ(gre.flags))
+			cb->nhoff += 4; /* Step over sequence number */
+
+		cb->flags |= FLOW_DIS_ENCAPSULATION;
+
+		if (gre.proto == bpf_htons(ETH_P_TEB)) {
+			if (bpf_skb_load_bytes(skb, cb->nhoff, &eth,
+					       sizeof(eth)))
+				return BPF_DROP;
+
+			cb->nhoff += sizeof(eth);
+
+			return parse_eth_proto(skb, eth.h_proto);
+		} else {
+			return parse_eth_proto(skb, gre.proto);
+		}
+
+	case IPPROTO_TCP:
+		if (cb->nhoff + sizeof(struct tcphdr) > data_len)
+			return BPF_DROP;
+
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &tcp, sizeof(tcp)))
+			return BPF_DROP;
+
+		if (tcp.doff < 5)
+			return BPF_DROP;
+
+		if (cb->nhoff + (tcp.doff << 2) > data_len)
+			return BPF_DROP;
+
+		return write_ports(skb, proto);
+	case IPPROTO_UDP:
+	case IPPROTO_UDPLITE:
+		if (cb->nhoff + sizeof(struct udphdr) > data_len)
+			return BPF_DROP;
+
+		return write_ports(skb, proto);
+	default:
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_control control;
+	struct flow_dissector_key_basic basic;
+
+	switch (nexthdr) {
+	case IPPROTO_HOPOPTS:
+	case IPPROTO_DSTOPTS:
+		bpf_tail_call(skb, &jmp_table, IPV6OP);
+		break;
+	case IPPROTO_FRAGMENT:
+		bpf_tail_call(skb, &jmp_table, IPV6FR);
+		break;
+	default:
+		control.thoff = cb->nhoff;
+		control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+		control.flags = cb->flags;
+		if (bpf_flow_dissector_write_keys(skb, &control,
+						  sizeof(control),
+						  FLOW_DISSECTOR_KEY_CONTROL))
+			return BPF_DROP;
+
+		memset(&basic, 0, sizeof(basic));
+		basic.n_proto = bpf_htons(ETH_P_IPV6);
+		basic.ip_proto = nexthdr;
+		if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
+					      FLOW_DISSECTOR_KEY_BASIC))
+			return BPF_DROP;
+
+		return parse_ip_proto(skb, nexthdr);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(IP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 *data_end = (__u8 *)(long)skb->data_end;
+	struct flow_dissector_key_control control;
+	struct flow_dissector_key_addrs addrs;
+	struct flow_dissector_key_basic basic;
+	__u8 *data = (__u8 *)(long)skb->data;
+	__u32 data_len = data_end - data;
+	bool done = false;
+	struct iphdr iph;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &iph, sizeof(iph)))
+		return BPF_DROP;
+
+	/* IP header cannot be smaller than 20 bytes */
+	if (iph.ihl < 5)
+		return BPF_DROP;
+
+	addrs.v4addrs.src = iph.saddr;
+	addrs.v4addrs.dst = iph.daddr;
+	if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v4addrs),
+				      FLOW_DISSECTOR_KEY_IPV4_ADDRS))
+		return BPF_DROP;
+
+	cb->nhoff += iph.ihl << 2;
+	if (cb->nhoff > data_len)
+		return BPF_DROP;
+
+	if (iph.frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
+		cb->flags |= FLOW_DIS_IS_FRAGMENT;
+		if (iph.frag_off & bpf_htons(IP_OFFSET))
+			/* From second fragment on, packets do not have headers
+			 * we can parse.
+			 */
+			done = true;
+		else
+			cb->flags |= FLOW_DIS_FIRST_FRAG;
+	}
+
+
+	control.thoff = cb->nhoff;
+	control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	control.flags = cb->flags;
+	if (bpf_flow_dissector_write_keys(skb, &control, sizeof(control),
+					  FLOW_DISSECTOR_KEY_CONTROL))
+		return BPF_DROP;
+
+	memset(&basic, 0, sizeof(basic));
+	basic.n_proto = bpf_htons(ETH_P_IP);
+	basic.ip_proto = iph.protocol;
+	if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
+				      FLOW_DISSECTOR_KEY_BASIC))
+		return BPF_DROP;
+
+	if (done)
+		return BPF_OK;
+
+	return parse_ip_proto(skb, iph.protocol);
+}
+
+PROG(IPV6)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct flow_dissector_key_addrs addrs;
+	struct ipv6hdr ip6h;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &ip6h, sizeof(ip6h)))
+		return BPF_DROP;
+
+	addrs.v6addrs.src = ip6h.saddr;
+	addrs.v6addrs.dst = ip6h.daddr;
+	if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v6addrs),
+				      FLOW_DISSECTOR_KEY_IPV6_ADDRS))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(struct ipv6hdr);
+
+	return parse_ipv6_proto(skb, ip6h.nexthdr);
+}
+
+PROG(IPV6OP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__u8 proto;
+	__u8 hlen;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
+		return BPF_DROP;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff + sizeof(proto), &hlen,
+			       sizeof(hlen)))
+		return BPF_DROP;
+	/* hlen is in 8-octects and does not include the first 8 bytes
+	 * of the header
+	 */
+	cb->nhoff += (1 + hlen) << 3;
+
+	return parse_ipv6_proto(skb, proto);
+}
+
+PROG(IPV6FR)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	__be16 frag_off;
+	__u8 proto;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
+		return BPF_DROP;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff + 2, &frag_off, sizeof(frag_off)))
+		return BPF_DROP;
+
+	cb->nhoff += 8;
+	cb->flags |= FLOW_DIS_IS_FRAGMENT;
+	if (!(frag_off & bpf_htons(IP6_OFFSET)))
+		cb->flags |= FLOW_DIS_FIRST_FRAG;
+
+	return parse_ipv6_proto(skb, proto);
+}
+
+PROG(MPLS)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct mpls_label mpls;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &mpls, sizeof(mpls)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(mpls);
+
+	if (mpls.entry & MPLS_LS_S_MASK) {
+		/* This is the last MPLS header. The network layer packet always
+		 * follows the MPLS header. Peek forward and dispatch based on
+		 * that.
+		 */
+		__u8 version;
+
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &version,
+				       sizeof(version)))
+			return BPF_DROP;
+
+		/* IP version is always the first 4 bits of the header */
+		switch (version & 0xF0) {
+		case 4:
+			bpf_tail_call(skb, &jmp_table, IP);
+			break;
+		case 6:
+			bpf_tail_call(skb, &jmp_table, IPV6);
+			break;
+		default:
+			return BPF_DROP;
+		}
+	} else {
+		bpf_tail_call(skb, &jmp_table, MPLS);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(VLAN)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct vlan_hdr vlan;
+	__be16 proto;
+
+	/* Peek back to see if single or double-tagging */
+	if (bpf_skb_load_bytes(skb, cb->nhoff - sizeof(proto), &proto,
+			       sizeof(proto)))
+		return BPF_DROP;
+
+	/* Account for double-tagging */
+	if (proto == bpf_htons(ETH_P_8021AD)) {
+		if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
+			return BPF_DROP;
+
+		if (vlan.h_vlan_encapsulated_proto != bpf_htons(ETH_P_8021Q))
+			return BPF_DROP;
+
+		cb->nhoff += sizeof(vlan);
+	}
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(vlan);
+	/* Only allow 8021AD + 8021Q double tagging and no triple tagging.*/
+	if (vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021AD) ||
+	    vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021Q))
+		return BPF_DROP;
+
+	return parse_eth_proto(skb, vlan.h_vlan_encapsulated_proto);
+}
+
+PROG(GUE)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct guehdr gue;
+
+	if (bpf_skb_load_bytes(skb, cb->nhoff, &gue, sizeof(gue)))
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(gue);
+	cb->nhoff += gue.hlen << 2;
+
+	cb->flags |= FLOW_DIS_ENCAPSULATION;
+	return parse_ip_proto(skb, gue.proto_ctype);
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [bpf-next RFC 3/3] selftests/bpf: test bpf flow dissection
  2018-08-16 16:44 [bpf-next RFC 0/3] Introduce eBPF flow dissector Petar Penkov
  2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
  2018-08-16 16:44 ` [bpf-next RFC 2/3] flow_dissector: implements eBPF parser Petar Penkov
@ 2018-08-16 16:44 ` Petar Penkov
  2018-08-20 20:52 ` [bpf-next RFC 0/3] Introduce eBPF flow dissector Alexei Starovoitov
  3 siblings, 0 replies; 20+ messages in thread
From: Petar Penkov @ 2018-08-16 16:44 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn

From: Petar Penkov <ppenkov@google.com>

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 49938d72cf63..e61a85ac4b79 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -19,3 +19,5 @@ test_btf
 test_sockmap
 test_lirc_mode2_user
 get_cgroup_id_user
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
 	test_tunnel.sh \
 	test_lwt_seg6local.sh \
 	test_lirc_mode2.sh \
-	test_skb_cgroup_id.sh
+	test_skb_cgroup_id.sh \
+	test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \
+	flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index 000000000000..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+#include <errno.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+	struct bpf_program *prog, *main_prog;
+	struct bpf_map *prog_array;
+	int i, fd, prog_fd, ret;
+	struct bpf_object *obj;
+	int prog_array_fd;
+
+	ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+			    &prog_fd);
+	if (ret)
+		error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+	main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+	if (!main_prog)
+		error(1, 0, "bpf_object__find_program_by_title %s",
+		      cfg_section_name);
+
+	prog_fd = bpf_program__fd(main_prog);
+	if (prog_fd < 0)
+		error(1, 0, "bpf_program__fd");
+
+	prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+	if (!prog_array)
+		error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+	prog_array_fd = bpf_map__fd(prog_array);
+	if (prog_array_fd < 0)
+		error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+	i = 0;
+	bpf_object__for_each_program(prog, obj) {
+		fd = bpf_program__fd(prog);
+		if (fd < 0)
+			error(1, 0, "bpf_program__fd");
+
+		if (fd != prog_fd) {
+			printf("%d: %s\n", i, bpf_program__title(prog, false));
+			bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+			++i;
+		}
+	}
+
+	ret = bpf_prog_attach(prog_fd, 0 /* Ignore */, BPF_FLOW_DISSECTOR, 0);
+	if (ret)
+		error(1, 0, "bpf_prog_attach %s", cfg_path_name);
+
+	ret = bpf_object__pin(obj, cfg_pin_path);
+	if (ret)
+		error(1, 0, "bpf_object__pin %s", cfg_pin_path);
+
+}
+
+static void detach_program(void)
+{
+	char command[64];
+	int ret;
+
+	ret = bpf_prog_detach(0, BPF_FLOW_DISSECTOR);
+	if (ret)
+		error(1, 0, "bpf_prog_detach");
+
+	/* To unpin, it is necessary and sufficient to just remove this dir */
+	sprintf(command, "rm -r %s", cfg_pin_path);
+	ret = system(command);
+	if (ret)
+		error(1, errno, command);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	bool attach = false;
+	bool detach = false;
+	int c;
+
+	while ((c = getopt(argc, argv, "adp:s:")) != -1) {
+		switch (c) {
+		case 'a':
+			if (detach)
+				error(1, 0, "attach/detach are exclusive");
+			attach = true;
+			break;
+		case 'd':
+			if (attach)
+				error(1, 0, "attach/detach are exclusive");
+			detach = true;
+			break;
+		case 'p':
+			if (cfg_path_name)
+				error(1, 0, "only one prog name can be given");
+
+			cfg_path_name = optarg;
+			break;
+		case 's':
+			if (cfg_section_name)
+				error(1, 0, "only one section can be given");
+
+			cfg_section_name = optarg;
+			break;
+		}
+	}
+
+	if (detach)
+		cfg_attach = false;
+
+	if (cfg_attach && !cfg_path_name)
+		error(1, 0, "must provide a path to the BPF program");
+
+	if (cfg_attach && !cfg_section_name)
+		error(1, 0, "must provide a section name");
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	if (cfg_attach)
+		load_and_attach_program();
+	else
+		detach_program();
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.c b/tools/testing/selftests/bpf/test_flow_dissector.c
new file mode 100644
index 000000000000..12b784afba31
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.c
@@ -0,0 +1,782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Inject packets with all sorts of encapsulation into the kernel.
+ *
+ * IPv4/IPv6	outer layer 3
+ * GRE/GUE/BARE outer layer 4, where bare is IPIP/SIT/IPv4-in-IPv6/..
+ * IPv4/IPv6    inner layer 3
+ */
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <asm/byteorder.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <netinet/ip.h>
+#include <netinet/in.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define CFG_PORT_INNER	8000
+
+/* Add some protocol definitions that do not exist in userspace */
+
+struct grehdr {
+	uint16_t unused;
+	uint16_t protocol;
+} __attribute__((packed));
+
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+static uint8_t	cfg_dsfield_inner;
+static uint8_t	cfg_dsfield_outer;
+static uint8_t	cfg_encap_proto;
+static bool	cfg_expect_failure = false;
+static int	cfg_l3_extra = AF_UNSPEC;	/* optional SIT prefix */
+static int	cfg_l3_inner = AF_UNSPEC;
+static int	cfg_l3_outer = AF_UNSPEC;
+static int	cfg_num_pkt = 10;
+static int	cfg_num_secs = 0;
+static char	cfg_payload_char = 'a';
+static int	cfg_payload_len = 100;
+static int	cfg_port_gue = 6080;
+static bool	cfg_only_rx;
+static bool	cfg_only_tx;
+static int	cfg_src_port = 9;
+
+static char	buf[ETH_DATA_LEN];
+
+#define INIT_ADDR4(name, addr4, port)				\
+	static struct sockaddr_in name = {			\
+		.sin_family = AF_INET,				\
+		.sin_port = __constant_htons(port),		\
+		.sin_addr.s_addr = __constant_htonl(addr4),	\
+	};
+
+#define INIT_ADDR6(name, addr6, port)				\
+	static struct sockaddr_in6 name = {			\
+		.sin6_family = AF_INET6,			\
+		.sin6_port = __constant_htons(port),		\
+		.sin6_addr = addr6,				\
+	};
+
+INIT_ADDR4(in_daddr4, INADDR_LOOPBACK, CFG_PORT_INNER)
+INIT_ADDR4(in_saddr4, INADDR_LOOPBACK + 2, 0)
+INIT_ADDR4(out_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(out_saddr4, INADDR_LOOPBACK + 1, 0)
+INIT_ADDR4(extra_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(extra_saddr4, INADDR_LOOPBACK + 1, 0)
+
+INIT_ADDR6(in_daddr6, IN6ADDR_LOOPBACK_INIT, CFG_PORT_INNER)
+INIT_ADDR6(in_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+
+static unsigned long util_gettime(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void util_printaddr(const char *msg, struct sockaddr *addr)
+{
+	unsigned long off = 0;
+	char nbuf[INET6_ADDRSTRLEN];
+
+	switch (addr->sa_family) {
+	case PF_INET:
+		off = __builtin_offsetof(struct sockaddr_in, sin_addr);
+		break;
+	case PF_INET6:
+		off = __builtin_offsetof(struct sockaddr_in6, sin6_addr);
+		break;
+	default:
+		error(1, 0, "printaddr: unsupported family %u\n",
+		      addr->sa_family);
+	}
+
+	if (!inet_ntop(addr->sa_family, ((void *) addr) + off, nbuf,
+		       sizeof(nbuf)))
+		error(1, errno, "inet_ntop");
+
+	fprintf(stderr, "%s: %s\n", msg, nbuf);
+}
+
+static unsigned long add_csum_hword(const uint16_t *start, int num_u16)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_u16; i++)
+		sum += start[i];
+
+	return sum;
+}
+
+static uint16_t build_ip_csum(const uint16_t *start, int num_u16,
+			      unsigned long sum)
+{
+	sum += add_csum_hword(start, num_u16);
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	return ~sum;
+}
+
+static void build_ipv4_header(void *header, uint8_t proto,
+			      uint32_t src, uint32_t dst,
+			      int payload_len, uint8_t tos)
+{
+	struct iphdr *iph = header;
+
+	iph->ihl = 5;
+	iph->version = 4;
+	iph->tos = tos;
+	iph->ttl = 8;
+	iph->tot_len = htons(sizeof(*iph) + payload_len);
+	iph->id = htons(1337);
+	iph->protocol = proto;
+	iph->saddr = src;
+	iph->daddr = dst;
+	iph->check = build_ip_csum((void *) iph, iph->ihl << 1, 0);
+}
+
+static void ipv6_set_dsfield(struct ipv6hdr *ip6h, uint8_t dsfield)
+{
+	uint16_t val, *ptr = (uint16_t *)ip6h;
+
+	val = ntohs(*ptr);
+	val &= 0xF00F;
+	val |= ((uint16_t) dsfield) << 4;
+	*ptr = htons(val);
+}
+
+static void build_ipv6_header(void *header, uint8_t proto,
+			      struct sockaddr_in6 *src,
+			      struct sockaddr_in6 *dst,
+			      int payload_len, uint8_t dsfield)
+{
+	struct ipv6hdr *ip6h = header;
+
+	ip6h->version = 6;
+	ip6h->payload_len = htons(payload_len);
+	ip6h->nexthdr = proto;
+	ip6h->hop_limit = 8;
+	ipv6_set_dsfield(ip6h, dsfield);
+
+	memcpy(&ip6h->saddr, &src->sin6_addr, sizeof(ip6h->saddr));
+	memcpy(&ip6h->daddr, &dst->sin6_addr, sizeof(ip6h->daddr));
+}
+
+static uint16_t build_udp_v4_csum(const struct iphdr *iph,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(iph->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &iph->saddr, num_u16);
+	pseudo_sum += htons(IPPROTO_UDP);
+	pseudo_sum += udph->len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static uint16_t build_udp_v6_csum(const struct ipv6hdr *ip6h,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(ip6h->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &ip6h->saddr, num_u16);
+	pseudo_sum += htons(ip6h->nexthdr);
+	pseudo_sum += ip6h->payload_len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static void build_udp_header(void *header, int payload_len,
+			     uint16_t dport, int family)
+{
+	struct udphdr *udph = header;
+	int len = sizeof(*udph) + payload_len;
+
+	udph->source = htons(cfg_src_port);
+	udph->dest = htons(dport);
+	udph->len = htons(len);
+	udph->check = 0;
+	if (family == AF_INET)
+		udph->check = build_udp_v4_csum(header - sizeof(struct iphdr),
+						udph, len >> 1);
+	else
+		udph->check = build_udp_v6_csum(header - sizeof(struct ipv6hdr),
+						udph, len >> 1);
+}
+
+static void build_gue_header(void *header, uint8_t proto)
+{
+	struct guehdr *gueh = header;
+
+	gueh->proto_ctype = proto;
+}
+
+static void build_gre_header(void *header, uint16_t proto)
+{
+	struct grehdr *greh = header;
+
+	greh->protocol = htons(proto);
+}
+
+static int l3_length(int family)
+{
+	if (family == AF_INET)
+		return sizeof(struct iphdr);
+	else
+		return sizeof(struct ipv6hdr);
+}
+
+static int build_packet(void)
+{
+	int ol3_len = 0, ol4_len = 0, il3_len = 0, il4_len = 0;
+	int el3_len = 0;
+
+	if (cfg_l3_extra)
+		el3_len = l3_length(cfg_l3_extra);
+
+	/* calculate header offsets */
+	if (cfg_encap_proto) {
+		ol3_len = l3_length(cfg_l3_outer);
+
+		if (cfg_encap_proto == IPPROTO_GRE)
+			ol4_len = sizeof(struct grehdr);
+		else if (cfg_encap_proto == IPPROTO_UDP)
+			ol4_len = sizeof(struct udphdr) + sizeof(struct guehdr);
+	}
+
+	il3_len = l3_length(cfg_l3_inner);
+	il4_len = sizeof(struct udphdr);
+
+	if (el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len >=
+	    sizeof(buf))
+		error(1, 0, "packet too large\n");
+
+	/*
+	 * Fill packet from inside out, to calculate correct checksums.
+	 * But create ip before udp headers, as udp uses ip for pseudo-sum.
+	 */
+	memset(buf + el3_len + ol3_len + ol4_len + il3_len + il4_len,
+	       cfg_payload_char, cfg_payload_len);
+
+	/* add zero byte for udp csum padding */
+	buf[el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len] = 0;
+
+	switch (cfg_l3_inner) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  in_saddr4.sin_addr.s_addr,
+				  in_daddr4.sin_addr.s_addr,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  &in_saddr6, &in_daddr6,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	}
+
+	build_udp_header(buf + el3_len + ol3_len + ol4_len + il3_len,
+			 cfg_payload_len, CFG_PORT_INNER, cfg_l3_inner);
+
+	if (!cfg_encap_proto)
+		return il3_len + il4_len + cfg_payload_len;
+
+	switch (cfg_l3_outer) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len, cfg_encap_proto,
+				  out_saddr4.sin_addr.s_addr,
+				  out_daddr4.sin_addr.s_addr,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len, cfg_encap_proto,
+				  &out_saddr6, &out_daddr6,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	}
+
+	switch (cfg_encap_proto) {
+	case IPPROTO_UDP:
+		build_gue_header(buf + el3_len + ol3_len + ol4_len -
+				 sizeof(struct guehdr),
+				 cfg_l3_inner == PF_INET ? IPPROTO_IPIP
+							 : IPPROTO_IPV6);
+		build_udp_header(buf + el3_len + ol3_len,
+				 sizeof(struct guehdr) + il3_len + il4_len +
+				 cfg_payload_len,
+				 cfg_port_gue, cfg_l3_outer);
+		break;
+	case IPPROTO_GRE:
+		build_gre_header(buf + el3_len + ol3_len,
+				 cfg_l3_inner == PF_INET ? ETH_P_IP
+							 : ETH_P_IPV6);
+		break;
+	}
+
+	switch (cfg_l3_extra) {
+	case PF_INET:
+		build_ipv4_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  extra_saddr4.sin_addr.s_addr,
+				  extra_daddr4.sin_addr.s_addr,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  &extra_saddr6, &extra_daddr6,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	}
+
+	return el3_len + ol3_len + ol4_len + il3_len + il4_len +
+	       cfg_payload_len;
+}
+
+/* sender transmits encapsulated over RAW or unencap'd over UDP */
+static int setup_tx(void)
+{
+	int family, fd, ret;
+
+	if (cfg_l3_extra)
+		family = cfg_l3_extra;
+	else if (cfg_l3_outer)
+		family = cfg_l3_outer;
+	else
+		family = cfg_l3_inner;
+
+	fd = socket(family, SOCK_RAW, IPPROTO_RAW);
+	if (fd == -1)
+		error(1, errno, "socket tx");
+
+	if (cfg_l3_extra) {
+		if (cfg_l3_extra == PF_INET)
+			ret = connect(fd, (void *) &extra_daddr4,
+				      sizeof(extra_daddr4));
+		else
+			ret = connect(fd, (void *) &extra_daddr6,
+				      sizeof(extra_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else if (cfg_l3_outer) {
+		/* connect to destination if not encapsulated */
+		if (cfg_l3_outer == PF_INET)
+			ret = connect(fd, (void *) &out_daddr4,
+				      sizeof(out_daddr4));
+		else
+			ret = connect(fd, (void *) &out_daddr6,
+				      sizeof(out_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else {
+		/* otherwise using loopback */
+		if (cfg_l3_inner == PF_INET)
+			ret = connect(fd, (void *) &in_daddr4,
+				      sizeof(in_daddr4));
+		else
+			ret = connect(fd, (void *) &in_daddr6,
+				      sizeof(in_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	}
+
+	return fd;
+}
+
+/* receiver reads unencapsulated UDP */
+static int setup_rx(void)
+{
+	int fd, ret;
+
+	fd = socket(cfg_l3_inner, SOCK_DGRAM, 0);
+	if (fd == -1)
+		error(1, errno, "socket rx");
+
+	if (cfg_l3_inner == PF_INET)
+		ret = bind(fd, (void *) &in_daddr4, sizeof(in_daddr4));
+	else
+		ret = bind(fd, (void *) &in_daddr6, sizeof(in_daddr6));
+	if (ret)
+		error(1, errno, "bind rx");
+
+	return fd;
+}
+
+static int do_tx(int fd, const char *pkt, int len)
+{
+	int ret;
+
+	ret = write(fd, pkt, len);
+	if (ret == -1)
+		error(1, errno, "send");
+	if (ret != len)
+		error(1, errno, "send: len (%d < %d)\n", ret, len);
+
+	return 1;
+}
+
+static int do_poll(int fd, short events, int timeout)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.fd = fd;
+	pfd.events = events;
+
+	ret = poll(&pfd, 1, timeout);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret && !(pfd.revents & POLLIN))
+		error(1, errno, "poll: unexpected event 0x%x\n", pfd.revents);
+
+	return ret;
+}
+
+static int do_rx(int fd)
+{
+	char rbuf;
+	int ret, num = 0;
+
+	while (1) {
+		ret = recv(fd, &rbuf, 1, MSG_DONTWAIT);
+		if (ret == -1 && errno == EAGAIN)
+			break;
+		if (ret == -1)
+			error(1, errno, "recv");
+		if (rbuf != cfg_payload_char)
+			error(1, 0, "recv: payload mismatch");
+		num++;
+	};
+
+	return num;
+}
+
+static int do_main(void)
+{
+	unsigned long tstop, treport, tcur;
+	int fdt = -1, fdr = -1, len, tx = 0, rx = 0;
+
+	if (!cfg_only_tx)
+		fdr = setup_rx();
+	if (!cfg_only_rx)
+		fdt = setup_tx();
+
+	len = build_packet();
+
+	tcur = util_gettime();
+	treport = tcur + 1000;
+	tstop = tcur + (cfg_num_secs * 1000);
+
+	while (1) {
+		if (!cfg_only_rx)
+			tx += do_tx(fdt, buf, len);
+
+		if (!cfg_only_tx)
+			rx += do_rx(fdr);
+
+		if (cfg_num_secs) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+			if (tcur >= treport) {
+				fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+				tx = 0;
+				rx = 0;
+				treport = tcur + 1000;
+			}
+		} else {
+			if (tx == cfg_num_pkt)
+				break;
+		}
+	}
+
+	/* read straggler packets, if any */
+	if (rx < tx) {
+		tstop = util_gettime() + 100;
+		while (rx < tx) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+
+			do_poll(fdr, POLLIN, tstop - tcur);
+			rx += do_rx(fdr);
+		}
+	}
+
+	fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+
+	if (fdr != -1 && close(fdr))
+		error(1, errno, "close rx");
+	if (fdt != -1 && close(fdt))
+		error(1, errno, "close tx");
+
+	/*
+	 * success (== 0) only if received all packets
+	 * unless failure is expected, in which case none must arrive.
+	 */
+	if (cfg_expect_failure)
+		return rx != 0;
+	else
+		return rx != tx;
+}
+
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+	fprintf(stderr, "Usage: %s [-e gre|gue|bare|none] [-i 4|6] [-l len] "
+			"[-O 4|6] [-o 4|6] [-n num] [-t secs] [-R] [-T] "
+			"[-s <osrc> [-d <odst>] [-S <isrc>] [-D <idst>] "
+			"[-x <otos>] [-X <itos>] [-f <isport>] [-F]\n",
+		filepath);
+	exit(1);
+}
+
+static void parse_addr(int family, void *addr, const char *optarg)
+{
+	int ret;
+
+	ret = inet_pton(family, optarg, addr);
+	if (ret == -1)
+		error(1, errno, "inet_pton");
+	if (ret == 0)
+		error(1, 0, "inet_pton: bad string");
+}
+
+static void parse_addr4(struct sockaddr_in *addr, const char *optarg)
+{
+	parse_addr(AF_INET, &addr->sin_addr, optarg);
+}
+
+static void parse_addr6(struct sockaddr_in6 *addr, const char *optarg)
+{
+	parse_addr(AF_INET6, &addr->sin6_addr, optarg);
+}
+
+static int parse_protocol_family(const char *filepath, const char *optarg)
+{
+	if (!strcmp(optarg, "4"))
+		return PF_INET;
+	if (!strcmp(optarg, "6"))
+		return PF_INET6;
+
+	usage(filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "d:D:e:f:Fhi:l:n:o:O:Rs:S:t:Tx:X:")) != -1) {
+		switch (c) {
+		case 'd':
+			if (cfg_l3_outer == AF_UNSPEC)
+				error(1, 0, "-d must be preceded by -o");
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_daddr4, optarg);
+			else
+				parse_addr6(&out_daddr6, optarg);
+			break;
+		case 'D':
+			if (cfg_l3_inner == AF_UNSPEC)
+				error(1, 0, "-D must be preceded by -i");
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_daddr4, optarg);
+			else
+				parse_addr6(&in_daddr6, optarg);
+			break;
+		case 'e':
+			if (!strcmp(optarg, "gre"))
+				cfg_encap_proto = IPPROTO_GRE;
+			else if (!strcmp(optarg, "gue"))
+				cfg_encap_proto = IPPROTO_UDP;
+			else if (!strcmp(optarg, "bare"))
+				cfg_encap_proto = IPPROTO_IPIP;
+			else if (!strcmp(optarg, "none"))
+				cfg_encap_proto = IPPROTO_IP;	/* == 0 */
+			else
+				usage(argv[0]);
+			break;
+		case 'f':
+			cfg_src_port = strtol(optarg, NULL, 0);
+			break;
+		case 'F':
+			cfg_expect_failure = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			break;
+		case 'i':
+			if (!strcmp(optarg, "4"))
+				cfg_l3_inner = PF_INET;
+			else if (!strcmp(optarg, "6"))
+				cfg_l3_inner = PF_INET6;
+			else
+				usage(argv[0]);
+			break;
+		case 'l':
+			cfg_payload_len = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			cfg_num_pkt = strtol(optarg, NULL, 0);
+			break;
+		case 'o':
+			cfg_l3_outer = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'O':
+			cfg_l3_extra = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'R':
+			cfg_only_rx = true;
+			break;
+		case 's':
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_saddr4, optarg);
+			else
+				parse_addr6(&out_saddr6, optarg);
+			break;
+		case 'S':
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_saddr4, optarg);
+			else
+				parse_addr6(&in_saddr6, optarg);
+			break;
+		case 't':
+			cfg_num_secs = strtol(optarg, NULL, 0);
+			break;
+		case 'T':
+			cfg_only_tx = true;
+			break;
+		case 'x':
+			cfg_dsfield_outer = strtol(optarg, NULL, 0);
+			break;
+		case 'X':
+			cfg_dsfield_inner = strtol(optarg, NULL, 0);
+			break;
+		}
+	}
+
+	if (cfg_only_rx && cfg_only_tx)
+		error(1, 0, "options: cannot combine rx-only and tx-only");
+
+	if (cfg_encap_proto && cfg_l3_outer == AF_UNSPEC)
+		error(1, 0, "options: must specify outer with encap");
+	else if ((!cfg_encap_proto) && cfg_l3_outer != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and outer");
+	else if ((!cfg_encap_proto) && cfg_l3_extra != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and extra");
+
+	if (cfg_l3_inner == AF_UNSPEC)
+		cfg_l3_inner = AF_INET6;
+	if (cfg_l3_inner == AF_INET6 && cfg_encap_proto == IPPROTO_IPIP)
+		cfg_encap_proto = IPPROTO_IPV6;
+
+	/* RFC 6040 4.2:
+	 *   on decap, if outer encountered congestion (CE == 0x3),
+	 *   but inner cannot encode ECN (NoECT == 0x0), then drop packet.
+	 */
+	if (((cfg_dsfield_outer & 0x3) == 0x3) &&
+	    ((cfg_dsfield_inner & 0x3) == 0x0))
+		cfg_expect_failure = true;
+}
+
+static void print_opts(void)
+{
+	if (cfg_l3_inner == PF_INET6) {
+		util_printaddr("inner.dest6", (void *) &in_daddr6);
+		util_printaddr("inner.source6", (void *) &in_saddr6);
+	} else {
+		util_printaddr("inner.dest4", (void *) &in_daddr4);
+		util_printaddr("inner.source4", (void *) &in_saddr4);
+	}
+
+	if (!cfg_l3_outer)
+		return;
+
+	fprintf(stderr, "encap proto:   %u\n", cfg_encap_proto);
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("outer.dest6", (void *) &out_daddr6);
+		util_printaddr("outer.source6", (void *) &out_saddr6);
+	} else {
+		util_printaddr("outer.dest4", (void *) &out_daddr4);
+		util_printaddr("outer.source4", (void *) &out_saddr4);
+	}
+
+	if (!cfg_l3_extra)
+		return;
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("extra.dest6", (void *) &extra_daddr6);
+		util_printaddr("extra.source6", (void *) &extra_saddr6);
+	} else {
+		util_printaddr("extra.dest4", (void *) &extra_daddr4);
+		util_printaddr("extra.source4", (void *) &extra_saddr4);
+	}
+
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	print_opts();
+	return do_main();
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh
new file mode 100755
index 000000000000..c0fb073b5eab
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Load BPF flow dissector and verify it correctly dissects traffic
+export TESTNAME=test_flow_dissector
+unmount=0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+msg="skip all tests:"
+if [ $UID != 0 ]; then
+	echo $msg please run this as root >&2
+	exit $ksft_skip
+fi
+
+# This test needs to be run in a network namespace with in_netns.sh. Check if
+# this is the case and run it with in_netns.sh if it is being run in the root
+# namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+	../net/in_netns.sh "$0" "$@"
+	exit $?
+fi
+
+# Determine selftest success via shell exit code
+exit_handler()
+{
+	if (( $? == 0 )); then
+		echo "selftests: $TESTNAME [PASS]";
+	else
+		echo "selftests: $TESTNAME [FAILED]";
+	fi
+
+	set +e
+
+	# Cleanup
+	tc filter del dev lo ingress pref 1337 2> /dev/null
+	tc qdisc del dev lo ingress 2> /dev/null
+	./flow_dissector_load -d 2> /dev/null
+	if [ $unmount -ne 0 ]; then
+		umount bpffs 2> /dev/null
+	fi
+}
+
+# Exit script immediately (well catched by trap handler) if any
+# program/thing exits with a non-zero status.
+set -e
+
+# (Use 'trap -l' to list meaning of numbers)
+trap exit_handler 0 2 3 6 9
+
+# Mount BPF file system
+if /bin/mount | grep /sys/fs/bpf > /dev/null; then
+	echo "bpffs already mounted"
+else
+	echo "bpffs not mounted. Mounting..."
+	unmount=1
+	/bin/mount bpffs /sys/fs/bpf -t bpf
+fi
+
+# Attach BPF program
+./flow_dissector_load -p bpf_flow.o -s dissect
+
+# Setup
+tc qdisc add dev lo ingress
+
+echo "Testing IPv4..."
+# Drops all IP/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ip pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv4/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 4 -f 8
+# Send 10 IPv4/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 4 -f 9 -F
+# Send 10 IPv4/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 4 -f 10
+
+echo "Testing IPIP..."
+# Send 10 IPv4/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+echo "Testing IPv4 + GRE..."
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+tc filter del dev lo ingress pref 1337
+
+echo "Testing IPv6..."
+# Drops all IPv6/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ipv6 pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv6/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 6 -f 8
+# Send 10 IPv6/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 6 -f 9 -F
+# Send 10 IPv6/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 6 -f 10
+
+exit 0
diff --git a/tools/testing/selftests/bpf/with_addr.sh b/tools/testing/selftests/bpf/with_addr.sh
new file mode 100755
index 000000000000..ffcd3953f94c
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_addr.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# add private ipv4 and ipv6 addresses to loopback
+
+readonly V6_INNER='100::a/128'
+readonly V4_INNER='192.168.0.1/32'
+
+if getopts ":s" opt; then
+  readonly SIT_DEV_NAME='sixtofourtest0'
+  readonly V6_SIT='2::/64'
+  readonly V4_SIT='172.17.0.1/32'
+  shift
+fi
+
+fail() {
+  echo "error: $*" 1>&2
+  exit 1
+}
+
+setup() {
+  ip -6 addr add "${V6_INNER}" dev lo || fail 'failed to setup v6 address'
+  ip -4 addr add "${V4_INNER}" dev lo || fail 'failed to setup v4 address'
+
+  if [[ -n "${V6_SIT}" ]]; then
+    ip link add "${SIT_DEV_NAME}" type sit remote any local any \
+	    || fail 'failed to add sit'
+    ip link set dev "${SIT_DEV_NAME}" up \
+	    || fail 'failed to bring sit device up'
+    ip -6 addr add "${V6_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v6 SIT address'
+    ip -4 addr add "${V4_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v4 SIT address'
+  fi
+
+  sleep 2	# avoid race causing bind to fail
+}
+
+cleanup() {
+  if [[ -n "${V6_SIT}" ]]; then
+    ip -4 addr del "${V4_SIT}" dev "${SIT_DEV_NAME}"
+    ip -6 addr del "${V6_SIT}" dev "${SIT_DEV_NAME}"
+    ip link del "${SIT_DEV_NAME}"
+  fi
+
+  ip -4 addr del "${V4_INNER}" dev lo
+  ip -6 addr del "${V6_INNER}" dev lo
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
diff --git a/tools/testing/selftests/bpf/with_tunnels.sh b/tools/testing/selftests/bpf/with_tunnels.sh
new file mode 100755
index 000000000000..e24949ed3a20
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_tunnels.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# setup tunnels for flow dissection test
+
+readonly SUFFIX="test_$(mktemp -u XXXX)"
+CONFIG="remote 127.0.0.2 local 127.0.0.1 dev lo"
+
+setup() {
+  ip link add "ipip_${SUFFIX}" type ipip ${CONFIG}
+  ip link add "gre_${SUFFIX}" type gre ${CONFIG}
+  ip link add "sit_${SUFFIX}" type sit ${CONFIG}
+
+  echo "tunnels before test:"
+  ip tunnel show
+
+  ip link set "ipip_${SUFFIX}" up
+  ip link set "gre_${SUFFIX}" up
+  ip link set "sit_${SUFFIX}" up
+}
+
+
+cleanup() {
+  ip tunnel del "ipip_${SUFFIX}"
+  ip tunnel del "gre_${SUFFIX}"
+  ip tunnel del "sit_${SUFFIX}"
+
+  echo "tunnels after test:"
+  ip tunnel show
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
-- 
2.18.0.865.gffc8e1a3cd6-goog

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
@ 2018-08-16 18:34   ` Edward Cree
  2018-08-16 22:37     ` Willem de Bruijn
  2018-08-16 22:40   ` Song Liu
  1 sibling, 1 reply; 20+ messages in thread
From: Edward Cree @ 2018-08-16 18:34 UTC (permalink / raw)
  To: Petar Penkov, netdev
  Cc: davem, ast, daniel, simon.horman, Petar Penkov, Willem de Bruijn

On 16/08/18 17:44, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is kept as a global variable so it is
> accessible to all flow dissectors.
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---

This looks really great.

> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> +{
> +	struct bpf_prog *attached;
> +
> +	mutex_lock(&flow_dissector_mutex);
> +	attached = rcu_dereference_protected(flow_dissector_prog,
> +					     lockdep_is_held(&flow_dissector_mutex));
> +	if (!flow_dissector_prog) {
> +		mutex_unlock(&flow_dissector_mutex);
> +		return -EINVAL;
Wouldn't -ENOENT be more usual here (as the counterpart to -EEXIST in
 the skb_flow_dissector_bpf_prog_attach() version just above)?

-Ed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 18:34   ` Edward Cree
@ 2018-08-16 22:37     ` Willem de Bruijn
  0 siblings, 0 replies; 20+ messages in thread
From: Willem de Bruijn @ 2018-08-16 22:37 UTC (permalink / raw)
  To: ecree
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, Petar Penkov,
	Willem de Bruijn

On Thu, Aug 16, 2018 at 2:51 PM Edward Cree <ecree@solarflare.com> wrote:
>
> On 16/08/18 17:44, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is kept as a global variable so it is
> > accessible to all flow dissectors.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
>
> This looks really great.
>
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > +{
> > +     struct bpf_prog *attached;
> > +
> > +     mutex_lock(&flow_dissector_mutex);
> > +     attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                          lockdep_is_held(&flow_dissector_mutex));
> > +     if (!flow_dissector_prog) {
> > +             mutex_unlock(&flow_dissector_mutex);
> > +             return -EINVAL;
> Wouldn't -ENOENT be more usual here (as the counterpart to -EEXIST in
>  the skb_flow_dissector_bpf_prog_attach() version just above)?

Absolutely. That better matches bpf_detach behavior, too.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
  2018-08-16 18:34   ` Edward Cree
@ 2018-08-16 22:40   ` Song Liu
  2018-08-16 23:14     ` Petar Penkov
  1 sibling, 1 reply; 20+ messages in thread
From: Song Liu @ 2018-08-16 22:40 UTC (permalink / raw)
  To: Petar Penkov
  Cc: Networking, David S . Miller, Alexei Starovoitov, Daniel Borkmann,
	simon.horman, Petar Penkov, Willem de Bruijn

On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> path. The BPF program is kept as a global variable so it is
> accessible to all flow dissectors.
>
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  include/linux/bpf_types.h                 |   1 +
>  include/linux/skbuff.h                    |   7 +
>  include/net/flow_dissector.h              |  16 +++
>  include/uapi/linux/bpf.h                  |  14 +-
>  kernel/bpf/syscall.c                      |   8 ++
>  kernel/bpf/verifier.c                     |   2 +
>  net/core/filter.c                         | 157 ++++++++++++++++++++++
>  net/core/flow_dissector.c                 |  76 +++++++++++
>  tools/bpf/bpftool/prog.c                  |   1 +
>  tools/include/uapi/linux/bpf.h            |   5 +-
>  tools/lib/bpf/libbpf.c                    |   2 +
>  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
>  12 files changed, 290 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index cd26c090e7c0..22083712dd18 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
>  #ifdef CONFIG_INET
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
>  #endif
> +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
>
>  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 17a13e4785fc..ce0e863f02a2 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -243,6 +243,8 @@ struct scatterlist;
>  struct pipe_inode_info;
>  struct iov_iter;
>  struct napi_struct;
> +struct bpf_prog;
> +union bpf_attr;
>
>  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
>  struct nf_conntrack {
> @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>                              const struct flow_dissector_key *key,
>                              unsigned int key_count);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> +                                      struct bpf_prog *prog);
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> +
>  bool __skb_flow_dissect(const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
>                         void *target_container,
> diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> index 6a4586dcdede..edb919d320c1 100644
> --- a/include/net/flow_dissector.h
> +++ b/include/net/flow_dissector.h
> @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
>  extern struct flow_dissector flow_keys_dissector;
>  extern struct flow_dissector flow_keys_basic_dissector;
>
> +/* struct bpf_flow_dissect_cb:
> + *
> + * This struct is used to pass parameters to BPF programs of type
> + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> + * the control block of the skb to be a struct of this type. The first field is
> + * used to communicate the next header offset between the BPF programs and the
> + * first value of it is passed from the kernel. The last two fields are used for
> + * writing out flow keys.
> + */
> +struct bpf_flow_dissect_cb {
> +       u16 nhoff;
> +       u16 unused;
> +       void *target_container;
> +       struct flow_dissector *flow_dissector;
> +};
> +
>  /* struct flow_keys_digest:
>   *
>   * This structure is used to hold a digest of the full flow keys. This is a
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4eba27..8bc0fdab685d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>         BPF_PROG_TYPE_LIRC_MODE2,
>         BPF_PROG_TYPE_SK_REUSEPORT,
> +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>  };
>
>  enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_UDP4_SENDMSG,
>         BPF_CGROUP_UDP6_SENDMSG,
>         BPF_LIRC_MODE2,
> +       BPF_FLOW_DISSECTOR,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -2141,6 +2143,15 @@ union bpf_attr {
>   *             request in the skb.
>   *     Return
>   *             0 on success, or a negative error in case of failure.
> + *
> + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> + *     Description
> + *             Try to write *len* bytes from the source pointer into the offset
> + *             of the key with id *key_id*. If *len* is different from the
> + *             size of the key, an error is returned. If the key is not used,
> + *             this function exits with no effect and code 0.
> + *     Return
> + *             0 on success, negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2226,7 +2237,8 @@ union bpf_attr {
>         FN(get_current_cgroup_id),      \
>         FN(get_local_storage),          \
>         FN(sk_select_reuseport),        \
> -       FN(skb_ancestor_cgroup_id),
> +       FN(skb_ancestor_cgroup_id),     \
> +       FN(flow_dissector_write_keys),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 43727ed0d94a..a06568841a92 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         case BPF_LIRC_MODE2:
>                 ptype = BPF_PROG_TYPE_LIRC_MODE2;
>                 break;
> +       case BPF_FLOW_DISSECTOR:
> +               ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> +               break;
>         default:
>                 return -EINVAL;
>         }
> @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>         case BPF_PROG_TYPE_LIRC_MODE2:
>                 ret = lirc_prog_attach(attr, prog);
>                 break;
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> +               ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> +               break;
>         default:
>                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>         }
> @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>                 return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
>         case BPF_LIRC_MODE2:
>                 return lirc_prog_detach(attr);
> +       case BPF_FLOW_DISSECTOR:
> +               return skb_flow_dissector_bpf_prog_detach(attr);
>         default:
>                 return -EINVAL;
>         }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ca90679a7fe5..6d3f268fa8e0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>         case BPF_PROG_TYPE_LWT_XMIT:
>         case BPF_PROG_TYPE_SK_SKB:
>         case BPF_PROG_TYPE_SK_MSG:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 if (meta)
>                         return meta->pkt_access;
>
> @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_SOCKET_FILTER:
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 return true;
>         default:
>                 return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fd423ce3da34..03d3037e6508 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
>         return false;
>  }
>
> +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> +          const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> +{
> +       struct bpf_flow_dissect_cb *cb;
> +       void *dest;
> +
> +       cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> +
> +       /* Make sure the dissector actually uses the key. It is not an error if
> +        * it does not, but we should not continue past this point in that case
> +        */
> +       if (!dissector_uses_key(cb->flow_dissector, key_id))
> +               return 0;
> +
> +       /* Make sure the length is correct */
> +       switch (key_id) {
> +       case FLOW_DISSECTOR_KEY_CONTROL:
> +       case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> +               if (len != sizeof(struct flow_dissector_key_control))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_BASIC:
> +               if (len != sizeof(struct flow_dissector_key_basic))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> +       case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> +       case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ICMP:
> +               if (len != sizeof(struct flow_dissector_key_icmp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_PORTS:
> +       case FLOW_DISSECTOR_KEY_ENC_PORTS:
> +               if (len != sizeof(struct flow_dissector_key_ports))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> +               if (len != sizeof(struct flow_dissector_key_eth_addrs))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_TIPC:
> +               if (len != sizeof(struct flow_dissector_key_tipc))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_ARP:
> +               if (len != sizeof(struct flow_dissector_key_arp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_VLAN:
> +       case FLOW_DISSECTOR_KEY_CVLAN:
> +               if (len != sizeof(struct flow_dissector_key_vlan))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> +               if (len != sizeof(struct flow_dissector_key_tags))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_GRE_KEYID:
> +       case FLOW_DISSECTOR_KEY_ENC_KEYID:
> +       case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> +               if (len != sizeof(struct flow_dissector_key_keyid))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_MPLS:
> +               if (len != sizeof(struct flow_dissector_key_mpls))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_TCP:
> +               if (len != sizeof(struct flow_dissector_key_tcp))
> +                       return -EINVAL;
> +               break;
> +       case FLOW_DISSECTOR_KEY_IP:
> +       case FLOW_DISSECTOR_KEY_ENC_IP:
> +               if (len != sizeof(struct flow_dissector_key_ip))
> +                       return -EINVAL;
> +               break;
> +       default:
> +               return -EINVAL;
> +       }
> +
> +       dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> +                                        cb->target_container);
> +
> +       memcpy(dest, from, len);
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> +       .func           = bpf_flow_dissector_write_keys,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_PTR_TO_MEM,
> +       .arg3_type      = ARG_CONST_SIZE,
> +       .arg4_type      = ARG_ANYTHING,
> +};
> +
>  static const struct bpf_func_proto *
>  bpf_base_func_proto(enum bpf_func_id func_id)
>  {
> @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_skb_load_bytes:
> +               return &bpf_skb_load_bytes_proto;
> +       case BPF_FUNC_flow_dissector_write_keys:
> +               return &bpf_flow_dissector_write_keys_proto;
> +       default:
> +               return bpf_base_func_proto(func_id);
> +       }
> +}
> +
>  static const struct bpf_func_proto *
>  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  {
> @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
>         return true;
>  }
>
> +static bool flow_dissector_is_valid_access(int off, int size,
> +                                          enum bpf_access_type type,
> +                                          const struct bpf_prog *prog,
> +                                          struct bpf_insn_access_aux *info)
> +{
> +       if (type == BPF_WRITE) {
> +               switch (off) {
> +               case bpf_ctx_range(struct __sk_buff, cb[0]):
> +                       break;
> +               default:
> +                       return false;
> +               }
> +       }
> +
> +       switch (off) {
> +       case bpf_ctx_range(struct __sk_buff, data):
> +               info->reg_type = PTR_TO_PACKET;
> +               break;
> +       case bpf_ctx_range(struct __sk_buff, data_end):
> +               info->reg_type = PTR_TO_PACKET_END;
> +               break;
> +       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> +       case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> +               return false;
> +       }
> +
> +       return bpf_skb_is_valid_access(off, size, type, prog, info);
> +}
> +
>  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>                                   const struct bpf_insn *si,
>                                   struct bpf_insn *insn_buf,
> @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
>  const struct bpf_prog_ops sk_msg_prog_ops = {
>  };
>
> +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> +       .get_func_proto         = flow_dissector_func_proto,
> +       .is_valid_access        = flow_dissector_is_valid_access,
> +       .convert_ctx_access     = bpf_convert_ctx_access,
> +       .gen_ld_abs             = bpf_gen_ld_abs,
> +};
> +
> +const struct bpf_prog_ops flow_dissector_prog_ops = {
> +};
> +
>  int sk_detach_filter(struct sock *sk)
>  {
>         int ret = -ENOENT;
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index ce9eeeb7c024..767daa231f04 100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -25,6 +25,11 @@
>  #include <net/flow_dissector.h>
>  #include <scsi/fc/fc_fcoe.h>
>  #include <uapi/linux/batadv_packet.h>
> +#include <linux/bpf.h>
> +
> +/* BPF program accessible by all flow dissectors */
> +static struct bpf_prog __rcu *flow_dissector_prog;
> +static DEFINE_MUTEX(flow_dissector_mutex);
>
>  static void dissector_set_key(struct flow_dissector *flow_dissector,
>                               enum flow_dissector_key_id key_id)
> @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>  }
>  EXPORT_SYMBOL(skb_flow_dissector_init);
>
> +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> +                                      struct bpf_prog *prog)
> +{
> +       struct bpf_prog *attached;
> +
> +       mutex_lock(&flow_dissector_mutex);
> +       attached = rcu_dereference_protected(flow_dissector_prog,
> +                                            lockdep_is_held(&flow_dissector_mutex));
> +       if (attached) {
> +               /* Only one BPF program can be attached at a time */
> +               mutex_unlock(&flow_dissector_mutex);
> +               return -EEXIST;
> +       }
> +       rcu_assign_pointer(flow_dissector_prog, prog);
> +       mutex_unlock(&flow_dissector_mutex);
> +       return 0;
> +}
> +
> +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> +{
> +       struct bpf_prog *attached;
> +
> +       mutex_lock(&flow_dissector_mutex);
> +       attached = rcu_dereference_protected(flow_dissector_prog,
> +                                            lockdep_is_held(&flow_dissector_mutex));
> +       if (!flow_dissector_prog) {
> +               mutex_unlock(&flow_dissector_mutex);
> +               return -EINVAL;
> +       }
> +       bpf_prog_put(attached);
> +       RCU_INIT_POINTER(flow_dissector_prog, NULL);
> +       mutex_unlock(&flow_dissector_mutex);
> +       return 0;
> +}
>  /**
>   * skb_flow_get_be16 - extract be16 entity
>   * @skb: sk_buff to extract from
> @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>         struct flow_dissector_key_vlan *key_vlan;
>         enum flow_dissect_ret fdret;
>         enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> +       struct bpf_prog *attached;
>         int num_hdrs = 0;
>         u8 ip_proto = 0;
>         bool ret;
> @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>                                               FLOW_DISSECTOR_KEY_BASIC,
>                                               target_container);
>
> +       rcu_read_lock();
> +       attached = rcu_dereference(flow_dissector_prog);
> +       if (attached) {
> +               /* Note that even though the const qualifier is discarded
> +                * throughout the execution of the BPF program, all changes(the
> +                * control block) are reverted after the BPF program returns.
> +                * Therefore, __skb_flow_dissect does not alter the skb.
> +                */
> +               struct bpf_flow_dissect_cb *cb;
> +               u8 cb_saved[BPF_SKB_CB_LEN];
> +               u32 result;
> +
> +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> +
> +               /* Save Control Block */
> +               memcpy(cb_saved, cb, sizeof(cb_saved));
> +               memset(cb, 0, sizeof(cb_saved));
> +
> +               /* Pass parameters to the BPF program */
> +               cb->nhoff = nhoff;
> +               cb->target_container = target_container;
> +               cb->flow_dissector = flow_dissector;
> +
> +               bpf_compute_data_pointers((struct sk_buff *)skb);
> +               result = BPF_PROG_RUN(attached, skb);
> +
> +               /* Restore state */
> +               memcpy(cb, cb_saved, sizeof(cb_saved));
> +
> +               key_control->thoff = min_t(u16, key_control->thoff,
> +                                          skb ? skb->len : hlen);
> +               rcu_read_unlock();
> +               return result == BPF_OK;
> +       }

If the BPF program cannot handle certain protocol, shall we fall back
to the built-in logic? Otherwise, all BPF programs need to have some
code for all protocols.

Song

> +       rcu_read_unlock();
> +
>         if (dissector_uses_key(flow_dissector,
>                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>                 struct ethhdr *eth = eth_hdr(skb);
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index dce960d22106..b1cd3bc8db70 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
>         [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
>         [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
>         [BPF_PROG_TYPE_LIRC_MODE2]      = "lirc_mode2",
> +       [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
>  };
>
>  static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 66917a4eba27..acd74a0dd063 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>         BPF_PROG_TYPE_LIRC_MODE2,
>         BPF_PROG_TYPE_SK_REUSEPORT,
> +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>  };
>
>  enum bpf_attach_type {
> @@ -172,6 +173,7 @@ enum bpf_attach_type {
>         BPF_CGROUP_UDP4_SENDMSG,
>         BPF_CGROUP_UDP6_SENDMSG,
>         BPF_LIRC_MODE2,
> +       BPF_FLOW_DISSECTOR,
>         __MAX_BPF_ATTACH_TYPE
>  };
>
> @@ -2226,7 +2228,8 @@ union bpf_attr {
>         FN(get_current_cgroup_id),      \
>         FN(get_local_storage),          \
>         FN(sk_select_reuseport),        \
> -       FN(skb_ancestor_cgroup_id),
> +       FN(skb_ancestor_cgroup_id),     \
> +       FN(flow_dissector_write_keys),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 2abd0f112627..0c749ce1b717 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>         case BPF_PROG_TYPE_LIRC_MODE2:
>         case BPF_PROG_TYPE_SK_REUSEPORT:
> +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>                 return false;
>         case BPF_PROG_TYPE_UNSPEC:
>         case BPF_PROG_TYPE_KPROBE:
> @@ -2121,6 +2122,7 @@ static const struct {
>         BPF_PROG_SEC("sk_skb",          BPF_PROG_TYPE_SK_SKB),
>         BPF_PROG_SEC("sk_msg",          BPF_PROG_TYPE_SK_MSG),
>         BPF_PROG_SEC("lirc_mode2",      BPF_PROG_TYPE_LIRC_MODE2),
> +       BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
>         BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
>         BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
>         BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> index e4be7730222d..4204c496a04f 100644
> --- a/tools/testing/selftests/bpf/bpf_helpers.h
> +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
>         (void *) BPF_FUNC_skb_cgroup_id;
>  static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
>         (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> +                                           int key) =
> +       (void *) BPF_FUNC_flow_dissector_write_keys;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> --
> 2.18.0.865.gffc8e1a3cd6-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 22:40   ` Song Liu
@ 2018-08-16 23:14     ` Petar Penkov
  2018-08-20  5:44       ` Song Liu
  0 siblings, 1 reply; 20+ messages in thread
From: Petar Penkov @ 2018-08-16 23:14 UTC (permalink / raw)
  To: Song Liu
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is kept as a global variable so it is
> > accessible to all flow dissectors.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/linux/bpf_types.h                 |   1 +
> >  include/linux/skbuff.h                    |   7 +
> >  include/net/flow_dissector.h              |  16 +++
> >  include/uapi/linux/bpf.h                  |  14 +-
> >  kernel/bpf/syscall.c                      |   8 ++
> >  kernel/bpf/verifier.c                     |   2 +
> >  net/core/filter.c                         | 157 ++++++++++++++++++++++
> >  net/core/flow_dissector.c                 |  76 +++++++++++
> >  tools/bpf/bpftool/prog.c                  |   1 +
> >  tools/include/uapi/linux/bpf.h            |   5 +-
> >  tools/lib/bpf/libbpf.c                    |   2 +
> >  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
> >  12 files changed, 290 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > index cd26c090e7c0..22083712dd18 100644
> > --- a/include/linux/bpf_types.h
> > +++ b/include/linux/bpf_types.h
> > @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
> >  #ifdef CONFIG_INET
> >  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
> >  #endif
> > +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
> >
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> >  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 17a13e4785fc..ce0e863f02a2 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -243,6 +243,8 @@ struct scatterlist;
> >  struct pipe_inode_info;
> >  struct iov_iter;
> >  struct napi_struct;
> > +struct bpf_prog;
> > +union bpf_attr;
> >
> >  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
> >  struct nf_conntrack {
> > @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> >                              const struct flow_dissector_key *key,
> >                              unsigned int key_count);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > +                                      struct bpf_prog *prog);
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
> > +
> >  bool __skb_flow_dissect(const struct sk_buff *skb,
> >                         struct flow_dissector *flow_dissector,
> >                         void *target_container,
> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> > index 6a4586dcdede..edb919d320c1 100644
> > --- a/include/net/flow_dissector.h
> > +++ b/include/net/flow_dissector.h
> > @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
> >  extern struct flow_dissector flow_keys_dissector;
> >  extern struct flow_dissector flow_keys_basic_dissector;
> >
> > +/* struct bpf_flow_dissect_cb:
> > + *
> > + * This struct is used to pass parameters to BPF programs of type
> > + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
> > + * the control block of the skb to be a struct of this type. The first field is
> > + * used to communicate the next header offset between the BPF programs and the
> > + * first value of it is passed from the kernel. The last two fields are used for
> > + * writing out flow keys.
> > + */
> > +struct bpf_flow_dissect_cb {
> > +       u16 nhoff;
> > +       u16 unused;
> > +       void *target_container;
> > +       struct flow_dissector *flow_dissector;
> > +};
> > +
> >  /* struct flow_keys_digest:
> >   *
> >   * This structure is used to hold a digest of the full flow keys. This is a
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 66917a4eba27..8bc0fdab685d 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
> >         BPF_PROG_TYPE_LIRC_MODE2,
> >         BPF_PROG_TYPE_SK_REUSEPORT,
> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> >         BPF_CGROUP_UDP4_SENDMSG,
> >         BPF_CGROUP_UDP6_SENDMSG,
> >         BPF_LIRC_MODE2,
> > +       BPF_FLOW_DISSECTOR,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -2141,6 +2143,15 @@ union bpf_attr {
> >   *             request in the skb.
> >   *     Return
> >   *             0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
> > + *     Description
> > + *             Try to write *len* bytes from the source pointer into the offset
> > + *             of the key with id *key_id*. If *len* is different from the
> > + *             size of the key, an error is returned. If the key is not used,
> > + *             this function exits with no effect and code 0.
> > + *     Return
> > + *             0 on success, negative error in case of failure.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -2226,7 +2237,8 @@ union bpf_attr {
> >         FN(get_current_cgroup_id),      \
> >         FN(get_local_storage),          \
> >         FN(sk_select_reuseport),        \
> > -       FN(skb_ancestor_cgroup_id),
> > +       FN(skb_ancestor_cgroup_id),     \
> > +       FN(flow_dissector_write_keys),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 43727ed0d94a..a06568841a92 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >         case BPF_LIRC_MODE2:
> >                 ptype = BPF_PROG_TYPE_LIRC_MODE2;
> >                 break;
> > +       case BPF_FLOW_DISSECTOR:
> > +               ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
> > +               break;
> >         default:
> >                 return -EINVAL;
> >         }
> > @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> >         case BPF_PROG_TYPE_LIRC_MODE2:
> >                 ret = lirc_prog_attach(attr, prog);
> >                 break;
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> > +               ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
> > +               break;
> >         default:
> >                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
> >         }
> > @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> >                 return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
> >         case BPF_LIRC_MODE2:
> >                 return lirc_prog_detach(attr);
> > +       case BPF_FLOW_DISSECTOR:
> > +               return skb_flow_dissector_bpf_prog_detach(attr);
> >         default:
> >                 return -EINVAL;
> >         }
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index ca90679a7fe5..6d3f268fa8e0 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> >         case BPF_PROG_TYPE_LWT_XMIT:
> >         case BPF_PROG_TYPE_SK_SKB:
> >         case BPF_PROG_TYPE_SK_MSG:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 if (meta)
> >                         return meta->pkt_access;
> >
> > @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
> >         case BPF_PROG_TYPE_SOCKET_FILTER:
> >         case BPF_PROG_TYPE_SCHED_CLS:
> >         case BPF_PROG_TYPE_SCHED_ACT:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 return true;
> >         default:
> >                 return false;
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index fd423ce3da34..03d3037e6508 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
> >         return false;
> >  }
> >
> > +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
> > +          const void *, from, u32, len, enum flow_dissector_key_id, key_id)
> > +{
> > +       struct bpf_flow_dissect_cb *cb;
> > +       void *dest;
> > +
> > +       cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
> > +
> > +       /* Make sure the dissector actually uses the key. It is not an error if
> > +        * it does not, but we should not continue past this point in that case
> > +        */
> > +       if (!dissector_uses_key(cb->flow_dissector, key_id))
> > +               return 0;
> > +
> > +       /* Make sure the length is correct */
> > +       switch (key_id) {
> > +       case FLOW_DISSECTOR_KEY_CONTROL:
> > +       case FLOW_DISSECTOR_KEY_ENC_CONTROL:
> > +               if (len != sizeof(struct flow_dissector_key_control))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_BASIC:
> > +               if (len != sizeof(struct flow_dissector_key_basic))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
> > +       case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
> > +       case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ICMP:
> > +               if (len != sizeof(struct flow_dissector_key_icmp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_PORTS:
> > +       case FLOW_DISSECTOR_KEY_ENC_PORTS:
> > +               if (len != sizeof(struct flow_dissector_key_ports))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ETH_ADDRS:
> > +               if (len != sizeof(struct flow_dissector_key_eth_addrs))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_TIPC:
> > +               if (len != sizeof(struct flow_dissector_key_tipc))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_ARP:
> > +               if (len != sizeof(struct flow_dissector_key_arp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_VLAN:
> > +       case FLOW_DISSECTOR_KEY_CVLAN:
> > +               if (len != sizeof(struct flow_dissector_key_vlan))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_FLOW_LABEL:
> > +               if (len != sizeof(struct flow_dissector_key_tags))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_GRE_KEYID:
> > +       case FLOW_DISSECTOR_KEY_ENC_KEYID:
> > +       case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
> > +               if (len != sizeof(struct flow_dissector_key_keyid))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_MPLS:
> > +               if (len != sizeof(struct flow_dissector_key_mpls))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_TCP:
> > +               if (len != sizeof(struct flow_dissector_key_tcp))
> > +                       return -EINVAL;
> > +               break;
> > +       case FLOW_DISSECTOR_KEY_IP:
> > +       case FLOW_DISSECTOR_KEY_ENC_IP:
> > +               if (len != sizeof(struct flow_dissector_key_ip))
> > +                       return -EINVAL;
> > +               break;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +
> > +       dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
> > +                                        cb->target_container);
> > +
> > +       memcpy(dest, from, len);
> > +       return 0;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
> > +       .func           = bpf_flow_dissector_write_keys,
> > +       .gpl_only       = false,
> > +       .ret_type       = RET_INTEGER,
> > +       .arg1_type      = ARG_PTR_TO_CTX,
> > +       .arg2_type      = ARG_PTR_TO_MEM,
> > +       .arg3_type      = ARG_CONST_SIZE,
> > +       .arg4_type      = ARG_ANYTHING,
> > +};
> > +
> >  static const struct bpf_func_proto *
> >  bpf_base_func_proto(enum bpf_func_id func_id)
> >  {
> > @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >         }
> >  }
> >
> > +static const struct bpf_func_proto *
> > +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +       switch (func_id) {
> > +       case BPF_FUNC_skb_load_bytes:
> > +               return &bpf_skb_load_bytes_proto;
> > +       case BPF_FUNC_flow_dissector_write_keys:
> > +               return &bpf_flow_dissector_write_keys_proto;
> > +       default:
> > +               return bpf_base_func_proto(func_id);
> > +       }
> > +}
> > +
> >  static const struct bpf_func_proto *
> >  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  {
> > @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
> >         return true;
> >  }
> >
> > +static bool flow_dissector_is_valid_access(int off, int size,
> > +                                          enum bpf_access_type type,
> > +                                          const struct bpf_prog *prog,
> > +                                          struct bpf_insn_access_aux *info)
> > +{
> > +       if (type == BPF_WRITE) {
> > +               switch (off) {
> > +               case bpf_ctx_range(struct __sk_buff, cb[0]):
> > +                       break;
> > +               default:
> > +                       return false;
> > +               }
> > +       }
> > +
> > +       switch (off) {
> > +       case bpf_ctx_range(struct __sk_buff, data):
> > +               info->reg_type = PTR_TO_PACKET;
> > +               break;
> > +       case bpf_ctx_range(struct __sk_buff, data_end):
> > +               info->reg_type = PTR_TO_PACKET_END;
> > +               break;
> > +       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> > +       case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
> > +               return false;
> > +       }
> > +
> > +       return bpf_skb_is_valid_access(off, size, type, prog, info);
> > +}
> > +
> >  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
> >                                   const struct bpf_insn *si,
> >                                   struct bpf_insn *insn_buf,
> > @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
> >  const struct bpf_prog_ops sk_msg_prog_ops = {
> >  };
> >
> > +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
> > +       .get_func_proto         = flow_dissector_func_proto,
> > +       .is_valid_access        = flow_dissector_is_valid_access,
> > +       .convert_ctx_access     = bpf_convert_ctx_access,
> > +       .gen_ld_abs             = bpf_gen_ld_abs,
> > +};
> > +
> > +const struct bpf_prog_ops flow_dissector_prog_ops = {
> > +};
> > +
> >  int sk_detach_filter(struct sock *sk)
> >  {
> >         int ret = -ENOENT;
> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> > index ce9eeeb7c024..767daa231f04 100644
> > --- a/net/core/flow_dissector.c
> > +++ b/net/core/flow_dissector.c
> > @@ -25,6 +25,11 @@
> >  #include <net/flow_dissector.h>
> >  #include <scsi/fc/fc_fcoe.h>
> >  #include <uapi/linux/batadv_packet.h>
> > +#include <linux/bpf.h>
> > +
> > +/* BPF program accessible by all flow dissectors */
> > +static struct bpf_prog __rcu *flow_dissector_prog;
> > +static DEFINE_MUTEX(flow_dissector_mutex);
> >
> >  static void dissector_set_key(struct flow_dissector *flow_dissector,
> >                               enum flow_dissector_key_id key_id)
> > @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
> >  }
> >  EXPORT_SYMBOL(skb_flow_dissector_init);
> >
> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
> > +                                      struct bpf_prog *prog)
> > +{
> > +       struct bpf_prog *attached;
> > +
> > +       mutex_lock(&flow_dissector_mutex);
> > +       attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                            lockdep_is_held(&flow_dissector_mutex));
> > +       if (attached) {
> > +               /* Only one BPF program can be attached at a time */
> > +               mutex_unlock(&flow_dissector_mutex);
> > +               return -EEXIST;
> > +       }
> > +       rcu_assign_pointer(flow_dissector_prog, prog);
> > +       mutex_unlock(&flow_dissector_mutex);
> > +       return 0;
> > +}
> > +
> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
> > +{
> > +       struct bpf_prog *attached;
> > +
> > +       mutex_lock(&flow_dissector_mutex);
> > +       attached = rcu_dereference_protected(flow_dissector_prog,
> > +                                            lockdep_is_held(&flow_dissector_mutex));
> > +       if (!flow_dissector_prog) {
> > +               mutex_unlock(&flow_dissector_mutex);
> > +               return -EINVAL;
> > +       }
> > +       bpf_prog_put(attached);
> > +       RCU_INIT_POINTER(flow_dissector_prog, NULL);
> > +       mutex_unlock(&flow_dissector_mutex);
> > +       return 0;
> > +}
> >  /**
> >   * skb_flow_get_be16 - extract be16 entity
> >   * @skb: sk_buff to extract from
> > @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >         struct flow_dissector_key_vlan *key_vlan;
> >         enum flow_dissect_ret fdret;
> >         enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
> > +       struct bpf_prog *attached;
> >         int num_hdrs = 0;
> >         u8 ip_proto = 0;
> >         bool ret;
> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >                                               FLOW_DISSECTOR_KEY_BASIC,
> >                                               target_container);
> >
> > +       rcu_read_lock();
> > +       attached = rcu_dereference(flow_dissector_prog);
> > +       if (attached) {
> > +               /* Note that even though the const qualifier is discarded
> > +                * throughout the execution of the BPF program, all changes(the
> > +                * control block) are reverted after the BPF program returns.
> > +                * Therefore, __skb_flow_dissect does not alter the skb.
> > +                */
> > +               struct bpf_flow_dissect_cb *cb;
> > +               u8 cb_saved[BPF_SKB_CB_LEN];
> > +               u32 result;
> > +
> > +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> > +
> > +               /* Save Control Block */
> > +               memcpy(cb_saved, cb, sizeof(cb_saved));
> > +               memset(cb, 0, sizeof(cb_saved));
> > +
> > +               /* Pass parameters to the BPF program */
> > +               cb->nhoff = nhoff;
> > +               cb->target_container = target_container;
> > +               cb->flow_dissector = flow_dissector;
> > +
> > +               bpf_compute_data_pointers((struct sk_buff *)skb);
> > +               result = BPF_PROG_RUN(attached, skb);
> > +
> > +               /* Restore state */
> > +               memcpy(cb, cb_saved, sizeof(cb_saved));
> > +
> > +               key_control->thoff = min_t(u16, key_control->thoff,
> > +                                          skb ? skb->len : hlen);
> > +               rcu_read_unlock();
> > +               return result == BPF_OK;
> > +       }
>
> If the BPF program cannot handle certain protocol, shall we fall back
> to the built-in logic? Otherwise, all BPF programs need to have some
> code for all protocols.
>
> Song

I believe that if we fall back to the built-in logic we lose all security
guarantees from BPF and this is why the code does not support
fall back.

Petar
>
>
> > +       rcu_read_unlock();
> > +
> >         if (dissector_uses_key(flow_dissector,
> >                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
> >                 struct ethhdr *eth = eth_hdr(skb);
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index dce960d22106..b1cd3bc8db70 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
> >         [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
> >         [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
> >         [BPF_PROG_TYPE_LIRC_MODE2]      = "lirc_mode2",
> > +       [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
> >  };
> >
> >  static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 66917a4eba27..acd74a0dd063 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
> >         BPF_PROG_TYPE_LIRC_MODE2,
> >         BPF_PROG_TYPE_SK_REUSEPORT,
> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
> >  };
> >
> >  enum bpf_attach_type {
> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
> >         BPF_CGROUP_UDP4_SENDMSG,
> >         BPF_CGROUP_UDP6_SENDMSG,
> >         BPF_LIRC_MODE2,
> > +       BPF_FLOW_DISSECTOR,
> >         __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > @@ -2226,7 +2228,8 @@ union bpf_attr {
> >         FN(get_current_cgroup_id),      \
> >         FN(get_local_storage),          \
> >         FN(sk_select_reuseport),        \
> > -       FN(skb_ancestor_cgroup_id),
> > +       FN(skb_ancestor_cgroup_id),     \
> > +       FN(flow_dissector_write_keys),
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 2abd0f112627..0c749ce1b717 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
> >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> >         case BPF_PROG_TYPE_LIRC_MODE2:
> >         case BPF_PROG_TYPE_SK_REUSEPORT:
> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
> >                 return false;
> >         case BPF_PROG_TYPE_UNSPEC:
> >         case BPF_PROG_TYPE_KPROBE:
> > @@ -2121,6 +2122,7 @@ static const struct {
> >         BPF_PROG_SEC("sk_skb",          BPF_PROG_TYPE_SK_SKB),
> >         BPF_PROG_SEC("sk_msg",          BPF_PROG_TYPE_SK_MSG),
> >         BPF_PROG_SEC("lirc_mode2",      BPF_PROG_TYPE_LIRC_MODE2),
> > +       BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
> >         BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
> >         BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
> >         BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
> > index e4be7730222d..4204c496a04f 100644
> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
> > @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
> >         (void *) BPF_FUNC_skb_cgroup_id;
> >  static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
> >         (void *) BPF_FUNC_skb_ancestor_cgroup_id;
> > +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
> > +                                           int key) =
> > +       (void *) BPF_FUNC_flow_dissector_write_keys;
> >
> >  /* llvm builtin functions that eBPF C program may use to
> >   * emit BPF_LD_ABS and BPF_LD_IND instructions
> > --
> > 2.18.0.865.gffc8e1a3cd6-goog
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 2/3] flow_dissector: implements eBPF parser
  2018-08-16 16:44 ` [bpf-next RFC 2/3] flow_dissector: implements eBPF parser Petar Penkov
@ 2018-08-18 15:50   ` Tom Herbert
  2018-08-18 19:49     ` Willem de Bruijn
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Herbert @ 2018-08-18 15:50 UTC (permalink / raw)
  To: Petar Penkov
  Cc: Linux Kernel Network Developers, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Simon Horman, Petar Penkov,
	Willem de Bruijn

On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> This eBPF program extracts basic/control/ip address/ports keys from
> incoming packets. It supports recursive parsing for IP
> encapsulation, MPLS, GUE, and VLAN, along with IPv4/IPv6 and extension
> headers. This program is meant to show how flow dissection and key
> extraction can be done in eBPF.
>
> It is initially meant to be used for demonstration rather than as a
> complete replacement of the existing flow dissector.
>
> This includes parsing of GUE and MPLS payload, which cannot be done
> in production in general, as GUE tunnels and MPLS payloads cannot
> unambiguously be detected in general.
>
> In closed environments, however, it can be enabled. Another example
> where the programmability of BPF aids flow dissection.
>
> Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  tools/testing/selftests/bpf/Makefile   |   2 +-
>  tools/testing/selftests/bpf/bpf_flow.c | 542 +++++++++++++++++++++++++
>  2 files changed, 543 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
>
> diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
> index fff7fb1285fc..e65f50f9185e 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
>         test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
>         test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
>         get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
> -       test_skb_cgroup_id_kern.o
> +       test_skb_cgroup_id_kern.o bpf_flow.o
>
>  # Order correspond to 'make run_tests' order
>  TEST_PROGS := test_kmod.sh \
> diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c
> new file mode 100644
> index 000000000000..9c11c644b713
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_flow.c
> @@ -0,0 +1,542 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <stddef.h>
> +#include <stdbool.h>
> +#include <string.h>
> +#include <linux/pkt_cls.h>
> +#include <linux/bpf.h>
> +#include <linux/in.h>
> +#include <linux/if_ether.h>
> +#include <linux/icmp.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include <linux/tcp.h>
> +#include <linux/udp.h>
> +#include <linux/if_packet.h>
> +#include <sys/socket.h>
> +#include <linux/if_tunnel.h>
> +#include <linux/mpls.h>
> +#include "bpf_helpers.h"
> +#include "bpf_endian.h"
> +
> +int _version SEC("version") = 1;
> +#define PROG(F) SEC(#F) int bpf_func_##F
> +
> +/* These are the identifiers of the BPF programs that will be used in tail
> + * calls. Name is limited to 16 characters, with the terminating character and
> + * bpf_func_ above, we have only 6 to work with, anything after will be cropped.
> + */
> +enum {
> +       IP,
> +       IPV6,
> +       IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */
> +       IPV6FR, /* Fragmentation IPv6 Extension Header */
> +       MPLS,
> +       VLAN,
> +       GUE,
> +};
> +
> +#define IP_MF          0x2000
> +#define IP_OFFSET      0x1FFF
> +#define IP6_MF         0x0001
> +#define IP6_OFFSET     0xFFF8
> +
> +struct vlan_hdr {
> +       __be16 h_vlan_TCI;
> +       __be16 h_vlan_encapsulated_proto;
> +};
> +
> +struct gre_hdr {
> +       __be16 flags;
> +       __be16 proto;
> +};
> +
> +#define GUE_PORT 6080
> +/* Taken from include/net/gue.h. Move that to uapi, instead? */
> +struct guehdr {
> +       union {
> +               struct {
> +#if defined(__LITTLE_ENDIAN_BITFIELD)
> +                       __u8    hlen:5,
> +                               control:1,
> +                               version:2;
> +#elif defined (__BIG_ENDIAN_BITFIELD)
> +                       __u8    version:2,
> +                               control:1,
> +                               hlen:5;
> +#else
> +#error  "Please fix <asm/byteorder.h>"
> +#endif
> +                       __u8    proto_ctype;
> +                       __be16  flags;
> +               };
> +               __be32  word;
> +       };
> +};
> +
> +enum flow_dissector_key_id {
> +       FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
> +       FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
> +       FLOW_DISSECTOR_KEY_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> +       FLOW_DISSECTOR_KEY_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> +       FLOW_DISSECTOR_KEY_PORTS, /* struct flow_dissector_key_ports */
> +       FLOW_DISSECTOR_KEY_ICMP, /* struct flow_dissector_key_icmp */
> +       FLOW_DISSECTOR_KEY_ETH_ADDRS, /* struct flow_dissector_key_eth_addrs */
> +       FLOW_DISSECTOR_KEY_TIPC, /* struct flow_dissector_key_tipc */
> +       FLOW_DISSECTOR_KEY_ARP, /* struct flow_dissector_key_arp */
> +       FLOW_DISSECTOR_KEY_VLAN, /* struct flow_dissector_key_flow_vlan */
> +       FLOW_DISSECTOR_KEY_FLOW_LABEL, /* struct flow_dissector_key_flow_tags */
> +       FLOW_DISSECTOR_KEY_GRE_KEYID, /* struct flow_dissector_key_keyid */
> +       FLOW_DISSECTOR_KEY_MPLS_ENTROPY, /* struct flow_dissector_key_keyid */
> +       FLOW_DISSECTOR_KEY_ENC_KEYID, /* struct flow_dissector_key_keyid */
> +       FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, /* struct flow_dissector_key_ipv4_addrs */
> +       FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, /* struct flow_dissector_key_ipv6_addrs */
> +       FLOW_DISSECTOR_KEY_ENC_CONTROL, /* struct flow_dissector_key_control */
> +       FLOW_DISSECTOR_KEY_ENC_PORTS, /* struct flow_dissector_key_ports */
> +       FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
> +       FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
> +       FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
> +       FLOW_DISSECTOR_KEY_CVLAN, /* struct flow_dissector_key_flow_vlan */
> +
> +       FLOW_DISSECTOR_KEY_MAX,
> +};
> +
> +struct flow_dissector_key_control {
> +       __u16   thoff;
> +       __u16   addr_type;
> +       __u32   flags;
> +};
> +
> +#define FLOW_DIS_IS_FRAGMENT   (1 << 0)
> +#define FLOW_DIS_FIRST_FRAG    (1 << 1)
> +#define FLOW_DIS_ENCAPSULATION (1 << 2)
> +
> +struct flow_dissector_key_basic {
> +       __be16  n_proto;
> +       __u8    ip_proto;
> +       __u8    padding;
> +};
> +
> +struct flow_dissector_key_ipv4_addrs {
> +       __be32 src;
> +       __be32 dst;
> +};
> +
> +struct flow_dissector_key_ipv6_addrs {
> +       struct in6_addr src;
> +       struct in6_addr dst;
> +};
> +
> +struct flow_dissector_key_addrs {
> +       union {
> +               struct flow_dissector_key_ipv4_addrs v4addrs;
> +               struct flow_dissector_key_ipv6_addrs v6addrs;
> +       };
> +};
> +
> +struct flow_dissector_key_ports {
> +       union {
> +               __be32 ports;
> +               struct {
> +                       __be16 src;
> +                       __be16 dst;
> +               };
> +       };
> +};
> +
> +struct bpf_map_def SEC("maps") jmp_table = {
> +       .type = BPF_MAP_TYPE_PROG_ARRAY,
> +       .key_size = sizeof(__u32),
> +       .value_size = sizeof(__u32),
> +       .max_entries = 8
> +};
> +
> +struct bpf_dissect_cb {
> +       __u16 nhoff;
> +       __u16 flags;
> +};
> +
> +/* Dispatches on ETHERTYPE */
> +static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
> +{
> +       switch (proto) {
> +       case bpf_htons(ETH_P_IP):
> +               bpf_tail_call(skb, &jmp_table, IP);
> +               break;
> +       case bpf_htons(ETH_P_IPV6):
> +               bpf_tail_call(skb, &jmp_table, IPV6);
> +               break;
> +       case bpf_htons(ETH_P_MPLS_MC):
> +       case bpf_htons(ETH_P_MPLS_UC):
> +               bpf_tail_call(skb, &jmp_table, MPLS);
> +               break;
> +       case bpf_htons(ETH_P_8021Q):
> +       case bpf_htons(ETH_P_8021AD):
> +               bpf_tail_call(skb, &jmp_table, VLAN);
> +               break;
> +       default:
> +               /* Protocol not supported */
> +               return BPF_DROP;
> +       }
> +
> +       return BPF_DROP;
> +}
> +
> +static __always_inline int write_ports(struct __sk_buff *skb, __u8 proto)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct flow_dissector_key_ports ports;
> +
> +       /* The supported protocols always start with the ports */
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &ports, sizeof(ports)))
> +               return BPF_DROP;
> +
> +       if (proto == IPPROTO_UDP && ports.dst == bpf_htons(GUE_PORT)) {
> +               /* GUE encapsulation */
> +               cb->nhoff += sizeof(struct udphdr);
> +               bpf_tail_call(skb, &jmp_table, GUE);
> +               return BPF_DROP;

It's a nice sentiment to support GUE, but this really isn't the right
way to do it. What would be much better is a means to generically
support all the various UDP encapsulations like GUE, VXLAN, Geneve,
GRE/UDP, MPLS/UDP, etc. I think there's two ways to do that:

1) A UDP socket lookup that returns an encapsulation socket containing
a flow dissector function that can be called. This is the safest
method because of the UDP are reserved numbers problem. I implement
this in kernel flow dissector, not upstreamed though.
2) Create a lookup table based on destination port that returns the
flow dissector function to call. This doesn't have the socket lookup
so it isn't quite as robust as the socket lookup. But, at least it's a
generic interface and programmable so it might be appropriate in the
BPF flow dissector case.

Tom

> +       }
> +
> +       if (bpf_flow_dissector_write_keys(skb, &ports, sizeof(ports),
> +                                         FLOW_DISSECTOR_KEY_PORTS))
> +               return BPF_DROP;
> +
> +       return BPF_OK;
> +}
> +
> +SEC("dissect")
> +int dissect(struct __sk_buff *skb)
> +{
> +       if (!skb->vlan_present)
> +               return parse_eth_proto(skb, skb->protocol);
> +       else
> +               return parse_eth_proto(skb, skb->vlan_proto);
> +}
> +
> +/* Parses on IPPROTO_* */
> +static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       __u8 *data_end = (__u8 *)(long)skb->data_end;
> +       __u8 *data = (__u8 *)(long)skb->data;
> +       __u32 data_len = data_end - data;
> +       struct gre_hdr gre;
> +       struct ethhdr eth;
> +       struct tcphdr tcp;
> +
> +       switch (proto) {
> +       case IPPROTO_ICMP:
> +               if (cb->nhoff + sizeof(struct icmphdr) > data_len)
> +                       return BPF_DROP;
> +               return BPF_OK;
> +       case IPPROTO_IPIP:
> +               cb->flags |= FLOW_DIS_ENCAPSULATION;
> +               bpf_tail_call(skb, &jmp_table, IP);
> +               break;
> +       case IPPROTO_IPV6:
> +               cb->flags |= FLOW_DIS_ENCAPSULATION;
> +               bpf_tail_call(skb, &jmp_table, IPV6);
> +               break;
> +       case IPPROTO_GRE:
> +               if (bpf_skb_load_bytes(skb, cb->nhoff, &gre, sizeof(gre)))
> +                       return BPF_DROP;
> +
> +               if (bpf_htons(gre.flags & GRE_VERSION))
> +                       /* Only inspect standard GRE packets with version 0 */
> +                       return BPF_OK;
> +
> +               cb->nhoff += sizeof(gre); /* Step over GRE Flags and Protocol */
> +               if (GRE_IS_CSUM(gre.flags))
> +                       cb->nhoff += 4; /* Step over chksum and Padding */
> +               if (GRE_IS_KEY(gre.flags))
> +                       cb->nhoff += 4; /* Step over key */
> +               if (GRE_IS_SEQ(gre.flags))
> +                       cb->nhoff += 4; /* Step over sequence number */
> +
> +               cb->flags |= FLOW_DIS_ENCAPSULATION;
> +
> +               if (gre.proto == bpf_htons(ETH_P_TEB)) {
> +                       if (bpf_skb_load_bytes(skb, cb->nhoff, &eth,
> +                                              sizeof(eth)))
> +                               return BPF_DROP;
> +
> +                       cb->nhoff += sizeof(eth);
> +
> +                       return parse_eth_proto(skb, eth.h_proto);
> +               } else {
> +                       return parse_eth_proto(skb, gre.proto);
> +               }
> +
> +       case IPPROTO_TCP:
> +               if (cb->nhoff + sizeof(struct tcphdr) > data_len)
> +                       return BPF_DROP;
> +
> +               if (bpf_skb_load_bytes(skb, cb->nhoff, &tcp, sizeof(tcp)))
> +                       return BPF_DROP;
> +
> +               if (tcp.doff < 5)
> +                       return BPF_DROP;
> +
> +               if (cb->nhoff + (tcp.doff << 2) > data_len)
> +                       return BPF_DROP;
> +
> +               return write_ports(skb, proto);
> +       case IPPROTO_UDP:
> +       case IPPROTO_UDPLITE:
> +               if (cb->nhoff + sizeof(struct udphdr) > data_len)
> +                       return BPF_DROP;
> +
> +               return write_ports(skb, proto);
> +       default:
> +               return BPF_DROP;
> +       }
> +
> +       return BPF_DROP;
> +}
> +
> +static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct flow_dissector_key_control control;
> +       struct flow_dissector_key_basic basic;
> +
> +       switch (nexthdr) {
> +       case IPPROTO_HOPOPTS:
> +       case IPPROTO_DSTOPTS:
> +               bpf_tail_call(skb, &jmp_table, IPV6OP);
> +               break;
> +       case IPPROTO_FRAGMENT:
> +               bpf_tail_call(skb, &jmp_table, IPV6FR);
> +               break;
> +       default:
> +               control.thoff = cb->nhoff;
> +               control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
> +               control.flags = cb->flags;
> +               if (bpf_flow_dissector_write_keys(skb, &control,
> +                                                 sizeof(control),
> +                                                 FLOW_DISSECTOR_KEY_CONTROL))
> +                       return BPF_DROP;
> +
> +               memset(&basic, 0, sizeof(basic));
> +               basic.n_proto = bpf_htons(ETH_P_IPV6);
> +               basic.ip_proto = nexthdr;
> +               if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
> +                                             FLOW_DISSECTOR_KEY_BASIC))
> +                       return BPF_DROP;
> +
> +               return parse_ip_proto(skb, nexthdr);
> +       }
> +
> +       return BPF_DROP;
> +}
> +
> +PROG(IP)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       __u8 *data_end = (__u8 *)(long)skb->data_end;
> +       struct flow_dissector_key_control control;
> +       struct flow_dissector_key_addrs addrs;
> +       struct flow_dissector_key_basic basic;
> +       __u8 *data = (__u8 *)(long)skb->data;
> +       __u32 data_len = data_end - data;
> +       bool done = false;
> +       struct iphdr iph;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &iph, sizeof(iph)))
> +               return BPF_DROP;
> +
> +       /* IP header cannot be smaller than 20 bytes */
> +       if (iph.ihl < 5)
> +               return BPF_DROP;
> +
> +       addrs.v4addrs.src = iph.saddr;
> +       addrs.v4addrs.dst = iph.daddr;
> +       if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v4addrs),
> +                                     FLOW_DISSECTOR_KEY_IPV4_ADDRS))
> +               return BPF_DROP;
> +
> +       cb->nhoff += iph.ihl << 2;
> +       if (cb->nhoff > data_len)
> +               return BPF_DROP;
> +
> +       if (iph.frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
> +               cb->flags |= FLOW_DIS_IS_FRAGMENT;
> +               if (iph.frag_off & bpf_htons(IP_OFFSET))
> +                       /* From second fragment on, packets do not have headers
> +                        * we can parse.
> +                        */
> +                       done = true;
> +               else
> +                       cb->flags |= FLOW_DIS_FIRST_FRAG;
> +       }
> +
> +
> +       control.thoff = cb->nhoff;
> +       control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
> +       control.flags = cb->flags;
> +       if (bpf_flow_dissector_write_keys(skb, &control, sizeof(control),
> +                                         FLOW_DISSECTOR_KEY_CONTROL))
> +               return BPF_DROP;
> +
> +       memset(&basic, 0, sizeof(basic));
> +       basic.n_proto = bpf_htons(ETH_P_IP);
> +       basic.ip_proto = iph.protocol;
> +       if (bpf_flow_dissector_write_keys(skb, &basic, sizeof(basic),
> +                                     FLOW_DISSECTOR_KEY_BASIC))
> +               return BPF_DROP;
> +
> +       if (done)
> +               return BPF_OK;
> +
> +       return parse_ip_proto(skb, iph.protocol);
> +}
> +
> +PROG(IPV6)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct flow_dissector_key_addrs addrs;
> +       struct ipv6hdr ip6h;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &ip6h, sizeof(ip6h)))
> +               return BPF_DROP;
> +
> +       addrs.v6addrs.src = ip6h.saddr;
> +       addrs.v6addrs.dst = ip6h.daddr;
> +       if (bpf_flow_dissector_write_keys(skb, &addrs, sizeof(addrs.v6addrs),
> +                                     FLOW_DISSECTOR_KEY_IPV6_ADDRS))
> +               return BPF_DROP;
> +
> +       cb->nhoff += sizeof(struct ipv6hdr);
> +
> +       return parse_ipv6_proto(skb, ip6h.nexthdr);
> +}
> +
> +PROG(IPV6OP)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       __u8 proto;
> +       __u8 hlen;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
> +               return BPF_DROP;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff + sizeof(proto), &hlen,
> +                              sizeof(hlen)))
> +               return BPF_DROP;
> +       /* hlen is in 8-octects and does not include the first 8 bytes
> +        * of the header
> +        */
> +       cb->nhoff += (1 + hlen) << 3;
> +
> +       return parse_ipv6_proto(skb, proto);
> +}
> +
> +PROG(IPV6FR)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       __be16 frag_off;
> +       __u8 proto;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &proto, sizeof(proto)))
> +               return BPF_DROP;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff + 2, &frag_off, sizeof(frag_off)))
> +               return BPF_DROP;
> +
> +       cb->nhoff += 8;
> +       cb->flags |= FLOW_DIS_IS_FRAGMENT;
> +       if (!(frag_off & bpf_htons(IP6_OFFSET)))
> +               cb->flags |= FLOW_DIS_FIRST_FRAG;
> +
> +       return parse_ipv6_proto(skb, proto);
> +}
> +
> +PROG(MPLS)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct mpls_label mpls;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &mpls, sizeof(mpls)))
> +               return BPF_DROP;
> +
> +       cb->nhoff += sizeof(mpls);
> +
> +       if (mpls.entry & MPLS_LS_S_MASK) {
> +               /* This is the last MPLS header. The network layer packet always
> +                * follows the MPLS header. Peek forward and dispatch based on
> +                * that.
> +                */
> +               __u8 version;
> +
> +               if (bpf_skb_load_bytes(skb, cb->nhoff, &version,
> +                                      sizeof(version)))
> +                       return BPF_DROP;
> +
> +               /* IP version is always the first 4 bits of the header */
> +               switch (version & 0xF0) {
> +               case 4:
> +                       bpf_tail_call(skb, &jmp_table, IP);
> +                       break;
> +               case 6:
> +                       bpf_tail_call(skb, &jmp_table, IPV6);
> +                       break;
> +               default:
> +                       return BPF_DROP;
> +               }
> +       } else {
> +               bpf_tail_call(skb, &jmp_table, MPLS);
> +       }
> +
> +       return BPF_DROP;
> +}
> +
> +PROG(VLAN)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct vlan_hdr vlan;
> +       __be16 proto;
> +
> +       /* Peek back to see if single or double-tagging */
> +       if (bpf_skb_load_bytes(skb, cb->nhoff - sizeof(proto), &proto,
> +                              sizeof(proto)))
> +               return BPF_DROP;
> +
> +       /* Account for double-tagging */
> +       if (proto == bpf_htons(ETH_P_8021AD)) {
> +               if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
> +                       return BPF_DROP;
> +
> +               if (vlan.h_vlan_encapsulated_proto != bpf_htons(ETH_P_8021Q))
> +                       return BPF_DROP;
> +
> +               cb->nhoff += sizeof(vlan);
> +       }
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &vlan, sizeof(vlan)))
> +               return BPF_DROP;
> +
> +       cb->nhoff += sizeof(vlan);
> +       /* Only allow 8021AD + 8021Q double tagging and no triple tagging.*/
> +       if (vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021AD) ||
> +           vlan.h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021Q))
> +               return BPF_DROP;
> +
> +       return parse_eth_proto(skb, vlan.h_vlan_encapsulated_proto);
> +}
> +
> +PROG(GUE)(struct __sk_buff *skb)
> +{
> +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> +       struct guehdr gue;
> +
> +       if (bpf_skb_load_bytes(skb, cb->nhoff, &gue, sizeof(gue)))
> +               return BPF_DROP;
> +
> +       cb->nhoff += sizeof(gue);
> +       cb->nhoff += gue.hlen << 2;
> +
> +       cb->flags |= FLOW_DIS_ENCAPSULATION;
> +       return parse_ip_proto(skb, gue.proto_ctype);
> +}
> +
> +char __license[] SEC("license") = "GPL";
> --
> 2.18.0.865.gffc8e1a3cd6-goog
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 2/3] flow_dissector: implements eBPF parser
  2018-08-18 15:50   ` Tom Herbert
@ 2018-08-18 19:49     ` Willem de Bruijn
  0 siblings, 0 replies; 20+ messages in thread
From: Willem de Bruijn @ 2018-08-18 19:49 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman, Petar Penkov,
	Willem de Bruijn

On Sat, Aug 18, 2018 at 11:56 AM Tom Herbert <tom@herbertland.com> wrote:
>
> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > This eBPF program extracts basic/control/ip address/ports keys from
> > incoming packets. It supports recursive parsing for IP
> > encapsulation, MPLS, GUE, and VLAN, along with IPv4/IPv6 and extension
> > headers. This program is meant to show how flow dissection and key
> > extraction can be done in eBPF.
> >
> > It is initially meant to be used for demonstration rather than as a
> > complete replacement of the existing flow dissector.
> >
> > This includes parsing of GUE and MPLS payload, which cannot be done
> > in production in general, as GUE tunnels and MPLS payloads cannot
> > unambiguously be detected in general.
> >
> > In closed environments, however, it can be enabled. Another example
> > where the programmability of BPF aids flow dissection.

> > +static __always_inline int write_ports(struct __sk_buff *skb, __u8 proto)
> > +{
> > +       struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
> > +       struct flow_dissector_key_ports ports;
> > +
> > +       /* The supported protocols always start with the ports */
> > +       if (bpf_skb_load_bytes(skb, cb->nhoff, &ports, sizeof(ports)))
> > +               return BPF_DROP;
> > +
> > +       if (proto == IPPROTO_UDP && ports.dst == bpf_htons(GUE_PORT)) {
> > +               /* GUE encapsulation */
> > +               cb->nhoff += sizeof(struct udphdr);
> > +               bpf_tail_call(skb, &jmp_table, GUE);
> > +               return BPF_DROP;
>
> It's a nice sentiment to support GUE, but this really isn't the right
> way to do it.

Yes, this was just for demonstration purposes. The same for
unconditionally parsing MPLS payload as IP.

Though note the point in the commit message that within a closed
network with fixed reserved GUE ports, a custom BPF program
like this could be sufficient. That's true not only for UDP tunnels.

> What would be much better is a means to generically
> support all the various UDP encapsulations like GUE, VXLAN, Geneve,
> GRE/UDP, MPLS/UDP, etc. I think there's two ways to do that:
>
> 1) A UDP socket lookup that returns an encapsulation socket containing
> a flow dissector function that can be called. This is the safest
> method because of the UDP are reserved numbers problem. I implement
> this in kernel flow dissector, not upstreamed though.

Yes, similar to udp_gro_receive. Socket lookup is not free, however,
and this is a relatively rarely used feature.

I want to move the one in udp_gro_receive behind a static key.
udp_encap_needed_key is the likely target. Then the same can
eventually be done for flow dissection inside UDP tunnels.

> 2) Create a lookup table based on destination port that returns the
> flow dissector function to call. This doesn't have the socket lookup
> so it isn't quite as robust as the socket lookup. But, at least it's a
> generic interface and programmable so it might be appropriate in the
> BPF flow dissector case.

Option 1 sounds preferable to me.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-16 23:14     ` Petar Penkov
@ 2018-08-20  5:44       ` Song Liu
  2018-08-20 14:13         ` Willem de Bruijn
  0 siblings, 1 reply; 20+ messages in thread
From: Song Liu @ 2018-08-20  5:44 UTC (permalink / raw)
  To: Petar Penkov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Thu, Aug 16, 2018 at 4:14 PM, Petar Penkov <ppenkov@google.com> wrote:
> On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>>
>> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
>> > From: Petar Penkov <ppenkov@google.com>
>> >
>> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
>> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
>> > path. The BPF program is kept as a global variable so it is
>> > accessible to all flow dissectors.
>> >
>> > Signed-off-by: Petar Penkov <ppenkov@google.com>
>> > Signed-off-by: Willem de Bruijn <willemb@google.com>
>> > ---
>> >  include/linux/bpf_types.h                 |   1 +
>> >  include/linux/skbuff.h                    |   7 +
>> >  include/net/flow_dissector.h              |  16 +++
>> >  include/uapi/linux/bpf.h                  |  14 +-
>> >  kernel/bpf/syscall.c                      |   8 ++
>> >  kernel/bpf/verifier.c                     |   2 +
>> >  net/core/filter.c                         | 157 ++++++++++++++++++++++
>> >  net/core/flow_dissector.c                 |  76 +++++++++++
>> >  tools/bpf/bpftool/prog.c                  |   1 +
>> >  tools/include/uapi/linux/bpf.h            |   5 +-
>> >  tools/lib/bpf/libbpf.c                    |   2 +
>> >  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
>> >  12 files changed, 290 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> > index cd26c090e7c0..22083712dd18 100644
>> > --- a/include/linux/bpf_types.h
>> > +++ b/include/linux/bpf_types.h
>> > @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
>> >  #ifdef CONFIG_INET
>> >  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
>> >  #endif
>> > +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
>> >
>> >  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>> >  BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
>> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> > index 17a13e4785fc..ce0e863f02a2 100644
>> > --- a/include/linux/skbuff.h
>> > +++ b/include/linux/skbuff.h
>> > @@ -243,6 +243,8 @@ struct scatterlist;
>> >  struct pipe_inode_info;
>> >  struct iov_iter;
>> >  struct napi_struct;
>> > +struct bpf_prog;
>> > +union bpf_attr;
>> >
>> >  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
>> >  struct nf_conntrack {
>> > @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>> >                              const struct flow_dissector_key *key,
>> >                              unsigned int key_count);
>> >
>> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
>> > +                                      struct bpf_prog *prog);
>> > +
>> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
>> > +
>> >  bool __skb_flow_dissect(const struct sk_buff *skb,
>> >                         struct flow_dissector *flow_dissector,
>> >                         void *target_container,
>> > diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>> > index 6a4586dcdede..edb919d320c1 100644
>> > --- a/include/net/flow_dissector.h
>> > +++ b/include/net/flow_dissector.h
>> > @@ -270,6 +270,22 @@ __be32 flow_get_u32_dst(const struct flow_keys *flow);
>> >  extern struct flow_dissector flow_keys_dissector;
>> >  extern struct flow_dissector flow_keys_basic_dissector;
>> >
>> > +/* struct bpf_flow_dissect_cb:
>> > + *
>> > + * This struct is used to pass parameters to BPF programs of type
>> > + * BPF_PROG_TYPE_FLOW_DISSECTOR. Before such a program is run, the caller sets
>> > + * the control block of the skb to be a struct of this type. The first field is
>> > + * used to communicate the next header offset between the BPF programs and the
>> > + * first value of it is passed from the kernel. The last two fields are used for
>> > + * writing out flow keys.
>> > + */
>> > +struct bpf_flow_dissect_cb {
>> > +       u16 nhoff;
>> > +       u16 unused;
>> > +       void *target_container;
>> > +       struct flow_dissector *flow_dissector;
>> > +};
>> > +
>> >  /* struct flow_keys_digest:
>> >   *
>> >   * This structure is used to hold a digest of the full flow keys. This is a
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index 66917a4eba27..8bc0fdab685d 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
>> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>> >         BPF_PROG_TYPE_LIRC_MODE2,
>> >         BPF_PROG_TYPE_SK_REUSEPORT,
>> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>> >  };
>> >
>> >  enum bpf_attach_type {
>> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
>> >         BPF_CGROUP_UDP4_SENDMSG,
>> >         BPF_CGROUP_UDP6_SENDMSG,
>> >         BPF_LIRC_MODE2,
>> > +       BPF_FLOW_DISSECTOR,
>> >         __MAX_BPF_ATTACH_TYPE
>> >  };
>> >
>> > @@ -2141,6 +2143,15 @@ union bpf_attr {
>> >   *             request in the skb.
>> >   *     Return
>> >   *             0 on success, or a negative error in case of failure.
>> > + *
>> > + * int bpf_flow_dissector_write_keys(const struct sk_buff *skb, const void *from, u32 len, enum flow_dissector_key_id key_id)
>> > + *     Description
>> > + *             Try to write *len* bytes from the source pointer into the offset
>> > + *             of the key with id *key_id*. If *len* is different from the
>> > + *             size of the key, an error is returned. If the key is not used,
>> > + *             this function exits with no effect and code 0.
>> > + *     Return
>> > + *             0 on success, negative error in case of failure.
>> >   */
>> >  #define __BPF_FUNC_MAPPER(FN)          \
>> >         FN(unspec),                     \
>> > @@ -2226,7 +2237,8 @@ union bpf_attr {
>> >         FN(get_current_cgroup_id),      \
>> >         FN(get_local_storage),          \
>> >         FN(sk_select_reuseport),        \
>> > -       FN(skb_ancestor_cgroup_id),
>> > +       FN(skb_ancestor_cgroup_id),     \
>> > +       FN(flow_dissector_write_keys),
>> >
>> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> >   * function eBPF program intends to call
>> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> > index 43727ed0d94a..a06568841a92 100644
>> > --- a/kernel/bpf/syscall.c
>> > +++ b/kernel/bpf/syscall.c
>> > @@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>> >         case BPF_LIRC_MODE2:
>> >                 ptype = BPF_PROG_TYPE_LIRC_MODE2;
>> >                 break;
>> > +       case BPF_FLOW_DISSECTOR:
>> > +               ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
>> > +               break;
>> >         default:
>> >                 return -EINVAL;
>> >         }
>> > @@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>> >         case BPF_PROG_TYPE_LIRC_MODE2:
>> >                 ret = lirc_prog_attach(attr, prog);
>> >                 break;
>> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>> > +               ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
>> > +               break;
>> >         default:
>> >                 ret = cgroup_bpf_prog_attach(attr, ptype, prog);
>> >         }
>> > @@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>> >                 return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
>> >         case BPF_LIRC_MODE2:
>> >                 return lirc_prog_detach(attr);
>> > +       case BPF_FLOW_DISSECTOR:
>> > +               return skb_flow_dissector_bpf_prog_detach(attr);
>> >         default:
>> >                 return -EINVAL;
>> >         }
>> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> > index ca90679a7fe5..6d3f268fa8e0 100644
>> > --- a/kernel/bpf/verifier.c
>> > +++ b/kernel/bpf/verifier.c
>> > @@ -1321,6 +1321,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>> >         case BPF_PROG_TYPE_LWT_XMIT:
>> >         case BPF_PROG_TYPE_SK_SKB:
>> >         case BPF_PROG_TYPE_SK_MSG:
>> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>> >                 if (meta)
>> >                         return meta->pkt_access;
>> >
>> > @@ -3976,6 +3977,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>> >         case BPF_PROG_TYPE_SOCKET_FILTER:
>> >         case BPF_PROG_TYPE_SCHED_CLS:
>> >         case BPF_PROG_TYPE_SCHED_ACT:
>> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>> >                 return true;
>> >         default:
>> >                 return false;
>> > diff --git a/net/core/filter.c b/net/core/filter.c
>> > index fd423ce3da34..03d3037e6508 100644
>> > --- a/net/core/filter.c
>> > +++ b/net/core/filter.c
>> > @@ -4820,6 +4820,111 @@ bool bpf_helper_changes_pkt_data(void *func)
>> >         return false;
>> >  }
>> >
>> > +BPF_CALL_4(bpf_flow_dissector_write_keys, const struct sk_buff *, skb,
>> > +          const void *, from, u32, len, enum flow_dissector_key_id, key_id)
>> > +{
>> > +       struct bpf_flow_dissect_cb *cb;
>> > +       void *dest;
>> > +
>> > +       cb = (struct bpf_flow_dissect_cb *)bpf_skb_cb(skb);
>> > +
>> > +       /* Make sure the dissector actually uses the key. It is not an error if
>> > +        * it does not, but we should not continue past this point in that case
>> > +        */
>> > +       if (!dissector_uses_key(cb->flow_dissector, key_id))
>> > +               return 0;
>> > +
>> > +       /* Make sure the length is correct */
>> > +       switch (key_id) {
>> > +       case FLOW_DISSECTOR_KEY_CONTROL:
>> > +       case FLOW_DISSECTOR_KEY_ENC_CONTROL:
>> > +               if (len != sizeof(struct flow_dissector_key_control))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_BASIC:
>> > +               if (len != sizeof(struct flow_dissector_key_basic))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
>> > +       case FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS:
>> > +               if (len != sizeof(struct flow_dissector_key_ipv4_addrs))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
>> > +       case FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS:
>> > +               if (len != sizeof(struct flow_dissector_key_ipv6_addrs))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_ICMP:
>> > +               if (len != sizeof(struct flow_dissector_key_icmp))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_PORTS:
>> > +       case FLOW_DISSECTOR_KEY_ENC_PORTS:
>> > +               if (len != sizeof(struct flow_dissector_key_ports))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_ETH_ADDRS:
>> > +               if (len != sizeof(struct flow_dissector_key_eth_addrs))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_TIPC:
>> > +               if (len != sizeof(struct flow_dissector_key_tipc))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_ARP:
>> > +               if (len != sizeof(struct flow_dissector_key_arp))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_VLAN:
>> > +       case FLOW_DISSECTOR_KEY_CVLAN:
>> > +               if (len != sizeof(struct flow_dissector_key_vlan))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_FLOW_LABEL:
>> > +               if (len != sizeof(struct flow_dissector_key_tags))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_GRE_KEYID:
>> > +       case FLOW_DISSECTOR_KEY_ENC_KEYID:
>> > +       case FLOW_DISSECTOR_KEY_MPLS_ENTROPY:
>> > +               if (len != sizeof(struct flow_dissector_key_keyid))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_MPLS:
>> > +               if (len != sizeof(struct flow_dissector_key_mpls))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_TCP:
>> > +               if (len != sizeof(struct flow_dissector_key_tcp))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       case FLOW_DISSECTOR_KEY_IP:
>> > +       case FLOW_DISSECTOR_KEY_ENC_IP:
>> > +               if (len != sizeof(struct flow_dissector_key_ip))
>> > +                       return -EINVAL;
>> > +               break;
>> > +       default:
>> > +               return -EINVAL;
>> > +       }
>> > +
>> > +       dest = skb_flow_dissector_target(cb->flow_dissector, key_id,
>> > +                                        cb->target_container);
>> > +
>> > +       memcpy(dest, from, len);
>> > +       return 0;
>> > +}
>> > +
>> > +static const struct bpf_func_proto bpf_flow_dissector_write_keys_proto = {
>> > +       .func           = bpf_flow_dissector_write_keys,
>> > +       .gpl_only       = false,
>> > +       .ret_type       = RET_INTEGER,
>> > +       .arg1_type      = ARG_PTR_TO_CTX,
>> > +       .arg2_type      = ARG_PTR_TO_MEM,
>> > +       .arg3_type      = ARG_CONST_SIZE,
>> > +       .arg4_type      = ARG_ANYTHING,
>> > +};
>> > +
>> >  static const struct bpf_func_proto *
>> >  bpf_base_func_proto(enum bpf_func_id func_id)
>> >  {
>> > @@ -5100,6 +5205,19 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> >         }
>> >  }
>> >
>> > +static const struct bpf_func_proto *
>> > +flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> > +{
>> > +       switch (func_id) {
>> > +       case BPF_FUNC_skb_load_bytes:
>> > +               return &bpf_skb_load_bytes_proto;
>> > +       case BPF_FUNC_flow_dissector_write_keys:
>> > +               return &bpf_flow_dissector_write_keys_proto;
>> > +       default:
>> > +               return bpf_base_func_proto(func_id);
>> > +       }
>> > +}
>> > +
>> >  static const struct bpf_func_proto *
>> >  lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> >  {
>> > @@ -5738,6 +5856,35 @@ static bool sk_msg_is_valid_access(int off, int size,
>> >         return true;
>> >  }
>> >
>> > +static bool flow_dissector_is_valid_access(int off, int size,
>> > +                                          enum bpf_access_type type,
>> > +                                          const struct bpf_prog *prog,
>> > +                                          struct bpf_insn_access_aux *info)
>> > +{
>> > +       if (type == BPF_WRITE) {
>> > +               switch (off) {
>> > +               case bpf_ctx_range(struct __sk_buff, cb[0]):
>> > +                       break;
>> > +               default:
>> > +                       return false;
>> > +               }
>> > +       }
>> > +
>> > +       switch (off) {
>> > +       case bpf_ctx_range(struct __sk_buff, data):
>> > +               info->reg_type = PTR_TO_PACKET;
>> > +               break;
>> > +       case bpf_ctx_range(struct __sk_buff, data_end):
>> > +               info->reg_type = PTR_TO_PACKET_END;
>> > +               break;
>> > +       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
>> > +       case bpf_ctx_range_till(struct __sk_buff, cb[1], cb[4]):
>> > +               return false;
>> > +       }
>> > +
>> > +       return bpf_skb_is_valid_access(off, size, type, prog, info);
>> > +}
>> > +
>> >  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>> >                                   const struct bpf_insn *si,
>> >                                   struct bpf_insn *insn_buf,
>> > @@ -6995,6 +7142,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
>> >  const struct bpf_prog_ops sk_msg_prog_ops = {
>> >  };
>> >
>> > +const struct bpf_verifier_ops flow_dissector_verifier_ops = {
>> > +       .get_func_proto         = flow_dissector_func_proto,
>> > +       .is_valid_access        = flow_dissector_is_valid_access,
>> > +       .convert_ctx_access     = bpf_convert_ctx_access,
>> > +       .gen_ld_abs             = bpf_gen_ld_abs,
>> > +};
>> > +
>> > +const struct bpf_prog_ops flow_dissector_prog_ops = {
>> > +};
>> > +
>> >  int sk_detach_filter(struct sock *sk)
>> >  {
>> >         int ret = -ENOENT;
>> > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> > index ce9eeeb7c024..767daa231f04 100644
>> > --- a/net/core/flow_dissector.c
>> > +++ b/net/core/flow_dissector.c
>> > @@ -25,6 +25,11 @@
>> >  #include <net/flow_dissector.h>
>> >  #include <scsi/fc/fc_fcoe.h>
>> >  #include <uapi/linux/batadv_packet.h>
>> > +#include <linux/bpf.h>
>> > +
>> > +/* BPF program accessible by all flow dissectors */
>> > +static struct bpf_prog __rcu *flow_dissector_prog;
>> > +static DEFINE_MUTEX(flow_dissector_mutex);
>> >
>> >  static void dissector_set_key(struct flow_dissector *flow_dissector,
>> >                               enum flow_dissector_key_id key_id)
>> > @@ -62,6 +67,40 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
>> >  }
>> >  EXPORT_SYMBOL(skb_flow_dissector_init);
>> >
>> > +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
>> > +                                      struct bpf_prog *prog)
>> > +{
>> > +       struct bpf_prog *attached;
>> > +
>> > +       mutex_lock(&flow_dissector_mutex);
>> > +       attached = rcu_dereference_protected(flow_dissector_prog,
>> > +                                            lockdep_is_held(&flow_dissector_mutex));
>> > +       if (attached) {
>> > +               /* Only one BPF program can be attached at a time */
>> > +               mutex_unlock(&flow_dissector_mutex);
>> > +               return -EEXIST;
>> > +       }
>> > +       rcu_assign_pointer(flow_dissector_prog, prog);
>> > +       mutex_unlock(&flow_dissector_mutex);
>> > +       return 0;
>> > +}
>> > +
>> > +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
>> > +{
>> > +       struct bpf_prog *attached;
>> > +
>> > +       mutex_lock(&flow_dissector_mutex);
>> > +       attached = rcu_dereference_protected(flow_dissector_prog,
>> > +                                            lockdep_is_held(&flow_dissector_mutex));
>> > +       if (!flow_dissector_prog) {
>> > +               mutex_unlock(&flow_dissector_mutex);
>> > +               return -EINVAL;
>> > +       }
>> > +       bpf_prog_put(attached);
>> > +       RCU_INIT_POINTER(flow_dissector_prog, NULL);
>> > +       mutex_unlock(&flow_dissector_mutex);
>> > +       return 0;
>> > +}
>> >  /**
>> >   * skb_flow_get_be16 - extract be16 entity
>> >   * @skb: sk_buff to extract from
>> > @@ -619,6 +658,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> >         struct flow_dissector_key_vlan *key_vlan;
>> >         enum flow_dissect_ret fdret;
>> >         enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
>> > +       struct bpf_prog *attached;
>> >         int num_hdrs = 0;
>> >         u8 ip_proto = 0;
>> >         bool ret;
>> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> >                                               FLOW_DISSECTOR_KEY_BASIC,
>> >                                               target_container);
>> >
>> > +       rcu_read_lock();
>> > +       attached = rcu_dereference(flow_dissector_prog);
>> > +       if (attached) {
>> > +               /* Note that even though the const qualifier is discarded
>> > +                * throughout the execution of the BPF program, all changes(the
>> > +                * control block) are reverted after the BPF program returns.
>> > +                * Therefore, __skb_flow_dissect does not alter the skb.
>> > +                */
>> > +               struct bpf_flow_dissect_cb *cb;
>> > +               u8 cb_saved[BPF_SKB_CB_LEN];
>> > +               u32 result;
>> > +
>> > +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
>> > +
>> > +               /* Save Control Block */
>> > +               memcpy(cb_saved, cb, sizeof(cb_saved));
>> > +               memset(cb, 0, sizeof(cb_saved));
>> > +
>> > +               /* Pass parameters to the BPF program */
>> > +               cb->nhoff = nhoff;
>> > +               cb->target_container = target_container;
>> > +               cb->flow_dissector = flow_dissector;
>> > +
>> > +               bpf_compute_data_pointers((struct sk_buff *)skb);
>> > +               result = BPF_PROG_RUN(attached, skb);
>> > +
>> > +               /* Restore state */
>> > +               memcpy(cb, cb_saved, sizeof(cb_saved));
>> > +
>> > +               key_control->thoff = min_t(u16, key_control->thoff,
>> > +                                          skb ? skb->len : hlen);
>> > +               rcu_read_unlock();
>> > +               return result == BPF_OK;
>> > +       }
>>
>> If the BPF program cannot handle certain protocol, shall we fall back
>> to the built-in logic? Otherwise, all BPF programs need to have some
>> code for all protocols.
>>
>> Song
>
> I believe that if we fall back to the built-in logic we lose all security
> guarantees from BPF and this is why the code does not support
> fall back.
>
> Petar

I am not really sure we are on the same page. I am proposing 3
different return values from BPF_PROG_RUN(), and they should be
handled as

1. result == BPF_OK                      => return true;
2. result == BPF_DROP                 => return false;
3. result == something else            => fall back.

Does this proposal make any sense?

Thanks,
Song

>>
>>
>> > +       rcu_read_unlock();
>> > +
>> >         if (dissector_uses_key(flow_dissector,
>> >                                FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
>> >                 struct ethhdr *eth = eth_hdr(skb);
>> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
>> > index dce960d22106..b1cd3bc8db70 100644
>> > --- a/tools/bpf/bpftool/prog.c
>> > +++ b/tools/bpf/bpftool/prog.c
>> > @@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
>> >         [BPF_PROG_TYPE_RAW_TRACEPOINT]  = "raw_tracepoint",
>> >         [BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
>> >         [BPF_PROG_TYPE_LIRC_MODE2]      = "lirc_mode2",
>> > +       [BPF_PROG_TYPE_FLOW_DISSECTOR]  = "flow_dissector",
>> >  };
>> >
>> >  static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
>> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> > index 66917a4eba27..acd74a0dd063 100644
>> > --- a/tools/include/uapi/linux/bpf.h
>> > +++ b/tools/include/uapi/linux/bpf.h
>> > @@ -152,6 +152,7 @@ enum bpf_prog_type {
>> >         BPF_PROG_TYPE_LWT_SEG6LOCAL,
>> >         BPF_PROG_TYPE_LIRC_MODE2,
>> >         BPF_PROG_TYPE_SK_REUSEPORT,
>> > +       BPF_PROG_TYPE_FLOW_DISSECTOR,
>> >  };
>> >
>> >  enum bpf_attach_type {
>> > @@ -172,6 +173,7 @@ enum bpf_attach_type {
>> >         BPF_CGROUP_UDP4_SENDMSG,
>> >         BPF_CGROUP_UDP6_SENDMSG,
>> >         BPF_LIRC_MODE2,
>> > +       BPF_FLOW_DISSECTOR,
>> >         __MAX_BPF_ATTACH_TYPE
>> >  };
>> >
>> > @@ -2226,7 +2228,8 @@ union bpf_attr {
>> >         FN(get_current_cgroup_id),      \
>> >         FN(get_local_storage),          \
>> >         FN(sk_select_reuseport),        \
>> > -       FN(skb_ancestor_cgroup_id),
>> > +       FN(skb_ancestor_cgroup_id),     \
>> > +       FN(flow_dissector_write_keys),
>> >
>> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> >   * function eBPF program intends to call
>> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> > index 2abd0f112627..0c749ce1b717 100644
>> > --- a/tools/lib/bpf/libbpf.c
>> > +++ b/tools/lib/bpf/libbpf.c
>> > @@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
>> >         case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>> >         case BPF_PROG_TYPE_LIRC_MODE2:
>> >         case BPF_PROG_TYPE_SK_REUSEPORT:
>> > +       case BPF_PROG_TYPE_FLOW_DISSECTOR:
>> >                 return false;
>> >         case BPF_PROG_TYPE_UNSPEC:
>> >         case BPF_PROG_TYPE_KPROBE:
>> > @@ -2121,6 +2122,7 @@ static const struct {
>> >         BPF_PROG_SEC("sk_skb",          BPF_PROG_TYPE_SK_SKB),
>> >         BPF_PROG_SEC("sk_msg",          BPF_PROG_TYPE_SK_MSG),
>> >         BPF_PROG_SEC("lirc_mode2",      BPF_PROG_TYPE_LIRC_MODE2),
>> > +       BPF_PROG_SEC("flow_dissector",  BPF_PROG_TYPE_FLOW_DISSECTOR),
>> >         BPF_SA_PROG_SEC("cgroup/bind4", BPF_CGROUP_INET4_BIND),
>> >         BPF_SA_PROG_SEC("cgroup/bind6", BPF_CGROUP_INET6_BIND),
>> >         BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
>> > diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
>> > index e4be7730222d..4204c496a04f 100644
>> > --- a/tools/testing/selftests/bpf/bpf_helpers.h
>> > +++ b/tools/testing/selftests/bpf/bpf_helpers.h
>> > @@ -143,6 +143,9 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
>> >         (void *) BPF_FUNC_skb_cgroup_id;
>> >  static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
>> >         (void *) BPF_FUNC_skb_ancestor_cgroup_id;
>> > +static int (*bpf_flow_dissector_write_keys)(void *ctx, void *src, int len,
>> > +                                           int key) =
>> > +       (void *) BPF_FUNC_flow_dissector_write_keys;
>> >
>> >  /* llvm builtin functions that eBPF C program may use to
>> >   * emit BPF_LD_ABS and BPF_LD_IND instructions
>> > --
>> > 2.18.0.865.gffc8e1a3cd6-goog
>> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-20  5:44       ` Song Liu
@ 2018-08-20 14:13         ` Willem de Bruijn
  2018-08-20 14:20           ` Daniel Borkmann
  2018-08-20 16:04           ` Song Liu
  0 siblings, 2 replies; 20+ messages in thread
From: Willem de Bruijn @ 2018-08-20 14:13 UTC (permalink / raw)
  To: liu.song.a23
  Cc: Petar Penkov, Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Mon, Aug 20, 2018 at 1:47 AM Song Liu <liu.song.a23@gmail.com> wrote:
>
> On Thu, Aug 16, 2018 at 4:14 PM, Petar Penkov <ppenkov@google.com> wrote:
> > On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
> >>
> >> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
> >> > From: Petar Penkov <ppenkov@google.com>
> >> >
> >> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> >> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> >> > path. The BPF program is kept as a global variable so it is
> >> > accessible to all flow dissectors.
> >> >
> >> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> >> > Signed-off-by: Willem de Bruijn <willemb@google.com>

> >> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >> >                                               FLOW_DISSECTOR_KEY_BASIC,
> >> >                                               target_container);
> >> >
> >> > +       rcu_read_lock();
> >> > +       attached = rcu_dereference(flow_dissector_prog);
> >> > +       if (attached) {
> >> > +               /* Note that even though the const qualifier is discarded
> >> > +                * throughout the execution of the BPF program, all changes(the
> >> > +                * control block) are reverted after the BPF program returns.
> >> > +                * Therefore, __skb_flow_dissect does not alter the skb.
> >> > +                */
> >> > +               struct bpf_flow_dissect_cb *cb;
> >> > +               u8 cb_saved[BPF_SKB_CB_LEN];
> >> > +               u32 result;
> >> > +
> >> > +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
> >> > +
> >> > +               /* Save Control Block */
> >> > +               memcpy(cb_saved, cb, sizeof(cb_saved));
> >> > +               memset(cb, 0, sizeof(cb_saved));
> >> > +
> >> > +               /* Pass parameters to the BPF program */
> >> > +               cb->nhoff = nhoff;
> >> > +               cb->target_container = target_container;
> >> > +               cb->flow_dissector = flow_dissector;
> >> > +
> >> > +               bpf_compute_data_pointers((struct sk_buff *)skb);
> >> > +               result = BPF_PROG_RUN(attached, skb);
> >> > +
> >> > +               /* Restore state */
> >> > +               memcpy(cb, cb_saved, sizeof(cb_saved));
> >> > +
> >> > +               key_control->thoff = min_t(u16, key_control->thoff,
> >> > +                                          skb ? skb->len : hlen);
> >> > +               rcu_read_unlock();
> >> > +               return result == BPF_OK;
> >> > +       }
> >>
> >> If the BPF program cannot handle certain protocol, shall we fall back
> >> to the built-in logic? Otherwise, all BPF programs need to have some
> >> code for all protocols.
> >>
> >> Song
> >
> > I believe that if we fall back to the built-in logic we lose all security
> > guarantees from BPF and this is why the code does not support
> > fall back.
> >
> > Petar
>
> I am not really sure we are on the same page. I am proposing 3
> different return values from BPF_PROG_RUN(), and they should be
> handled as
>
> 1. result == BPF_OK                      => return true;
> 2. result == BPF_DROP                 => return false;
> 3. result == something else            => fall back.
>
> Does this proposal make any sense?
>
> Thanks,
> Song

It certainly makes sense. We debated it initially, as well.

In the short term, it allows for simpler BPF programs, as they can
off-load some protocols to the C implementation.

But the RFC patchset already implements most protocols in BPF.
I had not expected that when we started out.

Eventually, I think it is preferable to just deprecate the C
implementation. Which is not possible if we make this opt-out
a part of the BPF flow dissector interface.

There is also the lesser issue that a buggy BPF program might
accidentally pass the third value and unknowing open itself up
to the large attack surface. Without this option, the security
audit is much simpler.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-20 14:13         ` Willem de Bruijn
@ 2018-08-20 14:20           ` Daniel Borkmann
  2018-08-20 16:04           ` Song Liu
  1 sibling, 0 replies; 20+ messages in thread
From: Daniel Borkmann @ 2018-08-20 14:20 UTC (permalink / raw)
  To: Willem de Bruijn, liu.song.a23
  Cc: Petar Penkov, Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, simon.horman, Willem de Bruijn

On 08/20/2018 04:13 PM, Willem de Bruijn wrote:
> On Mon, Aug 20, 2018 at 1:47 AM Song Liu <liu.song.a23@gmail.com> wrote:
>> On Thu, Aug 16, 2018 at 4:14 PM, Petar Penkov <ppenkov@google.com> wrote:
>>> On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>>>> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
>>>>> From: Petar Penkov <ppenkov@google.com>
>>>>>
>>>>> Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
>>>>> attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
>>>>> path. The BPF program is kept as a global variable so it is
>>>>> accessible to all flow dissectors.
>>>>>
>>>>> Signed-off-by: Petar Penkov <ppenkov@google.com>
>>>>> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
>>>>> @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>>>>>                                               FLOW_DISSECTOR_KEY_BASIC,
>>>>>                                               target_container);
>>>>>
>>>>> +       rcu_read_lock();
>>>>> +       attached = rcu_dereference(flow_dissector_prog);
>>>>> +       if (attached) {
>>>>> +               /* Note that even though the const qualifier is discarded
>>>>> +                * throughout the execution of the BPF program, all changes(the
>>>>> +                * control block) are reverted after the BPF program returns.
>>>>> +                * Therefore, __skb_flow_dissect does not alter the skb.
>>>>> +                */
>>>>> +               struct bpf_flow_dissect_cb *cb;
>>>>> +               u8 cb_saved[BPF_SKB_CB_LEN];
>>>>> +               u32 result;
>>>>> +
>>>>> +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
>>>>> +
>>>>> +               /* Save Control Block */
>>>>> +               memcpy(cb_saved, cb, sizeof(cb_saved));
>>>>> +               memset(cb, 0, sizeof(cb_saved));
>>>>> +
>>>>> +               /* Pass parameters to the BPF program */
>>>>> +               cb->nhoff = nhoff;
>>>>> +               cb->target_container = target_container;
>>>>> +               cb->flow_dissector = flow_dissector;
>>>>> +
>>>>> +               bpf_compute_data_pointers((struct sk_buff *)skb);
>>>>> +               result = BPF_PROG_RUN(attached, skb);
>>>>> +
>>>>> +               /* Restore state */
>>>>> +               memcpy(cb, cb_saved, sizeof(cb_saved));
>>>>> +
>>>>> +               key_control->thoff = min_t(u16, key_control->thoff,
>>>>> +                                          skb ? skb->len : hlen);
>>>>> +               rcu_read_unlock();
>>>>> +               return result == BPF_OK;
>>>>> +       }
>>>>
>>>> If the BPF program cannot handle certain protocol, shall we fall back
>>>> to the built-in logic? Otherwise, all BPF programs need to have some
>>>> code for all protocols.
>>>>
>>>> Song
>>>
>>> I believe that if we fall back to the built-in logic we lose all security
>>> guarantees from BPF and this is why the code does not support
>>> fall back.
>>>
>>> Petar
>>
>> I am not really sure we are on the same page. I am proposing 3
>> different return values from BPF_PROG_RUN(), and they should be
>> handled as
>>
>> 1. result == BPF_OK                      => return true;
>> 2. result == BPF_DROP                 => return false;
>> 3. result == something else            => fall back.
>>
>> Does this proposal make any sense?
>>
>> Thanks,
>> Song
> 
> It certainly makes sense. We debated it initially, as well.
> 
> In the short term, it allows for simpler BPF programs, as they can
> off-load some protocols to the C implementation.
> 
> But the RFC patchset already implements most protocols in BPF.
> I had not expected that when we started out.
> 
> Eventually, I think it is preferable to just deprecate the C
> implementation. Which is not possible if we make this opt-out
> a part of the BPF flow dissector interface.

+1

> There is also the lesser issue that a buggy BPF program might
> accidentally pass the third value and unknowing open itself up
> to the large attack surface. Without this option, the security
> audit is much simpler.

Fully agree, I'm all for dropping such option.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook
  2018-08-20 14:13         ` Willem de Bruijn
  2018-08-20 14:20           ` Daniel Borkmann
@ 2018-08-20 16:04           ` Song Liu
  1 sibling, 0 replies; 20+ messages in thread
From: Song Liu @ 2018-08-20 16:04 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Petar Penkov, Petar Penkov, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Mon, Aug 20, 2018 at 7:13 AM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Mon, Aug 20, 2018 at 1:47 AM Song Liu <liu.song.a23@gmail.com> wrote:
>>
>> On Thu, Aug 16, 2018 at 4:14 PM, Petar Penkov <ppenkov@google.com> wrote:
>> > On Thu, Aug 16, 2018 at 3:40 PM, Song Liu <liu.song.a23@gmail.com> wrote:
>> >>
>> >> On Thu, Aug 16, 2018 at 9:44 AM, Petar Penkov <peterpenkov96@gmail.com> wrote:
>> >> > From: Petar Penkov <ppenkov@google.com>
>> >> >
>> >> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
>> >> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
>> >> > path. The BPF program is kept as a global variable so it is
>> >> > accessible to all flow dissectors.
>> >> >
>> >> > Signed-off-by: Petar Penkov <ppenkov@google.com>
>> >> > Signed-off-by: Willem de Bruijn <willemb@google.com>
>
>> >> > @@ -658,6 +698,42 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>> >> >                                               FLOW_DISSECTOR_KEY_BASIC,
>> >> >                                               target_container);
>> >> >
>> >> > +       rcu_read_lock();
>> >> > +       attached = rcu_dereference(flow_dissector_prog);
>> >> > +       if (attached) {
>> >> > +               /* Note that even though the const qualifier is discarded
>> >> > +                * throughout the execution of the BPF program, all changes(the
>> >> > +                * control block) are reverted after the BPF program returns.
>> >> > +                * Therefore, __skb_flow_dissect does not alter the skb.
>> >> > +                */
>> >> > +               struct bpf_flow_dissect_cb *cb;
>> >> > +               u8 cb_saved[BPF_SKB_CB_LEN];
>> >> > +               u32 result;
>> >> > +
>> >> > +               cb = (struct bpf_flow_dissect_cb *)(bpf_skb_cb((struct sk_buff *)skb));
>> >> > +
>> >> > +               /* Save Control Block */
>> >> > +               memcpy(cb_saved, cb, sizeof(cb_saved));
>> >> > +               memset(cb, 0, sizeof(cb_saved));
>> >> > +
>> >> > +               /* Pass parameters to the BPF program */
>> >> > +               cb->nhoff = nhoff;
>> >> > +               cb->target_container = target_container;
>> >> > +               cb->flow_dissector = flow_dissector;
>> >> > +
>> >> > +               bpf_compute_data_pointers((struct sk_buff *)skb);
>> >> > +               result = BPF_PROG_RUN(attached, skb);
>> >> > +
>> >> > +               /* Restore state */
>> >> > +               memcpy(cb, cb_saved, sizeof(cb_saved));
>> >> > +
>> >> > +               key_control->thoff = min_t(u16, key_control->thoff,
>> >> > +                                          skb ? skb->len : hlen);
>> >> > +               rcu_read_unlock();
>> >> > +               return result == BPF_OK;
>> >> > +       }
>> >>
>> >> If the BPF program cannot handle certain protocol, shall we fall back
>> >> to the built-in logic? Otherwise, all BPF programs need to have some
>> >> code for all protocols.
>> >>
>> >> Song
>> >
>> > I believe that if we fall back to the built-in logic we lose all security
>> > guarantees from BPF and this is why the code does not support
>> > fall back.
>> >
>> > Petar
>>
>> I am not really sure we are on the same page. I am proposing 3
>> different return values from BPF_PROG_RUN(), and they should be
>> handled as
>>
>> 1. result == BPF_OK                      => return true;
>> 2. result == BPF_DROP                 => return false;
>> 3. result == something else            => fall back.
>>
>> Does this proposal make any sense?
>>
>> Thanks,
>> Song
>
> It certainly makes sense. We debated it initially, as well.
>
> In the short term, it allows for simpler BPF programs, as they can
> off-load some protocols to the C implementation.
>
> But the RFC patchset already implements most protocols in BPF.
> I had not expected that when we started out.
>
> Eventually, I think it is preferable to just deprecate the C
> implementation. Which is not possible if we make this opt-out
> a part of the BPF flow dissector interface.
>
> There is also the lesser issue that a buggy BPF program might
> accidentally pass the third value and unknowing open itself up
> to the large attack surface. Without this option, the security
> audit is much simpler.

Thanks for the explanation. I didn't realize that the end goal is to
deprecate the C implementation.

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-16 16:44 [bpf-next RFC 0/3] Introduce eBPF flow dissector Petar Penkov
                   ` (2 preceding siblings ...)
  2018-08-16 16:44 ` [bpf-next RFC 3/3] selftests/bpf: test bpf flow dissection Petar Penkov
@ 2018-08-20 20:52 ` Alexei Starovoitov
  2018-08-21  2:24   ` David Miller
  2018-08-22  0:19   ` Petar Penkov
  3 siblings, 2 replies; 20+ messages in thread
From: Alexei Starovoitov @ 2018-08-20 20:52 UTC (permalink / raw)
  To: Petar Penkov
  Cc: netdev, davem, ast, daniel, simon.horman, Petar Penkov, willemb

On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
> 
> This patch series hardens the RX stack by allowing flow dissection in BPF,
> as previously discussed [1]. Because of the rigorous checks of the BPF
> verifier, this provides significant security guarantees. In particular, the
> BPF flow dissector cannot get inside of an infinite loop, as with
> CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
> read outside of packet bounds, because all memory accesses are checked.
> Also, with BPF the administrator can decide which protocols to support,
> reducing potential attack surface. Rarely encountered protocols can be
> excluded from dissection and the program can be updated without kernel
> recompile or reboot if a bug is discovered.
> 
> Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
> This includes a new BPF program and attach type.
> 
> Patch 2 adds a flow dissector program in BPF. This parses most protocols in
> __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
> and address types).
> 
> Patch 3 adds a selftest that attaches the BPF program to the flow dissector
> and sends traffic with different levels of encapsulation.

Overall I fully support the direction. Few things to consider:

> This RFC patchset exposes a few design considerations:
> 
> 1/ Because the flow dissector key definitions live in
> include/linux/net/flow_dissector.h, they are not visible from userspace,
> and the flow keys definitions need to be copied in the BPF program.

I don't think copy-paste avoids the issue of uapi.
Anything used by BPF program is uapi.
The only exception is offsets of kernel internal structures
passed into bpf_probe_read().
So we have several options:
1. be honest and say 'struct flow_dissect_key*' is now uapi
2. wrap all of them into 'struct bpf_flow_dissect_key*' and do rewrites
  when/if 'struct flow_dissect_key*' changes
3. wait for BTF to solve it for tracing use case and for this one two.
The idea is that kernel internal structs can be defined in bpf prog
and since they will be described precisely in BTF that comes with the prog
the kernel can validate that prog's BTF matches what kernel thinks it has.
imo that's the most flexible, but BTF for all of vmlinux won't be ready
tomorrow and looks like this patch set is ready to go, so I would go with 1 or 2.

> 2/ An alternative to adding a new hook would have been to attach flow
> dissection programs at the XDP hook. Because this hook is executed before
> GRO, it would have to execute on every MSS, which would be more
> computationally expensive. Furthermore, the XDP hook is executed before an
> SKB has been allocated and there is no clear way to move the dissected keys
> into the SKB after it has been allocated. Eventually, perhaps a single pass
> can implement both GRO and flow dissection -- but napi_gro_cb shows that a
> lot more flow state would need to be parsed for this.

global flow_dissect bpf hook semantics are problematic for testing.
like patch 3 test affects the whole system including all containers.
should the hook be per-netns or may be per netdevice?

> 3/ The BPF program cannot use direct packet access everywhere because it
> uses an offset, initially supplied by the flow dissector.  Because the
> initial value of this non-constant offset comes from outside of the
> program, the verifier does not know what its value is, and it cannot verify
> that it is within packet bounds. Therefore, direct packet access programs
> get rejected.

this part doesn't seem to match the code.
direct packet access is allowed and usable even for fragmented skbs.
in such case only linear part of skb is in "direct access".

Last bit, I'm curious about, how the 'demo' flow dissector program
from patch 2 fairs vs in-kernel dissector when performance tested?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-20 20:52 ` [bpf-next RFC 0/3] Introduce eBPF flow dissector Alexei Starovoitov
@ 2018-08-21  2:24   ` David Miller
  2018-08-22  0:19   ` Petar Penkov
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2018-08-21  2:24 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterpenkov96, netdev, ast, daniel, simon.horman, ppenkov,
	willemb

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Mon, 20 Aug 2018 13:52:07 -0700

> I don't think copy-paste avoids the issue of uapi.
> Anything used by BPF program is uapi.
> The only exception is offsets of kernel internal structures
> passed into bpf_probe_read().
> So we have several options:
> 1. be honest and say 'struct flow_dissect_key*' is now uapi
> 2. wrap all of them into 'struct bpf_flow_dissect_key*' and do rewrites
>   when/if 'struct flow_dissect_key*' changes
> 3. wait for BTF to solve it for tracing use case and for this one two.
 ...
> The idea is that kernel internal structs can be defined in bpf prog
> and since they will be described precisely in BTF that comes with the prog
> the kernel can validate that prog's BTF matches what kernel thinks it has.
> imo that's the most flexible, but BTF for all of vmlinux won't be ready
> tomorrow and looks like this patch set is ready to go, so I would go with 1 or 2.

I would definitely prefer #2 or #3.

I personally would like to see us avoid preventing interesting
optimizations of the flow key layout and/or accesses in the future.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-20 20:52 ` [bpf-next RFC 0/3] Introduce eBPF flow dissector Alexei Starovoitov
  2018-08-21  2:24   ` David Miller
@ 2018-08-22  0:19   ` Petar Penkov
  2018-08-22  7:22     ` Daniel Borkmann
  1 sibling, 1 reply; 20+ messages in thread
From: Petar Penkov @ 2018-08-22  0:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>> From: Petar Penkov <ppenkov@google.com>
>>
>> This patch series hardens the RX stack by allowing flow dissection in BPF,
>> as previously discussed [1]. Because of the rigorous checks of the BPF
>> verifier, this provides significant security guarantees. In particular, the
>> BPF flow dissector cannot get inside of an infinite loop, as with
>> CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
>> read outside of packet bounds, because all memory accesses are checked.
>> Also, with BPF the administrator can decide which protocols to support,
>> reducing potential attack surface. Rarely encountered protocols can be
>> excluded from dissection and the program can be updated without kernel
>> recompile or reboot if a bug is discovered.
>>
>> Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
>> This includes a new BPF program and attach type.
>>
>> Patch 2 adds a flow dissector program in BPF. This parses most protocols in
>> __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
>> and address types).
>>
>> Patch 3 adds a selftest that attaches the BPF program to the flow dissector
>> and sends traffic with different levels of encapsulation.
>
> Overall I fully support the direction. Few things to consider:
>
>> This RFC patchset exposes a few design considerations:
>>
>> 1/ Because the flow dissector key definitions live in
>> include/linux/net/flow_dissector.h, they are not visible from userspace,
>> and the flow keys definitions need to be copied in the BPF program.
>
> I don't think copy-paste avoids the issue of uapi.
> Anything used by BPF program is uapi.
> The only exception is offsets of kernel internal structures
> passed into bpf_probe_read().
> So we have several options:
> 1. be honest and say 'struct flow_dissect_key*' is now uapi
> 2. wrap all of them into 'struct bpf_flow_dissect_key*' and do rewrites
>   when/if 'struct flow_dissect_key*' changes
> 3. wait for BTF to solve it for tracing use case and for this one two.
> The idea is that kernel internal structs can be defined in bpf prog
> and since they will be described precisely in BTF that comes with the prog
> the kernel can validate that prog's BTF matches what kernel thinks it has.
> imo that's the most flexible, but BTF for all of vmlinux won't be ready
> tomorrow and looks like this patch set is ready to go, so I would go with 1 or 2.

I will pursue #2 then as both you and David are in support of it.

>
>> 2/ An alternative to adding a new hook would have been to attach flow
>> dissection programs at the XDP hook. Because this hook is executed before
>> GRO, it would have to execute on every MSS, which would be more
>> computationally expensive. Furthermore, the XDP hook is executed before an
>> SKB has been allocated and there is no clear way to move the dissected keys
>> into the SKB after it has been allocated. Eventually, perhaps a single pass
>> can implement both GRO and flow dissection -- but napi_gro_cb shows that a
>> lot more flow state would need to be parsed for this.
>
> global flow_dissect bpf hook semantics are problematic for testing.
> like patch 3 test affects the whole system including all containers.
> should the hook be per-netns or may be per netdevice?
>

Having the hook be per-netns would definitely be cleaner for testing.
I will look into
refactoring the hook from global to per-netns.

>> 3/ The BPF program cannot use direct packet access everywhere because it
>> uses an offset, initially supplied by the flow dissector.  Because the
>> initial value of this non-constant offset comes from outside of the
>> program, the verifier does not know what its value is, and it cannot verify
>> that it is within packet bounds. Therefore, direct packet access programs
>> get rejected.
>
> this part doesn't seem to match the code.
> direct packet access is allowed and usable even for fragmented skbs.
> in such case only linear part of skb is in "direct access".

I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
rather than direct packet access because the offset at which I read headers,
nhoff, depends on an initial value that cannot be statically verified - namely
what __skb_flow_dissect provides. Is there an alternative approach I should
be taking here, and/or am I misunderstanding direct access?

>
> Last bit, I'm curious about, how the 'demo' flow dissector program
> from patch 2 fairs vs in-kernel dissector when performance tested?
>

I used the test in patch 3 to compare the two implementations and the
difference seemed to be small, with the BPF flow dissector being  slightly
slower.

Thank you so much for your feedback, Alexei and David!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-22  0:19   ` Petar Penkov
@ 2018-08-22  7:22     ` Daniel Borkmann
  2018-08-22  7:28       ` Daniel Borkmann
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Borkmann @ 2018-08-22  7:22 UTC (permalink / raw)
  To: Petar Penkov, Alexei Starovoitov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	simon.horman, Willem de Bruijn

On 08/22/2018 02:19 AM, Petar Penkov wrote:
> On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>>> From: Petar Penkov <ppenkov@google.com>
[...]
>>> 3/ The BPF program cannot use direct packet access everywhere because it
>>> uses an offset, initially supplied by the flow dissector.  Because the
>>> initial value of this non-constant offset comes from outside of the
>>> program, the verifier does not know what its value is, and it cannot verify
>>> that it is within packet bounds. Therefore, direct packet access programs
>>> get rejected.
>>
>> this part doesn't seem to match the code.
>> direct packet access is allowed and usable even for fragmented skbs.
>> in such case only linear part of skb is in "direct access".
> 
> I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
> rather than direct packet access because the offset at which I read headers,
> nhoff, depends on an initial value that cannot be statically verified - namely
> what __skb_flow_dissect provides. Is there an alternative approach I should
> be taking here, and/or am I misunderstanding direct access?

You can still use direct packet access with it, the only thing you would
need to make sure is that the initial offset is bounded (e.g. test if
larger than some const and then drop the packet, or '& <const>') so that
the verifier can make sure the alu op won't cause overflow, then you can
add this to pkt_data, and later on open an access range with the usual test
like pkt_data' + <const> > pkt_end.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-22  7:22     ` Daniel Borkmann
@ 2018-08-22  7:28       ` Daniel Borkmann
  2018-08-23  0:10         ` Petar Penkov
  0 siblings, 1 reply; 20+ messages in thread
From: Daniel Borkmann @ 2018-08-22  7:28 UTC (permalink / raw)
  To: Petar Penkov, Alexei Starovoitov
  Cc: Petar Penkov, Networking, David S . Miller, Alexei Starovoitov,
	simon.horman, Willem de Bruijn

"On 08/22/2018 09:22 AM, Daniel Borkmann wrote:
> On 08/22/2018 02:19 AM, Petar Penkov wrote:
>> On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>>>> From: Petar Penkov <ppenkov@google.com>
> [...]
>>>> 3/ The BPF program cannot use direct packet access everywhere because it
>>>> uses an offset, initially supplied by the flow dissector.  Because the
>>>> initial value of this non-constant offset comes from outside of the
>>>> program, the verifier does not know what its value is, and it cannot verify
>>>> that it is within packet bounds. Therefore, direct packet access programs
>>>> get rejected.
>>>
>>> this part doesn't seem to match the code.
>>> direct packet access is allowed and usable even for fragmented skbs.
>>> in such case only linear part of skb is in "direct access".
>>
>> I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
>> rather than direct packet access because the offset at which I read headers,
>> nhoff, depends on an initial value that cannot be statically verified - namely
>> what __skb_flow_dissect provides. Is there an alternative approach I should
>> be taking here, and/or am I misunderstanding direct access?
> 
> You can still use direct packet access with it, the only thing you would
> need to make sure is that the initial offset is bounded (e.g. test if
> larger than some const and then drop the packet, or '& <const>') so that
> the verifier can make sure the alu op won't cause overflow, then you can
> add this to pkt_data, and later on open an access range with the usual test
> like pkt_data' + <const> > pkt_end.

And for non-linear data, you could use the bpf_skb_pull_data() helper as
we have in tc/BPF case 36bbef52c7eb ("bpf: direct packet write and access
for helpers for clsact progs") to pull it into linear area and make it
accessible for direct packet access.

> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [bpf-next RFC 0/3] Introduce eBPF flow dissector
  2018-08-22  7:28       ` Daniel Borkmann
@ 2018-08-23  0:10         ` Petar Penkov
  0 siblings, 0 replies; 20+ messages in thread
From: Petar Penkov @ 2018-08-23  0:10 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Petar Penkov, Networking, David S . Miller,
	Alexei Starovoitov, simon.horman, Willem de Bruijn

On Wed, Aug 22, 2018 at 12:28 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> "On 08/22/2018 09:22 AM, Daniel Borkmann wrote:
>> On 08/22/2018 02:19 AM, Petar Penkov wrote:
>>> On Mon, Aug 20, 2018 at 1:52 PM, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Thu, Aug 16, 2018 at 09:44:20AM -0700, Petar Penkov wrote:
>>>>> From: Petar Penkov <ppenkov@google.com>
>> [...]
>>>>> 3/ The BPF program cannot use direct packet access everywhere because it
>>>>> uses an offset, initially supplied by the flow dissector.  Because the
>>>>> initial value of this non-constant offset comes from outside of the
>>>>> program, the verifier does not know what its value is, and it cannot verify
>>>>> that it is within packet bounds. Therefore, direct packet access programs
>>>>> get rejected.
>>>>
>>>> this part doesn't seem to match the code.
>>>> direct packet access is allowed and usable even for fragmented skbs.
>>>> in such case only linear part of skb is in "direct access".
>>>
>>> I am not sure I understand. What I meant was that I use bpf_skb_load_bytes
>>> rather than direct packet access because the offset at which I read headers,
>>> nhoff, depends on an initial value that cannot be statically verified - namely
>>> what __skb_flow_dissect provides. Is there an alternative approach I should
>>> be taking here, and/or am I misunderstanding direct access?
>>
>> You can still use direct packet access with it, the only thing you would
>> need to make sure is that the initial offset is bounded (e.g. test if
>> larger than some const and then drop the packet, or '& <const>') so that
>> the verifier can make sure the alu op won't cause overflow, then you can
>> add this to pkt_data, and later on open an access range with the usual test
>> like pkt_data' + <const> > pkt_end.
>
> And for non-linear data, you could use the bpf_skb_pull_data() helper as
> we have in tc/BPF case 36bbef52c7eb ("bpf: direct packet write and access
> for helpers for clsact progs") to pull it into linear area and make it
> accessible for direct packet access.
>
>> Thanks,
>> Daniel

Thanks for the clarification! With direct packet access the flow
dissector in patch 2
is as fast as the in-kernel flow dissector when tested with the test in patch 3.

To bound the initial offset and use direct access I check if the
initial offset is larger
than 1500. This is sufficient for the verifier but I was wondering if there is a
better constant to use.

Thanks once again for your feedback,
Petar

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-08-23  3:37 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-16 16:44 [bpf-next RFC 0/3] Introduce eBPF flow dissector Petar Penkov
2018-08-16 16:44 ` [bpf-next RFC 1/3] flow_dissector: implements flow dissector BPF hook Petar Penkov
2018-08-16 18:34   ` Edward Cree
2018-08-16 22:37     ` Willem de Bruijn
2018-08-16 22:40   ` Song Liu
2018-08-16 23:14     ` Petar Penkov
2018-08-20  5:44       ` Song Liu
2018-08-20 14:13         ` Willem de Bruijn
2018-08-20 14:20           ` Daniel Borkmann
2018-08-20 16:04           ` Song Liu
2018-08-16 16:44 ` [bpf-next RFC 2/3] flow_dissector: implements eBPF parser Petar Penkov
2018-08-18 15:50   ` Tom Herbert
2018-08-18 19:49     ` Willem de Bruijn
2018-08-16 16:44 ` [bpf-next RFC 3/3] selftests/bpf: test bpf flow dissection Petar Penkov
2018-08-20 20:52 ` [bpf-next RFC 0/3] Introduce eBPF flow dissector Alexei Starovoitov
2018-08-21  2:24   ` David Miller
2018-08-22  0:19   ` Petar Penkov
2018-08-22  7:22     ` Daniel Borkmann
2018-08-22  7:28       ` Daniel Borkmann
2018-08-23  0:10         ` Petar Penkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).