[PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation
@ 2016-11-29 13:21 Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 13:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
skb as context. Programs attached to input and output are read-only.
Programs attached to lwtunnel_xmit() can modify and redirect, push headers
and redirect packets.

The facility can be used to:
 - Collect statistics and generate sampling data for a subset of traffic
   based on the dst utilized by the packet thus allowing to extend the
   existing realms.
 - Apply additional per route/dst filters to prohibit certain outgoing
   or incoming packets based on BPF filters. In particular, this allows
   to maintain per dst custom state across multiple packets in BPF maps
   and apply filters based on statistics and behaviour observed over time.
 - Attachment of L2 headers at transmit where resolving the L2 address
   is not required.
 - Possibly many more.

v2 -> v3:
 - Added real world sample lwt_len_hist_kern.c which demonstrates how to
   collect a histogram on packet sizes for all packets flowing through
   a number of routes.
 - Restricted output to be read-only. Since the header can no longer
   be modified, the rerouting functionality has been removed again.
 - Added test case which cover destructive modification of packet data.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed


Thomas Graf (4):
  route: Set orig_output when redirecting to lwt on locally generated
    traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel infrastructure
  bpf: Add tests and samples for LWT-BPF

 include/linux/filter.h          |   2 +-
 include/uapi/linux/bpf.h        |  32 +++-
 include/uapi/linux/lwtunnel.h   |  23 +++
 kernel/bpf/verifier.c           |  14 +-
 net/Kconfig                     |   8 +
 net/core/Makefile               |   1 +
 net/core/filter.c               | 148 ++++++++++++++-
 net/core/lwt_bpf.c              | 397 ++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c             |   2 +
 net/ipv4/route.c                |  37 ++--
 samples/bpf/Makefile            |   4 +
 samples/bpf/bpf_helpers.h       |   4 +
 samples/bpf/lwt_len_hist.sh     |  37 ++++
 samples/bpf/lwt_len_hist_kern.c |  82 +++++++++
 samples/bpf/lwt_len_hist_user.c |  76 ++++++++
 samples/bpf/test_lwt_bpf.c      | 247 +++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh     | 385 ++++++++++++++++++++++++++++++++++++++
 17 files changed, 1482 insertions(+), 17 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic
  2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-11-29 13:21 ` Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 13:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes

orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d37fc6f..0c39b62 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2154,8 +2154,10 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate))
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
 		rth->dst.output = lwtunnel_output;
+	}
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v3 2/4] route: Set lwtstate for local traffic and cached input dsts
  2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
@ 2016-11-29 13:21 ` Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure Thomas Graf
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 13:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes

A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 0c39b62..f4c4d7a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1602,6 +1602,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
 	spin_unlock_bh(&fnhe_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
+		rth->dst.output = lwtunnel_output;
+	}
+
+	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_input = rth->dst.input;
+		rth->dst.input = lwtunnel_input;
+	}
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
@@ -1691,14 +1704,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->dst.input = ip_forward;
 
 	rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
-	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_input = rth->dst.input;
-		rth->dst.input = lwtunnel_input;
-	}
+	set_lwt_redirect(rth);
 	skb_dst_set(skb, &rth->dst);
 out:
 	err = 0;
@@ -1925,8 +1931,18 @@ out:	return err;
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
+
 	if (do_cache) {
-		if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
+		struct fib_nh *nh = &FIB_RES_NH(res);
+
+		rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+		if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+			WARN_ON(rth->dst.input == lwtunnel_input);
+			rth->dst.lwtstate->orig_input = rth->dst.input;
+			rth->dst.input = lwtunnel_input;
+		}
+
+		if (unlikely(!rt_cache_route(nh, rth))) {
 			rth->dst.flags |= DST_NOCACHE;
 			rt_add_uncached_list(rth);
 		}
@@ -2154,10 +2170,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
+	set_lwt_redirect(rth);
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
  2016-11-29 13:21 ` [PATCH net-next v3 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
@ 2016-11-29 13:21 ` Thomas Graf
  2016-11-30  0:15   ` Alexei Starovoitov
  2016-11-29 13:21 ` [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF Thomas Graf
  2016-11-29 14:15 ` [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
  4 siblings, 1 reply; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 13:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes

Registers new BPF program types which correspond to the LWT hooks:
  - BPF_PROG_TYPE_LWT_IN   => dst_input()
  - BPF_PROG_TYPE_LWT_OUT  => dst_output()
  - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()

The separate program types are required to differentiate between the
capabilities each LWT hook allows:

 * Programs attached to dst_input() or dst_output() are restricted and
   may only read the data of an skb. This prevent modification and
   possible invalidation of already validated packet headers on receive
   and the construction of illegal headers while the IP headers are
   still being assembled.

 * Programs attached to lwtunnel_xmit() are allowed to modify packet
   content as well as prepending an L2 header via a newly introduced
   helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
   after the IP header has been assembled completely.

All BPF programs receive an skb with L3 headers attached and may return
one of the following error codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid in lwtunnel_xmit() context)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  32 +++-
 include/uapi/linux/lwtunnel.h |  23 +++
 kernel/bpf/verifier.c         |  14 +-
 net/Kconfig                   |   8 +
 net/core/Makefile             |   1 +
 net/core/filter.c             | 148 +++++++++++++++-
 net/core/lwt_bpf.c            | 397 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |   2 +
 9 files changed, 621 insertions(+), 6 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7f246a2..7ba6446 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1370a9d..81c02c5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,6 +101,9 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
 	BPF_PROG_TYPE_CGROUP_SKB,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
 };
 
 enum bpf_attach_type {
@@ -409,6 +412,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ *     Add room to beginning of skb and adjusts MAC header offset accordingly.
+ *     Extends/reallocaes for needed skb headeroom automatically.
+ *     May change skb data pointer and will thus invalidate any check done
+ *     for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -453,7 +466,8 @@ union bpf_attr {
 	FN(skb_pull_data),		\
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
-	FN(get_numa_node_id),
+	FN(get_numa_node_id),		\
+	FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -537,6 +551,22 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+	/* >127 are reserved for prog type specific return codes */
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index 453cc62..937f660 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -10,6 +10,7 @@ enum lwtunnel_encap_types {
 	LWTUNNEL_ENCAP_ILA,
 	LWTUNNEL_ENCAP_IP6,
 	LWTUNNEL_ENCAP_SEG6,
+	LWTUNNEL_ENCAP_BPF,
 	__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -43,4 +44,26 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+	LWT_BPF_PROG_UNSPEC,
+	LWT_BPF_PROG_FD,
+	LWT_BPF_PROG_NAME,
+	__LWT_BPF_PROG_MAX,
+};
+
+#define LWT_BPF_PROG_MAX (__LWT_BPF_PROG_MAX - 1)
+
+enum {
+	LWT_BPF_UNSPEC,
+	LWT_BPF_IN,
+	LWT_BPF_OUT,
+	LWT_BPF_XMIT,
+	LWT_BPF_XMIT_HEADROOM,
+	__LWT_BPF_MAX,
+};
+
+#define LWT_BPF_MAX (__LWT_BPF_MAX - 1)
+
+#define LWT_BPF_MAX_HEADROOM 128
+
 #endif /* _UAPI_LWTUNNEL_H_ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8740c5f..8135cb1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -633,12 +633,19 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
 #define MAX_PACKET_OFF 0xffff
 
 static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
-				       const struct bpf_call_arg_meta *meta)
+				       const struct bpf_call_arg_meta *meta,
+				       enum bpf_access_type t)
 {
 	switch (env->prog->type) {
+	case BPF_PROG_TYPE_LWT_IN:
+	case BPF_PROG_TYPE_LWT_OUT:
+		/* dst_input() and dst_output() can't write for now */
+		if (t == BPF_WRITE)
+			return false;
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
 	case BPF_PROG_TYPE_XDP:
+	case BPF_PROG_TYPE_LWT_XMIT:
 		if (meta)
 			return meta->pkt_access;
 
@@ -837,7 +844,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL)) {
+		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
@@ -970,7 +977,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		return 0;
 	}
 
-	if (type == PTR_TO_PACKET && !may_access_direct_pkt_data(env, meta)) {
+	if (type == PTR_TO_PACKET &&
+	    !may_access_direct_pkt_data(env, meta, BPF_READ)) {
 		verbose("helper access to the packet is not allowed\n");
 		return -EACCES;
 	}
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..a100500 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -402,6 +402,14 @@ config LWTUNNEL
 	  weight tunnel endpoint. Tunnel encapsulation parameters are stored
 	  with light weight tunnel state associated with fib routes.
 
+config LWTUNNEL_BPF
+	bool "Execute BPF program as route nexthop action"
+	depends on LWTUNNEL
+	default y if LWTUNNEL=y
+	---help---
+	  Allows to run BPF programs as a nexthop action following a route
+	  lookup for incoming and outgoing packets.
+
 config DST_CACHE
 	bool
 	default n
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..f6761b6 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
 obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
+obj-$(CONFIG_LWTUNNEL_BPF) += lwt_bpf.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
diff --git a/net/core/filter.c b/net/core/filter.c
index 698a262..1dafe37 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2188,6 +2188,50 @@ static const struct bpf_func_proto bpf_skb_change_tail_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static bool len_fits_mtu(struct sk_buff *skb, u32 len)
+{
+	u32 new_len = skb->len + len;
+
+	if (skb_is_gso(skb))
+		return skb_gso_validate_mtu(skb, skb->dev->mtu - len);
+
+	return new_len <= skb->dev->mtu && new_len >= skb->len;
+}
+
+BPF_CALL_3(bpf_skb_push, struct sk_buff *, skb, __u32, len, u64, flags)
+{
+	if (!len_fits_mtu(skb, len))
+		return -ERANGE;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len > 0) {
+		int ret;
+
+		ret = skb_cow(skb, len);
+		if (unlikely(ret < 0))
+			return ret;
+
+		__skb_push(skb, len);
+		memset(skb->data, 0, len);
+	}
+
+	skb_reset_mac_header(skb);
+	bpf_compute_data_end(skb);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_push_proto = {
+	.func		= bpf_skb_push,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_skb_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -2197,7 +2241,8 @@ bool bpf_helper_changes_skb_data(void *func)
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace)
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_skb_push)
 		return true;
 
 	return false;
@@ -2639,6 +2684,68 @@ cg_skb_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_inout_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_xmit_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_get_tunnel_key:
+		return &bpf_skb_get_tunnel_key_proto;
+	case BPF_FUNC_skb_set_tunnel_key:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_skb_get_tunnel_opt:
+		return &bpf_skb_get_tunnel_opt_proto;
+	case BPF_FUNC_skb_set_tunnel_opt:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_redirect:
+		return &bpf_redirect_proto;
+	case BPF_FUNC_clone_redirect:
+		return &bpf_clone_redirect_proto;
+	case BPF_FUNC_skb_change_tail:
+		return &bpf_skb_change_tail_proto;
+	case BPF_FUNC_skb_push:
+		return &bpf_skb_push_proto;
+	case BPF_FUNC_skb_store_bytes:
+		return &bpf_skb_store_bytes_proto;
+	case BPF_FUNC_csum_update:
+		return &bpf_csum_update_proto;
+	case BPF_FUNC_l3_csum_replace:
+		return &bpf_l3_csum_replace_proto;
+	case BPF_FUNC_l4_csum_replace:
+		return &bpf_l4_csum_replace_proto;
+	case BPF_FUNC_set_hash_invalid:
+		return &bpf_set_hash_invalid_proto;
+	default:
+		return lwt_inout_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -3007,6 +3114,27 @@ static const struct bpf_verifier_ops cg_skb_ops = {
 	.convert_ctx_access	= sk_filter_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops lwt_in_ops = {
+	.get_func_proto		= lwt_inout_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_out_ops = {
+	.get_func_proto		= lwt_inout_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_xmit_ops = {
+	.get_func_proto		= lwt_xmit_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -3032,6 +3160,21 @@ static struct bpf_prog_type_list cg_skb_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+static struct bpf_prog_type_list lwt_in_type __read_mostly = {
+	.ops	= &lwt_in_ops,
+	.type	= BPF_PROG_TYPE_LWT_IN,
+};
+
+static struct bpf_prog_type_list lwt_out_type __read_mostly = {
+	.ops	= &lwt_out_ops,
+	.type	= BPF_PROG_TYPE_LWT_OUT,
+};
+
+static struct bpf_prog_type_list lwt_xmit_type __read_mostly = {
+	.ops	= &lwt_xmit_ops,
+	.type	= BPF_PROG_TYPE_LWT_XMIT,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
@@ -3039,6 +3182,9 @@ static int __init register_sk_filter_ops(void)
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
 	bpf_register_prog_type(&cg_skb_type);
+	bpf_register_prog_type(&lwt_in_type);
+	bpf_register_prog_type(&lwt_out_type);
+	bpf_register_prog_type(&lwt_xmit_type);
 
 	return 0;
 }
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
new file mode 100644
index 0000000..fdbf6f4
--- /dev/null
+++ b/net/core/lwt_bpf.c
@@ -0,0 +1,397 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <net/lwtunnel.h>
+
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
+struct bpf_lwt {
+	struct bpf_lwt_prog in;
+	struct bpf_lwt_prog out;
+	struct bpf_lwt_prog xmit;
+	int family;
+};
+
+#define MAX_PROG_NAME 256
+
+static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
+{
+	return (struct bpf_lwt *)lwt->data;
+}
+
+#define NO_REDIRECT false
+#define CAN_REDIRECT true
+
+static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
+		       struct dst_entry *dst, bool can_redirect)
+{
+	int ret;
+
+	/* Preempt disable is needed to protect per-cpu redirect_info between
+	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
+	 * access to maps strictly require a rcu_read_lock() for protection,
+	 * mixing with BH RCU lock doesn't work.
+	 */
+	preempt_disable();
+	rcu_read_lock();
+	bpf_compute_data_end(skb);
+	ret = BPF_PROG_RUN(lwt->prog, skb);
+	rcu_read_unlock();
+
+	switch (ret) {
+	case BPF_OK:
+		break;
+
+	case BPF_REDIRECT:
+		if (!can_redirect) {
+			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
+				  lwt->name ? : "<unknown>");
+			ret = BPF_OK;
+		} else {
+			ret = skb_do_redirect(skb);
+			if (ret == 0)
+				ret = BPF_REDIRECT;
+		}
+		break;
+
+	case BPF_DROP:
+		kfree_skb(skb);
+		ret = -EPERM;
+		break;
+
+	default:
+		WARN_ONCE(1, "Illegal LWT BPF return value %u, expect packet loss\n",
+			  ret);
+		kfree_skb(skb);
+		ret = -EINVAL;
+		break;
+	}
+
+	preempt_enable();
+
+	return ret;
+}
+
+static int bpf_input(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->in.prog) {
+		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_input)) {
+		WARN_ONCE(1, "orig_input not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_input(skb);
+}
+
+static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->out.prog) {
+		ret = run_lwt_bpf(skb, &bpf->out, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_output)) {
+		WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_output(net, sk, skb);
+}
+
+static int xmit_check_hhlen(struct sk_buff *skb)
+{
+	int hh_len = skb_dst(skb)->dev->hard_header_len;
+
+	if (skb_headroom(skb) < hh_len) {
+		int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb));
+
+		if (pskb_expand_head(skb, nhead, 0, GFP_ATOMIC))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int bpf_xmit(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->xmit.prog) {
+		int ret;
+
+		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
+		switch (ret) {
+		case BPF_OK:
+			/* If the L3 header was expanded, headroom might be too
+			 * small for L2 header now, expand as needed.
+			 */
+			ret = xmit_check_hhlen(skb);
+			if (unlikely(ret))
+				return ret;
+
+			return LWTUNNEL_XMIT_CONTINUE;
+		case BPF_REDIRECT:
+			return LWTUNNEL_XMIT_DONE;
+		default:
+			return ret;
+		}
+	}
+
+	return LWTUNNEL_XMIT_CONTINUE;
+}
+
+static void bpf_lwt_prog_destroy(struct bpf_lwt_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+
+	kfree(prog->name);
+}
+
+static void bpf_destroy_state(struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	bpf_lwt_prog_destroy(&bpf->in);
+	bpf_lwt_prog_destroy(&bpf->out);
+	bpf_lwt_prog_destroy(&bpf->xmit);
+}
+
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD] = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int bpf_parse_prog(struct nlattr *attr, struct bpf_lwt_prog *prog,
+			  enum bpf_prog_type type)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attr, bpf_prog_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	prog->name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!prog->name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, type);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	prog->prog = p;
+
+	return 0;
+}
+
+static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = {
+	[LWT_BPF_IN]		= { .type = NLA_NESTED, },
+	[LWT_BPF_OUT]		= { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT]		= { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT_HEADROOM]	= { .type = NLA_U32 },
+};
+
+static int bpf_build_state(struct net_device *dev, struct nlattr *nla,
+			   unsigned int family, const void *cfg,
+			   struct lwtunnel_state **ts)
+{
+	struct nlattr *tb[LWT_BPF_MAX + 1];
+	struct lwtunnel_state *newts;
+	struct bpf_lwt *bpf;
+	int ret;
+
+	if (family != AF_INET && family != AF_INET6)
+		return -EAFNOSUPPORT;
+
+	ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_IN] && !tb[LWT_BPF_OUT] && !tb[LWT_BPF_XMIT])
+		return -EINVAL;
+
+	newts = lwtunnel_state_alloc(sizeof(*bpf));
+	if (!newts)
+		return -ENOMEM;
+
+	newts->type = LWTUNNEL_ENCAP_BPF;
+	bpf = bpf_lwt_lwtunnel(newts);
+
+	if (tb[LWT_BPF_IN]) {
+		newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_IN], &bpf->in,
+				     BPF_PROG_TYPE_LWT_IN);
+		if (ret  < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_OUT]) {
+		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_OUT], &bpf->out,
+				     BPF_PROG_TYPE_LWT_OUT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_XMIT]) {
+		newts->flags |= LWTUNNEL_STATE_XMIT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_XMIT], &bpf->xmit,
+				     BPF_PROG_TYPE_LWT_XMIT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_XMIT_HEADROOM]) {
+		u32 headroom = nla_get_u32(tb[LWT_BPF_XMIT_HEADROOM]);
+
+		if (headroom > LWT_BPF_MAX_HEADROOM) {
+			ret = -ERANGE;
+			goto errout;
+		}
+
+		newts->headroom = headroom;
+	}
+
+	bpf->family = family;
+	*ts = newts;
+
+	return 0;
+
+errout:
+	bpf_destroy_state(newts);
+	kfree(newts);
+	return ret;
+}
+
+static int bpf_fill_lwt_prog(struct sk_buff *skb, int attr,
+			     struct bpf_lwt_prog *prog)
+{
+	struct nlattr *nest;
+
+	if (!prog->prog)
+		return 0;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (prog->name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, prog->name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int bpf_fill_encap_info(struct sk_buff *skb, struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	if (bpf_fill_lwt_prog(skb, LWT_BPF_IN, &bpf->in) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_OUT, &bpf->out) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_XMIT, &bpf->xmit) < 0)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int bpf_encap_nlsize(struct lwtunnel_state *lwtstate)
+{
+	int nest_len = nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) + /* LWT_BPF_PROG_NAME */
+		       0;
+
+	return nest_len + /* LWT_BPF_IN */
+	       nest_len + /* LWT_BPF_OUT */
+	       nest_len + /* LWT_BPF_XMIT */
+	       0;
+}
+
+int bpf_lwt_prog_cmp(struct bpf_lwt_prog *a, struct bpf_lwt_prog *b)
+{
+	/* FIXME:
+	 * The LWT state is currently rebuilt for delete requests which
+	 * results in a new bpf_prog instance. Comparing names for now.
+	 */
+	if (!a->name && !b->name)
+		return 0;
+
+	if (!a->name || !b->name)
+		return 1;
+
+	return strcmp(a->name, b->name);
+}
+
+static int bpf_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
+{
+	struct bpf_lwt *a_bpf = bpf_lwt_lwtunnel(a);
+	struct bpf_lwt *b_bpf = bpf_lwt_lwtunnel(b);
+
+	return bpf_lwt_prog_cmp(&a_bpf->in, &b_bpf->in) ||
+	       bpf_lwt_prog_cmp(&a_bpf->out, &b_bpf->out) ||
+	       bpf_lwt_prog_cmp(&a_bpf->xmit, &b_bpf->xmit);
+}
+
+static const struct lwtunnel_encap_ops bpf_encap_ops = {
+	.build_state	= bpf_build_state,
+	.destroy_state	= bpf_destroy_state,
+	.input		= bpf_input,
+	.output		= bpf_output,
+	.xmit		= bpf_xmit,
+	.fill_encap	= bpf_fill_encap_info,
+	.get_encap_size = bpf_encap_nlsize,
+	.cmp_encap	= bpf_encap_cmp,
+};
+
+static int __init bpf_lwt_init(void)
+{
+	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
+}
+
+subsys_initcall(bpf_lwt_init)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 03976e9..a5d4e86 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -41,6 +41,8 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
 		return "ILA";
 	case LWTUNNEL_ENCAP_SEG6:
 		return "SEG6";
+	case LWTUNNEL_ENCAP_BPF:
+		return "BPF";
 	case LWTUNNEL_ENCAP_IP6:
 	case LWTUNNEL_ENCAP_IP:
 	case LWTUNNEL_ENCAP_NONE:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF
  2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (2 preceding siblings ...)
  2016-11-29 13:21 ` [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure Thomas Graf
@ 2016-11-29 13:21 ` Thomas Graf
  2016-11-30  0:17   ` Alexei Starovoitov
  2016-11-29 14:15 ` [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
  4 siblings, 1 reply; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 13:21 UTC (permalink / raw)
  To: davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa, hannes

Adds a series of test to verify the functionality of attaching
BPF programs at LWT hooks.

Also adds a sample which collects a histogram of packet sizes which
pass through an LWT hook.

$ ./lwt_len_hist.sh
Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.00    39857.69
       1 -> 1        : 0        |                                      |
       2 -> 3        : 0        |                                      |
       4 -> 7        : 0        |                                      |
       8 -> 15       : 0        |                                      |
      16 -> 31       : 0        |                                      |
      32 -> 63       : 22       |                                      |
      64 -> 127      : 98       |                                      |
     128 -> 255      : 213      |                                      |
     256 -> 511      : 1444251  |********                              |
     512 -> 1023     : 660610   |***                                   |
    1024 -> 2047     : 535241   |**                                    |
    2048 -> 4095     : 19       |                                      |
    4096 -> 8191     : 180      |                                      |
    8192 -> 16383    : 5578023  |************************************* |
   16384 -> 32767    : 632099   |***                                   |
   32768 -> 65535    : 6575     |                                      |

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 samples/bpf/Makefile            |   4 +
 samples/bpf/bpf_helpers.h       |   4 +
 samples/bpf/lwt_len_hist.sh     |  37 ++++
 samples/bpf/lwt_len_hist_kern.c |  82 +++++++++
 samples/bpf/lwt_len_hist_user.c |  76 ++++++++
 samples/bpf/test_lwt_bpf.c      | 247 ++++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh     | 385 ++++++++++++++++++++++++++++++++++++++++
 7 files changed, 835 insertions(+)
 create mode 100755 samples/bpf/lwt_len_hist.sh
 create mode 100644 samples/bpf/lwt_len_hist_kern.c
 create mode 100644 samples/bpf/lwt_len_hist_user.c
 create mode 100644 samples/bpf/test_lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 22b6407e..3161f82 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -29,6 +29,7 @@ hostprogs-y += test_current_task_under_cgroup
 hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
+hostprogs-y += lwt_len_hist
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -59,6 +60,7 @@ test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
 trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
+lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -89,6 +91,7 @@ always += xdp2_kern.o
 always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
+always += lwt_len_hist_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(objtree)/tools/testing/selftests/bpf/
@@ -117,6 +120,7 @@ HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
 HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
+HOSTLOADLIBES_lwt_len_hist += -l elf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
 	unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+	(void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
 	(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
 	(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
 	(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+	(void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_len_hist.sh b/samples/bpf/lwt_len_hist.sh
new file mode 100755
index 0000000..3a8ee52
--- /dev/null
+++ b/samples/bpf/lwt_len_hist.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+NS1=lwt_ns1
+VETH0=tst_lwt1a
+VETH1=tst_lwt1b
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function cleanup {
+        ip route del 192.168.253.2/32 dev $VETH0 2> /dev/null
+        ip link del $VETH0 2> /dev/null
+        ip link del $VETH1 2> /dev/null
+	ip netns exec $NS1 killall netserver
+        ip netns delete $NS1 2> /dev/null
+}
+
+cleanup
+
+ip netns add $NS1
+ip link add $VETH0 type veth peer name $VETH1
+ip link set dev $VETH0 up
+ip addr add 192.168.253.1/24 dev $VETH0
+ip link set $VETH1 netns $NS1
+ip netns exec $NS1 ip link set dev $VETH1 up
+ip netns exec $NS1 ip addr add 192.168.253.2/24 dev $VETH1
+ip netns exec $NS1 netserver
+
+echo 1 > ${TRACE_ROOT}/tracing_on
+cp /dev/null ${TRACE_ROOT}/trace
+ip route add 192.168.253.2/32 encap bpf out obj lwt_len_hist_kern.o section len_hist dev $VETH0
+netperf -H 192.168.253.2 -t TCP_STREAM
+cat ${TRACE_ROOT}/trace | grep -v '^#'
+./lwt_len_hist
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+
+exit 0
diff --git a/samples/bpf/lwt_len_hist_kern.c b/samples/bpf/lwt_len_hist_kern.c
new file mode 100644
index 0000000..df75383
--- /dev/null
+++ b/samples/bpf/lwt_len_hist_kern.c
@@ -0,0 +1,82 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/in.h>
+#include "bpf_helpers.h"
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+struct bpf_elf_map {
+	__u32 type;
+	__u32 size_key;
+	__u32 size_value;
+	__u32 max_elem;
+	__u32 flags;
+	__u32 id;
+	__u32 pinning;
+};
+
+struct bpf_elf_map SEC("maps") lwt_len_hist_map = {
+	.type = BPF_MAP_TYPE_PERCPU_HASH,
+	.size_key = sizeof(__u64),
+	.size_value = sizeof(__u64),
+	.pinning = 2,
+	.max_elem = 1024,
+};
+
+static unsigned int log2(unsigned int v)
+{
+	unsigned int r;
+	unsigned int shift;
+
+	r = (v > 0xFFFF) << 4; v >>= r;
+	shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
+	shift = (v > 0xF) << 2; v >>= shift; r |= shift;
+	shift = (v > 0x3) << 1; v >>= shift; r |= shift;
+	r |= (v >> 1);
+	return r;
+}
+
+static unsigned int log2l(unsigned long v)
+{
+	unsigned int hi = v >> 32;
+	if (hi)
+		return log2(hi) + 32;
+	else
+		return log2(v);
+}
+
+SEC("len_hist")
+int do_len_hist(struct __sk_buff *skb)
+{
+	__u64 *value, key, init_val = 1;
+
+	key = log2l(skb->len);
+
+	value = bpf_map_lookup_elem(&lwt_len_hist_map, &key);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+	else
+		bpf_map_update_elem(&lwt_len_hist_map, &key, &init_val, BPF_ANY);
+
+	return BPF_OK;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/lwt_len_hist_user.c b/samples/bpf/lwt_len_hist_user.c
new file mode 100644
index 0000000..05d783f
--- /dev/null
+++ b/samples/bpf/lwt_len_hist_user.c
@@ -0,0 +1,76 @@
+#include <linux/unistd.h>
+#include <linux/bpf.h>
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <arpa/inet.h>
+
+#include "libbpf.h"
+#include "bpf_util.h"
+
+#define MAX_INDEX 64
+#define MAX_STARS 38
+
+static void stars(char *str, long val, long max, int width)
+{
+	int i;
+
+	for (i = 0; i < (width * val / max) - 1 && i < width - 1; i++)
+		str[i] = '*';
+	if (val > max)
+		str[i - 1] = '+';
+	str[i] = '\0';
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	const char *map_filename = "/sys/fs/bpf/tc/globals/lwt_len_hist_map";
+	uint64_t values[nr_cpus], sum, max_value = 0, data[MAX_INDEX] = {};
+	uint64_t key = 0, next_key, max_key = 0;
+	char starstr[MAX_STARS];
+	int i, map_fd;
+
+	map_fd = bpf_obj_get(map_filename);
+	if (map_fd < 0) {
+		fprintf(stderr, "bpf_obj_get(%s): %s(%d)\n",
+			map_filename, strerror(errno), errno);
+		return -1;
+	}
+
+	while (bpf_get_next_key(map_fd, &key, &next_key) == 0) {
+		if (next_key >= MAX_INDEX) {
+			fprintf(stderr, "Key %lu out of bounds\n", next_key);
+			continue;
+		}
+
+		bpf_lookup_elem(map_fd, &next_key, values);
+
+		sum = 0;
+		for (i = 0; i < nr_cpus; i++)
+			sum += values[i];
+
+		data[next_key] = sum;
+		if (sum && next_key > max_key)
+			max_key = next_key;
+
+		if (sum > max_value)
+			max_value = sum;
+
+		key = next_key;
+	}
+
+	for (i = 1; i <= max_key + 1; i++) {
+		stars(starstr, data[i - 1], max_value, MAX_STARS);
+		printf("%8ld -> %-8ld : %-8ld |%-*s|\n",
+		       (1l << i) >> 1, (1l << i) - 1, data[i - 1],
+		       MAX_STARS, starstr);
+	}
+
+	close(map_fd);
+
+	return 0;
+}
diff --git a/samples/bpf/test_lwt_bpf.c b/samples/bpf/test_lwt_bpf.c
new file mode 100644
index 0000000..9c1773e
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.c
@@ -0,0 +1,247 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <stdint.h>
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include "bpf_helpers.h"
+#include <string.h>
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+#define CB_MAGIC 1234
+
+/* Test: Pass all packets through */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+	return BPF_OK;
+}
+
+/* Test: Verify context information can be accessed */
+SEC("test_ctx")
+int do_test_ctx(struct __sk_buff *skb)
+{
+	skb->cb[0] = CB_MAGIC;
+	printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+	       skb->protocol);
+	printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+	       skb->ingress_ifindex, skb->ifindex);
+
+	return BPF_OK;
+}
+
+/* Test: Ensure skb->cb[] buffer is cleared */
+SEC("test_cb")
+int do_test_cb(struct __sk_buff *skb)
+{
+	printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+	       skb->cb[2]);
+	printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+	return BPF_OK;
+}
+
+/* Test: Verify skb data can be read */
+SEC("test_data")
+int do_test_data(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph = data;
+
+	if (data + sizeof(*iph) > data_end) {
+		printk("packet truncated\n");
+		return BPF_DROP;
+	}
+
+	printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+	return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+			  uint32_t new_ip, int rw_daddr)
+{
+	int ret, off = 0, flags = IS_PSEUDO;
+	uint8_t proto;
+
+	ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, &proto, 1);
+	if (ret < 0) {
+		printk("bpf_l4_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	switch (proto) {
+	case IPPROTO_TCP:
+		off = TCP_CSUM_OFF;
+		break;
+
+	case IPPROTO_UDP:
+		off = UDP_CSUM_OFF;
+		flags |= BPF_F_MARK_MANGLED_0;
+		break;
+
+	case IPPROTO_ICMPV6:
+		off = offsetof(struct icmp6hdr, icmp6_cksum);
+		break;
+	}
+
+	if (off) {
+		ret = bpf_l4_csum_replace(skb, off, old_ip, new_ip,
+					  flags | sizeof(new_ip));
+		if (ret < 0) {
+			printk("bpf_l4_csum_replace failed: %d\n");
+			return BPF_DROP;
+		}
+	}
+
+	ret = bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ip, new_ip, sizeof(new_ip));
+	if (ret < 0) {
+		printk("bpf_l3_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (rw_daddr)
+		ret = bpf_skb_store_bytes(skb, IP_DST_OFF, &new_ip, sizeof(new_ip), 0);
+	else
+		ret = bpf_skb_store_bytes(skb, IP_SRC_OFF, &new_ip, sizeof(new_ip), 0);
+
+	if (ret < 0) {
+		printk("bpf_skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return BPF_OK;
+}
+
+/* Test: Verify skb data can be modified */
+SEC("test_rewrite")
+int do_test_rewrite(struct __sk_buff *skb)
+{
+	uint32_t old_ip, new_ip = 0x3fea8c0;
+	int ret;
+
+	ret = bpf_skb_load_bytes(skb, IP_DST_OFF, &old_ip, 4);
+	if (ret < 0) {
+		printk("bpf_skb_load_bytes failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (old_ip == 0x2fea8c0) {
+		printk("out: rewriting from %x to %x\n", old_ip, new_ip);
+		return rewrite(skb, old_ip, new_ip, 1);
+	}
+
+	return BPF_OK;
+}
+
+static inline int __do_push_ll_and_redirect(struct __sk_buff *skb)
+{
+	uint64_t smac = SRC_MAC, dmac = DST_MAC;
+	int ret, ifindex = DST_IFINDEX;
+	struct ethhdr ehdr;
+
+	ret = bpf_skb_push(skb, 14, 0);
+	if (ret < 0) {
+		printk("skb_push() failed: %d\n", ret);
+	}
+
+	ehdr.h_proto = __constant_htons(ETH_P_IP);
+	memcpy(&ehdr.h_source, &smac, 6);
+	memcpy(&ehdr.h_dest, &dmac, 6);
+
+	ret = bpf_skb_store_bytes(skb, 0, &ehdr, sizeof(ehdr), 0);
+	if (ret < 0) {
+		printk("skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	ret = bpf_redirect(ifindex, 0);
+	if (ret < 0) {
+		printk("bpf_redirect() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return BPF_REDIRECT;
+}
+
+SEC("push_ll_and_redirect_silent")
+int do_push_ll_and_redirect_silent(struct __sk_buff *skb)
+{
+	return __do_push_ll_and_redirect(skb);
+}
+
+SEC("push_ll_and_redirect")
+int do_push_ll_and_redirect(struct __sk_buff *skb)
+{
+	int ret, ifindex = DST_IFINDEX;
+
+	ret = __do_push_ll_and_redirect(skb);
+	if (ret >= 0)
+		printk("redirected to %d\n", ifindex);
+
+	return ret;
+}
+
+SEC("fill_garbage")
+int do_fill_garbage(struct __sk_buff *skb)
+{
+	uint64_t f = 0xFFFFFFFFFFFFFFFF;
+
+	bpf_skb_store_bytes(skb, 0, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 8, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 16, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 24, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 32, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 40, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 48, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 56, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 64, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 72, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 80, &f, sizeof(f), 0);
+	bpf_skb_store_bytes(skb, 88, &f, sizeof(f), 0);
+
+	printk("Set initial 96 bytes of header to FF\n");
+
+	return BPF_OK;
+}
+
+/* Drop all packets */
+SEC("drop_all")
+int do_drop_all(struct __sk_buff *skb)
+{
+	printk("dropping with: %d\n", BPF_DROP);
+	return BPF_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_lwt_bpf.sh b/samples/bpf/test_lwt_bpf.sh
new file mode 100755
index 0000000..d016961
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.sh
@@ -0,0 +1,385 @@
+#!/bin/bash
+
+# Uncomment to see generated bytecode
+#VERBOSE=verbose
+
+NS1=lwt_ns1
+NS2=lwt_ns2
+VETH0=tst_lwt1a
+VETH1=tst_lwt1b
+VETH2=tst_lwt2a
+VETH3=tst_lwt2b
+IPVETH0="192.168.254.1"
+IPVETH1="192.168.254.2"
+IPVETH1b="192.168.254.3"
+
+IPVETH2="192.168.111.1"
+IPVETH3="192.168.111.2"
+
+IP_LOCAL="192.168.99.1"
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function lookup_mac()
+{
+	set +x
+	if [ ! -z "$2" ]; then
+		MAC=$(ip netns exec $2 ip link show $1 | grep ether | awk '{print $2}')
+	else
+		MAC=$(ip link show $1 | grep ether | awk '{print $2}')
+	fi
+	MAC="${MAC//:/}"
+	echo "0x${MAC:10:2}${MAC:8:2}${MAC:6:2}${MAC:4:2}${MAC:2:2}${MAC:0:2}"
+	set -x
+}
+
+function cleanup {
+        set +ex
+        rm test_lwt_bpf.o 2> /dev/null
+        ip link del $VETH0 2> /dev/null
+        ip link del $VETH1 2> /dev/null
+        ip link del $VETH2 2> /dev/null
+        ip link del $VETH3 2> /dev/null
+	ip netns exec $NS1 killall netserver
+        ip netns delete $NS1 2> /dev/null
+        ip netns delete $NS2 2> /dev/null
+        set -ex
+}
+
+function setup_one_veth {
+        ip netns add $1
+        ip link add $2 type veth peer name $3
+        ip link set dev $2 up
+        ip addr add $4/24 dev $2
+        ip link set $3 netns $1
+        ip netns exec $1 ip link set dev $3 up
+        ip netns exec $1 ip addr add $5/24 dev $3
+
+	if [ "$6" ]; then
+		ip netns exec $1 ip addr add $6/32 dev $3
+	fi
+}
+
+function get_trace {
+	set +x
+        cat ${TRACE_ROOT}/trace | grep -v '^#'
+	set -x
+}
+
+function cleanup_routes {
+	ip route del ${IPVETH1}/32 dev $VETH0 2> /dev/null || true
+	ip route del table local local ${IP_LOCAL}/32 dev lo 2> /dev/null || true
+}
+
+function install_test {
+	cleanup_routes
+	cp /dev/null ${TRACE_ROOT}/trace
+
+	OPTS="encap bpf headroom 14 $1 obj test_lwt_bpf.o section $2 $VERBOSE"
+
+	if [ "$1" == "in" ];  then
+		ip route add table local local ${IP_LOCAL}/32 $OPTS dev lo
+	else
+		ip route add ${IPVETH1}/32 $OPTS dev $VETH0
+	fi
+}
+
+function remove_prog {
+	if [ "$1" == "in" ];  then
+		ip route del table local local ${IP_LOCAL}/32 dev lo
+	else
+		ip route del ${IPVETH1}/32 dev $VETH0
+	fi
+}
+
+function filter_trace {
+	# Add newline to allow starting EXPECT= variables on newline
+	NL=$'\n'
+	echo "${NL}$*" | sed -e 's/^.*: : //g'
+}
+
+function expect_fail {
+	set +x
+	echo "FAIL:"
+	echo "Expected: $1"
+	echo "Got: $2"
+	set -x
+	exit 1
+}
+
+function match_trace {
+	set +x
+	RET=0
+	TRACE=$1
+	EXPECT=$2
+	GOT="$(filter_trace "$TRACE")"
+
+	[ "$GOT" != "$EXPECT" ] && {
+		expect_fail "$EXPECT" "$GOT"
+		RET=1
+	}
+	set -x
+	return $RET
+}
+
+function test_start {
+	set +x
+	echo "----------------------------------------------------------------"
+	echo "Starting test: $*"
+	echo "----------------------------------------------------------------"
+	set -x
+}
+
+function failure {
+	get_trace
+	echo "FAIL: $*"
+	exit 1
+}
+
+function test_ctx_xmit {
+	test_start "test_ctx on lwt xmit"
+	install_test xmit test_ctx
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx xmit: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_ctx_out {
+	test_start "test_ctx on lwt out"
+	install_test out test_ctx
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx out: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0" || exit 1
+	remove_prog out
+}
+
+function test_ctx_in {
+	test_start "test_ctx on lwt in"
+	install_test in test_ctx
+	ping -c 3 $IP_LOCAL || {
+		failure "test_ctx out: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1" || exit 1
+	remove_prog in
+}
+
+function test_data {
+	test_start "test_data on lwt $1"
+	install_test $1 test_data
+	ping -c 3 $IPVETH1 || {
+		failure "test_data ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0" || exit 1
+	remove_prog $1
+}
+
+function test_data_in {
+	test_start "test_data on lwt in"
+	install_test in test_data
+	ping -c 3 $IP_LOCAL || {
+		failure "test_data in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0" || exit 1
+	remove_prog in
+}
+
+function test_cb {
+	test_start "test_cb on lwt $1"
+	install_test $1 test_cb
+	ping -c 3 $IPVETH1 || {
+		failure "test_cb ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog $1
+}
+
+function test_cb_in {
+	test_start "test_cb on lwt in"
+	install_test in test_cb
+	ping -c 3 $IP_LOCAL || {
+		failure "test_cb in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog in
+}
+
+function test_drop_all {
+	test_start "test_drop_all on lwt $1"
+	install_test $1 drop_all
+	ping -c 3 $IPVETH1 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog $1
+}
+
+function test_drop_all_in {
+	test_start "test_drop_all on lwt in"
+	install_test in drop_all
+	ping -c 3 $IP_LOCAL && {
+		failure "test_drop_all in: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog in
+}
+
+function test_push_ll_and_redirect {
+	test_start "test_push_ll_and_redirect on lwt xmit"
+	install_test xmit push_ll_and_redirect
+	ping -c 3 $IPVETH1 || {
+		failure "Redirected packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_rewrite {
+	test_start "test_rewrite on lwt xmit"
+	install_test xmit test_rewrite
+	ping -c 3 $IPVETH1 || {
+		failure "Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0" || exit 1
+	remove_prog out
+}
+
+function test_fill_garbage {
+	test_start "test_fill_garbage on lwt xmit"
+	install_test xmit fill_garbage
+	ping -c 3 $IPVETH1 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+Set initial 96 bytes of header to FF
+Set initial 96 bytes of header to FF
+Set initial 96 bytes of header to FF" || exit 1
+	remove_prog xmit
+}
+
+function test_netperf_nop {
+	test_start "test_netperf_nop on lwt xmit"
+	install_test xmit nop
+	netperf -H $IPVETH1 -t TCP_STREAM || {
+		failure "packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" ""|| exit 1
+	remove_prog xmit
+}
+
+function test_netperf_redirect {
+	test_start "test_netperf_redirect on lwt xmit"
+	install_test xmit push_ll_and_redirect_silent
+	netperf -H $IPVETH1 -t TCP_STREAM || {
+		failure "Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" ""|| exit 1
+	remove_prog xmit
+}
+
+cleanup
+setup_one_veth $NS1 $VETH0 $VETH1 $IPVETH0 $IPVETH1 $IPVETH1b
+setup_one_veth $NS2 $VETH2 $VETH3 $IPVETH2 $IPVETH3
+ip netns exec $NS1 netserver
+echo 1 > ${TRACE_ROOT}/tracing_on
+
+DST_MAC=$(lookup_mac $VETH1 $NS1)
+SRC_MAC=$(lookup_mac $VETH0)
+DST_IFINDEX=$(cat /sys/class/net/$VETH0/ifindex)
+
+CLANG_OPTS="-O2 -target bpf -I ../include/"
+CLANG_OPTS+=" -DSRC_MAC=$SRC_MAC -DDST_MAC=$DST_MAC -DDST_IFINDEX=$DST_IFINDEX"
+clang $CLANG_OPTS -c test_lwt_bpf.c -o test_lwt_bpf.o
+
+test_ctx_xmit
+test_ctx_out
+test_ctx_in
+test_data "xmit"
+test_data "out"
+test_data_in
+test_cb "xmit"
+test_cb "out"
+test_cb_in
+test_drop_all "xmit"
+test_drop_all "out"
+test_drop_all_in
+test_rewrite
+test_push_ll_and_redirect
+test_fill_garbage
+test_netperf_nop
+test_netperf_redirect
+
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+exit 0
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation
  2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (3 preceding siblings ...)
  2016-11-29 13:21 ` [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF Thomas Graf
@ 2016-11-29 14:15 ` Hannes Frederic Sowa
  2016-11-29 14:58   ` Thomas Graf
  4 siblings, 1 reply; 16+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-29 14:15 UTC (permalink / raw)
  To: Thomas Graf, davem; +Cc: netdev, alexei.starovoitov, daniel, tom, roopa

Hi Thomas,

On 29.11.2016 14:21, Thomas Graf wrote:
> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure. The BPF program can be attached to
> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and see an L3
> skb as context. Programs attached to input and output are read-only.
> Programs attached to lwtunnel_xmit() can modify and redirect, push headers
> and redirect packets.
> 
> The facility can be used to:
>  - Collect statistics and generate sampling data for a subset of traffic
>    based on the dst utilized by the packet thus allowing to extend the
>    existing realms.
>  - Apply additional per route/dst filters to prohibit certain outgoing
>    or incoming packets based on BPF filters. In particular, this allows
>    to maintain per dst custom state across multiple packets in BPF maps
>    and apply filters based on statistics and behaviour observed over time.
>  - Attachment of L2 headers at transmit where resolving the L2 address
>    is not required.
>  - Possibly many more.

Did you look at the cgroup based hooks which were added recently in
ip_finish_output for cgroup ebpf support and in general the cgroup bpf
subsystem. Does some of this solve the problem for you already? Would be
interesting to hear your opinion on that.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation
  2016-11-29 14:15 ` [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
@ 2016-11-29 14:58   ` Thomas Graf
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-29 14:58 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: davem, netdev, alexei.starovoitov, daniel, tom, roopa

Hi Hannes,

On 11/29/16 at 03:15pm, Hannes Frederic Sowa wrote:
> Did you look at the cgroup based hooks which were added recently in
> ip_finish_output for cgroup ebpf support and in general the cgroup bpf
> subsystem. Does some of this solve the problem for you already? Would be
> interesting to hear your opinion on that.

What I'm looking for is the ability to collect statistics and generate
samples for a subset of the traffic, e.g. all intra data center
traffic, all packets hitting the default route in a network namespace,
all packets which use a dst tying a certain endpoint to particular TCP
metrics. For the examples above, LWT provides a very intuitive and
natural way to do so while amortizing the cost of the route lookup
which is required anyway.

The cgroup hook provides similar semantics but if the application
context is of interest. Obviously, tasks in a cgroup may be sharing
routes so I can't use it as a replacement. However, using the two in
combination will become highly useful as it allows to gather statistics
individually for both application context and routing context and then
aggregate them to see how applications are using different network
segments.

Aside from the different context matching, the cgroup hook will not
allow to modify the packet as the lwtunnel_xmit() post
ip_finish_output does.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-29 13:21 ` [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure Thomas Graf
@ 2016-11-30  0:15   ` Alexei Starovoitov
  2016-11-30  2:52     ` John Fastabend
  2016-11-30  6:48     ` Thomas Graf
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2016-11-30  0:15 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes

On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> Registers new BPF program types which correspond to the LWT hooks:
>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> 
> The separate program types are required to differentiate between the
> capabilities each LWT hook allows:
> 
>  * Programs attached to dst_input() or dst_output() are restricted and
>    may only read the data of an skb. This prevent modification and
>    possible invalidation of already validated packet headers on receive
>    and the construction of illegal headers while the IP headers are
>    still being assembled.
> 
>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>    content as well as prepending an L2 header via a newly introduced
>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>    after the IP header has been assembled completely.
> 
> All BPF programs receive an skb with L3 headers attached and may return
> one of the following error codes:
> 
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid in lwtunnel_xmit() context)
> 
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
> 
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
...
> +#define LWT_BPF_MAX_HEADROOM 128

why 128?
btw I'm thinking for XDP to use 256, so metadata can be stored in there.

> +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> +		       struct dst_entry *dst, bool can_redirect)
> +{
> +	int ret;
> +
> +	/* Preempt disable is needed to protect per-cpu redirect_info between
> +	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> +	 * access to maps strictly require a rcu_read_lock() for protection,
> +	 * mixing with BH RCU lock doesn't work.
> +	 */
> +	preempt_disable();
> +	rcu_read_lock();
> +	bpf_compute_data_end(skb);
> +	ret = BPF_PROG_RUN(lwt->prog, skb);
> +	rcu_read_unlock();
> +
> +	switch (ret) {
> +	case BPF_OK:
> +		break;
> +
> +	case BPF_REDIRECT:
> +		if (!can_redirect) {
> +			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> +				  lwt->name ? : "<unknown>");
> +			ret = BPF_OK;
> +		} else {
> +			ret = skb_do_redirect(skb);

I think this assumes that program did bpf_skb_push and L2 header is present.
Would it make sense to check that mac_header < network_header here to make
sure that it actually happened? I think the cost of single 'if' isn't much.
Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
so program shouldn't be doing bpf_skb_push in such case...
May be rename bpf_skb_push to bpf_skb_push_l2 ?
since it's doing skb_reset_mac_header(skb); at the end of it?
Or it's probably better to use 'flags' argument to tell whether
bpf_skb_push() should set mac_header or not ? Then this bit:

> +		case BPF_OK:
> +			/* If the L3 header was expanded, headroom might be too
> +			 * small for L2 header now, expand as needed.
> +			 */
> +			ret = xmit_check_hhlen(skb);

will work fine as well...
which probably needs "mac_header wasn't set" check? or it's fine?

All bpf bits look great. Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF
  2016-11-29 13:21 ` [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF Thomas Graf
@ 2016-11-30  0:17   ` Alexei Starovoitov
  2016-11-30  6:52     ` Thomas Graf
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-11-30  0:17 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes

On Tue, Nov 29, 2016 at 02:21:23PM +0100, Thomas Graf wrote:
> Adds a series of test to verify the functionality of attaching
> BPF programs at LWT hooks.
> 
> Also adds a sample which collects a histogram of packet sizes which
> pass through an LWT hook.
> 
> $ ./lwt_len_hist.sh
> Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
> 
>  87380  16384  16384    10.00    39857.69

Nice!

> +	ret = bpf_redirect(ifindex, 0);
> +	if (ret < 0) {
> +		printk("bpf_redirect() failed: %d\n", ret);
> +		return BPF_DROP;
> +	}

this 'if' looks a bit weird. You're passing 0 as flags,
so this helper will always succeed.
Other sample code often does 'return bpf_redirect(...)'
due to this reasoning.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  0:15   ` Alexei Starovoitov
@ 2016-11-30  2:52     ` John Fastabend
  2016-11-30  5:37       ` Alexei Starovoitov
  2016-11-30  6:48     ` Thomas Graf
  1 sibling, 1 reply; 16+ messages in thread
From: John Fastabend @ 2016-11-30  2:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes

On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>> Registers new BPF program types which correspond to the LWT hooks:
>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>
>> The separate program types are required to differentiate between the
>> capabilities each LWT hook allows:
>>
>>  * Programs attached to dst_input() or dst_output() are restricted and
>>    may only read the data of an skb. This prevent modification and
>>    possible invalidation of already validated packet headers on receive
>>    and the construction of illegal headers while the IP headers are
>>    still being assembled.
>>
>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>    content as well as prepending an L2 header via a newly introduced
>>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>    after the IP header has been assembled completely.
>>
>> All BPF programs receive an skb with L3 headers attached and may return
>> one of the following error codes:
>>
>>  BPF_OK - Continue routing as per nexthop
>>  BPF_DROP - Drop skb and return EPERM
>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>>                 (Only valid in lwtunnel_xmit() context)
>>
>> The return codes are binary compatible with their TC_ACT_
>> relatives to ease compatibility.
>>
>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ...
>> +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 

hopefully not too off-topic but for XDP I would like to see this get
passed down with the program. It would be more generic and drivers could
configure the headroom on demand and more importantly verify that a
program pushing data is not going to fail at runtime.

.John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  2:52     ` John Fastabend
@ 2016-11-30  5:37       ` Alexei Starovoitov
  2016-11-30 16:57         ` John Fastabend
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-11-30  5:37 UTC (permalink / raw)
  To: John Fastabend; +Cc: Thomas Graf, davem, netdev, daniel, tom, roopa, hannes

On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> >> Registers new BPF program types which correspond to the LWT hooks:
> >>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
> >>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
> >>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
> >>
> >> The separate program types are required to differentiate between the
> >> capabilities each LWT hook allows:
> >>
> >>  * Programs attached to dst_input() or dst_output() are restricted and
> >>    may only read the data of an skb. This prevent modification and
> >>    possible invalidation of already validated packet headers on receive
> >>    and the construction of illegal headers while the IP headers are
> >>    still being assembled.
> >>
> >>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
> >>    content as well as prepending an L2 header via a newly introduced
> >>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
> >>    after the IP header has been assembled completely.
> >>
> >> All BPF programs receive an skb with L3 headers attached and may return
> >> one of the following error codes:
> >>
> >>  BPF_OK - Continue routing as per nexthop
> >>  BPF_DROP - Drop skb and return EPERM
> >>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
> >>                 (Only valid in lwtunnel_xmit() context)
> >>
> >> The return codes are binary compatible with their TC_ACT_
> >> relatives to ease compatibility.
> >>
> >> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > ...
> >> +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> > 
> 
> hopefully not too off-topic but for XDP I would like to see this get

definitely off-topic. lwt->headroom is existing concept. Too late
to do anything about it.

> passed down with the program. It would be more generic and drivers could
> configure the headroom on demand and more importantly verify that a
> program pushing data is not going to fail at runtime.

For xdp I think it will be problematic, since we'd have to check for
that value at prog array access to make sure tailcalls are not broken.
Mix and match won't be possible.
So what does 'configure the headroom on demand' buys us?
Isn't it much easier to tell all drivers "always reserve this much" ?
We burn the page anyway.
If it's configurable per driver, then we'd need an api
to retrieve it. Yet the program author doesn't care what the value is.
If program needs to do udp encap, it will try do it. No matter what.
If xdp_adjust_head() helper fails, the program will likely decide
to drop the packet. In some cases it may decide to punt to stack
for further processing, but for high performance dataplane code
it's highly unlikely.
If it's configurable to something that is not cache line boundary
hw dma performance may suffer and so on.
So I see only cons in such 'configurable headroom' and propose
to have fixed 256 bytes headroom for XDP
which is enough for any sensible encap and metadata.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  0:15   ` Alexei Starovoitov
  2016-11-30  2:52     ` John Fastabend
@ 2016-11-30  6:48     ` Thomas Graf
  2016-11-30  7:01       ` Alexei Starovoitov
  1 sibling, 1 reply; 16+ messages in thread
From: Thomas Graf @ 2016-11-30  6:48 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, netdev, daniel, tom, roopa, hannes

On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> ...
> > +#define LWT_BPF_MAX_HEADROOM 128
> 
> why 128?
> btw I'm thinking for XDP to use 256, so metadata can be stored in there.

It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
fine with bumping it to 256.

> > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > +		       struct dst_entry *dst, bool can_redirect)
> > +{
> > +	int ret;
> > +
> > +	/* Preempt disable is needed to protect per-cpu redirect_info between
> > +	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > +	 * access to maps strictly require a rcu_read_lock() for protection,
> > +	 * mixing with BH RCU lock doesn't work.
> > +	 */
> > +	preempt_disable();
> > +	rcu_read_lock();
> > +	bpf_compute_data_end(skb);
> > +	ret = BPF_PROG_RUN(lwt->prog, skb);
> > +	rcu_read_unlock();
> > +
> > +	switch (ret) {
> > +	case BPF_OK:
> > +		break;
> > +
> > +	case BPF_REDIRECT:
> > +		if (!can_redirect) {
> > +			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> > +				  lwt->name ? : "<unknown>");
> > +			ret = BPF_OK;
> > +		} else {
> > +			ret = skb_do_redirect(skb);
> 
> I think this assumes that program did bpf_skb_push and L2 header is present.
> Would it make sense to check that mac_header < network_header here to make
> sure that it actually happened? I think the cost of single 'if' isn't much.
> Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> so program shouldn't be doing bpf_skb_push in such case...

We are currently guaranteeing mac_header <= network_header given that
bpf_skb_push() is calling skb_reset_mac_header() unconditionally.

Even if a program were to push an L2 header and then redirect to an l3
tunnel, __bpf_redirect_no_mac will pull it off again and correct the
mac_header offset.

Should we check in __bpf_redirect_common() whether mac_header <
nework_header then or add it to lwt-bpf conditional on
dev_is_mac_header_xmit()?

> May be rename bpf_skb_push to bpf_skb_push_l2 ?
> since it's doing skb_reset_mac_header(skb); at the end of it?
> Or it's probably better to use 'flags' argument to tell whether
> bpf_skb_push() should set mac_header or not ?

I added the flags for this purpose but left it unused for now. This
will allow to add a flag later to extend the l3 header prior to
pushing an l2 header, hence the current helper name. But even then,
we would always reset the mac header as well to ensure we never
leave an skb with mac_header > network_header. Are you seeing a
scenario where we would want to omit the mac header reset?

> Then this bit:
> 
> > +		case BPF_OK:
> > +			/* If the L3 header was expanded, headroom might be too
> > +			 * small for L2 header now, expand as needed.
> > +			 */
> > +			ret = xmit_check_hhlen(skb);
> 
> will work fine as well...
> which probably needs "mac_header wasn't set" check? or it's fine?

Right. The check is not strictly required right now but is a safety net
to ensure that the skb passed to dst_neigh_output() which will add the
l2 header in the non-redirect case to always have sufficient headroom.
It will currently be a NOP as we are not allowing to extend the l3 header
in bpf_skb_push() yet. I left this in there to ensure that we are not
missing to add this later.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF
  2016-11-30  0:17   ` Alexei Starovoitov
@ 2016-11-30  6:52     ` Thomas Graf
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-30  6:52 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, netdev, daniel, tom, roopa, hannes

On 11/29/16 at 04:17pm, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 02:21:23PM +0100, Thomas Graf wrote:
> > Adds a series of test to verify the functionality of attaching
> > BPF programs at LWT hooks.
> > 
> > Also adds a sample which collects a histogram of packet sizes which
> > pass through an LWT hook.
> > 
> > $ ./lwt_len_hist.sh
> > Starting netserver with host 'IN(6)ADDR_ANY' port '12865' and family AF_UNSPEC
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.253.2 () port 0 AF_INET : demo
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> > 
> >  87380  16384  16384    10.00    39857.69
> 
> Nice!
> 
> > +	ret = bpf_redirect(ifindex, 0);
> > +	if (ret < 0) {
> > +		printk("bpf_redirect() failed: %d\n", ret);
> > +		return BPF_DROP;
> > +	}
> 
> this 'if' looks a bit weird. You're passing 0 as flags,
> so this helper will always succeed.
> Other sample code often does 'return bpf_redirect(...)'
> due to this reasoning.

Right, the if branch is absolutely useless. I will remove it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  6:48     ` Thomas Graf
@ 2016-11-30  7:01       ` Alexei Starovoitov
  2016-11-30  8:57         ` Thomas Graf
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2016-11-30  7:01 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, netdev, daniel, tom, roopa, hannes

On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> On 11/29/16 at 04:15pm, Alexei Starovoitov wrote:
> > On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
> > ...
> > > +#define LWT_BPF_MAX_HEADROOM 128
> > 
> > why 128?
> > btw I'm thinking for XDP to use 256, so metadata can be stored in there.
> 
> It's an arbitrary limit to catch obvious misconfiguration. I'm absolutely
> fine with bumping it to 256.
> 
> > > +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> > > +		       struct dst_entry *dst, bool can_redirect)
> > > +{
> > > +	int ret;
> > > +
> > > +	/* Preempt disable is needed to protect per-cpu redirect_info between
> > > +	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> > > +	 * access to maps strictly require a rcu_read_lock() for protection,
> > > +	 * mixing with BH RCU lock doesn't work.
> > > +	 */
> > > +	preempt_disable();
> > > +	rcu_read_lock();
> > > +	bpf_compute_data_end(skb);
> > > +	ret = BPF_PROG_RUN(lwt->prog, skb);
> > > +	rcu_read_unlock();
> > > +
> > > +	switch (ret) {
> > > +	case BPF_OK:
> > > +		break;
> > > +
> > > +	case BPF_REDIRECT:
> > > +		if (!can_redirect) {
> > > +			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> > > +				  lwt->name ? : "<unknown>");
> > > +			ret = BPF_OK;
> > > +		} else {
> > > +			ret = skb_do_redirect(skb);
> > 
> > I think this assumes that program did bpf_skb_push and L2 header is present.
> > Would it make sense to check that mac_header < network_header here to make
> > sure that it actually happened? I think the cost of single 'if' isn't much.
> > Also skb_do_redirect() can redirect to l3 tunnels like ipip ;)
> > so program shouldn't be doing bpf_skb_push in such case...
> 
> We are currently guaranteeing mac_header <= network_header given that
> bpf_skb_push() is calling skb_reset_mac_header() unconditionally.
> 
> Even if a program were to push an L2 header and then redirect to an l3
> tunnel, __bpf_redirect_no_mac will pull it off again and correct the
> mac_header offset.

yes. that part is fine.

> Should we check in __bpf_redirect_common() whether mac_header <
> nework_header then or add it to lwt-bpf conditional on
> dev_is_mac_header_xmit()?

may be only extra 'if' in lwt-bpf is all we need?
I'm still missing what will happen if we 'forget' to do
bpf_skb_push() inside the lwt-bpf program, but still do redirect
in lwt_xmit stage to l2 netdev...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  7:01       ` Alexei Starovoitov
@ 2016-11-30  8:57         ` Thomas Graf
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Graf @ 2016-11-30  8:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: davem, netdev, daniel, tom, roopa, hannes

On 11/29/16 at 11:01pm, Alexei Starovoitov wrote:
> On Wed, Nov 30, 2016 at 07:48:51AM +0100, Thomas Graf wrote:
> > Should we check in __bpf_redirect_common() whether mac_header <
> > nework_header then or add it to lwt-bpf conditional on
> > dev_is_mac_header_xmit()?
> 
> may be only extra 'if' in lwt-bpf is all we need?

Agreed, I will add a mac_header < network_header check to lwt-bpf if we
redirect to an l2 device.

> I'm still missing what will happen if we 'forget' to do
> bpf_skb_push() inside the lwt-bpf program, but still do redirect
> in lwt_xmit stage to l2 netdev...

The same as for a AF_PACKET socket not providing an actual L2 header.
I will add a test case to cover this scenario as well.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure
  2016-11-30  5:37       ` Alexei Starovoitov
@ 2016-11-30 16:57         ` John Fastabend
  0 siblings, 0 replies; 16+ messages in thread
From: John Fastabend @ 2016-11-30 16:57 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Thomas Graf, davem, netdev, daniel, tom, roopa, hannes

On 16-11-29 09:37 PM, Alexei Starovoitov wrote:
> On Tue, Nov 29, 2016 at 06:52:36PM -0800, John Fastabend wrote:
>> On 16-11-29 04:15 PM, Alexei Starovoitov wrote:
>>> On Tue, Nov 29, 2016 at 02:21:22PM +0100, Thomas Graf wrote:
>>>> Registers new BPF program types which correspond to the LWT hooks:
>>>>   - BPF_PROG_TYPE_LWT_IN   => dst_input()
>>>>   - BPF_PROG_TYPE_LWT_OUT  => dst_output()
>>>>   - BPF_PROG_TYPE_LWT_XMIT => lwtunnel_xmit()
>>>>
>>>> The separate program types are required to differentiate between the
>>>> capabilities each LWT hook allows:
>>>>
>>>>  * Programs attached to dst_input() or dst_output() are restricted and
>>>>    may only read the data of an skb. This prevent modification and
>>>>    possible invalidation of already validated packet headers on receive
>>>>    and the construction of illegal headers while the IP headers are
>>>>    still being assembled.
>>>>
>>>>  * Programs attached to lwtunnel_xmit() are allowed to modify packet
>>>>    content as well as prepending an L2 header via a newly introduced
>>>>    helper bpf_skb_push(). This is safe as lwtunnel_xmit() is invoked
>>>>    after the IP header has been assembled completely.
>>>>
>>>> All BPF programs receive an skb with L3 headers attached and may return
>>>> one of the following error codes:
>>>>
>>>>  BPF_OK - Continue routing as per nexthop
>>>>  BPF_DROP - Drop skb and return EPERM
>>>>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>>>>                 (Only valid in lwtunnel_xmit() context)
>>>>
>>>> The return codes are binary compatible with their TC_ACT_
>>>> relatives to ease compatibility.
>>>>
>>>> Signed-off-by: Thomas Graf <tgraf@suug.ch>
>>> ...
>>>> +#define LWT_BPF_MAX_HEADROOM 128
>>>
>>> why 128?
>>> btw I'm thinking for XDP to use 256, so metadata can be stored in there.
>>>
>>
>> hopefully not too off-topic but for XDP I would like to see this get
> 
> definitely off-topic. lwt->headroom is existing concept. Too late
> to do anything about it.
> 
>> passed down with the program. It would be more generic and drivers could
>> configure the headroom on demand and more importantly verify that a
>> program pushing data is not going to fail at runtime.
> 
> For xdp I think it will be problematic, since we'd have to check for
> that value at prog array access to make sure tailcalls are not broken.
> Mix and match won't be possible.
> So what does 'configure the headroom on demand' buys us?
> Isn't it much easier to tell all drivers "always reserve this much" ?
> We burn the page anyway.
> If it's configurable per driver, then we'd need an api
> to retrieve it. Yet the program author doesn't care what the value is.
> If program needs to do udp encap, it will try do it. No matter what.
> If xdp_adjust_head() helper fails, the program will likely decide
> to drop the packet. In some cases it may decide to punt to stack
> for further processing, but for high performance dataplane code
> it's highly unlikely.
> If it's configurable to something that is not cache line boundary
> hw dma performance may suffer and so on.
> So I see only cons in such 'configurable headroom' and propose
> to have fixed 256 bytes headroom for XDP
> which is enough for any sensible encap and metadata.
> 

OK I'm convinced let it be fixed at some conservative value.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-11-30 16:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-29 13:21 [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
2016-11-29 13:21 ` [PATCH net-next v3 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
2016-11-29 13:21 ` [PATCH net-next v3 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
2016-11-29 13:21 ` [PATCH net-next v3 3/4] bpf: BPF for lightweight tunnel infrastructure Thomas Graf
2016-11-30  0:15   ` Alexei Starovoitov
2016-11-30  2:52     ` John Fastabend
2016-11-30  5:37       ` Alexei Starovoitov
2016-11-30 16:57         ` John Fastabend
2016-11-30  6:48     ` Thomas Graf
2016-11-30  7:01       ` Alexei Starovoitov
2016-11-30  8:57         ` Thomas Graf
2016-11-29 13:21 ` [PATCH net-next v3 4/4] bpf: Add tests and samples for LWT-BPF Thomas Graf
2016-11-30  0:17   ` Alexei Starovoitov
2016-11-30  6:52     ` Thomas Graf
2016-11-29 14:15 ` [PATCH v3 net-next 0/4] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
2016-11-29 14:58   ` Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).