Netdev List

Netdev List
 help / color / mirror / Atom feed

* [bpf-next v1 7/9] net/ipv6: Add fib lookup stubs for use in bpf helper
From: David Ahern @ 2018-05-03  3:53 UTC (permalink / raw)
  To: netdev, borkmann, ast
  Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180503035319.18290-1-dsahern@gmail.com>

Add stubs to retrieve a handle to an IPv6 FIB table, fib6_get_table,
a stub to do a lookup in a specific table, fib6_table_lookup, and
a stub for a full route lookup.

The stubs are needed for core bpf code to handle the case when the
IPv6 module is not builtin.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/addrconf.h   | 14 ++++++++++++++
 net/ipv6/addrconf_core.c | 33 ++++++++++++++++++++++++++++++++-
 net/ipv6/af_inet6.c      |  6 +++++-
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 8312cc25a3af..ff766ab207e0 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -223,6 +223,20 @@ struct ipv6_stub {
 				 const struct in6_addr *addr);
 	int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
 			       struct dst_entry **dst, struct flowi6 *fl6);
+
+	struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
+	struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
+					 struct flowi6 *fl6, int flags);
+	struct fib6_info *(*fib6_table_lookup)(struct net *net,
+					      struct fib6_table *table,
+					      int oif, struct flowi6 *fl6,
+					      int flags);
+	struct fib6_info *(*fib6_multipath_select)(const struct net *net,
+						   struct fib6_info *f6i,
+						   struct flowi6 *fl6, int oif,
+						   const struct sk_buff *skb,
+						   int strict);
+
 	void (*udpv6_encap_enable)(void);
 	void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr,
 			      const struct in6_addr *solicited_addr,
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 32b564dfd02a..2fe754fd4f5e 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -134,8 +134,39 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
 	return -EAFNOSUPPORT;
 }
 
+static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
+{
+	return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table,
+			       int oif, struct flowi6 *fl6, int flags)
+{
+	return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
+			 int flags)
+{
+	return NULL;
+}
+
+static struct fib6_info *
+eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i,
+				   struct flowi6 *fl6, int oif,
+				   const struct sk_buff *skb, int strict)
+{
+	return f6i;
+}
+
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
-	.ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
+	.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
+	.fib6_get_table    = eafnosupport_fib6_get_table,
+	.fib6_table_lookup = eafnosupport_fib6_table_lookup,
+	.fib6_lookup       = eafnosupport_fib6_lookup,
+	.fib6_multipath_select = eafnosupport_fib6_multipath_select,
 };
 EXPORT_SYMBOL_GPL(ipv6_stub);
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 36d622c477b1..c0e8255d50bb 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -887,7 +887,11 @@ static struct pernet_operations inet6_net_ops = {
 static const struct ipv6_stub ipv6_stub_impl = {
 	.ipv6_sock_mc_join = ipv6_sock_mc_join,
 	.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
-	.ipv6_dst_lookup = ip6_dst_lookup,
+	.ipv6_dst_lookup   = ip6_dst_lookup,
+	.fib6_get_table	   = fib6_get_table,
+	.fib6_table_lookup = fib6_table_lookup,
+	.fib6_lookup       = fib6_lookup,
+	.fib6_multipath_select = fib6_multipath_select,
 	.udpv6_encap_enable = udpv6_encap_enable,
 	.ndisc_send_na = ndisc_send_na,
 	.nd_tbl	= &nd_tbl,
-- 
2.11.0

^ permalink raw reply related

* [bpf-next v1 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Ahern @ 2018-05-03  3:53 UTC (permalink / raw)
  To: netdev, borkmann, ast
  Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180503035319.18290-1-dsahern@gmail.com>

Provide a helper for doing a FIB and neighbor lookup in the kernel
tables from an XDP program. The helper provides a fastpath for forwarding
packets. If the packet is a local delivery or for any reason is not a
simple lookup and forward, the packet continues up the stack.

If it is to be forwarded, the forwarding can be done directly if the
neighbor is already known. If the neighbor does not exist, the first
few packets go up the stack for neighbor resolution. Once resolved, the
xdp program provides the fast path.

On successful lookup the nexthop dmac, current device smac and egress
device index are returned.

The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
are implemented in this patch. The API includes layer 4 parameters if
the XDP program chooses to do deep packet inspection to allow compare
against ACLs implemented as FIB rules.

Header rewrite is left to the XDP program.

The lookup takes 2 flags:
- BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
  straight to the table associated with the device (expert setting for
  those looking to maximize throughput)

- BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
  Default is an ingress lookup.

Initial performance numbers collected by Jesper, forwarded packets/sec:

       Full stack    XDP FIB lookup    XDP Direct lookup
IPv4   1,947,969       7,074,156          7,415,333
IPv6   1,728,000       6,165,504          7,262,720

These number are single CPU core forwarding on a Broadwell
E5-1650 v4 @ 3.60GHz.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/uapi/linux/bpf.h |  83 ++++++++++++++-
 net/core/filter.c        | 263 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 345 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8daef7326bb7..360a1168c353 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -10,6 +10,8 @@
 
 #include <linux/types.h>
 #include <linux/bpf_common.h>
+#include <linux/if_ether.h>
+#include <linux/in6.h>
 
 /* Extended instruction set based on top of classic BPF */
 
@@ -1801,6 +1803,33 @@ union bpf_attr {
  * 	Return
  * 		a non-negative value equal to or less than size on success, or
  * 		a negative error in case of failure.
+ *
+ * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
+ *	Description
+ *		Do FIB lookup in kernel tables using parameters in *params*.
+ *		If lookup is successful and result shows packets is to be
+ *		forwarded, the neighbor tables are searched for the nexthop.
+ *		If successful (ie., FIB lookup shows forwarding and nexthop
+ *		is resolved), the nexthop address is returned in ipv4_dst,
+ *		ipv6_dst or mpls_out based on family, smac is set to mac
+ *		address of egress device, dmac is set to nexthop mac address,
+ *		rt_metric is set to metric from route.
+ *
+ *		*plen* argument is the size of the passed in struct.
+ *		*flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
+ *
+ *		**BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
+ *		full lookup using FIB rules
+ *		**BPF_FIB_LOOKUP_OUTPUT** mmeans do lookup from an egress
+ *		perspective (default is ingress)
+ *
+ *		*ctx* is either **struct xdp_md** for XDP programs or
+ *		**struct sk_buff** tc cls_act programs.
+ *
+ *	Return
+ *		Egress device index on success, 0 if packet needs to continue
+ *		up the stack for further processing or a negative error in case
+ *		of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -1870,7 +1899,8 @@ union bpf_attr {
 	FN(bind),			\
 	FN(xdp_adjust_tail),		\
 	FN(skb_get_xfrm_state),		\
-	FN(get_stack),
+	FN(get_stack),			\
+	FN(fib_lookup),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2278,4 +2308,55 @@ struct bpf_raw_tracepoint_args {
 	__u64 args[0];
 };
 
+/* DIRECT:  Skip the FIB rules and go to FIB table associated with device
+ * OUTPUT:  Do lookup from egress perspective; default is ingress
+ */
+#define BPF_FIB_LOOKUP_DIRECT  BIT(0)
+#define BPF_FIB_LOOKUP_OUTPUT  BIT(1)
+
+struct bpf_fib_lookup {
+	/* input */
+	__u8	family;   /* network family, AF_INET, AF_INET6, AF_MPLS */
+
+	/* set if lookup is to consider L4 data - e.g., FIB rules */
+	__u8	l4_protocol;
+	__be16	sport;
+	__be16	dport;
+
+	/* total length of packet from network header - used for MTU check */
+	__u16	tot_len;
+	__u32	ifindex;  /* L3 device index for lookup */
+
+	union {
+		/* inputs to lookup */
+		__u8	tos;		/* AF_INET  */
+		__be32	flowlabel;	/* AF_INET6 */
+
+		/* output: metric of fib result */
+		__u32 rt_metric;
+	};
+
+	union {
+		__be32		mpls_in;
+		__be32		ipv4_src;
+		struct in6_addr	ipv6_src;
+	};
+
+	/* input to bpf_fib_lookup, *dst is destination address.
+	 * output: bpf_fib_lookup sets to gateway address
+	 */
+	union {
+		/* return for MPLS lookups */
+		__be32		mpls_out[4];  /* support up to 4 labels */
+		__be32		ipv4_dst;
+		struct in6_addr	ipv6_dst;
+	};
+
+	/* output */
+	__be16	h_vlan_proto;
+	__be16	h_vlan_TCI;
+	__u8	smac[ETH_ALEN];
+	__u8	dmac[ETH_ALEN];
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index d3781daa26ab..c34ba2675a98 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,6 +59,10 @@
 #include <net/tcp.h>
 #include <net/xfrm.h>
 #include <linux/bpf_trace.h>
+#include <linux/inetdevice.h>
+#include <net/ip_fib.h>
+#include <net/flow.h>
+#include <net/arp.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -3788,6 +3792,261 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
+static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
+				  const struct neighbour *neigh,
+				  const struct net_device *dev)
+{
+	memcpy(params->dmac, neigh->ha, ETH_ALEN);
+	memcpy(params->smac, dev->dev_addr, ETH_ALEN);
+	params->h_vlan_TCI = 0;
+	params->h_vlan_proto = 0;
+
+	return dev->ifindex;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_INET)
+static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+			       u32 flags)
+{
+	struct in_device *in_dev;
+	struct neighbour *neigh;
+	struct net_device *dev;
+	struct fib_result res;
+	struct fib_nh *nh;
+	struct flowi4 fl4;
+	int err;
+
+	dev = dev_get_by_index_rcu(net, params->ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	/* verify forwarding is enabled on this interface */
+	in_dev = __in_dev_get_rcu(dev);
+	if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
+		return 0;
+
+	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+		fl4.flowi4_iif = 1;
+		fl4.flowi4_oif = params->ifindex;
+	} else {
+		fl4.flowi4_iif = params->ifindex;
+		fl4.flowi4_oif = 0;
+	}
+	fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
+	fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
+	fl4.flowi4_flags = 0;
+
+	fl4.flowi4_proto = params->l4_protocol;
+	fl4.daddr = params->ipv4_dst;
+	fl4.saddr = params->ipv4_src;
+	fl4.fl4_sport = params->sport;
+	fl4.fl4_dport = params->dport;
+
+	if (flags & BPF_FIB_LOOKUP_DIRECT) {
+		u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+		struct fib_table *tb;
+
+		tb = fib_get_table(net, tbid);
+		if (unlikely(!tb))
+			return 0;
+
+		err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
+	} else {
+		fl4.flowi4_mark = 0;
+		fl4.flowi4_secid = 0;
+		fl4.flowi4_tun_key.tun_id = 0;
+		fl4.flowi4_uid = sock_net_uid(net, NULL);
+
+		err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
+	}
+
+	if (err || res.type != RTN_UNICAST)
+		return 0;
+
+	if (res.fi->fib_nhs > 1)
+		fib_select_path(net, &res, &fl4, NULL);
+
+	nh = &res.fi->fib_nh[res.nh_sel];
+
+	/* do not handle lwt encaps right now */
+	if (nh->nh_lwtstate)
+		return 0;
+
+	dev = nh->nh_dev;
+	if (unlikely(!dev))
+		return 0;
+
+	if (nh->nh_gw)
+		params->ipv4_dst = nh->nh_gw;
+
+	params->rt_metric = res.fi->fib_priority;
+
+	/* xdp and cls_bpf programs are run in RCU-bh so
+	 * rcu_read_lock_bh is not needed here
+	 */
+	neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
+	if (neigh)
+		return bpf_fib_set_fwd_params(params, neigh, dev);
+
+	return 0;
+}
+#endif
+
+#if IS_ENABLED(CONFIG_IPV6)
+static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
+			       u32 flags)
+{
+	struct neighbour *neigh;
+	struct net_device *dev;
+	struct fib6_info *f6i;
+	struct flowi6 fl6;
+	int strict = 0;
+	int oif;
+
+	/* link local addresses are never forwarded */
+	if (rt6_need_strict(&params->ipv6_dst) ||
+	    rt6_need_strict(&params->ipv6_src))
+		return 0;
+
+	dev = dev_get_by_index_rcu(net, params->ifindex);
+	if (unlikely(!dev))
+		return -ENODEV;
+
+	if (flags & BPF_FIB_LOOKUP_OUTPUT) {
+		fl6.flowi6_iif = 1;
+		oif = fl6.flowi6_oif = params->ifindex;
+	} else {
+		oif = fl6.flowi6_iif = params->ifindex;
+		fl6.flowi6_oif = 0;
+		strict = RT6_LOOKUP_F_HAS_SADDR;
+	}
+	fl6.flowlabel = params->flowlabel;
+	fl6.flowi6_scope = 0;
+	fl6.flowi6_flags = 0;
+	fl6.mp_hash = 0;
+
+	fl6.flowi6_proto = params->l4_protocol;
+	fl6.daddr = params->ipv6_dst;
+	fl6.saddr = params->ipv6_src;
+	fl6.fl6_sport = params->sport;
+	fl6.fl6_dport = params->dport;
+
+	if (flags & BPF_FIB_LOOKUP_DIRECT) {
+		u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
+		struct fib6_table *tb;
+
+		tb = ipv6_stub->fib6_get_table(net, tbid);
+		if (unlikely(!tb))
+			return 0;
+
+		f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
+	} else {
+		fl6.flowi6_mark = 0;
+		fl6.flowi6_secid = 0;
+		fl6.flowi6_tun_key.tun_id = 0;
+		fl6.flowi6_uid = sock_net_uid(net, NULL);
+
+		f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict);
+	}
+
+	if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
+		return 0;
+
+	if (unlikely(f6i->fib6_flags & RTF_REJECT ||
+	    f6i->fib6_type != RTN_UNICAST))
+		return 0;
+
+	if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
+		f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
+						       fl6.flowi6_oif, NULL,
+						       strict);
+
+	if (f6i->fib6_nh.nh_lwtstate)
+		return 0;
+
+	if (f6i->fib6_flags & RTF_GATEWAY)
+		params->ipv6_dst = f6i->fib6_nh.nh_gw;
+
+	dev = f6i->fib6_nh.nh_dev;
+	params->rt_metric = f6i->fib6_metric;
+
+	/* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
+	 * not needed here. Can not use __ipv6_neigh_lookup_noref here
+	 * because we need to get nd_tbl via the stub
+	 */
+	neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
+				      ndisc_hashfn, &params->ipv6_dst, dev);
+	if (neigh)
+		return bpf_fib_set_fwd_params(params, neigh, dev);
+
+	return 0;
+}
+#endif
+
+BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
+	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+	if (plen < sizeof(*params))
+		return -EINVAL;
+
+	switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+	case AF_INET:
+		return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
+					   flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
+					   flags);
+#endif
+	}
+	return -ENOTSUPP;
+}
+
+static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
+	.func		= bpf_xdp_fib_lookup,
+	.gpl_only	= true,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_MEM,
+	.arg3_type      = ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
+	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
+{
+	if (plen < sizeof(*params))
+		return -EINVAL;
+
+	switch (params->family) {
+#if IS_ENABLED(CONFIG_INET)
+	case AF_INET:
+		return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
+#endif
+	}
+	return -ENOTSUPP;
+}
+
+static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
+	.func		= bpf_skb_fib_lookup,
+	.gpl_only	= true,
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_MEM,
+	.arg3_type      = ARG_CONST_SIZE,
+	.arg4_type	= ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3933,6 +4192,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
 #endif
+	case BPF_FUNC_fib_lookup:
+		return &bpf_skb_fib_lookup_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -3958,6 +4219,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_map_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
+	case BPF_FUNC_fib_lookup:
+		return &bpf_xdp_fib_lookup_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
2.11.0

^ permalink raw reply related

* [bpf-next v1 9/9] samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
From: David Ahern @ 2018-05-03  3:53 UTC (permalink / raw)
  To: netdev, borkmann, ast
  Cc: davem, shm, roopa, brouer, toke, john.fastabend, David Ahern
In-Reply-To: <20180503035319.18290-1-dsahern@gmail.com>

Simple example of fast-path forwarding. It has a serious flaw
in not verifying the egress device index supports XDP forwarding.
If the egress device does not packets are dropped.

Take this only as a simple example of fast-path forwarding.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 samples/bpf/Makefile                      |   4 +
 samples/bpf/xdp_fwd_kern.c                | 113 +++++++++++++++++++++++++
 samples/bpf/xdp_fwd_user.c                | 136 ++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 4 files changed, 256 insertions(+)
 create mode 100644 samples/bpf/xdp_fwd_kern.c
 create mode 100644 samples/bpf/xdp_fwd_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 5e31770ac087..393dac1c43f4 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdp_fwd
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -98,6 +99,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdp_fwd-objs := bpf_load.o $(LIBBPF) xdp_fwd_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdp_fwd_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdp_fwd += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
new file mode 100644
index 000000000000..7eeaa32538b1
--- /dev/null
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include "bpf_helpers.h"
+
+#define IPV6_FLOWINFO_MASK              cpu_to_be32(0x0FFFFFFF)
+
+struct bpf_map_def SEC("maps") tx_port = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 64,
+};
+
+static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct bpf_fib_lookup fib_params;
+	struct ethhdr *eth = data;
+	int out_index;
+	u16 h_proto;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	__builtin_memset(&fib_params, 0, sizeof(fib_params));
+
+	h_proto = eth->h_proto;
+	if (h_proto == htons(ETH_P_IP)) {
+		struct iphdr *iph = data + nh_off;
+
+		if (iph + 1 > data_end)
+			return XDP_DROP;
+
+		fib_params.family	= AF_INET;
+		fib_params.tos		= iph->tos;
+		fib_params.l4_protocol	= iph->protocol;
+		fib_params.sport	= 0;
+		fib_params.dport	= 0;
+		fib_params.tot_len	= ntohs(iph->tot_len);
+		fib_params.ipv4_src	= iph->saddr;
+		fib_params.ipv4_dst	= iph->daddr;
+	} else if (h_proto == htons(ETH_P_IPV6)) {
+		struct ipv6hdr *iph = data + nh_off;
+
+		if (iph + 1 > data_end)
+			return XDP_DROP;
+
+		fib_params.family	= AF_INET6;
+		fib_params.flowlabel	= *(__be32 *)iph & IPV6_FLOWINFO_MASK;
+		fib_params.l4_protocol	= iph->nexthdr;
+		fib_params.sport	= 0;
+		fib_params.dport	= 0;
+		fib_params.tot_len	= ntohs(iph->payload_len);
+		fib_params.ipv6_src	= iph->saddr;
+		fib_params.ipv6_dst	= iph->daddr;
+	} else {
+		return XDP_PASS;
+	}
+
+	fib_params.ifindex = ctx->ingress_ifindex;
+
+	out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
+
+	/* verify egress index has xdp support
+	 * TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
+	 *       cannot pass map_type 14 into func bpf_map_lookup_elem#1:
+	 * NOTE: without verification that egress index supports XDP
+	 *       forwarding packets are dropped.
+	 */
+	if (out_index > 0) {
+		memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
+		memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
+		return bpf_redirect_map(&tx_port, out_index, 0);
+	}
+
+	return XDP_PASS;
+}
+
+SEC("xdp_fwd")
+int xdp_fwd_prog(struct xdp_md *ctx)
+{
+	return xdp_fwd_flags(ctx, 0);
+}
+
+SEC("xdp_fwd_direct")
+int xdp_fwd_direct_prog(struct xdp_md *ctx)
+{
+	return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
new file mode 100644
index 000000000000..9c6606f57126
--- /dev/null
+++ b/samples/bpf/xdp_fwd_user.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/limits.h>
+#include <net/if.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libgen.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+
+static int do_attach(int idx, int fd, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, fd, 0);
+	if (err < 0)
+		printf("ERROR: failed to attach program to %s\n", name);
+
+	return err;
+}
+
+static int do_detach(int idx, const char *name)
+{
+	int err;
+
+	err = bpf_set_link_xdp_fd(idx, -1, 0);
+	if (err < 0)
+		printf("ERROR: failed to detach program from %s\n", name);
+
+	return err;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] interface-list\n"
+		"\nOPTS:\n"
+		"    -d    detach program\n"
+		"    -D    direct table lookups (skip fib rules)\n",
+		prog);
+}
+
+int main(int argc, char **argv)
+{
+	char filename[PATH_MAX];
+	int opt, i, idx, err;
+	int prog_id = 0;
+	int attach = 1;
+	int ret = 0;
+
+	while ((opt = getopt(argc, argv, ":dD")) != -1) {
+		switch (opt) {
+		case 'd':
+			attach = 0;
+			break;
+		case 'D':
+			prog_id = 1;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (optind == argc) {
+		usage(basename(argv[0]));
+		return 1;
+	}
+
+	if (attach) {
+		snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+		if (access(filename, O_RDONLY) < 0) {
+			printf("error accessing file %s: %s\n",
+				filename, strerror(errno));
+			return 1;
+		}
+
+		if (load_bpf_file(filename)) {
+			printf("%s", bpf_log_buf);
+			return 1;
+		}
+
+		if (!prog_fd[prog_id]) {
+			printf("load_bpf_file: %s\n", strerror(errno));
+			return 1;
+		}
+	}
+	if (attach) {
+		for (i = 1; i < 64; ++i)
+			bpf_map_update_elem(map_fd[0], &i, &i, 0);
+	}
+
+	for (i = optind; i < argc; ++i) {
+		idx = if_nametoindex(argv[i]);
+		if (!idx)
+			idx = strtoul(argv[i], NULL, 0);
+
+		if (!idx) {
+			fprintf(stderr, "Invalid arg\n");
+			return 1;
+		}
+		if (!attach) {
+			err = do_detach(idx, argv[i]);
+			if (err)
+				ret = err;
+		} else {
+			err = do_attach(idx, prog_fd[prog_id], argv[i]);
+			if (err)
+				ret = err;
+		}
+	}
+
+	return ret;
+}
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 265f8e0e8ada..2375d06c706b 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -103,6 +103,9 @@ static int (*bpf_skb_get_xfrm_state)(void *ctx, int index, void *state,
 	(void *) BPF_FUNC_skb_get_xfrm_state;
 static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
 	(void *) BPF_FUNC_get_stack;
+static int (*bpf_fib_lookup)(void *ctx, struct bpf_fib_lookup *params,
+			     int plen, __u32 flags) =
+	(void *) BPF_FUNC_fib_lookup;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.11.0

^ permalink raw reply related

* [PATCH 0/2] multi-threading device shutdown
From: Pavel Tatashin @ 2018-05-03  3:59 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh

Do a faster shutdown by calling dev->*->shutdown(dev) in parallel.
device_shutdown() calls these functions for every single device but
only using one thread.

Since, nothing else is running on the machine by the device_shutdown()
s called, there is no reason not to utilize all the available CPU
resources.

Pavel Tatashin (2):
  ixgbe: release lock for the duration of ixgbe_suspend_close()
  drivers core: multi-threading device shutdown

 drivers/base/core.c                           | 238 ++++++++++++++----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   9 +-
 2 files changed, 197 insertions(+), 50 deletions(-)

-- 
2.17.0

^ permalink raw reply

* [PATCH 1/2] ixgbe: release lock for the duration of ixgbe_suspend_close()
From: Pavel Tatashin @ 2018-05-03  3:59 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh
In-Reply-To: <20180503035931.22439-1-pasha.tatashin@oracle.com>

Currently, during device_shutdown() ixgbe holds rtnl_lock for the duration
of lengthy ixgbe_close_suspend(). On machines with multiple ixgbe cards
this lock prevents scaling if device_shutdown() function is multi-threaded.

It is not necessary to hold this lock during ixgbe_close_suspend()
as it is not held when ixgbe_close() is called also during shutdown but for
kexec case.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index afadba99f7b8..e7875b58854b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6748,8 +6748,15 @@ static int __ixgbe_shutdown(struct pci_dev *pdev, bool *enable_wake)
 	rtnl_lock();
 	netif_device_detach(netdev);
 
-	if (netif_running(netdev))
+	if (netif_running(netdev)) {
+		/* Suspend takes a long time, device_shutdown may be
+		 * parallelized this function, so drop lock for the
+		 * duration of this call.
+		 */
+		rtnl_unlock();
 		ixgbe_close_suspend(adapter);
+		rtnl_lock();
+	}
 
 	ixgbe_clear_interrupt_scheme(adapter);
 	rtnl_unlock();
-- 
2.17.0

^ permalink raw reply related

* [PATCH 2/2] drivers core: multi-threading device shutdown
From: Pavel Tatashin @ 2018-05-03  3:59 UTC (permalink / raw)
  To: pasha.tatashin, steven.sistare, daniel.m.jordan, linux-kernel,
	jeffrey.t.kirsher, intel-wired-lan, netdev, gregkh
In-Reply-To: <20180503035931.22439-1-pasha.tatashin@oracle.com>

When system is rebooted, halted or kexeced device_shutdown() is
called.

This function shuts down every single device by calling either:
	dev->bus->shutdown(dev)
	dev->driver->shutdown(dev)

Even on a machine just with a moderate amount of devices, device_shutdown()
may take multiple seconds to complete. Because many devices require a
specific delays to perform this operation.

Here is sample analysis of time it takes to call device_shutdown() on
two socket Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz machine.

device_shutdown		2.95s
 mlx4_shutdown		1.14s
 megasas_shutdown	0.24s
 ixgbe_shutdown		0.37s x 4 (four ixgbe devices on my machine).
 the rest		0.09s

In mlx4 we spent the most time, but that is because there is a 1 second
sleep:
mlx4_shutdown
 mlx4_unload_one
  mlx4_free_ownership
   msleep(1000)

With megasas we spend quoter of second, but sometimes longer (up-to 0.5s)
in this path:

    megasas_shutdown
      megasas_flush_cache
        megasas_issue_blocked_cmd
          wait_event_timeout

Finally, with ixgbe_shutdown() it takes 0.37 for each device, but that time
is spread all over the place, with bigger offenders:

    ixgbe_shutdown
      __ixgbe_shutdown
        ixgbe_close_suspend
          ixgbe_down
            ixgbe_init_hw_generic
              ixgbe_reset_hw_X540
                msleep(100);                        0.104483472
                ixgbe_get_san_mac_addr_generic      0.048414851
                ixgbe_get_wwn_prefix_generic        0.048409893
              ixgbe_start_hw_X540
                ixgbe_start_hw_generic
                  ixgbe_clear_hw_cntrs_generic      0.048581502
                  ixgbe_setup_fc_generic            0.024225800

    All the ixgbe_*generic functions end-up calling:
    ixgbe_read_eerd_X540()
      ixgbe_acquire_swfw_sync_X540
        usleep_range(5000, 6000);
      ixgbe_release_swfw_sync_X540
        usleep_range(5000, 6000);

While these are short sleeps, they end-up calling them over 24 times!
24 * 0.0055s = 0.132s. Adding-up to 0.528s for four devices.

While we should keep optimizing the individual device drivers, in some
cases this is simply a hardware property that forces a specific delay, and
we must wait.

So, the solution for this problem is to shutdown devices in parallel.
However, we must shutdown children before shutting down parents, so parent
device must wait for its children to finish.

With this patch, on the same machine devices_shutdown() takes 1.142s, and
without mlx4 one second delay only 0.38s

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
---
 drivers/base/core.c | 238 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 189 insertions(+), 49 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..f370369a303b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -25,6 +25,7 @@
 #include <linux/netdevice.h>
 #include <linux/sched/signal.h>
 #include <linux/sysfs.h>
+#include <linux/kthread.h>
 
 #include "base.h"
 #include "power/power.h"
@@ -2102,6 +2103,59 @@ const char *device_get_devnode(struct device *dev,
 	return *tmp = s;
 }
 
+/**
+ * device_children_count - device children count
+ * @parent: parent struct device.
+ *
+ * Returns number of children for this device or 0 if nonde.
+ */
+static int device_children_count(struct device *parent)
+{
+	struct klist_iter i;
+	int children = 0;
+
+	if (!parent->p)
+		return 0;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while (next_device(&i))
+		children++;
+	klist_iter_exit(&i);
+
+	return children;
+}
+
+/**
+ * device_get_child_by_index - Return child using the provide index.
+ * @parent: parent struct device.
+ * @index:  Index of the child, where 0 is the first child in the children list,
+ * and so on.
+ *
+ * Returns child or NULL if child with this index is not present.
+ */
+static struct device *
+device_get_child_by_index(struct device *parent, int index)
+{
+	struct klist_iter i;
+	struct device *dev = NULL, *d;
+	int child_index = 0;
+
+	if (!parent->p || index < 0)
+		return NULL;
+
+	klist_iter_init(&parent->p->klist_children, &i);
+	while ((d = next_device(&i)) != NULL) {
+		if (child_index == index) {
+			dev = d;
+			break;
+		}
+		child_index++;
+	}
+	klist_iter_exit(&i);
+
+	return dev;
+}
+
 /**
  * device_for_each_child - device child iterator.
  * @parent: parent struct device.
@@ -2765,71 +2819,157 @@ int device_move(struct device *dev, struct device *new_parent,
 }
 EXPORT_SYMBOL_GPL(device_move);
 
+/*
+ * device_shutdown_one - call ->shutdown() for the device passed as
+ * argument.
+ */
+static void device_shutdown_one(struct device *dev)
+{
+	/* Don't allow any more runtime suspends */
+	pm_runtime_get_noresume(dev);
+	pm_runtime_barrier(dev);
+
+	if (dev->class && dev->class->shutdown_pre) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown_pre\n");
+		dev->class->shutdown_pre(dev);
+	}
+	if (dev->bus && dev->bus->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->bus->shutdown(dev);
+	} else if (dev->driver && dev->driver->shutdown) {
+		if (initcall_debug)
+			dev_info(dev, "shutdown\n");
+		dev->driver->shutdown(dev);
+	}
+
+	/* Release device lock, and decrement the reference counter */
+	device_unlock(dev);
+	put_device(dev);
+}
+
+static DECLARE_COMPLETION(device_root_tasks_complete);
+static void device_shutdown_tree(struct device *dev);
+static atomic_t device_root_tasks;
+
+/*
+ * Passed as an argument to to device_shutdown_task().
+ * child_next_index	the next available child index.
+ * tasks_running	number of tasks still running. Each tasks decrements it
+ *			when job is finished and the last tasks signals that the
+ *			job is complete.
+ * complete		Used to signal job competition.
+ * parent		Parent device.
+ */
+struct device_shutdown_task_data {
+	atomic_t		child_next_index;
+	atomic_t		tasks_running;
+	struct completion	complete;
+	struct device		*parent;
+};
+
+static int device_shutdown_task(void *data)
+{
+	struct device_shutdown_task_data *tdata =
+		(struct device_shutdown_task_data *)data;
+	int child_idx = atomic_inc_return(&tdata->child_next_index) - 1;
+	struct device *dev = device_get_child_by_index(tdata->parent,
+						       child_idx);
+
+	if (dev)
+		device_shutdown_tree(dev);
+	if (atomic_dec_return(&tdata->tasks_running) == 0)
+		complete(&tdata->complete);
+	return 0;
+}
+
+/*
+ * Shutdown device tree with root started in dev. If dev has no children
+ * simply shutdown only this device. If dev has children recursively shutdown
+ * children first, and only then the parent. For performance reasons children
+ * are shutdown in parallel using kernel threads.
+ */
+static void device_shutdown_tree(struct device *dev)
+{
+	int children_count = device_children_count(dev);
+
+	if (children_count) {
+		struct device_shutdown_task_data tdata;
+		int i;
+
+		init_completion(&tdata.complete);
+		atomic_set(&tdata.child_next_index, 0);
+		atomic_set(&tdata.tasks_running, children_count);
+		tdata.parent = dev;
+
+		for (i = 0; i < children_count; i++) {
+			kthread_run(device_shutdown_task,
+				    &tdata, "device_shutdown.%s",
+				    dev_name(dev));
+		}
+		wait_for_completion(&tdata.complete);
+	}
+	device_shutdown_one(dev);
+}
+
+/*
+ * On shutdown each root device (the one that does not have a parent) goes
+ * through this function.
+ */
+static int
+device_shutdown_root_task(void *data)
+{
+	struct device *dev = (struct device *)data;
+
+	device_shutdown_tree(dev);
+	if (atomic_dec_return(&device_root_tasks) == 0)
+		complete(&device_root_tasks_complete);
+	return 0;
+}
+
 /**
  * device_shutdown - call ->shutdown() on each device to shutdown.
  */
 void device_shutdown(void)
 {
-	struct device *dev, *parent;
+	struct list_head *pos, *next;
+	int root_devices = 0;
+	struct device *dev;
 
 	spin_lock(&devices_kset->list_lock);
 	/*
-	 * Walk the devices list backward, shutting down each in turn.
-	 * Beware that device unplug events may also start pulling
-	 * devices offline, even as the system is shutting down.
+	 * Prepare devices for shutdown: lock, and increment references in every
+	 * devices. Remove child devices from the list, and count number of root
+	 * devices.
 	 */
-	while (!list_empty(&devices_kset->list)) {
-		dev = list_entry(devices_kset->list.prev, struct device,
-				kobj.entry);
+	list_for_each_safe(pos, next, &devices_kset->list) {
+		dev = list_entry(pos, struct device, kobj.entry);
 
-		/*
-		 * hold reference count of device's parent to
-		 * prevent it from being freed because parent's
-		 * lock is to be held
-		 */
-		parent = get_device(dev->parent);
 		get_device(dev);
-		/*
-		 * Make sure the device is off the kset list, in the
-		 * event that dev->*->shutdown() doesn't remove it.
-		 */
-		list_del_init(&dev->kobj.entry);
-		spin_unlock(&devices_kset->list_lock);
-
-		/* hold lock to avoid race with probe/release */
-		if (parent)
-			device_lock(parent);
 		device_lock(dev);
 
-		/* Don't allow any more runtime suspends */
-		pm_runtime_get_noresume(dev);
-		pm_runtime_barrier(dev);
-
-		if (dev->class && dev->class->shutdown_pre) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown_pre\n");
-			dev->class->shutdown_pre(dev);
-		}
-		if (dev->bus && dev->bus->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->bus->shutdown(dev);
-		} else if (dev->driver && dev->driver->shutdown) {
-			if (initcall_debug)
-				dev_info(dev, "shutdown\n");
-			dev->driver->shutdown(dev);
-		}
-
-		device_unlock(dev);
-		if (parent)
-			device_unlock(parent);
-
-		put_device(dev);
-		put_device(parent);
-
+		if (!dev->parent)
+			root_devices++;
+		else
+			list_del_init(&dev->kobj.entry);
+	}
+	atomic_set(&device_root_tasks, root_devices);
+	/*
+	 * Shutdown the root devices in parallel. The children are going to be
+	 * shutdown first.
+	 */
+	list_for_each_safe(pos, next, &devices_kset->list) {
+		dev = list_entry(pos, struct device, kobj.entry);
+		list_del_init(&dev->kobj.entry);
+		spin_unlock(&devices_kset->list_lock);
+		kthread_run(device_shutdown_root_task,
+			    dev, "device_root_shutdown.%s",
+			    dev_name(dev));
 		spin_lock(&devices_kset->list_lock);
 	}
 	spin_unlock(&devices_kset->list_lock);
+	wait_for_completion(&device_root_tasks_complete);
 }
 
 /*
-- 
2.17.0

^ permalink raw reply related

* [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/apple/macmace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/apple/macmace.c b/drivers/net/ethernet/apple/macmace.c
index 137cbb470af2..98292c49ecf0 100644
--- a/drivers/net/ethernet/apple/macmace.c
+++ b/drivers/net/ethernet/apple/macmace.c
@@ -203,6 +203,10 @@ static int mace_probe(struct platform_device *pdev)
 	unsigned char checksum = 0;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(PRIV_BYTES);
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/apple/macmace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/apple/macmace.c b/drivers/net/ethernet/apple/macmace.c
index 137cbb470af2..98292c49ecf0 100644
--- a/drivers/net/ethernet/apple/macmace.c
+++ b/drivers/net/ethernet/apple/macmace.c
@@ -203,6 +203,10 @@ static int mace_probe(struct platform_device *pdev)
 	unsigned char checksum = 0;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(PRIV_BYTES);
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/apple/macmace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/apple/macmace.c b/drivers/net/ethernet/apple/macmace.c
index 137cbb470af2..98292c49ecf0 100644
--- a/drivers/net/ethernet/apple/macmace.c
+++ b/drivers/net/ethernet/apple/macmace.c
@@ -203,6 +203,10 @@ static int mace_probe(struct platform_device *pdev)
 	unsigned char checksum = 0;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(PRIV_BYTES);
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macmace: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Tested-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/apple/macmace.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/apple/macmace.c b/drivers/net/ethernet/apple/macmace.c
index 137cbb470af2..98292c49ecf0 100644
--- a/drivers/net/ethernet/apple/macmace.c
+++ b/drivers/net/ethernet/apple/macmace.c
@@ -203,6 +203,10 @@ static int mace_probe(struct platform_device *pdev)
 	unsigned char checksum = 0;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(PRIV_BYTES);
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
 	struct sonic_local *lp;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(sizeof(struct sonic_local));
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
 	struct sonic_local *lp;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(sizeof(struct sonic_local));
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
 	struct sonic_local *lp;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(sizeof(struct sonic_local));
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* [PATCH net] macsonic: Set platform device coherent_dma_mask
From: Finn Thain @ 2018-05-03  4:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-m68k, netdev, linux-kernel

Set the device's coherent_dma_mask to avoid a WARNING splat.
Please see commit 205e1b7f51e4 ("dma-mapping: warn when there is
no coherent_dma_mask").

Cc: linux-m68k@lists.linux-m68k.org
Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
---
 drivers/net/ethernet/natsemi/macsonic.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/natsemi/macsonic.c b/drivers/net/ethernet/natsemi/macsonic.c
index 0937fc2a928e..37b1ffa8bb61 100644
--- a/drivers/net/ethernet/natsemi/macsonic.c
+++ b/drivers/net/ethernet/natsemi/macsonic.c
@@ -523,6 +523,10 @@ static int mac_sonic_platform_probe(struct platform_device *pdev)
 	struct sonic_local *lp;
 	int err;

+	err = dma_coerce_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
+	if (err)
+		return err;
+
 	dev = alloc_etherdev(sizeof(struct sonic_local));
 	if (!dev)
 		return -ENOMEM;
-- 
2.16.1

^ permalink raw reply related

* Re: [lkp-robot] 486ad79630 [   15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Andrew Morton @ 2018-05-03  4:27 UTC (permalink / raw)
  To: kernel test robot
  Cc: kernel test robot, Linux Memory Management List, Johannes Weiner,
	LKP, David Miller, netdev, Cong Wang
In-Reply-To: <20180503041450.pq2njvkssxtay64o@shao2-debian>


(networking cc's added)

On Thu, 3 May 2018 12:14:50 +0800 kernel test robot <shun.hao@intel.com> wrote:

> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> git://git.cmpxchg.org/linux-mmotm.git master
> 
> commit 486ad79630d0ba0b7205a8db9fe15ba392f5ee32
> Author:     Andrew Morton <akpm@linux-foundation.org>
> AuthorDate: Fri Apr 20 22:00:53 2018 +0000
> Commit:     Johannes Weiner <hannes@cmpxchg.org>
> CommitDate: Fri Apr 20 22:00:53 2018 +0000
> 
>     origin


OK, this got confusing.  origin.patch is the diff between 4.17-rc3 and
current mainline.

>
> [many lines deleted]
>
> [main] Setsockopt(101 c 1b24000 a) on fd 177 [3:5:240]
> [main] Setsockopt(1 2c 1b24000 4) on fd 178 [5:2:0]
> [main] Setsockopt(29 8 1b24000 4) on fd 180 [10:1:0]
> [main] Setsockopt(1 20 1b24000 4) on fd 181 [26:2:125]
> [main] Setsockopt(11 1 1b24000 4) on fd 183 [2:2:17]
> [   15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
> [   15.534143] PGD 800000001734b067 P4D 800000001734b067 PUD 17350067 PMD 0 
> [   15.535516] Oops: 0002 [#1] PTI
> [   15.536165] Modules linked in:
> [   15.536798] CPU: 0 PID: 363 Comm: trinity-main Not tainted 4.17.0-rc1-00001-g486ad79 #2
> [   15.538396] RIP: 0010:llc_ui_release+0x3a/0xd0
> [   15.539293] RSP: 0018:ffffc9000015bd70 EFLAGS: 00010202
> [   15.540345] RAX: 0000000000000001 RBX: ffff88001fa60008 RCX: 0000000000000006
> [   15.541802] RDX: 0000000000000006 RSI: ffff88001fdda660 RDI: ffff88001fa60008
> [   15.543139] RBP: ffffc9000015bd80 R08: 0000000000000000 R09: 0000000000000000
> [   15.544725] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [   15.546287] R13: ffff88001fa61730 R14: ffff88001e130a60 R15: ffff880019bdb3f0
> [   15.547962] FS:  00007f2221bb1700(0000) GS:ffffffff82034000(0000) knlGS:0000000000000000
> [   15.549848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.551186] CR2: 0000000000000004 CR3: 000000001734e000 CR4: 00000000000006b0
> [   15.552671] DR0: 0000000002232000 DR1: 0000000000000000 DR2: 0000000000000000
> [   15.554105] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [   15.555534] Call Trace:
> [   15.556049]  sock_release+0x14/0x60
> [   15.556767]  sock_close+0xd/0x20
> [   15.557427]  __fput+0xba/0x1f0
> [   15.558058]  ____fput+0x9/0x10
> [   15.558682]  task_work_run+0x73/0xa0
> [   15.559416]  do_exit+0x231/0xab0
> [   15.560079]  do_group_exit+0x3f/0xc0
> [   15.560810]  __x64_sys_exit_group+0x13/0x20
> [   15.561656]  do_syscall_64+0x58/0x2f0
> [   15.562407]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [   15.563360]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   15.564471] RIP: 0033:0x7f2221696408
> [   15.565264] RSP: 002b:00007ffe5c544c48 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
> [   15.566924] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2221696408
> [   15.568485] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [   15.570046] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffffa0
> [   15.571603] R10: 00007ffe5c5449e0 R11: 0000000000000206 R12: 0000000000000004
> [   15.573160] R13: 00007ffe5c544e30 R14: 0000000000000000 R15: 0000000000000000
> [   15.574720] Code: 7b ff 43 78 0f 88 a5 6f 14 00 31 f6 48 89 df e8 ad 33 fb ff 48 89 df e8 55 94 ff ff 85 c0 0f 84 84 00 00 00 4c 8b a3 d8 04 00 00 <41> ff 44 24 04 0f 88 7f 6f 14 00 48 8b 43 58 f6 c4 01 74 58 48 
> [   15.578679] RIP: llc_ui_release+0x3a/0xd0 RSP: ffffc9000015bd70
> [   15.579874] CR2: 0000000000000004
> [   15.580553] ---[ end trace 0dd8fdc6b7182234 ]---
>

So it's saying that something which got committed into Linus's tree
after 4.17-rc3 has caused a NULL deref in
sock_release->llc_ui_release+0x3a/0xd0

^ permalink raw reply

* [PATCH v2 net-next 0/4] bpfilter
From: Alexei Starovoitov @ 2018-05-03  4:36 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team

Hi All,

v1->v2:
this patch set is almost a full rewrite of the earlier umh modules approach
The v1 of patches and follow up discussion was covered by LWN:
https://lwn.net/Articles/749108/

I believe the v2 addresses all issues brought up by Andy and others.
Mainly there are zero changes to kernel/module.c
Instead of teaching module loading logic to recognize special
umh module, let normal kernel modules execute part of its own
.init.rodata as a new user space process (Andy's idea)
Patch 1 introduces this new helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
Input:
  data + len == executable file
Output:
  struct umh_info {
       struct file *pipe_to_umh;
       struct file *pipe_from_umh;
       pid_t pid;
  };

Advantages vs v1:
- the embedded user mode executable is stored as .init.rodata inside
  normal kernel module. These pages are freed when .ko finishes loading
- the elf file is copied into tmpfs file. The user mode process is swappable.
- the communication between user mode process and 'parent' kernel module
  is done via two unix pipes, hence protocol is not exposed to
  user space
- impossible to launch umh on its own (that was the main issue of v1)
  and impossible to be man-in-the-middle due to pipes
- bpfilter.ko consists of tiny kernel part that passes the data
  between kernel and umh via pipes and much bigger umh part that
  doing all the work
- 'lsmod' shows bpfilter.ko as usual.
  'rmmod bpfilter' removes kernel module and kills corresponding umh
- signed bpfilter.ko covers the whole image including umh code

Few issues:
- architecturally bpfilter.ko can be builtin, but doesn't work yet.
  Still debugging. Kinda cool to have user mode executables
  to be part of vmlinux
- the user can still attach to the process and debug it with
  'gdb /proc/pid/exe pid', but 'gdb -p pid' doesn't work.
  (a bit worse comparing to v1)
- tinyconfig will notice a small increase in .text
  +766 | TEXT | 7c8b94806bec umh: introduce fork_usermode_blob() helper

More details in patches 1 and 2 that are ready to land.
Patches 3 and 4 are still rough. They were mainly used for
testing and to demonstrate how bpfilter is building on top.
The patch 4 approach of converting one iptable rule to few bpf
instructions will certainly change in the future, since it doesn't
scale to thousands of rules.

Alexei Starovoitov (2):
  umh: introduce fork_usermode_blob() helper
  net: add skeleton of bpfilter kernel module

Daniel Borkmann (1):
  bpfilter: rough bpfilter codegen example hack

David S. Miller (1):
  bpfilter: add iptable get/set parsing

 fs/exec.c                     |  38 ++++-
 include/linux/binfmts.h       |   1 +
 include/linux/bpfilter.h      |  15 ++
 include/linux/umh.h           |  12 ++
 include/uapi/linux/bpfilter.h | 200 ++++++++++++++++++++++
 kernel/umh.c                  | 176 +++++++++++++++++++-
 net/Kconfig                   |   2 +
 net/Makefile                  |   1 +
 net/bpfilter/Kconfig          |  17 ++
 net/bpfilter/Makefile         |  24 +++
 net/bpfilter/bpfilter_kern.c  |  93 +++++++++++
 net/bpfilter/bpfilter_mod.h   | 373 ++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/ctor.c           |  91 +++++++++++
 net/bpfilter/gen.c            | 290 ++++++++++++++++++++++++++++++++
 net/bpfilter/init.c           |  36 ++++
 net/bpfilter/main.c           | 117 +++++++++++++
 net/bpfilter/msgfmt.h         |  17 ++
 net/bpfilter/sockopt.c        | 236 ++++++++++++++++++++++++++
 net/bpfilter/tables.c         |  73 +++++++++
 net/bpfilter/targets.c        |  51 ++++++
 net/bpfilter/tgts.c           |  26 +++
 net/ipv4/Makefile             |   2 +
 net/ipv4/bpfilter/Makefile    |   2 +
 net/ipv4/bpfilter/sockopt.c   |  42 +++++
 net/ipv4/ip_sockglue.c        |  17 ++
 25 files changed, 1940 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter_kern.c
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/gen.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/main.c
 create mode 100644 net/bpfilter/msgfmt.h
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

-- 
2.9.5

^ permalink raw reply

* [PATCH v2 net-next 1/4] umh: introduce fork_usermode_blob() helper
From: Alexei Starovoitov @ 2018-05-03  4:36 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>

Introduce helper:
int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
struct umh_info {
       struct file *pipe_to_umh;
       struct file *pipe_from_umh;
       pid_t pid;
};

that GPLed kernel modules (signed or unsigned) can use it to execute part
of its own data as swappable user mode process.

The kernel will do:
- mount "tmpfs"
- allocate a unique file in tmpfs
- populate that file with [data, data + len] bytes
- user-mode-helper code will do_execve that file and, before the process
  starts, the kernel will create two unix pipes for bidirectional
  communication between kernel module and umh
- close tmpfs file, effectively deleting it
- the fork_usermode_blob will return zero on success and populate
  'struct umh_info' with two unix pipes and the pid of the user process

As the first step in the development of the bpfilter project
the fork_usermode_blob() helper is introduced to allow user mode code
to be invoked from a kernel module. The idea is that user mode code plus
normal kernel module code are built as part of the kernel build
and installed as traditional kernel module into distro specified location,
such that from a distribution point of view, there is
no difference between regular kernel modules and kernel modules + umh code.
Such modules can be signed, modprobed, rmmod, etc. The use of this new helper
by a kernel module doesn't make it any special from kernel and user space
tooling point of view.

Such approach enables kernel to delegate functionality traditionally done
by the kernel modules into the user space processes (either root or !root) and
reduces security attack surface of the new code. The buggy umh code would crash
the user process, but not the kernel. Another advantage is that umh code
of the kernel module can be debugged and tested out of user space
(e.g. opening the possibility to run clang sanitizers, fuzzers or
user space test suites on the umh code).
In case of the bpfilter project such architecture allows complex control plane
to be done in the user space while bpf based data plane stays in the kernel.

Since umh can crash, can be oom-ed by the kernel, killed by the admin,
the kernel module that uses them (like bpfilter) needs to manage life
time of umh on its own via two unix pipes and the pid of umh.

The exit code of such kernel module should kill the umh it started,
so that rmmod of the kernel module will cleanup the corresponding umh.
Just like if the kernel module does kmalloc() it should kfree() it in the exit code.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 fs/exec.c               |  38 ++++++++---
 include/linux/binfmts.h |   1 +
 include/linux/umh.h     |  12 ++++
 kernel/umh.c            | 176 +++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 215 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..30a36c2a39bf 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1706,14 +1706,13 @@ static int exec_binprm(struct linux_binprm *bprm)
 /*
  * sys_execve() executes a new program.
  */
-static int do_execveat_common(int fd, struct filename *filename,
-			      struct user_arg_ptr argv,
-			      struct user_arg_ptr envp,
-			      int flags)
+static int __do_execve_file(int fd, struct filename *filename,
+			    struct user_arg_ptr argv,
+			    struct user_arg_ptr envp,
+			    int flags, struct file *file)
 {
 	char *pathbuf = NULL;
 	struct linux_binprm *bprm;
-	struct file *file;
 	struct files_struct *displaced;
 	int retval;
 
@@ -1752,7 +1751,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	check_unsafe_exec(bprm);
 	current->in_execve = 1;
 
-	file = do_open_execat(fd, filename, flags);
+	if (!file)
+		file = do_open_execat(fd, filename, flags);
 	retval = PTR_ERR(file);
 	if (IS_ERR(file))
 		goto out_unmark;
@@ -1760,7 +1760,9 @@ static int do_execveat_common(int fd, struct filename *filename,
 	sched_exec();
 
 	bprm->file = file;
-	if (fd == AT_FDCWD || filename->name[0] == '/') {
+	if (!filename) {
+		bprm->filename = "none";
+	} else if (fd == AT_FDCWD || filename->name[0] == '/') {
 		bprm->filename = filename->name;
 	} else {
 		if (filename->name[0] == '\0')
@@ -1826,7 +1828,8 @@ static int do_execveat_common(int fd, struct filename *filename,
 	task_numa_free(current);
 	free_bprm(bprm);
 	kfree(pathbuf);
-	putname(filename);
+	if (filename)
+		putname(filename);
 	if (displaced)
 		put_files_struct(displaced);
 	return retval;
@@ -1849,10 +1852,27 @@ static int do_execveat_common(int fd, struct filename *filename,
 	if (displaced)
 		reset_files_struct(displaced);
 out_ret:
-	putname(filename);
+	if (filename)
+		putname(filename);
 	return retval;
 }
 
+static int do_execveat_common(int fd, struct filename *filename,
+			      struct user_arg_ptr argv,
+			      struct user_arg_ptr envp,
+			      int flags)
+{
+	return __do_execve_file(fd, filename, argv, envp, flags, NULL);
+}
+
+int do_execve_file(struct file *file, void *__argv, void *__envp)
+{
+	struct user_arg_ptr argv = { .ptr.native = __argv };
+	struct user_arg_ptr envp = { .ptr.native = __envp };
+
+	return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
+}
+
 int do_execve(struct filename *filename,
 	const char __user *const __user *__argv,
 	const char __user *const __user *__envp)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 4955e0863b83..c05f24fac4f6 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -150,5 +150,6 @@ extern int do_execveat(int, struct filename *,
 		       const char __user * const __user *,
 		       const char __user * const __user *,
 		       int);
+int do_execve_file(struct file *file, void *__argv, void *__envp);
 
 #endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/umh.h b/include/linux/umh.h
index 244aff638220..5c812acbb80a 100644
--- a/include/linux/umh.h
+++ b/include/linux/umh.h
@@ -22,8 +22,10 @@ struct subprocess_info {
 	const char *path;
 	char **argv;
 	char **envp;
+	struct file *file;
 	int wait;
 	int retval;
+	pid_t pid;
 	int (*init)(struct subprocess_info *info, struct cred *new);
 	void (*cleanup)(struct subprocess_info *info);
 	void *data;
@@ -38,6 +40,16 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp,
 			  int (*init)(struct subprocess_info *info, struct cred *new),
 			  void (*cleanup)(struct subprocess_info *), void *data);
 
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+			  int (*init)(struct subprocess_info *info, struct cred *new),
+			  void (*cleanup)(struct subprocess_info *), void *data);
+struct umh_info {
+	struct file *pipe_to_umh;
+	struct file *pipe_from_umh;
+	pid_t pid;
+};
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info);
+
 extern int
 call_usermodehelper_exec(struct subprocess_info *info, int wait);
 
diff --git a/kernel/umh.c b/kernel/umh.c
index f76b3ff876cf..c3f418d7d51a 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -25,6 +25,8 @@
 #include <linux/ptrace.h>
 #include <linux/async.h>
 #include <linux/uaccess.h>
+#include <linux/shmem_fs.h>
+#include <linux/pipe_fs_i.h>
 
 #include <trace/events/module.h>
 
@@ -97,9 +99,13 @@ static int call_usermodehelper_exec_async(void *data)
 
 	commit_creds(new);
 
-	retval = do_execve(getname_kernel(sub_info->path),
-			   (const char __user *const __user *)sub_info->argv,
-			   (const char __user *const __user *)sub_info->envp);
+	if (sub_info->file)
+		retval = do_execve_file(sub_info->file,
+					sub_info->argv, sub_info->envp);
+	else
+		retval = do_execve(getname_kernel(sub_info->path),
+				   (const char __user *const __user *)sub_info->argv,
+				   (const char __user *const __user *)sub_info->envp);
 out:
 	sub_info->retval = retval;
 	/*
@@ -185,6 +191,8 @@ static void call_usermodehelper_exec_work(struct work_struct *work)
 		if (pid < 0) {
 			sub_info->retval = pid;
 			umh_complete(sub_info);
+		} else {
+			sub_info->pid = pid;
 		}
 	}
 }
@@ -393,6 +401,168 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
 }
 EXPORT_SYMBOL(call_usermodehelper_setup);
 
+struct subprocess_info *call_usermodehelper_setup_file(struct file *file,
+		int (*init)(struct subprocess_info *info, struct cred *new),
+		void (*cleanup)(struct subprocess_info *info), void *data)
+{
+	struct subprocess_info *sub_info;
+
+	sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
+	if (!sub_info)
+		return NULL;
+
+	INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
+	sub_info->path = "none";
+	sub_info->file = file;
+	sub_info->init = init;
+	sub_info->cleanup = cleanup;
+	sub_info->data = data;
+	return sub_info;
+}
+
+static struct vfsmount *umh_fs;
+
+static int init_tmpfs(void)
+{
+	struct file_system_type *type;
+
+	if (umh_fs)
+		return 0;
+	type = get_fs_type("tmpfs");
+	if (!type)
+		return -ENODEV;
+	umh_fs = kern_mount(type);
+	if (IS_ERR(umh_fs)) {
+		int err = PTR_ERR(umh_fs);
+
+		put_filesystem(type);
+		umh_fs = NULL;
+		return err;
+	}
+	return 0;
+}
+
+static int alloc_tmpfs_file(size_t size, struct file **filp)
+{
+	struct file *file;
+	int err;
+
+	err = init_tmpfs();
+	if (err)
+		return err;
+	file = shmem_file_setup_with_mnt(umh_fs, "umh", size, VM_NORESERVE);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+	*filp = file;
+	return 0;
+}
+
+static int populate_file(struct file *file, const void *data, size_t size)
+{
+	size_t offset = 0;
+	int err;
+
+	do {
+		unsigned int len = min_t(typeof(size), size, PAGE_SIZE);
+		struct page *page;
+		void *pgdata, *vaddr;
+
+		err = pagecache_write_begin(file, file->f_mapping, offset, len,
+					    0, &page, &pgdata);
+		if (err < 0)
+			goto fail;
+
+		vaddr = kmap(page);
+		memcpy(vaddr, data, len);
+		kunmap(page);
+
+		err = pagecache_write_end(file, file->f_mapping, offset, len,
+					  len, page, pgdata);
+		if (err < 0)
+			goto fail;
+
+		size -= len;
+		data += len;
+		offset += len;
+	} while (size);
+	return 0;
+fail:
+	return err;
+}
+
+static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
+{
+	struct umh_info *umh_info = info->data;
+	struct file *from_umh[2];
+	struct file *to_umh[2];
+	int err;
+
+	/* create pipe to send data to umh */
+	err = create_pipe_files(to_umh, 0);
+	if (err)
+		return err;
+	err = replace_fd(0, to_umh[0], 0);
+	fput(to_umh[0]);
+	if (err < 0) {
+		fput(to_umh[1]);
+		return err;
+	}
+
+	/* create pipe to receive data from umh */
+	err = create_pipe_files(from_umh, 0);
+	if (err) {
+		fput(to_umh[1]);
+		replace_fd(0, NULL, 0);
+		return err;
+	}
+	err = replace_fd(1, from_umh[1], 0);
+	fput(from_umh[1]);
+	if (err < 0) {
+		fput(to_umh[1]);
+		replace_fd(0, NULL, 0);
+		fput(from_umh[0]);
+		return err;
+	}
+
+	umh_info->pipe_to_umh = to_umh[1];
+	umh_info->pipe_from_umh = from_umh[0];
+	return 0;
+}
+
+static void umh_save_pid(struct subprocess_info *info)
+{
+	struct umh_info *umh_info = info->data;
+
+	umh_info->pid = info->pid;
+}
+
+int fork_usermode_blob(void *data, size_t len, struct umh_info *info)
+{
+	struct subprocess_info *sub_info;
+	struct file *file = NULL;
+	int err;
+
+	err = alloc_tmpfs_file(len, &file);
+	if (err)
+		return err;
+
+	err = populate_file(file, data, len);
+	if (err)
+		goto out;
+
+	err = -ENOMEM;
+	sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup,
+						  umh_save_pid, info);
+	if (!sub_info)
+		goto out;
+
+	err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+out:
+	fput(file);
+	return err;
+}
+EXPORT_SYMBOL_GPL(fork_usermode_blob);
+
 /**
  * call_usermodehelper_exec - start a usermode application
  * @sub_info: information about the subprocessa
-- 
2.9.5

^ permalink raw reply related

* [PATCH v2 net-next 2/4] net: add skeleton of bpfilter kernel module
From: Alexei Starovoitov @ 2018-05-03  4:36 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>

bpfilter.ko consists of bpfilter_kern.c (normal kernel module code)
and user mode helper code that is embedded into bpfilter.ko

The steps to build bpfilter.ko are the following:
- main.c is compiled by HOSTCC into the bpfilter_umh elf executable file
- with quite a bit of objcopy and Makefile magic the bpfilter_umh elf file
  is converted into bpfilter_umh.o object file
  with _binary_net_bpfilter_bpfilter_umh_start and _end symbols
  Example:
  $ nm ./bld_x64/net/bpfilter/bpfilter_umh.o
  0000000000004cf8 T _binary_net_bpfilter_bpfilter_umh_end
  0000000000004cf8 A _binary_net_bpfilter_bpfilter_umh_size
  0000000000000000 T _binary_net_bpfilter_bpfilter_umh_start
- bpfilter_umh.o and bpfilter_kern.o are linked together into bpfilter.ko

bpfilter_kern.c is a normal kernel module code that calls
the fork_usermode_blob() helper to execute part of its own data
as a user mode process.

Notice that _binary_net_bpfilter_bpfilter_umh_start - end
is placed into .init.rodata section, so it's freed as soon as __init
function of bpfilter.ko is finished.
As part of __init the bpfilter.ko does first request/reply action
via two unix pipe provided by fork_usermode_blob() helper to
make sure that umh is healthy. If not it will kill it via pid.

Later bpfilter_process_sockopt() will be called from bpfilter hooks
in get/setsockopt() to pass iptable commands into umh via bpfilter.ko

If admin does 'rmmod bpfilter' the __exit code bpfilter.ko will
kill umh as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpfilter.h      | 15 +++++++
 include/uapi/linux/bpfilter.h | 21 ++++++++++
 net/Kconfig                   |  2 +
 net/Makefile                  |  1 +
 net/bpfilter/Kconfig          | 17 ++++++++
 net/bpfilter/Makefile         | 24 +++++++++++
 net/bpfilter/bpfilter_kern.c  | 93 +++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/main.c           | 63 +++++++++++++++++++++++++++++
 net/bpfilter/msgfmt.h         | 17 ++++++++
 net/ipv4/Makefile             |  2 +
 net/ipv4/bpfilter/Makefile    |  2 +
 net/ipv4/bpfilter/sockopt.c   | 42 +++++++++++++++++++
 net/ipv4/ip_sockglue.c        | 17 ++++++++
 13 files changed, 316 insertions(+)
 create mode 100644 include/linux/bpfilter.h
 create mode 100644 include/uapi/linux/bpfilter.h
 create mode 100644 net/bpfilter/Kconfig
 create mode 100644 net/bpfilter/Makefile
 create mode 100644 net/bpfilter/bpfilter_kern.c
 create mode 100644 net/bpfilter/main.c
 create mode 100644 net/bpfilter/msgfmt.h
 create mode 100644 net/ipv4/bpfilter/Makefile
 create mode 100644 net/ipv4/bpfilter/sockopt.c

diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h
new file mode 100644
index 000000000000..687b1760bb9f
--- /dev/null
+++ b/include/linux/bpfilter.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_H
+#define _LINUX_BPFILTER_H
+
+#include <uapi/linux/bpfilter.h>
+
+struct sock;
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char *optval,
+			    unsigned int optlen);
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char *optval,
+			    int *optlen);
+extern int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+				       char __user *optval,
+				       unsigned int optlen, bool is_set);
+#endif
diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
new file mode 100644
index 000000000000..2ec3cc99ea4c
--- /dev/null
+++ b/include/uapi/linux/bpfilter.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BPFILTER_H
+#define _UAPI_LINUX_BPFILTER_H
+
+#include <linux/if.h>
+
+enum {
+	BPFILTER_IPT_SO_SET_REPLACE = 64,
+	BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65,
+	BPFILTER_IPT_SET_MAX,
+};
+
+enum {
+	BPFILTER_IPT_SO_GET_INFO = 64,
+	BPFILTER_IPT_SO_GET_ENTRIES = 65,
+	BPFILTER_IPT_SO_GET_REVISION_MATCH = 66,
+	BPFILTER_IPT_SO_GET_REVISION_TARGET = 67,
+	BPFILTER_IPT_GET_MAX,
+};
+
+#endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/Kconfig b/net/Kconfig
index b62089fb1332..ed6368b306fa 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -201,6 +201,8 @@ source "net/bridge/netfilter/Kconfig"
 
 endif
 
+source "net/bpfilter/Kconfig"
+
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
 source "net/rds/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..7f982b7682bd 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
 obj-$(CONFIG_UNIX)		+= unix/
 obj-$(CONFIG_NET)		+= ipv6/
+obj-$(CONFIG_BPFILTER)		+= bpfilter/
 obj-$(CONFIG_PACKET)		+= packet/
 obj-$(CONFIG_NET_KEY)		+= key/
 obj-$(CONFIG_BRIDGE)		+= bridge/
diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
new file mode 100644
index 000000000000..782a732b9a5c
--- /dev/null
+++ b/net/bpfilter/Kconfig
@@ -0,0 +1,17 @@
+menuconfig BPFILTER
+	bool "BPF based packet filtering framework (BPFILTER)"
+	default n
+	depends on NET && BPF
+	help
+	  This builds experimental bpfilter framework that is aiming to
+	  provide netfilter compatible functionality via BPF
+
+if BPFILTER
+config BPFILTER_UMH
+	tristate "bpftiler kernel module with user mode helper"
+	default m
+	depends on m
+	help
+	  This builds bpfilter kernel module with embedded user mode helper
+endif
+
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
new file mode 100644
index 000000000000..897eedae523e
--- /dev/null
+++ b/net/bpfilter/Makefile
@@ -0,0 +1,24 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the Linux BPFILTER layer.
+#
+
+hostprogs-y := bpfilter_umh
+bpfilter_umh-objs := main.o
+HOSTCFLAGS += -I. -Itools/include/
+
+# a bit of elf magic to convert bpfilter_umh binary into a binary blob
+# inside bpfilter_umh.o elf file referenced by
+# _binary_net_bpfilter_bpfilter_umh_start symbol
+# which bpfilter_kern.c passes further into umh blob loader at run-time
+quiet_cmd_copy_umh = GEN $@
+      cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
+      $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+      --rename-section .data=.init.rodata $< $@
+
+$(obj)/bpfilter_umh.o: $(obj)/bpfilter_umh
+	$(call cmd,copy_umh)
+
+obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o
+bpfilter-objs += bpfilter_kern.o bpfilter_umh.o
diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c
new file mode 100644
index 000000000000..e0a6fdd5842b
--- /dev/null
+++ b/net/bpfilter/bpfilter_kern.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/umh.h>
+#include <linux/bpfilter.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include "msgfmt.h"
+
+#define UMH_start _binary_net_bpfilter_bpfilter_umh_start
+#define UMH_end _binary_net_bpfilter_bpfilter_umh_end
+
+extern char UMH_start;
+extern char UMH_end;
+
+static struct umh_info info;
+
+static void shutdown_umh(struct umh_info *info)
+{
+	struct task_struct *tsk;
+
+	tsk = pid_task(find_vpid(info->pid), PIDTYPE_PID);
+	if (tsk)
+		force_sig(SIGKILL, tsk);
+	fput(info->pipe_to_umh);
+	fput(info->pipe_from_umh);
+}
+
+static void stop_umh(void)
+{
+	if (bpfilter_process_sockopt) {
+		bpfilter_process_sockopt = NULL;
+		shutdown_umh(&info);
+	}
+}
+
+static int __bpfilter_process_sockopt(struct sock *sk, int optname,
+				      char __user *optval,
+				      unsigned int optlen, bool is_set)
+{
+	struct mbox_request req;
+	struct mbox_reply reply;
+	loff_t pos;
+	ssize_t n;
+
+	req.is_set = is_set;
+	req.pid = current->pid;
+	req.cmd = optname;
+	req.addr = (long)optval;
+	req.len = optlen;
+	n = __kernel_write(info.pipe_to_umh, &req, sizeof(req), &pos);
+	if (n != sizeof(req)) {
+		pr_err("write fail %zd\n", n);
+		stop_umh();
+		return -EFAULT;
+	}
+	pos = 0;
+	n = kernel_read(info.pipe_from_umh, &reply, sizeof(reply), &pos);
+	if (n != sizeof(reply)) {
+		pr_err("read fail %zd\n", n);
+		stop_umh();
+		return -EFAULT;
+	}
+	return reply.status;
+}
+
+static int __init load_umh(void)
+{
+	int err;
+
+	err = fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+	if (err)
+		return err;
+	pr_info("Loaded umh pid %d\n", info.pid);
+	bpfilter_process_sockopt = &__bpfilter_process_sockopt;
+
+	if (__bpfilter_process_sockopt(NULL, 0, 0, 0, 0) != 0) {
+		stop_umh();
+		return -EFAULT;
+	}
+	return 0;
+}
+
+static void __exit fini_umh(void)
+{
+	stop_umh();
+}
+module_init(load_umh);
+module_exit(fini_umh);
+MODULE_LICENSE("GPL");
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
new file mode 100644
index 000000000000..81bbc1684896
--- /dev/null
+++ b/net/bpfilter/main.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <errno.h>
+#include <stdio.h>
+#include <sys/socket.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "include/uapi/linux/bpf.h"
+#include <asm/unistd.h>
+#include "msgfmt.h"
+
+int debug_fd;
+
+static int handle_get_cmd(struct mbox_request *cmd)
+{
+	switch (cmd->cmd) {
+	case 0:
+		return 0;
+	default:
+		break;
+	}
+	return -ENOPROTOOPT;
+}
+
+static int handle_set_cmd(struct mbox_request *cmd)
+{
+	return -ENOPROTOOPT;
+}
+
+static void loop(void)
+{
+	while (1) {
+		struct mbox_request req;
+		struct mbox_reply reply;
+		int n;
+
+		n = read(0, &req, sizeof(req));
+		if (n != sizeof(req)) {
+			dprintf(debug_fd, "invalid request %d\n", n);
+			return;
+		}
+
+		reply.status = req.is_set ?
+			handle_set_cmd(&req) :
+			handle_get_cmd(&req);
+
+		n = write(1, &reply, sizeof(reply));
+		if (n != sizeof(reply)) {
+			dprintf(debug_fd, "reply failed %d\n", n);
+			return;
+		}
+	}
+}
+
+int main(void)
+{
+	debug_fd = open("/dev/console", 00000002 | 00000100);
+	dprintf(debug_fd, "Started bpfilter\n");
+	loop();
+	close(debug_fd);
+	return 0;
+}
diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h
new file mode 100644
index 000000000000..94b9ac9e5114
--- /dev/null
+++ b/net/bpfilter/msgfmt.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_BPFILTER_MSGFMT_H
+#define _NET_BPFTILER_MSGFMT_H
+
+struct mbox_request {
+	__u64 addr;
+	__u32 len;
+	__u32 is_set;
+	__u32 cmd;
+	__u32 pid;
+};
+
+struct mbox_reply {
+	__u32 status;
+};
+
+#endif
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index b379520f9133..7018f91c5a39 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -16,6 +16,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
 	     metrics.o
 
+obj-$(CONFIG_BPFILTER) += bpfilter/
+
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
 obj-$(CONFIG_PROC_FS) += proc.o
diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile
new file mode 100644
index 000000000000..ce262d76cc48
--- /dev/null
+++ b/net/ipv4/bpfilter/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_BPFILTER) += sockopt.o
+
diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c
new file mode 100644
index 000000000000..42a96d2d8d05
--- /dev/null
+++ b/net/ipv4/bpfilter/sockopt.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/uaccess.h>
+#include <linux/bpfilter.h>
+#include <uapi/linux/bpf.h>
+#include <linux/wait.h>
+#include <linux/kmod.h>
+
+int (*bpfilter_process_sockopt)(struct sock *sk, int optname,
+				char __user *optval,
+				unsigned int optlen, bool is_set);
+EXPORT_SYMBOL_GPL(bpfilter_process_sockopt);
+
+int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval,
+			  unsigned int optlen, bool is_set)
+{
+	if (!bpfilter_process_sockopt) {
+		int err = request_module("bpfilter");
+
+		if (err)
+			return err;
+		if (!bpfilter_process_sockopt)
+			return -ECHILD;
+	}
+	return bpfilter_process_sockopt(sk, optname, optval, optlen, is_set);
+}
+
+int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval,
+			    unsigned int optlen)
+{
+	return bpfilter_mbox_request(sk, optname, optval, optlen, true);
+}
+
+int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval,
+			    int __user *optlen)
+{
+	int len;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+
+	return bpfilter_mbox_request(sk, optname, optval, len, false);
+}
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 5ad2d8ed3a3f..e0791faacb24 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -47,6 +47,8 @@
 #include <linux/errqueue.h>
 #include <linux/uaccess.h>
 
+#include <linux/bpfilter.h>
+
 /*
  *	SOL_IP control messages.
  */
@@ -1244,6 +1246,11 @@ int ip_setsockopt(struct sock *sk, int level,
 		return -ENOPROTOOPT;
 
 	err = do_ip_setsockopt(sk, level, optname, optval, optlen);
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_SET_REPLACE &&
+	    optname < BPFILTER_IPT_SET_MAX)
+		err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_HDRINCL &&
@@ -1552,6 +1559,11 @@ int ip_getsockopt(struct sock *sk, int level,
 	int err;
 
 	err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0);
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+	    optname < BPFILTER_IPT_GET_MAX)
+		err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
@@ -1584,6 +1596,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname,
 	err = do_ip_getsockopt(sk, level, optname, optval, optlen,
 		MSG_CMSG_COMPAT);
 
+#ifdef CONFIG_BPFILTER
+	if (optname >= BPFILTER_IPT_SO_GET_INFO &&
+	    optname < BPFILTER_IPT_GET_MAX)
+		err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen);
+#endif
 #ifdef CONFIG_NETFILTER
 	/* we need to exclude all possible ENOPROTOOPTs except default case */
 	if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS &&
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC v2 net-next 3/4] bpfilter: add iptable get/set parsing
From: Alexei Starovoitov @ 2018-05-03  4:36 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>

From: "David S. Miller" <davem@davemloft.net>

parse iptable binary blobs into bpfilter internal data structures
bpfilter.ko only passing the [gs]etsockopt commands from kernel to umh
All parsing is done inside umh

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/uapi/linux/bpfilter.h | 179 ++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/Makefile         |   2 +-
 net/bpfilter/bpfilter_mod.h   |  96 ++++++++++++++++++++++
 net/bpfilter/ctor.c           |  80 +++++++++++++++++++
 net/bpfilter/init.c           |  33 ++++++++
 net/bpfilter/main.c           |  51 ++++++++++++
 net/bpfilter/sockopt.c        | 153 ++++++++++++++++++++++++++++++++++++
 net/bpfilter/tables.c         |  70 +++++++++++++++++
 net/bpfilter/targets.c        |  51 ++++++++++++
 net/bpfilter/tgts.c           |  25 ++++++
 10 files changed, 739 insertions(+), 1 deletion(-)
 create mode 100644 net/bpfilter/bpfilter_mod.h
 create mode 100644 net/bpfilter/ctor.c
 create mode 100644 net/bpfilter/init.c
 create mode 100644 net/bpfilter/sockopt.c
 create mode 100644 net/bpfilter/tables.c
 create mode 100644 net/bpfilter/targets.c
 create mode 100644 net/bpfilter/tgts.c

diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h
index 2ec3cc99ea4c..38d54e9947a1 100644
--- a/include/uapi/linux/bpfilter.h
+++ b/include/uapi/linux/bpfilter.h
@@ -18,4 +18,183 @@ enum {
 	BPFILTER_IPT_GET_MAX,
 };
 
+enum {
+	BPFILTER_XT_TABLE_MAXNAMELEN = 32,
+};
+
+enum {
+	BPFILTER_NF_DROP = 0,
+	BPFILTER_NF_ACCEPT = 1,
+	BPFILTER_NF_STOLEN = 2,
+	BPFILTER_NF_QUEUE = 3,
+	BPFILTER_NF_REPEAT = 4,
+	BPFILTER_NF_STOP = 5,
+	BPFILTER_NF_MAX_VERDICT = BPFILTER_NF_STOP,
+};
+
+enum {
+	BPFILTER_INET_HOOK_PRE_ROUTING	= 0,
+	BPFILTER_INET_HOOK_LOCAL_IN	= 1,
+	BPFILTER_INET_HOOK_FORWARD	= 2,
+	BPFILTER_INET_HOOK_LOCAL_OUT	= 3,
+	BPFILTER_INET_HOOK_POST_ROUTING	= 4,
+	BPFILTER_INET_HOOK_MAX,
+};
+
+enum {
+	BPFILTER_PROTO_UNSPEC	= 0,
+	BPFILTER_PROTO_INET	= 1,
+	BPFILTER_PROTO_IPV4	= 2,
+	BPFILTER_PROTO_ARP	= 3,
+	BPFILTER_PROTO_NETDEV	= 5,
+	BPFILTER_PROTO_BRIDGE	= 7,
+	BPFILTER_PROTO_IPV6	= 10,
+	BPFILTER_PROTO_DECNET	= 12,
+	BPFILTER_PROTO_NUMPROTO,
+};
+
+#ifndef INT_MAX
+#define INT_MAX		((int)(~0U>>1))
+#endif
+#ifndef INT_MIN
+#define INT_MIN         (-INT_MAX - 1)
+#endif
+
+enum {
+	BPFILTER_IP_PRI_FIRST			= INT_MIN,
+	BPFILTER_IP_PRI_CONNTRACK_DEFRAG	= -400,
+	BPFILTER_IP_PRI_RAW			= -300,
+	BPFILTER_IP_PRI_SELINUX_FIRST		= -225,
+	BPFILTER_IP_PRI_CONNTRACK		= -200,
+	BPFILTER_IP_PRI_MANGLE			= -150,
+	BPFILTER_IP_PRI_NAT_DST			= -100,
+	BPFILTER_IP_PRI_FILTER			= 0,
+	BPFILTER_IP_PRI_SECURITY		= 50,
+	BPFILTER_IP_PRI_NAT_SRC			= 100,
+	BPFILTER_IP_PRI_SELINUX_LAST		= 225,
+	BPFILTER_IP_PRI_CONNTRACK_HELPER	= 300,
+	BPFILTER_IP_PRI_CONNTRACK_CONFIRM	= INT_MAX,
+	BPFILTER_IP_PRI_LAST			= INT_MAX,
+};
+
+#define BPFILTER_FUNCTION_MAXNAMELEN	30
+#define BPFILTER_EXTENSION_MAXNAMELEN	29
+#define BPFILTER_TABLE_MAXNAMELEN	32
+
+struct bpfilter_match;
+struct bpfilter_entry_match {
+	union {
+		struct {
+			__u16		match_size;
+			char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+			__u8		revision;
+		} user;
+		struct {
+			__u16			match_size;
+			struct bpfilter_match	*match;
+		} kernel;
+		__u16		match_size;
+	} u;
+	unsigned char	data[0];
+};
+
+struct bpfilter_target;
+struct bpfilter_entry_target {
+	union {
+		struct {
+			__u16		target_size;
+			char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+			__u8		revision;
+		} user;
+		struct {
+			__u16			target_size;
+			struct bpfilter_target	*target;
+		} kernel;
+		__u16		target_size;
+	} u;
+	unsigned char	data[0];
+};
+
+struct bpfilter_standard_target {
+	struct bpfilter_entry_target	target;
+	int				verdict;
+};
+
+struct bpfilter_error_target {
+	struct bpfilter_entry_target	target;
+	char				error_name[BPFILTER_FUNCTION_MAXNAMELEN];
+};
+
+#define __ALIGN_KERNEL(x, a)            __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
+#define __ALIGN_KERNEL_MASK(x, mask)    (((x) + (mask)) & ~(mask))
+
+#define BPFILTER_ALIGN(__X)	\
+	__ALIGN_KERNEL(__X, __alignof__(__u64))
+
+#define BPFILTER_TARGET_INIT(__name, __size)			\
+{								\
+	.target.u.user = {					\
+		.target_size	= BPFILTER_ALIGN(__size),	\
+		.name		= (__name),			\
+	},							\
+}
+#define BPFILTER_STANDARD_TARGET	""
+#define BPFILTER_ERROR_TARGET		"ERROR"
+
+struct bpfilter_xt_counters {
+	__u64	packet_cnt;
+	__u64	byte_cnt;
+};
+
+struct bpfilter_ipt_ip {
+	__u32	src;
+	__u32	dst;
+	__u32	src_mask;
+	__u32	dst_mask;
+	char	in_iface[IFNAMSIZ];
+	char	out_iface[IFNAMSIZ];
+	__u8	in_iface_mask[IFNAMSIZ];
+	__u8	out_iface_mask[IFNAMSIZ];
+	__u16	protocol;
+	__u8	flags;
+	__u8	inv_flags;
+};
+
+struct bpfilter_ipt_entry {
+	struct bpfilter_ipt_ip		ip;
+	__u32				bfcache;
+	__u16				target_offset;
+	__u16				next_offset;
+	__u32				camefrom;
+	struct bpfilter_xt_counters	cntrs;
+	__u8				elems[0];
+};
+
+struct bpfilter_ipt_get_info {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				valid_hooks;
+	__u32				hook_entry[BPFILTER_INET_HOOK_MAX];
+	__u32				underflow[BPFILTER_INET_HOOK_MAX];
+	__u32				num_entries;
+	__u32				size;
+};
+
+struct bpfilter_ipt_get_entries {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				size;
+	struct bpfilter_ipt_entry	entries[0];
+};
+
+struct bpfilter_ipt_replace {
+	char				name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	__u32				valid_hooks;
+	__u32				num_entries;
+	__u32				size;
+	__u32				hook_entry[BPFILTER_INET_HOOK_MAX];
+	__u32				underflow[BPFILTER_INET_HOOK_MAX];
+	__u32				num_counters;
+	struct bpfilter_xt_counters	*cntrs;
+	struct bpfilter_ipt_entry	entries[0];
+};
+
 #endif /* _UAPI_LINUX_BPFILTER_H */
diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 897eedae523e..bec6181de995 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
 #
 
 hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
 HOSTCFLAGS += -I. -Itools/include/
 
 # a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
new file mode 100644
index 000000000000..f0de41b20793
--- /dev/null
+++ b/net/bpfilter/bpfilter_mod.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_BPFILTER_INTERNAL_H
+#define _LINUX_BPFILTER_INTERNAL_H
+
+#include "include/uapi/linux/bpfilter.h"
+#include <linux/list.h>
+
+struct bpfilter_table {
+	struct hlist_node	hash;
+	u32			valid_hooks;
+	struct			bpfilter_table_info *info;
+	int			hold;
+	u8			family;
+	int			priority;
+	const char		name[BPFILTER_XT_TABLE_MAXNAMELEN];
+};
+
+struct bpfilter_table_info {
+	unsigned int		size;
+	u32			num_entries;
+	unsigned int		initial_entries;
+	unsigned int		hook_entry[BPFILTER_INET_HOOK_MAX];
+	unsigned int		underflow[BPFILTER_INET_HOOK_MAX];
+	unsigned int		stacksize;
+	void			***jumpstack;
+	unsigned char		entries[0] __aligned(8);
+};
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len);
+void bpfilter_table_put(struct bpfilter_table *tbl);
+int bpfilter_table_add(struct bpfilter_table *tbl);
+
+struct bpfilter_ipt_standard {
+	struct bpfilter_ipt_entry	entry;
+	struct bpfilter_standard_target	target;
+};
+
+struct bpfilter_ipt_error {
+	struct bpfilter_ipt_entry	entry;
+	struct bpfilter_error_target	target;
+};
+
+#define BPFILTER_IPT_ENTRY_INIT(__sz) 				\
+{								\
+	.target_offset = sizeof(struct bpfilter_ipt_entry),	\
+	.next_offset = (__sz),					\
+}
+
+#define BPFILTER_IPT_STANDARD_INIT(__verdict) 					\
+{										\
+	.entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_standard)),	\
+	.target = BPFILTER_TARGET_INIT(BPFILTER_STANDARD_TARGET,		\
+				       sizeof(struct bpfilter_standard_target)),\
+	.target.verdict = -(__verdict) - 1,					\
+}
+
+#define BPFILTER_IPT_ERROR_INIT							\
+{										\
+	.entry = BPFILTER_IPT_ENTRY_INIT(sizeof(struct bpfilter_ipt_error)),	\
+	.target = BPFILTER_TARGET_INIT(BPFILTER_ERROR_TARGET,			\
+				       sizeof(struct bpfilter_error_target)),	\
+	.target.error_name = "ERROR",						\
+}
+
+struct bpfilter_target {
+	struct list_head	all_target_list;
+	const char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+	unsigned int		size;
+	int			hold;
+	u16			family;
+	u8			rev;
+};
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
+void bpfilter_target_put(struct bpfilter_target *tgt);
+int bpfilter_target_add(struct bpfilter_target *tgt);
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+int bpfilter_ipv4_register_targets(void);
+void bpfilter_tables_init(void);
+int bpfilter_get_info(void *addr, int len);
+int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_ipv4_init(void);
+
+int copy_from_user(void *dst, void *addr, int len);
+int copy_to_user(void *addr, const void *src, int len);
+#define put_user(x, ptr) \
+({ \
+	__typeof__(*(ptr)) __x = (x); \
+	copy_to_user(ptr, &__x, sizeof(*(ptr))); \
+})
+extern int pid;
+extern int debug_fd;
+#define ENOTSUPP        524
+
+#endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
new file mode 100644
index 000000000000..efb7feef3c42
--- /dev/null
+++ b/net/bpfilter/ctor.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <linux/bitops.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+unsigned int __sw_hweight32(unsigned int w)
+{
+	w -= (w >> 1) & 0x55555555;
+	w =  (w & 0x33333333) + ((w >> 2) & 0x33333333);
+	w =  (w + (w >> 4)) & 0x0f0f0f0f;
+	return (w * 0x01010101) >> 24;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+{
+	unsigned int num_hooks = hweight32(tbl->valid_hooks);
+	struct bpfilter_ipt_standard *tgts;
+	struct bpfilter_table_info *info;
+	struct bpfilter_ipt_error *term;
+	unsigned int mask, offset, h, i;
+	unsigned int size, alloc_size;
+
+	size  = sizeof(struct bpfilter_ipt_standard) * num_hooks;
+	size += sizeof(struct bpfilter_ipt_error);
+
+	alloc_size = size + sizeof(struct bpfilter_table_info);
+
+	info = malloc(alloc_size);
+	if (!info)
+		return NULL;
+
+	info->num_entries = num_hooks + 1;
+	info->size = size;
+
+	tgts = (struct bpfilter_ipt_standard *) (info + 1);
+	term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+
+	mask = tbl->valid_hooks;
+	offset = 0;
+	h = 0;
+	i = 0;
+	dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
+	while (mask) {
+		struct bpfilter_ipt_standard *t;
+
+		if (!(mask & 1))
+			goto next;
+
+		info->hook_entry[h] = offset;
+		info->underflow[h] = offset;
+		t = &tgts[i++];
+		*t = (struct bpfilter_ipt_standard)
+			BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
+		t->target.target.u.kernel.target =
+			bpfilter_target_get_by_name(t->target.target.u.user.name);
+		dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
+		if (!t->target.target.u.kernel.target)
+			goto out_fail;
+
+		offset += sizeof(struct bpfilter_ipt_standard);
+	next:
+		mask >>= 1;
+		h++;
+	}
+	*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
+	term->target.target.u.kernel.target =
+		bpfilter_target_get_by_name(term->target.target.u.user.name);
+	dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
+	if (!term->target.target.u.kernel.target)
+		goto out_fail;
+
+	dprintf(debug_fd, "info %p\n", info);
+	return info;
+
+out_fail:
+	free(info);
+	return NULL;
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
new file mode 100644
index 000000000000..699f3f623189
--- /dev/null
+++ b/net/bpfilter/init.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include "bpfilter_mod.h"
+
+static struct bpfilter_table filter_table_ipv4 = {
+	.name		= "filter",
+	.valid_hooks	= ((1<<BPFILTER_INET_HOOK_LOCAL_IN) |
+			   (1<<BPFILTER_INET_HOOK_FORWARD) |
+			   (1<<BPFILTER_INET_HOOK_LOCAL_OUT)),
+	.family		= BPFILTER_PROTO_IPV4,
+	.priority	= BPFILTER_IP_PRI_FILTER,
+};
+
+int bpfilter_ipv4_init(void)
+{
+	struct bpfilter_table *t = &filter_table_ipv4;
+	struct bpfilter_table_info *info;
+	int err;
+
+	err = bpfilter_ipv4_register_targets();
+	if (err)
+		return err;
+
+	info = bpfilter_ipv4_table_ctor(t);
+	if (!info)
+		return -ENOMEM;
+
+	t->info = info;
+
+	return bpfilter_table_add(&filter_table_ipv4);
+}
+
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index 81bbc1684896..e0273ca201ad 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -8,13 +8,52 @@
 #include <unistd.h>
 #include "include/uapi/linux/bpf.h"
 #include <asm/unistd.h>
+#include "bpfilter_mod.h"
 #include "msgfmt.h"
 
+extern long int syscall (long int __sysno, ...);
+
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+			  unsigned int size)
+{
+	return syscall(321, cmd, attr, size);
+}
+
+int pid;
 int debug_fd;
 
+int copy_from_user(void *dst, void *addr, int len)
+{
+	struct iovec local;
+	struct iovec remote;
+
+	local.iov_base = dst;
+	local.iov_len = len;
+	remote.iov_base = addr;
+	remote.iov_len = len;
+	return process_vm_readv(pid, &local, 1, &remote, 1, 0) != len;
+}
+
+int copy_to_user(void *addr, const void *src, int len)
+{
+	struct iovec local;
+	struct iovec remote;
+
+	local.iov_base = (void *) src;
+	local.iov_len = len;
+	remote.iov_base = addr;
+	remote.iov_len = len;
+	return process_vm_writev(pid, &local, 1, &remote, 1, 0) != len;
+}
+
 static int handle_get_cmd(struct mbox_request *cmd)
 {
+	pid = cmd->pid;
 	switch (cmd->cmd) {
+	case BPFILTER_IPT_SO_GET_INFO:
+		return bpfilter_get_info((void *)(long)cmd->addr, cmd->len);
+	case BPFILTER_IPT_SO_GET_ENTRIES:
+		return bpfilter_get_entries((void *)(long)cmd->addr, cmd->len);
 	case 0:
 		return 0;
 	default:
@@ -25,11 +64,23 @@ static int handle_get_cmd(struct mbox_request *cmd)
 
 static int handle_set_cmd(struct mbox_request *cmd)
 {
+	pid = cmd->pid;
+	switch (cmd->cmd) {
+	case BPFILTER_IPT_SO_SET_REPLACE:
+		return bpfilter_set_replace((void *)(long)cmd->addr, cmd->len);
+	case BPFILTER_IPT_SO_SET_ADD_COUNTERS:
+		return bpfilter_set_add_counters((void *)(long)cmd->addr, cmd->len);
+	default:
+		break;
+	}
 	return -ENOPROTOOPT;
 }
 
 static void loop(void)
 {
+	bpfilter_tables_init();
+	bpfilter_ipv4_init();
+
 	while (1) {
 		struct mbox_request req;
 		struct mbox_reply reply;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
new file mode 100644
index 000000000000..43687daf51a3
--- /dev/null
+++ b/net/bpfilter/sockopt.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include "bpfilter_mod.h"
+
+static int fetch_name(void *addr, int len, char *name, int name_len)
+{
+	if (copy_from_user(name, addr, name_len))
+		return -EFAULT;
+
+	name[BPFILTER_XT_TABLE_MAXNAMELEN-1] = '\0';
+	return 0;
+}
+
+int bpfilter_get_info(void *addr, int len)
+{
+	char name[BPFILTER_XT_TABLE_MAXNAMELEN];
+	struct bpfilter_ipt_get_info resp;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len != sizeof(struct bpfilter_ipt_get_info))
+		return -EINVAL;
+
+	err = fetch_name(addr, len, name, sizeof(name));
+	if (err)
+		return err;
+
+	tbl = bpfilter_table_get_by_name(name, strlen(name));
+	if (!tbl)
+		return -ENOENT;
+
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+
+	memset(&resp, 0, sizeof(resp));
+	memcpy(resp.name, name, sizeof(resp.name));
+	resp.valid_hooks = tbl->valid_hooks;
+	memcpy(&resp.hook_entry, info->hook_entry, sizeof(resp.hook_entry));
+	memcpy(&resp.underflow, info->underflow, sizeof(resp.underflow));
+	resp.num_entries = info->num_entries;
+	resp.size = info->size;
+
+	err = 0;
+	if (copy_to_user(addr, &resp, len))
+		err = -EFAULT;
+out_put:
+	bpfilter_table_put(tbl);
+	return err;
+}
+
+static int copy_target(struct bpfilter_standard_target *ut,
+		       struct bpfilter_standard_target *kt)
+{
+	struct bpfilter_target *tgt;
+	int sz;
+
+
+	if (put_user(kt->target.u.target_size,
+		     &ut->target.u.target_size))
+		return -EFAULT;
+
+	tgt = kt->target.u.kernel.target;
+	if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
+		return -EFAULT;
+
+	if (put_user(tgt->rev, &ut->target.u.user.revision))
+		return -EFAULT;
+
+	sz = tgt->size;
+	if (copy_to_user(ut->target.data, kt->target.data, sz))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int do_get_entries(void *up,
+			  struct bpfilter_table *tbl,
+			  struct bpfilter_table_info *info)
+{
+	unsigned int total_size = info->size;
+	const struct bpfilter_ipt_entry *ent;
+	unsigned int off;
+	void *base;
+
+	base = info->entries;
+
+	for (off = 0; off < total_size; off += ent->next_offset) {
+		struct bpfilter_xt_counters *cntrs;
+		struct bpfilter_standard_target *tgt;
+
+		ent = base + off;
+		if (copy_to_user(up + off, ent, sizeof(*ent)))
+			return -EFAULT;
+
+		/* XXX Just clear counters for now. XXX */
+		cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
+		if (put_user(0, &cntrs->packet_cnt) ||
+		    put_user(0, &cntrs->byte_cnt))
+			return -EINVAL;
+
+		tgt = (void *) ent + ent->target_offset;
+		dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
+		if (copy_target(up + off + ent->target_offset, tgt))
+			return -EFAULT;
+	}
+	return 0;
+}
+
+int bpfilter_get_entries(void *cmd, int len)
+{
+	struct bpfilter_ipt_get_entries *uptr = cmd;
+	struct bpfilter_ipt_get_entries req;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len < sizeof(struct bpfilter_ipt_get_entries))
+		return -EINVAL;
+
+	if (copy_from_user(&req, cmd, sizeof(req)))
+		return -EFAULT;
+
+	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+	if (!tbl)
+		return -ENOENT;
+
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+
+	if (info->size != req.size) {
+		err = -EINVAL;
+		goto out_put;
+	}
+
+	err = do_get_entries(uptr->entries, tbl, info);
+	dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
+
+out_put:
+	bpfilter_table_put(tbl);
+
+	return err;
+}
+
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
new file mode 100644
index 000000000000..9a96599be634
--- /dev/null
+++ b/net/bpfilter/tables.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/hashtable.h>
+#include "bpfilter_mod.h"
+
+static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
+{
+	unsigned int hash = 0;
+	int i;
+
+	for (i = 0; i < len; i++)
+		hash ^= *(name + i);
+	return hash;
+}
+
+DEFINE_HASHTABLE(bpfilter_tables, 4);
+//DEFINE_MUTEX(bpfilter_table_mutex);
+
+struct bpfilter_table *bpfilter_table_get_by_name(const char *name, int name_len)
+{
+	unsigned int hval = full_name_hash(NULL, name, name_len);
+	struct bpfilter_table *tbl;
+
+//	mutex_lock(&bpfilter_table_mutex);
+	hash_for_each_possible(bpfilter_tables, tbl, hash, hval) {
+		if (!strcmp(name, tbl->name)) {
+			tbl->hold++;
+			goto out;
+		}
+	}
+	tbl = NULL;
+out:
+//	mutex_unlock(&bpfilter_table_mutex);
+	return tbl;
+}
+
+void bpfilter_table_put(struct bpfilter_table *tbl)
+{
+//	mutex_lock(&bpfilter_table_mutex);
+	tbl->hold--;
+//	mutex_unlock(&bpfilter_table_mutex);
+}
+
+int bpfilter_table_add(struct bpfilter_table *tbl)
+{
+	unsigned int hval = full_name_hash(NULL, tbl->name, strlen(tbl->name));
+	struct bpfilter_table *srch;
+
+//	mutex_lock(&bpfilter_table_mutex);
+	hash_for_each_possible(bpfilter_tables, srch, hash, hval) {
+		if (!strcmp(srch->name, tbl->name))
+			goto exists;
+	}
+	hash_add(bpfilter_tables, &tbl->hash, hval);
+//	mutex_unlock(&bpfilter_table_mutex);
+
+	return 0;
+
+exists:
+//	mutex_unlock(&bpfilter_table_mutex);
+	return -EEXIST;
+}
+
+void bpfilter_tables_init(void)
+{
+	hash_init(bpfilter_tables);
+}
+
diff --git a/net/bpfilter/targets.c b/net/bpfilter/targets.c
new file mode 100644
index 000000000000..4086ac82eaf5
--- /dev/null
+++ b/net/bpfilter/targets.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include <errno.h>
+#include <string.h>
+#include "bpfilter_mod.h"
+
+//DEFINE_MUTEX(bpfilter_target_mutex);
+static LIST_HEAD(bpfilter_targets);
+
+struct bpfilter_target *bpfilter_target_get_by_name(const char *name)
+{
+	struct bpfilter_target *tgt;
+
+//	mutex_lock(&bpfilter_target_mutex);
+	list_for_each_entry(tgt, &bpfilter_targets, all_target_list) {
+		if (!strcmp(tgt->name, name)) {
+			tgt->hold++;
+			goto out;
+		}
+	}
+	tgt = NULL;
+out:
+//	mutex_unlock(&bpfilter_target_mutex);
+	return tgt;
+}
+
+void bpfilter_target_put(struct bpfilter_target *tgt)
+{
+//	mutex_lock(&bpfilter_target_mutex);
+	tgt->hold--;
+//	mutex_unlock(&bpfilter_target_mutex);
+}
+
+int bpfilter_target_add(struct bpfilter_target *tgt)
+{
+	struct bpfilter_target *srch;
+
+//	mutex_lock(&bpfilter_target_mutex);
+	list_for_each_entry(srch, &bpfilter_targets, all_target_list) {
+		if (!strcmp(srch->name, tgt->name))
+			goto exists;
+	}
+	list_add_tail(&tgt->all_target_list, &bpfilter_targets);
+//	mutex_unlock(&bpfilter_target_mutex);
+	return 0;
+
+exists:
+//	mutex_unlock(&bpfilter_target_mutex);
+	return -EEXIST;
+}
+
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
new file mode 100644
index 000000000000..eac5e8ac0b4b
--- /dev/null
+++ b/net/bpfilter/tgts.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/socket.h>
+#include "bpfilter_mod.h"
+
+struct bpfilter_target std_tgt = {
+	.name = BPFILTER_STANDARD_TARGET,
+	.family = BPFILTER_PROTO_IPV4,
+	.size = sizeof(int),
+};
+
+struct bpfilter_target err_tgt = {
+	.name = BPFILTER_ERROR_TARGET,
+	.family = BPFILTER_PROTO_IPV4,
+	.size = BPFILTER_FUNCTION_MAXNAMELEN,
+};
+
+int bpfilter_ipv4_register_targets(void)
+{
+	int err = bpfilter_target_add(&std_tgt);
+
+	if (err)
+		return err;
+	return bpfilter_target_add(&err_tgt);
+}
+
-- 
2.9.5

^ permalink raw reply related

* [PATCH RFC v2 net-next 4/4] bpfilter: rough bpfilter codegen example hack
From: Alexei Starovoitov @ 2018-05-03  4:36 UTC (permalink / raw)
  To: davem; +Cc: daniel, torvalds, gregkh, luto, netdev, linux-kernel, kernel-team
In-Reply-To: <20180503043604.1604587-1-ast@kernel.org>

From: Daniel Borkmann <daniel@iogearbox.net>

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 net/bpfilter/Makefile       |   2 +-
 net/bpfilter/bpfilter_mod.h | 285 ++++++++++++++++++++++++++++++++++++++++++-
 net/bpfilter/ctor.c         |  57 +++++----
 net/bpfilter/gen.c          | 290 ++++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/init.c         |  11 +-
 net/bpfilter/main.c         |  15 ++-
 net/bpfilter/sockopt.c      | 137 ++++++++++++++++-----
 net/bpfilter/tables.c       |   5 +-
 net/bpfilter/tgts.c         |   1 +
 9 files changed, 737 insertions(+), 66 deletions(-)
 create mode 100644 net/bpfilter/gen.c

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index bec6181de995..3796651c76cb 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -4,7 +4,7 @@
 #
 
 hostprogs-y := bpfilter_umh
-bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o
+bpfilter_umh-objs := main.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
 HOSTCFLAGS += -I. -Itools/include/
 
 # a bit of elf magic to convert bpfilter_umh binary into a binary blob
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index f0de41b20793..b4209985efff 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -21,8 +21,8 @@ struct bpfilter_table_info {
 	unsigned int		initial_entries;
 	unsigned int		hook_entry[BPFILTER_INET_HOOK_MAX];
 	unsigned int		underflow[BPFILTER_INET_HOOK_MAX];
-	unsigned int		stacksize;
-	void			***jumpstack;
+//	unsigned int		stacksize;
+//	void			***jumpstack;
 	unsigned char		entries[0] __aligned(8);
 };
 
@@ -64,22 +64,55 @@ struct bpfilter_ipt_error {
 
 struct bpfilter_target {
 	struct list_head	all_target_list;
-	const char		name[BPFILTER_EXTENSION_MAXNAMELEN];
+	char			name[BPFILTER_EXTENSION_MAXNAMELEN];
 	unsigned int		size;
 	int			hold;
 	u16			family;
 	u8			rev;
 };
 
+struct bpfilter_gen_ctx {
+	struct bpf_insn		*img;
+	u32			len_cur;
+	u32			len_max;
+	u32			default_verdict;
+	int			fd;
+	int			ifindex;
+	bool			offloaded;
+};
+
+union bpf_attr;
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+			struct bpfilter_ipt_ip *ent, int verdict);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
+
 struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
 void bpfilter_target_put(struct bpfilter_target *tgt);
 int bpfilter_target_add(struct bpfilter_target *tgt);
 
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl, __u32 size_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+			     struct bpfilter_table_info *info,
+			     __u32 size_ents, __u32 num_ents);
+struct bpfilter_table_info *
+bpfilter_ipv4_table_finalize2(struct bpfilter_table *tbl,
+			      struct bpfilter_table_info *info,
+			      __u32 size_ents, __u32 num_ents);
+
 int bpfilter_ipv4_register_targets(void);
 void bpfilter_tables_init(void);
 int bpfilter_get_info(void *addr, int len);
 int bpfilter_get_entries(void *cmd, int len);
+int bpfilter_set_replace(void *cmd, int len);
+int bpfilter_set_add_counters(void *cmd, int len);
 int bpfilter_ipv4_init(void);
 
 int copy_from_user(void *dst, void *addr, int len);
@@ -93,4 +126,248 @@ extern int pid;
 extern int debug_fd;
 #define ENOTSUPP        524
 
+/* Helper macros for filter block array initializers. */
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Endianess conversion, cpu_to_{l,b}e(), {l,b}e_to_cpu() */
+
+#define BPF_ENDIAN(TYPE, DST, LEN)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_END | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = LEN })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD)				\
+	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+/* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */
+
+#define BPF_MOV64_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_RAW(TYPE, DST, SRC, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_SRC(TYPE),	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Indirect packet access, R0 = *(uint *) (skb->data + src_reg + imm32) */
+
+#define BPF_LD_IND(SIZE, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_IND,	\
+		.dst_reg = 0,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Unconditional jumps, goto pc + off16 */
+
+#define BPF_JMP_A(OFF)						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_JA,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Function call */
+
+#define BPF_EMIT_CALL(FUNC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_CALL,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((FUNC) - __bpf_call_base) })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
 #endif
diff --git a/net/bpfilter/ctor.c b/net/bpfilter/ctor.c
index efb7feef3c42..ba44c21cacfa 100644
--- a/net/bpfilter/ctor.c
+++ b/net/bpfilter/ctor.c
@@ -1,8 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
-#include <linux/bitops.h>
 #include <stdlib.h>
 #include <stdio.h>
+#include <string.h>
+
+#include <sys/socket.h>
+
+#include <linux/bitops.h>
+
 #include "bpfilter_mod.h"
 
 unsigned int __sw_hweight32(unsigned int w)
@@ -13,35 +17,47 @@ unsigned int __sw_hweight32(unsigned int w)
 	return (w * 0x01010101) >> 24;
 }
 
-struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
+struct bpfilter_table_info *bpfilter_ipv4_table_alloc(struct bpfilter_table *tbl,
+						      __u32 size_ents)
 {
 	unsigned int num_hooks = hweight32(tbl->valid_hooks);
-	struct bpfilter_ipt_standard *tgts;
 	struct bpfilter_table_info *info;
-	struct bpfilter_ipt_error *term;
-	unsigned int mask, offset, h, i;
 	unsigned int size, alloc_size;
 
 	size  = sizeof(struct bpfilter_ipt_standard) * num_hooks;
 	size += sizeof(struct bpfilter_ipt_error);
+	size += size_ents;
 
 	alloc_size = size + sizeof(struct bpfilter_table_info);
 
 	info = malloc(alloc_size);
-	if (!info)
-		return NULL;
+	if (info) {
+		memset(info, 0, alloc_size);
+		info->size = size;
+	}
+	return info;
+}
+
+struct bpfilter_table_info *bpfilter_ipv4_table_finalize(struct bpfilter_table *tbl,
+							 struct bpfilter_table_info *info,
+							 __u32 size_ents, __u32 num_ents)
+{
+	unsigned int num_hooks = hweight32(tbl->valid_hooks);
+	struct bpfilter_ipt_standard *tgts;
+	struct bpfilter_ipt_error *term;
+	struct bpfilter_ipt_entry *ent;
+	unsigned int mask, offset, h, i;
 
-	info->num_entries = num_hooks + 1;
-	info->size = size;
+	info->num_entries = num_ents + num_hooks + 1;
 
-	tgts = (struct bpfilter_ipt_standard *) (info + 1);
-	term = (struct bpfilter_ipt_error *) (tgts + num_hooks);
+	ent  = (struct bpfilter_ipt_entry *)(info + 1);
+	tgts = (struct bpfilter_ipt_standard *)((u8 *)ent + size_ents);
+	term = (struct bpfilter_ipt_error *)(tgts + num_hooks);
 
 	mask = tbl->valid_hooks;
 	offset = 0;
 	h = 0;
 	i = 0;
-	dprintf(debug_fd, "mask %x num_hooks %d\n", mask, num_hooks);
 	while (mask) {
 		struct bpfilter_ipt_standard *t;
 
@@ -55,7 +71,6 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
 			BPFILTER_IPT_STANDARD_INIT(BPFILTER_NF_ACCEPT);
 		t->target.target.u.kernel.target =
 			bpfilter_target_get_by_name(t->target.target.u.user.name);
-		dprintf(debug_fd, "user.name %s\n", t->target.target.u.user.name);
 		if (!t->target.target.u.kernel.target)
 			goto out_fail;
 
@@ -67,14 +82,10 @@ struct bpfilter_table_info *bpfilter_ipv4_table_ctor(struct bpfilter_table *tbl)
 	*term = (struct bpfilter_ipt_error) BPFILTER_IPT_ERROR_INIT;
 	term->target.target.u.kernel.target =
 		bpfilter_target_get_by_name(term->target.target.u.user.name);
-	dprintf(debug_fd, "user.name %s\n", term->target.target.u.user.name);
-	if (!term->target.target.u.kernel.target)
-		goto out_fail;
-
-	dprintf(debug_fd, "info %p\n", info);
-	return info;
-
+	if (!term->target.target.u.kernel.target) {
 out_fail:
-	free(info);
-	return NULL;
+		free(info);
+		return NULL;
+	}
+	return info;
 }
diff --git a/net/bpfilter/gen.c b/net/bpfilter/gen.c
new file mode 100644
index 000000000000..8e08561b78f1
--- /dev/null
+++ b/net/bpfilter/gen.c
@@ -0,0 +1,290 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <errno.h>
+#include <string.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_link.h>
+#include <linux/rtnetlink.h>
+#include <linux/bpf.h>
+typedef __u16 __bitwise __sum16; /* hack */
+#include <linux/ip.h>
+
+#include <arpa/inet.h>
+
+#include "bpfilter_mod.h"
+
+unsigned int if_nametoindex(const char *ifname);
+
+static inline __u64 bpf_ptr_to_u64(const void *ptr)
+{
+	return (__u64)(unsigned long)ptr;
+}
+
+static int bpf_prog_load(enum bpf_prog_type type,
+			 const struct bpf_insn *insns,
+			 unsigned int insn_num,
+			 __u32 offload_ifindex)
+{
+	union bpf_attr attr = {};
+
+	attr.prog_type		= type;
+	attr.insns		= bpf_ptr_to_u64(insns);
+	attr.insn_cnt		= insn_num;
+	attr.license		= bpf_ptr_to_u64("GPL");
+	attr.prog_ifindex	= offload_ifindex;
+
+	return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+}
+
+static int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+static int bpfilter_load_dev(struct bpfilter_gen_ctx *ctx)
+{
+	u32 xdp_flags = 0;
+
+	if (ctx->offloaded)
+		xdp_flags |= XDP_FLAGS_HW_MODE;
+	return bpf_set_link_xdp_fd(ctx->ifindex, ctx->fd, xdp_flags);
+}
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx)
+{
+	unsigned int len_max = BPF_MAXINSNS;
+
+	memset(ctx, 0, sizeof(*ctx));
+	ctx->img = calloc(len_max, sizeof(struct bpf_insn));
+	if (!ctx->img)
+		return -ENOMEM;
+	ctx->len_max = len_max;
+	ctx->fd = -1;
+	ctx->default_verdict = XDP_PASS;
+
+	return 0;
+}
+
+#define EMIT(x)						\
+	do {						\
+		if (ctx->len_cur + 1 > ctx->len_max)	\
+			return -ENOMEM;			\
+		ctx->img[ctx->len_cur++] = x;		\
+	} while (0)
+
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx)
+{
+	EMIT(BPF_MOV64_REG(BPF_REG_9, BPF_REG_1));
+	EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_9,
+			 offsetof(struct xdp_md, data)));
+	EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_9,
+			 offsetof(struct xdp_md, data_end)));
+	EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, ETH_HLEN));
+	EMIT(BPF_JMP_REG(BPF_JLE, BPF_REG_1, BPF_REG_3, 2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx)
+{
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, ctx->default_verdict));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+static int bpfilter_gen_check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF	"\xff\xff\xff\xff"
+	static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+	static const __u8 mask0[IFNAMSIZ] = { };
+	int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+	if (strlen(ent->out_iface) > 0)
+		return -ENOTSUPP;
+	if (memcmp(ent->in_iface_mask, mask1, ones) ||
+	    memcmp(&ent->in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+		return -ENOTSUPP;
+	if ((ent->src_mask != 0 && ent->src_mask != 0xffffffff) ||
+	    (ent->dst_mask != 0 && ent->dst_mask != 0xffffffff))
+		return -ENOTSUPP;
+
+	return 0;
+}
+
+int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
+			struct bpfilter_ipt_ip *ent, int verdict)
+{
+	u32 match_xdp = verdict == -1 ? XDP_DROP : XDP_PASS;
+	int ret, ifindex, match_state = 0;
+
+	/* convention R1: tmp, R2: data, R3: data_end, R9: xdp_buff */
+	ret = bpfilter_gen_check_entry(ent);
+	if (ret < 0)
+		return ret;
+	if (ent->src_mask == 0 && ent->dst_mask == 0)
+		return 0;
+
+	ifindex = if_nametoindex(ent->in_iface);
+	if (!ifindex)
+		return 0;
+	if (ctx->ifindex && ctx->ifindex != ifindex)
+		return -ENOTSUPP;
+
+	ctx->ifindex = ifindex;
+	match_state = !!ent->src_mask + !!ent->dst_mask;
+
+	EMIT(BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_5, 0));
+	EMIT(BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_1,
+			 offsetof(struct ethhdr, h_proto)));
+	EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, htons(ETH_P_IP),
+			 3 + match_state * 3));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+			   sizeof(struct ethhdr) + sizeof(struct iphdr)));
+	EMIT(BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 1 + match_state * 3));
+	EMIT(BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -(int)sizeof(struct iphdr)));
+	if (ent->src_mask) {
+		EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				 offsetof(struct iphdr, saddr)));
+		EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->src, 1));
+		EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+	}
+	if (ent->dst_mask) {
+		EMIT(BPF_LDX_MEM(BPF_W, BPF_REG_4, BPF_REG_1,
+				 offsetof(struct iphdr, daddr)));
+		EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_4, ent->dst, 1));
+		EMIT(BPF_ALU32_IMM(BPF_ADD, BPF_REG_5, 1));
+	}
+	EMIT(BPF_JMP_IMM(BPF_JNE, BPF_REG_5, match_state, 2));
+	EMIT(BPF_MOV32_IMM(BPF_REG_0, match_xdp));
+	EMIT(BPF_EXIT_INSN());
+	return 0;
+}
+
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx)
+{
+	int ret;
+
+	ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+			    ctx->len_cur, ctx->ifindex);
+	if (ret > 0)
+		ctx->offloaded = true;
+	if (ret < 0)
+		ret = bpf_prog_load(BPF_PROG_TYPE_XDP, ctx->img,
+				    ctx->len_cur, 0);
+	if (ret > 0) {
+		ctx->fd = ret;
+		ret = bpfilter_load_dev(ctx);
+	}
+
+	return ret < 0 ? ret : 0;
+}
+
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx)
+{
+	free(ctx->img);
+	close(ctx->fd);
+}
diff --git a/net/bpfilter/init.c b/net/bpfilter/init.c
index 699f3f623189..14e621a03217 100644
--- a/net/bpfilter/init.c
+++ b/net/bpfilter/init.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
+
+#include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
 static struct bpfilter_table filter_table_ipv4 = {
@@ -22,12 +24,13 @@ int bpfilter_ipv4_init(void)
 	if (err)
 		return err;
 
-	info = bpfilter_ipv4_table_ctor(t);
+	info = bpfilter_ipv4_table_alloc(t, 0);
+	if (!info)
+		return -ENOMEM;
+	info = bpfilter_ipv4_table_finalize(t, info, 0, 0);
 	if (!info)
 		return -ENOMEM;
-
 	t->info = info;
-
 	return bpfilter_table_add(&filter_table_ipv4);
 }
 
diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c
index e0273ca201ad..ebd8a4fb1e95 100644
--- a/net/bpfilter/main.c
+++ b/net/bpfilter/main.c
@@ -1,20 +1,23 @@
 // SPDX-License-Identifier: GPL-2.0
 #define _GNU_SOURCE
-#include <sys/uio.h>
 #include <errno.h>
 #include <stdio.h>
-#include <sys/socket.h>
 #include <fcntl.h>
 #include <unistd.h>
-#include "include/uapi/linux/bpf.h"
+
+#include <sys/uio.h>
+#include <sys/socket.h>
+
 #include <asm/unistd.h>
+
+#include "include/uapi/linux/bpf.h"
+
 #include "bpfilter_mod.h"
 #include "msgfmt.h"
 
 extern long int syscall (long int __sysno, ...);
 
-static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
-			  unsigned int size)
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
 {
 	return syscall(321, cmd, attr, size);
 }
@@ -39,7 +42,7 @@ int copy_to_user(void *addr, const void *src, int len)
 	struct iovec local;
 	struct iovec remote;
 
-	local.iov_base = (void *) src;
+	local.iov_base = (void *)src;
 	local.iov_len = len;
 	remote.iov_base = addr;
 	remote.iov_len = len;
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
index 43687daf51a3..26ad12a11736 100644
--- a/net/bpfilter/sockopt.c
+++ b/net/bpfilter/sockopt.c
@@ -1,10 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
 #include <string.h>
 #include <stdio.h>
+#include <stdlib.h>
+
+#include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
+/* TODO: Get all of this in here properly done in encoding/decoding layer. */
 static int fetch_name(void *addr, int len, char *name, int name_len)
 {
 	if (copy_from_user(name, addr, name_len))
@@ -55,12 +59,17 @@ int bpfilter_get_info(void *addr, int len)
 	return err;
 }
 
-static int copy_target(struct bpfilter_standard_target *ut,
-		       struct bpfilter_standard_target *kt)
+static int target_u2k(struct bpfilter_standard_target *kt)
 {
-	struct bpfilter_target *tgt;
-	int sz;
+	kt->target.u.kernel.target =
+		bpfilter_target_get_by_name(kt->target.u.user.name);
+	return kt->target.u.kernel.target ? 0 : -EINVAL;
+}
 
+static int target_k2u(struct bpfilter_standard_target *ut,
+		      struct bpfilter_standard_target *kt)
+{
+	struct bpfilter_target *tgt;
 
 	if (put_user(kt->target.u.target_size,
 		     &ut->target.u.target_size))
@@ -69,12 +78,9 @@ static int copy_target(struct bpfilter_standard_target *ut,
 	tgt = kt->target.u.kernel.target;
 	if (copy_to_user(ut->target.u.user.name, tgt->name, strlen(tgt->name)))
 		return -EFAULT;
-
 	if (put_user(tgt->rev, &ut->target.u.user.revision))
 		return -EFAULT;
-
-	sz = tgt->size;
-	if (copy_to_user(ut->target.data, kt->target.data, sz))
+	if (copy_to_user(ut->target.data, kt->target.data, tgt->size))
 		return -EFAULT;
 
 	return 0;
@@ -84,30 +90,25 @@ static int do_get_entries(void *up,
 			  struct bpfilter_table *tbl,
 			  struct bpfilter_table_info *info)
 {
-	unsigned int total_size = info->size;
 	const struct bpfilter_ipt_entry *ent;
+	unsigned int total_size = info->size;
+	void *base = info->entries;
 	unsigned int off;
-	void *base;
-
-	base = info->entries;
 
 	for (off = 0; off < total_size; off += ent->next_offset) {
-		struct bpfilter_xt_counters *cntrs;
 		struct bpfilter_standard_target *tgt;
+		struct bpfilter_xt_counters *cntrs;
 
 		ent = base + off;
 		if (copy_to_user(up + off, ent, sizeof(*ent)))
 			return -EFAULT;
-
-		/* XXX Just clear counters for now. XXX */
+		/* XXX: Just clear counters for now. */
 		cntrs = up + off + offsetof(struct bpfilter_ipt_entry, cntrs);
 		if (put_user(0, &cntrs->packet_cnt) ||
 		    put_user(0, &cntrs->byte_cnt))
 			return -EINVAL;
-
-		tgt = (void *) ent + ent->target_offset;
-		dprintf(debug_fd, "target.verdict %d\n", tgt->verdict);
-		if (copy_target(up + off + ent->target_offset, tgt))
+		tgt = (void *)ent + ent->target_offset;
+		if (target_k2u(up + off + ent->target_offset, tgt))
 			return -EFAULT;
 	}
 	return 0;
@@ -123,31 +124,113 @@ int bpfilter_get_entries(void *cmd, int len)
 
 	if (len < sizeof(struct bpfilter_ipt_get_entries))
 		return -EINVAL;
-
 	if (copy_from_user(&req, cmd, sizeof(req)))
 		return -EFAULT;
-
 	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
 	if (!tbl)
 		return -ENOENT;
-
 	info = tbl->info;
 	if (!info) {
 		err = -ENOENT;
 		goto out_put;
 	}
-
 	if (info->size != req.size) {
 		err = -EINVAL;
 		goto out_put;
 	}
-
 	err = do_get_entries(uptr->entries, tbl, info);
-	dprintf(debug_fd, "do_get_entries %d req.size %d\n", err, req.size);
-
 out_put:
 	bpfilter_table_put(tbl);
+	return err;
+}
 
+static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
+			  struct bpfilter_table *tbl)
+{
+	unsigned int total_size = req->size;
+	struct bpfilter_table_info *info;
+	struct bpfilter_ipt_entry *ent;
+	struct bpfilter_gen_ctx ctx;
+	unsigned int off, sents = 0, ents = 0;
+	int ret;
+
+	ret = bpfilter_gen_init(&ctx);
+	if (ret < 0)
+		return ret;
+	ret = bpfilter_gen_prologue(&ctx);
+	if (ret < 0)
+		return ret;
+	info = bpfilter_ipv4_table_alloc(tbl, total_size);
+	if (!info)
+		return -ENOMEM;
+	if (copy_from_user(&info->entries[0], base, req->size)) {
+		free(info);
+		return -EFAULT;
+	}
+	base = &info->entries[0];
+	for (off = 0; off < total_size; off += ent->next_offset) {
+		struct bpfilter_standard_target *tgt;
+		ent = base + off;
+		ents++;
+		sents += ent->next_offset;
+		tgt = (void *) ent + ent->target_offset;
+		target_u2k(tgt);
+		ret = bpfilter_gen_append(&ctx, &ent->ip, tgt->verdict);
+                if (ret < 0)
+                        goto err;
+	}
+	info->num_entries = ents;
+	info->size = sents;
+	memcpy(info->hook_entry, req->hook_entry, sizeof(info->hook_entry));
+	memcpy(info->underflow, req->underflow, sizeof(info->hook_entry));
+	ret = bpfilter_gen_epilogue(&ctx);
+	if (ret < 0)
+		goto err;
+	ret = bpfilter_gen_commit(&ctx);
+	if (ret < 0)
+		goto err;
+	free(tbl->info);
+	tbl->info = info;
+	bpfilter_gen_destroy(&ctx);
+	dprintf(debug_fd, "offloaded %u\n", ctx.offloaded);
+	return ret;
+err:
+	free(info);
+	return ret;
+}
+
+int bpfilter_set_replace(void *cmd, int len)
+{
+	struct bpfilter_ipt_replace *uptr = cmd;
+	struct bpfilter_ipt_replace req;
+	struct bpfilter_table_info *info;
+	struct bpfilter_table *tbl;
+	int err;
+
+	if (len < sizeof(req))
+		return -EINVAL;
+	if (copy_from_user(&req, cmd, sizeof(req)))
+		return -EFAULT;
+	if (req.num_counters >= INT_MAX / sizeof(struct bpfilter_xt_counters))
+		return -ENOMEM;
+	if (req.num_counters == 0)
+		return -EINVAL;
+	req.name[sizeof(req.name) - 1] = 0;
+	tbl = bpfilter_table_get_by_name(req.name, strlen(req.name));
+	if (!tbl)
+		return -ENOENT;
+	info = tbl->info;
+	if (!info) {
+		err = -ENOENT;
+		goto out_put;
+	}
+	err = do_set_replace(&req, uptr->entries, tbl);
+out_put:
+	bpfilter_table_put(tbl);
 	return err;
 }
 
+int bpfilter_set_add_counters(void *cmd, int len)
+{
+	return 0;
+}
diff --git a/net/bpfilter/tables.c b/net/bpfilter/tables.c
index 9a96599be634..e0dab283092d 100644
--- a/net/bpfilter/tables.c
+++ b/net/bpfilter/tables.c
@@ -1,8 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <sys/socket.h>
 #include <errno.h>
 #include <string.h>
+
+#include <sys/socket.h>
+
 #include <linux/hashtable.h>
+
 #include "bpfilter_mod.h"
 
 static unsigned int full_name_hash(const void *salt, const char *name, unsigned int len)
diff --git a/net/bpfilter/tgts.c b/net/bpfilter/tgts.c
index eac5e8ac0b4b..0a00bc289d3d 100644
--- a/net/bpfilter/tgts.c
+++ b/net/bpfilter/tgts.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <sys/socket.h>
+
 #include "bpfilter_mod.h"
 
 struct bpfilter_target std_tgt = {
-- 
2.9.5

^ permalink raw reply related

* Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
From: Cong Wang @ 2018-05-03  4:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kernel test robot, kernel test robot,
	Linux Memory Management List, Johannes Weiner, LKP, David Miller,
	Linux Kernel Network Developers
In-Reply-To: <20180502212735.7660515ac03cf61630f5ff6b@linux-foundation.org>

On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> So it's saying that something which got committed into Linus's tree
> after 4.17-rc3 has caused a NULL deref in
> sock_release->llc_ui_release+0x3a/0xd0

Do you mean it contains commit 3a04ce7130a7
("llc: fix NULL pointer deref for SOCK_ZAPPED")?

^ permalink raw reply

* Re: Silently dropped UDP packets on kernel 4.14
From: Florian Westphal @ 2018-05-03  5:03 UTC (permalink / raw)
  To: Kristian Evensen
  Cc: Florian Westphal, Netfilter Development Mailing list,
	Network Development
In-Reply-To: <CAKfDRXiO7qKHGLf5faSy7g_f1p9budJLXS9-uZWLPgyx3OAJbA@mail.gmail.com>

Kristian Evensen <kristian.evensen@gmail.com> wrote:
> I went for the early-insert approached and have patched

I'm sorry for suggesting that.

It doesn't work, because of NAT.
NAT rewrites packet content and changes the reply tuple, but the tuples
determine the hash insertion location.

I don't know how to solve this problem.

^ permalink raw reply

* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Michael Chan @ 2018-05-03  5:04 UTC (permalink / raw)
  To: Zumeng Chen
  Cc: Netdev, open list, Siva Reddy Kallam,
	prashant.sreedharan@broadcom.com, David Miller, Zumeng Chen
In-Reply-To: <8f73e98f-0c55-2d96-a1b7-0890bf90bf41@gmail.com>

On Wed, May 2, 2018 at 5:30 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
> On 2018年05月03日 01:32, Michael Chan wrote:
>>
>> On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>>>
>>> On 2018年05月02日 13:12, Michael Chan wrote:
>>>>
>>>> On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com>
>>>> wrote:
>>>>
>>>>> diff --git a/drivers/net/ethernet/broadcom/tg3.h
>>>>> b/drivers/net/ethernet/broadcom/tg3.h
>>>>> index 3b5e98e..c61d83c 100644
>>>>> --- a/drivers/net/ethernet/broadcom/tg3.h
>>>>> +++ b/drivers/net/ethernet/broadcom/tg3.h
>>>>> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>>>>>           TG3_FLAG_ROBOSWITCH,
>>>>>           TG3_FLAG_ONE_DMA_AT_ONCE,
>>>>>           TG3_FLAG_RGMII_MODE,
>>>>> +       TG3_FLAG_HALT,
>>>>
>>>> I think you should be able to use the existing INIT_COMPLETE flag
>>>
>>>
>>> No,  it will bring the uncertain factors into the existed complicate
>>> logic
>>> of INIT_COMPLETE.
>>> And I think it's very simple logic here to fix the meaningless hw_stats
>>> reading and the problem
>>> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
>>> codes carefully.
>>>
>> We should use an existing flag whenever appropriate
>
>
> I disagree. This is sort of blahblah...
>>

I don't want to see another flag added that is practically the same as
!INIT_COMPLETE.  The driver already has close to one hundred flags.
Adding a new flag that is similar to an existing flag will just make
the code more difficult to understand and maintain.

If you don't want to fix it the cleaner way, Siva or I will fix it.

^ permalink raw reply

* Re: [PATCH net-next v7 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: kbuild test robot @ 2018-05-03  5:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: kbuild-all, netdev, cake
In-Reply-To: <152527386316.14936.5409621935637217368.stgit@alrua-kau>

Hi Toke,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Toke-H-iland-J-rgensen/sched-Add-Common-Applications-Kept-Enhanced-cake-qdisc/20180503-073002


coccinelle warnings: (new ones prefixed by >>)

>> net/sched/sch_cake.c:580:2-3: Unneeded semicolon

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH] sched: fix semicolon.cocci warnings
From: kbuild test robot @ 2018-05-03  5:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: kbuild-all, netdev, cake
In-Reply-To: <152527386316.14936.5409621935637217368.stgit@alrua-kau>

From: Fengguang Wu <fengguang.wu@intel.com>

net/sched/sch_cake.c:580:2-3: Unneeded semicolon


 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: 907a16741a03 ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
CC: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---

 sch_cake.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -577,7 +577,7 @@ cake_hash(struct cake_tin_data *q, const
 	default:
 		dsthost_hash = 0;
 		srchost_hash = 0;
-	};
+	}
 
 	/* This *must* be after the above switch, since as a
 	 * side-effect it sorts the src and dst addresses.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox