Netdev List
 help / color / mirror / Atom feed
* [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers
From: Arseny Maslennikov @ 2018-08-30 18:22 UTC (permalink / raw)
  To: linux-rdma; +Cc: Arseny Maslennikov, Doug Ledford, Jason Gunthorpe, netdev
In-Reply-To: <20180830182238.16361-1-ar@cs.msu.ru>

Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.

Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.

The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.

Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').

Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e3d28f9ad9c0..ba16a63ee303 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
 	       sizeof(union ib_gid));
 
 	SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
-	priv->dev->dev_id = priv->port - 1;
+	priv->dev->dev_port = priv->port - 1;
 
 	return 0;
 }
-- 
2.18.0

^ permalink raw reply related

* [PATCH v2 0/2] IB/ipoib: Use dev_port to disambiguate port numbers
From: Arseny Maslennikov @ 2018-08-30 18:22 UTC (permalink / raw)
  To: linux-rdma; +Cc: Arseny Maslennikov, Doug Ledford, Jason Gunthorpe, netdev

Pre-3.15 userspace had trouble distinguishing different ports
of a NIC on a single PCI bus/device/function. To solve this,
a sysfs field `dev_port' was introduced quite a while ago
(commit v3.14-rc3-739-g3f85944fe207), and some relevant device
drivers were fixed to use it, but not in case of IPoIB.

The convention for some reason never got documented in the kernel, but
was immediately adopted by userspace (notably udev[1][2], biosdevname[3])

3/3 documents the sysfs field — that's why I'm CC-ing netdev.

This series was tested on and applies to 4.19-rc1.

[1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html
[2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html
[3] https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38

v1->v2: replace a line instead of inserting and then removing.

Arseny Maslennikov (2):
  IB/ipoib: Use dev_port to expose network interface port numbers
  Documentation/ABI: document /sys/class/net/*/dev_port

 Documentation/ABI/testing/sysfs-class-net | 10 ++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

-- 
2.18.0

^ permalink raw reply

* [PATCH v2 2/2] Documentation/ABI: document /sys/class/net/*/dev_port
From: Arseny Maslennikov @ 2018-08-30 18:22 UTC (permalink / raw)
  To: linux-rdma; +Cc: Arseny Maslennikov, Doug Ledford, Jason Gunthorpe, netdev
In-Reply-To: <20180830182238.16361-1-ar@cs.msu.ru>

The sysfs field was introduced 4 years ago along with fixes to various
drivers that erroneously used `dev_id' for that purpose, but it was not
properly documented anywhere.
See commit v3.14-rc3-739-g3f85944fe207.

Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
---
 Documentation/ABI/testing/sysfs-class-net | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net
index 2f1788111cd9..1593d8997ade 100644
--- a/Documentation/ABI/testing/sysfs-class-net
+++ b/Documentation/ABI/testing/sysfs-class-net
@@ -91,6 +91,16 @@ Description:
 		stacked (e.g: VLAN interfaces) but still have the same MAC
 		address as their parent device.
 
+What:		/sys/class/net/<iface>/dev_port
+Date:		February 2014
+KernelVersion:	3.15
+Contact:	netdev@vger.kernel.org
+Description:
+		Indicates the port number of this network device, formatted
+		as a decimal value. Some NICs have multiple independent ports
+		on the same PCI bus, device and function. This field allows
+		userspace to distinguish the respective interfaces.
+
 What:		/sys/class/net/<iface>/dormant
 Date:		March 2006
 KernelVersion:	2.6.17
-- 
2.18.0

^ permalink raw reply related

* [bpf-next 0/3] Introduce eBPF flow dissector
From: Petar Penkov @ 2018-08-30 18:22 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series hardens the RX stack by allowing flow dissection in BPF,
as previously discussed [1]. Because of the rigorous checks of the BPF
verifier, this provides significant security guarantees. In particular, the
BPF flow dissector cannot get inside of an infinite loop, as with
CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot
read outside of packet bounds, because all memory accesses are checked.
Also, with BPF the administrator can decide which protocols to support,
reducing potential attack surface. Rarely encountered protocols can be
excluded from dissection and the program can be updated without kernel
recompile or reboot if a bug is discovered.

Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect.
This includes a new BPF program and attach type.

Patch 2 adds a flow dissector program in BPF. This parses most protocols in
__skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports,
and address types).

Patch 3 adds a selftest that attaches the BPF program to the flow dissector
and sends traffic with different levels of encapsulation.

Performance Evaluation:
The in-kernel implementation was compared against the demo program from
patch 2 using the test in patch 3 with IPv4/UDP traffic over 10 seconds.
	$perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \
		-t 10

In-kernel Dissector:
	__skb_flow_dissect overhead: 2.12%
	Total Packets: 3,272,597 (from output of ./test_flow_dissector)

BPF Dissector:
	__skb_flow_dissect overhead: 1.63% 
	Total Packets: 3,232,356 (from output of ./test_flow_dissector)

No-op Dissector:
	__skb_flow_dissect overhead: 1.52% 
	Total Packets: 3,330,635 (from output of ./test_flow_dissector)

Changes since RFC:

1/ (Patch 1) Flow dissector hook is no longer global. Instead, it is
per-netns

2/ (Patch 1) Defined struct bpf_flow_keys to be used in BPF flow dissector
programs instead of exposing the internal flow keys layout. Added a
function to translate from bpf_flow_keys to the internal layout after BPF
dissection is complete. The pointer to this struct is stored in
qdisc_skb_cb rather than inside of the 20 byte control block which
simplifies verification and allows access to all 20 bytes of the cb.

3/ (Patch 2) Removed GUE parsing as it relied on a hardcoded port

4/ (Patch 2) MPLS parsing now stops at the first label which is consistent
with the in-kernel flow dissector

5/ (Patch 2) Refactored to use direct packet access and to write out to
struct bpf_flow_keys

[1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

Petar Penkov (3):
  flow_dissector: implements flow dissector BPF hook
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

 include/linux/bpf.h                           |   1 +
 include/linux/bpf_types.h                     |   1 +
 include/linux/skbuff.h                        |   7 +
 include/net/net_namespace.h                   |   3 +
 include/net/sch_generic.h                     |  12 +-
 include/uapi/linux/bpf.h                      |  25 +
 kernel/bpf/syscall.c                          |   8 +
 kernel/bpf/verifier.c                         |  33 +
 net/core/filter.c                             |  67 ++
 net/core/flow_dissector.c                     | 136 +++
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |  25 +
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   8 +-
 tools/testing/selftests/bpf/bpf_flow.c        | 390 +++++++++
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 22 files changed, 1843 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply

* [bpf-next 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov @ 2018-08-30 18:22 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180830182301.89435-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/bpf.h            |   1 +
 include/linux/bpf_types.h      |   1 +
 include/linux/skbuff.h         |   7 ++
 include/net/net_namespace.h    |   3 +
 include/net/sch_generic.h      |  12 ++-
 include/uapi/linux/bpf.h       |  25 ++++++
 kernel/bpf/syscall.c           |   8 ++
 kernel/bpf/verifier.c          |  33 ++++++++
 net/core/filter.c              |  67 ++++++++++++++++
 net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |  25 ++++++
 tools/lib/bpf/libbpf.c         |   2 +
 13 files changed, 318 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 523481a3471b..988a00797bcd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -212,6 +212,7 @@ enum bpf_reg_type {
 	PTR_TO_PACKET_META,	 /* skb->data - meta_len */
 	PTR_TO_PACKET,		 /* reg points to skb->data */
 	PTR_TO_PACKET_END,	 /* skb->data + headlen */
+	PTR_TO_FLOW_KEYS,	 /* reg points to bpf_flow_keys */
 };
 
 /* The information passed from prog-specific *_is_valid_access
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c090e7c0..22083712dd18 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
 #endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector)
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 17a13e4785fc..ce0e863f02a2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -243,6 +243,8 @@ struct scatterlist;
 struct pipe_inode_info;
 struct iov_iter;
 struct napi_struct;
+struct bpf_prog;
+union bpf_attr;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 			     const struct flow_dissector_key *key,
 			     unsigned int key_count);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog);
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr);
+
 bool __skb_flow_dissect(const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 9b5fdc50519a..99d4148e0f90 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -43,6 +43,7 @@ struct ctl_table_header;
 struct net_generic;
 struct uevent_sock;
 struct netns_ipvs;
+struct bpf_prog;
 
 
 #define NETDEV_HASHBITS    8
@@ -145,6 +146,8 @@ struct net {
 #endif
 	struct net_generic __rcu	*gen;
 
+	struct bpf_prog __rcu	*flow_dissector_prog;
+
 	/* Note : following structs are cache line aligned */
 #ifdef CONFIG_XFRM
 	struct netns_xfrm	xfrm;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index a6d00093f35e..1b81ba85fd2d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@ struct Qdisc_ops;
 struct qdisc_walker;
 struct tcf_walker;
 struct module;
+struct bpf_flow_keys;
 
 typedef int tc_setup_cb_t(enum tc_setup_type type,
 			  void *type_data, void *cb_priv);
@@ -307,9 +308,14 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-	unsigned int		pkt_len;
-	u16			slave_dev_queue_mapping;
-	u16			tc_classid;
+	union {
+		struct {
+			unsigned int		pkt_len;
+			u16			slave_dev_queue_mapping;
+			u16			tc_classid;
+		};
+		struct bpf_flow_keys *flow_keys;
+	};
 #define QDISC_CB_PRIV_LEN 20
 	unsigned char		data[QDISC_CB_PRIV_LEN];
 };
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4eba27..3064706fcaaa 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
 	/* ... here. */
 
 	__u32 data_meta;
+	__u32 flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,26 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+struct bpf_flow_keys {
+	__u16	thoff;
+	__u16	addr_proto;			/* ETH_P_* of valid addrs */
+	__u8	is_frag;
+	__u8	is_first_frag;
+	__u8	is_encap;
+	__be16	n_proto;
+	__u8	ip_proto;
+	union {
+		struct {
+			__be32	ipv4_src;
+			__be32	ipv4_dst;
+		};
+		struct {
+			__u32	ipv6_src[4];	/* in6_addr; network order */
+			__u32	ipv6_dst[4];	/* in6_addr; network order */
+		};
+	};
+	__be16	sport;
+	__be16	dport;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8339d81cba1d..043c1a72e382 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1616,6 +1616,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_LIRC_MODE2:
 		ptype = BPF_PROG_TYPE_LIRC_MODE2;
 		break;
+	case BPF_FLOW_DISSECTOR:
+		ptype = BPF_PROG_TYPE_FLOW_DISSECTOR;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1637,6 +1640,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_LIRC_MODE2:
 		ret = lirc_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
+		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -1689,6 +1695,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		return sockmap_get_from_fd(attr, BPF_PROG_TYPE_SK_SKB, NULL);
 	case BPF_LIRC_MODE2:
 		return lirc_prog_detach(attr);
+	case BPF_FLOW_DISSECTOR:
+		return skb_flow_dissector_bpf_prog_detach(attr);
 	default:
 		return -EINVAL;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 92246117d2b0..ac9aeed325b5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -261,6 +261,7 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_PACKET]		= "pkt",
 	[PTR_TO_PACKET_META]	= "pkt_meta",
 	[PTR_TO_PACKET_END]	= "pkt_end",
+	[PTR_TO_FLOW_KEYS]	= "flow_keys",
 };
 
 static void print_liveness(struct bpf_verifier_env *env,
@@ -993,6 +994,7 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_PACKET:
 	case PTR_TO_PACKET_META:
 	case PTR_TO_PACKET_END:
+	case PTR_TO_FLOW_KEYS:
 	case CONST_PTR_TO_MAP:
 		return true;
 	default:
@@ -1321,6 +1323,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		if (meta)
 			return meta->pkt_access;
 
@@ -1404,6 +1407,18 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off,
 	return -EACCES;
 }
 
+static int check_flow_keys_access(struct bpf_verifier_env *env, int off,
+				  int size)
+{
+	if (size < 0 || off < 0 ||
+	    (u64)off + size > sizeof(struct bpf_flow_keys)) {
+		verbose(env, "invalid access to flow keys off=%d size=%d\n",
+			off, size);
+		return -EACCES;
+	}
+	return 0;
+}
+
 static bool __is_pointer_value(bool allow_ptr_leaks,
 			       const struct bpf_reg_state *reg)
 {
@@ -1505,6 +1520,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 		 * right in front, treat it the very same way.
 		 */
 		return check_pkt_ptr_alignment(env, reg, off, size, strict);
+	case PTR_TO_FLOW_KEYS:
+		pointer_desc = "flow keys ";
+		break;
 	case PTR_TO_MAP_VALUE:
 		pointer_desc = "value ";
 		break;
@@ -1778,6 +1796,17 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_packet_access(env, regno, off, size, false);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
+	} else if (reg->type == PTR_TO_FLOW_KEYS) {
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose(env, "R%d leaks addr into flow keys\n",
+				value_regno);
+			return -EACCES;
+		}
+
+		err = check_flow_keys_access(env, off, size);
+		if (!err && t == BPF_READ && value_regno >= 0)
+			mark_reg_unknown(env, regs, value_regno);
 	} else {
 		verbose(env, "R%d invalid mem access '%s'\n", regno,
 			reg_type_str[reg->type]);
@@ -1925,6 +1954,8 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	case PTR_TO_PACKET_META:
 		return check_packet_access(env, regno, reg->off, access_size,
 					   zero_size_allowed);
+	case PTR_TO_FLOW_KEYS:
+		return check_flow_keys_access(env, reg->off, access_size);
 	case PTR_TO_MAP_VALUE:
 		return check_map_access(env, regno, reg->off, access_size,
 					zero_size_allowed);
@@ -3976,6 +4007,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return true;
 	default:
 		return false;
@@ -4451,6 +4483,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_CTX:
 	case CONST_PTR_TO_MAP:
 	case PTR_TO_PACKET_END:
+	case PTR_TO_FLOW_KEYS:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
diff --git a/net/core/filter.c b/net/core/filter.c
index c25eb36f1320..0143b9c0c67e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5092,6 +5092,17 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
 static const struct bpf_func_proto *
 lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -5207,6 +5218,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		if (size != size_default)
 			return false;
 		break;
@@ -5235,6 +5247,7 @@ static bool sk_filter_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
 	}
@@ -5260,6 +5273,7 @@ static bool lwt_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
 	}
 
@@ -5470,6 +5484,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
 	}
@@ -5671,6 +5686,7 @@ static bool sk_skb_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
 		return false;
 	}
 
@@ -5730,6 +5746,38 @@ static bool sk_msg_is_valid_access(int off, int size,
 	return true;
 }
 
+static bool flow_dissector_is_valid_access(int off, int size,
+					   enum bpf_access_type type,
+					   const struct bpf_prog *prog,
+					   struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+		case bpf_ctx_range(struct __sk_buff, flow_keys):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	switch (off) {
+	case bpf_ctx_range(struct __sk_buff, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case bpf_ctx_range(struct __sk_buff, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	case bpf_ctx_range(struct __sk_buff, flow_keys):
+		info->reg_type = PTR_TO_FLOW_KEYS;
+		break;
+	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+		return false;
+	}
+
+	return bpf_skb_is_valid_access(off, size, type, prog, info);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -6024,6 +6072,15 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				      bpf_target_off(struct sock_common,
 						     skc_num, 2, target_size));
 		break;
+
+	case offsetof(struct __sk_buff, flow_keys):
+		off  = si->off;
+		off -= offsetof(struct __sk_buff, flow_keys);
+		off += offsetof(struct sk_buff, cb);
+		off += offsetof(struct qdisc_skb_cb, flow_keys);
+		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
+				      si->src_reg, off);
+		break;
 	}
 
 	return insn - insn_buf;
@@ -6987,6 +7044,16 @@ const struct bpf_verifier_ops sk_msg_verifier_ops = {
 const struct bpf_prog_ops sk_msg_prog_ops = {
 };
 
+const struct bpf_verifier_ops flow_dissector_verifier_ops = {
+	.get_func_proto		= flow_dissector_func_proto,
+	.is_valid_access	= flow_dissector_is_valid_access,
+	.convert_ctx_access	= bpf_convert_ctx_access,
+	.gen_ld_abs		= bpf_gen_ld_abs,
+};
+
+const struct bpf_prog_ops flow_dissector_prog_ops = {
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index ce9eeeb7c024..a5f7da69911a 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -25,6 +25,9 @@
 #include <net/flow_dissector.h>
 #include <scsi/fc/fc_fcoe.h>
 #include <uapi/linux/batadv_packet.h>
+#include <linux/bpf.h>
+
+static DEFINE_MUTEX(flow_dissector_mutex);
 
 static void dissector_set_key(struct flow_dissector *flow_dissector,
 			      enum flow_dissector_key_id key_id)
@@ -62,6 +65,44 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector,
 }
 EXPORT_SYMBOL(skb_flow_dissector_init);
 
+int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
+				       struct bpf_prog *prog)
+{
+	struct bpf_prog *attached;
+	struct net *net;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(net->flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (attached) {
+		/* Only one BPF program can be attached at a time */
+		mutex_unlock(&flow_dissector_mutex);
+		return -EEXIST;
+	}
+	rcu_assign_pointer(net->flow_dissector_prog, prog);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
+
+int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
+{
+	struct bpf_prog *attached;
+	struct net *net;
+
+	net = current->nsproxy->net_ns;
+	mutex_lock(&flow_dissector_mutex);
+	attached = rcu_dereference_protected(net->flow_dissector_prog,
+					     lockdep_is_held(&flow_dissector_mutex));
+	if (!attached) {
+		mutex_unlock(&flow_dissector_mutex);
+		return -ENOENT;
+	}
+	bpf_prog_put(attached);
+	RCU_INIT_POINTER(net->flow_dissector_prog, NULL);
+	mutex_unlock(&flow_dissector_mutex);
+	return 0;
+}
 /**
  * skb_flow_get_be16 - extract be16 entity
  * @skb: sk_buff to extract from
@@ -588,6 +629,60 @@ static bool skb_flow_dissect_allowed(int *num_hdrs)
 	return (*num_hdrs <= MAX_FLOW_DISSECT_HDRS);
 }
 
+static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
+				     struct flow_dissector *flow_dissector,
+				     void *target_container)
+{
+	struct flow_dissector_key_control *key_control;
+	struct flow_dissector_key_basic *key_basic;
+	struct flow_dissector_key_addrs *key_addrs;
+	struct flow_dissector_key_ports *key_ports;
+
+	key_control = skb_flow_dissector_target(flow_dissector,
+						FLOW_DISSECTOR_KEY_CONTROL,
+						target_container);
+	key_control->thoff = flow_keys->thoff;
+	if (flow_keys->is_frag)
+		key_control->flags |= FLOW_DIS_IS_FRAGMENT;
+	if (flow_keys->is_first_frag)
+		key_control->flags |= FLOW_DIS_FIRST_FRAG;
+	if (flow_keys->is_encap)
+		key_control->flags |= FLOW_DIS_ENCAPSULATION;
+
+	key_basic = skb_flow_dissector_target(flow_dissector,
+					      FLOW_DISSECTOR_KEY_BASIC,
+					      target_container);
+	key_basic->n_proto = flow_keys->n_proto;
+	key_basic->ip_proto = flow_keys->ip_proto;
+
+	if (flow_keys->addr_proto == ETH_P_IP &&
+	    dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) {
+		key_addrs = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_IPV4_ADDRS,
+						      target_container);
+		key_addrs->v4addrs.src = flow_keys->ipv4_src;
+		key_addrs->v4addrs.dst = flow_keys->ipv4_dst;
+		key_control->addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+	} else if (flow_keys->addr_proto == ETH_P_IPV6 &&
+		   dissector_uses_key(flow_dissector,
+				      FLOW_DISSECTOR_KEY_IPV6_ADDRS)) {
+		key_addrs = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_IPV6_ADDRS,
+						      target_container);
+		memcpy(&key_addrs->v6addrs, &flow_keys->ipv6_src,
+		       sizeof(key_addrs->v6addrs));
+		key_control->addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+	}
+
+	if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS)) {
+		key_ports = skb_flow_dissector_target(flow_dissector,
+						      FLOW_DISSECTOR_KEY_PORTS,
+						      target_container);
+		key_ports->src = flow_keys->sport;
+		key_ports->dst = flow_keys->dport;
+	}
+}
+
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
@@ -619,6 +714,7 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 	struct flow_dissector_key_vlan *key_vlan;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
+	struct bpf_prog *attached;
 	int num_hdrs = 0;
 	u8 ip_proto = 0;
 	bool ret;
@@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 					      FLOW_DISSECTOR_KEY_BASIC,
 					      target_container);
 
+	rcu_read_lock();
+	attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
+	if (attached) {
+		/* Note that even though the const qualifier is discarded
+		 * throughout the execution of the BPF program, all changes(the
+		 * control block) are reverted after the BPF program returns.
+		 * Therefore, __skb_flow_dissect does not alter the skb.
+		 */
+		struct bpf_flow_keys flow_keys = {};
+		struct qdisc_skb_cb cb_saved;
+		struct qdisc_skb_cb *cb;
+		u16 *pseudo_cb;
+		u32 result;
+
+		cb = qdisc_skb_cb(skb);
+		pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
+
+		/* Save Control Block */
+		memcpy(&cb_saved, cb, sizeof(cb_saved));
+		memset(cb, 0, sizeof(cb_saved));
+
+		/* Pass parameters to the BPF program */
+		cb->flow_keys = &flow_keys;
+		*pseudo_cb = nhoff;
+
+		bpf_compute_data_pointers((struct sk_buff *)skb);
+		result = BPF_PROG_RUN(attached, skb);
+
+		/* Restore state */
+		memcpy(cb, &cb_saved, sizeof(cb_saved));
+
+		__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
+					 target_container);
+		key_control->thoff = min_t(u16, key_control->thoff,
+					   skb ? skb->len : hlen);
+		rcu_read_unlock();
+		return result == BPF_OK;
+	}
+	rcu_read_unlock();
+
 	if (dissector_uses_key(flow_dissector,
 			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
 		struct ethhdr *eth = eth_hdr(skb);
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d22106..b1cd3bc8db70 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@ static const char * const prog_type_name[] = {
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_FLOW_DISSECTOR]	= "flow_dissector",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4eba27..3064706fcaaa 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_FLOW_DISSECTOR,
 };
 
 enum bpf_attach_type {
@@ -172,6 +173,7 @@ enum bpf_attach_type {
 	BPF_CGROUP_UDP4_SENDMSG,
 	BPF_CGROUP_UDP6_SENDMSG,
 	BPF_LIRC_MODE2,
+	BPF_FLOW_DISSECTOR,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -2333,6 +2335,7 @@ struct __sk_buff {
 	/* ... here. */
 
 	__u32 data_meta;
+	__u32 flow_keys;
 };
 
 struct bpf_tunnel_key {
@@ -2778,4 +2781,26 @@ enum bpf_task_fd_type {
 	BPF_FD_TYPE_URETPROBE,		/* filename + offset */
 };
 
+struct bpf_flow_keys {
+	__u16	thoff;
+	__u16	addr_proto;			/* ETH_P_* of valid addrs */
+	__u8	is_frag;
+	__u8	is_first_frag;
+	__u8	is_encap;
+	__be16	n_proto;
+	__u8	ip_proto;
+	union {
+		struct {
+			__be32	ipv4_src;
+			__be32	ipv4_dst;
+		};
+		struct {
+			__u32	ipv6_src[4];	/* in6_addr; network order */
+			__u32	ipv6_dst[4];	/* in6_addr; network order */
+		};
+	};
+	__be16	sport;
+	__be16	dport;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f112627..0c749ce1b717 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2121,6 +2122,7 @@ static const struct {
 	BPF_PROG_SEC("sk_skb",		BPF_PROG_TYPE_SK_SKB),
 	BPF_PROG_SEC("sk_msg",		BPF_PROG_TYPE_SK_MSG),
 	BPF_PROG_SEC("lirc_mode2",	BPF_PROG_TYPE_LIRC_MODE2),
+	BPF_PROG_SEC("flow_dissector",	BPF_PROG_TYPE_FLOW_DISSECTOR),
 	BPF_SA_PROG_SEC("cgroup/bind4",	BPF_CGROUP_INET4_BIND),
 	BPF_SA_PROG_SEC("cgroup/bind6",	BPF_CGROUP_INET6_BIND),
 	BPF_SA_PROG_SEC("cgroup/connect4", BPF_CGROUP_INET4_CONNECT),
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [bpf-next 2/3] flow_dissector: implements eBPF parser
From: Petar Penkov @ 2018-08-30 18:23 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180830182301.89435-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This eBPF program extracts basic/control/ip address/ports keys from
incoming packets. It supports recursive parsing for IP encapsulation,
and VLAN, along with IPv4/IPv6 and extension headers.  This program is
meant to show how flow dissection and key extraction can be done in
eBPF.

Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/Makefile   |   2 +-
 tools/testing/selftests/bpf/bpf_flow.c | 390 +++++++++++++++++++++++++
 2 files changed, 391 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_flow.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index fff7fb1285fc..e65f50f9185e 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test
 	test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
 	test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-	test_skb_cgroup_id_kern.o
+	test_skb_cgroup_id_kern.o bpf_flow.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c
new file mode 100644
index 000000000000..6d8388735d1d
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_flow.c
@@ -0,0 +1,390 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <limits.h>
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/icmp.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/if_packet.h>
+#include <sys/socket.h>
+#include <linux/if_tunnel.h>
+#include <linux/mpls.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+int _version SEC("version") = 1;
+#define PROG(F) SEC(#F) int bpf_func_##F
+
+/* These are the identifiers of the BPF programs that will be used in tail
+ * calls. Name is limited to 16 characters, with the terminating character and
+ * bpf_func_ above, we have only 6 to work with, anything after will be cropped.
+ */
+enum {
+	IP,
+	IPV6,
+	IPV6OP,	/* Destination/Hop-by-Hop Options IPv6 Extension header */
+	IPV6FR,	/* Fragmentation IPv6 Extension Header */
+	MPLS,
+	VLAN,
+};
+
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+#define IP6_MF		0x0001
+#define IP6_OFFSET	0xFFF8
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct gre_hdr {
+	__be16 flags;
+	__be16 proto;
+};
+
+struct frag_hdr {
+	__u8 nexthdr;
+	__u8 reserved;
+	__be16 frag_off;
+	__be32 identification;
+};
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 8
+};
+
+struct bpf_dissect_cb {
+	__u16 nhoff;
+};
+
+static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb,
+							 __u16 hdr_size)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (__u8 *)(long)skb->data_end;
+	void *data = (__u8 *)(long)skb->data;
+	__u8 *hdr;
+
+	/* Verifies this variable offset does not overflow */
+	if (cb->nhoff > (USHRT_MAX - hdr_size))
+		return NULL;
+
+	hdr = data + cb->nhoff;
+	if (hdr + hdr_size > data_end)
+		return NULL;
+
+	return hdr;
+}
+
+/* Dispatches on ETHERTYPE */
+static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	keys->n_proto = proto;
+
+	switch (proto) {
+	case bpf_htons(ETH_P_IP):
+		bpf_tail_call(skb, &jmp_table, IP);
+		break;
+	case bpf_htons(ETH_P_IPV6):
+		bpf_tail_call(skb, &jmp_table, IPV6);
+		break;
+	case bpf_htons(ETH_P_MPLS_MC):
+	case bpf_htons(ETH_P_MPLS_UC):
+		bpf_tail_call(skb, &jmp_table, MPLS);
+		break;
+	case bpf_htons(ETH_P_8021Q):
+	case bpf_htons(ETH_P_8021AD):
+		bpf_tail_call(skb, &jmp_table, VLAN);
+		break;
+	default:
+		/* Protocol not supported */
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline void write_ports(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data = (__u8 *)(long)skb->data;
+	__be16 *ports = data + cb->nhoff; /* TCP/UDP start with the ports */
+
+	keys->sport = ports[0];
+	keys->dport = ports[1];
+}
+
+SEC("dissect")
+int dissect(struct __sk_buff *skb)
+{
+	if (!skb->vlan_present)
+		return parse_eth_proto(skb, skb->protocol);
+	else
+		return parse_eth_proto(skb, skb->vlan_proto);
+}
+
+/* Parses on IPPROTO_* */
+static __always_inline int parse_ip_proto(struct __sk_buff *skb, __u8 proto)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (void *)(long)skb->data_end;
+	struct icmphdr *icmp;
+	struct gre_hdr *gre;
+	struct ethhdr *eth;
+	struct tcphdr *tcp;
+	struct udphdr *udp;
+
+	keys->ip_proto = proto;
+	switch (proto) {
+	case IPPROTO_ICMP:
+		icmp = bpf_flow_dissect_get_header(skb, sizeof(*icmp));
+		if (!icmp)
+			return BPF_DROP;
+		return BPF_OK;
+	case IPPROTO_IPIP:
+		keys->is_encap = true;
+		return parse_eth_proto(skb, bpf_htons(ETH_P_IP));
+	case IPPROTO_IPV6:
+		keys->is_encap = true;
+		return parse_eth_proto(skb, bpf_htons(ETH_P_IPV6));
+	case IPPROTO_GRE:
+		gre = bpf_flow_dissect_get_header(skb, sizeof(*gre));
+		if (!gre)
+			return BPF_DROP;
+
+		if (bpf_htons(gre->flags & GRE_VERSION))
+			/* Only inspect standard GRE packets with version 0 */
+			return BPF_OK;
+
+		cb->nhoff += sizeof(*gre); /* Step over GRE Flags and Proto */
+		if (GRE_IS_CSUM(gre->flags))
+			cb->nhoff += 4; /* Step over chksum and Padding */
+		if (GRE_IS_KEY(gre->flags))
+			cb->nhoff += 4; /* Step over key */
+		if (GRE_IS_SEQ(gre->flags))
+			cb->nhoff += 4; /* Step over sequence number */
+
+		keys->is_encap = true;
+
+		if (gre->proto == bpf_htons(ETH_P_TEB)) {
+			eth = bpf_flow_dissect_get_header(skb, sizeof(*eth));
+			if (!eth)
+				return BPF_DROP;
+
+			cb->nhoff += sizeof(*eth);
+
+			return parse_eth_proto(skb, eth->h_proto);
+		} else {
+			return parse_eth_proto(skb, gre->proto);
+		}
+	case IPPROTO_TCP:
+		tcp = bpf_flow_dissect_get_header(skb, sizeof(*tcp));
+		if (!tcp)
+			return BPF_DROP;
+
+		if (tcp->doff < 5)
+			return BPF_DROP;
+
+		if ((__u8 *)tcp + (tcp->doff << 2) > data_end)
+			return BPF_DROP;
+
+		keys->thoff = cb->nhoff;
+		write_ports(skb);
+		return BPF_OK;
+	case IPPROTO_UDP:
+	case IPPROTO_UDPLITE:
+		udp = bpf_flow_dissect_get_header(skb, sizeof(*udp));
+		if (!udp)
+			return BPF_DROP;
+
+		keys->thoff = cb->nhoff;
+		write_ports(skb);
+		return BPF_OK;
+	default:
+		return BPF_DROP;
+	}
+
+	return BPF_DROP;
+}
+
+static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+
+	keys->ip_proto = nexthdr;
+	switch (nexthdr) {
+	case IPPROTO_HOPOPTS:
+	case IPPROTO_DSTOPTS:
+		bpf_tail_call(skb, &jmp_table, IPV6OP);
+		break;
+	case IPPROTO_FRAGMENT:
+		bpf_tail_call(skb, &jmp_table, IPV6FR);
+		break;
+	default:
+		return parse_ip_proto(skb, nexthdr);
+	}
+
+	return BPF_DROP;
+}
+
+PROG(IP)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data_end = (__u8 *)(long)skb->data_end;
+	void *data = (__u8 *)(long)skb->data;
+	bool done = false;
+	struct iphdr *iph;
+
+	iph = bpf_flow_dissect_get_header(skb, sizeof(*iph));
+	if (!iph)
+		return BPF_DROP;
+
+	/* IP header cannot be smaller than 20 bytes */
+	if (iph->ihl < 5)
+		return BPF_DROP;
+
+	keys->addr_proto = ETH_P_IP;
+	keys->ipv4_src = iph->saddr;
+	keys->ipv4_dst = iph->daddr;
+
+	cb->nhoff += iph->ihl << 2;
+	if (data + cb->nhoff > data_end)
+		return BPF_DROP;
+
+	if (iph->frag_off & bpf_htons(IP_MF | IP_OFFSET)) {
+		keys->is_frag = true;
+		if (iph->frag_off & bpf_htons(IP_OFFSET))
+			/* From second fragment on, packets do not have headers
+			 * we can parse.
+			 */
+			done = true;
+		else
+			keys->is_first_frag = true;
+	}
+
+	if (done)
+		return BPF_OK;
+
+	return parse_ip_proto(skb, iph->protocol);
+}
+
+PROG(IPV6)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct ipv6hdr *ip6h;
+
+	ip6h = bpf_flow_dissect_get_header(skb, sizeof(*ip6h));
+	if (!ip6h)
+		return BPF_DROP;
+
+	keys->addr_proto = ETH_P_IPV6;
+	memcpy(&keys->ipv6_src, &ip6h->saddr, 2*sizeof(ip6h->saddr));
+
+	cb->nhoff += sizeof(struct ipv6hdr);
+
+	return parse_ipv6_proto(skb, ip6h->nexthdr);
+}
+
+PROG(IPV6OP)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct ipv6_opt_hdr *ip6h;
+
+	ip6h = bpf_flow_dissect_get_header(skb, sizeof(*ip6h));
+	if (!ip6h)
+		return BPF_DROP;
+
+	/* hlen is in 8-octects and does not include the first 8 bytes
+	 * of the header
+	 */
+	cb->nhoff += (1 + ip6h->hdrlen) << 3;
+
+	return parse_ipv6_proto(skb, ip6h->nexthdr);
+}
+
+PROG(IPV6FR)(struct __sk_buff *skb)
+{
+	struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys);
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct frag_hdr *fragh;
+
+	fragh = bpf_flow_dissect_get_header(skb, sizeof(*fragh));
+	if (!fragh)
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(*fragh);
+	keys->is_frag = true;
+	if (!(fragh->frag_off & bpf_htons(IP6_OFFSET)))
+		keys->is_first_frag = true;
+
+	return parse_ipv6_proto(skb, fragh->nexthdr);
+}
+
+PROG(MPLS)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	void *data = (void *)(long)skb->data;
+	struct mpls_label *mpls;
+	__u8 *version;
+
+	mpls = bpf_flow_dissect_get_header(skb, sizeof(*mpls));
+	if (!mpls)
+		return BPF_DROP;
+
+	return BPF_OK;
+}
+
+PROG(VLAN)(struct __sk_buff *skb)
+{
+	struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb);
+	struct vlan_hdr *vlan;
+	__be16 proto;
+
+	/* Peek back to see if single or double-tagging */
+	if (bpf_skb_load_bytes(skb, cb->nhoff - sizeof(proto), &proto,
+			       sizeof(proto)))
+		return BPF_DROP;
+
+	/* Account for double-tagging */
+	if (proto == bpf_htons(ETH_P_8021AD)) {
+		vlan = bpf_flow_dissect_get_header(skb, sizeof(*vlan));
+		if (!vlan)
+			return BPF_DROP;
+
+		if (vlan->h_vlan_encapsulated_proto != bpf_htons(ETH_P_8021Q))
+			return BPF_DROP;
+
+		cb->nhoff += sizeof(*vlan);
+	}
+
+	vlan = bpf_flow_dissect_get_header(skb, sizeof(*vlan));
+	if (!vlan)
+		return BPF_DROP;
+
+	cb->nhoff += sizeof(*vlan);
+	/* Only allow 8021AD + 8021Q double tagging and no triple tagging.*/
+	if (vlan->h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021AD) ||
+	    vlan->h_vlan_encapsulated_proto == bpf_htons(ETH_P_8021Q))
+		return BPF_DROP;
+
+	return parse_eth_proto(skb, vlan->h_vlan_encapsulated_proto);
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* [bpf-next 3/3] selftests/bpf: test bpf flow dissection
From: Petar Penkov @ 2018-08-30 18:23 UTC (permalink / raw)
  To: netdev
  Cc: davem, ast, daniel, simon.horman, ecree, songliubraving, tom,
	Petar Penkov, Willem de Bruijn
In-Reply-To: <20180830182301.89435-1-peterpenkov96@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Adds a test that sends different types of packets over multiple
tunnels and verifies that valid packets are dissected correctly.  To do
so, a tc-flower rule is added to drop packets on UDP src port 9, and
packets are sent from ports 8, 9, and 10. Only the packets on port 9
should be dropped. Because tc-flower relies on the flow dissector to
match flows, correct classification demonstrates correct dissection.

Also add support logic to load the BPF program and to inject the test
packets.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   2 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/flow_dissector_load.c       | 140 ++++
 .../selftests/bpf/test_flow_dissector.c       | 782 ++++++++++++++++++
 .../selftests/bpf/test_flow_dissector.sh      | 115 +++
 tools/testing/selftests/bpf/with_addr.sh      |  54 ++
 tools/testing/selftests/bpf/with_tunnels.sh   |  36 +
 8 files changed, 1134 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c
 create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c
 create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh
 create mode 100755 tools/testing/selftests/bpf/with_addr.sh
 create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 49938d72cf63..e61a85ac4b79 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -19,3 +19,5 @@ test_btf
 test_sockmap
 test_lirc_mode2_user
 get_cgroup_id_user
+test_flow_dissector
+flow_dissector_load
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e65f50f9185e..fd3851d5c079 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \
 	test_tunnel.sh \
 	test_lwt_seg6local.sh \
 	test_lirc_mode2.sh \
-	test_skb_cgroup_id.sh
+	test_skb_cgroup_id.sh \
+	test_flow_dissector.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \
+	flow_dissector_load test_flow_dissector
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index b4994a94968b..3655508f95fd 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m
 CONFIG_CRYPTO_SHA256=m
 CONFIG_VXLAN=y
 CONFIG_GENEVE=y
+CONFIG_NET_CLS_FLOWER=m
diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c
new file mode 100644
index 000000000000..d3273b5b3173
--- /dev/null
+++ b/tools/testing/selftests/bpf/flow_dissector_load.c
@@ -0,0 +1,140 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <error.h>
+#include <errno.h>
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector";
+const char *cfg_map_name = "jmp_table";
+bool cfg_attach = true;
+char *cfg_section_name;
+char *cfg_path_name;
+
+static void load_and_attach_program(void)
+{
+	struct bpf_program *prog, *main_prog;
+	struct bpf_map *prog_array;
+	int i, fd, prog_fd, ret;
+	struct bpf_object *obj;
+	int prog_array_fd;
+
+	ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj,
+			    &prog_fd);
+	if (ret)
+		error(1, 0, "bpf_prog_load %s", cfg_path_name);
+
+	main_prog = bpf_object__find_program_by_title(obj, cfg_section_name);
+	if (!main_prog)
+		error(1, 0, "bpf_object__find_program_by_title %s",
+		      cfg_section_name);
+
+	prog_fd = bpf_program__fd(main_prog);
+	if (prog_fd < 0)
+		error(1, 0, "bpf_program__fd");
+
+	prog_array = bpf_object__find_map_by_name(obj, cfg_map_name);
+	if (!prog_array)
+		error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name);
+
+	prog_array_fd = bpf_map__fd(prog_array);
+	if (prog_array_fd < 0)
+		error(1, 0, "bpf_map__fd %s", cfg_map_name);
+
+	i = 0;
+	bpf_object__for_each_program(prog, obj) {
+		fd = bpf_program__fd(prog);
+		if (fd < 0)
+			error(1, 0, "bpf_program__fd");
+
+		if (fd != prog_fd) {
+			printf("%d: %s\n", i, bpf_program__title(prog, false));
+			bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY);
+			++i;
+		}
+	}
+
+	ret = bpf_prog_attach(prog_fd, 0 /* Ignore */, BPF_FLOW_DISSECTOR, 0);
+	if (ret)
+		error(1, 0, "bpf_prog_attach %s", cfg_path_name);
+
+	ret = bpf_object__pin(obj, cfg_pin_path);
+	if (ret)
+		error(1, 0, "bpf_object__pin %s", cfg_pin_path);
+
+}
+
+static void detach_program(void)
+{
+	char command[64];
+	int ret;
+
+	ret = bpf_prog_detach(0, BPF_FLOW_DISSECTOR);
+	if (ret)
+		error(1, 0, "bpf_prog_detach");
+
+	/* To unpin, it is necessary and sufficient to just remove this dir */
+	sprintf(command, "rm -r %s", cfg_pin_path);
+	ret = system(command);
+	if (ret)
+		error(1, errno, command);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	bool attach = false;
+	bool detach = false;
+	int c;
+
+	while ((c = getopt(argc, argv, "adp:s:")) != -1) {
+		switch (c) {
+		case 'a':
+			if (detach)
+				error(1, 0, "attach/detach are exclusive");
+			attach = true;
+			break;
+		case 'd':
+			if (attach)
+				error(1, 0, "attach/detach are exclusive");
+			detach = true;
+			break;
+		case 'p':
+			if (cfg_path_name)
+				error(1, 0, "only one prog name can be given");
+
+			cfg_path_name = optarg;
+			break;
+		case 's':
+			if (cfg_section_name)
+				error(1, 0, "only one section can be given");
+
+			cfg_section_name = optarg;
+			break;
+		}
+	}
+
+	if (detach)
+		cfg_attach = false;
+
+	if (cfg_attach && !cfg_path_name)
+		error(1, 0, "must provide a path to the BPF program");
+
+	if (cfg_attach && !cfg_section_name)
+		error(1, 0, "must provide a section name");
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	if (cfg_attach)
+		load_and_attach_program();
+	else
+		detach_program();
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.c b/tools/testing/selftests/bpf/test_flow_dissector.c
new file mode 100644
index 000000000000..12b784afba31
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.c
@@ -0,0 +1,782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Inject packets with all sorts of encapsulation into the kernel.
+ *
+ * IPv4/IPv6	outer layer 3
+ * GRE/GUE/BARE outer layer 4, where bare is IPIP/SIT/IPv4-in-IPv6/..
+ * IPv4/IPv6    inner layer 3
+ */
+
+#define _GNU_SOURCE
+
+#include <stddef.h>
+#include <arpa/inet.h>
+#include <asm/byteorder.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <netinet/ip.h>
+#include <netinet/in.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define CFG_PORT_INNER	8000
+
+/* Add some protocol definitions that do not exist in userspace */
+
+struct grehdr {
+	uint16_t unused;
+	uint16_t protocol;
+} __attribute__((packed));
+
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:5,
+				control:1,
+				version:2;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:2,
+				control:1,
+				hlen:5;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8	proto_ctype;
+			__be16	flags;
+		};
+		__be32	word;
+	};
+};
+
+static uint8_t	cfg_dsfield_inner;
+static uint8_t	cfg_dsfield_outer;
+static uint8_t	cfg_encap_proto;
+static bool	cfg_expect_failure = false;
+static int	cfg_l3_extra = AF_UNSPEC;	/* optional SIT prefix */
+static int	cfg_l3_inner = AF_UNSPEC;
+static int	cfg_l3_outer = AF_UNSPEC;
+static int	cfg_num_pkt = 10;
+static int	cfg_num_secs = 0;
+static char	cfg_payload_char = 'a';
+static int	cfg_payload_len = 100;
+static int	cfg_port_gue = 6080;
+static bool	cfg_only_rx;
+static bool	cfg_only_tx;
+static int	cfg_src_port = 9;
+
+static char	buf[ETH_DATA_LEN];
+
+#define INIT_ADDR4(name, addr4, port)				\
+	static struct sockaddr_in name = {			\
+		.sin_family = AF_INET,				\
+		.sin_port = __constant_htons(port),		\
+		.sin_addr.s_addr = __constant_htonl(addr4),	\
+	};
+
+#define INIT_ADDR6(name, addr6, port)				\
+	static struct sockaddr_in6 name = {			\
+		.sin6_family = AF_INET6,			\
+		.sin6_port = __constant_htons(port),		\
+		.sin6_addr = addr6,				\
+	};
+
+INIT_ADDR4(in_daddr4, INADDR_LOOPBACK, CFG_PORT_INNER)
+INIT_ADDR4(in_saddr4, INADDR_LOOPBACK + 2, 0)
+INIT_ADDR4(out_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(out_saddr4, INADDR_LOOPBACK + 1, 0)
+INIT_ADDR4(extra_daddr4, INADDR_LOOPBACK, 0)
+INIT_ADDR4(extra_saddr4, INADDR_LOOPBACK + 1, 0)
+
+INIT_ADDR6(in_daddr6, IN6ADDR_LOOPBACK_INIT, CFG_PORT_INNER)
+INIT_ADDR6(in_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(out_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_daddr6, IN6ADDR_LOOPBACK_INIT, 0)
+INIT_ADDR6(extra_saddr6, IN6ADDR_LOOPBACK_INIT, 0)
+
+static unsigned long util_gettime(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void util_printaddr(const char *msg, struct sockaddr *addr)
+{
+	unsigned long off = 0;
+	char nbuf[INET6_ADDRSTRLEN];
+
+	switch (addr->sa_family) {
+	case PF_INET:
+		off = __builtin_offsetof(struct sockaddr_in, sin_addr);
+		break;
+	case PF_INET6:
+		off = __builtin_offsetof(struct sockaddr_in6, sin6_addr);
+		break;
+	default:
+		error(1, 0, "printaddr: unsupported family %u\n",
+		      addr->sa_family);
+	}
+
+	if (!inet_ntop(addr->sa_family, ((void *) addr) + off, nbuf,
+		       sizeof(nbuf)))
+		error(1, errno, "inet_ntop");
+
+	fprintf(stderr, "%s: %s\n", msg, nbuf);
+}
+
+static unsigned long add_csum_hword(const uint16_t *start, int num_u16)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_u16; i++)
+		sum += start[i];
+
+	return sum;
+}
+
+static uint16_t build_ip_csum(const uint16_t *start, int num_u16,
+			      unsigned long sum)
+{
+	sum += add_csum_hword(start, num_u16);
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	return ~sum;
+}
+
+static void build_ipv4_header(void *header, uint8_t proto,
+			      uint32_t src, uint32_t dst,
+			      int payload_len, uint8_t tos)
+{
+	struct iphdr *iph = header;
+
+	iph->ihl = 5;
+	iph->version = 4;
+	iph->tos = tos;
+	iph->ttl = 8;
+	iph->tot_len = htons(sizeof(*iph) + payload_len);
+	iph->id = htons(1337);
+	iph->protocol = proto;
+	iph->saddr = src;
+	iph->daddr = dst;
+	iph->check = build_ip_csum((void *) iph, iph->ihl << 1, 0);
+}
+
+static void ipv6_set_dsfield(struct ipv6hdr *ip6h, uint8_t dsfield)
+{
+	uint16_t val, *ptr = (uint16_t *)ip6h;
+
+	val = ntohs(*ptr);
+	val &= 0xF00F;
+	val |= ((uint16_t) dsfield) << 4;
+	*ptr = htons(val);
+}
+
+static void build_ipv6_header(void *header, uint8_t proto,
+			      struct sockaddr_in6 *src,
+			      struct sockaddr_in6 *dst,
+			      int payload_len, uint8_t dsfield)
+{
+	struct ipv6hdr *ip6h = header;
+
+	ip6h->version = 6;
+	ip6h->payload_len = htons(payload_len);
+	ip6h->nexthdr = proto;
+	ip6h->hop_limit = 8;
+	ipv6_set_dsfield(ip6h, dsfield);
+
+	memcpy(&ip6h->saddr, &src->sin6_addr, sizeof(ip6h->saddr));
+	memcpy(&ip6h->daddr, &dst->sin6_addr, sizeof(ip6h->daddr));
+}
+
+static uint16_t build_udp_v4_csum(const struct iphdr *iph,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(iph->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &iph->saddr, num_u16);
+	pseudo_sum += htons(IPPROTO_UDP);
+	pseudo_sum += udph->len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static uint16_t build_udp_v6_csum(const struct ipv6hdr *ip6h,
+				  const struct udphdr *udph,
+				  int num_words)
+{
+	unsigned long pseudo_sum;
+	int num_u16 = sizeof(ip6h->saddr);	/* halfwords: twice byte len */
+
+	pseudo_sum = add_csum_hword((void *) &ip6h->saddr, num_u16);
+	pseudo_sum += htons(ip6h->nexthdr);
+	pseudo_sum += ip6h->payload_len;
+	return build_ip_csum((void *) udph, num_words, pseudo_sum);
+}
+
+static void build_udp_header(void *header, int payload_len,
+			     uint16_t dport, int family)
+{
+	struct udphdr *udph = header;
+	int len = sizeof(*udph) + payload_len;
+
+	udph->source = htons(cfg_src_port);
+	udph->dest = htons(dport);
+	udph->len = htons(len);
+	udph->check = 0;
+	if (family == AF_INET)
+		udph->check = build_udp_v4_csum(header - sizeof(struct iphdr),
+						udph, len >> 1);
+	else
+		udph->check = build_udp_v6_csum(header - sizeof(struct ipv6hdr),
+						udph, len >> 1);
+}
+
+static void build_gue_header(void *header, uint8_t proto)
+{
+	struct guehdr *gueh = header;
+
+	gueh->proto_ctype = proto;
+}
+
+static void build_gre_header(void *header, uint16_t proto)
+{
+	struct grehdr *greh = header;
+
+	greh->protocol = htons(proto);
+}
+
+static int l3_length(int family)
+{
+	if (family == AF_INET)
+		return sizeof(struct iphdr);
+	else
+		return sizeof(struct ipv6hdr);
+}
+
+static int build_packet(void)
+{
+	int ol3_len = 0, ol4_len = 0, il3_len = 0, il4_len = 0;
+	int el3_len = 0;
+
+	if (cfg_l3_extra)
+		el3_len = l3_length(cfg_l3_extra);
+
+	/* calculate header offsets */
+	if (cfg_encap_proto) {
+		ol3_len = l3_length(cfg_l3_outer);
+
+		if (cfg_encap_proto == IPPROTO_GRE)
+			ol4_len = sizeof(struct grehdr);
+		else if (cfg_encap_proto == IPPROTO_UDP)
+			ol4_len = sizeof(struct udphdr) + sizeof(struct guehdr);
+	}
+
+	il3_len = l3_length(cfg_l3_inner);
+	il4_len = sizeof(struct udphdr);
+
+	if (el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len >=
+	    sizeof(buf))
+		error(1, 0, "packet too large\n");
+
+	/*
+	 * Fill packet from inside out, to calculate correct checksums.
+	 * But create ip before udp headers, as udp uses ip for pseudo-sum.
+	 */
+	memset(buf + el3_len + ol3_len + ol4_len + il3_len + il4_len,
+	       cfg_payload_char, cfg_payload_len);
+
+	/* add zero byte for udp csum padding */
+	buf[el3_len + ol3_len + ol4_len + il3_len + il4_len + cfg_payload_len] = 0;
+
+	switch (cfg_l3_inner) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  in_saddr4.sin_addr.s_addr,
+				  in_daddr4.sin_addr.s_addr,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len + ol3_len + ol4_len,
+				  IPPROTO_UDP,
+				  &in_saddr6, &in_daddr6,
+				  il4_len + cfg_payload_len,
+				  cfg_dsfield_inner);
+		break;
+	}
+
+	build_udp_header(buf + el3_len + ol3_len + ol4_len + il3_len,
+			 cfg_payload_len, CFG_PORT_INNER, cfg_l3_inner);
+
+	if (!cfg_encap_proto)
+		return il3_len + il4_len + cfg_payload_len;
+
+	switch (cfg_l3_outer) {
+	case PF_INET:
+		build_ipv4_header(buf + el3_len, cfg_encap_proto,
+				  out_saddr4.sin_addr.s_addr,
+				  out_daddr4.sin_addr.s_addr,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf + el3_len, cfg_encap_proto,
+				  &out_saddr6, &out_daddr6,
+				  ol4_len + il3_len + il4_len + cfg_payload_len,
+				  cfg_dsfield_outer);
+		break;
+	}
+
+	switch (cfg_encap_proto) {
+	case IPPROTO_UDP:
+		build_gue_header(buf + el3_len + ol3_len + ol4_len -
+				 sizeof(struct guehdr),
+				 cfg_l3_inner == PF_INET ? IPPROTO_IPIP
+							 : IPPROTO_IPV6);
+		build_udp_header(buf + el3_len + ol3_len,
+				 sizeof(struct guehdr) + il3_len + il4_len +
+				 cfg_payload_len,
+				 cfg_port_gue, cfg_l3_outer);
+		break;
+	case IPPROTO_GRE:
+		build_gre_header(buf + el3_len + ol3_len,
+				 cfg_l3_inner == PF_INET ? ETH_P_IP
+							 : ETH_P_IPV6);
+		break;
+	}
+
+	switch (cfg_l3_extra) {
+	case PF_INET:
+		build_ipv4_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  extra_saddr4.sin_addr.s_addr,
+				  extra_daddr4.sin_addr.s_addr,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	case PF_INET6:
+		build_ipv6_header(buf,
+				  cfg_l3_outer == PF_INET ? IPPROTO_IPIP
+							  : IPPROTO_IPV6,
+				  &extra_saddr6, &extra_daddr6,
+				  ol3_len + ol4_len + il3_len + il4_len +
+				  cfg_payload_len, 0);
+		break;
+	}
+
+	return el3_len + ol3_len + ol4_len + il3_len + il4_len +
+	       cfg_payload_len;
+}
+
+/* sender transmits encapsulated over RAW or unencap'd over UDP */
+static int setup_tx(void)
+{
+	int family, fd, ret;
+
+	if (cfg_l3_extra)
+		family = cfg_l3_extra;
+	else if (cfg_l3_outer)
+		family = cfg_l3_outer;
+	else
+		family = cfg_l3_inner;
+
+	fd = socket(family, SOCK_RAW, IPPROTO_RAW);
+	if (fd == -1)
+		error(1, errno, "socket tx");
+
+	if (cfg_l3_extra) {
+		if (cfg_l3_extra == PF_INET)
+			ret = connect(fd, (void *) &extra_daddr4,
+				      sizeof(extra_daddr4));
+		else
+			ret = connect(fd, (void *) &extra_daddr6,
+				      sizeof(extra_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else if (cfg_l3_outer) {
+		/* connect to destination if not encapsulated */
+		if (cfg_l3_outer == PF_INET)
+			ret = connect(fd, (void *) &out_daddr4,
+				      sizeof(out_daddr4));
+		else
+			ret = connect(fd, (void *) &out_daddr6,
+				      sizeof(out_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	} else {
+		/* otherwise using loopback */
+		if (cfg_l3_inner == PF_INET)
+			ret = connect(fd, (void *) &in_daddr4,
+				      sizeof(in_daddr4));
+		else
+			ret = connect(fd, (void *) &in_daddr6,
+				      sizeof(in_daddr6));
+		if (ret)
+			error(1, errno, "connect tx");
+	}
+
+	return fd;
+}
+
+/* receiver reads unencapsulated UDP */
+static int setup_rx(void)
+{
+	int fd, ret;
+
+	fd = socket(cfg_l3_inner, SOCK_DGRAM, 0);
+	if (fd == -1)
+		error(1, errno, "socket rx");
+
+	if (cfg_l3_inner == PF_INET)
+		ret = bind(fd, (void *) &in_daddr4, sizeof(in_daddr4));
+	else
+		ret = bind(fd, (void *) &in_daddr6, sizeof(in_daddr6));
+	if (ret)
+		error(1, errno, "bind rx");
+
+	return fd;
+}
+
+static int do_tx(int fd, const char *pkt, int len)
+{
+	int ret;
+
+	ret = write(fd, pkt, len);
+	if (ret == -1)
+		error(1, errno, "send");
+	if (ret != len)
+		error(1, errno, "send: len (%d < %d)\n", ret, len);
+
+	return 1;
+}
+
+static int do_poll(int fd, short events, int timeout)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.fd = fd;
+	pfd.events = events;
+
+	ret = poll(&pfd, 1, timeout);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret && !(pfd.revents & POLLIN))
+		error(1, errno, "poll: unexpected event 0x%x\n", pfd.revents);
+
+	return ret;
+}
+
+static int do_rx(int fd)
+{
+	char rbuf;
+	int ret, num = 0;
+
+	while (1) {
+		ret = recv(fd, &rbuf, 1, MSG_DONTWAIT);
+		if (ret == -1 && errno == EAGAIN)
+			break;
+		if (ret == -1)
+			error(1, errno, "recv");
+		if (rbuf != cfg_payload_char)
+			error(1, 0, "recv: payload mismatch");
+		num++;
+	};
+
+	return num;
+}
+
+static int do_main(void)
+{
+	unsigned long tstop, treport, tcur;
+	int fdt = -1, fdr = -1, len, tx = 0, rx = 0;
+
+	if (!cfg_only_tx)
+		fdr = setup_rx();
+	if (!cfg_only_rx)
+		fdt = setup_tx();
+
+	len = build_packet();
+
+	tcur = util_gettime();
+	treport = tcur + 1000;
+	tstop = tcur + (cfg_num_secs * 1000);
+
+	while (1) {
+		if (!cfg_only_rx)
+			tx += do_tx(fdt, buf, len);
+
+		if (!cfg_only_tx)
+			rx += do_rx(fdr);
+
+		if (cfg_num_secs) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+			if (tcur >= treport) {
+				fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+				tx = 0;
+				rx = 0;
+				treport = tcur + 1000;
+			}
+		} else {
+			if (tx == cfg_num_pkt)
+				break;
+		}
+	}
+
+	/* read straggler packets, if any */
+	if (rx < tx) {
+		tstop = util_gettime() + 100;
+		while (rx < tx) {
+			tcur = util_gettime();
+			if (tcur >= tstop)
+				break;
+
+			do_poll(fdr, POLLIN, tstop - tcur);
+			rx += do_rx(fdr);
+		}
+	}
+
+	fprintf(stderr, "pkts: tx=%u rx=%u\n", tx, rx);
+
+	if (fdr != -1 && close(fdr))
+		error(1, errno, "close rx");
+	if (fdt != -1 && close(fdt))
+		error(1, errno, "close tx");
+
+	/*
+	 * success (== 0) only if received all packets
+	 * unless failure is expected, in which case none must arrive.
+	 */
+	if (cfg_expect_failure)
+		return rx != 0;
+	else
+		return rx != tx;
+}
+
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+	fprintf(stderr, "Usage: %s [-e gre|gue|bare|none] [-i 4|6] [-l len] "
+			"[-O 4|6] [-o 4|6] [-n num] [-t secs] [-R] [-T] "
+			"[-s <osrc> [-d <odst>] [-S <isrc>] [-D <idst>] "
+			"[-x <otos>] [-X <itos>] [-f <isport>] [-F]\n",
+		filepath);
+	exit(1);
+}
+
+static void parse_addr(int family, void *addr, const char *optarg)
+{
+	int ret;
+
+	ret = inet_pton(family, optarg, addr);
+	if (ret == -1)
+		error(1, errno, "inet_pton");
+	if (ret == 0)
+		error(1, 0, "inet_pton: bad string");
+}
+
+static void parse_addr4(struct sockaddr_in *addr, const char *optarg)
+{
+	parse_addr(AF_INET, &addr->sin_addr, optarg);
+}
+
+static void parse_addr6(struct sockaddr_in6 *addr, const char *optarg)
+{
+	parse_addr(AF_INET6, &addr->sin6_addr, optarg);
+}
+
+static int parse_protocol_family(const char *filepath, const char *optarg)
+{
+	if (!strcmp(optarg, "4"))
+		return PF_INET;
+	if (!strcmp(optarg, "6"))
+		return PF_INET6;
+
+	usage(filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "d:D:e:f:Fhi:l:n:o:O:Rs:S:t:Tx:X:")) != -1) {
+		switch (c) {
+		case 'd':
+			if (cfg_l3_outer == AF_UNSPEC)
+				error(1, 0, "-d must be preceded by -o");
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_daddr4, optarg);
+			else
+				parse_addr6(&out_daddr6, optarg);
+			break;
+		case 'D':
+			if (cfg_l3_inner == AF_UNSPEC)
+				error(1, 0, "-D must be preceded by -i");
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_daddr4, optarg);
+			else
+				parse_addr6(&in_daddr6, optarg);
+			break;
+		case 'e':
+			if (!strcmp(optarg, "gre"))
+				cfg_encap_proto = IPPROTO_GRE;
+			else if (!strcmp(optarg, "gue"))
+				cfg_encap_proto = IPPROTO_UDP;
+			else if (!strcmp(optarg, "bare"))
+				cfg_encap_proto = IPPROTO_IPIP;
+			else if (!strcmp(optarg, "none"))
+				cfg_encap_proto = IPPROTO_IP;	/* == 0 */
+			else
+				usage(argv[0]);
+			break;
+		case 'f':
+			cfg_src_port = strtol(optarg, NULL, 0);
+			break;
+		case 'F':
+			cfg_expect_failure = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			break;
+		case 'i':
+			if (!strcmp(optarg, "4"))
+				cfg_l3_inner = PF_INET;
+			else if (!strcmp(optarg, "6"))
+				cfg_l3_inner = PF_INET6;
+			else
+				usage(argv[0]);
+			break;
+		case 'l':
+			cfg_payload_len = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			cfg_num_pkt = strtol(optarg, NULL, 0);
+			break;
+		case 'o':
+			cfg_l3_outer = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'O':
+			cfg_l3_extra = parse_protocol_family(argv[0], optarg);
+			break;
+		case 'R':
+			cfg_only_rx = true;
+			break;
+		case 's':
+			if (cfg_l3_outer == AF_INET)
+				parse_addr4(&out_saddr4, optarg);
+			else
+				parse_addr6(&out_saddr6, optarg);
+			break;
+		case 'S':
+			if (cfg_l3_inner == AF_INET)
+				parse_addr4(&in_saddr4, optarg);
+			else
+				parse_addr6(&in_saddr6, optarg);
+			break;
+		case 't':
+			cfg_num_secs = strtol(optarg, NULL, 0);
+			break;
+		case 'T':
+			cfg_only_tx = true;
+			break;
+		case 'x':
+			cfg_dsfield_outer = strtol(optarg, NULL, 0);
+			break;
+		case 'X':
+			cfg_dsfield_inner = strtol(optarg, NULL, 0);
+			break;
+		}
+	}
+
+	if (cfg_only_rx && cfg_only_tx)
+		error(1, 0, "options: cannot combine rx-only and tx-only");
+
+	if (cfg_encap_proto && cfg_l3_outer == AF_UNSPEC)
+		error(1, 0, "options: must specify outer with encap");
+	else if ((!cfg_encap_proto) && cfg_l3_outer != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and outer");
+	else if ((!cfg_encap_proto) && cfg_l3_extra != AF_UNSPEC)
+		error(1, 0, "options: cannot combine no-encap and extra");
+
+	if (cfg_l3_inner == AF_UNSPEC)
+		cfg_l3_inner = AF_INET6;
+	if (cfg_l3_inner == AF_INET6 && cfg_encap_proto == IPPROTO_IPIP)
+		cfg_encap_proto = IPPROTO_IPV6;
+
+	/* RFC 6040 4.2:
+	 *   on decap, if outer encountered congestion (CE == 0x3),
+	 *   but inner cannot encode ECN (NoECT == 0x0), then drop packet.
+	 */
+	if (((cfg_dsfield_outer & 0x3) == 0x3) &&
+	    ((cfg_dsfield_inner & 0x3) == 0x0))
+		cfg_expect_failure = true;
+}
+
+static void print_opts(void)
+{
+	if (cfg_l3_inner == PF_INET6) {
+		util_printaddr("inner.dest6", (void *) &in_daddr6);
+		util_printaddr("inner.source6", (void *) &in_saddr6);
+	} else {
+		util_printaddr("inner.dest4", (void *) &in_daddr4);
+		util_printaddr("inner.source4", (void *) &in_saddr4);
+	}
+
+	if (!cfg_l3_outer)
+		return;
+
+	fprintf(stderr, "encap proto:   %u\n", cfg_encap_proto);
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("outer.dest6", (void *) &out_daddr6);
+		util_printaddr("outer.source6", (void *) &out_saddr6);
+	} else {
+		util_printaddr("outer.dest4", (void *) &out_daddr4);
+		util_printaddr("outer.source4", (void *) &out_saddr4);
+	}
+
+	if (!cfg_l3_extra)
+		return;
+
+	if (cfg_l3_outer == PF_INET6) {
+		util_printaddr("extra.dest6", (void *) &extra_daddr6);
+		util_printaddr("extra.source6", (void *) &extra_saddr6);
+	} else {
+		util_printaddr("extra.dest4", (void *) &extra_daddr4);
+		util_printaddr("extra.source4", (void *) &extra_saddr4);
+	}
+
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+	print_opts();
+	return do_main();
+}
diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh
new file mode 100755
index 000000000000..c0fb073b5eab
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -0,0 +1,115 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Load BPF flow dissector and verify it correctly dissects traffic
+export TESTNAME=test_flow_dissector
+unmount=0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+msg="skip all tests:"
+if [ $UID != 0 ]; then
+	echo $msg please run this as root >&2
+	exit $ksft_skip
+fi
+
+# This test needs to be run in a network namespace with in_netns.sh. Check if
+# this is the case and run it with in_netns.sh if it is being run in the root
+# namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+	../net/in_netns.sh "$0" "$@"
+	exit $?
+fi
+
+# Determine selftest success via shell exit code
+exit_handler()
+{
+	if (( $? == 0 )); then
+		echo "selftests: $TESTNAME [PASS]";
+	else
+		echo "selftests: $TESTNAME [FAILED]";
+	fi
+
+	set +e
+
+	# Cleanup
+	tc filter del dev lo ingress pref 1337 2> /dev/null
+	tc qdisc del dev lo ingress 2> /dev/null
+	./flow_dissector_load -d 2> /dev/null
+	if [ $unmount -ne 0 ]; then
+		umount bpffs 2> /dev/null
+	fi
+}
+
+# Exit script immediately (well catched by trap handler) if any
+# program/thing exits with a non-zero status.
+set -e
+
+# (Use 'trap -l' to list meaning of numbers)
+trap exit_handler 0 2 3 6 9
+
+# Mount BPF file system
+if /bin/mount | grep /sys/fs/bpf > /dev/null; then
+	echo "bpffs already mounted"
+else
+	echo "bpffs not mounted. Mounting..."
+	unmount=1
+	/bin/mount bpffs /sys/fs/bpf -t bpf
+fi
+
+# Attach BPF program
+./flow_dissector_load -p bpf_flow.o -s dissect
+
+# Setup
+tc qdisc add dev lo ingress
+
+echo "Testing IPv4..."
+# Drops all IP/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ip pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv4/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 4 -f 8
+# Send 10 IPv4/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 4 -f 9 -F
+# Send 10 IPv4/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 4 -f 10
+
+echo "Testing IPIP..."
+# Send 10 IPv4/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e bare -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+echo "Testing IPv4 + GRE..."
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 8. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 8
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 9. Filter should drop all.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 9 -F
+# Send 10 IPv4/GRE/IPv4/UDP packets from port 10. Filter should not drop any.
+./with_addr.sh ./with_tunnels.sh ./test_flow_dissector -o 4 -e gre -i 4 \
+	-D 192.168.0.1 -S 1.1.1.1 -f 10
+
+tc filter del dev lo ingress pref 1337
+
+echo "Testing IPv6..."
+# Drops all IPv6/UDP packets coming from port 9
+tc filter add dev lo parent ffff: protocol ipv6 pref 1337 flower ip_proto \
+	udp src_port 9 action drop
+
+# Send 10 IPv6/UDP packets from port 8. Filter should not drop any.
+./test_flow_dissector -i 6 -f 8
+# Send 10 IPv6/UDP packets from port 9. Filter should drop all.
+./test_flow_dissector -i 6 -f 9 -F
+# Send 10 IPv6/UDP packets from port 10. Filter should not drop any.
+./test_flow_dissector -i 6 -f 10
+
+exit 0
diff --git a/tools/testing/selftests/bpf/with_addr.sh b/tools/testing/selftests/bpf/with_addr.sh
new file mode 100755
index 000000000000..ffcd3953f94c
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_addr.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# add private ipv4 and ipv6 addresses to loopback
+
+readonly V6_INNER='100::a/128'
+readonly V4_INNER='192.168.0.1/32'
+
+if getopts ":s" opt; then
+  readonly SIT_DEV_NAME='sixtofourtest0'
+  readonly V6_SIT='2::/64'
+  readonly V4_SIT='172.17.0.1/32'
+  shift
+fi
+
+fail() {
+  echo "error: $*" 1>&2
+  exit 1
+}
+
+setup() {
+  ip -6 addr add "${V6_INNER}" dev lo || fail 'failed to setup v6 address'
+  ip -4 addr add "${V4_INNER}" dev lo || fail 'failed to setup v4 address'
+
+  if [[ -n "${V6_SIT}" ]]; then
+    ip link add "${SIT_DEV_NAME}" type sit remote any local any \
+	    || fail 'failed to add sit'
+    ip link set dev "${SIT_DEV_NAME}" up \
+	    || fail 'failed to bring sit device up'
+    ip -6 addr add "${V6_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v6 SIT address'
+    ip -4 addr add "${V4_SIT}" dev "${SIT_DEV_NAME}" \
+	    || fail 'failed to setup v4 SIT address'
+  fi
+
+  sleep 2	# avoid race causing bind to fail
+}
+
+cleanup() {
+  if [[ -n "${V6_SIT}" ]]; then
+    ip -4 addr del "${V4_SIT}" dev "${SIT_DEV_NAME}"
+    ip -6 addr del "${V6_SIT}" dev "${SIT_DEV_NAME}"
+    ip link del "${SIT_DEV_NAME}"
+  fi
+
+  ip -4 addr del "${V4_INNER}" dev lo
+  ip -6 addr del "${V6_INNER}" dev lo
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
diff --git a/tools/testing/selftests/bpf/with_tunnels.sh b/tools/testing/selftests/bpf/with_tunnels.sh
new file mode 100755
index 000000000000..e24949ed3a20
--- /dev/null
+++ b/tools/testing/selftests/bpf/with_tunnels.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# setup tunnels for flow dissection test
+
+readonly SUFFIX="test_$(mktemp -u XXXX)"
+CONFIG="remote 127.0.0.2 local 127.0.0.1 dev lo"
+
+setup() {
+  ip link add "ipip_${SUFFIX}" type ipip ${CONFIG}
+  ip link add "gre_${SUFFIX}" type gre ${CONFIG}
+  ip link add "sit_${SUFFIX}" type sit ${CONFIG}
+
+  echo "tunnels before test:"
+  ip tunnel show
+
+  ip link set "ipip_${SUFFIX}" up
+  ip link set "gre_${SUFFIX}" up
+  ip link set "sit_${SUFFIX}" up
+}
+
+
+cleanup() {
+  ip tunnel del "ipip_${SUFFIX}"
+  ip tunnel del "gre_${SUFFIX}"
+  ip tunnel del "sit_${SUFFIX}"
+
+  echo "tunnels after test:"
+  ip tunnel show
+}
+
+trap cleanup EXIT
+
+setup
+"$@"
+exit "$?"
-- 
2.19.0.rc0.228.g281dcd1b4d0-goog

^ permalink raw reply related

* KMSAN: uninit-value in rds_connect
From: syzbot @ 2018-08-30 18:31 UTC (permalink / raw)
  To: davem, linux-kernel, linux-rdma, netdev, rds-devel,
	santosh.shilimkar, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    2dca2cbde67a kmsan: fix build warnings with CONFIG_KMSAN=n
git tree:       https://github.com/google/kmsan.git/master
console output: https://syzkaller.appspot.com/x/log.txt?x=1519734a400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=820d6393634b55e3
dashboard link: https://syzkaller.appspot.com/bug?extid=0049bebbf3042dbd2e8f
compiler:       clang version 8.0.0 (trunk 339414)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0049bebbf3042dbd2e8f@syzkaller.appspotmail.com

RDX: 0000000000000000 RSI: 0000000020000380 RDI: 00000000000001ff
RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 00000000004cd2e8 R14: 00000000004c3d93 R15: 0000000000000003
==================================================================
BUG: KMSAN: uninit-value in rds_connect+0x213/0x950 net/rds/af_rds.c:511
CPU: 0 PID: 8117 Comm: syz-executor6 Not tainted 4.19.0-rc1+ #36
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x14b/0x190 lib/dump_stack.c:113
  kmsan_report+0x183/0x2b0 mm/kmsan/kmsan.c:956
  __msan_warning+0x70/0xc0 mm/kmsan/kmsan_instr.c:645
  rds_connect+0x213/0x950 net/rds/af_rds.c:511
  __sys_connect+0x64e/0x750 net/socket.c:1662
  __do_sys_connect net/socket.c:1673 [inline]
  __se_sys_connect net/socket.c:1670 [inline]
  __x64_sys_connect+0xd8/0x120 net/socket.c:1670
  do_syscall_64+0x15b/0x220 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x457089
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f7090fc1c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 00007f7090fc26d4 RCX: 0000000000457089
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000004cb710 R14: 00000000004c320b R15: 0000000000000000

Local variable description: ----address@__sys_connect
Variable was created at:
  __sys_connect+0x6a/0x750 net/socket.c:1645
  __do_sys_connect net/socket.c:1673 [inline]
  __se_sys_connect net/socket.c:1670 [inline]
  __x64_sys_connect+0xd8/0x120 net/socket.c:1670
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* KMSAN: uninit-value in rds_bind
From: syzbot @ 2018-08-30 18:31 UTC (permalink / raw)
  To: davem, linux-kernel, linux-rdma, netdev, rds-devel,
	santosh.shilimkar, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    2dca2cbde67a kmsan: fix build warnings with CONFIG_KMSAN=n
git tree:       https://github.com/google/kmsan.git/master
console output: https://syzkaller.appspot.com/x/log.txt?x=16db895a400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=820d6393634b55e3
dashboard link: https://syzkaller.appspot.com/bug?extid=915c9f99f3dbc4bd6cd1
compiler:       clang version 8.0.0 (trunk 339414)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1137bffe400000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1521a7fe400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+915c9f99f3dbc4bd6cd1@syzkaller.appspotmail.com

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
==================================================================
BUG: KMSAN: uninit-value in rds_bind+0x1f1/0x2360 net/rds/bind.c:174
CPU: 1 PID: 4497 Comm: syz-executor098 Not tainted 4.19.0-rc1+ #36
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x14b/0x190 lib/dump_stack.c:113
  kmsan_report+0x183/0x2b0 mm/kmsan/kmsan.c:956
  __msan_warning+0x70/0xc0 mm/kmsan/kmsan_instr.c:645
  rds_bind+0x1f1/0x2360 net/rds/bind.c:174
  __sys_bind+0x594/0x6f0 net/socket.c:1481
  __do_sys_bind net/socket.c:1492 [inline]
  __se_sys_bind net/socket.c:1490 [inline]
  __x64_sys_bind+0xd8/0x120 net/socket.c:1490
  do_syscall_64+0x15b/0x220 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x440099
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffb43cf578 EFLAGS: 00000213 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440099
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000401920
R13: 00000000004019b0 R14: 0000000000000000 R15: 0000000000000000

Local variable description: ----address@__sys_bind
Variable was created at:
  __sys_bind+0x6a/0x6f0 net/socket.c:1468
  __do_sys_bind net/socket.c:1492 [inline]
  __se_sys_bind net/socket.c:1490 [inline]
  __x64_sys_bind+0xd8/0x120 net/socket.c:1490
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
From: Wolfgang Walter @ 2018-08-30 18:53 UTC (permalink / raw)
  To: netdev; +Cc: Wei Wang, Steffen Klassert
In-Reply-To: <3482600.6PjfSIYROA@stwm.de>

Hello,

kernels > 4.12 do not work on one of our main routers. They crash as soon
as ipsec-tunnels are configured and ipsec-traffic actually flows.
 
Just configuring ipsec (that is starting strongswan) does not trigger the
oops.
 
I finally found time to bisect that. It bisected down to

	b838d5e1c5b6e57b10ec8af2268824041e3ea911
	ipv4: mark DST_NOGC and remove the operation of dst_free()

Now we have other machines which run just fine with the very same kernels
doing ipsec. They differ insofar as they have much less cores, do not use
the ixgbe driver, do not have 10G and terminate only a few tunnels instead
of hundreds.

I already tested distribution kernels > 4.12 from debian, they also crash.

All kernels I created in the bisection run fine if I didn't use ipsec.
The bad ones all oopsed/crashed exactly as vanilla 4.14 described above.


Here is the bisect-log:

# bad: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect start 'v4.14' 'v4.9'
# good: [d82dd0e34d0347be201fd274dc84cd645dccc064] raid1: prefer disk without bad blocks
git bisect good d82dd0e34d0347be201fd274dc84cd645dccc064
# bad: [9967468c0a109644e4a1f5b39b39bf86fe7507a7] Merge branch 'akpm' (patches from Andrew)
git bisect bad 9967468c0a109644e4a1f5b39b39bf86fe7507a7
# bad: [17d9aa66b08de445645bd0688fc1635bed77a57b] Merge tag 'iwlwifi-next-for-kalle-2017-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next
git bisect bad 17d9aa66b08de445645bd0688fc1635bed77a57b
# good: [de4d195308ad589626571dbe5789cebf9695a204] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good de4d195308ad589626571dbe5789cebf9695a204
# good: [9376906c17fa975bf6a7ea9dd124be697bcda289] Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9376906c17fa975bf6a7ea9dd124be697bcda289
# good: [40e86a3619a1e84ad73c716c943f65fc38eb1e28] iwlwifi: mvm: use scnprintf() instead of snprintf()
git bisect good 40e86a3619a1e84ad73c716c943f65fc38eb1e28
# bad: [c66f2091c9248ddf42504c74cd327ae8619b04a4] net/mlx5e: Prevent PFC call for non ethernet ports
git bisect bad c66f2091c9248ddf42504c74cd327ae8619b04a4
# good: [a090bd4ff8387c409732a8e059fbf264ea0bdd56] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good a090bd4ff8387c409732a8e059fbf264ea0bdd56
# good: [1947030645b6012aeee98da764d6dd47071a6aad] Merge branch 'dsa-prefix-Global-macros'
git bisect good 1947030645b6012aeee98da764d6dd47071a6aad
# good: [69137ea60c9dad58773a1918de6c1b00b088520c] pktgen: Specify num packets per thread
git bisect good 69137ea60c9dad58773a1918de6c1b00b088520c
# good: [d24406c85d123df773bc4df88ad5da2233896919] udp: call dst_hold_safe() in udp_sk_rx_set_dst()
git bisect good d24406c85d123df773bc4df88ad5da2233896919
# bad: [5b7c9a8ff828287af5aebe93e707271bf1a82cc3] net: remove dst gc related code
git bisect bad 5b7c9a8ff828287af5aebe93e707271bf1a82cc3
# bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free()
git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911
# good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put()
git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36
# good: [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly
git bisect good 95c47f9cf5e028d1ae77dc6c767c1edc8a18025b
# good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly
git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f
# first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free()


In my first email I wrote >= 4.12, but I think 4.12 works. I bisected between
4.9 and 4.14 as we actually run 4.9 on the machine with the problem and 4.14
on most other routers.

I also tested 4.18.5 and it still shows this bug.


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* Re: WARNING in handle_irq (3)
From: John Fastabend @ 2018-08-30 23:05 UTC (permalink / raw)
  To: Dmitry Vyukov, syzbot, netdev, Alexei Starovoitov,
	Daniel Borkmann
  Cc: Greg Kroah-Hartman, H. Peter Anvin, Kate Stewart, LKML,
	Andy Lutomirski, Ingo Molnar, nstange, syzkaller-bugs,
	Thomas Gleixner, the arch/x86 maintainers
In-Reply-To: <CACT4Y+Yuwcm9N-_QwyPzj==xFUuziW24eQzw318Y=VKXQ=8_nw@mail.gmail.com>

On 08/30/2018 08:39 AM, Dmitry Vyukov wrote:
> On Thu, Aug 30, 2018 at 8:31 AM, syzbot
> <syzbot+a58b558e3e62d0604e5c@syzkaller.appspotmail.com> wrote:
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    58c3f14f86c9 Merge tag 'riscv-for-linus-4.19-rc2' of git:/..
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=10be176a400000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=531a917630d2a492
>> dashboard link: https://syzkaller.appspot.com/bug?extid=a58b558e3e62d0604e5c
>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
> 
> +bpf maintainers
> 
> Looks suspiciously similar to:
> https://groups.google.com/d/msg/syzkaller-bugs/4v7MtbIT1hY/A87hInzyAwAJ
> 
> Note this commit seems to already have "bpf, sockmap: fix
> sock_hash_alloc and reject zero-sized keys ".
> 
> Tentative reproducer from the log is:
> 
> 14:08:59 executing program 5:
> socketpair(0x20000, 0x0, 0x0, &(0x7f0000000140))
> r0 = socket$inet6_tcp(0xa, 0x1, 0x0)
> r1 = socket$inet6_tcp(0xa, 0x1, 0x0)
> bind$inet6(r1, &(0x7f00000000c0)={0xa, 0x4e22}, 0x1c)
> listen(r1, 0x0)
> sendto$inet6(r0, &(0x7f0000000140), 0x2d6, 0x20000004,
> &(0x7f0000000080)={0xa, 0x100000004e22, 0x0, @loopback}, 0x1c)
> setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f0000000080)='tls\x00', 0x152)
> r2 = bpf$MAP_CREATE(0x0, &(0x7f0000000280)={0xf, 0x4, 0x4, 0x70}, 0x2c)
> bpf$MAP_UPDATE_ELEM(0x2, &(0x7f0000000180)={r2, &(0x7f0000000000),
> &(0x7f0000000140)}, 0x20)
> 
> Which does not create a 0-key map.
> 
> 

Hi Dmitry,

Testing a fix for this now, we have an error path that can
call module_put and/or null the ulp ops erroneously. Should
have something out later tonight or worst case early tomorrow.
Thanks for the snippet.

Thanks,
John

^ permalink raw reply

* Re: [PATCH net-next] packet: add sockopt to ignore outgoing packets
From: Willem de Bruijn @ 2018-08-30 19:44 UTC (permalink / raw)
  To: vincent.whitchurch
  Cc: David Miller, Network Development, Willem de Bruijn, rabinv
In-Reply-To: <20180830100116.12016-1-vincent.whitchurch@axis.com>

On Thu, Aug 30, 2018 at 6:12 AM Vincent Whitchurch
<vincent.whitchurch@axis.com> wrote:
>
> Currently, the only way to ignore outgoing packets on a packet socket is
> via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
> AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
> if the filter run from packet_rcv() would reject them.  So the presence
> of a packet socket on the interface takes away the benefits of
> MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
> packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
> cloned, but the cost for that is much lower.)
>
> Add a socket option to allow AF_PACKET sockets to ignore outgoing
> packets to solve this.  Note that the *BSDs already have something
> similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.
>
> The first intended user is lldpd.

Clear description of the use case, thanks. I don't see a simple alternative
to introducing a new socket option, either (a new ETH_P_xx protocol
wildcard different from ETH_P_ALL, perhaps).

> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
> ---
>  include/linux/netdevice.h      |  1 +
>  include/uapi/linux/if_packet.h |  1 +
>  net/core/dev.c                 |  3 +++
>  net/packet/af_packet.c         | 15 +++++++++++++++
>  4 files changed, 20 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ca5ab98053c8..8ef14d9edc58 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2317,6 +2317,7 @@ static inline struct sk_buff *call_gro_receive_sk(gro_receive_sk_t cb,
>
>  struct packet_type {
>         __be16                  type;   /* This is really htons(ether_type). */
> +       bool                    ignore_outgoing;
>         struct net_device       *dev;   /* NULL is wildcarded here           */
>         int                     (*func) (struct sk_buff *,
>                                          struct net_device *,
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index 67b61d91d89b..467b654bd4c7 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -57,6 +57,7 @@ struct sockaddr_ll {
>  #define PACKET_QDISC_BYPASS            20
>  #define PACKET_ROLLOVER_STATS          21
>  #define PACKET_FANOUT_DATA             22
> +#define PACKET_IGNORE_OUTGOING         23


>
>  #define PACKET_FANOUT_HASH             0
>  #define PACKET_FANOUT_LB               1
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 325fc5088370..0addb4f0abfe 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
>         if (!ptype->af_packet_priv || !skb->sk)
>                 return false;
>
> +       if (ptype->ignore_outgoing)
> +               return true;
> +

This probably does not belong in skb_loop_sk, but in
dev_queue_xmit_nit directly.

>         if (ptype->id_match)
>                 return ptype->id_match(ptype, skb->sk);
>         else if ((struct sock *)ptype->af_packet_priv == skb->sk)

^ permalink raw reply

* [RFC PATCH net-next 0/4] net: batched receive in GRO path
From: Edward Cree @ 2018-08-30 19:57 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
I have not listified dev_gro_receive() itself, or the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
 Nonetheless the batching of the calling code yields some performance gains
 in the GRO case as well.

Herewith the performance figures obtained in a NetPerf TCP stream test (with
 four streams, and irqs bound to a single core):
net-next: 7.166 Gbit/s (sigma 0.435)
after #2: 7.715 Gbit/s (sigma 0.145) = datum + 7.7%
after #4: 7.890 Gbit/s (sigma 0.217) = datum + 10.1%
(Note that the 'net-next' results were distinctly bimodal, with two results
 of about 8 Gbit/s and the remaining ten around 7 Gbit/s.  I don't have a
 good explanation for this.)
And with GRO disabled through ethtool -K (thus simulating traffic which is
 not GRO-able but, being TCP, is still passed to the GRO entry point):
net-next: 4.756 Gbit/s (sigma 0.240)
after #4: 5.355 Gbit/s (sigma 0.232) = datum + 12.6%

Edward Cree (4):
  net: introduce list entry point for GRO
  sfc: use batched receive for GRO
  net: make listified RX functions return number of good packets
  net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list

 drivers/net/ethernet/sfc/efx.c        |  11 +++-
 drivers/net/ethernet/sfc/net_driver.h |   1 +
 drivers/net/ethernet/sfc/rx.c         |  16 +++++-
 include/linux/netdevice.h             |   6 +-
 include/net/ip.h                      |   4 +-
 include/net/ipv6.h                    |   4 +-
 net/core/dev.c                        | 104 ++++++++++++++++++++++++++--------
 net/ipv4/ip_input.c                   |  39 ++++++++-----
 net/ipv6/ip6_input.c                  |  37 +++++++-----
 9 files changed, 157 insertions(+), 65 deletions(-)

^ permalink raw reply

* [RFC PATCH net-next 1/4] net: introduce list entry point for GRO
From: Edward Cree @ 2018-08-30 19:59 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev
In-Reply-To: <08c963dd-de14-7548-deb3-ec87570140a0@solarflare.com>

Also export napi_frags_skb() so that drivers using the napi_gro_frags()
 interface can prepare their SKBs properly for submitting on such a list.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 28 +++++++++++++++++++++++++++-
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..a4f55142d3eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3521,8 +3521,10 @@ int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
 void netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
+int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
+struct sk_buff *napi_frags_skb(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
diff --git a/net/core/dev.c b/net/core/dev.c
index 325fc5088370..d1e55bef84eb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5596,6 +5596,31 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(napi_gro_receive);
 
+/* Returns the number of SKBs on the list successfully received */
+int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head)
+{
+	struct sk_buff *skb, *next;
+	gro_result_t result;
+	int kept = 0;
+
+	list_for_each_entry(skb, head, list) {
+		skb_mark_napi_id(skb, napi);
+		trace_napi_gro_receive_entry(skb);
+		skb_gro_reset_offset(skb);
+	}
+
+	list_for_each_entry_safe(skb, next, head, list) {
+		list_del(&skb->list);
+		skb->next = NULL;
+		result = dev_gro_receive(napi, skb);
+		result = napi_skb_finish(result, skb);
+		if (result != GRO_DROP)
+			kept++;
+	}
+	return kept;
+}
+EXPORT_SYMBOL(napi_gro_receive_list);
+
 static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 {
 	if (unlikely(skb->pfmemalloc)) {
@@ -5667,7 +5692,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
  * Drivers could call both napi_gro_frags() and napi_gro_receive()
  * We copy ethernet header into skb->data to have a common layout.
  */
-static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
+struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 {
 	struct sk_buff *skb = napi->skb;
 	const struct ethhdr *eth;
@@ -5703,6 +5728,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 
 	return skb;
 }
+EXPORT_SYMBOL(napi_frags_skb);
 
 gro_result_t napi_gro_frags(struct napi_struct *napi)
 {

^ permalink raw reply related

* [RFC PATCH net-next 2/4] sfc: use batched receive for GRO
From: Edward Cree @ 2018-08-30 20:00 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev
In-Reply-To: <08c963dd-de14-7548-deb3-ec87570140a0@solarflare.com>

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/efx.c        | 11 +++++++++--
 drivers/net/ethernet/sfc/net_driver.h |  1 +
 drivers/net/ethernet/sfc/rx.c         | 16 +++++++++++++---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 330233286e78..dba13a28014c 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -263,9 +263,9 @@ static int efx_check_disabled(struct efx_nic *efx)
  */
 static int efx_process_channel(struct efx_channel *channel, int budget)
 {
+	struct list_head rx_list, gro_list;
 	struct efx_tx_queue *tx_queue;
-	struct list_head rx_list;
-	int spent;
+	int spent, gro_count;
 
 	if (unlikely(!channel->enabled))
 		return 0;
@@ -275,6 +275,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 	INIT_LIST_HEAD(&rx_list);
 	channel->rx_list = &rx_list;
 
+	EFX_WARN_ON_PARANOID(channel->gro_list != NULL);
+	INIT_LIST_HEAD(&gro_list);
+	channel->gro_list = &gro_list;
+
 	efx_for_each_channel_tx_queue(tx_queue, channel) {
 		tx_queue->pkts_compl = 0;
 		tx_queue->bytes_compl = 0;
@@ -300,6 +304,9 @@ static int efx_process_channel(struct efx_channel *channel, int budget)
 	/* Receive any packets we queued up */
 	netif_receive_skb_list(channel->rx_list);
 	channel->rx_list = NULL;
+	gro_count = napi_gro_receive_list(&channel->napi_str, channel->gro_list);
+	channel->irq_mod_score += gro_count * 2;
+	channel->gro_list = NULL;
 
 	return spent;
 }
diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h
index 961b92979640..72addac7a84a 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -502,6 +502,7 @@ struct efx_channel {
 	unsigned int rx_pkt_index;
 
 	struct list_head *rx_list;
+	struct list_head *gro_list;
 
 	struct efx_rx_queue rx_queue;
 	struct efx_tx_queue tx_queue[EFX_TXQ_TYPES];
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 396ff01298cd..0534a54048c6 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -453,9 +453,19 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf,
 
 	skb_record_rx_queue(skb, channel->rx_queue.core_index);
 
-	gro_result = napi_gro_frags(napi);
-	if (gro_result != GRO_DROP)
-		channel->irq_mod_score += 2;
+	/* Pass the packet up */
+	if (channel->gro_list != NULL) {
+		/* Clear napi->skb and prepare skb for GRO */
+		skb = napi_frags_skb(napi);
+		if (skb)
+			/* Add to list, will pass up later */
+			list_add_tail(&skb->list, channel->gro_list);
+	} else {
+		/* No list, so pass it up now */
+		gro_result = napi_gro_frags(napi);
+		if (gro_result != GRO_DROP)
+			channel->irq_mod_score += 2;
+	}
 }
 
 /* Allocate and construct an SKB around page fragments */

^ permalink raw reply related

* [RFC PATCH net-next 3/4] net: make listified RX functions return number of good packets
From: Edward Cree @ 2018-08-30 20:00 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev
In-Reply-To: <08c963dd-de14-7548-deb3-ec87570140a0@solarflare.com>

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 include/linux/netdevice.h |  4 +--
 include/net/ip.h          |  4 +--
 include/net/ipv6.h        |  4 +--
 net/core/dev.c            | 63 +++++++++++++++++++++++++++++------------------
 net/ipv4/ip_input.c       | 39 ++++++++++++++++++-----------
 net/ipv6/ip6_input.c      | 37 +++++++++++++++++-----------
 6 files changed, 92 insertions(+), 59 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a4f55142d3eb..ba5f985688eb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2322,7 +2322,7 @@ struct packet_type {
 					 struct net_device *,
 					 struct packet_type *,
 					 struct net_device *);
-	void			(*list_func) (struct list_head *,
+	int			(*list_func) (struct list_head *,
 					      struct packet_type *,
 					      struct net_device *);
 	bool			(*id_match)(struct packet_type *ptype,
@@ -3519,7 +3519,7 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
-void netif_receive_skb_list(struct list_head *head);
+int netif_receive_skb_list(struct list_head *head);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
 int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
diff --git a/include/net/ip.h b/include/net/ip.h
index e44b1a44f67a..aab1f7eea1e1 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -152,8 +152,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk,
 			  struct ip_options_rcu *opt);
 int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 	   struct net_device *orig_dev);
-void ip_list_rcv(struct list_head *head, struct packet_type *pt,
-		 struct net_device *orig_dev);
+int ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ff33f498c137..f15651eabfe0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -914,8 +914,8 @@ static inline __be32 flowi6_get_flowlabel(const struct flowi6 *fl6)
 
 int ipv6_rcv(struct sk_buff *skb, struct net_device *dev,
 	     struct packet_type *pt, struct net_device *orig_dev);
-void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
-		   struct net_device *orig_dev);
+int ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
+		  struct net_device *orig_dev);
 
 int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index d1e55bef84eb..ac9741273c62 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4920,24 +4920,27 @@ int netif_receive_skb_core(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(netif_receive_skb_core);
 
-static inline void __netif_receive_skb_list_ptype(struct list_head *head,
-						  struct packet_type *pt_prev,
-						  struct net_device *orig_dev)
+static inline int __netif_receive_skb_list_ptype(struct list_head *head,
+						 struct packet_type *pt_prev,
+						 struct net_device *orig_dev)
 {
 	struct sk_buff *skb, *next;
+	int kept = 0;
 
 	if (!pt_prev)
-		return;
+		return 0;
 	if (list_empty(head))
-		return;
+		return 0;
 	if (pt_prev->list_func != NULL)
-		pt_prev->list_func(head, pt_prev, orig_dev);
+		kept = pt_prev->list_func(head, pt_prev, orig_dev);
 	else
 		list_for_each_entry_safe(skb, next, head, list)
-			pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+			if (pt_prev->func(skb, skb->dev, pt_prev, orig_dev) == NET_RX_SUCCESS)
+				kept++;
+	return kept;
 }
 
-static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
+static int __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc)
 {
 	/* Fast-path assumptions:
 	 * - There is no RX handler.
@@ -4954,6 +4957,7 @@ static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemallo
 	struct net_device *od_curr = NULL;
 	struct list_head sublist;
 	struct sk_buff *skb, *next;
+	int kept = 0, ret;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
@@ -4961,12 +4965,15 @@ static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemallo
 		struct packet_type *pt_prev = NULL;
 
 		list_del(&skb->list);
-		__netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
-		if (!pt_prev)
+		ret = __netif_receive_skb_core(skb, pfmemalloc, &pt_prev);
+		if (!pt_prev) {
+			if (ret == NET_RX_SUCCESS)
+				kept++;
 			continue;
+		}
 		if (pt_curr != pt_prev || od_curr != orig_dev) {
 			/* dispatch old sublist */
-			__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+			kept += __netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
 			/* start new sublist */
 			INIT_LIST_HEAD(&sublist);
 			pt_curr = pt_prev;
@@ -4976,7 +4983,8 @@ static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemallo
 	}
 
 	/* dispatch final sublist */
-	__netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+	kept += __netif_receive_skb_list_ptype(&sublist, pt_curr, od_curr);
+	return kept;
 }
 
 static int __netif_receive_skb(struct sk_buff *skb)
@@ -5004,11 +5012,12 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
-static void __netif_receive_skb_list(struct list_head *head)
+static int __netif_receive_skb_list(struct list_head *head)
 {
 	unsigned long noreclaim_flag = 0;
 	struct sk_buff *skb, *next;
 	bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
+	int kept = 0;
 
 	list_for_each_entry_safe(skb, next, head, list) {
 		if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) {
@@ -5017,7 +5026,7 @@ static void __netif_receive_skb_list(struct list_head *head)
 			/* Handle the previous sublist */
 			list_cut_before(&sublist, head, &skb->list);
 			if (!list_empty(&sublist))
-				__netif_receive_skb_list_core(&sublist, pfmemalloc);
+				kept += __netif_receive_skb_list_core(&sublist, pfmemalloc);
 			pfmemalloc = !pfmemalloc;
 			/* See comments in __netif_receive_skb */
 			if (pfmemalloc)
@@ -5028,10 +5037,11 @@ static void __netif_receive_skb_list(struct list_head *head)
 	}
 	/* Handle the remaining sublist */
 	if (!list_empty(head))
-		__netif_receive_skb_list_core(head, pfmemalloc);
+		kept += __netif_receive_skb_list_core(head, pfmemalloc);
 	/* Restore pflags */
 	if (pfmemalloc)
 		memalloc_noreclaim_restore(noreclaim_flag);
+	return kept;
 }
 
 static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
@@ -5107,17 +5117,20 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	return ret;
 }
 
-static void netif_receive_skb_list_internal(struct list_head *head)
+static int netif_receive_skb_list_internal(struct list_head *head)
 {
 	struct bpf_prog *xdp_prog = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;
+	int kept = 0;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
 		net_timestamp_check(netdev_tstamp_prequeue, skb);
 		list_del(&skb->list);
-		if (!skb_defer_rx_timestamp(skb))
+		if (skb_defer_rx_timestamp(skb))
+			kept++;
+		else
 			list_add_tail(&skb->list, &sublist);
 	}
 	list_splice_init(&sublist, head);
@@ -5147,13 +5160,15 @@ static void netif_receive_skb_list_internal(struct list_head *head)
 			if (cpu >= 0) {
 				/* Will be handled, remove from list */
 				list_del(&skb->list);
-				enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+				if (enqueue_to_backlog(skb, cpu, &rflow->last_qtail) == NET_RX_SUCCESS)
+					kept++;
 			}
 		}
 	}
 #endif
-	__netif_receive_skb_list(head);
+	kept += __netif_receive_skb_list(head);
 	rcu_read_unlock();
+	return kept;
 }
 
 /**
@@ -5183,21 +5198,21 @@ EXPORT_SYMBOL(netif_receive_skb);
  *	netif_receive_skb_list - process many receive buffers from network
  *	@head: list of skbs to process.
  *
- *	Since return value of netif_receive_skb() is normally ignored, and
- *	wouldn't be meaningful for a list, this function returns void.
+ *	Returns the number of skbs for which netif_receive_skb() would have
+ *	returned %NET_RX_SUCCESS.
  *
  *	This function may only be called from softirq context and interrupts
  *	should be enabled.
  */
-void netif_receive_skb_list(struct list_head *head)
+int netif_receive_skb_list(struct list_head *head)
 {
 	struct sk_buff *skb;
 
 	if (list_empty(head))
-		return;
+		return 0;
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
-	netif_receive_skb_list_internal(head);
+	return netif_receive_skb_list_internal(head);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 3196cf58f418..75cc5a6ef9b8 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -526,9 +526,10 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt,
 		       ip_rcv_finish);
 }
 
-static void ip_sublist_rcv_finish(struct list_head *head)
+static int ip_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
+	int kept = 0;
 
 	list_for_each_entry_safe(skb, next, head, list) {
 		list_del(&skb->list);
@@ -536,16 +537,19 @@ static void ip_sublist_rcv_finish(struct list_head *head)
 		 * another kind of SKB-list usage (see validate_xmit_skb_list)
 		 */
 		skb->next = NULL;
-		dst_input(skb);
+		if (dst_input(skb) == NET_RX_SUCCESS)
+			kept++;
 	}
+	return kept;
 }
 
-static void ip_list_rcv_finish(struct net *net, struct sock *sk,
-			       struct list_head *head)
+static int ip_list_rcv_finish(struct net *net, struct sock *sk,
+			      struct list_head *head)
 {
 	struct dst_entry *curr_dst = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;
+	int kept = 0;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
@@ -556,8 +560,10 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
 		 * skb to its handler for processing
 		 */
 		skb = l3mdev_ip_rcv(skb);
-		if (!skb)
+		if (!skb) {
+			kept++;
 			continue;
+		}
 		if (ip_rcv_finish_core(net, sk, skb) == NET_RX_DROP)
 			continue;
 
@@ -565,7 +571,7 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
 		if (curr_dst != dst) {
 			/* dispatch old sublist */
 			if (!list_empty(&sublist))
-				ip_sublist_rcv_finish(&sublist);
+				kept += ip_sublist_rcv_finish(&sublist);
 			/* start new sublist */
 			INIT_LIST_HEAD(&sublist);
 			curr_dst = dst;
@@ -573,25 +579,27 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
 		list_add_tail(&skb->list, &sublist);
 	}
 	/* dispatch final sublist */
-	ip_sublist_rcv_finish(&sublist);
+	kept += ip_sublist_rcv_finish(&sublist);
+	return kept;
 }
 
-static void ip_sublist_rcv(struct list_head *head, struct net_device *dev,
-			   struct net *net)
+static int ip_sublist_rcv(struct list_head *head, struct net_device *dev,
+			  struct net *net)
 {
 	NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
 		     head, dev, NULL, ip_rcv_finish);
-	ip_list_rcv_finish(net, NULL, head);
+	return ip_list_rcv_finish(net, NULL, head);
 }
 
-/* Receive a list of IP packets */
-void ip_list_rcv(struct list_head *head, struct packet_type *pt,
-		 struct net_device *orig_dev)
+/* Receive a list of IP packets; return number of successful receives */
+int ip_list_rcv(struct list_head *head, struct packet_type *pt,
+		struct net_device *orig_dev)
 {
 	struct net_device *curr_dev = NULL;
 	struct net *curr_net = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;
+	int kept = 0;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
@@ -606,7 +614,7 @@ void ip_list_rcv(struct list_head *head, struct packet_type *pt,
 		if (curr_dev != dev || curr_net != net) {
 			/* dispatch old sublist */
 			if (!list_empty(&sublist))
-				ip_sublist_rcv(&sublist, curr_dev, curr_net);
+				kept += ip_sublist_rcv(&sublist, curr_dev, curr_net);
 			/* start new sublist */
 			INIT_LIST_HEAD(&sublist);
 			curr_dev = dev;
@@ -615,5 +623,6 @@ void ip_list_rcv(struct list_head *head, struct packet_type *pt,
 		list_add_tail(&skb->list, &sublist);
 	}
 	/* dispatch final sublist */
-	ip_sublist_rcv(&sublist, curr_dev, curr_net);
+	kept += ip_sublist_rcv(&sublist, curr_dev, curr_net);
+	return kept;
 }
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 6242682be876..e64b830c9f0f 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -76,20 +76,24 @@ int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
 	return dst_input(skb);
 }
 
-static void ip6_sublist_rcv_finish(struct list_head *head)
+static int ip6_sublist_rcv_finish(struct list_head *head)
 {
 	struct sk_buff *skb, *next;
+	int kept = 0;
 
 	list_for_each_entry_safe(skb, next, head, list)
-		dst_input(skb);
+		if (dst_input(skb) == NET_RX_SUCCESS)
+			kept++;
+	return kept;
 }
 
-static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
-				struct list_head *head)
+static int ip6_list_rcv_finish(struct net *net, struct sock *sk,
+			       struct list_head *head)
 {
 	struct dst_entry *curr_dst = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;
+	int kept = 0;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
@@ -100,14 +104,16 @@ static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
 		 * skb to its handler for processing
 		 */
 		skb = l3mdev_ip6_rcv(skb);
-		if (!skb)
+		if (!skb) {
+			kept++;
 			continue;
+		}
 		ip6_rcv_finish_core(net, sk, skb);
 		dst = skb_dst(skb);
 		if (curr_dst != dst) {
 			/* dispatch old sublist */
 			if (!list_empty(&sublist))
-				ip6_sublist_rcv_finish(&sublist);
+				kept += ip6_sublist_rcv_finish(&sublist);
 			/* start new sublist */
 			INIT_LIST_HEAD(&sublist);
 			curr_dst = dst;
@@ -115,7 +121,8 @@ static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
 		list_add_tail(&skb->list, &sublist);
 	}
 	/* dispatch final sublist */
-	ip6_sublist_rcv_finish(&sublist);
+	kept += ip6_sublist_rcv_finish(&sublist);
+	return kept;
 }
 
 static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev,
@@ -273,22 +280,23 @@ int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt
 		       ip6_rcv_finish);
 }
 
-static void ip6_sublist_rcv(struct list_head *head, struct net_device *dev,
-			    struct net *net)
+static int ip6_sublist_rcv(struct list_head *head, struct net_device *dev,
+			   struct net *net)
 {
 	NF_HOOK_LIST(NFPROTO_IPV6, NF_INET_PRE_ROUTING, net, NULL,
 		     head, dev, NULL, ip6_rcv_finish);
-	ip6_list_rcv_finish(net, NULL, head);
+	return ip6_list_rcv_finish(net, NULL, head);
 }
 
 /* Receive a list of IPv6 packets */
-void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
-		   struct net_device *orig_dev)
+int ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
+		  struct net_device *orig_dev)
 {
 	struct net_device *curr_dev = NULL;
 	struct net *curr_net = NULL;
 	struct sk_buff *skb, *next;
 	struct list_head sublist;
+	int kept = 0;
 
 	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
@@ -303,7 +311,7 @@ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
 		if (curr_dev != dev || curr_net != net) {
 			/* dispatch old sublist */
 			if (!list_empty(&sublist))
-				ip6_sublist_rcv(&sublist, curr_dev, curr_net);
+				kept += ip6_sublist_rcv(&sublist, curr_dev, curr_net);
 			/* start new sublist */
 			INIT_LIST_HEAD(&sublist);
 			curr_dev = dev;
@@ -312,7 +320,8 @@ void ipv6_list_rcv(struct list_head *head, struct packet_type *pt,
 		list_add_tail(&skb->list, &sublist);
 	}
 	/* dispatch final sublist */
-	ip6_sublist_rcv(&sublist, curr_dev, curr_net);
+	kept += ip6_sublist_rcv(&sublist, curr_dev, curr_net);
+	return kept;
 }
 
 /*

^ permalink raw reply related

* [RFC PATCH net-next 4/4] net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list
From: Edward Cree @ 2018-08-30 20:01 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev
In-Reply-To: <08c963dd-de14-7548-deb3-ec87570140a0@solarflare.com>

Allows GRO-using drivers to get the benefits of batching for non-GROable
 traffic.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/core/dev.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ac9741273c62..f9391f76feb7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5615,6 +5615,7 @@ EXPORT_SYMBOL(napi_gro_receive);
 int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head)
 {
 	struct sk_buff *skb, *next;
+	struct list_head sublist;
 	gro_result_t result;
 	int kept = 0;
 
@@ -5624,14 +5625,26 @@ int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head)
 		skb_gro_reset_offset(skb);
 	}
 
+	INIT_LIST_HEAD(&sublist);
 	list_for_each_entry_safe(skb, next, head, list) {
 		list_del(&skb->list);
 		skb->next = NULL;
 		result = dev_gro_receive(napi, skb);
-		result = napi_skb_finish(result, skb);
-		if (result != GRO_DROP)
-			kept++;
+		if (result == GRO_NORMAL) {
+			list_add_tail(&skb->list, &sublist);
+			continue;
+		} else {
+			if (!list_empty(&sublist)) {
+				/* Handle the GRO_NORMAL skbs to prevent OoO */
+				kept += netif_receive_skb_list_internal(&sublist);
+				INIT_LIST_HEAD(&sublist);
+			}
+			result = napi_skb_finish(result, skb);
+			if (result != GRO_DROP)
+				kept++;
+		}
 	}
+	kept += netif_receive_skb_list_internal(&sublist);
 	return kept;
 }
 EXPORT_SYMBOL(napi_gro_receive_list);

^ permalink raw reply related

* Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers
From: Doug Ledford @ 2018-08-30 20:17 UTC (permalink / raw)
  To: Arseny Maslennikov, linux-rdma; +Cc: Jason Gunthorpe, netdev
In-Reply-To: <20180830182238.16361-2-ar@cs.msu.ru>

[-- Attachment #1: Type: text/plain, Size: 2112 bytes --]

On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote:
> Some InfiniBand network devices have multiple ports on the same PCI
> function. This initializes the `dev_port' sysfs field of those
> network interfaces with their port number.
> 
> Prior to this the kernel erroneously used the `dev_id' sysfs
> field of those network interfaces to convey the port number to userspace.
> 
> The use of `dev_id' was considered correct until Linux 3.15,
> when another field, `dev_port', was defined for this particular
> purpose and `dev_id' was reserved for distinguishing stacked ifaces
> (e.g: VLANs) with the same hardware address as their parent device.
> 
> Similar fixes to net/mlx4_en and many other drivers, which started
> exporting this information through `dev_id' before 3.15, were accepted
> into the kernel 4 years ago.
> See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
> 
> Signed-off-by: Arseny Maslennikov <ar@cs.msu.ru>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index e3d28f9ad9c0..ba16a63ee303 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev)
>  	       sizeof(union ib_gid));
>  
>  	SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent);
> -	priv->dev->dev_id = priv->port - 1;
> +	priv->dev->dev_port = priv->port - 1;

I don't know that we can't do this.  At least not yet.  Expose the new
item to make us compliant with the new docs, and deprecate the old sysfs
item, but we can't just yank the old item.  Existing tools/scripts might
(probably) rely on it (existing tools already special case IPoIB
interfaces and we'll need to make sure they don't special case this
element too).

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH net-next 0/2] netlink: multicast join notifications
From: Patrick Ruddy @ 2018-08-30  9:35 UTC (permalink / raw)
  To: netdev; +Cc: roopa, jiri, stephen

This patch is an update to https://patchwork.ozlabs.org/patch/571127/. The
previous patch was based on sending multicast MAC addresses in the
netlink messages to allow the programming of hardware. It was agreed to
rework this to use RTM_NEW/DELLINK messages which were more appropriate
for layer 2 addresses.
In the interim period it has become apparent that the applications actually
needs to see the L3 multicast addresses which are joined for FORUS
processing so this patch has been reworked to send the L3 multicast
addresses using RTM_NEW/DELADDR.
These new multicast L3 netlink notifications should use the IFA_MULTICAST
address type but this has been dropped in favour of IFA_ADDRESS as during
testing it was noticed that some applications - notably getaddrinfo in
lib6c assume that there is an IFA_ADDRESS in a RTM_NEW/DELADDR and
blindly dereference it.
Finally the RTM_GETADDR for both address families has been modified to
include the multicast l3 addresses.

Patrick Ruddy (2):
  netlink: ipv4 IGMP join notifications
  netlink: ipv6 MLD join notifications

 include/linux/igmp.h |  2 +
 net/ipv4/devinet.c   | 39 +++++++++++++------
 net/ipv4/igmp.c      | 90 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv6/addrconf.c  | 44 ++++++++++++++++------
 net/ipv6/mcast.c     | 66 ++++++++++++++++++++++++++++++++
 5 files changed, 218 insertions(+), 23 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [RFC PATCH v1 0/3] device property: Support MAC address in VPD
From: Brian Norris @ 2018-08-31  0:55 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Brian Norris, Florian Fainelli, Rob Herring, Greg Kroah-Hartman,
	Rafael J. Wysocki, Andrew Lunn, Dmitry Torokhov, Guenter Roeck,
	netdev, devicetree, linux-kernel, Julius Werner,
	Srinivas Kandagatla
In-Reply-To: <153437082071.28926.11684780766239178367@swboyd.mtv.corp.google.com>

(sorry for the delayed response here)

Hi,

On Wed, Aug 15, 2018 at 03:07:00PM -0700, Stephen Boyd wrote:
> Quoting Brian Norris (2018-08-14 18:44:36)
> > But anyway, would the idea be that you just put 'ethernet{0,1,...}' and
> > 'wifi{0,1,...}' aliases in the /chosen node, then require boot firmware
> > to insert any {ethernet,wifi}_mac{0,1,...} into the paths represented by
> > the corresponding aliases? I suppose that would reduce the problems with
> > (1), but it still doesn't really help with (2).
> 
> Yes. Aliases are the way to do this. It obviates much of this discussion
> about finding things in DT by directly pointing to the node the
> bootloader wants to go modify.

Sure, that does seem to help.

> > > > And finally, this may be surmountable, but the existing APIs seem very
> > > > device tree centric. We use this same format on ACPI systems, and the
> > > > current series would theoretically work on both [1]. I'd have to rewrite
> > > > the current (OF-only) helpers to get equivalent support...
> 
> Where does it go on ACPI systems? Does the firmware stick it into some
> ACPI table by reading from VPD?

I'm not aware of any ACPI system that needs to do something quite like
this. The closest I see is a Coreboot board for some Chromeboxes [1],
which handles Ethernet MAC addresses -- it does this by reading similar
VPD fields, but then it just goes and programs the MAC directly into the
Ethernet card's registers. This is a Realtek PCI Ethernet chip, and
apparently it even needs firmware to program its product and vendor ID
for it.

I'm not really sure what lesson to get out of that though. That
discoverable hardware is a giant dirty mess (even PCI gets a lot of help
from system firmware!)?

Or I guess maybe the lesson is: ACPI systems will always find a way to
do it in firmware, no matter how much bloat it costs. So I don't really
need to consider "does this solution work with ACPI"? But just, does it
work with {device,fwnode}_get_mac_address() -- e.g., by teaching them to
call of_get_nvmem_mac_address()?

Brian

[1] Here, and similar variants in other boards throughout the tree:
https://review.coreboot.org/cgit/coreboot.git/tree/src/mainboard/google/beltino/lan.c

^ permalink raw reply

* Re: KMSAN: uninit-value in rds_bind
From: Dmitry Vyukov @ 2018-08-30 21:05 UTC (permalink / raw)
  To: Santosh Shilimkar
  Cc: syzbot, linux-rdma, netdev, syzkaller-bugs, David Miller, LKML
In-Reply-To: <355351e8-71fa-6db9-ad2a-43115cc2e80e@oracle.com>

On Thu, Aug 30, 2018 at 1:56 PM, Santosh Shilimkar
<santosh.shilimkar@oracle.com> wrote:
> On 8/30/2018 11:31 AM, syzbot wrote:
>>
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    2dca2cbde67a kmsan: fix build warnings with CONFIG_KMSAN=n
>> git tree:       https://github.com/google/kmsan.git/master
>
>
> BTW, can you please fix your git url since this one doesn't work.

I will fix this. Thanks.

But just to unblock you, it's git repo/branch, i.e.
https://github.com/google/kmsan.git is repo, and branch is master.


> This tree is not vanila kernel.org tree(4.19.0-rc1+ #36), so would be
> good to get the line numbers correct for sources.
>
> Regards,
> Snatosh
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/355351e8-71fa-6db9-ad2a-43115cc2e80e%40oracle.com.
>
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: nixge: Add support for fixed-link subnodes
From: Moritz Fischer @ 2018-08-30 21:09 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David S. Miller, Kees Cook, Florian Fainelli,
	Linux Kernel Mailing List, netdev, Alex Williams
In-Reply-To: <20180830175419.GD31581@lunn.ch>

Andrew,

On Thu, Aug 30, 2018 at 10:54 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>> > The hardware has MDIO, but you don't have a PHY connected on it, and
>> > use fixed link.
>>
>> Since it's an FPGA design in that case we'd probably build the hardware without
>> MDIO to save resources.
>
> You can save resources, but is it worth the complexity else where,
> like in the software?

Depends. For now I definitely have build versions that don't have MDIO regs
there. I might be able to chat with HW folks ...

>
>> > It is important you have the mdio subnode, with PHYs and switches as
>> > children. The driver currently gets this wrong, it uses
>> > pdev->dev.of_node.
>>
>> Oh, whoops.
>
> Yes, and i also missed it. I generally review all new network drivers
> and look at their MDIO and PHY code.

I had looked at macb as an example and there were a bunch of other cases
where there was no 'mdio' subnode.

>
>> Any good examples of drivers doing it right? Is the one going with
>> the DT snippet above a good example?
>
> That comes from the Freescale fec_main.c. It only supports DT, and
> always uses of_mdiobus_register. You need to be a bit more flexible
> for when you don't have DT. I'm not sure there are good example of
> this, since they either don't need this flexibility, or they get it
> wrong :-(

Alright, no problem. I'll take a stab at it. And come back with a v2 ;-)
Need to look at your response in the other patch.

Cheers,

Moritz

^ permalink raw reply

* [net-next v2 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-30
From: Jeff Kirsher @ 2018-08-30 21:11 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, jogreene

This series contains updates to i40e, i40evf and virtchnl.

Jake implements helper functions to use an array to handle the queue
stats which reduces the boiler plate code as well as keep the complexity
localized to a few functions.

Paweł adds the ability to change a VF's MAC address from the host side
without having to reload the VF driver on the guest side.

Paul adds a check to ensure that the number of queues that the PF sends
to the VF is equal to or less than the maximum number of queues the VF
can support.

Mitch fixes an issue caught by GCC 8, where we need to not include the
terminating null in the length of the string for strncpy().

Lihong fixes a VF issue to ensure that it does not enter into
promiscuous mode when macvlan is added to the VF.  Fixed a potential
crash after a VF is removed, since the workqueue sync for the adminq
task was not being cancelled.

Harshitha fixes the type for field_flags in the virtchnl_filter struct.

Martyna removes an unnecessary check in a conditional if statement.

Björn fixes an issue reported by Jesper Dangaard Brouer, where the
driver was reporting incorrect statistics when XDP was enabled.

Jan fixes the potential reporting of incorrect speed settings.

Patryk fixed an issue where the flag
I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING was getting set when any offload is
set via ethtool.  Resolved by only setting this flag when VLAN offload
is enabled.  Also ensure we hold the rtnl lock when we are clearing the
interrupt scheme.  Added a check when deleting the MAC address from the
VF to ensure that the MAC address was not set by the PF and if it was,
do not delete it. 

v2: updated patch 2 in the series based on community feedback from David
    Miller to inline a function

The following are changes since commit f0259b6ac4a3d27f6b7e938d6fce367cea377063:
  Merge tag 'mac80211-next-for-davem-2018-08-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Björn Töpel (1):
  i40e: report correct statistics when XDP is enabled

Harshitha Ramamurthy (1):
  virtchnl: use u8 type for a field in the virtchnl_filter struct

Jacob Keller (3):
  i40e: convert queue stats to i40e_stats array
  i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
  i40evf: update ethtool stats code and use helper functions

Jan Sokolowski (1):
  i40e: Check and correct speed values for link on open

Lihong Yang (2):
  i40evf: set IFF_UNICAST_FLT flag for the VF
  i40evf: cancel workqueue sync for adminq when a VF is removed

Martyna Szapar (1):
  i40e: static analysis report from community

Mitch Williams (1):
  i40e: use correct length for strncpy

Patryk Małek (3):
  i40evf: Don't enable vlan stripping when rx offload is turned on
  i40e: hold the rtnl lock on clearing interrupt scheme
  i40e: Prevent deleting MAC address from VF when set by PF

Paul M Stillwell Jr (1):
  i40evf: Validate the number of queues a PF sends

Paweł Jabłoński (1):
  i40evf: Change a VF mac without reloading the VF driver

 .../net/ethernet/intel/i40e/i40e_ethtool.c    | 238 ++++--------------
 .../ethernet/intel/i40e/i40e_ethtool_stats.h  | 221 ++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  61 +++--
 drivers/net/ethernet/intel/i40e/i40e_ptp.c    |   3 +-
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c    |  18 +-
 .../intel/i40evf/i40e_ethtool_stats.h         | 221 ++++++++++++++++
 .../ethernet/intel/i40evf/i40evf_ethtool.c    | 137 +++++-----
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  15 +-
 .../ethernet/intel/i40evf/i40evf_virtchnl.c   |  43 +++-
 include/linux/avf/virtchnl.h                  |   2 +-
 10 files changed, 678 insertions(+), 281 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h
 create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h

-- 
2.17.1

^ permalink raw reply

* [net-next v2 01/15] i40e: convert queue stats to i40e_stats array
From: Jeff Kirsher @ 2018-08-30 21:11 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180830211147.17008-1-jeffrey.t.kirsher@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

Use an i40e_stats array to handle the queue stats, instead of coding
similar functionality separately. Because of how the queue stats are
accessed on some kernels, we can't easily use i40e_add_ethtool_stats.

Instead, implement a separate helper, i40e_add_queue_stats, which we'll
use instead. This helper will correctly implement the
u64_stats_fetch_begin_irq logic and allow retries until successful. We
share the most complex code by re-using i40e_add_one_ethtool_stat.

This logic additionally easily supports skipping disabled rings by using
a ternary operator before calling the u64_stats_fetch_begin_irq()
function, so that we correctly zero-out the stats values without having
to perform two separate sections of code.

This significantly reduces the boiler plate code in
i40e_get_ethtool_stats, and helps keep the complex logic contained to as
few functions as possible.

With this patch, we've finally converted all the statistics to use the
helpers and the i40e_stats function.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c    | 148 +++++++++++-------
 1 file changed, 89 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5ff6caa83948..235012b3bd42 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -33,6 +33,8 @@ struct i40e_stats {
 	I40E_STAT(struct i40e_veb, _name, _stat)
 #define I40E_PFC_STAT(_name, _stat) \
 	I40E_STAT(struct i40e_pfc_stats, _name, _stat)
+#define I40E_QUEUE_STAT(_name, _stat) \
+	I40E_STAT(struct i40e_ring, _name, _stat)
 
 static const struct i40e_stats i40e_gstrings_net_stats[] = {
 	I40E_NETDEV_STAT(rx_packets),
@@ -171,20 +173,16 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = {
 	I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff),
 };
 
-/* We use num_tx_queues here as a proxy for the maximum number of queues
- * available because we always allocate queues symmetrically.
- */
-#define I40E_MAX_NUM_QUEUES(n) ((n)->num_tx_queues)
-#define I40E_QUEUE_STATS_LEN(n)                                              \
-	   (I40E_MAX_NUM_QUEUES(n)                                           \
-	    * 2 /* Tx and Rx together */                                     \
-	    * (sizeof(struct i40e_queue_stats) / sizeof(u64)))
-#define I40E_GLOBAL_STATS_LEN	ARRAY_SIZE(i40e_gstrings_stats)
+static const struct i40e_stats i40e_gstrings_queue_stats[] = {
+	I40E_QUEUE_STAT("%s-%u.packets", stats.packets),
+	I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes),
+};
+
 #define I40E_NETDEV_STATS_LEN	ARRAY_SIZE(i40e_gstrings_net_stats)
+
 #define I40E_MISC_STATS_LEN	ARRAY_SIZE(i40e_gstrings_misc_stats)
-#define I40E_VSI_STATS_LEN(n)	(I40E_NETDEV_STATS_LEN + \
-				 I40E_MISC_STATS_LEN + \
-				 I40E_QUEUE_STATS_LEN((n)))
+
+#define I40E_VSI_STATS_LEN	(I40E_NETDEV_STATS_LEN + I40E_MISC_STATS_LEN)
 
 #define I40E_PFC_STATS_LEN	(ARRAY_SIZE(i40e_gstrings_pfc_stats) * \
 				 I40E_MAX_USER_PRIORITY)
@@ -193,10 +191,15 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = {
 				 (ARRAY_SIZE(i40e_gstrings_veb_tc_stats) * \
 				  I40E_MAX_TRAFFIC_CLASS))
 
-#define I40E_PF_STATS_LEN(n)	(I40E_GLOBAL_STATS_LEN + \
+#define I40E_GLOBAL_STATS_LEN	ARRAY_SIZE(i40e_gstrings_stats)
+
+#define I40E_PF_STATS_LEN	(I40E_GLOBAL_STATS_LEN + \
 				 I40E_PFC_STATS_LEN + \
 				 I40E_VEB_STATS_LEN + \
-				 I40E_VSI_STATS_LEN((n)))
+				 I40E_VSI_STATS_LEN)
+
+/* Length of stats for a single queue */
+#define I40E_QUEUE_STATS_LEN	ARRAY_SIZE(i40e_gstrings_queue_stats)
 
 enum i40e_ethtool_test_id {
 	I40E_ETH_TEST_REG = 0,
@@ -1701,11 +1704,30 @@ static int i40e_get_stats_count(struct net_device *netdev)
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
 	struct i40e_pf *pf = vsi->back;
+	int stats_len;
 
 	if (vsi == pf->vsi[pf->lan_vsi] && pf->hw.partition_id == 1)
-		return I40E_PF_STATS_LEN(netdev);
+		stats_len = I40E_PF_STATS_LEN;
 	else
-		return I40E_VSI_STATS_LEN(netdev);
+		stats_len = I40E_VSI_STATS_LEN;
+
+	/* The number of stats reported for a given net_device must remain
+	 * constant throughout the life of that device.
+	 *
+	 * This is because the API for obtaining the size, strings, and stats
+	 * is spread out over three separate ethtool ioctls. There is no safe
+	 * way to lock the number of stats across these calls, so we must
+	 * assume that they will never change.
+	 *
+	 * Due to this, we report the maximum number of queues, even if not
+	 * every queue is currently configured. Since we always allocate
+	 * queues in pairs, we'll just use netdev->num_tx_queues * 2. This
+	 * works because the num_tx_queues is set at device creation and never
+	 * changes.
+	 */
+	stats_len += I40E_QUEUE_STATS_LEN * 2 * netdev->num_tx_queues;
+
+	return stats_len;
 }
 
 static int i40e_get_sset_count(struct net_device *netdev, int sset)
@@ -1734,8 +1756,8 @@ static int i40e_get_sset_count(struct net_device *netdev, int sset)
  * @stat: the stat definition
  *
  * Copies the stat data defined by the pointer and stat structure pair into
- * the memory supplied as data. Used to implement i40e_add_ethtool_stats.
- * If the pointer is null, data will be zero'd.
+ * the memory supplied as data. Used to implement i40e_add_ethtool_stats and
+ * i40e_add_queue_stats. If the pointer is null, data will be zero'd.
  */
 static inline void
 i40e_add_one_ethtool_stat(u64 *data, void *pointer,
@@ -1810,6 +1832,45 @@ __i40e_add_ethtool_stats(u64 **data, void *pointer,
 #define i40e_add_ethtool_stats(data, pointer, stats) \
 	__i40e_add_ethtool_stats(data, pointer, stats, ARRAY_SIZE(stats))
 
+/**
+ * i40e_add_queue_stats - copy queue statistics into supplied buffer
+ * @data: ethtool stats buffer
+ * @ring: the ring to copy
+ *
+ * Queue statistics must be copied while protected by
+ * u64_stats_fetch_begin_irq, so we can't directly use i40e_add_ethtool_stats.
+ * Assumes that queue stats are defined in i40e_gstrings_queue_stats. If the
+ * ring pointer is null, zero out the queue stat values and update the data
+ * pointer. Otherwise safely copy the stats from the ring into the supplied
+ * buffer and update the data pointer when finished.
+ *
+ * This function expects to be called while under rcu_read_lock().
+ **/
+static inline void
+i40e_add_queue_stats(u64 **data, struct i40e_ring *ring)
+{
+	const unsigned int size = ARRAY_SIZE(i40e_gstrings_queue_stats);
+	const struct i40e_stats *stats = i40e_gstrings_queue_stats;
+	unsigned int start;
+	unsigned int i;
+
+	/* To avoid invalid statistics values, ensure that we keep retrying
+	 * the copy until we get a consistent value according to
+	 * u64_stats_fetch_retry_irq. But first, make sure our ring is
+	 * non-null before attempting to access its syncp.
+	 */
+	do {
+		start = !ring ? 0 : u64_stats_fetch_begin_irq(&ring->syncp);
+		for (i = 0; i < size; i++) {
+			i40e_add_one_ethtool_stat(&(*data)[i], ring,
+						  &stats[i]);
+		}
+	} while (ring && u64_stats_fetch_retry_irq(&ring->syncp, start));
+
+	/* Once we successfully copy the stats in, update the data pointer */
+	data += size;
+}
+
 /**
  * i40e_get_pfc_stats - copy HW PFC statistics to formatted structure
  * @pf: the PF device structure
@@ -1853,12 +1914,10 @@ static void i40e_get_ethtool_stats(struct net_device *netdev,
 				   struct ethtool_stats *stats, u64 *data)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
-	struct i40e_ring *tx_ring, *rx_ring;
 	struct i40e_vsi *vsi = np->vsi;
 	struct i40e_pf *pf = vsi->back;
 	struct i40e_veb *veb = pf->veb[pf->lan_veb];
 	unsigned int i;
-	unsigned int start;
 	bool veb_stats;
 	u64 *p = data;
 
@@ -1870,38 +1929,12 @@ static void i40e_get_ethtool_stats(struct net_device *netdev,
 	i40e_add_ethtool_stats(&data, vsi, i40e_gstrings_misc_stats);
 
 	rcu_read_lock();
-	for (i = 0; i < I40E_MAX_NUM_QUEUES(netdev) ; i++) {
-		tx_ring = READ_ONCE(vsi->tx_rings[i]);
-
-		if (!tx_ring) {
-			/* Bump the stat counter to skip these stats, and make
-			 * sure the memory is zero'd
-			 */
-			*(data++) = 0;
-			*(data++) = 0;
-			*(data++) = 0;
-			*(data++) = 0;
-			continue;
-		}
-
-		/* process Tx ring statistics */
-		do {
-			start = u64_stats_fetch_begin_irq(&tx_ring->syncp);
-			data[0] = tx_ring->stats.packets;
-			data[1] = tx_ring->stats.bytes;
-		} while (u64_stats_fetch_retry_irq(&tx_ring->syncp, start));
-		data += 2;
-
-		/* Rx ring is the 2nd half of the queue pair */
-		rx_ring = &tx_ring[1];
-		do {
-			start = u64_stats_fetch_begin_irq(&rx_ring->syncp);
-			data[0] = rx_ring->stats.packets;
-			data[1] = rx_ring->stats.bytes;
-		} while (u64_stats_fetch_retry_irq(&rx_ring->syncp, start));
-		data += 2;
+	for (i = 0; i < netdev->num_tx_queues; i++) {
+		i40e_add_queue_stats(&data, READ_ONCE(vsi->tx_rings[i]));
+		i40e_add_queue_stats(&data, READ_ONCE(vsi->rx_rings[i]));
 	}
 	rcu_read_unlock();
+
 	if (vsi != pf->vsi[pf->lan_vsi] || pf->hw.partition_id != 1)
 		goto check_data_pointer;
 
@@ -1990,16 +2023,13 @@ static void i40e_get_stat_strings(struct net_device *netdev, u8 *data)
 
 	i40e_add_stat_strings(&data, i40e_gstrings_misc_stats);
 
-	for (i = 0; i < I40E_MAX_NUM_QUEUES(netdev); i++) {
-		snprintf(data, ETH_GSTRING_LEN, "tx-%u.tx_packets", i);
-		data += ETH_GSTRING_LEN;
-		snprintf(data, ETH_GSTRING_LEN, "tx-%u.tx_bytes", i);
-		data += ETH_GSTRING_LEN;
-		snprintf(data, ETH_GSTRING_LEN, "rx-%u.rx_packets", i);
-		data += ETH_GSTRING_LEN;
-		snprintf(data, ETH_GSTRING_LEN, "rx-%u.rx_bytes", i);
-		data += ETH_GSTRING_LEN;
+	for (i = 0; i < netdev->num_tx_queues; i++) {
+		i40e_add_stat_strings(&data, i40e_gstrings_queue_stats,
+				      "tx", i);
+		i40e_add_stat_strings(&data, i40e_gstrings_queue_stats,
+				      "rx", i);
 	}
+
 	if (vsi != pf->vsi[pf->lan_vsi] || pf->hw.partition_id != 1)
 		return;
 
-- 
2.17.1

^ permalink raw reply related

* [net-next v2 06/15] i40e: use correct length for strncpy
From: Jeff Kirsher @ 2018-08-30 21:11 UTC (permalink / raw)
  To: davem; +Cc: Mitch Williams, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <20180830211147.17008-1-jeffrey.t.kirsher@intel.com>

From: Mitch Williams <mitch.a.williams@intel.com>

Caught by GCC 8. When we provide a length for strncpy, we should not
include the terminating null. So we must tell it one less than the size
of the destination buffer.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_ptp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
index 35f2866b38c6..1199f0502d6d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c
@@ -694,7 +694,8 @@ static long i40e_ptp_create_clock(struct i40e_pf *pf)
 	if (!IS_ERR_OR_NULL(pf->ptp_clock))
 		return 0;
 
-	strncpy(pf->ptp_caps.name, i40e_driver_name, sizeof(pf->ptp_caps.name));
+	strncpy(pf->ptp_caps.name, i40e_driver_name,
+		sizeof(pf->ptp_caps.name) - 1);
 	pf->ptp_caps.owner = THIS_MODULE;
 	pf->ptp_caps.max_adj = 999999999;
 	pf->ptp_caps.n_ext_ts = 0;
-- 
2.17.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox