Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 14/15] xsk: statistics support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.

v2: getsockopt now returns size of stats structure.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  7 +++++++
 net/xdp/xsk.c               | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h         |  5 +++++
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e2ea878d025c..77b88c4efe98 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -38,6 +38,7 @@ struct sockaddr_xdp {
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
+#define XDP_STATISTICS			6
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -46,6 +47,12 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+struct xdp_statistics {
+	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+};
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b33c535c7996..009c5af5bba5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -468,6 +468,49 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int len;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case XDP_STATISTICS:
+	{
+		struct xdp_statistics stats;
+
+		if (len < sizeof(stats))
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		stats.rx_dropped = xs->rx_dropped;
+		stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
+		stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
+		mutex_unlock(&xs->mutex);
+
+		if (copy_to_user(optval, &stats, sizeof(stats)))
+			return -EFAULT;
+		if (put_user(sizeof(stats), optlen))
+			return -EFAULT;
+
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
@@ -524,7 +567,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.getsockopt =	xsk_getsockopt,
 	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3497e8808608..7aa9a535db0e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -36,6 +36,11 @@ struct xsk_queue {
 
 /* Common functions operating for both RXTX and umem queues */
 
+static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
+{
+	return q ? q->invalid_descs : 0;
+}
+
 static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 {
 	u32 entries = q->prod_tail - q->cons_tail;
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 15/15] samples/bpf: sample application and documentation for AF_XDP sockets
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	Björn Töpel
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.

To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".

v2: The entries variable was calculated twice in {umem,xq}_nb_avail.

Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 6 files changed, 1317 insertions(+)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
new file mode 100644
index 000000000000..91928d9ee4bf
--- /dev/null
+++ b/Documentation/networking/af_xdp.rst
@@ -0,0 +1,297 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+AF_XDP
+======
+
+Overview
+========
+
+AF_XDP is an address family that is optimized for high performance
+packet processing.
+
+This document assumes that the reader is familiar with BPF and XDP. If
+not, the Cilium project has an excellent reference guide at
+http://cilium.readthedocs.io/en/doc-1.0/bpf/.
+
+Using the XDP_REDIRECT action from an XDP program, the program can
+redirect ingress frames to other XDP enabled netdevs, using the
+bpf_redirect_map() function. AF_XDP sockets enable the possibility for
+XDP programs to redirect frames to a memory buffer in a user-space
+application.
+
+An AF_XDP socket (XSK) is created with the normal socket()
+syscall. Associated with each XSK are two rings: the RX ring and the
+TX ring. A socket can receive packets on the RX ring and it can send
+packets on the TX ring. These rings are registered and sized with the
+setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
+to have at least one of these rings for each socket. An RX or TX
+descriptor ring points to a data buffer in a memory area called a
+UMEM. RX and TX can share the same UMEM so that a packet does not have
+to be copied between RX and TX. Moreover, if a packet needs to be kept
+for a while due to a possible retransmit, the descriptor that points
+to that packet can be changed to point to another and reused right
+away. This again avoids copying data.
+
+The UMEM consists of a number of equally size frames and each frame
+has a unique frame id. A descriptor in one of the rings references a
+frame by referencing its frame id. The user space allocates memory for
+this UMEM using whatever means it feels is most appropriate (malloc,
+mmap, huge pages, etc). This memory area is then registered with the
+kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
+rings: the FILL ring and the COMPLETION ring. The fill ring is used by
+the application to send down frame ids for the kernel to fill in with
+RX packet data. References to these frames will then appear in the RX
+ring once each packet has been received. The completion ring, on the
+other hand, contains frame ids that the kernel has transmitted
+completely and can now be used again by user space, for either TX or
+RX. Thus, the frame ids appearing in the completion ring are ids that
+were previously transmitted using the TX ring. In summary, the RX and
+FILL rings are used for the RX path and the TX and COMPLETION rings
+are used for the TX path.
+
+The socket is then finally bound with a bind() call to a device and a
+specific queue id on that device, and it is not until bind is
+completed that traffic starts to flow.
+
+The UMEM can be shared between processes, if desired. If a process
+wants to do this, it simply skips the registration of the UMEM and its
+corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
+call and submits the XSK of the process it would like to share UMEM
+with as well as its own newly created XSK socket. The new process will
+then receive frame id references in its own RX ring that point to this
+shared UMEM. Note that since the ring structures are single-consumer /
+single-producer (for performance reasons), the new process has to
+create its own socket with associated RX and TX rings, since it cannot
+share this with the other process. This is also the reason that there
+is only one set of FILL and COMPLETION rings per UMEM. It is the
+responsibility of a single process to handle the UMEM.
+
+How is then packets distributed from an XDP program to the XSKs? There
+is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
+user-space application can place an XSK at an arbitrary place in this
+map. The XDP program can then redirect a packet to a specific index in
+this map and at this point XDP validates that the XSK in that map was
+indeed bound to that device and ring number. If not, the packet is
+dropped. If the map is empty at that index, the packet is also
+dropped. This also means that it is currently mandatory to have an XDP
+program loaded (and one XSK in the XSKMAP) to be able to get any
+traffic to user space through the XSK.
+
+AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
+driver does not have support for XDP, or XDP_SKB is explicitly chosen
+when loading the XDP program, XDP_SKB mode is employed that uses SKBs
+together with the generic XDP support and copies out the data to user
+space. A fallback mode that works for any network device. On the other
+hand, if the driver has support for XDP, it will be used by the AF_XDP
+code to provide better performance, but there is still a copy of the
+data into user space.
+
+Concepts
+========
+
+In order to use an AF_XDP socket, a number of associated objects need
+to be setup.
+
+Jonathan Corbet has also written an excellent article on LWN,
+"Accelerating networking with AF_XDP". It can be found at
+https://lwn.net/Articles/750845/.
+
+UMEM
+----
+
+UMEM is a region of virtual contiguous memory, divided into
+equal-sized frames. An UMEM is associated to a netdev and a specific
+queue id of that netdev. It is created and configured (frame size,
+frame headroom, start address and size) by using the XDP_UMEM_REG
+setsockopt system call. A UMEM is bound to a netdev and queue id, via
+the bind() system call.
+
+An AF_XDP is socket linked to a single UMEM, but one UMEM can have
+multiple AF_XDP sockets. To share an UMEM created via one socket A,
+the next socket B can do this by setting the XDP_SHARED_UMEM flag in
+struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
+of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
+
+The UMEM has two single-producer/single-consumer rings, that are used
+to transfer ownership of UMEM frames between the kernel and the
+user-space application.
+
+Rings
+-----
+
+There are a four different kind of rings: Fill, Completion, RX and
+TX. All rings are single-producer/single-consumer, so the user-space
+application need explicit synchronization of multiple
+processes/threads are reading/writing to them.
+
+The UMEM uses two rings: Fill and Completion. Each socket associated
+with the UMEM must have an RX queue, TX queue or both. Say, that there
+is a setup with four sockets (all doing TX and RX). Then there will be
+one Fill ring, one Completion ring, four TX rings and four RX rings.
+
+The rings are head(producer)/tail(consumer) based rings. A producer
+writes the data ring at the index pointed out by struct xdp_ring
+producer member, and increasing the producer index. A consumer reads
+the data ring at the index pointed out by struct xdp_ring consumer
+member, and increasing the consumer index.
+
+The rings are configured and created via the _RING setsockopt system
+calls and mmapped to user-space using the appropriate offset to mmap()
+(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
+XDP_UMEM_PGOFF_COMPLETION_RING).
+
+The size of the rings need to be of size power of two.
+
+UMEM Fill Ring
+~~~~~~~~~~~~~~
+
+The Fill ring is used to transfer ownership of UMEM frames from
+user-space to kernel-space. The UMEM indicies are passed in the
+ring. As an example, if the UMEM is 64k and each frame is 4k, then the
+UMEM has 16 frames and can pass indicies between 0 and 15.
+
+Frames passed to the kernel are used for the ingress path (RX rings).
+
+The user application produces UMEM indicies to this ring.
+
+UMEM Completetion Ring
+~~~~~~~~~~~~~~~~~~~~~~
+
+The Completion Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the Fill ring, UMEM indicies are
+used.
+
+Frames passed from the kernel to user-space are frames that has been
+sent (TX ring) and can be used by user-space again.
+
+The user application consumes UMEM indicies from this ring.
+
+
+RX Ring
+~~~~~~~
+
+The RX ring is the receiving side of a socket. Each entry in the ring
+is a struct xdp_desc descriptor. The descriptor contains UMEM index
+(idx), the length of the data (len), the offset into the frame
+(offset).
+
+If no frames have been passed to kernel via the Fill ring, no
+descriptors will (or can) appear on the RX ring.
+
+The user application consumes struct xdp_desc descriptors from this
+ring.
+
+TX Ring
+~~~~~~~
+
+The TX ring is used to send frames. The struct xdp_desc descriptor is
+filled (index, length and offset) and passed into the ring.
+
+To start the transfer a sendmsg() system call is required. This might
+be relaxed in the future.
+
+The user application produces struct xdp_desc descriptors to this
+ring.
+
+XSKMAP / BPF_MAP_TYPE_XSKMAP
+----------------------------
+
+On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
+is used in conjunction with bpf_redirect_map() to pass the ingress
+frame to a socket.
+
+The user application inserts the socket into the map, via the bpf()
+system call.
+
+Note that if an XDP program tries to redirect to a socket that does
+not match the queue configuration and netdev, the frame will be
+dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
+queue 17. Only the XDP program executing for eth0 and queue 17 will
+successfully pass data to the socket. Please refer to the sample
+application (samples/bpf/) in for an example.
+
+Usage
+=====
+
+In order to use AF_XDP sockets there are two parts needed. The
+user-space application and the XDP program. For a complete setup and
+usage example, please refer to the sample application. The user-space
+side is xdpsock_user.c and the XDP side xdpsock_kern.c.
+
+Naive ring dequeue and enqueue could look like this::
+
+    // typedef struct xdp_rxtx_ring RING;
+    // typedef struct xdp_umem_ring RING;
+
+    // typedef struct xdp_desc RING_TYPE;
+    // typedef __u32 RING_TYPE;
+
+    int dequeue_one(RING *ring, RING_TYPE *item)
+    {
+        __u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
+
+        if (entries == 0)
+            return -1;
+
+        // read-barrier!
+
+        *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
+        ring->ptrs.consumer++;
+        return 0;
+    }
+
+    int enqueue_one(RING *ring, const RING_TYPE *item)
+    {
+        u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
+
+        if (free_entries == 0)
+            return -1;
+
+        ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
+
+        // write-barrier!
+
+        ring->ptrs.producer++;
+        return 0;
+    }
+
+
+For a more optimized version, please refer to the sample application.
+
+Sample application
+==================
+
+There is a xdpsock benchmarking/test application included that
+demonstrates how to use AF_XDP sockets with both private and shared
+UMEMs. Say that you would like your UDP traffic from port 4242 to end
+up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
+for this::
+
+      ethtool -N p3p2 rx-flow-hash udp4 fn
+      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
+          action 16
+
+Running the rxdrop benchmark in XDP_DRV mode can then be done
+using::
+
+      samples/bpf/xdpsock -i p3p2 -q 16 -r -N
+
+For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
+can be displayed with "-h", as usual.
+
+Credits
+=======
+
+- Björn Töpel (AF_XDP core)
+- Magnus Karlsson (AF_XDP core)
+- Alexander Duyck
+- Alexei Starovoitov
+- Daniel Borkmann
+- Jesper Dangaard Brouer
+- John Fastabend
+- Jonathan Corbet (LWN coverage)
+- Michael S. Tsirkin
+- Qi Z Zhang
+- Willem de Bruijn
+
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index f204eaff657d..cbd9bdd4a79e 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,6 +6,7 @@ Contents:
 .. toctree::
    :maxdepth: 2
 
+   af_xdp
    batman-adv
    can
    dpaa2/index
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b853581592fd..c03f8358f12c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -150,6 +152,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -196,6 +199,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
new file mode 100644
index 000000000000..533ab81adfa1
--- /dev/null
+++ b/samples/bpf/xdpsock.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef XDPSOCK_H_
+#define XDPSOCK_H_
+
+/* Power-of-2 number of sockets */
+#define MAX_SOCKS 4
+
+/* Round-robin receive */
+#define RR_LB 0
+
+#endif /* XDPSOCK_H_ */
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..d8806c41362e
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#include "xdpsock.h"
+
+struct bpf_map_def SEC("maps") qidconf_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(int),
+	.max_entries	= 1,
+};
+
+struct bpf_map_def SEC("maps") xsks_map = {
+	.type = BPF_MAP_TYPE_XSKMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") rr_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(unsigned int),
+	.max_entries = 1,
+};
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	int *qidconf, key = 0, idx;
+	unsigned int *rr;
+
+	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
+	if (!qidconf)
+		return XDP_ABORTED;
+
+	if (*qidconf != ctx->rx_queue_index)
+		return XDP_PASS;
+
+#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
+	rr = bpf_map_lookup_elem(&rr_map, &key);
+	if (!rr)
+		return XDP_ABORTED;
+
+	*rr = (*rr + 1) & (MAX_SOCKS - 1);
+	idx = *rr;
+#else
+	idx = 0;
+#endif
+
+	return bpf_redirect_map(&xsks_map, idx, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..4b8a7cf3e63b
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,948 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2017 - 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_FRAMES 131072
+#define FRAME_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define FQ_NUM_DESCS 1024
+#define CQ_NUM_DESCS 1024
+
+#define DEBUG_HEXDUMP 0
+
+typedef __u32 u32;
+
+static unsigned long prev_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+static int opt_poll;
+static int opt_shared_packet_buffer;
+static int opt_interval = 1;
+
+struct xdp_umem_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_umem_ring *ring;
+};
+
+struct xdp_umem {
+	char (*frames)[FRAME_SIZE];
+	struct xdp_umem_uqueue fq;
+	struct xdp_umem_uqueue cq;
+	int fd;
+};
+
+struct xdp_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_rxtx_ring *ring;
+};
+
+struct xdpsock {
+	struct xdp_uqueue rx;
+	struct xdp_uqueue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	u32 outstanding_tx;
+	unsigned long rx_npkts;
+	unsigned long tx_npkts;
+	unsigned long prev_rx_npkts;
+	unsigned long prev_tx_npkts;
+};
+
+#define MAX_SOCKS 4
+static int num_socks;
+struct xdpsock *xsks[MAX_SOCKS];
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void dump_stats(void);
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			dump_stats();					\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+
+	if (free_entries >= nb)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer;
+
+	return q->size - (q->cached_prod - q->cached_cons);
+}
+
+static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 free_entries = q->cached_cons - q->cached_prod;
+
+	if (free_entries >= ndescs)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
+	return q->cached_cons - q->cached_prod;
+}
+
+static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > nb) ? nb : entries;
+}
+
+static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > ndescs) ? ndescs : entries;
+}
+
+static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
+					 struct xdp_desc *d,
+					 size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i].idx;
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+				      size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i];
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
+					       u32 *d, size_t nb)
+{
+	u32 idx, i, entries = umem_nb_avail(cq, nb);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = cq->cached_cons++ & cq->mask;
+		d[i] = cq->ring->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		cq->ring->ptrs.consumer = cq->cached_cons;
+	}
+
+	return entries;
+}
+
+static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+{
+	lassert(idx < NUM_FRAMES);
+	return &xsk->umem->frames[idx][off];
+}
+
+static inline int xq_enq(struct xdp_uqueue *uq,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		r->desc[idx].idx = descs[i].idx;
+		r->desc[idx].len = descs[i].len;
+		r->desc[idx].offset = descs[i].offset;
+	}
+
+	u_smp_wmb();
+
+	r->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
+				 __u32 idx, unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *q = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		q->desc[idx].idx	= idx + i;
+		q->desc[idx].len	= sizeof(pkt_data) - 1;
+		q->desc[idx].offset	= 0;
+	}
+
+	u_smp_wmb();
+
+	q->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_uqueue *uq,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int idx;
+	int i, entries;
+
+	entries = xq_nb_avail(uq, ndescs);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = uq->cached_cons++ & uq->mask;
+		descs[i] = r->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		r->ptrs.consumer = uq->cached_cons;
+	}
+
+	return entries;
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xdp_umem_configure(int sfd)
+{
+	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
+	struct xdp_umem_reg mr;
+	struct xdp_umem *umem;
+	void *bufs;
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+
+	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
+			       NUM_FRAMES * FRAME_SIZE) == 0);
+
+	mr.addr = (__u64)bufs;
+	mr.len = NUM_FRAMES * FRAME_SIZE;
+	mr.frame_size = FRAME_SIZE;
+	mr.frame_headroom = FRAME_HEADROOM;
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
+			   sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
+			   sizeof(int)) == 0);
+
+	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     FQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_FILL_RING);
+	lassert(umem->fq.ring != MAP_FAILED);
+
+	umem->fq.mask = FQ_NUM_DESCS - 1;
+	umem->fq.size = FQ_NUM_DESCS;
+
+	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     CQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_COMPLETION_RING);
+	lassert(umem->cq.ring != MAP_FAILED);
+
+	umem->cq.mask = CQ_NUM_DESCS - 1;
+	umem->cq.size = CQ_NUM_DESCS;
+
+	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		int i;
+
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(&umem->frames[i][0]);
+	}
+
+	return umem;
+}
+
+static struct xdpsock *xsk_configure(struct xdp_umem *umem)
+{
+	struct sockaddr_xdp sxdp = {};
+	int sfd, ndescs = NUM_DESCS;
+	struct xdpsock *xsk;
+	bool shared = true;
+	u32 i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xsk = calloc(1, sizeof(*xsk));
+	lassert(xsk);
+
+	xsk->sfd = sfd;
+	xsk->outstanding_tx = 0;
+
+	if (!umem) {
+		shared = false;
+		xsk->umem = xdp_umem_configure(sfd);
+	} else {
+		xsk->umem = umem;
+	}
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
+			   &ndescs, sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
+			   &ndescs, sizeof(int)) == 0);
+
+	/* Rx */
+	xsk->rx.ring = mmap(NULL,
+			    sizeof(struct xdp_ring) +
+			    NUM_DESCS * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_RX_RING);
+	lassert(xsk->rx.ring != MAP_FAILED);
+
+	if (!shared) {
+		for (i = 0; i < NUM_DESCS / 2; i++)
+			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
+				== 0);
+	}
+
+	/* Tx */
+	xsk->tx.ring = mmap(NULL,
+			 sizeof(struct xdp_ring) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_TX_RING);
+	lassert(xsk->tx.ring != MAP_FAILED);
+
+	xsk->rx.mask = NUM_DESCS - 1;
+	xsk->rx.size = NUM_DESCS;
+
+	xsk->tx.mask = NUM_DESCS - 1;
+	xsk->tx.size = NUM_DESCS;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+	if (shared) {
+		sxdp.sxdp_flags = XDP_SHARED_UMEM;
+		sxdp.sxdp_shared_umem_fd = umem->fd;
+	}
+
+	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
+
+	return xsk;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (opt_poll)
+		printf("poll() ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long now = get_nsecs();
+	long dt = now - prev_time;
+	int i;
+
+	prev_time = now;
+
+	for (i = 0; i < num_socks; i++) {
+		char *fmt = "%-15s %'-11.0f %'-11lu\n";
+		double rx_pps, tx_pps;
+
+		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
+			 1000000000. / dt;
+		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
+			 1000000000. / dt;
+
+		printf("\n sock%d@", i);
+		print_benchmark(false);
+		printf("\n");
+
+		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
+		       dt / 1000000000.);
+		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
+		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
+
+		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
+		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
+	}
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		sleep(opt_interval);
+		dump_stats();
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"poll", no_argument, 0, 'p'},
+	{"shared-buffer", no_argument, 0, 's'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{"interval", required_argument, 0, 'n'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -p, --poll		Use poll syscall\n"
+		"  -s, --shared-buffer	Use shared packet buffer\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 's':
+			opt_shared_packet_buffer = 1;
+			break;
+		case 'p':
+			opt_poll = 1;
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		case 'n':
+			opt_interval = atoi(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+		return;
+	lassert(0);
+}
+
+static inline void complete_tx_l2fwd(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		 xsk->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
+	if (rcvd > 0) {
+		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
+	if (rcvd > 0) {
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdpsock *xsk)
+{
+	struct xdp_desc descs[BATCH_SIZE];
+	unsigned int rcvd, i;
+
+	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+	if (!rcvd)
+		return;
+
+	for (i = 0; i < rcvd; i++) {
+		u32 idx = descs[i].idx;
+
+		lassert(idx < NUM_FRAMES);
+#if DEBUG_HEXDUMP
+		char *pkt;
+		char buf[32];
+
+		pkt = xq_get_data(xsk, idx, descs[i].offset);
+		sprintf(buf, "idx=%d", idx);
+		hex_dump(pkt, descs[i].len, buf);
+#endif
+	}
+
+	xsk->rx_npkts += rcvd;
+
+	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
+}
+
+static void rx_drop_all(void)
+{
+	struct pollfd fds[MAX_SOCKS + 1];
+	int i, ret, timeout, nfds = 1;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsks[i]->sfd;
+		fds[i].events = POLLIN;
+		timeout = 1000; /* 1sn */
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			rx_drop(xsks[i]);
+	}
+}
+
+static void tx_only(struct xdpsock *xsk)
+{
+	int timeout, ret, nfds = 1;
+	struct pollfd fds[nfds + 1];
+	unsigned int idx = 0;
+
+	memset(fds, 0, sizeof(fds));
+	fds[0].fd = xsk->sfd;
+	fds[0].events = POLLOUT;
+	timeout = 1000; /* 1sn */
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+
+			if (fds[0].fd != xsk->sfd ||
+			    !(fds[0].revents & POLLOUT))
+				continue;
+		}
+
+		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
+			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
+
+			xsk->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_FRAMES;
+		}
+
+		complete_tx_only(xsk);
+	}
+}
+
+static void l2fwd(struct xdpsock *xsk)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xsk);
+
+			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xsk, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		xsk->rx_npkts += rcvd;
+
+		ret = xq_enq(&xsk->tx, descs, rcvd);
+		lassert(ret == 0);
+		xsk->outstanding_tx += rcvd;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char xdp_filename[256];
+	int i, ret, key = 0;
+	pthread_t pt;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
+	if (ret) {
+		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create sockets... */
+	xsks[num_socks++] = xsk_configure(NULL);
+
+#if RR_LB
+	for (i = 0; i < MAX_SOCKS - 1; i++)
+		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
+#endif
+
+	/* ...and insert them into the map. */
+	for (i = 0; i < num_socks; i++) {
+		key = i;
+		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
+		if (ret) {
+			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	setlocale(LC_ALL, "");
+
+	ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	prev_time = get_nsecs();
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop_all();
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xsks[0]);
+	else
+		l2fwd(xsks[0]);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-27 12:21 UTC (permalink / raw)
  To: Bjorn Topel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Netdev
  Cc: Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

2018-04-27 14:17 GMT+02:00 Björn Töpel <bjorn.topel@gmail.com>:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this patch set,
> all packet data is copied out to user-space.
>
> A new feature in this patch set is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from V1 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       3.0(2.9)   9.5(9.4)
> txpush       2.5(2.5)   NA*
> l2fwd        1.9(1.9)   2.5(2.4) (TX using XDP_SKB in both cases)
>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.2(2.1)   3.3(3.3)
> l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this patch set. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
>
> Changes from V1:
>
> * Fixes to bugs spotted by Will in his review
> * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
>   suggested by Will
> * Refactored packet_direct_xmit to become a common function
>   in core/dev.c as suggested by Will
> * Added documentation as suggested by Jesper
> * Proper page unpinning as suggested by MST
> * Some minor code cleanups
>
> The structure of the patch set is as follows:
>
> Patches 1-3: Basic socket and umem plumbing
> Patches 4-9: RX support together with the new XSKMAP
> Patches 10-13: TX support
> Patch 14: Statistics support with getsockopt()
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit
>

Oops, I pressed play on tape too soon. We based it on commit
79741a38b4a2 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next").


Björn

> To do for this patch set:
>
> * Syzkaller torture session being worked on
>
> Post-series plan:
>
> * Optimize performance
>
> * Kernel selftest
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (7):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (8):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   dev: packet: make packet_direct_xmit a common function
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application and documentation for AF_XDP sockets
>
>  Documentation/networking/af_xdp.rst | 297 +++++++++++
>  Documentation/networking/index.rst  |   1 +
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/netdevice.h           |   1 +
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  66 +++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 272 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  73 ++-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/packet/af_packet.c              |  42 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 260 ++++++++++
>  net/xdp/xdp_umem.h                  |  67 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 656 +++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 247 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  36 files changed, 3255 insertions(+), 72 deletions(-)
>  create mode 100644 Documentation/networking/af_xdp.rst
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>

^ permalink raw reply

* Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()
From: Kevin Easton @ 2018-04-27 12:23 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev, Michal Ostrowski
In-Reply-To: <387ca48810af36f2626049008a795d1adc375cb8.1524494257.git.g.nault@alphalink.fr>

On Mon, Apr 23, 2018 at 04:38:27PM +0200, Guillaume Nault wrote:
> We must validate sockaddr_len, otherwise userspace can pass fewer data
> than we expect and we end up accessing invalid data.
> 
> Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
> Reported-by: syzbot+4f03bdf92fdf9ef5ddab@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
> ---
>  drivers/net/ppp/pppoe.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
> index 1483bc7b01e1..7df07337d69c 100644
> --- a/drivers/net/ppp/pppoe.c
> +++ b/drivers/net/ppp/pppoe.c
> @@ -620,6 +620,10 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>  	lock_sock(sk);
>  
>  	error = -EINVAL;
> +
> +	if (sockaddr_len != sizeof(struct sockaddr_pppox))
> +		goto end;
> +
>  	if (sp->sa_protocol != PX_PROTO_OE)
>  		goto end;

There's another bug here - pppoe_connect() should also be validating
sp->sa_family.  My suggested patch was going to be:

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 1483bc7..90eb3fd 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -620,6 +620,14 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
        lock_sock(sk);
 
        error = -EINVAL;
+       if (sockaddr_len < sizeof(struct sockaddr_pppox))
+               goto end;
+
+       error = -EAFNOSUPPORT;
+       if (sp->sa_family != AF_PPPOX)
+               goto end;
+
+       error = -EINVAL;
        if (sp->sa_protocol != PX_PROTO_OE)
                goto end;
 
Should I rework this on top of net.git HEAD?

(The same applies to pppol2tp_connect()).

    - Kevin

^ permalink raw reply related

* Re: [PATCH net-next v8 2/4] net: Introduce generic failover module
From: kbuild test robot @ 2018-04-27 12:39 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: kbuild-all, mst, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	sridhar.samudrala, jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524700768-38627-3-git-send-email-sridhar.samudrala@intel.com>

Hi Sridhar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180427-183842
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/core/net_failover.c:544:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:544:39:    expected struct net_device *dev
   net/core/net_failover.c:544:39:    got struct net_device [noderef] <asn:4>*standby_dev
   net/core/net_failover.c:547:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:547:39:    expected struct net_device *dev
   net/core/net_failover.c:547:39:    got struct net_device [noderef] <asn:4>*primary_dev
>> net/core/net_failover.c:112:12: sparse: context imbalance in 'net_failover_select_queue' - wrong count at exit

vim +544 net/core/net_failover.c

   446	
   447	static int net_failover_slave_register(struct net_device *slave_dev)
   448	{
   449		struct net_failover_info *nfo_info;
   450		struct net_failover_ops *nfo_ops;
   451		struct net_device *failover_dev;
   452		bool slave_is_standby;
   453		u32 orig_mtu;
   454		int err;
   455	
   456		ASSERT_RTNL();
   457	
   458		failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
   459		if (!failover_dev)
   460			goto done;
   461	
   462		if (failover_dev->type != slave_dev->type)
   463			goto done;
   464	
   465		if (nfo_ops && nfo_ops->slave_register)
   466			return nfo_ops->slave_register(slave_dev, failover_dev);
   467	
   468		nfo_info = netdev_priv(failover_dev);
   469		slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);
   470		if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
   471				rtnl_dereference(nfo_info->primary_dev)) {
   472			netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
   473				   slave_dev->name,
   474				   slave_is_standby ? "standby" : "primary");
   475			goto done;
   476		}
   477	
   478		/* We want to allow only a direct attached VF device as a primary
   479		 * netdev. As there is no easy way to check for a VF device, restrict
   480		 * this to a pci device.
   481		 */
   482		if (!slave_is_standby && (!slave_dev->dev.parent ||
   483					  !dev_is_pci(slave_dev->dev.parent)))
   484			goto done;
   485	
   486		if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
   487		    vlan_uses_dev(failover_dev)) {
   488			netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
   489				   failover_dev->name);
   490			goto done;
   491		}
   492	
   493		/* Align MTU of slave with failover dev */
   494		orig_mtu = slave_dev->mtu;
   495		err = dev_set_mtu(slave_dev, failover_dev->mtu);
   496		if (err) {
   497			netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
   498				   slave_dev->name, failover_dev->mtu);
   499			goto done;
   500		}
   501	
   502		dev_hold(slave_dev);
   503	
   504		if (netif_running(failover_dev)) {
   505			err = dev_open(slave_dev);
   506			if (err && (err != -EBUSY)) {
   507				netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
   508					   slave_dev->name, err);
   509				goto err_dev_open;
   510			}
   511		}
   512	
   513		netif_addr_lock_bh(failover_dev);
   514		dev_uc_sync_multiple(slave_dev, failover_dev);
   515		dev_uc_sync_multiple(slave_dev, failover_dev);
   516		netif_addr_unlock_bh(failover_dev);
   517	
   518		err = vlan_vids_add_by_dev(slave_dev, failover_dev);
   519		if (err) {
   520			netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
   521				   slave_dev->name, err);
   522			goto err_vlan_add;
   523		}
   524	
   525		err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
   526						 failover_dev);
   527		if (err) {
   528			netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
   529				   err);
   530			goto err_handler_register;
   531		}
   532	
   533		err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
   534		if (err) {
   535			netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
   536				   failover_dev->name, err);
   537			goto err_upper_link;
   538		}
   539	
   540		slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
   541	
   542		if (slave_is_standby) {
   543			rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
 > 544			dev_get_stats(nfo_info->standby_dev, &nfo_info->standby_stats);
   545		} else {
   546			rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
   547			dev_get_stats(nfo_info->primary_dev, &nfo_info->primary_stats);
   548			failover_dev->min_mtu = slave_dev->min_mtu;
   549			failover_dev->max_mtu = slave_dev->max_mtu;
   550		}
   551	
   552		net_failover_compute_features(failover_dev);
   553	
   554		call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
   555	
   556		netdev_info(failover_dev, "failover %s slave:%s registered\n",
   557			    slave_is_standby ? "standby" : "primary", slave_dev->name);
   558	
   559		goto done;
   560	
   561	err_upper_link:
   562		netdev_rx_handler_unregister(slave_dev);
   563	err_handler_register:
   564		vlan_vids_del_by_dev(slave_dev, failover_dev);
   565	err_vlan_add:
   566		dev_uc_unsync(slave_dev, failover_dev);
   567		dev_mc_unsync(slave_dev, failover_dev);
   568		dev_close(slave_dev);
   569	err_dev_open:
   570		dev_put(slave_dev);
   571		dev_set_mtu(slave_dev, orig_mtu);
   572	done:
   573		return NOTIFY_DONE;
   574	}
   575	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Hello from Lisa
From: Lisa Johnson @ 2018-04-27 12:44 UTC (permalink / raw)


Hello dear,
I am Miss Lisa. I have very important thing to discuss with you
please, this information is very vital. Contact me with my private
email so we can talk (lisajohnsonsalimanto@hotmail.com )
Lisa.

^ permalink raw reply

* [PATCHv2 net] bridge: check iface upper dev when setting master via ioctl
From: Hangbin Liu @ 2018-04-27 12:59 UTC (permalink / raw)
  To: netdev
  Cc: Nikolay Aleksandrov, Dmitry Vyukov, syzbot, David Miller,
	Hangbin Liu
In-Reply-To: <1524750986-23904-1-git-send-email-liuhangbin@gmail.com>

When we set a bond slave's master to bridge via ioctl, we only check
the IFF_BRIDGE_PORT flag. Although we will find the slave's real master
at netdev_master_upper_dev_link() later, it already does some settings
and allocates some resources. It would be better to return as early
as possible.

v1 -> v2:
use netdev_master_upper_dev_get() instead of netdev_has_any_upper_dev()
to check if we have a master, because not all upper devs are masters,
e.g. vlan device.

Reported-by: syzbot+de73361ee4971b6e6f75@syzkaller.appspotmail.com
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/bridge/br_if.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 82c1a6f..5bb6681 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -518,8 +518,8 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 		return -ELOOP;
 	}

-	/* Device is already being bridged */
-	if (br_port_exists(dev))
+	/* Device has master upper dev */
+	if (netdev_master_upper_dev_get(dev))
 		return -EBUSY;

 	/* No bridging devices that dislike that (e.g. wireless) */
-- 
2.5.5

^ permalink raw reply related

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
From: Eric Dumazet @ 2018-04-27 13:03 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, David Miller, netdev, Andy Lutomirski, LKML, linux-mm,
	Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <201804271455.cJQuTeDc%fengguang.wu@intel.com>

On Fri, Apr 27, 2018 at 1:45 AM kbuild test robot <lkp@intel.com> wrote:

> Hi Eric,

> Thank you for the patch! Yet something to improve:

> [auto build test ERROR on net-next/master]

> url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-TCP_ZEROCOPY_RECEIVE-support-for-zerocopy-receive/20180427-122234
> config: sh-rsk7269_defconfig (attached as .config)
> compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>          wget
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O
~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          make.cross ARCH=sh

> All errors (new ones prefixed by >>):

>     net/ipv4/tcp.o: In function `tcp_setsockopt':
> >> tcp.c:(.text+0x3f80): undefined reference to `zap_page_range'

I guess this tcp zerocopy stuff depends on CONFIG_MMU

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v4] Add Common Applications Kept Enhanced (cake) qdisc
From: Eric Dumazet @ 2018-04-27 13:17 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, netdev; +Cc: cake, Dave Taht
In-Reply-To: <20180427121706.23273-1-toke@toke.dk>



On 04/27/2018 05:17 AM, Toke Høiland-Jørgensen wrote:

...

> +
> +static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
> +				       struct cake_flow *flow)
> +{
> +	int seglen;
> +	struct sk_buff *skb = flow->tail, *skb_check, *skb_check_prev;
> +	struct iphdr *iph, *iph_check;
> +	struct ipv6hdr *ipv6h, *ipv6h_check;
> +	struct tcphdr *tcph, *tcph_check;
> +	bool otherconn_ack_seen = false;
> +	struct sk_buff *otherconn_checked_to = NULL;
> +	bool thisconn_redundant_seen = false, thisconn_seen_last = false;
> +	struct sk_buff *thisconn_checked_to = NULL, *thisconn_ack = NULL;
> +	bool aggressive = q->ack_filter == CAKE_ACK_AGGRESSIVE;
> +
> +	/* no other possible ACKs to filter */
> +	if (flow->head == skb)
> +		return NULL;
> +
> +	iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
> +	ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
> +
> +	/* check that the innermost network header is v4/v6, and contains TCP */
> +	if (pskb_may_pull(skb, ((unsigned char *)iph - skb->head) + sizeof(struct iphdr)) &&
> +	    iph->version == 4) {
> +		if (iph->protocol != IPPROTO_TCP)
> +			return NULL;
> +		seglen = ntohs(iph->tot_len) - (4 * iph->ihl);
> +		tcph = (struct tcphdr *)((void *)iph + (4 * iph->ihl));
> +		if (!pskb_may_pull(skb, ((unsigned char *)tcph - skb->head) + sizeof(struct tcphdr)))
> +			return NULL;
> +	} else if (pskb_may_pull(skb, ((unsigned char *)ipv6h - skb->head) + sizeof(struct ipv6hdr) + sizeof(struct tcphdr)) &&
> +	           ipv6h->version == 6) {
> +		if (ipv6h->nexthdr != IPPROTO_TCP)
> +			return NULL;
> +		seglen = ntohs(ipv6h->payload_len);
> +		tcph = (struct tcphdr *)((void *)ipv6h +
> +					 sizeof(struct ipv6hdr));
> +	} else {
> +		return NULL;
> +	}
> +


This is still broken.

After pskb_may_pull(), skb->head might have been reallocated.

You need to recompute iph , ipv6h, tcph, otherwise you are reading freed memory and crash kernels
with sufficient debugging (KASAN and other CONFIG_DEBUG_PAGEALLOC / CONFIG_DEBUG_SLAB like options)

^ permalink raw reply

* Re: [PATCH] ptp_pch: use helpers function for converting between ns and timespec
From: Richard Cochran @ 2018-04-27 13:26 UTC (permalink / raw)
  To: YueHaibing; +Cc: davem, netdev, linux-kernel
In-Reply-To: <20180427073618.12036-1-yuehaibing@huawei.com>

On Fri, Apr 27, 2018 at 03:36:18PM +0800, YueHaibing wrote:
> use ns_to_timespec64() and timespec64_to_ns() instead of open coding

Acked-by: Richard Cochran <richardcochran@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next v4] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-04-27 13:38 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: cake, Dave Taht
In-Reply-To: <efb401ac-bc79-cbc5-cd03-120803b65b4d@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 04/27/2018 05:17 AM, Toke Høiland-Jørgensen wrote:
>
> ...
>
>> +
>> +static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
>> +				       struct cake_flow *flow)
>> +{
>> +	int seglen;
>> +	struct sk_buff *skb = flow->tail, *skb_check, *skb_check_prev;
>> +	struct iphdr *iph, *iph_check;
>> +	struct ipv6hdr *ipv6h, *ipv6h_check;
>> +	struct tcphdr *tcph, *tcph_check;
>> +	bool otherconn_ack_seen = false;
>> +	struct sk_buff *otherconn_checked_to = NULL;
>> +	bool thisconn_redundant_seen = false, thisconn_seen_last = false;
>> +	struct sk_buff *thisconn_checked_to = NULL, *thisconn_ack = NULL;
>> +	bool aggressive = q->ack_filter == CAKE_ACK_AGGRESSIVE;
>> +
>> +	/* no other possible ACKs to filter */
>> +	if (flow->head == skb)
>> +		return NULL;
>> +
>> +	iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
>> +	ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
>> +
>> +	/* check that the innermost network header is v4/v6, and contains TCP */
>> +	if (pskb_may_pull(skb, ((unsigned char *)iph - skb->head) + sizeof(struct iphdr)) &&
>> +	    iph->version == 4) {
>> +		if (iph->protocol != IPPROTO_TCP)
>> +			return NULL;
>> +		seglen = ntohs(iph->tot_len) - (4 * iph->ihl);
>> +		tcph = (struct tcphdr *)((void *)iph + (4 * iph->ihl));
>> +		if (!pskb_may_pull(skb, ((unsigned char *)tcph - skb->head) + sizeof(struct tcphdr)))
>> +			return NULL;
>> +	} else if (pskb_may_pull(skb, ((unsigned char *)ipv6h - skb->head) + sizeof(struct ipv6hdr) + sizeof(struct tcphdr)) &&
>> +	           ipv6h->version == 6) {
>> +		if (ipv6h->nexthdr != IPPROTO_TCP)
>> +			return NULL;
>> +		seglen = ntohs(ipv6h->payload_len);
>> +		tcph = (struct tcphdr *)((void *)ipv6h +
>> +					 sizeof(struct ipv6hdr));
>> +	} else {
>> +		return NULL;
>> +	}
>> +
>
>
> This is still broken.
>
> After pskb_may_pull(), skb->head might have been reallocated.
>
> You need to recompute iph , ipv6h, tcph, otherwise you are reading
> freed memory and crash kernels with sufficient debugging (KASAN and
> other CONFIG_DEBUG_PAGEALLOC / CONFIG_DEBUG_SLAB like options)

Ah, right. Will fix.

Is it safe to dereference the iph pointer before calling
pskb_may_pull()?

-Toke

^ permalink raw reply

* Re: [PATCH net-next v2 4/7] net: mscc: Add initial Ocelot switch support
From: Alexandre Belloni @ 2018-04-27 13:44 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: David S . Miller, Allan Nielsen, razvan.stefanescu, po.liu,
	Thomas Petazzoni, Florian Fainelli, netdev, devicetree,
	linux-kernel, linux-mips
In-Reply-To: <20180426210915.GE23481@lunn.ch>

On 26/04/2018 23:09:15+0200, Andrew Lunn wrote:
> > +/* Checks if the net_device instance given to us originate from our driver. */
> > +static bool ocelot_netdevice_dev_check(const struct net_device *dev)
> > +{
> > +	return dev->netdev_ops == &ocelot_port_netdev_ops;
> > +}
> 
> This is probably O.K. now, but when you add support for controlling
> the switch over PCIe, i think it breaks. A board could have two
> switches...
> 
> It might be possible to do something with dev->parent. All ports of a
> switch should have the same parent.
> 

Actually, that is fine because it simply ensures netdev_priv(dev); is a
struct ocelot_port.

Later on, we get ocelot_port->ocelot and do the right thing.

The only thing that would not be working when having multiple of those
switches on the same platform would be having interfaces from different
switches in the same bridge. Anyway, this is definitively not something
we want because of the limited bandwidth of the CPU port.


-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next v4] Add Common Applications Kept Enhanced (cake) qdisc
From: Eric Dumazet @ 2018-04-27 13:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Eric Dumazet, netdev; +Cc: cake, Dave Taht
In-Reply-To: <87in8c4vn7.fsf@toke.dk>



On 04/27/2018 06:38 AM, Toke Høiland-Jørgensen wrote:
> 
> Ah, right. Will fix.
> 
> Is it safe to dereference the iph pointer before calling
> pskb_may_pull()?

No, please take a look at ip_rcv() for a typical use case.

^ permalink raw reply

* Re: [PATCH net-next v4] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-04-27 13:45 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: cake, Dave Taht
In-Reply-To: <673fc1b9-809c-2e14-a054-7eb8beb9a8fd@gmail.com>

Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 04/27/2018 06:38 AM, Toke Høiland-Jørgensen wrote:
>> 
>> Ah, right. Will fix.
>> 
>> Is it safe to dereference the iph pointer before calling
>> pskb_may_pull()?
>
> No, please take a look at ip_rcv() for a typical use case.

Will do, thanks.

-Toke

^ permalink raw reply

* Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally
From: Greg KH @ 2018-04-27 13:51 UTC (permalink / raw)
  To: Thomas Deutschmann; +Cc: stable, davem, nicolas.dichtel, netdev
In-Reply-To: <40404f68-8328-8ed2-15bf-9de38830f796@gentoo.org>

On Fri, Apr 27, 2018 at 02:20:07PM +0200, Thomas Deutschmann wrote:
> On 2018-04-22 23:50, Thomas Deutschmann wrote:
> > Hi,
> > 
> > please add
> > 
> >> From f15ca723c1ebe6c1a06bc95fda6b62cd87b44559 Mon Sep 17 00:00:00 2001
> >> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> >> Date: Thu, 25 Jan 2018 19:03:03 +0100
> >> Subject: net: don't call update_pmtu unconditionally
> >>
> >> Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
> >> "BUG: unable to handle kernel NULL pointer dereference at           (null)"
> >>
> >> Let's add a helper to check if update_pmtu is available before calling it.
> >>
> >> Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
> >> Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
> >> CC: Roman Kapl <code@rkapl.cz>
> >> CC: Xin Long <lucien.xin@gmail.com>
> >> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> >> Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > to 4.14.x.
> > 
> > This fixes NULL derefs caused by a93bf0ff4490 ("vxlan: update
> > skb dst pmtu on tx path"), which was backported to 4.14.24.
> 
> *ping* - Not yet applied and not yet queued. Is there a problem with the
> patch which prevents a cherry-pick for 4.14.x?

This looks like an "obvious" fix for me to pick up.

Dave, any objections for me just grabbing it as-is?

thanks,

greg k-h

^ permalink raw reply

* [PATCH 1/2] bpf: btf: silence uninitialize variable warnings
From: Dan Carpenter @ 2018-04-27 14:04 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau
  Cc: Daniel Borkmann, netdev, linux-kernel, kernel-janitors

Smatch complains that size can be uninitialized if btf_type_id_size()
returns NULL.  It seems reasonable enough to check for that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
This goes to the BPF tree (linux-next).

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 22e1046a1a86..e631b6fd60d3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1229,7 +1229,8 @@ static int btf_array_check_member(struct btf_verifier_env *env,
 	}
 
 	array_type_id = member->type;
-	btf_type_id_size(btf, &array_type_id, &array_size);
+	if (!btf_type_id_size(btf, &array_type_id, &array_size))
+		return -EINVAL;
 	struct_size = struct_type->size;
 	bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off);
 	if (struct_size - bytes_offset < array_size) {
@@ -1351,6 +1352,8 @@ static void btf_array_seq_show(const struct btf *btf, const struct btf_type *t,
 
 	elem_type_id = array->type;
 	elem_type = btf_type_id_size(btf, &elem_type_id, &elem_size);
+	if (!elem_type)
+		return;
 	elem_ops = btf_type_ops(elem_type);
 	seq_puts(m, "[");
 	for (i = 0; i < array->nelems; i++) {

^ permalink raw reply related

* [PATCH 2/2] bpf: btf: remove a couple conditions
From: Dan Carpenter @ 2018-04-27 14:04 UTC (permalink / raw)
  To: Alexei Starovoitov, Martin KaFai Lau
  Cc: Daniel Borkmann, netdev, linux-kernel, kernel-janitors

We know "err" is zero so we can remove these and pull the code in one
indent level.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
This applies to the BPF tree (linux-next)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index e631b6fd60d3..7cb0905f37c2 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1973,16 +1973,14 @@ static struct btf *btf_parse(void __user *btf_data, u32 btf_data_size,
 	if (err)
 		goto errout;
 
-	if (!err && log->level && bpf_verifier_log_full(log)) {
+	if (log->level && bpf_verifier_log_full(log)) {
 		err = -ENOSPC;
 		goto errout;
 	}
 
-	if (!err) {
-		btf_verifier_env_free(env);
-		btf_get(btf);
-		return btf;
-	}
+	btf_verifier_env_free(env);
+	btf_get(btf);
+	return btf;
 
 errout:
 	btf_verifier_env_free(env);

^ permalink raw reply related

* [PATCH net 0/2] sfc: more ARFS fixes
From: Edward Cree @ 2018-04-27 14:07 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev

A couple more bits of breakage in my recent ARFS and async filters work.
Patch #1 in particular fixes a bug that leads to memory trampling and
 consequent crashes.

Edward Cree (2):
  sfc: Use filter index rather than ID for rps_flow_id table
  sfc: fix ARFS expiry check on EF10

 drivers/net/ethernet/sfc/ef10.c | 5 +++--
 drivers/net/ethernet/sfc/rx.c   | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

^ permalink raw reply

* [PATCH net 1/2] sfc: Use filter index rather than ID for rps_flow_id table
From: Edward Cree @ 2018-04-27 14:08 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <480b987f-2dad-96d9-22ee-d2c25f0c3d92@solarflare.com>

efx->type->filter_insert() returns an ID rather than the index that
 efx->type->filter_async_insert() used to, which causes it to exceed
 efx->type->max_rx_ip_filters on some EF10 configurations, leading to out-
 of-bounds array writes.
So, in efx_filter_rfs_work(), convert this back into an index (which is
 what the remove call in the expiry path expects, anyway).

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/rx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 64a94f242027..d2e254f2f72b 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -839,6 +839,8 @@ static void efx_filter_rfs_work(struct work_struct *data)
 	int rc;
 
 	rc = efx->type->filter_insert(efx, &req->spec, true);
+	if (rc >= 0)
+		rc %= efx->type->max_rx_ip_filters;
 	if (efx->rps_hash_table) {
 		spin_lock_bh(&efx->rps_hash_lock);
 		rule = efx_rps_hash_find(efx, &req->spec);

^ permalink raw reply related

* [PATCH net 2/2] sfc: fix ARFS expiry check on EF10
From: Edward Cree @ 2018-04-27 14:08 UTC (permalink / raw)
  To: linux-net-drivers, David Miller; +Cc: netdev
In-Reply-To: <480b987f-2dad-96d9-22ee-d2c25f0c3d92@solarflare.com>

Owing to a missing conditional, the result of rps_may_expire_flow() was
 being ignored and filters were being removed even if we'd decided not to
 expire them.

Fixes: f8d6203780b7 ("sfc: ARFS filter IDs")
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/ef10.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c
index 63036d9bf3e6..d90a7b1f4088 100644
--- a/drivers/net/ethernet/sfc/ef10.c
+++ b/drivers/net/ethernet/sfc/ef10.c
@@ -4784,8 +4784,9 @@ static bool efx_ef10_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id,
 	 * will set rule->filter_id to EFX_ARFS_FILTER_ID_PENDING, meaning that
 	 * the rule is not removed by efx_rps_hash_del() below.
 	 */
-	ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
-					      filter_idx, true) == 0;
+	if (ret)
+		ret = efx_ef10_filter_remove_internal(efx, 1U << spec->priority,
+						      filter_idx, true) == 0;
 	/* While we can't safely dereference rule (we dropped the lock), we can
 	 * still test it for NULL.
 	 */

^ permalink raw reply related

* tc: Using u32 filter
From: Jose Abreu @ 2018-04-27 14:15 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: Joao Pinto

Hi,

I'm trying to use u32 filter to filter specific fields of packets
by HW *only* but I'm having a hard time in trying to run tc to
configure it.
I implemented a dummy .ndo_setup_tc callback which always returns
success and I set NETIF_F_HW_TC field in hw_features. Then I run
tc, like this:

    # tc filter add dev eth0 u32 skip_sw sample u32 20 ffff at 0

At this stage I'm not really caring about the packet content (the
"20 ffff at 0"), I just want to see the configuration reaching my
driver but I'm getting a "RTNETLINK answers: Operation not
supported" error.

Can you tell me what I'm I doing wrong?

Thanks and Best Regards,
Jose Miguel Abreu

^ permalink raw reply

* Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally
From: David Miller @ 2018-04-27 14:39 UTC (permalink / raw)
  To: gregkh; +Cc: whissi, stable, nicolas.dichtel, netdev
In-Reply-To: <20180427135125.GA31860@kroah.com>

From: Greg KH <gregkh@linuxfoundation.org>
Date: Fri, 27 Apr 2018 15:51:25 +0200

> On Fri, Apr 27, 2018 at 02:20:07PM +0200, Thomas Deutschmann wrote:
>> On 2018-04-22 23:50, Thomas Deutschmann wrote:
>> > Hi,
>> > 
>> > please add
>> > 
>> >> From f15ca723c1ebe6c1a06bc95fda6b62cd87b44559 Mon Sep 17 00:00:00 2001
>> >> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> >> Date: Thu, 25 Jan 2018 19:03:03 +0100
>> >> Subject: net: don't call update_pmtu unconditionally
>> >>
>> >> Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
>> >> "BUG: unable to handle kernel NULL pointer dereference at           (null)"
>> >>
>> >> Let's add a helper to check if update_pmtu is available before calling it.
>> >>
>> >> Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
>> >> Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
>> >> CC: Roman Kapl <code@rkapl.cz>
>> >> CC: Xin Long <lucien.xin@gmail.com>
>> >> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
>> >> Signed-off-by: David S. Miller <davem@davemloft.net>
>> > 
>> > to 4.14.x.
>> > 
>> > This fixes NULL derefs caused by a93bf0ff4490 ("vxlan: update
>> > skb dst pmtu on tx path"), which was backported to 4.14.24.
>> 
>> *ping* - Not yet applied and not yet queued. Is there a problem with the
>> patch which prevents a cherry-pick for 4.14.x?
> 
> This looks like an "obvious" fix for me to pick up.
> 
> Dave, any objections for me just grabbing it as-is?

No objections, thanks everyone.

^ permalink raw reply

* Re: Request for stable 4.14.x inclusion: net: don't call update_pmtu unconditionally
From: Greg KH @ 2018-04-27 14:45 UTC (permalink / raw)
  To: Thomas Deutschmann; +Cc: stable, davem, nicolas.dichtel, netdev
In-Reply-To: <20180427135125.GA31860@kroah.com>

On Fri, Apr 27, 2018 at 03:51:25PM +0200, Greg KH wrote:
> On Fri, Apr 27, 2018 at 02:20:07PM +0200, Thomas Deutschmann wrote:
> > On 2018-04-22 23:50, Thomas Deutschmann wrote:
> > > Hi,
> > > 
> > > please add
> > > 
> > >> From f15ca723c1ebe6c1a06bc95fda6b62cd87b44559 Mon Sep 17 00:00:00 2001
> > >> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> > >> Date: Thu, 25 Jan 2018 19:03:03 +0100
> > >> Subject: net: don't call update_pmtu unconditionally
> > >>
> > >> Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
> > >> "BUG: unable to handle kernel NULL pointer dereference at           (null)"
> > >>
> > >> Let's add a helper to check if update_pmtu is available before calling it.
> > >>
> > >> Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
> > >> Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
> > >> CC: Roman Kapl <code@rkapl.cz>
> > >> CC: Xin Long <lucien.xin@gmail.com>
> > >> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> > >> Signed-off-by: David S. Miller <davem@davemloft.net>
> > > 
> > > to 4.14.x.
> > > 
> > > This fixes NULL derefs caused by a93bf0ff4490 ("vxlan: update
> > > skb dst pmtu on tx path"), which was backported to 4.14.24.
> > 
> > *ping* - Not yet applied and not yet queued. Is there a problem with the
> > patch which prevents a cherry-pick for 4.14.x?
> 
> This looks like an "obvious" fix for me to pick up.

Well, it would be "obvious" if it actually applied to the 4.14.y tree :(

Thomas, did you try this patch out?  I can't apply it as-is, it will
need a backport.  Please work on that, and test it out, as I don't get
the impression that you did that here.

Then post the working backport and I'll be glad to consider it for
future 4.14.y releases.

thanks,

greg k-h

^ permalink raw reply

* Re: ip6-in-ip{4,6} ipsec tunnel issues with 1280 MTU
From: David Ahern @ 2018-04-27 14:48 UTC (permalink / raw)
  To: Ashwanth Goli, Paolo Abeni; +Cc: netdev, maloney, edumazet, netdev-owner
In-Reply-To: <a3e39b94de731f86d2e9ecd8f0230643@codeaurora.org>

On 4/27/18 5:02 AM, Ashwanth Goli wrote:
> On 2018-04-26 17:21, Paolo Abeni wrote:
>> Hi,
>>
>> [fixed CC list]
>>
>> On Wed, 2018-04-25 at 21:43 +0530, Ashwanth Goli wrote:
>>> Hi Pablo,
>>
>> Actually I'm Paolo, but yours is a recurring mistake ;)
>>
>>> I am noticing an issue similar to the one reported by Alexis Perez
>>> [Regression for ip6-in-ip4 IPsec tunnel in 4.14.16]
>>>
>>> In my IPsec setup outer MTU is set to 1280, ip6_setup_cork sees an MTU
>>> less than IPV6_MIN_MTU because of the tunnel headers. -EINVAL is being
>>> returned as a result of the MTU check that got added with below patch.

If you know you are running ipsec over the link why are you setting the
outer MTU to 1280? RFC 2460 suggests the fragmentation of packets for
links with MTU < 1280 should be done below the IPv6 layer:

5. Packet Size Issues

   IPv6 requires that every link in the internet have an MTU of 1280
   octets or greater.  On any link that cannot convey a 1280-octet
   packet in one piece, link-specific fragmentation and reassembly must
   be provided at a layer below IPv6.

   Links that have a configurable MTU (for example, PPP links [RFC-
   1661]) must be configured to have an MTU of at least 1280 octets; it
   is recommended that they be configured with an MTU of 1500 octets or
   greater, to accommodate possible encapsulations (i.e., tunneling)
   without incurring IPv6-layer fragmentation.

^ permalink raw reply

* Re: [PATCH v2] net: qrtr: Expose tunneling endpoint to user space
From: David Miller @ 2018-04-27 14:55 UTC (permalink / raw)
  To: bjorn.andersson; +Cc: clew, linux-kernel, netdev, linux-arm-msm
In-Reply-To: <20180423214653.10016-1-bjorn.andersson@linaro.org>

From: Bjorn Andersson <bjorn.andersson@linaro.org>
Date: Mon, 23 Apr 2018 14:46:53 -0700

> +	count = min_t(size_t, iov_iter_count(to), skb->len);
> +	if (copy_to_iter(skb->data, count, to) != count)
> +		count = -EFAULT;
> +
> +	kfree_skb(skb);

As noted by Chris, you should be using consume_skb() here.

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox