Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Francois Romieu @ 2016-12-07 23:15 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue,
	pavel, davem, linux-kernel, netdev
In-Reply-To: <1481141138-19466-2-git-send-email-LinoSanfilippo@gmx.de>

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> The driver uses a private lock for synchronization between the xmit
> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> is not set, the xmit function is also called with the xmit_lock held.
> 
> On the other hand the xmit completion handler first takes the private lock
> and (in case that the tx queue has been stopped) the xmit_lock, leading
> to a reverse locking order and the potential danger of a deadlock.

netif_tx_stop_queue is used by:
1. xmit function before releasing lock and returning.
2. sxgbe_restart_tx_queue()
   <- sxgbe_tx_interrupt
   <- sxgbe_reset_all_tx_queues()
      <- sxgbe_tx_timeout()

Given xmit won't be called again until tx queue is enabled, it's not clear
how a deadlock could happen due to #1.

Regardless of deadlocks anywhere else, #2 has some serious problem due to
the lack of exclusion between the tx queue restart handler and the xmit
handler.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
From: Pavel Machek @ 2016-12-07 23:21 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue,
	davem, linux-kernel, netdev
In-Reply-To: <afcae723-5804-305f-2756-c4dbf9045943@gmx.de>

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

On Wed 2016-12-07 23:34:19, Lino Sanfilippo wrote:
> On 07.12.2016 22:43, Lino Sanfilippo wrote:
> > Hi Pavel,
> > 
> > On 07.12.2016 22:37, Pavel Machek wrote:
> >> On Wed 2016-12-07 21:05:38, Lino Sanfilippo wrote:
> >>> The driver uses a private lock for synchronization between the xmit
> >>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
> >>> is not set, the xmit function is also called with the xmit_lock held.
> >>> 
> >>> On the other hand the xmit completion handler first takes the private lock
> >>> and (in case that the tx queue has been stopped) the xmit_lock, leading to
> >>> a reverse locking order and the potential danger of a deadlock.
> >>> 
> >>> Fix this by removing the private lock completely and synchronizing the xmit
> >>> function and completion handler solely by means of the xmit_lock. By doing
> >>> this remove also the now unnecessary double check for a stopped tx queue.
> >>> 
> >> 
> >> FYI, here's modified version. I believe _bh versions are needed, and
> >> I'm testing that version now. (Oh and I also ported it to net-next).
> >> 
> >> It survived 30 minutes of testing so far...
> >> 
> > 
> > First off, thanks for testing.
> > Hmm. I dont understand why _bh would be needed. We call that function from
> > BH context only (napi poll and timer).
> > Any idea?
> > 
> 
> Could this once again be caused by irq coalescing? When the tx queue has been stopped
> the cleanup handler has to wakeup the queue within a certain time span, otherwise the
> watchdog will complain (as it happened in your test). Could you retest this with
> irq coalescing disabled?

I actually had TX coalescing disabled, with

-#define STMMAC_TX_FRAMES       64
+#define STMMAC_TX_FRAMES       0


										Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] net: ethernet: stmmac: remove private tx queue lock
From: David Miller @ 2016-12-07 23:41 UTC (permalink / raw)
  To: pavel
  Cc: LinoSanfilippo, bh74.an, ks.giri, vipul.pandya, peppe.cavallaro,
	alexandre.torgue, linux-kernel, netdev
In-Reply-To: <20161207213757.GC2250@amd>

From: Pavel Machek <pavel@ucw.cz>
Date: Wed, 7 Dec 2016 22:37:57 +0100

> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index 982c952..7415bc2 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -1308,7 +1308,7 @@ static void stmmac_tx_clean(struct stmmac_priv *priv)
>  	unsigned int bytes_compl = 0, pkts_compl = 0;
>  	unsigned int entry = priv->dirty_tx;
>  
> -	spin_lock(&priv->tx_lock);
> +	netif_tx_lock_bh(priv->dev);
>  
>  	priv->xstats.tx_clean++;
>  

stmmac_tx_clean() runs from either the timer or the NAPI poll handler,
both execute from software interrupts, therefore _bh() should be
unnecessary.

^ permalink raw reply

* [PATCH v4 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
From: Martin KaFai Lau @ 2016-12-07 23:53 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team
In-Reply-To: <1481154794-2311034-1-git-send-email-kafai@fb.com>

When XDP is active in mlx4, mlx4 is using one page/pkt.
At the same time (i.e. when XDP is active), it is currently
limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN)
which is 1514 in x86.  AFAICT, we can at least raise the MTU
limit up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this
patch is doing.  It will be useful in the next patch which
allows XDP program to extend the packet by adding new header(s).

Note: In the earlier XDP patches, there is already existing guard
to ensure the page/pkt scheme only applies when XDP is active
in mlx4.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 28 +++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 46 ++++++++++++++------------
 2 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index f441eda63bec..c97d25b06444 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,6 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2249,6 +2251,19 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	free_netdev(dev);
 }
 
+static bool mlx4_en_check_xdp_mtu(struct net_device *dev, int mtu)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	if (mtu > MLX4_EN_MAX_XDP_MTU) {
+		en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
+		       mtu, MLX4_EN_MAX_XDP_MTU);
+		return false;
+	}
+
+	return true;
+}
+
 static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2258,11 +2273,10 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 	en_dbg(DRV, priv, "Change MTU called - current:%d new:%d\n",
 		 dev->mtu, new_mtu);
 
-	if (priv->tx_ring_num[TX_XDP] && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
-		en_err(priv, "MTU size:%d requires frags but XDP running\n",
-		       new_mtu);
-		return -EOPNOTSUPP;
-	}
+	if (priv->tx_ring_num[TX_XDP] &&
+	    !mlx4_en_check_xdp_mtu(dev, new_mtu))
+		return -ENOTSUPP;
+
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2715,10 +2729,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return 0;
 	}
 
-	if (priv->num_frags > 1) {
-		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+	if (!mlx4_en_check_xdp_mtu(dev, dev->mtu))
 		return -EOPNOTSUPP;
-	}
 
 	tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
 	if (!tmp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..23e9d04d1ef4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1164,37 +1164,39 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
-	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
-	int order = MLX4_EN_ALLOC_PREFER_ORDER;
-	u32 align = SMP_CACHE_BYTES;
-	int buf_size = 0;
 	int i = 0;
 
 	/* bpf requires buffers to be set up as 1 packet per page.
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->tx_ring_num[TX_XDP]) {
-		dma_dir = PCI_DMA_BIDIRECTIONAL;
-		/* This will gain efficient xdp frame recycling at the expense
-		 * of more costly truesize accounting
+		priv->frag_info[0].order = 0;
+		priv->frag_info[0].frag_size = eff_mtu;
+		priv->frag_info[0].frag_prefix_size = 0;
+		/* This will gain efficient xdp frame recycling at the
+		 * expense of more costly truesize accounting
 		 */
-		align = PAGE_SIZE;
-		order = 0;
-	}
-
-	while (buf_size < eff_mtu) {
-		priv->frag_info[i].order = order;
-		priv->frag_info[i].frag_size =
-			(eff_mtu > buf_size + frag_sizes[i]) ?
-				frag_sizes[i] : eff_mtu - buf_size;
-		priv->frag_info[i].frag_prefix_size = buf_size;
-		priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size, align);
-		priv->frag_info[i].dma_dir = dma_dir;
-		buf_size += priv->frag_info[i].frag_size;
-		i++;
+		priv->frag_info[0].frag_stride = PAGE_SIZE;
+		priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+		i = 1;
+	} else {
+		int buf_size = 0;
+
+		while (buf_size < eff_mtu) {
+			priv->frag_info[i].order = MLX4_EN_ALLOC_PREFER_ORDER;
+			priv->frag_info[i].frag_size =
+				(eff_mtu > buf_size + frag_sizes[i]) ?
+					frag_sizes[i] : eff_mtu - buf_size;
+			priv->frag_info[i].frag_prefix_size = buf_size;
+			priv->frag_info[i].frag_stride =
+				ALIGN(priv->frag_info[i].frag_size,
+				      SMP_CACHE_BYTES);
+			priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
+			buf_size += priv->frag_info[i].frag_size;
+			i++;
+		}
 	}
 
 	priv->num_frags = i;
-- 
2.5.1

^ permalink raw reply related

* [PATCH v4 net-next 4/4] bpf: xdp: Add XDP example for head adjustment
From: Martin KaFai Lau @ 2016-12-07 23:53 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team
In-Reply-To: <1481154794-2311034-1-git-send-email-kafai@fb.com>

The XDP prog checks if the incoming packet matches any VIP:PORT
combination in the BPF hashmap.  If it is, it will encapsulate
the packet with a IPv4/v6 header as instructed by the value of
the BPF hashmap and then XDP_TX it out.

The VIP:PORT -> IP-Encap-Info can be specified by the cmd args
of the user prog.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 samples/bpf/Makefile                 |   4 +
 samples/bpf/bpf_helpers.h            |   2 +
 samples/bpf/bpf_load.c               |  94 +++++++++++++
 samples/bpf/bpf_load.h               |   1 +
 samples/bpf/xdp1_user.c              |  93 -------------
 samples/bpf/xdp_tx_iptunnel_common.h |  37 +++++
 samples/bpf/xdp_tx_iptunnel_kern.c   | 236 ++++++++++++++++++++++++++++++++
 samples/bpf/xdp_tx_iptunnel_user.c   | 256 +++++++++++++++++++++++++++++++++++
 8 files changed, 630 insertions(+), 93 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptunnel_common.h
 create mode 100644 samples/bpf/xdp_tx_iptunnel_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptunnel_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 00cd3081c038..f2219c1489e5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -33,6 +33,7 @@ hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
+hostprogs-y += xdp_tx_iptunnel
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -67,6 +68,7 @@ trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
+xdp_tx_iptunnel-objs := bpf_load.o libbpf.o xdp_tx_iptunnel_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -99,6 +101,7 @@ always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
+always += xdp_tx_iptunnel_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
@@ -129,6 +132,7 @@ HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
 HOSTLOADLIBES_lwt_len_hist += -l elf
+HOSTLOADLIBES_xdp_tx_iptunnel += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 8370a6e3839d..faaffe2e139a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -57,6 +57,8 @@ static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
 	(void *) BPF_FUNC_skb_set_tunnel_opt;
 static unsigned long long (*bpf_get_prandom_u32)(void) =
 	(void *) BPF_FUNC_get_prandom_u32;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+	(void *) BPF_FUNC_xdp_adjust_head;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 49b45ccbe153..e30b6de94f2e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -12,6 +12,10 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/perf_event.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <sys/types.h>
+#include <sys/socket.h>
 #include <sys/syscall.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
@@ -450,3 +454,93 @@ struct ksym *ksym_search(long key)
 	/* out of range. return _stext */
 	return &syms[0];
 }
+
+int set_link_xdp_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 4adeeef53ad6..fb46a421ab41 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -31,4 +31,5 @@ struct ksym {
 
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
+int set_link_xdp_fd(int ifindex, int fd);
 #endif
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index 2b2150d6d6f7..5f040a0d7712 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -5,111 +5,18 @@
  * License as published by the Free Software Foundation.
  */
 #include <linux/bpf.h>
-#include <linux/netlink.h>
-#include <linux/rtnetlink.h>
 #include <assert.h>
 #include <errno.h>
 #include <signal.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
-#include <sys/socket.h>
 #include <unistd.h>
 
 #include "bpf_load.h"
 #include "bpf_util.h"
 #include "libbpf.h"
 
-static int set_link_xdp_fd(int ifindex, int fd)
-{
-	struct sockaddr_nl sa;
-	int sock, seq = 0, len, ret = -1;
-	char buf[4096];
-	struct nlattr *nla, *nla_xdp;
-	struct {
-		struct nlmsghdr  nh;
-		struct ifinfomsg ifinfo;
-		char             attrbuf[64];
-	} req;
-	struct nlmsghdr *nh;
-	struct nlmsgerr *err;
-
-	memset(&sa, 0, sizeof(sa));
-	sa.nl_family = AF_NETLINK;
-
-	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
-	if (sock < 0) {
-		printf("open netlink socket: %s\n", strerror(errno));
-		return -1;
-	}
-
-	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
-		printf("bind to netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	memset(&req, 0, sizeof(req));
-	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
-	req.nh.nlmsg_type = RTM_SETLINK;
-	req.nh.nlmsg_pid = 0;
-	req.nh.nlmsg_seq = ++seq;
-	req.ifinfo.ifi_family = AF_UNSPEC;
-	req.ifinfo.ifi_index = ifindex;
-	nla = (struct nlattr *)(((char *)&req)
-				+ NLMSG_ALIGN(req.nh.nlmsg_len));
-	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
-
-	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
-	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
-	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
-	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
-	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
-
-	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
-
-	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
-		printf("send to netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	len = recv(sock, buf, sizeof(buf), 0);
-	if (len < 0) {
-		printf("recv from netlink: %s\n", strerror(errno));
-		goto cleanup;
-	}
-
-	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
-	     nh = NLMSG_NEXT(nh, len)) {
-		if (nh->nlmsg_pid != getpid()) {
-			printf("Wrong pid %d, expected %d\n",
-			       nh->nlmsg_pid, getpid());
-			goto cleanup;
-		}
-		if (nh->nlmsg_seq != seq) {
-			printf("Wrong seq %d, expected %d\n",
-			       nh->nlmsg_seq, seq);
-			goto cleanup;
-		}
-		switch (nh->nlmsg_type) {
-		case NLMSG_ERROR:
-			err = (struct nlmsgerr *)NLMSG_DATA(nh);
-			if (!err->error)
-				continue;
-			printf("nlmsg error %s\n", strerror(-err->error));
-			goto cleanup;
-		case NLMSG_DONE:
-			break;
-		}
-	}
-
-	ret = 0;
-
-cleanup:
-	close(sock);
-	return ret;
-}
-
 static int ifindex;
 
 static void int_exit(int sig)
diff --git a/samples/bpf/xdp_tx_iptunnel_common.h b/samples/bpf/xdp_tx_iptunnel_common.h
new file mode 100644
index 000000000000..dd12cc35110f
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptunnel_common.h
@@ -0,0 +1,37 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#ifndef _SAMPLES_BPF_XDP_TX_IPTNL_COMMON_H
+#define _SAMPLES_BPF_XDP_TX_IPTNL_COMMON_H
+
+#include <linux/types.h>
+
+#define MAX_IPTNL_ENTRIES 256U
+
+struct vip {
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} daddr;
+	__u16 dport;
+	__u16 family;
+	__u8 protocol;
+};
+
+struct iptnl_info {
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} saddr;
+	union {
+		__u32 v6[4];
+		__u32 v4;
+	} daddr;
+	__u16 family;
+	__u8 dmac[6];
+};
+
+#endif
diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c
new file mode 100644
index 000000000000..85c38ecd3a2d
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptunnel_kern.c
@@ -0,0 +1,236 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program shows how to use bpf_xdp_adjust_head() by
+ * encapsulating the incoming packet in an IPv4/v6 header
+ * and then XDP_TX it out.
+ */
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+#include "xdp_tx_iptunnel_common.h"
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u64),
+	.max_entries = 256,
+};
+
+struct bpf_map_def SEC("maps") vip2tnl = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(struct vip),
+	.value_size = sizeof(struct iptnl_info),
+	.max_entries = MAX_IPTNL_ENTRIES,
+};
+
+static __always_inline void count_tx(u32 protocol)
+{
+	u64 *rxcnt_count;
+
+	rxcnt_count = bpf_map_lookup_elem(&rxcnt, &protocol);
+	if (rxcnt_count)
+		*rxcnt_count += 1;
+}
+
+static __always_inline int get_dport(void *trans_data, void *data_end,
+				     u8 protocol)
+{
+	struct tcphdr *th;
+	struct udphdr *uh;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		th = (struct tcphdr *)trans_data;
+		if (th + 1 > data_end)
+			return -1;
+		return th->dest;
+	case IPPROTO_UDP:
+		uh = (struct udphdr *)trans_data;
+		if (uh + 1 > data_end)
+			return -1;
+		return uh->dest;
+	default:
+		return 0;
+	}
+}
+
+static __always_inline void set_ethhdr(struct ethhdr *new_eth,
+				       const struct ethhdr *old_eth,
+				       const struct iptnl_info *tnl,
+				       __be16 h_proto)
+{
+	memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	memcpy(new_eth->h_dest, tnl->dmac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = h_proto;
+}
+
+static __always_inline int handle_ipv4(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	struct iphdr *iph = data + sizeof(struct ethhdr);
+	u16 *next_iph_u16;
+	u16 payload_len;
+	struct vip vip = {};
+	int dport;
+	u32 csum = 0;
+	int i;
+
+	if (iph + 1 > data_end)
+		return XDP_DROP;
+
+	dport = get_dport(iph + 1, data_end, iph->protocol);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = iph->protocol;
+	vip.family = AF_INET;
+	vip.daddr.v4 = iph->daddr;
+	vip.dport = dport;
+	payload_len = ntohs(iph->tot_len);
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v4-in-v4 */
+	if (!tnl || tnl->family != AF_INET)
+		return XDP_PASS;
+
+	/* The vip key is found.  Add an IP header and send it out */
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
+		return XDP_DROP;
+
+	data = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	iph = data + sizeof(*new_eth);
+	old_eth = data + sizeof(*iph);
+
+	if (new_eth + 1 > data_end ||
+	    old_eth + 1 > data_end ||
+	    iph + 1 > data_end)
+		return XDP_DROP;
+
+	set_ethhdr(new_eth, old_eth, tnl, htons(ETH_P_IP));
+
+	iph->version = 4;
+	iph->ihl = sizeof(*iph) >> 2;
+	iph->frag_off =	0;
+	iph->protocol = IPPROTO_IPIP;
+	iph->check = 0;
+	iph->tos = 0;
+	iph->tot_len = htons(payload_len + sizeof(*iph));
+	iph->daddr = tnl->daddr.v4;
+	iph->saddr = tnl->saddr.v4;
+	iph->ttl = 8;
+
+	next_iph_u16 = (u16 *)iph;
+#pragma clang loop unroll(full)
+	for (i = 0; i < sizeof(*iph) >> 1; i++)
+		csum += *next_iph_u16++;
+
+	iph->check = ~((csum & 0xffff) + (csum >> 16));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+static __always_inline int handle_ipv6(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+
+	if (ip6h + 1 > data_end)
+		return XDP_DROP;
+
+	dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = ip6h->nexthdr;
+	vip.family = AF_INET6;
+	memcpy(vip.daddr.v6, ip6h->daddr.s6_addr32, sizeof(vip.daddr));
+	vip.dport = dport;
+	payload_len = ip6h->payload_len;
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v6-in-v6 */
+	if (!tnl || tnl->family != AF_INET6)
+		return XDP_PASS;
+
+	/* The vip key is found.  Add an IP header and send it out */
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
+		return XDP_DROP;
+
+	data = (void *)(long)xdp->data;
+	data_end = (void *)(long)xdp->data_end;
+
+	new_eth = data;
+	ip6h = data + sizeof(*new_eth);
+	old_eth = data + sizeof(*ip6h);
+
+	if (new_eth + 1 > data_end ||
+	    old_eth + 1 > data_end ||
+	    ip6h + 1 > data_end)
+		return XDP_DROP;
+
+	set_ethhdr(new_eth, old_eth, tnl, htons(ETH_P_IPV6));
+
+	ip6h->version = 6;
+	ip6h->priority = 0;
+	memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
+	ip6h->payload_len = htons(ntohs(payload_len) + sizeof(*ip6h));
+	ip6h->nexthdr = IPPROTO_IPV6;
+	ip6h->hop_limit = 8;
+	memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
+	memcpy(ip6h->daddr.s6_addr32, tnl->daddr.v6, sizeof(tnl->daddr.v6));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+SEC("xdp_tx_iptunnel")
+int _xdp_tx_iptunnel(struct xdp_md *xdp)
+{
+	void *data_end = (void *)(long)xdp->data_end;
+	void *data = (void *)(long)xdp->data;
+	struct ethhdr *eth = data;
+	__u16 h_proto;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		return handle_ipv4(xdp);
+	else if (h_proto == htons(ETH_P_IPV6))
+
+		return handle_ipv6(xdp);
+	else
+		return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_tx_iptunnel_user.c b/samples/bpf/xdp_tx_iptunnel_user.c
new file mode 100644
index 000000000000..7a71f5c74684
--- /dev/null
+++ b/samples/bpf/xdp_tx_iptunnel_user.c
@@ -0,0 +1,256 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/resource.h>
+#include <arpa/inet.h>
+#include <netinet/ether.h>
+#include <unistd.h>
+#include <time.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+#include "bpf_util.h"
+#include "xdp_tx_iptunnel_common.h"
+
+#define STATS_INTERVAL_S 2U
+
+static int ifindex = -1;
+
+static void int_exit(int sig)
+{
+	if (ifindex > -1)
+		set_link_xdp_fd(ifindex, -1);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(unsigned int kill_after_s)
+{
+	const unsigned int nr_protos = 256;
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	time_t started_at = time(NULL);
+	__u64 values[nr_cpus], prev[nr_protos][nr_cpus];
+	__u32 proto;
+	int i;
+
+	memset(prev, 0, sizeof(prev));
+
+	while (!kill_after_s || time(NULL) - started_at <= kill_after_s) {
+		sleep(STATS_INTERVAL_S);
+
+		for (proto = 0; proto < nr_protos; proto++) {
+			__u64 sum = 0;
+
+			assert(bpf_lookup_elem(map_fd[0], &proto, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[proto][i]);
+
+			if (sum)
+				printf("proto %u: sum:%10llu pkts, rate:%10llu pkts/s\n",
+				       proto, sum, sum / STATS_INTERVAL_S);
+			memcpy(prev[proto], values, sizeof(values));
+		}
+	}
+}
+
+static void usage(const char *cmd)
+{
+	printf("Start a XDP prog which encapsulates incoming packets\n"
+	       "in an IPv4/v6 header and XDP_TX it out.  The dst <VIP:PORT>\n"
+	       "is used to select packets to encapsulate\n\n");
+	printf("Usage: %s [...]\n", cmd);
+	printf("    -i <ifindex> Interface Index\n");
+	printf("    -a <vip-service-address> IPv4 or IPv6\n");
+	printf("    -p <vip-service-port> A port range (e.g. 433-444) is also allowed\n");
+	printf("    -s <source-ip> Used in the IPTunnel header\n");
+	printf("    -d <dest-ip> Used in the IPTunnel header\n");
+	printf("    -m <dest-MAC> Used in sending the IP Tunneled pkt\n");
+	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
+	printf("    -P <IP-Protocol> Default is TCP\n");
+	printf("    -h Display this help\n");
+}
+
+static int parse_ipstr(const char *ipstr, unsigned int *addr)
+{
+	if (inet_pton(AF_INET6, ipstr, addr) == 1) {
+		return AF_INET6;
+	} else if (inet_pton(AF_INET, ipstr, addr) == 1) {
+		addr[1] = addr[2] = addr[3] = 0;
+		return AF_INET;
+	}
+
+	fprintf(stderr, "%s is an invalid IP\n", ipstr);
+	return AF_UNSPEC;
+}
+
+static int parse_ports(const char *port_str, int *min_port, int *max_port)
+{
+	char *end;
+	long tmp_min_port;
+	long tmp_max_port;
+
+	tmp_min_port = strtol(optarg, &end, 10);
+	if (tmp_min_port < 1 || tmp_min_port > 65535) {
+		fprintf(stderr, "Invalid port(s):%s\n", optarg);
+		return 1;
+	}
+
+	if (*end == '-') {
+		end++;
+		tmp_max_port = strtol(end, NULL, 10);
+		if (tmp_max_port < 1 || tmp_max_port > 65535) {
+			fprintf(stderr, "Invalid port(s):%s\n", optarg);
+			return 1;
+		}
+	} else {
+		tmp_max_port = tmp_min_port;
+	}
+
+	if (tmp_min_port > tmp_max_port) {
+		fprintf(stderr, "Invalid port(s):%s\n", optarg);
+		return 1;
+	}
+
+	if (tmp_max_port - tmp_min_port + 1 > MAX_IPTNL_ENTRIES) {
+		fprintf(stderr, "Port range (%s) is larger than %u\n",
+			port_str, MAX_IPTNL_ENTRIES);
+		return 1;
+	}
+	*min_port = tmp_min_port;
+	*max_port = tmp_max_port;
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned char opt_flags[256] = {};
+	unsigned int kill_after_s = 0;
+	const char *optstr = "i:a:p:s:d:m:T:P:h";
+	int min_port = 0, max_port = 0;
+	struct iptnl_info tnl = {};
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct vip vip = {};
+	char filename[256];
+	int opt;
+	int i;
+
+	tnl.family = AF_UNSPEC;
+	vip.protocol = IPPROTO_TCP;
+
+	for (i = 0; i < strlen(optstr); i++)
+		if (optstr[i] != 'h' && 'a' <= optstr[i] && optstr[i] <= 'z')
+			opt_flags[(unsigned char)optstr[i]] = 1;
+
+	while ((opt = getopt(argc, argv, optstr)) != -1) {
+		unsigned short family;
+		unsigned int *v6;
+
+		switch (opt) {
+		case 'i':
+			ifindex = atoi(optarg);
+			break;
+		case 'a':
+			vip.family = parse_ipstr(optarg, vip.daddr.v6);
+			if (vip.family == AF_UNSPEC)
+				return 1;
+			break;
+		case 'p':
+			if (parse_ports(optarg, &min_port, &max_port))
+				return 1;
+			break;
+		case 'P':
+			vip.protocol = atoi(optarg);
+			break;
+		case 's':
+		case 'd':
+			if (opt == 's')
+				v6 = tnl.saddr.v6;
+			else
+				v6 = tnl.daddr.v6;
+
+			family = parse_ipstr(optarg, v6);
+			if (family == AF_UNSPEC)
+				return 1;
+			if (tnl.family == AF_UNSPEC) {
+				tnl.family = family;
+			} else if (tnl.family != family) {
+				fprintf(stderr,
+					"The IP version of the src and dst addresses used in the IP encapsulation does not match\n");
+				return 1;
+			}
+			break;
+		case 'm':
+			if (!ether_aton_r(optarg,
+					  (struct ether_addr *)tnl.dmac)) {
+				fprintf(stderr, "Invalid mac address:%s\n",
+					optarg);
+				return 1;
+			}
+			break;
+		case 'T':
+			kill_after_s = atoi(optarg);
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+		opt_flags[opt] = 0;
+	}
+
+	for (i = 0; i < strlen(optstr); i++) {
+		if (opt_flags[(unsigned int)optstr[i]]) {
+			fprintf(stderr, "Missing argument -%c\n", optstr[i]);
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)");
+		return 1;
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	while (min_port <= max_port) {
+		vip.dport = htons(min_port++);
+		if (bpf_update_elem(map_fd[1], &vip, &tnl, BPF_NOEXIST)) {
+			perror("bpf_update_elem(&vip2tnl)");
+			return 1;
+		}
+	}
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	poll_stats(kill_after_s);
+
+	set_link_xdp_fd(ifindex, -1);
+
+	return 0;
+}
-- 
2.5.1

^ permalink raw reply related

* [PATCH v4 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog
From: Martin KaFai Lau @ 2016-12-07 23:53 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team
In-Reply-To: <1481154794-2311034-1-git-send-email-kafai@fb.com>

This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

This patch adds one "xdp_adjust_head" bit to bpf_prog for the
XDP-capable driver to check if the XDP prog requires
bpf_xdp_adjust_head() support.  The driver can then decide
to error out during XDP_SETUP_PROG.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 arch/powerpc/net/bpf_jit_comp64.c                  |  4 ++--
 arch/s390/net/bpf_jit_comp.c                       |  2 +-
 arch/x86/net/bpf_jit_comp.c                        |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  5 ++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  5 ++++
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |  4 ++++
 drivers/net/ethernet/qlogic/qede/qede_main.c       |  5 ++++
 include/linux/filter.h                             |  6 +++--
 include/uapi/linux/bpf.h                           | 11 ++++++++-
 kernel/bpf/core.c                                  |  2 +-
 kernel/bpf/syscall.c                               |  2 ++
 kernel/bpf/verifier.c                              |  2 +-
 net/core/filter.c                                  | 28 ++++++++++++++++++++--
 13 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 0fe98a567125..73a5cf18fd84 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -766,7 +766,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			func = (u8 *) __bpf_call_base + imm;
 
 			/* Save skb pointer if we need to re-cache skb data */
-			if (bpf_helper_changes_skb_data(func))
+			if (bpf_helper_changes_pkt_data(func))
 				PPC_BPF_STL(3, 1, bpf_jit_stack_local(ctx));
 
 			bpf_jit_emit_func_call(image, ctx, (u64)func);
@@ -775,7 +775,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			PPC_MR(b2p[BPF_REG_0], 3);
 
 			/* refresh skb cache */
-			if (bpf_helper_changes_skb_data(func)) {
+			if (bpf_helper_changes_pkt_data(func)) {
 				/* reload skb pointer to r3 */
 				PPC_BPF_LL(3, 1, bpf_jit_stack_local(ctx));
 				bpf_jit_emit_skb_loads(image, ctx);
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index bee281f3163d..167b31b186c1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -981,7 +981,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, struct bpf_prog *fp, int i
 		EMIT2(0x0d00, REG_14, REG_W1);
 		/* lgr %b0,%r2: load return value into %b0 */
 		EMIT4(0xb9040000, BPF_REG_0, REG_2);
-		if (bpf_helper_changes_skb_data((void *)func)) {
+		if (bpf_helper_changes_pkt_data((void *)func)) {
 			jit->seen |= SEEN_SKB_CHANGE;
 			/* lg %b1,ST_OFF_SKBP(%r15) */
 			EMIT6_DISP_LH(0xe3000000, 0x0004, BPF_REG_1, REG_0,
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe04a04dab8e..e76d1af60f7a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -853,7 +853,7 @@ xadd:			if (is_imm8(insn->off))
 			func = (u8 *) __bpf_call_base + imm32;
 			jmp_offset = func - (image + addrs[i]);
 			if (seen_ld_abs) {
-				reload_skb_data = bpf_helper_changes_skb_data(func);
+				reload_skb_data = bpf_helper_changes_pkt_data(func);
 				if (reload_skb_data) {
 					EMIT1(0x57); /* push %rdi */
 					jmp_offset += 22; /* pop, mov, sub, mov */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 49a81f1fc1d6..f441eda63bec 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2686,6 +2686,11 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	int err;
 	int i;
 
+	if (prog && prog->xdp_adjust_head) {
+		en_err(priv, "Does not support bpf_xdp_adjust_head()\n");
+		return -EOPNOTSUPP;
+	}
+
 	xdp_ring_num = prog ? priv->rx_ring_num : 0;
 
 	/* No need to reconfigure buffers when simply swapping the
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9def5cc378a3..504ca130f88d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3180,6 +3180,11 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	bool reset, was_opened;
 	int i;
 
+	if (prog && prog->xdp_adjust_head) {
+		netdev_err(netdev, "Does not support bpf_xdp_adjust_head()\n");
+		return -EOPNOTSUPP;
+	}
+
 	mutex_lock(&priv->state_lock);
 
 	if ((netdev->features & NETIF_F_LRO) && prog) {
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 00d9a03be31d..e8d448109e03 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2946,6 +2946,10 @@ static int nfp_net_xdp_setup(struct nfp_net *nn, struct bpf_prog *prog)
 	};
 	int err;
 
+	if (prog && prog->xdp_adjust_head) {
+		nn_err(nn, "Does not support bpf_xdp_adjust_head()\n");
+		return -EOPNOTSUPP;
+	}
 	if (!prog && !nn->xdp_prog)
 		return 0;
 	if (prog && nn->xdp_prog) {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
index cf1dd1436d93..aecdd1c5c0ea 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -2507,6 +2507,11 @@ static int qede_xdp_set(struct qede_dev *edev, struct bpf_prog *prog)
 {
 	struct qede_reload_args args;
 
+	if (prog && prog->xdp_adjust_head) {
+		DP_ERR(edev, "Does not support bpf_xdp_adjust_head()\n");
+		return -EOPNOTSUPP;
+	}
+
 	/* If we're called, there was already a bpf reference increment */
 	args.func = &qede_xdp_reload_func;
 	args.u.new_prog = prog;
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f078d2b1cff6..6a1658308612 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -406,7 +406,8 @@ struct bpf_prog {
 	u16			jited:1,	/* Is our filter JIT'ed? */
 				gpl_compatible:1, /* Is filter GPL compatible? */
 				cb_access:1,	/* Is control block accessed? */
-				dst_needed:1;	/* Do we need dst entry? */
+				dst_needed:1,	/* Do we need dst entry? */
+				xdp_adjust_head:1; /* Adjusting pkt head? */
 	kmemcheck_bitfield_end(meta);
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	u32			len;		/* Number of filter blocks */
@@ -440,6 +441,7 @@ struct bpf_skb_data_end {
 struct xdp_buff {
 	void *data;
 	void *data_end;
+	void *data_hard_start;
 };
 
 /* compute the linear packet data range [data, data_end) which
@@ -595,7 +597,7 @@ void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp);
 u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_skb_data(void *func);
+bool bpf_helper_changes_pkt_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6123d9b8e828..0eb0e87dbe9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -424,6 +424,12 @@ union bpf_attr {
  *     @len: length of header to be pushed in front
  *     @flags: Flags (unused for now)
  *     Return: 0 on success or negative error
+ *
+ * int bpf_xdp_adjust_head(xdp_md, delta)
+ *     Adjust the xdp_md.data by delta
+ *     @xdp_md: pointer to xdp_md
+ *     @delta: An positive/negative integer to be added to xdp_md.data
+ *     Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -469,7 +475,8 @@ union bpf_attr {
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
 	FN(get_numa_node_id),		\
-	FN(skb_change_head),
+	FN(skb_change_head),		\
+	FN(xdp_adjust_head),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -576,6 +583,8 @@ struct bpf_sock {
 	__u32 protocol;
 };
 
+#define XDP_PACKET_HEADROOM 256
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index bdcc9f4ba767..83e0d153b0b4 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1143,7 +1143,7 @@ struct bpf_prog * __weak bpf_int_jit_compile(struct bpf_prog *prog)
 	return prog;
 }
 
-bool __weak bpf_helper_changes_skb_data(void *func)
+bool __weak bpf_helper_changes_pkt_data(void *func)
 {
 	return false;
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0d2b423ce93..add93ecd7e69 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -579,6 +579,8 @@ static void fixup_bpf_calls(struct bpf_prog *prog)
 				prog->dst_needed = 1;
 			if (insn->imm == BPF_FUNC_get_prandom_u32)
 				bpf_user_rnd_init_once();
+			if (insn->imm == BPF_FUNC_xdp_adjust_head)
+				prog->xdp_adjust_head = 1;
 			if (insn->imm == BPF_FUNC_tail_call) {
 				/* mark bpf_tail_call as different opcode
 				 * to avoid conditional branch in
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cb37339ca0da..f5fa326518c9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1216,7 +1216,7 @@ static int check_call(struct bpf_verifier_env *env, int func_id)
 		return -EINVAL;
 	}
 
-	changes_data = bpf_helper_changes_skb_data(fn->func);
+	changes_data = bpf_helper_changes_pkt_data(fn->func);
 
 	memset(&meta, 0, sizeof(meta));
 	meta.pkt_access = fn->pkt_access;
diff --git a/net/core/filter.c b/net/core/filter.c
index b751202e12f8..b1461708a977 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2234,7 +2234,28 @@ static const struct bpf_func_proto bpf_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-bool bpf_helper_changes_skb_data(void *func)
+BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
+{
+	void *data = xdp->data + offset;
+
+	if (unlikely(data < xdp->data_hard_start ||
+		     data > xdp->data_end - ETH_HLEN))
+		return -EINVAL;
+
+	xdp->data = data;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
+	.func		= bpf_xdp_adjust_head,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
+bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
 	    func == bpf_skb_vlan_pop ||
@@ -2244,7 +2265,8 @@ bool bpf_helper_changes_skb_data(void *func)
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace)
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_xdp_adjust_head)
 		return true;
 
 	return false;
@@ -2670,6 +2692,8 @@ xdp_func_proto(enum bpf_func_id func_id)
 		return &bpf_xdp_event_output_proto;
 	case BPF_FUNC_get_smp_processor_id:
 		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_xdp_adjust_head:
+		return &bpf_xdp_adjust_head_proto;
 	default:
 		return sk_filter_func_proto(func_id);
 	}
-- 
2.5.1

^ permalink raw reply related

* [PATCH v4 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active
From: Martin KaFai Lau @ 2016-12-07 23:53 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team
In-Reply-To: <1481154794-2311034-1-git-send-email-kafai@fb.com>

Reserve XDP_PACKET_HEADROOM for packet and enable bpf_xdp_adjust_head()
support.  This patch only affects the code path when XDP is active.

After testing, the tx_dropped counter is incremented if the xdp_prog sends
more than wire MTU.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  8 ++------
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 24 ++++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |  9 +++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
 4 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c97d25b06444..bcd955339058 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,7 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
+				   XDP_PACKET_HEADROOM))
 
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
@@ -2700,11 +2701,6 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	int err;
 	int i;
 
-	if (prog && prog->xdp_adjust_head) {
-		en_err(priv, "Does not support bpf_xdp_adjust_head()\n");
-		return -EOPNOTSUPP;
-	}
-
 	xdp_ring_num = prog ? priv->rx_ring_num : 0;
 
 	/* No need to reconfigure buffers when simply swapping the
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 23e9d04d1ef4..3c37e216bbf3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
 	const struct mlx4_en_frag_info *frag_info;
 	struct page *page;
-	dma_addr_t dma;
 	int i;
 
 	for (i = 0; i < priv->num_frags; i++) {
@@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 
 	for (i = 0; i < priv->num_frags; i++) {
 		frags[i] = ring_alloc[i];
-		dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
+		frags[i].page_offset += priv->frag_info[i].rx_headroom;
+		rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
+						    frags[i].page_offset);
 		ring_alloc[i] = page_alloc[i];
-		rx_desc->data[i].addr = cpu_to_be64(dma);
 	}
 
 	return 0;
@@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 
 	if (ring->page_cache.index > 0) {
 		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
-		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
+						    frags[0].page_offset);
 		return 0;
 	}
 
@@ -889,6 +890,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		if (xdp_prog) {
 			struct xdp_buff xdp;
 			dma_addr_t dma;
+			void *orig_data;
 			u32 act;
 
 			dma = be64_to_cpu(rx_desc->data[0].addr);
@@ -896,11 +898,19 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp.data = page_address(frags[0].page) +
-							frags[0].page_offset;
+			xdp.data_hard_start = page_address(frags[0].page);
+			xdp.data = xdp.data_hard_start + frags[0].page_offset;
 			xdp.data_end = xdp.data + length;
+			orig_data = xdp.data;
 
 			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+			if (xdp.data != orig_data) {
+				length = xdp.data_end - xdp.data;
+				frags[0].page_offset = xdp.data -
+					xdp.data_hard_start;
+			}
+
 			switch (act) {
 			case XDP_PASS:
 				break;
@@ -1180,6 +1190,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 		 */
 		priv->frag_info[0].frag_stride = PAGE_SIZE;
 		priv->frag_info[0].dma_dir = PCI_DMA_BIDIRECTIONAL;
+		priv->frag_info[0].rx_headroom = XDP_PACKET_HEADROOM;
 		i = 1;
 	} else {
 		int buf_size = 0;
@@ -1194,6 +1205,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 				ALIGN(priv->frag_info[i].frag_size,
 				      SMP_CACHE_BYTES);
 			priv->frag_info[i].dma_dir = PCI_DMA_FROMDEVICE;
+			priv->frag_info[i].rx_headroom = 0;
 			buf_size += priv->frag_info[i].frag_size;
 			i++;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 4b597dca5c52..5886ad78058f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -354,7 +354,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc frame = {
 		.page = tx_info->page,
 		.dma = tx_info->map0_dma,
-		.page_offset = 0,
+		.page_offset = XDP_PACKET_HEADROOM,
 		.page_size = PAGE_SIZE,
 	};
 
@@ -1132,7 +1132,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	tx_info->page = frame->page;
 	frame->page = NULL;
 	tx_info->map0_dma = dma;
-	tx_info->map0_byte_count = length;
+	tx_info->map0_byte_count = PAGE_SIZE;
 	tx_info->nr_txbb = nr_txbb;
 	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
@@ -1141,9 +1141,10 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
 	tx_info->linear = 1;
 	tx_info->inl = 0;
 
-	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
+	dma_sync_single_range_for_device(priv->ddev, dma, frame->page_offset,
+					 length, PCI_DMA_TODEVICE);
 
-	data->addr = cpu_to_be64(dma);
+	data->addr = cpu_to_be64(dma + frame->page_offset);
 	data->lkey = ring->mr_key;
 	dma_wmb();
 	data->byte_count = cpu_to_be32(length);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 20a936428f4a..ba1c6cd0cc79 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -475,7 +475,8 @@ struct mlx4_en_frag_info {
 	u16 frag_prefix_size;
 	u32 frag_stride;
 	enum dma_data_direction dma_dir;
-	int order;
+	u16 order;
+	u16 rx_headroom;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
-- 
2.5.1

^ permalink raw reply related

* [PATCH v4 net-next 0/4]: Allow head adjustment in XDP prog
From: Martin KaFai Lau @ 2016-12-07 23:53 UTC (permalink / raw)
  To: netdev
  Cc: Alexei Starovoitov, Brenden Blanco, Daniel Borkmann, David Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Saeed Mahameed, Tariq Toukan, Kernel Team

This series adds a helper to allow head adjusting in XDP prog.  mlx4
driver has been modified to support this feature.  An example is written
to encapsulate a packet with an IPv4/v6 header and then XDP_TX it
out.

v4:
1. Remove XDP_QUERY_FEATURES command.  Instead, check
   the prog->xdp_adjust_head bit inside the driver itself
   during XDP_SETUP_PROG in patch 1of4.
   Thanks for everybody's ideas.
2. Nit changes on sample code per Jesper

v3:
1. Check if the driver supports head adjustment before
   setting the xdp_prog fd to the device in patch 1of4.
2. Remove the page alignment assumption on the data_hard_start.
   Instead, add data_hard_start to the struct xdp_buff and the
   driver has to fill it if it supports head adjustment.
3. Keep the wire MTU as before in mlx4
4. Set map0_byte_count to PAGE_SIZE in patch 3of4

v2:
1. Make a variable name change in bpf_xdp_adjust_head() in patch 1
2. Ensure no less than ETH_HLEN data in bpf_xdp_adjust_head() in patch 1
3. Some clarifications in commit log messages of patch 2 and 3

Thanks,
Martin

Martin KaFai Lau (4):
  bpf: xdp: Allow head adjustment in XDP prog
  mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
  mlx4: xdp: Reserve headroom for receiving packet when XDP prog is
    active
  bpf: xdp: Add XDP example for head adjustment

 arch/powerpc/net/bpf_jit_comp64.c                  |   4 +-
 arch/s390/net/bpf_jit_comp.c                       |   2 +-
 arch/x86/net/bpf_jit_comp.c                        |   2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |  29 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |  70 +++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |   9 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +
 .../net/ethernet/netronome/nfp/nfp_net_common.c    |   4 +
 drivers/net/ethernet/qlogic/qede/qede_main.c       |   5 +
 include/linux/filter.h                             |   6 +-
 include/uapi/linux/bpf.h                           |  11 +-
 kernel/bpf/core.c                                  |   2 +-
 kernel/bpf/syscall.c                               |   2 +
 kernel/bpf/verifier.c                              |   2 +-
 net/core/filter.c                                  |  28 ++-
 samples/bpf/Makefile                               |   4 +
 samples/bpf/bpf_helpers.h                          |   2 +
 samples/bpf/bpf_load.c                             |  94 ++++++++
 samples/bpf/bpf_load.h                             |   1 +
 samples/bpf/xdp1_user.c                            |  93 --------
 samples/bpf/xdp_tx_iptunnel_common.h               |  37 +++
 samples/bpf/xdp_tx_iptunnel_kern.c                 | 236 +++++++++++++++++++
 samples/bpf/xdp_tx_iptunnel_user.c                 | 256 +++++++++++++++++++++
 24 files changed, 762 insertions(+), 145 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptunnel_common.h
 create mode 100644 samples/bpf/xdp_tx_iptunnel_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptunnel_user.c

-- 
2.5.1

^ permalink raw reply

* Re: [net-next v2 00/19][pull request] 40GbE Intel Wired LAN Driver Updates 2016-12-07
From: David Miller @ 2016-12-08  0:15 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann, jogreene, guru.anbalagane
In-Reply-To: <20161207221918.57932-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  7 Dec 2016 14:18:59 -0800

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [net-next v2 02/19] i40e: simplify txd use count calculation
From: Eric Dumazet @ 2016-12-08  0:16 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: davem, Mitch Williams, netdev, nhorman, sassmann, jogreene,
	guru.anbalagane, Alexander Duyck
In-Reply-To: <20161207221918.57932-3-jeffrey.t.kirsher@intel.com>

On Wed, 2016-12-07 at 14:19 -0800, Jeff Kirsher wrote:
> From: Mitch Williams <mitch.a.williams@intel.com>
> 
> The i40e_txd_use_count function was fast but confusing. In the comments,
> it even admits that it's ugly. So replace it with a new function that is
> (very) slightly faster and has extensive commenting to help the thicker
> among us (including the author, who will forget in a week) understand
> how it works.
> 
> Change-ID: Ifb533f13786a0bf39cb29f77969a5be2c83d9a87
> Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   | 45 +++++++++++++++++----------
>  drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 45 +++++++++++++++++----------
>  2 files changed, 56 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> index de8550f..e065321 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> @@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union i40e_rx_desc *rx_desc,
>  #define I40E_MAX_DATA_PER_TXD_ALIGNED \
>  	(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
>  
> -/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where X is
> - * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the fact
> - * that 12K is not a power of 2 and division is expensive.  It is used to
> - * approximate the number of descriptors used per linear buffer.  Note
> - * that this will overestimate in some cases as it doesn't account for the
> - * fact that we will add up to 4K - 1 in aligning the 12K buffer, however
> - * the error should not impact things much as large buffers usually mean
> - * we will use fewer descriptors then there are frags in an skb.
> +/**
> + * i40e_txd_use_count  - estimate the number of descriptors needed for Tx
> + * @size: transmit request size in bytes
> + *
> + * Due to hardware alignment restrictions (4K alignment), we need to
> + * assume that we can have no more than 12K of data per descriptor, even
> + * though each descriptor can take up to 16K - 1 bytes of aligned memory.
> + * Thus, we need to divide by 12K. But division is slow! Instead,
> + * we decompose the operation into shifts and one relatively cheap
> + * multiply operation.
> + *
> + * To divide by 12K, we first divide by 4K, then divide by 3:
> + *     To divide by 4K, shift right by 12 bits
> + *     To divide by 3, multiply by 85, then divide by 256
> + *     (Divide by 256 is done by shifting right by 8 bits)
> + * Finally, we add one to round up. Because 256 isn't an exact multiple of
> + * 3, we'll underestimate near each multiple of 12K. This is actually more
> + * accurate as we have 4K - 1 of wiggle room that we can fit into the last
> + * segment.  For our purposes this is accurate out to 1M which is orders of
> + * magnitude greater than our largest possible GSO size.
> + *
> + * This would then be implemented as:
> + *     return (((size >> 12) * 85) >> 8) + 1;
> + *
> + * Since multiplication and division are commutative, we can reorder
> + * operations into:
> + *     return ((size * 85) >> 20) + 1;
>   */
>  static inline unsigned int i40e_txd_use_count(unsigned int size)
>  {
> -	const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
> -	const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
> -	unsigned int adjust = ~(u32)0;
> -
> -	/* if we rounded up on the reciprocal pull down the adjustment */
> -	if ((max * reciprocal) > adjust)
> -		adjust = ~(u32)(reciprocal - 1);
> -
> -	return (u32)((((u64)size * reciprocal) + adjust) >> 32);
> +	return ((size * 85) >> 20) + 1;
>  }

But...

I thought gcc already implemented reciprocal divides ?


$ cat div.c
unsigned int foo(unsigned int size)
{
	return size / 0x3000;
}
$ gcc -O2 -S div.c && cat div.s
foo:
.LFB0:
	.cfi_startproc
	movl	%edi, %eax
	movl	$-1431655765, %edx  // 0xaaaaaaab
	mull	%edx
	shrl	$13, %edx
	movl	%edx, %eax
	ret

^ permalink raw reply

* Re: [PATCH 2/2] [v2] net: qcom/emac: add support for the Qualcomm Technologies QDF2400
From: kbuild test robot @ 2016-12-08  0:26 UTC (permalink / raw)
  To: Timur Tabi; +Cc: kbuild-all, David Miller, netdev, alokc
In-Reply-To: <1481143186-20137-3-git-send-email-timur@codeaurora.org>

[-- Attachment #1: Type: text/plain, Size: 1658 bytes --]

Hi Timur,

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.9-rc8 next-20161207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Timur-Tabi/net-qcom-emac-simplify-support-for-different-SOCs/20161208-064231
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/qualcomm/emac/emac-sgmii.c: In function 'emac_sgmii_acpi_match':
>> drivers/net/ethernet/qualcomm/emac/emac-sgmii.c:178:3: error: implicit declaration of function 'acpi_evaluate_integer' [-Werror=implicit-function-declaration]
      status = acpi_evaluate_integer(handle, "_HRV", NULL, &hrv);
      ^
   cc1: some warnings being treated as errors

vim +/acpi_evaluate_integer +178 drivers/net/ethernet/qualcomm/emac/emac-sgmii.c

   172	
   173		if (id) {
   174			acpi_handle handle = ACPI_HANDLE(dev);
   175			unsigned long long hrv;
   176			acpi_status status;
   177	
 > 178			status = acpi_evaluate_integer(handle, "_HRV", NULL, &hrv);
   179			if (status) {
   180				if (status == AE_NOT_FOUND)
   181					/* Older versions of the QDF2432 ACPI tables do

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 47257 bytes --]

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Jason A. Donenfeld @ 2016-12-08  0:29 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev, WireGuard mailing list, LKML, linux-mips
In-Reply-To: <20161207.145240.1636297838792223189.davem@davemloft.net>

On Wed, Dec 7, 2016 at 8:52 PM, David Miller <davem@davemloft.net> wrote:
> The only truly difficult case to handle is GRE encapsulation.  Is
> that the situation you are running into?
>
> If not, please figure out what the header configuration looks like
> in the case that hits for you, and what the originating device is
> just in case it is a device driver issue.

My case is my own driver and my own protocol, which uses a 13 byte
header. I can, if absolutely necessary, change the protocol to add
another byte of padding. Or I can choose not to decrypt in place but
rather use a different trick, like overwriting the header during
decryption, though this removes some of the scatterwalk optimizations
when src and dst are the same. Or something else. I wrote the top
email of this thread inquiring about just exactly how bad it is to
call netif_rx(skb) when skb->data is unaligned.

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: Hannes Frederic Sowa @ 2016-12-08  0:30 UTC (permalink / raw)
  To: Jason A. Donenfeld, Netdev, linux-mips; +Cc: LKML, WireGuard mailing list
In-Reply-To: <CAHmME9o_eCNXpVztOZKW55kpRtE+1KSEQTQOjUBVn68Y2+or2g@mail.gmail.com>

Hi Jason,

On 07.12.2016 19:35, Jason A. Donenfeld wrote:
> I receive encrypted packets with a 13 byte header. I decrypt the
> ciphertext in place, and then discard the header. I then pass the
> plaintext to the rest of the networking stack. The plaintext is an IP
> packet. Due to the 13 byte header that was discarded, the plaintext
> possibly begins at an unaligned location (depending on whether
> dev->needed_headroom was respected).
> 
> Does this matter? Is this bad? Will there be a necessary performance hit?

Your custom protocol should be designed in a way you get an aligned ip
header. Most protocols of the IETF follow this mantra and it is always
possible to e.g. pad options so you end up on aligned boundaries for the
next header.

GRE-TEB for example needs skb_copy_bits to extract the header so it can
access them in an aligned way.

> In order to find out, I instrumented the MIPS unaligned access
> exception handler to see where I was actually in trouble.
> Surprisingly, the only part of the stack that seemed to be upset was
> on calls to ip_hdr(skb)->version.
> 
> Two things disturb me about this. First, this seems too good to be
> true. Does it seem reasonable to you that this is actually the only
> place that would be problematic? Or was my testing methodology wrong
> to arrive at such an optimistic conclusion?
> 
> Secondly, why should a call to ip_hdr(skb)->version cause an unaligned
> access anyway? This struct member is simply the second half of a
> single byte in a bit field. I'd expect for the compiler to generate a
> single byte load, followed by a bitshift or a mask. Instead, the
> compiler appears to generate a double byte load, hence the exception.
> What's up with this? Stupid compiler that should be fixed? Some odd
> optimization? What to do?

I don't see an issue with that at all. Why do you think it could be a
problem?

Bye,
Hannes

^ permalink raw reply

* RE: [net-next v2 02/19] i40e: simplify txd use count calculation
From: Duyck, Alexander H @ 2016-12-08  0:35 UTC (permalink / raw)
  To: Eric Dumazet, Kirsher, Jeffrey T
  Cc: davem@davemloft.net, Williams, Mitch A, netdev@vger.kernel.org,
	nhorman@redhat.com, sassmann@redhat.com, jogreene@redhat.com,
	guru.anbalagane@oracle.com
In-Reply-To: <1481156184.4930.77.camel@edumazet-glaptop3.roam.corp.google.com>

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, December 7, 2016 4:16 PM
> To: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>
> Cc: davem@davemloft.net; Williams, Mitch A <mitch.a.williams@intel.com>;
> netdev@vger.kernel.org; nhorman@redhat.com; sassmann@redhat.com;
> jogreene@redhat.com; guru.anbalagane@oracle.com; Duyck, Alexander H
> <alexander.h.duyck@intel.com>
> Subject: Re: [net-next v2 02/19] i40e: simplify txd use count calculation
> 
> On Wed, 2016-12-07 at 14:19 -0800, Jeff Kirsher wrote:
> > From: Mitch Williams <mitch.a.williams@intel.com>
> >
> > The i40e_txd_use_count function was fast but confusing. In the
> > comments, it even admits that it's ugly. So replace it with a new
> > function that is
> > (very) slightly faster and has extensive commenting to help the
> > thicker among us (including the author, who will forget in a week)
> > understand how it works.
> >
> > Change-ID: Ifb533f13786a0bf39cb29f77969a5be2c83d9a87
> > Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> > Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
> > Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > ---
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h   | 45 +++++++++++++++++---------
> -
> >  drivers/net/ethernet/intel/i40evf/i40e_txrx.h | 45
> > +++++++++++++++++----------
> >  2 files changed, 56 insertions(+), 34 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > index de8550f..e065321 100644
> > --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
> > @@ -173,26 +173,37 @@ static inline bool i40e_test_staterr(union
> > i40e_rx_desc *rx_desc,  #define I40E_MAX_DATA_PER_TXD_ALIGNED \
> >  	(I40E_MAX_DATA_PER_TXD & ~(I40E_MAX_READ_REQ_SIZE - 1))
> >
> > -/* This ugly bit of math is equivalent to DIV_ROUNDUP(size, X) where
> > X is
> > - * the value I40E_MAX_DATA_PER_TXD_ALIGNED.  It is needed due to the
> > fact
> > - * that 12K is not a power of 2 and division is expensive.  It is
> > used to
> > - * approximate the number of descriptors used per linear buffer.
> > Note
> > - * that this will overestimate in some cases as it doesn't account
> > for the
> > - * fact that we will add up to 4K - 1 in aligning the 12K buffer,
> > however
> > - * the error should not impact things much as large buffers usually
> > mean
> > - * we will use fewer descriptors then there are frags in an skb.
> > +/**
> > + * i40e_txd_use_count  - estimate the number of descriptors needed
> > +for Tx
> > + * @size: transmit request size in bytes
> > + *
> > + * Due to hardware alignment restrictions (4K alignment), we need to
> > + * assume that we can have no more than 12K of data per descriptor,
> > +even
> > + * though each descriptor can take up to 16K - 1 bytes of aligned memory.
> > + * Thus, we need to divide by 12K. But division is slow! Instead,
> > + * we decompose the operation into shifts and one relatively cheap
> > + * multiply operation.
> > + *
> > + * To divide by 12K, we first divide by 4K, then divide by 3:
> > + *     To divide by 4K, shift right by 12 bits
> > + *     To divide by 3, multiply by 85, then divide by 256
> > + *     (Divide by 256 is done by shifting right by 8 bits)
> > + * Finally, we add one to round up. Because 256 isn't an exact
> > +multiple of
> > + * 3, we'll underestimate near each multiple of 12K. This is actually
> > +more
> > + * accurate as we have 4K - 1 of wiggle room that we can fit into the
> > +last
> > + * segment.  For our purposes this is accurate out to 1M which is
> > +orders of
> > + * magnitude greater than our largest possible GSO size.
> > + *
> > + * This would then be implemented as:
> > + *     return (((size >> 12) * 85) >> 8) + 1;
> > + *
> > + * Since multiplication and division are commutative, we can reorder
> > + * operations into:
> > + *     return ((size * 85) >> 20) + 1;
> >   */
> >  static inline unsigned int i40e_txd_use_count(unsigned int size)  {
> > -	const unsigned int max = I40E_MAX_DATA_PER_TXD_ALIGNED;
> > -	const unsigned int reciprocal = ((1ull << 32) - 1 + (max / 2)) / max;
> > -	unsigned int adjust = ~(u32)0;
> > -
> > -	/* if we rounded up on the reciprocal pull down the adjustment */
> > -	if ((max * reciprocal) > adjust)
> > -		adjust = ~(u32)(reciprocal - 1);
> > -
> > -	return (u32)((((u64)size * reciprocal) + adjust) >> 32);
> > +	return ((size * 85) >> 20) + 1;
> >  }
> 
> But...
> 
> I thought gcc already implemented reciprocal divides ?
> 
> 
> $ cat div.c
> unsigned int foo(unsigned int size)
> {
> 	return size / 0x3000;
> }
> $ gcc -O2 -S div.c && cat div.s
> foo:
> .LFB0:
> 	.cfi_startproc
> 	movl	%edi, %eax
> 	movl	$-1431655765, %edx  // 0xaaaaaaab
> 	mull	%edx
> 	shrl	$13, %edx
> 	movl	%edx, %eax
> 	ret
> 

Well there ends up being a few aspects to it.  First we don't need the precision of a full 64b inverse multiplication, that is why we can get away with multiple by 85 and shift.  The assumption is we should never see a buffer larger than 64K for a TSO frame.  That being the case we can do the same thing without having to use a 64b value which isn't an option on 32b architectures.

So basically what it comes down to is dealing with the "optimized for size" kernel option, and 32b architectures not being able to do this.  Arguably both are corner cases but better to deal with them than take a performance hit we don't have to.

- Alex

^ permalink raw reply

* Re: Misalignment, MIPS, and ip_hdr(skb)->version
From: David Miller @ 2016-12-08  0:37 UTC (permalink / raw)
  To: Jason; +Cc: dave.taht, netdev, linux-mips, linux-kernel, wireguard
In-Reply-To: <CAHmME9qxz6wcOZuurqahXeY6QSF=zz0gH7gY4RZujLAJPNSWBA@mail.gmail.com>

From: "Jason A. Donenfeld" <Jason@zx2c4.com>
Date: Thu, 8 Dec 2016 01:29:42 +0100

> On Wed, Dec 7, 2016 at 8:52 PM, David Miller <davem@davemloft.net> wrote:
>> The only truly difficult case to handle is GRE encapsulation.  Is
>> that the situation you are running into?
>>
>> If not, please figure out what the header configuration looks like
>> in the case that hits for you, and what the originating device is
>> just in case it is a device driver issue.
> 
> My case is my own driver and my own protocol, which uses a 13 byte
> header. I can, if absolutely necessary, change the protocol to add
> another byte of padding. Or I can choose not to decrypt in place but
> rather use a different trick, like overwriting the header during
> decryption, though this removes some of the scatterwalk optimizations
> when src and dst are the same. Or something else. I wrote the top
> email of this thread inquiring about just exactly how bad it is to
> call netif_rx(skb) when skb->data is unaligned.

You really have to land the IP header on a proper 4 byte boundary.

I would suggest pushing 3 dummy garbage bytes of padding at the front
or the end of your header.

^ permalink raw reply

* Re: [PATCH 00/50] Netfilter/IPVS updates for net-next
From: David Miller @ 2016-12-08  0:29 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1481147576-5690-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Wed,  7 Dec 2016 22:52:06 +0100

> The following patchset contains a large Netfilter update for net-next,
> to summarise:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Pulled, thanks a lot Pablo.

^ permalink raw reply

* [PATCH 1/1] orinoco: fix improper return value
From: Pan Bian @ 2016-12-08  0:40 UTC (permalink / raw)
  To: Kalle Valo, linux-wireless, netdev; +Cc: linux-kernel, Pan Bian

Function orinoco_ioctl_commit() returns 0 (indicates success) when the
call to orinoco_lock() fails. Thus, the return value is inconsistent with
the execution status. It may be better to return "-EBUSY" when the call 
to orinoco_lock() fails.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188671

Signed-off-by: Pan Bian <bianpan2016@163.com>
---
 drivers/net/wireless/intersil/orinoco/wext.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/intersil/orinoco/wext.c b/drivers/net/wireless/intersil/orinoco/wext.c
index 1d4dae4..fee57ea 100644
--- a/drivers/net/wireless/intersil/orinoco/wext.c
+++ b/drivers/net/wireless/intersil/orinoco/wext.c
@@ -1314,7 +1314,7 @@ static int orinoco_ioctl_commit(struct net_device *dev,
 		return 0;
 
 	if (orinoco_lock(priv, &flags) != 0)
-		return err;
+		return -EBUSY;
 
 	err = orinoco_commit(priv);
 
-- 
1.9.1

^ permalink raw reply related

* Re: [net-next v2 02/19] i40e: simplify txd use count calculation
From: Eric Dumazet @ 2016-12-08  1:03 UTC (permalink / raw)
  To: Duyck, Alexander H
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, Williams, Mitch A,
	netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com, guru.anbalagane@oracle.com
In-Reply-To: <B1C1DF2ACD01FD4881736AA51731BAB2A54ADC@ORSMSX107.amr.corp.intel.com>

On Thu, 2016-12-08 at 00:35 +0000, Duyck, Alexander H wrote:

> Well there ends up being a few aspects to it.  First we don't need the
> precision of a full 64b inverse multiplication, that is why we can get
> away with multiple by 85 and shift.  The assumption is we should never
> see a buffer larger than 64K for a TSO frame.  That being the case we
> can do the same thing without having to use a 64b value which isn't an
> option on 32b architectures.
> 
> So basically what it comes down to is dealing with the "optimized for
> size" kernel option, and 32b architectures not being able to do this.
> Arguably both are corner cases but better to deal with them than take
> a performance hit we don't have to.

ok ok ;)

Too bad the 65536 value is accepted, (is it ?) otherwise

unsigned int foo(unsigned short size)
{
	return size / 0x3000;
}

-> generates the same kind of instructions, with maybe a better
precision.

foo:
	movzwl	%di, %eax
	imull	$43691, %eax, %eax
	shrl	$29, %eax
	ret

^ permalink raw reply

* RE: [net-next] icmp: correct return value of icmp_rcv()
From: 张胜举 @ 2016-12-08  1:09 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: netdev
In-Reply-To: <1481120298.4930.3.camel@edumazet-glaptop3.roam.corp.google.com>

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, December 07, 2016 10:18 PM
> To: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> Cc: netdev@vger.kernel.org
> Subject: Re: [net-next] icmp: correct return value of icmp_rcv()
> 
> On Wed, 2016-12-07 at 14:52 +0800, Zhang Shengju wrote:
> > Currently, icmp_rcv() always return zero on a packet delivery upcall.
> >
> > To make its behavior more compliant with the way this API should be
> > used, this patch changes this to let it return NET_RX_SUCCESS when the
> > packet is proper handled, and NET_RX_DROP otherwise.
> >
> > Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> > ---
> >  net/ipv4/icmp.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 691146a..f79d7a8
> > 100644
> > --- a/net/ipv4/icmp.c
> > +++ b/net/ipv4/icmp.c
> > @@ -1047,12 +1047,12 @@ int icmp_rcv(struct sk_buff *skb)
> >
> >  	if (success)  {
> >  		consume_skb(skb);
> > -		return 0;
> > +		return NET_RX_SUCCESS;
> >  	}
> >
> >  drop:
> >  	kfree_skb(skb);
> > -	return 0;
> > +	return NET_RX_DROP;
> >  csum_error:
> >  	__ICMP_INC_STATS(net, ICMP_MIB_CSUMERRORS);
> >  error:
> 
> 
> I am curious, what external/visible effects do you expect from such a change ?
> 
> We now have a very precise monitoring of where packets are dropped
> (consume_skb()/kfree_skb())
> 
> 

I know that the return value is always ignored, I just to want to make it 
compliant with the way this API required like I said in the comment.

Thanks,

^ permalink raw reply

* RE: [net-next v2 02/19] i40e: simplify txd use count calculation
From: Duyck, Alexander H @ 2016-12-08  1:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kirsher, Jeffrey T, davem@davemloft.net, Williams, Mitch A,
	netdev@vger.kernel.org, nhorman@redhat.com, sassmann@redhat.com,
	jogreene@redhat.com, guru.anbalagane@oracle.com
In-Reply-To: <1481159012.4930.85.camel@edumazet-glaptop3.roam.corp.google.com>

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, December 7, 2016 5:04 PM
> To: Duyck, Alexander H <alexander.h.duyck@intel.com>
> Cc: Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; davem@davemloft.net;
> Williams, Mitch A <mitch.a.williams@intel.com>; netdev@vger.kernel.org;
> nhorman@redhat.com; sassmann@redhat.com; jogreene@redhat.com;
> guru.anbalagane@oracle.com
> Subject: Re: [net-next v2 02/19] i40e: simplify txd use count calculation
> 
> On Thu, 2016-12-08 at 00:35 +0000, Duyck, Alexander H wrote:
> 
> > Well there ends up being a few aspects to it.  First we don't need the
> > precision of a full 64b inverse multiplication, that is why we can get
> > away with multiple by 85 and shift.  The assumption is we should never
> > see a buffer larger than 64K for a TSO frame.  That being the case we
> > can do the same thing without having to use a 64b value which isn't an
> > option on 32b architectures.
> >
> > So basically what it comes down to is dealing with the "optimized for
> > size" kernel option, and 32b architectures not being able to do this.
> > Arguably both are corner cases but better to deal with them than take
> > a performance hit we don't have to.
> 
> ok ok ;)
> 
> Too bad the 65536 value is accepted, (is it ?) otherwise
> 
> unsigned int foo(unsigned short size)
> {
> 	return size / 0x3000;
> }
> 
> -> generates the same kind of instructions, with maybe a better
> precision.
> 
> foo:
> 	movzwl	%di, %eax
> 	imull	$43691, %eax, %eax
> 	shrl	$29, %eax
> 	ret

I haven't ever looked all that closely.  I'm assuming the frame size can probably exceed by at least ETH_HLEN - 1 since the IP header length can theoretically reach 64K - 1 before we are maxed out.

We don't really need the extra precision anyway.  We will be off by 1 most likely anyway in many cases since the actual hardware can handle up to 16K - 1 in either the first and/or last buffer of any series since the actual limit is the 14 bits in the Tx descriptor for reporting the buffer length and we try to keep the split between buffers 4K aligned.

- Alex

^ permalink raw reply

* Re: [PATCH 1/1] ixgbe: fcoe: return value of skb_linearize should be handled
From: Zhouyi Zhou @ 2016-12-08  1:21 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: intel-wired-lan, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Zhouyi Zhou
In-Reply-To: <1481131804.2404.10.camel@intel.com>

Thanks Jeff for your advice,
Sorry for the  my innocence as a Linux kernel rookie.

Zhouyi

On Thu, Dec 8, 2016 at 1:30 AM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
> On Wed, 2016-12-07 at 15:43 +0800, Zhouyi Zhou wrote:
>> Signed-off-by: Zhouyi Zhou <yizhouzhou@ict.ac.cn>
>> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
>> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
>> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_fcoe.c | 6 +++++-
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 +--
>>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> Did Cong, Yuval and Eric give their Reviewed-by offline?  I see they made
> comments and suggests, but never saw them actually give you their Reviewed-
> by.  You cannot automatically add their Reviewed-by, Signed-off-by, etc
> just because someone provides feedback on your patch.

^ permalink raw reply

* [PATCH net 1/1] driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed
From: fgao @ 2016-12-08  1:21 UTC (permalink / raw)
  To: davem, maheshb, edumazet, netdev, gfree.wind; +Cc: Gao Feng

From: Gao Feng <fgao@ikuai8.com>

When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
unlink the ipvlan dev with upper dev.

Signed-off-by: Gao Feng <fgao@ikuai8.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 0fef178..189adbc 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -546,13 +546,15 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
 	}
 	err = ipvlan_set_port_mode(port, mode);
 	if (err) {
-		goto unregister_netdev;
+		goto dev_unlink;
 	}
 
 	list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
 	netif_stacked_transfer_operstate(phy_dev, dev);
 	return 0;
 
+dev_unlink:
+	netdev_upper_dev_unlink(phy_dev, dev);
 unregister_netdev:
 	unregister_netdevice(dev);
 destroy_ipvlan_port:
-- 
1.9.1

^ permalink raw reply related

* Re: [Intel-wired-lan] [PATCH 1/1] ixgbe: fcoe: return value of skb_linearize should be handled
From: Rustad, Mark D @ 2016-12-08  1:32 UTC (permalink / raw)
  To: Zhouyi Zhou
  Cc: Kirsher, Jeffrey T, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Zhouyi Zhou
In-Reply-To: <1481096614-25295-1-git-send-email-zhouzhouyi@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

Zhouyi Zhou <zhouzhouyi@gmail.com> wrote:

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index fee1f29..4926d48 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -2173,8 +2173,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector  
> *q_vector,
>  				total_rx_bytes += ddp_bytes;
>  				total_rx_packets += DIV_ROUND_UP(ddp_bytes,
>  								 mss);
> -			}
> -			if (!ddp_bytes) {
> +			} else {
>  				dev_kfree_skb_any(skb);
>  				continue;
>  			}

This is changing the logic by treating a negative ddp_bytes value (an error  
return) the same as a 0 value. This is probably wrong and inappropriate for  
this patch in any case.

--
Mark Rustad, Networking Division, Intel Corporation

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 841 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/1] driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed
From: Mahesh Bandewar (महेश बंडेवार) @ 2016-12-08  1:39 UTC (permalink / raw)
  To: fgao; +Cc: David Miller, Eric Dumazet, linux-netdev, gfree.wind
In-Reply-To: <1481160060-28730-1-git-send-email-fgao@ikuai8.com>

On Wed, Dec 7, 2016 at 5:21 PM,  <fgao@ikuai8.com> wrote:
> From: Gao Feng <fgao@ikuai8.com>
>
> When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
> unlink the ipvlan dev with upper dev.
>
> Signed-off-by: Gao Feng <fgao@ikuai8.com>
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 0fef178..189adbc 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -546,13 +546,15 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
>         }
>         err = ipvlan_set_port_mode(port, mode);
>         if (err) {
> -               goto unregister_netdev;
> +               goto dev_unlink;
>         }
>
>         list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
>         netif_stacked_transfer_operstate(phy_dev, dev);
>         return 0;
>
> +dev_unlink:
probably 'unlink_netdev' label inline with other labels used. thanks
> +       netdev_upper_dev_unlink(phy_dev, dev);
>  unregister_netdev:
>         unregister_netdevice(dev);
>  destroy_ipvlan_port:
> --
> 1.9.1
>
>

^ permalink raw reply

* Re: [PATCH net 1/1] driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed
From: Gao Feng @ 2016-12-08  1:44 UTC (permalink / raw)
  To: Mahesh Bandewar (महेश बंडेवार)
  Cc: David Miller, Eric Dumazet, linux-netdev
In-Reply-To: <CAF2d9jgfO=__ympFr-RSeWnKcwzDYDee5kh+Dp_KPk1fsMCHYQ@mail.gmail.com>

On Thu, Dec 8, 2016 at 9:39 AM, Mahesh Bandewar (महेश बंडेवार)
<maheshb@google.com> wrote:
> On Wed, Dec 7, 2016 at 5:21 PM,  <fgao@ikuai8.com> wrote:
>> From: Gao Feng <fgao@ikuai8.com>
>>
>> When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
>> unlink the ipvlan dev with upper dev.
>>
>> Signed-off-by: Gao Feng <fgao@ikuai8.com>
>> ---
>>  drivers/net/ipvlan/ipvlan_main.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
>> index 0fef178..189adbc 100644
>> --- a/drivers/net/ipvlan/ipvlan_main.c
>> +++ b/drivers/net/ipvlan/ipvlan_main.c
>> @@ -546,13 +546,15 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
>>         }
>>         err = ipvlan_set_port_mode(port, mode);
>>         if (err) {
>> -               goto unregister_netdev;
>> +               goto dev_unlink;
>>         }
>>
>>         list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
>>         netif_stacked_transfer_operstate(phy_dev, dev);
>>         return 0;
>>
>> +dev_unlink:
> probably 'unlink_netdev' label inline with other labels used. thanks

OK, it is better name.
I will follow it and send v2 update.

Regards
Feng

>> +       netdev_upper_dev_unlink(phy_dev, dev);
>>  unregister_netdev:
>>         unregister_netdevice(dev);
>>  destroy_ipvlan_port:
>> --
>> 1.9.1
>>
>>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox