Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 26/39] rtc/proc: switch to proc_create_single_data
From: Christoph Hellwig @ 2018-04-24 14:15 UTC (permalink / raw)
  To: Alexandre Belloni
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, Alexey Dobriyan,
	Greg Kroah-Hartman, Jiri Slaby, Corey Minyard, Alessandro Zummo,
	linux-acpi, drbd-dev, linux-ide, netdev, linux-rtc,
	megaraidlinux.pdl, linux-scsi, devel, linux-afs, linux-ext4,
	jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <20180419131027.GC7369@piout.net>

On Thu, Apr 19, 2018 at 03:10:27PM +0200, Alexandre Belloni wrote:
> On 19/04/2018 14:41:27+0200, Christoph Hellwig wrote:
> > And stop trying to get a reference on the submodule, procfs code deals
> > with release after and unloaded module and thus removed proc entry.
> > 
> 
> Are you sure about that? The rtc module is not the one adding the procfs
> file so I'm not sure how the procfs code can handle it.

The proc file is removed from this call chain:

  <driver>_exit (module_exit handler)
    -> rtc_device_unregister
      -> rtc_proc_del_device
        -> remove_proc_entry

remove_proc_entry takes care of waiting for currently active file
operation instances and makes sure every new operation never calls
into the actual proc file ops.  Same behavior as in RTC exists all
over the kernel.

^ permalink raw reply

* Re: [PATCH 16/39] ipmi: simplify procfs code
From: Christoph Hellwig @ 2018-04-24 14:16 UTC (permalink / raw)
  To: Corey Minyard
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, Alexey Dobriyan,
	Greg Kroah-Hartman, Jiri Slaby, Alessandro Zummo,
	Alexandre Belloni, linux-acpi, drbd-dev, linux-ide, netdev,
	linux-rtc, megaraidlinux.pdl, linux-scsi, devel, linux-afs,
	linux-ext4, jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <f322f243-9ab1-7e9f-00a4-9652cd288ca2@acm.org>

On Thu, Apr 19, 2018 at 10:29:29AM -0500, Corey Minyard wrote:
> On 04/19/2018 07:41 AM, Christoph Hellwig wrote:
>> Use remove_proc_subtree to remove the whole subtree on cleanup instead
>> of a hand rolled list of proc entries, unwind the registration loop into
>> individual calls.  Switch to use proc_create_single to further simplify
>> the code.
>
> I'm yanking all the proc code out of the IPMI driver in 3.18.  So this is 
> probably
> not necessary.

Ok, I'll drop this patch.

^ permalink raw reply

* Re: [PATCH net-next V3 0/3] Introduce adaptive TX interrupt moderation to net DIM
From: David Miller @ 2018-04-24 14:18 UTC (permalink / raw)
  To: talgi; +Cc: netdev, tariqt, saeedm, f.fainelli, andrew.gospodarek
In-Reply-To: <1524566163-41563-1-git-send-email-talgi@mellanox.com>

From: Tal Gilboa <talgi@mellanox.com>
Date: Tue, 24 Apr 2018 13:36:00 +0300

> Net DIM is a library designed for dynamic interrupt moderation. It was
> implemented and optimized with receive side interrupts in mind, since these
> are usually the CPU expensive ones. This patch-set introduces adaptive transmit
> interrupt moderation to net DIM, complete with a usage in the mlx5e driver.
> Using adaptive TX behavior would reduce interrupt rate for multiple scenarios.
> Furthermore, it is essential for increasing bandwidth on cases where payload
> aggregation is required.
> 
> v3: Remove "inline" from functions in .c files (requested by DaveM). Revert
> adding "enabled" field from struct net_dim and applied mlx5e structural
> suggestions (suggested by SaeedM).
> 
> v2: Rebase over proper tree.
> 
> v1: Fix compilation issues due to missed function renaming.

I have no problem with this, series applied, thanks.

Although I have to say that I've always been suspicious of adaptive moderation
schemes, especially if implemented in software.

My thinking was that at these kinds of link speeds, the conditions of the link
change so fast that whatever state you've measured changes by the time you
commit new settings to the chip.

It obviously helps, so I must be missing some piece of the puzzle in my mental
analysis :-)

^ permalink raw reply

* Re: [PATCH 03/39] proc: introduce proc_create_seq_private
From: Christoph Hellwig @ 2018-04-24 14:19 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro, linux-rtc,
	Alessandro Zummo, Alexandre Belloni, devel, linux-kernel,
	linux-scsi, Corey Minyard, linux-ide, Greg Kroah-Hartman,
	jfs-discussion, linux-afs, linux-acpi, netdev, netfilter-devel,
	Jiri Slaby, linux-ext4, Alexey Dobriyan, megaraidlinux.pdl,
	drbd-dev
In-Reply-To: <20180419141818.pjys7at4xmz2h6ho@mwanda>

On Thu, Apr 19, 2018 at 05:18:18PM +0300, Dan Carpenter wrote:
> > -static const struct file_operations cio_ignore_proc_fops = {
> > -	.open    = cio_ignore_proc_open,
> > -	.read    = seq_read,
> > -	.llseek  = seq_lseek,
> > -	.release = seq_release_private,
> > -	.write   = cio_ignore_write,
>                    ^^^^^^^^^^^^^^^^
> The cio_ignore_write() function isn't used any more so compilers will
> complain.

No compiler in the buildboot farm complained, but neverless this

^ permalink raw reply

* Re: simplify procfs code for seq_file instances
From: Christoph Hellwig @ 2018-04-24 14:23 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-rtc-u79uwXL29TY76Z2rM5mHXA, Alessandro Zummo,
	Alexandre Belloni, devel-gWbeCf7V1WCQmaza687I9mD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, Corey Minyard,
	linux-ide-u79uwXL29TY76Z2rM5mHXA, Greg Kroah-Hartman,
	jfs-discussion-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-acpi-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	netfilter-devel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Jiri Slaby, Andrew Morton, linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, megaraidlinux.pdl-dY08KVG/lbpWk0Htik3J/w,
	drbd-dev-cunTk1MwBs8qoQakbn7OcQ
In-Reply-To: <20180419185750.GD2066@avx2>

On Thu, Apr 19, 2018 at 09:57:50PM +0300, Alexey Dobriyan wrote:
> >     git://git.infradead.org/users/hch/misc.git proc_create
> 
> 
> I want to ask if it is time to start using poorman function overloading
> with _b_c_e(). There are millions of allocation functions for example,
> all slightly difference, and people will add more. Seeing /proc interfaces
> doubled like this is painful.

Function overloading is totally unacceptable.

And I very much disagree with a tradeoff that keeps 5000 lines of 
code vs a few new helpers.

^ permalink raw reply

* Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
From: kbuild test robot @ 2018-04-24 14:27 UTC (permalink / raw)
  To: Björn Töpel
  Cc: kbuild-all, bjorn.topel, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, john.fastabend, ast, brouer,
	willemdebruijn.kernel, daniel, mst, netdev, Björn Töpel,
	michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180423135619.7179-3-bjorn.topel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

Hi Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introducing-AF_XDP-support/20180424-085240
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: m68k-allyesconfig (attached as .config)
compiler: m68k-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/xdp/xdp_umem.o: In function `xdp_umem_reg':
>> xdp_umem.c:(.text+0x200): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 45403 bytes --]

^ permalink raw reply

* Re: [PATCH 03/39] proc: introduce proc_create_seq_private
From: Christoph Hellwig @ 2018-04-24 14:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-rtc, Alessandro Zummo, Alexandre Belloni, devel,
	linux-kernel, linux-scsi, Corey Minyard, linux-ide,
	Greg Kroah-Hartman, jfs-discussion, linux-afs, linux-acpi, netdev,
	netfilter-devel, Alexander Viro, Jiri Slaby, Andrew Morton,
	linux-ext4, Christoph Hellwig, megaraidlinux.pdl, drbd-dev
In-Reply-To: <20180419185027.GC2066@avx2>

On Thu, Apr 19, 2018 at 09:50:27PM +0300, Alexey Dobriyan wrote:
> On Thu, Apr 19, 2018 at 02:41:04PM +0200, Christoph Hellwig wrote:
> > Variant of proc_create_data that directly take a struct seq_operations
> 
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -45,6 +45,7 @@ struct proc_dir_entry {
> >  	const struct inode_operations *proc_iops;
> >  	const struct file_operations *proc_fops;
> >  	const struct seq_operations *seq_ops;
> > +	size_t state_size;
> 
> "unsigned int" please.
> 
> Where have you seen 4GB priv states?

We're passing the result of sizeof, which happens to be a size_t.
But if it makes you happy I can switch to unsigned int.

^ permalink raw reply

* Re: [PATCH 02/39] proc: introduce proc_create_seq{,_data}
From: Christoph Hellwig @ 2018-04-24 14:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Christoph Hellwig, Andrew Morton, Alexander Viro,
	Greg Kroah-Hartman, Jiri Slaby, Corey Minyard, Alessandro Zummo,
	Alexandre Belloni, linux-acpi, drbd-dev, linux-ide, netdev,
	linux-rtc, megaraidlinux.pdl, linux-scsi, devel, linux-afs,
	linux-ext4, jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <20180419184106.GA2066@avx2>

On Thu, Apr 19, 2018 at 09:41:06PM +0300, Alexey Dobriyan wrote:
> Should be oopsable.
> Once proc_create_data() returns, entry is live, ->open can be called.

Ok, switching to opencoding proc_create_data instead.

^ permalink raw reply

* [PATCH RFC 0/9] veth: Driver XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
 (XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

The envisioned use cases are:

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

With single core and simple XDP programs which only redirect and drop
packets, I got 10.5 Mpps redirect/drop rate with i40e 25G NIC + veth.

XXV710 (i40e) --- (XDP redirect) --> veth===veth (XDP drop)

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because I wanted to avoid stack inflation
by recursive calling of XDP programs.

As an RFC this has not implemented recently introduced xdp_adjust_tail
and based on top of Jesper's redirect memory return API patch set
(684009d4fdaf).
Any feedback is welcome. Thanks!

Toshiaki Makita (9):
  net: Export skb_headers_offset_update and skb_copy_header
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Use NAPI for XDP
  veth: Handle xdp_frame in xdp napi ring
  veth: Add ndo_xdp_xmit
  veth: Add XDP TX and REDIRECT
  veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
  veth: Avoid per-packet spinlock of XDP napi ring on enqueueing

 drivers/net/veth.c     | 688 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/filter.h |  16 ++
 include/linux/skbuff.h |   2 +
 net/core/filter.c      |  11 +-
 net/core/skbuff.c      |  12 +-
 5 files changed, 699 insertions(+), 30 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 12 +++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255..fdf80a9d4582 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1030,6 +1030,8 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca8..531354900177 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1290,7 +1290,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1304,8 +1304,9 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_network_header += off;
 	skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1313,6 +1314,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1355,7 +1357,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1419,7 +1421,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1599,7 +1601,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 2/9] veth: Add driver XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 205 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..9c4197306716 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,15 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -30,9 +35,11 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 /*
@@ -98,6 +105,25 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_link_ksettings	= veth_get_link_ksettings,
 };
 
+/* general routines */
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb);
+
+static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+{
+	skb = veth_xdp_rcv_skb(dev, skb);
+	if (!skb)
+		return NET_RX_DROP;
+
+	return netif_rx(skb);
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+}
+
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -111,7 +137,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -126,10 +152,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -179,19 +201,152 @@ static void veth_set_multicast_list(struct net_device *dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	u32 pktlen, headroom, act, metalen;
+	int size, mac_len, delta, off;
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	void *orig_data;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	mac_len = skb->data - skb_mac_header(skb);
+	pktlen = skb->len + mac_len;
+	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
+	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (size > PAGE_SIZE)
+		goto drop;
+
+	headroom = skb_headroom(skb) - mac_len;
+	if (skb_shared(skb) || skb_head_is_locked(skb) ||
+	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
+		struct sk_buff *nskb;
+		void *head, *start;
+		struct page *page;
+		int head_off;
+
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			goto drop;
+
+		head = page_address(page);
+		start = head + VETH_XDP_HEADROOM;
+		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		nskb = veth_build_skb(head,
+				      VETH_XDP_HEADROOM + mac_len, skb->len,
+				      PAGE_SIZE);
+		if (!nskb) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		skb_copy_header(nskb, skb);
+		head_off = skb_headroom(nskb) - skb_headroom(skb);
+		skb_headers_offset_update(nskb, head_off);
+		dev_consume_skb_any(skb);
+		skb = nskb;
+	}
+
+	xdp.data_hard_start = skb->head;
+	xdp.data = skb_mac_header(skb);
+	xdp.data_end = xdp.data + pktlen;
+	xdp.data_meta = xdp.data;
+	xdp.rxq = &priv->xdp_rxq;
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(dev, xdp_prog, act);
+	case XDP_DROP:
+		goto drop;
+	}
+	rcu_read_unlock();
+
+	delta = orig_data - xdp.data;
+	off = mac_len + delta;
+	if (off > 0)
+		__skb_push(skb, off);
+	else if (off < 0)
+		__skb_pull(skb, -off);
+	skb->mac_header -= delta;
+	skb->protocol = eth_type_trans(skb, dev);
+
+	metalen = xdp.data - xdp.data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+out:
+	return skb;
+drop:
+	rcu_read_unlock();
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer = rtnl_dereference(priv->peer);
+	int err;
 
 	if (!peer)
 		return -ENOTCONN;
 
+	err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
+	if (err < 0)
+		return err;
+
+	err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err < 0)
+		goto err_reg_mem;
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
 	}
+
 	return 0;
+err_reg_mem:
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
+	return err;
 }
 
 static int veth_close(struct net_device *dev)
@@ -203,6 +358,8 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
 	return 0;
 }
 
@@ -276,6 +433,48 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	old_prog = rtnl_dereference(priv->xdp_prog);
+
+	rcu_assign_pointer(priv->xdp_prog, prog);
+
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static u32 veth_xdp_query(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = rtnl_dereference(priv->xdp_prog);
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = veth_xdp_query(dev);
+		xdp->prog_attached = !!xdp->prog_id;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +489,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_bpf		= veth_xdp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9c4197306716..7271d9582b4a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -410,6 +410,23 @@ static int veth_get_iflink(const struct net_device *dev)
 	return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		if (rtnl_dereference(peer_priv->xdp_prog))
+			features &= ~NETIF_F_GSO_SOFTWARE;
+	}
+
+	return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
 	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -438,13 +455,32 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
+	struct net_device *peer;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
+	peer = rtnl_dereference(priv->peer);
+
+	if (!old_prog && prog && peer) {
+		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+			peer->hard_header_len -
+			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (peer->mtu > peer->max_mtu)
+			dev_set_mtu(peer, peer->max_mtu);
+	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
-	if (old_prog)
+	if (old_prog) {
 		bpf_prog_put(old_prog);
+		if (!prog && peer) {
+			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = ETH_MAX_MTU;
+		}
+	}
+
+	if ((!!old_prog ^ !!prog) && peer)
+		netdev_update_features(peer);
 
 	return 0;
 }
@@ -487,6 +523,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_poll_controller	= veth_poll_controller,
 #endif
 	.ndo_get_iflink		= veth_get_iflink,
+	.ndo_fix_features	= veth_fix_features,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 4/9] veth: Use NAPI for XDP
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

In order to avoid stack inflation by recursive XDP program call from
ndo_xdp_xmit, this change introduces NAPI in veth.

Add veth's own NAPI handler when XDP is enabled.
Use ptr_ring to emulate NIC ring. Tx function enqueues packets to the
ring and peer NAPI handler drains the ring.

This way also makes REDIRECT bulk interface simple. When ndo_xdp_xmit is
implemented later, ndo_xdp_flush schedules NAPI of the peer veth device
and NAPI handles xdp frames enqueued by previous ndo_xdp_xmit, which is
quite similar to physical NIC tx function using DMA ring descriptors and
mmio door bell.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved in the future by
allocating rings on per-queue basis.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 197 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 164 insertions(+), 33 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7271d9582b4a..452771f31c30 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -21,11 +21,13 @@
 #include <linux/module.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
 struct pcpu_vstats {
@@ -35,10 +37,14 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	bool			rx_notify_masked;
+	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 };
 
@@ -107,28 +113,56 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
-					struct sk_buff *skb);
+static void veth_ptr_free(void *ptr)
+{
+	if (!ptr)
+		return;
+	dev_kfree_skb_any(ptr);
+}
 
-static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+static void veth_xdp_flush(struct veth_priv *priv)
 {
-	skb = veth_xdp_rcv_skb(dev, skb);
-	if (!skb)
+	/* Write ptr_ring before reading rx_notify_masked */
+	smp_mb();
+	if (!priv->rx_notify_masked) {
+		priv->rx_notify_masked = true;
+		napi_schedule(&priv->xdp_napi);
+	}
+}
+
+static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
+{
+	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+		return -ENOSPC;
+
+	return 0;
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+	if (unlikely(veth_xdp_enqueue(priv, skb))) {
+		dev_kfree_skb_any(skb);
 		return NET_RX_DROP;
+	}
 
-	return netif_rx(skb);
+	return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
 {
-	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+	struct veth_priv *priv = netdev_priv(dev);
+
+	return __dev_forward_skb(dev, skb) ?: xdp ?
+		veth_xdp_rx(priv, skb) :
+		netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct veth_priv *priv = netdev_priv(dev);
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	int length = skb->len;
+	bool rcv_xdp = false;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -137,7 +171,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	rcv_priv = netdev_priv(rcv);
+	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -148,7 +185,13 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 		atomic64_inc(&priv->dropped);
 	}
+
+	/* TODO: check xmit_more and tx_stopped */
+	if (rcv_xdp)
+		veth_xdp_flush(rcv_priv);
+
 	rcu_read_unlock();
+
 	return NETDEV_TX_OK;
 }
 
@@ -220,10 +263,9 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
-	struct veth_priv *priv = netdev_priv(dev);
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
 	struct bpf_prog *xdp_prog;
@@ -293,7 +335,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
-		trace_xdp_exception(dev, xdp_prog, act);
+		trace_xdp_exception(priv->dev, xdp_prog, act);
 	case XDP_DROP:
 		goto drop;
 	}
@@ -306,7 +348,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	else if (off < 0)
 		__skb_pull(skb, -off);
 	skb->mac_header -= delta;
-	skb->protocol = eth_type_trans(skb, dev);
+	skb->protocol = eth_type_trans(skb, priv->dev);
 
 	metalen = xdp.data - xdp.data_meta;
 	if (metalen)
@@ -319,6 +361,72 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	return NULL;
 }
 
+static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+{
+	int i, done = 0;
+
+	for (i = 0; i < budget; i++) {
+		void *ptr = ptr_ring_consume(&priv->xdp_ring);
+		struct sk_buff *skb;
+
+		if (!ptr)
+			break;
+
+		skb = veth_xdp_rcv_skb(priv, ptr);
+
+		if (skb)
+			napi_gro_receive(&priv->xdp_napi, skb);
+
+		done++;
+	}
+
+	return done;
+}
+
+static int veth_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv =
+		container_of(napi, struct veth_priv, xdp_napi);
+	int done;
+
+	done = veth_xdp_rcv(priv, budget);
+
+	if (done < budget && napi_complete_done(napi, done)) {
+		/* Write rx_notify_masked before reading ptr_ring */
+		smp_store_mb(priv->rx_notify_masked, false);
+		if (unlikely(!ptr_ring_empty(&priv->xdp_ring))) {
+			priv->rx_notify_masked = true;
+			napi_schedule(&priv->xdp_napi);
+		}
+	}
+
+	return done;
+}
+
+static int veth_napi_add(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (err)
+		return err;
+
+	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+	napi_enable(&priv->xdp_napi);
+
+	return 0;
+}
+
+static void veth_napi_del(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	napi_disable(&priv->xdp_napi);
+	netif_napi_del(&priv->xdp_napi);
+	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -337,6 +445,12 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	if (rtnl_dereference(priv->xdp_prog)) {
+		err = veth_napi_add(dev);
+		if (err)
+			goto err_reg_mem;
+	}
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
@@ -358,6 +472,9 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	if (rtnl_dereference(priv->xdp_prog))
+		veth_napi_del(dev);
+
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
@@ -384,15 +501,12 @@ static void veth_dev_free(struct net_device *dev)
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void veth_poll_controller(struct net_device *dev)
 {
-	/* veth only receives frames when its peer sends one
-	 * Since it's a synchronous operation, we are guaranteed
-	 * never to have pending data when we poll for it so
-	 * there is nothing to do here.
-	 *
-	 * We need this though so netpoll recognizes us as an interface that
-	 * supports polling, which enables bridge devices in virt setups to
-	 * still use netconsole
-	 */
+	struct veth_priv *priv = netdev_priv(dev);
+
+	rcu_read_lock();
+	if (rcu_access_pointer(priv->xdp_prog))
+		veth_xdp_flush(priv);
+	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
 
@@ -456,26 +570,40 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 	struct net_device *peer;
+	int err;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
 	peer = rtnl_dereference(priv->peer);
 
-	if (!old_prog && prog && peer) {
-		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
-		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
-			peer->hard_header_len -
-			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-		if (peer->mtu > peer->max_mtu)
-			dev_set_mtu(peer, peer->max_mtu);
+	if (!old_prog && prog) {
+		if (dev->flags & IFF_UP) {
+			err = veth_napi_add(dev);
+			if (err)
+				return err;
+		}
+
+		if (peer) {
+			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+				peer->hard_header_len -
+				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+			if (peer->mtu > peer->max_mtu)
+				dev_set_mtu(peer, peer->max_mtu);
+		}
 	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
 	if (old_prog) {
 		bpf_prog_put(old_prog);
-		if (!prog && peer) {
-			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
-			peer->max_mtu = ETH_MAX_MTU;
+		if (!prog) {
+			if (dev->flags & IFF_UP)
+				veth_napi_del(dev);
+
+			if (peer) {
+				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+				peer->max_mtu = ETH_MAX_MTU;
+			}
 		}
 	}
 
@@ -688,10 +816,13 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
+	priv->dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
+	priv->dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
+
 	return 0;
 
 err_register_dev:
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is preparation for XDP TX and ndo_xdp_xmit.

Now the napi ring accepts both skb and xdp_frame. When xdp_frame is
enqueued, skb will not be allocated until XDP program on veth returns
PASS. This will speedup the XDP processing when ndo_xdp_xmit is
implemented and xdp_frame is enqueued by the peer device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 452771f31c30..89c91c1c9935 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -48,6 +49,16 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
 /*
  * ethtool interface
  */
@@ -117,7 +128,14 @@ static void veth_ptr_free(void *ptr)
 {
 	if (!ptr)
 		return;
-	dev_kfree_skb_any(ptr);
+
+	if (veth_is_xdp_frame(ptr)) {
+		struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
+
+		xdp_return_frame(frame);
+	} else {
+		dev_kfree_skb_any(ptr);
+	}
 }
 
 static void veth_xdp_flush(struct veth_priv *priv)
@@ -263,6 +281,60 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+					struct xdp_frame *frame)
+{
+	struct bpf_prog *xdp_prog;
+	unsigned int headroom;
+	struct sk_buff *skb;
+	int len, delta = 0;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (xdp_prog) {
+		struct xdp_buff xdp;
+		u32 act;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = frame->data + frame->len;
+		xdp.data_meta = frame->data - frame->metasize;
+		xdp.rxq = &priv->xdp_rxq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_PASS:
+			delta = frame->data - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+		case XDP_ABORTED:
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+	rcu_read_unlock();
+
+	headroom = frame->data - delta - (void *)frame;
+	len = frame->len + delta;
+	skb = veth_build_skb(frame, headroom, len, 0);
+	if (!skb) {
+		xdp_return_frame(frame);
+		goto err;
+	}
+
+	skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+	return skb;
+err_xdp:
+	rcu_read_unlock();
+	xdp_return_frame(frame);
+
+	return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
@@ -372,7 +444,10 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr))
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+		else
+			skb = veth_xdp_rcv_skb(priv, ptr);
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that whether an XDP program is loaded on the redirect target veth
device does not affect how xdp_frames sent by ndo_xdp_xmit is handled,
since the ring sits in rx (peer) side. Instead, whether XDP program is
loaded on peer veth does.

When peer veth device has driver XDP, ndo_xdp_xmit forwards xdp_frames
to its peer without modification.
If not, ndo_xdp_xmit converts xdp_frames to skb on sender side and
invokes netif_rx rather than dropping them. Although this will not
result in good performance, I'm thinking dropping redirected packets
when XDP is not loaded on the peer device is too restrictive, so added
this fallback.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c     | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/filter.h | 16 +++++++++++
 net/core/filter.c      | 11 +-------
 3 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 89c91c1c9935..b1d591be0eba 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -54,6 +54,11 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -138,7 +143,7 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
-static void veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_priv *priv)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
@@ -206,7 +211,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* TODO: check xmit_more and tx_stopped */
 	if (rcv_xdp)
-		veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rcv_priv);
 
 	rcu_read_unlock();
 
@@ -281,6 +286,66 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	int headroom = frame->data - (void *)frame;
+	struct net_device *rcv;
+	int err = 0;
+
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		return -ENXIO;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
+		err = xdp_ok_fwd_dev(rcv, frame->len);
+		if (unlikely(err))
+			return err;
+
+		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
+	} else {
+		struct sk_buff *skb;
+
+		skb = veth_build_skb(frame, headroom, frame->len, 0);
+		if (unlikely(!skb))
+			return -ENOMEM;
+
+		/* Get page ref in case skb is dropped in netif_rx.
+		 * The caller is responsible for freeing the page on error.
+		 */
+		get_page(virt_to_page(frame->data));
+		if (unlikely(veth_forward_skb(rcv, skb, false) != NET_RX_SUCCESS))
+			return -ENXIO;
+
+		/* Put page ref on success */
+		page_frag_free(frame->data);
+	}
+
+	return err;
+}
+
+static void veth_xdp_flush(struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto out;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+		goto out;
+
+	__veth_xdp_flush(rcv_priv);
+out:
+	rcu_read_unlock();
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 					struct xdp_frame *frame)
 {
@@ -580,7 +645,7 @@ static void veth_poll_controller(struct net_device *dev)
 
 	rcu_read_lock();
 	if (rcu_access_pointer(priv->xdp_prog))
-		veth_xdp_flush(priv);
+		__veth_xdp_flush(priv);
 	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
@@ -730,6 +795,8 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
+	.ndo_xdp_xmit		= veth_xdp_xmit,
+	.ndo_xdp_flush		= veth_xdp_flush,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b2308174..7d043f51d1d7 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
+#include <linux/if_vlan.h>
 
 #include <net/sch_generic.h>
 
@@ -752,6 +753,21 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static __always_inline int
+xdp_ok_fwd_dev(const struct net_device *fwd, unsigned int pktlen)
+{
+	unsigned int len;
+
+	if (unlikely(!(fwd->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
+	if (pktlen > len)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..25ae8ffaa968 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2923,16 +2923,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
 static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 {
-	unsigned int len;
-
-	if (unlikely(!(fwd->flags & IFF_UP)))
-		return -ENETDOWN;
-
-	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
-	if (skb->len > len)
-		return -EMSGSIZE;
-
-	return 0;
+	return xdp_ok_fwd_dev(fwd, skb->len);
 }
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem info from NIC so that page recycling of the NIC works on
the destination veth's XDP.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 85 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1d591be0eba..98fc91a64e29 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@ struct veth_priv {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
@@ -346,9 +347,21 @@ static void veth_xdp_flush(struct net_device *dev)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return veth_xdp_xmit(dev, frame);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-					struct xdp_frame *frame)
+					struct xdp_frame *frame, bool *xdp_xmit,
+					bool *xdp_redir)
 {
+	struct xdp_frame orig_frame;
 	struct bpf_prog *xdp_prog;
 	unsigned int headroom;
 	struct sk_buff *skb;
@@ -372,6 +385,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		case XDP_PASS:
 			delta = frame->data - xdp.data;
 			break;
+		case XDP_TX:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+				trace_xdp_exception(priv->dev, xdp_prog, act);
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_REDIRECT:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_redir = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
@@ -396,12 +432,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 err_xdp:
 	rcu_read_unlock();
 	xdp_return_frame(frame);
-
+xdp_xmit:
 	return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb)
+					struct sk_buff *skb, bool *xdp_xmit,
+					bool *xdp_redir)
 {
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
@@ -469,6 +506,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	switch (act) {
 	case XDP_PASS:
 		break;
+	case XDP_TX:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_REDIRECT:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+			goto err_xdp;
+		*xdp_redir = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -496,9 +553,15 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	rcu_read_unlock();
 	dev_kfree_skb_any(skb);
 	return NULL;
+err_xdp:
+	rcu_read_unlock();
+	page_frag_free(xdp.data);
+xdp_xmit:
+	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
+			bool *xdp_redir)
 {
 	int i, done = 0;
 
@@ -509,10 +572,12 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		if (veth_is_xdp_frame(ptr))
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
-		else
-			skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr)) {
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+					       xdp_xmit, xdp_redir);
+		} else {
+			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+		}
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -527,9 +592,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_priv *priv =
 		container_of(napi, struct veth_priv, xdp_napi);
+	bool xdp_xmit = false;
+	bool xdp_redir = false;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget);
+	done = veth_xdp_rcv(priv, budget, &xdp_xmit, &xdp_redir);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -540,6 +607,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	if (xdp_xmit)
+		veth_xdp_flush(priv->dev);
+	if (xdp_redir)
+		xdp_do_flush_map();
+
 	return done;
 }
 
@@ -585,6 +657,9 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	/* Save original mem info as it can be overwritten */
+	priv->xdp_mem = priv->xdp_rxq.mem;
+
 	if (rtnl_dereference(priv->xdp_prog)) {
 		err = veth_napi_add(dev);
 		if (err)
@@ -615,6 +690,7 @@ static int veth_close(struct net_device *dev)
 	if (rtnl_dereference(priv->xdp_prog))
 		veth_napi_del(dev);
 
+	priv->xdp_rxq.mem = priv->xdp_mem;
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 98fc91a64e29..1592119e3873 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -30,6 +30,7 @@
 #define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+#define VETH_XDP_QUEUE_SIZE	NAPI_POLL_WEIGHT
 
 struct pcpu_vstats {
 	u64			packets;
@@ -50,6 +51,8 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static DEFINE_PER_CPU(void *[VETH_XDP_QUEUE_SIZE], xdp_consume_q);
+
 static bool veth_is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VETH_XDP_FLAG;
@@ -563,27 +566,32 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
 			bool *xdp_redir)
 {
-	int i, done = 0;
-
-	for (i = 0; i < budget; i++) {
-		void *ptr = ptr_ring_consume(&priv->xdp_ring);
-		struct sk_buff *skb;
-
-		if (!ptr)
-			break;
+	void **q = this_cpu_ptr(xdp_consume_q);
+	int num, lim, done = 0;
+
+	do {
+		int i;
+
+		lim = min(budget - done, VETH_XDP_QUEUE_SIZE);
+		num = ptr_ring_consume_batched(&priv->xdp_ring, q, lim);
+		for (i = 0; i < num; i++) {
+			struct sk_buff *skb;
+			void *ptr = q[i];
+
+			if (veth_is_xdp_frame(ptr)) {
+				skb = veth_xdp_rcv_one(priv,
+						       veth_ptr_to_xdp(ptr),
+						       xdp_xmit, xdp_redir);
+			} else {
+				skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit,
+						       xdp_redir);
+			}
 
-		if (veth_is_xdp_frame(ptr)) {
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
-					       xdp_xmit, xdp_redir);
-		} else {
-			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+			if (skb)
+				napi_gro_receive(&priv->xdp_napi, skb);
 		}
-
-		if (skb)
-			napi_gro_receive(&priv->xdp_napi, skb);
-
-		done++;
-	}
+		done += num;
+	} while (unlikely(num == lim && done < budget));
 
 	return done;
 }
-- 
2.14.3

^ permalink raw reply related

* [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita
In-Reply-To: <20180424143923.26519-1-toshiaki.makita1@gmail.com>

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.
This is different from dequeue in that multiple veth devices can be
redirect target in one napi loop so allocate percpu storage in veth
private structure.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1592119e3873..5978d76f2c00 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -38,12 +38,18 @@ struct pcpu_vstats {
 	struct u64_stats_sync	syncp;
 };
 
+struct xdp_queue {
+	void *q[VETH_XDP_QUEUE_SIZE];
+	unsigned int len;
+};
+
 struct veth_priv {
 	struct napi_struct	xdp_napi;
 	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_queue __percpu *xdp_produce_q;
 	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
@@ -147,8 +153,48 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
+static void veth_xdp_cleanup_queues(struct veth_priv *priv)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct xdp_queue *q = per_cpu_ptr(priv->xdp_produce_q, cpu);
+		int i;
+
+		for (i = 0; i < q->len; i++)
+			veth_ptr_free(q->q[i]);
+
+		q->len = 0;
+	}
+}
+
+static bool veth_xdp_flush_queue(struct veth_priv *priv)
+{
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+	int i;
+
+	if (unlikely(!q->len))
+		return false;
+
+	spin_lock(&priv->xdp_ring.producer_lock);
+	for (i = 0; i < q->len; i++) {
+		void *ptr = q->q[i];
+
+		if (unlikely(__ptr_ring_produce(&priv->xdp_ring, ptr)))
+			veth_ptr_free(ptr);
+	}
+	spin_unlock(&priv->xdp_ring.producer_lock);
+
+	q->len = 0;
+
+	return true;
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
+	if (unlikely(!veth_xdp_flush_queue(priv)))
+		return;
+
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
 	if (!priv->rx_notify_masked) {
@@ -159,9 +205,13 @@ static void __veth_xdp_flush(struct veth_priv *priv)
 
 static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
 {
-	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+
+	if (unlikely(q->len >= VETH_XDP_QUEUE_SIZE))
 		return -ENOSPC;
 
+	q->q[q->len++] = ptr;
+
 	return 0;
 }
 
@@ -644,6 +694,7 @@ static void veth_napi_del(struct net_device *dev)
 
 	napi_disable(&priv->xdp_napi);
 	netif_napi_del(&priv->xdp_napi);
+	veth_xdp_cleanup_queues(priv);
 	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
 }
 
@@ -711,15 +762,28 @@ static int is_valid_veth_mtu(int mtu)
 
 static int veth_dev_init(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
 	if (!dev->vstats)
 		return -ENOMEM;
+
+	priv->xdp_produce_q = __alloc_percpu(sizeof(*priv->xdp_produce_q),
+					     sizeof (void *));
+	if (!priv->xdp_produce_q) {
+		free_percpu(dev->vstats);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	free_percpu(dev->vstats);
+	free_percpu(priv->xdp_produce_q);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH] net: aquantia: fix aq_ndev_start_xmit()'s return type
From: David Miller @ 2018-04-24 14:42 UTC (permalink / raw)
  To: luc.vanoostenryck; +Cc: linux-kernel, igor.russkikh, netdev
In-Reply-To: <20180424131623.3505-1-luc.vanoostenryck@gmail.com>


Luc please don't submit such a huge number of patches all at one time.

Also, please fix the indentation of the functions whose arguments
span multiple lines as has been pointed out to you in patch feedback.

Finally, make this a true patch series.  It is so much easier for
maintainers to work with a set of changes all doing the same thing if
you make them a proper patch series with an appropriate "[PATCH 0/N] ..."
header posting.

Thank you.

^ permalink raw reply

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
From: Stephen Hemminger @ 2018-04-24 14:44 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake
In-Reply-To: <20180424123046.21247-1-toke@toke.dk>

On Tue, 24 Apr 2018 14:30:46 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> diff --git a/tc/q_cake.c b/tc/q_cake.c
> new file mode 100644
> index 00000000..12263361
> --- /dev/null
> +++ b/tc/q_cake.c
> @@ -0,0 +1,778 @@
> +/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
> +/*
> + * Common Applications Kept Enhanced  --  CAKE
> + *
> + *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
> + *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions, and the following disclaimer,
> + *    without modification.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. The names of the authors may not be used to endorse or promote products
> + *    derived from this software without specific prior written permission.
> + *
> + * Alternatively, provided that this notice is retained in full, this
> + * software may be distributed under the terms of the GNU General
> + * Public License ("GPL") version 2, in which case the provisions of the
> + * GPL apply INSTEAD OF those given above.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> + * DAMAGE.
> + *
> + */

If you  have SPDX tag you don't have to have all the GPL boilerplate anymore.

^ permalink raw reply

* Re: [Cake] [PATCH iproute2-next v3] Add support for cake qdisc
From: Stephen Hemminger @ 2018-04-24 14:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake
In-Reply-To: <20180424123046.21247-1-toke@toke.dk>

On Tue, 24 Apr 2018 14:30:46 +0200
Toke Høiland-Jørgensen <toke@toke.dk> wrote:

> +static void cake_print_json_tin(struct tc_cake_tin_stats *tst, uint version)
> +{
> +	open_json_object(NULL);
> +	print_uint(PRINT_JSON, "threshold_rate", NULL, tst->threshold_rate);
> +	print_uint(PRINT_JSON, "target", NULL, tst->target_us);
> +	print_uint(PRINT_JSON, "interval", NULL, tst->interval_us);
> +	print_uint(PRINT_JSON, "peak_delay", NULL, tst->peak_delay_us);
> +	print_uint(PRINT_JSON, "average_delay", NULL, tst->avge_delay_us);
> +	print_uint(PRINT_JSON, "base_delay", NULL, tst->base_delay_us);
> +	print_uint(PRINT_JSON, "sent_packets", NULL, tst->sent.packets);
> +	print_uint(PRINT_JSON, "sent_bytes", NULL, tst->sent.bytes);
> +	print_uint(PRINT_JSON, "way_indirect_hits", NULL, tst->way_indirect_hits);
> +	print_uint(PRINT_JSON, "way_misses", NULL, tst->way_misses);
> +	print_uint(PRINT_JSON, "way_collisions", NULL, tst->way_collisions);
> +	print_uint(PRINT_JSON, "drops", NULL, tst->dropped.packets);
> +	print_uint(PRINT_JSON, "ecn_mark", NULL, tst->ecn_marked.packets);
> +	print_uint(PRINT_JSON, "ack_drops", NULL, tst->ack_drops.packets);
> +	print_uint(PRINT_JSON, "sparse_flows", NULL, tst->sparse_flows);
> +	print_uint(PRINT_JSON, "bulk_flows", NULL, tst->bulk_flows);
> +	print_uint(PRINT_JSON, "unresponsive_flows", NULL, tst->unresponse_flows);
> +	print_uint(PRINT_JSON, "max_pkt_len", NULL, tst->max_skblen);
> +	if (version >= 0x102)
> +		print_uint(PRINT_JSON, "flow_quantum", NULL, tst->flow_quantum);

Please don't version objects in netlink. That is not how netlink is
supposed to be used.

^ permalink raw reply

* Re: [PATCH] net: sh-eth: fix sh_eth_start_xmit()'s return type
From: Geert Uytterhoeven @ 2018-04-24 14:49 UTC (permalink / raw)
  To: Luc Van Oostenryck
  Cc: Linux Kernel Mailing List, Sergei Shtylyov, David S. Miller,
	Geert Uytterhoeven, Thomas Petazzoni, Laurent Pinchart,
	Simon Horman, Niklas Söderlund, netdev, Linux-Renesas
In-Reply-To: <20180424131720.4357-1-luc.vanoostenryck@gmail.com>

On Tue, Apr 24, 2018 at 3:17 PM, Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
>
> Fix this by returning 'netdev_tx_t' in this driver too.
>
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>

Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH net] sfc: ARFS filter IDs
From: kbuild test robot @ 2018-04-24 14:49 UTC (permalink / raw)
  To: Edward Cree; +Cc: kbuild-all, linux-net-drivers, David Miller, netdev
In-Reply-To: <2ee1ef47-886d-278d-4a8d-234d74e26ad7@solarflare.com>

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

Hi Edward,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:    https://github.com/0day-ci/linux/commits/Edward-Cree/sfc-ARFS-filter-IDs/20180424-080737
config: i386-randconfig-x0-04242110 (attached as .config)
compiler: gcc-5 (Debian 5.5.0-3) 5.4.1 20171010
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from drivers/net/ethernet/sfc/efx.c:28:0:
>> drivers/net/ethernet/sfc/efx.h:194:4: warning: 'struct efx_arfs_rule' declared inside parameter list
       bool *force);
       ^
>> drivers/net/ethernet/sfc/efx.h:194:4: warning: its scope is only this definition or declaration, which is probably not what you want

vim +194 drivers/net/ethernet/sfc/efx.h

   192	
   193	bool efx_rps_check_rule(struct efx_arfs_rule *rule, unsigned int filter_idx,
 > 194				bool *force);
   195	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29751 bytes --]

^ permalink raw reply

* [PATCH bpf-next,v3 0/2] bpf: add helper for getting xfrm states
From: Eyal Birger @ 2018-04-24 14:50 UTC (permalink / raw)
  To: netdev; +Cc: shmulik, ast, daniel, Eyal Birger

This patchset adds support for fetching XFRM state information from
an eBPF program called from TC.

The first patch introduces a helper for fetching an XFRM state from the
skb's secpath. The XFRM state is modeled using a new virtual struct which
contains the SPI, peer address, and reqid values of the state; This struct
can be extended in the future to provide additional state information.

The second patch adds a test example in test_tunnel_bpf.sh. The sample
validates the correct extraction of state information by the eBPF program.

---
v3:
  - Kept SPI and peer IPv4 address in state in network byte order
    following suggestion from Alexei Starovoitov
v2:
  - Fixed two comments by Daniel Borkmann:
    - disallow reserved flags in helper call
    - avoid compiling in helper code when CONFIG_XFRM is off

Eyal Birger (2):
  bpf: add helper for getting xfrm states
  samples/bpf: extend test_tunnel_bpf.sh with xfrm state test

 include/uapi/linux/bpf.h                  | 25 ++++++++++-
 net/core/filter.c                         | 48 +++++++++++++++++++++
 samples/bpf/tcbpf2_kern.c                 | 16 +++++++
 samples/bpf/test_tunnel_bpf.sh            | 71 +++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            | 25 ++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 6 files changed, 186 insertions(+), 3 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH bpf-next,v3 1/2] bpf: add helper for getting xfrm states
From: Eyal Birger @ 2018-04-24 14:50 UTC (permalink / raw)
  To: netdev; +Cc: shmulik, ast, daniel, Eyal Birger
In-Reply-To: <1524581430-11921-1-git-send-email-eyal.birger@gmail.com>

This commit introduces a helper which allows fetching xfrm state
parameters by eBPF programs attached to TC.

Prototype:
bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)

skb: pointer to skb
index: the index in the skb xfrm_state secpath array
xfrm_state: pointer to 'struct bpf_xfrm_state'
size: size of 'struct bpf_xfrm_state'
flags: reserved for future extensions

The helper returns 0 on success. Non zero if no xfrm state at the index
is found - or non exists at all.

struct bpf_xfrm_state currently includes the SPI, peer IPv4/IPv6
address and the reqid; it can be further extended by adding elements to
its end - indicating the populated fields by the 'size' argument -
keeping backwards compatibility.

Typical usage:

struct bpf_xfrm_state x = {};
bpf_skb_get_xfrm_state(skb, 0, &x, sizeof(x), 0);
...

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
---
 include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++-
 net/core/filter.c        | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8383a2..e667939 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -774,6 +774,15 @@ union bpf_attr {
  *     @xdp_md: pointer to xdp_md
  *     @delta: A negative integer to be added to xdp_md.data_end
  *     Return: 0 on success or negative on error
+ *
+ * int bpf_skb_get_xfrm_state(skb, index, xfrm_state, size, flags)
+ *     retrieve XFRM state
+ *     @skb: pointer to skb
+ *     @index: index of the xfrm state in the secpath
+ *     @key: pointer to 'struct bpf_xfrm_state'
+ *     @size: size of 'struct bpf_xfrm_state'
+ *     @flags: room for future extensions
+ *     Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -841,7 +850,8 @@ union bpf_attr {
 	FN(msg_cork_bytes),		\
 	FN(msg_pull_data),		\
 	FN(bind),			\
-	FN(xdp_adjust_tail),
+	FN(xdp_adjust_tail),		\
+	FN(skb_get_xfrm_state),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -947,6 +957,19 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* user accessible mirror of in-kernel xfrm_state.
+ * new fields can only be added to the end of this structure
+ */
+struct bpf_xfrm_state {
+	__u32 reqid;
+	__u32 spi;	/* Stored in network byte order */
+	__u16 family;
+	union {
+		__u32 remote_ipv4;	/* Stored in network byte order */
+		__u32 remote_ipv6[4];	/* Stored in network byte order */
+	};
+};
+
 /* Generic BPF return codes which all BPF program types may support.
  * The values are binary compatible with their TC_ACT_* counter-part to
  * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
diff --git a/net/core/filter.c b/net/core/filter.c
index e25bc4a..8e45c6c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -57,6 +57,7 @@
 #include <net/sock_reuseport.h>
 #include <net/busy_poll.h>
 #include <net/tcp.h>
+#include <net/xfrm.h>
 #include <linux/bpf_trace.h>
 
 /**
@@ -3743,6 +3744,49 @@ static const struct bpf_func_proto bpf_bind_proto = {
 	.arg3_type	= ARG_CONST_SIZE,
 };
 
+#ifdef CONFIG_XFRM
+BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index,
+	   struct bpf_xfrm_state *, to, u32, size, u64, flags)
+{
+	const struct sec_path *sp = skb_sec_path(skb);
+	const struct xfrm_state *x;
+
+	if (!sp || unlikely(index >= sp->len || flags))
+		goto err_clear;
+
+	x = sp->xvec[index];
+
+	if (unlikely(size != sizeof(struct bpf_xfrm_state)))
+		goto err_clear;
+
+	to->reqid = x->props.reqid;
+	to->spi = x->id.spi;
+	to->family = x->props.family;
+	if (to->family == AF_INET6) {
+		memcpy(to->remote_ipv6, x->props.saddr.a6,
+		       sizeof(to->remote_ipv6));
+	} else {
+		to->remote_ipv4 = x->props.saddr.a4;
+	}
+
+	return 0;
+err_clear:
+	memset(to, 0, size);
+	return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
+	.func		= bpf_skb_get_xfrm_state,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+	.arg5_type	= ARG_ANYTHING,
+};
+#endif
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3884,6 +3928,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_cookie_proto;
 	case BPF_FUNC_get_socket_uid:
 		return &bpf_get_socket_uid_proto;
+#ifdef CONFIG_XFRM
+	case BPF_FUNC_skb_get_xfrm_state:
+		return &bpf_skb_get_xfrm_state_proto;
+#endif
 	default:
 		return bpf_base_func_proto(func_id);
 	}
-- 
2.7.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox