Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net v2 1/1] driver: ipvlan: Unlink the upper dev when ipvlan_link_new failed
From: David Miller @ 2016-12-08 19:30 UTC (permalink / raw)
  To: fgao; +Cc: maheshb, edumazet, netdev, gfree.wind
In-Reply-To: <1481167018-559-1-git-send-email-fgao@ikuai8.com>

From: fgao@ikuai8.com
Date: Thu,  8 Dec 2016 11:16:58 +0800

> From: Gao Feng <fgao@ikuai8.com>
> 
> When netdev_upper_dev_unlink failed in ipvlan_link_new, need to
> unlink the ipvlan dev with upper dev.
> 
> Signed-off-by: Gao Feng <fgao@ikuai8.com>

Applied.

^ permalink raw reply

* Re: [PATCH] cxgb4/cxgb4vf: Remove deprecated module parameters
From: David Miller @ 2016-12-08 19:31 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, hariprasad
In-Reply-To: <1481183185-14636-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Thu,  8 Dec 2016 13:16:25 +0530

> Remove deprecated module parameters num_vf, dflt_msg_enable and
> force_init.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
From: Eric Dumazet @ 2016-12-08 19:32 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert
In-Reply-To: <1481223008.8408.6.camel@redhat.com>

On Thu, 2016-12-08 at 19:50 +0100, Paolo Abeni wrote:

> UDP GRO will require connected socket - very likely no DNS server. The
> use-case is an application using long lived UDP sockets doing a lot of
> traffic, like fix protocol feeds over UDP. 

You mean, the kind of traffic that should use TCP in the first place,
right ? ;)

NFS switched to TCP a long time ago.

^ permalink raw reply

* Re: pull request: bluetooth-next 2016-12-08
From: David Miller @ 2016-12-08 19:33 UTC (permalink / raw)
  To: johan.hedberg; +Cc: linux-bluetooth, netdev
In-Reply-To: <20161208075222.GA5772@x1c.lan>

From: Johan Hedberg <johan.hedberg@gmail.com>
Date: Thu, 8 Dec 2016 09:52:22 +0200

>   git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git for-upstream

Pulled.

^ permalink raw reply

* Re: [net-next PATCH v5 0/6] XDP for virtio_net
From: Alexei Starovoitov @ 2016-12-08 19:38 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, daniel, mst, shm, tgraf, john.r.fastabend, netdev,
	brouer
In-Reply-To: <20161208.141702.1346950420275854265.davem@davemloft.net>

On Thu, Dec 08, 2016 at 02:17:02PM -0500, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Wed, 07 Dec 2016 12:10:47 -0800
> 
> > This implements virtio_net for the mergeable buffers and big_packet
> > modes. I tested this with vhost_net running on qemu and did not see
> > any issues. For testing num_buf > 1 I added a hack to vhost driver
> > to only but 100 bytes per buffer.
>  ...
> 
> So where are we with this?
> 
> I'm not too thrilled with the idea of making XDP_TX optional or
> something like that.  If someone enables XDP, there is a tradeoff.
> 
> I also have reservations about the idea to make jumbo frames work
> without giving XDP access to the whole packet.  If it wants to push or
> pop a header, it might need to know the whole packet length.  How will
> you pass that to the XDP program?
> 
> Some kinds of encapsulation require trailers, thus preclusing access
> to the entire packet precludes those kinds of transformations.

+1

> This is why we want simple, linear, buffer access for XDP.
> 
> Even the most seemingly minor exception turns into a huge complicated
> mess.

+1

and from the other thread:
> > Can't we disable XDP_TX somehow? Many people might only want RX drop,
> > and extra queues are not always there.
> >
> 
> Alexei, Daniel, any thoughts on this?

I don't like it.

> I know we were trying to claim some base level of feature support for
> all XDP drivers. I am sympathetic to this argument though for DDOS we
> do not need XDP_TX support. And virtio can become queue constrained
> in some cases.

especially for ddos case doing lro/gro is not helpful.
I frankly don't see a use case where you'd want to steer a packet
all the way into VM just to drop them there?
Without XDP_TX it's too crippled. adjust_head() won't be possible,
packet mangling would have to be disabled and so on.
If xdp program doesn't see raw packet it can only parse the headers of
this jumbo meta-packet and drop it, but for virtio it's really too late.
ddos protection needs to be done at the earliest hw nic receive.
I think if driver claims xdp support it needs to support
drop/pass/tx and adjust_head. For metadata passing up into stack from xdp
we need adjust_head, for encap/decap we need it too. And lro is in the way
of such transformations.
We struggled a lot with cls_bpf due to all metadata inside skb that needs
to be kept correct. Feeding non-raw packets into xdp is a rat hole.

^ permalink raw reply

* [PATCH net 0/2] net: ethernet: Make sure we set dev->dev.parent
From: Florian Fainelli @ 2016-12-08 19:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, madalin.bucur, johan, Florian Fainelli

Hi David,

This patch series builds atop:

ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461 ("phy: Don't increment MDIO bus
refcount unless it's a different owner")

FMAN is the one that potentially needs patching as well (call SET_NETDEV_DEV),
but there appears to be no way that init_phy is called right now, or there is
not such an in-tree user. Madalin, can you comment on that?

Florian Fainelli (2):
  net: ethernet: lantiq_etop: Call SET_NETDEV_DEV()
  net: ethernet: cpmac: Call SET_NETDEV_DEV()

 drivers/net/ethernet/lantiq_etop.c | 1 +
 drivers/net/ethernet/ti/cpmac.c    | 1 +
 2 files changed, 2 insertions(+)

-- 
2.9.3

^ permalink raw reply

* [PATCH net 1/2] net: ethernet: lantiq_etop: Call SET_NETDEV_DEV()
From: Florian Fainelli @ 2016-12-08 19:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, madalin.bucur, johan, Florian Fainelli
In-Reply-To: <20161208194125.13264-1-f.fainelli@gmail.com>

The Lantiq Etop driver calls into PHYLIB which now checks for
net_device->dev.parent, so make sure we do set it before calling into
any MDIO/PHYLIB related function.

Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a different owner")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/lantiq_etop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 91e09d68b7e2..a167fd7ee13e 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -704,6 +704,7 @@ ltq_etop_probe(struct platform_device *pdev)
 	priv->pldata = dev_get_platdata(&pdev->dev);
 	priv->netdev = dev;
 	spin_lock_init(&priv->lock);
+	SET_NETDEV_DEV(dev, &pdev->dev);
 
 	for (i = 0; i < MAX_DMA_CHAN; i++) {
 		if (IS_TX(i))
-- 
2.9.3

^ permalink raw reply related

* [PATCH net 2/2] net: ethernet: cpmac: Call SET_NETDEV_DEV()
From: Florian Fainelli @ 2016-12-08 19:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, madalin.bucur, johan, Florian Fainelli
In-Reply-To: <20161208194125.13264-1-f.fainelli@gmail.com>

The TI CPMAC driver calls into PHYLIB which now checks for
net_device->dev.parent, so make sure we do set it before calling into
any MDIO/PHYLIB related function.

Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a different owner")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/ti/cpmac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ti/cpmac.c b/drivers/net/ethernet/ti/cpmac.c
index fa0cfda24fd9..28097be2ff28 100644
--- a/drivers/net/ethernet/ti/cpmac.c
+++ b/drivers/net/ethernet/ti/cpmac.c
@@ -1113,6 +1113,7 @@ static int cpmac_probe(struct platform_device *pdev)
 	if (!dev)
 		return -ENOMEM;
 
+	SET_NETDEV_DEV(dev, &pdev->dev);
 	platform_set_drvdata(pdev, dev);
 	priv = netdev_priv(dev);
 
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next] openvswitch: fix VxLAN-gpe port can't be created in ovs compat mode
From: Pravin Shelar @ 2016-12-08 19:41 UTC (permalink / raw)
  To: Yi Yang; +Cc: ovs dev, Linux Kernel Network Developers, Jiri Benc
In-Reply-To: <1481185210-13056-1-git-send-email-yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Thu, Dec 8, 2016 at 12:20 AM, Yi Yang <yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> In ovs compat mode, ovs won't use LWT in current kernel, this is to
> make sure ovs can work on the old kernels, Linux kernel v4.7 includes
> VxLAN-gpe support but many Linux distributions' kernels are odler than
> v4.7, this fix will ensure that ovs can create VxLAN-gpe port correctly
> on old kernels, it has been verified on Ubuntu 16.04 x86_64 with Linux
> kernel 4.4.0-53-generic.
>
> This does touch compat code, but it is necessary as Pravin commented.
>
> Without this fix, ovs can't create VxLAN-gpe port, it is still a VxLAN
> port.
>
> vxlan_sys_4790 Link encap:Ethernet  HWaddr 72:23:60:c2:8b:8d
>           inet6 addr: fe80::7023:60ff:fec2:8b8d/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:65485  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:8 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> But with this fix applied, a real L3 port is created
>
> vxlan_sys_4790 Link encap:UNSPEC  HWaddr
> 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
>           UP POINTOPOINT RUNNING NOARP MULTICAST  MTU:65485  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> Signed-off-by: Yi Yang <yi.y.yang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  include/uapi/linux/openvswitch.h |  1 +
>  net/openvswitch/vport-vxlan.c    | 15 +++++++++++++++
>  2 files changed, 16 insertions(+)
>
There is no need for this patch in upstream kernel module. I am open
to having such a patch in out of tree kernel if it simplifies feature
compatibility code.

^ permalink raw reply

* [PATCH v3 net-next 0/4] udp: receive path optimizations
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

This patch series provides about 100 % performance increase under flood. 

v2: added Paolo feedback on udp_rmem_release() for tiny sk_rcvbuf
    added the last patch touching sk_rmem_alloc later

Eric Dumazet (4):
  udp: add busylocks in RX path
  udp: copy skb->truesize in the first cache line
  udp: add batching to udp_rmem_release()
  udp: udp_rmem_release() should touch sk_rmem_alloc later

 include/linux/skbuff.h |  9 ++++++-
 include/linux/udp.h    |  3 +++
 net/ipv4/udp.c         | 71 ++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 77 insertions(+), 6 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH v3 net-next 1/4] udp: add busylocks in RX path
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481226117-31288-1-git-send-email-edumazet@google.com>

Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f5628ada47b53f0d92d08210e5d7e4132a10..e6a68d66f3b218998d409527301d2a994e95 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1195,10 +1195,36 @@ void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* Idea of busylocks is to let producers grab an extra spinlock
+ * to relieve pressure on the receive_queue spinlock shared by consumer.
+ * Under flood, this means that only one producer can be in line
+ * trying to acquire the receive_queue spinlock.
+ * These busylock can be allocated on a per cpu manner, instead of a
+ * per socket one (that would consume a cache line per socket)
+ */
+static int udp_busylocks_log __read_mostly;
+static spinlock_t *udp_busylocks __read_mostly;
+
+static spinlock_t *busylock_acquire(void *ptr)
+{
+	spinlock_t *busy;
+
+	busy = udp_busylocks + hash_ptr(ptr, udp_busylocks_log);
+	spin_lock(busy);
+	return busy;
+}
+
+static void busylock_release(spinlock_t *busy)
+{
+	if (busy)
+		spin_unlock(busy);
+}
+
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
+	spinlock_t *busy = NULL;
 	int size;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
@@ -1214,8 +1240,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 * - Less cache line misses at copyout() time
 	 * - Less work at consume_skb() (less alien page frag freeing)
 	 */
-	if (rmem > (sk->sk_rcvbuf >> 1))
+	if (rmem > (sk->sk_rcvbuf >> 1)) {
 		skb_condense(skb);
+
+		busy = busylock_acquire(sk);
+	}
 	size = skb->truesize;
 
 	/* we drop only if the receive buf is full and the receive
@@ -1252,6 +1281,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk);
 
+	busylock_release(busy);
 	return 0;
 
 uncharge_drop:
@@ -1259,6 +1289,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 
 drop:
 	atomic_inc(&sk->sk_drops);
+	busylock_release(busy);
 	return err;
 }
 EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
@@ -2613,6 +2644,7 @@ EXPORT_SYMBOL(udp_flow_hashrnd);
 void __init udp_init(void)
 {
 	unsigned long limit;
+	unsigned int i;
 
 	udp_table_init(&udp_table, "UDP");
 	limit = nr_free_buffer_pages() / 8;
@@ -2623,4 +2655,13 @@ void __init udp_init(void)
 
 	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
 	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+	/* 16 spinlocks per cpu */
+	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
+	udp_busylocks = kmalloc(sizeof(spinlock_t) << udp_busylocks_log,
+				GFP_KERNEL);
+	if (!udp_busylocks)
+		panic("UDP: failed to alloc udp_busylocks\n");
+	for (i = 0; i < (1U << udp_busylocks_log); i++)
+		spin_lock_init(udp_busylocks + i);
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 2/4] udp: copy skb->truesize in the first cache line
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481226117-31288-1-git-send-email-edumazet@google.com>

In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  9 ++++++++-
 net/ipv4/udp.c         | 13 ++++++++++---
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0cd92b0f2af5fe5a7c153435b8dc75833818..332e76756f54bee3b625ed8872c024a2bf53 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -645,8 +645,15 @@ struct sk_buff {
 		struct rb_node	rbnode; /* used in netem & tcp stack */
 	};
 	struct sock		*sk;
-	struct net_device	*dev;
 
+	union {
+		struct net_device	*dev;
+		/* Some protocols might use this space to store information,
+		 * while device pointer would be NULL.
+		 * UDP receive path is one user.
+		 */
+		unsigned long		dev_scratch;
+	};
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e6a68d66f3b218998d409527301d2a994e95..c608334d99aa5620858d9cceec500b2be944 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1188,10 +1188,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 }
 
-/* Note: called with sk_receive_queue.lock held */
+/* Note: called with sk_receive_queue.lock held.
+ * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
+ * This avoids a cache line miss while receive_queue lock is held.
+ * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+ */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->truesize, 1);
+	udp_rmem_release(sk, skb->dev_scratch, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
@@ -1246,6 +1250,10 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 		busy = busylock_acquire(sk);
 	}
 	size = skb->truesize;
+	/* Copy skb->truesize into skb->dev_scratch to avoid a cache line miss
+	 * in udp_skb_destructor()
+	 */
+	skb->dev_scratch = size;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
@@ -1272,7 +1280,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	skb->dev = NULL;
 	sock_skb_set_dropcount(sk, skb);
 
 	__skb_queue_tail(list, skb);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 3/4] udp: add batching to udp_rmem_release()
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481226117-31288-1-git-send-email-edumazet@google.com>

If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/udp.h |  3 +++
 net/ipv4/udp.c      | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd39478b635ef5396b5ae1c63f8c965..c0f530809d1f3db7323e51a52224eb49d8f9 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,9 @@ struct udp_sock {
 	int			(*gro_complete)(struct sock *sk,
 						struct sk_buff *skb,
 						int nhoff);
+
+	/* This field is dirtied by udp_recvmsg() */
+	int		forward_deficit;
 };

 static inline struct udp_sock *udp_sk(const struct sock *sk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c608334d99aa5620858d9cceec500b2be944..5a38faa12cde7fdcd5b6d86cdc0f4bc33de4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1177,8 +1177,20 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 /* fully reclaim rmem/fwd memory allocated for skb */
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
+	struct udp_sock *up = udp_sk(sk);
 	int amt;

+	if (likely(partial)) {
+		up->forward_deficit += size;
+		size = up->forward_deficit;
+		if (size < (sk->sk_rcvbuf >> 2) &&
+		    !skb_queue_empty(&sk->sk_receive_queue))
+			return;
+	} else {
+		size += up->forward_deficit;
+	}
+	up->forward_deficit = 0;
+
 	atomic_sub(size, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 4/4] udp: udp_rmem_release() should touch sk_rmem_alloc later
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481226117-31288-1-git-send-email-edumazet@google.com>

In flood situations, keeping sk_rmem_alloc at a high value
prevents producers from touching the socket.

It makes sense to lower sk_rmem_alloc only at the end
of udp_rmem_release() after the thread draining receive
queue in udp_recvmsg() finished the writes to sk_forward_alloc.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5a38faa12cde7fdcd5b6d86cdc0f4bc33de4..9ca279b130d51f6feaa97785b1c906775810 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,13 +1191,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	}
 	up->forward_deficit = 0;

-	atomic_sub(size, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
 	sk->sk_forward_alloc -= amt;

 	if (amt)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+
+	atomic_sub(size, &sk->sk_rmem_alloc);
 }

 /* Note: called with sk_receive_queue.lock held.
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [PATCH v3 net-next 1/4] udp: add busylocks in RX path
From: Hannes Frederic Sowa @ 2016-12-08 19:45 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481226117-31288-2-git-send-email-edumazet@google.com>

Hi Eric,

On Thu, Dec 8, 2016, at 20:41, Eric Dumazet wrote:
> Idea of busylocks is to let producers grab an extra spinlock
> to relieve pressure on the receive_queue spinlock shared by consumer.
> 
> This behavior is requested only once socket receive queue is above
> half occupancy.
> 
> Under flood, this means that only one producer can be in line
> trying to acquire the receive_queue spinlock.
> 
> These busylock can be allocated on a per cpu manner, instead of a
> per socket one (that would consume a cache line per socket)
> 
> This patch considerably improves UDP behavior under stress,
> depending on number of NIC RX queues and/or RPS spread.

This patch mostly improves situation for non-connected sockets. Do you
think it makes sense to acquire the spinlock depending on the sockets
state? Connected UDP sockets flow in on one CPU anyway?

Otherwise the series looks really great, thanks!

^ permalink raw reply

* RE: [PATCH v2 net-next 0/2] phy: lan78xx: add phy fixup unregister functions & LAN7801 update
From: Woojung.Huh @ 2016-12-08 19:50 UTC (permalink / raw)
  To: davem; +Cc: f.fainelli, andrew, netdev, UNGLinuxDriver
In-Reply-To: <20161208.142225.407549200157627922.davem@davemloft.net>

> > V2 patch of adding phy fixup unregister function with use in LAN7801
> > update.
> 
> Series applied, thanks.

Thanks.

^ permalink raw reply

* Re: [PATCH v3 net-next 1/4] udp: add busylocks in RX path
From: Eric Dumazet @ 2016-12-08 19:57 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni
In-Reply-To: <1481226350.1892859.812996409.6D78D4D7@webmail.messagingengine.com>

On Thu, 2016-12-08 at 20:45 +0100, Hannes Frederic Sowa wrote:
> Hi Eric,
> 

> This patch mostly improves situation for non-connected sockets. Do you
> think it makes sense to acquire the spinlock depending on the sockets
> state? Connected UDP sockets flow in on one CPU anyway?

We could do that, definitely.

However I could not measure any difference with a conditional test here
for connected socket.

Maybe it would show up if we are unlucky and multiple cpus compete on
the same cache line because of the hashed spinlock array.

^ permalink raw reply

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
From: Hannes Frederic Sowa @ 2016-12-08 20:05 UTC (permalink / raw)
  To: Tom Herbert, Eric Dumazet
  Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn
In-Reply-To: <CALx6S35B3ja3pLHnKK4JQZrpKxtSc8nuZ_LBJT7E8qQJj2ronQ@mail.gmail.com>

Hello,

On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
> >
> >> Of course that would only help on systems where no one enable encaps,
> >> ie. looks good in the the simple benchmarks but in real life if just
> >> one socket enables encap everyone else takes the hit. Alternatively,
> >> maybe we could do early demux when we do the lookup in GRO to
> >> eliminate the extra lookup?
> >
> > Well, if you do the lookup in GRO, wont it be done for every incoming
> > MSS, instead of once per GRO packet ?
> 
> We should be able to avoid that. We already do the lookup for every
> UDP packet going into GRO, would only need to take the refcnt once for
> the whole GRO packet.
> 
> >
> > Anyway, the flooded UDP sockets out there are not normally connected
> 
> We still should be able to use early demux in that case, just can't
> avoid the route lookup. I wonder if we might be able to cache a soft
> route maybe for the last local destination received to help the
> unconnected sockets case...
> 
> In any case, I can take a look at of doing early demux from with UDP GRO.

Early demux already breaks ip rules: we might set up a rule so an
incoming packet might depending on the rule not find an input route at
all and would be forwarded. Same problem might occur with VRF, when you
have multiple ip addresses in different "realms".

That said, I don't see why we can't be more aggressive for GRO in the
unconnected case: we simply must make sure that the current namespace
holds the ip address, which is simply a hash lookup. After that we can
even accept packets for a wildcard bounded socket.

Probably we should disable this logic as soon as soon as vrf and/or
rules are active to have correct semantics.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH 2/3] i40e: Add XDP_TX support
From: Björn Töpel @ 2016-12-08 20:22 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, jeffrey.t.kirsher, intel-wired-lan,
	Björn Töpel, john.r.fastabend, magnus.karlsson, netdev
In-Reply-To: <201612090331.1UJP1l3g%fengguang.wu@intel.com>

2016-12-08 20:20 GMT+01:00 kbuild test robot <lkp@intel.com>:
> Hi Björn,
>
> [auto build test ERROR on next-20161208]
> [cannot apply to jkirsher-next-queue/dev-queue v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.9-rc8]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/i40e-Support-for-XDP/20161209-013138
> config: sparc64-allmodconfig (attached as .config)
> compiler: sparc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=sparc64
>
> All error/warnings (new ones prefixed by >>):
>
>    In file included from include/linux/cache.h:4:0,
>                     from include/linux/printk.h:8,
>                     from include/linux/kernel.h:13,
>                     from include/linux/list.h:8,
>                     from include/linux/timer.h:4,
>                     from include/linux/workqueue.h:8,
>                     from include/linux/bpf.h:11,
>                     from drivers/net/ethernet/intel/i40e/i40e_txrx.c:27:
>    drivers/net/ethernet/intel/i40e/i40e_txrx.c: In function 'i40e_try_flip_rx_page':
>>> drivers/net/ethernet/intel/i40e/i40e_txrx.c:1613:32: error: 'size' undeclared (first use in this function)
>      unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
>                                    ^
>    include/uapi/linux/kernel.h:10:41: note: in definition of macro '__ALIGN_KERNEL_MASK'
>     #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
>                                             ^
>    include/linux/kernel.h:49:22: note: in expansion of macro '__ALIGN_KERNEL'
>     #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
>                          ^~~~~~~~~~~~~~
>>> drivers/net/ethernet/intel/i40e/i40e_txrx.c:1613:26: note: in expansion of macro 'ALIGN'
>      unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
>                              ^~~~~
>    drivers/net/ethernet/intel/i40e/i40e_txrx.c:1613:32: note: each undeclared identifier is reported only once for each function it appears in
>      unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
>                                    ^
>    include/uapi/linux/kernel.h:10:41: note: in definition of macro '__ALIGN_KERNEL_MASK'
>     #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
>                                             ^
>    include/linux/kernel.h:49:22: note: in expansion of macro '__ALIGN_KERNEL'
>     #define ALIGN(x, a)  __ALIGN_KERNEL((x), (a))
>                          ^~~~~~~~~~~~~~
>>> drivers/net/ethernet/intel/i40e/i40e_txrx.c:1613:26: note: in expansion of macro 'ALIGN'
>      unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
>                              ^~~~~
>
> vim +/size +1613 drivers/net/ethernet/intel/i40e/i40e_txrx.c
>
>   1607   */
>   1608  static bool i40e_try_flip_rx_page(struct i40e_rx_buffer *rx_buffer)
>   1609  {
>   1610  #if (PAGE_SIZE < 8192)
>   1611          unsigned int truesize = I40E_RXBUFFER_2048;
>   1612  #else
>> 1613          unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
>   1614          unsigned int last_offset = PAGE_SIZE - I40E_RXBUFFER_2048;
>   1615  #endif
>   1616
>
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation


Ick. I'll respin this.


Björn

^ permalink raw reply

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
From: Tom Herbert @ 2016-12-08 20:30 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, David Miller, netdev, Willem de Bruijn
In-Reply-To: <1481227516.1898563.813013233.52CC6646@webmail.messagingengine.com>

On Thu, Dec 8, 2016 at 12:05 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
>> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
>> >
>> >> Of course that would only help on systems where no one enable encaps,
>> >> ie. looks good in the the simple benchmarks but in real life if just
>> >> one socket enables encap everyone else takes the hit. Alternatively,
>> >> maybe we could do early demux when we do the lookup in GRO to
>> >> eliminate the extra lookup?
>> >
>> > Well, if you do the lookup in GRO, wont it be done for every incoming
>> > MSS, instead of once per GRO packet ?
>>
>> We should be able to avoid that. We already do the lookup for every
>> UDP packet going into GRO, would only need to take the refcnt once for
>> the whole GRO packet.
>>
>> >
>> > Anyway, the flooded UDP sockets out there are not normally connected
>>
>> We still should be able to use early demux in that case, just can't
>> avoid the route lookup. I wonder if we might be able to cache a soft
>> route maybe for the last local destination received to help the
>> unconnected sockets case...
>>
>> In any case, I can take a look at of doing early demux from with UDP GRO.
>
> Early demux already breaks ip rules: we might set up a rule so an
> incoming packet might depending on the rule not find an input route at
> all and would be forwarded. Same problem might occur with VRF, when you
> have multiple ip addresses in different "realms".
>
> That said, I don't see why we can't be more aggressive for GRO in the
> unconnected case: we simply must make sure that the current namespace
> holds the ip address, which is simply a hash lookup. After that we can
> even accept packets for a wildcard bounded socket.
>
> Probably we should disable this logic as soon as soon as vrf and/or
> rules are active to have correct semantics.
>
All this gets dicey in the presence of encapsulation. One problem is
that we can't tell when or if a packet crosses network namespace just
by parsing the packet. It's a subset of the general problem of
correctly identifying packets outside of the terminal protocol
processing (we've already talked about the incorrectness of devices
that identify UDP encapsulation based on port numbers). But even
without encapsulation there is still the problem as you point out with
vrf, IPvlan, etc. I think the answer thus far has been to hand wave
and rely on probability, for instance identifying UDP encapsulation by
port number in device probably works almost all of the time. Matching
an unconnected UDP socket in GRO and then accepting a route associated
with that probably would also work nearly all the time. Maybe if we
can quantify the dependencies that early parsing has in this area,
other mechanisms (vrf) might be able to do something intelligent to
ensure correctness-- does seem like a hard problem though!

Tom

> Bye,
> Hannes

^ permalink raw reply

* Re: [PATCH 1/2] net: ethernet: sxgbe: remove private tx queue lock
From: Lino Sanfilippo @ 2016-12-08 20:32 UTC (permalink / raw)
  To: Francois Romieu
  Cc: bh74.an, ks.giri, vipul.pandya, peppe.cavallaro, alexandre.torgue,
	pavel, davem, linux-kernel, netdev
In-Reply-To: <20161207231534.GB5889@electric-eye.fr.zoreil.com>

Hi,

On 08.12.2016 00:15, Francois Romieu wrote:
> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
>> The driver uses a private lock for synchronization between the xmit
>> function and the xmit completion handler, but since the NETIF_F_LLTX flag
>> is not set, the xmit function is also called with the xmit_lock held.
>> 
>> On the other hand the xmit completion handler first takes the private lock
>> and (in case that the tx queue has been stopped) the xmit_lock, leading
>> to a reverse locking order and the potential danger of a deadlock.
> 
> netif_tx_stop_queue is used by:
> 1. xmit function before releasing lock and returning.
> 2. sxgbe_restart_tx_queue()
>    <- sxgbe_tx_interrupt
>    <- sxgbe_reset_all_tx_queues()
>       <- sxgbe_tx_timeout()
> 
> Given xmit won't be called again until tx queue is enabled, it's not clear
> how a deadlock could happen due to #1.
> 

After spending more thoughts on this I tend to agree with you. Yes, we have the
different locking order for the xmit_lock and the private lock in two concurrent
threads. And one of the first things one learns about locking is that this is a
good way to create a deadlock sooner or later. But in our case the deadlock 
can only occur if the xmit function and the tx completion handler perceive different
 states for the tx queue, or to be more specific: 
the completion handler sees the tx queue in state "stopped" while the xmit handler 
sees it in state "running" at the same time. Only then both functions would try to
take both locks, which could lead to a deadlock.

OTOH Pavel said that he actually could produce a deadlock. Now I wonder if this is caused
by that locking scheme (in a way I have not figured out yet) or if it is a different issue.

Regards,
Lino

^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net/sched: cls_flower: Support matching on ICMP
From: Or Gerlitz @ 2016-12-08 20:43 UTC (permalink / raw)
  To: David Miller; +Cc: Simon Horman, Jiri Pirko, Linux Netdev List
In-Reply-To: <20161208.115810.554381162900423199.davem@davemloft.net>

On Thu, Dec 8, 2016 at 6:58 PM, David Miller <davem@davemloft.net> wrote:

> Simon and Or, you both added extensions to cls_flower at the same
> time.  Or's changes went in first, so his UAPI numbers did not change.
> Simons, your changes went in next so your numbers did change and
> therefore you will have to recompile any userland components you were
> using for testing.

Yeah, I guess you had to do some rebasing there for Simon's series...
thanks for taking care

^ permalink raw reply

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
From: Tom Herbert @ 2016-12-08 20:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, David Miller, netdev, Willem de Bruijn
In-Reply-To: <1481227516.1898563.813013233.52CC6646@webmail.messagingengine.com>

On Thu, Dec 8, 2016 at 12:05 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
>> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
>> >
>> >> Of course that would only help on systems where no one enable encaps,
>> >> ie. looks good in the the simple benchmarks but in real life if just
>> >> one socket enables encap everyone else takes the hit. Alternatively,
>> >> maybe we could do early demux when we do the lookup in GRO to
>> >> eliminate the extra lookup?
>> >
>> > Well, if you do the lookup in GRO, wont it be done for every incoming
>> > MSS, instead of once per GRO packet ?
>>
>> We should be able to avoid that. We already do the lookup for every
>> UDP packet going into GRO, would only need to take the refcnt once for
>> the whole GRO packet.
>>
>> >
>> > Anyway, the flooded UDP sockets out there are not normally connected
>>
>> We still should be able to use early demux in that case, just can't
>> avoid the route lookup. I wonder if we might be able to cache a soft
>> route maybe for the last local destination received to help the
>> unconnected sockets case...
>>
>> In any case, I can take a look at of doing early demux from with UDP GRO.
>
> Early demux already breaks ip rules: we might set up a rule so an
> incoming packet might depending on the rule not find an input route at
> all and would be forwarded. Same problem might occur with VRF, when you
> have multiple ip addresses in different "realms".
>
> That said, I don't see why we can't be more aggressive for GRO in the
> unconnected case: we simply must make sure that the current namespace
> holds the ip address, which is simply a hash lookup. After that we can
> even accept packets for a wildcard bounded socket.
>
Or just depend on encapsulation sockets to bind to an address. That
would eliminate most the ambiguity especially if it can be pushed into
a device that is trying to parse encapsulation. We would need new
interfaces to support that in HW, or use n-tuple filtering (which I
still maintain is the only right way to do it).

Tom

> Probably we should disable this logic as soon as soon as vrf and/or
> rules are active to have correct semantics.
>
> Bye,
> Hannes

^ permalink raw reply

* Re: [net-next PATCH v5 0/6] XDP for virtio_net
From: John Fastabend @ 2016-12-08 20:46 UTC (permalink / raw)
  To: Alexei Starovoitov, David Miller
  Cc: daniel, mst, shm, tgraf, john.r.fastabend, netdev, brouer,
	Michael S. Tsirkin
In-Reply-To: <20161208193814.GA1954@ast-mbp.thefacebook.com>

On 16-12-08 11:38 AM, Alexei Starovoitov wrote:
> On Thu, Dec 08, 2016 at 02:17:02PM -0500, David Miller wrote:
>> From: John Fastabend <john.fastabend@gmail.com>
>> Date: Wed, 07 Dec 2016 12:10:47 -0800
>>
>>> This implements virtio_net for the mergeable buffers and big_packet
>>> modes. I tested this with vhost_net running on qemu and did not see
>>> any issues. For testing num_buf > 1 I added a hack to vhost driver
>>> to only but 100 bytes per buffer.
>>  ...
>>
>> So where are we with this?

There is one possible issue with a hang that Michael pointed out. I can
either spin a v6 or if you pull this v5 series in I can post a bugfix
for it. I am not seeing the issue in practice XDP virtio has been up
and running on my box here for days without issue.

All the concerns below are really future XDP ideas and unrelated to
this series or at least not required for this series to applied IMO.

>>
>> I'm not too thrilled with the idea of making XDP_TX optional or
>> something like that.  If someone enables XDP, there is a tradeoff.
>>
>> I also have reservations about the idea to make jumbo frames work
>> without giving XDP access to the whole packet.  If it wants to push or
>> pop a header, it might need to know the whole packet length.  How will
>> you pass that to the XDP program?
>>
>> Some kinds of encapsulation require trailers, thus preclusing access
>> to the entire packet precludes those kinds of transformations.
> 
> +1

This was sort of speculative on my side it is certainly not dependent on
the series here. I agree that we don't want to get into a state where
program X runs here and not there and only runs after doing magic
incantations, etc. I would only propose it if there is a clean way to
implement this.

> 
>> This is why we want simple, linear, buffer access for XDP.
>>
>> Even the most seemingly minor exception turns into a huge complicated
>> mess.
> 
> +1

Yep.

> 
> and from the other thread:
>>> Can't we disable XDP_TX somehow? Many people might only want RX drop,
>>> and extra queues are not always there.
>>>
>>
>> Alexei, Daniel, any thoughts on this?
> 
> I don't like it.
> 

OK alternatively we can make more queues available in virtio which might
be the better solution.

>> I know we were trying to claim some base level of feature support for
>> all XDP drivers. I am sympathetic to this argument though for DDOS we
>> do not need XDP_TX support. And virtio can become queue constrained
>> in some cases.
> 
> especially for ddos case doing lro/gro is not helpful.

Fair enough but disabling LRO to handle the case where you "might" get
a DDOS will hurt normal good traffic.

> I frankly don't see a use case where you'd want to steer a packet
> all the way into VM just to drop them there?

VM to VM traffic is my use case. And in that model we need XDP at the
virtio or vhost layer in case of malicious/broke/untrusted VM. I have
some vhost patches under development for when net-next opens up again.

> Without XDP_TX it's too crippled. adjust_head() won't be possible,

Just a nit but any reason not to support adjust_head and then XDP_PASS.
I don't have a use case in mind but also see no reason to preclude it.

> packet mangling would have to be disabled and so on.
> If xdp program doesn't see raw packet it can only parse the headers of
> this jumbo meta-packet and drop it, but for virtio it's really too late.
> ddos protection needs to be done at the earliest hw nic receive.

VM to VM traffic never touches hw nic.

> I think if driver claims xdp support it needs to support
> drop/pass/tx and adjust_head. For metadata passing up into stack from xdp
> we need adjust_head, for encap/decap we need it too. And lro is in the way
> of such transformations.
> We struggled a lot with cls_bpf due to all metadata inside skb that needs
> to be kept correct. Feeding non-raw packets into xdp is a rat hole.
> 

In summary:

I think its worth investigating getting LRO working but agree we can't
sacrifice any of the existing features or complicate the code to do it.
If the result of investigating is it can't be done then that is how it
is.

Jumbo frames I care very little about in reality so should not have
mentioned it.

Requiring XDP drivers to support all features is fine for me I can make
the virtio queue scheme a bit more flexible. Michael might have some
opinion on this though.

This series shouldn't be blocked by any of the above.

Thanks,
.John

^ permalink raw reply

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
From: Jesper Dangaard Brouer @ 2016-12-08 20:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, David S . Miller, netdev, Paolo Abeni, Eric Dumazet
In-Reply-To: <1481218739-27089-1-git-send-email-edumazet@google.com>

On Thu,  8 Dec 2016 09:38:55 -0800
Eric Dumazet <edumazet@google.com> wrote:

> This patch series provides about 100 % performance increase under flood. 

Could you please explain a bit more about what kind of testing you are
doing that can show 100% performance improvement?

I've tested this patchset and my tests show *huge* speeds ups, but
reaping the performance benefit depend heavily on setup and enabling
the right UDP socket settings, and most importantly where the
performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).

Basic setup: Unload all netfilter, and enable ip_early_demux.
 sysctl net/ipv4/ip_early_demux=1

Test generator pktgen UDP packets single flow, 50Gbit/s mlx5 NICs.
 - Vary packet size between 64 and 1514.

Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	537.70	1859756.90	2155
recvmsg   	run: 0 10000000	510.84	1957541.83	2047
read      	run: 0 10000000	583.40	1714077.14	2338
recvfrom  	run: 0 10000000	600.09	1666411.49	2405

The ksoftirq thread "cost" more than udp_sink, which is idle, and UDP
queue does not get full-enough. Thus, patchset does not have any
effect.

Try to increase pktgen packet size, as this increase the copy cost of
udp_sink.  Thus, a queue can now form, and udp_sink CPU almost have no
idle cycles.  The "read" and "readfrom" did experience some idle
cycles.

Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	435.88	2294204.11	1747
recvmsg   	run: 0 10000000	458.06	2183100.64	1835
read      	run: 0 10000000	520.34	1921826.18	2085
recvfrom  	run: 0 10000000	515.48	1939935.27	2066

Next trick connected UDP:

Use connected UDP socket (combined with ip_early_demux), removes the
FIB_lookup from the ksoftirq, and cause tipping point to be better.

Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	391.18	2556361.62	1567
recvmsg   	run: 0 10000000	422.95	2364349.69	1695
read      	run: 0 10000000	425.29	2351338.10	1704
recvfrom  	run: 0 10000000	476.74	2097577.57	1910

Change/increase packet size:

Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	457.56	2185481.94	1833
recvmsg   	run: 0 10000000	479.42	2085837.49	1921
read      	run: 0 10000000	398.05	2512233.13	1595
recvfrom  	run: 0 10000000	391.07	2557096.95	1567

A bit strange, changing the packet size, flipped what is the fastest
syscall.

It is also interesting to see that ksoftirq limit is:

Result from "nstat" while using recvmsg, show that ksoftirq is
handling 2.6 Mpps, and consumer/udp_sink is bottleneck with 2Mpps.

[skylake ~]$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    2667577            0.0
IpInDelivers                    2667577            0.0
UdpInDatagrams                  2083580            0.0
UdpInErrors                     583995             0.0
UdpRcvbufErrors                 583995             0.0
IpExtInOctets                   4001340000         0.0
IpExtInNoECTPkts                2667559            0.0

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox