[RFC] udp: some improvements on RX path.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] udp: some improvements on RX path.
@ 2016-12-05  2:43 Eric Dumazet
  2016-12-05 13:22 ` Paolo Abeni
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-05  2:43 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev

We currently access 3 cache lines from an skb in receive queue while
holding receive queue lock :

First cache line (contains ->next / prev pointers )
2nd cache line (skb->peeked)
3rd cache line (skb->truesize)

I believe we could get rid of skb->peeked completely.

I will cook a patch, but basically the idea is that the last owner of a
skb (right before skb->users becomes 0) can have the 'ownership' and
thus increase stats.

The 3rd cache line miss is easily avoided by the following patch.

But I also want to work on the idea I gave few days back, having a
separate queue and use splice to transfer the 'softirq queue' into
a calm queue in a different cache line.

I expect a 50 % performance increase under load, maybe 1.5 Mpps.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 16d88ba9ff1c..37d4e8da6482 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,7 +1191,13 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 /* Note: called with sk_receive_queue.lock held */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->truesize, 1);
+	/* HACK HACK HACK :
+	 * Instead of using skb->truesize here, find a copy of it in skb->dev.
+	 * This avoids a cache line miss in this path,
+	 * while sk_receive_queue lock is held.
+	 * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+	 */
+	udp_rmem_release(sk, (int)(unsigned long)skb->dev, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
@@ -1201,6 +1207,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;
 
+	/* help udp_skb_destructor() to get skb->truesize from skb->dev
+	 * without a cache line miss.
+	 */
+	skb->dev = (struct net_device *)(unsigned long)size;
+
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
@@ -1233,7 +1244,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	skb->dev = NULL;
 	sock_skb_set_dropcount(sk, skb);
 
 	__skb_queue_tail(list, skb);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC] udp: some improvements on RX path.
  2016-12-05  2:43 [RFC] udp: some improvements on RX path Eric Dumazet
@ 2016-12-05 13:22 ` Paolo Abeni
  2016-12-05 14:28   ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-05 13:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

Hi Eric,

On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> We currently access 3 cache lines from an skb in receive queue while
> holding receive queue lock :
> 
> First cache line (contains ->next / prev pointers )
> 2nd cache line (skb->peeked)
> 3rd cache line (skb->truesize)
> 
> I believe we could get rid of skb->peeked completely.
> 
> I will cook a patch, but basically the idea is that the last owner of a
> skb (right before skb->users becomes 0) can have the 'ownership' and
> thus increase stats.

Agreed.

> The 3rd cache line miss is easily avoided by the following patch.

I run some performance tests on top of your patch "net: reorganize
struct sock for better data locality", and I see an additional ~7%
improvement on top of that, in the udp flood scenario. 

In my tests, the topmost perf offenders for the u/s process are now:

   9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
   8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
   6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
   5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
   5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string

udp_rmem_release() spends most of its time doing:

    atomic_sub(size, &sk->sk_rmem_alloc);

I see a cacheline miss while accessing sk_rmem_alloc; most probably due
to sk_rmem_alloc being updated by the writer outside the rx queue lock.

Moving such update inside the lock show remove this cache miss but will
increase the pressure on the rx lock. What do you think ?

inet_recvmsg() is there because with "net: reorganize struct sock for
better data locality" we get a cache miss while accessing skc_rxhash in
sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
sorry for not noticing this before. Do you have CONFIG_RPS disabled ?

> But I also want to work on the idea I gave few days back, having a
> separate queue and use splice to transfer the 'softirq queue' into
> a calm queue in a different cache line.
> 
> I expect a 50 % performance increase under load, maybe 1.5 Mpps.

It should work nicely under contention, but won't that increase the
overhead for the uncontended/single flow scenario ? the user space
reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
system ksoftirqd and the u/s process work at similar speeds, so splicing
will happen quite often. 

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] udp: some improvements on RX path.
  2016-12-05 13:22 ` Paolo Abeni
@ 2016-12-05 14:28   ` Eric Dumazet
  2016-12-05 15:37     ` Jesper Dangaard Brouer
  2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
  0 siblings, 2 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-05 14:28 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: netdev

On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> Hi Eric,
> 
> On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> > We currently access 3 cache lines from an skb in receive queue while
> > holding receive queue lock :
> > 
> > First cache line (contains ->next / prev pointers )
> > 2nd cache line (skb->peeked)
> > 3rd cache line (skb->truesize)
> > 
> > I believe we could get rid of skb->peeked completely.
> > 
> > I will cook a patch, but basically the idea is that the last owner of a
> > skb (right before skb->users becomes 0) can have the 'ownership' and
> > thus increase stats.
> 
> Agreed.
> 
> > The 3rd cache line miss is easily avoided by the following patch.
> 
> I run some performance tests on top of your patch "net: reorganize
> struct sock for better data locality", and I see an additional ~7%
> improvement on top of that, in the udp flood scenario. 
> 
> In my tests, the topmost perf offenders for the u/s process are now:
> 
>    9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
>    8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
>    6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
>    5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
> 
> udp_rmem_release() spends most of its time doing:
> 
>     atomic_sub(size, &sk->sk_rmem_alloc);
> 
> I see a cacheline miss while accessing sk_rmem_alloc; most probably due
> to sk_rmem_alloc being updated by the writer outside the rx queue lock.
> 
> Moving such update inside the lock show remove this cache miss but will
> increase the pressure on the rx lock. What do you think ?

I want to accumulate the sk_rmem_alloc  deficit in another variable in
the same cache line than the secondary list, updated from process
context at udp recvmsg() time.

And only transfer the accumulated deficit in one go when the second
queue is emptied, or if the accumulated deficite is > (rcvbuf/2)

If done right, we should interfere between the softirq flooder(s) and
the process thread only once per batch.

> 
> inet_recvmsg() is there because with "net: reorganize struct sock for
> better data locality" we get a cache miss while accessing skc_rxhash in
> sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
> sorry for not noticing this before. Do you have CONFIG_RPS disabled ?
> 
> > But I also want to work on the idea I gave few days back, having a
> > separate queue and use splice to transfer the 'softirq queue' into
> > a calm queue in a different cache line.
> > 
> > I expect a 50 % performance increase under load, maybe 1.5 Mpps.
> 
> It should work nicely under contention, but won't that increase the
> overhead for the uncontended/single flow scenario ? the user space
> reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
> system ksoftirqd and the u/s process work at similar speeds, so splicing
> will happen quite often. 

Well, the splice would happen only if you have more than one message in
the softirq queue. So no real overhead for uncontended flow scenario.


This reminds me of the busylock I added in __dev_xmit_skb(), which
basically is acquired only when we detect a possible contention on qdisc
lock.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] udp: some improvements on RX path.
  2016-12-05 14:28   ` Eric Dumazet
@ 2016-12-05 15:37     ` Jesper Dangaard Brouer
  2016-12-05 15:54       ` Eric Dumazet
  2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-05 15:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, Paolo Abeni, netdev


On Mon, 05 Dec 2016 06:28:53 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> > 
> > On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:  
[...]

> > > But I also want to work on the idea I gave few days back, having a
> > > separate queue and use splice to transfer the 'softirq queue' into
> > > a calm queue in a different cache line.
> > > 
> > > I expect a 50 % performance increase under load, maybe 1.5 Mpps.  

I also have high hopes for such a solution. I'm very excited that you
are working on this! :-)

 
> > It should work nicely under contention, but won't that increase the
> > overhead for the uncontended/single flow scenario ? the user space
> > reader needs to acquire 2 lock when splicing the 'softirq queue'.
> > On my system ksoftirqd and the u/s process work at similar speeds,
> > so splicing will happen quite often.   
> 
> Well, the splice would happen only if you have more than one message
> in the softirq queue. So no real overhead for uncontended flow
> scenario.
> 
> 
> This reminds me of the busylock I added in __dev_xmit_skb(), which
> basically is acquired only when we detect a possible contention on
> qdisc lock.

Do you think the splice technique would, have the same performance
benefit as having a MPMC queue with separate enqueue and dequeue locking?
(like we have with skb_array/ptr_ring that avoids cache bouncing)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC] udp: some improvements on RX path.
  2016-12-05 15:37     ` Jesper Dangaard Brouer
@ 2016-12-05 15:54       ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-05 15:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Paolo Abeni, netdev

On Mon, 2016-12-05 at 16:37 +0100, Jesper Dangaard Brouer wrote:

> Do you think the splice technique would, have the same performance
> benefit as having a MPMC queue with separate enqueue and dequeue locking?
> (like we have with skb_array/ptr_ring that avoids cache bouncing)?

I believe ring buffers make sense for critical points in the kernel,
but for an arbitrary number of TCP/UDP sockets in a host, they are a big
increase of memory, and a practical problem when SO_RCVBUF is changed,
since dynamic resize of the ring buffer would be needed.

If you think about it, most sockets have few outstanding packets, like
0, 1 , 2. But they also might have ~100 packets, sometimes...

For most of TCP/UDP sockets, a linked list is simply good enough.
( We only very recently converted the out of order receive queue to an
RB tree )

Now, if _two_ linked list are also good in the very rare case of floods,
I would use two linked lists, if they can offer us a 50 % increase at
small memory cost.

Then for very special cases, we have af_packet which should be optimized
for all the fancy stuff.

If an application really receives more than 1.5 Mpps per UDP socket,
then the author should seriously consider SO_REUSEPORT, and have more
than 1 vcpu on its VM. I think we have cheap cloud offers available from
many providers.

The ring buffer queue might make sense in net/core/dev.c, since
we currently have 2 queues per cpu.

So you might want to experiment with that, because it looks like we
might go to a model where a single cpu is (busypoll) processing all low
level RX processing from a single queue per NUMA node, then dispatch to
other cpus the IP/{TCP|UDP} processing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-05 14:28   ` Eric Dumazet
  2016-12-05 15:37     ` Jesper Dangaard Brouer
@ 2016-12-05 17:57     ` Eric Dumazet
  2016-12-06  9:53       ` Paolo Abeni
                         ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-05 17:57 UTC (permalink / raw)
  To: Paolo Abeni, David Miller; +Cc: netdev, Willem de Bruijn

From: Eric Dumazet <edumazet@google.com>

In UDP recvmsg() path we currently access 3 cache lines from an skb
while holding receive queue lock, plus another one if packet is
dequeued, since we need to change skb->next->prev

1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)

skb->peeked is only needed to make sure 0-length packets are properly
handled while MSG_PEEK is operated.

I had first the intent to remove skb->peeked but the "MSG_PEEK at
non-zero offset" support added by Sam Kumar makes this not possible.

This patch avoids one cache line miss during the locked section, when
skb->len and skb->peeked do not have to be read.

It also avoids the skb_set_peeked() cost for non empty UDP datagrams.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/datagram.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
 	if (error)
 		goto no_packet;
 
+	*peeked = 0;
 	do {
 		/* Again only user level code calls this function, so nothing
 		 * interrupt level will suddenly eat the receive_queue.
@@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
 		spin_lock_irqsave(&queue->lock, cpu_flags);
 		skb_queue_walk(queue, skb) {
 			*last = skb;
-			*peeked = skb->peeked;
 			if (flags & MSG_PEEK) {
 				if (_off >= skb->len && (skb->len || _off ||
 							 skb->peeked)) {
 					_off -= skb->len;
 					continue;
 				}
-
-				skb = skb_set_peeked(skb);
-				error = PTR_ERR(skb);
-				if (IS_ERR(skb)) {
-					spin_unlock_irqrestore(&queue->lock,
-							       cpu_flags);
-					goto no_packet;
+				if (!skb->len) {
+					skb = skb_set_peeked(skb);
+					if (IS_ERR(skb)) {
+						error = PTR_ERR(skb);
+						spin_unlock_irqrestore(&queue->lock,
+								       cpu_flags);
+						goto no_packet;
+					}
 				}
-
+				*peeked = 1;
 				atomic_inc(&skb->users);
 			} else {
 				__skb_unlink(skb, queue);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
@ 2016-12-06  9:53       ` Paolo Abeni
  2016-12-06 12:10         ` Paolo Abeni
  2016-12-06 14:34         ` Eric Dumazet
  2016-12-06 10:34       ` Paolo Abeni
  2016-12-06 15:42       ` David Miller
  2 siblings, 2 replies; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06  9:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

Hi Eric,

On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.

I'm wondering if peeking with offset is going to complicate the 2 queues
patch, too.

> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/datagram.c |   19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
>  	if (error)
>  		goto no_packet;
>  
> +	*peeked = 0;
>  	do {
>  		/* Again only user level code calls this function, so nothing
>  		 * interrupt level will suddenly eat the receive_queue.
> @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
>  		spin_lock_irqsave(&queue->lock, cpu_flags);
>  		skb_queue_walk(queue, skb) {
>  			*last = skb;
> -			*peeked = skb->peeked;
>  			if (flags & MSG_PEEK) {
>  				if (_off >= skb->len && (skb->len || _off ||
>  							 skb->peeked)) {
>  					_off -= skb->len;
>  					continue;
>  				}
> -
> -				skb = skb_set_peeked(skb);
> -				error = PTR_ERR(skb);
> -				if (IS_ERR(skb)) {
> -					spin_unlock_irqrestore(&queue->lock,
> -							       cpu_flags);
> -					goto no_packet;
> +				if (!skb->len) {
> +					skb = skb_set_peeked(skb);
> +					if (IS_ERR(skb)) {
> +						error = PTR_ERR(skb);
> +						spin_unlock_irqrestore(&queue->lock,
> +								       cpu_flags);
> +						goto no_packet;
> +					}
>  				}

I don't understand why we can avoid setting skb->peek if len > 0. I
think that will change the kernel behavior if:
- peek with offset is set
- 3 skbs with len > 0 are enqueued
- the u/s peek (with offset) the second one
- the u/s disable peeking with offset and peeks 2 more skbs.

With the current code in the last step the u/s is going to peek the 1#
and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
missing something ? Probably the new behavior is more correct, but still
is a change. 

I gave this a run in my test bed on top of your udp-related patches I
see additional ~3 improvement in the udp flood scenario, and a bit more
in the un-contended scenario.

Thank you,

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
  2016-12-06  9:53       ` Paolo Abeni
@ 2016-12-06 10:34       ` Paolo Abeni
  2016-12-06 17:08         ` Paolo Abeni
  2016-12-06 15:42       ` David Miller
  2 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06 10:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

Hi Eric,

On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
> 
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Thank you for all the good work.

After all your improvement, I see the cacheline miss in inet_recvmsg()
as a major perf offender for the user space process in the udp flood
scenario due to skc_rxhash sharing the same sk_drops cacheline.

Using an udp-specific drop counter (and an sk_drops accessor to wrap
sk_drops access where needed), we could avoid such cache miss. With that
- patch for udp.h only below - I get 3% improvement on top of all the
pending udp patches, and the gain should be more relevant after the 2
queues rework. What do you think ?

Cheers,

Paolo.
---
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..2bbf5db 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,11 @@ struct udp_sock {
        unsigned int     corkflag;      /* Cork is required */
        __u8             encap_type;    /* Is this an Encapsulation socket? */
        unsigned char    no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-                        no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+                        no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+                        pcflag:6;      /* UDP-Lite specific, moved here to */
+                                       /* fill an hole, marks socket as */
+                                       /* UDP-Lite if > 0    */
+
        /*
         * Following member retains the information to create a UDP header
         * when the socket is uncorked.
@@ -64,8 +68,7 @@ struct udp_sock {
 #define UDPLITE_BIT      0x1           /* set by udplite proto init function */
 #define UDPLITE_SEND_CC  0x2           /* set via udplite setsockopt         */
 #define UDPLITE_RECV_CC  0x4           /* set via udplite setsocktopt        */
-       __u8             pcflag;        /* marks socket as UDP-Lite if > 0    */
-       __u8             unused[3];
+
        /*
         * For encapsulation sockets.
         */
@@ -79,6 +82,9 @@ struct udp_sock {
        int                     (*gro_complete)(struct sock *sk,
                                                struct sk_buff *skb,
                                                int nhoff);
+
+       /* since we are prone to drops, avoid dirtying any sk cacheline */
+       atomic_t                drops ____cacheline_aligned_in_smp;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06  9:53       ` Paolo Abeni
@ 2016-12-06 12:10         ` Paolo Abeni
  2016-12-06 14:35           ` Eric Dumazet
  2016-12-06 14:34         ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06 12:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 10:53 +0100, Paolo Abeni wrote:
> On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > while holding receive queue lock, plus another one if packet is
> > dequeued, since we need to change skb->next->prev
> > 
> > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > 
> > skb->peeked is only needed to make sure 0-length packets are properly
> > handled while MSG_PEEK is operated.
> > 
> > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > non-zero offset" support added by Sam Kumar makes this not possible.
> 
> I'm wondering if peeking with offset is going to complicate the 2 queues
> patch, too.
> 
> > This patch avoids one cache line miss during the locked section, when
> > skb->len and skb->peeked do not have to be read.
> > 
> > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  net/core/datagram.c |   19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index 49816af8586bb832e806972b486588041a99524c..9482037a5c8c64aec79e42c65bd2691bdd9450a3 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -214,6 +214,7 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  	if (error)
> >  		goto no_packet;
> >  
> > +	*peeked = 0;
> >  	do {
> >  		/* Again only user level code calls this function, so nothing
> >  		 * interrupt level will suddenly eat the receive_queue.
> > @@ -227,22 +228,22 @@ struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned int flags,
> >  		spin_lock_irqsave(&queue->lock, cpu_flags);
> >  		skb_queue_walk(queue, skb) {
> >  			*last = skb;
> > -			*peeked = skb->peeked;
> >  			if (flags & MSG_PEEK) {
> >  				if (_off >= skb->len && (skb->len || _off ||
> >  							 skb->peeked)) {
> >  					_off -= skb->len;
> >  					continue;
> >  				}
> > -
> > -				skb = skb_set_peeked(skb);
> > -				error = PTR_ERR(skb);
> > -				if (IS_ERR(skb)) {
> > -					spin_unlock_irqrestore(&queue->lock,
> > -							       cpu_flags);
> > -					goto no_packet;
> > +				if (!skb->len) {
> > +					skb = skb_set_peeked(skb);
> > +					if (IS_ERR(skb)) {
> > +						error = PTR_ERR(skb);
> > +						spin_unlock_irqrestore(&queue->lock,
> > +								       cpu_flags);
> > +						goto no_packet;
> > +					}
> >  				}
> 
> I don't understand why we can avoid setting skb->peek if len > 0. I
> think that will change the kernel behavior if:
> - peek with offset is set
> - 3 skbs with len > 0 are enqueued
> - the u/s peek (with offset) the second one
> - the u/s disable peeking with offset and peeks 2 more skbs.
> 
> With the current code in the last step the u/s is going to peek the 1#
> and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
> missing something ? Probably the new behavior is more correct, but still
> is a change. 

Please ignore the above dumb comment. I misread the 'skip condition'.

I'm fine with the patch in its current form.

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06  9:53       ` Paolo Abeni
  2016-12-06 12:10         ` Paolo Abeni
@ 2016-12-06 14:34         ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-06 14:34 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 10:53 +0100, Paolo Abeni wrote:
> Hi Eric,

> I don't understand why we can avoid setting skb->peek if len > 0. I
> think that will change the kernel behavior if:
> - peek with offset is set
> - 3 skbs with len > 0 are enqueued
> - the u/s peek (with offset) the second one
> - the u/s disable peeking with offset and peeks 2 more skbs.
> 
> With the current code in the last step the u/s is going to peek the 1#
> and the 3# skbs, after this patch will peek the 1# and the 2#. Am I
> missing something ? Probably the new behavior is more correct, but still
> is a change. 
> 
> I gave this a run in my test bed on top of your udp-related patches I
> see additional ~3 improvement in the udp flood scenario, and a bit more
> in the un-contended scenario.
> 
> Thank you,

MSG_PEEK always grab the first skb in queue, regardless of its
skb->peeked status.

Unless an offset (given in bytes) is given. Then we skip X bytes in the
queue, again regardless of skb->peeked bit.

skb->peeked is _needed_ to avoid peeking the same 0-byte skb over and
over, since user land could not skip it, as the offset to skip skbs is
computed by sum ( lengthes). An infinite loop of recvmsg() would happen,
stuck on the same skb.

For regular non 0-bytes payload, we can skip over them without even
looking at skb->peeked.

Initially, skb->peeked was only a (bad) way to make sure UDP would
increment its stats only for non peeked messages. We can implement that
using at MSG_PEEK flag.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 12:10         ` Paolo Abeni
@ 2016-12-06 14:35           ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-06 14:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 13:10 +0100, Paolo Abeni wrote:

> Please ignore the above dumb comment. I misread the 'skip condition'.
> 
> I'm fine with the patch in its current form.
> 
> Acked-by: Paolo Abeni <pabeni@redhat.com>

No worries, I prefer having multiple eyes on this stuff before doing the
next step ;)

Thanks !

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
  2016-12-06  9:53       ` Paolo Abeni
  2016-12-06 10:34       ` Paolo Abeni
@ 2016-12-06 15:42       ` David Miller
  2 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2016-12-06 15:42 UTC (permalink / raw)
  To: eric.dumazet; +Cc: pabeni, netdev, willemb

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 05 Dec 2016 09:57:19 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> In UDP recvmsg() path we currently access 3 cache lines from an skb
> while holding receive queue lock, plus another one if packet is
> dequeued, since we need to change skb->next->prev
> 
> 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> 
> skb->peeked is only needed to make sure 0-length packets are properly
> handled while MSG_PEEK is operated.
> 
> I had first the intent to remove skb->peeked but the "MSG_PEEK at
> non-zero offset" support added by Sam Kumar makes this not possible.
> 
> This patch avoids one cache line miss during the locked section, when
> skb->len and skb->peeked do not have to be read.
> 
> It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 10:34       ` Paolo Abeni
@ 2016-12-06 17:08         ` Paolo Abeni
  2016-12-06 17:47           ` Eric Dumazet
  2016-12-07 17:09           ` [PATCH] net/udp: do not touch skb->peeked unless really needed David Laight
  0 siblings, 2 replies; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06 17:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 11:34 +0100, Paolo Abeni wrote:
> On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > while holding receive queue lock, plus another one if packet is
> > dequeued, since we need to change skb->next->prev
> > 
> > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > 
> > skb->peeked is only needed to make sure 0-length packets are properly
> > handled while MSG_PEEK is operated.
> > 
> > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > non-zero offset" support added by Sam Kumar makes this not possible.
> > 
> > This patch avoids one cache line miss during the locked section, when
> > skb->len and skb->peeked do not have to be read.
> > 
> > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> Thank you for all the good work.
> 
> After all your improvement, I see the cacheline miss in inet_recvmsg()
> as a major perf offender for the user space process in the udp flood
> scenario due to skc_rxhash sharing the same sk_drops cacheline.
> 
> Using an udp-specific drop counter (and an sk_drops accessor to wrap
> sk_drops access where needed), we could avoid such cache miss. With that
> - patch for udp.h only below - I get 3% improvement on top of all the
> pending udp patches, and the gain should be more relevant after the 2
> queues rework. What do you think ?

Here follow what I'm experimenting. 

The 'pcflag' changes is not strictly needed, but it shrinks the udp_sock
struct a bit, so that the newly added cacheline does not create
additional holes - with my kconfig, at least. I can use a separate patch
for that chunk.
---
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd..a21baaf 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,11 @@ struct udp_sock {
 	unsigned int	 corkflag;	/* Cork is required */
 	__u8		 encap_type;	/* Is this an Encapsulation socket? */
 	unsigned char	 no_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-			 no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+			 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+			 pcflag:6;	/* UDP-Lite specific, moved here to */
+					/* fill an hole, marks socket as */
+					/* UDP-Lite if > 0    */
+
 	/*
 	 * Following member retains the information to create a UDP header
 	 * when the socket is uncorked.
@@ -64,8 +68,7 @@ struct udp_sock {
 #define UDPLITE_BIT      0x1  		/* set by udplite proto init function */
 #define UDPLITE_SEND_CC  0x2  		/* set via udplite setsockopt         */
 #define UDPLITE_RECV_CC  0x4		/* set via udplite setsocktopt        */
-	__u8		 pcflag;        /* marks socket as UDP-Lite if > 0    */
-	__u8		 unused[3];
+
 	/*
 	 * For encapsulation sockets.
 	 */
@@ -79,6 +82,9 @@ struct udp_sock {
 	int			(*gro_complete)(struct sock *sk,
 						struct sk_buff *skb,
 						int nhoff);
+
+	/* since we are prone to drops, avoid dirtying any sk cacheline */
+	atomic_t		drops ____cacheline_aligned_in_smp;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)
@@ -106,6 +112,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
 	return udp_sk(sk)->no_check6_rx;
 }
 
+static inline int udp_drops_read(const struct sock *sk)
+{
      * +	return atomic_read(&udp_sk(sk)->drops);
+}
+
+static inline void
+udp_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
+{
+	SOCK_SKB_CB(skb)->dropcount = udp_drops_read(sk);
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
 	hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/sock.h b/include/net/sock.h
index ed75dec..113e495 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2138,6 +2138,8 @@ struct sock_skb_cb {
 	SOCK_SKB_CB(skb)->dropcount = atomic_read(&sk->sk_drops);
 }
 
+int sk_drops_read(const struct sock *sk);
+
 static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
 {
 	int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..dc41727 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -9,6 +9,7 @@
 #include <net/sock.h>
 #include <linux/kernel.h>
 #include <linux/tcp.h>
+#include <linux/udp.h>
 #include <linux/workqueue.h>
 
 #include <linux/inet_diag.h>
@@ -19,6 +20,14 @@
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
+int sk_drops_read(const struct sock *sk)
+{
+	if (sk->sk_protocol == IPPROTO_UDP)
+		return udp_drops_read(sk);
+	else
+		return atomic_read(&sk->sk_drops);
+}
+
 static u64 sock_gen_cookie(struct sock *sk)
 {
 	while (1) {
@@ -67,7 +76,7 @@ int sock_diag_put_meminfo(struct sock *sk, struct sk_buff *skb, int attrtype)
 	mem[SK_MEMINFO_WMEM_QUEUED] = sk->sk_wmem_queued;
 	mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
 	mem[SK_MEMINFO_BACKLOG] = sk->sk_backlog.len;
-	mem[SK_MEMINFO_DROPS] = atomic_read(&sk->sk_drops);
+	mem[SK_MEMINFO_DROPS] = sk_drops_read(sk);
 
 	return nla_put(skb, attrtype, sizeof(mem), &mem);
 }
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index fbd6b69..d7c4980 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1174,6 +1174,11 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 	return ret;
 }
 
+static void udp_drops_inc(struct sock *sk)
+{
+	atomic_inc(&udp_sk(sk)->drops);
+}
+
 /* fully reclaim rmem/fwd memory allocated for skb */
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
@@ -1244,7 +1249,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	sock_skb_set_dropcount(sk, skb);
+	udp_skb_set_dropcount(sk, skb);
 
 	__skb_queue_tail(list, skb);
 	spin_unlock(&list->lock);
@@ -1258,7 +1263,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
 
 drop:
-	atomic_inc(&sk->sk_drops);
+	udp_drops_inc(sk);
 	return err;
 }
 EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
@@ -1319,7 +1324,7 @@ static int first_packet_length(struct sock *sk)
 				IS_UDPLITE(sk));
 		__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
 				IS_UDPLITE(sk));
-		atomic_inc(&sk->sk_drops);
+		udp_drops_inc(sk);
 		__skb_unlink(skb, rcvq);
 		total += skb->truesize;
 		kfree_skb(skb);
@@ -1417,7 +1422,7 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int noblock,
 
 	if (unlikely(err)) {
 		if (!peeked) {
-			atomic_inc(&sk->sk_drops);
+			udp_drops_inc(sk);
 			UDP_INC_STATS(sock_net(sk),
 				      UDP_MIB_INERRORS, is_udplite);
 		}
@@ -1714,7 +1719,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_CSUMERRORS, is_udplite);
 drop:
 	__UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	atomic_inc(&sk->sk_drops);
+	udp_drops_inc(sk);
 	kfree_skb(skb);
 	return -1;
 }
@@ -1772,7 +1777,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 		nskb = skb_clone(skb, GFP_ATOMIC);
 
 		if (unlikely(!nskb)) {
-			atomic_inc(&sk->sk_drops);
+			udp_drops_inc(sk);
 			__UDP_INC_STATS(net, UDP_MIB_RCVBUFERRORS,
 					IS_UDPLITE(sk));
 			__UDP_INC_STATS(net, UDP_MIB_INERRORS,
@@ -2491,7 +2496,7 @@ static void udp4_format_sock(struct sock *sp, struct seq_file *f,
 		from_kuid_munged(seq_user_ns(f), sock_i_uid(sp)),
 		0, sock_i_ino(sp),
 		atomic_read(&sp->sk_refcnt), sp,
-		atomic_read(&sp->sk_drops));
+		udp_drops_read(sp));
 }
 
 int udp4_seq_show(struct seq_file *seq, void *v)
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index c5d76d2..9f46dff 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -1038,5 +1038,5 @@ void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp,
 		   0,
 		   sock_i_ino(sp),
 		   atomic_read(&sp->sk_refcnt), sp,
-		   atomic_read(&sp->sk_drops));
+		   sk_drops_read(sp));
 }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 17:08         ` Paolo Abeni
@ 2016-12-06 17:47           ` Eric Dumazet
  2016-12-06 18:31             ` Paolo Abeni
  2016-12-07 17:09           ` [PATCH] net/udp: do not touch skb->peeked unless really needed David Laight
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-06 17:47 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 18:08 +0100, Paolo Abeni wrote:
> On Tue, 2016-12-06 at 11:34 +0100, Paolo Abeni wrote:
> > On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > > while holding receive queue lock, plus another one if packet is
> > > dequeued, since we need to change skb->next->prev
> > > 
> > > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > > 
> > > skb->peeked is only needed to make sure 0-length packets are properly
> > > handled while MSG_PEEK is operated.
> > > 
> > > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > > non-zero offset" support added by Sam Kumar makes this not possible.
> > > 
> > > This patch avoids one cache line miss during the locked section, when
> > > skb->len and skb->peeked do not have to be read.
> > > 
> > > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > > 
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > 
> > Thank you for all the good work.
> > 
> > After all your improvement, I see the cacheline miss in inet_recvmsg()
> > as a major perf offender for the user space process in the udp flood
> > scenario due to skc_rxhash sharing the same sk_drops cacheline.
> > 
> > Using an udp-specific drop counter (and an sk_drops accessor to wrap
> > sk_drops access where needed), we could avoid such cache miss. With that
> > - patch for udp.h only below - I get 3% improvement on top of all the
> > pending udp patches, and the gain should be more relevant after the 2
> > queues rework. What do you think ?
> 
> Here follow what I'm experimenting. 

Well, new socket layout makes this kind of patches not really needed ?

	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	socket_lock_t              sk_lock;              /*  0x88  0x20 */
	atomic_t                   sk_drops;             /*  0xa8   0x4 */
	int                        sk_rcvlowat;          /*  0xac   0x4 */
	struct sk_buff_head        sk_error_queue;       /*  0xb0  0x18 */
	/* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */

Nothing in UDP fast path needs to access a field from this cache line, but sk_drops.

So wherever we put sk_drops (even in a cache line of its own), we will still have
false sharing on it.

And really this should be fine.
Eventually we could have per NUMA node counter to help a bit.

I mentioned at some point that we can very easily instruct 
sock_skb_set_dropcount() to not read sk_drops if application
does not care about getting sk_drops ;)

https://patchwork.kernel.org/patch/9405677/

Now sk_drops was moved, the plan is to submit this patch in an official way.

Thanks !

 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 17:47           ` Eric Dumazet
@ 2016-12-06 18:31             ` Paolo Abeni
  2016-12-06 18:58               ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06 18:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 09:47 -0800, Eric Dumazet wrote:
> On Tue, 2016-12-06 at 18:08 +0100, Paolo Abeni wrote:
> > On Tue, 2016-12-06 at 11:34 +0100, Paolo Abeni wrote:
> > > On Mon, 2016-12-05 at 09:57 -0800, Eric Dumazet wrote:
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > 
> > > > In UDP recvmsg() path we currently access 3 cache lines from an skb
> > > > while holding receive queue lock, plus another one if packet is
> > > > dequeued, since we need to change skb->next->prev
> > > > 
> > > > 1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
> > > > 2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
> > > > 3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)
> > > > 
> > > > skb->peeked is only needed to make sure 0-length packets are properly
> > > > handled while MSG_PEEK is operated.
> > > > 
> > > > I had first the intent to remove skb->peeked but the "MSG_PEEK at
> > > > non-zero offset" support added by Sam Kumar makes this not possible.
> > > > 
> > > > This patch avoids one cache line miss during the locked section, when
> > > > skb->len and skb->peeked do not have to be read.
> > > > 
> > > > It also avoids the skb_set_peeked() cost for non empty UDP datagrams.
> > > > 
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > 
> > > Thank you for all the good work.
> > > 
> > > After all your improvement, I see the cacheline miss in inet_recvmsg()
> > > as a major perf offender for the user space process in the udp flood
> > > scenario due to skc_rxhash sharing the same sk_drops cacheline.
> > > 
> > > Using an udp-specific drop counter (and an sk_drops accessor to wrap
> > > sk_drops access where needed), we could avoid such cache miss. With that
> > > - patch for udp.h only below - I get 3% improvement on top of all the
> > > pending udp patches, and the gain should be more relevant after the 2
> > > queues rework. What do you think ?
> > 
> > Here follow what I'm experimenting. 
> 
> Well, new socket layout makes this kind of patches not really needed ?
> 
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	socket_lock_t              sk_lock;              /*  0x88  0x20 */
> 	atomic_t                   sk_drops;             /*  0xa8   0x4 */
> 	int                        sk_rcvlowat;          /*  0xac   0x4 */
> 	struct sk_buff_head        sk_error_queue;       /*  0xb0  0x18 */
> 	/* --- cacheline 3 boundary (192 bytes) was 8 bytes ago --- */

cacheline 2 boundary (128 bytes) is 8 bytes before sk_lock: cacheline 2
includes also skc_refcnt and skc_rxhash from __sk_common (I use 'pahole
-E ...' to get the full blown output). skc_rxhash is read for each
packet in inet_recvmsg()/sock_rps_record_flow() if CONFIG_RPS is set. I
get a cache miss per packet there and inet_recvmsg() in my test takes
about 8% of the whole u/s processing time.

> I mentioned at some point that we can very easily instruct 
> sock_skb_set_dropcount() to not read sk_drops if application
> does not care about getting sk_drops ;)
> 
> https://patchwork.kernel.org/patch/9405677/
> 
> Now sk_drops was moved, the plan is to submit this patch in an official way.

Looking forward to that!

Thank you,

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 18:31             ` Paolo Abeni
@ 2016-12-06 18:58               ` Eric Dumazet
  2016-12-06 19:16                 ` Paolo Abeni
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-06 18:58 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 19:31 +0100, Paolo Abeni wrote:

> cacheline 2 boundary (128 bytes) is 8 bytes before sk_lock: cacheline 2
> includes also skc_refcnt and skc_rxhash from __sk_common (I use 'pahole
> -E ...' to get the full blown output). skc_rxhash is read for each
> packet in inet_recvmsg()/sock_rps_record_flow() if CONFIG_RPS is set. I
> get a cache miss per packet there and inet_recvmsg() in my test takes
> about 8% of the whole u/s processing time.



Wait a minute, this sk->sk_rxhash should only be read on connected
socket. Relying on it being 0 was okay only if we did not care
of false sharing. And UDP sockets used to grab socket refcount, so we
had false sharing a _lot_ in the past.

We must fix this if not already done properly.

Can you take care of this problem ?

Thanks !

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 18:58               ` Eric Dumazet
@ 2016-12-06 19:16                 ` Paolo Abeni
  2016-12-06 19:35                   ` Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-06 19:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 10:58 -0800, Eric Dumazet wrote:
> On Tue, 2016-12-06 at 19:31 +0100, Paolo Abeni wrote:
> 
> > cacheline 2 boundary (128 bytes) is 8 bytes before sk_lock: cacheline 2
> > includes also skc_refcnt and skc_rxhash from __sk_common (I use 'pahole
> > -E ...' to get the full blown output). skc_rxhash is read for each
> > packet in inet_recvmsg()/sock_rps_record_flow() if CONFIG_RPS is set. I
> > get a cache miss per packet there and inet_recvmsg() in my test takes
> > about 8% of the whole u/s processing time.
> 
> Wait a minute, this sk->sk_rxhash should only be read on connected
> socket. Relying on it being 0 was okay only if we did not care
> of false sharing. And UDP sockets used to grab socket refcount, so we
> had false sharing a _lot_ in the past.

Thank you for the pointer.

> We must fix this if not already done properly.
> 
> Can you take care of this problem ?

I'll try, but it can be very soon: I'll have limited time and bad
internet connection up to next week.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 19:16                 ` Paolo Abeni
@ 2016-12-06 19:35                   ` Eric Dumazet
  2016-12-07  3:32                     ` [PATCH net-next] net: sock_rps_record_flow() is for connected sockets Eric Dumazet
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-06 19:35 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 20:16 +0100, Paolo Abeni wrote:
> On Tue, 2016-12-06 at 10:58 -0800, Eric Dumazet wrote:
> > On Tue, 2016-12-06 at 19:31 +0100, Paolo Abeni wrote:
> > 
> > > cacheline 2 boundary (128 bytes) is 8 bytes before sk_lock: cacheline 2
> > > includes also skc_refcnt and skc_rxhash from __sk_common (I use 'pahole
> > > -E ...' to get the full blown output). skc_rxhash is read for each
> > > packet in inet_recvmsg()/sock_rps_record_flow() if CONFIG_RPS is set. I
> > > get a cache miss per packet there and inet_recvmsg() in my test takes
> > > about 8% of the whole u/s processing time.
> > 
> > Wait a minute, this sk->sk_rxhash should only be read on connected
> > socket. Relying on it being 0 was okay only if we did not care
> > of false sharing. And UDP sockets used to grab socket refcount, so we
> > had false sharing a _lot_ in the past.
> 
> Thank you for the pointer.
> 
> > We must fix this if not already done properly.
> > 
> > Can you take care of this problem ?
> 
> I'll try, but it can be very soon: I'll have limited time and bad
> internet connection up to next week.

Do not worry, I had a better idea anyway. I am testing it ;)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-06 19:35                   ` Eric Dumazet
@ 2016-12-07  3:32                     ` Eric Dumazet
  2016-12-07  6:47                       ` Eric Dumazet
                                         ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07  3:32 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

From: Eric Dumazet <edumazet@google.com>

Paolo noticed a cache line miss in UDP recvmsg() to access
sk_rxhash, sharing a cache line with sk_drops.

sk_drops might be heavily incremented by cpus handling a flood targeting
this socket.

We might place sk_drops on a separate cache line, but lets try
to avoid wasting 64 bytes per socket just for this, since we have
other bottlenecks to take care of.

sock_rps_record_flow() should only access sk_rxhash for connected
flows.

Testing sk_state for TCP_ESTABLISHED covers most of the cases for
connected sockets, for a zero cost, since system calls using
sock_rps_record_flow() also access sk->sk_prot which is on the
same cache line.

A follow up patch will provide a static_key (Jump Label) since most
hosts do not even use RFS.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sock.h |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 6dfe3aa22b970eecfab4d4a0753804b1cc82a200..a7ddab993b496f1f4060f0b41831a161c284df9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -913,7 +913,17 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
 static inline void sock_rps_record_flow(const struct sock *sk)
 {
 #ifdef CONFIG_RPS
-	sock_rps_record_flow_hash(sk->sk_rxhash);
+	/* Reading sk->sk_rxhash might incur an expensive cache line miss.
+	 *
+	 * TCP_ESTABLISHED does cover almost all states where RFS
+	 * might be useful, and is cheaper [1] than testing :
+	 *	IPv4: inet_sk(sk)->inet_daddr
+	 * 	IPv6: ipv6_addr_any(&sk->sk_v6_daddr)
+	 * OR	an additional socket flag
+	 * [1] : sk_state and sk_prot are in the same cache line.
+	 */
+	if (sk->sk_state == TCP_ESTABLISHED)
+		sock_rps_record_flow_hash(sk->sk_rxhash);
 #endif
 }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  3:32                     ` [PATCH net-next] net: sock_rps_record_flow() is for connected sockets Eric Dumazet
@ 2016-12-07  6:47                       ` Eric Dumazet
  2016-12-07  7:57                         ` Paolo Abeni
  2016-12-08 17:49                         ` Tom Herbert
  2016-12-07  7:59                       ` Paolo Abeni
  2016-12-07 15:47                       ` David Miller
  2 siblings, 2 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07  6:47 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> A follow up patch will provide a static_key (Jump Label) since most
> hosts do not even use RFS.

Speaking of static_key, it appears we now have GRO on UDP, and this
consumes a considerable amount of cpu cycles.

Turning off GRO allows me to get +20 % more packets on my single UDP
socket. (1.2 Mpps instead of 1.0 Mpps)

Surely udp_gro_receive() should be bypassed if no UDP socket has
registered a udp_sk(sk)->gro_receive handler 

And/or delay the inet_add_offload(&udpv{4|6}_offload, IPPROTO_UDP); to
the first UDP sockets setting udp_sk(sk)->gro_receive handler,
ie udp_encap_enable() and udpv6_encap_enable()


:(

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  6:47                       ` Eric Dumazet
@ 2016-12-07  7:57                         ` Paolo Abeni
  2016-12-07 14:26                           ` Eric Dumazet
                                             ` (2 more replies)
  2016-12-08 17:49                         ` Tom Herbert
  1 sibling, 3 replies; 43+ messages in thread
From: Paolo Abeni @ 2016-12-07  7:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Tue, 2016-12-06 at 22:47 -0800, Eric Dumazet wrote:
> On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> > A follow up patch will provide a static_key (Jump Label) since most
> > hosts do not even use RFS.
> 
> Speaking of static_key, it appears we now have GRO on UDP, and this
> consumes a considerable amount of cpu cycles.
> 
> Turning off GRO allows me to get +20 % more packets on my single UDP
> socket. (1.2 Mpps instead of 1.0 Mpps)

I see also an improvement for single flow tests disabling GRO, but on a
smaller scale (~5% if I recall correctly).

> Surely udp_gro_receive() should be bypassed if no UDP socket has
> registered a udp_sk(sk)->gro_receive handler 
> 
> And/or delay the inet_add_offload(&udpv{4|6}_offload, IPPROTO_UDP); to
> the first UDP sockets setting udp_sk(sk)->gro_receive handler,
> ie udp_encap_enable() and udpv6_encap_enable()

I had some patches adding explicit static keys for udp_gro_receive, but
they were ugly and I did not get that much gain (I measured ~1-2%
skipping udp_gro_receive only). I can try to refresh them anyway.

We have some experimental patches to implement GRO for plain UDP
connected sockets, using frag_list to preserve the individual skb len,
and deliver the packet to user space individually. With that I got
~3mpps with a single queue/user space sink - before the recent udp
improvements. I would like to present these patches on netdev soon (no
sooner than next week, anyway).

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  3:32                     ` [PATCH net-next] net: sock_rps_record_flow() is for connected sockets Eric Dumazet
  2016-12-07  6:47                       ` Eric Dumazet
@ 2016-12-07  7:59                       ` Paolo Abeni
  2016-12-07 13:58                         ` Eric Dumazet
  2016-12-07 15:47                       ` David Miller
  2 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-07  7:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn

On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Paolo noticed a cache line miss in UDP recvmsg() to access
> sk_rxhash, sharing a cache line with sk_drops.
> 
> sk_drops might be heavily incremented by cpus handling a flood targeting
> this socket.
> 
> We might place sk_drops on a separate cache line, but lets try
> to avoid wasting 64 bytes per socket just for this, since we have
> other bottlenecks to take care of.
> 
> sock_rps_record_flow() should only access sk_rxhash for connected
> flows.
> 
> Testing sk_state for TCP_ESTABLISHED covers most of the cases for
> connected sockets, for a zero cost, since system calls using
> sock_rps_record_flow() also access sk->sk_prot which is on the
> same cache line.
> 
> A follow up patch will provide a static_key (Jump Label) since most
> hosts do not even use RFS.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> ---
>  include/net/sock.h |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 6dfe3aa22b970eecfab4d4a0753804b1cc82a200..a7ddab993b496f1f4060f0b41831a161c284df9e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -913,7 +913,17 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
>  static inline void sock_rps_record_flow(const struct sock *sk)
>  {
>  #ifdef CONFIG_RPS
> -	sock_rps_record_flow_hash(sk->sk_rxhash);
> +	/* Reading sk->sk_rxhash might incur an expensive cache line miss.
> +	 *
> +	 * TCP_ESTABLISHED does cover almost all states where RFS
> +	 * might be useful, and is cheaper [1] than testing :
> +	 *	IPv4: inet_sk(sk)->inet_daddr
> +	 * 	IPv6: ipv6_addr_any(&sk->sk_v6_daddr)
> +	 * OR	an additional socket flag
> +	 * [1] : sk_state and sk_prot are in the same cache line.
> +	 */
> +	if (sk->sk_state == TCP_ESTABLISHED)
> +		sock_rps_record_flow_hash(sk->sk_rxhash);
>  #endif
>  }

Thank you for the very prompt patch!

You made me curious about your other idea on this topic, this what you
initially talked about, right ?

LGTM.

Acked-by: Paolo Abeni <pabeni@redhat.com>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  7:59                       ` Paolo Abeni
@ 2016-12-07 13:58                         ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 13:58 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn

On Wed, 2016-12-07 at 08:59 +0100, Paolo Abeni wrote:
> On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Paolo noticed a cache line miss in UDP recvmsg() to access
> > sk_rxhash, sharing a cache line with sk_drops.
> > 
> > sk_drops might be heavily incremented by cpus handling a flood targeting
> > this socket.
> > 
> > We might place sk_drops on a separate cache line, but lets try
> > to avoid wasting 64 bytes per socket just for this, since we have
> > other bottlenecks to take care of.
> > 
> > sock_rps_record_flow() should only access sk_rxhash for connected
> > flows.
> > 
> > Testing sk_state for TCP_ESTABLISHED covers most of the cases for
> > connected sockets, for a zero cost, since system calls using
> > sock_rps_record_flow() also access sk->sk_prot which is on the
> > same cache line.
> > 
> > A follow up patch will provide a static_key (Jump Label) since most
> > hosts do not even use RFS.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Reported-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> >  include/net/sock.h |   12 +++++++++++-
> >  1 file changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 6dfe3aa22b970eecfab4d4a0753804b1cc82a200..a7ddab993b496f1f4060f0b41831a161c284df9e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -913,7 +913,17 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
> >  static inline void sock_rps_record_flow(const struct sock *sk)
> >  {
> >  #ifdef CONFIG_RPS
> > -	sock_rps_record_flow_hash(sk->sk_rxhash);
> > +	/* Reading sk->sk_rxhash might incur an expensive cache line miss.
> > +	 *
> > +	 * TCP_ESTABLISHED does cover almost all states where RFS
> > +	 * might be useful, and is cheaper [1] than testing :
> > +	 *	IPv4: inet_sk(sk)->inet_daddr
> > +	 * 	IPv6: ipv6_addr_any(&sk->sk_v6_daddr)
> > +	 * OR	an additional socket flag
> > +	 * [1] : sk_state and sk_prot are in the same cache line.
> > +	 */
> > +	if (sk->sk_state == TCP_ESTABLISHED)
> > +		sock_rps_record_flow_hash(sk->sk_rxhash);
> >  #endif
> >  }
> 
> Thank you for the very prompt patch!
> 
> You made me curious about your other idea on this topic, this what you
> initially talked about, right ?

That was what I first thought, but then having an rfs_key would also
help here. I will submit the patch on top of this one, so that final
code looks like :

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1ff5ea6e12214db818c2cfa8a9b8ed5cbddc307c..994f7423a74bd622884c3b646f4123d28697b8ad 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -192,6 +192,7 @@ struct net_device_stats {
 #ifdef CONFIG_RPS
 #include <linux/static_key.h>
 extern struct static_key rps_needed;
+extern struct static_key rfs_needed;
 #endif
 
 struct neighbour;
diff --git a/include/net/sock.h b/include/net/sock.h
index 6dfe3aa22b970eecfab4d4a0753804b1cc82a200..f1f6a07da06facb308a5c76e4de21b804d25ec53 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -913,7 +913,19 @@ static inline void sock_rps_record_flow_hash(__u32 hash)
 static inline void sock_rps_record_flow(const struct sock *sk)
 {
 #ifdef CONFIG_RPS
-	sock_rps_record_flow_hash(sk->sk_rxhash);
+	if (static_key_false(&rfs_needed)) {
+		/* Reading sk->sk_rxhash might incur an expensive cache line miss.
+		 *
+		 * TCP_ESTABLISHED does not cover all states where RFS
+		 * might be useful, but looks cheaper [1] than testing :
+		 *	IPv4: inet_sk(sk)->inet_daddr
+		 * 	IPv6: ipv6_addr_any(&sk->sk_v6_daddr)
+		 * OR	an additional socket flag
+		 * [1] : sk_state and sk_prot are in the same cache line.
+		 */
+		if (sk->sk_state == TCP_ESTABLISHED)
+			sock_rps_record_flow_hash(sk->sk_rxhash);
+	}
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index bffb5253e77867b1d6a0ada7cc99f4605e03ad28..1d33ce03365f1e10996ad5274e86bf351a526284 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3447,6 +3447,8 @@ EXPORT_SYMBOL(rps_cpu_mask);
 
 struct static_key rps_needed __read_mostly;
 EXPORT_SYMBOL(rps_needed);
+struct static_key rfs_needed __read_mostly;
+EXPORT_SYMBOL(rfs_needed);
 
 static struct rps_dev_flow *
 set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 0df2aa6525308a365d89f57f6da76d57a24238f0..2a46e4009f62d8c2ac8949789ae9626b0c016a11 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -79,10 +79,13 @@ static int rps_sock_flow_sysctl(struct ctl_table *table, int write,
 
 		if (sock_table != orig_sock_table) {
 			rcu_assign_pointer(rps_sock_flow_table, sock_table);
-			if (sock_table)
+			if (sock_table) {
 				static_key_slow_inc(&rps_needed);
+				static_key_slow_inc(&rfs_needed);
+			}
 			if (orig_sock_table) {
 				static_key_slow_dec(&rps_needed);
+				static_key_slow_dec(&rfs_needed);
 				synchronize_rcu();
 				vfree(orig_sock_table);
 			}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  7:57                         ` Paolo Abeni
@ 2016-12-07 14:26                           ` Eric Dumazet
  2016-12-08 17:49                             ` Paolo Abeni
  2016-12-07 14:29                           ` Eric Dumazet
  2016-12-08 19:20                           ` Edward Cree
  2 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 14:26 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Wed, 2016-12-07 at 08:57 +0100, Paolo Abeni wrote:
> On Tue, 2016-12-06 at 22:47 -0800, Eric Dumazet wrote:
> > On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> > > A follow up patch will provide a static_key (Jump Label) since most
> > > hosts do not even use RFS.
> > 
> > Speaking of static_key, it appears we now have GRO on UDP, and this
> > consumes a considerable amount of cpu cycles.
> > 
> > Turning off GRO allows me to get +20 % more packets on my single UDP
> > socket. (1.2 Mpps instead of 1.0 Mpps)
> 
> I see also an improvement for single flow tests disabling GRO, but on a
> smaller scale (~5% if I recall correctly).

Was it on a NUMA host ?

My tests are spreading packets on 8 RX queues, 50/50 split on two NUMA
nodes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  7:57                         ` Paolo Abeni
  2016-12-07 14:26                           ` Eric Dumazet
@ 2016-12-07 14:29                           ` Eric Dumazet
  2016-12-07 15:59                             ` Eric Dumazet
  2016-12-08 18:50                             ` Paolo Abeni
  2016-12-08 19:20                           ` Edward Cree
  2 siblings, 2 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 14:29 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Wed, 2016-12-07 at 08:57 +0100, Paolo Abeni wrote:

> We have some experimental patches to implement GRO for plain UDP
> connected sockets, using frag_list to preserve the individual skb len,
> and deliver the packet to user space individually. With that I got
> ~3mpps with a single queue/user space sink - before the recent udp
> improvements. I would like to present these patches on netdev soon (no
> sooner than next week, anyway).
> 

Make sure you handle properly all netfilter helpers :(

Keeping frag_list means you keep one sk_buff per segment, so this really
looks like a legacy UDP server (like a DNS server) wont benefit from
this anyway.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  3:32                     ` [PATCH net-next] net: sock_rps_record_flow() is for connected sockets Eric Dumazet
  2016-12-07  6:47                       ` Eric Dumazet
  2016-12-07  7:59                       ` Paolo Abeni
@ 2016-12-07 15:47                       ` David Miller
  2 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2016-12-07 15:47 UTC (permalink / raw)
  To: eric.dumazet; +Cc: pabeni, netdev, willemb

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 06 Dec 2016 19:32:50 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Paolo noticed a cache line miss in UDP recvmsg() to access
> sk_rxhash, sharing a cache line with sk_drops.
> 
> sk_drops might be heavily incremented by cpus handling a flood targeting
> this socket.
> 
> We might place sk_drops on a separate cache line, but lets try
> to avoid wasting 64 bytes per socket just for this, since we have
> other bottlenecks to take care of.
> 
> sock_rps_record_flow() should only access sk_rxhash for connected
> flows.
> 
> Testing sk_state for TCP_ESTABLISHED covers most of the cases for
> connected sockets, for a zero cost, since system calls using
> sock_rps_record_flow() also access sk->sk_prot which is on the
> same cache line.
> 
> A follow up patch will provide a static_key (Jump Label) since most
> hosts do not even use RFS.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Paolo Abeni <pabeni@redhat.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07 14:29                           ` Eric Dumazet
@ 2016-12-07 15:59                             ` Eric Dumazet
  2016-12-08 18:50                             ` Paolo Abeni
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 15:59 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Wed, 2016-12-07 at 06:29 -0800, Eric Dumazet wrote:

> Keeping frag_list means you keep one sk_buff per segment, so this really
> looks like a legacy UDP server (like a DNS server) wont benefit from
> this anyway.

I played with the idea of preparing the skb for minimal overhead for the
process doing udp_recvmsg().

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later.

( Sort of copybreak, but without reallocating skb->head )

Gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
 KB
- We avoid a cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-06 17:08         ` Paolo Abeni
  2016-12-06 17:47           ` Eric Dumazet
@ 2016-12-07 17:09           ` David Laight
  2016-12-07 17:32             ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: David Laight @ 2016-12-07 17:09 UTC (permalink / raw)
  To: 'Paolo Abeni', Eric Dumazet
  Cc: David Miller, netdev, Willem de Bruijn

From: Paolo Abeni
> Sent: 06 December 2016 17:08
...
> @@ -79,6 +82,9 @@ struct udp_sock {
>  	int			(*gro_complete)(struct sock *sk,
>  						struct sk_buff *skb,
>  						int nhoff);
> +
> +	/* since we are prone to drops, avoid dirtying any sk cacheline */
> +	atomic_t		drops ____cacheline_aligned_in_smp;
>  };

Isn't that likely to create a large hole on systems with large cache lines.
(Same as any other use of ____cacheline_aligned_in_smp.)

	David


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-07 17:09           ` [PATCH] net/udp: do not touch skb->peeked unless really needed David Laight
@ 2016-12-07 17:32             ` Eric Dumazet
  2016-12-07 17:37               ` Hannes Frederic Sowa
  0 siblings, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 17:32 UTC (permalink / raw)
  To: David Laight
  Cc: 'Paolo Abeni', David Miller, netdev, Willem de Bruijn

On Wed, 2016-12-07 at 17:09 +0000, David Laight wrote:
> From: Paolo Abeni
> > Sent: 06 December 2016 17:08
> ...
> > @@ -79,6 +82,9 @@ struct udp_sock {
> >  	int			(*gro_complete)(struct sock *sk,
> >  						struct sk_buff *skb,
> >  						int nhoff);
> > +
> > +	/* since we are prone to drops, avoid dirtying any sk cacheline */
> > +	atomic_t		drops ____cacheline_aligned_in_smp;
> >  };
> 
> Isn't that likely to create a large hole on systems with large cache lines.
> (Same as any other use of ____cacheline_aligned_in_smp.)

Yes, I would like to avoid that, unless we come to the conclusion it is
absolutely needed.

I feel that we could simply use a pointer, and allocate memory on
demand, since many sockets do not ever experience a drop.

The pointer could stay in a read mostly section.

We even could use per cpu or node counter for some heavy drop cases. 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-07 17:32             ` Eric Dumazet
@ 2016-12-07 17:37               ` Hannes Frederic Sowa
  2016-12-07 17:52                 ` Eric Dumazet
  2016-12-07 17:55                 ` Eric Dumazet
  0 siblings, 2 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-07 17:37 UTC (permalink / raw)
  To: Eric Dumazet, David Laight
  Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Wed, Dec 7, 2016, at 18:32, Eric Dumazet wrote:
> On Wed, 2016-12-07 at 17:09 +0000, David Laight wrote:
> > From: Paolo Abeni
> > > Sent: 06 December 2016 17:08
> > ...
> > > @@ -79,6 +82,9 @@ struct udp_sock {
> > >  	int			(*gro_complete)(struct sock *sk,
> > >  						struct sk_buff *skb,
> > >  						int nhoff);
> > > +
> > > +	/* since we are prone to drops, avoid dirtying any sk cacheline */
> > > +	atomic_t		drops ____cacheline_aligned_in_smp;
> > >  };
> > 
> > Isn't that likely to create a large hole on systems with large cache lines.
> > (Same as any other use of ____cacheline_aligned_in_smp.)
> 
> Yes, I would like to avoid that, unless we come to the conclusion it is
> absolutely needed.
> 
> I feel that we could simply use a pointer, and allocate memory on
> demand, since many sockets do not ever experience a drop.
> 
> The pointer could stay in a read mostly section.
> 
> We even could use per cpu or node counter for some heavy drop cases. 

I had the same idea while discussing that with Paolo, merely using an
*atomic_t = kmalloc(sizeof(atomic_t)) out of band of the socket.

My fear was that those could be aggregated by the slab cache into one
cache line, causing even more heating on cachelines.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-07 17:37               ` Hannes Frederic Sowa
@ 2016-12-07 17:52                 ` Eric Dumazet
  2016-12-07 17:55                 ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 17:52 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Wed, 2016-12-07 at 18:37 +0100, Hannes Frederic Sowa wrote:

> I had the same idea while discussing that with Paolo, merely using an
> *atomic_t = kmalloc(sizeof(atomic_t)) out of band of the socket.
> 
> My fear was that those could be aggregated by the slab cache into one
> cache line, causing even more heating on cachelines.

For hot stuff, better use kmalloc(max_t(size_t, 
                                        L1_CACHE_BYTES,
                                        sizeof(...)) 
to avoid false sharing, unless this is per cpu data of course.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] net/udp: do not touch skb->peeked unless really needed
  2016-12-07 17:37               ` Hannes Frederic Sowa
  2016-12-07 17:52                 ` Eric Dumazet
@ 2016-12-07 17:55                 ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-07 17:55 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Laight, Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Wed, 2016-12-07 at 18:37 +0100, Hannes Frederic Sowa wrote:

> I had the same idea while discussing that with Paolo, merely using an
> *atomic_t = kmalloc(sizeof(atomic_t)) out of band of the socket.
> 
> My fear was that those could be aggregated by the slab cache into one
> cache line, causing even more heating on cachelines.

My exact idea was to let up to 4095 (or PAGE_SIZE - 1) increments being
done on the counter before switching to dynamically allocated memory.

( Some packets might be dropped by TCP sockets, not necessarily a sign
of an attack. just some spurious retransmits )

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  6:47                       ` Eric Dumazet
  2016-12-07  7:57                         ` Paolo Abeni
@ 2016-12-08 17:49                         ` Tom Herbert
  2016-12-08 18:02                           ` Eric Dumazet
  2016-12-08 18:07                           ` Eric Dumazet
  1 sibling, 2 replies; 43+ messages in thread
From: Tom Herbert @ 2016-12-08 17:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Tue, Dec 6, 2016 at 10:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
>> A follow up patch will provide a static_key (Jump Label) since most
>> hosts do not even use RFS.
>
> Speaking of static_key, it appears we now have GRO on UDP, and this
> consumes a considerable amount of cpu cycles.
>
> Turning off GRO allows me to get +20 % more packets on my single UDP
> socket. (1.2 Mpps instead of 1.0 Mpps)
>
> Surely udp_gro_receive() should be bypassed if no UDP socket has
> registered a udp_sk(sk)->gro_receive handler
>
> And/or delay the inet_add_offload(&udpv{4|6}_offload, IPPROTO_UDP); to
> the first UDP sockets setting udp_sk(sk)->gro_receive handler,
> ie udp_encap_enable() and udpv6_encap_enable()
>
Of course that would only help on systems where no one enable encaps,
ie. looks good in the the simple benchmarks but in real life if just
one socket enables encap everyone else takes the hit. Alternatively,
maybe we could do early demux when we do the lookup in GRO to
eliminate the extra lookup?

Tom

>
> :(
>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07 14:26                           ` Eric Dumazet
@ 2016-12-08 17:49                             ` Paolo Abeni
  0 siblings, 0 replies; 43+ messages in thread
From: Paolo Abeni @ 2016-12-08 17:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Wed, 2016-12-07 at 06:26 -0800, Eric Dumazet wrote:
> On Wed, 2016-12-07 at 08:57 +0100, Paolo Abeni wrote:
> > On Tue, 2016-12-06 at 22:47 -0800, Eric Dumazet wrote:
> > > On Tue, 2016-12-06 at 19:32 -0800, Eric Dumazet wrote:
> > > > A follow up patch will provide a static_key (Jump Label) since most
> > > > hosts do not even use RFS.
> > > 
> > > Speaking of static_key, it appears we now have GRO on UDP, and this
> > > consumes a considerable amount of cpu cycles.
> > > 
> > > Turning off GRO allows me to get +20 % more packets on my single UDP
> > > socket. (1.2 Mpps instead of 1.0 Mpps)
> > 
> > I see also an improvement for single flow tests disabling GRO, but on a
> > smaller scale (~5% if I recall correctly).
> 
> Was it on a NUMA host ?

I'm using a single socket host, with 12 cores/24 threads and 16 RX
queues. 
But my data is old. I'll re-run the test on top of current net-next.

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 17:49                         ` Tom Herbert
@ 2016-12-08 18:02                           ` Eric Dumazet
  2016-12-08 19:15                             ` Tom Herbert
  2016-12-08 18:07                           ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:02 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:

> Of course that would only help on systems where no one enable encaps,
> ie. looks good in the the simple benchmarks but in real life if just
> one socket enables encap everyone else takes the hit. Alternatively,
> maybe we could do early demux when we do the lookup in GRO to
> eliminate the extra lookup?

Well, if you do the lookup in GRO, wont it be done for every incoming
MSS, instead of once per GRO packet ?

Anyway, the flooded UDP sockets out there are not normally connected
ones.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 17:49                         ` Tom Herbert
  2016-12-08 18:02                           ` Eric Dumazet
@ 2016-12-08 18:07                           ` Eric Dumazet
  1 sibling, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:07 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:

> Of course that would only help on systems where no one enable encaps,
> ie. looks good in the the simple benchmarks but in real life if just
> one socket enables encap everyone else takes the hit.

Well, in real life most linux hosts do not use any UDP encapsulation.

Or if they do, maybe they still have to handle a lot of UDP traffic
which does not hit a tunnel in the kernel.

Anyway, my difference vs GRO on/off were caused by copybreak in mlx4
driver.

GRO off --> mlx4 uses copybreak for small messages (all protocols)
GRO on  --> no copybreak for native protocols (IP+TCP IP+UDP)

The lookup being done twice is not that expensive, if the first two
cache lines of the socket stay shared (mostly read)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07 14:29                           ` Eric Dumazet
  2016-12-07 15:59                             ` Eric Dumazet
@ 2016-12-08 18:50                             ` Paolo Abeni
  2016-12-08 19:32                               ` Eric Dumazet
  1 sibling, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2016-12-08 18:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Wed, 2016-12-07 at 06:29 -0800, Eric Dumazet wrote:
> On Wed, 2016-12-07 at 08:57 +0100, Paolo Abeni wrote:
> 
> > We have some experimental patches to implement GRO for plain UDP
> > connected sockets, using frag_list to preserve the individual skb len,
> > and deliver the packet to user space individually. With that I got
> > ~3mpps with a single queue/user space sink - before the recent udp
> > improvements. I would like to present these patches on netdev soon (no
> > sooner than next week, anyway).
> > 
> 
> Make sure you handle properly all netfilter helpers :(

Thank you for the head-up! 
UDP-GRO will be enabled by a specific netdev feature bit, disabled by
default, should not impact by default any setup.

> Keeping frag_list means you keep one sk_buff per segment, so this really
> looks like a legacy UDP server (like a DNS server) wont benefit from
> this anyway.

I'm sorry, I do not follow.

UDP GRO will require connected socket - very likely no DNS server. The
use-case is an application using long lived UDP sockets doing a lot of
traffic, like fix protocol feeds over UDP. 

Thank you,

Paolo

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 18:02                           ` Eric Dumazet
@ 2016-12-08 19:15                             ` Tom Herbert
  2016-12-08 20:05                               ` Hannes Frederic Sowa
  0 siblings, 1 reply; 43+ messages in thread
From: Tom Herbert @ 2016-12-08 19:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
>
>> Of course that would only help on systems where no one enable encaps,
>> ie. looks good in the the simple benchmarks but in real life if just
>> one socket enables encap everyone else takes the hit. Alternatively,
>> maybe we could do early demux when we do the lookup in GRO to
>> eliminate the extra lookup?
>
> Well, if you do the lookup in GRO, wont it be done for every incoming
> MSS, instead of once per GRO packet ?

We should be able to avoid that. We already do the lookup for every
UDP packet going into GRO, would only need to take the refcnt once for
the whole GRO packet.

>
> Anyway, the flooded UDP sockets out there are not normally connected

We still should be able to use early demux in that case, just can't
avoid the route lookup. I wonder if we might be able to cache a soft
route maybe for the last local destination received to help the
unconnected sockets case...

In any case, I can take a look at of doing early demux from with UDP GRO.

Tom

> ones.
>
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-07  7:57                         ` Paolo Abeni
  2016-12-07 14:26                           ` Eric Dumazet
  2016-12-07 14:29                           ` Eric Dumazet
@ 2016-12-08 19:20                           ` Edward Cree
  2 siblings, 0 replies; 43+ messages in thread
From: Edward Cree @ 2016-12-08 19:20 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet
  Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert,
	Jesper Dangaard Brouer

On 07/12/16 07:57, Paolo Abeni wrote:
> We have some experimental patches to implement GRO for plain UDP
> connected sockets, using frag_list to preserve the individual skb len,
> and deliver the packet to user space individually. With that I got
> ~3mpps with a single queue/user space sink - before the recent udp
> improvements.
You might want to benchmark these against my batched receive patches
from a while ago[1], both seem to have broadly the same objective.
In my benchmarking (obviously with different hardware) I was using
multiple sink processes, but all (processes and irqs) on a single
core; the unpatched kernel was getting ~5Mpps.  Then with my patches
I was getting ~6.4Mpps.  (Limitations of my test scripts meant that
having a single sink process meant also having a single source
process, in which case I was TX limited to ~3Mpps, and using about
60% CPU on the RX side.)

Let me know if you're interested in doing this comparison; if so I'll
post updated patches against net-next.  My own attempts to benchmark
them more have been held up by lack of time and not really knowing
what constitutes a realistic netfilter setup.
Of course if you're using a device other than sfc you'll need to add
your own equivalent of patch #2 to call the netif_receive_skb_list()
entry point from the driver.

-Ed

[1] https://www.spinics.net/lists/netdev/msg373769.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 18:50                             ` Paolo Abeni
@ 2016-12-08 19:32                               ` Eric Dumazet
  0 siblings, 0 replies; 43+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:32 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David Miller, netdev, Willem de Bruijn, Tom Herbert

On Thu, 2016-12-08 at 19:50 +0100, Paolo Abeni wrote:

> UDP GRO will require connected socket - very likely no DNS server. The
> use-case is an application using long lived UDP sockets doing a lot of
> traffic, like fix protocol feeds over UDP. 

You mean, the kind of traffic that should use TCP in the first place,
right ? ;)

NFS switched to TCP a long time ago.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 19:15                             ` Tom Herbert
@ 2016-12-08 20:05                               ` Hannes Frederic Sowa
  2016-12-08 20:30                                 ` Tom Herbert
  2016-12-08 20:44                                 ` Tom Herbert
  0 siblings, 2 replies; 43+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-08 20:05 UTC (permalink / raw)
  To: Tom Herbert, Eric Dumazet
  Cc: Paolo Abeni, David Miller, netdev, Willem de Bruijn

Hello,

On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
> wrote:
> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
> >
> >> Of course that would only help on systems where no one enable encaps,
> >> ie. looks good in the the simple benchmarks but in real life if just
> >> one socket enables encap everyone else takes the hit. Alternatively,
> >> maybe we could do early demux when we do the lookup in GRO to
> >> eliminate the extra lookup?
> >
> > Well, if you do the lookup in GRO, wont it be done for every incoming
> > MSS, instead of once per GRO packet ?
> 
> We should be able to avoid that. We already do the lookup for every
> UDP packet going into GRO, would only need to take the refcnt once for
> the whole GRO packet.
> 
> >
> > Anyway, the flooded UDP sockets out there are not normally connected
> 
> We still should be able to use early demux in that case, just can't
> avoid the route lookup. I wonder if we might be able to cache a soft
> route maybe for the last local destination received to help the
> unconnected sockets case...
> 
> In any case, I can take a look at of doing early demux from with UDP GRO.

Early demux already breaks ip rules: we might set up a rule so an
incoming packet might depending on the rule not find an input route at
all and would be forwarded. Same problem might occur with VRF, when you
have multiple ip addresses in different "realms".

That said, I don't see why we can't be more aggressive for GRO in the
unconnected case: we simply must make sure that the current namespace
holds the ip address, which is simply a hash lookup. After that we can
even accept packets for a wildcard bounded socket.

Probably we should disable this logic as soon as soon as vrf and/or
rules are active to have correct semantics.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 20:05                               ` Hannes Frederic Sowa
@ 2016-12-08 20:30                                 ` Tom Herbert
  2016-12-08 20:44                                 ` Tom Herbert
  1 sibling, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2016-12-08 20:30 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Thu, Dec 8, 2016 at 12:05 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
>> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
>> >
>> >> Of course that would only help on systems where no one enable encaps,
>> >> ie. looks good in the the simple benchmarks but in real life if just
>> >> one socket enables encap everyone else takes the hit. Alternatively,
>> >> maybe we could do early demux when we do the lookup in GRO to
>> >> eliminate the extra lookup?
>> >
>> > Well, if you do the lookup in GRO, wont it be done for every incoming
>> > MSS, instead of once per GRO packet ?
>>
>> We should be able to avoid that. We already do the lookup for every
>> UDP packet going into GRO, would only need to take the refcnt once for
>> the whole GRO packet.
>>
>> >
>> > Anyway, the flooded UDP sockets out there are not normally connected
>>
>> We still should be able to use early demux in that case, just can't
>> avoid the route lookup. I wonder if we might be able to cache a soft
>> route maybe for the last local destination received to help the
>> unconnected sockets case...
>>
>> In any case, I can take a look at of doing early demux from with UDP GRO.
>
> Early demux already breaks ip rules: we might set up a rule so an
> incoming packet might depending on the rule not find an input route at
> all and would be forwarded. Same problem might occur with VRF, when you
> have multiple ip addresses in different "realms".
>
> That said, I don't see why we can't be more aggressive for GRO in the
> unconnected case: we simply must make sure that the current namespace
> holds the ip address, which is simply a hash lookup. After that we can
> even accept packets for a wildcard bounded socket.
>
> Probably we should disable this logic as soon as soon as vrf and/or
> rules are active to have correct semantics.
>
All this gets dicey in the presence of encapsulation. One problem is
that we can't tell when or if a packet crosses network namespace just
by parsing the packet. It's a subset of the general problem of
correctly identifying packets outside of the terminal protocol
processing (we've already talked about the incorrectness of devices
that identify UDP encapsulation based on port numbers). But even
without encapsulation there is still the problem as you point out with
vrf, IPvlan, etc. I think the answer thus far has been to hand wave
and rely on probability, for instance identifying UDP encapsulation by
port number in device probably works almost all of the time. Matching
an unconnected UDP socket in GRO and then accepting a route associated
with that probably would also work nearly all the time. Maybe if we
can quantify the dependencies that early parsing has in this area,
other mechanisms (vrf) might be able to do something intelligent to
ensure correctness-- does seem like a hard problem though!

Tom

> Bye,
> Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH net-next] net: sock_rps_record_flow() is for connected sockets
  2016-12-08 20:05                               ` Hannes Frederic Sowa
  2016-12-08 20:30                                 ` Tom Herbert
@ 2016-12-08 20:44                                 ` Tom Herbert
  1 sibling, 0 replies; 43+ messages in thread
From: Tom Herbert @ 2016-12-08 20:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, Paolo Abeni, David Miller, netdev, Willem de Bruijn

On Thu, Dec 8, 2016 at 12:05 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> On Thu, Dec 8, 2016, at 20:15, Tom Herbert wrote:
>> On Thu, Dec 8, 2016 at 10:02 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>> > On Thu, 2016-12-08 at 09:49 -0800, Tom Herbert wrote:
>> >
>> >> Of course that would only help on systems where no one enable encaps,
>> >> ie. looks good in the the simple benchmarks but in real life if just
>> >> one socket enables encap everyone else takes the hit. Alternatively,
>> >> maybe we could do early demux when we do the lookup in GRO to
>> >> eliminate the extra lookup?
>> >
>> > Well, if you do the lookup in GRO, wont it be done for every incoming
>> > MSS, instead of once per GRO packet ?
>>
>> We should be able to avoid that. We already do the lookup for every
>> UDP packet going into GRO, would only need to take the refcnt once for
>> the whole GRO packet.
>>
>> >
>> > Anyway, the flooded UDP sockets out there are not normally connected
>>
>> We still should be able to use early demux in that case, just can't
>> avoid the route lookup. I wonder if we might be able to cache a soft
>> route maybe for the last local destination received to help the
>> unconnected sockets case...
>>
>> In any case, I can take a look at of doing early demux from with UDP GRO.
>
> Early demux already breaks ip rules: we might set up a rule so an
> incoming packet might depending on the rule not find an input route at
> all and would be forwarded. Same problem might occur with VRF, when you
> have multiple ip addresses in different "realms".
>
> That said, I don't see why we can't be more aggressive for GRO in the
> unconnected case: we simply must make sure that the current namespace
> holds the ip address, which is simply a hash lookup. After that we can
> even accept packets for a wildcard bounded socket.
>
Or just depend on encapsulation sockets to bind to an address. That
would eliminate most the ambiguity especially if it can be pushed into
a device that is trying to parse encapsulation. We would need new
interfaces to support that in HW, or use n-tuple filtering (which I
still maintain is the only right way to do it).

Tom

> Probably we should disable this logic as soon as soon as vrf and/or
> rules are active to have correct semantics.
>
> Bye,
> Hannes

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2016-12-08 20:44 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-05  2:43 [RFC] udp: some improvements on RX path Eric Dumazet
2016-12-05 13:22 ` Paolo Abeni
2016-12-05 14:28   ` Eric Dumazet
2016-12-05 15:37     ` Jesper Dangaard Brouer
2016-12-05 15:54       ` Eric Dumazet
2016-12-05 17:57     ` [PATCH] net/udp: do not touch skb->peeked unless really needed Eric Dumazet
2016-12-06  9:53       ` Paolo Abeni
2016-12-06 12:10         ` Paolo Abeni
2016-12-06 14:35           ` Eric Dumazet
2016-12-06 14:34         ` Eric Dumazet
2016-12-06 10:34       ` Paolo Abeni
2016-12-06 17:08         ` Paolo Abeni
2016-12-06 17:47           ` Eric Dumazet
2016-12-06 18:31             ` Paolo Abeni
2016-12-06 18:58               ` Eric Dumazet
2016-12-06 19:16                 ` Paolo Abeni
2016-12-06 19:35                   ` Eric Dumazet
2016-12-07  3:32                     ` [PATCH net-next] net: sock_rps_record_flow() is for connected sockets Eric Dumazet
2016-12-07  6:47                       ` Eric Dumazet
2016-12-07  7:57                         ` Paolo Abeni
2016-12-07 14:26                           ` Eric Dumazet
2016-12-08 17:49                             ` Paolo Abeni
2016-12-07 14:29                           ` Eric Dumazet
2016-12-07 15:59                             ` Eric Dumazet
2016-12-08 18:50                             ` Paolo Abeni
2016-12-08 19:32                               ` Eric Dumazet
2016-12-08 19:20                           ` Edward Cree
2016-12-08 17:49                         ` Tom Herbert
2016-12-08 18:02                           ` Eric Dumazet
2016-12-08 19:15                             ` Tom Herbert
2016-12-08 20:05                               ` Hannes Frederic Sowa
2016-12-08 20:30                                 ` Tom Herbert
2016-12-08 20:44                                 ` Tom Herbert
2016-12-08 18:07                           ` Eric Dumazet
2016-12-07  7:59                       ` Paolo Abeni
2016-12-07 13:58                         ` Eric Dumazet
2016-12-07 15:47                       ` David Miller
2016-12-07 17:09           ` [PATCH] net/udp: do not touch skb->peeked unless really needed David Laight
2016-12-07 17:32             ` Eric Dumazet
2016-12-07 17:37               ` Hannes Frederic Sowa
2016-12-07 17:52                 ` Eric Dumazet
2016-12-07 17:55                 ` Eric Dumazet
2016-12-06 15:42       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).