netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/4] udp: receive path optimizations
@ 2016-12-08 17:38 Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

This patch series provides about 100 % performance increase under flood. 

David, please scratch it if you prefer to wait for linux-4.11,
thanks !

Eric Dumazet (4):
  udp: under rx pressure, try to condense skbs
  udp: add busylocks in RX path
  udp: copy skb->truesize in the first cache line
  udp: add batching to udp_rmem_release()

 include/linux/skbuff.h | 11 +++++++-
 include/linux/udp.h    |  3 ++
 net/core/skbuff.c      | 28 ++++++++++++++++++
 net/ipv4/udp.c         | 77 +++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 114 insertions(+), 5 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs
  2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.

Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.

If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.

Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.

Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.

This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)

This patch gave me about 5 % increase in throughput in my tests.

skb_condense() helper could probably used in other contexts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 28 ++++++++++++++++++++++++++++
 net/ipv4/udp.c         | 12 +++++++++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fbccf2c..0cd92b0f2af5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1966,6 +1966,8 @@ static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
 	return __pskb_pull_tail(skb, len - skb_headlen(skb)) != NULL;
 }
 
+void skb_condense(struct sk_buff *skb);
+
 /**
  *	skb_headroom - bytes at buffer head
  *	@skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b45cd1494243..d27e0352ae2a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4931,3 +4931,31 @@ struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
 	return clone;
 }
 EXPORT_SYMBOL(pskb_extract);
+
+/**
+ * skb_condense - try to get rid of fragments/frag_list if possible
+ * @skb: buffer
+ *
+ * Can be used to save memory before skb is added to a busy queue.
+ * If the packet has bytes in frags and enough tail room in skb->head,
+ * pull all of them, so that we can free the frags right now and adjust
+ * truesize.
+ * Notes:
+ *	We do not reallocate skb->head thus can not fail.
+ *	Caller must re-evaluate skb->truesize if needed.
+ */
+void skb_condense(struct sk_buff *skb)
+{
+	if (!skb->data_len ||
+	    skb->data_len > skb->end - skb->tail ||
+	    skb_cloned(skb))
+		return;
+
+	/* Nice, we can free page frag(s) right now */
+	__pskb_pull_tail(skb, skb->data_len);
+
+	/* Now adjust skb->truesize, since __pskb_pull_tail() does
+	 * not do this.
+	 */
+	skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
+}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 16d88ba9ff1c..110414903f9e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1199,7 +1199,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
-	int size = skb->truesize;
+	int size;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
@@ -1208,6 +1208,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	if (rmem > sk->sk_rcvbuf)
 		goto drop;
 
+	/* Under mem pressure, it might be helpful to give udp_recvmsg()
+	 * linear skbs :
+	 * - Reduce memory overhead and thus increase receive queue capacity
+	 * - Less cache line misses at copyout() time
+	 * - Less work at consume_skb() (less alien page frag freeing)
+	 */
+	if (rmem > (sk->sk_rcvbuf >> 1))
+		skb_condense(skb);
+	size = skb->truesize;
+
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
 	 */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 net-next 2/4] udp: add busylocks in RX path
  2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 110414903f9e..77875712405f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1195,10 +1195,36 @@ void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* Idea of busylocks is to let producers grab an extra spinlock
+ * to relieve pressure on the receive_queue spinlock shared by consumer.
+ * Under flood, this means that only one producer can be in line
+ * trying to acquire the receive_queue spinlock.
+ * These busylock can be allocated on a per cpu manner, instead of a
+ * per socket one (that would consume a cache line per socket)
+ */
+static int udp_busylocks_log __read_mostly;
+static spinlock_t *udp_busylocks __read_mostly;
+
+static spinlock_t *busylock_acquire(void *ptr)
+{
+	spinlock_t *busy;
+
+	busy = udp_busylocks + hash_ptr(ptr, udp_busylocks_log);
+	spin_lock(busy);
+	return busy;
+}
+
+static void busylock_release(spinlock_t *busy)
+{
+	if (busy)
+		spin_unlock(busy);
+}
+
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
+	spinlock_t *busy = NULL;
 	int size;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
@@ -1214,8 +1240,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 * - Less cache line misses at copyout() time
 	 * - Less work at consume_skb() (less alien page frag freeing)
 	 */
-	if (rmem > (sk->sk_rcvbuf >> 1))
+	if (rmem > (sk->sk_rcvbuf >> 1)) {
 		skb_condense(skb);
+
+		busy = busylock_acquire(sk);
+	}
 	size = skb->truesize;
 
 	/* we drop only if the receive buf is full and the receive
@@ -1252,6 +1281,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk);
 
+	busylock_release(busy);
 	return 0;
 
 uncharge_drop:
@@ -1259,6 +1289,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 
 drop:
 	atomic_inc(&sk->sk_drops);
+	busylock_release(busy);
 	return err;
 }
 EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
@@ -2613,6 +2644,7 @@ EXPORT_SYMBOL(udp_flow_hashrnd);
 void __init udp_init(void)
 {
 	unsigned long limit;
+	unsigned int i;
 
 	udp_table_init(&udp_table, "UDP");
 	limit = nr_free_buffer_pages() / 8;
@@ -2623,4 +2655,13 @@ void __init udp_init(void)
 
 	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
 	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+	/* 16 spinlocks per cpu */
+	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
+	udp_busylocks = kmalloc(sizeof(spinlock_t) << udp_busylocks_log,
+				GFP_KERNEL);
+	if (!udp_busylocks)
+		panic("UDP: failed to alloc udp_busylocks\n");
+	for (i = 0; i < (1U << udp_busylocks_log); i++)
+		spin_lock_init(udp_busylocks + i);
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line
  2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
  2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
  2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
  4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  9 ++++++++-
 net/ipv4/udp.c         | 13 ++++++++++---
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0cd92b0f2af5..332e76756f54 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -645,8 +645,15 @@ struct sk_buff {
 		struct rb_node	rbnode; /* used in netem & tcp stack */
 	};
 	struct sock		*sk;
-	struct net_device	*dev;
 
+	union {
+		struct net_device	*dev;
+		/* Some protocols might use this space to store information,
+		 * while device pointer would be NULL.
+		 * UDP receive path is one user.
+		 */
+		unsigned long		dev_scratch;
+	};
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 77875712405f..880cd3d84abf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1188,10 +1188,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 }
 
-/* Note: called with sk_receive_queue.lock held */
+/* Note: called with sk_receive_queue.lock held.
+ * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
+ * This avoids a cache line miss while receive_queue lock is held.
+ * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+ */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->truesize, 1);
+	udp_rmem_release(sk, skb->dev_scratch, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
@@ -1246,6 +1250,10 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 		busy = busylock_acquire(sk);
 	}
 	size = skb->truesize;
+	/* Copy skb->truesize into skb->dev_scratch to avoid a cache line miss
+	 * in udp_skb_destructor()
+	 */
+	skb->dev_scratch = size;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
@@ -1272,7 +1280,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	skb->dev = NULL;
 	sock_skb_set_dropcount(sk, skb);
 
 	__skb_queue_tail(list, skb);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
  2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
                   ` (2 preceding siblings ...)
  2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
  2016-12-08 18:24   ` Paolo Abeni
  2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
  4 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/udp.h |  3 +++
 net/ipv4/udp.c      | 11 +++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd39478..c0f530809d1f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,9 @@ struct udp_sock {
 	int			(*gro_complete)(struct sock *sk,
 						struct sk_buff *skb,
 						int nhoff);
+
+	/* This field is dirtied by udp_recvmsg() */
+	int		forward_deficit;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 880cd3d84abf..f0096d088104 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1177,8 +1177,19 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 /* fully reclaim rmem/fwd memory allocated for skb */
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
+	struct udp_sock *up = udp_sk(sk);
 	int amt;
 
+	if (likely(partial)) {
+		up->forward_deficit += size;
+		size = up->forward_deficit;
+		if (size < (sk->sk_rcvbuf >> 2))
+			return;
+	} else {
+		size += up->forward_deficit;
+	}
+	up->forward_deficit = 0;
+
 	atomic_sub(size, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
  2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
@ 2016-12-08 18:24   ` Paolo Abeni
  2016-12-08 18:36     ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2016-12-08 18:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet

On Thu, 2016-12-08 at 09:38 -0800, Eric Dumazet wrote:
> If udp_recvmsg() constantly releases sk_rmem_alloc
> for every read packet, it gives opportunity for
> producers to immediately grab spinlocks and desperatly
> try adding another packet, causing false sharing.
> 
> We can add a simple heuristic to give the signal
> by batches of ~25 % of the queue capacity.
> 
> This patch considerably increases performance under
> flood by about 50 %, since the thread draining the queue
> is no longer slowed by false sharing.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/linux/udp.h |  3 +++
>  net/ipv4/udp.c      | 11 +++++++++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index d1fd8cd39478..c0f530809d1f 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -79,6 +79,9 @@ struct udp_sock {
>  	int			(*gro_complete)(struct sock *sk,
>  						struct sk_buff *skb,
>  						int nhoff);
> +
> +	/* This field is dirtied by udp_recvmsg() */
> +	int		forward_deficit;
>  };
>  
>  static inline struct udp_sock *udp_sk(const struct sock *sk)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 880cd3d84abf..f0096d088104 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1177,8 +1177,19 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
>  /* fully reclaim rmem/fwd memory allocated for skb */
>  static void udp_rmem_release(struct sock *sk, int size, int partial)
>  {
> +	struct udp_sock *up = udp_sk(sk);
>  	int amt;
>  
> +	if (likely(partial)) {
> +		up->forward_deficit += size;
> +		size = up->forward_deficit;
> +		if (size < (sk->sk_rcvbuf >> 2))
> +			return;
> +	} else {
> +		size += up->forward_deficit;
> +	}
> +	up->forward_deficit = 0;
> +
>  	atomic_sub(size, &sk->sk_rmem_alloc);
>  	sk->sk_forward_alloc += size;
>  	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);

Nice one! This sounds like a relevant improvement! 

I'm wondering if it may cause regressions with small value of
sk_rcvbuf ?!? e.g. with:

netperf -t UDP_STREAM  -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024

I'm sorry, I fear I will not unable to do any test before next week.

Cheers,

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
  2016-12-08 18:24   ` Paolo Abeni
@ 2016-12-08 18:36     ` Eric Dumazet
  2016-12-08 18:38       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:36 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet

On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:

> Nice one! This sounds like a relevant improvement!
>
> I'm wondering if it may cause regressions with small value of
> sk_rcvbuf ?!? e.g. with:
>
> netperf -t UDP_STREAM  -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>

Possibly, then simply we can refine the test to :

size = up->forward_deficit;
if (size < (sk->sk_rcvbuf >> 2)  && !skb_queue_empty(sk->sk_receive_buf))
     return;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
  2016-12-08 18:36     ` Eric Dumazet
@ 2016-12-08 18:38       ` Eric Dumazet
  2016-12-08 18:52         ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:38 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet

On Thu, Dec 8, 2016 at 10:36 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>
>> Nice one! This sounds like a relevant improvement!
>>
>> I'm wondering if it may cause regressions with small value of
>> sk_rcvbuf ?!? e.g. with:
>>
>> netperf -t UDP_STREAM  -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>>
>
> Possibly, then simply we can refine the test to :
>
> size = up->forward_deficit;
> if (size < (sk->sk_rcvbuf >> 2)  && !skb_queue_empty(sk->sk_receive_buf))
>      return;

BTW, I tried :

lpaa6:~# ./netperf -t UDP_STREAM  -H 127.0.0.1 -- -s 1280 -S 1280 -m
1024 -M 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

  4608    1024   10.00     4499400      0    3685.88
  2560           10.00     4498670           3685.28

So it looks like it is working.

However I have no doubt there might be a corner case for tiny
SO_RCVBUF values or for some message sizes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
  2016-12-08 18:38       ` Eric Dumazet
@ 2016-12-08 18:52         ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:52 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet

On Thu, Dec 8, 2016 at 10:38 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 8, 2016 at 10:36 AM, Eric Dumazet <edumazet@google.com> wrote:
>> On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>>
>>> Nice one! This sounds like a relevant improvement!
>>>
>>> I'm wondering if it may cause regressions with small value of
>>> sk_rcvbuf ?!? e.g. with:
>>>
>>> netperf -t UDP_STREAM  -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>>>
>>
>> Possibly, then simply we can refine the test to :
>>
>> size = up->forward_deficit;
>> if (size < (sk->sk_rcvbuf >> 2)  && !skb_queue_empty(sk->sk_receive_buf))
>>      return;
>

I will also add this patch :

This really makes sure our changes to sk_forward_alloc wont be slowed
because producers see
the change to sk_rmem_alloc too soon.

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8400d6954558..6bdcbe103390 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,13 +1191,14 @@ static void udp_rmem_release(struct sock *sk,
int size, int partial)
        }
        up->forward_deficit = 0;

-       atomic_sub(size, &sk->sk_rmem_alloc);
        sk->sk_forward_alloc += size;
        amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
        sk->sk_forward_alloc -= amt;

        if (amt)
                __sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+
+       atomic_sub(size, &sk->sk_rmem_alloc);
 }

 /* Note: called with sk_receive_queue.lock held.

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
                   ` (3 preceding siblings ...)
  2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
@ 2016-12-08 20:48 ` Jesper Dangaard Brouer
  2016-12-08 21:13   ` Eric Dumazet
  2016-12-08 21:17   ` Jesper Dangaard Brouer
  4 siblings, 2 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-08 20:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, David S . Miller, netdev, Paolo Abeni, Eric Dumazet

On Thu,  8 Dec 2016 09:38:55 -0800
Eric Dumazet <edumazet@google.com> wrote:

> This patch series provides about 100 % performance increase under flood. 

Could you please explain a bit more about what kind of testing you are
doing that can show 100% performance improvement?

I've tested this patchset and my tests show *huge* speeds ups, but
reaping the performance benefit depend heavily on setup and enabling
the right UDP socket settings, and most importantly where the
performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).

Basic setup: Unload all netfilter, and enable ip_early_demux.
 sysctl net/ipv4/ip_early_demux=1

Test generator pktgen UDP packets single flow, 50Gbit/s mlx5 NICs.
 - Vary packet size between 64 and 1514.


Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	537.70	1859756.90	2155
recvmsg   	run: 0 10000000	510.84	1957541.83	2047
read      	run: 0 10000000	583.40	1714077.14	2338
recvfrom  	run: 0 10000000	600.09	1666411.49	2405

The ksoftirq thread "cost" more than udp_sink, which is idle, and UDP
queue does not get full-enough. Thus, patchset does not have any
effect.


Try to increase pktgen packet size, as this increase the copy cost of
udp_sink.  Thus, a queue can now form, and udp_sink CPU almost have no
idle cycles.  The "read" and "readfrom" did experience some idle
cycles.

Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	435.88	2294204.11	1747
recvmsg   	run: 0 10000000	458.06	2183100.64	1835
read      	run: 0 10000000	520.34	1921826.18	2085
recvfrom  	run: 0 10000000	515.48	1939935.27	2066


Next trick connected UDP:

Use connected UDP socket (combined with ip_early_demux), removes the
FIB_lookup from the ksoftirq, and cause tipping point to be better.

Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	391.18	2556361.62	1567
recvmsg   	run: 0 10000000	422.95	2364349.69	1695
read      	run: 0 10000000	425.29	2351338.10	1704
recvfrom  	run: 0 10000000	476.74	2097577.57	1910

Change/increase packet size:

Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
                                ns/pkt  pps             cycles/pkt
recvMmsg/32  	run: 0 10000000	457.56	2185481.94	1833
recvmsg   	run: 0 10000000	479.42	2085837.49	1921
read      	run: 0 10000000	398.05	2512233.13	1595
recvfrom  	run: 0 10000000	391.07	2557096.95	1567

A bit strange, changing the packet size, flipped what is the fastest
syscall.

It is also interesting to see that ksoftirq limit is:

Result from "nstat" while using recvmsg, show that ksoftirq is
handling 2.6 Mpps, and consumer/udp_sink is bottleneck with 2Mpps.

[skylake ~]$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    2667577            0.0
IpInDelivers                    2667577            0.0
UdpInDatagrams                  2083580            0.0
UdpInErrors                     583995             0.0
UdpRcvbufErrors                 583995             0.0
IpExtInOctets                   4001340000         0.0
IpExtInNoECTPkts                2667559            0.0

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
@ 2016-12-08 21:13   ` Eric Dumazet
  2016-12-09 16:05     ` Jesper Dangaard Brouer
  2016-12-08 21:17   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 21:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni

On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> On Thu,  8 Dec 2016 09:38:55 -0800
> Eric Dumazet <edumazet@google.com> wrote:
> 
> > This patch series provides about 100 % performance increase under flood. 
> 
> Could you please explain a bit more about what kind of testing you are
> doing that can show 100% performance improvement?
> 
> I've tested this patchset and my tests show *huge* speeds ups, but
> reaping the performance benefit depend heavily on setup and enabling
> the right UDP socket settings, and most importantly where the
> performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).

Right.

So here at Google we do not try (yet) to downgrade our expensive
Multiqueue Nics into dumb NICS from last decade by using a single queue
on them. Maybe it will happen when we can process 10Mpps per core,
but we are not there yet  ;)

So my test is using a NIC, programmed with 8 queues, on a dual-socket
machine. (2 physical packages)

4 queues are handled by 4 cpus on socket0 (NUMA node 0)
4 queues are handled by 4 cpus on socket1 (NUMA node 1)

So I explicitly put my poor single thread UDP application in the worst
condition, having skbs produced on two NUMA nodes. 

Then my load generator use trafgen, with spoofed UDP source addresses,
like a UDP flood would use. Or typical DNS traffic, malicious or not.

So I have 8 cpus all trying to queue packets in a single UDP socket.

Of course, a real high performance server would use 8 UDP sockets, and
SO_REUSEPORT with nice eBPF filter to spread the packets based on the
queue/cpu they arrived.


In the case you have one cpu that you need to share between ksoftirq and
all user threads, then your test results depend on process scheduler
decisions more than anything we can code in network land.

It is actually easy for user space to get more than 50% of the cycles,
and 'starve' ksoftirqd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
  2016-12-08 21:13   ` Eric Dumazet
@ 2016-12-08 21:17   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-08 21:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, David S . Miller, netdev, Paolo Abeni, Eric Dumazet

On Thu, 8 Dec 2016 21:48:19 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Thu,  8 Dec 2016 09:38:55 -0800
> Eric Dumazet <edumazet@google.com> wrote:
> 
> > This patch series provides about 100 % performance increase under flood.   
> 
> Could you please explain a bit more about what kind of testing you are
> doing that can show 100% performance improvement?
> 
> I've tested this patchset and my tests show *huge* speeds ups, but
> reaping the performance benefit depend heavily on setup and enabling
> the right UDP socket settings, and most importantly where the
> performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
> 
> Basic setup: Unload all netfilter, and enable ip_early_demux.
>  sysctl net/ipv4/ip_early_demux=1
> 
> Test generator pktgen UDP packets single flow, 50Gbit/s mlx5 NICs.
>  - Vary packet size between 64 and 1514.

Below, I've added the baseline tests.

Baseline test on net-next at commit c9fba3ed3a4

> Packet-size: 64
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
>                                 ns/pkt  pps             cycles/pkt
> recvMmsg/32  	run: 0 10000000	537.70	1859756.90	2155
> recvmsg   	run: 0 10000000	510.84	1957541.83	2047
> read      	run: 0 10000000	583.40	1714077.14	2338
> recvfrom  	run: 0 10000000	600.09	1666411.49	2405

Packet-size: 64 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
recvMmsg/32  	run: 0 10000000	499.75	2001016.09	2003
recvmsg   	run: 0 10000000	455.84	2193740.92	1827
read      	run: 0 10000000	566.99	1763703.49	2272
recvfrom  	run: 0 10000000	581.02	1721098.87	2328

 
> The ksoftirq thread "cost" more than udp_sink, which is idle, and UDP
> queue does not get full-enough. Thus, patchset does not have any
> effect.
> 
> 
> Try to increase pktgen packet size, as this increase the copy cost of
> udp_sink.  Thus, a queue can now form, and udp_sink CPU almost have no
> idle cycles.  The "read" and "readfrom" did experience some idle
> cycles.
> 
> Packet-size: 1514
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
>                                 ns/pkt  pps             cycles/pkt
> recvMmsg/32  	run: 0 10000000	435.88	2294204.11	1747
> recvmsg   	run: 0 10000000	458.06	2183100.64	1835
> read      	run: 0 10000000	520.34	1921826.18	2085
> recvfrom  	run: 0 10000000	515.48	1939935.27	2066

Packet-size: 1514 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
recvMmsg/32  	run: 0 10000000	453.88	2203231.81	1819
recvmsg   	run: 0 10000000	488.31	2047869.13	1957
read      	run: 0 10000000	480.99	2079058.69	1927
recvfrom  	run: 0 10000000	522.64	1913349.26	2094


> Next trick connected UDP:
> 
> Use connected UDP socket (combined with ip_early_demux), removes the
> FIB_lookup from the ksoftirq, and cause tipping point to be better.
> 
> Packet-size: 64
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
>                                 ns/pkt  pps             cycles/pkt
> recvMmsg/32  	run: 0 10000000	391.18	2556361.62	1567
> recvmsg   	run: 0 10000000	422.95	2364349.69	1695
> read      	run: 0 10000000	425.29	2351338.10	1704
> recvfrom  	run: 0 10000000	476.74	2097577.57	1910

Packet-size: 64 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
recvMmsg/32  	run: 0 10000000	438.55	2280255.77	1757
recvmsg   	run: 0 10000000	496.73	2013156.99	1990
read      	run: 0 10000000	412.17	2426170.58	1652
recvfrom  	run: 0 10000000	471.77	2119662.99	1890


> Change/increase packet size:
> 
> Packet-size: 1514
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
>                                 ns/pkt  pps             cycles/pkt
> recvMmsg/32  	run: 0 10000000	457.56	2185481.94	1833
> recvmsg   	run: 0 10000000	479.42	2085837.49	1921
> read      	run: 0 10000000	398.05	2512233.13	1595
> recvfrom  	run: 0 10000000	391.07	2557096.95	1567

Packet-size: 1514 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
recvMmsg/32  	run: 0 10000000	491.11	2036205.63	1968
recvmsg   	run: 0 10000000	514.37	1944138.31	2061
read      	run: 0 10000000	444.02	2252147.84	1779
recvfrom  	run: 0 10000000	426.58	2344247.20	1709


> A bit strange, changing the packet size, flipped what is the fastest
> syscall.
> 
> It is also interesting to see that ksoftirq limit is:
> 
> Result from "nstat" while using recvmsg, show that ksoftirq is
> handling 2.6 Mpps, and consumer/udp_sink is bottleneck with 2Mpps.
> 
> [skylake ~]$ nstat > /dev/null && sleep 1  && nstat
> #kernel
> IpInReceives                    2667577            0.0
> IpInDelivers                    2667577            0.0
> UdpInDatagrams                  2083580            0.0
> UdpInErrors                     583995             0.0
> UdpRcvbufErrors                 583995             0.0
> IpExtInOctets                   4001340000         0.0
> IpExtInNoECTPkts                2667559            0.0

(baseline 1514 bytes recvmsg)
$ nstat > /dev/null && sleep 1  && nstat
#kernel
IpInReceives                    2702424            0.0
IpInDelivers                    2702423            0.0
UdpInDatagrams                  1950184            0.0
UdpInErrors                     752239             0.0
UdpRcvbufErrors                 752239             0.0
IpExtInOctets                   4053642000         0.0
IpExtInNoECTPkts                2702428            0.0

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-08 21:13   ` Eric Dumazet
@ 2016-12-09 16:05     ` Jesper Dangaard Brouer
  2016-12-09 16:26       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-09 16:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni, brouer

On Thu, 08 Dec 2016 13:13:15 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> > On Thu,  8 Dec 2016 09:38:55 -0800
> > Eric Dumazet <edumazet@google.com> wrote:
> >   
> > > This patch series provides about 100 % performance increase under flood.   
> > 
> > Could you please explain a bit more about what kind of testing you are
> > doing that can show 100% performance improvement?
> > 
> > I've tested this patchset and my tests show *huge* speeds ups, but
> > reaping the performance benefit depend heavily on setup and enabling
> > the right UDP socket settings, and most importantly where the
> > performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).  
> 
> Right.
> 
> So here at Google we do not try (yet) to downgrade our expensive
> Multiqueue Nics into dumb NICS from last decade by using a single queue
> on them. Maybe it will happen when we can process 10Mpps per core,
> but we are not there yet  ;)
> 
> So my test is using a NIC, programmed with 8 queues, on a dual-socket
> machine. (2 physical packages)
> 
> 4 queues are handled by 4 cpus on socket0 (NUMA node 0)
> 4 queues are handled by 4 cpus on socket1 (NUMA node 1)

Interesting setup, it will be good to catch cache-line bouncing and
false-sharing, which the streak of recent patches show ;-) (Hopefully
such setup are avoided for production).


> So I explicitly put my poor single thread UDP application in the worst
> condition, having skbs produced on two NUMA nodes. 

On which CPU do you place the single thread UDP application?

E.g. do you allow it to run on a CPU that also process ksoftirq?
My experience is that performance is approx half, if ksoftirq and
UDP-thread share a CPU (after you fixed the softirq issue).


> Then my load generator use trafgen, with spoofed UDP source addresses,
> like a UDP flood would use. Or typical DNS traffic, malicious or not.

I also like trafgen
 https://github.com/netoptimizer/network-testing/tree/master/trafgen

> So I have 8 cpus all trying to queue packets in a single UDP socket.
> 
> Of course, a real high performance server would use 8 UDP sockets, and
> SO_REUSEPORT with nice eBPF filter to spread the packets based on the
> queue/cpu they arrived.

Once the ksoftirq and UDP-threads are silo'ed like that, it should
basically correspond to the benchmarks of my single queue test,
multiplied by the number of CPUs/UDP-threads.

I think it might be a good idea (for me) to implement such a
UDP-multi-threaded sink example program (with SO_REUSEPORT and eBPF
filter) to demonstrate and make sure the stack scales (and every
time we/I improve single queue performance, the numbers should multiply
with the scaling). Maybe you already have such an example program?


> In the case you have one cpu that you need to share between ksoftirq and
> all user threads, then your test results depend on process scheduler
> decisions more than anything we can code in network land.

Yes, also my experience, the scheduler have large influence.
 
> It is actually easy for user space to get more than 50% of the cycles,
> and 'starve' ksoftirqd.

FYI, Paolo recently added an option for parsing of pktgen payload in
the udp_sink.c program, this way we can simulate the app doing something.

I've started testing with 4 CPUs doing ksoftirq, multiple flows
(pktgen_sample04_many_flows.sh) and then increasing adding udp_sink
--reuse-port programs, on other 4 CPUs, and it looks like it scales
nicely :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-09 16:05     ` Jesper Dangaard Brouer
@ 2016-12-09 16:26       ` Eric Dumazet
       [not found]         ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-09 16:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni

On Fri, 2016-12-09 at 17:05 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 08 Dec 2016 13:13:15 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> > > On Thu,  8 Dec 2016 09:38:55 -0800
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >   
> > > > This patch series provides about 100 % performance increase under flood.   
> > > 
> > > Could you please explain a bit more about what kind of testing you are
> > > doing that can show 100% performance improvement?
> > > 
> > > I've tested this patchset and my tests show *huge* speeds ups, but
> > > reaping the performance benefit depend heavily on setup and enabling
> > > the right UDP socket settings, and most importantly where the
> > > performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).  
> > 
> > Right.
> > 
> > So here at Google we do not try (yet) to downgrade our expensive
> > Multiqueue Nics into dumb NICS from last decade by using a single queue
> > on them. Maybe it will happen when we can process 10Mpps per core,
> > but we are not there yet  ;)
> > 
> > So my test is using a NIC, programmed with 8 queues, on a dual-socket
> > machine. (2 physical packages)
> > 
> > 4 queues are handled by 4 cpus on socket0 (NUMA node 0)
> > 4 queues are handled by 4 cpus on socket1 (NUMA node 1)
> 
> Interesting setup, it will be good to catch cache-line bouncing and
> false-sharing, which the streak of recent patches show ;-) (Hopefully
> such setup are avoided for production).

Well, if you have 100Gbit NIC, and 2 NUMA nodes, what do you suggest
exactly, when jobs run on both nodes ?

If you suggest to remove one package, or force jobs to run on Socket0,
just because the NIC is attached to it, it wont be an option.

Most of the traffic is TCP, so RSS comes nicely here to affine traffic
on one RX queue of the NIC.

Now, if for some reason an innocent UDP socket is the target of a flood,
we need to not make all cpus blocked in a spinlock to eventually queue a
packet.

Be assured that high performance UDP servers use kernel bypass, or
SO_REUSEPORT already. My effort is not targeting these special users,
since they already have good performance.

My effort is to provide some isolation, a bit like the effort I did for
SYN flood attacks (Cpus were all spinning on listener spinlock)




> 
> 
> > So I explicitly put my poor single thread UDP application in the worst
> > condition, having skbs produced on two NUMA nodes. 
> 
> On which CPU do you place the single thread UDP application?

No matter in this case. You can either force it to run on a group of
cpu, or let the scheduler choose.

If you let the scheduler choose, then it might help the single tuple
flood attack, since the user thread will be moved on a difference cpu
than the ksoftirqd.

> 
> E.g. do you allow it to run on a CPU that also process ksoftirq?
> My experience is that performance is approx half, if ksoftirq and
> UDP-thread share a CPU (after you fixed the softirq issue).

Well, this is exactly what I said earlier. Your choices about cpu
pinning might help or might hurt in different scenarios.

> 
> 
> > Then my load generator use trafgen, with spoofed UDP source addresses,
> > like a UDP flood would use. Or typical DNS traffic, malicious or not.
> 
> I also like trafgen
>  https://github.com/netoptimizer/network-testing/tree/master/trafgen
> 
> > So I have 8 cpus all trying to queue packets in a single UDP socket.
> > 
> > Of course, a real high performance server would use 8 UDP sockets, and
> > SO_REUSEPORT with nice eBPF filter to spread the packets based on the
> > queue/cpu they arrived.
> 
> Once the ksoftirq and UDP-threads are silo'ed like that, it should
> basically correspond to the benchmarks of my single queue test,
> multiplied by the number of CPUs/UDP-threads.

Well, if one cpu is shared by the producer and consumer then packets are
hot in caches, so trying to avoid cache line misses as I did is not
really helping.

I optimized the case where we do not assume both parties run on the same
cpu. If you leave process scheduler do its job, then your throughput can
be doubled ;)

Now if for some reason you are stuck with a single CPU, this is a very
different problem, and af_packet might be better.


> 
> I think it might be a good idea (for me) to implement such a
> UDP-multi-threaded sink example program (with SO_REUSEPORT and eBPF
> filter) to demonstrate and make sure the stack scales (and every
> time we/I improve single queue performance, the numbers should multiply
> with the scaling). Maybe you already have such an example program?


Well, I do have something using SO_REUSEPORT, but not yet BPF, so not in
a state I can share at this moment.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
       [not found]         ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
@ 2016-12-09 16:53           ` Eric Dumazet
  2016-12-09 17:13             ` Tom Herbert
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-09 16:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev,
	Paolo Abeni

On Fri, 2016-12-09 at 08:43 -0800, Tom Herbert wrote:
> 
> 


> Are you thinking of allowing unconnected socket to have multiple input
> queues? Sort of an automatic and transparent SO_REUSEPORT... 

It all depends if the user application is using a single thread or
multiple threads to drain the queue.

Since we used to grab socket lock in udp_recvmsg(), I guess nobody uses
multiple threads to read packets from a single socket.

So heavy users must use SO_REUSEPORT already, not sure what we would
gain trying to go to a single socket, with the complexity of mem
charging.


> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
  2016-12-09 16:53           ` Eric Dumazet
@ 2016-12-09 17:13             ` Tom Herbert
  0 siblings, 0 replies; 16+ messages in thread
From: Tom Herbert @ 2016-12-09 17:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev,
	Paolo Abeni

On Fri, Dec 9, 2016 at 8:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-12-09 at 08:43 -0800, Tom Herbert wrote:
>>
>>
>
>
>> Are you thinking of allowing unconnected socket to have multiple input
>> queues? Sort of an automatic and transparent SO_REUSEPORT...
>
> It all depends if the user application is using a single thread or
> multiple threads to drain the queue.
>
If they're using multiple threads hopefully there's no reason they
can't use SO_REUSEPORT. Since we should always assume DDOS is
possibility it seems like that should be a general recommendation: If
you have multiple threads listening on a port use SO_REUSEPORT.

> Since we used to grab socket lock in udp_recvmsg(), I guess nobody uses
> multiple threads to read packets from a single socket.
>
That's the hope! So the problem at hand is multiple producer CPUs and
one consumer CPU.

> So heavy users must use SO_REUSEPORT already, not sure what we would
> gain trying to go to a single socket, with the complexity of mem
> charging.
>
I think you're making a good point a the possibility that any
unconnected UDP socket could be subject to an attack, so any use of
unconnected UDP has the potential to become a "heavy user" (in fact
we've seen bring down whole networks before in production). Therefore
the single thread reader case is relevant to consider.

Tom

>
>>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-12-09 17:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
2016-12-08 18:24   ` Paolo Abeni
2016-12-08 18:36     ` Eric Dumazet
2016-12-08 18:38       ` Eric Dumazet
2016-12-08 18:52         ` Eric Dumazet
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
2016-12-08 21:13   ` Eric Dumazet
2016-12-09 16:05     ` Jesper Dangaard Brouer
2016-12-09 16:26       ` Eric Dumazet
     [not found]         ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
2016-12-09 16:53           ` Eric Dumazet
2016-12-09 17:13             ` Tom Herbert
2016-12-08 21:17   ` Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).