netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 0/4] udp: receive path optimizations
@ 2016-12-08 19:41 Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 1/4] udp: add busylocks in RX path Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

This patch series provides about 100 % performance increase under flood. 

v2: added Paolo feedback on udp_rmem_release() for tiny sk_rcvbuf
    added the last patch touching sk_rmem_alloc later

Eric Dumazet (4):
  udp: add busylocks in RX path
  udp: copy skb->truesize in the first cache line
  udp: add batching to udp_rmem_release()
  udp: udp_rmem_release() should touch sk_rmem_alloc later

 include/linux/skbuff.h |  9 ++++++-
 include/linux/udp.h    |  3 +++
 net/ipv4/udp.c         | 71 ++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 77 insertions(+), 6 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next 1/4] udp: add busylocks in RX path
  2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
@ 2016-12-08 19:41 ` Eric Dumazet
  2016-12-08 19:45   ` Hannes Frederic Sowa
  2016-12-08 19:41 ` [PATCH v3 net-next 2/4] udp: copy skb->truesize in the first cache line Eric Dumazet
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.

This behavior is requested only once socket receive queue is above
half occupancy.

Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.

These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)

This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f5628ada47b53f0d92d08210e5d7e4132a10..e6a68d66f3b218998d409527301d2a994e95 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1195,10 +1195,36 @@ void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
+/* Idea of busylocks is to let producers grab an extra spinlock
+ * to relieve pressure on the receive_queue spinlock shared by consumer.
+ * Under flood, this means that only one producer can be in line
+ * trying to acquire the receive_queue spinlock.
+ * These busylock can be allocated on a per cpu manner, instead of a
+ * per socket one (that would consume a cache line per socket)
+ */
+static int udp_busylocks_log __read_mostly;
+static spinlock_t *udp_busylocks __read_mostly;
+
+static spinlock_t *busylock_acquire(void *ptr)
+{
+	spinlock_t *busy;
+
+	busy = udp_busylocks + hash_ptr(ptr, udp_busylocks_log);
+	spin_lock(busy);
+	return busy;
+}
+
+static void busylock_release(spinlock_t *busy)
+{
+	if (busy)
+		spin_unlock(busy);
+}
+
 int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
+	spinlock_t *busy = NULL;
 	int size;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
@@ -1214,8 +1240,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	 * - Less cache line misses at copyout() time
 	 * - Less work at consume_skb() (less alien page frag freeing)
 	 */
-	if (rmem > (sk->sk_rcvbuf >> 1))
+	if (rmem > (sk->sk_rcvbuf >> 1)) {
 		skb_condense(skb);
+
+		busy = busylock_acquire(sk);
+	}
 	size = skb->truesize;
 
 	/* we drop only if the receive buf is full and the receive
@@ -1252,6 +1281,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk);
 
+	busylock_release(busy);
 	return 0;
 
 uncharge_drop:
@@ -1259,6 +1289,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 
 drop:
 	atomic_inc(&sk->sk_drops);
+	busylock_release(busy);
 	return err;
 }
 EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
@@ -2613,6 +2644,7 @@ EXPORT_SYMBOL(udp_flow_hashrnd);
 void __init udp_init(void)
 {
 	unsigned long limit;
+	unsigned int i;
 
 	udp_table_init(&udp_table, "UDP");
 	limit = nr_free_buffer_pages() / 8;
@@ -2623,4 +2655,13 @@ void __init udp_init(void)
 
 	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
 	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+	/* 16 spinlocks per cpu */
+	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
+	udp_busylocks = kmalloc(sizeof(spinlock_t) << udp_busylocks_log,
+				GFP_KERNEL);
+	if (!udp_busylocks)
+		panic("UDP: failed to alloc udp_busylocks\n");
+	for (i = 0; i < (1U << udp_busylocks_log); i++)
+		spin_lock_init(udp_busylocks + i);
 }
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next 2/4] udp: copy skb->truesize in the first cache line
  2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 1/4] udp: add busylocks in RX path Eric Dumazet
@ 2016-12-08 19:41 ` Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 3/4] udp: add batching to udp_rmem_release() Eric Dumazet
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.

Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |  9 ++++++++-
 net/ipv4/udp.c         | 13 ++++++++++---
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0cd92b0f2af5fe5a7c153435b8dc75833818..332e76756f54bee3b625ed8872c024a2bf53 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -645,8 +645,15 @@ struct sk_buff {
 		struct rb_node	rbnode; /* used in netem & tcp stack */
 	};
 	struct sock		*sk;
-	struct net_device	*dev;
 
+	union {
+		struct net_device	*dev;
+		/* Some protocols might use this space to store information,
+		 * while device pointer would be NULL.
+		 * UDP receive path is one user.
+		 */
+		unsigned long		dev_scratch;
+	};
 	/*
 	 * This is the control buffer. It is free to use for every
 	 * layer. Please put your private variables there. If you
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e6a68d66f3b218998d409527301d2a994e95..c608334d99aa5620858d9cceec500b2be944 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1188,10 +1188,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
 }
 
-/* Note: called with sk_receive_queue.lock held */
+/* Note: called with sk_receive_queue.lock held.
+ * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
+ * This avoids a cache line miss while receive_queue lock is held.
+ * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+ */
 void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
 {
-	udp_rmem_release(sk, skb->truesize, 1);
+	udp_rmem_release(sk, skb->dev_scratch, 1);
 }
 EXPORT_SYMBOL(udp_skb_destructor);
 
@@ -1246,6 +1250,10 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 		busy = busylock_acquire(sk);
 	}
 	size = skb->truesize;
+	/* Copy skb->truesize into skb->dev_scratch to avoid a cache line miss
+	 * in udp_skb_destructor()
+	 */
+	skb->dev_scratch = size;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
@@ -1272,7 +1280,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	/* no need to setup a destructor, we will explicitly release the
 	 * forward allocated memory on dequeue
 	 */
-	skb->dev = NULL;
 	sock_skb_set_dropcount(sk, skb);
 
 	__skb_queue_tail(list, skb);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next 3/4] udp: add batching to udp_rmem_release()
  2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 1/4] udp: add busylocks in RX path Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 2/4] udp: copy skb->truesize in the first cache line Eric Dumazet
@ 2016-12-08 19:41 ` Eric Dumazet
  2016-12-08 19:41 ` [PATCH v3 net-next 4/4] udp: udp_rmem_release() should touch sk_rmem_alloc later Eric Dumazet
  2016-12-10  3:13 ` [PATCH v3 net-next 0/4] udp: receive path optimizations David Miller
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.

We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.

This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/udp.h |  3 +++
 net/ipv4/udp.c      | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd39478b635ef5396b5ae1c63f8c965..c0f530809d1f3db7323e51a52224eb49d8f9 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,9 @@ struct udp_sock {
 	int			(*gro_complete)(struct sock *sk,
 						struct sk_buff *skb,
 						int nhoff);
+
+	/* This field is dirtied by udp_recvmsg() */
+	int		forward_deficit;
 };
 
 static inline struct udp_sock *udp_sk(const struct sock *sk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c608334d99aa5620858d9cceec500b2be944..5a38faa12cde7fdcd5b6d86cdc0f4bc33de4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1177,8 +1177,20 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
 /* fully reclaim rmem/fwd memory allocated for skb */
 static void udp_rmem_release(struct sock *sk, int size, int partial)
 {
+	struct udp_sock *up = udp_sk(sk);
 	int amt;
 
+	if (likely(partial)) {
+		up->forward_deficit += size;
+		size = up->forward_deficit;
+		if (size < (sk->sk_rcvbuf >> 2) &&
+		    !skb_queue_empty(&sk->sk_receive_queue))
+			return;
+	} else {
+		size += up->forward_deficit;
+	}
+	up->forward_deficit = 0;
+
 	atomic_sub(size, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next 4/4] udp: udp_rmem_release() should touch sk_rmem_alloc later
  2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
                   ` (2 preceding siblings ...)
  2016-12-08 19:41 ` [PATCH v3 net-next 3/4] udp: add batching to udp_rmem_release() Eric Dumazet
@ 2016-12-08 19:41 ` Eric Dumazet
  2016-12-10  3:13 ` [PATCH v3 net-next 0/4] udp: receive path optimizations David Miller
  4 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:41 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet

In flood situations, keeping sk_rmem_alloc at a high value
prevents producers from touching the socket.

It makes sense to lower sk_rmem_alloc only at the end
of udp_rmem_release() after the thread draining receive
queue in udp_recvmsg() finished the writes to sk_forward_alloc.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/udp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 5a38faa12cde7fdcd5b6d86cdc0f4bc33de4..9ca279b130d51f6feaa97785b1c906775810 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,13 +1191,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
 	}
 	up->forward_deficit = 0;
 
-	atomic_sub(size, &sk->sk_rmem_alloc);
 	sk->sk_forward_alloc += size;
 	amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
 	sk->sk_forward_alloc -= amt;
 
 	if (amt)
 		__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+
+	atomic_sub(size, &sk->sk_rmem_alloc);
 }
 
 /* Note: called with sk_receive_queue.lock held.
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/4] udp: add busylocks in RX path
  2016-12-08 19:41 ` [PATCH v3 net-next 1/4] udp: add busylocks in RX path Eric Dumazet
@ 2016-12-08 19:45   ` Hannes Frederic Sowa
  2016-12-08 19:57     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Hannes Frederic Sowa @ 2016-12-08 19:45 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Paolo Abeni, Eric Dumazet

Hi Eric,

On Thu, Dec 8, 2016, at 20:41, Eric Dumazet wrote:
> Idea of busylocks is to let producers grab an extra spinlock
> to relieve pressure on the receive_queue spinlock shared by consumer.
> 
> This behavior is requested only once socket receive queue is above
> half occupancy.
> 
> Under flood, this means that only one producer can be in line
> trying to acquire the receive_queue spinlock.
> 
> These busylock can be allocated on a per cpu manner, instead of a
> per socket one (that would consume a cache line per socket)
> 
> This patch considerably improves UDP behavior under stress,
> depending on number of NIC RX queues and/or RPS spread.

This patch mostly improves situation for non-connected sockets. Do you
think it makes sense to acquire the spinlock depending on the sockets
state? Connected UDP sockets flow in on one CPU anyway?

Otherwise the series looks really great, thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 1/4] udp: add busylocks in RX path
  2016-12-08 19:45   ` Hannes Frederic Sowa
@ 2016-12-08 19:57     ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2016-12-08 19:57 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni

On Thu, 2016-12-08 at 20:45 +0100, Hannes Frederic Sowa wrote:
> Hi Eric,
> 

> This patch mostly improves situation for non-connected sockets. Do you
> think it makes sense to acquire the spinlock depending on the sockets
> state? Connected UDP sockets flow in on one CPU anyway?

We could do that, definitely.

However I could not measure any difference with a conditional test here
for connected socket.

Maybe it would show up if we are unlucky and multiple cpus compete on
the same cache line because of the hashed spinlock array.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next 0/4] udp: receive path optimizations
  2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
                   ` (3 preceding siblings ...)
  2016-12-08 19:41 ` [PATCH v3 net-next 4/4] udp: udp_rmem_release() should touch sk_rmem_alloc later Eric Dumazet
@ 2016-12-10  3:13 ` David Miller
  4 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2016-12-10  3:13 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, pabeni, eric.dumazet

From: Eric Dumazet <edumazet@google.com>
Date: Thu,  8 Dec 2016 11:41:53 -0800

> This patch series provides about 100 % performance increase under flood. 
> 
> v2: added Paolo feedback on udp_rmem_release() for tiny sk_rcvbuf
>     added the last patch touching sk_rmem_alloc later

Series applied, thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-12-10  3:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-08 19:41 [PATCH v3 net-next 0/4] udp: receive path optimizations Eric Dumazet
2016-12-08 19:41 ` [PATCH v3 net-next 1/4] udp: add busylocks in RX path Eric Dumazet
2016-12-08 19:45   ` Hannes Frederic Sowa
2016-12-08 19:57     ` Eric Dumazet
2016-12-08 19:41 ` [PATCH v3 net-next 2/4] udp: copy skb->truesize in the first cache line Eric Dumazet
2016-12-08 19:41 ` [PATCH v3 net-next 3/4] udp: add batching to udp_rmem_release() Eric Dumazet
2016-12-08 19:41 ` [PATCH v3 net-next 4/4] udp: udp_rmem_release() should touch sk_rmem_alloc later Eric Dumazet
2016-12-10  3:13 ` [PATCH v3 net-next 0/4] udp: receive path optimizations David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).