* [PATCH v2 net-next 0/4] udp: receive path optimizations
@ 2016-12-08 17:38 Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
This patch series provides about 100 % performance increase under flood.
David, please scratch it if you prefer to wait for linux-4.11,
thanks !
Eric Dumazet (4):
udp: under rx pressure, try to condense skbs
udp: add busylocks in RX path
udp: copy skb->truesize in the first cache line
udp: add batching to udp_rmem_release()
include/linux/skbuff.h | 11 +++++++-
include/linux/udp.h | 3 ++
net/core/skbuff.c | 28 ++++++++++++++++++
net/ipv4/udp.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++---
4 files changed, 114 insertions(+), 5 deletions(-)
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
Under UDP flood, many softirq producers try to add packets to
UDP receive queue, and one user thread is burning one cpu trying
to dequeue packets as fast as possible.
Two parts of the per packet cost are :
- copying payload from kernel space to user space,
- freeing memory pieces associated with skb.
If socket is under pressure, softirq handler(s) can try to pull in
skb->head the payload of the packet if it fits.
Meaning the softirq handler(s) can free/reuse the page fragment
immediately, instead of letting udp_recvmsg() do this hundreds of usec
later, possibly from another node.
Additional gains :
- We reduce skb->truesize and thus can store more packets per SO_RCVBUF
- We avoid cache line misses at copyout() time and consume_skb() time,
and avoid one put_page() with potential alien freeing on NUMA hosts.
This comes at the cost of a copy, bounded to available tail room, which
is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
than necessary)
This patch gave me about 5 % increase in throughput in my tests.
skb_condense() helper could probably used in other contexts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
---
include/linux/skbuff.h | 2 ++
net/core/skbuff.c | 28 ++++++++++++++++++++++++++++
net/ipv4/udp.c | 12 +++++++++++-
3 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9c535fbccf2c..0cd92b0f2af5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1966,6 +1966,8 @@ static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len)
return __pskb_pull_tail(skb, len - skb_headlen(skb)) != NULL;
}
+void skb_condense(struct sk_buff *skb);
+
/**
* skb_headroom - bytes at buffer head
* @skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b45cd1494243..d27e0352ae2a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4931,3 +4931,31 @@ struct sk_buff *pskb_extract(struct sk_buff *skb, int off,
return clone;
}
EXPORT_SYMBOL(pskb_extract);
+
+/**
+ * skb_condense - try to get rid of fragments/frag_list if possible
+ * @skb: buffer
+ *
+ * Can be used to save memory before skb is added to a busy queue.
+ * If the packet has bytes in frags and enough tail room in skb->head,
+ * pull all of them, so that we can free the frags right now and adjust
+ * truesize.
+ * Notes:
+ * We do not reallocate skb->head thus can not fail.
+ * Caller must re-evaluate skb->truesize if needed.
+ */
+void skb_condense(struct sk_buff *skb)
+{
+ if (!skb->data_len ||
+ skb->data_len > skb->end - skb->tail ||
+ skb_cloned(skb))
+ return;
+
+ /* Nice, we can free page frag(s) right now */
+ __pskb_pull_tail(skb, skb->data_len);
+
+ /* Now adjust skb->truesize, since __pskb_pull_tail() does
+ * not do this.
+ */
+ skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
+}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 16d88ba9ff1c..110414903f9e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1199,7 +1199,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
{
struct sk_buff_head *list = &sk->sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
- int size = skb->truesize;
+ int size;
/* try to avoid the costly atomic add/sub pair when the receive
* queue is full; always allow at least a packet
@@ -1208,6 +1208,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
if (rmem > sk->sk_rcvbuf)
goto drop;
+ /* Under mem pressure, it might be helpful to give udp_recvmsg()
+ * linear skbs :
+ * - Reduce memory overhead and thus increase receive queue capacity
+ * - Less cache line misses at copyout() time
+ * - Less work at consume_skb() (less alien page frag freeing)
+ */
+ if (rmem > (sk->sk_rcvbuf >> 1))
+ skb_condense(skb);
+ size = skb->truesize;
+
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
*/
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 2/4] udp: add busylocks in RX path
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
` (2 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
Idea of busylocks is to let producers grab an extra spinlock
to relieve pressure on the receive_queue spinlock shared by consumer.
This behavior is requested only once socket receive queue is above
half occupancy.
Under flood, this means that only one producer can be in line
trying to acquire the receive_queue spinlock.
These busylock can be allocated on a per cpu manner, instead of a
per socket one (that would consume a cache line per socket)
This patch considerably improves UDP behavior under stress,
depending on number of NIC RX queues and/or RPS spread.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/udp.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 42 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 110414903f9e..77875712405f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1195,10 +1195,36 @@ void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
}
EXPORT_SYMBOL(udp_skb_destructor);
+/* Idea of busylocks is to let producers grab an extra spinlock
+ * to relieve pressure on the receive_queue spinlock shared by consumer.
+ * Under flood, this means that only one producer can be in line
+ * trying to acquire the receive_queue spinlock.
+ * These busylock can be allocated on a per cpu manner, instead of a
+ * per socket one (that would consume a cache line per socket)
+ */
+static int udp_busylocks_log __read_mostly;
+static spinlock_t *udp_busylocks __read_mostly;
+
+static spinlock_t *busylock_acquire(void *ptr)
+{
+ spinlock_t *busy;
+
+ busy = udp_busylocks + hash_ptr(ptr, udp_busylocks_log);
+ spin_lock(busy);
+ return busy;
+}
+
+static void busylock_release(spinlock_t *busy)
+{
+ if (busy)
+ spin_unlock(busy);
+}
+
int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
{
struct sk_buff_head *list = &sk->sk_receive_queue;
int rmem, delta, amt, err = -ENOMEM;
+ spinlock_t *busy = NULL;
int size;
/* try to avoid the costly atomic add/sub pair when the receive
@@ -1214,8 +1240,11 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
* - Less cache line misses at copyout() time
* - Less work at consume_skb() (less alien page frag freeing)
*/
- if (rmem > (sk->sk_rcvbuf >> 1))
+ if (rmem > (sk->sk_rcvbuf >> 1)) {
skb_condense(skb);
+
+ busy = busylock_acquire(sk);
+ }
size = skb->truesize;
/* we drop only if the receive buf is full and the receive
@@ -1252,6 +1281,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_data_ready(sk);
+ busylock_release(busy);
return 0;
uncharge_drop:
@@ -1259,6 +1289,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
drop:
atomic_inc(&sk->sk_drops);
+ busylock_release(busy);
return err;
}
EXPORT_SYMBOL_GPL(__udp_enqueue_schedule_skb);
@@ -2613,6 +2644,7 @@ EXPORT_SYMBOL(udp_flow_hashrnd);
void __init udp_init(void)
{
unsigned long limit;
+ unsigned int i;
udp_table_init(&udp_table, "UDP");
limit = nr_free_buffer_pages() / 8;
@@ -2623,4 +2655,13 @@ void __init udp_init(void)
sysctl_udp_rmem_min = SK_MEM_QUANTUM;
sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+ /* 16 spinlocks per cpu */
+ udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
+ udp_busylocks = kmalloc(sizeof(spinlock_t) << udp_busylocks_log,
+ GFP_KERNEL);
+ if (!udp_busylocks)
+ panic("UDP: failed to alloc udp_busylocks\n");
+ for (i = 0; i < (1U << udp_busylocks_log); i++)
+ spin_lock_init(udp_busylocks + i);
}
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
4 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
In UDP RX handler, we currently clear skb->dev before skb
is added to receive queue, because device pointer is no longer
available once we exit from RCU section.
Since this first cache line is always hot, lets reuse this space
to store skb->truesize and thus avoid a cache line miss at
udp_recvmsg()/udp_skb_destructor time while receive queue
spinlock is held.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/skbuff.h | 9 ++++++++-
net/ipv4/udp.c | 13 ++++++++++---
2 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0cd92b0f2af5..332e76756f54 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -645,8 +645,15 @@ struct sk_buff {
struct rb_node rbnode; /* used in netem & tcp stack */
};
struct sock *sk;
- struct net_device *dev;
+ union {
+ struct net_device *dev;
+ /* Some protocols might use this space to store information,
+ * while device pointer would be NULL.
+ * UDP receive path is one user.
+ */
+ unsigned long dev_scratch;
+ };
/*
* This is the control buffer. It is free to use for every
* layer. Please put your private variables there. If you
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 77875712405f..880cd3d84abf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1188,10 +1188,14 @@ static void udp_rmem_release(struct sock *sk, int size, int partial)
__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
}
-/* Note: called with sk_receive_queue.lock held */
+/* Note: called with sk_receive_queue.lock held.
+ * Instead of using skb->truesize here, find a copy of it in skb->dev_scratch
+ * This avoids a cache line miss while receive_queue lock is held.
+ * Look at __udp_enqueue_schedule_skb() to find where this copy is done.
+ */
void udp_skb_destructor(struct sock *sk, struct sk_buff *skb)
{
- udp_rmem_release(sk, skb->truesize, 1);
+ udp_rmem_release(sk, skb->dev_scratch, 1);
}
EXPORT_SYMBOL(udp_skb_destructor);
@@ -1246,6 +1250,10 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
busy = busylock_acquire(sk);
}
size = skb->truesize;
+ /* Copy skb->truesize into skb->dev_scratch to avoid a cache line miss
+ * in udp_skb_destructor()
+ */
+ skb->dev_scratch = size;
/* we drop only if the receive buf is full and the receive
* queue contains some other skb
@@ -1272,7 +1280,6 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
/* no need to setup a destructor, we will explicitly release the
* forward allocated memory on dequeue
*/
- skb->dev = NULL;
sock_skb_set_dropcount(sk, skb);
__skb_queue_tail(list, skb);
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
` (2 preceding siblings ...)
2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
@ 2016-12-08 17:38 ` Eric Dumazet
2016-12-08 18:24 ` Paolo Abeni
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
4 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 17:38 UTC (permalink / raw)
To: David S . Miller; +Cc: netdev, Eric Dumazet, Paolo Abeni, Eric Dumazet
If udp_recvmsg() constantly releases sk_rmem_alloc
for every read packet, it gives opportunity for
producers to immediately grab spinlocks and desperatly
try adding another packet, causing false sharing.
We can add a simple heuristic to give the signal
by batches of ~25 % of the queue capacity.
This patch considerably increases performance under
flood by about 50 %, since the thread draining the queue
is no longer slowed by false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/udp.h | 3 +++
net/ipv4/udp.c | 11 +++++++++++
2 files changed, 14 insertions(+)
diff --git a/include/linux/udp.h b/include/linux/udp.h
index d1fd8cd39478..c0f530809d1f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,9 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+
+ /* This field is dirtied by udp_recvmsg() */
+ int forward_deficit;
};
static inline struct udp_sock *udp_sk(const struct sock *sk)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 880cd3d84abf..f0096d088104 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1177,8 +1177,19 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
/* fully reclaim rmem/fwd memory allocated for skb */
static void udp_rmem_release(struct sock *sk, int size, int partial)
{
+ struct udp_sock *up = udp_sk(sk);
int amt;
+ if (likely(partial)) {
+ up->forward_deficit += size;
+ size = up->forward_deficit;
+ if (size < (sk->sk_rcvbuf >> 2))
+ return;
+ } else {
+ size += up->forward_deficit;
+ }
+ up->forward_deficit = 0;
+
atomic_sub(size, &sk->sk_rmem_alloc);
sk->sk_forward_alloc += size;
amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
--
2.8.0.rc3.226.g39d4020
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
@ 2016-12-08 18:24 ` Paolo Abeni
2016-12-08 18:36 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Paolo Abeni @ 2016-12-08 18:24 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
On Thu, 2016-12-08 at 09:38 -0800, Eric Dumazet wrote:
> If udp_recvmsg() constantly releases sk_rmem_alloc
> for every read packet, it gives opportunity for
> producers to immediately grab spinlocks and desperatly
> try adding another packet, causing false sharing.
>
> We can add a simple heuristic to give the signal
> by batches of ~25 % of the queue capacity.
>
> This patch considerably increases performance under
> flood by about 50 %, since the thread draining the queue
> is no longer slowed by false sharing.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/udp.h | 3 +++
> net/ipv4/udp.c | 11 +++++++++++
> 2 files changed, 14 insertions(+)
>
> diff --git a/include/linux/udp.h b/include/linux/udp.h
> index d1fd8cd39478..c0f530809d1f 100644
> --- a/include/linux/udp.h
> +++ b/include/linux/udp.h
> @@ -79,6 +79,9 @@ struct udp_sock {
> int (*gro_complete)(struct sock *sk,
> struct sk_buff *skb,
> int nhoff);
> +
> + /* This field is dirtied by udp_recvmsg() */
> + int forward_deficit;
> };
>
> static inline struct udp_sock *udp_sk(const struct sock *sk)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 880cd3d84abf..f0096d088104 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1177,8 +1177,19 @@ int udp_sendpage(struct sock *sk, struct page *page, int offset,
> /* fully reclaim rmem/fwd memory allocated for skb */
> static void udp_rmem_release(struct sock *sk, int size, int partial)
> {
> + struct udp_sock *up = udp_sk(sk);
> int amt;
>
> + if (likely(partial)) {
> + up->forward_deficit += size;
> + size = up->forward_deficit;
> + if (size < (sk->sk_rcvbuf >> 2))
> + return;
> + } else {
> + size += up->forward_deficit;
> + }
> + up->forward_deficit = 0;
> +
> atomic_sub(size, &sk->sk_rmem_alloc);
> sk->sk_forward_alloc += size;
> amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
Nice one! This sounds like a relevant improvement!
I'm wondering if it may cause regressions with small value of
sk_rcvbuf ?!? e.g. with:
netperf -t UDP_STREAM -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
I'm sorry, I fear I will not unable to do any test before next week.
Cheers,
Paolo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
2016-12-08 18:24 ` Paolo Abeni
@ 2016-12-08 18:36 ` Eric Dumazet
2016-12-08 18:38 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:36 UTC (permalink / raw)
To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet
On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> Nice one! This sounds like a relevant improvement!
>
> I'm wondering if it may cause regressions with small value of
> sk_rcvbuf ?!? e.g. with:
>
> netperf -t UDP_STREAM -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>
Possibly, then simply we can refine the test to :
size = up->forward_deficit;
if (size < (sk->sk_rcvbuf >> 2) && !skb_queue_empty(sk->sk_receive_buf))
return;
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
2016-12-08 18:36 ` Eric Dumazet
@ 2016-12-08 18:38 ` Eric Dumazet
2016-12-08 18:52 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:38 UTC (permalink / raw)
To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet
On Thu, Dec 8, 2016 at 10:36 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>
>> Nice one! This sounds like a relevant improvement!
>>
>> I'm wondering if it may cause regressions with small value of
>> sk_rcvbuf ?!? e.g. with:
>>
>> netperf -t UDP_STREAM -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>>
>
> Possibly, then simply we can refine the test to :
>
> size = up->forward_deficit;
> if (size < (sk->sk_rcvbuf >> 2) && !skb_queue_empty(sk->sk_receive_buf))
> return;
BTW, I tried :
lpaa6:~# ./netperf -t UDP_STREAM -H 127.0.0.1 -- -s 1280 -S 1280 -m
1024 -M 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
127.0.0.1 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
4608 1024 10.00 4499400 0 3685.88
2560 10.00 4498670 3685.28
So it looks like it is working.
However I have no doubt there might be a corner case for tiny
SO_RCVBUF values or for some message sizes.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release()
2016-12-08 18:38 ` Eric Dumazet
@ 2016-12-08 18:52 ` Eric Dumazet
0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 18:52 UTC (permalink / raw)
To: Paolo Abeni; +Cc: David S . Miller, netdev, Eric Dumazet
On Thu, Dec 8, 2016 at 10:38 AM, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 8, 2016 at 10:36 AM, Eric Dumazet <edumazet@google.com> wrote:
>> On Thu, Dec 8, 2016 at 10:24 AM, Paolo Abeni <pabeni@redhat.com> wrote:
>>
>>> Nice one! This sounds like a relevant improvement!
>>>
>>> I'm wondering if it may cause regressions with small value of
>>> sk_rcvbuf ?!? e.g. with:
>>>
>>> netperf -t UDP_STREAM -H 127.0.0.1 -- -s 1280 -S 1280 -m 1024 -M 1024
>>>
>>
>> Possibly, then simply we can refine the test to :
>>
>> size = up->forward_deficit;
>> if (size < (sk->sk_rcvbuf >> 2) && !skb_queue_empty(sk->sk_receive_buf))
>> return;
>
I will also add this patch :
This really makes sure our changes to sk_forward_alloc wont be slowed
because producers see
the change to sk_rmem_alloc too soon.
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8400d6954558..6bdcbe103390 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1191,13 +1191,14 @@ static void udp_rmem_release(struct sock *sk,
int size, int partial)
}
up->forward_deficit = 0;
- atomic_sub(size, &sk->sk_rmem_alloc);
sk->sk_forward_alloc += size;
amt = (sk->sk_forward_alloc - partial) & ~(SK_MEM_QUANTUM - 1);
sk->sk_forward_alloc -= amt;
if (amt)
__sk_mem_reduce_allocated(sk, amt >> SK_MEM_QUANTUM_SHIFT);
+
+ atomic_sub(size, &sk->sk_rmem_alloc);
}
/* Note: called with sk_receive_queue.lock held.
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
` (3 preceding siblings ...)
2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
@ 2016-12-08 20:48 ` Jesper Dangaard Brouer
2016-12-08 21:13 ` Eric Dumazet
2016-12-08 21:17 ` Jesper Dangaard Brouer
4 siblings, 2 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-08 20:48 UTC (permalink / raw)
To: Eric Dumazet; +Cc: brouer, David S . Miller, netdev, Paolo Abeni, Eric Dumazet
On Thu, 8 Dec 2016 09:38:55 -0800
Eric Dumazet <edumazet@google.com> wrote:
> This patch series provides about 100 % performance increase under flood.
Could you please explain a bit more about what kind of testing you are
doing that can show 100% performance improvement?
I've tested this patchset and my tests show *huge* speeds ups, but
reaping the performance benefit depend heavily on setup and enabling
the right UDP socket settings, and most importantly where the
performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
Basic setup: Unload all netfilter, and enable ip_early_demux.
sysctl net/ipv4/ip_early_demux=1
Test generator pktgen UDP packets single flow, 50Gbit/s mlx5 NICs.
- Vary packet size between 64 and 1514.
Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
ns/pkt pps cycles/pkt
recvMmsg/32 run: 0 10000000 537.70 1859756.90 2155
recvmsg run: 0 10000000 510.84 1957541.83 2047
read run: 0 10000000 583.40 1714077.14 2338
recvfrom run: 0 10000000 600.09 1666411.49 2405
The ksoftirq thread "cost" more than udp_sink, which is idle, and UDP
queue does not get full-enough. Thus, patchset does not have any
effect.
Try to increase pktgen packet size, as this increase the copy cost of
udp_sink. Thus, a queue can now form, and udp_sink CPU almost have no
idle cycles. The "read" and "readfrom" did experience some idle
cycles.
Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
ns/pkt pps cycles/pkt
recvMmsg/32 run: 0 10000000 435.88 2294204.11 1747
recvmsg run: 0 10000000 458.06 2183100.64 1835
read run: 0 10000000 520.34 1921826.18 2085
recvfrom run: 0 10000000 515.48 1939935.27 2066
Next trick connected UDP:
Use connected UDP socket (combined with ip_early_demux), removes the
FIB_lookup from the ksoftirq, and cause tipping point to be better.
Packet-size: 64
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
ns/pkt pps cycles/pkt
recvMmsg/32 run: 0 10000000 391.18 2556361.62 1567
recvmsg run: 0 10000000 422.95 2364349.69 1695
read run: 0 10000000 425.29 2351338.10 1704
recvfrom run: 0 10000000 476.74 2097577.57 1910
Change/increase packet size:
Packet-size: 1514
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
ns/pkt pps cycles/pkt
recvMmsg/32 run: 0 10000000 457.56 2185481.94 1833
recvmsg run: 0 10000000 479.42 2085837.49 1921
read run: 0 10000000 398.05 2512233.13 1595
recvfrom run: 0 10000000 391.07 2557096.95 1567
A bit strange, changing the packet size, flipped what is the fastest
syscall.
It is also interesting to see that ksoftirq limit is:
Result from "nstat" while using recvmsg, show that ksoftirq is
handling 2.6 Mpps, and consumer/udp_sink is bottleneck with 2Mpps.
[skylake ~]$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 2667577 0.0
IpInDelivers 2667577 0.0
UdpInDatagrams 2083580 0.0
UdpInErrors 583995 0.0
UdpRcvbufErrors 583995 0.0
IpExtInOctets 4001340000 0.0
IpExtInNoECTPkts 2667559 0.0
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
@ 2016-12-08 21:13 ` Eric Dumazet
2016-12-09 16:05 ` Jesper Dangaard Brouer
2016-12-08 21:17 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-08 21:13 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni
On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 8 Dec 2016 09:38:55 -0800
> Eric Dumazet <edumazet@google.com> wrote:
>
> > This patch series provides about 100 % performance increase under flood.
>
> Could you please explain a bit more about what kind of testing you are
> doing that can show 100% performance improvement?
>
> I've tested this patchset and my tests show *huge* speeds ups, but
> reaping the performance benefit depend heavily on setup and enabling
> the right UDP socket settings, and most importantly where the
> performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
Right.
So here at Google we do not try (yet) to downgrade our expensive
Multiqueue Nics into dumb NICS from last decade by using a single queue
on them. Maybe it will happen when we can process 10Mpps per core,
but we are not there yet ;)
So my test is using a NIC, programmed with 8 queues, on a dual-socket
machine. (2 physical packages)
4 queues are handled by 4 cpus on socket0 (NUMA node 0)
4 queues are handled by 4 cpus on socket1 (NUMA node 1)
So I explicitly put my poor single thread UDP application in the worst
condition, having skbs produced on two NUMA nodes.
Then my load generator use trafgen, with spoofed UDP source addresses,
like a UDP flood would use. Or typical DNS traffic, malicious or not.
So I have 8 cpus all trying to queue packets in a single UDP socket.
Of course, a real high performance server would use 8 UDP sockets, and
SO_REUSEPORT with nice eBPF filter to spread the packets based on the
queue/cpu they arrived.
In the case you have one cpu that you need to share between ksoftirq and
all user threads, then your test results depend on process scheduler
decisions more than anything we can code in network land.
It is actually easy for user space to get more than 50% of the cycles,
and 'starve' ksoftirqd.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
2016-12-08 21:13 ` Eric Dumazet
@ 2016-12-08 21:17 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-08 21:17 UTC (permalink / raw)
To: Eric Dumazet; +Cc: brouer, David S . Miller, netdev, Paolo Abeni, Eric Dumazet
On Thu, 8 Dec 2016 21:48:19 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> On Thu, 8 Dec 2016 09:38:55 -0800
> Eric Dumazet <edumazet@google.com> wrote:
>
> > This patch series provides about 100 % performance increase under flood.
>
> Could you please explain a bit more about what kind of testing you are
> doing that can show 100% performance improvement?
>
> I've tested this patchset and my tests show *huge* speeds ups, but
> reaping the performance benefit depend heavily on setup and enabling
> the right UDP socket settings, and most importantly where the
> performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
>
> Basic setup: Unload all netfilter, and enable ip_early_demux.
> sysctl net/ipv4/ip_early_demux=1
>
> Test generator pktgen UDP packets single flow, 50Gbit/s mlx5 NICs.
> - Vary packet size between 64 and 1514.
Below, I've added the baseline tests.
Baseline test on net-next at commit c9fba3ed3a4
> Packet-size: 64
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
> ns/pkt pps cycles/pkt
> recvMmsg/32 run: 0 10000000 537.70 1859756.90 2155
> recvmsg run: 0 10000000 510.84 1957541.83 2047
> read run: 0 10000000 583.40 1714077.14 2338
> recvfrom run: 0 10000000 600.09 1666411.49 2405
Packet-size: 64 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
recvMmsg/32 run: 0 10000000 499.75 2001016.09 2003
recvmsg run: 0 10000000 455.84 2193740.92 1827
read run: 0 10000000 566.99 1763703.49 2272
recvfrom run: 0 10000000 581.02 1721098.87 2328
> The ksoftirq thread "cost" more than udp_sink, which is idle, and UDP
> queue does not get full-enough. Thus, patchset does not have any
> effect.
>
>
> Try to increase pktgen packet size, as this increase the copy cost of
> udp_sink. Thus, a queue can now form, and udp_sink CPU almost have no
> idle cycles. The "read" and "readfrom" did experience some idle
> cycles.
>
> Packet-size: 1514
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
> ns/pkt pps cycles/pkt
> recvMmsg/32 run: 0 10000000 435.88 2294204.11 1747
> recvmsg run: 0 10000000 458.06 2183100.64 1835
> read run: 0 10000000 520.34 1921826.18 2085
> recvfrom run: 0 10000000 515.48 1939935.27 2066
Packet-size: 1514 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7))
recvMmsg/32 run: 0 10000000 453.88 2203231.81 1819
recvmsg run: 0 10000000 488.31 2047869.13 1957
read run: 0 10000000 480.99 2079058.69 1927
recvfrom run: 0 10000000 522.64 1913349.26 2094
> Next trick connected UDP:
>
> Use connected UDP socket (combined with ip_early_demux), removes the
> FIB_lookup from the ksoftirq, and cause tipping point to be better.
>
> Packet-size: 64
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
> ns/pkt pps cycles/pkt
> recvMmsg/32 run: 0 10000000 391.18 2556361.62 1567
> recvmsg run: 0 10000000 422.95 2364349.69 1695
> read run: 0 10000000 425.29 2351338.10 1704
> recvfrom run: 0 10000000 476.74 2097577.57 1910
Packet-size: 64 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
recvMmsg/32 run: 0 10000000 438.55 2280255.77 1757
recvmsg run: 0 10000000 496.73 2013156.99 1990
read run: 0 10000000 412.17 2426170.58 1652
recvfrom run: 0 10000000 471.77 2119662.99 1890
> Change/increase packet size:
>
> Packet-size: 1514
> $ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
> ns/pkt pps cycles/pkt
> recvMmsg/32 run: 0 10000000 457.56 2185481.94 1833
> recvmsg run: 0 10000000 479.42 2085837.49 1921
> read run: 0 10000000 398.05 2512233.13 1595
> recvfrom run: 0 10000000 391.07 2557096.95 1567
Packet-size: 1514 (baseline)
$ sudo taskset -c 4 ./udp_sink --port 9 --count $((10**7)) --connect
recvMmsg/32 run: 0 10000000 491.11 2036205.63 1968
recvmsg run: 0 10000000 514.37 1944138.31 2061
read run: 0 10000000 444.02 2252147.84 1779
recvfrom run: 0 10000000 426.58 2344247.20 1709
> A bit strange, changing the packet size, flipped what is the fastest
> syscall.
>
> It is also interesting to see that ksoftirq limit is:
>
> Result from "nstat" while using recvmsg, show that ksoftirq is
> handling 2.6 Mpps, and consumer/udp_sink is bottleneck with 2Mpps.
>
> [skylake ~]$ nstat > /dev/null && sleep 1 && nstat
> #kernel
> IpInReceives 2667577 0.0
> IpInDelivers 2667577 0.0
> UdpInDatagrams 2083580 0.0
> UdpInErrors 583995 0.0
> UdpRcvbufErrors 583995 0.0
> IpExtInOctets 4001340000 0.0
> IpExtInNoECTPkts 2667559 0.0
(baseline 1514 bytes recvmsg)
$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives 2702424 0.0
IpInDelivers 2702423 0.0
UdpInDatagrams 1950184 0.0
UdpInErrors 752239 0.0
UdpRcvbufErrors 752239 0.0
IpExtInOctets 4053642000 0.0
IpExtInNoECTPkts 2702428 0.0
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-08 21:13 ` Eric Dumazet
@ 2016-12-09 16:05 ` Jesper Dangaard Brouer
2016-12-09 16:26 ` Eric Dumazet
0 siblings, 1 reply; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2016-12-09 16:05 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni, brouer
On Thu, 08 Dec 2016 13:13:15 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 8 Dec 2016 09:38:55 -0800
> > Eric Dumazet <edumazet@google.com> wrote:
> >
> > > This patch series provides about 100 % performance increase under flood.
> >
> > Could you please explain a bit more about what kind of testing you are
> > doing that can show 100% performance improvement?
> >
> > I've tested this patchset and my tests show *huge* speeds ups, but
> > reaping the performance benefit depend heavily on setup and enabling
> > the right UDP socket settings, and most importantly where the
> > performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
>
> Right.
>
> So here at Google we do not try (yet) to downgrade our expensive
> Multiqueue Nics into dumb NICS from last decade by using a single queue
> on them. Maybe it will happen when we can process 10Mpps per core,
> but we are not there yet ;)
>
> So my test is using a NIC, programmed with 8 queues, on a dual-socket
> machine. (2 physical packages)
>
> 4 queues are handled by 4 cpus on socket0 (NUMA node 0)
> 4 queues are handled by 4 cpus on socket1 (NUMA node 1)
Interesting setup, it will be good to catch cache-line bouncing and
false-sharing, which the streak of recent patches show ;-) (Hopefully
such setup are avoided for production).
> So I explicitly put my poor single thread UDP application in the worst
> condition, having skbs produced on two NUMA nodes.
On which CPU do you place the single thread UDP application?
E.g. do you allow it to run on a CPU that also process ksoftirq?
My experience is that performance is approx half, if ksoftirq and
UDP-thread share a CPU (after you fixed the softirq issue).
> Then my load generator use trafgen, with spoofed UDP source addresses,
> like a UDP flood would use. Or typical DNS traffic, malicious or not.
I also like trafgen
https://github.com/netoptimizer/network-testing/tree/master/trafgen
> So I have 8 cpus all trying to queue packets in a single UDP socket.
>
> Of course, a real high performance server would use 8 UDP sockets, and
> SO_REUSEPORT with nice eBPF filter to spread the packets based on the
> queue/cpu they arrived.
Once the ksoftirq and UDP-threads are silo'ed like that, it should
basically correspond to the benchmarks of my single queue test,
multiplied by the number of CPUs/UDP-threads.
I think it might be a good idea (for me) to implement such a
UDP-multi-threaded sink example program (with SO_REUSEPORT and eBPF
filter) to demonstrate and make sure the stack scales (and every
time we/I improve single queue performance, the numbers should multiply
with the scaling). Maybe you already have such an example program?
> In the case you have one cpu that you need to share between ksoftirq and
> all user threads, then your test results depend on process scheduler
> decisions more than anything we can code in network land.
Yes, also my experience, the scheduler have large influence.
> It is actually easy for user space to get more than 50% of the cycles,
> and 'starve' ksoftirqd.
FYI, Paolo recently added an option for parsing of pktgen payload in
the udp_sink.c program, this way we can simulate the app doing something.
I've started testing with 4 CPUs doing ksoftirq, multiple flows
(pktgen_sample04_many_flows.sh) and then increasing adding udp_sink
--reuse-port programs, on other 4 CPUs, and it looks like it scales
nicely :-)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-09 16:05 ` Jesper Dangaard Brouer
@ 2016-12-09 16:26 ` Eric Dumazet
[not found] ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-09 16:26 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Eric Dumazet, David S . Miller, netdev, Paolo Abeni
On Fri, 2016-12-09 at 17:05 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 08 Dec 2016 13:13:15 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > On Thu, 2016-12-08 at 21:48 +0100, Jesper Dangaard Brouer wrote:
> > > On Thu, 8 Dec 2016 09:38:55 -0800
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > > This patch series provides about 100 % performance increase under flood.
> > >
> > > Could you please explain a bit more about what kind of testing you are
> > > doing that can show 100% performance improvement?
> > >
> > > I've tested this patchset and my tests show *huge* speeds ups, but
> > > reaping the performance benefit depend heavily on setup and enabling
> > > the right UDP socket settings, and most importantly where the
> > > performance bottleneck is: ksoftirqd(producer) or udp_sink(consumer).
> >
> > Right.
> >
> > So here at Google we do not try (yet) to downgrade our expensive
> > Multiqueue Nics into dumb NICS from last decade by using a single queue
> > on them. Maybe it will happen when we can process 10Mpps per core,
> > but we are not there yet ;)
> >
> > So my test is using a NIC, programmed with 8 queues, on a dual-socket
> > machine. (2 physical packages)
> >
> > 4 queues are handled by 4 cpus on socket0 (NUMA node 0)
> > 4 queues are handled by 4 cpus on socket1 (NUMA node 1)
>
> Interesting setup, it will be good to catch cache-line bouncing and
> false-sharing, which the streak of recent patches show ;-) (Hopefully
> such setup are avoided for production).
Well, if you have 100Gbit NIC, and 2 NUMA nodes, what do you suggest
exactly, when jobs run on both nodes ?
If you suggest to remove one package, or force jobs to run on Socket0,
just because the NIC is attached to it, it wont be an option.
Most of the traffic is TCP, so RSS comes nicely here to affine traffic
on one RX queue of the NIC.
Now, if for some reason an innocent UDP socket is the target of a flood,
we need to not make all cpus blocked in a spinlock to eventually queue a
packet.
Be assured that high performance UDP servers use kernel bypass, or
SO_REUSEPORT already. My effort is not targeting these special users,
since they already have good performance.
My effort is to provide some isolation, a bit like the effort I did for
SYN flood attacks (Cpus were all spinning on listener spinlock)
>
>
> > So I explicitly put my poor single thread UDP application in the worst
> > condition, having skbs produced on two NUMA nodes.
>
> On which CPU do you place the single thread UDP application?
No matter in this case. You can either force it to run on a group of
cpu, or let the scheduler choose.
If you let the scheduler choose, then it might help the single tuple
flood attack, since the user thread will be moved on a difference cpu
than the ksoftirqd.
>
> E.g. do you allow it to run on a CPU that also process ksoftirq?
> My experience is that performance is approx half, if ksoftirq and
> UDP-thread share a CPU (after you fixed the softirq issue).
Well, this is exactly what I said earlier. Your choices about cpu
pinning might help or might hurt in different scenarios.
>
>
> > Then my load generator use trafgen, with spoofed UDP source addresses,
> > like a UDP flood would use. Or typical DNS traffic, malicious or not.
>
> I also like trafgen
> https://github.com/netoptimizer/network-testing/tree/master/trafgen
>
> > So I have 8 cpus all trying to queue packets in a single UDP socket.
> >
> > Of course, a real high performance server would use 8 UDP sockets, and
> > SO_REUSEPORT with nice eBPF filter to spread the packets based on the
> > queue/cpu they arrived.
>
> Once the ksoftirq and UDP-threads are silo'ed like that, it should
> basically correspond to the benchmarks of my single queue test,
> multiplied by the number of CPUs/UDP-threads.
Well, if one cpu is shared by the producer and consumer then packets are
hot in caches, so trying to avoid cache line misses as I did is not
really helping.
I optimized the case where we do not assume both parties run on the same
cpu. If you leave process scheduler do its job, then your throughput can
be doubled ;)
Now if for some reason you are stuck with a single CPU, this is a very
different problem, and af_packet might be better.
>
> I think it might be a good idea (for me) to implement such a
> UDP-multi-threaded sink example program (with SO_REUSEPORT and eBPF
> filter) to demonstrate and make sure the stack scales (and every
> time we/I improve single queue performance, the numbers should multiply
> with the scaling). Maybe you already have such an example program?
Well, I do have something using SO_REUSEPORT, but not yet BPF, so not in
a state I can share at this moment.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
[not found] ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
@ 2016-12-09 16:53 ` Eric Dumazet
2016-12-09 17:13 ` Tom Herbert
0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-12-09 16:53 UTC (permalink / raw)
To: Tom Herbert
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev,
Paolo Abeni
On Fri, 2016-12-09 at 08:43 -0800, Tom Herbert wrote:
>
>
> Are you thinking of allowing unconnected socket to have multiple input
> queues? Sort of an automatic and transparent SO_REUSEPORT...
It all depends if the user application is using a single thread or
multiple threads to drain the queue.
Since we used to grab socket lock in udp_recvmsg(), I guess nobody uses
multiple threads to read packets from a single socket.
So heavy users must use SO_REUSEPORT already, not sure what we would
gain trying to go to a single socket, with the complexity of mem
charging.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 net-next 0/4] udp: receive path optimizations
2016-12-09 16:53 ` Eric Dumazet
@ 2016-12-09 17:13 ` Tom Herbert
0 siblings, 0 replies; 16+ messages in thread
From: Tom Herbert @ 2016-12-09 17:13 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jesper Dangaard Brouer, Eric Dumazet, David S . Miller, netdev,
Paolo Abeni
On Fri, Dec 9, 2016 at 8:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-12-09 at 08:43 -0800, Tom Herbert wrote:
>>
>>
>
>
>> Are you thinking of allowing unconnected socket to have multiple input
>> queues? Sort of an automatic and transparent SO_REUSEPORT...
>
> It all depends if the user application is using a single thread or
> multiple threads to drain the queue.
>
If they're using multiple threads hopefully there's no reason they
can't use SO_REUSEPORT. Since we should always assume DDOS is
possibility it seems like that should be a general recommendation: If
you have multiple threads listening on a port use SO_REUSEPORT.
> Since we used to grab socket lock in udp_recvmsg(), I guess nobody uses
> multiple threads to read packets from a single socket.
>
That's the hope! So the problem at hand is multiple producer CPUs and
one consumer CPU.
> So heavy users must use SO_REUSEPORT already, not sure what we would
> gain trying to go to a single socket, with the complexity of mem
> charging.
>
I think you're making a good point a the possibility that any
unconnected UDP socket could be subject to an attack, so any use of
unconnected UDP has the potential to become a "heavy user" (in fact
we've seen bring down whole networks before in production). Therefore
the single thread reader case is relevant to consider.
Tom
>
>>
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-12-09 17:13 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-08 17:38 [PATCH v2 net-next 0/4] udp: receive path optimizations Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 1/4] udp: under rx pressure, try to condense skbs Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 2/4] udp: add busylocks in RX path Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 3/4] udp: copy skb->truesize in the first cache line Eric Dumazet
2016-12-08 17:38 ` [PATCH v2 net-next 4/4] udp: add batching to udp_rmem_release() Eric Dumazet
2016-12-08 18:24 ` Paolo Abeni
2016-12-08 18:36 ` Eric Dumazet
2016-12-08 18:38 ` Eric Dumazet
2016-12-08 18:52 ` Eric Dumazet
2016-12-08 20:48 ` [PATCH v2 net-next 0/4] udp: receive path optimizations Jesper Dangaard Brouer
2016-12-08 21:13 ` Eric Dumazet
2016-12-09 16:05 ` Jesper Dangaard Brouer
2016-12-09 16:26 ` Eric Dumazet
[not found] ` <CALx6S35roMkor_0maXk-SwdXeF4GxBfbxXLEXLGnn6mRRaut6g@mail.gmail.com>
2016-12-09 16:53 ` Eric Dumazet
2016-12-09 17:13 ` Tom Herbert
2016-12-08 21:17 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).