netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* UDP multi-core performance on a single socket and SO_REUSEPORT
@ 2012-12-28 10:01 Mark Zealey
  2013-01-04 18:50 ` Mark Zealey
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Zealey @ 2012-12-28 10:01 UTC (permalink / raw)
  To: netdev

I appreciate that this question has come up a number of times over the 
years, most recently as far as I can see in this thread: 
http://markmail.org/message/hcc7zn5ln5wktypv . I'm going to explain my 
problem and present some performance numbers to back this up.

The problem: I'm doing some research on scaling a dns server (powerdns) 
to work well on multi-core boxes (in this case testing with 2*E5-2650 
processors ie linux sees 32 cores).

My powerdns configuration uses a shared socket with one thread for each 
core in the box listening on that socket using poll()/recvmsg(). I've 
modified powerdns so in my tests it is doing the absolute minimum of 
work to answer packets (all queries are for the same record, it keeps 
the response in memory and just changes a few fields before calling 
sendmsg()). I'm binding to a single 10.xxx address and using this for 
all local and remote tests.

The numbers below are generated using 16 parallel queryperf's on 
localhost (it doesn't really matter if it is from remote hosts or the 
localhost; the numbers don't change much).

Using stock centos 6.3 kernel I see powerdns performing at around 
120kqps (uses at most about 12 cpus)
Using 3.7.1 kernel (from elrepo) I see this increase to 200-240kqps 
maxing out all cpu's in the box (soft interrupt cpu time is about 8* 
higher than on centos 6.3 kernel at 40% and system cpu time is at 50% - 
powerdns only uses 10% of the cpu time)
Using stock centos 6.3 kernel with the google SO_REUSEPORT patch from 
2010 (modified slightly so it applies) I see 500-600kqps from remote; or 
1mqps when doing localhost queries. powerdns doesn't go past using 8 
cpus - it appears that the limit it is hitting then is to do with some 
lock in sendmsg().

I've not been able to get the 2010 SO_REUSEPORT patch working on the 
3.7.1 kernel I suspect it would make for even better performance as 
sendmsg() should have been significantly improved.

Now, I don't believe that SO_REUSEPORT is needed in the kernel in this 
case, however the numbers above clearly show that the current UDP 
implementation for recvmsg() on a single socket across multiple cores on 
kernel 3.7.1 is still locking badly. A perf report on 3.7.1 (using 16 
local queryperf's) shows:

     68.34%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_bh
             |
             --- 0x7fa472023a2d
                 system_call_fastpath
                 sys_recvmsg
                 __sys_recvmsg
                 sock_recvmsg
                 inet_recvmsg
                 udp_recvmsg
                 skb_free_datagram_locked
                |
                |--100.00%-- lock_sock_fast
                |          _raw_spin_lock_bh
                 --0.00%-- [...]

      3.10%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
             |
             --- 0x7fa472023a2d
                 system_call_fastpath
                 sys_recvmsg
                 __sys_recvmsg
                 sock_recvmsg
                 inet_recvmsg
                 udp_recvmsg
                |
                |--99.69%-- __skb_recv_datagram
                |          |
                |          |--77.68%-- _raw_spin_lock_irqsave
                |          |
                |          |--14.56%-- prepare_to_wait_exclusive
                |          |          _raw_spin_lock_irqsave
                |          |
                |           --7.76%-- finish_wait
                |                     _raw_spin_lock_irqsave
                 --0.31%-- [...]
                ...

Any advice or patches welcome... :-)

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UDP multi-core performance on a single socket and SO_REUSEPORT
  2012-12-28 10:01 UDP multi-core performance on a single socket and SO_REUSEPORT Mark Zealey
@ 2013-01-04 18:50 ` Mark Zealey
  2013-01-04 19:37   ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Mark Zealey @ 2013-01-04 18:50 UTC (permalink / raw)
  To: netdev

I have written two small test scripts now which can be found at 
http://mark.zealey.org/uploads/ - one launches 16 listening threads for 
a single UDP socket, the other needs to be run as

for i in `seq 16`; do ./udp_test_client & done

On my test server (32-core), stock kernel 3.7.1, 90% of the time is 
spent in the kernel waiting on spinlocks. Perf output:

     44.95%  udp_test_server  [kernel.kallsyms]   [k] _raw_spin_lock_bh
             |
             --- _raw_spin_lock_bh
                |
                |--100.00%-- lock_sock_fast
                |          skb_free_datagram_locked
                |          udp_recvmsg
                |          inet_recvmsg
                |          sock_recvmsg
                |          __sys_recvmsg
                |          sys_recvmsg
                |          system_call_fastpath
                |          0x7fd8c4702a2d
                |          start_thread
                 --0.00%-- [...]

     43.48%  udp_test_client  [kernel.kallsyms]   [k] _raw_spin_lock
             |
             --- _raw_spin_lock
                |
                |--99.80%-- udp_queue_rcv_skb
                |          __udp4_lib_rcv
                |          udp_rcv
                |          ip_local_deliver_finish
                |          ip_local_deliver
                |          ip_rcv_finish
                |          ip_rcv

Thanks,

Mark

On 28/12/12 10:01, Mark Zealey wrote:
> I appreciate that this question has come up a number of times over the 
> years, most recently as far as I can see in this thread: 
> http://markmail.org/message/hcc7zn5ln5wktypv . I'm going to explain my 
> problem and present some performance numbers to back this up.
>
> The problem: I'm doing some research on scaling a dns server 
> (powerdns) to work well on multi-core boxes (in this case testing with 
> 2*E5-2650 processors ie linux sees 32 cores).
>
> My powerdns configuration uses a shared socket with one thread for 
> each core in the box listening on that socket using poll()/recvmsg(). 
> I've modified powerdns so in my tests it is doing the absolute minimum 
> of work to answer packets (all queries are for the same record, it 
> keeps the response in memory and just changes a few fields before 
> calling sendmsg()). I'm binding to a single 10.xxx address and using 
> this for all local and remote tests.
>
> The numbers below are generated using 16 parallel queryperf's on 
> localhost (it doesn't really matter if it is from remote hosts or the 
> localhost; the numbers don't change much).
>
> Using stock centos 6.3 kernel I see powerdns performing at around 
> 120kqps (uses at most about 12 cpus)
> Using 3.7.1 kernel (from elrepo) I see this increase to 200-240kqps 
> maxing out all cpu's in the box (soft interrupt cpu time is about 8* 
> higher than on centos 6.3 kernel at 40% and system cpu time is at 50% 
> - powerdns only uses 10% of the cpu time)
> Using stock centos 6.3 kernel with the google SO_REUSEPORT patch from 
> 2010 (modified slightly so it applies) I see 500-600kqps from remote; 
> or 1mqps when doing localhost queries. powerdns doesn't go past using 
> 8 cpus - it appears that the limit it is hitting then is to do with 
> some lock in sendmsg().
>
> I've not been able to get the 2010 SO_REUSEPORT patch working on the 
> 3.7.1 kernel I suspect it would make for even better performance as 
> sendmsg() should have been significantly improved.
>
> Now, I don't believe that SO_REUSEPORT is needed in the kernel in this 
> case, however the numbers above clearly show that the current UDP 
> implementation for recvmsg() on a single socket across multiple cores 
> on kernel 3.7.1 is still locking badly. A perf report on 3.7.1 (using 
> 16 local queryperf's) shows:
>
>     68.34%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_bh
>             |
>             --- 0x7fa472023a2d
>                 system_call_fastpath
>                 sys_recvmsg
>                 __sys_recvmsg
>                 sock_recvmsg
>                 inet_recvmsg
>                 udp_recvmsg
>                 skb_free_datagram_locked
>                |
>                |--100.00%-- lock_sock_fast
>                |          _raw_spin_lock_bh
>                 --0.00%-- [...]
>
>      3.10%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
>             |
>             --- 0x7fa472023a2d
>                 system_call_fastpath
>                 sys_recvmsg
>                 __sys_recvmsg
>                 sock_recvmsg
>                 inet_recvmsg
>                 udp_recvmsg
>                |
>                |--99.69%-- __skb_recv_datagram
>                |          |
>                |          |--77.68%-- _raw_spin_lock_irqsave
>                |          |
>                |          |--14.56%-- prepare_to_wait_exclusive
>                |          |          _raw_spin_lock_irqsave
>                |          |
>                |           --7.76%-- finish_wait
>                |                     _raw_spin_lock_irqsave
>                 --0.31%-- [...]
>                ...
>
> Any advice or patches welcome... :-)
>
> Mark
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UDP multi-core performance on a single socket and SO_REUSEPORT
  2013-01-04 18:50 ` Mark Zealey
@ 2013-01-04 19:37   ` Eric Dumazet
  2013-01-04 20:47     ` Tom Herbert
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2013-01-04 19:37 UTC (permalink / raw)
  To: Mark Zealey; +Cc: netdev

On Fri, 2013-01-04 at 18:50 +0000, Mark Zealey wrote:
> I have written two small test scripts now which can be found at 
> http://mark.zealey.org/uploads/ - one launches 16 listening threads for 
> a single UDP socket, the other needs to be run as
> 
> for i in `seq 16`; do ./udp_test_client & done
> 
> On my test server (32-core), stock kernel 3.7.1, 90% of the time is 
> spent in the kernel waiting on spinlocks. Perf output:

Mark

We know the scalability issue of using a single socket and many threads.

The send path was somehow fixed to not require socket lock.

But the receive path uses a single receive_queue, protected by a
spinlock.

SO_REUSEPORT would be nice, but had known issues.

af_packet fanout implementation was nicer.
You could try :

1) Use af_packet FANOUT instead of UDP sockets

2) rewrite SO_REUSEPORT to use a FANOUT like implementation

3) Extend UDP sockets to be able to use a configurable number of receive
queues instead of a single one.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UDP multi-core performance on a single socket and SO_REUSEPORT
  2013-01-04 19:37   ` Eric Dumazet
@ 2013-01-04 20:47     ` Tom Herbert
  2013-01-04 21:46       ` Mark Zealey
  0 siblings, 1 reply; 5+ messages in thread
From: Tom Herbert @ 2013-01-04 20:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mark Zealey, netdev

I believe the hard part of making SO_REUSEPORT was on the TCP side in
dealing with state in req structs which we have not resolved.  UDP
SO_REUSEPORT seems to be working pretty well.

Tom

On Fri, Jan 4, 2013 at 11:37 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-01-04 at 18:50 +0000, Mark Zealey wrote:
>> I have written two small test scripts now which can be found at
>> http://mark.zealey.org/uploads/ - one launches 16 listening threads for
>> a single UDP socket, the other needs to be run as
>>
>> for i in `seq 16`; do ./udp_test_client & done
>>
>> On my test server (32-core), stock kernel 3.7.1, 90% of the time is
>> spent in the kernel waiting on spinlocks. Perf output:
>
> Mark
>
> We know the scalability issue of using a single socket and many threads.
>
> The send path was somehow fixed to not require socket lock.
>
> But the receive path uses a single receive_queue, protected by a
> spinlock.
>
> SO_REUSEPORT would be nice, but had known issues.
>
> af_packet fanout implementation was nicer.
> You could try :
>
> 1) Use af_packet FANOUT instead of UDP sockets
>
> 2) rewrite SO_REUSEPORT to use a FANOUT like implementation
>
> 3) Extend UDP sockets to be able to use a configurable number of receive
> queues instead of a single one.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UDP multi-core performance on a single socket and SO_REUSEPORT
  2013-01-04 20:47     ` Tom Herbert
@ 2013-01-04 21:46       ` Mark Zealey
  0 siblings, 0 replies; 5+ messages in thread
From: Mark Zealey @ 2013-01-04 21:46 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Eric Dumazet, netdev

On 04/01/13 20:47, Tom Herbert wrote:
> I believe the hard part of making SO_REUSEPORT was on the TCP side in
> dealing with state in req structs which we have not resolved.  UDP
> SO_REUSEPORT seems to be working pretty well.

Does anyone have a SO_REUSEPORT UDP patch for a modern networking stack? 
There have been a number of changes and new functions added since the 
2010 patch that I found so I couldn't get it to work properly on 3.7.1.

Thanks,

Mark

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-01-04 21:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-28 10:01 UDP multi-core performance on a single socket and SO_REUSEPORT Mark Zealey
2013-01-04 18:50 ` Mark Zealey
2013-01-04 19:37   ` Eric Dumazet
2013-01-04 20:47     ` Tom Herbert
2013-01-04 21:46       ` Mark Zealey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).